Intel® Ethernet Controller E810 Application Device Queues (ADQ)

Configuration Guide

ID 609008
Date 04/03/2023
Version 2.8
Document Table of Contents

ATS Server Setup

The following variables are used in the examples in this section:

$iface

The PF interface in use.

$num_​queues_​tc0

The number of queues for default traffic.

$num_​queues_​tc1

The number of queues for application traffic class.

$ip_​addr

The IP Address of the interface under test.

$app_​port

The TCP port of the ATS application being run on the SUT(for example, 80, 8080, 8888; default port for http traffic is 80).

$cgroup_​name

The name for the application group.

$iface_​bdf

The network interface BDF notation (Bus:Device.Function) used by devlink.

$cpu_​sockets

The list of sockets on SUT (0, 1 for 2-socket system).

$tc1_​qps_​per_​poller

The number of queues per one poller thread.

$tc1_​timeout_​value

The value of timeout in poller threads (nonzero integer value in jiffies, default value 10000).

$pathtoicepackage

The path to the ice driver package.

${pathtotc}

The path to the TC command.

$core_​number

The specific core number.

$pid

The process identification number.

$prio

The priority of flow-based traffic control group.

$core_​range_​0-$core_​range_​1

The range of cpu cores assigned to cgroup.

Note:Use of independent poller feature in Apache Traffic Server is currently supported on PF only (not supported on VF).

  1. Perform general system OS install and setup.
    1. Complete the ADQ install and setup in ADQ System Under Test (SUT) Installation. Note:SUT Linux kernel later than v5.12 is strongly recommended for optimal performance, and ICE driver version 1.9.x is required for Independent Pollers feature of ADQ 2.0. Note:For best ATS performance with ADQ, the following default ice driver settings should be used on the ATS Server:

      • Turn off packet optimizer: ethtool --set-priv-flags $iface channel-pkt-inspect-optimize off
      • Enable hardware tc offload: ethtool -K $iface hw-tc-offload on

        And for the interface that is being used, set:

        echo 1 > /sys/class/net/$iface/threaded

    2. Enable threaded mode napi poll.

      Two changes to the system tuning are required.

      1. Enable latency-performance tuned profile. tuned-adm profile latency-performance Note:The tuned-adm daemon is not installed by default in RHEL9.0 systems. Install it with the command yum install tuned.

        Check that the settings are applied correctly:

        cat /etc/tuned/active_profile

        Output: latency-performance

        cat /etc/tuned/profile_mode

        Output: manual

      2. Set the CPU scaling governor to performance mode x86_energy_perf_policy performance

        Check that the settings are applied correctly:

        cat /etc/tuned/active_profile

        Output: latency-performance

        cat /etc/tuned/profile_mode

        Output: manual

        cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

        Output: performance

        Note:For each CPU thread on server.

    For best performance with Apache Traffic Server, set net.core.netdev_​max_​backlog and net.ipv4.tcp_​max_​syn_​backlog as follows:

    sysctl -w net.core.netdev_max_backlog=250000 sysctl -w net.ipv4.tcp_max_syn_backlog=250000

    Also set:

    sysctl -w net.core.busy_poll=0 Note:Many settings in General System Tuning and those listed above do not persist between reboots and might need to be reapplied.
  2. Perform ATS build.

    1. Download the latest ATS version from one of the following locations:

      Or

    2. Compile and install ATS. cd trafficserver-${VER} Note:$VER is the trafficserver version that was downloaded. autoreconf-if ./configure make make install
    3. Set the number of ATS threads to be equal to $num_​queues_​tc1 using records.config.

      The records.config file, by default located in /usr/local/etc/trafficserver/, is a list of configurable variables used by the Traffic Server software.

      Field:

      CONFIG proxy.config.exec_thread.autoconfig.scale FLOAT 1

      Set the multiplier according to total number of cores vs number of ADQ queues.

      Example:

      If the total number of cores is 144 and the number of queues ($num_​queues_​tc1) is 126, the multiplier would be 0.875 and the config line would be:

      CONFIG proxy.config.exec_thread.autoconfig.scale FLOAT 0.875
  3. [ADQ only] Create Traffic Classes (TCs) on the interface under test. Note that TC1 should have the same number of queues as the number of threads that will be started by the ATS application. ${pathtotc}/tc qdisc add dev $iface root mqprio num_tc 2 map 0 1 queues $num_queues_tc0@0 num_queues_tc1@$num_queues_tc0 hw 1 mode channel ${pathtotc}/tc qdisc add dev $iface clsact Note:Due to timing issues, applying TC filters immediately after the tc qdisc add command might result in the filters not being offloaded in hardware. An error in dmesg is logged if the filter fails to add properly. It is recommended to wait five seconds after tc qdisc add before adding TC filters. sleep 5
  4. [ADQ only] Create one TC filter on the interface under test. ${pathtotc}/tc filter add dev $iface protocol ip ingress prio 1 flower dst_ip $ip_addr/32 ip_proto tcp dst_port $app_port skip_sw hw_tc 1 Note:The /32 in dst_​ip $ip_​addr/32 is not the subnet of the network being used, but the subnet of the filter you are creating. In other words, /32 indicates a single IP Address being filtered. It is recommended to use /32 when creating filters to limit the addresses being filtered. Note:ATS by default sets HTTP on port 8080, HTTPS on port 8443. It might be changed in vi /usr/local/etc/trafficserver/records.config
  5. [ADQ only] Confirm TC configuration:
    1. Check TC filter. ${pathtotc}/tc filter show dev $iface ingress
    2. Check that TCs were created correctly. ${pathtotc}/tc qdisc show dev $iface
  6. For best performance with Apache Traffic Server set the interrupt moderation rate to this static value for Tx and Rx. ethtool --coalesce ${iface} adaptive-rx off rx-usecs 100 ethtool --coalesce ${iface} adaptive-tx off tx-usecs 100
  7. Configure Independent pollers.

    ice version 1.9.x (and later):

    iface_bdf=$(ethtool -i ${iface} | grep bus-info | awk '{print $2}' devlink dev param set pci/${iface_bdf} name tc1_qps_per_poller value $tc1_qps_per_poller cmode runtime devlink dev param set pci/${iface_bdf} name tc1_poller_timeout value $tc1_timeout_value cmode runtime Note:Kernel with devlink param support is required for ice-1.8.x and later. See Install OS and Update Kernel (If Needed) to determine OS and kernel requirements. Note:Valid devlink param flags include tc1_​qps_​per_​poller and tc1_​poller_​timeout through tc15_​qps_​per_​poller and tc15_​poller_​timeout, to configure pollers on up to 16 TCs (max #TCs).
  8. Configure Flow Director. devlink dev param set pci/${iface_bdf} name tc1_inline_fd value true cmode runtime
  9. Run the set_​irq_​affinity script for the interface under test. ${pathtoicepackage}/scripts/set_irq_affinity -X7 all $iface
  10. Configure symmetric queues on the interface using the script included in the scripts folder of the ice package. ${pathtoicepackage}/scripts/set_xps_rxqs $iface
  11. [ADQ only] Note:The system under test can be configured with a single NIC or multiple NICs; choose the option based on the configuration.
    1. Single NIC configuration

      Configure specific cores as being responsible for dedicated poller threads to optimize CPU usage and improve SUT stability. In case of 2-socket systems and one NIC use cores on the CPU located on the same NUMA node as the NIC.

      #!/bin/bash CHK () { "$@" if [ $? -ne 0 ]; then echo "Error with ${1}, stopping now!" >&2 exit 1 fi } pin_napi_threads () { num_threads=$(ps -aef |grep "\[napi\/" | wc -l) start_pid=$(ps -ae |grep napi |head -n 1 |cut -d ? -f 1) for (( i = 0; i < $num_threads; i++ )) do printf -v pid "%d" $((start_pid+i)) printf -v cpu "%d" $((2*i+1)) CHK taskset -p -c "${core_number}" $pid done }

      pin_​napi_​threads

      Note:There is possibility to assign more than one cpu core for poller threads, it is required to be equal to the no. of configured poller threads. Each core number has to be separated by comma.

      Example for 8 poller threads and therefore 8 cores:

      CHK taskset -p -c "36,37,38,39,40,41,42,43" $pid
      Note:In case of small object being transferred more poller threads usually yields better performance (example: 1 poller thread per 8-16 queues for 10kB objects), for larger objects (>1MB) a smaller number of poller threads (and therefore dedicated cores) can be used.
    2. Multiple NIC configuration

      If your system under test has multiple NIC, ensure, that the system is balanced (it has the same number of NICs per NUMA node). Queues should be divided to equal no. per every NIC, and poller threads should be placed on each NUMA node in equal number and assigned to their respective NICs residing on the same NUMA node as the poller threads.

      Example:

      num_threads=$(ps -aef |grep "\[napi\/$iface1" | wc -l) start_pid=$(ps -ae |grep napi\/$iface1 |head -n 1 |cut -d ? -f 1) for (( i = 0; i < $num_threads; i++ )) do printf -v pid "%d" $((start_pid+i)) printf -v cpu "%d" $((2*i+1)) CHK taskset -p -c "${core_number}" $pid done num_threads=$(ps -aef |grep "\[napi\/$iface2" | wc -l) start_pid=$(ps -ae |grep napi\/$iface2 |head -n 1 |cut -d ? -f 1) for (( i = 0; i < $num_threads; i++ )) do printf -v pid "%d" $((start_pid+i)) printf -v cpu "%d" $((2*i+1)) CHK taskset -p -c "${core_number}" $pid done
  12. [ADQ only] In this example, we are using cgroups for pinning Apache Traffic Server threads and poller threads to specific cores. Other methods can be used to achieve the same effect.

    Create cgroup and map to interface under test. Set the priority for processes belonging to the cgroup. The $prio value should map to the position of the targeted TC in Step 3 to create traffic classes (TCs). See Create TCs for reference on TC priority mapping

    cgcreate -g cpuset,memory,net_prio:${cgroup_name} cgset -r net_prio.ifpriomap="$iface $prio" ${cgroup_name} cgset -r cpuset.mems="${cpu_sockets}" ${cgroup_name} cgset -r cpuset.cpus="${core_range_0}-${core_range_1}" ${cgroup_name}
  13. Start ATS either in the cgroup for ADQ or without the ADQ if testing baseline performance.
    1. [ADQ Only] Start ATS in the cgroup with the same number of threads that were configured for TC1.

      Example:

      /usr/local/bin/trafficserver start cgclassify -g cpuset,net_prio:ats $(ps -e | grep TS_MAIN | awk \ '{ print $1 }') cgclassify -g cpuset,net_prio:ats $(ps -e | grep ET_NET | awk \ '{ print $1 }')
    2. [Baseline] Start ATS.

      Example:

      /usr/local/bin/trafficserver start
  14. [ADQ only] Verify cgroup configuration:
    1. Show that the interface is mapped to the right interface. cat /sys/fs/cgroup/net_prio/${cgroup_name}/net_prio.ifpriomap
    2. Show the Process IDs being run in the cgroup and match them to the application Process IDs. cat /sys/fs/cgroup/net_prio/${cgroup_name}/tasks
  15. [ADQ only] Troubleshooting: While test is running, verify that ADQ traffic is on the correct queues.

    While ADQ application traffic is running, watch ethtool statistics to check that only the ADQ queues are being used (that is, have significant traffic) with busy poll (pkt_​busy_​poll) for ADQ traffic. If non busy poll (pkt_​not_​busy_​poll) has significant counts and/or if traffic is not confined to ADQ queues, recheck the configuration steps carefully.

    watch -d -n 0.5 "ethtool -S $iface | grep busy | column"

    See Configure Intel Ethernet Flow Director Settings for example watch output.

    Note:If the benefits of ADQ enabled versus disabled are not as dramatic as expected even though the queues appear to be well aligned, it is possible that the performance is limited by the processing power of the client systems. If client CPU shows greater than an average of 80% CPU utilization on the CPU cores in use, it is probable that the client is becoming overloaded.

    To achieve maximum performance benefits, try increasing the number of or the processing power of the client systems.

    After the test finishes, remove the cgroup.

    jobs -p | xargs kill &> /dev/null cgdelete -g net_prio: ${cgroup_name}

    After the test finishes, remove the ADQ configuration following steps in Clear the ADQ Configuration.