Intel® Ethernet Controller E810 Application Device Queues (ADQ)
Configuration Guide
ATS Server Setup
The following variables are used in the examples in this section:
$iface | The PF interface in use. |
$num_queues_tc0 | The number of queues for default traffic. |
$num_queues_tc1 | The number of queues for application traffic class. |
$ip_addr | The IP Address of the interface under test. |
$app_port | The TCP port of the ATS application being run on the SUT(for example, 80, 8080, 8888; default port for http traffic is 80). |
$cgroup_name | The name for the application group. |
$iface_bdf | The network interface BDF notation (Bus:Device.Function) used by devlink. |
$cpu_sockets | The list of sockets on SUT (0, 1 for 2-socket system). |
$tc1_qps_per_poller | The number of queues per one poller thread. |
$tc1_timeout_value | The value of timeout in poller threads (nonzero integer value in jiffies, default value 10000). |
$pathtoicepackage | The path to the ice driver package. |
${pathtotc} | The path to the TC command. |
$core_number | The specific core number. |
$pid | The process identification number. |
$prio | The priority of flow-based traffic control group. |
$core_range_0-$core_range_1 | The range of cpu cores assigned to cgroup. |
- Perform general system OS install and setup.
-
Complete the ADQ install and setup in ADQ System Under Test (SUT) Installation.
Note:SUT Linux kernel later than v5.12 is strongly recommended for optimal performance, and ICE driver version 1.9.x is required for Independent Pollers feature of ADQ 2.0. Note:For best ATS performance with ADQ, the following default ice driver settings should be used on the ATS Server: - Enable threaded mode napi poll.
Two changes to the system tuning are required.
- Enable latency-performance tuned profile.
tuned-adm profile latency-performance Note:The tuned-adm daemon is not installed by default in RHEL9.0 systems. Install it with the command yum install tuned .Check that the settings are applied correctly:
cat /etc/tuned/active_profile Output:
latency-performance cat /etc/tuned/profile_mode Output:
manual - Set the CPU scaling governor to performance mode
x86_energy_perf_policy performance Check that the settings are applied correctly:
cat /etc/tuned/active_profile Output:
latency-performance cat /etc/tuned/profile_mode Output:
manual cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor Output:
performance Note:For each CPU thread on server.
- Enable latency-performance tuned profile.
For best performance with Apache Traffic Server, set
net.core.netdev_max_backlog andnet.ipv4.tcp_max_syn_backlog as follows:sysctl -w net.core.netdev_max_backlog=250000 sysctl -w net.ipv4.tcp_max_syn_backlog=250000 Also set:
sysctl -w net.core.busy_poll=0 Note:Many settings in General System Tuning and those listed above do not persist between reboots and might need to be reapplied. -
-
Perform ATS build.
-
Download the latest ATS version from one of the following locations:
-
Download the latest ATS version from GitHub:
Example:
git clone https://github.com/apache/trafficserver
Or
- Download the latest stable ATS release from apache.org:
Example:
wget https://dlcdn.apache.org/trafficserver/trafficserver-9.1.2.tar.bz2 Untar the package
tar xf trafficserver-9.1.2.tar.bz2
-
- Compile and install ATS.
cd trafficserver-${VER} Note:$VER is the trafficserver version that was downloaded. autoreconf-if ./configure make make install - Set the number of ATS threads to be equal to $num_queues_tc1 using
records.config .The records.config file, by default located in
/usr/local/etc/trafficserver/ , is a list of configurable variables used by the Traffic Server software.Field:
CONFIG proxy.config.exec_thread.autoconfig.scale FLOAT 1 Set the multiplier according to total number of cores vs number of ADQ queues.
Example:
If the total number of cores is 144 and the number of queues ($num_queues_tc1) is 126, the multiplier would be 0.875 and the config line would be:
CONFIG proxy.config.exec_thread.autoconfig.scale FLOAT 0.875
-
- [ADQ only] Create Traffic Classes (TCs) on the interface under test. Note that TC1 should have the same number of queues as the number of threads that will be started by the ATS application.
${pathtotc}/tc qdisc add dev $iface root mqprio num_tc 2 map 0 1 queues $num_queues_tc0@0 num_queues_tc1@$num_queues_tc0 hw 1 mode channel ${pathtotc}/tc qdisc add dev $iface clsact Note:Due to timing issues, applying TC filters immediately after the tc qdisc add command might result in the filters not being offloaded in hardware. An error in dmesg is logged if the filter fails to add properly. It is recommended to wait five seconds aftertc qdisc add before adding TC filters.sleep 5 - [ADQ only] Create one TC filter on the interface under test.
${pathtotc}/tc filter add dev $iface protocol ip ingress prio 1 flower dst_ip $ip_addr/32 ip_proto tcp dst_port $app_port skip_sw hw_tc 1 Note:The /32 in dst_ip $ip_addr/32 is not the subnet of the network being used, but the subnet of the filter you are creating. In other words, /32 indicates a single IP Address being filtered. It is recommended to use /32 when creating filters to limit the addresses being filtered.Note:ATS by default sets HTTP on port 8080, HTTPS on port 8443. It might be changed in vi /usr/local/etc/trafficserver/records.config - [ADQ only] Confirm TC configuration:
- For best performance with Apache Traffic Server set the interrupt moderation rate to this static value for Tx and Rx.
ethtool --coalesce ${iface} adaptive-rx off rx-usecs 100 ethtool --coalesce ${iface} adaptive-tx off tx-usecs 100 - Configure Independent pollers.
ice version 1.9.x (and later):
iface_bdf=$(ethtool -i ${iface} | grep bus-info | awk '{print $2}' devlink dev param set pci/${iface_bdf} name tc1_qps_per_poller value $tc1_qps_per_poller cmode runtime devlink dev param set pci/${iface_bdf} name tc1_poller_timeout value $tc1_timeout_value cmode runtime Note:Kernel with devlink param support is required for ice-1.8.x and later. See Install OS and Update Kernel (If Needed) to determine OS and kernel requirements. Note:Valid devlink param flags include tc1_qps_per_poller andtc1_poller_timeout throughtc15_qps_per_poller andtc15_poller_timeout , to configure pollers on up to 16 TCs (max #TCs). - Configure Flow Director.
devlink dev param set pci/${iface_bdf} name tc1_inline_fd value true cmode runtime - Run the set_irq_affinity script for the interface under test.
${pathtoicepackage}/scripts/set_irq_affinity -X7 all $iface - Configure symmetric queues on the interface using the script included in the scripts folder of the ice package.
${pathtoicepackage}/scripts/set_xps_rxqs $iface - [ADQ only]
Note:The system under test can be configured with a single NIC or multiple NICs; choose the option based on the configuration. - Single NIC configuration
Configure specific cores as being responsible for dedicated poller threads to optimize CPU usage and improve SUT stability. In case of 2-socket systems and one NIC use cores on the CPU located on the same NUMA node as the NIC.
#!/bin/bash CHK () { "$@" if [ $? -ne 0 ]; then echo "Error with ${1}, stopping now!" >&2 exit 1 fi } pin_napi_threads () { num_threads=$(ps -aef |grep "\[napi\/" | wc -l) start_pid=$(ps -ae |grep napi |head -n 1 |cut -d ? -f 1) for (( i = 0; i < $num_threads; i++ )) do printf -v pid "%d" $((start_pid+i)) printf -v cpu "%d" $((2*i+1)) CHK taskset -p -c "${core_number}" $pid done } pin_napi_threads
Note:There is possibility to assign more than one cpu core for poller threads, it is required to be equal to the no. of configured poller threads. Each core number has to be separated by comma. Example for 8 poller threads and therefore 8 cores:
CHK taskset -p -c "36,37,38,39,40,41,42,43" $pid Note:In case of small object being transferred more poller threads usually yields better performance (example: 1 poller thread per 8-16 queues for 10kB objects), for larger objects (>1MB) a smaller number of poller threads (and therefore dedicated cores) can be used. - Multiple NIC configuration
If your system under test has multiple NIC, ensure, that the system is balanced (it has the same number of NICs per NUMA node). Queues should be divided to equal no. per every NIC, and poller threads should be placed on each NUMA node in equal number and assigned to their respective NICs residing on the same NUMA node as the poller threads.
Example:
num_threads=$(ps -aef |grep "\[napi\/$iface1" | wc -l) start_pid=$(ps -ae |grep napi\/$iface1 |head -n 1 |cut -d ? -f 1) for (( i = 0; i < $num_threads; i++ )) do printf -v pid "%d" $((start_pid+i)) printf -v cpu "%d" $((2*i+1)) CHK taskset -p -c "${core_number}" $pid done num_threads=$(ps -aef |grep "\[napi\/$iface2" | wc -l) start_pid=$(ps -ae |grep napi\/$iface2 |head -n 1 |cut -d ? -f 1) for (( i = 0; i < $num_threads; i++ )) do printf -v pid "%d" $((start_pid+i)) printf -v cpu "%d" $((2*i+1)) CHK taskset -p -c "${core_number}" $pid done
- Single NIC configuration
- [ADQ only] In this example, we are using cgroups for pinning Apache Traffic Server threads and poller threads to specific cores. Other methods can be used to achieve the same effect.
Create cgroup and map to interface under test. Set the priority for processes belonging to the cgroup. The $prio value should map to the position of the targeted TC in Step 3 to create traffic classes (TCs). See Create TCs for reference on TC priority mapping
cgcreate -g cpuset,memory,net_prio:${cgroup_name} cgset -r net_prio.ifpriomap="$iface $prio" ${cgroup_name} cgset -r cpuset.mems="${cpu_sockets}" ${cgroup_name} cgset -r cpuset.cpus="${core_range_0}-${core_range_1}" ${cgroup_name} - Start ATS either in the cgroup for ADQ or without the ADQ if testing baseline performance.
- [ADQ Only] Start ATS in the cgroup with the same number of threads that were configured for TC1.
Example:
/usr/local/bin/trafficserver start cgclassify -g cpuset,net_prio:ats $(ps -e | grep TS_MAIN | awk \ '{ print $1 }') cgclassify -g cpuset,net_prio:ats $(ps -e | grep ET_NET | awk \ '{ print $1 }') - [Baseline] Start ATS.
Example:
/usr/local/bin/trafficserver start
- [ADQ Only] Start ATS in the cgroup with the same number of threads that were configured for TC1.
- [ADQ only] Verify cgroup configuration:
- [ADQ only] Troubleshooting: While test is running, verify that ADQ traffic is on the correct queues.
While ADQ application traffic is running, watch ethtool statistics to check that only the ADQ queues are being used (that is, have significant traffic) with busy poll (
pkt_busy_poll ) for ADQ traffic. If non busy poll (pkt_not_busy_poll ) has significant counts and/or if traffic is not confined to ADQ queues, recheck the configuration steps carefully.watch -d -n 0.5 "ethtool -S $iface | grep busy | column" See Configure Intel Ethernet Flow Director Settings for example watch output.
Note:If the benefits of ADQ enabled versus disabled are not as dramatic as expected even though the queues appear to be well aligned, it is possible that the performance is limited by the processing power of the client systems. If client CPU shows greater than an average of 80% CPU utilization on the CPU cores in use, it is probable that the client is becoming overloaded. To achieve maximum performance benefits, try increasing the number of or the processing power of the client systems.
After the test finishes, remove the cgroup.
jobs -p | xargs kill &> /dev/null cgdelete -g net_prio: ${cgroup_name} After the test finishes, remove the ADQ configuration following steps in Clear the ADQ Configuration.