All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] pktgen: add a new sample script for 40G and above link testing
@ 2017-08-25  2:24 Robert Hoo
  2017-08-25  9:19 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 12+ messages in thread
From: Robert Hoo @ 2017-08-25  2:24 UTC (permalink / raw)
  To: robert.hu; +Cc: Robert Ho

From: Robert Ho <robert.hu@intel.com>

It's hard to benchmark 40G+ network bandwidth using ordinary
tools like iperf, netperf. I then tried with pktgen multiqueue sample
scripts, but still cannot reach line rate.
I then derived this NUMA awared irq affinity sample script from
multi-queue sample one, successfully benchmarked 40G link. I think this can
also be useful for 100G reference, though I haven't got device to test.

This script simply does:
Detect $DEV's NUMA node belonging.
Bind each thread (processor from that NUMA node) with each $DEV queue's
irq affinity, 1:1 mapping.
How many '-t' threads input determines how many queues will be
utilized.

Tested with Intel XL710 NIC with Cisco 3172 switch.

It would be even slightly better if the irqbalance service is turned
off outside.

Referrences:
https://people.netfilter.org/hawk/presentations/LCA2015/net_stack_challenges_100G_LCA2015.pdf
http://www.intel.cn/content/dam/www/public/us/en/documents/reference-guides/xl710-x710-performance-tuning-linux-guide.pdf

Signed-off-by: Robert Hoo <robert.hu@intel.com>
---
 ...tgen_sample06_numa_awared_queue_irq_affinity.sh | 132 +++++++++++++++++++++
 1 file changed, 132 insertions(+)
 create mode 100755 samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh

diff --git a/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh b/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
new file mode 100755
index 0000000..f0ee25c
--- /dev/null
+++ b/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
@@ -0,0 +1,132 @@
+#!/bin/bash
+#
+# Multiqueue: Using pktgen threads for sending on multiple CPUs
+#  * adding devices to kernel threads which are in the same NUMA node
+#  * bound devices queue's irq affinity to the threads, 1:1 mapping
+#  * notice the naming scheme for keeping device names unique
+#  * nameing scheme: dev@thread_number
+#  * flow variation via random UDP source port
+#
+basedir=`dirname $0`
+source ${basedir}/functions.sh
+root_check_run_with_sudo "$@"
+#
+# Required param: -i dev in $DEV
+source ${basedir}/parameters.sh
+
+get_iface_node()
+{
+	echo `cat /sys/class/net/$1/device/numa_node`
+}
+
+get_iface_irqs()
+{
+	local IFACE=$1
+	local queues="${IFACE}-.*TxRx"
+
+	irqs=$(grep "$queues" /proc/interrupts | cut -f1 -d:)
+	[ -z "$irqs" ] && irqs=$(grep $IFACE /proc/interrupts | cut -f1 -d:)
+	[ -z "$irqs" ] && irqs=$(for i in `ls -Ux /sys/class/net/$IFACE/device/msi_irqs` ;\
+		do grep "$i:.*TxRx" /proc/interrupts | grep -v fdir | cut -f 1 -d : ;\
+	    done)
+	[ -z "$irqs" ] && echo "Error: Could not find interrupts for $IFACE"
+
+	echo $irqs
+}
+
+get_node_cpus()
+{
+	local node=$1
+	local node_cpu_list
+	local node_cpu_range_list=`cut -f1- -d, --output-delimiter=" " \
+			/sys/devices/system/node/node$node/cpulist`
+
+	for cpu_range in $node_cpu_range_list
+	do
+		node_cpu_list="$node_cpu_list "`seq -s " " ${cpu_range//-/ }`
+	done
+
+	echo $node_cpu_list
+}
+
+
+# Base Config
+DELAY="0"        # Zero means max speed
+COUNT="20000000"   # Zero means indefinitely
+[ -z "$CLONE_SKB" ] && CLONE_SKB="0"
+
+# Flow variation random source port between min and max
+UDP_MIN=9
+UDP_MAX=109
+
+node=`get_iface_node $DEV`
+irq_array=(`get_iface_irqs $DEV`)
+cpu_array=(`get_node_cpus $node`)
+
+[ $THREADS -gt ${#irq_array[*]} -o $THREADS -gt ${#cpu_array[*]}  ] && \
+	err 1 "Thread number $THREADS exceeds: min (${#irq_array[*]},${#cpu_array[*]})"
+
+# (example of setting default params in your script)
+if [ -z "$DEST_IP" ]; then
+    [ -z "$IP6" ] && DEST_IP="198.18.0.42" || DEST_IP="FD00::1"
+fi
+[ -z "$DST_MAC" ] && DST_MAC="90:e2:ba:ff:ff:ff"
+
+# General cleanup everything since last run
+pg_ctrl "reset"
+
+# Threads are specified with parameter -t value in $THREADS
+for ((i = 0; i < $THREADS; i++)); do
+    # The device name is extended with @name, using thread number to
+    # make then unique, but any name will do.
+    # Set the queue's irq affinity to this $thread (processor)
+    thread=${cpu_array[$i]}
+    dev=${DEV}@${thread}
+    echo $thread > /proc/irq/${irq_array[$i]}/smp_affinity_list
+    echo "irq ${irq_array[$i]} is set affinity to `cat /proc/irq/${irq_array[$i]}/smp_affinity_list`"
+
+    # Add remove all other devices and add_device $dev to thread
+    pg_thread $thread "rem_device_all"
+    pg_thread $thread "add_device" $dev
+
+    # select queue and bind the queue and $dev in 1:1 relationship
+    queue_num=$i
+    echo "queue number is $queue_num"
+    pg_set $dev "queue_map_min $queue_num"
+    pg_set $dev "queue_map_max $queue_num"
+
+    # Notice config queue to map to cpu (mirrors smp_processor_id())
+    # It is beneficial to map IRQ /proc/irq/*/smp_affinity 1:1 to CPU number
+    pg_set $dev "flag QUEUE_MAP_CPU"
+
+    # Base config of dev
+    pg_set $dev "count $COUNT"
+    pg_set $dev "clone_skb $CLONE_SKB"
+    pg_set $dev "pkt_size $PKT_SIZE"
+    pg_set $dev "delay $DELAY"
+
+    # Flag example disabling timestamping
+    pg_set $dev "flag NO_TIMESTAMP"
+
+    # Destination
+    pg_set $dev "dst_mac $DST_MAC"
+    pg_set $dev "dst$IP6 $DEST_IP"
+
+    # Setup random UDP port src range
+    pg_set $dev "flag UDPSRC_RND"
+    pg_set $dev "udp_src_min $UDP_MIN"
+    pg_set $dev "udp_src_max $UDP_MAX"
+done
+
+# start_run
+echo "Running... ctrl^C to stop" >&2
+pg_ctrl "start"
+echo "Done" >&2
+
+# Print results
+for ((i = 0; i < $THREADS; i++)); do
+    thread=${cpu_array[$i]}
+    dev=${DEV}@${thread}
+    echo "Device: $dev"
+    cat /proc/net/pktgen/$dev | grep -A2 "Result:"
+done
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] pktgen: add a new sample script for 40G and above link testing
  2017-08-25  2:24 [PATCH] pktgen: add a new sample script for 40G and above link testing Robert Hoo
@ 2017-08-25  9:19 ` Jesper Dangaard Brouer
  2017-08-25 14:24   ` Waskiewicz Jr, Peter
  2017-09-01 13:48   ` Robert Hoo
  0 siblings, 2 replies; 12+ messages in thread
From: Jesper Dangaard Brouer @ 2017-08-25  9:19 UTC (permalink / raw)
  To: Robert Hoo; +Cc: brouer, robert.hu, netdev


(please don't use BCC on the netdev list, replies might miss the list in cc)

Comments inlined below:

On Fri, 25 Aug 2017 10:24:30 +0800 Robert Hoo <robert.hu@intel.com> wrote:

> From: Robert Ho <robert.hu@intel.com>
> 
> It's hard to benchmark 40G+ network bandwidth using ordinary
> tools like iperf, netperf. I then tried with pktgen multiqueue sample
> scripts, but still cannot reach line rate.

The pktgen_sample02_multiqueue.sh does not use burst or skb_cloning.
Thus, the performance will suffer.

See the samples that use the burst feature:
  pktgen_sample03_burst_single_flow.sh
  pktgen_sample05_flow_per_thread.sh

With the pktgen "burst" feature, I can easily generate 40G.  Generating
100G is also possible, but often you will hit some HW limits before the
pktgen limit.  I experienced hitting both (1) PCIe Gen3 x8 limit, and (2)
memory bandwidth limit.


> I then derived this NUMA awared irq affinity sample script from
> multi-queue sample one, successfully benchmarked 40G link. I think this can
> also be useful for 100G reference, though I haven't got device to test.

Okay, so your issue was really related to NUMA irq affinity.  I do feel
that IRQ tuning lives outside the realm of the pktgen scripts, but
looking closer at your script, I it doesn't look like you change the
IRQ setting which is good.  

You introduce some helper functions take makes it possible to extract
NUMA information in the shell script code, really cool.  I would like
to see these functions being integrated into the function.sh file.

 
> This script simply does:
> Detect $DEV's NUMA node belonging.
> Bind each thread (processor from that NUMA node) with each $DEV queue's
> irq affinity, 1:1 mapping.
> How many '-t' threads input determines how many queues will be
> utilized.
> 
> Tested with Intel XL710 NIC with Cisco 3172 switch.
> 
> It would be even slightly better if the irqbalance service is turned
> off outside.

Yes, if you don't turn-off (kill) irqbalance it will move around the
IRQs behind your back...

 
> Referrences:
> https://people.netfilter.org/hawk/presentations/LCA2015/net_stack_challenges_100G_LCA2015.pdf
> http://www.intel.cn/content/dam/www/public/us/en/documents/reference-guides/xl710-x710-performance-tuning-linux-guide.pdf
> 
> Signed-off-by: Robert Hoo <robert.hu@intel.com>
> ---
>  ...tgen_sample06_numa_awared_queue_irq_affinity.sh | 132 +++++++++++++++++++++
>  1 file changed, 132 insertions(+)
>  create mode 100755 samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
> 
> diff --git a/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh b/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
> new file mode 100755
> index 0000000..f0ee25c
> --- /dev/null
> +++ b/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
> @@ -0,0 +1,132 @@
> +#!/bin/bash
> +#
> +# Multiqueue: Using pktgen threads for sending on multiple CPUs
> +#  * adding devices to kernel threads which are in the same NUMA node
> +#  * bound devices queue's irq affinity to the threads, 1:1 mapping
> +#  * notice the naming scheme for keeping device names unique
> +#  * nameing scheme: dev@thread_number
> +#  * flow variation via random UDP source port
> +#
> +basedir=`dirname $0`
> +source ${basedir}/functions.sh
> +root_check_run_with_sudo "$@"
> +#
> +# Required param: -i dev in $DEV
> +source ${basedir}/parameters.sh
> +
> +get_iface_node()
> +{
> +	echo `cat /sys/class/net/$1/device/numa_node`

Here you could use the following shell trick to avoid using "cat":

 echo $(</sys/class/net/$1/device/numa_node)

It looks like you don't handle the case of -1, which indicate non-NUMA
system.  You need to use something like::

get_iface_node()
{
    local node=$(</sys/class/net/$1/device/numa_node)
    if [[ $node == -1 ]]; then
	echo 0
    else
	echo $node
    fi
}


> +}
> +
> +get_iface_irqs()
> +{
> +	local IFACE=$1
> +	local queues="${IFACE}-.*TxRx"
> +
> +	irqs=$(grep "$queues" /proc/interrupts | cut -f1 -d:)
> +	[ -z "$irqs" ] && irqs=$(grep $IFACE /proc/interrupts | cut -f1 -d:)
> +	[ -z "$irqs" ] && irqs=$(for i in `ls -Ux /sys/class/net/$IFACE/device/msi_irqs` ;\
> +		do grep "$i:.*TxRx" /proc/interrupts | grep -v fdir | cut -f 1 -d : ;\
> +	    done)

Nice that you handle all these different methods.  I personally look
in /proc/irq/*/$IFACE*/../smp_affinity_list , like (copy-paste):

echo " --- Align IRQs ---"
# I've named my NICs ixgbe1 + ixgbe2
for F in /proc/irq/*/ixgbe*-TxRx-*/../smp_affinity_list; do
   # Extract irqname e.g. "ixgbe2-TxRx-2"
   irqname=$(basename $(dirname $(dirname $F))) ;
   # Substring pattern removal
   hwq_nr=${irqname#*-*-}
   echo $hwq_nr > $F
   #grep . -H $F;
done
grep -H . /proc/irq/*/ixgbe*/../smp_affinity_list

Maybe I should switch to use:
   /sys/class/net/$IFACE/device/msi_irqs/*
 

> +	[ -z "$irqs" ] && echo "Error: Could not find interrupts for $IFACE"

In the error case you should let the script die.  There is a helper
function for this called "err" (where first arg is the exitcode, which
is useful to detect the reason your script failed).


> +	echo $irqs
> +}

> +get_node_cpus()
> +{
> +	local node=$1
> +	local node_cpu_list
> +	local node_cpu_range_list=`cut -f1- -d, --output-delimiter=" " \
> +			/sys/devices/system/node/node$node/cpulist`
> +
> +	for cpu_range in $node_cpu_range_list
> +	do
> +		node_cpu_list="$node_cpu_list "`seq -s " " ${cpu_range//-/ }`
> +	done
> +
> +	echo $node_cpu_list
> +}
> +
> +
> +# Base Config
> +DELAY="0"        # Zero means max speed
> +COUNT="20000000"   # Zero means indefinitely
> +[ -z "$CLONE_SKB" ] && CLONE_SKB="0"
> +
> +# Flow variation random source port between min and max
> +UDP_MIN=9
> +UDP_MAX=109
> +
> +node=`get_iface_node $DEV`
> +irq_array=(`get_iface_irqs $DEV`)
> +cpu_array=(`get_node_cpus $node`)

Nice trick to generate an array.

> +
> +[ $THREADS -gt ${#irq_array[*]} -o $THREADS -gt ${#cpu_array[*]}  ] && \
> +	err 1 "Thread number $THREADS exceeds: min (${#irq_array[*]},${#cpu_array[*]})"
> +
> +# (example of setting default params in your script)
> +if [ -z "$DEST_IP" ]; then
> +    [ -z "$IP6" ] && DEST_IP="198.18.0.42" || DEST_IP="FD00::1"
> +fi
> +[ -z "$DST_MAC" ] && DST_MAC="90:e2:ba:ff:ff:ff"
> +
> +# General cleanup everything since last run
> +pg_ctrl "reset"
> +
> +# Threads are specified with parameter -t value in $THREADS
> +for ((i = 0; i < $THREADS; i++)); do
> +    # The device name is extended with @name, using thread number to
> +    # make then unique, but any name will do.
> +    # Set the queue's irq affinity to this $thread (processor)
> +    thread=${cpu_array[$i]}
> +    dev=${DEV}@${thread}
> +    echo $thread > /proc/irq/${irq_array[$i]}/smp_affinity_list
> +    echo "irq ${irq_array[$i]} is set affinity to `cat /proc/irq/${irq_array[$i]}/smp_affinity_list`"
> +
> +    # Add remove all other devices and add_device $dev to thread
> +    pg_thread $thread "rem_device_all"
> +    pg_thread $thread "add_device" $dev
> +
> +    # select queue and bind the queue and $dev in 1:1 relationship
> +    queue_num=$i
> +    echo "queue number is $queue_num"
> +    pg_set $dev "queue_map_min $queue_num"
> +    pg_set $dev "queue_map_max $queue_num"
> +
> +    # Notice config queue to map to cpu (mirrors smp_processor_id())
> +    # It is beneficial to map IRQ /proc/irq/*/smp_affinity 1:1 to CPU number
> +    pg_set $dev "flag QUEUE_MAP_CPU"
> +
> +    # Base config of dev
> +    pg_set $dev "count $COUNT"
> +    pg_set $dev "clone_skb $CLONE_SKB"
> +    pg_set $dev "pkt_size $PKT_SIZE"
> +    pg_set $dev "delay $DELAY"
> +
> +    # Flag example disabling timestamping
> +    pg_set $dev "flag NO_TIMESTAMP"
> +
> +    # Destination
> +    pg_set $dev "dst_mac $DST_MAC"
> +    pg_set $dev "dst$IP6 $DEST_IP"
> +
> +    # Setup random UDP port src range
> +    pg_set $dev "flag UDPSRC_RND"
> +    pg_set $dev "udp_src_min $UDP_MIN"
> +    pg_set $dev "udp_src_max $UDP_MAX"
> +done
> +
> +# start_run
> +echo "Running... ctrl^C to stop" >&2
> +pg_ctrl "start"
> +echo "Done" >&2
> +
> +# Print results
> +for ((i = 0; i < $THREADS; i++)); do
> +    thread=${cpu_array[$i]}
> +    dev=${DEV}@${thread}
> +    echo "Device: $dev"
> +    cat /proc/net/pktgen/$dev | grep -A2 "Result:"
> +done



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] pktgen: add a new sample script for 40G and above link testing
  2017-08-25  9:19 ` Jesper Dangaard Brouer
@ 2017-08-25 14:24   ` Waskiewicz Jr, Peter
  2017-08-25 14:59     ` Jesper Dangaard Brouer
  2017-09-01 13:57     ` Robert Hoo
  2017-09-01 13:48   ` Robert Hoo
  1 sibling, 2 replies; 12+ messages in thread
From: Waskiewicz Jr, Peter @ 2017-08-25 14:24 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Hu, Robert; +Cc: robert.hu, netdev

On 8/25/17 5:19 AM, Jesper Dangaard Brouer wrote:
>>
>> Tested with Intel XL710 NIC with Cisco 3172 switch.
>>
>> It would be even slightly better if the irqbalance service is turned
>> off outside.
> 
> Yes, if you don't turn-off (kill) irqbalance it will move around the
> IRQs behind your back...

Or you can use the --banirq option to irqbalance to ignore your device's 
interrupts as targets for balancing.

Cheers,
-PJ

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] pktgen: add a new sample script for 40G and above link testing
  2017-08-25 14:24   ` Waskiewicz Jr, Peter
@ 2017-08-25 14:59     ` Jesper Dangaard Brouer
  2017-08-25 15:11       ` Waskiewicz Jr, Peter
  2017-09-01 13:57     ` Robert Hoo
  1 sibling, 1 reply; 12+ messages in thread
From: Jesper Dangaard Brouer @ 2017-08-25 14:59 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter; +Cc: Hu, Robert, robert.hu, netdev, brouer

On Fri, 25 Aug 2017 14:24:28 +0000
"Waskiewicz Jr, Peter" <peter.waskiewicz.jr@intel.com> wrote:

> On 8/25/17 5:19 AM, Jesper Dangaard Brouer wrote:
> >>
> >> Tested with Intel XL710 NIC with Cisco 3172 switch.
> >>
> >> It would be even slightly better if the irqbalance service is turned
> >> off outside.  
> > 
> > Yes, if you don't turn-off (kill) irqbalance it will move around the
> > IRQs behind your back...  
> 
> Or you can use the --banirq option to irqbalance to ignore your device's 
> interrupts as targets for balancing.

It might be worth mentioning that --banirq=X is specified for each IRQ
that you want to exclude, and --banirq is simply specified multiple
times on the command line.

Is it possible to tell a running irqbalance that I want to excluded an
extra IRQ? (just before I do my manual adjustment).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] pktgen: add a new sample script for 40G and above link testing
  2017-08-25 14:59     ` Jesper Dangaard Brouer
@ 2017-08-25 15:11       ` Waskiewicz Jr, Peter
  0 siblings, 0 replies; 12+ messages in thread
From: Waskiewicz Jr, Peter @ 2017-08-25 15:11 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Hu, Robert, robert.hu, netdev

On 8/25/17 10:59 AM, Jesper Dangaard Brouer wrote:
> On Fri, 25 Aug 2017 14:24:28 +0000
> "Waskiewicz Jr, Peter" <peter.waskiewicz.jr@intel.com> wrote:
> 
>> On 8/25/17 5:19 AM, Jesper Dangaard Brouer wrote:
>>>>
>>>> Tested with Intel XL710 NIC with Cisco 3172 switch.
>>>>
>>>> It would be even slightly better if the irqbalance service is turned
>>>> off outside.
>>>
>>> Yes, if you don't turn-off (kill) irqbalance it will move around the
>>> IRQs behind your back...
>>
>> Or you can use the --banirq option to irqbalance to ignore your device's
>> interrupts as targets for balancing.
> 
> It might be worth mentioning that --banirq=X is specified for each IRQ
> that you want to exclude, and --banirq is simply specified multiple
> times on the command line.
> 
> Is it possible to tell a running irqbalance that I want to excluded an
> extra IRQ? (just before I do my manual adjustment).

It isn't possible today, since we don't have a way to attach a 
foreground/oneshot irqbalance run to a currently-running daemon.  That's 
an interesting feature enhancement...I can add it to our list as a 
feature request so I don't forget about it.  That way I can also get 
Neil's thoughts on this.

-PJ

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] pktgen: add a new sample script for 40G and above link testing
  2017-08-25  9:19 ` Jesper Dangaard Brouer
  2017-08-25 14:24   ` Waskiewicz Jr, Peter
@ 2017-09-01 13:48   ` Robert Hoo
  1 sibling, 0 replies; 12+ messages in thread
From: Robert Hoo @ 2017-09-01 13:48 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: netdev, davem, tariqt, kyle.leet

On Fri, 2017-08-25 at 11:19 +0200, Jesper Dangaard Brouer wrote:
> (please don't use BCC on the netdev list, replies might miss the list in cc)
> 
> Comments inlined below:
> 
> On Fri, 25 Aug 2017 10:24:30 +0800 Robert Hoo <robert.hu@intel.com> wrote:
> 
> > From: Robert Ho <robert.hu@intel.com>
> > 
> > It's hard to benchmark 40G+ network bandwidth using ordinary
> > tools like iperf, netperf. I then tried with pktgen multiqueue sample
> > scripts, but still cannot reach line rate.
> 
> The pktgen_sample02_multiqueue.sh does not use burst or skb_cloning.
> Thus, the performance will suffer.
> 
> See the samples that use the burst feature:
>   pktgen_sample03_burst_single_flow.sh
>   pktgen_sample05_flow_per_thread.sh
> 
> With the pktgen "burst" feature, I can easily generate 40G.  Generating
> 100G is also possible, but often you will hit some HW limits before the
> pktgen limit.  I experienced hitting both (1) PCIe Gen3 x8 limit, and (2)
> memory bandwidth limit.

Thanks Jesper for review. Sorry for late reply, I do this part time.

I just tried 'pktgen_sample03_burst_single_flow.sh' and 'pktgen_sample05_flow_per_thread.sh'
cmd:
	./pktgen_sample05_flow_per_thread.sh -i ens801 -s 1500 -m 3c:fd:fe:9d:6f:f0 -t 2 -v -x -d 192.168.0.107
	./pktgen_sample03_burst_single_flow.sh -i ens801 -s 1500 -m 3c:fd:fe:9d:6f:f0 -t 2 -v -x -d 192.168.0.107

indeed, they can achieve nearly 40G. (though still slightly less than my 
script). pktgen_sample03 and pktgen_sample05 can approximately achieve 38xxxMb/sec ~ 39xxxMb/sec;
my script can achieve 40xxxMb/sec ~ 41xxxMb/sec. (threads >= 2)

So a general question: is it still necessary to continue my sample06_numa_awared_queue_irq_affinity work? as sample03
and sample05 already approximately achieved 40G line rate.

> 
> 
> > I then derived this NUMA awared irq affinity sample script from
> > multi-queue sample one, successfully benchmarked 40G link. I think this can
> > also be useful for 100G reference, though I haven't got device to test.
> 
> Okay, so your issue was really related to NUMA irq affinity.  I do feel
> that IRQ tuning lives outside the realm of the pktgen scripts, but
> looking closer at your script, I it doesn't look like you change the
> IRQ setting which is good.  

Sorry I don't quite understand above. I changed the irq affinities.
See "echo $thread > /proc/irq/${irq_array[$i]}/smp_affinity_list".
You would not like me to change it? I can restore them to original at the end
of the script.
> 
> You introduce some helper functions take makes it possible to extract
> NUMA information in the shell script code, really cool.  I would like
> to see these functions being integrated into the function.sh file.

Yes, it is doable, if you maintainer think so.
> 
>  
> > This script simply does:
> > Detect $DEV's NUMA node belonging.
> > Bind each thread (processor from that NUMA node) with each $DEV queue's
> > irq affinity, 1:1 mapping.
> > How many '-t' threads input determines how many queues will be
> > utilized.
> > 
> > Tested with Intel XL710 NIC with Cisco 3172 switch.
> > 
> > It would be even slightly better if the irqbalance service is turned
> > off outside.
> 
> Yes, if you don't turn-off (kill) irqbalance it will move around the
> IRQs behind your back...

Yes; while the experiment result turns out it affects just very little.
> 
>  
> > Referrences:
> > https://people.netfilter.org/hawk/presentations/LCA2015/net_stack_challenges_100G_LCA2015.pdf
> > http://www.intel.cn/content/dam/www/public/us/en/documents/reference-guides/xl710-x710-performance-tuning-linux-guide.pdf
> > 
> > Signed-off-by: Robert Hoo <robert.hu@intel.com>
> > ---
> >  ...tgen_sample06_numa_awared_queue_irq_affinity.sh | 132 +++++++++++++++++++++
> >  1 file changed, 132 insertions(+)
> >  create mode 100755 samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
> > 
> > diff --git a/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh b/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
> > new file mode 100755
> > index 0000000..f0ee25c
> > --- /dev/null
> > +++ b/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
> > @@ -0,0 +1,132 @@
> > +#!/bin/bash
> > +#
> > +# Multiqueue: Using pktgen threads for sending on multiple CPUs
> > +#  * adding devices to kernel threads which are in the same NUMA node
> > +#  * bound devices queue's irq affinity to the threads, 1:1 mapping
> > +#  * notice the naming scheme for keeping device names unique
> > +#  * nameing scheme: dev@thread_number
> > +#  * flow variation via random UDP source port
> > +#
> > +basedir=`dirname $0`
> > +source ${basedir}/functions.sh
> > +root_check_run_with_sudo "$@"
> > +#
> > +# Required param: -i dev in $DEV
> > +source ${basedir}/parameters.sh
> > +
> > +get_iface_node()
> > +{
> > +	echo `cat /sys/class/net/$1/device/numa_node`
> 
> Here you could use the following shell trick to avoid using "cat":
> 
>  echo $(</sys/class/net/$1/device/numa_node)

Thanks for teaching. Indeed this is more concise.
> 
> It looks like you don't handle the case of -1, which indicate non-NUMA
> system.  You need to use something like::
> 
> get_iface_node()
> {
>     local node=$(</sys/class/net/$1/device/numa_node)
>     if [[ $node == -1 ]]; then
> 	echo 0
>     else
> 	echo $node
>     fi
> }

Yes, I can amend in v2.
> 
> 
> > +}
> > +
> > +get_iface_irqs()
> > +{
> > +	local IFACE=$1
> > +	local queues="${IFACE}-.*TxRx"
> > +
> > +	irqs=$(grep "$queues" /proc/interrupts | cut -f1 -d:)
> > +	[ -z "$irqs" ] && irqs=$(grep $IFACE /proc/interrupts | cut -f1 -d:)
> > +	[ -z "$irqs" ] && irqs=$(for i in `ls -Ux /sys/class/net/$IFACE/device/msi_irqs` ;\
> > +		do grep "$i:.*TxRx" /proc/interrupts | grep -v fdir | cut -f 1 -d : ;\
> > +	    done)
> 
> Nice that you handle all these different methods.  I personally look
> in /proc/irq/*/$IFACE*/../smp_affinity_list , like (copy-paste):
> 
> echo " --- Align IRQs ---"
> # I've named my NICs ixgbe1 + ixgbe2
> for F in /proc/irq/*/ixgbe*-TxRx-*/../smp_affinity_list; do
>    # Extract irqname e.g. "ixgbe2-TxRx-2"
>    irqname=$(basename $(dirname $(dirname $F))) ;
>    # Substring pattern removal
>    hwq_nr=${irqname#*-*-}
>    echo $hwq_nr > $F
>    #grep . -H $F;
> done
> grep -H . /proc/irq/*/ixgbe*/../smp_affinity_list
> 
> Maybe I should switch to use:
>    /sys/class/net/$IFACE/device/msi_irqs/*
>  
> 
> > +	[ -z "$irqs" ] && echo "Error: Could not find interrupts for $IFACE"
> 
> In the error case you should let the script die.  There is a helper
> function for this called "err" (where first arg is the exitcode, which
> is useful to detect the reason your script failed).

Yes, I noticed that helper function and changed some of my original "echo Error"s;
this is a missing in my code clear/tidy work. I can amend in v2.
> 
> 
> > +	echo $irqs
> > +}
> 
> > +get_node_cpus()
> > +{
> > +	local node=$1
> > +	local node_cpu_list
> > +	local node_cpu_range_list=`cut -f1- -d, --output-delimiter=" " \
> > +			/sys/devices/system/node/node$node/cpulist`
> > +
> > +	for cpu_range in $node_cpu_range_list
> > +	do
> > +		node_cpu_list="$node_cpu_list "`seq -s " " ${cpu_range//-/ }`
> > +	done
> > +
> > +	echo $node_cpu_list
> > +}
> > +
> > +
> > +# Base Config
> > +DELAY="0"        # Zero means max speed
> > +COUNT="20000000"   # Zero means indefinitely
> > +[ -z "$CLONE_SKB" ] && CLONE_SKB="0"
> > +
> > +# Flow variation random source port between min and max
> > +UDP_MIN=9
> > +UDP_MAX=109
> > +
> > +node=`get_iface_node $DEV`
> > +irq_array=(`get_iface_irqs $DEV`)
> > +cpu_array=(`get_node_cpus $node`)
> 
> Nice trick to generate an array.
> 
> > +
> > +[ $THREADS -gt ${#irq_array[*]} -o $THREADS -gt ${#cpu_array[*]}  ] && \
> > +	err 1 "Thread number $THREADS exceeds: min (${#irq_array[*]},${#cpu_array[*]})"
> > +
> > +# (example of setting default params in your script)
> > +if [ -z "$DEST_IP" ]; then
> > +    [ -z "$IP6" ] && DEST_IP="198.18.0.42" || DEST_IP="FD00::1"
> > +fi
> > +[ -z "$DST_MAC" ] && DST_MAC="90:e2:ba:ff:ff:ff"
> > +
> > +# General cleanup everything since last run
> > +pg_ctrl "reset"
> > +
> > +# Threads are specified with parameter -t value in $THREADS
> > +for ((i = 0; i < $THREADS; i++)); do
> > +    # The device name is extended with @name, using thread number to
> > +    # make then unique, but any name will do.
> > +    # Set the queue's irq affinity to this $thread (processor)
> > +    thread=${cpu_array[$i]}
> > +    dev=${DEV}@${thread}
> > +    echo $thread > /proc/irq/${irq_array[$i]}/smp_affinity_list
> > +    echo "irq ${irq_array[$i]} is set affinity to `cat /proc/irq/${irq_array[$i]}/smp_affinity_list`"
> > +
> > +    # Add remove all other devices and add_device $dev to thread
> > +    pg_thread $thread "rem_device_all"
> > +    pg_thread $thread "add_device" $dev
> > +
> > +    # select queue and bind the queue and $dev in 1:1 relationship
> > +    queue_num=$i
> > +    echo "queue number is $queue_num"
> > +    pg_set $dev "queue_map_min $queue_num"
> > +    pg_set $dev "queue_map_max $queue_num"
> > +
> > +    # Notice config queue to map to cpu (mirrors smp_processor_id())
> > +    # It is beneficial to map IRQ /proc/irq/*/smp_affinity 1:1 to CPU number
> > +    pg_set $dev "flag QUEUE_MAP_CPU"
> > +
> > +    # Base config of dev
> > +    pg_set $dev "count $COUNT"
> > +    pg_set $dev "clone_skb $CLONE_SKB"
> > +    pg_set $dev "pkt_size $PKT_SIZE"
> > +    pg_set $dev "delay $DELAY"
> > +
> > +    # Flag example disabling timestamping
> > +    pg_set $dev "flag NO_TIMESTAMP"
> > +
> > +    # Destination
> > +    pg_set $dev "dst_mac $DST_MAC"
> > +    pg_set $dev "dst$IP6 $DEST_IP"
> > +
> > +    # Setup random UDP port src range
> > +    pg_set $dev "flag UDPSRC_RND"
> > +    pg_set $dev "udp_src_min $UDP_MIN"
> > +    pg_set $dev "udp_src_max $UDP_MAX"
> > +done
> > +
> > +# start_run
> > +echo "Running... ctrl^C to stop" >&2
> > +pg_ctrl "start"
> > +echo "Done" >&2
> > +
> > +# Print results
> > +for ((i = 0; i < $THREADS; i++)); do
> > +    thread=${cpu_array[$i]}
> > +    dev=${DEV}@${thread}
> > +    echo "Device: $dev"
> > +    cat /proc/net/pktgen/$dev | grep -A2 "Result:"
> > +done
> 
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] pktgen: add a new sample script for 40G and above link testing
  2017-08-25 14:24   ` Waskiewicz Jr, Peter
  2017-08-25 14:59     ` Jesper Dangaard Brouer
@ 2017-09-01 13:57     ` Robert Hoo
  1 sibling, 0 replies; 12+ messages in thread
From: Robert Hoo @ 2017-09-01 13:57 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter; +Cc: Jesper Dangaard Brouer, netdev

On Fri, 2017-08-25 at 14:24 +0000, Waskiewicz Jr, Peter wrote:
> On 8/25/17 5:19 AM, Jesper Dangaard Brouer wrote:
> >>
> >> Tested with Intel XL710 NIC with Cisco 3172 switch.
> >>
> >> It would be even slightly better if the irqbalance service is turned
> >> off outside.
> > 
> > Yes, if you don't turn-off (kill) irqbalance it will move around the
> > IRQs behind your back...
> 
> Or you can use the --banirq option to irqbalance to ignore your device's 
> interrupts as targets for balancing.

Oh, I wasn't aware of this parameter. I will be glad to have try later.
Meanwhile, in my test above, the irqbalance service just affect result
very little.
> 
> Cheers,
> -PJ

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] pktgen: add a new sample script for 40G and above link testing
  2017-08-27  8:25 ` Tariq Toukan
@ 2017-09-01 13:53   ` Robert Hoo
  0 siblings, 0 replies; 12+ messages in thread
From: Robert Hoo @ 2017-09-01 13:53 UTC (permalink / raw)
  To: Tariq Toukan; +Cc: davem, brouer, kyle.leet, netdev

On Sun, 2017-08-27 at 11:25 +0300, Tariq Toukan wrote:
> 
> On 25/08/2017 12:26 PM, Robert Hoo wrote:
> > (Sorry for yesterday's wrong sending, I finally fixed my MTA and git
> > send-email settings.)
> > 
> > It's hard to benchmark 40G+ network bandwidth using ordinary
> > tools like iperf, netperf (see reference 1).
> > Pktgen, packet generator from Kernel sapce, shall be a candidate.
> > I then tried with pktgen multiqueue sample scripts, but still
> > cannot reach line rate.
> 
> Try samples 03 and 04.

Thanks Tariq for review. Sorry for late reply; I do this part time.

Yes, I just tried sample 03 and 04. They can approximately reach 40G
line rate; though still slightly less than my script :) (see my reply
to Jesper).
> 
> > I then derived this NUMA awared irq affinity sample script from
> > multi-queue sample one, successfully benchmarked 40G link. I think this can
> > also be useful for 100G reference, though I haven't got device to test yet.
> > 
> > This script simply does:
> > Detect $DEV's NUMA node belonging.
> > Bind each thread (processor from that NUMA node) with each $DEV queue's
> > irq affinity, 1:1 mapping.
> > How many '-t' threads input determines how many queues will be
> > utilized.
> 
> I agree this is an essential capability.
> This was the main reason I added support for the -f argument.
> Using it, I could choose cores of local NUMA, especially for single 
> thread, or when cores of the NUMA are sequential.

Indeed this argument is very helpful.
Sorry I haven't taken it into consideration in v1. I should consider
this, what if user designate '-f'. I can improve this in v2.
> 
> > 
> > Tested with Intel XL710 NIC with Cisco 3172 switch.
> > 
> > It would be even slightly better if the irqbalance service is turned
> > off outside.
> > 
> > Referrences:
> > https://people.netfilter.org/hawk/presentations/LCA2015/net_stack_challenges_100G_LCA2015.pdf
> > http://www.intel.cn/content/dam/www/public/us/en/documents/reference-guides/xl710-x710-performance-tuning-linux-guide.pdf
> > 
> > Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
> > ---
> 
> Regards,
> Tariq Toukan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] pktgen: add a new sample script for 40G and above link testing
  2017-08-25  9:26 Robert Hoo
  2017-08-25  9:47 ` Jesper Dangaard Brouer
@ 2017-08-27  8:25 ` Tariq Toukan
  2017-09-01 13:53   ` Robert Hoo
  1 sibling, 1 reply; 12+ messages in thread
From: Tariq Toukan @ 2017-08-27  8:25 UTC (permalink / raw)
  To: Robert Hoo, davem, tariqt, brouer, kyle.leet; +Cc: netdev, robert.hu



On 25/08/2017 12:26 PM, Robert Hoo wrote:
> (Sorry for yesterday's wrong sending, I finally fixed my MTA and git
> send-email settings.)
> 
> It's hard to benchmark 40G+ network bandwidth using ordinary
> tools like iperf, netperf (see reference 1).
> Pktgen, packet generator from Kernel sapce, shall be a candidate.
> I then tried with pktgen multiqueue sample scripts, but still
> cannot reach line rate.

Try samples 03 and 04.

> I then derived this NUMA awared irq affinity sample script from
> multi-queue sample one, successfully benchmarked 40G link. I think this can
> also be useful for 100G reference, though I haven't got device to test yet.
> 
> This script simply does:
> Detect $DEV's NUMA node belonging.
> Bind each thread (processor from that NUMA node) with each $DEV queue's
> irq affinity, 1:1 mapping.
> How many '-t' threads input determines how many queues will be
> utilized.

I agree this is an essential capability.
This was the main reason I added support for the -f argument.
Using it, I could choose cores of local NUMA, especially for single 
thread, or when cores of the NUMA are sequential.

> 
> Tested with Intel XL710 NIC with Cisco 3172 switch.
> 
> It would be even slightly better if the irqbalance service is turned
> off outside.
> 
> Referrences:
> https://people.netfilter.org/hawk/presentations/LCA2015/net_stack_challenges_100G_LCA2015.pdf
> http://www.intel.cn/content/dam/www/public/us/en/documents/reference-guides/xl710-x710-performance-tuning-linux-guide.pdf
> 
> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
> ---

Regards,
Tariq Toukan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] pktgen: add a new sample script for 40G and above link testing
  2017-08-25  9:26 Robert Hoo
@ 2017-08-25  9:47 ` Jesper Dangaard Brouer
  2017-08-27  8:25 ` Tariq Toukan
  1 sibling, 0 replies; 12+ messages in thread
From: Jesper Dangaard Brouer @ 2017-08-25  9:47 UTC (permalink / raw)
  To: Robert Hoo; +Cc: davem, tariqt, kyle.leet, netdev, robert.hu, brouer

On Fri, 25 Aug 2017 17:26:36 +0800
Robert Hoo <robert.hu@linux.intel.com> wrote:

> (Sorry for yesterday's wrong sending, I finally fixed my MTA and git
> send-email settings.)

Please see my reply in:
  http://lkml.kernel.org/r/20170825111921.061713c8@redhat.com

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] pktgen: add a new sample script for 40G and above link testing
@ 2017-08-25  9:26 Robert Hoo
  2017-08-25  9:47 ` Jesper Dangaard Brouer
  2017-08-27  8:25 ` Tariq Toukan
  0 siblings, 2 replies; 12+ messages in thread
From: Robert Hoo @ 2017-08-25  9:26 UTC (permalink / raw)
  To: davem, tariqt, brouer, kyle.leet; +Cc: netdev, robert.hu, Robert Hoo

(Sorry for yesterday's wrong sending, I finally fixed my MTA and git
send-email settings.)

It's hard to benchmark 40G+ network bandwidth using ordinary
tools like iperf, netperf (see reference 1). 
Pktgen, packet generator from Kernel sapce, shall be a candidate.
I then tried with pktgen multiqueue sample scripts, but still
cannot reach line rate.
I then derived this NUMA awared irq affinity sample script from
multi-queue sample one, successfully benchmarked 40G link. I think this can
also be useful for 100G reference, though I haven't got device to test yet.

This script simply does:
Detect $DEV's NUMA node belonging.
Bind each thread (processor from that NUMA node) with each $DEV queue's
irq affinity, 1:1 mapping.
How many '-t' threads input determines how many queues will be
utilized.

Tested with Intel XL710 NIC with Cisco 3172 switch.

It would be even slightly better if the irqbalance service is turned
off outside.

Referrences:
https://people.netfilter.org/hawk/presentations/LCA2015/net_stack_challenges_100G_LCA2015.pdf
http://www.intel.cn/content/dam/www/public/us/en/documents/reference-guides/xl710-x710-performance-tuning-linux-guide.pdf

Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
---
 ...tgen_sample06_numa_awared_queue_irq_affinity.sh | 132 +++++++++++++++++++++
 1 file changed, 132 insertions(+)
 create mode 100755 samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh

diff --git a/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh b/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
new file mode 100755
index 0000000..f0ee25c
--- /dev/null
+++ b/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
@@ -0,0 +1,132 @@
+#!/bin/bash
+#
+# Multiqueue: Using pktgen threads for sending on multiple CPUs
+#  * adding devices to kernel threads which are in the same NUMA node
+#  * bound devices queue's irq affinity to the threads, 1:1 mapping
+#  * notice the naming scheme for keeping device names unique
+#  * nameing scheme: dev@thread_number
+#  * flow variation via random UDP source port
+#
+basedir=`dirname $0`
+source ${basedir}/functions.sh
+root_check_run_with_sudo "$@"
+#
+# Required param: -i dev in $DEV
+source ${basedir}/parameters.sh
+
+get_iface_node()
+{
+	echo `cat /sys/class/net/$1/device/numa_node`
+}
+
+get_iface_irqs()
+{
+	local IFACE=$1
+	local queues="${IFACE}-.*TxRx"
+
+	irqs=$(grep "$queues" /proc/interrupts | cut -f1 -d:)
+	[ -z "$irqs" ] && irqs=$(grep $IFACE /proc/interrupts | cut -f1 -d:)
+	[ -z "$irqs" ] && irqs=$(for i in `ls -Ux /sys/class/net/$IFACE/device/msi_irqs` ;\
+		do grep "$i:.*TxRx" /proc/interrupts | grep -v fdir | cut -f 1 -d : ;\
+	    done)
+	[ -z "$irqs" ] && echo "Error: Could not find interrupts for $IFACE"
+
+	echo $irqs
+}
+
+get_node_cpus()
+{
+	local node=$1
+	local node_cpu_list
+	local node_cpu_range_list=`cut -f1- -d, --output-delimiter=" " \
+			/sys/devices/system/node/node$node/cpulist`
+
+	for cpu_range in $node_cpu_range_list
+	do
+		node_cpu_list="$node_cpu_list "`seq -s " " ${cpu_range//-/ }`
+	done
+
+	echo $node_cpu_list
+}
+
+
+# Base Config
+DELAY="0"        # Zero means max speed
+COUNT="20000000"   # Zero means indefinitely
+[ -z "$CLONE_SKB" ] && CLONE_SKB="0"
+
+# Flow variation random source port between min and max
+UDP_MIN=9
+UDP_MAX=109
+
+node=`get_iface_node $DEV`
+irq_array=(`get_iface_irqs $DEV`)
+cpu_array=(`get_node_cpus $node`)
+
+[ $THREADS -gt ${#irq_array[*]} -o $THREADS -gt ${#cpu_array[*]}  ] && \
+	err 1 "Thread number $THREADS exceeds: min (${#irq_array[*]},${#cpu_array[*]})"
+
+# (example of setting default params in your script)
+if [ -z "$DEST_IP" ]; then
+    [ -z "$IP6" ] && DEST_IP="198.18.0.42" || DEST_IP="FD00::1"
+fi
+[ -z "$DST_MAC" ] && DST_MAC="90:e2:ba:ff:ff:ff"
+
+# General cleanup everything since last run
+pg_ctrl "reset"
+
+# Threads are specified with parameter -t value in $THREADS
+for ((i = 0; i < $THREADS; i++)); do
+    # The device name is extended with @name, using thread number to
+    # make then unique, but any name will do.
+    # Set the queue's irq affinity to this $thread (processor)
+    thread=${cpu_array[$i]}
+    dev=${DEV}@${thread}
+    echo $thread > /proc/irq/${irq_array[$i]}/smp_affinity_list
+    echo "irq ${irq_array[$i]} is set affinity to `cat /proc/irq/${irq_array[$i]}/smp_affinity_list`"
+
+    # Add remove all other devices and add_device $dev to thread
+    pg_thread $thread "rem_device_all"
+    pg_thread $thread "add_device" $dev
+
+    # select queue and bind the queue and $dev in 1:1 relationship
+    queue_num=$i
+    echo "queue number is $queue_num"
+    pg_set $dev "queue_map_min $queue_num"
+    pg_set $dev "queue_map_max $queue_num"
+
+    # Notice config queue to map to cpu (mirrors smp_processor_id())
+    # It is beneficial to map IRQ /proc/irq/*/smp_affinity 1:1 to CPU number
+    pg_set $dev "flag QUEUE_MAP_CPU"
+
+    # Base config of dev
+    pg_set $dev "count $COUNT"
+    pg_set $dev "clone_skb $CLONE_SKB"
+    pg_set $dev "pkt_size $PKT_SIZE"
+    pg_set $dev "delay $DELAY"
+
+    # Flag example disabling timestamping
+    pg_set $dev "flag NO_TIMESTAMP"
+
+    # Destination
+    pg_set $dev "dst_mac $DST_MAC"
+    pg_set $dev "dst$IP6 $DEST_IP"
+
+    # Setup random UDP port src range
+    pg_set $dev "flag UDPSRC_RND"
+    pg_set $dev "udp_src_min $UDP_MIN"
+    pg_set $dev "udp_src_max $UDP_MAX"
+done
+
+# start_run
+echo "Running... ctrl^C to stop" >&2
+pg_ctrl "start"
+echo "Done" >&2
+
+# Print results
+for ((i = 0; i < $THREADS; i++)); do
+    thread=${cpu_array[$i]}
+    dev=${DEV}@${thread}
+    echo "Device: $dev"
+    cat /proc/net/pktgen/$dev | grep -A2 "Result:"
+done
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH] pktgen: add a new sample script for 40G and above link testing
@ 2017-08-24 12:06 Robert Hoo
  0 siblings, 0 replies; 12+ messages in thread
From: Robert Hoo @ 2017-08-24 12:06 UTC (permalink / raw)
  To: robert.hu; +Cc: Robert Ho

From: Robert Ho <robert.hu@intel.com>

It's hard to benchmark 40G+ network bandwidth using ordinary
tools like iperf, netperf (see reference 1). 
Pktgen, packet generator from Kernel sapce, shall be a candidate.
I then tried with pktgen multiqueue sample scripts, but still
cannot reach line rate.
I then derived this NUMA awared irq affinity sample script from
multi-queue sample one, successfully benchmarked 40G link. I think this can
also be useful for 100G reference, though I haven't got device to test yet.

This script simply does:
Detect $DEV's NUMA node belonging.
Bind each thread (processor from that NUMA node) with each $DEV queue's
irq affinity, 1:1 mapping.
How many '-t' threads input determines how many queues will be
utilized.

Tested with Intel XL710 NIC with Cisco 3172 switch.

It would be even slightly better if the irqbalance service is turned
off outside.

Referrences:
https://people.netfilter.org/hawk/presentations/LCA2015/net_stack_challenges_100G_LCA2015.pdf
http://www.intel.cn/content/dam/www/public/us/en/documents/reference-guides/xl710-x710-performance-tuning-linux-guide.pdf

Signed-off-by: Robert Hoo <robert.hu@intel.com>
---
 ...tgen_sample06_numa_awared_queue_irq_affinity.sh | 132 +++++++++++++++++++++
 1 file changed, 132 insertions(+)
 create mode 100755 samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh

diff --git a/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh b/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
new file mode 100755
index 0000000..f0ee25c
--- /dev/null
+++ b/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
@@ -0,0 +1,132 @@
+#!/bin/bash
+#
+# Multiqueue: Using pktgen threads for sending on multiple CPUs
+#  * adding devices to kernel threads which are in the same NUMA node
+#  * bound devices queue's irq affinity to the threads, 1:1 mapping
+#  * notice the naming scheme for keeping device names unique
+#  * nameing scheme: dev@thread_number
+#  * flow variation via random UDP source port
+#
+basedir=`dirname $0`
+source ${basedir}/functions.sh
+root_check_run_with_sudo "$@"
+#
+# Required param: -i dev in $DEV
+source ${basedir}/parameters.sh
+
+get_iface_node()
+{
+	echo `cat /sys/class/net/$1/device/numa_node`
+}
+
+get_iface_irqs()
+{
+	local IFACE=$1
+	local queues="${IFACE}-.*TxRx"
+
+	irqs=$(grep "$queues" /proc/interrupts | cut -f1 -d:)
+	[ -z "$irqs" ] && irqs=$(grep $IFACE /proc/interrupts | cut -f1 -d:)
+	[ -z "$irqs" ] && irqs=$(for i in `ls -Ux /sys/class/net/$IFACE/device/msi_irqs` ;\
+		do grep "$i:.*TxRx" /proc/interrupts | grep -v fdir | cut -f 1 -d : ;\
+	    done)
+	[ -z "$irqs" ] && echo "Error: Could not find interrupts for $IFACE"
+
+	echo $irqs
+}
+
+get_node_cpus()
+{
+	local node=$1
+	local node_cpu_list
+	local node_cpu_range_list=`cut -f1- -d, --output-delimiter=" " \
+			/sys/devices/system/node/node$node/cpulist`
+
+	for cpu_range in $node_cpu_range_list
+	do
+		node_cpu_list="$node_cpu_list "`seq -s " " ${cpu_range//-/ }`
+	done
+
+	echo $node_cpu_list
+}
+
+
+# Base Config
+DELAY="0"        # Zero means max speed
+COUNT="20000000"   # Zero means indefinitely
+[ -z "$CLONE_SKB" ] && CLONE_SKB="0"
+
+# Flow variation random source port between min and max
+UDP_MIN=9
+UDP_MAX=109
+
+node=`get_iface_node $DEV`
+irq_array=(`get_iface_irqs $DEV`)
+cpu_array=(`get_node_cpus $node`)
+
+[ $THREADS -gt ${#irq_array[*]} -o $THREADS -gt ${#cpu_array[*]}  ] && \
+	err 1 "Thread number $THREADS exceeds: min (${#irq_array[*]},${#cpu_array[*]})"
+
+# (example of setting default params in your script)
+if [ -z "$DEST_IP" ]; then
+    [ -z "$IP6" ] && DEST_IP="198.18.0.42" || DEST_IP="FD00::1"
+fi
+[ -z "$DST_MAC" ] && DST_MAC="90:e2:ba:ff:ff:ff"
+
+# General cleanup everything since last run
+pg_ctrl "reset"
+
+# Threads are specified with parameter -t value in $THREADS
+for ((i = 0; i < $THREADS; i++)); do
+    # The device name is extended with @name, using thread number to
+    # make then unique, but any name will do.
+    # Set the queue's irq affinity to this $thread (processor)
+    thread=${cpu_array[$i]}
+    dev=${DEV}@${thread}
+    echo $thread > /proc/irq/${irq_array[$i]}/smp_affinity_list
+    echo "irq ${irq_array[$i]} is set affinity to `cat /proc/irq/${irq_array[$i]}/smp_affinity_list`"
+
+    # Add remove all other devices and add_device $dev to thread
+    pg_thread $thread "rem_device_all"
+    pg_thread $thread "add_device" $dev
+
+    # select queue and bind the queue and $dev in 1:1 relationship
+    queue_num=$i
+    echo "queue number is $queue_num"
+    pg_set $dev "queue_map_min $queue_num"
+    pg_set $dev "queue_map_max $queue_num"
+
+    # Notice config queue to map to cpu (mirrors smp_processor_id())
+    # It is beneficial to map IRQ /proc/irq/*/smp_affinity 1:1 to CPU number
+    pg_set $dev "flag QUEUE_MAP_CPU"
+
+    # Base config of dev
+    pg_set $dev "count $COUNT"
+    pg_set $dev "clone_skb $CLONE_SKB"
+    pg_set $dev "pkt_size $PKT_SIZE"
+    pg_set $dev "delay $DELAY"
+
+    # Flag example disabling timestamping
+    pg_set $dev "flag NO_TIMESTAMP"
+
+    # Destination
+    pg_set $dev "dst_mac $DST_MAC"
+    pg_set $dev "dst$IP6 $DEST_IP"
+
+    # Setup random UDP port src range
+    pg_set $dev "flag UDPSRC_RND"
+    pg_set $dev "udp_src_min $UDP_MIN"
+    pg_set $dev "udp_src_max $UDP_MAX"
+done
+
+# start_run
+echo "Running... ctrl^C to stop" >&2
+pg_ctrl "start"
+echo "Done" >&2
+
+# Print results
+for ((i = 0; i < $THREADS; i++)); do
+    thread=${cpu_array[$i]}
+    dev=${DEV}@${thread}
+    echo "Device: $dev"
+    cat /proc/net/pktgen/$dev | grep -A2 "Result:"
+done
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-09-01 13:57 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-25  2:24 [PATCH] pktgen: add a new sample script for 40G and above link testing Robert Hoo
2017-08-25  9:19 ` Jesper Dangaard Brouer
2017-08-25 14:24   ` Waskiewicz Jr, Peter
2017-08-25 14:59     ` Jesper Dangaard Brouer
2017-08-25 15:11       ` Waskiewicz Jr, Peter
2017-09-01 13:57     ` Robert Hoo
2017-09-01 13:48   ` Robert Hoo
  -- strict thread matches above, loose matches on Subject: below --
2017-08-25  9:26 Robert Hoo
2017-08-25  9:47 ` Jesper Dangaard Brouer
2017-08-27  8:25 ` Tariq Toukan
2017-09-01 13:53   ` Robert Hoo
2017-08-24 12:06 Robert Hoo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.