netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC net-next] netif_receive_skb performance
@ 2015-04-29  2:11 Alexei Starovoitov
  2015-04-29  2:11 ` [PATCH RFC net-next] pktgen: introduce 'rx' mode Alexei Starovoitov
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Alexei Starovoitov @ 2015-04-29  2:11 UTC (permalink / raw)
  To: David S. Miller
  Cc: Eric Dumazet, Daniel Borkmann, Thomas Graf, Jamal Hadi Salim,
	John Fastabend, netdev

Hi,

there were many requests for performance numbers in the past, but not
everyone has access to 10/40G nics and we need a common way to talk
about RX path performance without overhead of driver RX. That's
especially important when making changes to netif_receive_skb.

One would think that using pktgen and xmit into veth would do the trick,
but that's not the case, since clone/burst parameters are not avaiable, so
such approach doesn't stress rx path.
The patch 1 introduces 'rx' mode for pktgen which instead of sending
packets via ndo_start_xmit, delivers them to stack via netif_receive_skb.

It's typical usage:
$ sudo ./pktgen.sh eth0
...
Result: OK: 232376(c232372+d3) usec, 10000000 (60byte,0frags)
  43033682pps 20656Mb/sec (20656167360bps) errors: 10000000

which says that netif_receive_skb->ip_rcv->kfree_skb can drop cloned
packets at the rate of 43 M packet per second.
'pref report' looks as expected:
  37.69%  kpktgend_0   [kernel.vmlinux]  [k] __netif_receive_skb_core
  25.81%  kpktgend_0   [kernel.vmlinux]  [k] kfree_skb
   7.22%  kpktgend_0   [kernel.vmlinux]  [k] ip_rcv
   5.68%  kpktgend_0   [pktgen]          [k] pktgen_thread_worker

In this case pktgen script configured to use skb->dmac != eth0's mac,
so skb->pkt_type == PACKET_OTHERHOST and skbs are dropped immediately
by ip_rcv as expected.

Configuring dmac == eth0's mac we'll see 6.5 Mpps and 'perf report':
  21.97%  kpktgend_0   [kernel.vmlinux]  [k] fib_table_lookup
   9.64%  kpktgend_0   [kernel.vmlinux]  [k] __netif_receive_skb_core
   8.44%  kpktgend_0   [kernel.vmlinux]  [k] ip_rcv
   7.19%  kpktgend_0   [kernel.vmlinux]  [k] __skb_clone
   6.89%  kpktgend_0   [kernel.vmlinux]  [k] fib_validate_source
   5.36%  kpktgend_0   [kernel.vmlinux]  [k] ip_route_input_noref
   5.18%  kpktgend_0   [kernel.vmlinux]  [k] udp_v4_early_demux
   4.57%  kpktgend_0   [kernel.vmlinux]  [k] consume_skb
   4.42%  kpktgend_0   [kernel.vmlinux]  [k] skb_release_data
   3.90%  kpktgend_0   [kernel.vmlinux]  [k] ip_rcv_finish

The profile dump looks as expected for RX of UDP packets without local socket
except presence of __skb_clone. It's there since pktgen does
skb->users += burst and first thing ip_rcv does is skb_share_check.
So not exactly representative for normal udp receive, but precise enough to
simulate udp receive with taps on eth0 which do skb_clone as well.

My main goal was to benchmark ingress qdisc.
So here are the numbers:
raw netif_receive_skb->ip_rcv->kfree_skb - 43 Mpps
adding ingress qdisc to eth0 drops performance to - 26 Mpps
adding 'u32 match u32 0 0' drops if further to - 22.4 Mpps
All as expected.

Now let's remove ingress spin_lock (the goal of John's patches) - 24.5 Mpps
Note this is single core receive. The boost from removal will be much higher
on a real nic with multiple cores servicing rx irqs.

With my experimental replacement of ingress_queue/sch_ingress with
ingress_filter_list and 'u32 match u32 0 0' classifier - 26.2 Mpps

Experimental ingress_filter_list and JITed bpf 'return 0' program - 27.2 Mpps

So there is definitely room for further improvements in ingress
qdisc beyond dropping spin_lock.

Few other numbers for comparison with dmac == eth0 mac:
no qdisc, with conntrack and empty iptables - 2.2 Mpps
   7.65%  kpktgend_0   [nf_conntrack]    [k] nf_conntrack_in
   7.62%  kpktgend_0   [kernel.vmlinux]  [k] fib_table_lookup
   5.44%  kpktgend_0   [kernel.vmlinux]  [k] __call_rcu.constprop.63
   3.71%  kpktgend_0   [kernel.vmlinux]  [k] nf_iterate
   3.59%  kpktgend_0   [ip_tables]       [k] ipt_do_table

no qdisc, unload conntrack, keep empty iptables - 5.4 Mpps 
  18.17%  kpktgend_0   [kernel.vmlinux]  [k] fib_table_lookup
   8.31%  kpktgend_0   [kernel.vmlinux]  [k] ip_rcv
   7.97%  kpktgend_0   [kernel.vmlinux]  [k] __netif_receive_skb_core
   7.53%  kpktgend_0   [ip_tables]       [k] ipt_do_table

no qdisc, unload conntrack, unload iptables - 6.5 Mpps
  21.97%  kpktgend_0   [kernel.vmlinux]  [k] fib_table_lookup
   9.64%  kpktgend_0   [kernel.vmlinux]  [k] __netif_receive_skb_core
   8.44%  kpktgend_0   [kernel.vmlinux]  [k] ip_rcv
   7.19%  kpktgend_0   [kernel.vmlinux]  [k] __skb_clone
   6.89%  kpktgend_0   [kernel.vmlinux]  [k] fib_validate_source

After I'm done with ingress qdisc improvements, I'm planning
to look at netif_receive_skb itself, since it looks a bit too hot.

Alexei Starovoitov (1):
  pktgen: introduce 'rx' mode

 net/core/pktgen.c |   30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH RFC net-next] pktgen: introduce 'rx' mode
  2015-04-29  2:11 [PATCH RFC net-next] netif_receive_skb performance Alexei Starovoitov
@ 2015-04-29  2:11 ` Alexei Starovoitov
  2015-04-29  4:14   ` Eric Dumazet
  2015-04-29  5:23 ` [PATCH RFC net-next] netif_receive_skb performance Eric Dumazet
  2015-04-29  9:37 ` Daniel Borkmann
  2 siblings, 1 reply; 14+ messages in thread
From: Alexei Starovoitov @ 2015-04-29  2:11 UTC (permalink / raw)
  To: David S. Miller
  Cc: Eric Dumazet, Daniel Borkmann, Thomas Graf, Jamal Hadi Salim,
	John Fastabend, netdev

Introduce 'RX' mode for pktgen which generates the packets
using familiar pktgen commands, but feeds them into netif_receive_skb()
instead of ndo_start_xmit().
It can be used to benchmark different kernel RX paths.

Sample script 'pktgen.sh':
!/bin/bash
function pgset() {
  local result

  echo $1 > $PGDEV

  result=`cat $PGDEV | fgrep "Result: OK:"`
  if [ "$result" = "" ]; then
    cat $PGDEV | fgrep Result:
  fi
}

ETH=$1

PGDEV=/proc/net/pktgen/kpktgend_0
pgset "rem_device_all"
pgset "add_device $ETH"

PGDEV=/proc/net/pktgen/$ETH
pgset "rx"
pgset "clone_skb 1000"
pgset "pkt_size 60"
pgset "dst 99.1.1.2"
pgset "dst_mac 90:e2:ba:6e:a8:e5"
pgset "count 10000000"
pgset "burst 32"

PGDEV=/proc/net/pktgen/pgctrl
echo "Running... ctrl^C to stop"
pgset "start"
echo "Done"
cat /proc/net/pktgen/$ETH

Usage:
$ sudo ./pktgen.sh eth2
...
Result: OK: 232376(c232372+d3) usec, 10000000 (60byte,0frags)
  43033682pps 20656Mb/sec (20656167360bps) errors: 10000000

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 net/core/pktgen.c |   30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 508155b283dd..4f6c56bca550 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -203,6 +203,7 @@
 #define F_NODE          (1<<15)	/* Node memory alloc*/
 #define F_UDPCSUM       (1<<16)	/* Include UDP checksum */
 #define F_NO_TIMESTAMP  (1<<17)	/* Don't timestamp packets (default TS) */
+#define F_DO_RX		(1<<18) /* generate packets for RX */
 
 /* Thread control flag bits */
 #define T_STOP        (1<<0)	/* Stop run */
@@ -1080,7 +1081,7 @@ static ssize_t pktgen_if_write(struct file *file,
 		len = num_arg(&user_buffer[i], 10, &value);
 		if (len < 0)
 			return len;
-		if ((value > 0) &&
+		if ((value > 0) && !(pkt_dev->flags & F_DO_RX) &&
 		    (!(pkt_dev->odev->priv_flags & IFF_TX_SKB_SHARING)))
 			return -ENOTSUPP;
 		i += len;
@@ -1089,6 +1090,12 @@ static ssize_t pktgen_if_write(struct file *file,
 		sprintf(pg_result, "OK: clone_skb=%d", pkt_dev->clone_skb);
 		return count;
 	}
+	if (!strcmp(name, "rx")) {
+		pkt_dev->flags |= F_DO_RX;
+
+		sprintf(pg_result, "OK: RX is ON");
+		return count;
+	}
 	if (!strcmp(name, "count")) {
 		len = num_arg(&user_buffer[i], 10, &value);
 		if (len < 0)
@@ -1134,7 +1141,7 @@ static ssize_t pktgen_if_write(struct file *file,
 			return len;
 
 		i += len;
-		if ((value > 1) &&
+		if ((value > 1) && !(pkt_dev->flags & F_DO_RX) &&
 		    (!(pkt_dev->odev->priv_flags & IFF_TX_SKB_SHARING)))
 			return -ENOTSUPP;
 		pkt_dev->burst = value < 1 ? 1 : value;
@@ -3349,11 +3356,29 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
 		pkt_dev->last_pkt_size = pkt_dev->skb->len;
 		pkt_dev->allocated_skbs++;
 		pkt_dev->clone_count = 0;	/* reset counter */
+		if (pkt_dev->flags & F_DO_RX)
+			pkt_dev->skb->protocol = eth_type_trans(pkt_dev->skb,
+								pkt_dev->skb->dev);
 	}
 
 	if (pkt_dev->delay && pkt_dev->last_ok)
 		spin(pkt_dev, pkt_dev->next_tx);
 
+	if (pkt_dev->flags & F_DO_RX) {
+		local_bh_disable();
+		atomic_add(burst, &pkt_dev->skb->users);
+		do {
+			ret = netif_receive_skb(pkt_dev->skb);
+			if (ret == NET_RX_DROP)
+				pkt_dev->errors++;
+			pkt_dev->last_ok = 1;
+			pkt_dev->sofar++;
+			pkt_dev->seq_num++;
+		} while (--burst > 0);
+		local_bh_enable();
+		goto out;
+	}
+
 	txq = skb_get_tx_queue(odev, pkt_dev->skb);
 
 	local_bh_disable();
@@ -3403,6 +3428,7 @@ unlock:
 
 	local_bh_enable();
 
+out:
 	/* If pkt_dev->count is zero, then run forever */
 	if ((pkt_dev->count != 0) && (pkt_dev->sofar >= pkt_dev->count)) {
 		pktgen_wait_for_skb(pkt_dev);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC net-next] pktgen: introduce 'rx' mode
  2015-04-29  2:11 ` [PATCH RFC net-next] pktgen: introduce 'rx' mode Alexei Starovoitov
@ 2015-04-29  4:14   ` Eric Dumazet
  2015-04-29 21:55     ` Alexei Starovoitov
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2015-04-29  4:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Eric Dumazet, Daniel Borkmann, Thomas Graf,
	Jamal Hadi Salim, John Fastabend, netdev

On Tue, 2015-04-28 at 19:11 -0700, Alexei Starovoitov wrote:

> +	if (pkt_dev->flags & F_DO_RX) {
> +		local_bh_disable();
> +		atomic_add(burst, &pkt_dev->skb->users);
> +		do {
> +			ret = netif_receive_skb(pkt_dev->skb);
> +			if (ret == NET_RX_DROP)
> +				pkt_dev->errors++;
> +			pkt_dev->last_ok = 1;
> +			pkt_dev->sofar++;
> +			pkt_dev->seq_num++;
> +		} while (--burst > 0);
> +		local_bh_enable();
> +		goto out;
> +	}
> +

This looks buggy.

skb can be put on a queue, so skb->next and skb->prev cannot be reused,
or queues will be corrupted.

Note that on TX, it is possible to have the same issue if you use a
virtual device like bonding, and skb is queued on a slave qdisc.

(Thats why we have this IFF_TX_SKB_SHARING flag)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC net-next] netif_receive_skb performance
  2015-04-29  2:11 [PATCH RFC net-next] netif_receive_skb performance Alexei Starovoitov
  2015-04-29  2:11 ` [PATCH RFC net-next] pktgen: introduce 'rx' mode Alexei Starovoitov
@ 2015-04-29  5:23 ` Eric Dumazet
  2015-04-29 22:15   ` Alexei Starovoitov
  2015-04-29  9:37 ` Daniel Borkmann
  2 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2015-04-29  5:23 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Eric Dumazet, Daniel Borkmann, Thomas Graf,
	Jamal Hadi Salim, John Fastabend, netdev

On Tue, 2015-04-28 at 19:11 -0700, Alexei Starovoitov wrote:
> Hi,
> 
> there were many requests for performance numbers in the past, but not
> everyone has access to 10/40G nics and we need a common way to talk
> about RX path performance without overhead of driver RX. That's
> especially important when making changes to netif_receive_skb.

Well, in real life, having to fetch RX descriptor and packet headers are
the main cost, and skb->users == 1.

So its nice trying to optimize netif_receive_skb(), but make sure you
have something that can really exercise same code flows/stalls,
otherwise you'll be tempted by wrong optimizations.

I would for example use a ring buffer, so that each skb you provide to
netif_receive_skb() has cold cache lines (at least skb->head if you want
to mimic build_skb() or napi_get_frags()/napi_reuse_skb() behavior)

Also, this model of flooding one cpu (no irqs, no context switch) mask
latencies caused by code size, since icache is fully populated, with a
very specialized working set.

If we want to pursue this model (like user space (DPDK and alike
frameworks)), we might have to design a very different model than the
IRQ driven one, by dedicating one or multiple cpu threads to run
networking code with no state transition.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC net-next] netif_receive_skb performance
  2015-04-29  2:11 [PATCH RFC net-next] netif_receive_skb performance Alexei Starovoitov
  2015-04-29  2:11 ` [PATCH RFC net-next] pktgen: introduce 'rx' mode Alexei Starovoitov
  2015-04-29  5:23 ` [PATCH RFC net-next] netif_receive_skb performance Eric Dumazet
@ 2015-04-29  9:37 ` Daniel Borkmann
  2015-04-29 22:20   ` Alexei Starovoitov
  2 siblings, 1 reply; 14+ messages in thread
From: Daniel Borkmann @ 2015-04-29  9:37 UTC (permalink / raw)
  To: Alexei Starovoitov, David S. Miller
  Cc: Eric Dumazet, Thomas Graf, Jamal Hadi Salim, John Fastabend, netdev

On 04/29/2015 04:11 AM, Alexei Starovoitov wrote:
...
> It's typical usage:
> $ sudo ./pktgen.sh eth0
> ...
> Result: OK: 232376(c232372+d3) usec, 10000000 (60byte,0frags)
>    43033682pps 20656Mb/sec (20656167360bps) errors: 10000000
...
> My main goal was to benchmark ingress qdisc.
> So here are the numbers:
> raw netif_receive_skb->ip_rcv->kfree_skb - 43 Mpps
> adding ingress qdisc to eth0 drops performance to - 26 Mpps
> adding 'u32 match u32 0 0' drops if further to - 22.4 Mpps
> All as expected.
>
> Now let's remove ingress spin_lock (the goal of John's patches) - 24.5 Mpps
> Note this is single core receive. The boost from removal will be much higher
> on a real nic with multiple cores servicing rx irqs.
>
> With my experimental replacement of ingress_queue/sch_ingress with
> ingress_filter_list and 'u32 match u32 0 0' classifier - 26.2 Mpps
>
> Experimental ingress_filter_list and JITed bpf 'return 0' program - 27.2 Mpps
>
> So there is definitely room for further improvements in ingress
> qdisc beyond dropping spin_lock.

Is the below the case where the conntracker has always a miss and thus
each time needs to create new entries, iow pktgen DoS with random IPs?

> Few other numbers for comparison with dmac == eth0 mac:
> no qdisc, with conntrack and empty iptables - 2.2 Mpps
>     7.65%  kpktgend_0   [nf_conntrack]    [k] nf_conntrack_in
>     7.62%  kpktgend_0   [kernel.vmlinux]  [k] fib_table_lookup
>     5.44%  kpktgend_0   [kernel.vmlinux]  [k] __call_rcu.constprop.63
>     3.71%  kpktgend_0   [kernel.vmlinux]  [k] nf_iterate
>     3.59%  kpktgend_0   [ip_tables]       [k] ipt_do_table
>
> no qdisc, unload conntrack, keep empty iptables - 5.4 Mpps
>    18.17%  kpktgend_0   [kernel.vmlinux]  [k] fib_table_lookup
>     8.31%  kpktgend_0   [kernel.vmlinux]  [k] ip_rcv
>     7.97%  kpktgend_0   [kernel.vmlinux]  [k] __netif_receive_skb_core
>     7.53%  kpktgend_0   [ip_tables]       [k] ipt_do_table
>
> no qdisc, unload conntrack, unload iptables - 6.5 Mpps
>    21.97%  kpktgend_0   [kernel.vmlinux]  [k] fib_table_lookup
>     9.64%  kpktgend_0   [kernel.vmlinux]  [k] __netif_receive_skb_core
>     8.44%  kpktgend_0   [kernel.vmlinux]  [k] ip_rcv
>     7.19%  kpktgend_0   [kernel.vmlinux]  [k] __skb_clone
>     6.89%  kpktgend_0   [kernel.vmlinux]  [k] fib_validate_source
>
> After I'm done with ingress qdisc improvements, I'm planning
> to look at netif_receive_skb itself, since it looks a bit too hot.
...

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC net-next] pktgen: introduce 'rx' mode
  2015-04-29  4:14   ` Eric Dumazet
@ 2015-04-29 21:55     ` Alexei Starovoitov
  2015-04-29 22:19       ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Alexei Starovoitov @ 2015-04-29 21:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Eric Dumazet, Daniel Borkmann, Thomas Graf,
	Jamal Hadi Salim, John Fastabend, netdev

On 4/28/15 9:14 PM, Eric Dumazet wrote:
> On Tue, 2015-04-28 at 19:11 -0700, Alexei Starovoitov wrote:
>
>> +	if (pkt_dev->flags & F_DO_RX) {
>> +		local_bh_disable();
>> +		atomic_add(burst, &pkt_dev->skb->users);
>> +		do {
>> +			ret = netif_receive_skb(pkt_dev->skb);
>> +			if (ret == NET_RX_DROP)
>> +				pkt_dev->errors++;
>> +			pkt_dev->last_ok = 1;
>> +			pkt_dev->sofar++;
>> +			pkt_dev->seq_num++;
>> +		} while (--burst > 0);
>> +		local_bh_enable();
>> +		goto out;
>> +	}
>> +
>
> This looks buggy.
>
> skb can be put on a queue, so skb->next and skb->prev cannot be reused,
> or queues will be corrupted.

don't see the bug yet.
Any layer that wants to do such queueing should do skb_share_check
first. Just like ip_rcv does. So everything in IP world should
work fine, because it will be operating on clean cloned skb.

> Note that on TX, it is possible to have the same issue if you use a
> virtual device like bonding, and skb is queued on a slave qdisc.
>
> (Thats why we have this IFF_TX_SKB_SHARING flag)

yep, that's why xmit into veth is not useful for benchmarking of rx,
since the hottest functions are alloc_skb and fill_packet, while
netif_receive_skb is not even seen in top 10.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC net-next] netif_receive_skb performance
  2015-04-29  5:23 ` [PATCH RFC net-next] netif_receive_skb performance Eric Dumazet
@ 2015-04-29 22:15   ` Alexei Starovoitov
  0 siblings, 0 replies; 14+ messages in thread
From: Alexei Starovoitov @ 2015-04-29 22:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Eric Dumazet, Daniel Borkmann, Thomas Graf,
	Jamal Hadi Salim, John Fastabend, netdev

On 4/28/15 10:23 PM, Eric Dumazet wrote:
> On Tue, 2015-04-28 at 19:11 -0700, Alexei Starovoitov wrote:
>> Hi,
>>
>> there were many requests for performance numbers in the past, but not
>> everyone has access to 10/40G nics and we need a common way to talk
>> about RX path performance without overhead of driver RX. That's
>> especially important when making changes to netif_receive_skb.
>
> Well, in real life, having to fetch RX descriptor and packet headers are
> the main cost, and skb->users == 1.

yes. you're describing the main cost of overall RX including drivers.
This pktgen rx mode is aiming to benchmark RX _after_ drivers.
I'm assuming driver vendors equally care a lot about performance of
their bits.

> So its nice trying to optimize netif_receive_skb(), but make sure you
> have something that can really exercise same code flows/stalls,
> otherwise you'll be tempted by wrong optimizations.
>
> I would for example use a ring buffer, so that each skb you provide to
> netif_receive_skb() has cold cache lines (at least skb->head if you want
> to mimic build_skb() or napi_get_frags()/napi_reuse_skb() behavior)

agree as well, but cache cold benchmarking is not a substitute for
cache hot. Both are valuable and numbers from both shouldn't be blindly
used to make decisions.
This pktgen rx mode simulates copybreak and/or small packets when
skb->data/head/... pointers and packet data itself is cache hot,
since driver's copybreak logic just touched it.
The ring-buffer approach with cold skbs is useful as well, but it will
benchmark different codepath through netif_receive_skb.
I think at the end we need both. This patch tackles simple case first.

> Also, this model of flooding one cpu (no irqs, no context switch) mask
> latencies caused by code size, since icache is fully populated, with a
> very specialized working set.
>
> If we want to pursue this model (like user space (DPDK and alike
> frameworks)), we might have to design a very different model than the
> IRQ driven one, by dedicating one or multiple cpu threads to run
> networking code with no state transition.

well, that's very different discussion. I would like to see this type
of model implemented in kernel, where we can dedicate a core for
network only processing. Though I think irq+napi are good enough for
doing batch processing of a lot of packets. My numbers show
that netif_receive_skb+ingress_qdisc+cls/act can do tens of millions
packet per second. imo that's a great base. We need skb alloc/free
and driver RX path to catch up. TX is already in good shape. Then
overall we'll have very capable packet processing machine from
one physical interface into another.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC net-next] pktgen: introduce 'rx' mode
  2015-04-29 21:55     ` Alexei Starovoitov
@ 2015-04-29 22:19       ` Eric Dumazet
  2015-04-29 22:38         ` Alexei Starovoitov
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2015-04-29 22:19 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Eric Dumazet, Daniel Borkmann, Thomas Graf,
	Jamal Hadi Salim, John Fastabend, netdev

On Wed, 2015-04-29 at 14:55 -0700, Alexei Starovoitov wrote:
> On 4/28/15 9:14 PM, Eric Dumazet wrote:
> > On Tue, 2015-04-28 at 19:11 -0700, Alexei Starovoitov wrote:
> >
> >> +	if (pkt_dev->flags & F_DO_RX) {
> >> +		local_bh_disable();
> >> +		atomic_add(burst, &pkt_dev->skb->users);
> >> +		do {
> >> +			ret = netif_receive_skb(pkt_dev->skb);
> >> +			if (ret == NET_RX_DROP)
> >> +				pkt_dev->errors++;
> >> +			pkt_dev->last_ok = 1;
> >> +			pkt_dev->sofar++;
> >> +			pkt_dev->seq_num++;
> >> +		} while (--burst > 0);
> >> +		local_bh_enable();
> >> +		goto out;
> >> +	}
> >> +
> >
> > This looks buggy.
> >
> > skb can be put on a queue, so skb->next and skb->prev cannot be reused,
> > or queues will be corrupted.
> 
> don't see the bug yet.
> Any layer that wants to do such queueing should do skb_share_check
> first. Just like ip_rcv does. So everything in IP world should
> work fine, because it will be operating on clean cloned skb.

Really this is what _you_ think is needed, so that your patch can fly.

In current state of the stack, the skb_share_check() is done where we
know that packet _might_ be delivered to multiple end points
(deliver_skb() does atomic_inc(&skb->users) )

But RPS/RFS/GRO do not care of your new rule.

Yes, before reaching __netif_receive_skb_core(), packets are supposed to
belong to the stack. We are supposed to queue them, without adding a
check for skb->users being one or not, and eventually add an expensive
memory allocation/copy.

We are not going to add an extra check just to make pktgen rx fast.
pktgen will have to comply to existing rules.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC net-next] netif_receive_skb performance
  2015-04-29  9:37 ` Daniel Borkmann
@ 2015-04-29 22:20   ` Alexei Starovoitov
  0 siblings, 0 replies; 14+ messages in thread
From: Alexei Starovoitov @ 2015-04-29 22:20 UTC (permalink / raw)
  To: Daniel Borkmann, David S. Miller
  Cc: Eric Dumazet, Thomas Graf, Jamal Hadi Salim, John Fastabend, netdev

On 4/29/15 2:37 AM, Daniel Borkmann wrote:
>
> Is the below the case where the conntracker has always a miss and thus
> each time needs to create new entries, iow pktgen DoS with random IPs?

not really. As far as I understand it's not doing much, just being
invoked as part of default code path. Not sure. This was a default
number on my setup with all modules loaded. I have empty
iptables/nat/ct rules. I mentioned it, since that is what most linux
users will see by default from their distro.

>> Few other numbers for comparison with dmac == eth0 mac:
>> no qdisc, with conntrack and empty iptables - 2.2 Mpps
>>     7.65%  kpktgend_0   [nf_conntrack]    [k] nf_conntrack_in
>>     7.62%  kpktgend_0   [kernel.vmlinux]  [k] fib_table_lookup
>>     5.44%  kpktgend_0   [kernel.vmlinux]  [k] __call_rcu.constprop.63
>>     3.71%  kpktgend_0   [kernel.vmlinux]  [k] nf_iterate
>>     3.59%  kpktgend_0   [ip_tables]       [k] ipt_do_table
>>
>> no qdisc, unload conntrack, keep empty iptables - 5.4 Mpps
>>    18.17%  kpktgend_0   [kernel.vmlinux]  [k] fib_table_lookup
>>     8.31%  kpktgend_0   [kernel.vmlinux]  [k] ip_rcv
>>     7.97%  kpktgend_0   [kernel.vmlinux]  [k] __netif_receive_skb_core
>>     7.53%  kpktgend_0   [ip_tables]       [k] ipt_do_table
>>
>> no qdisc, unload conntrack, unload iptables - 6.5 Mpps
>>    21.97%  kpktgend_0   [kernel.vmlinux]  [k] fib_table_lookup
>>     9.64%  kpktgend_0   [kernel.vmlinux]  [k] __netif_receive_skb_core
>>     8.44%  kpktgend_0   [kernel.vmlinux]  [k] ip_rcv
>>     7.19%  kpktgend_0   [kernel.vmlinux]  [k] __skb_clone
>>     6.89%  kpktgend_0   [kernel.vmlinux]  [k] fib_validate_source

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC net-next] pktgen: introduce 'rx' mode
  2015-04-29 22:19       ` Eric Dumazet
@ 2015-04-29 22:38         ` Alexei Starovoitov
  2015-04-29 22:56           ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Alexei Starovoitov @ 2015-04-29 22:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Eric Dumazet, Daniel Borkmann, Thomas Graf,
	Jamal Hadi Salim, John Fastabend, netdev

On 4/29/15 3:19 PM, Eric Dumazet wrote:
> On Wed, 2015-04-29 at 14:55 -0700, Alexei Starovoitov wrote:
>> On 4/28/15 9:14 PM, Eric Dumazet wrote:
>>> On Tue, 2015-04-28 at 19:11 -0700, Alexei Starovoitov wrote:
>>>
>>>
>>> This looks buggy.
>>>
>>> skb can be put on a queue, so skb->next and skb->prev cannot be reused,
>>> or queues will be corrupted.
>>
>> don't see the bug yet.
>> Any layer that wants to do such queueing should do skb_share_check
>> first. Just like ip_rcv does. So everything in IP world should
>> work fine, because it will be operating on clean cloned skb.
>
> Really this is what _you_ think is needed, so that your patch can fly.
>
> In current state of the stack, the skb_share_check() is done where we
> know that packet _might_ be delivered to multiple end points
> (deliver_skb() does atomic_inc(&skb->users) )
>
> But RPS/RFS/GRO do not care of your new rule.
>
> Yes, before reaching __netif_receive_skb_core(), packets are supposed to
> belong to the stack. We are supposed to queue them, without adding a
> check for skb->users being one or not, and eventually add an expensive
> memory allocation/copy.
>
> We are not going to add an extra check just to make pktgen rx fast.
> pktgen will have to comply to existing rules.

I'm not making and not suggesting any new rules.
ip_rcv is doing this skb_share_check() not because of pktgen rx,
but because there can be taps and deliver_skb() as you said.
gro has a different interface and this pktgen cannot benchmark it.
rps/rfs is not benchmarkable but this approach either.
To me this is all fine. I'm not trying to do a universal
benchmarking tool. This one is dumb and simple and primarily
oriented to benchmark changes to netif_receive_skb and ingress
qdisc only. I'm not suggesting to use it everywhere.
I already mentioned in cover letter:
"The profile dump looks as expected for RX of UDP packets
without local socket except presence of __skb_clone."
Clearly I'm not suggesting to use pktgen rx to optimize IP stack
and not suggesting at all that stack should assume users!=1
when skb hits netif_receive_skb. Today at the beginning of
netif_receive_skb we know that users==1 without checking.
I'm not changing that assumption.
Just like pktgen xmit path is cheating little bit while
benchmarking TX, I'm cheating a little bit with users!=1 on RX.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC net-next] pktgen: introduce 'rx' mode
  2015-04-29 22:38         ` Alexei Starovoitov
@ 2015-04-29 22:56           ` Eric Dumazet
  2015-04-29 23:28             ` Alexei Starovoitov
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2015-04-29 22:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Eric Dumazet, Daniel Borkmann, Thomas Graf,
	Jamal Hadi Salim, John Fastabend, netdev

On Wed, 2015-04-29 at 15:38 -0700, Alexei Starovoitov wrote:

> I'm not making and not suggesting any new rules.
> ip_rcv is doing this skb_share_check() not because of pktgen rx,
> but because there can be taps and deliver_skb() as you said.
> gro has a different interface and this pktgen cannot benchmark it.
> rps/rfs is not benchmarkable but this approach either.
> To me this is all fine. I'm not trying to do a universal
> benchmarking tool. This one is dumb and simple and primarily
> oriented to benchmark changes to netif_receive_skb and ingress
> qdisc only. I'm not suggesting to use it everywhere.
> I already mentioned in cover letter:
> "The profile dump looks as expected for RX of UDP packets
> without local socket except presence of __skb_clone."
> Clearly I'm not suggesting to use pktgen rx to optimize IP stack
> and not suggesting at all that stack should assume users!=1
> when skb hits netif_receive_skb. Today at the beginning of
> netif_receive_skb we know that users==1 without checking.
> I'm not changing that assumption.
> Just like pktgen xmit path is cheating little bit while
> benchmarking TX, I'm cheating a little bit with users!=1 on RX.

Look, you don't have to write all this, just fix the bug I mentioned to
you about RPS / RFS. (details in Documentation/networking/scaling.txt )

Simply read the code in enqueue_to_backlog(), this obviously needs to
own skb->next/prev

So pktgen in RX mode MUST deliver skb with skb->users = 1, there is no
way around it.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC net-next] pktgen: introduce 'rx' mode
  2015-04-29 22:56           ` Eric Dumazet
@ 2015-04-29 23:28             ` Alexei Starovoitov
  2015-04-29 23:39               ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Alexei Starovoitov @ 2015-04-29 23:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Eric Dumazet, Daniel Borkmann, Thomas Graf,
	Jamal Hadi Salim, John Fastabend, netdev

On 4/29/15 3:56 PM, Eric Dumazet wrote:
>
> So pktgen in RX mode MUST deliver skb with skb->users = 1, there is no
> way around it.

if I only knew how to do it...
The cost of continuously allocating skbs is way higher than
netif_receive_skb itself. Such benchmarking tool would measure the
speed of skb alloc/free instead of speed of netif_receive_skb.
Are you suggesting to pre-allocate 10s of millions of skbs and
then feed them in one go? The profile will be dominated by
cache misses in the first few lines of __netif_receive_skb_core()
where it accesses skb->dev,data,head. Doesn't sound too useful either.
Other thoughts?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC net-next] pktgen: introduce 'rx' mode
  2015-04-29 23:28             ` Alexei Starovoitov
@ 2015-04-29 23:39               ` Eric Dumazet
  2015-04-29 23:59                 ` Alexei Starovoitov
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2015-04-29 23:39 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Eric Dumazet, Daniel Borkmann, Thomas Graf,
	Jamal Hadi Salim, John Fastabend, netdev

On Wed, 2015-04-29 at 16:28 -0700, Alexei Starovoitov wrote:
> On 4/29/15 3:56 PM, Eric Dumazet wrote:
> >
> > So pktgen in RX mode MUST deliver skb with skb->users = 1, there is no
> > way around it.
> 
> if I only knew how to do it...
> The cost of continuously allocating skbs is way higher than
> netif_receive_skb itself. Such benchmarking tool would measure the
> speed of skb alloc/free instead of speed of netif_receive_skb.
> Are you suggesting to pre-allocate 10s of millions of skbs and
> then feed them in one go? The profile will be dominated by
> cache misses in the first few lines of __netif_receive_skb_core()
> where it accesses skb->dev,data,head. Doesn't sound too useful either.
> Other thoughts?

The code I copy pasted from your patch is buggy. Not the whole thing.

You have to replace it by something smarter.

This should not be hard really.

Zap the 'burst/clone' thing, this is not going to work for RX.
TX was okay, not RX. 

You could for instance do :

atomic_inc(&skb->users);
netif_receive_skb(skb);
if (atomic_read(skb->users) != 1) {
	/* This is too bad, I can not recycle this skb because it is still used */
	consume_skb(skb);
	/* allocate a fresh new skb */
	skb = ...
} else {
	/* Yeah ! Lets celebrate, cost of reusing this skb was one atomic op */
}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC net-next] pktgen: introduce 'rx' mode
  2015-04-29 23:39               ` Eric Dumazet
@ 2015-04-29 23:59                 ` Alexei Starovoitov
  0 siblings, 0 replies; 14+ messages in thread
From: Alexei Starovoitov @ 2015-04-29 23:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Eric Dumazet, Daniel Borkmann, Thomas Graf,
	Jamal Hadi Salim, John Fastabend, netdev

On 4/29/15 4:39 PM, Eric Dumazet wrote:
>
> Zap the 'burst/clone' thing, this is not going to work for RX.
> TX was okay, not RX.
>
> You could for instance do :
>
> atomic_inc(&skb->users);
> netif_receive_skb(skb);
> if (atomic_read(skb->users) != 1) {
> 	/* This is too bad, I can not recycle this skb because it is still used */
> 	consume_skb(skb);
> 	/* allocate a fresh new skb */
> 	skb = ...
> } else {
> 	/* Yeah ! Lets celebrate, cost of reusing this skb was one atomic op */
> }

ahh, great! I think I'm starting to understand.
then the following should be ok as well?

atomic_add(burst, &skb->users);
do {
    netif_receive_skb(skb);
    if (atomic_read(skb->users) != burst) {
  	/* too bad, can not recycle */
         atomic_sub(burst - 1, &skb->users);
  	consume_skb(skb);
  	/* allocate a fresh new skb */
  	skb = ...
         /* and get out of this loop */
    } else {
	/* Yeah ! the cost of reusing this skb was
            one atomic op amortized over 'burst' iterations */
    }
} while (--burst > 0);

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-04-29 23:59 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-29  2:11 [PATCH RFC net-next] netif_receive_skb performance Alexei Starovoitov
2015-04-29  2:11 ` [PATCH RFC net-next] pktgen: introduce 'rx' mode Alexei Starovoitov
2015-04-29  4:14   ` Eric Dumazet
2015-04-29 21:55     ` Alexei Starovoitov
2015-04-29 22:19       ` Eric Dumazet
2015-04-29 22:38         ` Alexei Starovoitov
2015-04-29 22:56           ` Eric Dumazet
2015-04-29 23:28             ` Alexei Starovoitov
2015-04-29 23:39               ` Eric Dumazet
2015-04-29 23:59                 ` Alexei Starovoitov
2015-04-29  5:23 ` [PATCH RFC net-next] netif_receive_skb performance Eric Dumazet
2015-04-29 22:15   ` Alexei Starovoitov
2015-04-29  9:37 ` Daniel Borkmann
2015-04-29 22:20   ` Alexei Starovoitov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).