All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 net-next] net: sched: run ingress qdisc without locks
@ 2015-05-02  5:27 Alexei Starovoitov
  2015-05-03 15:42 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 5+ messages in thread
From: Alexei Starovoitov @ 2015-05-02  5:27 UTC (permalink / raw)
  To: David S. Miller; +Cc: John Fastabend, Jamal Hadi Salim, Daniel Borkmann, netdev

From: John Fastabend <john.r.fastabend@intel.com>

TC classifiers/actions were converted to RCU by John in the series:
http://thread.gmane.org/gmane.linux.network/329739/focus=329739
and many follow on patches.
This is the last patch from that series that finally drops
ingress spin_lock.

Single cpu ingress+u32 performance goes from 22.9 Mpps to 24.5 Mpps.

In two cpu case when both cores are receiving traffic on the same
device and go into the same ingress+u32 the performance jumps
from 4.5 + 4.5 Mpps to 23.5 + 23.5 Mpps

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
---

v1->v2: add From:John tag, Sob, Ack

 net/core/dev.c          |    2 --
 net/sched/sch_ingress.c |    5 +++--
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 97a15ae8d07a..862875ec8f2f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3538,10 +3538,8 @@ static int ing_filter(struct sk_buff *skb, struct netdev_queue *rxq)
 
 	q = rcu_dereference(rxq->qdisc);
 	if (q != &noop_qdisc) {
-		spin_lock(qdisc_lock(q));
 		if (likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
 			result = qdisc_enqueue_root(skb, q);
-		spin_unlock(qdisc_lock(q));
 	}
 
 	return result;
diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
index 4cdbfb85686a..a89cc3278bfb 100644
--- a/net/sched/sch_ingress.c
+++ b/net/sched/sch_ingress.c
@@ -65,11 +65,11 @@ static int ingress_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 
 	result = tc_classify(skb, fl, &res);
 
-	qdisc_bstats_update(sch, skb);
+	qdisc_bstats_update_cpu(sch, skb);
 	switch (result) {
 	case TC_ACT_SHOT:
 		result = TC_ACT_SHOT;
-		qdisc_qstats_drop(sch);
+		qdisc_qstats_drop_cpu(sch);
 		break;
 	case TC_ACT_STOLEN:
 	case TC_ACT_QUEUED:
@@ -91,6 +91,7 @@ static int ingress_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 static int ingress_init(struct Qdisc *sch, struct nlattr *opt)
 {
 	net_inc_ingress_queue();
+	sch->flags |= TCQ_F_CPUSTATS;
 
 	return 0;
 }
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 net-next] net: sched: run ingress qdisc without locks
  2015-05-02  5:27 [PATCH v2 net-next] net: sched: run ingress qdisc without locks Alexei Starovoitov
@ 2015-05-03 15:42 ` Jesper Dangaard Brouer
  2015-05-04  5:12   ` Alexei Starovoitov
  0 siblings, 1 reply; 5+ messages in thread
From: Jesper Dangaard Brouer @ 2015-05-03 15:42 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: brouer, David S. Miller, John Fastabend, Jamal Hadi Salim,
	Daniel Borkmann, netdev

On Fri,  1 May 2015 22:27:28 -0700
Alexei Starovoitov <ast@plumgrid.com> wrote:

> From: John Fastabend <john.r.fastabend@intel.com>
> 
> TC classifiers/actions were converted to RCU by John in the series:
> http://thread.gmane.org/gmane.linux.network/329739/focus=329739
> and many follow on patches.
> This is the last patch from that series that finally drops
> ingress spin_lock.

I absolutely love this change.  It is a huge step for ingress
scalability.


> Single cpu ingress+u32 performance goes from 22.9 Mpps to 24.5 Mpps.

I was actually expecting to see a higher performance boost.

 (processing cost per packet)
 (1/(22.9*10^6)*10^9) = 43.67 ns
 (1/(24.5*10^6)*10^9) = 40.82 ns
 improvement diff     = -2.85 ns

The patch is removing two atomic operations, spin_{un,}lock, which I
have benchmarked[1] to cost approx 14ns on my system.  Your system
likely is faster, but not that much (p.s. benchmark your own system
with [1])

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c

> In two cpu case when both cores are receiving traffic on the same
> device and go into the same ingress+u32 the performance jumps
> from 4.5 + 4.5 Mpps to 23.5 + 23.5 Mpps

This looks good for scalability :-)))

> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
> Acked-by: Daniel Borkmann <daniel@iogearbox.net>

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 net-next] net: sched: run ingress qdisc without locks
  2015-05-03 15:42 ` Jesper Dangaard Brouer
@ 2015-05-04  5:12   ` Alexei Starovoitov
  2015-05-04 11:04     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 5+ messages in thread
From: Alexei Starovoitov @ 2015-05-04  5:12 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David S. Miller, John Fastabend, Jamal Hadi Salim,
	Daniel Borkmann, netdev

On 5/3/15 8:42 AM, Jesper Dangaard Brouer wrote:
>
> I was actually expecting to see a higher performance boost.
 > improvement diff     = -2.85 ns
...
> The patch is removing two atomic operations, spin_{un,}lock, which I
> have benchmarked[1] to cost approx 14ns on my system.  Your system
> likely is faster, but not that much (p.s. benchmark your own system
> with [1])
>
> [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c

have tried you tight loop spin_lock test on my box and it showed:
time_bench: Type:spin_lock_unlock Per elem: 40 cycles(tsc) 11.070 ns
and yet the total single cpu gain from removal of spin_lock/unlock
in ingress path is smaller than 11ns. I think this observation is
telling us that tight loop benchmarking is inherently flawed.
I'm guessing that uops that cmpxchg is broken into can execute in
parallel with uops of other insns, so tight loops of the same sequence
of uops has more alu dependencies whereas in more normal insn flow
these uops can mix and match better. Would be great if intel microarch
experts can chime in.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 net-next] net: sched: run ingress qdisc without locks
  2015-05-04  5:12   ` Alexei Starovoitov
@ 2015-05-04 11:04     ` Jesper Dangaard Brouer
  2015-05-05  1:27       ` Alexei Starovoitov
  0 siblings, 1 reply; 5+ messages in thread
From: Jesper Dangaard Brouer @ 2015-05-04 11:04 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, John Fastabend, Jamal Hadi Salim,
	Daniel Borkmann, netdev, brouer

On Sun, 03 May 2015 22:12:43 -0700
Alexei Starovoitov <ast@plumgrid.com> wrote:

> On 5/3/15 8:42 AM, Jesper Dangaard Brouer wrote:
> >
> > I was actually expecting to see a higher performance boost.
>  > improvement diff     = -2.85 ns
> ...
> > The patch is removing two atomic operations, spin_{un,}lock, which I
> > have benchmarked[1] to cost approx 14ns on my system.  Your system
> > likely is faster, but not that much (p.s. benchmark your own system
> > with [1])
> >
> > [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c
> 
> have tried you tight loop spin_lock test on my box and it showed:
> time_bench: Type:spin_lock_unlock Per elem: 40 cycles(tsc) 11.070 ns
> and yet the total single cpu gain from removal of spin_lock/unlock
> in ingress path is smaller than 11ns. I think this observation is
> telling us that tight loop benchmarking is inherently flawed.
> I'm guessing that uops that cmpxchg is broken into can execute in
> parallel with uops of other insns, so tight loops of the same sequence
> of uops has more alu dependencies whereas in more normal insn flow
> these uops can mix and match better. Would be great if intel microarch
> experts can chime in.

How do you activate the ingress code path?

I'm just doing (is this enough?):
 export DEV=eth4
 tc qdisc add dev $DEV handle ffff: ingress
 

I re-ran the experiment, and I can also only show a 2.68ns
improvement.  This is rather strange, and I cannot explain it.

The lock clearly shows up in perf report[1] with 12.23% raw_spin_lock,
and perf report[2] it clearly gone, but we don't see a 12% improvement
in performance, but around 4.7%.

Before activating qdisc ingress code : 25.3Mpps (25398057)
Activating qdisc ingress with lock   : 16.9Mpps (16989315)
Activating qdisc ingress without lock: 17.8Mpps (17800496)

(1/17800496*10^9)-(1/16989315*10^9) = -2.68 ns

The "cost" of activating the ingress qdisc is also interesting:
 (1/25398057*10^9)-(1/16989315*10^9) = -19.49 ns
 (1/25398057*10^9)-(1/17800496*10^9) = -16.81 ns

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

My setup
 * Tested on top of commit 4749c3ef854
 * gcc version 4.4.7 20120313 (Red Hat 4.4.7-11) (GCC)
 * CPU E5-2695(ES) @ 2.8GHz

[1] perf report with ingress qlock

 Samples: 2K of event 'cycles', Event count (approx.): 1762298819
   Overhead  Command        Shared Object     Symbol
 +   35.86%  kpktgend_0     [kernel.vmlinux]  [k] __netif_receive_skb_core
 +   17.81%  kpktgend_0     [kernel.vmlinux]  [k] kfree_skb
 +   12.23%  kpktgend_0     [kernel.vmlinux]  [k] _raw_spin_lock
    - _raw_spin_lock
       + 93.54% __netif_receive_skb_core
       + 6.46% __netif_receive_skb
 +    5.45%  kpktgend_0     [sch_ingress]     [k] ingress_enqueue
 +    4.65%  kpktgend_0     [pktgen]          [k] pktgen_thread_worker
 +    4.23%  kpktgend_0     [kernel.vmlinux]  [k] ip_rcv
 +    3.95%  kpktgend_0     [kernel.vmlinux]  [k] tc_classify_compat
 +    3.71%  kpktgend_0     [kernel.vmlinux]  [k] tc_classify
 +    3.03%  kpktgend_0     [kernel.vmlinux]  [k] netif_receive_skb_internal
 +    2.65%  kpktgend_0     [kernel.vmlinux]  [k] netif_receive_skb_sk
 +    1.97%  kpktgend_0     [kernel.vmlinux]  [k] __netif_receive_skb
 +    0.71%  kpktgend_0     [kernel.vmlinux]  [k] __local_bh_enable_ip
 +    0.28%  kpktgend_0     [kernel.vmlinux]  [k] kthread_should_stop

[2] perf report without ingress qlock

 Samples: 2K of event 'cycles', Event count (approx.): 1633499063
   Overhead  Command       Shared Object        Symbol
 +   39.29%  kpktgend_0    [kernel.vmlinux]     [k] __netif_receive_skb_core
 +   19.24%  kpktgend_0    [kernel.vmlinux]     [k] kfree_skb
 +   11.05%  kpktgend_0    [sch_ingress]        [k] ingress_enqueue
 +    4.69%  kpktgend_0    [kernel.vmlinux]     [k] tc_classify
 +    4.48%  kpktgend_0    [kernel.vmlinux]     [k] ip_rcv
 +    4.43%  kpktgend_0    [kernel.vmlinux]     [k] tc_classify_compat
 +    4.19%  kpktgend_0    [pktgen]             [k] pktgen_thread_worker
 +    3.50%  kpktgend_0    [kernel.vmlinux]     [k] netif_receive_skb_internal
 +    2.61%  kpktgend_0    [kernel.vmlinux]     [k] netif_receive_skb_sk
 +    2.26%  kpktgend_0    [kernel.vmlinux]     [k] __netif_receive_skb
 +    0.43%  kpktgend_0    [kernel.vmlinux]     [k] __local_bh_enable_ip
 +    0.13%  swapper       [kernel.vmlinux]     [k] mwait_idle

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2 net-next] net: sched: run ingress qdisc without locks
  2015-05-04 11:04     ` Jesper Dangaard Brouer
@ 2015-05-05  1:27       ` Alexei Starovoitov
  0 siblings, 0 replies; 5+ messages in thread
From: Alexei Starovoitov @ 2015-05-05  1:27 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David S. Miller, John Fastabend, Jamal Hadi Salim,
	Daniel Borkmann, netdev

On 5/4/15 4:04 AM, Jesper Dangaard Brouer wrote:
>
> How do you activate the ingress code path?
>
> I'm just doing (is this enough?):
>   export DEV=eth4
>   tc qdisc add dev $DEV handle ffff: ingress

yes. plus my numbers also include u32 classifier.

> I re-ran the experiment, and I can also only show a 2.68ns
> improvement.  This is rather strange, and I cannot explain it.
>
> The lock clearly shows up in perf report[1] with 12.23% raw_spin_lock,
> and perf report[2] it clearly gone, but we don't see a 12% improvement
> in performance, but around 4.7%.

It's indeed puzzling. Hopefully intel experts can chime in.

> The "cost" of activating the ingress qdisc is also interesting:
>   (1/25398057*10^9)-(1/16989315*10^9) = -19.49 ns
>   (1/25398057*10^9)-(1/17800496*10^9) = -16.81 ns

yep, we're working hard on reducing it.
btw the cost of enabling rps without using it is ~8ns.
Our line rate goal is still a bit far, but hopefully getting closer :)

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-05-05  1:27 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-02  5:27 [PATCH v2 net-next] net: sched: run ingress qdisc without locks Alexei Starovoitov
2015-05-03 15:42 ` Jesper Dangaard Brouer
2015-05-04  5:12   ` Alexei Starovoitov
2015-05-04 11:04     ` Jesper Dangaard Brouer
2015-05-05  1:27       ` Alexei Starovoitov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.