netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH 1/2 net-next] net: kill useless net_*_ingress_queue() definitions when NET_CLS_ACT is unset
  2015-05-10 16:59 ` [PATCH 1/2 net-next] net: kill useless net_*_ingress_queue() definitions when NET_CLS_ACT is unset Pablo Neira Ayuso
@ 2015-05-10 16:58   ` Sergei Shtylyov
  0 siblings, 0 replies; 32+ messages in thread
From: Sergei Shtylyov @ 2015-05-10 16:58 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netdev; +Cc: davem, ast, jhs, daniel

Hello.

On 5/10/2015 7:59 PM, Pablo Neira Ayuso wrote:

> This fixes 4577139 ("net: use jump label patching for ingress qdisc in
> __netif_receive_skb_core").

> The only client of this is sch_ingress and it depends on NET_CLS_ACT. So
> there is no way these definition can be of any use at all.

> Fixes: 4577139 ("net: use jump label patching for ingress qdisc in __netif_receive_skb_core")

    The commit ID is supposed to have 12 digits for this tag.

> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

WBR, Sergei

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 0/2 net-next] critical ingress path performance improvements
@ 2015-05-10 16:59 Pablo Neira Ayuso
  2015-05-10 16:59 ` [PATCH 1/2 net-next] net: kill useless net_*_ingress_queue() definitions when NET_CLS_ACT is unset Pablo Neira Ayuso
  2015-05-10 16:59 ` [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs Pablo Neira Ayuso
  0 siblings, 2 replies; 32+ messages in thread
From: Pablo Neira Ayuso @ 2015-05-10 16:59 UTC (permalink / raw)
  To: netdev; +Cc: davem, ast, jhs, daniel

This patch improves performance of the super-critical ingress path by moving
the qdisc ingress code to sch_ingress, where this really belongs.

This patchset also removes the Qdisc->enqueue() indirection since this does not
enqueue anything.

This is roughly 92199pps, ~0.42% more performance on my old box.

As it's been stated before, most users don't need this:

24824a09 ("net: dynamic ingress_queue allocation")

So getting the code inlined into the core introduces a penalty to everyone in
this world.

Please apply, thanks!

Pablo Neira Ayuso (2):
  net: kill useless net_*_ingress_queue() definitions when NET_CLS_ACT is unset
  net: move qdisc ingress filtering code where it belongs

 include/linux/netdevice.h |    3 +++
 include/linux/rtnetlink.h |    9 +--------
 net/core/dev.c            |   45 ++++++++++++---------------------------------
 net/sched/sch_ingress.c   |   43 +++++++++++++++++++++++++++++++++++++++----
 4 files changed, 55 insertions(+), 45 deletions(-)

-- 
1.7.10.4

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/2 net-next] net: kill useless net_*_ingress_queue() definitions when NET_CLS_ACT is unset
  2015-05-10 16:59 [PATCH 0/2 net-next] critical ingress path performance improvements Pablo Neira Ayuso
@ 2015-05-10 16:59 ` Pablo Neira Ayuso
  2015-05-10 16:58   ` Sergei Shtylyov
  2015-05-10 16:59 ` [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs Pablo Neira Ayuso
  1 sibling, 1 reply; 32+ messages in thread
From: Pablo Neira Ayuso @ 2015-05-10 16:59 UTC (permalink / raw)
  To: netdev; +Cc: davem, ast, jhs, daniel

This fixes 4577139 ("net: use jump label patching for ingress qdisc in
__netif_receive_skb_core").

The only client of this is sch_ingress and it depends on NET_CLS_ACT. So
there is no way these definition can be of any use at all.

Fixes: 4577139 ("net: use jump label patching for ingress qdisc in __netif_receive_skb_core")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/linux/rtnetlink.h |    8 --------
 1 file changed, 8 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 7b8e260..bd29ab4 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -82,14 +82,6 @@ struct netdev_queue *dev_ingress_queue_create(struct net_device *dev);
 #ifdef CONFIG_NET_CLS_ACT
 void net_inc_ingress_queue(void);
 void net_dec_ingress_queue(void);
-#else
-static inline void net_inc_ingress_queue(void)
-{
-}
-
-static inline void net_dec_ingress_queue(void)
-{
-}
 #endif
 
 extern void rtnetlink_init(void);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-10 16:59 [PATCH 0/2 net-next] critical ingress path performance improvements Pablo Neira Ayuso
  2015-05-10 16:59 ` [PATCH 1/2 net-next] net: kill useless net_*_ingress_queue() definitions when NET_CLS_ACT is unset Pablo Neira Ayuso
@ 2015-05-10 16:59 ` Pablo Neira Ayuso
  2015-05-10 17:25   ` Eric Dumazet
  2015-05-10 17:45   ` Alexei Starovoitov
  1 sibling, 2 replies; 32+ messages in thread
From: Pablo Neira Ayuso @ 2015-05-10 16:59 UTC (permalink / raw)
  To: netdev; +Cc: davem, ast, jhs, daniel

The qdisc ingress filtering code is embedded into the core most likely because
at that time we had no RCU in place to define a hook. This is semantically
wrong as this violates the most basic rules of encapsulation.

On top of that, this special qdisc does not enqueue anything at all, so we can
skip the enqueue indirection from qdisc_enqueue_root() which is doing things
that we don't need.

This reduces the pollution of the super-critical ingress path, where
most users don't need this as it has been stated many times before.
e.g. 24824a09 ("net: dynamic ingress_queue allocation").

As a result, this improves performance in the super-critical ingress:

Before:

Result: OK: 4767946(c4767946+d0) usec, 100000000 (60byte,0frags)
  20973388pps 10067Mb/sec (10067226240bps) errors: 100000000

After:

Result: OK: 4747078(c4747078+d0) usec, 100000000 (60byte,0frags)
  21065587pps 10111Mb/sec (10111481760bps) errors: 100000000

This is roughly 92199pps, ~0.42% more performance on my old box.

Using pktgen rx injection, perf shows I'm profiling the right thing.

    36.12%  kpktgend_0  [kernel.kallsyms]  [k] __netif_receive_skb_core
    18.46%  kpktgend_0  [kernel.kallsyms]  [k] atomic_dec_and_test
    15.87%  kpktgend_0  [kernel.kallsyms]  [k] deliver_ptype_list_skb
     5.04%  kpktgend_0  [pktgen]           [k] pktgen_thread_worker
     4.81%  kpktgend_0  [kernel.kallsyms]  [k] netif_receive_skb_internal
     4.11%  kpktgend_0  [kernel.kallsyms]  [k] kfree_skb
     3.89%  kpktgend_0  [kernel.kallsyms]  [k] ip_rcv
     3.44%  kpktgend_0  [kernel.kallsyms]  [k] __rcu_read_unlock
     2.89%  kpktgend_0  [kernel.kallsyms]  [k] netif_receive_skb_sk
     2.14%  kpktgend_0  [kernel.kallsyms]  [k] __netif_receive_skb
     2.14%  kpktgend_0  [kernel.kallsyms]  [k] __rcu_read_lock
     0.57%  kpktgend_0  [kernel.kallsyms]  [k] __local_bh_enable_ip

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/linux/netdevice.h |    3 +++
 include/linux/rtnetlink.h |    1 +
 net/core/dev.c            |   45 ++++++++++++---------------------------------
 net/sched/sch_ingress.c   |   43 +++++++++++++++++++++++++++++++++++++++----
 4 files changed, 55 insertions(+), 37 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 1899c74..85288b5 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1299,6 +1299,9 @@ enum netdev_priv_flags {
 #define IFF_IPVLAN_MASTER		IFF_IPVLAN_MASTER
 #define IFF_IPVLAN_SLAVE		IFF_IPVLAN_SLAVE
 
+typedef struct sk_buff *qdisc_ingress_hook_t(struct sk_buff *skb);
+extern qdisc_ingress_hook_t __rcu *qdisc_ingress_hook;
+
 /**
  *	struct net_device - The DEVICE structure.
  *		Actually, this whole structure is a big mistake.  It mixes I/O
diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index bd29ab4..7c204f2 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -82,6 +82,7 @@ struct netdev_queue *dev_ingress_queue_create(struct net_device *dev);
 #ifdef CONFIG_NET_CLS_ACT
 void net_inc_ingress_queue(void);
 void net_dec_ingress_queue(void);
+int net_ingress_queue_count(void);
 #endif
 
 extern void rtnetlink_init(void);
diff --git a/net/core/dev.c b/net/core/dev.c
index 862875e..14a07ec 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1644,6 +1644,12 @@ void net_dec_ingress_queue(void)
 	static_key_slow_dec(&ingress_needed);
 }
 EXPORT_SYMBOL_GPL(net_dec_ingress_queue);
+
+int net_ingress_queue_count(void)
+{
+	return static_key_count(&ingress_needed);
+}
+EXPORT_SYMBOL_GPL(net_ingress_queue_count);
 #endif
 
 static struct static_key netstamp_needed __read_mostly;
@@ -3521,37 +3527,17 @@ EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook);
 #endif
 
 #ifdef CONFIG_NET_CLS_ACT
-/* TODO: Maybe we should just force sch_ingress to be compiled in
- * when CONFIG_NET_CLS_ACT is? otherwise some useless instructions
- * a compare and 2 stores extra right now if we dont have it on
- * but have CONFIG_NET_CLS_ACT
- * NOTE: This doesn't stop any functionality; if you dont have
- * the ingress scheduler, you just can't add policies on ingress.
- *
- */
-static int ing_filter(struct sk_buff *skb, struct netdev_queue *rxq)
-{
-	int result = TC_ACT_OK;
-	struct Qdisc *q;
-
-	skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_INGRESS);
-
-	q = rcu_dereference(rxq->qdisc);
-	if (q != &noop_qdisc) {
-		if (likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
-			result = qdisc_enqueue_root(skb, q);
-	}
-
-	return result;
-}
+qdisc_ingress_hook_t __rcu *qdisc_ingress_hook __read_mostly;
+EXPORT_SYMBOL_GPL(qdisc_ingress_hook);
 
 static inline struct sk_buff *handle_ing(struct sk_buff *skb,
 					 struct packet_type **pt_prev,
 					 int *ret, struct net_device *orig_dev)
 {
-	struct netdev_queue *rxq = rcu_dereference(skb->dev->ingress_queue);
+	qdisc_ingress_hook_t *ingress_hook;
 
-	if (!rxq || rcu_access_pointer(rxq->qdisc) == &noop_qdisc)
+	ingress_hook = rcu_dereference(qdisc_ingress_hook);
+	if (ingress_hook == NULL)
 		return skb;
 
 	if (*pt_prev) {
@@ -3559,14 +3545,7 @@ static inline struct sk_buff *handle_ing(struct sk_buff *skb,
 		*pt_prev = NULL;
 	}
 
-	switch (ing_filter(skb, rxq)) {
-	case TC_ACT_SHOT:
-	case TC_ACT_STOLEN:
-		kfree_skb(skb);
-		return NULL;
-	}
-
-	return skb;
+	return ingress_hook(skb);
 }
 #endif
 
diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
index a89cc32..0c20f89 100644
--- a/net/sched/sch_ingress.c
+++ b/net/sched/sch_ingress.c
@@ -54,9 +54,9 @@ static struct tcf_proto __rcu **ingress_find_tcf(struct Qdisc *sch,
 	return &p->filter_list;
 }
 
-/* --------------------------- Qdisc operations ---------------------------- */
+/* ------------------------------------------------------------- */
 
-static int ingress_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+static int ingress_filter(struct sk_buff *skb, struct Qdisc *sch)
 {
 	struct ingress_qdisc_data *p = qdisc_priv(sch);
 	struct tcf_result res;
@@ -86,10 +86,42 @@ static int ingress_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	return result;
 }
 
-/* ------------------------------------------------------------- */
+static int ing_filter(struct sk_buff *skb, struct netdev_queue *rxq)
+{
+	int result = TC_ACT_OK;
+	struct Qdisc *q = rcu_dereference(rxq->qdisc);
+
+	skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_INGRESS);
+
+	if (q != &noop_qdisc) {
+		if (likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
+			result = ingress_filter(skb, q);
+	}
+
+	return result;
+}
+
+static struct sk_buff *qdisc_ingress_filter(struct sk_buff *skb)
+{
+	struct netdev_queue *rxq = rcu_dereference(skb->dev->ingress_queue);
+
+	if (!rxq || rcu_access_pointer(rxq->qdisc) == &noop_qdisc)
+		return skb;
+
+	switch (ing_filter(skb, rxq)) {
+	case TC_ACT_SHOT:
+	case TC_ACT_STOLEN:
+		kfree_skb(skb);
+		return NULL;
+	}
+
+	return skb;
+}
 
 static int ingress_init(struct Qdisc *sch, struct nlattr *opt)
 {
+	if (net_ingress_queue_count() == 0)
+		rcu_assign_pointer(qdisc_ingress_hook, qdisc_ingress_filter);
 	net_inc_ingress_queue();
 	sch->flags |= TCQ_F_CPUSTATS;
 
@@ -102,6 +134,10 @@ static void ingress_destroy(struct Qdisc *sch)
 
 	tcf_destroy_chain(&p->filter_list);
 	net_dec_ingress_queue();
+	if (net_ingress_queue_count() == 0) {
+		rcu_assign_pointer(qdisc_ingress_hook, NULL);
+		synchronize_rcu();
+	}
 }
 
 static int ingress_dump(struct Qdisc *sch, struct sk_buff *skb)
@@ -132,7 +168,6 @@ static struct Qdisc_ops ingress_qdisc_ops __read_mostly = {
 	.cl_ops		=	&ingress_class_ops,
 	.id		=	"ingress",
 	.priv_size	=	sizeof(struct ingress_qdisc_data),
-	.enqueue	=	ingress_enqueue,
 	.init		=	ingress_init,
 	.destroy	=	ingress_destroy,
 	.dump		=	ingress_dump,
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-10 16:59 ` [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs Pablo Neira Ayuso
@ 2015-05-10 17:25   ` Eric Dumazet
  2015-05-10 17:45   ` Alexei Starovoitov
  1 sibling, 0 replies; 32+ messages in thread
From: Eric Dumazet @ 2015-05-10 17:25 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netdev, davem, ast, jhs, daniel

On Sun, 2015-05-10 at 18:59 +0200, Pablo Neira Ayuso wrote:

> On top of that, this special qdisc does not enqueue anything at all, so we can
> skip the enqueue indirection from qdisc_enqueue_root() which is doing things
> that we don't need.

Note that we can get rid of qdisc_enqueue_root() completely, as
net/sched/sch_netem.c does not need need it either.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-10 16:59 ` [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs Pablo Neira Ayuso
  2015-05-10 17:25   ` Eric Dumazet
@ 2015-05-10 17:45   ` Alexei Starovoitov
  2015-05-10 17:59     ` Pablo Neira Ayuso
  1 sibling, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2015-05-10 17:45 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netdev; +Cc: davem, jhs, daniel

On 5/10/15 9:59 AM, Pablo Neira Ayuso wrote:
> The qdisc ingress filtering code is embedded into the core most likely because
> at that time we had no RCU in place to define a hook. This is semantically
> wrong as this violates the most basic rules of encapsulation.

Yet another attempt to sneak in 'qdisc_ingress_hook' to kill TC ?
Just add another hook for netfilter. Seriously. Enough of these
politics.

> On top of that, this special qdisc does not enqueue anything at all, so we can
> skip the enqueue indirection from qdisc_enqueue_root() which is doing things
> that we don't need.

Daniel's patch does that as well, but in much cleaner way.
Looks like you're stealing our ideas to hide overhead that you're
adding to ingress qdisc. Not cool.

> This reduces the pollution of the super-critical ingress path, where
> most users don't need this as it has been stated many times before.
> e.g. 24824a09 ("net: dynamic ingress_queue allocation").

Again, Daniel's patch accelerates super-critical ingress path even more.
Care to carefully read it first?

> As a result, this improves performance in the super-critical ingress:
>
> Before:
>
> Result: OK: 4767946(c4767946+d0) usec, 100000000 (60byte,0frags)
>    20973388pps 10067Mb/sec (10067226240bps) errors: 100000000
>
> After:
>
> Result: OK: 4747078(c4747078+d0) usec, 100000000 (60byte,0frags)
>    21065587pps 10111Mb/sec (10111481760bps) errors: 100000000
>
> This is roughly 92199pps, ~0.42% more performance on my old box.

funny, how the gain from removal of qdisc_enqueue_root() is offsetted
by added extra overhead. Compare your 0.42% with clean gains achieved by
Daniel's patch set.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-10 17:45   ` Alexei Starovoitov
@ 2015-05-10 17:59     ` Pablo Neira Ayuso
  2015-05-10 18:05       ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Pablo Neira Ayuso @ 2015-05-10 17:59 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: netdev, davem, jhs, daniel

On Sun, May 10, 2015 at 10:45:42AM -0700, Alexei Starovoitov wrote:
> On 5/10/15 9:59 AM, Pablo Neira Ayuso wrote:
> >The qdisc ingress filtering code is embedded into the core most likely because
> >at that time we had no RCU in place to define a hook. This is semantically
> >wrong as this violates the most basic rules of encapsulation.
> 
> Yet another attempt to sneak in 'qdisc_ingress_hook' to kill TC ?
> Just add another hook for netfilter. Seriously. Enough of these
> politics.

Absolutely not. I will not kill TC because people like jamal likes it,
and that's more than an argument to me to keep it there.

I have to ask you to stop harassing me all over with non-technical
comments: "evil", "funny", ...

I'm getting quite enough of this, you stop that.

> Again, Daniel's patch accelerates super-critical ingress path even more.
> Care to carefully read it first?

No, Daniel is *not* benchmarking the netif_received_core() with no
filtering at all.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-10 17:59     ` Pablo Neira Ayuso
@ 2015-05-10 18:05       ` Alexei Starovoitov
  2015-05-10 18:24         ` Pablo Neira Ayuso
  0 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2015-05-10 18:05 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netdev, davem, jhs, daniel

On 5/10/15 10:59 AM, Pablo Neira Ayuso wrote:
> On Sun, May 10, 2015 at 10:45:42AM -0700, Alexei Starovoitov wrote:
>> On 5/10/15 9:59 AM, Pablo Neira Ayuso wrote:
>>> The qdisc ingress filtering code is embedded into the core most likely because
>>> at that time we had no RCU in place to define a hook. This is semantically
>>> wrong as this violates the most basic rules of encapsulation.
>>
>> Yet another attempt to sneak in 'qdisc_ingress_hook' to kill TC ?
>> Just add another hook for netfilter. Seriously. Enough of these
>> politics.
>
> Absolutely not. I will not kill TC because people like jamal likes it,
> and that's more than an argument to me to keep it there.
>
> I have to ask you to stop harassing me all over with non-technical
> comments: "evil", "funny", ...

Please, I never called you 'evil'. Though we're arguing, it's ok,
because we both want the best for the kernel. We just not on the same
page yet.
'funny' also doesn't apply to you.
If you feel offended, I'm sorry. I didn't mean it at all.

> I'm getting quite enough of this, you stop that.

agree. let's articulate on exact technical means.
So, please, state clearly why you so much insisting of combining
existing tc and future netfilter hook into one that creates long
term head aches? What is wrong with two hooks?

>> Again, Daniel's patch accelerates super-critical ingress path even more.
>> Care to carefully read it first?
>
> No, Daniel is *not* benchmarking the netif_received_core() with no
> filtering at all.

sorry, not true. We did benchmark all combinations. Daniel posted
his, I'll send numbers from my box as well.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-10 18:05       ` Alexei Starovoitov
@ 2015-05-10 18:24         ` Pablo Neira Ayuso
  2015-05-10 18:47           ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Pablo Neira Ayuso @ 2015-05-10 18:24 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: netdev, davem, jhs, daniel

On Sun, May 10, 2015 at 11:05:28AM -0700, Alexei Starovoitov wrote:
> On 5/10/15 10:59 AM, Pablo Neira Ayuso wrote:
> >No, Daniel is *not* benchmarking the netif_received_core() with no
> >filtering at all.
> 
> sorry, not true. We did benchmark all combinations. Daniel posted
> his, I'll send numbers from my box as well.

Daniel said:

"The extra indirection layers however, are not necessary for calling
into ingress qdisc. pktgen calling locally into netif_receive_skb()
with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon
E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps."

That explicitly refers to u32, hence qdisc ingress, so he did *not*
post any number of the use case I'm indicating.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-10 18:24         ` Pablo Neira Ayuso
@ 2015-05-10 18:47           ` Alexei Starovoitov
  2015-05-10 19:00             ` Pablo Neira Ayuso
  2015-05-10 20:40             ` Daniel Borkmann
  0 siblings, 2 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2015-05-10 18:47 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netdev, davem, jhs, daniel

On 5/10/15 11:24 AM, Pablo Neira Ayuso wrote:
> On Sun, May 10, 2015 at 11:05:28AM -0700, Alexei Starovoitov wrote:
>> On 5/10/15 10:59 AM, Pablo Neira Ayuso wrote:
>>> No, Daniel is *not* benchmarking the netif_received_core() with no
>>> filtering at all.
>>
>> sorry, not true. We did benchmark all combinations. Daniel posted
>> his, I'll send numbers from my box as well.
>
> Daniel said:
>
> "The extra indirection layers however, are not necessary for calling
> into ingress qdisc. pktgen calling locally into netif_receive_skb()
> with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon
> E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps."
>
> That explicitly refers to u32, hence qdisc ingress, so he did *not*
> post any number of the use case I'm indicating.

I think I'm starting to understand your concern.
You've read the patch in a way that it slows down netif_receive
_without_ ingress qdisc? Of course, that's not the case.

Here are the number from my box:
before:
no ingress - 37.6
ingress on other dev - 36.5
ingress on this dev - 28.8
ingress on this dev + u32 - 24.1

after Daniel's two patches:
no ingress - 37.6
ingress on other dev - 36.5
ingress on this dev - 36.5
ingress on this dev + u32 - 25.2

so when ingress qdisc is not used, the difference is zero.
When ingress qdisc is added to another device - difference is zero.
The last two numbers that we wanted to accelerate and we did.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-10 18:47           ` Alexei Starovoitov
@ 2015-05-10 19:00             ` Pablo Neira Ayuso
  2015-05-10 19:06               ` Alexei Starovoitov
  2015-05-10 20:40             ` Daniel Borkmann
  1 sibling, 1 reply; 32+ messages in thread
From: Pablo Neira Ayuso @ 2015-05-10 19:00 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: netdev, davem, jhs, daniel

On Sun, May 10, 2015 at 11:47:01AM -0700, Alexei Starovoitov wrote:
> On 5/10/15 11:24 AM, Pablo Neira Ayuso wrote:
> >On Sun, May 10, 2015 at 11:05:28AM -0700, Alexei Starovoitov wrote:
> >>On 5/10/15 10:59 AM, Pablo Neira Ayuso wrote:
> >>>No, Daniel is *not* benchmarking the netif_received_core() with no
> >>>filtering at all.
> >>
> >>sorry, not true. We did benchmark all combinations. Daniel posted
> >>his, I'll send numbers from my box as well.
> >
> >Daniel said:
> >
> >"The extra indirection layers however, are not necessary for calling
> >into ingress qdisc. pktgen calling locally into netif_receive_skb()
> >with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon
> >E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps."
> >
> >That explicitly refers to u32, hence qdisc ingress, so he did *not*
> >post any number of the use case I'm indicating.
> 
> I think I'm starting to understand your concern.
> You've read the patch in a way that it slows down netif_receive
> _without_ ingress qdisc?

No. What I said regarding my patchset I said:

"This patch improves performance of the super-critical ingress path by
moving the qdisc ingress code to sch_ingress, where this really
belongs."

The inlined code into the ingress core path seems to have an impact to
people that don't need this, even with the static key.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-10 19:00             ` Pablo Neira Ayuso
@ 2015-05-10 19:06               ` Alexei Starovoitov
  2015-05-10 19:20                 ` Pablo Neira Ayuso
  0 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2015-05-10 19:06 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netdev, davem, jhs, daniel

On 5/10/15 12:00 PM, Pablo Neira Ayuso wrote:
>
> The inlined code into the ingress core path seems to have an impact to
> people that don't need this, even with the static key.

two emails ago you've accused me of non-technical comments and
now I've posted real numbers that show no impact on users that don't
enable ingress and you still say 'seems to have an impact' ?!
I'm speechless.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-10 19:06               ` Alexei Starovoitov
@ 2015-05-10 19:20                 ` Pablo Neira Ayuso
  2015-05-10 19:37                   ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Pablo Neira Ayuso @ 2015-05-10 19:20 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: netdev, davem, jhs, daniel

On Sun, May 10, 2015 at 12:06:34PM -0700, Alexei Starovoitov wrote:
> On 5/10/15 12:00 PM, Pablo Neira Ayuso wrote:
> >
> >The inlined code into the ingress core path seems to have an impact to
> >people that don't need this, even with the static key.
> 
> two emails ago you've accused me of non-technical comments and
> now I've posted real numbers that show no impact on users that don't
> enable ingress and you still say 'seems to have an impact' ?!
> I'm speechless.

No. On the danger of repeating myself: The existing approach that
inlines handle_ing() into __netif_receive_skb_core(), and your
approach since it's persists on that, has an impact in performance on
everyone in the earth.

It's quite clear from my patchset description.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-10 19:20                 ` Pablo Neira Ayuso
@ 2015-05-10 19:37                   ` Alexei Starovoitov
  2015-05-10 19:50                     ` Pablo Neira Ayuso
  0 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2015-05-10 19:37 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netdev, davem, jhs, daniel

On 5/10/15 12:20 PM, Pablo Neira Ayuso wrote:
> On Sun, May 10, 2015 at 12:06:34PM -0700, Alexei Starovoitov wrote:
>> On 5/10/15 12:00 PM, Pablo Neira Ayuso wrote:
>>>
>>> The inlined code into the ingress core path seems to have an impact to
>>> people that don't need this, even with the static key.
>>
>> two emails ago you've accused me of non-technical comments and
>> now I've posted real numbers that show no impact on users that don't
>> enable ingress and you still say 'seems to have an impact' ?!
>> I'm speechless.
>
> No. On the danger of repeating myself: The existing approach that
> inlines handle_ing() into __netif_receive_skb_core(), and your
> approach since it's persists on that, has an impact in performance on
> everyone in the earth.

Another non-technical 'guess' ?

baseline with Daniel's two patches:
    text	   data	    bss	    dec	    hex	filename
10605509	1885208	1400832	13891549	 d3f7dd	vmlinux

then with:
-static inline struct sk_buff *handle_ing(struct sk_buff *skb,
+static noinline struct sk_buff *handle_ing(struct sk_buff *skb,

    text	   data	    bss	    dec	    hex	filename
10605572	1885208	1400832	13891612	 d3f81c	vmlinux

so not inlining handle_ing() actually may have an impact on everyone,
because .text gets bigger. Though only marginally.

btw, after removing 'inline' keyword gcc still inlines it automatically
and looking at the above numbers gcc is doing the right call.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-10 19:37                   ` Alexei Starovoitov
@ 2015-05-10 19:50                     ` Pablo Neira Ayuso
  2015-05-10 21:31                       ` Daniel Borkmann
  0 siblings, 1 reply; 32+ messages in thread
From: Pablo Neira Ayuso @ 2015-05-10 19:50 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: netdev, davem, jhs, daniel

On Sun, May 10, 2015 at 12:37:10PM -0700, Alexei Starovoitov wrote:
> On 5/10/15 12:20 PM, Pablo Neira Ayuso wrote:
[...]
> Another non-technical 'guess' ?
> 
> baseline with Daniel's two patches:
>    text	   data	    bss	    dec	    hex	filename
> 10605509	1885208	1400832	13891549	 d3f7dd	vmlinux
> 
> then with:
> -static inline struct sk_buff *handle_ing(struct sk_buff *skb,
> +static noinline struct sk_buff *handle_ing(struct sk_buff *skb,
> 
>    text	   data	    bss	    dec	    hex	filename
> 10605572	1885208	1400832	13891612	 d3f81c	vmlinux
> 
> so not inlining handle_ing() actually may have an impact on everyone,
> because .text gets bigger. Though only marginally.
> 
> btw, after removing 'inline' keyword gcc still inlines it automatically
> and looking at the above numbers gcc is doing the right call.

Please, stop this.

The numbers show that the existing approach and your approach results
in less performance for everyone that don't need to filter from
ingress. We have to move ingress to where it belongs.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-10 18:47           ` Alexei Starovoitov
  2015-05-10 19:00             ` Pablo Neira Ayuso
@ 2015-05-10 20:40             ` Daniel Borkmann
  1 sibling, 0 replies; 32+ messages in thread
From: Daniel Borkmann @ 2015-05-10 20:40 UTC (permalink / raw)
  To: Alexei Starovoitov, Pablo Neira Ayuso; +Cc: netdev, davem, jhs

On 05/10/2015 08:47 PM, Alexei Starovoitov wrote:
...
> You've read the patch in a way that it slows down netif_receive
> _without_ ingress qdisc? Of course, that's not the case.
>
> Here are the number from my box:
> before:
> no ingress - 37.6
> ingress on other dev - 36.5
> ingress on this dev - 28.8
> ingress on this dev + u32 - 24.1
>
> after Daniel's two patches:
> no ingress - 37.6
> ingress on other dev - 36.5
> ingress on this dev - 36.5
> ingress on this dev + u32 - 25.2
>
> so when ingress qdisc is not used, the difference is zero.
> When ingress qdisc is added to another device - difference is zero.
> The last two numbers that we wanted to accelerate and we did.

+1, exactly!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-10 19:50                     ` Pablo Neira Ayuso
@ 2015-05-10 21:31                       ` Daniel Borkmann
  2015-05-10 21:44                         ` Daniel Borkmann
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel Borkmann @ 2015-05-10 21:31 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Alexei Starovoitov; +Cc: netdev, davem, jhs

On 05/10/2015 09:50 PM, Pablo Neira Ayuso wrote:
...
> The numbers show that the existing approach and your approach results
> in less performance for everyone that don't need to filter from
> ingress. We have to move ingress to where it belongs.

Your cleanup in patch 1 is okay, thanks for spotting it Pablo.

I agree with you on the qdisc_enqueue_root(), it's not needed, which I
removed in my set as well. Please note that my set doesn't introduce a
regression, it improves ingress performance however.

If there's no ingress user than that code path is simply *nop*'ed out.
If there's one ingress present on one device but not on others, it also
doesn't make anything slower to the current state. And you can also always
compile out CONFIG_NET_CLS_ACT (which we actually could make more fine
grained), if you really care.

A next possible step would be to get rid of the ingress netdev queue so
we can also reduce memory overhead. The only thing that is needed is
the classifier list, which is then being invoked, we all have stated
that many times previously.

My other concern is, if we export qdisc_ingress_hook function pointer,
out of tree modules can simply do rcu_assign_pointer(qdisc_ingress_hook,
my_own_handler) to transparently implement their own hook, hm.

Best,
Daniel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-10 21:31                       ` Daniel Borkmann
@ 2015-05-10 21:44                         ` Daniel Borkmann
  2015-05-10 23:43                           ` Pablo Neira Ayuso
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel Borkmann @ 2015-05-10 21:44 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Alexei Starovoitov; +Cc: netdev, davem, jhs

On 05/10/2015 11:31 PM, Daniel Borkmann wrote:
> On 05/10/2015 09:50 PM, Pablo Neira Ayuso wrote:
> ...
>> The numbers show that the existing approach and your approach results
>> in less performance for everyone that don't need to filter from
>> ingress. We have to move ingress to where it belongs.
>
> Your cleanup in patch 1 is okay, thanks for spotting it Pablo.
>
> I agree with you on the qdisc_enqueue_root(), it's not needed, which I
> removed in my set as well. Please note that my set doesn't introduce a
> regression, it improves ingress performance however.
>
> If there's no ingress user than that code path is simply *nop*'ed out.
> If there's one ingress present on one device but not on others, it also
> doesn't make anything slower to the current state. And you can also always
> compile out CONFIG_NET_CLS_ACT (which we actually could make more fine
> grained), if you really care.

But I am still wondering, does your machine have static_key support?
If nothing is enabled, the code runs through a straight-line code path,
it's a nop that is there.

> A next possible step would be to get rid of the ingress netdev queue so
> we can also reduce memory overhead. The only thing that is needed is
> the classifier list, which is then being invoked, we all have stated
> that many times previously.
>
> My other concern is, if we export qdisc_ingress_hook function pointer,
> out of tree modules can simply do rcu_assign_pointer(qdisc_ingress_hook,
> my_own_handler) to transparently implement their own hook, hm.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-10 21:44                         ` Daniel Borkmann
@ 2015-05-10 23:43                           ` Pablo Neira Ayuso
  2015-05-11  5:57                             ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Pablo Neira Ayuso @ 2015-05-10 23:43 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Alexei Starovoitov, netdev, davem, jhs

On Sun, May 10, 2015 at 11:44:15PM +0200, Daniel Borkmann wrote:
> On 05/10/2015 11:31 PM, Daniel Borkmann wrote:
> >On 05/10/2015 09:50 PM, Pablo Neira Ayuso wrote:
> >...
> >>The numbers show that the existing approach and your approach results
> >>in less performance for everyone that don't need to filter from
> >>ingress. We have to move ingress to where it belongs.
> >
> >Your cleanup in patch 1 is okay, thanks for spotting it Pablo.
> >
> >I agree with you on the qdisc_enqueue_root(), it's not needed, which I
> >removed in my set as well. Please note that my set doesn't introduce a
> >regression, it improves ingress performance however.
> >
> >If there's no ingress user than that code path is simply *nop*'ed out.
> >If there's one ingress present on one device but not on others, it also
> >doesn't make anything slower to the current state. And you can also always
> >compile out CONFIG_NET_CLS_ACT (which we actually could make more fine
> >grained), if you really care.
> 
> But I am still wondering, does your machine have static_key support?

Yes:

CONFIG_JUMP_LABEL=y
CONFIG_HAVE_ARCH_JUMP_LABEL=y

$ scripts/gcc-goto.sh gcc
y

> If nothing is enabled, the code runs through a straight-line code path,
> it's a nop that is there.

The noop is patched to an unconditional branch to skip that code, but
the code is still there in that path, even if it's dormant.

What the numbers show is rather simple: The more code is in the path,
the less performance you get, and the qdisc ingress specific code
embedded there is reducing performance for people that are not using
qdisc ingress, hence it should go where it belongs. The static key
cannot save you from that.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-10 23:43                           ` Pablo Neira Ayuso
@ 2015-05-11  5:57                             ` Alexei Starovoitov
  2015-05-11 12:49                               ` Jamal Hadi Salim
                                                 ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2015-05-11  5:57 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Daniel Borkmann; +Cc: netdev, davem, jhs

On 5/10/15 4:43 PM, Pablo Neira Ayuso wrote:
>
> The noop is patched to an unconditional branch to skip that code, but
> the code is still there in that path, even if it's dormant.
>
> What the numbers show is rather simple: The more code is in the path,
> the less performance you get, and the qdisc ingress specific code
> embedded there is reducing performance for people that are not using
> qdisc ingress, hence it should go where it belongs. The static key
> cannot save you from that.

hmm, did I miss these numbers ?

My numbers are showing the opposite. There is no degradation whatsoever.
To recap, here are the numbers from my box:
no ingress - 37.6
ingress on other dev - 36.5
ingress on this dev - 28.8
ingress on this dev + u32 - 24.1

after Daniel's two patches:
no ingress - 37.6
ingress on other dev - 36.5
ingress on this dev - 36.5
ingress on this dev + u32 - 25.2

Explanation of the first lines:
'no ingress' means pure netif_receive_skb with drop in ip_rcv.
This is the case when ingress qdisc is not attached to any of the
devices.
Here static_key is off, so:
         if (static_key_false(&ingress_needed)) {
                 ... never reaches here ...
                 skb = handle_ing(skb, &pt_prev, &ret, orig_dev);

So the code path is the same and numbers before and after are
proving it is the case.

'ingress on other dev' means that ingress qdisc is attached to
some other device. Meaning that static_key is now on and
handle_ing() after two patches does:
cl = rcu_dereference_bh(skb->dev->ingress_cl_list);
if (!cl)
   return skb; ... returns here ...

Prior to two Daniel's patches handle_ing() does:
  rxq = rcu_dereference(skb->dev->ingress_queue);
  if (!rxq || rcu_access_pointer(rxq->qdisc) == &noop_qdisc)
     return skb; ... returns here ...

so the number of instructions is the same for 'ingress on other dev'
case too.
Not surprisingly before and after numbers for 'no ingress' and
'ingress on other dev' are exactly the same.

It sounds like you're saying that code that not even being executed
is somehow affecting the speed? How is that possible ?
What kind of benchmark do you run? I really want to get to the bottom
of it. We cannot use 'drive-by reviewing' and 'gut feel' to make
decisions. If you think my numbers are wrong, please rerun them.
I'm using pktgen with 'xmit_mode netif_receive'. If there is some
other benchmark please bring it on. Everyone will benefit.
If you think pktgen as a test is not representative of real numbers,
sure, that's a different discussion. Let's talk what is the right
test to use. Just claiming that inline of a function into
netif_receive_skb is hurting performance is bogus. Inlining hurts
when size of executed code increases beyond I-cache size. That's
clearly not the case here. Compilers moves inlined handle_ing()
into cold path of netif_receive_skb. So for anyone who doesn't
enable ingress qdisc there is no difference, since that code doesn't
even get into I-cache.
Here is snippet from net/core/dev.s:
	/* below is line 3696: if (static_key_false(&ingress_needed)) */
         1:.byte 0x0f,0x1f,0x44,0x00,0
         .pushsection __jump_table,  "aw"
          .balign 8
          .quad 1b, .L1756, ingress_needed       #,
         .popsection
	/* below is line 3702: skb->tc_verd = 0; */
	movw    $0, 150(%rbx)   #, skb_310->tc_verd

As you can see the inlined handle_ing() is not in fall through path
and it's placed by gcc at the end of netif_receive_skb_core at the label
L1756. So I cannot possible see how it can affect performance.
Even if we make handle_ing() twice as big, there is still won't be
any difference for users that don't use ingress qdisc.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-11  5:57                             ` Alexei Starovoitov
@ 2015-05-11 12:49                               ` Jamal Hadi Salim
  2015-05-11 12:58                               ` Daniel Borkmann
  2015-05-11 13:32                               ` Pablo Neira Ayuso
  2 siblings, 0 replies; 32+ messages in thread
From: Jamal Hadi Salim @ 2015-05-11 12:49 UTC (permalink / raw)
  To: Alexei Starovoitov, Pablo Neira Ayuso, Daniel Borkmann; +Cc: netdev, davem


we need to agree on what the benchmark is. The metrics below are
fine, whoever is doing measurements (expecting different kernels
and hardware to behave differently, so a kernel config will help):

1)no ingress (if drop is in ip_rcv, does it look at some header?)
2)ingress on other dev (this should mean still #1 drop applies)
3)ingress on this dev (#1 drop in effect)
4)ingress on this dev (#1 drop in effect)
5)ingress on this dev + drop at u32 classifier/action.

Variations of above:
A) Current kernel
B) ingress static key introduced
C)Pablo's vs Daniel's settings

The one thing i see in Pablo's comments is extra code will slow
down things. I see Alexei refuting it below since that code is
not being executed.
So some variation with extra code to see what happens.

It is hard, but please stay calm and carry on - we all want
whats best.

cheers,
jamal

On 05/11/15 01:57, Alexei Starovoitov wrote:
> On 5/10/15 4:43 PM, Pablo Neira Ayuso wrote:
>>
>> The noop is patched to an unconditional branch to skip that code, but
>> the code is still there in that path, even if it's dormant.
>>
>> What the numbers show is rather simple: The more code is in the path,
>> the less performance you get, and the qdisc ingress specific code
>> embedded there is reducing performance for people that are not using
>> qdisc ingress, hence it should go where it belongs. The static key
>> cannot save you from that.
>
> hmm, did I miss these numbers ?
>
> My numbers are showing the opposite. There is no degradation whatsoever.
> To recap, here are the numbers from my box:
> no ingress - 37.6
> ingress on other dev - 36.5
> ingress on this dev - 28.8
> ingress on this dev + u32 - 24.1
>
> after Daniel's two patches:
> no ingress - 37.6
> ingress on other dev - 36.5
> ingress on this dev - 36.5
> ingress on this dev + u32 - 25.2
>
> Explanation of the first lines:
> 'no ingress' means pure netif_receive_skb with drop in ip_rcv.
> This is the case when ingress qdisc is not attached to any of the
> devices.
> Here static_key is off, so:
>          if (static_key_false(&ingress_needed)) {
>                  ... never reaches here ...
>                  skb = handle_ing(skb, &pt_prev, &ret, orig_dev);
>
> So the code path is the same and numbers before and after are
> proving it is the case.
>
> 'ingress on other dev' means that ingress qdisc is attached to
> some other device. Meaning that static_key is now on and
> handle_ing() after two patches does:
> cl = rcu_dereference_bh(skb->dev->ingress_cl_list);
> if (!cl)
>    return skb; ... returns here ...
>
> Prior to two Daniel's patches handle_ing() does:
>   rxq = rcu_dereference(skb->dev->ingress_queue);
>   if (!rxq || rcu_access_pointer(rxq->qdisc) == &noop_qdisc)
>      return skb; ... returns here ...
>
> so the number of instructions is the same for 'ingress on other dev'
> case too.
> Not surprisingly before and after numbers for 'no ingress' and
> 'ingress on other dev' are exactly the same.
>
> It sounds like you're saying that code that not even being executed
> is somehow affecting the speed? How is that possible ?
> What kind of benchmark do you run? I really want to get to the bottom
> of it. We cannot use 'drive-by reviewing' and 'gut feel' to make
> decisions. If you think my numbers are wrong, please rerun them.
> I'm using pktgen with 'xmit_mode netif_receive'. If there is some
> other benchmark please bring it on. Everyone will benefit.
> If you think pktgen as a test is not representative of real numbers,
> sure, that's a different discussion. Let's talk what is the right
> test to use. Just claiming that inline of a function into
> netif_receive_skb is hurting performance is bogus. Inlining hurts
> when size of executed code increases beyond I-cache size. That's
> clearly not the case here. Compilers moves inlined handle_ing()
> into cold path of netif_receive_skb. So for anyone who doesn't
> enable ingress qdisc there is no difference, since that code doesn't
> even get into I-cache.
> Here is snippet from net/core/dev.s:
>      /* below is line 3696: if (static_key_false(&ingress_needed)) */
>          1:.byte 0x0f,0x1f,0x44,0x00,0
>          .pushsection __jump_table,  "aw"
>           .balign 8
>           .quad 1b, .L1756, ingress_needed       #,
>          .popsection
>      /* below is line 3702: skb->tc_verd = 0; */
>      movw    $0, 150(%rbx)   #, skb_310->tc_verd
>
> As you can see the inlined handle_ing() is not in fall through path
> and it's placed by gcc at the end of netif_receive_skb_core at the label
> L1756. So I cannot possible see how it can affect performance.
> Even if we make handle_ing() twice as big, there is still won't be
> any difference for users that don't use ingress qdisc.
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-11  5:57                             ` Alexei Starovoitov
  2015-05-11 12:49                               ` Jamal Hadi Salim
@ 2015-05-11 12:58                               ` Daniel Borkmann
  2015-05-11 17:15                                 ` Alexei Starovoitov
  2015-05-11 13:32                               ` Pablo Neira Ayuso
  2 siblings, 1 reply; 32+ messages in thread
From: Daniel Borkmann @ 2015-05-11 12:58 UTC (permalink / raw)
  To: Alexei Starovoitov, Pablo Neira Ayuso; +Cc: netdev, davem, jhs

On 05/11/2015 07:57 AM, Alexei Starovoitov wrote:
...
> It sounds like you're saying that code that not even being executed
> is somehow affecting the speed? How is that possible ?
> What kind of benchmark do you run? I really want to get to the bottom
> of it.

I'm also curious and would like to get it clarified, so I've added the
below silly test patch.

Since ingress path is *not* executed by the static key, it just bloats up
the handle_ing code, which is the whole purpose of this test.

The claim is : since handle_ing() is inlined it adds up on run time
performance *even* if static keys prevents execution of it.

 From the disassembly, I've verified that the below printk's are not
optimized out and that handle_ing() is inlined.

The *disabled* code path gets added below the normal execution path of
__netif_receive_skb_core(), in other words, after the return location
of 'expected' execution.

    text	   data	    bss	    dec	    hex	filename
   57036	   2204	   2858	  62098	   f292	net/core/dev.o.before-printk
   60180	   2204	   2858	  65242	   feda	net/core/dev.o.after-printk

Before printk change:

Samples: 50K of event 'cycles:k', Event count (approx.): 45068616248
+   40.51%   kpktgend_0  [kernel.kallsyms]  [k] __netif_receive_skb_core
+   31.28%   kpktgend_0  [kernel.kallsyms]  [k] kfree_skb
+    6.79%   kpktgend_0  [kernel.kallsyms]  [k] netif_receive_skb_internal
+    6.65%   kpktgend_0  [pktgen]           [k] pktgen_thread_worker
+    6.62%   kpktgend_0  [kernel.kallsyms]  [k] ip_rcv
+    3.41%   kpktgend_0  [kernel.kallsyms]  [k] __netif_receive_skb
+    3.21%   kpktgend_0  [kernel.kallsyms]  [k] netif_receive_skb_sk
+    0.95%   kpktgend_0  [kernel.kallsyms]  [k] __local_bh_enable_ip
+    0.40%   kpktgend_0  [kernel.kallsyms]  [k] kthread_should_stop
+    0.02%   kpktgend_0  [kernel.kallsyms]  [k] _cond_resched

After printk-change:

Samples: 50K of event 'cycles:k', Event count (approx.): 45200368743
+   40.11%   kpktgend_0  [kernel.kallsyms]  [k] __netif_receive_skb_core
+   31.09%   kpktgend_0  [kernel.kallsyms]  [k] kfree_skb
+    6.93%   kpktgend_0  [pktgen]           [k] pktgen_thread_worker
+    6.75%   kpktgend_0  [kernel.kallsyms]  [k] ip_rcv
+    6.58%   kpktgend_0  [kernel.kallsyms]  [k] netif_receive_skb_internal
+    3.49%   kpktgend_0  [kernel.kallsyms]  [k] __netif_receive_skb
+    3.39%   kpktgend_0  [kernel.kallsyms]  [k] netif_receive_skb_sk
+    0.96%   kpktgend_0  [kernel.kallsyms]  [k] __local_bh_enable_ip
+    0.49%   kpktgend_0  [kernel.kallsyms]  [k] kthread_should_stop
+    0.04%   kpktgend_0  [kernel.kallsyms]  [k] _cond_resched

I would consider the stuff after comma as noise.

After your patch set:

Samples: 50K of event 'cycles:k', Event count (approx.): 45160667741
+   40.49%  kpktgend_0  [kernel.kallsyms]  [k] __netif_receive_skb_core
+   31.21%  kpktgend_0  [kernel.kallsyms]  [k] kfree_skb
+    6.94%  kpktgend_0  [pktgen]           [k] pktgen_thread_worker
+    6.63%  kpktgend_0  [kernel.kallsyms]  [k] netif_receive_skb_internal
+    6.63%  kpktgend_0  [kernel.kallsyms]  [k] ip_rcv
+    3.30%  kpktgend_0  [kernel.kallsyms]  [k] netif_receive_skb_sk
+    3.30%  kpktgend_0  [kernel.kallsyms]  [k] __netif_receive_skb
+    0.96%  kpktgend_0  [kernel.kallsyms]  [k] __local_bh_enable_ip
+    0.37%  kpktgend_0  [kernel.kallsyms]  [k] kthread_should_stop
+    0.03%  kpktgend_0  [kernel.kallsyms]  [k] _cond_resched

For *all* three, I reliably get ~40.0 Mpps with the benchmark.

In your perf report I see in addition ...

     18.46%  kpktgend_0  [kernel.kallsyms]  [k] atomic_dec_and_test
     15.87%  kpktgend_0  [kernel.kallsyms]  [k] deliver_ptype_list_skb

... so I have no idea how/where that possibly interferes to your before/after
numbers.

Best & thanks,
Daniel

diff --git a/net/core/dev.c b/net/core/dev.c
index 862875e..7ec18fd 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3545,11 +3545,178 @@ static int ing_filter(struct sk_buff *skb, struct netdev_queue *rxq)
  	return result;
  }

-static inline struct sk_buff *handle_ing(struct sk_buff *skb,
+static __always_inline struct sk_buff *handle_ing(struct sk_buff *skb,
  					 struct packet_type **pt_prev,
  					 int *ret, struct net_device *orig_dev)
  {
  	struct netdev_queue *rxq = rcu_dereference(skb->dev->ingress_queue);
+	int i = 0;
+
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);
+	printk("XXX %d\n", i++);

  	if (!rxq || rcu_access_pointer(rxq->qdisc) == &noop_qdisc)
  		return skb;

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-11  5:57                             ` Alexei Starovoitov
  2015-05-11 12:49                               ` Jamal Hadi Salim
  2015-05-11 12:58                               ` Daniel Borkmann
@ 2015-05-11 13:32                               ` Pablo Neira Ayuso
  2015-05-11 14:35                                 ` Eric Dumazet
  2 siblings, 1 reply; 32+ messages in thread
From: Pablo Neira Ayuso @ 2015-05-11 13:32 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Daniel Borkmann, netdev, davem, jhs

[-- Attachment #1: Type: text/plain, Size: 2475 bytes --]

On Sun, May 10, 2015 at 10:57:44PM -0700, Alexei Starovoitov wrote:
> On 5/10/15 4:43 PM, Pablo Neira Ayuso wrote:
> >
> >The noop is patched to an unconditional branch to skip that code, but
> >the code is still there in that path, even if it's dormant.
> >
> >What the numbers show is rather simple: The more code is in the path,
> >the less performance you get, and the qdisc ingress specific code
> >embedded there is reducing performance for people that are not using
> >qdisc ingress, hence it should go where it belongs. The static key
> >cannot save you from that.
> 
> hmm, did I miss these numbers ?
> 
> My numbers are showing the opposite. There is no degradation whatsoever.
> To recap, here are the numbers from my box:
> no ingress - 37.6
> ingress on other dev - 36.5
> ingress on this dev - 28.8
> ingress on this dev + u32 - 24.1
> 
> after Daniel's two patches:
> no ingress - 37.6
> ingress on other dev - 36.5
> ingress on this dev - 36.5
> ingress on this dev + u32 - 25.2

I'm refering to the numbers and the scenario that I describe in:
http://patchwork.ozlabs.org/patch/470512/

ie. no qdisc ingress before my patch vs. no qdisc ingress after my patch.

that shows that moving that code to sch_ingress results in a
performance improvements.

Some more numbers. I intentionally added more static key code that
depends on the ingress_needed key, see attached patch, the numbers
show a clear performance drop:

Result: OK: 5049126(c5049126+d0) usec, 100000000 (60byte,0frags)
  19805404pps 9506Mb/sec (9506593920bps) errors: 100000000

Remember that the base (the results with my patch applied) is:

Result: OK: 4747078(c4747078+d0) usec, 100000000 (60byte,0frags)
  21065587pps 10111Mb/sec (10111481760bps) errors: 100000000

That's 1M pps less because of more code that is under the static key.

So I'm measuring an impact on the use of static keys. Yes, I have
jump_label support as I told to Daniel, and I can reproduce this
numbers over and over again. Perf also show that I'm measuring the
right thing.

I'm running exactly the pktgen tests with netif_receive using the
script from the patch description, so it's basically the same test
here.

My old box is a Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz

lscpu on debian shows:

L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              3072K

Please, carefully read my patch description. I think you can rebuild
your work on top of this patch. Thanks.

[-- Attachment #2: more-static-keys.patch --]
[-- Type: text/x-diff, Size: 1072 bytes --]

diff --git a/net/core/dev.c b/net/core/dev.c
index f5aeaf5..2a2047b1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3689,6 +3689,13 @@ ncls:
 	if (pfmemalloc && !skb_pfmemalloc_protocol(skb))
 		goto drop;
 
+	if (static_key_false(&ingress_needed)) {
+		int i = 0;
+
+		for (i = 0; i < 10; i++)
+			pr_info("this code depends on qdisc ingress\n");
+	}
+
 	if (skb_vlan_tag_present(skb)) {
 		if (pt_prev) {
 			ret = deliver_skb(skb, pt_prev, orig_dev);
@@ -3721,6 +3728,13 @@ ncls:
 		}
 	}
 
+	if (static_key_false(&ingress_needed)) {
+		int i = 0;
+
+		for (i = 0; i < 10; i++)
+			pr_info("this also depends on qdisc ingress\n");
+	}
+
 	if (unlikely(skb_vlan_tag_present(skb))) {
 		if (skb_vlan_tag_get_id(skb))
 			skb->pkt_type = PACKET_OTHERHOST;
@@ -3748,6 +3762,13 @@ ncls:
 				       &skb->dev->ptype_specific);
 	}
 
+	if (static_key_false(&ingress_needed)) {
+		int i = 0;
+
+		for (i = 0; i < 10; i++)
+			pr_info("more code that depends on qdisc ingress\n");
+	}
+
 	if (pt_prev) {
 		if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
 			goto drop;

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-11 13:32                               ` Pablo Neira Ayuso
@ 2015-05-11 14:35                                 ` Eric Dumazet
  2015-05-11 23:02                                   ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2015-05-11 14:35 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Alexei Starovoitov, Daniel Borkmann, netdev, davem, jhs

On Mon, 2015-05-11 at 15:32 +0200, Pablo Neira Ayuso wrote:

> My old box is a Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz
> 
> lscpu on debian shows:
> 
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              256K
> L3 cache:              3072K
> 
> Please, carefully read my patch description. I think you can rebuild
> your work on top of this patch. Thanks.

Yes, I guess even this simple patch could make your box faster maybe.

We might have to add a noinline_for_icache_sake


# size net/core/dev.o net/core/dev.o.after
   text	   data	    bss	    dec	    hex	filename
  57028	   2260	   2864	  62152	   f2c8	net/core/dev.o
  56468	   2260	   2864	  61592	   f098	net/core/dev.o.after

diff --git a/net/core/dev.c b/net/core/dev.c
index 862875ec8f2f30dc39fec7bb6c00c75e202ebeaa..03080ed5e639a938715e6d499949fa1ab12a8908 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1765,7 +1765,7 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
 }
 EXPORT_SYMBOL_GPL(dev_forward_skb);
 
-static inline int deliver_skb(struct sk_buff *skb,
+static noinline int deliver_skb(struct sk_buff *skb,
 			      struct packet_type *pt_prev,
 			      struct net_device *orig_dev)
 {
@@ -3545,7 +3545,7 @@ static int ing_filter(struct sk_buff *skb, struct netdev_queue *rxq)
 	return result;
 }
 
-static inline struct sk_buff *handle_ing(struct sk_buff *skb,
+static noinline struct sk_buff *handle_ing(struct sk_buff *skb,
 					 struct packet_type **pt_prev,
 					 int *ret, struct net_device *orig_dev)
 {

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-11 12:58                               ` Daniel Borkmann
@ 2015-05-11 17:15                                 ` Alexei Starovoitov
  0 siblings, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2015-05-11 17:15 UTC (permalink / raw)
  To: Daniel Borkmann, Pablo Neira Ayuso; +Cc: netdev, davem, jhs

On 5/11/15 5:58 AM, Daniel Borkmann wrote:
> diff --git a/net/core/dev.c b/net/core/dev.c
>
> -static inline struct sk_buff *handle_ing(struct sk_buff *skb,
> +static __always_inline struct sk_buff *handle_ing(struct sk_buff *skb,
>                        struct packet_type **pt_prev,
>                        int *ret, struct net_device *orig_dev)
>   {
>       struct netdev_queue *rxq = rcu_dereference(skb->dev->ingress_queue);
> +    int i = 0;
> +
> +    printk("XXX %d\n", i++);
> +    printk("XXX %d\n", i++);
   .. lots of printk...

that an interesting test! Tried it out as well:

current baseline:
37711847pps 18101Mb/sec (18101686560bps) errors: 10000000
37776912pps 18132Mb/sec (18132917760bps) errors: 10000000
37700180pps 18096Mb/sec (18096086400bps) errors: 10000000
37730169pps 18110Mb/sec (18110481120bps) errors: 10000000

with massive printk bloating in _inlined_ handle_ing:
37744223pps 18117Mb/sec (18117227040bps) errors: 10000000
37718786pps 18105Mb/sec (18105017280bps) errors: 10000000
37742087pps 18116Mb/sec (18116201760bps) errors: 10000000
37727777pps 18109Mb/sec (18109332960bps) errors: 10000000

no performance difference as expected and matches what Daniel is seeing.

Then I've tried to do 'noinline' for handle_ing():
36818072pps 17672Mb/sec (17672674560bps) errors: 10000000
36828761pps 17677Mb/sec (17677805280bps) errors: 10000000
36840106pps 17683Mb/sec (17683250880bps) errors: 10000000
36885403pps 17704Mb/sec (17704993440bps) errors: 10000000

this drop when static_key suppose to protect handle_ing()
was totally unexpected.
So I started digging into assembler before and after.
Turned out that with inlined handle_ing GCC can see what is
happening with pt_prev and ret pointers, so with handle_ing
inlined the asm looks like:
movl    $1, %r15d       #, ret
xorl    %r12d, %r12d    # pt_prev

when handle_ing is not inlined, the asm of netif_receive_skb has:
movl    $1, -68(%rbp)   #, ret
movq    $0, -64(%rbp)   #, pt_prev

To test it further I've tried:
+static noinline struct sk_buff *handle_ing_finish(struct sk_buff *skb,
+                                                 struct tcf_proto *cl)
...
static inline struct sk_buff *handle_ing(struct sk_buff *skb,
+                                        struct packet_type **pt_prev,
+                                        int *ret, struct net_device 
*orig_dev)
+{
+       struct tcf_proto *cl = 
rcu_dereference_bh(skb->dev->ingress_cl_list);
+
+       if (!cl)
+               return skb;
+       if (*pt_prev) {
+               *ret = deliver_skb(skb, *pt_prev, orig_dev);
+               *pt_prev = NULL;
+       }
+       return handle_ing_finish(skb, cl);
+}

so tc ingress part would not be inlined, but deliver_skb bits are.
The performance went back to normal:
37701570pps 18096Mb/sec (18096753600bps) errors: 10000000
37752444pps 18121Mb/sec (18121173120bps) errors: 10000000
37719331pps 18105Mb/sec (18105278880bps) errors: 10000000

Unfortunately this last experiment hurts ingress+u32 case
that dropped from 25.2 Mpps to 24.5 Mpps.

Will keep digging into it more. Stay tuned.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-11 14:35                                 ` Eric Dumazet
@ 2015-05-11 23:02                                   ` Alexei Starovoitov
  2015-05-11 23:30                                     ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2015-05-11 23:02 UTC (permalink / raw)
  To: Eric Dumazet, Pablo Neira Ayuso; +Cc: Daniel Borkmann, netdev, davem, jhs

On 5/11/15 7:35 AM, Eric Dumazet wrote:
>
> -static inline int deliver_skb(struct sk_buff *skb,
> +static noinline int deliver_skb(struct sk_buff *skb,
>   			      struct packet_type *pt_prev,
>   			      struct net_device *orig_dev)

have tried the above only and didn't see any difference
for simple 'netif_receive + drop in ip_rcv' test.
Not sure whether it's actually worth doing. I would leave it as-is.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-11 23:02                                   ` Alexei Starovoitov
@ 2015-05-11 23:30                                     ` Eric Dumazet
  2015-05-12  3:21                                       ` Alexei Starovoitov
  2015-05-12 12:55                                       ` Pablo Neira Ayuso
  0 siblings, 2 replies; 32+ messages in thread
From: Eric Dumazet @ 2015-05-11 23:30 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Pablo Neira Ayuso, Daniel Borkmann, netdev, davem, jhs

On Mon, 2015-05-11 at 16:02 -0700, Alexei Starovoitov wrote:
> On 5/11/15 7:35 AM, Eric Dumazet wrote:
> >
> > -static inline int deliver_skb(struct sk_buff *skb,
> > +static noinline int deliver_skb(struct sk_buff *skb,
> >   			      struct packet_type *pt_prev,
> >   			      struct net_device *orig_dev)
> 
> have tried the above only and didn't see any difference
> for simple 'netif_receive + drop in ip_rcv' test.
> Not sure whether it's actually worth doing. I would leave it as-is.

Yes, this was probably too aggressive.

unlikely() or static_key_false() are no moving code away enough, whole
function including unused code pollutes icache.

Code size increased a lot, while L1/L2 caches on cpu are about the same
than 6 years ago.

For example , commit 7866a621043fbaca3d7389e9b9f69dd1a2e5a855
helped a given workload, but probably made things slower for most common
cases.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-11 23:30                                     ` Eric Dumazet
@ 2015-05-12  3:21                                       ` Alexei Starovoitov
  2015-05-12 12:55                                       ` Pablo Neira Ayuso
  1 sibling, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2015-05-12  3:21 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Pablo Neira Ayuso, Daniel Borkmann, netdev, davem, jhs

On 5/11/15 4:30 PM, Eric Dumazet wrote:
>
> For example , commit 7866a621043fbaca3d7389e9b9f69dd1a2e5a855
> helped a given workload, but probably made things slower for most common
> cases.

yes, indeed. reverting it improves netif_receive + drop in ip_rcv
from 41.1 to 42.6. I've been trying to come up with a simple way
to roll global ptype_all into skb->dev->ptype_all, but registering
device notifier seems overkill to remove one loop from netif_receive.
Also tried to partially remove pt_prev logic from the first half
of netif_receive and keep it only for deliver_ptype_list_skb() part,
but that didn't help.
Then tried this:
-static int __netif_receive_skb(struct sk_buff *skb)
+static inline int __netif_receive_skb(struct sk_buff *skb)
...
-static int netif_receive_skb_internal(struct sk_buff *skb)
+static inline int netif_receive_skb_internal(struct sk_buff *skb)

it helped to go from 41.1 to 43.1, but size increase not negligible:
    text	   data	    bss	    dec	    hex	filename
   55990	   1667	   2856	  60513	   ec61	dev.o.base
   56403	   1907	   2856	  61166	   eeee	dev.o.inline

inlining only one of them (either __netif_receive_skb or
netif_receive_skb_internal) gives minimal gain.
Still exploring...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-11 23:30                                     ` Eric Dumazet
  2015-05-12  3:21                                       ` Alexei Starovoitov
@ 2015-05-12 12:55                                       ` Pablo Neira Ayuso
  2015-05-12 13:27                                         ` Daniel Borkmann
  1 sibling, 1 reply; 32+ messages in thread
From: Pablo Neira Ayuso @ 2015-05-12 12:55 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Alexei Starovoitov, Daniel Borkmann, netdev, davem, jhs

Hi Eric,

On Mon, May 11, 2015 at 04:30:38PM -0700, Eric Dumazet wrote:
> On Mon, 2015-05-11 at 16:02 -0700, Alexei Starovoitov wrote:
> > On 5/11/15 7:35 AM, Eric Dumazet wrote:
> > >
> > > -static inline int deliver_skb(struct sk_buff *skb,
> > > +static noinline int deliver_skb(struct sk_buff *skb,
> > >   			      struct packet_type *pt_prev,
> > >   			      struct net_device *orig_dev)
> > 
> > have tried the above only and didn't see any difference
> > for simple 'netif_receive + drop in ip_rcv' test.
> > Not sure whether it's actually worth doing. I would leave it as-is.
> 
> Yes, this was probably too aggressive.

I tested this noinline patch and I got a bit less performance here in
my 32K i-cache testbed.

> unlikely() or static_key_false() are no moving code away enough, whole
> function including unused code pollutes icache.
>
> Code size increased a lot, while L1/L2 caches on cpu are about the same
> than 6 years ago.

OK, so that explains why I'm getting more performance with the patch
that move the qdisc ingress code using the indirection, since that
helped to move code away, right?

BTW, looking at the emails, Daniel said:

[...]
> After your patch set:
>
> Samples: 50K of event 'cycles:k', Event count (approx.): 45160667741
> +   40.49%  kpktgend_0  [kernel.kallsyms]  [k] __netif_receive_skb_core
> +   31.21%  kpktgend_0  [kernel.kallsyms]  [k] kfree_skb
> +    6.94%  kpktgend_0  [pktgen]           [k] pktgen_thread_worker
> +    6.63%  kpktgend_0  [kernel.kallsyms]  [k] netif_receive_skb_internal
> +    6.63%  kpktgend_0  [kernel.kallsyms]  [k] ip_rcv
> +    3.30%  kpktgend_0  [kernel.kallsyms]  [k] netif_receive_skb_sk
> +    3.30%  kpktgend_0  [kernel.kallsyms]  [k] __netif_receive_skb
> +    0.96%  kpktgend_0  [kernel.kallsyms]  [k] __local_bh_enable_ip
> +    0.37%  kpktgend_0  [kernel.kallsyms]  [k] kthread_should_stop
> +    0.03%  kpktgend_0  [kernel.kallsyms]  [k] _cond_resched
>
> For *all* three, I reliably get ~40.0 Mpps with the benchmark.

@Daniel, Alexei: Are you getting the same numbers with the indirection?

What's the i-cache size in your testbed?

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-12 12:55                                       ` Pablo Neira Ayuso
@ 2015-05-12 13:27                                         ` Daniel Borkmann
  2015-05-12 21:22                                           ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel Borkmann @ 2015-05-12 13:27 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Eric Dumazet; +Cc: Alexei Starovoitov, netdev, davem, jhs

Hi Pablo,

On 05/12/2015 02:55 PM, Pablo Neira Ayuso wrote:
> On Mon, May 11, 2015 at 04:30:38PM -0700, Eric Dumazet wrote:
>> On Mon, 2015-05-11 at 16:02 -0700, Alexei Starovoitov wrote:
>>> On 5/11/15 7:35 AM, Eric Dumazet wrote:
>>>>
>>>> -static inline int deliver_skb(struct sk_buff *skb,
>>>> +static noinline int deliver_skb(struct sk_buff *skb,
>>>>    			      struct packet_type *pt_prev,
>>>>    			      struct net_device *orig_dev)
>>>
>>> have tried the above only and didn't see any difference
>>> for simple 'netif_receive + drop in ip_rcv' test.
>>> Not sure whether it's actually worth doing. I would leave it as-is.
>>
>> Yes, this was probably too aggressive.
>
> I tested this noinline patch and I got a bit less performance here in
> my 32K i-cache testbed.
>
>> unlikely() or static_key_false() are no moving code away enough, whole
>> function including unused code pollutes icache.
>>
>> Code size increased a lot, while L1/L2 caches on cpu are about the same
>> than 6 years ago.
>
> OK, so that explains why I'm getting more performance with the patch
> that move the qdisc ingress code using the indirection, since that
> helped to move code away, right?
>
> BTW, looking at the emails, Daniel said:
>
> [...]
>> After your patch set:
>>
>> Samples: 50K of event 'cycles:k', Event count (approx.): 45160667741
>> +   40.49%  kpktgend_0  [kernel.kallsyms]  [k] __netif_receive_skb_core
>> +   31.21%  kpktgend_0  [kernel.kallsyms]  [k] kfree_skb
>> +    6.94%  kpktgend_0  [pktgen]           [k] pktgen_thread_worker
>> +    6.63%  kpktgend_0  [kernel.kallsyms]  [k] netif_receive_skb_internal
>> +    6.63%  kpktgend_0  [kernel.kallsyms]  [k] ip_rcv
>> +    3.30%  kpktgend_0  [kernel.kallsyms]  [k] netif_receive_skb_sk
>> +    3.30%  kpktgend_0  [kernel.kallsyms]  [k] __netif_receive_skb
>> +    0.96%  kpktgend_0  [kernel.kallsyms]  [k] __local_bh_enable_ip
>> +    0.37%  kpktgend_0  [kernel.kallsyms]  [k] kthread_should_stop
>> +    0.03%  kpktgend_0  [kernel.kallsyms]  [k] _cond_resched
>>
>> For *all* three, I reliably get ~40.0 Mpps with the benchmark.
>
> @Daniel, Alexei: Are you getting the same numbers with the indirection?

The above I tested was basically your set from here on top of net-next
when ingress is disabled. It didn't show an effect (similarly as the
printk bloat test patch on my side). So, judging from your observations,
the printk bloat would also make things slower on your side, right?

> What's the i-cache size in your testbed?

For the Xeon E3-1240, I get (via lscpu):

L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K

Best,
Daniel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-12 13:27                                         ` Daniel Borkmann
@ 2015-05-12 21:22                                           ` Alexei Starovoitov
  2015-05-12 21:43                                             ` Daniel Borkmann
  0 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2015-05-12 21:22 UTC (permalink / raw)
  To: Daniel Borkmann, Pablo Neira Ayuso, Eric Dumazet; +Cc: netdev, davem, jhs

On 5/12/15 6:27 AM, Daniel Borkmann wrote:
>
>> What's the i-cache size in your testbed?
>
> For the Xeon E3-1240, I get (via lscpu):
>
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              256K
> L3 cache:              8192K

my E5-1630 v3 @ 3.70GHz:
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              10240K

I think it's not cpu that is causing discrepancies
between our numbers, but the difference in compilers or flags.

Looking at Pablo's perf profile:
     36.12%  kpktgend_0  [kernel.kallsyms]  [k] __netif_receive_skb_core
     18.46%  kpktgend_0  [kernel.kallsyms]  [k] atomic_dec_and_test
     15.87%  kpktgend_0  [kernel.kallsyms]  [k] deliver_ptype_list_skb
      5.04%  kpktgend_0  [pktgen]           [k] pktgen_thread_worker
      4.81%  kpktgend_0  [kernel.kallsyms]  [k] netif_receive_skb_internal
      4.11%  kpktgend_0  [kernel.kallsyms]  [k] kfree_skb
      3.89%  kpktgend_0  [kernel.kallsyms]  [k] ip_rcv

It means that deliver_ptype_list_skb() is not inlined, which is odd
and atomic_dec_and_test() from kfree_skb() is also not inlined either.
Both functions are marked 'static inline'. So I suspect the kernel was
compiled with some broken gcc or CONFIG_CC_OPTIMIZE_FOR_SIZE is set.
If gcc is old/broken, it's really bad, since it can be mis-optimizing
bunch of other things.
If optimize_for_size is set, then it's not great for performance
either, since compiler will be trying way too hard to squeeze
code size and losing performance left and right.
btw, there is patch pending on lkml to make
atomic_dec_and_test() __always_inline.

-Os is also causing static_key to ignore 'unlikely', so all cold
branches are generated as fall through which causing I-cache misses.
I've looked at net/core/dev.s with -Os and it's not pretty.
bstats_update, deliver_skb, deliver_ptype_list_skb are all not inlined.

There was a thread on lkml recently to request better behaving -Os from
gcc guys, but I think it didn't go anywhere.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs
  2015-05-12 21:22                                           ` Alexei Starovoitov
@ 2015-05-12 21:43                                             ` Daniel Borkmann
  0 siblings, 0 replies; 32+ messages in thread
From: Daniel Borkmann @ 2015-05-12 21:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Pablo Neira Ayuso, Eric Dumazet; +Cc: netdev, davem, jhs

On 05/12/2015 11:22 PM, Alexei Starovoitov wrote:
> On 5/12/15 6:27 AM, Daniel Borkmann wrote:
>>
>>> What's the i-cache size in your testbed?
>>
>> For the Xeon E3-1240, I get (via lscpu):
>>
>> L1d cache:             32K
>> L1i cache:             32K
>> L2 cache:              256K
>> L3 cache:              8192K
>
> my E5-1630 v3 @ 3.70GHz:
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              256K
> L3 cache:              10240K
>
> I think it's not cpu that is causing discrepancies
> between our numbers, but the difference in compilers or flags.
>
> Looking at Pablo's perf profile:
>      36.12%  kpktgend_0  [kernel.kallsyms]  [k] __netif_receive_skb_core
>      18.46%  kpktgend_0  [kernel.kallsyms]  [k] atomic_dec_and_test
>      15.87%  kpktgend_0  [kernel.kallsyms]  [k] deliver_ptype_list_skb
>       5.04%  kpktgend_0  [pktgen]           [k] pktgen_thread_worker
>       4.81%  kpktgend_0  [kernel.kallsyms]  [k] netif_receive_skb_internal
>       4.11%  kpktgend_0  [kernel.kallsyms]  [k] kfree_skb
>       3.89%  kpktgend_0  [kernel.kallsyms]  [k] ip_rcv
>
> It means that deliver_ptype_list_skb() is not inlined, which is odd
> and atomic_dec_and_test() from kfree_skb() is also not inlined either.
> Both functions are marked 'static inline'. So I suspect the kernel was
> compiled with some broken gcc or CONFIG_CC_OPTIMIZE_FOR_SIZE is set.
> If gcc is old/broken, it's really bad, since it can be mis-optimizing
> bunch of other things.

There was a recent lkml thread from Hagen wrt bad inlining heuristics
of gcc:

   https://lkml.org/lkml/2015/4/20/637
   https://lkml.org/lkml/2015/4/23/598

"Here is the situation: the inlining problem occur with the 4.9.x
  branch - I tried to reproduce it with 4.8.x and saw *no* problems."

[ I was using: gcc (GCC) 4.8.3 20140624 (Red Hat 4.8.3-1) ]

> If optimize_for_size is set, then it's not great for performance
> either, since compiler will be trying way too hard to squeeze
> code size and losing performance left and right.
> btw, there is patch pending on lkml to make
> atomic_dec_and_test() __always_inline.
>
> -Os is also causing static_key to ignore 'unlikely', so all cold
> branches are generated as fall through which causing I-cache misses.
> I've looked at net/core/dev.s with -Os and it's not pretty.
> bstats_update, deliver_skb, deliver_ptype_list_skb are all not inlined.
>
> There was a thread on lkml recently to request better behaving -Os from
> gcc guys, but I think it didn't go anywhere.

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2015-05-12 21:43 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-10 16:59 [PATCH 0/2 net-next] critical ingress path performance improvements Pablo Neira Ayuso
2015-05-10 16:59 ` [PATCH 1/2 net-next] net: kill useless net_*_ingress_queue() definitions when NET_CLS_ACT is unset Pablo Neira Ayuso
2015-05-10 16:58   ` Sergei Shtylyov
2015-05-10 16:59 ` [PATCH 2/2 net-next] net: move qdisc ingress filtering code where it belongs Pablo Neira Ayuso
2015-05-10 17:25   ` Eric Dumazet
2015-05-10 17:45   ` Alexei Starovoitov
2015-05-10 17:59     ` Pablo Neira Ayuso
2015-05-10 18:05       ` Alexei Starovoitov
2015-05-10 18:24         ` Pablo Neira Ayuso
2015-05-10 18:47           ` Alexei Starovoitov
2015-05-10 19:00             ` Pablo Neira Ayuso
2015-05-10 19:06               ` Alexei Starovoitov
2015-05-10 19:20                 ` Pablo Neira Ayuso
2015-05-10 19:37                   ` Alexei Starovoitov
2015-05-10 19:50                     ` Pablo Neira Ayuso
2015-05-10 21:31                       ` Daniel Borkmann
2015-05-10 21:44                         ` Daniel Borkmann
2015-05-10 23:43                           ` Pablo Neira Ayuso
2015-05-11  5:57                             ` Alexei Starovoitov
2015-05-11 12:49                               ` Jamal Hadi Salim
2015-05-11 12:58                               ` Daniel Borkmann
2015-05-11 17:15                                 ` Alexei Starovoitov
2015-05-11 13:32                               ` Pablo Neira Ayuso
2015-05-11 14:35                                 ` Eric Dumazet
2015-05-11 23:02                                   ` Alexei Starovoitov
2015-05-11 23:30                                     ` Eric Dumazet
2015-05-12  3:21                                       ` Alexei Starovoitov
2015-05-12 12:55                                       ` Pablo Neira Ayuso
2015-05-12 13:27                                         ` Daniel Borkmann
2015-05-12 21:22                                           ` Alexei Starovoitov
2015-05-12 21:43                                             ` Daniel Borkmann
2015-05-10 20:40             ` Daniel Borkmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).