All of lore.kernel.org
 help / color / mirror / Atom feed
* significant napi_gro_frags() slowdown
@ 2014-03-27 16:39 Michal Schmidt
  2014-03-27 16:47 ` Eric Dumazet
  0 siblings, 1 reply; 10+ messages in thread
From: Michal Schmidt @ 2014-03-27 16:39 UTC (permalink / raw)
  To: netdev; +Cc: Jerry Chu, Eric Dumazet, David S. Miller

Hello,

I received a report about a performance regression on a simple workload 
of receiving a single TCP stream using a be2net NIC. Previously the 
reporter was able to reach more than 9 Gb/s, but now he sees less than 7 
Gb/s.

On my system the performance loss was not so obvious, but profiling with 
perf showed significant time spent in copying memory in 
__pskb_pull_tail(). Here is an example perf histogram from 3.14-rc8:

-  14.03%         swapper  [kernel.kallsyms]        [k] memcpy
    - memcpy
       - 99.99% __pskb_pull_tail
          - 46.16% tcp_gro_receive
               tcp4_gro_receive
               inet_gro_receive
               dev_gro_receive
               napi_gro_frags
               be_process_rx
               be_poll
               net_rx_action
               __do_softirq
             + irq_exit
          - 31.18% napi_gro_frags
               be_process_rx
               be_poll
               net_rx_action
               __do_softirq
               irq_exit
               do_IRQ
             + ret_from_intr
          - 22.66% inet_gro_receive
               dev_gro_receive
               napi_gro_frags
               be_process_rx
               be_poll
               net_rx_action
               __do_softirq
               irq_exit
               do_IRQ
             + ret_from_intr
-  13.44%       netserver  [kernel.kallsyms]        [k] 
copy_user_generic_string
    - copy_user_generic_string
    - skb_copy_datagram_iovec
       + 56.70% skb_copy_datagram_iovec
       + 32.27% tcp_recvmsg
       + 11.03% tcp_rcv_established
-   5.64%         swapper  [kernel.kallsyms]        [k] __pskb_pull_tail
    - __pskb_pull_tail
       + 48.58% tcp_gro_receive
       + 24.48% inet_gro_receive
       + 24.11% napi_gro_frags
       + 1.28% tcp4_gro_receive
       + 0.91% be_process_rx
       + 0.64% dev_gro_receive
+   5.13%         swapper  [kernel.kallsyms]        [k] skb_copy_bits
...

Bisection identified this first bad commit:

     commit 299603e8370a93dd5d8e8d800f0dff1ce2c53d36
     Author: Jerry Chu <hkchu@google.com>
     Date:   Wed Dec 11 20:53:45 2013 -0800

         net-gro: Prepare GRO stack for the upcoming tunneling support

Before this commit, the GRO code was able to access the received 
packets' headers quickly via NAPI_GRO_CB(skb)->frag0. After the commit, 
this optimization no longer applies. napi_frags_skb() will now always 
pull the Ethernet header from the first frag. Subsequently, the slow 
paths are being taken in .gro_receive functions, always pulling the 
other headers with __pskb_pull_tail().

To demonstrate the call chains, let's see the function graph traced by 
ftrace.
Before the commit:

  13)               |  napi_gro_frags() {
  13)   0.038 us    |    skb_gro_reset_offset();
  13)               |    dev_gro_receive() {
  13)               |      inet_gro_receive() {
  13)               |        tcp4_gro_receive() {
  13)               |          tcp_gro_receive() {
  13)   0.059 us    |            skb_gro_receive();
  13)   0.385 us    |          }
  13)   0.675 us    |        }
  13)   1.012 us    |      }
  13)   1.336 us    |    }
  13)   0.040 us    |    napi_reuse_skb.isra.57();
  13)   2.235 us    |  }

After the commit:

   7)               |  napi_gro_frags() {
   7)               |    __pskb_pull_tail() {
   7)   0.204 us    |      skb_copy_bits();
   7)   0.551 us    |    }
   7)   0.046 us    |    eth_type_trans();
   7)               |    dev_gro_receive() {
   7)               |      inet_gro_receive() {
   7)               |        __pskb_pull_tail() {
   7)   0.095 us    |          skb_copy_bits();
   7)   0.412 us    |        }
   7)               |        tcp4_gro_receive() {
   7)               |          tcp_gro_receive() {
   7)               |            __pskb_pull_tail() {
   7)   0.095 us    |              skb_copy_bits();
   7)   0.410 us    |            }
   7)               |            __pskb_pull_tail() {
   7)   0.095 us    |              skb_copy_bits();
   7)   0.412 us    |            }
   7)   0.055 us    |            skb_gro_receive();
   7)   1.771 us    |          }
   7)   2.077 us    |        }
   7)   3.078 us    |      }
   7)   3.403 us    |    }
   7)   0.043 us    |    napi_reuse_skb.isra.72();
   7)   5.152 us    |  }

Here are typical times spent in napi_gro_frags(), measured using ftrace 
and making napi_gro_frags() a leaf function (thus avoiding the tracing 
overhead in child functions).
Before the commit:

...
  12)   0.130 us    |  napi_gro_frags();
  12)   0.129 us    |  napi_gro_frags();
  12)   0.132 us    |  napi_gro_frags();
  12)   0.128 us    |  napi_gro_frags();
  12)   0.126 us    |  napi_gro_frags();
  12)   0.129 us    |  napi_gro_frags();
  12)   1.423 us    |  napi_gro_frags();
...

After the commit:

...
  20)   0.812 us    |  napi_gro_frags();
  20)   0.820 us    |  napi_gro_frags();
  20)   0.811 us    |  napi_gro_frags();
  20)   0.812 us    |  napi_gro_frags();
  20)   0.814 us    |  napi_gro_frags();
  20)   0.819 us    |  napi_gro_frags();
  20)   1.526 us    |  napi_gro_frags();
...

This should affect not just be2net, but all drivers that use 
napi_gro_frags().
Any suggestions how to restore the frag0 optimization?

Thanks,
Michal

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: significant napi_gro_frags() slowdown
  2014-03-27 16:39 significant napi_gro_frags() slowdown Michal Schmidt
@ 2014-03-27 16:47 ` Eric Dumazet
  2014-03-27 17:05   ` Michal Schmidt
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2014-03-27 16:47 UTC (permalink / raw)
  To: Michal Schmidt; +Cc: netdev, Jerry Chu, Eric Dumazet, David S. Miller

On Thu, 2014-03-27 at 17:39 +0100, Michal Schmidt wrote:
> Hello,
> 
> I received a report about a performance regression on a simple workload 
> of receiving a single TCP stream using a be2net NIC. Previously the 
> reporter was able to reach more than 9 Gb/s, but now he sees less than 7 
> Gb/s.

Yeah, we are aware of this, and we had some discussions, but not yet got
a fix.

Are you running a 64 or 32bit kernel BTW ?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: significant napi_gro_frags() slowdown
  2014-03-27 16:47 ` Eric Dumazet
@ 2014-03-27 17:05   ` Michal Schmidt
  2014-03-27 17:21     ` Eric Dumazet
  0 siblings, 1 reply; 10+ messages in thread
From: Michal Schmidt @ 2014-03-27 17:05 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Jerry Chu, Eric Dumazet, David S. Miller

Dne 27.3.2014 17:47, Eric Dumazet napsal:
> On Thu, 2014-03-27 at 17:39 +0100, Michal Schmidt wrote:
>> I received a report about a performance regression on a simple workload
>> of receiving a single TCP stream using a be2net NIC. Previously the
>> reporter was able to reach more than 9 Gb/s, but now he sees less than 7
>> Gb/s.
>
> Yeah, we are aware of this, and we had some discussions, but not yet got
> a fix.
>
> Are you running a 64 or 32bit kernel BTW ?

Both my reporter and I are running 64bit kernels.

Thanks,
Michal

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: significant napi_gro_frags() slowdown
  2014-03-27 17:05   ` Michal Schmidt
@ 2014-03-27 17:21     ` Eric Dumazet
  2014-03-30  4:28       ` [PATCH net-next] net-gro: restore frag0 optimization Eric Dumazet
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2014-03-27 17:21 UTC (permalink / raw)
  To: Michal Schmidt; +Cc: netdev, Jerry Chu, Eric Dumazet, David S. Miller

On Thu, 2014-03-27 at 18:05 +0100, Michal Schmidt wrote:
> Dne 27.3.2014 17:47, Eric Dumazet napsal:
> > On Thu, 2014-03-27 at 17:39 +0100, Michal Schmidt wrote:
> >> I received a report about a performance regression on a simple workload
> >> of receiving a single TCP stream using a be2net NIC. Previously the
> >> reporter was able to reach more than 9 Gb/s, but now he sees less than 7
> >> Gb/s.
> >
> > Yeah, we are aware of this, and we had some discussions, but not yet got
> > a fix.
> >
> > Are you running a 64 or 32bit kernel BTW ?
> 
> Both my reporter and I are running 64bit kernels.

OK I'll cook a patch asap, thanks for the heads up ;) 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH net-next] net-gro: restore frag0 optimization
  2014-03-27 17:21     ` Eric Dumazet
@ 2014-03-30  4:28       ` Eric Dumazet
  2014-03-31 20:27         ` David Miller
  2014-04-01 16:40         ` Michal Schmidt
  0 siblings, 2 replies; 10+ messages in thread
From: Eric Dumazet @ 2014-03-30  4:28 UTC (permalink / raw)
  To: Michal Schmidt; +Cc: netdev, Eric Dumazet, David S. Miller, Jerry Chu

From: Eric Dumazet <edumazet@google.com>

Main difference between napi_frags_skb() and napi_gro_receive() is that
the later is called while ethernet header was already pulled by the NIC
driver (eth_type_trans() was called before napi_gro_receive())

Jerry Chu in commit 299603e8370a ("net-gro: Prepare GRO stack for the
upcoming tunneling support") tried to remove this difference by calling
eth_type_trans() from napi_frags_skb() instead of doing this later from
napi_frags_finish()

Goal was that napi_gro_complete() could call 
ptype->callbacks.gro_complete(skb, 0)  (offset of first network header =
0)

Also, xxx_gro_receive() handlers all use off = skb_gro_offset(skb) to
point to their own header, for the current skb and ones held in gro_list

Problem is this cleanup work defeated the frag0 optimization:
It turns out the consecutive pskb_may_pull() calls are too expensive.

This patch brings back the frag0 stuff in napi_frags_skb().

As all skb have their mac header in skb head, we no longer need
skb_gro_mac_header()

Reported-by: Michal Schmidt <mschmidt@redhat.com>
Fixes: 299603e8370a ("net-gro: Prepare GRO stack for the upcoming tunneling support")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jerry Chu <hkchu@google.com>
---
David, this patch is definitely risky, I targeted net-next for staging
it. I tested it with mlx4 NIC.

BTW, I believe the BUG_ON(skb->end - skb->tail < grow); could trigger
if the headers are bigger than what the driver allocated for skb->head.

After encapsulation support added by Jerry, this might happen ?

We probably need to tweak skb_gro_reset_offset() to cap frag0_len to
force calling slow path. I'll cook a patch.

 include/linux/netdevice.h |    5 -
 net/core/dev.c            |   96 ++++++++++++++++++++++++------------
 2 files changed, 64 insertions(+), 37 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 06287c110241..4a99bf43d619 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2018,11 +2018,6 @@ static inline void *skb_gro_header_slow(struct sk_buff *skb, unsigned int hlen,
 	return skb->data + offset;
 }
 
-static inline void *skb_gro_mac_header(struct sk_buff *skb)
-{
-	return NAPI_GRO_CB(skb)->frag0 ?: skb_mac_header(skb);
-}
-
 static inline void *skb_gro_network_header(struct sk_buff *skb)
 {
 	return (NAPI_GRO_CB(skb)->frag0 ?: skb->data) +
diff --git a/net/core/dev.c b/net/core/dev.c
index cf92139b229c..039681b5402c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3833,10 +3833,10 @@ static void gro_list_prepare(struct napi_struct *napi, struct sk_buff *skb)
 		diffs |= p->vlan_tci ^ skb->vlan_tci;
 		if (maclen == ETH_HLEN)
 			diffs |= compare_ether_header(skb_mac_header(p),
-						      skb_gro_mac_header(skb));
+						      skb_mac_header(skb));
 		else if (!diffs)
 			diffs = memcmp(skb_mac_header(p),
-				       skb_gro_mac_header(skb),
+				       skb_mac_header(skb),
 				       maclen);
 		NAPI_GRO_CB(p)->same_flow = !diffs;
 	}
@@ -3859,6 +3859,27 @@ static void skb_gro_reset_offset(struct sk_buff *skb)
 	}
 }
 
+static void gro_pull_from_frag0(struct sk_buff *skb, int grow)
+{
+	struct skb_shared_info *pinfo = skb_shinfo(skb);
+
+	BUG_ON(skb->end - skb->tail < grow);
+
+	memcpy(skb_tail_pointer(skb), NAPI_GRO_CB(skb)->frag0, grow);
+
+	skb->data_len -= grow;
+	skb->tail += grow;
+
+	pinfo->frags[0].page_offset += grow;
+	skb_frag_size_sub(&pinfo->frags[0], grow);
+
+	if (unlikely(!skb_frag_size(&pinfo->frags[0]))) {
+		skb_frag_unref(skb, 0);
+		memmove(pinfo->frags, pinfo->frags + 1,
+			--pinfo->nr_frags * sizeof(pinfo->frags[0]));
+	}
+}
+
 static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
 {
 	struct sk_buff **pp = NULL;
@@ -3867,6 +3888,7 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
 	struct list_head *head = &offload_base;
 	int same_flow;
 	enum gro_result ret;
+	int grow;
 
 	if (!(skb->dev->features & NETIF_F_GRO))
 		goto normal;
@@ -3874,7 +3896,6 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
 	if (skb_is_gso(skb) || skb_has_frag_list(skb))
 		goto normal;
 
-	skb_gro_reset_offset(skb);
 	gro_list_prepare(napi, skb);
 	NAPI_GRO_CB(skb)->csum = skb->csum; /* Needed for CHECKSUM_COMPLETE */
 
@@ -3938,27 +3959,9 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
 	ret = GRO_HELD;
 
 pull:
-	if (skb_headlen(skb) < skb_gro_offset(skb)) {
-		int grow = skb_gro_offset(skb) - skb_headlen(skb);
-
-		BUG_ON(skb->end - skb->tail < grow);
-
-		memcpy(skb_tail_pointer(skb), NAPI_GRO_CB(skb)->frag0, grow);
-
-		skb->tail += grow;
-		skb->data_len -= grow;
-
-		skb_shinfo(skb)->frags[0].page_offset += grow;
-		skb_frag_size_sub(&skb_shinfo(skb)->frags[0], grow);
-
-		if (unlikely(!skb_frag_size(&skb_shinfo(skb)->frags[0]))) {
-			skb_frag_unref(skb, 0);
-			memmove(skb_shinfo(skb)->frags,
-				skb_shinfo(skb)->frags + 1,
-				--skb_shinfo(skb)->nr_frags * sizeof(skb_frag_t));
-		}
-	}
-
+	grow = skb_gro_offset(skb) - skb_headlen(skb);
+	if (grow > 0)
+		gro_pull_from_frag0(skb, grow);
 ok:
 	return ret;
 
@@ -4026,6 +4029,8 @@ gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
 {
 	trace_napi_gro_receive_entry(skb);
 
+	skb_gro_reset_offset(skb);
+
 	return napi_skb_finish(dev_gro_receive(napi, skb), skb);
 }
 EXPORT_SYMBOL(napi_gro_receive);
@@ -4054,12 +4059,16 @@ struct sk_buff *napi_get_frags(struct napi_struct *napi)
 }
 EXPORT_SYMBOL(napi_get_frags);
 
-static gro_result_t napi_frags_finish(struct napi_struct *napi, struct sk_buff *skb,
-			       gro_result_t ret)
+static gro_result_t napi_frags_finish(struct napi_struct *napi,
+				      struct sk_buff *skb,
+				      gro_result_t ret)
 {
 	switch (ret) {
 	case GRO_NORMAL:
-		if (netif_receive_skb_internal(skb))
+	case GRO_HELD:
+		__skb_push(skb, ETH_HLEN);
+		skb->protocol = eth_type_trans(skb, skb->dev);
+		if (ret == GRO_NORMAL && netif_receive_skb_internal(skb))
 			ret = GRO_DROP;
 		break;
 
@@ -4068,7 +4077,6 @@ static gro_result_t napi_frags_finish(struct napi_struct *napi, struct sk_buff *
 		napi_reuse_skb(napi, skb);
 		break;
 
-	case GRO_HELD:
 	case GRO_MERGED:
 		break;
 	}
@@ -4076,17 +4084,41 @@ static gro_result_t napi_frags_finish(struct napi_struct *napi, struct sk_buff *
 	return ret;
 }
 
+/* Upper GRO stack assumes network header starts at gro_offset=0
+ * Drivers could call both napi_gro_frags() and napi_gro_receive()
+ * We copy ethernet header into skb->data to have a common layout.
+ */
 static struct sk_buff *napi_frags_skb(struct napi_struct *napi)
 {
 	struct sk_buff *skb = napi->skb;
+	const struct ethhdr *eth;
+	unsigned int hlen = sizeof(*eth);
 
 	napi->skb = NULL;
 
-	if (unlikely(!pskb_may_pull(skb, sizeof(struct ethhdr)))) {
-		napi_reuse_skb(napi, skb);
-		return NULL;
+	skb_reset_mac_header(skb);
+	skb_gro_reset_offset(skb);
+
+	eth = skb_gro_header_fast(skb, 0);
+	if (unlikely(skb_gro_header_hard(skb, hlen))) {
+		eth = skb_gro_header_slow(skb, hlen, 0);
+		if (unlikely(!eth)) {
+			napi_reuse_skb(napi, skb);
+			return NULL;
+		}
+	} else {
+		gro_pull_from_frag0(skb, hlen);
+		NAPI_GRO_CB(skb)->frag0 += hlen;
+		NAPI_GRO_CB(skb)->frag0_len -= hlen;
 	}
-	skb->protocol = eth_type_trans(skb, skb->dev);
+	__skb_pull(skb, hlen);
+
+	/*
+	 * This works because the only protocols we care about don't require
+	 * special handling.
+	 * We'll fix it up properly in napi_frags_finish()
+	 */
+	skb->protocol = eth->h_proto;
 
 	return skb;
 }

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] net-gro: restore frag0 optimization
  2014-03-30  4:28       ` [PATCH net-next] net-gro: restore frag0 optimization Eric Dumazet
@ 2014-03-31 20:27         ` David Miller
  2014-03-31 21:01           ` Eric Dumazet
  2014-04-01 16:40         ` Michal Schmidt
  1 sibling, 1 reply; 10+ messages in thread
From: David Miller @ 2014-03-31 20:27 UTC (permalink / raw)
  To: eric.dumazet; +Cc: mschmidt, netdev, edumazet, hkchu

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 29 Mar 2014 21:28:21 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> Main difference between napi_frags_skb() and napi_gro_receive() is that
> the later is called while ethernet header was already pulled by the NIC
> driver (eth_type_trans() was called before napi_gro_receive())
> 
> Jerry Chu in commit 299603e8370a ("net-gro: Prepare GRO stack for the
> upcoming tunneling support") tried to remove this difference by calling
> eth_type_trans() from napi_frags_skb() instead of doing this later from
> napi_frags_finish()
> 
> Goal was that napi_gro_complete() could call 
> ptype->callbacks.gro_complete(skb, 0)  (offset of first network header =
> 0)
> 
> Also, xxx_gro_receive() handlers all use off = skb_gro_offset(skb) to
> point to their own header, for the current skb and ones held in gro_list
> 
> Problem is this cleanup work defeated the frag0 optimization:
> It turns out the consecutive pskb_may_pull() calls are too expensive.
> 
> This patch brings back the frag0 stuff in napi_frags_skb().
> 
> As all skb have their mac header in skb head, we no longer need
> skb_gro_mac_header()
> 
> Reported-by: Michal Schmidt <mschmidt@redhat.com>
> Fixes: 299603e8370a ("net-gro: Prepare GRO stack for the upcoming tunneling support")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Jerry Chu <hkchu@google.com>
> ---
> David, this patch is definitely risky, I targeted net-next for staging
> it. I tested it with mlx4 NIC.
> 
> BTW, I believe the BUG_ON(skb->end - skb->tail < grow); could trigger
> if the headers are bigger than what the driver allocated for skb->head.
> 
> After encapsulation support added by Jerry, this might happen ?
> 
> We probably need to tweak skb_gro_reset_offset() to cap frag0_len to
> force calling slow path. I'll cook a patch.

Ok, applied, thanks for taking care of this Eric.

We can send to -stable after it's gotten a lot of testing and
auditing.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] net-gro: restore frag0 optimization
  2014-03-31 20:27         ` David Miller
@ 2014-03-31 21:01           ` Eric Dumazet
  2014-03-31 21:19             ` Eric Dumazet
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2014-03-31 21:01 UTC (permalink / raw)
  To: David Miller; +Cc: mschmidt, netdev, edumazet, hkchu

On Mon, 2014-03-31 at 16:27 -0400, David Miller wrote:

> Ok, applied, thanks for taking care of this Eric.
> 
> We can send to -stable after it's gotten a lot of testing and
> auditing.

Thanks David

BTW, I found a off-by-one error in skb_gro_header_hard()

TCP pure ACK packets always hit the 'slow' path, if I understand this
correctly.

I'll formally submit the following when I have tested it.


diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 775cc956ff78..7484d6f3d6cb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1998,9 +1998,10 @@ static inline void *skb_gro_header_fast(struct sk_buff *skb,
 	return NAPI_GRO_CB(skb)->frag0 + offset;
 }
 
-static inline int skb_gro_header_hard(struct sk_buff *skb, unsigned int hlen)
+static inline bool skb_gro_header_hard(const struct sk_buff *skb,
+				       unsigned int hlen)
 {
-	return NAPI_GRO_CB(skb)->frag0_len < hlen;
+	return NAPI_GRO_CB(skb)->frag0_len <= hlen;
 }
 
 static inline void *skb_gro_header_slow(struct sk_buff *skb, unsigned int hlen,

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] net-gro: restore frag0 optimization
  2014-03-31 21:01           ` Eric Dumazet
@ 2014-03-31 21:19             ` Eric Dumazet
  0 siblings, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2014-03-31 21:19 UTC (permalink / raw)
  To: David Miller; +Cc: mschmidt, netdev, edumazet, hkchu

On Mon, 2014-03-31 at 14:01 -0700, Eric Dumazet wrote:

> 
> BTW, I found a off-by-one error in skb_gro_header_hard()


>  
> -static inline int skb_gro_header_hard(struct sk_buff *skb, unsigned int hlen)
> +static inline bool skb_gro_header_hard(const struct sk_buff *skb,
> +				       unsigned int hlen)
>  {
> -	return NAPI_GRO_CB(skb)->frag0_len < hlen;
> +	return NAPI_GRO_CB(skb)->frag0_len <= hlen;
>  }


Oh well, scratch this. This strange function name fooled me again.

bool slowpath = hlen > NAPI_GRO_CB(skb)->frag0_len;

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] net-gro: restore frag0 optimization
  2014-03-30  4:28       ` [PATCH net-next] net-gro: restore frag0 optimization Eric Dumazet
  2014-03-31 20:27         ` David Miller
@ 2014-04-01 16:40         ` Michal Schmidt
  2014-04-01 17:16           ` Eric Dumazet
  1 sibling, 1 reply; 10+ messages in thread
From: Michal Schmidt @ 2014-04-01 16:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Eric Dumazet, David S. Miller, Jerry Chu

On 03/30/2014 06:28 AM, Eric Dumazet wrote:
> Main difference between napi_frags_skb() and napi_gro_receive() is that
> the later is called while ethernet header was already pulled by the NIC
> driver (eth_type_trans() was called before napi_gro_receive())
> 
> Jerry Chu in commit 299603e8370a ("net-gro: Prepare GRO stack for the
> upcoming tunneling support") tried to remove this difference by calling
> eth_type_trans() from napi_frags_skb() instead of doing this later from
> napi_frags_finish()
> 
> Goal was that napi_gro_complete() could call 
> ptype->callbacks.gro_complete(skb, 0)  (offset of first network header =
> 0)
> 
> Also, xxx_gro_receive() handlers all use off = skb_gro_offset(skb) to
> point to their own header, for the current skb and ones held in gro_list
> 
> Problem is this cleanup work defeated the frag0 optimization:
> It turns out the consecutive pskb_may_pull() calls are too expensive.
> 
> This patch brings back the frag0 stuff in napi_frags_skb().
> 
> As all skb have their mac header in skb head, we no longer need
> skb_gro_mac_header()

Eric,
thank you. The patch improves the performance. Though it's still not as
fast as it was before the commit "net-gro: Prepare GRO stack for the
upcoming tunneling support". In repeated netperf runs my reporter now
sees occasional results above 9 Gb/s, but on average it's only 7 Gb/s.

With your patch only Ethernet headers (and not other headers) are copied
into skbs' heads, but this is done for all skbs. Previously (before
Jerry's patch) no copying was needed for skbs that were GRO_MERGED.
Is this correct?

Regards,
Michal

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] net-gro: restore frag0 optimization
  2014-04-01 16:40         ` Michal Schmidt
@ 2014-04-01 17:16           ` Eric Dumazet
  0 siblings, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2014-04-01 17:16 UTC (permalink / raw)
  To: Michal Schmidt; +Cc: netdev, Eric Dumazet, David S. Miller, Jerry Chu

On Tue, 2014-04-01 at 18:40 +0200, Michal Schmidt wrote:

> Eric,
> thank you. The patch improves the performance. Though it's still not as
> fast as it was before the commit "net-gro: Prepare GRO stack for the
> upcoming tunneling support". In repeated netperf runs my reporter now
> sees occasional results above 9 Gb/s, but on average it's only 7 Gb/s.
> 
> With your patch only Ethernet headers (and not other headers) are copied
> into skbs' heads, but this is done for all skbs. Previously (before
> Jerry's patch) no copying was needed for skbs that were GRO_MERGED.
> Is this correct?

Note that TCP performance of a single flow is very dependent from
various factors, especially on a NUMA platform.

For example, consuming less cpu can have a side effect because it allows
it to enter deeper wait states on idle periods.

If you see a problem doing the memcpy(xxx, yyy, 14) then it means
that we would have had the issue later anyway.

Thats why some drivers are doing a prefetch(frame) way before calling
napi_gro_frags()

Sure, GRO stack is a bit more complex because it now handles
encapsulation.

So is the GSO stack, and we are OK with that because
it improves general use cases in DC/virtualization world.

Make sure everything is setup properly (irq smp_affinity),
and try to use multiple TCP flows, or make sure this is not a sender
issue.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-04-01 17:16 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-27 16:39 significant napi_gro_frags() slowdown Michal Schmidt
2014-03-27 16:47 ` Eric Dumazet
2014-03-27 17:05   ` Michal Schmidt
2014-03-27 17:21     ` Eric Dumazet
2014-03-30  4:28       ` [PATCH net-next] net-gro: restore frag0 optimization Eric Dumazet
2014-03-31 20:27         ` David Miller
2014-03-31 21:01           ` Eric Dumazet
2014-03-31 21:19             ` Eric Dumazet
2014-04-01 16:40         ` Michal Schmidt
2014-04-01 17:16           ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.