All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 3/4 net-next] net: make GRO aware of skb->head_frag
@ 2012-04-27 10:37 Eric Dumazet
  2012-04-30 17:54 ` Eric Dumazet
  2012-04-30 18:10 ` [PATCH 3/4 v2 " Eric Dumazet
  0 siblings, 2 replies; 45+ messages in thread
From: Eric Dumazet @ 2012-04-27 10:37 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Neal Cardwell, Tom Herbert, Jeff Kirsher, Michael Chan,
	Matt Carlson, Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

From: Eric Dumazet <edumazet@google.com>

GRO can check if skb to be merged has its skb->head mapped to a page
fragment, instead of a kmalloc() area.

We 'upgrade' skb->head as a fragment in itself

This avoids the frag_list fallback, and permits to build true GRO skb
(one sk_buff and up to 16 fragments), using less memory.

This reduces number of cache misses when user makes its copy, since a
single sk_buff is fetched.

This is a followup of patch "net: allow skb->head to be a page fragment"

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Matt Carlson <mcarlson@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
---
 include/linux/netdevice.h |    2 ++
 include/linux/skbuff.h    |    1 +
 net/core/dev.c            |    5 ++++-
 net/core/skbuff.c         |   27 ++++++++++++++++++++++++++-

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e0b70e9..7f377fb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1509,6 +1509,8 @@ struct napi_gro_cb {
 
 	/* Free the skb? */
 	int free;
+#define NAPI_GRO_FREE		  1
+#define NAPI_GRO_FREE_STOLEN_HEAD 2
 };
 
 #define NAPI_GRO_CB(skb) ((struct napi_gro_cb *)(skb)->cb)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9d28a22..2c75e98 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -561,6 +561,7 @@ static inline struct rtable *skb_rtable(const struct sk_buff *skb)
 extern void kfree_skb(struct sk_buff *skb);
 extern void consume_skb(struct sk_buff *skb);
 extern void	       __kfree_skb(struct sk_buff *skb);
+extern struct kmem_cache *skbuff_head_cache;
 extern struct sk_buff *__alloc_skb(unsigned int size,
 				   gfp_t priority, int fclone, int node);
 extern struct sk_buff *build_skb(void *data, unsigned int frag_size);
diff --git a/net/core/dev.c b/net/core/dev.c
index 501f3cc..a2be59f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3546,7 +3546,10 @@ gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb)
 		break;
 
 	case GRO_MERGED_FREE:
-		consume_skb(skb);
+		if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD)
+			kmem_cache_free(skbuff_head_cache, skb);
+		else
+			__kfree_skb(skb);
 		break;
 
 	case GRO_HELD:
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index effa75d..2ad1ee7 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -69,7 +69,7 @@
 #include <trace/events/skb.h>
 #include <linux/highmem.h>
 
-static struct kmem_cache *skbuff_head_cache __read_mostly;
+struct kmem_cache *skbuff_head_cache __read_mostly;
 static struct kmem_cache *skbuff_fclone_cache __read_mostly;
 
 static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
@@ -2901,6 +2901,31 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 
 		NAPI_GRO_CB(skb)->free = 1;
 		goto done;
+	} else if (skb->head_frag) {
+		int nr_frags = pinfo->nr_frags;
+		skb_frag_t *frag = pinfo->frags + nr_frags;
+		struct page *page = virt_to_head_page(skb->head);
+		unsigned int first_size = headlen - offset;
+		unsigned int first_offset;
+
+		if (nr_frags + 1 + skbinfo->nr_frags > MAX_SKB_FRAGS)
+			return -E2BIG;
+
+		first_offset = skb->head -
+			       (unsigned char *)page_address(page) +
+			       offset;
+
+		pinfo->nr_frags = nr_frags + 1 + skbinfo->nr_frags;
+
+		frag->page.p	  = page;
+		frag->page_offset = first_offset;
+		skb_frag_size_set(frag, first_size);
+
+		memcpy(frag + 1, skbinfo->frags, sizeof(*frag) * skbinfo->nr_frags);
+		/* We dont need to clear skbinfo->nr_frags here */
+
+		NAPI_GRO_CB(skb)->free = NAPI_GRO_FREE_STOLEN_HEAD;
+		goto done;
 	} else if (skb_gro_len(p) != pinfo->gso_size)
 		return -E2BIG;
 

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4 net-next] net: make GRO aware of skb->head_frag
  2012-04-27 10:37 [PATCH 3/4 net-next] net: make GRO aware of skb->head_frag Eric Dumazet
@ 2012-04-30 17:54 ` Eric Dumazet
  2012-04-30 18:10 ` [PATCH 3/4 v2 " Eric Dumazet
  1 sibling, 0 replies; 45+ messages in thread
From: Eric Dumazet @ 2012-04-30 17:54 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Neal Cardwell, Tom Herbert, Jeff Kirsher, Michael Chan,
	Matt Carlson, Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On Fri, 2012-04-27 at 12:37 +0200, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
...

> +		first_offset = skb->head -
> +			       (unsigned char *)page_address(page) +
> +			       offset;
> +
> +		pinfo->nr_frags = nr_frags + 1 + skbinfo->nr_frags;
> +

Oops. I realize this is a previous version of this patch.  

skb->head should be changed by skb->data here, I'll post a v2 for this
GRO change.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 3/4 v2 net-next] net: make GRO aware of skb->head_frag
  2012-04-27 10:37 [PATCH 3/4 net-next] net: make GRO aware of skb->head_frag Eric Dumazet
  2012-04-30 17:54 ` Eric Dumazet
@ 2012-04-30 18:10 ` Eric Dumazet
  2012-04-30 23:36   ` Alexander Duyck
  2012-05-01  1:48   ` [PATCH 3/4 v2 net-next] net: make GRO aware of skb->head_frag David Miller
  1 sibling, 2 replies; 45+ messages in thread
From: Eric Dumazet @ 2012-04-30 18:10 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Neal Cardwell, Tom Herbert, Jeff Kirsher, Michael Chan,
	Matt Carlson, Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

From: Eric Dumazet <edumazet@google.com>

GRO can check if skb to be merged has its skb->head mapped to a page
fragment, instead of a kmalloc() area.

We 'upgrade' skb->head as a fragment in itself

This avoids the frag_list fallback, and permits to build true GRO skb
(one sk_buff and up to 16 fragments), using less memory.

This reduces number of cache misses when user makes its copy, since a
single sk_buff is fetched.

This is a followup of patch "net: allow skb->head to be a page fragment"

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Matt Carlson <mcarlson@broadcom.com>
Cc: Michael Chan <mchan@broadcom.com>
---

v2: change skb->head by skb->data to compute correct first_offset in
frag

 include/linux/netdevice.h |    2 ++
 include/linux/skbuff.h    |    1 +
 net/core/dev.c            |    5 ++++-
 net/core/skbuff.c         |   27 ++++++++++++++++++++++++++-

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e0b70e9..7f377fb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1509,6 +1509,8 @@ struct napi_gro_cb {
 
 	/* Free the skb? */
 	int free;
+#define NAPI_GRO_FREE		  1
+#define NAPI_GRO_FREE_STOLEN_HEAD 2
 };
 
 #define NAPI_GRO_CB(skb) ((struct napi_gro_cb *)(skb)->cb)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9d28a22..2c75e98 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -561,6 +561,7 @@ static inline struct rtable *skb_rtable(const struct sk_buff *skb)
 extern void kfree_skb(struct sk_buff *skb);
 extern void consume_skb(struct sk_buff *skb);
 extern void	       __kfree_skb(struct sk_buff *skb);
+extern struct kmem_cache *skbuff_head_cache;
 extern struct sk_buff *__alloc_skb(unsigned int size,
 				   gfp_t priority, int fclone, int node);
 extern struct sk_buff *build_skb(void *data, unsigned int frag_size);
diff --git a/net/core/dev.c b/net/core/dev.c
index 501f3cc..a2be59f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3546,7 +3546,10 @@ gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb)
 		break;
 
 	case GRO_MERGED_FREE:
-		consume_skb(skb);
+		if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD)
+			kmem_cache_free(skbuff_head_cache, skb);
+		else
+			__kfree_skb(skb);
 		break;
 
 	case GRO_HELD:
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index effa75d..2ad1ee7 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -69,7 +69,7 @@
 #include <trace/events/skb.h>
 #include <linux/highmem.h>
 
-static struct kmem_cache *skbuff_head_cache __read_mostly;
+struct kmem_cache *skbuff_head_cache __read_mostly;
 static struct kmem_cache *skbuff_fclone_cache __read_mostly;
 
 static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
@@ -2901,6 +2901,31 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 
 		NAPI_GRO_CB(skb)->free = 1;
 		goto done;
+	} else if (skb->head_frag) {
+		int nr_frags = pinfo->nr_frags;
+		skb_frag_t *frag = pinfo->frags + nr_frags;
+		struct page *page = virt_to_head_page(skb->head);
+		unsigned int first_size = headlen - offset;
+		unsigned int first_offset;
+
+		if (nr_frags + 1 + skbinfo->nr_frags > MAX_SKB_FRAGS)
+			return -E2BIG;
+
+		first_offset = skb->data -
+			       (unsigned char *)page_address(page) +
+			       offset;
+
+		pinfo->nr_frags = nr_frags + 1 + skbinfo->nr_frags;
+
+		frag->page.p	  = page;
+		frag->page_offset = first_offset;
+		skb_frag_size_set(frag, first_size);
+
+		memcpy(frag + 1, skbinfo->frags, sizeof(*frag) * skbinfo->nr_frags);
+		/* We dont need to clear skbinfo->nr_frags here */
+
+		NAPI_GRO_CB(skb)->free = NAPI_GRO_FREE_STOLEN_HEAD;
+		goto done;
 	} else if (skb_gro_len(p) != pinfo->gso_size)
 		return -E2BIG;
 

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4 v2 net-next] net: make GRO aware of skb->head_frag
  2012-04-30 18:10 ` [PATCH 3/4 v2 " Eric Dumazet
@ 2012-04-30 23:36   ` Alexander Duyck
  2012-05-01  1:27     ` Eric Dumazet
  2012-05-01  1:48   ` [PATCH 3/4 v2 net-next] net: make GRO aware of skb->head_frag David Miller
  1 sibling, 1 reply; 45+ messages in thread
From: Alexander Duyck @ 2012-04-30 23:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Neal Cardwell, Tom Herbert, Jeff Kirsher,
	Michael Chan, Matt Carlson, Herbert Xu, Ben Hutchings,
	Ilpo Järvinen, Maciej Żenczykowski

On 04/30/2012 11:10 AM, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> GRO can check if skb to be merged has its skb->head mapped to a page
> fragment, instead of a kmalloc() area.
>
> We 'upgrade' skb->head as a fragment in itself
>
> This avoids the frag_list fallback, and permits to build true GRO skb
> (one sk_buff and up to 16 fragments), using less memory.
>
> This reduces number of cache misses when user makes its copy, since a
> single sk_buff is fetched.
>
> This is a followup of patch "net: allow skb->head to be a page fragment"
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: Maciej Żenczykowski <maze@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Tom Herbert <therbert@google.com>
> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> Cc: Ben Hutchings <bhutchings@solarflare.com>
> Cc: Matt Carlson <mcarlson@broadcom.com>
> Cc: Michael Chan <mchan@broadcom.com>
> ---
>
> v2: change skb->head by skb->data to compute correct first_offset in
> frag
>
>  include/linux/netdevice.h |    2 ++
>  include/linux/skbuff.h    |    1 +
>  net/core/dev.c            |    5 ++++-
>  net/core/skbuff.c         |   27 ++++++++++++++++++++++++++-
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index e0b70e9..7f377fb 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1509,6 +1509,8 @@ struct napi_gro_cb {
>  
>  	/* Free the skb? */
>  	int free;
> +#define NAPI_GRO_FREE		  1
> +#define NAPI_GRO_FREE_STOLEN_HEAD 2
>  };
>  
>  #define NAPI_GRO_CB(skb) ((struct napi_gro_cb *)(skb)->cb)
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 9d28a22..2c75e98 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -561,6 +561,7 @@ static inline struct rtable *skb_rtable(const struct sk_buff *skb)
>  extern void kfree_skb(struct sk_buff *skb);
>  extern void consume_skb(struct sk_buff *skb);
>  extern void	       __kfree_skb(struct sk_buff *skb);
> +extern struct kmem_cache *skbuff_head_cache;
>  extern struct sk_buff *__alloc_skb(unsigned int size,
>  				   gfp_t priority, int fclone, int node);
>  extern struct sk_buff *build_skb(void *data, unsigned int frag_size);
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 501f3cc..a2be59f 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3546,7 +3546,10 @@ gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb)
>  		break;
>  
>  	case GRO_MERGED_FREE:
> -		consume_skb(skb);
> +		if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD)
> +			kmem_cache_free(skbuff_head_cache, skb);
> +		else
> +			__kfree_skb(skb);
>  		break;
>  
>  	case GRO_HELD:
How can you go from consume_skb() which checks for skb->users to a case
that doesn't, or are you guaranteed that there is no way to get here
with skb->users greater than 1?

> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index effa75d..2ad1ee7 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -69,7 +69,7 @@
>  #include <trace/events/skb.h>
>  #include <linux/highmem.h>
>  
> -static struct kmem_cache *skbuff_head_cache __read_mostly;
> +struct kmem_cache *skbuff_head_cache __read_mostly;
>  static struct kmem_cache *skbuff_fclone_cache __read_mostly;
>  
>  static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
> @@ -2901,6 +2901,31 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
>  
>  		NAPI_GRO_CB(skb)->free = 1;
>  		goto done;
> +	} else if (skb->head_frag) {
> +		int nr_frags = pinfo->nr_frags;
> +		skb_frag_t *frag = pinfo->frags + nr_frags;
> +		struct page *page = virt_to_head_page(skb->head);
> +		unsigned int first_size = headlen - offset;
> +		unsigned int first_offset;
> +
> +		if (nr_frags + 1 + skbinfo->nr_frags > MAX_SKB_FRAGS)
> +			return -E2BIG;
> +
> +		first_offset = skb->data -
> +			       (unsigned char *)page_address(page) +
> +			       offset;
> +
> +		pinfo->nr_frags = nr_frags + 1 + skbinfo->nr_frags;
> +
> +		frag->page.p	  = page;
> +		frag->page_offset = first_offset;
> +		skb_frag_size_set(frag, first_size);
> +
> +		memcpy(frag + 1, skbinfo->frags, sizeof(*frag) * skbinfo->nr_frags);
> +		/* We dont need to clear skbinfo->nr_frags here */
> +
> +		NAPI_GRO_CB(skb)->free = NAPI_GRO_FREE_STOLEN_HEAD;
> +		goto done;
>  	} else if (skb_gro_len(p) != pinfo->gso_size)
>  		return -E2BIG;
>  
>
>
Maybe I missed something, but shouldn't you be checking skb->cloned, and
skb_shinfo()->dataref before you can consider just dropping the
sk_buff?  It seems like if you are sharing the frag with a clone you
would have to retain the skb->head so that you can track the dataref. 
Otherwise you will likely cause issues because the stack could end up
freeing the sk_buff, or the GRO frame will be capable of calling
put_page and freeing the page out from under the clone.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4 v2 net-next] net: make GRO aware of skb->head_frag
  2012-04-30 23:36   ` Alexander Duyck
@ 2012-05-01  1:27     ` Eric Dumazet
  2012-05-01  5:33       ` Alexander Duyck
  0 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2012-05-01  1:27 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Miller, netdev, Neal Cardwell, Tom Herbert, Jeff Kirsher,
	Michael Chan, Matt Carlson, Herbert Xu, Ben Hutchings,
	Ilpo Järvinen, Maciej Żenczykowski

On Mon, 2012-04-30 at 16:36 -0700, Alexander Duyck wrote:
> On 04/30/2012 11:10 AM, Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > GRO can check if skb to be merged has its skb->head mapped to a page
> > fragment, instead of a kmalloc() area.
> >
> > We 'upgrade' skb->head as a fragment in itself
> >
> > This avoids the frag_list fallback, and permits to build true GRO skb
> > (one sk_buff and up to 16 fragments), using less memory.
> >
> > This reduces number of cache misses when user makes its copy, since a
> > single sk_buff is fetched.
> >
> > This is a followup of patch "net: allow skb->head to be a page fragment"
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> > Cc: Herbert Xu <herbert@gondor.apana.org.au>
> > Cc: Maciej Żenczykowski <maze@google.com>
> > Cc: Neal Cardwell <ncardwell@google.com>
> > Cc: Tom Herbert <therbert@google.com>
> > Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> > Cc: Ben Hutchings <bhutchings@solarflare.com>
> > Cc: Matt Carlson <mcarlson@broadcom.com>
> > Cc: Michael Chan <mchan@broadcom.com>
> > ---
> >
> > v2: change skb->head by skb->data to compute correct first_offset in
> > frag
> >
> >  include/linux/netdevice.h |    2 ++
> >  include/linux/skbuff.h    |    1 +
> >  net/core/dev.c            |    5 ++++-
> >  net/core/skbuff.c         |   27 ++++++++++++++++++++++++++-
> >
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index e0b70e9..7f377fb 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -1509,6 +1509,8 @@ struct napi_gro_cb {
> >  
> >  	/* Free the skb? */
> >  	int free;
> > +#define NAPI_GRO_FREE		  1
> > +#define NAPI_GRO_FREE_STOLEN_HEAD 2
> >  };
> >  
> >  #define NAPI_GRO_CB(skb) ((struct napi_gro_cb *)(skb)->cb)
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 9d28a22..2c75e98 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -561,6 +561,7 @@ static inline struct rtable *skb_rtable(const struct sk_buff *skb)
> >  extern void kfree_skb(struct sk_buff *skb);
> >  extern void consume_skb(struct sk_buff *skb);
> >  extern void	       __kfree_skb(struct sk_buff *skb);
> > +extern struct kmem_cache *skbuff_head_cache;
> >  extern struct sk_buff *__alloc_skb(unsigned int size,
> >  				   gfp_t priority, int fclone, int node);
> >  extern struct sk_buff *build_skb(void *data, unsigned int frag_size);
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 501f3cc..a2be59f 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -3546,7 +3546,10 @@ gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb)
> >  		break;
> >  
> >  	case GRO_MERGED_FREE:
> > -		consume_skb(skb);
> > +		if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD)
> > +			kmem_cache_free(skbuff_head_cache, skb);
> > +		else
> > +			__kfree_skb(skb);
> >  		break;
> >  
> >  	case GRO_HELD:
> How can you go from consume_skb() which checks for skb->users to a case
> that doesn't, or are you guaranteed that there is no way to get here
> with skb->users greater than 1?
> 

GRO runs with skbs delivered from NIC rx handlers, and they MUST have
skb->users == 1.

Thinks GRO is the first layer above drivers. We own each skb.

> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index effa75d..2ad1ee7 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -69,7 +69,7 @@
> >  #include <trace/events/skb.h>
> >  #include <linux/highmem.h>
> >  
> > -static struct kmem_cache *skbuff_head_cache __read_mostly;
> > +struct kmem_cache *skbuff_head_cache __read_mostly;
> >  static struct kmem_cache *skbuff_fclone_cache __read_mostly;
> >  
> >  static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
> > @@ -2901,6 +2901,31 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
> >  
> >  		NAPI_GRO_CB(skb)->free = 1;
> >  		goto done;
> > +	} else if (skb->head_frag) {
> > +		int nr_frags = pinfo->nr_frags;
> > +		skb_frag_t *frag = pinfo->frags + nr_frags;
> > +		struct page *page = virt_to_head_page(skb->head);
> > +		unsigned int first_size = headlen - offset;
> > +		unsigned int first_offset;
> > +
> > +		if (nr_frags + 1 + skbinfo->nr_frags > MAX_SKB_FRAGS)
> > +			return -E2BIG;
> > +
> > +		first_offset = skb->data -
> > +			       (unsigned char *)page_address(page) +
> > +			       offset;
> > +
> > +		pinfo->nr_frags = nr_frags + 1 + skbinfo->nr_frags;
> > +
> > +		frag->page.p	  = page;
> > +		frag->page_offset = first_offset;
> > +		skb_frag_size_set(frag, first_size);
> > +
> > +		memcpy(frag + 1, skbinfo->frags, sizeof(*frag) * skbinfo->nr_frags);
> > +		/* We dont need to clear skbinfo->nr_frags here */
> > +
> > +		NAPI_GRO_CB(skb)->free = NAPI_GRO_FREE_STOLEN_HEAD;
> > +		goto done;
> >  	} else if (skb_gro_len(p) != pinfo->gso_size)
> >  		return -E2BIG;
> >  
> >
> >
> Maybe I missed something, but shouldn't you be checking skb->cloned, and
> skb_shinfo()->dataref before you can consider just dropping the
> sk_buff?  It seems like if you are sharing the frag with a clone you
> would have to retain the skb->head so that you can track the dataref. 
> Otherwise you will likely cause issues because the stack could end up
> freeing the sk_buff, or the GRO frame will be capable of calling
> put_page and freeing the page out from under the clone.
> 

Is it a general question, or specific to this patch ?

If its a general problem, we already check dataref where appropriate.
Fact that skb->head is a kmalloc() or frag doesnt matter.

If specific to GRO, see my first answer. GRO owns each skb.

adding a BUG() would make no sense here.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4 v2 net-next] net: make GRO aware of skb->head_frag
  2012-04-30 18:10 ` [PATCH 3/4 v2 " Eric Dumazet
  2012-04-30 23:36   ` Alexander Duyck
@ 2012-05-01  1:48   ` David Miller
  1 sibling, 0 replies; 45+ messages in thread
From: David Miller @ 2012-05-01  1:48 UTC (permalink / raw)
  To: eric.dumazet
  Cc: netdev, ncardwell, therbert, jeffrey.t.kirsher, mchan, mcarlson,
	herbert, bhutchings, ilpo.jarvinen, maze

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 30 Apr 2012 20:10:34 +0200

> From: Eric Dumazet <edumazet@google.com>
> 
> GRO can check if skb to be merged has its skb->head mapped to a page
> fragment, instead of a kmalloc() area.
> 
> We 'upgrade' skb->head as a fragment in itself
> 
> This avoids the frag_list fallback, and permits to build true GRO skb
> (one sk_buff and up to 16 fragments), using less memory.
> 
> This reduces number of cache misses when user makes its copy, since a
> single sk_buff is fetched.
> 
> This is a followup of patch "net: allow skb->head to be a page fragment"
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4 v2 net-next] net: make GRO aware of skb->head_frag
  2012-05-01  1:27     ` Eric Dumazet
@ 2012-05-01  5:33       ` Alexander Duyck
  2012-05-01  6:39         ` Eric Dumazet
  0 siblings, 1 reply; 45+ messages in thread
From: Alexander Duyck @ 2012-05-01  5:33 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On Mon, Apr 30, 2012 at 6:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2012-04-30 at 16:36 -0700, Alexander Duyck wrote:
>> On 04/30/2012 11:10 AM, Eric Dumazet wrote:
>> > From: Eric Dumazet <edumazet@google.com>
>> >
>> > GRO can check if skb to be merged has its skb->head mapped to a page
>> > fragment, instead of a kmalloc() area.
>> >
>> > We 'upgrade' skb->head as a fragment in itself
>> >
>> > This avoids the frag_list fallback, and permits to build true GRO skb
>> > (one sk_buff and up to 16 fragments), using less memory.
>> >
>> > This reduces number of cache misses when user makes its copy, since a
>> > single sk_buff is fetched.
>> >
>> > This is a followup of patch "net: allow skb->head to be a page fragment"
>> >

[...]

>> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> > index effa75d..2ad1ee7 100644
>> > --- a/net/core/skbuff.c
>> > +++ b/net/core/skbuff.c
>> > @@ -69,7 +69,7 @@
>> >  #include <trace/events/skb.h>
>> >  #include <linux/highmem.h>
>> >
>> > -static struct kmem_cache *skbuff_head_cache __read_mostly;
>> > +struct kmem_cache *skbuff_head_cache __read_mostly;
>> >  static struct kmem_cache *skbuff_fclone_cache __read_mostly;
>> >
>> >  static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
>> > @@ -2901,6 +2901,31 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
>> >
>> >             NAPI_GRO_CB(skb)->free = 1;
>> >             goto done;
>> > +   } else if (skb->head_frag) {
>> > +           int nr_frags = pinfo->nr_frags;
>> > +           skb_frag_t *frag = pinfo->frags + nr_frags;
>> > +           struct page *page = virt_to_head_page(skb->head);
>> > +           unsigned int first_size = headlen - offset;
>> > +           unsigned int first_offset;
>> > +
>> > +           if (nr_frags + 1 + skbinfo->nr_frags > MAX_SKB_FRAGS)
>> > +                   return -E2BIG;
>> > +
>> > +           first_offset = skb->data -
>> > +                          (unsigned char *)page_address(page) +
>> > +                          offset;
>> > +
>> > +           pinfo->nr_frags = nr_frags + 1 + skbinfo->nr_frags;
>> > +
>> > +           frag->page.p      = page;
>> > +           frag->page_offset = first_offset;
>> > +           skb_frag_size_set(frag, first_size);
>> > +
>> > +           memcpy(frag + 1, skbinfo->frags, sizeof(*frag) * skbinfo->nr_frags);
>> > +           /* We dont need to clear skbinfo->nr_frags here */
>> > +
>> > +           NAPI_GRO_CB(skb)->free = NAPI_GRO_FREE_STOLEN_HEAD;
>> > +           goto done;
>> >     } else if (skb_gro_len(p) != pinfo->gso_size)
>> >             return -E2BIG;
>> >
>> >
>> >
>> Maybe I missed something, but shouldn't you be checking skb->cloned, and
>> skb_shinfo()->dataref before you can consider just dropping the
>> sk_buff?  It seems like if you are sharing the frag with a clone you
>> would have to retain the skb->head so that you can track the dataref.
>> Otherwise you will likely cause issues because the stack could end up
>> freeing the sk_buff, or the GRO frame will be capable of calling
>> put_page and freeing the page out from under the clone.
>>
>
> Is it a general question, or specific to this patch ?
>
> If its a general problem, we already check dataref where appropriate.
> Fact that skb->head is a kmalloc() or frag doesnt matter.
>
> If specific to GRO, see my first answer. GRO owns each skb.
>
> adding a BUG() would make no sense here.

The question I had was more specific to GRO.  As long as we have
skb->users == 1 and the skb isn't cloned we should be fine.   It just
hadn't occurred to me before that napi_gro_receive had the extra
requirement that the skb couldn't be cloned.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4 v2 net-next] net: make GRO aware of skb->head_frag
  2012-05-01  5:33       ` Alexander Duyck
@ 2012-05-01  6:39         ` Eric Dumazet
  2012-05-01 16:17           ` Alexander Duyck
  0 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2012-05-01  6:39 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On Mon, 2012-04-30 at 22:33 -0700, Alexander Duyck wrote:

> The question I had was more specific to GRO.  As long as we have
> skb->users == 1 and the skb isn't cloned we should be fine.   It just
> hadn't occurred to me before that napi_gro_receive had the extra
> requirement that the skb couldn't be cloned.
> 

OK

By the way, even if skb was cloned, we would be allowed to steal
skb->head.

When we clone an oskb we :

1) allocate a struct nskb sk_buff (or use the shadow in case of TCP)
2) increment dataref
3) link skb->head

After cloning()  nskb->users == 1, and oskb->users is unchanged

I believe you are referring to skb_get(oskb), that ones increment
oskb->users and skb_shared(oskb) then returns true.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4 v2 net-next] net: make GRO aware of skb->head_frag
  2012-05-01  6:39         ` Eric Dumazet
@ 2012-05-01 16:17           ` Alexander Duyck
  2012-05-01 17:04             ` Eric Dumazet
  0 siblings, 1 reply; 45+ messages in thread
From: Alexander Duyck @ 2012-05-01 16:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On 04/30/2012 11:39 PM, Eric Dumazet wrote:
> On Mon, 2012-04-30 at 22:33 -0700, Alexander Duyck wrote:
>
>> The question I had was more specific to GRO.  As long as we have
>> skb->users == 1 and the skb isn't cloned we should be fine.   It just
>> hadn't occurred to me before that napi_gro_receive had the extra
>> requirement that the skb couldn't be cloned.
>>
> OK
>
> By the way, even if skb was cloned, we would be allowed to steal
> skb->head.
>
> When we clone an oskb we :
>
> 1) allocate a struct nskb sk_buff (or use the shadow in case of TCP)
> 2) increment dataref
The problem I have is with this piece right here.  So you increment
dataref.  Now you have an skb that is still pointing to the shared info
on this page and dataref is 2.  What about the side that is stealing the
head?  Is it going to be tracking the dataref as well and decrementing
it before put_page or does it just assume that dataref is 1 and call
put_page directly?  I am guessing the latter since I didn't see anything
that allowed for tracking the dataref of stolen heads.

> 3) link skb->head
>
> After cloning()  nskb->users == 1, and oskb->users is unchanged
>
> I believe you are referring to skb_get(oskb), that ones increment
> oskb->users and skb_shared(oskb) then returns true.
skb_clone and skb_get are two completely separate things.  My concern
with the skb_get portion is if skb->users is greater than 1 we should
not be freeing the skbuff back to the head cache.  This was addressed
with the fact that GRO is requiring that skb->users be 1 before it is
handed the skb, although I didn't look too closely at the other patches
that are also stealing skb->head.  My concern with skb_clone, as I
mentioned above, is that I am not sure how the dataref tracking will
still work.

It looks like Dave applied the patches last night so I will pull the
latest net-next and do some poking around.  It is possible I am just
being dense and not getting this, but I just feel like there is
something that got missed or overlooked.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4 v2 net-next] net: make GRO aware of skb->head_frag
  2012-05-01 16:17           ` Alexander Duyck
@ 2012-05-01 17:04             ` Eric Dumazet
  2012-05-01 19:45               ` Alexander Duyck
  2012-05-01 22:58               ` Alexander Duyck
  0 siblings, 2 replies; 45+ messages in thread
From: Eric Dumazet @ 2012-05-01 17:04 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On Tue, 2012-05-01 at 09:17 -0700, Alexander Duyck wrote:
> On 04/30/2012 11:39 PM, Eric Dumazet wrote:
> > On Mon, 2012-04-30 at 22:33 -0700, Alexander Duyck wrote:
> >
> >> The question I had was more specific to GRO.  As long as we have
> >> skb->users == 1 and the skb isn't cloned we should be fine.   It just
> >> hadn't occurred to me before that napi_gro_receive had the extra
> >> requirement that the skb couldn't be cloned.
> >>
> > OK
> >
> > By the way, even if skb was cloned, we would be allowed to steal
> > skb->head.
> >
> > When we clone an oskb we :
> >
> > 1) allocate a struct nskb sk_buff (or use the shadow in case of TCP)
> > 2) increment dataref
> The problem I have is with this piece right here.  So you increment
> dataref.  Now you have an skb that is still pointing to the shared info
> on this page and dataref is 2.  What about the side that is stealing the
> head?  Is it going to be tracking the dataref as well and decrementing
> it before put_page or does it just assume that dataref is 1 and call
> put_page directly?  I am guessing the latter since I didn't see anything
> that allowed for tracking the dataref of stolen heads.

The only changed thing is the kfree() replaced by put_page()

This kfree() was done when last reference to dataref was released.

If we had a problem before, we have same problem after my patch.

Truth is : In TCP (coalesce and splice()) and GRO, we owns skbs.

(See the various __kfree_skb(skb) calls in net/ipv4/tcp_input.c
There is one exception in ipv6 / treq->pktopts ) but its for SYN packet
and this wont be merged with a previous packet.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4 v2 net-next] net: make GRO aware of skb->head_frag
  2012-05-01 17:04             ` Eric Dumazet
@ 2012-05-01 19:45               ` Alexander Duyck
  2012-05-02  2:45                 ` Eric Dumazet
  2012-05-02  8:24                 ` Eric Dumazet
  2012-05-01 22:58               ` Alexander Duyck
  1 sibling, 2 replies; 45+ messages in thread
From: Alexander Duyck @ 2012-05-01 19:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On 05/01/2012 10:04 AM, Eric Dumazet wrote:
> On Tue, 2012-05-01 at 09:17 -0700, Alexander Duyck wrote:
>> On 04/30/2012 11:39 PM, Eric Dumazet wrote:
>>> On Mon, 2012-04-30 at 22:33 -0700, Alexander Duyck wrote:
>>>
>>>> The question I had was more specific to GRO.  As long as we have
>>>> skb->users == 1 and the skb isn't cloned we should be fine.   It just
>>>> hadn't occurred to me before that napi_gro_receive had the extra
>>>> requirement that the skb couldn't be cloned.
>>>>
>>> OK
>>>
>>> By the way, even if skb was cloned, we would be allowed to steal
>>> skb->head.
>>>
>>> When we clone an oskb we :
>>>
>>> 1) allocate a struct nskb sk_buff (or use the shadow in case of TCP)
>>> 2) increment dataref
>> The problem I have is with this piece right here.  So you increment
>> dataref.  Now you have an skb that is still pointing to the shared info
>> on this page and dataref is 2.  What about the side that is stealing the
>> head?  Is it going to be tracking the dataref as well and decrementing
>> it before put_page or does it just assume that dataref is 1 and call
>> put_page directly?  I am guessing the latter since I didn't see anything
>> that allowed for tracking the dataref of stolen heads.
> The only changed thing is the kfree() replaced by put_page()
>
> This kfree() was done when last reference to dataref was released.
>
> If we had a problem before, we have same problem after my patch.
>
> Truth is : In TCP (coalesce and splice()) and GRO, we owns skbs.
>
> (See the various __kfree_skb(skb) calls in net/ipv4/tcp_input.c
> There is one exception in ipv6 / treq->pktopts ) but its for SYN packet
> and this wont be merged with a previous packet.
I wasn't worried about the kfree vs put_page, I was worried about the
coalesce case.  However, it looks like you are correct and I am not
seeing any issues so everything seems to be working fine.

I have a hacked together ixgbe up and running now with the new build_skb
logic and RSC/LRO disabled.  It looks like it is giving me a 5%
performance boost for small packet routing, but I am using more CPU for
netperf TCP receive tests and I was wondering if you had seen anything
similar on the tg3 driver?

Thanks,

Alex

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4 v2 net-next] net: make GRO aware of skb->head_frag
  2012-05-01 17:04             ` Eric Dumazet
  2012-05-01 19:45               ` Alexander Duyck
@ 2012-05-01 22:58               ` Alexander Duyck
  2012-05-01 23:10                 ` Alexander Duyck
  1 sibling, 1 reply; 45+ messages in thread
From: Alexander Duyck @ 2012-05-01 22:58 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On 05/01/2012 10:04 AM, Eric Dumazet wrote:
> On Tue, 2012-05-01 at 09:17 -0700, Alexander Duyck wrote:
>> On 04/30/2012 11:39 PM, Eric Dumazet wrote:
>>> On Mon, 2012-04-30 at 22:33 -0700, Alexander Duyck wrote:
>>>
>>>> The question I had was more specific to GRO.  As long as we have
>>>> skb->users == 1 and the skb isn't cloned we should be fine.   It just
>>>> hadn't occurred to me before that napi_gro_receive had the extra
>>>> requirement that the skb couldn't be cloned.
>>>>
>>> OK
>>>
>>> By the way, even if skb was cloned, we would be allowed to steal
>>> skb->head.
>>>
>>> When we clone an oskb we :
>>>
>>> 1) allocate a struct nskb sk_buff (or use the shadow in case of TCP)
>>> 2) increment dataref
>> The problem I have is with this piece right here.  So you increment
>> dataref.  Now you have an skb that is still pointing to the shared info
>> on this page and dataref is 2.  What about the side that is stealing the
>> head?  Is it going to be tracking the dataref as well and decrementing
>> it before put_page or does it just assume that dataref is 1 and call
>> put_page directly?  I am guessing the latter since I didn't see anything
>> that allowed for tracking the dataref of stolen heads.
> The only changed thing is the kfree() replaced by put_page()
>
> This kfree() was done when last reference to dataref was released.
>
> If we had a problem before, we have same problem after my patch.
>
> Truth is : In TCP (coalesce and splice()) and GRO, we owns skbs.
>
> (See the various __kfree_skb(skb) calls in net/ipv4/tcp_input.c
> There is one exception in ipv6 / treq->pktopts ) but its for SYN packet
> and this wont be merged with a previous packet.
Eric,

I think I have found a bug, although it is not specific to this patch
but it is related.  It looks like the TCP coalesce code is causing
tcpdump to fail when using frags.  Based on the comments in the patch I
am assuming you have an ixgbe adapter to test with as that is what I
reproduced this on.  To reproduce the issue all you need to do is run
"tcpdump -i ethX > /dev/null" on one console, and on a second console
run a netperf TCP_MAERTS test to some other server.  Tcpdump will exit
out with a message about bad address like this:

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), capture size 96 bytes
tcpdump: pcap_loop: recvfrom: Bad address
682 packets captured
2357 packets received by filter
1543 packets dropped by kernel


A bisect of the issue tracked it down to:

1402d366019fedaa2b024f2bac06b7cc9a8782e1 is first bad commit
commit 1402d366019fedaa2b024f2bac06b7cc9a8782e1
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Apr 23 07:11:42 2012 +0000

    tcp: introduce tcp_try_coalesce
    
    commit c8628155ece3 (tcp: reduce out_of_order memory use) took care of
    coalescing tcp segments provided by legacy devices (linear skbs)
    
    We extend this idea to fragged skbs, as their truesize can be heavy.
    
    ixgbe for example uses 256+1024+PAGE_SIZE/2 = 3328 bytes per segment.
    
    Use this coalescing strategy for receive queue too.
    
    This contributes to reduce number of tcp collapses, at minimal cost, and
    reduces memory overhead and packets drops.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Neal Cardwell <ncardwell@google.com>
    Cc: Tom Herbert <therbert@google.com>
    Cc: Maciej Żenczykowski <maze@google.com>
    Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
    Acked-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

:040000 040000 8ca3e0b4e6c6a8f375fd800069d24203880623f3 2576d34c5c9cfc717a11e2ebe054143956716b93 M	net

I suspect we are dealing with either a shared or cloned skb in this
case, though I haven't verified which it is yet.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4 v2 net-next] net: make GRO aware of skb->head_frag
  2012-05-01 22:58               ` Alexander Duyck
@ 2012-05-01 23:10                 ` Alexander Duyck
  2012-05-02  2:47                   ` Eric Dumazet
  0 siblings, 1 reply; 45+ messages in thread
From: Alexander Duyck @ 2012-05-01 23:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On 05/01/2012 03:58 PM, Alexander Duyck wrote:
> On 05/01/2012 10:04 AM, Eric Dumazet wrote:
>> On Tue, 2012-05-01 at 09:17 -0700, Alexander Duyck wrote:
>>> On 04/30/2012 11:39 PM, Eric Dumazet wrote:
>>>> On Mon, 2012-04-30 at 22:33 -0700, Alexander Duyck wrote:
>>>>
>>>>> The question I had was more specific to GRO.  As long as we have
>>>>> skb->users == 1 and the skb isn't cloned we should be fine.   It just
>>>>> hadn't occurred to me before that napi_gro_receive had the extra
>>>>> requirement that the skb couldn't be cloned.
>>>>>
>>>> OK
>>>>
>>>> By the way, even if skb was cloned, we would be allowed to steal
>>>> skb->head.
>>>>
>>>> When we clone an oskb we :
>>>>
>>>> 1) allocate a struct nskb sk_buff (or use the shadow in case of TCP)
>>>> 2) increment dataref
>>> The problem I have is with this piece right here.  So you increment
>>> dataref.  Now you have an skb that is still pointing to the shared info
>>> on this page and dataref is 2.  What about the side that is stealing the
>>> head?  Is it going to be tracking the dataref as well and decrementing
>>> it before put_page or does it just assume that dataref is 1 and call
>>> put_page directly?  I am guessing the latter since I didn't see anything
>>> that allowed for tracking the dataref of stolen heads.
>> The only changed thing is the kfree() replaced by put_page()
>>
>> This kfree() was done when last reference to dataref was released.
>>
>> If we had a problem before, we have same problem after my patch.
>>
>> Truth is : In TCP (coalesce and splice()) and GRO, we owns skbs.
>>
>> (See the various __kfree_skb(skb) calls in net/ipv4/tcp_input.c
>> There is one exception in ipv6 / treq->pktopts ) but its for SYN packet
>> and this wont be merged with a previous packet.
> Eric,
>
> I think I have found a bug, although it is not specific to this patch
> but it is related.  It looks like the TCP coalesce code is causing
> tcpdump to fail when using frags.  Based on the comments in the patch I
> am assuming you have an ixgbe adapter to test with as that is what I
> reproduced this on.  To reproduce the issue all you need to do is run
> "tcpdump -i ethX > /dev/null" on one console, and on a second console
> run a netperf TCP_MAERTS test to some other server.  Tcpdump will exit
> out with a message about bad address like this:
>
> tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
> listening on eth1, link-type EN10MB (Ethernet), capture size 96 bytes
> tcpdump: pcap_loop: recvfrom: Bad address
> 682 packets captured
> 2357 packets received by filter
> 1543 packets dropped by kernel
>
>
> A bisect of the issue tracked it down to:
>
> 1402d366019fedaa2b024f2bac06b7cc9a8782e1 is first bad commit
> commit 1402d366019fedaa2b024f2bac06b7cc9a8782e1
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Mon Apr 23 07:11:42 2012 +0000
>
>     tcp: introduce tcp_try_coalesce
>     
>     commit c8628155ece3 (tcp: reduce out_of_order memory use) took care of
>     coalescing tcp segments provided by legacy devices (linear skbs)
>     
>     We extend this idea to fragged skbs, as their truesize can be heavy.
>     
>     ixgbe for example uses 256+1024+PAGE_SIZE/2 = 3328 bytes per segment.
>     
>     Use this coalescing strategy for receive queue too.
>     
>     This contributes to reduce number of tcp collapses, at minimal cost, and
>     reduces memory overhead and packets drops.
>     
>     Signed-off-by: Eric Dumazet <edumazet@google.com>
>     Cc: Neal Cardwell <ncardwell@google.com>
>     Cc: Tom Herbert <therbert@google.com>
>     Cc: Maciej Żenczykowski <maze@google.com>
>     Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
>     Acked-by: Neal Cardwell <ncardwell@google.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
>
> :040000 040000 8ca3e0b4e6c6a8f375fd800069d24203880623f3 2576d34c5c9cfc717a11e2ebe054143956716b93 M	net
>
> I suspect we are dealing with either a shared or cloned skb in this
> case, though I haven't verified which it is yet.
>
> Thanks,
>
> Alex
>
One additional note.  It looks like LRO and GRO need to be disabled to
trigger the bug.  If either of them are enabled it doesn't seem to
occur.  Likely due to the fact that they are doing the coalescing before
it gets up to the tcp_try_coalesce call.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4 v2 net-next] net: make GRO aware of skb->head_frag
  2012-05-01 19:45               ` Alexander Duyck
@ 2012-05-02  2:45                 ` Eric Dumazet
  2012-05-02  8:24                 ` Eric Dumazet
  1 sibling, 0 replies; 45+ messages in thread
From: Eric Dumazet @ 2012-05-02  2:45 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On Tue, 2012-05-01 at 12:45 -0700, Alexander Duyck wrote:

> I wasn't worried about the kfree vs put_page, I was worried about the
> coalesce case.  However, it looks like you are correct and I am not
> seeing any issues so everything seems to be working fine.
> 
> I have a hacked together ixgbe up and running now with the new build_skb
> logic and RSC/LRO disabled.  It looks like it is giving me a 5%
> performance boost for small packet routing, but I am using more CPU for
> netperf TCP receive tests and I was wondering if you had seen anything
> similar on the tg3 driver?
> 



Hmm, tg3 perf for small packets is not changed, because tg3 uses
copybreak anyway.

#define TG3_RX_COPY_THRESHOLD       256

For big packets, I have no changes in netperf because GRO cannot really
provide big enough skbs at 1Gb speed.

The changes are noticeable when BDP (Bandwidth*Delay Product) is large
and some packet losses happen : You notice lower latencies in TCP stack,
because of less collapses (big copies)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4 v2 net-next] net: make GRO aware of skb->head_frag
  2012-05-01 23:10                 ` Alexander Duyck
@ 2012-05-02  2:47                   ` Eric Dumazet
  2012-05-02  3:54                     ` Eric Dumazet
  2012-05-02  8:13                     ` [PATCH net-next] net: take care of cloned skbs in tcp_try_coalesce() Eric Dumazet
  0 siblings, 2 replies; 45+ messages in thread
From: Eric Dumazet @ 2012-05-02  2:47 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On Tue, 2012-05-01 at 16:10 -0700, Alexander Duyck wrote:
> On 05/01/2012 03:58 PM, Alexander Duyck wrote:

> > Eric,
> >
> > I think I have found a bug, although it is not specific to this patch
> > but it is related.  It looks like the TCP coalesce code is causing
> > tcpdump to fail when using frags.  Based on the comments in the patch I
> > am assuming you have an ixgbe adapter to test with as that is what I
> > reproduced this on.  To reproduce the issue all you need to do is run
> > "tcpdump -i ethX > /dev/null" on one console, and on a second console
> > run a netperf TCP_MAERTS test to some other server.  Tcpdump will exit
> > out with a message about bad address like this:
> >
> > tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
> > listening on eth1, link-type EN10MB (Ethernet), capture size 96 bytes
> > tcpdump: pcap_loop: recvfrom: Bad address
> > 682 packets captured
> > 2357 packets received by filter
> > 1543 packets dropped by kernel
> >
> >
> > A bisect of the issue tracked it down to:
> >
> > 1402d366019fedaa2b024f2bac06b7cc9a8782e1 is first bad commit
> > commit 1402d366019fedaa2b024f2bac06b7cc9a8782e1
> > Author: Eric Dumazet <edumazet@google.com>
> > Date:   Mon Apr 23 07:11:42 2012 +0000
> >
> >     tcp: introduce tcp_try_coalesce
> >     
> >     commit c8628155ece3 (tcp: reduce out_of_order memory use) took care of
> >     coalescing tcp segments provided by legacy devices (linear skbs)
> >     
> >     We extend this idea to fragged skbs, as their truesize can be heavy.
> >     
> >     ixgbe for example uses 256+1024+PAGE_SIZE/2 = 3328 bytes per segment.
> >     
> >     Use this coalescing strategy for receive queue too.
> >     
> >     This contributes to reduce number of tcp collapses, at minimal cost, and
> >     reduces memory overhead and packets drops.
> >     
> >     Signed-off-by: Eric Dumazet <edumazet@google.com>
> >     Cc: Neal Cardwell <ncardwell@google.com>
> >     Cc: Tom Herbert <therbert@google.com>
> >     Cc: Maciej Żenczykowski <maze@google.com>
> >     Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> >     Acked-by: Neal Cardwell <ncardwell@google.com>
> >     Signed-off-by: David S. Miller <davem@davemloft.net>
> >
> > :040000 040000 8ca3e0b4e6c6a8f375fd800069d24203880623f3 2576d34c5c9cfc717a11e2ebe054143956716b93 M	net
> >
> > I suspect we are dealing with either a shared or cloned skb in this
> > case, though I haven't verified which it is yet.
> >
> > Thanks,
> >
> > Alex
> >
> One additional note.  It looks like LRO and GRO need to be disabled to
> trigger the bug.  If either of them are enabled it doesn't seem to
> occur.  Likely due to the fact that they are doing the coalescing before
> it gets up to the tcp_try_coalesce call.
> 

Thanks Alex, I'll take a look.

It seems my tcpdump is different than yours.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4 v2 net-next] net: make GRO aware of skb->head_frag
  2012-05-02  2:47                   ` Eric Dumazet
@ 2012-05-02  3:54                     ` Eric Dumazet
  2012-05-02  8:13                     ` [PATCH net-next] net: take care of cloned skbs in tcp_try_coalesce() Eric Dumazet
  1 sibling, 0 replies; 45+ messages in thread
From: Eric Dumazet @ 2012-05-02  3:54 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On Wed, 2012-05-02 at 04:47 +0200, Eric Dumazet wrote:

> Thanks Alex, I'll take a look.
> 
> It seems my tcpdump is different than yours.
> 
> 

I'll test following patch in a couple of hours :

Thanks !

 net/ipv4/tcp_input.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 96a631d..910a794 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4467,7 +4467,8 @@ static bool tcp_try_coalesce(struct sock *sk,
 			     struct sk_buff *from,
 			     bool *fragstolen)
 {
-	int delta, len = from->len;
+	int i, delta, len = from->len;
+	bool fastpath;
 
 	*fragstolen = false;
 	if (tcp_hdr(from)->fin)
@@ -4484,6 +4485,7 @@ merge:
 	if (skb_has_frag_list(to) || skb_has_frag_list(from))
 		return false;
 
+	fastpath = !from->cloned || atomic_read(&skb_shinfo(from)->dataref) == 1;
 	if (skb_headlen(from) == 0 &&
 	    (skb_shinfo(to)->nr_frags +
 	     skb_shinfo(from)->nr_frags <= MAX_SKB_FRAGS)) {
@@ -4497,7 +4499,13 @@ copyfrags:
 		       skb_shinfo(from)->frags,
 		       skb_shinfo(from)->nr_frags * sizeof(skb_frag_t));
 		skb_shinfo(to)->nr_frags += skb_shinfo(from)->nr_frags;
-		skb_shinfo(from)->nr_frags = 0;
+
+		if (fastpath)
+			skb_shinfo(from)->nr_frags = 0;
+		else
+			for (i = 0; i < skb_shinfo(from)->nr_frags; i++)
+				skb_frag_ref(from, i);
+
 		to->truesize += delta;
 		atomic_add(delta, &sk->sk_rmem_alloc);
 		sk_mem_charge(sk, delta);
@@ -4515,7 +4523,9 @@ copyfrags:
 		offset = from->data - (unsigned char *)page_address(page);
 		skb_fill_page_desc(to, skb_shinfo(to)->nr_frags,
 				   page, offset, skb_headlen(from));
-		*fragstolen = true;
+		*fragstolen = fastpath;
+		if (!fastpath)
+			get_page(page);
 		delta = len; /* we dont know real truesize... */
 		goto copyfrags;
 	}

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH net-next] net: take care of cloned skbs in tcp_try_coalesce()
  2012-05-02  2:47                   ` Eric Dumazet
  2012-05-02  3:54                     ` Eric Dumazet
@ 2012-05-02  8:13                     ` Eric Dumazet
  2012-05-02 15:52                       ` Alexander Duyck
  1 sibling, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2012-05-02  8:13 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

From: Eric Dumazet <edumazet@google.com>

Before stealing fragments or skb head, we must make sure skb is not
cloned.

If skb is cloned, we must take references on pages instead.

Bug happened using tcpdump (if not using mmap())

Reported-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_input.c |   17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 96a631d..7686d7f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4467,7 +4467,7 @@ static bool tcp_try_coalesce(struct sock *sk,
 			     struct sk_buff *from,
 			     bool *fragstolen)
 {
-	int delta, len = from->len;
+	int i, delta, len = from->len;
 
 	*fragstolen = false;
 	if (tcp_hdr(from)->fin)
@@ -4497,7 +4497,13 @@ copyfrags:
 		       skb_shinfo(from)->frags,
 		       skb_shinfo(from)->nr_frags * sizeof(skb_frag_t));
 		skb_shinfo(to)->nr_frags += skb_shinfo(from)->nr_frags;
-		skb_shinfo(from)->nr_frags = 0;
+
+		if (skb_cloned(from))
+			for (i = 0; i < skb_shinfo(from)->nr_frags; i++)
+				skb_frag_ref(from, i);
+		else
+			skb_shinfo(from)->nr_frags = 0;
+
 		to->truesize += delta;
 		atomic_add(delta, &sk->sk_rmem_alloc);
 		sk_mem_charge(sk, delta);
@@ -4515,7 +4521,12 @@ copyfrags:
 		offset = from->data - (unsigned char *)page_address(page);
 		skb_fill_page_desc(to, skb_shinfo(to)->nr_frags,
 				   page, offset, skb_headlen(from));
-		*fragstolen = true;
+
+		if (skb_cloned(from))
+			get_page(page);
+		else
+			*fragstolen = true;
+
 		delta = len; /* we dont know real truesize... */
 		goto copyfrags;
 	}

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4 v2 net-next] net: make GRO aware of skb->head_frag
  2012-05-01 19:45               ` Alexander Duyck
  2012-05-02  2:45                 ` Eric Dumazet
@ 2012-05-02  8:24                 ` Eric Dumazet
  2012-05-02 16:16                   ` Alexander Duyck
  2012-05-02 17:16                   ` Rick Jones
  1 sibling, 2 replies; 45+ messages in thread
From: Eric Dumazet @ 2012-05-02  8:24 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On Tue, 2012-05-01 at 12:45 -0700, Alexander Duyck wrote:

> I have a hacked together ixgbe up and running now with the new build_skb
> logic and RSC/LRO disabled.  It looks like it is giving me a 5%
> performance boost for small packet routing, but I am using more CPU for
> netperf TCP receive tests and I was wondering if you had seen anything
> similar on the tg3 driver?

Really hard to say, numbers are so small on Gb link :

what do you use to make your numbers ?

netperf -H 172.30.42.23 -t OMNI -C -c 
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.23 (172.30.42.23) port 0 AF_INET
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
1700840     1700840     16384  10.01   931.60     10^6bits/s  4.50  S      1.32   S      1.582   2.783   usec/KB  

About ixgbe, feel free to send your patch ;)

Thanks !

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next] net: take care of cloned skbs in tcp_try_coalesce()
  2012-05-02  8:13                     ` [PATCH net-next] net: take care of cloned skbs in tcp_try_coalesce() Eric Dumazet
@ 2012-05-02 15:52                       ` Alexander Duyck
  2012-05-02 16:12                         ` Eric Dumazet
  0 siblings, 1 reply; 45+ messages in thread
From: Alexander Duyck @ 2012-05-02 15:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On 05/02/2012 01:13 AM, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> Before stealing fragments or skb head, we must make sure skb is not
> cloned.
>
> If skb is cloned, we must take references on pages instead.
>
> Bug happened using tcpdump (if not using mmap())
>
> Reported-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  net/ipv4/tcp_input.c |   17 ++++++++++++++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 96a631d..7686d7f 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4467,7 +4467,7 @@ static bool tcp_try_coalesce(struct sock *sk,
>  			     struct sk_buff *from,
>  			     bool *fragstolen)
>  {
> -	int delta, len = from->len;
> +	int i, delta, len = from->len;
>  
>  	*fragstolen = false;
>  	if (tcp_hdr(from)->fin)
> @@ -4497,7 +4497,13 @@ copyfrags:
>  		       skb_shinfo(from)->frags,
>  		       skb_shinfo(from)->nr_frags * sizeof(skb_frag_t));
>  		skb_shinfo(to)->nr_frags += skb_shinfo(from)->nr_frags;
> -		skb_shinfo(from)->nr_frags = 0;
> +
> +		if (skb_cloned(from))
> +			for (i = 0; i < skb_shinfo(from)->nr_frags; i++)
> +				skb_frag_ref(from, i);
> +		else
> +			skb_shinfo(from)->nr_frags = 0;
> +
>  		to->truesize += delta;
>  		atomic_add(delta, &sk->sk_rmem_alloc);
>  		sk_mem_charge(sk, delta);
I am fairly certain the bug I saw is only masked over by this change. 
The underlying problem is that we shouldn't be messing with nr_frags on
the from or the to if either one is clone.  You now have a check in
place for the from, but what about the to?  This function should
probably be calling a pskb_expand_head on the to skb in order to
guarantee that the skb->head isn't shared.  Otherwise this is going to
cause other issues for any functions that are sharing these skbs that
just walk through frags without checking skb->len or skb->data_len first. 

> @@ -4515,7 +4521,12 @@ copyfrags:
>  		offset = from->data - (unsigned char *)page_address(page);
>  		skb_fill_page_desc(to, skb_shinfo(to)->nr_frags,
>  				   page, offset, skb_headlen(from));
> -		*fragstolen = true;
> +
> +		if (skb_cloned(from))
> +			get_page(page);
> +		else
> +			*fragstolen = true;
> +
>  		delta = len; /* we dont know real truesize... */
>  		goto copyfrags;
>  	}
>
>
I don't see where we are now addressing the put_page call to release the
page afterwards.  By calling get_page you are incrementing the page
count by one, but where are you decrementing dataref in the shared
info?  Without that we are looking at a memory leak because __kfree_skb
will decrement the dataref but it will never reach 0 so it will never
call put_page on the head frag.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next] net: take care of cloned skbs in tcp_try_coalesce()
  2012-05-02 15:52                       ` Alexander Duyck
@ 2012-05-02 16:12                         ` Eric Dumazet
  2012-05-02 16:27                           ` Alexander Duyck
  0 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2012-05-02 16:12 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On Wed, 2012-05-02 at 08:52 -0700, Alexander Duyck wrote:
> On 05/02/2012 01:13 AM, Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > Before stealing fragments or skb head, we must make sure skb is not
> > cloned.
> >
> > If skb is cloned, we must take references on pages instead.
> >
> > Bug happened using tcpdump (if not using mmap())
> >
> > Reported-by: Alexander Duyck <alexander.h.duyck@intel.com>
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > ---
> >  net/ipv4/tcp_input.c |   17 ++++++++++++++---
> >  1 file changed, 14 insertions(+), 3 deletions(-)
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 96a631d..7686d7f 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -4467,7 +4467,7 @@ static bool tcp_try_coalesce(struct sock *sk,
> >  			     struct sk_buff *from,
> >  			     bool *fragstolen)
> >  {
> > -	int delta, len = from->len;
> > +	int i, delta, len = from->len;
> >  
> >  	*fragstolen = false;
> >  	if (tcp_hdr(from)->fin)
> > @@ -4497,7 +4497,13 @@ copyfrags:
> >  		       skb_shinfo(from)->frags,
> >  		       skb_shinfo(from)->nr_frags * sizeof(skb_frag_t));
> >  		skb_shinfo(to)->nr_frags += skb_shinfo(from)->nr_frags;
> > -		skb_shinfo(from)->nr_frags = 0;
> > +
> > +		if (skb_cloned(from))
> > +			for (i = 0; i < skb_shinfo(from)->nr_frags; i++)
> > +				skb_frag_ref(from, i);
> > +		else
> > +			skb_shinfo(from)->nr_frags = 0;
> > +
> >  		to->truesize += delta;
> >  		atomic_add(delta, &sk->sk_rmem_alloc);
> >  		sk_mem_charge(sk, delta);
> I am fairly certain the bug I saw is only masked over by this change. 
> The underlying problem is that we shouldn't be messing with nr_frags on
> the from or the to if either one is clone.  You now have a check in
> place for the from, but what about the to?  This function should
> probably be calling a pskb_expand_head on the to skb in order to
> guarantee that the skb->head isn't shared.  Otherwise this is going to
> cause other issues for any functions that are sharing these skbs that
> just walk through frags without checking skb->len or skb->data_len first. 

Its safe to increase to->len and increase nr_frags in this context,
because we hold a reference to dataref : It cannot disappear under us.

clones will still have their skb->len at skb_clone() time and wont care
we expanded the frags.

> 
> > @@ -4515,7 +4521,12 @@ copyfrags:
> >  		offset = from->data - (unsigned char *)page_address(page);
> >  		skb_fill_page_desc(to, skb_shinfo(to)->nr_frags,
> >  				   page, offset, skb_headlen(from));
> > -		*fragstolen = true;
> > +
> > +		if (skb_cloned(from))
> > +			get_page(page);
> > +		else
> > +			*fragstolen = true;
> > +
> >  		delta = len; /* we dont know real truesize... */
> >  		goto copyfrags;
> >  	}
> >
> >
> I don't see where we are now addressing the put_page call to release the
> page afterwards.  By calling get_page you are incrementing the page
> count by one, but where are you decrementing dataref in the shared
> info?  Without that we are looking at a memory leak because __kfree_skb
> will decrement the dataref but it will never reach 0 so it will never
> call put_page on the head frag.

really the dataref was already incremented at skb_clone() time

It will be properly decremented since we call __kfree_skb()

Only the last decrement will perform the put_page()

Think about splice() is doing, its the same get_page() game.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4 v2 net-next] net: make GRO aware of skb->head_frag
  2012-05-02  8:24                 ` Eric Dumazet
@ 2012-05-02 16:16                   ` Alexander Duyck
  2012-05-02 16:19                     ` Eric Dumazet
  2012-05-02 17:16                   ` Rick Jones
  1 sibling, 1 reply; 45+ messages in thread
From: Alexander Duyck @ 2012-05-02 16:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On 05/02/2012 01:24 AM, Eric Dumazet wrote:
> On Tue, 2012-05-01 at 12:45 -0700, Alexander Duyck wrote:
>
>> I have a hacked together ixgbe up and running now with the new build_skb
>> logic and RSC/LRO disabled.  It looks like it is giving me a 5%
>> performance boost for small packet routing, but I am using more CPU for
>> netperf TCP receive tests and I was wondering if you had seen anything
>> similar on the tg3 driver?
> Really hard to say, numbers are so small on Gb link :
>
> what do you use to make your numbers ?
>
> netperf -H 172.30.42.23 -t OMNI -C -c 
> OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.23 (172.30.42.23) port 0 AF_INET
> Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
> Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
> Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
> Final       Final                                             %     Method %      Method                          
> 1700840     1700840     16384  10.01   931.60     10^6bits/s  4.50  S      1.32   S      1.582   2.783   usec/KB  
>
> About ixgbe, feel free to send your patch ;)
>
> Thanks !
>
>
I was working with the out-of-tree ixgbe because I have the option there
of stripping out FCoE and RSC via a couple of build flags.  The problem
is I don't know if the head frag stuff will work out very well with
ixgbe because RSC and FCoE require that we have to use 1K aligned
receive buffers.  It would require us to make us have to bump up our
allocation size by NET_SKB_PAD plus skb_shared_info which would likely
force us up to order 1 pages on most platforms.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4 v2 net-next] net: make GRO aware of skb->head_frag
  2012-05-02 16:16                   ` Alexander Duyck
@ 2012-05-02 16:19                     ` Eric Dumazet
  2012-05-02 16:27                       ` Eric Dumazet
  2012-05-02 17:02                       ` Alexander Duyck
  0 siblings, 2 replies; 45+ messages in thread
From: Eric Dumazet @ 2012-05-02 16:19 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On Wed, 2012-05-02 at 09:16 -0700, Alexander Duyck wrote:

> I was working with the out-of-tree ixgbe because I have the option there
> of stripping out FCoE and RSC via a couple of build flags.  The problem
> is I don't know if the head frag stuff will work out very well with
> ixgbe because RSC and FCoE require that we have to use 1K aligned
> receive buffers.  It would require us to make us have to bump up our
> allocation size by NET_SKB_PAD plus skb_shared_info which would likely
> force us up to order 1 pages on most platforms.

What is RSC exactly, and why RSC is used in the build_skb() context ?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next] net: take care of cloned skbs in tcp_try_coalesce()
  2012-05-02 16:12                         ` Eric Dumazet
@ 2012-05-02 16:27                           ` Alexander Duyck
  2012-05-02 16:46                             ` Eric Dumazet
  0 siblings, 1 reply; 45+ messages in thread
From: Alexander Duyck @ 2012-05-02 16:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On 05/02/2012 09:12 AM, Eric Dumazet wrote:
> On Wed, 2012-05-02 at 08:52 -0700, Alexander Duyck wrote:
>> On 05/02/2012 01:13 AM, Eric Dumazet wrote:
>>> From: Eric Dumazet <edumazet@google.com>
>>>
>>> Before stealing fragments or skb head, we must make sure skb is not
>>> cloned.
>>>
>>> If skb is cloned, we must take references on pages instead.
>>>
>>> Bug happened using tcpdump (if not using mmap())
>>>
>>> Reported-by: Alexander Duyck <alexander.h.duyck@intel.com>
>>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>>> ---
>>>  net/ipv4/tcp_input.c |   17 ++++++++++++++---
>>>  1 file changed, 14 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>>> index 96a631d..7686d7f 100644
>>> --- a/net/ipv4/tcp_input.c
>>> +++ b/net/ipv4/tcp_input.c
>>> @@ -4467,7 +4467,7 @@ static bool tcp_try_coalesce(struct sock *sk,
>>>  			     struct sk_buff *from,
>>>  			     bool *fragstolen)
>>>  {
>>> -	int delta, len = from->len;
>>> +	int i, delta, len = from->len;
>>>  
>>>  	*fragstolen = false;
>>>  	if (tcp_hdr(from)->fin)
>>> @@ -4497,7 +4497,13 @@ copyfrags:
>>>  		       skb_shinfo(from)->frags,
>>>  		       skb_shinfo(from)->nr_frags * sizeof(skb_frag_t));
>>>  		skb_shinfo(to)->nr_frags += skb_shinfo(from)->nr_frags;
>>> -		skb_shinfo(from)->nr_frags = 0;
>>> +
>>> +		if (skb_cloned(from))
>>> +			for (i = 0; i < skb_shinfo(from)->nr_frags; i++)
>>> +				skb_frag_ref(from, i);
>>> +		else
>>> +			skb_shinfo(from)->nr_frags = 0;
>>> +
>>>  		to->truesize += delta;
>>>  		atomic_add(delta, &sk->sk_rmem_alloc);
>>>  		sk_mem_charge(sk, delta);
>> I am fairly certain the bug I saw is only masked over by this change. 
>> The underlying problem is that we shouldn't be messing with nr_frags on
>> the from or the to if either one is clone.  You now have a check in
>> place for the from, but what about the to?  This function should
>> probably be calling a pskb_expand_head on the to skb in order to
>> guarantee that the skb->head isn't shared.  Otherwise this is going to
>> cause other issues for any functions that are sharing these skbs that
>> just walk through frags without checking skb->len or skb->data_len first. 
> Its safe to increase to->len and increase nr_frags in this context,
> because we hold a reference to dataref : It cannot disappear under us.
>
> clones will still have their skb->len at skb_clone() time and wont care
> we expanded the frags.
Are you sure about that?  I think this may blow up if a bridge is
brought into play.  In that case you will have clones that will be going
through the xmit path of network drivers and I know in the case of the
older e1000 driver it didn't stop to look at the length but would
instead just go through and start mapping all frags to the device.  I am
fairly certain you are risking a data corruption any time you modify
nr_frags and dataref is != 1.

I really think what we should be doing is either not merge period, or we
have to go through slow paths if either the to or the from is cloned.

>>> @@ -4515,7 +4521,12 @@ copyfrags:
>>>  		offset = from->data - (unsigned char *)page_address(page);
>>>  		skb_fill_page_desc(to, skb_shinfo(to)->nr_frags,
>>>  				   page, offset, skb_headlen(from));
>>> -		*fragstolen = true;
>>> +
>>> +		if (skb_cloned(from))
>>> +			get_page(page);
>>> +		else
>>> +			*fragstolen = true;
>>> +
>>>  		delta = len; /* we dont know real truesize... */
>>>  		goto copyfrags;
>>>  	}
>>>
>>>
>> I don't see where we are now addressing the put_page call to release the
>> page afterwards.  By calling get_page you are incrementing the page
>> count by one, but where are you decrementing dataref in the shared
>> info?  Without that we are looking at a memory leak because __kfree_skb
>> will decrement the dataref but it will never reach 0 so it will never
>> call put_page on the head frag.
> really the dataref was already incremented at skb_clone() time
>
> It will be properly decremented since we call __kfree_skb()
>
> Only the last decrement will perform the put_page()
>
> Think about splice() is doing, its the same get_page() game.
I think you are missing the point.  So skb_clone will bump the dataref
to 2, calling get_page will bump the page count to 2.  After this
function you don't call __kfree_skb(skb) instead you call
kmem_cache_free(skbuff_head_cache, skb).  This will free the sk_buff,
but not decrement dataref leaving it at 2.  Eventually the raw socket
will call kfree_skb(skb) on the clone dropping the dataref to 1 and you
will call put_page dropping the page count to 1, but I don't see where
the last __kfree_skb call will come from that will drop dataref and the
page count to 0.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4 v2 net-next] net: make GRO aware of skb->head_frag
  2012-05-02 16:19                     ` Eric Dumazet
@ 2012-05-02 16:27                       ` Eric Dumazet
  2012-05-02 17:04                         ` Alexander Duyck
  2012-05-02 17:02                       ` Alexander Duyck
  1 sibling, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2012-05-02 16:27 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On Wed, 2012-05-02 at 18:19 +0200, Eric Dumazet wrote:
> On Wed, 2012-05-02 at 09:16 -0700, Alexander Duyck wrote:
> 
> > I was working with the out-of-tree ixgbe because I have the option there
> > of stripping out FCoE and RSC via a couple of build flags.  The problem
> > is I don't know if the head frag stuff will work out very well with
> > ixgbe because RSC and FCoE require that we have to use 1K aligned
> > receive buffers.  It would require us to make us have to bump up our
> > allocation size by NET_SKB_PAD plus skb_shared_info which would likely
> > force us up to order 1 pages on most platforms.
> 
> What is RSC exactly, and why RSC is used in the build_skb() context ?
> 
> 

It looks like e1000e would be a good candidate for build_skb()
(without packet split)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next] net: take care of cloned skbs in tcp_try_coalesce()
  2012-05-02 16:27                           ` Alexander Duyck
@ 2012-05-02 16:46                             ` Eric Dumazet
  2012-05-02 17:55                               ` [PATCH v2 " Eric Dumazet
  2012-05-02 18:05                               ` [PATCH " Alexander Duyck
  0 siblings, 2 replies; 45+ messages in thread
From: Eric Dumazet @ 2012-05-02 16:46 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On Wed, 2012-05-02 at 09:27 -0700, Alexander Duyck wrote:

> Are you sure about that?  I think this may blow up if a bridge is
> brought into play.  In that case you will have clones that will be going
> through the xmit path of network drivers and I know in the case of the
> older e1000 driver it didn't stop to look at the length but would
> instead just go through and start mapping all frags to the device.  I am
> fairly certain you are risking a data corruption any time you modify
> nr_frags and dataref is != 1.
> 


Hmm...

A driver should not map more fragments than len/data_len permits.
But point taken.

Frankly we can add the test, but it means that any sniffer running will
disable tcp coalescing, while net/packet/af_packet.c does the right
thing.

I'll check how I can do...

> I really think what we should be doing is either not merge period, or we
> have to go through slow paths if either the to or the from is cloned.
> 
> >>> @@ -4515,7 +4521,12 @@ copyfrags:
> >>>  		offset = from->data - (unsigned char *)page_address(page);
> >>>  		skb_fill_page_desc(to, skb_shinfo(to)->nr_frags,
> >>>  				   page, offset, skb_headlen(from));
> >>> -		*fragstolen = true;
> >>> +
> >>> +		if (skb_cloned(from))
> >>> +			get_page(page);
> >>> +		else
> >>> +			*fragstolen = true;
> >>> +
> >>>  		delta = len; /* we dont know real truesize... */
> >>>  		goto copyfrags;
> >>>  	}
> >>>
> >>>
> >> I don't see where we are now addressing the put_page call to release the
> >> page afterwards.  By calling get_page you are incrementing the page
> >> count by one, but where are you decrementing dataref in the shared
> >> info?  Without that we are looking at a memory leak because __kfree_skb
> >> will decrement the dataref but it will never reach 0 so it will never
> >> call put_page on the head frag.
> > really the dataref was already incremented at skb_clone() time
> >
> > It will be properly decremented since we call __kfree_skb()
> >
> > Only the last decrement will perform the put_page()
> >
> > Think about splice() is doing, its the same get_page() game.
> I think you are missing the point.  So skb_clone will bump the dataref
> to 2, calling get_page will bump the page count to 2.  After this
> function you don't call __kfree_skb(skb) instead you call
> kmem_cache_free(skbuff_head_cache, skb).  This will free the sk_buff,
> but not decrement dataref leaving it at 2.  Eventually the raw socket
> will call kfree_skb(skb) on the clone dropping the dataref to 1 and you
> will call put_page dropping the page count to 1, but I don't see where
> the last __kfree_skb call will come from that will drop dataref and the
> page count to 0.

No, you miss that _if_ we added one to page count, then we wont call
kmem_cache_free(skbuff_head_cache, skb) but call __kfree_skb(skb)
instead because fragstolen will be false.

if (fragstolen)
	kmem_cache_free(...)
else
	__kfree_skb(...)

In future patch (addressing tcp coalescing in tcp_queue_rcv() as well),
I'll add a helper to make this more clear.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4 v2 net-next] net: make GRO aware of skb->head_frag
  2012-05-02 16:19                     ` Eric Dumazet
  2012-05-02 16:27                       ` Eric Dumazet
@ 2012-05-02 17:02                       ` Alexander Duyck
  1 sibling, 0 replies; 45+ messages in thread
From: Alexander Duyck @ 2012-05-02 17:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On 05/02/2012 09:19 AM, Eric Dumazet wrote:
> On Wed, 2012-05-02 at 09:16 -0700, Alexander Duyck wrote:
>
>> I was working with the out-of-tree ixgbe because I have the option there
>> of stripping out FCoE and RSC via a couple of build flags.  The problem
>> is I don't know if the head frag stuff will work out very well with
>> ixgbe because RSC and FCoE require that we have to use 1K aligned
>> receive buffers.  It would require us to make us have to bump up our
>> allocation size by NET_SKB_PAD plus skb_shared_info which would likely
>> force us up to order 1 pages on most platforms.
> What is RSC exactly, and why RSC is used in the build_skb() context ?
RSC is your in-hardware LRO.  Basically it aggregates the TCP flows in
hardware instead of software.  As a result we have to be able to receive
jumbo frames any time it is enabled.  This means we can end up using the
full data buffer which we can only set in 1K increments.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4 v2 net-next] net: make GRO aware of skb->head_frag
  2012-05-02 16:27                       ` Eric Dumazet
@ 2012-05-02 17:04                         ` Alexander Duyck
  0 siblings, 0 replies; 45+ messages in thread
From: Alexander Duyck @ 2012-05-02 17:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On 05/02/2012 09:27 AM, Eric Dumazet wrote:
> On Wed, 2012-05-02 at 18:19 +0200, Eric Dumazet wrote:
>> On Wed, 2012-05-02 at 09:16 -0700, Alexander Duyck wrote:
>>
>>> I was working with the out-of-tree ixgbe because I have the option there
>>> of stripping out FCoE and RSC via a couple of build flags.  The problem
>>> is I don't know if the head frag stuff will work out very well with
>>> ixgbe because RSC and FCoE require that we have to use 1K aligned
>>> receive buffers.  It would require us to make us have to bump up our
>>> allocation size by NET_SKB_PAD plus skb_shared_info which would likely
>>> force us up to order 1 pages on most platforms.
>> What is RSC exactly, and why RSC is used in the build_skb() context ?
>>
>>
> It looks like e1000e would be a good candidate for build_skb()
> (without packet split)

Yes, e1000e and e1000 would be good candidates since they have separate
flows for jumbo flows.  Odds are they could probably also take advantage
of the page reuse code I have in igb and ixgbe, but I just haven't had
time to get around to updating them.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4 v2 net-next] net: make GRO aware of skb->head_frag
  2012-05-02  8:24                 ` Eric Dumazet
  2012-05-02 16:16                   ` Alexander Duyck
@ 2012-05-02 17:16                   ` Rick Jones
  1 sibling, 0 replies; 45+ messages in thread
From: Rick Jones @ 2012-05-02 17:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, Alexander Duyck, David Miller, netdev,
	Neal Cardwell, Tom Herbert, Jeff Kirsher, Michael Chan,
	Matt Carlson, Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On 05/02/2012 01:24 AM, Eric Dumazet wrote:
> On Tue, 2012-05-01 at 12:45 -0700, Alexander Duyck wrote:
>
>> I have a hacked together ixgbe up and running now with the new build_skb
>> logic and RSC/LRO disabled.  It looks like it is giving me a 5%
>> performance boost for small packet routing, but I am using more CPU for
>> netperf TCP receive tests and I was wondering if you had seen anything
>> similar on the tg3 driver?
>
> Really hard to say, numbers are so small on Gb link :
>
> what do you use to make your numbers ?
>
> netperf -H 172.30.42.23 -t OMNI -C -c
> OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.23 (172.30.42.23) port 0 AF_INET
> Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
> Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
> Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
> Final       Final                                             %     Method %      Method
> 1700840     1700840     16384  10.01   931.60     10^6bits/s  4.50  S      1.32   S      1.582   2.783   usec/KB

If there is so little CPU consumed, I'm a bit surprised the throughput 
wasn't 940 Gbit/s.

It might be a good idea to fix the local and remote socket buffer sizes 
for these sorts of A-B comparisons to take the variability of the 
autotuning out.

And then, to see if the small differences are "real" one can light-up 
the confidence intervals.  For example (using kernels unrelated to the 
patch discussion):

raj@tardy:~/netperf2_trunk/src$ ./netperf -H 192.168.1.3 -t omni -c -C 
-I 99,1 -i 30,3 -- -s 256K -S 256K -m 16K -O 
throughput,local_cpu_util,local_sd,remote_cpu_util,remote_sd,throughput_confid,local_cpu_confid,remote_cpu_confid,confidence_iteration
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.3 () 
port 0 AF_INET : +/-0.500% @ 99% conf.  : interval : demo
Throughput Local Local   Remote Remote  Throughput Local      Remote 
  Confidence
            CPU   Service CPU    Service Confidence CPU        CPU 
   Iterations
            Util  Demand  Util   Demand  Width (%)  Confidence 
Confidence Run
            %             %                         Width (%)  Width (%) 

941.36     8.70  3.030   45.36  7.895   0.006      18.836     0.209 
  30

In this instance, I asked to be 99% confident the throughput and CPU 
util were within +/- 0.5% of the "real" mean.  The confidence intervals 
were hit for throughput and remote CPU util, but not for local CPU util 
- netperf was running on my personal workstation, which also receives 
email etc.  Presumably a more isolated and idle system would have hit 
the confidence intervals.

Other sources of variation to consider eliminating when looking for 
small differences in CPU utilization might be the multiqueue support in 
the NIC.  I'll often just terminate irqbalance and set all the IRQs to a 
single CPU (when doing single stream tests).  Or, one can fully specify 
the four-tuple for the netperf data connection.

rick jones
of course there is also the whole question of the effect of HW threading 
on the meaningfulness of OS-determined utilization...

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 net-next] net: take care of cloned skbs in tcp_try_coalesce()
  2012-05-02 16:46                             ` Eric Dumazet
@ 2012-05-02 17:55                               ` Eric Dumazet
  2012-05-02 19:58                                 ` [PATCH net-next] net: implement tcp coalescing in tcp_queue_rcv() Eric Dumazet
  2012-05-03  1:11                                 ` [PATCH v2 net-next] net: take care of cloned skbs in tcp_try_coalesce() David Miller
  2012-05-02 18:05                               ` [PATCH " Alexander Duyck
  1 sibling, 2 replies; 45+ messages in thread
From: Eric Dumazet @ 2012-05-02 17:55 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

From: Eric Dumazet <edumazet@google.com>

Before stealing fragments or skb head, we must make sure skbs are not
cloned.

Alexander was worried about destination skb being cloned : In bridge
setups, a driver could be fooled if skb->data_len would not match skb
nr_frags.

If source skb is cloned, we must take references on pages instead.

Bug happened using tcpdump (if not using mmap())

Introduce kfree_skb_partial() helper to cleanup code.

Reported-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_input.c |   42 +++++++++++++++++++++++++++--------------
 1 file changed, 28 insertions(+), 14 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 96a631d..f891a5e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4455,6 +4455,7 @@ static inline int tcp_try_rmem_schedule(struct sock *sk, unsigned int size)
  * @sk: socket
  * @to: prior buffer
  * @from: buffer to add in queue
+ * @fragstolen: pointer to boolean
  *
  * Before queueing skb @from after @to, try to merge them
  * to reduce overall memory use and queue lengths, if cost is small.
@@ -4467,10 +4468,10 @@ static bool tcp_try_coalesce(struct sock *sk,
 			     struct sk_buff *from,
 			     bool *fragstolen)
 {
-	int delta, len = from->len;
+	int i, delta, len = from->len;
 
 	*fragstolen = false;
-	if (tcp_hdr(from)->fin)
+	if (tcp_hdr(from)->fin || skb_cloned(to))
 		return false;
 	if (len <= skb_tailroom(to)) {
 		BUG_ON(skb_copy_bits(from, 0, skb_put(to, len), len));
@@ -4497,7 +4498,13 @@ copyfrags:
 		       skb_shinfo(from)->frags,
 		       skb_shinfo(from)->nr_frags * sizeof(skb_frag_t));
 		skb_shinfo(to)->nr_frags += skb_shinfo(from)->nr_frags;
-		skb_shinfo(from)->nr_frags = 0;
+
+		if (skb_cloned(from))
+			for (i = 0; i < skb_shinfo(from)->nr_frags; i++)
+				skb_frag_ref(from, i);
+		else
+			skb_shinfo(from)->nr_frags = 0;
+
 		to->truesize += delta;
 		atomic_add(delta, &sk->sk_rmem_alloc);
 		sk_mem_charge(sk, delta);
@@ -4515,13 +4522,26 @@ copyfrags:
 		offset = from->data - (unsigned char *)page_address(page);
 		skb_fill_page_desc(to, skb_shinfo(to)->nr_frags,
 				   page, offset, skb_headlen(from));
-		*fragstolen = true;
+
+		if (skb_cloned(from))
+			get_page(page);
+		else
+			*fragstolen = true;
+
 		delta = len; /* we dont know real truesize... */
 		goto copyfrags;
 	}
 	return false;
 }
 
+static void kfree_skb_partial(struct sk_buff *skb, bool head_stolen)
+{
+	if (head_stolen)
+		kmem_cache_free(skbuff_head_cache, skb);
+	else
+		__kfree_skb(skb);
+}
+
 static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -4565,10 +4585,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 		if (!tcp_try_coalesce(sk, skb1, skb, &fragstolen)) {
 			__skb_queue_after(&tp->out_of_order_queue, skb1, skb);
 		} else {
-			if (fragstolen)
-				kmem_cache_free(skbuff_head_cache, skb);
-			else
-				__kfree_skb(skb);
+			kfree_skb_partial(skb, fragstolen);
 			skb = NULL;
 		}
 
@@ -4727,12 +4744,9 @@ queue_and_out:
 
 		tcp_fast_path_check(sk);
 
-		if (eaten > 0) {
-			if (fragstolen)
-				kmem_cache_free(skbuff_head_cache, skb);
-			else
-				__kfree_skb(skb);
-		} else if (!sock_flag(sk, SOCK_DEAD))
+		if (eaten > 0)
+			kfree_skb_partial(skb, fragstolen);
+		else if (!sock_flag(sk, SOCK_DEAD))
 			sk->sk_data_ready(sk, 0);
 		return;
 	}

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next] net: take care of cloned skbs in tcp_try_coalesce()
  2012-05-02 16:46                             ` Eric Dumazet
  2012-05-02 17:55                               ` [PATCH v2 " Eric Dumazet
@ 2012-05-02 18:05                               ` Alexander Duyck
  2012-05-02 18:15                                 ` Eric Dumazet
  1 sibling, 1 reply; 45+ messages in thread
From: Alexander Duyck @ 2012-05-02 18:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On 05/02/2012 09:46 AM, Eric Dumazet wrote:
> On Wed, 2012-05-02 at 09:27 -0700, Alexander Duyck wrote:
[...]
>>>>> @@ -4515,7 +4521,12 @@ copyfrags:
>>>>>  		offset = from->data - (unsigned char *)page_address(page);
>>>>>  		skb_fill_page_desc(to, skb_shinfo(to)->nr_frags,
>>>>>  				   page, offset, skb_headlen(from));
>>>>> -		*fragstolen = true;
>>>>> +
>>>>> +		if (skb_cloned(from))
>>>>> +			get_page(page);
>>>>> +		else
>>>>> +			*fragstolen = true;
>>>>> +
>>>>>  		delta = len; /* we dont know real truesize... */
>>>>>  		goto copyfrags;
>>>>>  	}
>>>>>
>>>>>
>>>> I don't see where we are now addressing the put_page call to release the
>>>> page afterwards.  By calling get_page you are incrementing the page
>>>> count by one, but where are you decrementing dataref in the shared
>>>> info?  Without that we are looking at a memory leak because __kfree_skb
>>>> will decrement the dataref but it will never reach 0 so it will never
>>>> call put_page on the head frag.
>>> really the dataref was already incremented at skb_clone() time
>>>
>>> It will be properly decremented since we call __kfree_skb()
>>>
>>> Only the last decrement will perform the put_page()
>>>
>>> Think about splice() is doing, its the same get_page() game.
>> I think you are missing the point.  So skb_clone will bump the dataref
>> to 2, calling get_page will bump the page count to 2.  After this
>> function you don't call __kfree_skb(skb) instead you call
>> kmem_cache_free(skbuff_head_cache, skb).  This will free the sk_buff,
>> but not decrement dataref leaving it at 2.  Eventually the raw socket
>> will call kfree_skb(skb) on the clone dropping the dataref to 1 and you
>> will call put_page dropping the page count to 1, but I don't see where
>> the last __kfree_skb call will come from that will drop dataref and the
>> page count to 0.
> No, you miss that _if_ we added one to page count, then we wont call
> kmem_cache_free(skbuff_head_cache, skb) but call __kfree_skb(skb)
> instead because fragstolen will be false.
>
> if (fragstolen)
> 	kmem_cache_free(...)
> else
> 	__kfree_skb(...)
>
> In future patch (addressing tcp coalescing in tcp_queue_rcv() as well),
> I'll add a helper to make this more clear.
You're correct about the fragstolen case, I actually was thinking of the
first patch you sent, not this second one.

However we still have a problem.  What we end up with now is a case of
sharing in which the clone skb no longer knows that it is sharing the
head with another skb.  The dataref will drop to 1 when we call
__kfree_skb.  This means that any other function out there that tries to
see if the skb is shared would return false.  This could lead to issues
if there is anything out there that manipulates the data in head based
on the false assumption that it is not cloned.  What we would probably
need to do in this case is tweak the logic for skb_cloned.  If you are
using a head_frag you should probably add a check that returns true if
cloned is true and page_count is greater than 1.  We should be safe in
the case of skb_header_cloned since we already dropped are dataref when
we stole the page and freed the skb.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next] net: take care of cloned skbs in tcp_try_coalesce()
  2012-05-02 18:05                               ` [PATCH " Alexander Duyck
@ 2012-05-02 18:15                                 ` Eric Dumazet
  2012-05-02 20:55                                   ` Alexander Duyck
  0 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2012-05-02 18:15 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On Wed, 2012-05-02 at 11:05 -0700, Alexander Duyck wrote:

> You're correct about the fragstolen case, I actually was thinking of the
> first patch you sent, not this second one.
> 
> However we still have a problem.  What we end up with now is a case of
> sharing in which the clone skb no longer knows that it is sharing the
> head with another skb.  The dataref will drop to 1 when we call
> __kfree_skb.  This means that any other function out there that tries to
> see if the skb is shared would return false.  This could lead to issues
> if there is anything out there that manipulates the data in head based
> on the false assumption that it is not cloned.  What we would probably
> need to do in this case is tweak the logic for skb_cloned.  If you are
> using a head_frag you should probably add a check that returns true if
> cloned is true and page_count is greater than 1.  We should be safe in
> the case of skb_header_cloned since we already dropped are dataref when
> we stole the page and freed the skb.

I really dont understand this concern.

When skb is cloned, we copy in head_frag __skb_clone()

So both skbs have the bit set, and dataref = 2.

first skb is freed, dataref becomes 1 and nothing special  happen

>From this point, skb->head is not 'shared' anymore (taken your own
words). And we are free to do whatever we want.

second skb is freed, dataref becomes 0 and we call the right destructor.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH net-next] net: implement tcp coalescing in tcp_queue_rcv()
  2012-05-02 17:55                               ` [PATCH v2 " Eric Dumazet
@ 2012-05-02 19:58                                 ` Eric Dumazet
  2012-05-02 20:11                                   ` Joe Perches
  2012-05-03  1:11                                   ` David Miller
  2012-05-03  1:11                                 ` [PATCH v2 net-next] net: take care of cloned skbs in tcp_try_coalesce() David Miller
  1 sibling, 2 replies; 45+ messages in thread
From: Eric Dumazet @ 2012-05-02 19:58 UTC (permalink / raw)
  To: David Miller
  Cc: Alexander Duyck, Alexander Duyck, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

From: Eric Dumazet <edumazet@google.com>

Extend tcp coalescing implementing it from tcp_queue_rcv(), the main
receiver function when application is not blocked in recvmsg().

Function tcp_queue_rcv() is moved a bit to allow its call from
tcp_data_queue()

This gives good results especially if GRO could not kick, and if skb
head is a fragment.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
---
To be applied after "[PATCH v2 net-next] net: take care of cloned skbs
in tcp_try_coalesce()"

 include/net/tcp.h    |    3 ++-
 net/ipv4/tcp.c       |   10 +++++-----
 net/ipv4/tcp_input.c |   40 +++++++++++++++++++++-------------------
 3 files changed, 28 insertions(+), 25 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0fb84de..a9d2fb8 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -438,7 +438,8 @@ extern int tcp_disconnect(struct sock *sk, int flags);
 
 void tcp_connect_init(struct sock *sk);
 void tcp_finish_connect(struct sock *sk, struct sk_buff *skb);
-void tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen);
+int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb,
+			       int hdrlen, bool *fragstolen);
 
 /* From syncookies.c */
 extern __u32 syncookie_secret[2][16-4+SHA_DIGEST_WORDS];
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 9670af3..bd5deff 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -980,8 +980,8 @@ static inline int select_size(const struct sock *sk, bool sg)
 static int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, size_t size)
 {
 	struct sk_buff *skb;
-	struct tcp_skb_cb *cb;
 	struct tcphdr *th;
+	bool fragstolen;
 
 	skb = alloc_skb(size + sizeof(*th), sk->sk_allocation);
 	if (!skb)
@@ -994,14 +994,14 @@ static int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, size_t size)
 	if (memcpy_fromiovec(skb_put(skb, size), msg->msg_iov, size))
 		goto err_free;
 
-	cb = TCP_SKB_CB(skb);
-
 	TCP_SKB_CB(skb)->seq = tcp_sk(sk)->rcv_nxt;
 	TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(skb)->seq + size;
 	TCP_SKB_CB(skb)->ack_seq = tcp_sk(sk)->snd_una - 1;
 
-	tcp_queue_rcv(sk, skb, sizeof(*th));
-
+	if (tcp_queue_rcv(sk, skb, sizeof(*th), &fragstolen)) {
+		WARN_ON_ONCE(fragstolen); /* should not happen */
+		__kfree_skb(skb);
+	}
 	return size;
 
 err_free:
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index f891a5e..2233468 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4662,6 +4662,22 @@ end:
 		skb_set_owner_r(skb, sk);
 }
 
+int tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
+		  bool *fragstolen)
+{
+	int eaten;
+	struct sk_buff *tail = skb_peek_tail(&sk->sk_receive_queue);
+
+	__skb_pull(skb, hdrlen);
+	eaten = (tail &&
+		 tcp_try_coalesce(sk, tail, skb, fragstolen)) ? 1 : 0;
+	tcp_sk(sk)->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
+	if (!eaten) {
+		__skb_queue_tail(&sk->sk_receive_queue, skb);
+		skb_set_owner_r(skb, sk);
+	}
+	return eaten;
+}
 
 static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 {
@@ -4708,20 +4724,12 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 		}
 
 		if (eaten <= 0) {
-			struct sk_buff *tail;
 queue_and_out:
 			if (eaten < 0 &&
 			    tcp_try_rmem_schedule(sk, skb->truesize))
 				goto drop;
 
-			tail = skb_peek_tail(&sk->sk_receive_queue);
-			eaten = (tail &&
-				 tcp_try_coalesce(sk, tail, skb,
-						  &fragstolen)) ? 1 : 0;
-			if (eaten <= 0) {
-				skb_set_owner_r(skb, sk);
-				__skb_queue_tail(&sk->sk_receive_queue, skb);
-			}
+			eaten = tcp_queue_rcv(sk, skb, 0, &fragstolen);
 		}
 		tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
 		if (skb->len)
@@ -5416,14 +5424,6 @@ discard:
 	return 0;
 }
 
-void tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen)
-{
-	__skb_pull(skb, hdrlen);
-	__skb_queue_tail(&sk->sk_receive_queue, skb);
-	skb_set_owner_r(skb, sk);
-	tcp_sk(sk)->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
-}
-
 /*
  *	TCP receive function for the ESTABLISHED state.
  *
@@ -5532,6 +5532,7 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 		} else {
 			int eaten = 0;
 			int copied_early = 0;
+			bool fragstolen = false;
 
 			if (tp->copied_seq == tp->rcv_nxt &&
 			    len - tcp_header_len <= tp->ucopy.len) {
@@ -5589,7 +5590,8 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITS);
 
 				/* Bulk data transfer: receiver */
-				tcp_queue_rcv(sk, skb, tcp_header_len);
+				eaten = tcp_queue_rcv(sk, skb, tcp_header_len,
+						      &fragstolen);
 			}
 
 			tcp_event_data_recv(sk, skb);
@@ -5611,7 +5613,7 @@ no_ack:
 			else
 #endif
 			if (eaten)
-				__kfree_skb(skb);
+				kfree_skb_partial(skb, fragstolen);
 			else
 				sk->sk_data_ready(sk, 0);
 			return 0;

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next] net: implement tcp coalescing in tcp_queue_rcv()
  2012-05-02 19:58                                 ` [PATCH net-next] net: implement tcp coalescing in tcp_queue_rcv() Eric Dumazet
@ 2012-05-02 20:11                                   ` Joe Perches
  2012-05-02 20:23                                     ` Eric Dumazet
  2012-05-03  1:11                                   ` David Miller
  1 sibling, 1 reply; 45+ messages in thread
From: Joe Perches @ 2012-05-02 20:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Alexander Duyck, Alexander Duyck, netdev,
	Neal Cardwell, Tom Herbert, Jeff Kirsher, Michael Chan,
	Matt Carlson, Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On Wed, 2012-05-02 at 21:58 +0200, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Extend tcp coalescing implementing it from tcp_queue_rcv(), the main
> receiver function when application is not blocked in recvmsg().
> 
> Function tcp_queue_rcv() is moved a bit to allow its call from
> tcp_data_queue()
> 
> This gives good results especially if GRO could not kick, and if skb
> head is a fragment.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Alexander Duyck <alexander.h.duyck@intel.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Tom Herbert <therbert@google.com>
> ---
> To be applied after "[PATCH v2 net-next] net: take care of cloned skbs
> in tcp_try_coalesce()"
> 
>  include/net/tcp.h    |    3 ++-
>  net/ipv4/tcp.c       |   10 +++++-----
>  net/ipv4/tcp_input.c |   40 +++++++++++++++++++++-------------------
>  3 files changed, 28 insertions(+), 25 deletions(-)
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 0fb84de..a9d2fb8 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -438,7 +438,8 @@ extern int tcp_disconnect(struct sock *sk, int flags);
>  
>  void tcp_connect_init(struct sock *sk);
>  void tcp_finish_connect(struct sock *sk, struct sk_buff *skb);
> -void tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen);
> +int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb,
> +			       int hdrlen, bool *fragstolen);
>  
>  /* From syncookies.c */
>  extern __u32 syncookie_secret[2][16-4+SHA_DIGEST_WORDS];
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 9670af3..bd5deff 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -980,8 +980,8 @@ static inline int select_size(const struct sock *sk, bool sg)
>  static int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, size_t size)
>  {
>  	struct sk_buff *skb;
> -	struct tcp_skb_cb *cb;
>  	struct tcphdr *th;
> +	bool fragstolen;

It might be useful to comment that this is/can be initialized to
false in tcp_try_coalesce via tcp_recv_queue.

>  	skb = alloc_skb(size + sizeof(*th), sk->sk_allocation);
>  	if (!skb)
> @@ -994,14 +994,14 @@ static int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, size_t size)
>  	if (memcpy_fromiovec(skb_put(skb, size), msg->msg_iov, size))
>  		goto err_free;
>  
> -	cb = TCP_SKB_CB(skb);
> -
>  	TCP_SKB_CB(skb)->seq = tcp_sk(sk)->rcv_nxt;
>  	TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(skb)->seq + size;
>  	TCP_SKB_CB(skb)->ack_seq = tcp_sk(sk)->snd_una - 1;
>  
> -	tcp_queue_rcv(sk, skb, sizeof(*th));
> -
> +	if (tcp_queue_rcv(sk, skb, sizeof(*th), &fragstolen)) {
> +		WARN_ON_ONCE(fragstolen); /* should not happen */

Otherwise this looks like a possibly uninitialized test
of fragstolen.  Maybe there's a path where tail is null
in tcp_recv_queue and it's an uninitialized test anyway.

> +		__kfree_skb(skb);
> +	}
>  	return size;
>  
>  err_free:
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index f891a5e..2233468 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4662,6 +4662,22 @@ end:
>  		skb_set_owner_r(skb, sk);
>  }
>  
> +int tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
> +		  bool *fragstolen)
> +{
> +	int eaten;
> +	struct sk_buff *tail = skb_peek_tail(&sk->sk_receive_queue);
> +
> +	__skb_pull(skb, hdrlen);
> +	eaten = (tail &&
> +		 tcp_try_coalesce(sk, tail, skb, fragstolen)) ? 1 : 0;
> +	tcp_sk(sk)->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
> +	if (!eaten) {
> +		__skb_queue_tail(&sk->sk_receive_queue, skb);
> +		skb_set_owner_r(skb, sk);
> +	}
> +	return eaten;
> +}

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next] net: implement tcp coalescing in tcp_queue_rcv()
  2012-05-02 20:11                                   ` Joe Perches
@ 2012-05-02 20:23                                     ` Eric Dumazet
  2012-05-02 20:34                                       ` Joe Perches
  2012-05-03  0:32                                       ` David Miller
  0 siblings, 2 replies; 45+ messages in thread
From: Eric Dumazet @ 2012-05-02 20:23 UTC (permalink / raw)
  To: Joe Perches
  Cc: David Miller, Alexander Duyck, Alexander Duyck, netdev,
	Neal Cardwell, Tom Herbert, Jeff Kirsher, Michael Chan,
	Matt Carlson, Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On Wed, 2012-05-02 at 13:11 -0700, Joe Perches wrote:

> It might be useful to comment that this is/can be initialized to
> false in tcp_try_coalesce via tcp_recv_queue.


> Otherwise this looks like a possibly uninitialized test
> of fragstolen.  Maybe there's a path where tail is null
> in tcp_recv_queue and it's an uninitialized test anyway.

If tcp_queue_rcv() returns 1, fragstolen is initialized in
tcp_try_coalesce().

If tcp_queue_rcv() returns 0, fragstolen content is undefined and we
dont care.

If a compiler or static checker complains, its only their problem.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next] net: implement tcp coalescing in tcp_queue_rcv()
  2012-05-02 20:23                                     ` Eric Dumazet
@ 2012-05-02 20:34                                       ` Joe Perches
  2012-05-03  0:32                                       ` David Miller
  1 sibling, 0 replies; 45+ messages in thread
From: Joe Perches @ 2012-05-02 20:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Alexander Duyck, Alexander Duyck, netdev,
	Neal Cardwell, Tom Herbert, Jeff Kirsher, Michael Chan,
	Matt Carlson, Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On Wed, 2012-05-02 at 22:23 +0200, Eric Dumazet wrote:
> On Wed, 2012-05-02 at 13:11 -0700, Joe Perches wrote:
> > It might be useful to comment that this is/can be initialized to
> > false in tcp_try_coalesce via tcp_recv_queue.
> > Otherwise this looks like a possibly uninitialized test
> > of fragstolen.  Maybe there's a path where tail is null
> > in tcp_recv_queue and it's an uninitialized test anyway.
> 
> If tcp_queue_rcv() returns 1, fragstolen is initialized in
> tcp_try_coalesce().
> 
> If tcp_queue_rcv() returns 0, fragstolen content is undefined and we
> dont care.
> 
> If a compiler or static checker complains, its only their problem.

True, but it's code that's a bit fragile and
I think as such it could be improved for any
human reader by a descriptive comment.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next] net: take care of cloned skbs in tcp_try_coalesce()
  2012-05-02 18:15                                 ` Eric Dumazet
@ 2012-05-02 20:55                                   ` Alexander Duyck
  2012-05-03  1:52                                     ` Eric Dumazet
  0 siblings, 1 reply; 45+ messages in thread
From: Alexander Duyck @ 2012-05-02 20:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On 05/02/2012 11:15 AM, Eric Dumazet wrote:
> On Wed, 2012-05-02 at 11:05 -0700, Alexander Duyck wrote:
>
>> You're correct about the fragstolen case, I actually was thinking of the
>> first patch you sent, not this second one.
>>
>> However we still have a problem.  What we end up with now is a case of
>> sharing in which the clone skb no longer knows that it is sharing the
>> head with another skb.  The dataref will drop to 1 when we call
>> __kfree_skb.  This means that any other function out there that tries to
>> see if the skb is shared would return false.  This could lead to issues
>> if there is anything out there that manipulates the data in head based
>> on the false assumption that it is not cloned.  What we would probably
>> need to do in this case is tweak the logic for skb_cloned.  If you are
>> using a head_frag you should probably add a check that returns true if
>> cloned is true and page_count is greater than 1.  We should be safe in
>> the case of skb_header_cloned since we already dropped are dataref when
>> we stole the page and freed the skb.
> I really dont understand this concern.
>
> When skb is cloned, we copy in head_frag __skb_clone()
>
> So both skbs have the bit set, and dataref = 2.
>
> first skb is freed, dataref becomes 1 and nothing special  happen
>
> >From this point, skb->head is not 'shared' anymore (taken your own
> words). And we are free to do whatever we want.
>
> second skb is freed, dataref becomes 0 and we call the right destructor.
The problem is that the stack will not be able to detect sharing.  As
long as page_count is greater than 2 and skb->cloned is set we should be
telling any callers to skb_cloned that the head is cloned.  Otherwise we
can run into issues elsewhere with well meaning code checking and not
detecting sharing, and then mangling the header.

Also I am not sure if the big monolithic changes are really the best way
to approach this.  It would be nice if we could fix this incrementally
instead of trying to do it all at once since there are multiple issues
that need to be addressed.

I will try to submit a few patches from my end later today.  I still
need to look over all of the changes from the past couple of weeks that
were based on the assumption that the IP stack completely owned the skb.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next] net: implement tcp coalescing in tcp_queue_rcv()
  2012-05-02 20:23                                     ` Eric Dumazet
  2012-05-02 20:34                                       ` Joe Perches
@ 2012-05-03  0:32                                       ` David Miller
  1 sibling, 0 replies; 45+ messages in thread
From: David Miller @ 2012-05-03  0:32 UTC (permalink / raw)
  To: eric.dumazet
  Cc: joe, alexander.duyck, alexander.h.duyck, netdev, ncardwell,
	therbert, jeffrey.t.kirsher, mchan, mcarlson, herbert,
	bhutchings, ilpo.jarvinen, maze

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 02 May 2012 22:23:07 +0200

> If a compiler or static checker complains, its only their problem.

It's also my problem, since it makes it even harder for me to see
any meaningful new warnings that crop up.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 net-next] net: take care of cloned skbs in tcp_try_coalesce()
  2012-05-02 17:55                               ` [PATCH v2 " Eric Dumazet
  2012-05-02 19:58                                 ` [PATCH net-next] net: implement tcp coalescing in tcp_queue_rcv() Eric Dumazet
@ 2012-05-03  1:11                                 ` David Miller
  1 sibling, 0 replies; 45+ messages in thread
From: David Miller @ 2012-05-03  1:11 UTC (permalink / raw)
  To: eric.dumazet
  Cc: alexander.h.duyck, alexander.duyck, netdev, ncardwell, therbert,
	jeffrey.t.kirsher, mchan, mcarlson, herbert, bhutchings,
	ilpo.jarvinen, maze

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 02 May 2012 19:55:58 +0200

> From: Eric Dumazet <edumazet@google.com>
> 
> Before stealing fragments or skb head, we must make sure skbs are not
> cloned.
> 
> Alexander was worried about destination skb being cloned : In bridge
> setups, a driver could be fooled if skb->data_len would not match skb
> nr_frags.
> 
> If source skb is cloned, we must take references on pages instead.
> 
> Bug happened using tcpdump (if not using mmap())
> 
> Introduce kfree_skb_partial() helper to cleanup code.
> 
> Reported-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next] net: implement tcp coalescing in tcp_queue_rcv()
  2012-05-02 19:58                                 ` [PATCH net-next] net: implement tcp coalescing in tcp_queue_rcv() Eric Dumazet
  2012-05-02 20:11                                   ` Joe Perches
@ 2012-05-03  1:11                                   ` David Miller
  2012-05-03  2:14                                     ` Eric Dumazet
  1 sibling, 1 reply; 45+ messages in thread
From: David Miller @ 2012-05-03  1:11 UTC (permalink / raw)
  To: eric.dumazet
  Cc: alexander.duyck, alexander.h.duyck, netdev, ncardwell, therbert,
	jeffrey.t.kirsher, mchan, mcarlson, herbert, bhutchings,
	ilpo.jarvinen, maze

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 02 May 2012 21:58:29 +0200

> From: Eric Dumazet <edumazet@google.com>
> 
> Extend tcp coalescing implementing it from tcp_queue_rcv(), the main
> receiver function when application is not blocked in recvmsg().
> 
> Function tcp_queue_rcv() is moved a bit to allow its call from
> tcp_data_queue()
> 
> This gives good results especially if GRO could not kick, and if skb
> head is a fragment.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next] net: take care of cloned skbs in tcp_try_coalesce()
  2012-05-02 20:55                                   ` Alexander Duyck
@ 2012-05-03  1:52                                     ` Eric Dumazet
  2012-05-03  3:00                                       ` Alexander Duyck
  0 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2012-05-03  1:52 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On Wed, 2012-05-02 at 13:55 -0700, Alexander Duyck wrote:
> On 05/02/2012 11:15 AM, Eric Dumazet wrote:
> > On Wed, 2012-05-02 at 11:05 -0700, Alexander Duyck wrote:
> >
> >> You're correct about the fragstolen case, I actually was thinking of the
> >> first patch you sent, not this second one.
> >>
> >> However we still have a problem.  What we end up with now is a case of
> >> sharing in which the clone skb no longer knows that it is sharing the
> >> head with another skb.  The dataref will drop to 1 when we call
> >> __kfree_skb.  This means that any other function out there that tries to
> >> see if the skb is shared would return false.  This could lead to issues
> >> if there is anything out there that manipulates the data in head based
> >> on the false assumption that it is not cloned.  What we would probably
> >> need to do in this case is tweak the logic for skb_cloned.  If you are
> >> using a head_frag you should probably add a check that returns true if
> >> cloned is true and page_count is greater than 1.  We should be safe in
> >> the case of skb_header_cloned since we already dropped are dataref when
> >> we stole the page and freed the skb.
> > I really dont understand this concern.
> >
> > When skb is cloned, we copy in head_frag __skb_clone()
> >
> > So both skbs have the bit set, and dataref = 2.
> >
> > first skb is freed, dataref becomes 1 and nothing special  happen
> >
> > >From this point, skb->head is not 'shared' anymore (taken your own
> > words). And we are free to do whatever we want.
> >
> > second skb is freed, dataref becomes 0 and we call the right destructor.
> The problem is that the stack will not be able to detect sharing.  As
> long as page_count is greater than 2 and skb->cloned is set we should be
> telling any callers to skb_cloned that the head is cloned.  Otherwise we
> can run into issues elsewhere with well meaning code checking and not
> detecting sharing, and then mangling the header.
> 

page count is irrelevant, since if PAGE_SIZE=65536, you can have 32
fragments (of 2048 bytes) per page. Still we can call put_page() for
each individual frag that must be freed.

You forgot to give an example of path that would be failing. Since the
skb_cloned() check is still valid.

Head is cloned if : skb->cloned is set and dataref value is not 1

(minus the skb_header_release() tweaks done on output path for tcp)

Every time a 'caller' is going to modify/mangle its skb head, it must
first call pskb_expand_head() (or various helpers around it) to :

- allocate a new skb->head
- copy old content to new head
- release a reference on old head dataref
- if old dataref reaches 0, 'free' old head (might be a kfree() or
put_page())

> Also I am not sure if the big monolithic changes are really the best way
> to approach this.  It would be nice if we could fix this incrementally
> instead of trying to do it all at once since there are multiple issues
> that need to be addressed.
> 
> I will try to submit a few patches from my end later today.  I still
> need to look over all of the changes from the past couple of weeks that
> were based on the assumption that the IP stack completely owned the skb.

I did my best to provide small changes.

Plus TCP coalescing is done after IP processing.

Owning skb is a vague concept anyway. IP borrows skb but not owns them.
They already could be cloned skb.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next] net: implement tcp coalescing in tcp_queue_rcv()
  2012-05-03  1:11                                   ` David Miller
@ 2012-05-03  2:14                                     ` Eric Dumazet
  2012-05-03  2:21                                       ` David Miller
  0 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2012-05-03  2:14 UTC (permalink / raw)
  To: David Miller
  Cc: alexander.duyck, alexander.h.duyck, netdev, ncardwell, therbert,
	jeffrey.t.kirsher, mchan, mcarlson, herbert, bhutchings,
	ilpo.jarvinen, maze

On Wed, 2012-05-02 at 21:11 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 02 May 2012 21:58:29 +0200
> 
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > Extend tcp coalescing implementing it from tcp_queue_rcv(), the main
> > receiver function when application is not blocked in recvmsg().
> > 
> > Function tcp_queue_rcv() is moved a bit to allow its call from
> > tcp_data_queue()
> > 
> > This gives good results especially if GRO could not kick, and if skb
> > head is a fragment.
> > 
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> 
> Applied.

Thanks David

My next step is to provide a common helper to NAPI drivers to replace 
netdev_alloc_skb_ip_align(struct net_device *dev, unsigned int len)

by :

struct sk_buff *napi_alloc_rxskb(struct napi_struct *napi, unsigned int len)

That will manage a cache of one page, splitted in fragments as needed.
(roughly the code we added in tg3 as POC)

Because converting drivers to build_skb() is a too slow process.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next] net: implement tcp coalescing in tcp_queue_rcv()
  2012-05-03  2:14                                     ` Eric Dumazet
@ 2012-05-03  2:21                                       ` David Miller
  0 siblings, 0 replies; 45+ messages in thread
From: David Miller @ 2012-05-03  2:21 UTC (permalink / raw)
  To: eric.dumazet
  Cc: alexander.duyck, alexander.h.duyck, netdev, ncardwell, therbert,
	jeffrey.t.kirsher, mchan, mcarlson, herbert, bhutchings,
	ilpo.jarvinen, maze

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 03 May 2012 04:14:20 +0200

> My next step is to provide a common helper to NAPI drivers to replace 
> netdev_alloc_skb_ip_align(struct net_device *dev, unsigned int len)
> 
> by :
> 
> struct sk_buff *napi_alloc_rxskb(struct napi_struct *napi, unsigned int len)
> 
> That will manage a cache of one page, splitted in fragments as needed.
> (roughly the code we added in tg3 as POC)
> 
> Because converting drivers to build_skb() is a too slow process.

Sounds good.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next] net: take care of cloned skbs in tcp_try_coalesce()
  2012-05-03  1:52                                     ` Eric Dumazet
@ 2012-05-03  3:00                                       ` Alexander Duyck
  2012-05-03  3:14                                         ` Eric Dumazet
  0 siblings, 1 reply; 45+ messages in thread
From: Alexander Duyck @ 2012-05-03  3:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On 5/2/2012 6:52 PM, Eric Dumazet wrote:
> On Wed, 2012-05-02 at 13:55 -0700, Alexander Duyck wrote:
>> On 05/02/2012 11:15 AM, Eric Dumazet wrote:
>>> On Wed, 2012-05-02 at 11:05 -0700, Alexander Duyck wrote:
>>>> You're correct about the fragstolen case, I actually was thinking of the
>>>> first patch you sent, not this second one.
>>>>
>>>> However we still have a problem.  What we end up with now is a case of
>>>> sharing in which the clone skb no longer knows that it is sharing the
>>>> head with another skb.  The dataref will drop to 1 when we call
>>>> __kfree_skb.  This means that any other function out there that tries to
>>>> see if the skb is shared would return false.  This could lead to issues
>>>> if there is anything out there that manipulates the data in head based
>>>> on the false assumption that it is not cloned.  What we would probably
>>>> need to do in this case is tweak the logic for skb_cloned.  If you are
>>>> using a head_frag you should probably add a check that returns true if
>>>> cloned is true and page_count is greater than 1.  We should be safe in
>>>> the case of skb_header_cloned since we already dropped are dataref when
>>>> we stole the page and freed the skb.
>>> I really dont understand this concern.
>>>
>>> When skb is cloned, we copy in head_frag __skb_clone()
>>>
>>> So both skbs have the bit set, and dataref = 2.
>>>
>>> first skb is freed, dataref becomes 1 and nothing special  happen
>>>
>>> > From this point, skb->head is not 'shared' anymore (taken your own
>>> words). And we are free to do whatever we want.
>>>
>>> second skb is freed, dataref becomes 0 and we call the right destructor.
>> The problem is that the stack will not be able to detect sharing.  As
>> long as page_count is greater than 2 and skb->cloned is set we should be
>> telling any callers to skb_cloned that the head is cloned.  Otherwise we
>> can run into issues elsewhere with well meaning code checking and not
>> detecting sharing, and then mangling the header.
> page count is irrelevant, since if PAGE_SIZE=65536, you can have 32
> fragments (of 2048 bytes) per page. Still we can call put_page() for
> each individual frag that must be freed.
>
> You forgot to give an example of path that would be failing. Since the
> skb_cloned() check is still valid.
>
> Head is cloned if : skb->cloned is set and dataref value is not 1
>
> (minus the skb_header_release() tweaks done on output path for tcp)
>
> Every time a 'caller' is going to modify/mangle its skb head, it must
> first call pskb_expand_head() (or various helpers around it) to :
>
> - allocate a new skb->head
> - copy old content to new head
> - release a reference on old head dataref
> - if old dataref reaches 0, 'free' old head (might be a kfree() or
> put_page())
This is exactly my point.  The way your current code works the check in 
pskb_expand_head will not detect the header is cloned.

So for example lets say you have one of your skbs that goes through and 
is merged.

1.  You start with a cloned skb.  dataref = 2, head_frag page count = 1;
2.  You go through tcp_try_coalesce.  dataref = 2, head_frag page count = 2;
3.  You call __kfree_skb on the skb.  dataref = 1, head_frag page count = 2;
4   At this point if the holder of the clone decides to modify the skb-> 
head there is nothing to stop it from doing so.

Odds are you would have to have a pretty extreme corner case to really 
see any issues with this since I know most raw sockets won't mangle the 
data portion of the frame.  I just figure we could save ourselves a lot 
of trouble by just not coalescing the head_frag in the case that the skb 
is cloned.

>> Also I am not sure if the big monolithic changes are really the best way
>> to approach this.  It would be nice if we could fix this incrementally
>> instead of trying to do it all at once since there are multiple issues
>> that need to be addressed.
>>
>> I will try to submit a few patches from my end later today.  I still
>> need to look over all of the changes from the past couple of weeks that
>> were based on the assumption that the IP stack completely owned the skb.
> I did my best to provide small changes.
I just meant you didn't need to do things like add the kfree_skb_partial 
helper in the same patch.

> Plus TCP coalescing is done after IP processing.
>
> Owning skb is a vague concept anyway. IP borrows skb but not owns them.
> They already could be cloned skb.
I know it is a vague concept.  However it is one of those concepts where 
if we don't get it right we start to see random panics and call traces 
due to memory corruption.  That is why I prefer to avoid it if at all 
possible.  All I am really asking for is that we don't just use the 
head_frag as a page unless we know it is not cloned.  All those checks 
that just look for skb->head_frag could check for skb->head_frag && 
!skb_cloned(skb) and I would be a much happier person since then I know 
all owners of the clones will be using the same variable for reference 
count tracking.

I'm still working on those patches.  I thought I had sent them earlier 
but it looks like I had an email issue, and since then Dave pulled your 
patches so I am rebasing the patches on the updated tree and should have 
something in the next hour or so.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next] net: take care of cloned skbs in tcp_try_coalesce()
  2012-05-03  3:00                                       ` Alexander Duyck
@ 2012-05-03  3:14                                         ` Eric Dumazet
  2012-05-03  3:28                                           ` Alexander Duyck
  0 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2012-05-03  3:14 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On Wed, 2012-05-02 at 20:00 -0700, Alexander Duyck wrote:

> This is exactly my point.  The way your current code works the check in 
> pskb_expand_head will not detect the header is cloned.
> 
> So for example lets say you have one of your skbs that goes through and 
> is merged.
> 
> 1.  You start with a cloned skb.  dataref = 2, head_frag page count = 1;
> 2.  You go through tcp_try_coalesce.  dataref = 2, head_frag page count = 2;
> 3.  You call __kfree_skb on the skb.  dataref = 1, head_frag page count = 2;


If page count was incremented in 2., this is because we stole the head.

But we could not stole the head since dataref = 2

So try to find a real example, based on current state, not on the first
patch I sent, since we made progress since this time.

Thanks

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next] net: take care of cloned skbs in tcp_try_coalesce()
  2012-05-03  3:14                                         ` Eric Dumazet
@ 2012-05-03  3:28                                           ` Alexander Duyck
  0 siblings, 0 replies; 45+ messages in thread
From: Alexander Duyck @ 2012-05-03  3:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, David Miller, netdev, Neal Cardwell,
	Tom Herbert, Jeff Kirsher, Michael Chan, Matt Carlson,
	Herbert Xu, Ben Hutchings, Ilpo Järvinen,
	Maciej Żenczykowski

On 5/2/2012 8:14 PM, Eric Dumazet wrote:
> On Wed, 2012-05-02 at 20:00 -0700, Alexander Duyck wrote:
>> This is exactly my point.  The way your current code works the check in
>> pskb_expand_head will not detect the header is cloned.
>>
>> So for example lets say you have one of your skbs that goes through and
>> is merged.
>>
>> 1.  You start with a cloned skb.  dataref = 2, head_frag page count = 1;
>> 2.  You go through tcp_try_coalesce.  dataref = 2, head_frag page count = 2;

4592         if (from->head_frag) {
4593                 struct page *page;
4594                 unsigned int offset;
4595
4596                 if (skb_shinfo(to)->nr_frags + skb_shinfo(from)->nr_frags>= MAX_SKB_FRAGS)
4597                         return false;
4598                 page = virt_to_head_page(from->head);
4599                 offset = from->data - (unsigned char *)page_address(page);
4600                 skb_fill_page_desc(to, skb_shinfo(to)->nr_frags,
4601                                    page, offset, skb_headlen(from));
4602
4603                 if (skb_cloned(from))
4604                         get_page(page);
4605                 else
4606                         *fragstolen = true;
4607
4608                 delta = len; /* we dont know real truesize... */
4609                 goto copyfrags;
4610         }


>> 3.  You call __kfree_skb on the skb.  dataref = 1, head_frag page count = 2;
4614 static void kfree_skb_partial(struct sk_buff *skb, bool head_stolen)
4615 {
4616         if (head_stolen)
4617                 kmem_cache_free(skbuff_head_cache, skb);
4618         else
4619                 __kfree_skb(skb);
4620 }
4621

>
> If page count was incremented in 2., this is because we stole the head.
>
> But we could not stole the head since dataref = 2
I don't see how that is the case.  It looks pretty clear to me that if 
dataref ==2, as represented by skb_cloned(from) being true above, we are 
calling get_page which will bump the page count and we are still 
stealing the head and then calling __kfree_skb which will decrement dataref.
> So try to find a real example, based on current state, not on the first
> patch I sent, since we made progress since this time.
>
> Thanks

I included a couple of grabs from blobs on git.kernel.org for Dave's 
current tree.  I should have patches out in about 10 minutes and then we 
can discuss those.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2012-05-03  3:28 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-27 10:37 [PATCH 3/4 net-next] net: make GRO aware of skb->head_frag Eric Dumazet
2012-04-30 17:54 ` Eric Dumazet
2012-04-30 18:10 ` [PATCH 3/4 v2 " Eric Dumazet
2012-04-30 23:36   ` Alexander Duyck
2012-05-01  1:27     ` Eric Dumazet
2012-05-01  5:33       ` Alexander Duyck
2012-05-01  6:39         ` Eric Dumazet
2012-05-01 16:17           ` Alexander Duyck
2012-05-01 17:04             ` Eric Dumazet
2012-05-01 19:45               ` Alexander Duyck
2012-05-02  2:45                 ` Eric Dumazet
2012-05-02  8:24                 ` Eric Dumazet
2012-05-02 16:16                   ` Alexander Duyck
2012-05-02 16:19                     ` Eric Dumazet
2012-05-02 16:27                       ` Eric Dumazet
2012-05-02 17:04                         ` Alexander Duyck
2012-05-02 17:02                       ` Alexander Duyck
2012-05-02 17:16                   ` Rick Jones
2012-05-01 22:58               ` Alexander Duyck
2012-05-01 23:10                 ` Alexander Duyck
2012-05-02  2:47                   ` Eric Dumazet
2012-05-02  3:54                     ` Eric Dumazet
2012-05-02  8:13                     ` [PATCH net-next] net: take care of cloned skbs in tcp_try_coalesce() Eric Dumazet
2012-05-02 15:52                       ` Alexander Duyck
2012-05-02 16:12                         ` Eric Dumazet
2012-05-02 16:27                           ` Alexander Duyck
2012-05-02 16:46                             ` Eric Dumazet
2012-05-02 17:55                               ` [PATCH v2 " Eric Dumazet
2012-05-02 19:58                                 ` [PATCH net-next] net: implement tcp coalescing in tcp_queue_rcv() Eric Dumazet
2012-05-02 20:11                                   ` Joe Perches
2012-05-02 20:23                                     ` Eric Dumazet
2012-05-02 20:34                                       ` Joe Perches
2012-05-03  0:32                                       ` David Miller
2012-05-03  1:11                                   ` David Miller
2012-05-03  2:14                                     ` Eric Dumazet
2012-05-03  2:21                                       ` David Miller
2012-05-03  1:11                                 ` [PATCH v2 net-next] net: take care of cloned skbs in tcp_try_coalesce() David Miller
2012-05-02 18:05                               ` [PATCH " Alexander Duyck
2012-05-02 18:15                                 ` Eric Dumazet
2012-05-02 20:55                                   ` Alexander Duyck
2012-05-03  1:52                                     ` Eric Dumazet
2012-05-03  3:00                                       ` Alexander Duyck
2012-05-03  3:14                                         ` Eric Dumazet
2012-05-03  3:28                                           ` Alexander Duyck
2012-05-01  1:48   ` [PATCH 3/4 v2 net-next] net: make GRO aware of skb->head_frag David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.