All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] net: Frag list lost on head expansion.
@ 2010-09-03  3:43 David Miller
  2010-09-03  5:48 ` Eric Dumazet
  0 siblings, 1 reply; 33+ messages in thread
From: David Miller @ 2010-09-03  3:43 UTC (permalink / raw)
  To: netdev


When pskb_expand_head() releases the data, with skb_release_data(), it
tries to properly preserve any fragment list using
skb_clone_fraglist().

Although skb_clone_fraglist() will properly grab a reference to all of
the fragment list SKBs, it will not block skb_release_data() from
NULL'ing out the ->frag_list pointer when it calls skb_drop_list() via
skb_drop_fraglist().

As a result we lose the fragment SKBs and they are leaked forever.

Instead, hide the fragment list pointer around the skb_release_data()
call and restore it afterwards.  This fixes the bug and also makes
it cheaper since we won't grab and release every single fragment
list SKB reference.

Signed-off-by: David S. Miller <davem@davemloft.net>
---

I found this via pure code inspection, please review.

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 26396ff..def2e49 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -789,6 +789,7 @@ EXPORT_SYMBOL(pskb_copy);
 int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 		     gfp_t gfp_mask)
 {
+	struct sk_buff *frag_list;
 	int i;
 	u8 *data;
 #ifdef NET_SKBUFF_DATA_USES_OFFSET
@@ -822,11 +823,13 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
 		get_page(skb_shinfo(skb)->frags[i].page);
 
-	if (skb_has_frags(skb))
-		skb_clone_fraglist(skb);
+	frag_list = skb_shinfo(skb)->frag_list;
+	skb_shinfo(skb)->frag_list = NULL;
 
 	skb_release_data(skb);
 
+	skb_shinfo(skb)->frag_list = frag_list;
+
 	off = (data + nhead) - skb->head;
 
 	skb->head     = data;

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH] net: Frag list lost on head expansion.
  2010-09-03  3:43 [PATCH] net: Frag list lost on head expansion David Miller
@ 2010-09-03  5:48 ` Eric Dumazet
  2010-09-03  6:19   ` Eric Dumazet
  2010-09-03  9:09   ` [PATCH net-next-2.6] net: pskb_expand_head() optimization Eric Dumazet
  0 siblings, 2 replies; 33+ messages in thread
From: Eric Dumazet @ 2010-09-03  5:48 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Le jeudi 02 septembre 2010 à 20:43 -0700, David Miller a écrit :
> When pskb_expand_head() releases the data, with skb_release_data(), it
> tries to properly preserve any fragment list using
> skb_clone_fraglist().
> 
> Although skb_clone_fraglist() will properly grab a reference to all of
> the fragment list SKBs, it will not block skb_release_data() from
> NULL'ing out the ->frag_list pointer when it calls skb_drop_list() via
> skb_drop_fraglist().
> 
> As a result we lose the fragment SKBs and they are leaked forever.
> 
> Instead, hide the fragment list pointer around the skb_release_data()
> call and restore it afterwards.  This fixes the bug and also makes
> it cheaper since we won't grab and release every single fragment
> list SKB reference.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>
> ---
> 
> I found this via pure code inspection, please review.
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 26396ff..def2e49 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -789,6 +789,7 @@ EXPORT_SYMBOL(pskb_copy);
>  int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>  		     gfp_t gfp_mask)
>  {
> +	struct sk_buff *frag_list;
>  	int i;
>  	u8 *data;
>  #ifdef NET_SKBUFF_DATA_USES_OFFSET
> @@ -822,11 +823,13 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>  	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
>  		get_page(skb_shinfo(skb)->frags[i].page);
>  
> -	if (skb_has_frags(skb))
> -		skb_clone_fraglist(skb);
> +	frag_list = skb_shinfo(skb)->frag_list;
> +	skb_shinfo(skb)->frag_list = NULL;
>  
>  	skb_release_data(skb);
>  
> +	skb_shinfo(skb)->frag_list = frag_list;
> +
>  	off = (data + nhead) - skb->head;
>  
>  	skb->head     = data;

David, I had this same idea some days ago when reviewing this code,
but when I came to conclusion we could not avoid the get_page /
put_page() on skb_shinfo(skb)->frags[i].page. I thought it was not worth
trying to avoid the frag_list grab/release operation.

But we are the only user of this skb and skb_shinfo(skb)->frag_list, so
your patch seems good to me.

Please note you are not fixing a bug, because the new frag_list pointer
was correctly copied in the

memcpy(data + size, skb_end_pointer(skb), 
	offsetof(struct skb_shared_info,
		frags[skb_shinfo(skb)->nr_frags]));

Please rewrite the changelog to say this patch is an optimization, to
avoid the atomic ops on each skb found in frag_list ?

Thanks !



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] net: Frag list lost on head expansion.
  2010-09-03  5:48 ` Eric Dumazet
@ 2010-09-03  6:19   ` Eric Dumazet
  2010-09-03  9:09   ` [PATCH net-next-2.6] net: pskb_expand_head() optimization Eric Dumazet
  1 sibling, 0 replies; 33+ messages in thread
From: Eric Dumazet @ 2010-09-03  6:19 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Le vendredi 03 septembre 2010 à 07:48 +0200, Eric Dumazet a écrit :

> David, I had this same idea some days ago when reviewing this code,
> but when I came to conclusion we could not avoid the get_page /
> put_page() on skb_shinfo(skb)->frags[i].page. I thought it was not worth
> trying to avoid the frag_list grab/release operation.
> 
> But we are the only user of this skb and skb_shinfo(skb)->frag_list, so
> your patch seems good to me.
> 
> Please note you are not fixing a bug, because the new frag_list pointer
> was correctly copied in the
> 
> memcpy(data + size, skb_end_pointer(skb), 
> 	offsetof(struct skb_shared_info,
> 		frags[skb_shinfo(skb)->nr_frags]));
> 
> Please rewrite the changelog to say this patch is an optimization, to
> avoid the atomic ops on each skb found in frag_list ?
> 

Well, skb is not shared, but we have no guarantee skb_shinfo(skb) is
not.

To optimize this thing, you'll need to add a new parameter to
skb_release_data() ?



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-09-03  5:48 ` Eric Dumazet
  2010-09-03  6:19   ` Eric Dumazet
@ 2010-09-03  9:09   ` Eric Dumazet
  2010-09-03 13:46     ` David Miller
  2010-09-07  1:25     ` David Miller
  1 sibling, 2 replies; 33+ messages in thread
From: Eric Dumazet @ 2010-09-03  9:09 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Le vendredi 03 septembre 2010 à 07:48 +0200, Eric Dumazet a écrit :

> David, I had this same idea some days ago when reviewing this code,
> but when I came to conclusion we could not avoid the get_page /
> put_page() on skb_shinfo(skb)->frags[i].page. I thought it was not worth
> trying to avoid the frag_list grab/release operation.
> 

Here is the patch I cooked, for net-next-2.6

[PATCH net-next-2.6] net: pskb_expand_head() optimization

pskb_expand_head() blindly takes references on fragments before calling
skb_release_data(), potentially releasing these references.

We can add a fast path, avoiding these atomic operations, if we own the
last reference on skb->head.

Based on a previous patch from David

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/core/skbuff.c |   25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 231dff0..59b96fe 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -781,6 +781,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	u8 *data;
 	int size = nhead + (skb_end_pointer(skb) - skb->head) + ntail;
 	long off;
+	bool fastpath;
 
 	BUG_ON(nhead < 0);
 
@@ -802,14 +803,28 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	       skb_shinfo(skb),
 	       offsetof(struct skb_shared_info, frags[skb_shinfo(skb)->nr_frags]));
 
-	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
-		get_page(skb_shinfo(skb)->frags[i].page);
+	/* Check if we can avoid taking references on fragments if we own
+	 * the last reference on skb->head. (see skb_release_data())
+	 */
+	if (!skb->cloned)
+		fastpath = true;
+	else {
+		int delta = skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1;
 
-	if (skb_has_frag_list(skb))
-		skb_clone_fraglist(skb);
+		fastpath = atomic_read(&skb_shinfo(skb)->dataref) == delta;
+	}
 
-	skb_release_data(skb);
+	if (fastpath) {
+		kfree(skb->head);
+	} else {
+		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
+			get_page(skb_shinfo(skb)->frags[i].page);
 
+		if (skb_has_frag_list(skb))
+			skb_clone_fraglist(skb);
+
+		skb_release_data(skb);
+	}
 	off = (data + nhead) - skb->head;
 
 	skb->head     = data;



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-09-03  9:09   ` [PATCH net-next-2.6] net: pskb_expand_head() optimization Eric Dumazet
@ 2010-09-03 13:46     ` David Miller
  2010-09-07  2:20       ` David Miller
  2010-09-07  1:25     ` David Miller
  1 sibling, 1 reply; 33+ messages in thread
From: David Miller @ 2010-09-03 13:46 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 03 Sep 2010 11:09:32 +0200

> Le vendredi 03 septembre 2010 à 07:48 +0200, Eric Dumazet a écrit :
> 
>> David, I had this same idea some days ago when reviewing this code,
>> but when I came to conclusion we could not avoid the get_page /
>> put_page() on skb_shinfo(skb)->frags[i].page. I thought it was not worth
>> trying to avoid the frag_list grab/release operation.
>> 
> 
> Here is the patch I cooked, for net-next-2.6
> 
> [PATCH net-next-2.6] net: pskb_expand_head() optimization
> 
> pskb_expand_head() blindly takes references on fragments before calling
> skb_release_data(), potentially releasing these references.
> 
> We can add a fast path, avoiding these atomic operations, if we own the
> last reference on skb->head.
> 
> Based on a previous patch from David
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Ok, I see how this works now, thanks Eric.  But this looks to be a bit
too complicated to be worthwhile, I think, so let's hold off on this
optimization for a while.

Let me tell you why I was swimming around in this area and what my
ultimate motivation is :-)

In trying to eventually convert sk_buff to list_head one of the
troubling areas is this frag_list business.

Amusingly, if you look, virtually all of the code which constructs
frag_list SKBs wants a doubly linked list so it can insert at the tail
(all LRO gathering, GRO, you name it).

There are only two operations that make a double-linked list currently
impossible.  This pskb_expand_head() thing, and pskb_copy().

They cause the situation where the list is shared between multiple SKB
data shared areas.

And because of this a doubly-linked list like list_head won't work at
all.

For the optimized case of pskb_expand_head() you discovered we can avoid
this sharing.  And at this point I imagine that for the remaining cases
we can simply copy the full SKBs in the frag list to avoid this list
sharing.

So that's where I'm trying to go in all of this.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-09-03  9:09   ` [PATCH net-next-2.6] net: pskb_expand_head() optimization Eric Dumazet
  2010-09-03 13:46     ` David Miller
@ 2010-09-07  1:25     ` David Miller
  1 sibling, 0 replies; 33+ messages in thread
From: David Miller @ 2010-09-07  1:25 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 03 Sep 2010 11:09:32 +0200

> [PATCH net-next-2.6] net: pskb_expand_head() optimization
> 
> pskb_expand_head() blindly takes references on fragments before calling
> skb_release_data(), potentially releasing these references.
> 
> We can add a fast path, avoiding these atomic operations, if we own the
> last reference on skb->head.
> 
> Based on a previous patch from David
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Eric, I ended up applying this patch after all, thanks!

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-09-03 13:46     ` David Miller
@ 2010-09-07  2:20       ` David Miller
  2010-09-07  5:02         ` Eric Dumazet
  0 siblings, 1 reply; 33+ messages in thread
From: David Miller @ 2010-09-07  2:20 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev

From: David Miller <davem@davemloft.net>
Date: Fri, 03 Sep 2010 06:46:33 -0700 (PDT)

> There are only two operations that make a double-linked list currently
> impossible.  This pskb_expand_head() thing, and pskb_copy().
> 
> They cause the situation where the list is shared between multiple SKB
> data shared areas.
> 
> And because of this a doubly-linked list like list_head won't work at
> all.
> 
> For the optimized case of pskb_expand_head() you discovered we can avoid
> this sharing.  And at this point I imagine that for the remaining cases
> we can simply copy the full SKBs in the frag list to avoid this list
> sharing.

Eric, this goes on top of your patch and demonstrates the idea.

Please review if you have a chance:

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2d1bc76..aeb56af 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -327,6 +327,32 @@ static void skb_clone_fraglist(struct sk_buff *skb)
 		skb_get(list);
 }
 
+static struct sk_buff *skb_copy_fraglist(struct sk_buff *parent,
+					 gfp_t gfp_mask)
+{
+	struct sk_buff *first_skb = NULL;
+	struct sk_buff *prev_skb = NULL;
+	struct sk_buff *skb;
+
+	skb_walk_frags(parent, skb) {
+		struct sk_buff *nskb = pskb_copy(skb, gfp_mask);
+
+		if (!nskb)
+			goto fail;
+		if (!first_skb)
+			first_skb = skb;
+		else
+			prev_skb->next = skb;
+		prev_skb = skb;
+	}
+
+	return first_skb;
+
+fail:
+	skb_drop_list(&first_skb);
+	return NULL;
+}
+
 static void skb_release_data(struct sk_buff *skb)
 {
 	if (!skb->cloned ||
@@ -812,17 +838,22 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 		fastpath = atomic_read(&skb_shinfo(skb)->dataref) == delta;
 	}
 
-	if (fastpath) {
-		kfree(skb->head);
-	} else {
+	if (!fastpath) {
+		if (skb_has_frag_list(skb)) {
+			struct sk_buff *new_list;
+
+			new_list = skb_copy_fraglist(skb, gfp_mask);
+			if (!new_list)
+				goto free_data;
+			skb_shinfo(skb)->frag_list = new_list;
+		}
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
 			get_page(skb_shinfo(skb)->frags[i].page);
 
-		if (skb_has_frag_list(skb))
-			skb_clone_fraglist(skb);
-
-		skb_release_data(skb);
 	}
+
+	kfree(skb->head);
+
 	off = (data + nhead) - skb->head;
 
 	skb->head     = data;
@@ -848,6 +879,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	atomic_set(&skb_shinfo(skb)->dataref, 1);
 	return 0;
 
+free_data:
+	kfree(data);
 nodata:
 	return -ENOMEM;
 }

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-09-07  2:20       ` David Miller
@ 2010-09-07  5:02         ` Eric Dumazet
  2010-09-07  5:05           ` David Miller
  2010-09-07  9:16           ` Jarek Poplawski
  0 siblings, 2 replies; 33+ messages in thread
From: Eric Dumazet @ 2010-09-07  5:02 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Le lundi 06 septembre 2010 à 19:20 -0700, David Miller a écrit :

> Eric, this goes on top of your patch and demonstrates the idea.
> 
> Please review if you have a chance:
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 2d1bc76..aeb56af 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -327,6 +327,32 @@ static void skb_clone_fraglist(struct sk_buff *skb)
>  		skb_get(list);
>  }
>  
> +static struct sk_buff *skb_copy_fraglist(struct sk_buff *parent,
> +					 gfp_t gfp_mask)
> +{
> +	struct sk_buff *first_skb = NULL;
> +	struct sk_buff *prev_skb = NULL;
> +	struct sk_buff *skb;
> +
> +	skb_walk_frags(parent, skb) {
> +		struct sk_buff *nskb = pskb_copy(skb, gfp_mask);
> +
> +		if (!nskb)
> +			goto fail;
> +		if (!first_skb)
> +			first_skb = skb;
> +		else
> +			prev_skb->next = skb;
> +		prev_skb = skb;
> +	}
> +
> +	return first_skb;
> +
> +fail:
> +	skb_drop_list(&first_skb);
> +	return NULL;
> +}
> +
>  static void skb_release_data(struct sk_buff *skb)
>  {
>  	if (!skb->cloned ||
> @@ -812,17 +838,22 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>  		fastpath = atomic_read(&skb_shinfo(skb)->dataref) == delta;
>  	}
>  
> -	if (fastpath) {
> -		kfree(skb->head);
> -	} else {
> +	if (!fastpath) {
> +		if (skb_has_frag_list(skb)) {
> +			struct sk_buff *new_list;
> +
> +			new_list = skb_copy_fraglist(skb, gfp_mask);
> +			if (!new_list)
> +				goto free_data;
> +			skb_shinfo(skb)->frag_list = new_list;

Here, skb_shinfo(skb) still points to old shinfo, you should not touch
it. An other user might need it :)

> +		}
>  		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
>  			get_page(skb_shinfo(skb)->frags[i].page);
>  
> -		if (skb_has_frag_list(skb))
> -			skb_clone_fraglist(skb);
> -
> -		skb_release_data(skb);
>  	}

I believe you cannot remove skb_release_data() call, we really need to
perform the atomic operation, and test the result on it, or a double
free could happen.

> +
> +	kfree(skb->head);
> +
>  	off = (data + nhead) - skb->head;
>  
>  	skb->head     = data;
> @@ -848,6 +879,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>  	atomic_set(&skb_shinfo(skb)->dataref, 1);
>  	return 0;
>  
> +free_data:
> +	kfree(data);

is it a leftover ?

>  nodata:
>  	return -ENOMEM;
>  }

I understand what you want to do, but problem is we need to perform a
CAS2 operation : atomically changes two values (dataref and frag_list)




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-09-07  5:02         ` Eric Dumazet
@ 2010-09-07  5:05           ` David Miller
  2010-09-07  9:16           ` Jarek Poplawski
  1 sibling, 0 replies; 33+ messages in thread
From: David Miller @ 2010-09-07  5:05 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 07 Sep 2010 07:02:09 +0200

> I understand what you want to do, but problem is we need to perform a
> CAS2 operation : atomically changes two values (dataref and frag_list)

Ok, thanks for the review, I'll think some more about this.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-09-07  5:02         ` Eric Dumazet
  2010-09-07  5:05           ` David Miller
@ 2010-09-07  9:16           ` Jarek Poplawski
  2010-09-07  9:37             ` Eric Dumazet
  1 sibling, 1 reply; 33+ messages in thread
From: Jarek Poplawski @ 2010-09-07  9:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

On 2010-09-07 07:02, Eric Dumazet wrote:
> Le lundi 06 septembre 2010 Ă  19:20 -0700, David Miller a ĂŠcrit :
> 
>> Eric, this goes on top of your patch and demonstrates the idea.
>>
>> Please review if you have a chance:
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 2d1bc76..aeb56af 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -327,6 +327,32 @@ static void skb_clone_fraglist(struct sk_buff *skb)
>>  		skb_get(list);
>>  }
>>  
>> +static struct sk_buff *skb_copy_fraglist(struct sk_buff *parent,
>> +					 gfp_t gfp_mask)
>> +{
>> +	struct sk_buff *first_skb = NULL;
>> +	struct sk_buff *prev_skb = NULL;
>> +	struct sk_buff *skb;
>> +
>> +	skb_walk_frags(parent, skb) {
>> +		struct sk_buff *nskb = pskb_copy(skb, gfp_mask);
>> +
>> +		if (!nskb)
>> +			goto fail;
>> +		if (!first_skb)
>> +			first_skb = skb;
>> +		else
>> +			prev_skb->next = skb;
>> +		prev_skb = skb;
>> +	}
>> +
>> +	return first_skb;
>> +
>> +fail:
>> +	skb_drop_list(&first_skb);
>> +	return NULL;
>> +}
>> +
>>  static void skb_release_data(struct sk_buff *skb)
>>  {
>>  	if (!skb->cloned ||
>> @@ -812,17 +838,22 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>>  		fastpath = atomic_read(&skb_shinfo(skb)->dataref) == delta;
>>  	}
>>  
>> -	if (fastpath) {
>> -		kfree(skb->head);
>> -	} else {
>> +	if (!fastpath) {
>> +		if (skb_has_frag_list(skb)) {
>> +			struct sk_buff *new_list;
>> +
>> +			new_list = skb_copy_fraglist(skb, gfp_mask);
>> +			if (!new_list)
>> +				goto free_data;
>> +			skb_shinfo(skb)->frag_list = new_list;
> 
> Here, skb_shinfo(skb) still points to old shinfo, you should not touch
> it. An other user might need it :)

Even if there were no users this is written to the area freed with
kfree(skb->head) a few lines later, isn't it?

> 
>> +		}
>>  		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
>>  			get_page(skb_shinfo(skb)->frags[i].page);
>>  
>> -		if (skb_has_frag_list(skb))
>> -			skb_clone_fraglist(skb);
>> -
>> -		skb_release_data(skb);
>>  	}
> 
> I believe you cannot remove skb_release_data() call, we really need to
> perform the atomic operation, and test the result on it, or a double
> free could happen.
> 
>> +
>> +	kfree(skb->head);
>> +
>>  	off = (data + nhead) - skb->head;
>>  
>>  	skb->head     = data;
>> @@ -848,6 +879,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>>  	atomic_set(&skb_shinfo(skb)->dataref, 1);
>>  	return 0;
>>  
>> +free_data:
>> +	kfree(data);
> 
> is it a leftover ?
> 
>>  nodata:
>>  	return -ENOMEM;
>>  }
> 
> I understand what you want to do, but problem is we need to perform a
> CAS2 operation : atomically changes two values (dataref and frag_list)

Alas I can't understand why do you think these clone and atomic tests
in skb_release_data() don't protect skb_shinfo(skb)->frag_list enough.

Jarek P.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-09-07  9:16           ` Jarek Poplawski
@ 2010-09-07  9:37             ` Eric Dumazet
  2010-09-10 19:54               ` David Miller
  0 siblings, 1 reply; 33+ messages in thread
From: Eric Dumazet @ 2010-09-07  9:37 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev

Le mardi 07 septembre 2010 à 09:16 +0000, Jarek Poplawski a écrit :
> On 2010-09-07 07:02, Eric Dumazet wrote:

> > 
> > I understand what you want to do, but problem is we need to perform a
> > CAS2 operation : atomically changes two values (dataref and frag_list)
> 
> Alas I can't understand why do you think these clone and atomic tests
> in skb_release_data() don't protect skb_shinfo(skb)->frag_list enough.
> 

It was early in the morning, before a cup of tea.

David only had to set frag_list in the new shinfo, not the old.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-09-07  9:37             ` Eric Dumazet
@ 2010-09-10 19:54               ` David Miller
  2010-09-11 12:31                 ` Jarek Poplawski
  0 siblings, 1 reply; 33+ messages in thread
From: David Miller @ 2010-09-10 19:54 UTC (permalink / raw)
  To: eric.dumazet; +Cc: jarkao2, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 07 Sep 2010 11:37:28 +0200

> Le mardi 07 septembre 2010 à 09:16 +0000, Jarek Poplawski a écrit :
>> On 2010-09-07 07:02, Eric Dumazet wrote:
> 
>> > 
>> > I understand what you want to do, but problem is we need to perform a
>> > CAS2 operation : atomically changes two values (dataref and frag_list)
>> 
>> Alas I can't understand why do you think these clone and atomic tests
>> in skb_release_data() don't protect skb_shinfo(skb)->frag_list enough.
>> 
> 
> It was early in the morning, before a cup of tea.
> 
> David only had to set frag_list in the new shinfo, not the old.

Ok, how does this look?

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 752c197..aaa9750 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -327,6 +327,32 @@ static void skb_clone_fraglist(struct sk_buff *skb)
 		skb_get(list);
 }
 
+static struct sk_buff *skb_copy_fraglist(struct sk_buff *parent,
+					 gfp_t gfp_mask)
+{
+	struct sk_buff *first_skb = NULL;
+	struct sk_buff *prev_skb = NULL;
+	struct sk_buff *skb;
+
+	skb_walk_frags(parent, skb) {
+		struct sk_buff *nskb = pskb_copy(skb, gfp_mask);
+
+		if (!nskb)
+			goto fail;
+		if (!first_skb)
+			first_skb = skb;
+		else
+			prev_skb->next = skb;
+		prev_skb = skb;
+	}
+
+	return first_skb;
+
+fail:
+	skb_drop_list(&first_skb);
+	return NULL;
+}
+
 static void skb_release_data(struct sk_buff *skb)
 {
 	if (!skb->cloned ||
@@ -775,11 +801,12 @@ EXPORT_SYMBOL(pskb_copy);
 int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 		     gfp_t gfp_mask)
 {
-	int i;
-	u8 *data;
 	int size = nhead + (skb_end_pointer(skb) - skb->head) + ntail;
-	long off;
+	struct skb_shared_info *new_shinfo;
 	bool fastpath;
+	u8 *data;
+	long off;
+	int i;
 
 	BUG_ON(nhead < 0);
 
@@ -797,8 +824,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	 */
 	memcpy(data + nhead, skb->head, skb_tail_pointer(skb) - skb->head);
 
-	memcpy((struct skb_shared_info *)(data + size),
-	       skb_shinfo(skb),
+	new_shinfo = (struct skb_shared_info *)(data + size);
+	memcpy(new_shinfo, skb_shinfo(skb),
 	       offsetof(struct skb_shared_info, frags[skb_shinfo(skb)->nr_frags]));
 
 	/* Check if we can avoid taking references on fragments if we own
@@ -815,14 +842,20 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	if (fastpath) {
 		kfree(skb->head);
 	} else {
+		if (skb_has_frag_list(skb)) {
+			struct sk_buff *new_list;
+
+			new_list = skb_copy_fraglist(skb, gfp_mask);
+			if (!new_list)
+				goto free_data;
+			new_shinfo->frag_list = new_list;
+		}
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
 			get_page(skb_shinfo(skb)->frags[i].page);
 
-		if (skb_has_frag_list(skb))
-			skb_clone_fraglist(skb);
-
 		skb_release_data(skb);
 	}
+
 	off = (data + nhead) - skb->head;
 
 	skb->head     = data;
@@ -848,6 +881,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	atomic_set(&skb_shinfo(skb)->dataref, 1);
 	return 0;
 
+free_data:
+	kfree(data);
 nodata:
 	return -ENOMEM;
 }

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-09-10 19:54               ` David Miller
@ 2010-09-11 12:31                 ` Jarek Poplawski
  2010-09-12  3:30                   ` David Miller
  2010-09-20  0:17                   ` David Miller
  0 siblings, 2 replies; 33+ messages in thread
From: Jarek Poplawski @ 2010-09-11 12:31 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev

On Fri, Sep 10, 2010 at 12:54:49PM -0700, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 07 Sep 2010 11:37:28 +0200
> 
> > Le mardi 07 septembre 2010 ? 09:16 +0000, Jarek Poplawski a écrit :
> >> On 2010-09-07 07:02, Eric Dumazet wrote:
> > 
> >> > 
> >> > I understand what you want to do, but problem is we need to perform a
> >> > CAS2 operation : atomically changes two values (dataref and frag_list)
> >> 
> >> Alas I can't understand why do you think these clone and atomic tests
> >> in skb_release_data() don't protect skb_shinfo(skb)->frag_list enough.
> >> 
> > 
> > It was early in the morning, before a cup of tea.
> > 
> > David only had to set frag_list in the new shinfo, not the old.
> 
> Ok, how does this look?
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 752c197..aaa9750 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -327,6 +327,32 @@ static void skb_clone_fraglist(struct sk_buff *skb)
>  		skb_get(list);
>  }
>  
> +static struct sk_buff *skb_copy_fraglist(struct sk_buff *parent,
> +					 gfp_t gfp_mask)
> +{
> +	struct sk_buff *first_skb = NULL;
> +	struct sk_buff *prev_skb = NULL;
> +	struct sk_buff *skb;
> +
> +	skb_walk_frags(parent, skb) {
> +		struct sk_buff *nskb = pskb_copy(skb, gfp_mask);
> +
> +		if (!nskb)
> +			goto fail;
> +		if (!first_skb)
> +			first_skb = skb;

Probably here and below: "= nskb"

> +		else
> +			prev_skb->next = skb;
> +		prev_skb = skb;
> +	}
> +
> +	return first_skb;
> +
> +fail:

With "if (first_skb)" here it would look better to me even if it
currently doesn't matter.

Otherwise seems OK, but I still would like to know the scenario
demanding this change.

Jarek P.

> +	skb_drop_list(&first_skb);
> +	return NULL;
> +}
> +
>  static void skb_release_data(struct sk_buff *skb)
>  {
>  	if (!skb->cloned ||
> @@ -775,11 +801,12 @@ EXPORT_SYMBOL(pskb_copy);
>  int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>  		     gfp_t gfp_mask)
>  {
> -	int i;
> -	u8 *data;
>  	int size = nhead + (skb_end_pointer(skb) - skb->head) + ntail;
> -	long off;
> +	struct skb_shared_info *new_shinfo;
>  	bool fastpath;
> +	u8 *data;
> +	long off;
> +	int i;
>  
>  	BUG_ON(nhead < 0);
>  
> @@ -797,8 +824,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>  	 */
>  	memcpy(data + nhead, skb->head, skb_tail_pointer(skb) - skb->head);
>  
> -	memcpy((struct skb_shared_info *)(data + size),
> -	       skb_shinfo(skb),
> +	new_shinfo = (struct skb_shared_info *)(data + size);
> +	memcpy(new_shinfo, skb_shinfo(skb),
>  	       offsetof(struct skb_shared_info, frags[skb_shinfo(skb)->nr_frags]));
>  
>  	/* Check if we can avoid taking references on fragments if we own
> @@ -815,14 +842,20 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>  	if (fastpath) {
>  		kfree(skb->head);
>  	} else {
> +		if (skb_has_frag_list(skb)) {
> +			struct sk_buff *new_list;
> +
> +			new_list = skb_copy_fraglist(skb, gfp_mask);
> +			if (!new_list)
> +				goto free_data;
> +			new_shinfo->frag_list = new_list;
> +		}
>  		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
>  			get_page(skb_shinfo(skb)->frags[i].page);
>  
> -		if (skb_has_frag_list(skb))
> -			skb_clone_fraglist(skb);
> -
>  		skb_release_data(skb);
>  	}
> +
>  	off = (data + nhead) - skb->head;
>  
>  	skb->head     = data;
> @@ -848,6 +881,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>  	atomic_set(&skb_shinfo(skb)->dataref, 1);
>  	return 0;
>  
> +free_data:
> +	kfree(data);
>  nodata:
>  	return -ENOMEM;
>  }

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-09-11 12:31                 ` Jarek Poplawski
@ 2010-09-12  3:30                   ` David Miller
  2010-09-12 10:45                     ` Jarek Poplawski
  2010-09-20  0:17                   ` David Miller
  1 sibling, 1 reply; 33+ messages in thread
From: David Miller @ 2010-09-12  3:30 UTC (permalink / raw)
  To: jarkao2; +Cc: eric.dumazet, netdev

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Sat, 11 Sep 2010 14:31:40 +0200

> Otherwise seems OK, but I still would like to know the scenario
> demanding this change.

I want to make sk_buff use list_head, including all uses such as
frag_list et al.

If the frag_list chain can be shared, a doubly linked list cannot be
used.

This is someting I've been gradually working on now for more than 2
years :)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-09-12  3:30                   ` David Miller
@ 2010-09-12 10:45                     ` Jarek Poplawski
  2010-09-12 10:58                       ` Jarek Poplawski
  2010-09-12 15:58                       ` David Miller
  0 siblings, 2 replies; 33+ messages in thread
From: Jarek Poplawski @ 2010-09-12 10:45 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev

On Sat, Sep 11, 2010 at 08:30:02PM -0700, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Sat, 11 Sep 2010 14:31:40 +0200
> 
> > Otherwise seems OK, but I still would like to know the scenario
> > demanding this change.
> 
> I want to make sk_buff use list_head, including all uses such as
> frag_list et al.
> 
> If the frag_list chain can be shared, a doubly linked list cannot be
> used.
> 
> This is someting I've been gradually working on now for more than 2
> years :)

Hmm... Then the first message/changelog in this thread seems to
describe the future bug, only with doubly linked lists. If so, it was
a bit misleading to me ;-)

Then a few more questions:
1) if doubly linked lists really require such pskb_copying, isn't it
   all too costly?
2) why skb_clone isn't enough instead of pskb_copy?
3) since skb_clone has some cost too, why e.g. saving only the pointer
   to the tail of the list in skb_shared_info isn't enough?

Jarek P.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-09-12 10:45                     ` Jarek Poplawski
@ 2010-09-12 10:58                       ` Jarek Poplawski
  2010-09-12 15:58                       ` David Miller
  1 sibling, 0 replies; 33+ messages in thread
From: Jarek Poplawski @ 2010-09-12 10:58 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev

On Sun, Sep 12, 2010 at 12:45:34PM +0200, Jarek Poplawski wrote:
> On Sat, Sep 11, 2010 at 08:30:02PM -0700, David Miller wrote:
> > From: Jarek Poplawski <jarkao2@gmail.com>
> > Date: Sat, 11 Sep 2010 14:31:40 +0200
> > 
> > > Otherwise seems OK, but I still would like to know the scenario
> > > demanding this change.
> > 
> > I want to make sk_buff use list_head, including all uses such as
> > frag_list et al.
> > 
> > If the frag_list chain can be shared, a doubly linked list cannot be
> > used.
> > 
> > This is someting I've been gradually working on now for more than 2
> > years :)
> 
> Hmm... Then the first message/changelog in this thread seems to
> describe the future bug, only with doubly linked lists. If so, it was
> a bit misleading to me ;-)
> 
> Then a few more questions:
> 1) if doubly linked lists really require such pskb_copying, isn't it
>    all too costly?
> 2) why skb_clone isn't enough instead of pskb_copy?
> 3) since skb_clone has some cost too, why e.g. saving only the pointer
>    to the tail of the list in skb_shared_info isn't enough?
...3a) IOW, do we really need this double linking...

Jarek P.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-09-12 10:45                     ` Jarek Poplawski
  2010-09-12 10:58                       ` Jarek Poplawski
@ 2010-09-12 15:58                       ` David Miller
  2010-09-12 16:13                         ` David Miller
                                           ` (2 more replies)
  1 sibling, 3 replies; 33+ messages in thread
From: David Miller @ 2010-09-12 15:58 UTC (permalink / raw)
  To: jarkao2; +Cc: eric.dumazet, netdev

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Sun, 12 Sep 2010 12:45:34 +0200

> Then a few more questions:
> 1) if doubly linked lists really require such pskb_copying, isn't it
>    all too costly?

In the common case the data reference will be one, so we will not
copy.

> 2) why skb_clone isn't enough instead of pskb_copy?

Can't share the metadata.

> 3) since skb_clone has some cost too, why e.g. saving only the pointer
>    to the tail of the list in skb_shared_info isn't enough?

Then we won't get the rest of the advantages of using list_head such
as prefetching during traversals, automatic debugging facilities, et al.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-09-12 15:58                       ` David Miller
@ 2010-09-12 16:13                         ` David Miller
  2010-09-12 20:57                           ` Jarek Poplawski
  2010-09-12 19:55                         ` Ben Pfaff
  2010-09-12 20:45                         ` Jarek Poplawski
  2 siblings, 1 reply; 33+ messages in thread
From: David Miller @ 2010-09-12 16:13 UTC (permalink / raw)
  To: jarkao2; +Cc: eric.dumazet, netdev


BTW, Jarek, as to your idea to store a tail pointer in the shinfo, how
will you sync that tail pointer in all of the shinfo instances
referencing the frag list?

It simply can't work, we have to copy.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-09-12 15:58                       ` David Miller
  2010-09-12 16:13                         ` David Miller
@ 2010-09-12 19:55                         ` Ben Pfaff
  2010-09-12 20:24                           ` David Miller
  2010-09-12 20:45                         ` Jarek Poplawski
  2 siblings, 1 reply; 33+ messages in thread
From: Ben Pfaff @ 2010-09-12 19:55 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, eric.dumazet, netdev

David Miller <davem@davemloft.net> writes:

>> 3) since skb_clone has some cost too, why e.g. saving only the pointer
>>    to the tail of the list in skb_shared_info isn't enough?
>
> Then we won't get the rest of the advantages of using list_head such
> as prefetching during traversals, automatic debugging facilities, et al.

Did you see the recent patch from Andi Kleen where he proposes
removing this prefetching in most situations because the costs
outweigh the benefits on most modern architectures?
        http://permalink.gmane.org/gmane.linux.kernel/1033281

I'm not saying that list_head doesn't have other advantages
though.
-- 
Ben Pfaff 
http://benpfaff.org

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-09-12 19:55                         ` Ben Pfaff
@ 2010-09-12 20:24                           ` David Miller
  0 siblings, 0 replies; 33+ messages in thread
From: David Miller @ 2010-09-12 20:24 UTC (permalink / raw)
  To: blp; +Cc: jarkao2, eric.dumazet, netdev

From: Ben Pfaff <blp@cs.stanford.edu>
Date: Sun, 12 Sep 2010 12:55:54 -0700

> David Miller <davem@davemloft.net> writes:
> 
>>> 3) since skb_clone has some cost too, why e.g. saving only the pointer
>>>    to the tail of the list in skb_shared_info isn't enough?
>>
>> Then we won't get the rest of the advantages of using list_head such
>> as prefetching during traversals, automatic debugging facilities, et al.
> 
> Did you see the recent patch from Andi Kleen where he proposes
> removing this prefetching in most situations because the costs
> outweigh the benefits on most modern architectures?
>         http://permalink.gmane.org/gmane.linux.kernel/1033281
> 
> I'm not saying that list_head doesn't have other advantages
> though.

Yes I saw it and I somewhat disagree with him, but don't care
enough to argue with him about it.  There are much more important
things to apply my mind and time to at the moment :-)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-09-12 15:58                       ` David Miller
  2010-09-12 16:13                         ` David Miller
  2010-09-12 19:55                         ` Ben Pfaff
@ 2010-09-12 20:45                         ` Jarek Poplawski
  2 siblings, 0 replies; 33+ messages in thread
From: Jarek Poplawski @ 2010-09-12 20:45 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev

On Sun, Sep 12, 2010 at 08:58:33AM -0700, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Sun, 12 Sep 2010 12:45:34 +0200
> 
> > Then a few more questions:
> > 1) if doubly linked lists really require such pskb_copying, isn't it
> >    all too costly?
> 
> In the common case the data reference will be one, so we will not
> copy.

Even if so, one such a case on the fast path should hit performance,
so it would need special reviewing.

> 
> > 2) why skb_clone isn't enough instead of pskb_copy?
> 
> Can't share the metadata.

I'd really like to understand why the change in handling next/prev
should affect more than skb pointers wrt. current sharing.

> 
> > 3) since skb_clone has some cost too, why e.g. saving only the pointer
> >    to the tail of the list in skb_shared_info isn't enough?
> 
> Then we won't get the rest of the advantages of using list_head such
> as prefetching during traversals, automatic debugging facilities, et al.

Right, we need to sum pros and cons. So, what's the pros? ;-)

Jarek P.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-09-12 16:13                         ` David Miller
@ 2010-09-12 20:57                           ` Jarek Poplawski
  2010-09-12 22:08                             ` David Miller
  0 siblings, 1 reply; 33+ messages in thread
From: Jarek Poplawski @ 2010-09-12 20:57 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev

On Sun, Sep 12, 2010 at 09:13:53AM -0700, David Miller wrote:
> 
> BTW, Jarek, as to your idea to store a tail pointer in the shinfo, how
> will you sync that tail pointer in all of the shinfo instances
> referencing the frag list?
> 
> It simply can't work, we have to copy.

The question is if we need to sync at all? This is shared data at the
moment, so I can't imagine how the list (especialy doubly linked)
could be changed without locking? And even if it's possible, I doubt
copying e.g. like in your current patch can help when an skb is added
at the tail later.

Jarek P.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-09-12 20:57                           ` Jarek Poplawski
@ 2010-09-12 22:08                             ` David Miller
  2010-09-13  7:49                               ` Jarek Poplawski
  0 siblings, 1 reply; 33+ messages in thread
From: David Miller @ 2010-09-12 22:08 UTC (permalink / raw)
  To: jarkao2; +Cc: eric.dumazet, netdev

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Sun, 12 Sep 2010 22:57:22 +0200

> On Sun, Sep 12, 2010 at 09:13:53AM -0700, David Miller wrote:
>> 
>> BTW, Jarek, as to your idea to store a tail pointer in the shinfo, how
>> will you sync that tail pointer in all of the shinfo instances
>> referencing the frag list?
>> 
>> It simply can't work, we have to copy.
> 
> The question is if we need to sync at all? This is shared data at the
> moment, so I can't imagine how the list (especialy doubly linked)
> could be changed without locking? And even if it's possible, I doubt
> copying e.g. like in your current patch can help when an skb is added
> at the tail later.

That's the fundamental issue.

If you look, everywhere we curently do that trick of "use the
skb->prev pointer to remmeber the frag_list tail" the code knows
it has exclusive access to both the skb metadata and the
underlying data.

But for modifications of the frag list during the SKBs lifetime
that's another issue, entirely.  All of these functions trimming
the head or tail of the SKB data which can modify the frag
list elements, they can be called from all kinds of contexts.
Look for Alexey Kuznetsov's comments in skbuff.c that read
"mincing fragments" and similar.

The real win with my work is complete unification of all list
handling, and making our packet handling code much more "hackable"
by non-networking kernel hackers.

Really we have the last major core datastructures that do not
use standard lists, and I'm going to convert it so we can
be sane like the rest of the kernel. :-)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-09-12 22:08                             ` David Miller
@ 2010-09-13  7:49                               ` Jarek Poplawski
  0 siblings, 0 replies; 33+ messages in thread
From: Jarek Poplawski @ 2010-09-13  7:49 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, eric.dumazet, netdev

On 2010-09-13 00:08, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Sun, 12 Sep 2010 22:57:22 +0200
> 
>> On Sun, Sep 12, 2010 at 09:13:53AM -0700, David Miller wrote:
>>>
>>> BTW, Jarek, as to your idea to store a tail pointer in the shinfo, how
>>> will you sync that tail pointer in all of the shinfo instances
>>> referencing the frag list?
>>>
>>> It simply can't work, we have to copy.
>>
>> The question is if we need to sync at all? This is shared data at the
>> moment, so I can't imagine how the list (especialy doubly linked)
>> could be changed without locking? And even if it's possible, I doubt
>> copying e.g. like in your current patch can help when an skb is added
>> at the tail later.
> 
> That's the fundamental issue.
> 
> If you look, everywhere we curently do that trick of "use the
> skb->prev pointer to remmeber the frag_list tail" the code knows
> it has exclusive access to both the skb metadata and the
> underlying data.
> 
> But for modifications of the frag list during the SKBs lifetime
> that's another issue, entirely.  All of these functions trimming
> the head or tail of the SKB data which can modify the frag
> list elements, they can be called from all kinds of contexts.
> Look for Alexey Kuznetsov's comments in skbuff.c that read
> "mincing fragments" and similar.

OK, I've found the skb_cow_data() comments, but I can see there only
a modification of copied data.

> 
> The real win with my work is complete unification of all list
> handling, and making our packet handling code much more "hackable"
> by non-networking kernel hackers.
> 
> Really we have the last major core datastructures that do not
> use standard lists, and I'm going to convert it so we can
> be sane like the rest of the kernel. :-)

I guess we can stay with different opinions, no problem, at least to
me. IMHO, considering the need for additional copying or even only
cloning data where it's not currently done, doubly linked list is
too costly for the frag_list. It would affect pskb_expand_head(),
pskb_copy() but probably also skb_segment(), and maybe more. So, with
GROwing(!) importance of fragmented packets, I doubt this copying will
be as unlikely as presumed.

Jarek P.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-09-11 12:31                 ` Jarek Poplawski
  2010-09-12  3:30                   ` David Miller
@ 2010-09-20  0:17                   ` David Miller
  2010-09-20  7:21                     ` Jarek Poplawski
  1 sibling, 1 reply; 33+ messages in thread
From: David Miller @ 2010-09-20  0:17 UTC (permalink / raw)
  To: jarkao2; +Cc: eric.dumazet, netdev

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Sat, 11 Sep 2010 14:31:40 +0200

> Otherwise seems OK, but I still would like to know the scenario
> demanding this change.

Ok Jarek, after some more consideration I agree with you.
Removing this kind of sharing would be unwise, ho hum...

While pondering over this I thought about why we even need
frag lists at all.  We need them for two reasons:

1) Because we have an inability to turn kmalloc data references into
   page based ones easily.

   We've run into this sort of problem with socket splice() too.

2) For recording the segmentation points of a fragmented packet.

We definitely don't use frag lists for performance, if you look
in the GRO history the GRO code specifically has been changed
to prefer accumulating into the page vector whenever possible
because using frag lists is a lot slower.

Anyways, even if we somehow solved #1 we'd still have a lot of
code (even bluetooth) using it for the sake of issue #2.

So for the time being there is no clear path for trying to
eliminate frag lists altogether if we wanted to do that either.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-09-20  0:17                   ` David Miller
@ 2010-09-20  7:21                     ` Jarek Poplawski
  2010-09-20  9:02                       ` Eric Dumazet
  2010-09-20 16:59                       ` David Miller
  0 siblings, 2 replies; 33+ messages in thread
From: Jarek Poplawski @ 2010-09-20  7:21 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev

On Sun, Sep 19, 2010 at 05:17:25PM -0700, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Sat, 11 Sep 2010 14:31:40 +0200
> 
> > Otherwise seems OK, but I still would like to know the scenario
> > demanding this change.
> 
> Ok Jarek, after some more consideration I agree with you.
> Removing this kind of sharing would be unwise, ho hum...
> 
> While pondering over this I thought about why we even need
> frag lists at all.  We need them for two reasons:
> 
> 1) Because we have an inability to turn kmalloc data references into
>    page based ones easily.
> 
>    We've run into this sort of problem with socket splice() too.
> 
> 2) For recording the segmentation points of a fragmented packet.
> 
> We definitely don't use frag lists for performance, if you look
> in the GRO history the GRO code specifically has been changed
> to prefer accumulating into the page vector whenever possible
> because using frag lists is a lot slower.

Yes, and GRO made frag lists more popular btw.

> Anyways, even if we somehow solved #1 we'd still have a lot of
> code (even bluetooth) using it for the sake of issue #2.

Since there exist drivers using entirely paged skbs and napi_gro_frags
path I can't see why #1 or #2 could be unsolvable elsewhere.

> So for the time being there is no clear path for trying to
> eliminate frag lists altogether if we wanted to do that either.

Probably we could start from enhacing moving drivers to paged skbs
where possible. And maybe simplifying the skb model by not allowing
frags and frag lists together?

Btw, I wonder what is the exact reason we can't use only
NET_SKBUFF_DATA_USES_OFFSET?

Jarek P.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-09-20  7:21                     ` Jarek Poplawski
@ 2010-09-20  9:02                       ` Eric Dumazet
  2010-09-20  9:14                         ` Jarek Poplawski
  2010-09-20 16:59                       ` David Miller
  1 sibling, 1 reply; 33+ messages in thread
From: Eric Dumazet @ 2010-09-20  9:02 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev

Le lundi 20 septembre 2010 à 07:21 +0000, Jarek Poplawski a écrit :

> Probably we could start from enhacing moving drivers to paged skbs
> where possible. And maybe simplifying the skb model by not allowing
> frags and frag lists together?
> 

Sure. I believe current model, pre-allocating skb in huge tx rings is a
waste of mem bandwidth anyway. (I am refering to the struct sk_buff
itself, not the payload part)

Of course some drivers are doing it right, using netdev_alloc_skb()
right before feeding this skb to network stack, not an old one.

> Btw, I wonder what is the exact reason we can't use only
> NET_SKBUFF_DATA_USES_OFFSET?

I see no real reason.

On 32bit arches, it might be faster to manipulate pointers, and not
'base+offset' values.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-09-20  9:02                       ` Eric Dumazet
@ 2010-09-20  9:14                         ` Jarek Poplawski
  2010-09-20 12:12                           ` Jarek Poplawski
  0 siblings, 1 reply; 33+ messages in thread
From: Jarek Poplawski @ 2010-09-20  9:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

On Mon, Sep 20, 2010 at 11:02:00AM +0200, Eric Dumazet wrote:
> Le lundi 20 septembre 2010 ?? 07:21 +0000, Jarek Poplawski a écrit :
> 
> > Probably we could start from enhacing moving drivers to paged skbs
> > where possible. And maybe simplifying the skb model by not allowing
> > frags and frag lists together?
> > 
> 
> Sure. I believe current model, pre-allocating skb in huge tx rings is a
> waste of mem bandwidth anyway. (I am refering to the struct sk_buff
> itself, not the payload part)
> 
> Of course some drivers are doing it right, using netdev_alloc_skb()
> right before feeding this skb to network stack, not an old one.
> 
> > Btw, I wonder what is the exact reason we can't use only
> > NET_SKBUFF_DATA_USES_OFFSET?
> 
> I see no real reason.
> 
> On 32bit arches, it might be faster to manipulate pointers, and not
> 'base+offset' values.

The only reason is code readability and maintenance. Did anybody check
how much faster?

Jarek P.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-09-20  9:14                         ` Jarek Poplawski
@ 2010-09-20 12:12                           ` Jarek Poplawski
  2010-09-20 12:40                             ` Eric Dumazet
  0 siblings, 1 reply; 33+ messages in thread
From: Jarek Poplawski @ 2010-09-20 12:12 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

On Mon, Sep 20, 2010 at 09:14:25AM +0000, Jarek Poplawski wrote:
> On Mon, Sep 20, 2010 at 11:02:00AM +0200, Eric Dumazet wrote:
> > Le lundi 20 septembre 2010 ?? 07:21 +0000, Jarek Poplawski a écrit :
> > > Btw, I wonder what is the exact reason we can't use only
> > > NET_SKBUFF_DATA_USES_OFFSET?
> > 
> > I see no real reason.
> > 
> > On 32bit arches, it might be faster to manipulate pointers, and not
> > 'base+offset' values.
> 
> The only reason is code readability and maintenance. Did anybody check
> how much faster?

Hmm... I probably misread your reasoning. So, if no real reason,
really, how about turning this NET_SKBUFF_DATA_USES_OFFSET on
unconditionally in net-next, until somebody spots the difference?

Jarek P.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-09-20 12:12                           ` Jarek Poplawski
@ 2010-09-20 12:40                             ` Eric Dumazet
  0 siblings, 0 replies; 33+ messages in thread
From: Eric Dumazet @ 2010-09-20 12:40 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev

Le lundi 20 septembre 2010 à 12:12 +0000, Jarek Poplawski a écrit :

> Hmm... I probably misread your reasoning. So, if no real reason,
> really, how about turning this NET_SKBUFF_DATA_USES_OFFSET on
> unconditionally in net-next, until somebody spots the difference?

Yes, but most developpers use 64bit kernels anyway, I suspect nobody
will ever notice :(

Here (with a typical config), here is the vmlinux size before and after
this patch :

$ size vmlinux.old
   text	   data	    bss	    dec	    hex	filename
6061748	 640208	7285056	13987012	 d56cc4	vmlinux.old

$ size vmlinux
   text	   data	    bss	    dec	    hex	filename
6070420	 640208	7285056	13995684	 d58ea4	vmlinux


Thats 8672 bytes of text increase.

(1330326 instructions instead of 1328472 -> 1854 more instructions)




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-09-20  7:21                     ` Jarek Poplawski
  2010-09-20  9:02                       ` Eric Dumazet
@ 2010-09-20 16:59                       ` David Miller
  1 sibling, 0 replies; 33+ messages in thread
From: David Miller @ 2010-09-20 16:59 UTC (permalink / raw)
  To: jarkao2; +Cc: eric.dumazet, netdev

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Mon, 20 Sep 2010 07:21:49 +0000

> On Sun, Sep 19, 2010 at 05:17:25PM -0700, David Miller wrote:
>> We definitely don't use frag lists for performance, if you look
>> in the GRO history the GRO code specifically has been changed
>> to prefer accumulating into the page vector whenever possible
>> because using frag lists is a lot slower.
> 
> Yes, and GRO made frag lists more popular btw.

Yes.  However, realize that this is only really a legacy from
inet_lro.

These drivers could restructure themselves to operate on pages.
I am pretty sure the hardware would be amicable to this and it
would likely improve performance just as favoring page vector
accumulation did for GRO.

> Btw, I wonder what is the exact reason we can't use only
> NET_SKBUFF_DATA_USES_OFFSET?

As Eric explained, and then demonstrated, it generates a non-trivial
increased amount of code.

This is almost certainly why Arnaldo didn't make it unconditional
back when he implemented the SKB data offset changes for 64-bit.

In my opinion, this duality of SKB pointer handling never causes
real problems because any change mistakenly accessing the pointers
directly usually gets caught by the first person who builds it on
64-bit :-)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-07-23  5:09 ` [PATCH net-next-2.6] net: pskb_expand_head() optimization Eric Dumazet
@ 2010-07-25  4:06   ` David Miller
  0 siblings, 0 replies; 33+ messages in thread
From: David Miller @ 2010-07-25  4:06 UTC (permalink / raw)
  To: eric.dumazet; +Cc: andrea, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 23 Jul 2010 07:09:08 +0200

> [PATCH net-next-2.6] net: pskb_expand_head() optimization
> 
> Move frags[] at the end of struct skb_shared_info, and make
> pskb_expand_head() copy only the used part of it instead of whole array.
> 
> This should avoid kmemcheck warnings and speedup pskb_expand_head() as
> well, avoiding a lot of cache misses.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Maybe it's just that people aren't running kmemcheck when a
pskb_expand_head() triggers, who knows.

Anyways, since we skip the ->frag[] array in skb alloc, etc.,
your patch is of course fine.

Applied, thanks!

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH net-next-2.6] net: pskb_expand_head() optimization
  2010-07-22 19:12 [PATCH] Fix corruption of skb csum field in pskb_expand_head() of net/core/skbuff.c Andrea Shepard
@ 2010-07-23  5:09 ` Eric Dumazet
  2010-07-25  4:06   ` David Miller
  0 siblings, 1 reply; 33+ messages in thread
From: Eric Dumazet @ 2010-07-23  5:09 UTC (permalink / raw)
  To: Andrea Shepard, David Miller; +Cc: netdev

Le jeudi 22 juillet 2010 à 12:12 -0700, Andrea Shepard a écrit :
> Make pskb_expand_head() check ip_summed to make sure csum_start is really
> csum_start and not csum before adjusting it.
> 
> This fixes a bug I encountered using a Sun Quad-Fast Ethernet card and VLANs.
> On my configuration, the sunhme driver produces skbs with differing amounts
> of headroom on receive depending on the packet size.  See line 2030 of
> drivers/net/sunhme.c; packets smaller than RX_COPY_THRESHOLD have 52 bytes
> of headroom but packets larger than that cutoff have only 20 bytes.
> 
> When these packets reach the VLAN driver, vlan_check_reorder_header()
> calls skb_cow(), which, if the packet has less than NET_SKB_PAD (== 32) bytes
> of headroom, uses pskb_expand_head() to make more.
> 
> Then, pskb_expand_head() needs to adjust a lot of offsets into the skb,
> including csum_start.  Since csum_start is a union with csum, if the packet
> has a valid csum value this will corrupt it, which was the effect I observed.
> The sunhme hardware computes receive checksums, so the skbs would be created
> by the driver with ip_summed == CHECKSUM_COMPLETE and a valid csum field, and
> then pskb_expand_head() would corrupt the csum field, leading to an "hw csum
> error" message later on, for example in icmp_rcv() for pings larger than the
> sunhme RX_COPY_THRESHOLD.
> 
> On the basis of the comment at the beginning of include/linux/skbuff.h,
> I believe that the csum_start skb field is only meaningful if ip_csummed is
> CSUM_PARTIAL, so this patch makes pskb_expand_head() adjust it only in that
> case to avoid corrupting a valid csum value.
> 
> Please see my more in-depth disucssion of tracking down this bug for
> more details if you like:
> 
> http://puellavulnerata.livejournal.com/112186.html
> http://puellavulnerata.livejournal.com/112567.html
> http://puellavulnerata.livejournal.com/112891.html
> http://puellavulnerata.livejournal.com/113096.html
> http://puellavulnerata.livejournal.com/113591.html
> 
> I am not subscribed to this list, so please CC me on replies.
> 

Excellent Changelog Andrea, thanks !

Reviewing pskb_expand_head(), I see it copy the whole struct
skb_shared_info, even if source contains garbage (uninitialized data).

I wonder why it is not triggering kmemcheck alarms

[PATCH net-next-2.6] net: pskb_expand_head() optimization

Move frags[] at the end of struct skb_shared_info, and make
pskb_expand_head() copy only the used part of it instead of whole array.

This should avoid kmemcheck warnings and speedup pskb_expand_head() as
well, avoiding a lot of cache misses.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/skbuff.h |    3 ++-
 net/core/skbuff.c      |    2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f5aa87e..d89876b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -202,10 +202,11 @@ struct skb_shared_info {
 	 */
 	atomic_t	dataref;
 
-	skb_frag_t	frags[MAX_SKB_FRAGS];
 	/* Intermediate layers must ensure that destructor_arg
 	 * remains valid until skb destructor */
 	void *		destructor_arg;
+	/* must be last field, see pskb_expand_head() */
+	skb_frag_t	frags[MAX_SKB_FRAGS];
 };
 
 /* We divide dataref into two halves.  The higher 16 bits hold references
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 76d33ca..7da58a2 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -817,7 +817,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	memcpy(data + nhead, skb->head, skb->tail - skb->head);
 #endif
 	memcpy(data + size, skb_end_pointer(skb),
-	       sizeof(struct skb_shared_info));
+	       offsetof(struct skb_shared_info, frags[skb_shinfo(skb)->nr_frags]));
 
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
 		get_page(skb_shinfo(skb)->frags[i].page);



^ permalink raw reply related	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2010-09-20 16:59 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-03  3:43 [PATCH] net: Frag list lost on head expansion David Miller
2010-09-03  5:48 ` Eric Dumazet
2010-09-03  6:19   ` Eric Dumazet
2010-09-03  9:09   ` [PATCH net-next-2.6] net: pskb_expand_head() optimization Eric Dumazet
2010-09-03 13:46     ` David Miller
2010-09-07  2:20       ` David Miller
2010-09-07  5:02         ` Eric Dumazet
2010-09-07  5:05           ` David Miller
2010-09-07  9:16           ` Jarek Poplawski
2010-09-07  9:37             ` Eric Dumazet
2010-09-10 19:54               ` David Miller
2010-09-11 12:31                 ` Jarek Poplawski
2010-09-12  3:30                   ` David Miller
2010-09-12 10:45                     ` Jarek Poplawski
2010-09-12 10:58                       ` Jarek Poplawski
2010-09-12 15:58                       ` David Miller
2010-09-12 16:13                         ` David Miller
2010-09-12 20:57                           ` Jarek Poplawski
2010-09-12 22:08                             ` David Miller
2010-09-13  7:49                               ` Jarek Poplawski
2010-09-12 19:55                         ` Ben Pfaff
2010-09-12 20:24                           ` David Miller
2010-09-12 20:45                         ` Jarek Poplawski
2010-09-20  0:17                   ` David Miller
2010-09-20  7:21                     ` Jarek Poplawski
2010-09-20  9:02                       ` Eric Dumazet
2010-09-20  9:14                         ` Jarek Poplawski
2010-09-20 12:12                           ` Jarek Poplawski
2010-09-20 12:40                             ` Eric Dumazet
2010-09-20 16:59                       ` David Miller
2010-09-07  1:25     ` David Miller
  -- strict thread matches above, loose matches on Subject: below --
2010-07-22 19:12 [PATCH] Fix corruption of skb csum field in pskb_expand_head() of net/core/skbuff.c Andrea Shepard
2010-07-23  5:09 ` [PATCH net-next-2.6] net: pskb_expand_head() optimization Eric Dumazet
2010-07-25  4:06   ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.