[PATCH] xen-netfront: pull on receive skb may need to happen earlier

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] xen-netfront: pull on receive skb may need to happen earlier
       [not found]     ` <20130704150137.GW7483@zion.uk.xensource.com>
@ 2013-07-05  9:32       ` Jan Beulich
  2013-07-05 14:53         ` Wei Liu
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2013-07-05  9:32 UTC (permalink / raw)
  To: davem; +Cc: Ian Campbell, wei.liu2, Dion Kant, xen-devel, netdev, stable

Due to commit 3683243b ("xen-netfront: use __pskb_pull_tail to ensure
linear area is big enough on RX") xennet_fill_frags() may end up
filling MAX_SKB_FRAGS + 1 fragments in a receive skb, and only reduce
the fragment count subsequently via __pskb_pull_tail(). That's a
result of xennet_get_responses() allowing a maximum of one more slot to
be consumed (and intermediately transformed into a fragment) if the
head slot has a size less than or equal to RX_COPY_THRESHOLD.

Hence we need to adjust xennet_fill_frags() to pull earlier if we
reached the maximum fragment count - due to the described behavior of
xennet_get_responses() this guarantees that at least the first fragment
will get completely consumed, and hence the fragment count reduced.

In order to not needlessly call __pskb_pull_tail() twice, make the
original call conditional upon the pull target not having been reached
yet, and defer the newly added one as much as possible (an alternative
would have been to always call the function right before the call to
xennet_fill_frags(), but that would imply more frequent cases of
needing to call it twice).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: stable@vger.kernel.org (3.6 onwards)

--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -831,6 +831,15 @@ static RING_IDX xennet_fill_frags(struct
 			RING_GET_RESPONSE(&np->rx, ++cons);
 		skb_frag_t *nfrag = &skb_shinfo(nskb)->frags[0];

+		if (nr_frags == MAX_SKB_FRAGS) {
+			unsigned int pull_to = NETFRONT_SKB_CB(skb)->pull_to;
+
+			BUG_ON(pull_to <= skb_headlen(skb));
+			__pskb_pull_tail(skb, pull_to - skb_headlen(skb));
+			nr_frags = shinfo->nr_frags;
+		}
+		BUG_ON(nr_frags >= MAX_SKB_FRAGS);
+
 		__skb_fill_page_desc(skb, nr_frags,
 				     skb_frag_page(nfrag),
 				     rx->offset, rx->status);
@@ -929,7 +938,8 @@ static int handle_incoming_queue(struct 
 	while ((skb = __skb_dequeue(rxq)) != NULL) {
 		int pull_to = NETFRONT_SKB_CB(skb)->pull_to;

-		__pskb_pull_tail(skb, pull_to - skb_headlen(skb));
+		if (pull_to > skb_headlen(skb))
+			__pskb_pull_tail(skb, pull_to - skb_headlen(skb));

 		/* Ethernet work: Delayed to here as it peeks the header. */
 		skb->protocol = eth_type_trans(skb, dev);

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] xen-netfront: pull on receive skb may need to happen earlier
  2013-07-05  9:32       ` [PATCH] xen-netfront: pull on receive skb may need to happen earlier Jan Beulich
@ 2013-07-05 14:53         ` Wei Liu
  2013-07-07  1:10           ` David Miller
  2013-07-08  9:59           ` Jan Beulich
  0 siblings, 2 replies; 22+ messages in thread
From: Wei Liu @ 2013-07-05 14:53 UTC (permalink / raw)
  To: Jan Beulich
  Cc: davem, Ian Campbell, wei.liu2, Dion Kant, xen-devel, netdev, stable

On Fri, Jul 05, 2013 at 10:32:41AM +0100, Jan Beulich wrote:
> Due to commit 3683243b ("xen-netfront: use __pskb_pull_tail to ensure
> linear area is big enough on RX") xennet_fill_frags() may end up
> filling MAX_SKB_FRAGS + 1 fragments in a receive skb, and only reduce
> the fragment count subsequently via __pskb_pull_tail(). That's a
> result of xennet_get_responses() allowing a maximum of one more slot to
> be consumed (and intermediately transformed into a fragment) if the
> head slot has a size less than or equal to RX_COPY_THRESHOLD.
> 
> Hence we need to adjust xennet_fill_frags() to pull earlier if we
> reached the maximum fragment count - due to the described behavior of
> xennet_get_responses() this guarantees that at least the first fragment
> will get completely consumed, and hence the fragment count reduced.
> 
> In order to not needlessly call __pskb_pull_tail() twice, make the
> original call conditional upon the pull target not having been reached
> yet, and defer the newly added one as much as possible (an alternative
> would have been to always call the function right before the call to
> xennet_fill_frags(), but that would imply more frequent cases of
> needing to call it twice).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Cc: Wei Liu <wei.liu2@citrix.com>
> Cc: Ian Campbell <ian.campbell@citrix.com>
> Cc: stable@vger.kernel.org (3.6 onwards)
> 
> --- a/drivers/net/xen-netfront.c
> +++ b/drivers/net/xen-netfront.c
> @@ -831,6 +831,15 @@ static RING_IDX xennet_fill_frags(struct
>  			RING_GET_RESPONSE(&np->rx, ++cons);
>  		skb_frag_t *nfrag = &skb_shinfo(nskb)->frags[0];
>  
> +		if (nr_frags == MAX_SKB_FRAGS) {
> +			unsigned int pull_to = NETFRONT_SKB_CB(skb)->pull_to;
> +
> +			BUG_ON(pull_to <= skb_headlen(skb));
> +			__pskb_pull_tail(skb, pull_to - skb_headlen(skb));

skb_headlen is in fact "skb->len - skb->data_len". Looking at the
caller code:

    while loop {
        skb_shinfo(skb)->frags[0].page_offset = rx->offset;
	skb_frag_size_set(&skb_shinfo(skb)->frags[0], rx->status);
	skb->data_len = rx->status;

	i = xennet_fill_frags(np, skb, &tmpq);

	/*                                                                                                                                             
	 * Truesize is the actual allocation size, even if the                                                                                         
	 * allocation is only partially used.                                                                                                          
	 */
	skb->truesize += PAGE_SIZE * skb_shinfo(skb)->nr_frags;
	skb->len += skb->data_len;
    }

    handle_incoming_packet();

You seem to be altering the behavior of the original code, because in
your patch the skb->len is incremented before use, while in the original
code (which calls skb_headlen in handle_incoming_packet) the skb->len is
correctly set.

> +			nr_frags = shinfo->nr_frags;
> +		}
> +		BUG_ON(nr_frags >= MAX_SKB_FRAGS);
> +
>  		__skb_fill_page_desc(skb, nr_frags,
>  				     skb_frag_page(nfrag),
>  				     rx->offset, rx->status);
> @@ -929,7 +938,8 @@ static int handle_incoming_queue(struct 
>  	while ((skb = __skb_dequeue(rxq)) != NULL) {
>  		int pull_to = NETFRONT_SKB_CB(skb)->pull_to;
>  
> -		__pskb_pull_tail(skb, pull_to - skb_headlen(skb));
> +		if (pull_to > skb_headlen(skb))
> +			__pskb_pull_tail(skb, pull_to - skb_headlen(skb));
>  
>  		/* Ethernet work: Delayed to here as it peeks the header. */
>  		skb->protocol = eth_type_trans(skb, dev);
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] xen-netfront: pull on receive skb may need to happen earlier
  2013-07-05 14:53         ` Wei Liu
@ 2013-07-07  1:10           ` David Miller
  2013-07-08  9:59           ` Jan Beulich
  1 sibling, 0 replies; 22+ messages in thread
From: David Miller @ 2013-07-07  1:10 UTC (permalink / raw)
  To: wei.liu2; +Cc: JBeulich, ian.campbell, g.w.kant, xen-devel, netdev, stable

From: Wei Liu <wei.liu2@citrix.com>
Date: Fri, 5 Jul 2013 15:53:19 +0100

> You seem to be altering the behavior of the original code, because in
> your patch the skb->len is incremented before use, while in the original
> code (which calls skb_headlen in handle_incoming_packet) the skb->len is
> correctly set.

Indeed, I think you're right.

Because __pskb_pull_tail() decrements the number of bytes pulled
from skb->data_len, it won't be accounted for in skb->len

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] xen-netfront: pull on receive skb may need to happen earlier
  2013-07-05 14:53         ` Wei Liu
  2013-07-07  1:10           ` David Miller
@ 2013-07-08  9:59           ` Jan Beulich
  2013-07-08 12:16             ` Dion Kant
  2013-07-08 14:20             ` [Xen-devel] " Jan Beulich
  1 sibling, 2 replies; 22+ messages in thread
From: Jan Beulich @ 2013-07-08  9:59 UTC (permalink / raw)
  To: Wei Liu; +Cc: Ian Campbell, davem, Dion Kant, xen-devel, netdev, stable

>>> On 05.07.13 at 16:53, Wei Liu <wei.liu2@citrix.com> wrote:
> On Fri, Jul 05, 2013 at 10:32:41AM +0100, Jan Beulich wrote:
>> --- a/drivers/net/xen-netfront.c
>> +++ b/drivers/net/xen-netfront.c
>> @@ -831,6 +831,15 @@ static RING_IDX xennet_fill_frags(struct
>>  			RING_GET_RESPONSE(&np->rx, ++cons);
>>  		skb_frag_t *nfrag = &skb_shinfo(nskb)->frags[0];
>>  
>> +		if (nr_frags == MAX_SKB_FRAGS) {
>> +			unsigned int pull_to = NETFRONT_SKB_CB(skb)->pull_to;
>> +
>> +			BUG_ON(pull_to <= skb_headlen(skb));
>> +			__pskb_pull_tail(skb, pull_to - skb_headlen(skb));
> 
> skb_headlen is in fact "skb->len - skb->data_len". Looking at the
> caller code:
> 
>     while loop {
>         skb_shinfo(skb)->frags[0].page_offset = rx->offset;
> 	skb_frag_size_set(&skb_shinfo(skb)->frags[0], rx->status);
> 	skb->data_len = rx->status;
> 
> 	i = xennet_fill_frags(np, skb, &tmpq);
> 
> 	/*                                                                           
>                                                                   
> 	 * Truesize is the actual allocation size, even if the                       
>                                                                   
> 	 * allocation is only partially used.                                        
>                                                                   
> 	 */
> 	skb->truesize += PAGE_SIZE * skb_shinfo(skb)->nr_frags;
> 	skb->len += skb->data_len;
>     }
> 
>     handle_incoming_packet();
> 
> You seem to be altering the behavior of the original code, because in
> your patch the skb->len is incremented before use, while in the original
> code (which calls skb_headlen in handle_incoming_packet) the skb->len is
> correctly set.

Right. So I basically need to keep skb->len up-to-date along with
->data_len. Just handed a patch to Dion with that done; I'll defer
sending a v2 for the upstream code until I know the change works
for our kernel.

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] xen-netfront: pull on receive skb may need to happen earlier
  2013-07-08  9:59           ` Jan Beulich
@ 2013-07-08 12:16             ` Dion Kant
  2013-07-08 12:41               ` Jan Beulich
  2013-07-08 14:20             ` [Xen-devel] " Jan Beulich
  1 sibling, 1 reply; 22+ messages in thread
From: Dion Kant @ 2013-07-08 12:16 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Wei Liu, Ian Campbell, netdev, stable, xen-devel, davem

On 07/08/2013 11:59 AM, Jan Beulich wrote:
>>>> On 05.07.13 at 16:53, Wei Liu <wei.liu2@citrix.com> wrote:
>> On Fri, Jul 05, 2013 at 10:32:41AM +0100, Jan Beulich wrote:
>>> --- a/drivers/net/xen-netfront.c
>>> +++ b/drivers/net/xen-netfront.c
>>> @@ -831,6 +831,15 @@ static RING_IDX xennet_fill_frags(struct
>>>  			RING_GET_RESPONSE(&np->rx, ++cons);
>>>  		skb_frag_t *nfrag = &skb_shinfo(nskb)->frags[0];
>>>  
>>> +		if (nr_frags == MAX_SKB_FRAGS) {
>>> +			unsigned int pull_to = NETFRONT_SKB_CB(skb)->pull_to;
>>> +
>>> +			BUG_ON(pull_to <= skb_headlen(skb));
>>> +			__pskb_pull_tail(skb, pull_to - skb_headlen(skb));
>>
>> skb_headlen is in fact "skb->len - skb->data_len". Looking at the
>> caller code:
>>
>>     while loop {
>>         skb_shinfo(skb)->frags[0].page_offset = rx->offset;
>> 	skb_frag_size_set(&skb_shinfo(skb)->frags[0], rx->status);
>> 	skb->data_len = rx->status;
>>
>> 	i = xennet_fill_frags(np, skb, &tmpq);
>>
>> 	/*                                                                           
>>                                                                   
>> 	 * Truesize is the actual allocation size, even if the                       
>>                                                                   
>> 	 * allocation is only partially used.                                        
>>                                                                   
>> 	 */
>> 	skb->truesize += PAGE_SIZE * skb_shinfo(skb)->nr_frags;
>> 	skb->len += skb->data_len;
>>     }
>>
>>     handle_incoming_packet();
>>
>> You seem to be altering the behavior of the original code, because in
>> your patch the skb->len is incremented before use, while in the original
>> code (which calls skb_headlen in handle_incoming_packet) the skb->len is
>> correctly set.
> 
> Right. So I basically need to keep skb->len up-to-date along with
> ->data_len. Just handed a patch to Dion with that done; I'll defer
> sending a v2 for the upstream code until I know the change works
> for our kernel.
> 
> Jan
> 

Jan,

I was wondering about the following.

In netif_poll(struct napi_struct *napi, int budget) the skb->cb is assigend,
but it may be clipped to RX_COPY_THRESHOLD

		NETFRONT_SKB_CB(skb)->pull_to = rx->status;
		if (NETFRONT_SKB_CB(skb)->pull_to > RX_COPY_THRESHOLD)
		      NETFRONT_SKB_CB(skb)->pull_to = RX_COPY_THRESHOLD;

How does this modification of NETFRONT_SKB_CB(skb)->pull_to propagates
or is this irrelevant?

Dion

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] xen-netfront: pull on receive skb may need to happen earlier
  2013-07-08 12:16             ` Dion Kant
@ 2013-07-08 12:41               ` Jan Beulich
  0 siblings, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2013-07-08 12:41 UTC (permalink / raw)
  To: Dion Kant; +Cc: Ian Campbell, Wei Liu, davem, xen-devel, netdev, stable

>>> On 08.07.13 at 14:16, Dion Kant <g.w.kant@hunenet.nl> wrote:
> I was wondering about the following.
> 
> In netif_poll(struct napi_struct *napi, int budget) the skb->cb is assigend,
> but it may be clipped to RX_COPY_THRESHOLD
> 
> 		NETFRONT_SKB_CB(skb)->pull_to = rx->status;
> 		if (NETFRONT_SKB_CB(skb)->pull_to > RX_COPY_THRESHOLD)
> 		      NETFRONT_SKB_CB(skb)->pull_to = RX_COPY_THRESHOLD;
> 
> How does this modification of NETFRONT_SKB_CB(skb)->pull_to propagates
> or is this irrelevant?

I'm not sure I understand what you're asking. pull_to is a private
(to netfront) field used to communicate how much of the skb
needs to be pulled. I don't see how the word "propagate" makes
any sense here, but as said I may simply not be getting what
you're asking about.

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH] xen-netfront: pull on receive skb may need to happen earlier
  2013-07-08  9:59           ` Jan Beulich
  2013-07-08 12:16             ` Dion Kant
@ 2013-07-08 14:20             ` Jan Beulich
  2013-07-08 15:22               ` Eric Dumazet
                                 ` (2 more replies)
  1 sibling, 3 replies; 22+ messages in thread
From: Jan Beulich @ 2013-07-08 14:20 UTC (permalink / raw)
  To: Wei Liu, davem; +Cc: Ian Campbell, Dion Kant, xen-devel, netdev, stable

>>> On 08.07.13 at 11:59, "Jan Beulich" <JBeulich@suse.com> wrote:
>>>> On 05.07.13 at 16:53, Wei Liu <wei.liu2@citrix.com> wrote:
>> On Fri, Jul 05, 2013 at 10:32:41AM +0100, Jan Beulich wrote:
>>> --- a/drivers/net/xen-netfront.c
>>> +++ b/drivers/net/xen-netfront.c
>>> @@ -831,6 +831,15 @@ static RING_IDX xennet_fill_frags(struct
>>>  			RING_GET_RESPONSE(&np->rx, ++cons);
>>>  		skb_frag_t *nfrag = &skb_shinfo(nskb)->frags[0];
>>>  
>>> +		if (nr_frags == MAX_SKB_FRAGS) {
>>> +			unsigned int pull_to = NETFRONT_SKB_CB(skb)->pull_to;
>>> +
>>> +			BUG_ON(pull_to <= skb_headlen(skb));
>>> +			__pskb_pull_tail(skb, pull_to - skb_headlen(skb));
>> 
>> skb_headlen is in fact "skb->len - skb->data_len". Looking at the
>> caller code:
>> 
>>     while loop {
>>         skb_shinfo(skb)->frags[0].page_offset = rx->offset;
>> 	skb_frag_size_set(&skb_shinfo(skb)->frags[0], rx->status);
>> 	skb->data_len = rx->status;
>> 
>> 	i = xennet_fill_frags(np, skb, &tmpq);
>> 
>> 	/*                                                                           
> 
>>                                                                   
>> 	 * Truesize is the actual allocation size, even if the                       
> 
>>                                                                   
>> 	 * allocation is only partially used.                                        
> 
>>                                                                   
>> 	 */
>> 	skb->truesize += PAGE_SIZE * skb_shinfo(skb)->nr_frags;
>> 	skb->len += skb->data_len;
>>     }
>> 
>>     handle_incoming_packet();
>> 
>> You seem to be altering the behavior of the original code, because in
>> your patch the skb->len is incremented before use, while in the original
>> code (which calls skb_headlen in handle_incoming_packet) the skb->len is
>> correctly set.
> 
> Right. So I basically need to keep skb->len up-to-date along with
> ->data_len. Just handed a patch to Dion with that done; I'll defer
> sending a v2 for the upstream code until I know the change works
> for our kernel.

Okay, so with that done (see below) Dion is now seeing the
WARN_ON_ONCE(delta < len) in skb_try_coalesce() triggering. Of
course, with it having crashed before, it's hard to tell whether the
triggering now is an effect of the patch, or just got unmasked by it.

Looking over the ->truesize handling, I can't see how the change
here could break things: RX_COPY_THRESHOLD is already
accounted for by how alloc_skb() gets called, and the increment
right after the call to xennet_fill_frags() should now be even more
correct than before (since __pskb_pull_tail() can drop fragments,
which would then have made this an over-estimation afaict).

That all said with me knowing pretty little about the networking
code, so I'd appreciate if you could point out anything wrong with
my idea of how things work. Additionally - is my fundamental (for
this patch) assumption right that multiple __pskb_pull_tail() call
are cumulative (i.e. calling it twice with a delta of
pull_to - skb_headlen(skb) would indeed end up pulling up to
pull_to, provided there is enough data)?

Jan

--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -831,10 +831,20 @@ static RING_IDX xennet_fill_frags(struct
 			RING_GET_RESPONSE(&np->rx, ++cons);
 		skb_frag_t *nfrag = &skb_shinfo(nskb)->frags[0];
 
+		if (nr_frags == MAX_SKB_FRAGS) {
+			unsigned int pull_to = NETFRONT_SKB_CB(skb)->pull_to;
+
+			BUG_ON(pull_to <= skb_headlen(skb));
+			__pskb_pull_tail(skb, pull_to - skb_headlen(skb));
+			nr_frags = shinfo->nr_frags;
+		}
+		BUG_ON(nr_frags >= MAX_SKB_FRAGS);
+
 		__skb_fill_page_desc(skb, nr_frags,
 				     skb_frag_page(nfrag),
 				     rx->offset, rx->status);
 
+		skb->len += rx->status;
 		skb->data_len += rx->status;
 
 		skb_shinfo(nskb)->nr_frags = 0;
@@ -929,7 +939,8 @@ static int handle_incoming_queue(struct 
 	while ((skb = __skb_dequeue(rxq)) != NULL) {
 		int pull_to = NETFRONT_SKB_CB(skb)->pull_to;
 
-		__pskb_pull_tail(skb, pull_to - skb_headlen(skb));
+		if (pull_to > skb_headlen(skb))
+			__pskb_pull_tail(skb, pull_to - skb_headlen(skb));
 
 		/* Ethernet work: Delayed to here as it peeks the header. */
 		skb->protocol = eth_type_trans(skb, dev);
@@ -1014,7 +1025,7 @@ err:
 
 		skb_shinfo(skb)->frags[0].page_offset = rx->offset;
 		skb_frag_size_set(&skb_shinfo(skb)->frags[0], rx->status);
-		skb->data_len = rx->status;
+		skb->len = skb->data_len = rx->status;
 
 		i = xennet_fill_frags(np, skb, &tmpq);
 
@@ -1023,7 +1034,6 @@ err:
                  * allocation is only partially used.
                  */
 		skb->truesize += PAGE_SIZE * skb_shinfo(skb)->nr_frags;
-		skb->len += skb->data_len;
 
 		if (rx->flags & XEN_NETRXF_csum_blank)
 			skb->ip_summed = CHECKSUM_PARTIAL;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH] xen-netfront: pull on receive skb may need to happen earlier
  2013-07-08 14:20             ` [Xen-devel] " Jan Beulich
@ 2013-07-08 15:22               ` Eric Dumazet
  2013-07-09  7:47                 ` Jan Beulich
  2013-07-08 15:48               ` Wei Liu
  2013-07-12  8:32               ` Wei Liu
  2 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2013-07-08 15:22 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, davem, Ian Campbell, Dion Kant, xen-devel, netdev, stable

On Mon, 2013-07-08 at 15:20 +0100, Jan Beulich wrote:

> Okay, so with that done (see below) Dion is now seeing the
> WARN_ON_ONCE(delta < len) in skb_try_coalesce() triggering. Of
> course, with it having crashed before, it's hard to tell whether the
> triggering now is an effect of the patch, or just got unmasked by it.

Please note this warning is not the sign of an error.

It can be triggered because network stack (IP + TCP) had to reallocate
skb->head to hold all the headers.

pskb_may_pull() doesn't change skb->truesize, for various reasons, so we
can have a situation where the warning triggers.

For example, it can happen when drivers use a really small skb->head,
like 64 bytes, as its not big enough to hold ethernet+IP+TCP headers.

For example, I did a patch on a wifi driver (iwl3945: better skb
management in rx path) because of this sub optimal skb->head sizing.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH] xen-netfront: pull on receive skb may need to happen earlier
  2013-07-08 14:20             ` [Xen-devel] " Jan Beulich
  2013-07-08 15:22               ` Eric Dumazet
@ 2013-07-08 15:48               ` Wei Liu
  2013-07-09  6:52                 ` Jan Beulich
  2013-07-12  8:32               ` Wei Liu
  2 siblings, 1 reply; 22+ messages in thread
From: Wei Liu @ 2013-07-08 15:48 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, davem, Ian Campbell, Dion Kant, xen-devel, netdev, stable

On Mon, Jul 08, 2013 at 03:20:26PM +0100, Jan Beulich wrote:
> >>> On 08.07.13 at 11:59, "Jan Beulich" <JBeulich@suse.com> wrote:
> >>>> On 05.07.13 at 16:53, Wei Liu <wei.liu2@citrix.com> wrote:
> >> On Fri, Jul 05, 2013 at 10:32:41AM +0100, Jan Beulich wrote:
> >>> --- a/drivers/net/xen-netfront.c
> >>> +++ b/drivers/net/xen-netfront.c
> >>> @@ -831,6 +831,15 @@ static RING_IDX xennet_fill_frags(struct
> >>>  			RING_GET_RESPONSE(&np->rx, ++cons);
> >>>  		skb_frag_t *nfrag = &skb_shinfo(nskb)->frags[0];
> >>>  
> >>> +		if (nr_frags == MAX_SKB_FRAGS) {
> >>> +			unsigned int pull_to = NETFRONT_SKB_CB(skb)->pull_to;
> >>> +
> >>> +			BUG_ON(pull_to <= skb_headlen(skb));
> >>> +			__pskb_pull_tail(skb, pull_to - skb_headlen(skb));
> >> 
> >> skb_headlen is in fact "skb->len - skb->data_len". Looking at the
> >> caller code:
> >> 
> >>     while loop {
> >>         skb_shinfo(skb)->frags[0].page_offset = rx->offset;
> >> 	skb_frag_size_set(&skb_shinfo(skb)->frags[0], rx->status);
> >> 	skb->data_len = rx->status;
> >> 
> >> 	i = xennet_fill_frags(np, skb, &tmpq);
> >> 
> >> 	/*                                                                           
> > 
> >>                                                                   
> >> 	 * Truesize is the actual allocation size, even if the                       
> > 
> >>                                                                   
> >> 	 * allocation is only partially used.                                        
> > 
> >>                                                                   
> >> 	 */
> >> 	skb->truesize += PAGE_SIZE * skb_shinfo(skb)->nr_frags;
> >> 	skb->len += skb->data_len;
> >>     }
> >> 
> >>     handle_incoming_packet();
> >> 
> >> You seem to be altering the behavior of the original code, because in
> >> your patch the skb->len is incremented before use, while in the original
> >> code (which calls skb_headlen in handle_incoming_packet) the skb->len is
> >> correctly set.
> > 
> > Right. So I basically need to keep skb->len up-to-date along with
> > ->data_len. Just handed a patch to Dion with that done; I'll defer
> > sending a v2 for the upstream code until I know the change works
> > for our kernel.
> 
> Okay, so with that done (see below) Dion is now seeing the
> WARN_ON_ONCE(delta < len) in skb_try_coalesce() triggering. Of
> course, with it having crashed before, it's hard to tell whether the
> triggering now is an effect of the patch, or just got unmasked by it.
> 

I think the previous crash was due to incorrect accounting.

  BUG_ON(pull_to <= skb_headlen(skb))

  skb_headlen(skb) = skb->len - skb->data_len

In your first patch, skb_headlen(skb) can return a very large number if
skb->len is smaller then skb->data_len (they are both unsigned), because
before calling skb_headlen() skb->data_len is incremented while skb->len
is not.

As for the WARN_ON_ONCE(delte < len), I haven't looked that far as I
discover a problem with your new patch. Please see below comments.

> Looking over the ->truesize handling, I can't see how the change
> here could break things: RX_COPY_THRESHOLD is already
> accounted for by how alloc_skb() gets called, and the increment
> right after the call to xennet_fill_frags() should now be even more
> correct than before (since __pskb_pull_tail() can drop fragments,
> which would then have made this an over-estimation afaict).
> 
> That all said with me knowing pretty little about the networking
> code, so I'd appreciate if you could point out anything wrong with
> my idea of how things work. Additionally - is my fundamental (for
> this patch) assumption right that multiple __pskb_pull_tail() call
> are cumulative (i.e. calling it twice with a delta of
> pull_to - skb_headlen(skb) would indeed end up pulling up to
> pull_to, provided there is enough data)?
> 
> Jan
> 
> --- a/drivers/net/xen-netfront.c
> +++ b/drivers/net/xen-netfront.c
> @@ -831,10 +831,20 @@ static RING_IDX xennet_fill_frags(struct
>  			RING_GET_RESPONSE(&np->rx, ++cons);
>  		skb_frag_t *nfrag = &skb_shinfo(nskb)->frags[0];
>  
> +		if (nr_frags == MAX_SKB_FRAGS) {
> +			unsigned int pull_to = NETFRONT_SKB_CB(skb)->pull_to;
> +
> +			BUG_ON(pull_to <= skb_headlen(skb));
> +			__pskb_pull_tail(skb, pull_to - skb_headlen(skb));
> +			nr_frags = shinfo->nr_frags;
> +		}
> +		BUG_ON(nr_frags >= MAX_SKB_FRAGS);
> +
>  		__skb_fill_page_desc(skb, nr_frags,
>  				     skb_frag_page(nfrag),
>  				     rx->offset, rx->status);
>  
> +		skb->len += rx->status;
>  		skb->data_len += rx->status;
>  
>  		skb_shinfo(nskb)->nr_frags = 0;
> @@ -929,7 +939,8 @@ static int handle_incoming_queue(struct 
>  	while ((skb = __skb_dequeue(rxq)) != NULL) {
>  		int pull_to = NETFRONT_SKB_CB(skb)->pull_to;
>  
> -		__pskb_pull_tail(skb, pull_to - skb_headlen(skb));
> +		if (pull_to > skb_headlen(skb))
> +			__pskb_pull_tail(skb, pull_to - skb_headlen(skb));
>  
>  		/* Ethernet work: Delayed to here as it peeks the header. */
>  		skb->protocol = eth_type_trans(skb, dev);
> @@ -1014,7 +1025,7 @@ err:
>  
>  		skb_shinfo(skb)->frags[0].page_offset = rx->offset;
>  		skb_frag_size_set(&skb_shinfo(skb)->frags[0], rx->status);
> -		skb->data_len = rx->status;
> +		skb->len = skb->data_len = rx->status;

This is not correct. You should not be needing this. Now you lose count
of SKB head len. Try to go without the above line and see if it makes a
difference?


Wei.

>  
>  		i = xennet_fill_frags(np, skb, &tmpq);
>  
> @@ -1023,7 +1034,6 @@ err:
>                   * allocation is only partially used.
>                   */
>  		skb->truesize += PAGE_SIZE * skb_shinfo(skb)->nr_frags;
> -		skb->len += skb->data_len;
>  
>  		if (rx->flags & XEN_NETRXF_csum_blank)
>  			skb->ip_summed = CHECKSUM_PARTIAL;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH] xen-netfront: pull on receive skb may need to happen earlier
  2013-07-08 15:48               ` Wei Liu
@ 2013-07-09  6:52                 ` Jan Beulich
  2013-07-09 16:51                   ` Wei Liu
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2013-07-09  6:52 UTC (permalink / raw)
  To: Wei Liu; +Cc: Ian Campbell, davem, Dion Kant, xen-devel, netdev, stable

>>> On 08.07.13 at 17:48, Wei Liu <wei.liu2@citrix.com> wrote:
> On Mon, Jul 08, 2013 at 03:20:26PM +0100, Jan Beulich wrote:
>> @@ -1014,7 +1025,7 @@ err:
>>  
>>  		skb_shinfo(skb)->frags[0].page_offset = rx->offset;
>>  		skb_frag_size_set(&skb_shinfo(skb)->frags[0], rx->status);
>> -		skb->data_len = rx->status;
>> +		skb->len = skb->data_len = rx->status;
> 
> This is not correct. You should not be needing this. Now you lose count
> of SKB head len. Try to go without the above line and see if it makes a
> difference?

I don't follow - at this point, there's 0 bytes of head (this only
changes with the first call to __pskb_pull_tail()). Hence ->len ==
->data_len seems correct to me (and afaict pulling would do the
wrong thing if I dropped that change).

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH] xen-netfront: pull on receive skb may need to happen earlier
  2013-07-08 15:22               ` Eric Dumazet
@ 2013-07-09  7:47                 ` Jan Beulich
  0 siblings, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2013-07-09  7:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ian Campbell, Wei Liu, davem, Dion Kant, xen-devel, netdev, stable

>>> On 08.07.13 at 17:22, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2013-07-08 at 15:20 +0100, Jan Beulich wrote:
> 
>> Okay, so with that done (see below) Dion is now seeing the
>> WARN_ON_ONCE(delta < len) in skb_try_coalesce() triggering. Of
>> course, with it having crashed before, it's hard to tell whether the
>> triggering now is an effect of the patch, or just got unmasked by it.
> 
> Please note this warning is not the sign of an error.

Thanks for pointing this out.

> It can be triggered because network stack (IP + TCP) had to reallocate
> skb->head to hold all the headers.
> 
> pskb_may_pull() doesn't change skb->truesize, for various reasons, so we
> can have a situation where the warning triggers.
> 
> For example, it can happen when drivers use a really small skb->head,
> like 64 bytes, as its not big enough to hold ethernet+IP+TCP headers.

We know from the crashes prior to the patch here that this is
exactly what is happening.

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH] xen-netfront: pull on receive skb may need to happen earlier
  2013-07-09  6:52                 ` Jan Beulich
@ 2013-07-09 16:51                   ` Wei Liu
  2013-07-10  6:58                     ` Jan Beulich
  0 siblings, 1 reply; 22+ messages in thread
From: Wei Liu @ 2013-07-09 16:51 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Ian Campbell, davem, Dion Kant, xen-devel, netdev, stable

On Tue, Jul 09, 2013 at 07:52:31AM +0100, Jan Beulich wrote:
> >>> On 08.07.13 at 17:48, Wei Liu <wei.liu2@citrix.com> wrote:
> > On Mon, Jul 08, 2013 at 03:20:26PM +0100, Jan Beulich wrote:
> >> @@ -1014,7 +1025,7 @@ err:
> >>  
> >>  		skb_shinfo(skb)->frags[0].page_offset = rx->offset;
> >>  		skb_frag_size_set(&skb_shinfo(skb)->frags[0], rx->status);
> >> -		skb->data_len = rx->status;
> >> +		skb->len = skb->data_len = rx->status;
> > 
> > This is not correct. You should not be needing this. Now you lose count
> > of SKB head len. Try to go without the above line and see if it makes a
> > difference?
> 
> I don't follow - at this point, there's 0 bytes of head (this only
> changes with the first call to __pskb_pull_tail()). Hence ->len ==
> ->data_len seems correct to me (and afaict pulling would do the
> wrong thing if I dropped that change).
> 

My bad, I suggested the wrong thing. :-(

But I would prefer skb->len += skb->data_len. In the case that skb->len
== 0 it's the same as your line while skb->len is not zero it would also
do the right thing.

As for the warning in skb_try_coalesce, I don't see any direct call to
it in netfront, I will need to think about it. It looks like it's really
something very deep in the stack.


Wei.

> Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH] xen-netfront: pull on receive skb may need to happen earlier
  2013-07-09 16:51                   ` Wei Liu
@ 2013-07-10  6:58                     ` Jan Beulich
  2013-07-10 10:04                       ` Wei Liu
  2013-07-10 13:19                       ` Eric Dumazet
  0 siblings, 2 replies; 22+ messages in thread
From: Jan Beulich @ 2013-07-10  6:58 UTC (permalink / raw)
  To: Wei Liu; +Cc: Ian Campbell, davem, Dion Kant, xen-devel, netdev, stable

>>> On 09.07.13 at 18:51, Wei Liu <wei.liu2@citrix.com> wrote:
> On Tue, Jul 09, 2013 at 07:52:31AM +0100, Jan Beulich wrote:
>> >>> On 08.07.13 at 17:48, Wei Liu <wei.liu2@citrix.com> wrote:
>> > On Mon, Jul 08, 2013 at 03:20:26PM +0100, Jan Beulich wrote:
>> >> @@ -1014,7 +1025,7 @@ err:
>> >>  
>> >>  		skb_shinfo(skb)->frags[0].page_offset = rx->offset;
>> >>  		skb_frag_size_set(&skb_shinfo(skb)->frags[0], rx->status);
>> >> -		skb->data_len = rx->status;
>> >> +		skb->len = skb->data_len = rx->status;
>> > 
>> > This is not correct. You should not be needing this. Now you lose count
>> > of SKB head len. Try to go without the above line and see if it makes a
>> > difference?
>> 
>> I don't follow - at this point, there's 0 bytes of head (this only
>> changes with the first call to __pskb_pull_tail()). Hence ->len ==
>> ->data_len seems correct to me (and afaict pulling would do the
>> wrong thing if I dropped that change).
>> 
> 
> My bad, I suggested the wrong thing. :-(
> 
> But I would prefer skb->len += skb->data_len. In the case that skb->len
> == 0 it's the same as your line while skb->len is not zero it would also
> do the right thing.

I can do that, albeit I don't see how ->len could end up non-zero
here.

> As for the warning in skb_try_coalesce, I don't see any direct call to
> it in netfront, I will need to think about it. It looks like it's really
> something very deep in the stack.

Yes, as the call stack provided by Dion proves. The question
really is whether the patch somehow results in ->truesize to be
incorrect, or whether - as Eric points out - this is "normal" for
the sort of special SKBs here (having a rather small headlen). If
what he says is applicable here, it may hint at the pulling we do
still not being sufficient for the full TCP header to be in the linear
part (which iirc is the main [if not the only purpose] of us doing
the pull in the first place).

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH] xen-netfront: pull on receive skb may need to happen earlier
  2013-07-10  6:58                     ` Jan Beulich
@ 2013-07-10 10:04                       ` Wei Liu
  2013-07-10 10:46                         ` Jan Beulich
  2013-07-10 13:19                       ` Eric Dumazet
  1 sibling, 1 reply; 22+ messages in thread
From: Wei Liu @ 2013-07-10 10:04 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Ian Campbell, davem, Dion Kant, xen-devel, netdev, stable

On Wed, Jul 10, 2013 at 07:58:06AM +0100, Jan Beulich wrote:
> >>> On 09.07.13 at 18:51, Wei Liu <wei.liu2@citrix.com> wrote:
> > On Tue, Jul 09, 2013 at 07:52:31AM +0100, Jan Beulich wrote:
> >> >>> On 08.07.13 at 17:48, Wei Liu <wei.liu2@citrix.com> wrote:
> >> > On Mon, Jul 08, 2013 at 03:20:26PM +0100, Jan Beulich wrote:
> >> >> @@ -1014,7 +1025,7 @@ err:
> >> >>  
> >> >>  		skb_shinfo(skb)->frags[0].page_offset = rx->offset;
> >> >>  		skb_frag_size_set(&skb_shinfo(skb)->frags[0], rx->status);
> >> >> -		skb->data_len = rx->status;
> >> >> +		skb->len = skb->data_len = rx->status;
> >> > 
> >> > This is not correct. You should not be needing this. Now you lose count
> >> > of SKB head len. Try to go without the above line and see if it makes a
> >> > difference?
> >> 
> >> I don't follow - at this point, there's 0 bytes of head (this only
> >> changes with the first call to __pskb_pull_tail()). Hence ->len ==
> >> ->data_len seems correct to me (and afaict pulling would do the
> >> wrong thing if I dropped that change).
> >> 
> > 
> > My bad, I suggested the wrong thing. :-(
> > 
> > But I would prefer skb->len += skb->data_len. In the case that skb->len
> > == 0 it's the same as your line while skb->len is not zero it would also
> > do the right thing.
> 
> I can do that, albeit I don't see how ->len could end up non-zero
> here.
> 
> > As for the warning in skb_try_coalesce, I don't see any direct call to
> > it in netfront, I will need to think about it. It looks like it's really
> > something very deep in the stack.
> 
> Yes, as the call stack provided by Dion proves. The question
> really is whether the patch somehow results in ->truesize to be
> incorrect, or whether - as Eric points out - this is "normal" for
> the sort of special SKBs here (having a rather small headlen). If
> what he says is applicable here, it may hint at the pulling we do
> still not being sufficient for the full TCP header to be in the linear

__netdev_alloc_skb in netfront is fed with RX_COPY_THRESHOLD+NET_IP_ALIGN.
RX_COPY_THRESHOLD is 256 while MAX_TCP_HEADER can be as larger as 256+48
depending on kernel configurations.

> part (which iirc is the main [if not the only purpose] of us doing
> the pull in the first place).
> 

Ian, any comment on this?

Jan, looking at the commit log, the overrun issue in
xennet_get_responses was not introduced by __pskb_pull_tail. The call to
xennet_fill_frags has always been in the same place.

Now I start to think about the purpose of "max = MAX_SKB_FRAGS +
(rx->status <= RX_COPY_THRESHOLD)" in xennet_get_responses which queues
up to MAX_SKB_FRAGS+1 responeses. I'm not clear about the rationale
of that line.


Wei.

> Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH] xen-netfront: pull on receive skb may need to happen earlier
  2013-07-10 10:04                       ` Wei Liu
@ 2013-07-10 10:46                         ` Jan Beulich
  2013-07-10 12:50                           ` Ian Campbell
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2013-07-10 10:46 UTC (permalink / raw)
  To: Ian Campbell, Wei Liu; +Cc: davem, Dion Kant, xen-devel, netdev, stable

>>> On 10.07.13 at 12:04, Wei Liu <wei.liu2@citrix.com> wrote:
> Jan, looking at the commit log, the overrun issue in
> xennet_get_responses was not introduced by __pskb_pull_tail. The call to
> xennet_fill_frags has always been in the same place.

I'm convinced it was: Prior to that commit, if the first response slot
contained up to RX_COPY_THRESHOLD bytes, it got entirely
consumed into the linear portion of the SKB, leaving the number of
fragments available for filling at MAX_SKB_FRAGS. Said commit
dropped the early copying, leaving the fragment count at 1
unconditionally, and now accumulates all of the response slots into
fragments, only pulling after all of them got filled in. It neglected to
realize - due to the count now always being 1 at the beginning - that
this can lead to MAX_SKB_FRAGS + 1 frags getting filled, corrupting
memory.

Ian - I have to admit that I'm slightly irritated by you so far not
having participated at all in sorting out the fix for this bug that a
change of yours introduced.

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH] xen-netfront: pull on receive skb may need to happen earlier
  2013-07-10 10:46                         ` Jan Beulich
@ 2013-07-10 12:50                           ` Ian Campbell
  2013-07-10 12:53                             ` Wei Liu
  2013-07-10 13:58                             ` Jan Beulich
  0 siblings, 2 replies; 22+ messages in thread
From: Ian Campbell @ 2013-07-10 12:50 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Wei Liu, davem, Dion Kant, xen-devel, netdev, stable

On Wed, 2013-07-10 at 11:46 +0100, Jan Beulich wrote:
> >>> On 10.07.13 at 12:04, Wei Liu <wei.liu2@citrix.com> wrote:
> > Jan, looking at the commit log, the overrun issue in
> > xennet_get_responses was not introduced by __pskb_pull_tail. The call to
> > xennet_fill_frags has always been in the same place.
> 
> I'm convinced it was: Prior to that commit, if the first response slot
> contained up to RX_COPY_THRESHOLD bytes, it got entirely
> consumed into the linear portion of the SKB, leaving the number of
> fragments available for filling at MAX_SKB_FRAGS. Said commit
> dropped the early copying, leaving the fragment count at 1
> unconditionally, and now accumulates all of the response slots into
> fragments, only pulling after all of them got filled in. It neglected to
> realize - due to the count now always being 1 at the beginning - that
> this can lead to MAX_SKB_FRAGS + 1 frags getting filled, corrupting
> memory.

That argument makes sense to me.

Is it possible to hit a scenario where we need to pull more than
RX_COPY_THRESHOLD in order to fit all of the data in MAX_SKB_FRAGS ?

> Ian - I have to admit that I'm slightly irritated by you so far not
> having participated at all in sorting out the fix for this bug that a
> change of yours introduced.

Sorry I've been travelling and not following closely enough to realise
this was related to something I'd done.

Does this relate somehow to the patch Annie has sent out recently too?

Ian.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH] xen-netfront: pull on receive skb may need to happen earlier
  2013-07-10 12:50                           ` Ian Campbell
@ 2013-07-10 12:53                             ` Wei Liu
  2013-07-10 13:58                             ` Jan Beulich
  1 sibling, 0 replies; 22+ messages in thread
From: Wei Liu @ 2013-07-10 12:53 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Jan Beulich, Wei Liu, davem, Dion Kant, xen-devel, netdev, stable

On Wed, Jul 10, 2013 at 01:50:44PM +0100, Ian Campbell wrote:
> On Wed, 2013-07-10 at 11:46 +0100, Jan Beulich wrote:
> > >>> On 10.07.13 at 12:04, Wei Liu <wei.liu2@citrix.com> wrote:
> > > Jan, looking at the commit log, the overrun issue in
> > > xennet_get_responses was not introduced by __pskb_pull_tail. The call to
> > > xennet_fill_frags has always been in the same place.
> > 
> > I'm convinced it was: Prior to that commit, if the first response slot
> > contained up to RX_COPY_THRESHOLD bytes, it got entirely
> > consumed into the linear portion of the SKB, leaving the number of
> > fragments available for filling at MAX_SKB_FRAGS. Said commit
> > dropped the early copying, leaving the fragment count at 1
> > unconditionally, and now accumulates all of the response slots into
> > fragments, only pulling after all of them got filled in. It neglected to
> > realize - due to the count now always being 1 at the beginning - that
> > this can lead to MAX_SKB_FRAGS + 1 frags getting filled, corrupting
> > memory.
> 
> That argument makes sense to me.
> 
> Is it possible to hit a scenario where we need to pull more than
> RX_COPY_THRESHOLD in order to fit all of the data in MAX_SKB_FRAGS ?
> 
> > Ian - I have to admit that I'm slightly irritated by you so far not
> > having participated at all in sorting out the fix for this bug that a
> > change of yours introduced.
> 
> Sorry I've been travelling and not following closely enough to realise
> this was related to something I'd done.
> 
> Does this relate somehow to the patch Annie has sent out recently too?
> 

No. That's not related.


Wei.


> Ian.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH] xen-netfront: pull on receive skb may need to happen earlier
  2013-07-10  6:58                     ` Jan Beulich
  2013-07-10 10:04                       ` Wei Liu
@ 2013-07-10 13:19                       ` Eric Dumazet
  1 sibling, 0 replies; 22+ messages in thread
From: Eric Dumazet @ 2013-07-10 13:19 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Ian Campbell, davem, Dion Kant, xen-devel, netdev, stable

On Wed, 2013-07-10 at 07:58 +0100, Jan Beulich wrote:

> 
> Yes, as the call stack provided by Dion proves. The question
> really is whether the patch somehow results in ->truesize to be
> incorrect, or whether - as Eric points out - this is "normal" for
> the sort of special SKBs here (having a rather small headlen). If
> what he says is applicable here, it may hint at the pulling we do
> still not being sufficient for the full TCP header to be in the linear
> part (which iirc is the main [if not the only purpose] of us doing
> the pull in the first place).

I have not closely followed the thread (it seems it did not began on
netdev so I do not have the first messages), but if we cook an skb with
no tailroom (headroom is irrelevant), and IP stack or TCP stack has to
pull more bytes from the frags to skb->head, then skb->head must be
reallocated to get more headroom before pulling the headers.

This operation is _expensive_, as it involves copy of struct
skb_shared_info and eventually page refcounts games if original skb
was cloned (sniffer)

We have a one time WARN_ON_ONCE() in skb_try_coalesce(),
to detect such non optimal behavior, but not in the fast path, only in
the part used in TCP coalescing.

So the right thing would be to cook skbs so that there is enough
tailroom.

But there are 256 bytes of tailroom, it should be enough unless some
huge amount of headers are used.

If you believe it's fine and very unlikely, then ignore the warning,
because using 512 bytes of tailroom for all skbs only to cope with very
unlikely reallocations is not worth it.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH] xen-netfront: pull on receive skb may need to happen earlier
  2013-07-10 12:50                           ` Ian Campbell
  2013-07-10 12:53                             ` Wei Liu
@ 2013-07-10 13:58                             ` Jan Beulich
  1 sibling, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2013-07-10 13:58 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Wei Liu, davem, Dion Kant, xen-devel, netdev, stable

>>> On 10.07.13 at 14:50, Ian Campbell <ian.campbell@citrix.com> wrote:
> On Wed, 2013-07-10 at 11:46 +0100, Jan Beulich wrote:
>> >>> On 10.07.13 at 12:04, Wei Liu <wei.liu2@citrix.com> wrote:
>> > Jan, looking at the commit log, the overrun issue in
>> > xennet_get_responses was not introduced by __pskb_pull_tail. The call to
>> > xennet_fill_frags has always been in the same place.
>> 
>> I'm convinced it was: Prior to that commit, if the first response slot
>> contained up to RX_COPY_THRESHOLD bytes, it got entirely
>> consumed into the linear portion of the SKB, leaving the number of
>> fragments available for filling at MAX_SKB_FRAGS. Said commit
>> dropped the early copying, leaving the fragment count at 1
>> unconditionally, and now accumulates all of the response slots into
>> fragments, only pulling after all of them got filled in. It neglected to
>> realize - due to the count now always being 1 at the beginning - that
>> this can lead to MAX_SKB_FRAGS + 1 frags getting filled, corrupting
>> memory.
> 
> That argument makes sense to me.
> 
> Is it possible to hit a scenario where we need to pull more than
> RX_COPY_THRESHOLD in order to fit all of the data in MAX_SKB_FRAGS ?

I'm not aware of any, but I'm no expert here in any way.

> Does this relate somehow to the patch Annie has sent out recently too?

I don't think so.

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH] xen-netfront: pull on receive skb may need to happen earlier
  2013-07-08 14:20             ` [Xen-devel] " Jan Beulich
  2013-07-08 15:22               ` Eric Dumazet
  2013-07-08 15:48               ` Wei Liu
@ 2013-07-12  8:32               ` Wei Liu
  2013-07-12  8:56                 ` Jan Beulich
  2 siblings, 1 reply; 22+ messages in thread
From: Wei Liu @ 2013-07-12  8:32 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, davem, Ian Campbell, Dion Kant, xen-devel, netdev, stable

On Mon, Jul 08, 2013 at 03:20:26PM +0100, Jan Beulich wrote:
> >>> On 08.07.13 at 11:59, "Jan Beulich" <JBeulich@suse.com> wrote:
> >>>> On 05.07.13 at 16:53, Wei Liu <wei.liu2@citrix.com> wrote:
> >> On Fri, Jul 05, 2013 at 10:32:41AM +0100, Jan Beulich wrote:
> >>> --- a/drivers/net/xen-netfront.c
> >>> +++ b/drivers/net/xen-netfront.c
> >>> @@ -831,6 +831,15 @@ static RING_IDX xennet_fill_frags(struct
> >>>  			RING_GET_RESPONSE(&np->rx, ++cons);
> >>>  		skb_frag_t *nfrag = &skb_shinfo(nskb)->frags[0];
> >>>  
> >>> +		if (nr_frags == MAX_SKB_FRAGS) {
> >>> +			unsigned int pull_to = NETFRONT_SKB_CB(skb)->pull_to;
> >>> +
> >>> +			BUG_ON(pull_to <= skb_headlen(skb));
> >>> +			__pskb_pull_tail(skb, pull_to - skb_headlen(skb));
> >> 
> >> skb_headlen is in fact "skb->len - skb->data_len". Looking at the
> >> caller code:
> >> 
> >>     while loop {
> >>         skb_shinfo(skb)->frags[0].page_offset = rx->offset;
> >> 	skb_frag_size_set(&skb_shinfo(skb)->frags[0], rx->status);
> >> 	skb->data_len = rx->status;
> >> 
> >> 	i = xennet_fill_frags(np, skb, &tmpq);
> >> 
> >> 	/*                                                                           
> > 
> >>                                                                   
> >> 	 * Truesize is the actual allocation size, even if the                       
> > 
> >>                                                                   
> >> 	 * allocation is only partially used.                                        
> > 
> >>                                                                   
> >> 	 */
> >> 	skb->truesize += PAGE_SIZE * skb_shinfo(skb)->nr_frags;
> >> 	skb->len += skb->data_len;
> >>     }
> >> 
> >>     handle_incoming_packet();
> >> 
> >> You seem to be altering the behavior of the original code, because in
> >> your patch the skb->len is incremented before use, while in the original
> >> code (which calls skb_headlen in handle_incoming_packet) the skb->len is
> >> correctly set.
> > 
> > Right. So I basically need to keep skb->len up-to-date along with
> > ->data_len. Just handed a patch to Dion with that done; I'll defer
> > sending a v2 for the upstream code until I know the change works
> > for our kernel.
> 
> Okay, so with that done (see below) Dion is now seeing the
> WARN_ON_ONCE(delta < len) in skb_try_coalesce() triggering. Of
> course, with it having crashed before, it's hard to tell whether the
> triggering now is an effect of the patch, or just got unmasked by it.
> 

I just ported your below patch to upstream kernel and I didn't see the
WARN_ON_ONCE. I only did iperf and netperf tests.

If the work load to trigger this bug is simple enough I can give it a
shot...


Wei.

> Looking over the ->truesize handling, I can't see how the change
> here could break things: RX_COPY_THRESHOLD is already
> accounted for by how alloc_skb() gets called, and the increment
> right after the call to xennet_fill_frags() should now be even more
> correct than before (since __pskb_pull_tail() can drop fragments,
> which would then have made this an over-estimation afaict).
> 
> That all said with me knowing pretty little about the networking
> code, so I'd appreciate if you could point out anything wrong with
> my idea of how things work. Additionally - is my fundamental (for
> this patch) assumption right that multiple __pskb_pull_tail() call
> are cumulative (i.e. calling it twice with a delta of
> pull_to - skb_headlen(skb) would indeed end up pulling up to
> pull_to, provided there is enough data)?
> 
> Jan
> 
> --- a/drivers/net/xen-netfront.c
> +++ b/drivers/net/xen-netfront.c
> @@ -831,10 +831,20 @@ static RING_IDX xennet_fill_frags(struct
>  			RING_GET_RESPONSE(&np->rx, ++cons);
>  		skb_frag_t *nfrag = &skb_shinfo(nskb)->frags[0];
>  
> +		if (nr_frags == MAX_SKB_FRAGS) {
> +			unsigned int pull_to = NETFRONT_SKB_CB(skb)->pull_to;
> +
> +			BUG_ON(pull_to <= skb_headlen(skb));
> +			__pskb_pull_tail(skb, pull_to - skb_headlen(skb));
> +			nr_frags = shinfo->nr_frags;
> +		}
> +		BUG_ON(nr_frags >= MAX_SKB_FRAGS);
> +
>  		__skb_fill_page_desc(skb, nr_frags,
>  				     skb_frag_page(nfrag),
>  				     rx->offset, rx->status);
>  
> +		skb->len += rx->status;
>  		skb->data_len += rx->status;
>  
>  		skb_shinfo(nskb)->nr_frags = 0;
> @@ -929,7 +939,8 @@ static int handle_incoming_queue(struct 
>  	while ((skb = __skb_dequeue(rxq)) != NULL) {
>  		int pull_to = NETFRONT_SKB_CB(skb)->pull_to;
>  
> -		__pskb_pull_tail(skb, pull_to - skb_headlen(skb));
> +		if (pull_to > skb_headlen(skb))
> +			__pskb_pull_tail(skb, pull_to - skb_headlen(skb));
>  
>  		/* Ethernet work: Delayed to here as it peeks the header. */
>  		skb->protocol = eth_type_trans(skb, dev);
> @@ -1014,7 +1025,7 @@ err:
>  
>  		skb_shinfo(skb)->frags[0].page_offset = rx->offset;
>  		skb_frag_size_set(&skb_shinfo(skb)->frags[0], rx->status);
> -		skb->data_len = rx->status;
> +		skb->len = skb->data_len = rx->status;
>  
>  		i = xennet_fill_frags(np, skb, &tmpq);
>  
> @@ -1023,7 +1034,6 @@ err:
>                   * allocation is only partially used.
>                   */
>  		skb->truesize += PAGE_SIZE * skb_shinfo(skb)->nr_frags;
> -		skb->len += skb->data_len;
>  
>  		if (rx->flags & XEN_NETRXF_csum_blank)
>  			skb->ip_summed = CHECKSUM_PARTIAL;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH] xen-netfront: pull on receive skb may need to happen earlier
  2013-07-12  8:32               ` Wei Liu
@ 2013-07-12  8:56                 ` Jan Beulich
  2013-07-13 11:26                   ` Dion Kant
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2013-07-12  8:56 UTC (permalink / raw)
  To: Wei Liu; +Cc: Ian Campbell, davem, Dion Kant, xen-devel, netdev, stable

>>> On 12.07.13 at 10:32, Wei Liu <wei.liu2@citrix.com> wrote:
> On Mon, Jul 08, 2013 at 03:20:26PM +0100, Jan Beulich wrote:
>> >>> On 08.07.13 at 11:59, "Jan Beulich" <JBeulich@suse.com> wrote:
>> >>>> On 05.07.13 at 16:53, Wei Liu <wei.liu2@citrix.com> wrote:
>> >> 	skb->truesize += PAGE_SIZE * skb_shinfo(skb)->nr_frags;
>> >> 	skb->len += skb->data_len;
>> >>     }
>> >> 
>> >>     handle_incoming_packet();
>> >> 
>> >> You seem to be altering the behavior of the original code, because in
>> >> your patch the skb->len is incremented before use, while in the original
>> >> code (which calls skb_headlen in handle_incoming_packet) the skb->len is
>> >> correctly set.
>> > 
>> > Right. So I basically need to keep skb->len up-to-date along with
>> > ->data_len. Just handed a patch to Dion with that done; I'll defer
>> > sending a v2 for the upstream code until I know the change works
>> > for our kernel.
>> 
>> Okay, so with that done (see below) Dion is now seeing the
>> WARN_ON_ONCE(delta < len) in skb_try_coalesce() triggering. Of
>> course, with it having crashed before, it's hard to tell whether the
>> triggering now is an effect of the patch, or just got unmasked by it.
>> 
> 
> I just ported your below patch to upstream kernel and I didn't see the
> WARN_ON_ONCE. I only did iperf and netperf tests.
> 
> If the work load to trigger this bug is simple enough I can give it a
> shot...

I'm meanwhile relatively convinced that the warning isn't an effect
of the patch (final verification pending); I intend to submit v2 as
soon as 3.11-rc1 is out.

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Xen-devel] [PATCH] xen-netfront: pull on receive skb may need to happen earlier
  2013-07-12  8:56                 ` Jan Beulich
@ 2013-07-13 11:26                   ` Dion Kant
  0 siblings, 0 replies; 22+ messages in thread
From: Dion Kant @ 2013-07-13 11:26 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Wei Liu, Ian Campbell, davem, xen-devel, netdev, stable

On 07/12/2013 10:56 AM, Jan Beulich wrote:
>>>> On 12.07.13 at 10:32, Wei Liu <wei.liu2@citrix.com> wrote:
>> On Mon, Jul 08, 2013 at 03:20:26PM +0100, Jan Beulich wrote:
>>>>>> On 08.07.13 at 11:59, "Jan Beulich" <JBeulich@suse.com> wrote:
>>>>>>> On 05.07.13 at 16:53, Wei Liu <wei.liu2@citrix.com> wrote:
>>>>> 	skb->truesize += PAGE_SIZE * skb_shinfo(skb)->nr_frags;
>>>>> 	skb->len += skb->data_len;
>>>>>     }
>>>>>
>>>>>     handle_incoming_packet();
>>>>>
>>>>> You seem to be altering the behavior of the original code, because in
>>>>> your patch the skb->len is incremented before use, while in the original
>>>>> code (which calls skb_headlen in handle_incoming_packet) the skb->len is
>>>>> correctly set.
>>>>
>>>> Right. So I basically need to keep skb->len up-to-date along with
>>>> ->data_len. Just handed a patch to Dion with that done; I'll defer
>>>> sending a v2 for the upstream code until I know the change works
>>>> for our kernel.
>>>
>>> Okay, so with that done (see below) Dion is now seeing the
>>> WARN_ON_ONCE(delta < len) in skb_try_coalesce() triggering. Of
>>> course, with it having crashed before, it's hard to tell whether the
>>> triggering now is an effect of the patch, or just got unmasked by it.
>>>
>>
>> I just ported your below patch to upstream kernel and I didn't see the
>> WARN_ON_ONCE. I only did iperf and netperf tests.
>>
>> If the work load to trigger this bug is simple enough I can give it a
>> shot...
> 
> I'm meanwhile relatively convinced that the warning isn't an effect
> of the patch (final verification pending); I intend to submit v2 as
> soon as 3.11-rc1 is out.
> 
> Jan
> 

I have been doing some testing and focussed on xennet_fill_frags. I
enhanced a bit the detection of the skb->data_len > nr_frags * PAGE_SIZE
condition (as suggested by Jan) by storing rx->status and nr_frags as
function of each dequeued skb in the while loop.

There is something wrong with the part

+if (nr_frags == MAX_SKB_FRAGS) {
+       unsigned int pull_to = NETFRONT_SKB_CB(skb)->pull_to;
+
+       BUG_ON(pull_to <= skb_headlen(skb));
+       __pskb_pull_tail(skb, pull_to - skb_headlen(skb));
+       nr_frags = shinfo->nr_frags;
+}
+BUG_ON(nr_frags >= MAX_SKB_FRAGS);

When nr_frags reaches MAX_SKB_FRAGS (and this happens), nr_frags is
"reset" to shinfo->nr_frags. In fact I see the nr_frags set to 0 the
first time nr_frags reaches MAX_SKB_FRAGS.

The first problem with this is of course that the "skb->truesize +=
PAGE_SIZE * skb_shinfo(skb)->nr_frags" calculation following the "i =
xennet_fill_frags(np, skb, &tmpq)" in  neif_poll leads to a wrong
result. At the end the skb has ->truesize way smaller than ->len.

The second problem is that the BUG_ON(nr_frags >= MAX_SKB_FRAGS) shall
not occur, since "shinfo->nr_frags" afaict, is not updated in between.

Furthermore, I cannot figure out why, when control enters
xennet_fill_frags(), shinfo->nr_frags equals 1, and a little later when
nr_frags reaches MAX_SKB_FRAGS, it is 0.

Here is the (printk) result of a trial:


>   1 2013-07-13T04:27:30.757241+02:00 doma kernel: [   41.759528] netfront: Too many frags
>   2 2013-07-13T04:27:39.629261+02:00 doma kernel: [   50.628482] 17 == 17 (shinfo->nr_frags=0)
>   3 2013-07-13T04:27:39.629285+02:00 doma kernel: [   50.628486] j=0,rx_trace[j]=48,nr_frags_trace[j]=1
>   4 2013-07-13T04:27:39.629287+02:00 doma kernel: [   50.628488] j=1,rx_trace[j]=1000,nr_frags_trace[j]=2
>   5 2013-07-13T04:27:39.629289+02:00 doma kernel: [   50.628489] j=2,rx_trace[j]=1000,nr_frags_trace[j]=3
>   6 2013-07-13T04:27:39.629291+02:00 doma kernel: [   50.628491] j=3,rx_trace[j]=1000,nr_frags_trace[j]=4
>   7 2013-07-13T04:27:39.629293+02:00 doma kernel: [   50.628492] j=4,rx_trace[j]=1000,nr_frags_trace[j]=5
>   8 2013-07-13T04:27:39.629294+02:00 doma kernel: [   50.628493] j=5,rx_trace[j]=1000,nr_frags_trace[j]=6
>   9 2013-07-13T04:27:39.629296+02:00 doma kernel: [   50.628494] j=6,rx_trace[j]=1000,nr_frags_trace[j]=7
>  10 2013-07-13T04:27:39.629298+02:00 doma kernel: [   50.628496] j=7,rx_trace[j]=1000,nr_frags_trace[j]=8
>  11 2013-07-13T04:27:39.629300+02:00 doma kernel: [   50.628497] j=8,rx_trace[j]=1000,nr_frags_trace[j]=9
>  12 2013-07-13T04:27:39.629302+02:00 doma kernel: [   50.628498] j=9,rx_trace[j]=1000,nr_frags_trace[j]=10
>  13 2013-07-13T04:27:39.629304+02:00 doma kernel: [   50.628499] j=10,rx_trace[j]=1000,nr_frags_trace[j]=11
>  14 2013-07-13T04:27:39.629305+02:00 doma kernel: [   50.628503] j=11,rx_trace[j]=1000,nr_frags_trace[j]=12
>  15 2013-07-13T04:27:39.629307+02:00 doma kernel: [   50.628504] j=12,rx_trace[j]=1000,nr_frags_trace[j]=13
>  16 2013-07-13T04:27:39.629328+02:00 doma kernel: [   50.628506] j=13,rx_trace[j]=1000,nr_frags_trace[j]=14
>  17 2013-07-13T04:27:39.629330+02:00 doma kernel: [   50.628507] j=14,rx_trace[j]=1000,nr_frags_trace[j]=15
>  18 2013-07-13T04:27:39.629332+02:00 doma kernel: [   50.628508] j=15,rx_trace[j]=1000,nr_frags_trace[j]=16
>  19 2013-07-13T04:27:39.629335+02:00 doma kernel: [   50.628510] j=16,rx_trace[j]=e40,nr_frags_trace[j]=0
>  20 2013-07-13T04:27:39.629336+02:00 doma kernel: [   50.628511] xennet::xennet_fill_frags (fe88,feca,500,1,e40,42,42)

line 2 comes from

if (nr_frags == MAX_SKB_FRAGS) {
  unsigned int pull_to = NETFRONT_SKB_CB(skb)->pull_to;

  BUG_ON(pull_to <= skb_headlen(skb));
  __pskb_pull_tail(skb, pull_to - skb_headlen(skb));
+  printk("%u == %lu (shinfo->nr_frags=%u)\n",//temp
+                     nr_frags,MAX_SKB_FRAGS, shinfo->nr_frags);//tmp
  nr_frags = shinfo->nr_frags;
}

Lines 3 to 19 show the dumping I added at the end of the while loop

+if (skb->data_len > nr_frags * PAGE_SIZE) {//tmp
+ for (j=0; j<i; ++j)//tmp
+   printk("j=%u,rx_trace[j]=%x,nr_frags_trace[j]=%u\n", //tmp
+           j,rx_trace[j],nr_frags_trace[j]);//tmp
+ printk("xennet::xennet_fill_frags (%x,%x,%x,%u,%x,%x,%x)\n",//temp
+         skb->data_len, skb->len, skb->truesize, nr_frags,//tmp
+         rx->status, len0, data_len0);//temp
+}

For gathering the data I added a couple of variables on the stack

struct skb_shared_info *shinfo = skb_shinfo(skb);
int nr_frags = shinfo->nr_frags;
RING_IDX cons = np->rx.rsp_cons;
struct sk_buff *nskb;
+unsigned rx_trace[MAX_SKB_FRAGS+1], i=0,j, len0=0, data_len0=0;//tmp
+int nr_frags_trace[MAX_SKB_FRAGS+1];//tmp

and in the while loop

__skb_fill_page_desc(skb, nr_frags,
                     skb_frag_page(skb_shinfo(nskb)->frags),
                     rx->offset, rx->status);

+if (i==0) {//tmp
+  len0=skb->len;//tmp
+  data_len0=skb->data_len;//tmp
+}//tmp

+rx_trace[i] = rx->status;//tmp
skb->len += rx->status;
skb->data_len += rx->status;

skb_shinfo(nskb)->nr_frags = 0;
kfree_skb(nskb);

+nr_frags_trace[i++] = nr_frags;//tmp
nr_frags++;

As a coincidence I noticed that there has to be a subtle condition
related to speed of the while loop required to reach the state where
nr_frags becomes MAX_SKB_FRAGS.

A bit more delay in the while and it does not occur anymore.

E.g. first I had the gathering ordered like

+if (i==0) {//tmp
+  len0=skb->len;//tmp
+  data_len0=skb->data_len;//tmp
+}//tmp

+rx_trace[i] = rx->status;//tmp
+nr_frags_trace[i++] = nr_frags;//tmp
skb->len += rx->status;
skb->data_len += rx->status;

skb_shinfo(nskb)->nr_frags = 0;
kfree_skb(nskb);

nr_frags++;

and I was unable to find nr_frags reaching MAX_SKB_FRAGS within a finite
testing time. I thought this may be worthwhile to share this.

As a workload to trigger this I use simply

dom0-t:~ # scp bigrand.bin doma:/dev/null

and with this on my system the transfer rate settles around 45 MB/s
limited by a saturated core which runs the user land ssh process.

Dion

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2013-07-13 11:26 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <8511913.uMAmUdIO30@eistomin.edss.local>
     [not found] ` <20130517085923.GC14401@zion.uk.xensource.com>
     [not found]   ` <51D57C1F.8070909@hunenet.nl>
     [not found]     ` <20130704150137.GW7483@zion.uk.xensource.com>
2013-07-05  9:32       ` [PATCH] xen-netfront: pull on receive skb may need to happen earlier Jan Beulich
2013-07-05 14:53         ` Wei Liu
2013-07-07  1:10           ` David Miller
2013-07-08  9:59           ` Jan Beulich
2013-07-08 12:16             ` Dion Kant
2013-07-08 12:41               ` Jan Beulich
2013-07-08 14:20             ` [Xen-devel] " Jan Beulich
2013-07-08 15:22               ` Eric Dumazet
2013-07-09  7:47                 ` Jan Beulich
2013-07-08 15:48               ` Wei Liu
2013-07-09  6:52                 ` Jan Beulich
2013-07-09 16:51                   ` Wei Liu
2013-07-10  6:58                     ` Jan Beulich
2013-07-10 10:04                       ` Wei Liu
2013-07-10 10:46                         ` Jan Beulich
2013-07-10 12:50                           ` Ian Campbell
2013-07-10 12:53                             ` Wei Liu
2013-07-10 13:58                             ` Jan Beulich
2013-07-10 13:19                       ` Eric Dumazet
2013-07-12  8:32               ` Wei Liu
2013-07-12  8:56                 ` Jan Beulich
2013-07-13 11:26                   ` Dion Kant

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).