netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* tcp: skb_shift() issues w.r.t. SACK processing
@ 2012-03-01 21:36 Vijay Subramanian
  2012-03-02 21:57 ` Neal Cardwell
  2012-03-05 10:42 ` Ilpo Järvinen
  0 siblings, 2 replies; 3+ messages in thread
From: Vijay Subramanian @ 2012-03-01 21:36 UTC (permalink / raw)
  To: netdev; +Cc: ilpo.jarvinen, ncardwell

Hi,

I was looking at the skb_shift() function that is called by
tcp_shift_skb_data() to shift bytes from a skb to the previous skb.
This is used to collapse multiple
SACK blocks into a single skb. I noticed the following two things.

1: skb_shift() has the following snippet.
 if (!to ||
           !skb_can_coalesce(tgt, to, skb_frag_page(fragfrom),
                             fragfrom->page_offset)) {
               merge = -1;
       } else {
               merge = to - 1;

               todo -= skb_frag_size(fragfrom);
               if (todo < 0) {
                       if (skb_prepare_for_shift(skb) ||
                           skb_prepare_for_shift(tgt))
                               return 0;

                       /* All previous frag pointers might be stale! */
                       fragfrom = &skb_shinfo(skb)->frags[from];
                       fragto = &skb_shinfo(tgt)->frags[merge];

                       skb_frag_size_add(fragto, shiftlen);
                       skb_frag_size_sub(fragfrom, shiftlen);
                       fragfrom->page_offset += shiftlen;

                       goto onlymerged;
               }

               from++;
       }

The (todo < 0) part is executed when we need to copy less than
skb_frag_size(fragfrom). Essentially, after the bytes have been
shifted, the frag in the original skb will still retain some data.
However, unless I am reading this wrong, the bytes are actually never
shifted. The frag offsets and size are updated and then we jump to
label onlymerged which  updates other counters.
It looks like the data will be lost since it is in neither skb. This
may not matter in normal case since this data has been SACKed by
receiver and will be discarded after cumulative ack.
But in case of receiver SACK reneging, this data may potentially be
resent. A similar case later in the function is treated correctly with
the data actually getting copied.


2: The reason I  was looking at skb_shift() was to understand why some
SACKed skbs were not getting collapsed into the previous skbs by
tcp_shift_skb_data(). This function has the following check.
if (!skb_can_shift(skb))
               goto fallback;
which fails if skb has data in linear part.

Currently skb_shift() does not shift bytes if there is data in linear
part. From skb_shift()
BUG_ON(skb_headlen(skb));       /* Would corrupt stream */

commit f07d960df3 (tcp: avoid frag allocation for small frames) made
it possible for skb to have data in linear part. Such skbs will not be
collapsed into the previous skb. This is not a bug but this behavior
prevents some savings by not allowing collapse of SACKed skbs. Can
anyone clarify why shifting bytes from linear part would corrupt the
stream in current code? Does it make sense to remove this constraint
after
commit f07d960df3?

Thanks,
Vijay

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: tcp: skb_shift() issues w.r.t. SACK processing
  2012-03-01 21:36 tcp: skb_shift() issues w.r.t. SACK processing Vijay Subramanian
@ 2012-03-02 21:57 ` Neal Cardwell
  2012-03-05 10:42 ` Ilpo Järvinen
  1 sibling, 0 replies; 3+ messages in thread
From: Neal Cardwell @ 2012-03-02 21:57 UTC (permalink / raw)
  To: Vijay Subramanian; +Cc: netdev, ilpo.jarvinen

On Thu, Mar 1, 2012 at 4:36 PM, Vijay Subramanian
<subramanian.vijay@gmail.com> wrote:
> Hi,
>
> I was looking at the skb_shift() function that is called by
> tcp_shift_skb_data() to shift bytes from a skb to the previous skb.
> This is used to collapse multiple
> SACK blocks into a single skb. I noticed the following two things.
>
> 1: skb_shift() has the following snippet.
>  if (!to ||
>            !skb_can_coalesce(tgt, to, skb_frag_page(fragfrom),
>                              fragfrom->page_offset)) {
>                merge = -1;
>        } else {
>                merge = to - 1;
>
>                todo -= skb_frag_size(fragfrom);
>                if (todo < 0) {
>                        if (skb_prepare_for_shift(skb) ||
>                            skb_prepare_for_shift(tgt))
>                                return 0;
>
>                        /* All previous frag pointers might be stale! */
>                        fragfrom = &skb_shinfo(skb)->frags[from];
>                        fragto = &skb_shinfo(tgt)->frags[merge];
>
>                        skb_frag_size_add(fragto, shiftlen);
>                        skb_frag_size_sub(fragfrom, shiftlen);
>                        fragfrom->page_offset += shiftlen;
>
>                        goto onlymerged;
>                }
>
>                from++;
>        }
>
> The (todo < 0) part is executed when we need to copy less than
> skb_frag_size(fragfrom). Essentially, after the bytes have been
> shifted, the frag in the original skb will still retain some data.
> However, unless I am reading this wrong, the bytes are actually never
> shifted. The frag offsets and size are updated and then we jump to
> label onlymerged which  updates other counters.
> It looks like the data will be lost since it is in neither skb. This
> may not matter in normal case since this data has been SACKed by
> receiver and will be discarded after cumulative ack.
> But in case of receiver SACK reneging, this data may potentially be
> resent. A similar case later in the function is treated correctly with
> the data actually getting copied.

I think skb_shift() is actually OK in this regard. The code in
question is guarded by a call to skb_can_coalesce(), which ensures
that the source and destination fragments are actually from the same
page, and are adjacent within the page. So in this case, the code
seems correct in its assumption that all we need to do is increment
the size of the destination fragment and decrement the size of the
source fragment.

neal

ps: I haven't looked into your second question.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: tcp: skb_shift() issues w.r.t. SACK processing
  2012-03-01 21:36 tcp: skb_shift() issues w.r.t. SACK processing Vijay Subramanian
  2012-03-02 21:57 ` Neal Cardwell
@ 2012-03-05 10:42 ` Ilpo Järvinen
  1 sibling, 0 replies; 3+ messages in thread
From: Ilpo Järvinen @ 2012-03-05 10:42 UTC (permalink / raw)
  To: Vijay Subramanian; +Cc: netdev, ncardwell, Eric Dumazet

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2987 bytes --]

On Thu, 1 Mar 2012, Vijay Subramanian wrote:

> [...snip...]
> 
> The (todo < 0) part is executed when we need to copy less than
> skb_frag_size(fragfrom). Essentially, after the bytes have been
> shifted, the frag in the original skb will still retain some data.
> However, unless I am reading this wrong, the bytes are actually never
> shifted. The frag offsets and size are updated and then we jump to
> label onlymerged which  updates other counters.
> It looks like the data will be lost since it is in neither skb. This
> may not matter in normal case since this data has been SACKed by
> receiver and will be discarded after cumulative ack.
> But in case of receiver SACK reneging, this data may potentially be
> resent. A similar case later in the function is treated correctly with
> the data actually getting copied.

Like Neal already mentioned, the data is moved before goto. Thanks anyway 
for staring at it for a while and bringing up your suspicions.

> 2: The reason I  was looking at skb_shift() was to understand why some
> SACKed skbs were not getting collapsed into the previous skbs by
> tcp_shift_skb_data(). This function has the following check.
> if (!skb_can_shift(skb))
>                goto fallback;
> which fails if skb has data in linear part.
> 
> Currently skb_shift() does not shift bytes if there is data in linear
> part. From skb_shift()
> BUG_ON(skb_headlen(skb));       /* Would corrupt stream */
> 
> commit f07d960df3 (tcp: avoid frag allocation for small frames) made
> it possible for skb to have data in linear part. Such skbs will not be
> collapsed into the previous skb. This is not a bug but this behavior
> prevents some savings by not allowing collapse of SACKed skbs. Can
> anyone clarify why shifting bytes from linear part would corrupt the
> stream in current code?

Linear to linear part is legal only if tgt has no frags, back then frags 
were basically always there when GSO was enabled. The current skb_shift 
only deals (linear+)frags+frags copying, whereas linear+frags+linear+frags 
is unrepresentable afaict with the current skbuff structures. ...The 
corruption would happen if frags would be moved past a linear part.

> Does it make sense to remove this constraint after commit f07d960df3?

Pure linear to linear is probably not going to gain much as two of them 
won't fit anyway to (2048 - MAX_TCP_HEADER - x) with the default mss. And 
if one would be sending smaller than mss, imho that's where the problem 
lies, not in the recombining logic :-). But of course it would be 
technically possible to include also combining for pure linear to linear 
if there's enough linear space in the tgt but I doubt that the benefits 
are large enough to be worth the effort. Or do you have some particular 
case with measurable %?

I suppose the select_size logic could somehow depend on the size sendmsg 
is going to process though.

I've added Eric as Cc if he has some insights on this, it seems like 
trade-off here.


-- 
 i.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2012-03-05 10:42 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-01 21:36 tcp: skb_shift() issues w.r.t. SACK processing Vijay Subramanian
2012-03-02 21:57 ` Neal Cardwell
2012-03-05 10:42 ` Ilpo Järvinen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).