All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [MPTCP] Change to remove MPTCP members from tcp_skb_cb
@ 2017-05-04 17:25 Rao Shoaib
  0 siblings, 0 replies; 12+ messages in thread
From: Rao Shoaib @ 2017-05-04 17:25 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 8261 bytes --]



On 05/02/2017 01:50 PM, Mat Martineau wrote:
> On Fri, 28 Apr 2017, Rao Shoaib wrote:
>
>>
>>
>> On 04/28/2017 03:52 PM, Mat Martineau wrote:
>>> On Wed, 26 Apr 2017, Rao Shoaib wrote:
>>>
>>>> Hi,
>>>>
>>>> Here is an attempt to not use tcp_skb_cb with minimal change to 
>>>> MPTCP code and skb code. I have tested the change by using wget 
>>>> from multipath-tcp.org. pcap file is attached.
>>>>
>>>> The approach that I have taken is very straight forward. On Tx when 
>>>> an skb is allocated and the underlying socket is MPTCP , 32 bytes 
>>>> extra are allocated[1] and reserved as the head room. This space is 
>>>> used for passing what was being passed in tcp_skb_cb. An used bit 
>>>> in sk_buff is used to indicate this special case and when skb is 
>>>> copied the extra data is also copied. clone is not an issue as the 
>>>> data is shared. This work for retransmission also because 
>>>> tcp_transmit_skb clones the original skb.
>>>>
>>>> On the Rx side, options are parsed in tcp at that point we can 
>>>> reuse the space that was used for ether and IP  headers and no 
>>>> extra space is required.
>>>>
>>>> The space used is always right below the skb head pointer because 
>>>> the location of head never changes.
>>>>
>>>> Please take a look and let me know what I have missed.
>>>
>>> Hi Rao -
>>>
>>> It's quite a bit easier to review patches on a mailing list when 
>>> they're plain text (as opposed to attachments), so I'd like ask that 
>>> we stick to the kernel patch guidelines: 
>>> https://www.kernel.org/doc/html/latest/process/submitting-patches.html
>>>
>>> (I recognize that it's a long read - using 'git send-email' will get 
>>> you most of the way there!)
>> Sorry about that. I have never posted a patch before.
>>>
>>>
>>> I'm wary of using the skb data to hide away an extra control block. 
>>> The number of data bytes consumed isn't really accounted for (other 
>>> than being hard coded for the MPTCP use case), 
>> I don't understand this concern. Why do they have to be accounted for 
>> and what code needs that ? The total size allocated is still billed 
>> to the socket. This is same as allocating maximum size  tcp header 
>> and not using all the space OR one can think of it as a new protocol 
>> header that gets added because we are running ethernet over Infiniband.
>
> If another layer of the networking stack checks skb_headroom() on an 
> skb with a struct tucked away at skb->head, that layer will have no 
> reason to avoid writing to the "free" header space.
>
> It's different from a new protocol header in that real protocol 
> headers use skb_push() and only maintain valid data between skb->data 
> and skb->tail.
>
> The maintainers aren't going to like breaking the rules about where 
> valid data lives in the buffer space because it will impact the 
> maintainability of the code. Unless someone is familiar with the 
> quirks of MPTCP's buffer space usage, they could get tripped up when 
> doing some kind of encapsulation/deencapsulation, for example.
>
>>
>>> and keeping track of the size of the extra control block would take 
>>> some space in the skb, which is what we're trying to avoid. 
>> The way I have coded it, it does not, and that is the whole point of 
>> taking this approach because upstream folks do not want the size of 
>> 'struct sk_buff' increased.
>
> We are in agreement here. I was pointing out that modifying your 
> approach to have a variable size for MPTCP_SEG_INFO() (or a generic 
> version of it) is also not workable because of the sk_buff size 
> constraint.
>
>> Perhaps I do not understand your concern about counting, can you give 
>> me an example of the issue that you have in mind.
>
> I think I got to this above - skb_headroom() reports bytes available, 
> and a lower layer is free to overwrite the MPTCP_SEG_INFO bytes.
>
>>> On the rx side, overwriting the network or mac headers doesn't seem 
>>> right - what if the skb was cloned for monitoring purposes? 
>> This is a valid concern -- let me think about it a little bit. Where 
>> would the buffer be cloned ?
>
> There are so many places it could be cloned it's easier to talk about 
> where it can't. If the skb is newly created or provided by some piece 
> of code that guarantees it hasn't been cloned, then you know it's not 
> cloned. Otherwise it may or may not be a clone, but that's checkable 
> with skb_cloned().
>
>>> We'd have to force an skb_copy before modifying the buffer. It might 
>>> be safer to use space after skb_end_pointer() 
>> Yes that can be done (I think). I choose the head just to test things 
>> out.
>>> - but that's where the skb_shared_info struct is, so maybe what we 
>>> need is a way to optionally put something after skb_shared_info?
>> This can not be done currently based on how the buffer is 
>> constructed.  I would avoid doing that any ways as it would cause 
>> more concern for upstream folks as it is more disruptive since it 
>> changes the location of the end of the structure. In the current 
>> scheme no assumption is broken about where everything lies and all 
>> pointers retain their current interpretation.
>
> My (speculative) suggestion was to add some optional space at skb->end 
> + sizeof(struct skb_shared_info), which changes the location of the 
> end of the allocated space but doesn't change where any current 
> structures end or the values of any existing pointers. The tradeoff is 
> that the extra size information needs to get in to __alloc_skb() somehow.
>
>>>
>>> I think we need to take a step back: what data does MPTCP really 
>>> need to attach to an skb? If we have extra mptcp data with the 
>>> tcp_sock for the subflow, is there a way to keep track of auxiliary 
>>> data outside the skb? I'll think on this some more too.
>> You can not use tcp_sock because the data is per sk_buff and not per 
>> flow. Additionally same buffer could be transmitted on several flows. 
>> We could implement some kind of lookup but that will kill the 
>> performance.
>
> Yeah, "some kind of lookup" is what I was getting at, if it could be 
> made fast enough.
>
>> I am working on a generic solution but I am not optimistic that 
>> upstream folks will accept it because it will require more changes to 
>> the existing code.
>
> If we're adding functionality that's seen as useful or an improvement 
> over what's there now, I think it will get a fair shot so long as the 
> normal structure size is the same or smaller (and the changes to 
> existing code are not too risky, etc).
>
>>
>> If a better scheme is proposed that works I am all for it.
>
> Yes, I definitely agree! We have a tough set of constraints to work 
> within, but we haven't exhausted the possibilities. I'm still trying 
> to come up with other alternatives to consider.

Hi Mat,

You have raised concern about someone using the head room. I would like 
to understand how Linux SKB's work as I have a different understanding.

I assumed that all code in Linux follows the basic principle that header 
space is reserved when the skb is allocated. That is why there is

LL_MAX_HEADER -- /* Compute the worst-case header length according to 
the protocols used */

Plus the SKB usage rules are very clear about references.

/* We divide dataref into two halves.  The higher 16 bits hold references
  * to the payload part of skb->data.  The lower 16 bits hold references to
  * the entire skb->data.
  *
In this case the dataref (lower 16 bits) will be greater than one and a 
I expect a consumer to call skb_cow_head() to determine if the header 
can be changed or does it need to make a copy. __vlan_insert_tag() for 
example does just that. So what are the condition that allow a consumer 
to disregard these rules and just use the header space ?

As evident from the 2nd patch that I have sent out,  I am not committed 
to any fix. If the team does not agree, we have to keep trying. This is 
a good exercise for me to learn the nuances of Linux networking. I will 
be an expert on SKB's by the time we are done :-).

Rao

> -- 
> Mat Martineau
> Intel OTC


[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 11341 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [MPTCP] Change to remove MPTCP members from tcp_skb_cb
@ 2017-05-05 23:14 Mat Martineau
  0 siblings, 0 replies; 12+ messages in thread
From: Mat Martineau @ 2017-05-05 23:14 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 2494 bytes --]


Hi Rao,


On Thu, 4 May 2017, Rao Shoaib wrote:

> Hi Mat,
> 
> You have raised concern about someone using the head room. I would like to understand how Linux SKB's work as I have a different
> understanding.
> 
> I assumed that all code in Linux follows the basic principle that header space is reserved when the skb is allocated. That is why there is
> 
> LL_MAX_HEADER -- /* Compute the worst-case header length according to the protocols used */
> 
> Plus the SKB usage rules are very clear about references.
> 
> /* We divide dataref into two halves.  The higher 16 bits hold references
>  * to the payload part of skb->data.  The lower 16 bits hold references to
>  * the entire skb->data.
>  *
> In this case the dataref (lower 16 bits) will be greater than one and a I expect a consumer to call skb_cow_head() to determine if the header
> can be changed or does it need to make a copy. __vlan_insert_tag() for example does just that. So what are the condition that allow a
> consumer to disregard these rules and just use the header space ?

I generally refer to http://vger.kernel.org/~davem/skb_data.html for 
sk_buff basics. It's a little dated (sk_buff has grown since it was 
written), but still helpful.

skb_headroom() will indicate that space is available without regard to the 
reference counts in dataref, or the presence of MPTCP stuff at skb->head. 
I think that function needs to continue to return a trustworthy value.

As for when header space is writable, __skb_header_release() is used by 
the tcp stack before placing packets in the write queue, so clones of 
those skbs will be able to write in the header space without making a copy 
first.

> 
> As evident from the 2nd patch that I have sent out,  I am not committed to any fix. If the team does not agree, we have to keep trying. This
> is a good exercise for me to learn the nuances of Linux networking. I will be an expert on SKB's by the time we are done :-).

Thanks for sticking with it!

I spent some time this afternoon looking at sk_buff->reserved_tailroom 
(and associated functions like skb_availroom() and 
skb_tailroom_reserve()). This does reserve space (at least for some 
purposes) and seems to be used only by IGMP and IPv6 multicast, and avoids 
the ambiguity of writable headers. However, it's still putting the data 
buffer to use for something other than packet data (which I remain 
concerned about).

--
Mat Martineau
Intel OTC

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [MPTCP] Change to remove MPTCP members from tcp_skb_cb
@ 2017-05-02 20:50 Mat Martineau
  0 siblings, 0 replies; 12+ messages in thread
From: Mat Martineau @ 2017-05-02 20:50 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 6733 bytes --]

On Fri, 28 Apr 2017, Rao Shoaib wrote:

>
>
> On 04/28/2017 03:52 PM, Mat Martineau wrote:
>> On Wed, 26 Apr 2017, Rao Shoaib wrote:
>> 
>>> Hi,
>>> 
>>> Here is an attempt to not use tcp_skb_cb with minimal change to MPTCP code 
>>> and skb code. I have tested the change by using wget from 
>>> multipath-tcp.org. pcap file is attached.
>>> 
>>> The approach that I have taken is very straight forward. On Tx when an skb 
>>> is allocated and the underlying socket is MPTCP , 32 bytes extra are 
>>> allocated[1] and reserved as the head room. This space is used for passing 
>>> what was being passed in tcp_skb_cb. An used bit in sk_buff is used to 
>>> indicate this special case and when skb is copied the extra data is also 
>>> copied. clone is not an issue as the data is shared. This work for 
>>> retransmission also because tcp_transmit_skb clones the original skb.
>>> 
>>> On the Rx side, options are parsed in tcp at that point we can reuse the 
>>> space that was used for ether and IP  headers and no extra space is 
>>> required.
>>> 
>>> The space used is always right below the skb head pointer because the 
>>> location of head never changes.
>>> 
>>> Please take a look and let me know what I have missed.
>> 
>> Hi Rao -
>> 
>> It's quite a bit easier to review patches on a mailing list when they're 
>> plain text (as opposed to attachments), so I'd like ask that we stick to 
>> the kernel patch guidelines: 
>> https://www.kernel.org/doc/html/latest/process/submitting-patches.html
>> 
>> (I recognize that it's a long read - using 'git send-email' will get you 
>> most of the way there!)
> Sorry about that. I have never posted a patch before.
>> 
>> 
>> I'm wary of using the skb data to hide away an extra control block. The 
>> number of data bytes consumed isn't really accounted for (other than being 
>> hard coded for the MPTCP use case), 
> I don't understand this concern. Why do they have to be accounted for and 
> what code needs that ? The total size allocated is still billed to the 
> socket. This is same as allocating maximum size  tcp header and not using all 
> the space OR one can think of it as a new protocol header that gets added 
> because we are running ethernet over Infiniband.

If another layer of the networking stack checks skb_headroom() on an skb 
with a struct tucked away at skb->head, that layer will have no reason to 
avoid writing to the "free" header space.

It's different from a new protocol header in that real protocol headers 
use skb_push() and only maintain valid data between skb->data and 
skb->tail.

The maintainers aren't going to like breaking the rules about where valid 
data lives in the buffer space because it will impact the maintainability 
of the code. Unless someone is familiar with the quirks of MPTCP's buffer 
space usage, they could get tripped up when doing some kind of 
encapsulation/deencapsulation, for example.

>
>> and keeping track of the size of the extra control block would take some 
>> space in the skb, which is what we're trying to avoid. 
> The way I have coded it, it does not, and that is the whole point of taking 
> this approach because upstream folks do not want the size of 'struct sk_buff' 
> increased.

We are in agreement here. I was pointing out that modifying your approach 
to have a variable size for MPTCP_SEG_INFO() (or a generic version of it) 
is also not workable because of the sk_buff size constraint.

> Perhaps I do not understand your concern about counting, can you give me an 
> example of the issue that you have in mind.

I think I got to this above - skb_headroom() reports bytes available, and 
a lower layer is free to overwrite the MPTCP_SEG_INFO bytes.

>> On the rx side, overwriting the network or mac headers doesn't seem right - 
>> what if the skb was cloned for monitoring purposes? 
> This is a valid concern -- let me think about it a little bit. Where would 
> the buffer be cloned ?

There are so many places it could be cloned it's easier to talk about 
where it can't. If the skb is newly created or provided by some piece of 
code that guarantees it hasn't been cloned, then you know it's not cloned. 
Otherwise it may or may not be a clone, but that's checkable with 
skb_cloned().

>> We'd have to force an skb_copy before modifying the buffer. It might be 
>> safer to use space after skb_end_pointer() 
> Yes that can be done (I think). I choose the head just to test things out.
>> - but that's where the skb_shared_info struct is, so maybe what we need is 
>> a way to optionally put something after skb_shared_info?
> This can not be done currently based on how the buffer is constructed.  I 
> would avoid doing that any ways as it would cause more concern for upstream 
> folks as it is more disruptive since it changes the location of the end of 
> the structure. In the current scheme no assumption is broken about where 
> everything lies and all pointers retain their current interpretation.

My (speculative) suggestion was to add some optional space at skb->end + 
sizeof(struct skb_shared_info), which changes the location of the end of 
the allocated space but doesn't change where any current structures end or 
the values of any existing pointers. The tradeoff is that the extra size 
information needs to get in to __alloc_skb() somehow.

>> 
>> I think we need to take a step back: what data does MPTCP really need to 
>> attach to an skb? If we have extra mptcp data with the tcp_sock for the 
>> subflow, is there a way to keep track of auxiliary data outside the skb? 
>> I'll think on this some more too.
> You can not use tcp_sock because the data is per sk_buff and not per flow. 
> Additionally same buffer could be transmitted on several flows. We could 
> implement some kind of lookup but that will kill the performance.

Yeah, "some kind of lookup" is what I was getting at, if it could be made 
fast enough.

> I am working on a generic solution but I am not optimistic that upstream 
> folks will accept it because it will require more changes to the existing 
> code.

If we're adding functionality that's seen as useful or an improvement over 
what's there now, I think it will get a fair shot so long as the normal 
structure size is the same or smaller (and the changes to existing code 
are not too risky, etc).

>
> If a better scheme is proposed that works I am all for it.

Yes, I definitely agree! We have a tough set of constraints to work 
within, but we haven't exhausted the possibilities. I'm still trying to 
come up with other alternatives to consider.

--
Mat Martineau
Intel OTC

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [MPTCP] Change to remove MPTCP members from tcp_skb_cb
@ 2017-04-29  8:59 Rao Shoaib
  0 siblings, 0 replies; 12+ messages in thread
From: Rao Shoaib @ 2017-04-29  8:59 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 640 bytes --]



On 04/29/2017 01:39 AM, Christoph Paasch wrote:
>
>> Please take a look at the latest tcp_skb_cb. As I mentioned before, the
>> space has been completely used up. If you can re-arrange something and
>> recover 24 bytes than that is the path we should take.
> Oh, my bad. I was on mptcp_trunk. Sorry about that.
>
> Back to the drawing board :)
>
>
> Christoph
>
:-)
A quick look suggests that all that is needed is the data sequence. The 
rest can be calculated. However we will have to recalculating every time 
the packet is retransmitted. Right now we do not even have space to 
store/pass the data sequence.

Rao

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [MPTCP] Change to remove MPTCP members from tcp_skb_cb
@ 2017-04-29  8:39 Christoph Paasch
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Paasch @ 2017-04-29  8:39 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 5354 bytes --]

On 29/04/17 - 01:29:13, Rao Shoaib wrote:
> 
> 
> On 04/29/2017 12:36 AM, Christoph Paasch wrote:
> > Hello,
> > 
> > On 28/04/17 - 15:52:53, Mat Martineau wrote:
> > > On Wed, 26 Apr 2017, Rao Shoaib wrote:
> > > 
> > > > Hi,
> > > > 
> > > > Here is an attempt to not use tcp_skb_cb with minimal change to MPTCP
> > > > code and skb code. I have tested the change by using wget from
> > > > multipath-tcp.org. pcap file is attached.
> > > > 
> > > > The approach that I have taken is very straight forward. On Tx when an
> > > > skb is allocated and the underlying socket is MPTCP , 32 bytes extra are
> > > > allocated[1] and reserved as the head room. This space is used for
> > > > passing what was being passed in tcp_skb_cb. An used bit in sk_buff is
> > > > used to indicate this special case and when skb is copied the extra data
> > > > is also copied. clone is not an issue as the data is shared. This work
> > > > for retransmission also because tcp_transmit_skb clones the original
> > > > skb.
> > > > 
> > > > On the Rx side, options are parsed in tcp at that point we can reuse the
> > > > space that was used for ether and IP  headers and no extra space is
> > > > required.
> > > > 
> > > > The space used is always right below the skb head pointer because the
> > > > location of head never changes.
> > > > 
> > > > Please take a look and let me know what I have missed.
> > > Hi Rao -
> > > 
> > > It's quite a bit easier to review patches on a mailing list when they're
> > > plain text (as opposed to attachments), so I'd like ask that we stick to the
> > > kernel patch guidelines:
> > > https://www.kernel.org/doc/html/latest/process/submitting-patches.html
> > > 
> > > (I recognize that it's a long read - using 'git send-email' will get you
> > > most of the way there!)
> > > 
> > > 
> > > I'm wary of using the skb data to hide away an extra control block. The
> > > number of data bytes consumed isn't really accounted for (other than being
> > > hard coded for the MPTCP use case), and keeping track of the size of the
> > > extra control block would take some space in the skb, which is what we're
> > > trying to avoid. On the rx side, overwriting the network or mac headers
> > > doesn't seem right - what if the skb was cloned for monitoring purposes?
> > > We'd have to force an skb_copy before modifying the buffer. It might be
> > > safer to use space after skb_end_pointer() - but that's where the
> > > skb_shared_info struct is, so maybe what we need is a way to optionally put
> > > something after skb_shared_info?
> > > 
> > > I think we need to take a step back: what data does MPTCP really need to
> > > attach to an skb? If we have extra mptcp data with the tcp_sock for the
> > > subflow, is there a way to keep track of auxiliary data outside the skb?
> > > I'll think on this some more too.
> > What we really need is basically the DSS-mapping. Because, every byte that
> > has been scheduled on a subflow is linked to a DSS-mapping that is built of
> > the 32-bit subflow-sequence-number, the data-length (16 bits) and the
> > data-sequence number (32 or 64 bits). That's all we really need.
> > 
> > I still think that tcp_skb_cb is the best place to store it. And currently,
> > we would be fine wrt to not increasing the skb, if we get rid of dss_off
> > (which should be easily doable).
> > Then we are only adding 1 byte to tcp_skb_cb (mptcp_flags), and it fits nicely
> > into the existing hole.
> > The reason why I think it is the best place to store it is because
> > tcp_skb_cb has a meaning only inside the TCP-stack. It's a scratch-buffer
> > that on the other layers can be used for other stuff.
> Please take a look at the latest tcp_skb_cb. As I mentioned before, the
> space has been completely used up. If you can re-arrange something and
> recover 24 bytes than that is the path we should take.

Oh, my bad. I was on mptcp_trunk. Sorry about that.

Back to the drawing board :)


Christoph

> 
> > Another solution would be to have some kind of a hashmap for each subflow,
> > where we can map each byte of the subflow's send-queue to its corresponding
> > DSS-mapping. I don't think that's a path we should go down. It seems quite
> > complex to do.
> I don't want to go down this path either because it will have performance
> implications.
> > 
> > 
> > On the RX-side, I would not worry too much. When we need the data
> > sequence-number we can always just look it up again in the TCP-header. Which
> > is actually what we are doing right now (almost - we still use dss_off to
> > quickly find it, but that can be changed easily).
> That is what I thought we can do also. I was wondering if we could just not
> process MPTCP options till the packet is received by MPTCP. However that
> would mean parsing the options twice. Since I found space I used it. Another
> option would be to allocate the structure and overload a pointer, but as I
> pointed out in my latest email, we have space so why complicate life.
> 
> 
> Rao.
> > 
> > 
> > Christoph
> > 
> > 
> > > --
> > > Mat Martineau
> > > Intel OTC
> > > _______________________________________________
> > > mptcp mailing list
> > > mptcp(a)lists.01.org
> > > https://lists.01.org/mailman/listinfo/mptcp
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [MPTCP] Change to remove MPTCP members from tcp_skb_cb
@ 2017-04-29  8:29 Rao Shoaib
  0 siblings, 0 replies; 12+ messages in thread
From: Rao Shoaib @ 2017-04-29  8:29 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 4902 bytes --]



On 04/29/2017 12:36 AM, Christoph Paasch wrote:
> Hello,
>
> On 28/04/17 - 15:52:53, Mat Martineau wrote:
>> On Wed, 26 Apr 2017, Rao Shoaib wrote:
>>
>>> Hi,
>>>
>>> Here is an attempt to not use tcp_skb_cb with minimal change to MPTCP
>>> code and skb code. I have tested the change by using wget from
>>> multipath-tcp.org. pcap file is attached.
>>>
>>> The approach that I have taken is very straight forward. On Tx when an
>>> skb is allocated and the underlying socket is MPTCP , 32 bytes extra are
>>> allocated[1] and reserved as the head room. This space is used for
>>> passing what was being passed in tcp_skb_cb. An used bit in sk_buff is
>>> used to indicate this special case and when skb is copied the extra data
>>> is also copied. clone is not an issue as the data is shared. This work
>>> for retransmission also because tcp_transmit_skb clones the original
>>> skb.
>>>
>>> On the Rx side, options are parsed in tcp at that point we can reuse the
>>> space that was used for ether and IP  headers and no extra space is
>>> required.
>>>
>>> The space used is always right below the skb head pointer because the
>>> location of head never changes.
>>>
>>> Please take a look and let me know what I have missed.
>> Hi Rao -
>>
>> It's quite a bit easier to review patches on a mailing list when they're
>> plain text (as opposed to attachments), so I'd like ask that we stick to the
>> kernel patch guidelines:
>> https://www.kernel.org/doc/html/latest/process/submitting-patches.html
>>
>> (I recognize that it's a long read - using 'git send-email' will get you
>> most of the way there!)
>>
>>
>> I'm wary of using the skb data to hide away an extra control block. The
>> number of data bytes consumed isn't really accounted for (other than being
>> hard coded for the MPTCP use case), and keeping track of the size of the
>> extra control block would take some space in the skb, which is what we're
>> trying to avoid. On the rx side, overwriting the network or mac headers
>> doesn't seem right - what if the skb was cloned for monitoring purposes?
>> We'd have to force an skb_copy before modifying the buffer. It might be
>> safer to use space after skb_end_pointer() - but that's where the
>> skb_shared_info struct is, so maybe what we need is a way to optionally put
>> something after skb_shared_info?
>>
>> I think we need to take a step back: what data does MPTCP really need to
>> attach to an skb? If we have extra mptcp data with the tcp_sock for the
>> subflow, is there a way to keep track of auxiliary data outside the skb?
>> I'll think on this some more too.
> What we really need is basically the DSS-mapping. Because, every byte that
> has been scheduled on a subflow is linked to a DSS-mapping that is built of
> the 32-bit subflow-sequence-number, the data-length (16 bits) and the
> data-sequence number (32 or 64 bits). That's all we really need.
>
> I still think that tcp_skb_cb is the best place to store it. And currently,
> we would be fine wrt to not increasing the skb, if we get rid of dss_off
> (which should be easily doable).
> Then we are only adding 1 byte to tcp_skb_cb (mptcp_flags), and it fits nicely
> into the existing hole.
> The reason why I think it is the best place to store it is because
> tcp_skb_cb has a meaning only inside the TCP-stack. It's a scratch-buffer
> that on the other layers can be used for other stuff.
Please take a look at the latest tcp_skb_cb. As I mentioned before, the 
space has been completely used up. If you can re-arrange something and 
recover 24 bytes than that is the path we should take.

> Another solution would be to have some kind of a hashmap for each subflow,
> where we can map each byte of the subflow's send-queue to its corresponding
> DSS-mapping. I don't think that's a path we should go down. It seems quite
> complex to do.
I don't want to go down this path either because it will have 
performance implications.
>
>
> On the RX-side, I would not worry too much. When we need the data
> sequence-number we can always just look it up again in the TCP-header. Which
> is actually what we are doing right now (almost - we still use dss_off to
> quickly find it, but that can be changed easily).
That is what I thought we can do also. I was wondering if we could just 
not process MPTCP options till the packet is received by MPTCP. However 
that would mean parsing the options twice. Since I found space I used 
it. Another option would be to allocate the structure and overload a 
pointer, but as I pointed out in my latest email, we have space so why 
complicate life.


Rao.
>
>
> Christoph
>
>
>> --
>> Mat Martineau
>> Intel OTC
>> _______________________________________________
>> mptcp mailing list
>> mptcp(a)lists.01.org
>> https://lists.01.org/mailman/listinfo/mptcp


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [MPTCP] Change to remove MPTCP members from tcp_skb_cb
@ 2017-04-29  8:10 Rao Shoaib
  0 siblings, 0 replies; 12+ messages in thread
From: Rao Shoaib @ 2017-04-29  8:10 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 1835 bytes --]



On 04/28/2017 09:03 PM, Rao Shoaib wrote:
>>> On the rx side, overwriting the network or mac headers doesn't seem 
>>> right - what if the skb was cloned for monitoring purposes? 
>> This is a valid concern -- let me think about it a little bit. Where 
>> would the buffer be cloned ?
> Oh I just realized, I sent out a pcap file while doing mostly Rx 
> traffic. I did not see any corruption so I am not sure this is an 
> issue. To be sure I would have to dig into the code as to what happens 
> when the promiscuous mode is on.
>
> Rao. 

The reason I did not see any corruption on received packets is because 
on Rx the skb allocated has extra head room. NET_IP_ALIGN can be zero 
but NET_SKB_PAD is not. So using space right below head is just fine 
because by the time the packet reaches MPTCP the space is not being 
used. So it turns out that what is being done for MPTCP on the Tx side 
is already happening on the Rx side.

While we look at other better solutions I may send this patch to the 
mptcp-dev mailing list to get more comments as I may be overlooking some 
other issues as well.

Rao.

struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
                                  gfp_t gfp_mask)
{
         struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
         struct sk_buff *skb;
         void *data;

         len += NET_SKB_PAD + NET_IP_ALIGN;

<.............>

skb_success:
         skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
............

}

skbuff.h

  * Various parts of the networking layer expect at least 32 bytes of
  * headroom, you should not reduce this.
  *

* Using max(32, L1_CACHE_BYTES) makes sense (especially with RPS)

#ifndef NET_SKB_PAD
#define NET_SKB_PAD     max(32, L1_CACHE_BYTES)
#endif


[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 2734 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [MPTCP] Change to remove MPTCP members from tcp_skb_cb
@ 2017-04-29  7:36 Christoph Paasch
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Paasch @ 2017-04-29  7:36 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 4093 bytes --]

Hello,

On 28/04/17 - 15:52:53, Mat Martineau wrote:
> On Wed, 26 Apr 2017, Rao Shoaib wrote:
> 
> > Hi,
> > 
> > Here is an attempt to not use tcp_skb_cb with minimal change to MPTCP
> > code and skb code. I have tested the change by using wget from
> > multipath-tcp.org. pcap file is attached.
> > 
> > The approach that I have taken is very straight forward. On Tx when an
> > skb is allocated and the underlying socket is MPTCP , 32 bytes extra are
> > allocated[1] and reserved as the head room. This space is used for
> > passing what was being passed in tcp_skb_cb. An used bit in sk_buff is
> > used to indicate this special case and when skb is copied the extra data
> > is also copied. clone is not an issue as the data is shared. This work
> > for retransmission also because tcp_transmit_skb clones the original
> > skb.
> > 
> > On the Rx side, options are parsed in tcp at that point we can reuse the
> > space that was used for ether and IP  headers and no extra space is
> > required.
> > 
> > The space used is always right below the skb head pointer because the
> > location of head never changes.
> > 
> > Please take a look and let me know what I have missed.
> 
> Hi Rao -
> 
> It's quite a bit easier to review patches on a mailing list when they're
> plain text (as opposed to attachments), so I'd like ask that we stick to the
> kernel patch guidelines:
> https://www.kernel.org/doc/html/latest/process/submitting-patches.html
> 
> (I recognize that it's a long read - using 'git send-email' will get you
> most of the way there!)
> 
> 
> I'm wary of using the skb data to hide away an extra control block. The
> number of data bytes consumed isn't really accounted for (other than being
> hard coded for the MPTCP use case), and keeping track of the size of the
> extra control block would take some space in the skb, which is what we're
> trying to avoid. On the rx side, overwriting the network or mac headers
> doesn't seem right - what if the skb was cloned for monitoring purposes?
> We'd have to force an skb_copy before modifying the buffer. It might be
> safer to use space after skb_end_pointer() - but that's where the
> skb_shared_info struct is, so maybe what we need is a way to optionally put
> something after skb_shared_info?
> 
> I think we need to take a step back: what data does MPTCP really need to
> attach to an skb? If we have extra mptcp data with the tcp_sock for the
> subflow, is there a way to keep track of auxiliary data outside the skb?
> I'll think on this some more too.

What we really need is basically the DSS-mapping. Because, every byte that
has been scheduled on a subflow is linked to a DSS-mapping that is built of
the 32-bit subflow-sequence-number, the data-length (16 bits) and the
data-sequence number (32 or 64 bits). That's all we really need.

I still think that tcp_skb_cb is the best place to store it. And currently,
we would be fine wrt to not increasing the skb, if we get rid of dss_off
(which should be easily doable).
Then we are only adding 1 byte to tcp_skb_cb (mptcp_flags), and it fits nicely
into the existing hole.
The reason why I think it is the best place to store it is because
tcp_skb_cb has a meaning only inside the TCP-stack. It's a scratch-buffer
that on the other layers can be used for other stuff.

Another solution would be to have some kind of a hashmap for each subflow,
where we can map each byte of the subflow's send-queue to its corresponding
DSS-mapping. I don't think that's a path we should go down. It seems quite
complex to do.


On the RX-side, I would not worry too much. When we need the data
sequence-number we can always just look it up again in the TCP-header. Which
is actually what we are doing right now (almost - we still use dss_off to
quickly find it, but that can be changed easily).


Christoph


> 
> --
> Mat Martineau
> Intel OTC
> _______________________________________________
> mptcp mailing list
> mptcp(a)lists.01.org
> https://lists.01.org/mailman/listinfo/mptcp

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [MPTCP] Change to remove MPTCP members from tcp_skb_cb
@ 2017-04-29  4:03 Rao Shoaib
  0 siblings, 0 replies; 12+ messages in thread
From: Rao Shoaib @ 2017-04-29  4:03 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 3231 bytes --]



On 04/28/2017 08:23 PM, Rao Shoaib wrote:
>
>
> On 04/28/2017 03:52 PM, Mat Martineau wrote:
>> On Wed, 26 Apr 2017, Rao Shoaib wrote:
>>
>>> Hi,
>>>
>>> Here is an attempt to not use tcp_skb_cb with minimal change to 
>>> MPTCP code and skb code. I have tested the change by using wget from 
>>> multipath-tcp.org. pcap file is attached.
>>>
>>> The approach that I have taken is very straight forward. On Tx when 
>>> an skb is allocated and the underlying socket is MPTCP , 32 bytes 
>>> extra are allocated[1] and reserved as the head room. This space is 
>>> used for passing what was being passed in tcp_skb_cb. An used bit in 
>>> sk_buff is used to indicate this special case and when skb is copied 
>>> the extra data is also copied. clone is not an issue as the data is 
>>> shared. This work for retransmission also because tcp_transmit_skb 
>>> clones the original skb.
>>>
>>> On the Rx side, options are parsed in tcp at that point we can reuse 
>>> the space that was used for ether and IP  headers and no extra space 
>>> is required.
>>>
>>> The space used is always right below the skb head pointer because 
>>> the location of head never changes.
>>>
>>> Please take a look and let me know what I have missed.
>>
>> Hi Rao -
>>
>> It's quite a bit easier to review patches on a mailing list when 
>> they're plain text (as opposed to attachments), so I'd like ask that 
>> we stick to the kernel patch guidelines: 
>> https://www.kernel.org/doc/html/latest/process/submitting-patches.html
>>
>> (I recognize that it's a long read - using 'git send-email' will get 
>> you most of the way there!)
> Sorry about that. I have never posted a patch before.
>>
>>
>> I'm wary of using the skb data to hide away an extra control block. 
>> The number of data bytes consumed isn't really accounted for (other 
>> than being hard coded for the MPTCP use case), 
> I don't understand this concern. Why do they have to be accounted for 
> and what code needs that ? The total size allocated is still billed to 
> the socket. This is same as allocating maximum size  tcp header and 
> not using all the space OR one can think of it as a new protocol 
> header that gets added because we are running ethernet over Infiniband.
>
>> and keeping track of the size of the extra control block would take 
>> some space in the skb, which is what we're trying to avoid. 
> The way I have coded it, it does not, and that is the whole point of 
> taking this approach because upstream folks do not want the size of 
> 'struct sk_buff' increased.
>
> Perhaps I do not understand your concern about counting, can you give 
> me an example of the issue that you have in mind.
>
>> On the rx side, overwriting the network or mac headers doesn't seem 
>> right - what if the skb was cloned for monitoring purposes? 
> This is a valid concern -- let me think about it a little bit. Where 
> would the buffer be cloned ?
Oh I just realized, I sent out a pcap file while doing mostly Rx 
traffic. I did not see any corruption so I am not sure this is an issue. 
To be sure I would have to dig into the code as to what happens when the 
promiscuous mode is on.

Rao.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [MPTCP] Change to remove MPTCP members from tcp_skb_cb
@ 2017-04-29  3:23 Rao Shoaib
  0 siblings, 0 replies; 12+ messages in thread
From: Rao Shoaib @ 2017-04-29  3:23 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 4333 bytes --]



On 04/28/2017 03:52 PM, Mat Martineau wrote:
> On Wed, 26 Apr 2017, Rao Shoaib wrote:
>
>> Hi,
>>
>> Here is an attempt to not use tcp_skb_cb with minimal change to MPTCP 
>> code and skb code. I have tested the change by using wget from 
>> multipath-tcp.org. pcap file is attached.
>>
>> The approach that I have taken is very straight forward. On Tx when 
>> an skb is allocated and the underlying socket is MPTCP , 32 bytes 
>> extra are allocated[1] and reserved as the head room. This space is 
>> used for passing what was being passed in tcp_skb_cb. An used bit in 
>> sk_buff is used to indicate this special case and when skb is copied 
>> the extra data is also copied. clone is not an issue as the data is 
>> shared. This work for retransmission also because tcp_transmit_skb 
>> clones the original skb.
>>
>> On the Rx side, options are parsed in tcp at that point we can reuse 
>> the space that was used for ether and IP  headers and no extra space 
>> is required.
>>
>> The space used is always right below the skb head pointer because the 
>> location of head never changes.
>>
>> Please take a look and let me know what I have missed.
>
> Hi Rao -
>
> It's quite a bit easier to review patches on a mailing list when 
> they're plain text (as opposed to attachments), so I'd like ask that 
> we stick to the kernel patch guidelines: 
> https://www.kernel.org/doc/html/latest/process/submitting-patches.html
>
> (I recognize that it's a long read - using 'git send-email' will get 
> you most of the way there!)
Sorry about that. I have never posted a patch before.
>
>
> I'm wary of using the skb data to hide away an extra control block. 
> The number of data bytes consumed isn't really accounted for (other 
> than being hard coded for the MPTCP use case), 
I don't understand this concern. Why do they have to be accounted for 
and what code needs that ? The total size allocated is still billed to 
the socket. This is same as allocating maximum size  tcp header and not 
using all the space OR one can think of it as a new protocol header that 
gets added because we are running ethernet over Infiniband.

> and keeping track of the size of the extra control block would take 
> some space in the skb, which is what we're trying to avoid. 
The way I have coded it, it does not, and that is the whole point of 
taking this approach because upstream folks do not want the size of 
'struct sk_buff' increased.

Perhaps I do not understand your concern about counting, can you give me 
an example of the issue that you have in mind.

> On the rx side, overwriting the network or mac headers doesn't seem 
> right - what if the skb was cloned for monitoring purposes? 
This is a valid concern -- let me think about it a little bit. Where 
would the buffer be cloned ?
> We'd have to force an skb_copy before modifying the buffer. It might 
> be safer to use space after skb_end_pointer() 
Yes that can be done (I think). I choose the head just to test things out.
> - but that's where the skb_shared_info struct is, so maybe what we 
> need is a way to optionally put something after skb_shared_info?
This can not be done currently based on how the buffer is constructed.  
I would avoid doing that any ways as it would cause more concern for 
upstream folks as it is more disruptive since it changes the location of 
the end of the structure. In the current scheme no assumption is broken 
about where everything lies and all pointers retain their current 
interpretation.
>
> I think we need to take a step back: what data does MPTCP really need 
> to attach to an skb? If we have extra mptcp data with the tcp_sock for 
> the subflow, is there a way to keep track of auxiliary data outside 
> the skb? I'll think on this some more too.
You can not use tcp_sock because the data is per sk_buff and not per 
flow. Additionally same buffer could be transmitted on several flows. We 
could implement some kind of lookup but that will kill the performance.

I am working on a generic solution but I am not optimistic that upstream 
folks will accept it because it will require more changes to the 
existing code.

If a better scheme is proposed that works I am all for it.

Rao.
>
> -- 
> Mat Martineau
> Intel OTC


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [MPTCP] Change to remove MPTCP members from tcp_skb_cb
@ 2017-04-28 22:52 Mat Martineau
  0 siblings, 0 replies; 12+ messages in thread
From: Mat Martineau @ 2017-04-28 22:52 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 2473 bytes --]

On Wed, 26 Apr 2017, Rao Shoaib wrote:

> Hi,
>
> Here is an attempt to not use tcp_skb_cb with minimal change to MPTCP code 
> and skb code. I have tested the change by using wget from multipath-tcp.org. 
> pcap file is attached.
>
> The approach that I have taken is very straight forward. On Tx when an skb is 
> allocated and the underlying socket is MPTCP , 32 bytes extra are 
> allocated[1] and reserved as the head room. This space is used for passing 
> what was being passed in tcp_skb_cb. An used bit in sk_buff is used to 
> indicate this special case and when skb is copied the extra data is also 
> copied. clone is not an issue as the data is shared. This work for 
> retransmission also because tcp_transmit_skb clones the original skb.
>
> On the Rx side, options are parsed in tcp at that point we can reuse the 
> space that was used for ether and IP  headers and no extra space is required.
>
> The space used is always right below the skb head pointer because the 
> location of head never changes.
>
> Please take a look and let me know what I have missed.

Hi Rao -

It's quite a bit easier to review patches on a mailing list when they're 
plain text (as opposed to attachments), so I'd like ask that we stick to 
the kernel patch guidelines: 
https://www.kernel.org/doc/html/latest/process/submitting-patches.html

(I recognize that it's a long read - using 'git send-email' will get you 
most of the way there!)


I'm wary of using the skb data to hide away an extra control block. The 
number of data bytes consumed isn't really accounted for (other than being 
hard coded for the MPTCP use case), and keeping track of the size of the 
extra control block would take some space in the skb, which is what we're 
trying to avoid. On the rx side, overwriting the network or mac headers 
doesn't seem right - what if the skb was cloned for monitoring purposes? 
We'd have to force an skb_copy before modifying the buffer. It might be 
safer to use space after skb_end_pointer() - but that's where the 
skb_shared_info struct is, so maybe what we need is a way to optionally 
put something after skb_shared_info?

I think we need to take a step back: what data does MPTCP really need to 
attach to an skb? If we have extra mptcp data with the tcp_sock for the 
subflow, is there a way to keep track of auxiliary data outside the skb? 
I'll think on this some more too.

--
Mat Martineau
Intel OTC

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [MPTCP] Change to remove MPTCP members from tcp_skb_cb
@ 2017-04-27  3:33 Rao Shoaib
  0 siblings, 0 replies; 12+ messages in thread
From: Rao Shoaib @ 2017-04-27  3:33 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 1046 bytes --]

Hi,

Here is an attempt to not use tcp_skb_cb with minimal change to MPTCP 
code and skb code. I have tested the change by using wget from 
multipath-tcp.org. pcap file is attached.

The approach that I have taken is very straight forward. On Tx when an 
skb is allocated and the underlying socket is MPTCP , 32 bytes extra are 
allocated[1] and reserved as the head room. This space is used for 
passing what was being passed in tcp_skb_cb. An used bit in sk_buff is 
used to indicate this special case and when skb is copied the extra data 
is also copied. clone is not an issue as the data is shared. This work 
for retransmission also because tcp_transmit_skb clones the original skb.

On the Rx side, options are parsed in tcp at that point we can reuse the 
space that was used for ether and IP  headers and no extra space is 
required.

The space used is always right below the skb head pointer because the 
location of head never changes.

Please take a look and let me know what I have missed.

Regards,

Rao


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-remove-MPTCP-members-from-tcp_skb_cb.patch --]
[-- Type: text/x-patch, Size: 28612 bytes --]

>From a341433ecc476f5acc48871d6e7bc450eb7e2437 Mon Sep 17 00:00:00 2001
From: Rao Shoaib <rao.shoaib@oracle.com>
Date: Wed, 26 Apr 2017 19:51:12 -0700
Subject: [PATCH] remove MPTCP members from tcp_skb_cb

---
 include/linux/skbuff.h      |  6 ++++-
 include/net/mptcp.h         | 36 ++++++++++++++++++++++++-----
 include/net/tcp.h           | 11 ---------
 net/core/skbuff.c           | 11 +++++++++
 net/ipv4/tcp.c              | 11 ++++++++-
 net/ipv4/tcp_ipv4.c         |  4 ----
 net/ipv4/tcp_output.c       |  8 +++++++
 net/ipv6/tcp_ipv6.c         |  4 ----
 net/mptcp/mptcp_input.c     | 46 +++++++++++++++++++++----------------
 net/mptcp/mptcp_ipv4.c      |  4 ++--
 net/mptcp/mptcp_ipv6.c      |  4 ++--
 net/mptcp/mptcp_output.c    | 56 ++++++++++++++++++++++++---------------------
 net/mptcp/mptcp_redundant.c |  6 ++---
 net/mptcp/mptcp_rr.c        |  4 ++--
 net/mptcp/mptcp_sched.c     | 10 ++++----
 15 files changed, 134 insertions(+), 87 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f66cd5e..1ef937f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -135,6 +135,8 @@
 #define CHECKSUM_COMPLETE	2
 #define CHECKSUM_PARTIAL	3
 
+#define MPTCP_SKB_BYTES		32
+
 /* Maximum value in skb->csum_level */
 #define SKB_MAX_CSUM_LEVEL	3
 
@@ -289,6 +291,7 @@ enum {
 
 	/* generate software timestamp on peer data acknowledgment */
 	SKBTX_ACK_TSTAMP = 1 << 7,
+
 };
 
 #define SKBTX_ANY_SW_TSTAMP	(SKBTX_SW_TSTAMP    | \
@@ -568,7 +571,8 @@ struct sk_buff {
 				fclone:2,
 				peeked:1,
 				head_frag:1,
-				xmit_more:1;
+				xmit_more:1,
+				copy_head:1;
 	/* one bit hole */
 	kmemcheck_bitfield_end(flags1);
 
diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index 876f1e6..d510277 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -79,6 +79,17 @@ struct mptcp_rem6 {
 	struct in6_addr	addr;
 };
 
+#define MPTCP_SEG_INFO(__skb)	((struct mptcp_seg_info *)((__skb)->head))	
+
+struct mptcp_seg_info {
+	__u8 	mptcp_flags;
+	__u8	dss_off;
+	union {
+		__u32 	path_mask;
+		__u32 	dss[6];
+	};
+};
+
 struct mptcp_request_sock {
 	struct tcp_request_sock		req;
 	/* hlist-nulls entry to the hash-table. Depending on whether this is a
@@ -1000,12 +1011,17 @@ static inline void mptcp_sub_force_close_all(struct mptcp_cb *mpcb,
 
 static inline bool mptcp_is_data_seq(const struct sk_buff *skb)
 {
-	return TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_SEQ;
+	return MPTCP_SEG_INFO(skb)->mptcp_flags & MPTCPHDR_SEQ;
 }
 
 static inline bool mptcp_is_data_fin(const struct sk_buff *skb)
 {
-	return TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_FIN;
+	return MPTCP_SEG_INFO(skb)->mptcp_flags & MPTCPHDR_FIN;
+}
+
+static inline bool mptcp_is_infinite(const struct sk_buff *skb)
+{
+	return MPTCP_SEG_INFO(skb)->mptcp_flags & MPTCPHDR_INF;
 }
 
 /* Is it a data-fin while in infinite mapping mode?
@@ -1038,13 +1054,13 @@ static inline __u32 *mptcp_skb_set_data_seq(const struct sk_buff *skb,
 					    u32 *data_seq,
 					    struct mptcp_cb *mpcb)
 {
-	__u32 *ptr = (__u32 *)(skb_transport_header(skb) + TCP_SKB_CB(skb)->dss_off);
+	__u32 *ptr = (__u32 *)(skb_transport_header(skb) + MPTCP_SEG_INFO(skb)->dss_off);
 
-	if (TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_SEQ64_SET) {
+	if (MPTCP_SEG_INFO(skb)->mptcp_flags & MPTCPHDR_SEQ64_SET) {
 		u64 data_seq64 = get_unaligned_be64(ptr);
 
 		if (mpcb)
-			TCP_SKB_CB(skb)->mptcp_flags |= mptcp_get_64_bit(data_seq64, mpcb);
+			MPTCP_SEG_INFO(skb)->mptcp_flags |= mptcp_get_64_bit(data_seq64, mpcb);
 
 		*data_seq = (u32)data_seq64;
 		ptr++;
@@ -1142,7 +1158,7 @@ static inline void mptcp_reset_mopt(struct tcp_sock *tp)
 static inline __be32 mptcp_get_highorder_sndbits(const struct sk_buff *skb,
 						 const struct mptcp_cb *mpcb)
 {
-	return htonl(mpcb->snd_high_order[(TCP_SKB_CB(skb)->mptcp_flags &
+	return htonl(mpcb->snd_high_order[(MPTCP_SEG_INFO(skb)->mptcp_flags &
 			MPTCPHDR_SEQ64_INDEX) ? 1 : 0]);
 }
 
@@ -1368,6 +1384,10 @@ bool mptcp_prune_ofo_queue(struct sock *sk);
 	do {				\
 	} while(0)
 
+static inline bool mptcp_is_infinite(const struct sk_buff *skb)
+{
+	return false;
+}
 static inline bool mptcp_is_data_fin(const struct sk_buff *skb)
 {
 	return false;
@@ -1380,6 +1400,10 @@ static inline struct sock *mptcp_meta_sk(const struct sock *sk)
 {
 	return NULL;
 }
+static inline bool mptcp_is_inf(const struct sk_buff *skb)
+{
+	return false;
+}
 static inline struct tcp_sock *mptcp_meta_tp(const struct tcp_sock *tp)
 {
 	return NULL;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 655ecd4..ad5aa63 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -842,11 +842,6 @@ struct tcp_skb_cb {
 		__u32		tcp_gso_segs;
 	};
 
-#ifdef CONFIG_MPTCP
-	__u8		mptcp_flags;	/* flags for the MPTCP layer    */
-	__u8		dss_off;	/* Number of 4-byte words until
-					 * seq-number */
-#endif
 	__u8		tcp_flags;	/* TCP header flags. (tcp[13])	*/
 
 	__u8		sacked;		/* State flags for SACK/FACK.	*/
@@ -869,12 +864,6 @@ struct tcp_skb_cb {
 			struct inet6_skb_parm	h6;
 #endif
 		} header;	/* For incoming frames		*/
-#ifdef CONFIG_MPTCP
-		union {			/* For MPTCP outgoing frames */
-			__u32 path_mask; /* paths that tried to send this skb */
-			__u32 dss[6];	/* DSS options */
-		};
-#endif
 	};
 };
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 8e787a4..0c6611d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -220,6 +220,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 		goto out;
 	prefetchw(skb);
 
+	skb->copy_head = 0;
+
 	/* We do our best to align skb_shared_info on a separate cache
 	 * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
 	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.
@@ -1009,6 +1011,7 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
 		n->fclone = SKB_FCLONE_UNAVAILABLE;
 	}
 
+	n->copy_head = skb->copy_head;
 	return __skb_clone(n, skb);
 }
 EXPORT_SYMBOL(skb_clone);
@@ -1143,6 +1146,14 @@ struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, int headroom,
 	}
 
 	copy_skb_header(n, skb);
+	if (skb->copy_head) {
+		/*
+		 * copy bytes from the head
+		 */
+		WARN_ON(skb_headroom(n) < MPTCP_SKB_BYTES);
+		memcpy(n->head, skb->head, min_t(unsigned int, MPTCP_SKB_BYTES, skb_headroom(n)));
+		n->copy_head = 1;
+	}
 out:
 	return n;
 }
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 63c107c..7177b40 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -847,13 +847,22 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp)
 {
 	struct sk_buff *skb;
 
+	struct tcp_sock *tp = tcp_sk(sk);
+	
+	/*
+	 * For mptcp sockets add extra 32 bytes for the seg info structure
+	 * Mayebe we 
+	 */
+	if (mptcp(tp))
+		size += MPTCP_SKB_BYTES;
 	/* The TCP header must be at least 32-bit aligned.  */
 	size = ALIGN(size, 4);
 
 	skb = alloc_skb_fclone(size + sk->sk_prot->max_header, gfp);
 	if (skb) {
 		if (sk_wmem_schedule(sk, skb->truesize)) {
-			skb_reserve(skb, sk->sk_prot->max_header);
+			skb_reserve(skb, sk->sk_prot->max_header +
+			    (mptcp(tp) != NULL ? MPTCP_SKB_BYTES : 0));
 			/*
 			 * Make sure that we have exactly size bytes
 			 * available to the caller, no more, no less.
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 6967a86..d9edbe9 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1636,10 +1636,6 @@ int tcp_v4_rcv(struct sk_buff *skb)
 	TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
 				    skb->len - th->doff * 4);
 	TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
-#ifdef CONFIG_MPTCP
-	TCP_SKB_CB(skb)->mptcp_flags = 0;
-	TCP_SKB_CB(skb)->dss_off = 0;
-#endif
 	TCP_SKB_CB(skb)->tcp_flags = tcp_flag_byte(th);
 	TCP_SKB_CB(skb)->tcp_tw_isn = 0;
 	TCP_SKB_CB(skb)->ip_dsfield = ipv4_get_dsfield(iph);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a635483..141907d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1200,6 +1200,14 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len,
 	if (!buff)
 		return -ENOMEM; /* We'll just try again later. */
 
+	if (skb->copy_head) {
+		WARN_ON(skb_headroom(buff) < MPTCP_SKB_BYTES);
+		memcpy(buff->head, skb->head, min_t(unsigned int, MPTCP_SKB_BYTES, skb_headroom(buff)));
+		/*
+		 * indicates that the data is valid
+		 */
+		buff->copy_head = skb->copy_head;
+	}
 	sk->sk_wmem_queued += buff->truesize;
 	sk_mem_charge(sk, buff->truesize);
 	nlen = skb->len - len - nsize;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index eba2436..d360128 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1412,10 +1412,6 @@ static void tcp_v6_fill_cb(struct sk_buff *skb, const struct ipv6hdr *hdr,
 	TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
 				    skb->len - th->doff*4);
 	TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
-#ifdef CONFIG_MPTCP
-	TCP_SKB_CB(skb)->mptcp_flags = 0;
-	TCP_SKB_CB(skb)->dss_off = 0;
-#endif
 	TCP_SKB_CB(skb)->tcp_flags = tcp_flag_byte(th);
 	TCP_SKB_CB(skb)->tcp_tw_isn = 0;
 	TCP_SKB_CB(skb)->ip_dsfield = ipv6_get_dsfield(hdr);
diff --git a/net/mptcp/mptcp_input.c b/net/mptcp/mptcp_input.c
index 51cbb06..3b45ba1 100644
--- a/net/mptcp/mptcp_input.c
+++ b/net/mptcp/mptcp_input.c
@@ -324,8 +324,9 @@ static int mptcp_verif_dss_csum(struct sock *sk)
 			 * in the final csum_partial-call.
 			 */
 			u32 offset = skb_transport_offset(tmp) +
-				     TCP_SKB_CB(tmp)->dss_off;
-			if (TCP_SKB_CB(tmp)->mptcp_flags & MPTCPHDR_SEQ64_SET)
+			    MPTCP_SEG_INFO(tmp)->dss_off;
+
+			if (MPTCP_SEG_INFO(tmp)->mptcp_flags & MPTCPHDR_SEQ64_SET)
 				offset += 4;
 
 			csum_tcp = skb_checksum(tmp, offset,
@@ -750,7 +751,7 @@ static int mptcp_detect_mapping(struct sock *sk, struct sk_buff *skb)
 		 */
 		pr_err("%s Packet's mapping does not map to the DSS sub_seq %u "
 		       "end_seq %u, tcp_end_seq %u seq %u dfin %u len %u data_len %u"
-		       "copied_seq %u\n", __func__, sub_seq, tcb->end_seq, tcp_end_seq, tcb->seq, mptcp_is_data_fin(skb),
+		       " copied_seq %u\n", __func__, sub_seq, tcb->end_seq, tcp_end_seq, tcb->seq, mptcp_is_data_fin(skb),
 		       skb->len, data_len, tp->copied_seq);
 		MPTCP_INC_STATS_BH(sock_net(sk), MPTCP_MIB_DSSTCPMISMATCH);
 		mptcp_send_reset(sk);
@@ -758,7 +759,7 @@ static int mptcp_detect_mapping(struct sock *sk, struct sk_buff *skb)
 	}
 
 	/* Does the DSS had 64-bit seqnum's ? */
-	if (!(tcb->mptcp_flags & MPTCPHDR_SEQ64_SET)) {
+	if (!(MPTCP_SEG_INFO(skb)->mptcp_flags & MPTCPHDR_SEQ64_SET)) {
 		/* Wrapped around? */
 		if (unlikely(after(data_seq, meta_tp->rcv_nxt) && data_seq < meta_tp->rcv_nxt)) {
 			tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, !mpcb->rcv_hiseq_index, data_seq);
@@ -767,9 +768,9 @@ static int mptcp_detect_mapping(struct sock *sk, struct sk_buff *skb)
 			tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index, data_seq);
 		}
 	} else {
-		tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, (tcb->mptcp_flags & MPTCPHDR_SEQ64_INDEX) ? 1 : 0, data_seq);
+		tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, (MPTCP_SEG_INFO(skb)->mptcp_flags & MPTCPHDR_SEQ64_INDEX) ? 1 : 0, data_seq);
 
-		if (unlikely(tcb->mptcp_flags & MPTCPHDR_SEQ64_OFO)) {
+		if (unlikely(MPTCP_SEG_INFO(skb)->mptcp_flags & MPTCPHDR_SEQ64_OFO)) {
 			/* We make sure that the data_seq is invalid.
 			 * It will be dropped later.
 			 */
@@ -1108,7 +1109,7 @@ int mptcp_check_req(struct sk_buff *skb, struct net *net)
 	if (!meta_sk)
 		return 0;
 
-	TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
+	MPTCP_SEG_INFO(skb)->mptcp_flags |= MPTCPHDR_JOIN;
 
 	bh_lock_sock_nested(meta_sk);
 	if (sock_owned_by_user(meta_sk)) {
@@ -1222,7 +1223,7 @@ int mptcp_lookup_join(struct sk_buff *skb, struct inet_timewait_sock *tw)
 		inet_twsk_put(tw);
 	}
 
-	TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
+	MPTCP_SEG_INFO(skb)->mptcp_flags |= MPTCPHDR_JOIN;
 	/* OK, this is a new syn/join, let's create a new open request and
 	 * send syn+ack
 	 */
@@ -1279,7 +1280,7 @@ int mptcp_do_join_short(struct sk_buff *skb,
 		return -1;
 	}
 
-	TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
+	MPTCP_SEG_INFO(skb)->mptcp_flags |= MPTCPHDR_JOIN;
 
 	/* OK, this is a new syn/join, let's create a new open request and
 	 * send syn+ack
@@ -1461,7 +1462,6 @@ static void mptcp_data_ack(struct sock *sk, const struct sk_buff *skb)
 {
 	struct sock *meta_sk = mptcp_meta_sk(sk);
 	struct tcp_sock *meta_tp = tcp_sk(meta_sk), *tp = tcp_sk(sk);
-	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
 	u32 prior_snd_una = meta_tp->snd_una;
 	int prior_packets;
 	u32 nwin, data_ack, data_seq;
@@ -1481,7 +1481,7 @@ static void mptcp_data_ack(struct sock *sk, const struct sk_buff *skb)
 	/* If we are in infinite mapping mode, rx_opt.data_ack has been
 	 * set by mptcp_clean_rtx_infinite.
 	 */
-	if (!(tcb->mptcp_flags & MPTCPHDR_ACK) && !tp->mpcb->infinite_mapping_snd)
+	if (!(MPTCP_SEG_INFO(skb)->mptcp_flags & MPTCPHDR_ACK) && !tp->mpcb->infinite_mapping_snd)
 		goto exit;
 
 	data_ack = tp->mptcp->rx_opt.data_ack;
@@ -1740,6 +1740,9 @@ void mptcp_parse_options(const uint8_t *ptr, int opsize,
 		const struct mp_dss *mdss = (struct mp_dss *)ptr;
 		struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
 
+		MPTCP_SEG_INFO(skb)->mptcp_flags = 0;
+		MPTCP_SEG_INFO(skb)->dss_off = 0;
+
 		/* We check opsize for the csum and non-csum case. We do this,
 		 * because the draft says that the csum SHOULD be ignored if
 		 * it has not been negotiated in the MP_CAPABLE but still is
@@ -1757,7 +1760,7 @@ void mptcp_parse_options(const uint8_t *ptr, int opsize,
 		ptr += 4;
 
 		if (mdss->A) {
-			tcb->mptcp_flags |= MPTCPHDR_ACK;
+			MPTCP_SEG_INFO(skb)->mptcp_flags |= MPTCPHDR_ACK;
 
 			if (mdss->a) {
 				mopt->data_ack = (u32) get_unaligned_be64(ptr);
@@ -1768,13 +1771,14 @@ void mptcp_parse_options(const uint8_t *ptr, int opsize,
 			}
 		}
 
-		tcb->dss_off = (ptr - skb_transport_header(skb));
+		MPTCP_SEG_INFO(skb)->dss_off = (ptr - skb_transport_header(skb));
 
 		if (mdss->M) {
 			if (mdss->m) {
 				u64 data_seq64 = get_unaligned_be64(ptr);
 
-				tcb->mptcp_flags |= MPTCPHDR_SEQ64_SET;
+				MPTCP_SEG_INFO(skb)->mptcp_flags
+				    |= MPTCPHDR_SEQ64_SET;
 				mopt->data_seq = (u32) data_seq64;
 
 				ptr += 12; /* 64-bit dseq + subseq */
@@ -1784,15 +1788,15 @@ void mptcp_parse_options(const uint8_t *ptr, int opsize,
 			}
 			mopt->data_len = get_unaligned_be16(ptr);
 
-			tcb->mptcp_flags |= MPTCPHDR_SEQ;
+			MPTCP_SEG_INFO(skb)->mptcp_flags |= MPTCPHDR_SEQ;
 
 			/* Is a check-sum present? */
 			if (opsize == mptcp_sub_len_dss(mdss, 1))
-				tcb->mptcp_flags |= MPTCPHDR_DSS_CSUM;
+				MPTCP_SEG_INFO(skb)->mptcp_flags |= MPTCPHDR_DSS_CSUM;
 
 			/* DATA_FIN only possible with DSS-mapping */
 			if (mdss->F)
-				tcb->mptcp_flags |= MPTCPHDR_FIN;
+				MPTCP_SEG_INFO(skb)->mptcp_flags |= MPTCPHDR_FIN;
 		}
 
 		break;
@@ -2171,11 +2175,13 @@ bool mptcp_handle_options(struct sock *sk, const struct tcphdr *th,
 	if (tp->mpcb->infinite_mapping_rcv || tp->mpcb->infinite_mapping_snd)
 		return false;
 
-	if (mptcp_mp_fastclose_rcvd(sk))
+	if (mptcp_mp_fastclose_rcvd(sk)) {
 		return true;
+	}
 
-	if (sk->sk_state == TCP_RST_WAIT && !th->rst)
+	if (sk->sk_state == TCP_RST_WAIT && !th->rst) {
 		return true;
+	}
 
 	if (unlikely(mopt->mp_fail))
 		mptcp_mp_fail_rcvd(sk, th);
@@ -2185,7 +2191,7 @@ bool mptcp_handle_options(struct sock *sk, const struct tcphdr *th,
 	 * receiver MUST close the subflow with a RST as it is considered broken.
 	 */
 	if (mptcp_is_data_seq(skb) && tp->mpcb->dss_csum &&
-	    !(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_DSS_CSUM)) {
+	    !(MPTCP_SEG_INFO(skb)->mptcp_flags & MPTCPHDR_DSS_CSUM)) {
 		mptcp_send_reset(sk);
 		return true;
 	}
diff --git a/net/mptcp/mptcp_ipv4.c b/net/mptcp/mptcp_ipv4.c
index a147b20..4c85b8b 100644
--- a/net/mptcp/mptcp_ipv4.c
+++ b/net/mptcp/mptcp_ipv4.c
@@ -186,7 +186,7 @@ int mptcp_v4_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
 	struct sock *child, *rsk = NULL;
 	int ret;
 
-	if (!(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_JOIN)) {
+	if (!(MPTCP_SEG_INFO(skb)->mptcp_flags & MPTCPHDR_JOIN)) {
 		struct tcphdr *th = tcp_hdr(skb);
 		const struct iphdr *iph = ip_hdr(skb);
 		struct sock *sk;
@@ -217,7 +217,7 @@ int mptcp_v4_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
 
 		return ret;
 	}
-	TCP_SKB_CB(skb)->mptcp_flags = 0;
+	MPTCP_SEG_INFO(skb)->mptcp_flags = 0;
 
 	/* Has been removed from the tk-table. Thus, no new subflows.
 	 *
diff --git a/net/mptcp/mptcp_ipv6.c b/net/mptcp/mptcp_ipv6.c
index 0de953d..654e078 100644
--- a/net/mptcp/mptcp_ipv6.c
+++ b/net/mptcp/mptcp_ipv6.c
@@ -199,7 +199,7 @@ int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
 	struct sock *child, *rsk = NULL;
 	int ret;
 
-	if (!(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_JOIN)) {
+	if (!(MPTCP_SEG_INFO(skb)->mptcp_flags & MPTCPHDR_JOIN)) {
 		struct tcphdr *th = tcp_hdr(skb);
 		const struct ipv6hdr *ip6h = ipv6_hdr(skb);
 		struct sock *sk;
@@ -232,7 +232,7 @@ int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
 
 		return ret;
 	}
-	TCP_SKB_CB(skb)->mptcp_flags = 0;
+	MPTCP_SEG_INFO(skb)->mptcp_flags = 0;
 
 	/* Has been removed from the tk-table. Thus, no new subflows.
 	 *
diff --git a/net/mptcp/mptcp_output.c b/net/mptcp/mptcp_output.c
index 691ef6f..67943a7 100644
--- a/net/mptcp/mptcp_output.c
+++ b/net/mptcp/mptcp_output.c
@@ -59,7 +59,7 @@ EXPORT_SYMBOL(mptcp_sub_len_remove_addr_align);
  */
 static bool mptcp_reconstruct_mapping(struct sk_buff *skb)
 {
-	const struct mp_dss *mpdss = (struct mp_dss *)TCP_SKB_CB(skb)->dss;
+	const struct mp_dss *mpdss = (struct mp_dss *)MPTCP_SEG_INFO(skb)->dss;
 	u32 *p32;
 	u16 *p16;
 
@@ -91,7 +91,7 @@ static bool mptcp_reconstruct_mapping(struct sk_buff *skb)
 
 static bool mptcp_is_reinjected(const struct sk_buff *skb)
 {
-	return TCP_SKB_CB(skb)->mptcp_flags & MPTCP_REINJECT;
+	return MPTCP_SEG_INFO(skb)->mptcp_flags & MPTCP_REINJECT;
 }
 
 static void mptcp_find_and_set_pathmask(const struct sock *meta_sk, struct sk_buff *skb)
@@ -105,7 +105,7 @@ static void mptcp_find_and_set_pathmask(const struct sock *meta_sk, struct sk_bu
 			break;
 
 		if (TCP_SKB_CB(skb_it)->seq == TCP_SKB_CB(skb)->seq) {
-			TCP_SKB_CB(skb)->path_mask = TCP_SKB_CB(skb_it)->path_mask;
+			MPTCP_SEG_INFO(skb)->path_mask = MPTCP_SEG_INFO(skb_it)->path_mask;
 			break;
 		}
 	}
@@ -182,7 +182,7 @@ static void __mptcp_reinject_data(struct sk_buff *orig_skb, struct sock *meta_sk
 	/* Segment goes back to the MPTCP-layer. So, we need to zero the
 	 * path_mask/dss.
 	 */
-	memset(TCP_SKB_CB(skb)->dss, 0 , mptcp_dss_len);
+	memset(skb->head, 0 , mptcp_dss_len);
 
 	/* We need to find out the path-mask from the meta-write-queue
 	 * to properly select a subflow.
@@ -272,14 +272,14 @@ void mptcp_reinject_data(struct sock *sk, int clone_it)
 		if (mptcp_is_reinjected(skb_it))
 			continue;
 
-		tcb->mptcp_flags |= MPTCP_REINJECT;
+		MPTCP_SEG_INFO(skb_it)->mptcp_flags |= MPTCP_REINJECT;
 		__mptcp_reinject_data(skb_it, meta_sk, sk, clone_it);
 	}
 
 	skb_it = tcp_write_queue_tail(meta_sk);
 	/* If sk has sent the empty data-fin, we have to reinject it too. */
 	if (skb_it && mptcp_is_data_fin(skb_it) && skb_it->len == 0 &&
-	    TCP_SKB_CB(skb_it)->path_mask & mptcp_pi_to_flag(tp->mptcp->path_index)) {
+	    MPTCP_SEG_INFO(skb_it)->path_mask & mptcp_pi_to_flag(tp->mptcp->path_index)) {
 		__mptcp_reinject_data(skb_it, meta_sk, NULL, 1);
 	}
 
@@ -334,7 +334,7 @@ static int mptcp_write_dss_mapping(const struct tcp_sock *tp, const struct sk_bu
 	else
 		*ptr++ = htonl(tp->write_seq - tp->mptcp->snt_isn); /* subseq */
 
-	if (tcb->mptcp_flags & MPTCPHDR_INF)
+	if (MPTCP_SEG_INFO(skb)->mptcp_flags & MPTCPHDR_INF)
 		data_len = 0;
 	else
 		data_len = tcb->end_seq - tcb->seq;
@@ -397,13 +397,13 @@ static int mptcp_write_dss_data_ack(const struct tcp_sock *tp, const struct sk_b
  */
 static void mptcp_save_dss_data_seq(const struct tcp_sock *tp, struct sk_buff *skb)
 {
-	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
-	__be32 *ptr = (__be32 *)tcb->dss;
+	__be32 *ptr = (__be32 *)MPTCP_SEG_INFO(skb)->dss;
 
-	tcb->mptcp_flags |= MPTCPHDR_SEQ;
+	MPTCP_SEG_INFO(skb)->mptcp_flags |= MPTCPHDR_SEQ;
 
 	ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
 	ptr += mptcp_write_dss_mapping(tp, skb, ptr);
+	skb->copy_head = 1;
 }
 
 /* Write the saved DSS mapping to the header */
@@ -412,7 +412,7 @@ static int mptcp_write_dss_data_seq(const struct tcp_sock *tp, struct sk_buff *s
 {
 	__be32 *start = ptr;
 
-	memcpy(ptr, TCP_SKB_CB(skb)->dss, mptcp_dss_len);
+	memcpy(ptr, MPTCP_SEG_INFO(skb)->dss, mptcp_dss_len);
 
 	/* update the data_ack */
 	start[1] = htonl(mptcp_meta_tp(tp)->rcv_nxt);
@@ -420,9 +420,10 @@ static int mptcp_write_dss_data_seq(const struct tcp_sock *tp, struct sk_buff *s
 	/* dss is in a union with inet_skb_parm and
 	 * the IP layer expects zeroed IPCB fields.
 	 */
-	memset(TCP_SKB_CB(skb)->dss, 0 , mptcp_dss_len);
+	memset(MPTCP_SEG_INFO(skb), 0 , mptcp_dss_len);
 
 	return mptcp_dss_len/sizeof(*ptr);
+	
 }
 
 static bool mptcp_skb_entail(struct sock *sk, struct sk_buff *skb, int reinject)
@@ -434,7 +435,7 @@ static bool mptcp_skb_entail(struct sock *sk, struct sk_buff *skb, int reinject)
 	struct sk_buff *subskb = NULL;
 
 	if (!reinject)
-		TCP_SKB_CB(skb)->mptcp_flags |= (mpcb->snd_hiseq_index ?
+		MPTCP_SEG_INFO(skb)->mptcp_flags |= (mpcb->snd_hiseq_index ?
 						  MPTCPHDR_SEQ64_INDEX : 0);
 
 	subskb = pskb_copy_for_clone(skb, GFP_ATOMIC);
@@ -447,7 +448,7 @@ static bool mptcp_skb_entail(struct sock *sk, struct sk_buff *skb, int reinject)
 	 */
 	tcp_skb_pcount_set(subskb, 0);
 
-	TCP_SKB_CB(skb)->path_mask |= mptcp_pi_to_flag(tp->mptcp->path_index);
+	MPTCP_SEG_INFO(skb)->path_mask |= mptcp_pi_to_flag(tp->mptcp->path_index);
 
 	if (!(sk->sk_route_caps & NETIF_F_ALL_CSUM) &&
 	    skb->ip_summed == CHECKSUM_PARTIAL) {
@@ -463,7 +464,7 @@ static bool mptcp_skb_entail(struct sock *sk, struct sk_buff *skb, int reinject)
 		tp->mptcp->fully_established = 1;
 		tp->mpcb->infinite_mapping_snd = 1;
 		tp->mptcp->infinite_cutoff_seq = tp->write_seq;
-		tcb->mptcp_flags |= MPTCPHDR_INF;
+		MPTCP_SEG_INFO(subskb)->mptcp_flags |= MPTCPHDR_INF;
 	}
 
 	if (mptcp_is_data_fin(subskb))
@@ -536,10 +537,10 @@ static int mptcp_fragment(struct sock *meta_sk, struct sk_buff *skb, u32 len,
 
 	buff = skb->next;
 
-	flags = TCP_SKB_CB(skb)->mptcp_flags;
-	TCP_SKB_CB(skb)->mptcp_flags = flags & ~(MPTCPHDR_FIN);
-	TCP_SKB_CB(buff)->mptcp_flags = flags;
-	TCP_SKB_CB(buff)->path_mask = TCP_SKB_CB(skb)->path_mask;
+	flags = MPTCP_SEG_INFO(skb)->mptcp_flags;
+	MPTCP_SEG_INFO(skb)->mptcp_flags = flags & ~(MPTCPHDR_FIN);
+	MPTCP_SEG_INFO(buff)->mptcp_flags = flags;
+	MPTCP_SEG_INFO(buff)->path_mask = MPTCP_SEG_INFO(skb)->path_mask;
 
 	/* If reinject == 1, the buff will be added to the reinject
 	 * queue, which is currently not part of memory accounting. So
@@ -967,7 +968,7 @@ void mptcp_established_options(struct sock *sk, struct sk_buff *skb,
 	if (unlikely(mpcb->infinite_mapping_snd) &&
 	    ((mpcb->send_infinite_mapping && tcb &&
 	      mptcp_is_data_seq(skb) &&
-	      !(tcb->mptcp_flags & MPTCPHDR_INF) &&
+	      !mptcp_is_infinite(skb) &&
 	      !before(tcb->seq, tp->mptcp->infinite_cutoff_seq)) ||
 	     !mpcb->send_infinite_mapping))
 		return;
@@ -1197,10 +1198,11 @@ void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
 	}
 
 	if (OPTION_DATA_ACK & opts->mptcp_options) {
-		if (!mptcp_is_data_seq(skb))
+		if (!mptcp_is_data_seq(skb)) {
 			ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
-		else
+		} else {
 			ptr += mptcp_write_dss_data_seq(tp, skb, ptr);
+		}
 	}
 	if (unlikely(OPTION_MP_PRIO & opts->mptcp_options)) {
 		struct mp_prio *mpprio = (struct mp_prio *)ptr;
@@ -1233,24 +1235,26 @@ void mptcp_send_fin(struct sock *meta_sk)
 	mss_now = mptcp_current_mss(meta_sk);
 
 	if (tcp_send_head(meta_sk) != NULL) {
-		TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_FIN;
+
+		MPTCP_SEG_INFO(skb)->mptcp_flags |= MPTCPHDR_FIN;
 		TCP_SKB_CB(skb)->end_seq++;
 		meta_tp->write_seq++;
 	} else {
 		/* Socket is locked, keep trying until memory is available. */
 		for (;;) {
-			skb = alloc_skb_fclone(MAX_TCP_HEADER,
+			skb = alloc_skb_fclone(MAX_TCP_HEADER + MPTCP_SKB_BYTES,
 					       meta_sk->sk_allocation);
 			if (skb)
 				break;
 			yield();
 		}
 		/* Reserve space for headers and prepare control bits. */
-		skb_reserve(skb, MAX_TCP_HEADER);
+		skb_reserve(skb, MAX_TCP_HEADER + MPTCP_SKB_BYTES);
 
 		tcp_init_nondata_skb(skb, meta_tp->write_seq, TCPHDR_ACK);
 		TCP_SKB_CB(skb)->end_seq++;
-		TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_FIN;
+		skb->copy_head = 1;
+		MPTCP_SEG_INFO(skb)->mptcp_flags |= MPTCPHDR_FIN;
 		tcp_queue_skb(meta_sk, skb);
 	}
 	__tcp_push_pending_frames(meta_sk, mss_now, TCP_NAGLE_OFF);
diff --git a/net/mptcp/mptcp_redundant.c b/net/mptcp/mptcp_redundant.c
index 8fa2dba..1aa18aa 100644
--- a/net/mptcp/mptcp_redundant.c
+++ b/net/mptcp/mptcp_redundant.c
@@ -72,10 +72,10 @@ static bool redsched_use_subflow(struct sock *meta_sk,
 	if (!skb || !mptcp_is_available((struct sock *)tp, skb, false))
 		return false;
 
-	if (TCP_SKB_CB(skb)->path_mask != 0)
+	if (MPTCP_SEG_INFO(skb)->path_mask != 0)
 		return subflow_is_active(tp);
 
-	if (TCP_SKB_CB(skb)->path_mask == 0) {
+	if (MPTCP_SEG_INFO(skb)->path_mask == 0) {
 		if (active_valid_sks == -1)
 			active_valid_sks = redsched_get_active_valid_sks(meta_sk);
 
@@ -209,7 +209,7 @@ static struct sk_buff *redundant_next_segment(struct sock *meta_sk,
 			cb_data->next_subflow = tp->mptcp->next;
 			*subsk = (struct sock *)tp;
 
-			if (TCP_SKB_CB(skb)->path_mask)
+			if (MPTCP_SEG_INFO(skb)->path_mask)
 				*reinject = -1;
 			return skb;
 		}
diff --git a/net/mptcp/mptcp_rr.c b/net/mptcp/mptcp_rr.c
index 8910ba9..b99140b 100644
--- a/net/mptcp/mptcp_rr.c
+++ b/net/mptcp/mptcp_rr.c
@@ -93,7 +93,7 @@ static int mptcp_rr_dont_reinject_skb(const struct tcp_sock *tp, const struct sk
 	 */
 	return skb &&
 		/* Has the skb already been enqueued into this subsocket? */
-		mptcp_pi_to_flag(tp->mptcp->path_index) & TCP_SKB_CB(skb)->path_mask;
+		mptcp_pi_to_flag(tp->mptcp->path_index) & MPTCP_SEG_INFO(skb)->path_mask;
 }
 
 /* We just look for any subflow that is available */
@@ -136,7 +136,7 @@ static struct sock *rr_get_available_subflow(struct sock *meta_sk,
 		 * chance again by restarting its pathmask.
 		 */
 		if (skb)
-			TCP_SKB_CB(skb)->path_mask = 0;
+			MPTCP_SEG_INFO(skb)->path_mask = 0;
 		sk = backupsk;
 	}
 
diff --git a/net/mptcp/mptcp_sched.c b/net/mptcp/mptcp_sched.c
index 54408ff..1c59b1b 100644
--- a/net/mptcp/mptcp_sched.c
+++ b/net/mptcp/mptcp_sched.c
@@ -118,7 +118,7 @@ static int mptcp_dont_reinject_skb(const struct tcp_sock *tp, const struct sk_bu
 	 */
 	return skb &&
 		/* Has the skb already been enqueued into this subsocket? */
-		mptcp_pi_to_flag(tp->mptcp->path_index) & TCP_SKB_CB(skb)->path_mask;
+		mptcp_pi_to_flag(tp->mptcp->path_index) & MPTCP_SEG_INFO(skb)->path_mask;
 }
 
 bool subflow_is_backup(const struct tcp_sock *tp)
@@ -261,7 +261,7 @@ struct sock *get_available_subflow(struct sock *meta_sk, struct sk_buff *skb,
 		 * the skb passed through all the available active and backups
 		 * sks, so clean the path mask
 		 */
-		TCP_SKB_CB(skb)->path_mask = 0;
+		MPTCP_SEG_INFO(skb)->path_mask = 0;
 	return sk;
 }
 EXPORT_SYMBOL_GPL(get_available_subflow);
@@ -298,7 +298,7 @@ static struct sk_buff *mptcp_rcv_buf_optimization(struct sock *sk, int penal)
 	/* Half the cwnd of the slow flow */
 	mptcp_for_each_tp(tp->mpcb, tp_it) {
 		if (tp_it != tp &&
-		    TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp_it->mptcp->path_index)) {
+		    MPTCP_SEG_INFO(skb_head)->path_mask & mptcp_pi_to_flag(tp_it->mptcp->path_index)) {
 			if (tp->srtt_us < tp_it->srtt_us && inet_csk((struct sock *)tp_it)->icsk_ca_state == TCP_CA_Open) {
 				u32 prior_cwnd = tp_it->snd_cwnd;
 
@@ -317,11 +317,11 @@ static struct sk_buff *mptcp_rcv_buf_optimization(struct sock *sk, int penal)
 retrans:
 
 	/* Segment not yet injected into this path? Take it!!! */
-	if (!(TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp->mptcp->path_index))) {
+	if (!(MPTCP_SEG_INFO(skb_head)->path_mask & mptcp_pi_to_flag(tp->mptcp->path_index))) {
 		bool do_retrans = false;
 		mptcp_for_each_tp(tp->mpcb, tp_it) {
 			if (tp_it != tp &&
-			    TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp_it->mptcp->path_index)) {
+			    MPTCP_SEG_INFO(skb_head)->path_mask & mptcp_pi_to_flag(tp_it->mptcp->path_index)) {
 				if (tp_it->snd_cwnd <= 4) {
 					do_retrans = true;
 					break;
-- 
2.7.4


[-- Attachment #3: capture.pcap --]
[-- Type: application/vnd.tcpdump.pcap, Size: 32424 bytes --]

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-05-05 23:14 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-04 17:25 [MPTCP] Change to remove MPTCP members from tcp_skb_cb Rao Shoaib
  -- strict thread matches above, loose matches on Subject: below --
2017-05-05 23:14 Mat Martineau
2017-05-02 20:50 Mat Martineau
2017-04-29  8:59 Rao Shoaib
2017-04-29  8:39 Christoph Paasch
2017-04-29  8:29 Rao Shoaib
2017-04-29  8:10 Rao Shoaib
2017-04-29  7:36 Christoph Paasch
2017-04-29  4:03 Rao Shoaib
2017-04-29  3:23 Rao Shoaib
2017-04-28 22:52 Mat Martineau
2017-04-27  3:33 Rao Shoaib

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.