All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-11-09 16:26 Mat Martineau
  0 siblings, 0 replies; 36+ messages in thread
From: Mat Martineau @ 2017-11-09 16:26 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 12897 bytes --]


Hi everyone,

On Thu, 9 Nov 2017, cpaasch(a)apple.com wrote:

> On 09/11/17 - 04:32:54, Boris Pismenny wrote:
>> +Ilya and Liran
>>
>> Hi,
>>
>>> -----Original Message-----
>>> From: cpaasch(a)apple.com [mailto:cpaasch(a)apple.com]
>>> Sent: Thursday, November 09, 2017 13:13
>>> To: Mat Martineau <mathew.j.martineau(a)linux.intel.com>; Boris Pismenny
>>> <borisp(a)mellanox.com>
>>> Cc: mptcp(a)lists.01.org
>>> Subject: Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
>>>
>>> +Boris
>>>
>>> On 20/10/17 - 16:02:31, Mat Martineau wrote:
>>>> The sk_buff control buffer is of limited size, and cannot be enlarged
>>>> without significant impact on systemwide memory use. However, additional
>>>> per-packet state is needed for some protocols, like Multipath TCP.
>>>>
>>>> An optional shared control buffer placed after the normal struct
>>>> skb_shared_info can accomodate the necessary state without imposing
>>>> extra memory usage or code changes on normal struct sk_buff
>>>> users. __alloc_skb will now place a skb_shared_info_ext structure at
>>>> skb->end when given the SKB_ALLOC_SHINFO_EXT flag. Most users of struct
>>>> sk_buff continue to use the skb_shinfo() macro to access shared
>>>> info. skb_shinfo(skb)->is_ext is set if the extended structure is
>>>> available, and cleared if it is not.
>>>>
>>>> pskb_expand_head will preserve the shared control buffer if it is present.
>>>>
>>>> Signed-off-by: Mat Martineau <mathew.j.martineau(a)linux.intel.com>
>>>> ---
>>>>  include/linux/skbuff.h | 24 +++++++++++++++++++++-
>>>>  net/core/skbuff.c      | 56 ++++++++++++++++++++++++++++++++++++++-------
>>> -----
>>>>  2 files changed, 66 insertions(+), 14 deletions(-)
>>>
>>> Boris, below is the change I mentioned to you.
>>>
>>> It allows to allocate 48 additional bytes on-demand after skb_shared_info.
>>> As it is on-demand, it won't increase the size of the skb for other users.
>>>
>>> For example, TLS could start using it when it creates the skb that it
>>> pushes down to the TCP-stack. That way you don't need to handle the
>>> tls_record lists.
>>>
>>
>> One small problem is that TLS doesn't create SKBs. As a ULP it calls the transport send
>> Functions (do_tcp_sendpages for TLS). This function receives a page and not a SKB.
>
> yes, that's a good point. Mat has another patch as part of this series,
> that adds an skb-arg to sendpages
> (https://lists.01.org/pipermail/mptcp/2017-October/000130.html)
>
> That should do the job for you.

After working with the extended control block some more, I found that the 
arg to sendpages (at least as I implemented it in that patch) doesn't work 
out because skb_entail() isn't called. I'm experimenting with allocating 
and entailing an empty skb before calling do_tcp_sendpages(), which will 
then find the extended skb on the write queue and append to it. It's 
involving a lot of code to handle the memory waits and error conditions, 
though - it's probably cleaner to plumb some parameters in to 
do_tcp_sendpages() and sk_stream_alloc_skb() to request the extended skb.


Mat


>> We decided not to create the SKB outside of the TCP layer to reduce the number of
>> changes we made to TCP.
>>
>> It would be nice if we could use something like that. Did you talk to DaveM about
>> upstreaming this?
>
> No, we didn't talk yet to anyone outside of this list here about it.
>
> We were looking for a user of it outside of MPTCP but couldn't find one.
> Now, it seems like we found one that would be interested :)
>
>
> I think this infrastructure here would simplify your code quite a bit, no?
>
>> We will definitely find it useful for the receive side, there we allocate the SKB in the driver.
>
> Interesting! So even there you could use it. We were under the impression
> that it would be of less interest for the receive-side.
>
>
> Christoph
>
>>
>>> See below for the rest of the patch.
>>>
>>>
>>> Christoph
>>>
>>>
>>>>
>>>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>>>> index 03634ec2f918..873910c66df9 100644
>>>> --- a/include/linux/skbuff.h
>>>> +++ b/include/linux/skbuff.h
>>>> @@ -489,7 +489,8 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct
>>> sk_buff *skb,
>>>>   * the end of the header data, ie. at skb->end.
>>>>   */
>>>>  struct skb_shared_info {
>>>> -	__u8		__unused;
>>>> +	__u8		is_ext:1,
>>>> +			__unused:7;
>>>>  	__u8		meta_len;
>>>>  	__u8		nr_frags;
>>>>  	__u8		tx_flags;
>>>> @@ -530,6 +531,24 @@ struct skb_shared_info {
>>>>  #define SKB_DATAREF_MASK ((1 << SKB_DATAREF_SHIFT) - 1)
>>>>
>>>>
>>>> +/* This is an extended version of skb_shared_info, also invariant across
>>>> + * clones and living at the end of the header data.
>>>> + */
>>>> +struct skb_shared_info_ext {
>>>> +	/* skb_shared_info must be the first member */
>>>> +	struct skb_shared_info	shinfo;
>>>> +
>>>> +	/* This is the shared control buffer. It is similar to sk_buff's
>>>> +	 * control buffer, but is shared across clones. It must not be
>>>> +	 * modified when multiple sk_buffs are referencing this structure.
>>>> +	 */
>>>> +	char			shcb[48];
>>>> +};
>>>> +
>>>> +#define SKB_SHINFO_EXT_OVERHEAD	\
>>>> +	(SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext)) - \
>>>> +	 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
>>>> +
>>>>  enum {
>>>>  	SKB_FCLONE_UNAVAILABLE,	/* skb has no fclone (from head_cache)
>>> */
>>>>  	SKB_FCLONE_ORIG,	/* orig skb (from fclone_cache) */
>>>> @@ -856,6 +875,7 @@ struct sk_buff {
>>>>  #define SKB_ALLOC_FCLONE	0x01
>>>>  #define SKB_ALLOC_RX		0x02
>>>>  #define SKB_ALLOC_NAPI		0x04
>>>> +#define SKB_ALLOC_SHINFO_EXT	0x08
>>>>
>>>>  /* Returns true if the skb was allocated from PFMEMALLOC reserves */
>>>>  static inline bool skb_pfmemalloc(const struct sk_buff *skb)
>>>> @@ -1271,6 +1291,8 @@ static inline unsigned int skb_end_offset(const
>>> struct sk_buff *skb)
>>>>
>>>>  /* Internal */
>>>>  #define skb_shinfo(SKB)	((struct skb_shared_info
>>> *)(skb_end_pointer(SKB)))
>>>> +#define skb_shinfo_ext(SKB)	\
>>>> +	((struct skb_shared_info_ext *)(skb_end_pointer(SKB)))
>>>>
>>>>  static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff
>>> *skb)
>>>>  {
>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>> index 40717501cbdd..397edd5c0613 100644
>>>> --- a/net/core/skbuff.c
>>>> +++ b/net/core/skbuff.c
>>>> @@ -166,6 +166,8 @@ static void *__kmalloc_reserve(size_t size, gfp_t
>>> flags, int node,
>>>>   *		instead of head cache and allocate a cloned (child) skb.
>>>>   *		If SKB_ALLOC_RX is set, __GFP_MEMALLOC will be used for
>>>>   *		allocations in case the data is required for writeback
>>>> + *		If SKB_ALLOC_SHINFO_EXT is set, the skb will be allocated
>>>> + *		with an extended shared info struct.
>>>>   *	@node: numa node to allocate memory on
>>>>   *
>>>>   *	Allocate a new &sk_buff. The returned buffer has no headroom and a
>>>> @@ -179,9 +181,9 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t
>>> gfp_mask,
>>>>  			    int flags, int node)
>>>>  {
>>>>  	struct kmem_cache *cache;
>>>> -	struct skb_shared_info *shinfo;
>>>>  	struct sk_buff *skb;
>>>>  	u8 *data;
>>>> +	unsigned int shinfo_size;
>>>>  	bool pfmemalloc;
>>>>
>>>>  	cache = (flags & SKB_ALLOC_FCLONE)
>>>> @@ -199,18 +201,22 @@ struct sk_buff *__alloc_skb(unsigned int size,
>>> gfp_t gfp_mask,
>>>>  	/* We do our best to align skb_shared_info on a separate cache
>>>>  	 * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
>>>>  	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.
>>>> -	 * Both skb->head and skb_shared_info are cache line aligned.
>>>> +	 * Both skb->head and skb_shared_info (or skb_shared_info_ext) are
>>>> +	 * cache line aligned.
>>>>  	 */
>>>>  	size = SKB_DATA_ALIGN(size);
>>>> -	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>>>> -	data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
>>>> +	if (flags & SKB_ALLOC_SHINFO_EXT)
>>>> +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct
>>> skb_shared_info_ext));
>>>> +	else
>>>> +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct
>>> skb_shared_info));
>>>> +	data = kmalloc_reserve(size + shinfo_size, gfp_mask, node,
>>> &pfmemalloc);
>>>>  	if (!data)
>>>>  		goto nodata;
>>>>  	/* kmalloc(size) might give us more room than requested.
>>>>  	 * Put skb_shared_info exactly at the end of allocated zone,
>>>>  	 * to allow max possible filling before reallocation.
>>>>  	 */
>>>> -	size = SKB_WITH_OVERHEAD(ksize(data));
>>>> +	size = ksize(data) - shinfo_size;
>>>>  	prefetchw(data + size);
>>>>
>>>>  	/*
>>>> @@ -220,7 +226,10 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t
>>> gfp_mask,
>>>>  	 */
>>>>  	memset(skb, 0, offsetof(struct sk_buff, tail));
>>>>  	/* Account for allocated memory : skb + skb->head */
>>>> -	skb->truesize = SKB_TRUESIZE(size);
>>>> +	if (flags & SKB_ALLOC_SHINFO_EXT)
>>>> +		skb->truesize = SKB_TRUESIZE(size) +
>>> SKB_SHINFO_EXT_OVERHEAD;
>>>> +	else
>>>> +		skb->truesize = SKB_TRUESIZE(size);
>>>>  	skb->pfmemalloc = pfmemalloc;
>>>>  	refcount_set(&skb->users, 1);
>>>>  	skb->head = data;
>>>> @@ -231,10 +240,21 @@ struct sk_buff *__alloc_skb(unsigned int size,
>>> gfp_t gfp_mask,
>>>>  	skb->transport_header = (typeof(skb->transport_header))~0U;
>>>>
>>>>  	/* make sure we initialize shinfo sequentially */
>>>> -	shinfo = skb_shinfo(skb);
>>>> -	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
>>>> -	atomic_set(&shinfo->dataref, 1);
>>>> -	kmemcheck_annotate_variable(shinfo->destructor_arg);
>>>> +	if (flags & SKB_ALLOC_SHINFO_EXT) {
>>>> +		struct skb_shared_info_ext *shinfo_ext = skb_shinfo_ext(skb);
>>>> +		shinfo_ext->shinfo.is_ext = 1;
>>>> +		memset(&shinfo_ext->shinfo.meta_len, 0,
>>>> +		       offsetof(struct skb_shared_info, dataref) -
>>>> +		       offsetof(struct skb_shared_info, meta_len));
>>>> +		atomic_set(&shinfo_ext->shinfo.dataref, 1);
>>>> +		kmemcheck_annotate_variable(shinfo_ext-
>>>> shinfo.destructor_arg);
>>>> +		memset(&shinfo_ext->shcb, 0, sizeof(shinfo_ext->shcb));
>>>> +	} else {
>>>> +		struct skb_shared_info *shinfo = skb_shinfo(skb);
>>>> +		memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
>>>> +		atomic_set(&shinfo->dataref, 1);
>>>> +		kmemcheck_annotate_variable(shinfo->destructor_arg);
>>>> +	}
>>>>
>>>>  	if (flags & SKB_ALLOC_FCLONE) {
>>>>  		struct sk_buff_fclones *fclones;
>>>> @@ -1443,6 +1463,7 @@ int pskb_expand_head(struct sk_buff *skb, int
>>> nhead, int ntail,
>>>>  {
>>>>  	int i, osize = skb_end_offset(skb);
>>>>  	int size = osize + nhead + ntail;
>>>> +	int shinfo_size;
>>>>  	long off;
>>>>  	u8 *data;
>>>>
>>>> @@ -1454,11 +1475,14 @@ int pskb_expand_head(struct sk_buff *skb, int
>>> nhead, int ntail,
>>>>
>>>>  	if (skb_pfmemalloc(skb))
>>>>  		gfp_mask |= __GFP_MEMALLOC;
>>>> -	data = kmalloc_reserve(size + SKB_DATA_ALIGN(sizeof(struct
>>> skb_shared_info)),
>>>> -			       gfp_mask, NUMA_NO_NODE, NULL);
>>>> +	if (skb_shinfo(skb)->is_ext)
>>>> +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct
>>> skb_shared_info_ext));
>>>> +	else
>>>> +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct
>>> skb_shared_info));
>>>> +	data = kmalloc_reserve(size + shinfo_size, gfp_mask,
>>> NUMA_NO_NODE, NULL);
>>>>  	if (!data)
>>>>  		goto nodata;
>>>> -	size = SKB_WITH_OVERHEAD(ksize(data));
>>>> +	size = ksize(data) - shinfo_size;
>>>>
>>>>  	/* Copy only real data... and, alas, header. This should be
>>>>  	 * optimized for the cases when header is void.
>>>> @@ -1468,6 +1492,12 @@ int pskb_expand_head(struct sk_buff *skb, int
>>> nhead, int ntail,
>>>>  	memcpy((struct skb_shared_info *)(data + size),
>>>>  	       skb_shinfo(skb),
>>>>  	       offsetof(struct skb_shared_info, frags[skb_shinfo(skb)->nr_frags]));
>>>> +	if (skb_shinfo(skb)->is_ext) {
>>>> +		int offset = offsetof(struct skb_shared_info_ext, shcb);
>>>> +		memcpy((struct skb_shared_info_ext *)(data + size + offset),
>>>> +		       &skb_shinfo_ext(skb)->shcb,
>>>> +		       sizeof(skb_shinfo_ext(skb)->shcb));
>>>> +	}
>>>>
>>>>  	/*
>>>>  	 * if shinfo is shared we must drop the old head gracefully, but if it
>>>> --
>>>> 2.14.2
>>>>
>>>> _______________________________________________
>>>> mptcp mailing list
>>>> mptcp(a)lists.01.org
>>>>
>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.0
>>> 1.org%2Fmailman%2Flistinfo%2Fmptcp&data=02%7C01%7Cborisp%40mellano
>>> x.com%7C57765cda687e4c3ab8ed08d527283bd5%7Ca652971c7d2e4d9ba6a4
>>> d149256f461b%7C0%7C0%7C636457976113353851&sdata=RtZwfWQx%2FSw
>>> HoF9miFLpa2r4kA4pf0ta7la6k7F3OZI%3D&reserved=0
>

--
Mat Martineau
Intel OTC

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-11-13  6:47 cpaasch
  0 siblings, 0 replies; 36+ messages in thread
From: cpaasch @ 2017-11-13  6:47 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 3610 bytes --]

Hello Mat,

On 09/11/17 - 08:26:23, Mat Martineau wrote:
> 
> Hi everyone,
> 
> On Thu, 9 Nov 2017, cpaasch(a)apple.com wrote:
> 
> > On 09/11/17 - 04:32:54, Boris Pismenny wrote:
> > > +Ilya and Liran
> > > 
> > > Hi,
> > > 
> > > > -----Original Message-----
> > > > From: cpaasch(a)apple.com [mailto:cpaasch(a)apple.com]
> > > > Sent: Thursday, November 09, 2017 13:13
> > > > To: Mat Martineau <mathew.j.martineau(a)linux.intel.com>; Boris Pismenny
> > > > <borisp(a)mellanox.com>
> > > > Cc: mptcp(a)lists.01.org
> > > > Subject: Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
> > > > 
> > > > +Boris
> > > > 
> > > > On 20/10/17 - 16:02:31, Mat Martineau wrote:
> > > > > The sk_buff control buffer is of limited size, and cannot be enlarged
> > > > > without significant impact on systemwide memory use. However, additional
> > > > > per-packet state is needed for some protocols, like Multipath TCP.
> > > > > 
> > > > > An optional shared control buffer placed after the normal struct
> > > > > skb_shared_info can accomodate the necessary state without imposing
> > > > > extra memory usage or code changes on normal struct sk_buff
> > > > > users. __alloc_skb will now place a skb_shared_info_ext structure at
> > > > > skb->end when given the SKB_ALLOC_SHINFO_EXT flag. Most users of struct
> > > > > sk_buff continue to use the skb_shinfo() macro to access shared
> > > > > info. skb_shinfo(skb)->is_ext is set if the extended structure is
> > > > > available, and cleared if it is not.
> > > > > 
> > > > > pskb_expand_head will preserve the shared control buffer if it is present.
> > > > > 
> > > > > Signed-off-by: Mat Martineau <mathew.j.martineau(a)linux.intel.com>
> > > > > ---
> > > > >  include/linux/skbuff.h | 24 +++++++++++++++++++++-
> > > > >  net/core/skbuff.c      | 56 ++++++++++++++++++++++++++++++++++++++-------
> > > > -----
> > > > >  2 files changed, 66 insertions(+), 14 deletions(-)
> > > > 
> > > > Boris, below is the change I mentioned to you.
> > > > 
> > > > It allows to allocate 48 additional bytes on-demand after skb_shared_info.
> > > > As it is on-demand, it won't increase the size of the skb for other users.
> > > > 
> > > > For example, TLS could start using it when it creates the skb that it
> > > > pushes down to the TCP-stack. That way you don't need to handle the
> > > > tls_record lists.
> > > > 
> > > 
> > > One small problem is that TLS doesn't create SKBs. As a ULP it calls the transport send
> > > Functions (do_tcp_sendpages for TLS). This function receives a page and not a SKB.
> > 
> > yes, that's a good point. Mat has another patch as part of this series,
> > that adds an skb-arg to sendpages
> > (https://lists.01.org/pipermail/mptcp/2017-October/000130.html)
> > 
> > That should do the job for you.
> 
> After working with the extended control block some more, I found that the
> arg to sendpages (at least as I implemented it in that patch) doesn't work
> out because skb_entail() isn't called. I'm experimenting with allocating and
> entailing an empty skb before calling do_tcp_sendpages(), which will then
> find the extended skb on the write queue and append to it. It's involving a
> lot of code to handle the memory waits and error conditions, though - it's
> probably cleaner to plumb some parameters in to do_tcp_sendpages() and
> sk_stream_alloc_skb() to request the extended skb.

we then also need a way for the ULP to pass the info down to
do_tcp_sendpages that should be written in the extended region of the skb.


Christoph


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-11-10  0:31 Mat Martineau
  0 siblings, 0 replies; 36+ messages in thread
From: Mat Martineau @ 2017-11-10  0:31 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 7323 bytes --]


On Thu, 9 Nov 2017, cpaasch(a)apple.com wrote:

> On 09/11/17 - 07:31:40, Ilya Lesokhin wrote:
>> One of the issues I see with TLS is that we need to update
>> This control buffer when SKBs are split or merged.
>> Have you given any thought into how it should be done?

I'm trying to not update this control buffer after the skb is created for 
MPTCP.

In the MPTCP case, when packets are split the newly created skb doesn't 
need the metadata and can be a regular skb. The original skb (with data 
truncated) keeps the metadata.

For MPTCP merging, I'm going to try using TCP_SKB_CB(skb)->eor to prevent 
the collapse of an extended MPTCP skb in to a previous skb by the generic 
TCP code. MPTCP could combine skbs in the write queue if needed since it 
understands the extended information.

TLS probably has different constraints, but I thought I'd offer some more 
context.

>> In any case, having a hint with a pointer to record could help TLS
>> Even if the hint is not always accurate.
>
> This would still incur the cost of maintaining and allocating the records in
> the list. Do you have an estimate as to the cost of it in terms of CPU
> cycles? (should be possible to have a rough measurement with perf)
>
>
> In general, there seems to be a need for adding meta-data to skb's. (I just
> looked at skb_shared_info->meta_len which was also added recently for XDP).
>
> With all these use-cases, it might be worth to have something clean to store
> this meta-data.

I agree. skb_shared_info->meta_len seems limited (using the mac header 
pointer to determine the location of metadata). A cleaner way to store 
metadata might let MPTCP and TLS be used at the same time.


Mat

>
> Another idea we had was to store meta-data rather somewhere in the linear
> memory of the skb (e.g., between skb->head and skb->data or between
> skb->end and skb->tail).
>
>
> Christoph
>
>>
>>> -----Original Message-----
>>> From: Boris Pismenny
>>> Sent: Thursday, November 09, 2017 8:48 AM
>>> To: cpaasch(a)apple.com
>>> Cc: Mat Martineau <mathew.j.martineau(a)linux.intel.com>; Ilya Lesokhin
>>> <ilyal(a)mellanox.com>; Liran Liss <liranl(a)mellanox.com>; mptcp(a)lists.01.org
>>> Subject: RE: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: cpaasch(a)apple.com [mailto:cpaasch(a)apple.com]
>>>> Sent: Thursday, November 09, 2017 13:48
>>>> To: Boris Pismenny <borisp(a)mellanox.com>
>>>> Cc: Mat Martineau <mathew.j.martineau(a)linux.intel.com>; Ilya Lesokhin
>>>> <ilyal(a)mellanox.com>; Liran Liss <liranl(a)mellanox.com>;
>>>> mptcp(a)lists.01.org
>>>> Subject: Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
>>>>
>>>> On 09/11/17 - 04:32:54, Boris Pismenny wrote:
>>>>> +Ilya and Liran
>>>>>
>>>>> Hi,
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: cpaasch(a)apple.com [mailto:cpaasch(a)apple.com]
>>>>>> Sent: Thursday, November 09, 2017 13:13
>>>>>> To: Mat Martineau <mathew.j.martineau(a)linux.intel.com>; Boris
>>>> Pismenny
>>>>>> <borisp(a)mellanox.com>
>>>>>> Cc: mptcp(a)lists.01.org
>>>>>> Subject: Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
>>>>>>
>>>>>> +Boris
>>>>>>
>>>>>> On 20/10/17 - 16:02:31, Mat Martineau wrote:
>>>>>>> The sk_buff control buffer is of limited size, and cannot be
>>>>>>> enlarged without significant impact on systemwide memory use.
>>>>>>> However,
>>>> additional
>>>>>>> per-packet state is needed for some protocols, like Multipath TCP.
>>>>>>>
>>>>>>> An optional shared control buffer placed after the normal struct
>>>>>>> skb_shared_info can accomodate the necessary state without
>>>>>>> imposing extra memory usage or code changes on normal struct
>>>>>>> sk_buff users. __alloc_skb will now place a skb_shared_info_ext
>>>>>>> structure at
>>>>>>> skb->end when given the SKB_ALLOC_SHINFO_EXT flag. Most users of
>>>> struct
>>>>>>> sk_buff continue to use the skb_shinfo() macro to access shared
>>>>>>> info. skb_shinfo(skb)->is_ext is set if the extended structure
>>>>>>> is available, and cleared if it is not.
>>>>>>>
>>>>>>> pskb_expand_head will preserve the shared control buffer if it is
>>> present.
>>>>>>>
>>>>>>> Signed-off-by: Mat Martineau
>>>>>>> <mathew.j.martineau(a)linux.intel.com>
>>>>>>> ---
>>>>>>>  include/linux/skbuff.h | 24 +++++++++++++++++++++-
>>>>>>>  net/core/skbuff.c      | 56
>>> ++++++++++++++++++++++++++++++++++++++--
>>>> -----
>>>>>> -----
>>>>>>>  2 files changed, 66 insertions(+), 14 deletions(-)
>>>>>>
>>>>>> Boris, below is the change I mentioned to you.
>>>>>>
>>>>>> It allows to allocate 48 additional bytes on-demand after skb_shared_info.
>>>>>> As it is on-demand, it won't increase the size of the skb for other users.
>>>>>>
>>>>>> For example, TLS could start using it when it creates the skb that
>>>>>> it pushes down to the TCP-stack. That way you don't need to handle
>>>>>> the tls_record lists.
>>>>>>
>>>>>
>>>>> One small problem is that TLS doesn't create SKBs. As a ULP it calls
>>>>> the
>>>> transport send
>>>>> Functions (do_tcp_sendpages for TLS). This function receives a page
>>>>> and not a
>>>> SKB.
>>>>
>>>> yes, that's a good point. Mat has another patch as part of this
>>>> series, that adds an skb-arg to sendpages
>>>>
>>> (https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.
>>>> 01.org%2Fpipermail%2Fmptcp%2F2017-
>>>>
>>> October%2F000130.html&data=02%7C01%7Cborisp%40mellanox.com%7C8d
>>> 0
>>>>
>>> 4f7b1c39649a6e80d08d5272d17fe%7Ca652971c7d2e4d9ba6a4d149256f461
>>> b
>>>>
>>> %7C0%7C0%7C636457996996299024&sdata=YjycapXvL2N%2FkrPj15oTy2igh
>>> C
>>>> 23j1lKmWetdJtdXLU%3D&reserved=0)
>>>>
>>>> That should do the job for you.
>>>>
>>>>> We decided not to create the SKB outside of the TCP layer to reduce
>>>>> the
>>>> number of
>>>>> changes we made to TCP.
>>>>>
>>>>> It would be nice if we could use something like that. Did you talk
>>>>> to DaveM
>>>> about
>>>>> upstreaming this?
>>>>
>>>> No, we didn't talk yet to anyone outside of this list here about it.
>>>>
>>>> We were looking for a user of it outside of MPTCP but couldn't find one.
>>>> Now, it seems like we found one that would be interested :)
>>>>
>>>>
>>>> I think this infrastructure here would simplify your code quite a bit, no?
>>>
>>> I'm not sure. I'll need to think about that.
>>>
>>> I'm worried that in some cases it won't work, i.e. the shared_info wouldn't
>>> reach our driver and then again we would be forced to use the tls_get_record
>>> function.
>>> One example, is the tcp_mtu_probe function, which creates a new skb. Another
>>> is tcp_collapse_retrans. We need to ensure all of these are covered when we
>>> pass any offloaded TLS skb to the driver.
>>>
>>>>
>>>>> We will definitely find it useful for the receive side, there we
>>>>> allocate the SKB
>>>> in the driver.
>>>>
>>>> Interesting! So even there you could use it. We were under the
>>>> impression that it would be of less interest for the receive-side.
>>>>
>>>>
>>>> Christoph
>>>>
>>>>>
>>>>>> See below for the rest of the patch.
>>>>>>
>>>>>>
>>>>>> Christoph
>>>>>>
>>
>

--
Mat Martineau
Intel OTC

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-11-09  7:56 cpaasch
  0 siblings, 0 replies; 36+ messages in thread
From: cpaasch @ 2017-11-09  7:56 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 6626 bytes --]

On 09/11/17 - 07:31:40, Ilya Lesokhin wrote:
> One of the issues I see with TLS is that we need to update
> This control buffer when SKBs are split or merged.
> Have you given any thought into how it should be done?
> 
> In any case, having a hint with a pointer to record could help TLS
> Even if the hint is not always accurate.

This would still incur the cost of maintaining and allocating the records in
the list. Do you have an estimate as to the cost of it in terms of CPU
cycles? (should be possible to have a rough measurement with perf)


In general, there seems to be a need for adding meta-data to skb's. (I just
looked at skb_shared_info->meta_len which was also added recently for XDP).

With all these use-cases, it might be worth to have something clean to store
this meta-data.


Another idea we had was to store meta-data rather somewhere in the linear
memory of the skb (e.g., between skb->head and skb->data or between
skb->end and skb->tail).


Christoph


> 
> > -----Original Message-----
> > From: Boris Pismenny
> > Sent: Thursday, November 09, 2017 8:48 AM
> > To: cpaasch(a)apple.com
> > Cc: Mat Martineau <mathew.j.martineau(a)linux.intel.com>; Ilya Lesokhin
> > <ilyal(a)mellanox.com>; Liran Liss <liranl(a)mellanox.com>; mptcp(a)lists.01.org
> > Subject: RE: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
> > 
> > 
> > 
> > > -----Original Message-----
> > > From: cpaasch(a)apple.com [mailto:cpaasch(a)apple.com]
> > > Sent: Thursday, November 09, 2017 13:48
> > > To: Boris Pismenny <borisp(a)mellanox.com>
> > > Cc: Mat Martineau <mathew.j.martineau(a)linux.intel.com>; Ilya Lesokhin
> > > <ilyal(a)mellanox.com>; Liran Liss <liranl(a)mellanox.com>;
> > > mptcp(a)lists.01.org
> > > Subject: Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
> > >
> > > On 09/11/17 - 04:32:54, Boris Pismenny wrote:
> > > > +Ilya and Liran
> > > >
> > > > Hi,
> > > >
> > > > > -----Original Message-----
> > > > > From: cpaasch(a)apple.com [mailto:cpaasch(a)apple.com]
> > > > > Sent: Thursday, November 09, 2017 13:13
> > > > > To: Mat Martineau <mathew.j.martineau(a)linux.intel.com>; Boris
> > > Pismenny
> > > > > <borisp(a)mellanox.com>
> > > > > Cc: mptcp(a)lists.01.org
> > > > > Subject: Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
> > > > >
> > > > > +Boris
> > > > >
> > > > > On 20/10/17 - 16:02:31, Mat Martineau wrote:
> > > > > > The sk_buff control buffer is of limited size, and cannot be
> > > > > > enlarged without significant impact on systemwide memory use.
> > > > > > However,
> > > additional
> > > > > > per-packet state is needed for some protocols, like Multipath TCP.
> > > > > >
> > > > > > An optional shared control buffer placed after the normal struct
> > > > > > skb_shared_info can accomodate the necessary state without
> > > > > > imposing extra memory usage or code changes on normal struct
> > > > > > sk_buff users. __alloc_skb will now place a skb_shared_info_ext
> > > > > > structure at
> > > > > > skb->end when given the SKB_ALLOC_SHINFO_EXT flag. Most users of
> > > struct
> > > > > > sk_buff continue to use the skb_shinfo() macro to access shared
> > > > > > info. skb_shinfo(skb)->is_ext is set if the extended structure
> > > > > > is available, and cleared if it is not.
> > > > > >
> > > > > > pskb_expand_head will preserve the shared control buffer if it is
> > present.
> > > > > >
> > > > > > Signed-off-by: Mat Martineau
> > > > > > <mathew.j.martineau(a)linux.intel.com>
> > > > > > ---
> > > > > >  include/linux/skbuff.h | 24 +++++++++++++++++++++-
> > > > > >  net/core/skbuff.c      | 56
> > ++++++++++++++++++++++++++++++++++++++--
> > > -----
> > > > > -----
> > > > > >  2 files changed, 66 insertions(+), 14 deletions(-)
> > > > >
> > > > > Boris, below is the change I mentioned to you.
> > > > >
> > > > > It allows to allocate 48 additional bytes on-demand after skb_shared_info.
> > > > > As it is on-demand, it won't increase the size of the skb for other users.
> > > > >
> > > > > For example, TLS could start using it when it creates the skb that
> > > > > it pushes down to the TCP-stack. That way you don't need to handle
> > > > > the tls_record lists.
> > > > >
> > > >
> > > > One small problem is that TLS doesn't create SKBs. As a ULP it calls
> > > > the
> > > transport send
> > > > Functions (do_tcp_sendpages for TLS). This function receives a page
> > > > and not a
> > > SKB.
> > >
> > > yes, that's a good point. Mat has another patch as part of this
> > > series, that adds an skb-arg to sendpages
> > >
> > (https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.
> > > 01.org%2Fpipermail%2Fmptcp%2F2017-
> > >
> > October%2F000130.html&data=02%7C01%7Cborisp%40mellanox.com%7C8d
> > 0
> > >
> > 4f7b1c39649a6e80d08d5272d17fe%7Ca652971c7d2e4d9ba6a4d149256f461
> > b
> > >
> > %7C0%7C0%7C636457996996299024&sdata=YjycapXvL2N%2FkrPj15oTy2igh
> > C
> > > 23j1lKmWetdJtdXLU%3D&reserved=0)
> > >
> > > That should do the job for you.
> > >
> > > > We decided not to create the SKB outside of the TCP layer to reduce
> > > > the
> > > number of
> > > > changes we made to TCP.
> > > >
> > > > It would be nice if we could use something like that. Did you talk
> > > > to DaveM
> > > about
> > > > upstreaming this?
> > >
> > > No, we didn't talk yet to anyone outside of this list here about it.
> > >
> > > We were looking for a user of it outside of MPTCP but couldn't find one.
> > > Now, it seems like we found one that would be interested :)
> > >
> > >
> > > I think this infrastructure here would simplify your code quite a bit, no?
> > 
> > I'm not sure. I'll need to think about that.
> > 
> > I'm worried that in some cases it won't work, i.e. the shared_info wouldn't
> > reach our driver and then again we would be forced to use the tls_get_record
> > function.
> > One example, is the tcp_mtu_probe function, which creates a new skb. Another
> > is tcp_collapse_retrans. We need to ensure all of these are covered when we
> > pass any offloaded TLS skb to the driver.
> > 
> > >
> > > > We will definitely find it useful for the receive side, there we
> > > > allocate the SKB
> > > in the driver.
> > >
> > > Interesting! So even there you could use it. We were under the
> > > impression that it would be of less interest for the receive-side.
> > >
> > >
> > > Christoph
> > >
> > > >
> > > > > See below for the rest of the patch.
> > > > >
> > > > >
> > > > > Christoph
> > > > >
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-11-09  7:51 cpaasch
  0 siblings, 0 replies; 36+ messages in thread
From: cpaasch @ 2017-11-09  7:51 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 5127 bytes --]

On 09/11/17 - 06:48:09, Boris Pismenny wrote:
> > -----Original Message-----
> > From: cpaasch(a)apple.com [mailto:cpaasch(a)apple.com]
> > Sent: Thursday, November 09, 2017 13:48
> > To: Boris Pismenny <borisp(a)mellanox.com>
> > Cc: Mat Martineau <mathew.j.martineau(a)linux.intel.com>; Ilya Lesokhin
> > <ilyal(a)mellanox.com>; Liran Liss <liranl(a)mellanox.com>; mptcp(a)lists.01.org
> > Subject: Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
> > 
> > On 09/11/17 - 04:32:54, Boris Pismenny wrote:
> > > +Ilya and Liran
> > >
> > > Hi,
> > >
> > > > -----Original Message-----
> > > > From: cpaasch(a)apple.com [mailto:cpaasch(a)apple.com]
> > > > Sent: Thursday, November 09, 2017 13:13
> > > > To: Mat Martineau <mathew.j.martineau(a)linux.intel.com>; Boris
> > Pismenny
> > > > <borisp(a)mellanox.com>
> > > > Cc: mptcp(a)lists.01.org
> > > > Subject: Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
> > > >
> > > > +Boris
> > > >
> > > > On 20/10/17 - 16:02:31, Mat Martineau wrote:
> > > > > The sk_buff control buffer is of limited size, and cannot be enlarged
> > > > > without significant impact on systemwide memory use. However,
> > additional
> > > > > per-packet state is needed for some protocols, like Multipath TCP.
> > > > >
> > > > > An optional shared control buffer placed after the normal struct
> > > > > skb_shared_info can accomodate the necessary state without imposing
> > > > > extra memory usage or code changes on normal struct sk_buff
> > > > > users. __alloc_skb will now place a skb_shared_info_ext structure at
> > > > > skb->end when given the SKB_ALLOC_SHINFO_EXT flag. Most users of
> > struct
> > > > > sk_buff continue to use the skb_shinfo() macro to access shared
> > > > > info. skb_shinfo(skb)->is_ext is set if the extended structure is
> > > > > available, and cleared if it is not.
> > > > >
> > > > > pskb_expand_head will preserve the shared control buffer if it is present.
> > > > >
> > > > > Signed-off-by: Mat Martineau <mathew.j.martineau(a)linux.intel.com>
> > > > > ---
> > > > >  include/linux/skbuff.h | 24 +++++++++++++++++++++-
> > > > >  net/core/skbuff.c      | 56 ++++++++++++++++++++++++++++++++++++++--
> > -----
> > > > -----
> > > > >  2 files changed, 66 insertions(+), 14 deletions(-)
> > > >
> > > > Boris, below is the change I mentioned to you.
> > > >
> > > > It allows to allocate 48 additional bytes on-demand after skb_shared_info.
> > > > As it is on-demand, it won't increase the size of the skb for other users.
> > > >
> > > > For example, TLS could start using it when it creates the skb that it
> > > > pushes down to the TCP-stack. That way you don't need to handle the
> > > > tls_record lists.
> > > >
> > >
> > > One small problem is that TLS doesn't create SKBs. As a ULP it calls the
> > transport send
> > > Functions (do_tcp_sendpages for TLS). This function receives a page and not a
> > SKB.
> > 
> > yes, that's a good point. Mat has another patch as part of this series,
> > that adds an skb-arg to sendpages
> > (https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.
> > 01.org%2Fpipermail%2Fmptcp%2F2017-
> > October%2F000130.html&data=02%7C01%7Cborisp%40mellanox.com%7C8d0
> > 4f7b1c39649a6e80d08d5272d17fe%7Ca652971c7d2e4d9ba6a4d149256f461b
> > %7C0%7C0%7C636457996996299024&sdata=YjycapXvL2N%2FkrPj15oTy2ighC
> > 23j1lKmWetdJtdXLU%3D&reserved=0)
> > 
> > That should do the job for you.
> > 
> > > We decided not to create the SKB outside of the TCP layer to reduce the
> > number of
> > > changes we made to TCP.
> > >
> > > It would be nice if we could use something like that. Did you talk to DaveM
> > about
> > > upstreaming this?
> > 
> > No, we didn't talk yet to anyone outside of this list here about it.
> > 
> > We were looking for a user of it outside of MPTCP but couldn't find one.
> > Now, it seems like we found one that would be interested :)
> > 
> > 
> > I think this infrastructure here would simplify your code quite a bit, no?
> 
> I'm not sure. I'll need to think about that.
> 
> I'm worried that in some cases it won't work, i.e. the shared_info wouldn't reach our
> driver and then again we would be forced to use the tls_get_record function.
> One example, is the tcp_mtu_probe function, which creates a new skb. Another
> is tcp_collapse_retrans. We need to ensure all of these are covered when we pass
> any offloaded TLS skb to the driver.

Yes, that's a good point. Making sure that all of these are covered will be
tricky.
There could even be other places further down the stack before the driver
where an skb is copied/modified.


Christoph

> 
> > 
> > > We will definitely find it useful for the receive side, there we allocate the SKB
> > in the driver.
> > 
> > Interesting! So even there you could use it. We were under the impression
> > that it would be of less interest for the receive-side.
> > 
> > 
> > Christoph
> > 
> > >
> > > > See below for the rest of the patch.
> > > >
> > > >
> > > > Christoph
> > > >
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-11-09  4:48 cpaasch
  0 siblings, 0 replies; 36+ messages in thread
From: cpaasch @ 2017-11-09  4:48 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 12371 bytes --]

On 09/11/17 - 04:32:54, Boris Pismenny wrote:
> +Ilya and Liran
> 
> Hi,
> 
> > -----Original Message-----
> > From: cpaasch(a)apple.com [mailto:cpaasch(a)apple.com]
> > Sent: Thursday, November 09, 2017 13:13
> > To: Mat Martineau <mathew.j.martineau(a)linux.intel.com>; Boris Pismenny
> > <borisp(a)mellanox.com>
> > Cc: mptcp(a)lists.01.org
> > Subject: Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
> > 
> > +Boris
> > 
> > On 20/10/17 - 16:02:31, Mat Martineau wrote:
> > > The sk_buff control buffer is of limited size, and cannot be enlarged
> > > without significant impact on systemwide memory use. However, additional
> > > per-packet state is needed for some protocols, like Multipath TCP.
> > >
> > > An optional shared control buffer placed after the normal struct
> > > skb_shared_info can accomodate the necessary state without imposing
> > > extra memory usage or code changes on normal struct sk_buff
> > > users. __alloc_skb will now place a skb_shared_info_ext structure at
> > > skb->end when given the SKB_ALLOC_SHINFO_EXT flag. Most users of struct
> > > sk_buff continue to use the skb_shinfo() macro to access shared
> > > info. skb_shinfo(skb)->is_ext is set if the extended structure is
> > > available, and cleared if it is not.
> > >
> > > pskb_expand_head will preserve the shared control buffer if it is present.
> > >
> > > Signed-off-by: Mat Martineau <mathew.j.martineau(a)linux.intel.com>
> > > ---
> > >  include/linux/skbuff.h | 24 +++++++++++++++++++++-
> > >  net/core/skbuff.c      | 56 ++++++++++++++++++++++++++++++++++++++-------
> > -----
> > >  2 files changed, 66 insertions(+), 14 deletions(-)
> > 
> > Boris, below is the change I mentioned to you.
> > 
> > It allows to allocate 48 additional bytes on-demand after skb_shared_info.
> > As it is on-demand, it won't increase the size of the skb for other users.
> > 
> > For example, TLS could start using it when it creates the skb that it
> > pushes down to the TCP-stack. That way you don't need to handle the
> > tls_record lists.
> > 
> 
> One small problem is that TLS doesn't create SKBs. As a ULP it calls the transport send
> Functions (do_tcp_sendpages for TLS). This function receives a page and not a SKB.

yes, that's a good point. Mat has another patch as part of this series,
that adds an skb-arg to sendpages
(https://lists.01.org/pipermail/mptcp/2017-October/000130.html)

That should do the job for you.

> We decided not to create the SKB outside of the TCP layer to reduce the number of
> changes we made to TCP.
> 
> It would be nice if we could use something like that. Did you talk to DaveM about
> upstreaming this?

No, we didn't talk yet to anyone outside of this list here about it.

We were looking for a user of it outside of MPTCP but couldn't find one.
Now, it seems like we found one that would be interested :)


I think this infrastructure here would simplify your code quite a bit, no?

> We will definitely find it useful for the receive side, there we allocate the SKB in the driver.

Interesting! So even there you could use it. We were under the impression
that it would be of less interest for the receive-side.


Christoph

> 
> > See below for the rest of the patch.
> > 
> > 
> > Christoph
> > 
> > 
> > >
> > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > index 03634ec2f918..873910c66df9 100644
> > > --- a/include/linux/skbuff.h
> > > +++ b/include/linux/skbuff.h
> > > @@ -489,7 +489,8 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct
> > sk_buff *skb,
> > >   * the end of the header data, ie. at skb->end.
> > >   */
> > >  struct skb_shared_info {
> > > -	__u8		__unused;
> > > +	__u8		is_ext:1,
> > > +			__unused:7;
> > >  	__u8		meta_len;
> > >  	__u8		nr_frags;
> > >  	__u8		tx_flags;
> > > @@ -530,6 +531,24 @@ struct skb_shared_info {
> > >  #define SKB_DATAREF_MASK ((1 << SKB_DATAREF_SHIFT) - 1)
> > >
> > >
> > > +/* This is an extended version of skb_shared_info, also invariant across
> > > + * clones and living at the end of the header data.
> > > + */
> > > +struct skb_shared_info_ext {
> > > +	/* skb_shared_info must be the first member */
> > > +	struct skb_shared_info	shinfo;
> > > +
> > > +	/* This is the shared control buffer. It is similar to sk_buff's
> > > +	 * control buffer, but is shared across clones. It must not be
> > > +	 * modified when multiple sk_buffs are referencing this structure.
> > > +	 */
> > > +	char			shcb[48];
> > > +};
> > > +
> > > +#define SKB_SHINFO_EXT_OVERHEAD	\
> > > +	(SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext)) - \
> > > +	 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
> > > +
> > >  enum {
> > >  	SKB_FCLONE_UNAVAILABLE,	/* skb has no fclone (from head_cache)
> > */
> > >  	SKB_FCLONE_ORIG,	/* orig skb (from fclone_cache) */
> > > @@ -856,6 +875,7 @@ struct sk_buff {
> > >  #define SKB_ALLOC_FCLONE	0x01
> > >  #define SKB_ALLOC_RX		0x02
> > >  #define SKB_ALLOC_NAPI		0x04
> > > +#define SKB_ALLOC_SHINFO_EXT	0x08
> > >
> > >  /* Returns true if the skb was allocated from PFMEMALLOC reserves */
> > >  static inline bool skb_pfmemalloc(const struct sk_buff *skb)
> > > @@ -1271,6 +1291,8 @@ static inline unsigned int skb_end_offset(const
> > struct sk_buff *skb)
> > >
> > >  /* Internal */
> > >  #define skb_shinfo(SKB)	((struct skb_shared_info
> > *)(skb_end_pointer(SKB)))
> > > +#define skb_shinfo_ext(SKB)	\
> > > +	((struct skb_shared_info_ext *)(skb_end_pointer(SKB)))
> > >
> > >  static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff
> > *skb)
> > >  {
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index 40717501cbdd..397edd5c0613 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -166,6 +166,8 @@ static void *__kmalloc_reserve(size_t size, gfp_t
> > flags, int node,
> > >   *		instead of head cache and allocate a cloned (child) skb.
> > >   *		If SKB_ALLOC_RX is set, __GFP_MEMALLOC will be used for
> > >   *		allocations in case the data is required for writeback
> > > + *		If SKB_ALLOC_SHINFO_EXT is set, the skb will be allocated
> > > + *		with an extended shared info struct.
> > >   *	@node: numa node to allocate memory on
> > >   *
> > >   *	Allocate a new &sk_buff. The returned buffer has no headroom and a
> > > @@ -179,9 +181,9 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t
> > gfp_mask,
> > >  			    int flags, int node)
> > >  {
> > >  	struct kmem_cache *cache;
> > > -	struct skb_shared_info *shinfo;
> > >  	struct sk_buff *skb;
> > >  	u8 *data;
> > > +	unsigned int shinfo_size;
> > >  	bool pfmemalloc;
> > >
> > >  	cache = (flags & SKB_ALLOC_FCLONE)
> > > @@ -199,18 +201,22 @@ struct sk_buff *__alloc_skb(unsigned int size,
> > gfp_t gfp_mask,
> > >  	/* We do our best to align skb_shared_info on a separate cache
> > >  	 * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
> > >  	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.
> > > -	 * Both skb->head and skb_shared_info are cache line aligned.
> > > +	 * Both skb->head and skb_shared_info (or skb_shared_info_ext) are
> > > +	 * cache line aligned.
> > >  	 */
> > >  	size = SKB_DATA_ALIGN(size);
> > > -	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > > -	data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
> > > +	if (flags & SKB_ALLOC_SHINFO_EXT)
> > > +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct
> > skb_shared_info_ext));
> > > +	else
> > > +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct
> > skb_shared_info));
> > > +	data = kmalloc_reserve(size + shinfo_size, gfp_mask, node,
> > &pfmemalloc);
> > >  	if (!data)
> > >  		goto nodata;
> > >  	/* kmalloc(size) might give us more room than requested.
> > >  	 * Put skb_shared_info exactly at the end of allocated zone,
> > >  	 * to allow max possible filling before reallocation.
> > >  	 */
> > > -	size = SKB_WITH_OVERHEAD(ksize(data));
> > > +	size = ksize(data) - shinfo_size;
> > >  	prefetchw(data + size);
> > >
> > >  	/*
> > > @@ -220,7 +226,10 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t
> > gfp_mask,
> > >  	 */
> > >  	memset(skb, 0, offsetof(struct sk_buff, tail));
> > >  	/* Account for allocated memory : skb + skb->head */
> > > -	skb->truesize = SKB_TRUESIZE(size);
> > > +	if (flags & SKB_ALLOC_SHINFO_EXT)
> > > +		skb->truesize = SKB_TRUESIZE(size) +
> > SKB_SHINFO_EXT_OVERHEAD;
> > > +	else
> > > +		skb->truesize = SKB_TRUESIZE(size);
> > >  	skb->pfmemalloc = pfmemalloc;
> > >  	refcount_set(&skb->users, 1);
> > >  	skb->head = data;
> > > @@ -231,10 +240,21 @@ struct sk_buff *__alloc_skb(unsigned int size,
> > gfp_t gfp_mask,
> > >  	skb->transport_header = (typeof(skb->transport_header))~0U;
> > >
> > >  	/* make sure we initialize shinfo sequentially */
> > > -	shinfo = skb_shinfo(skb);
> > > -	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> > > -	atomic_set(&shinfo->dataref, 1);
> > > -	kmemcheck_annotate_variable(shinfo->destructor_arg);
> > > +	if (flags & SKB_ALLOC_SHINFO_EXT) {
> > > +		struct skb_shared_info_ext *shinfo_ext = skb_shinfo_ext(skb);
> > > +		shinfo_ext->shinfo.is_ext = 1;
> > > +		memset(&shinfo_ext->shinfo.meta_len, 0,
> > > +		       offsetof(struct skb_shared_info, dataref) -
> > > +		       offsetof(struct skb_shared_info, meta_len));
> > > +		atomic_set(&shinfo_ext->shinfo.dataref, 1);
> > > +		kmemcheck_annotate_variable(shinfo_ext-
> > >shinfo.destructor_arg);
> > > +		memset(&shinfo_ext->shcb, 0, sizeof(shinfo_ext->shcb));
> > > +	} else {
> > > +		struct skb_shared_info *shinfo = skb_shinfo(skb);
> > > +		memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> > > +		atomic_set(&shinfo->dataref, 1);
> > > +		kmemcheck_annotate_variable(shinfo->destructor_arg);
> > > +	}
> > >
> > >  	if (flags & SKB_ALLOC_FCLONE) {
> > >  		struct sk_buff_fclones *fclones;
> > > @@ -1443,6 +1463,7 @@ int pskb_expand_head(struct sk_buff *skb, int
> > nhead, int ntail,
> > >  {
> > >  	int i, osize = skb_end_offset(skb);
> > >  	int size = osize + nhead + ntail;
> > > +	int shinfo_size;
> > >  	long off;
> > >  	u8 *data;
> > >
> > > @@ -1454,11 +1475,14 @@ int pskb_expand_head(struct sk_buff *skb, int
> > nhead, int ntail,
> > >
> > >  	if (skb_pfmemalloc(skb))
> > >  		gfp_mask |= __GFP_MEMALLOC;
> > > -	data = kmalloc_reserve(size + SKB_DATA_ALIGN(sizeof(struct
> > skb_shared_info)),
> > > -			       gfp_mask, NUMA_NO_NODE, NULL);
> > > +	if (skb_shinfo(skb)->is_ext)
> > > +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct
> > skb_shared_info_ext));
> > > +	else
> > > +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct
> > skb_shared_info));
> > > +	data = kmalloc_reserve(size + shinfo_size, gfp_mask,
> > NUMA_NO_NODE, NULL);
> > >  	if (!data)
> > >  		goto nodata;
> > > -	size = SKB_WITH_OVERHEAD(ksize(data));
> > > +	size = ksize(data) - shinfo_size;
> > >
> > >  	/* Copy only real data... and, alas, header. This should be
> > >  	 * optimized for the cases when header is void.
> > > @@ -1468,6 +1492,12 @@ int pskb_expand_head(struct sk_buff *skb, int
> > nhead, int ntail,
> > >  	memcpy((struct skb_shared_info *)(data + size),
> > >  	       skb_shinfo(skb),
> > >  	       offsetof(struct skb_shared_info, frags[skb_shinfo(skb)->nr_frags]));
> > > +	if (skb_shinfo(skb)->is_ext) {
> > > +		int offset = offsetof(struct skb_shared_info_ext, shcb);
> > > +		memcpy((struct skb_shared_info_ext *)(data + size + offset),
> > > +		       &skb_shinfo_ext(skb)->shcb,
> > > +		       sizeof(skb_shinfo_ext(skb)->shcb));
> > > +	}
> > >
> > >  	/*
> > >  	 * if shinfo is shared we must drop the old head gracefully, but if it
> > > --
> > > 2.14.2
> > >
> > > _______________________________________________
> > > mptcp mailing list
> > > mptcp(a)lists.01.org
> > >
> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.0
> > 1.org%2Fmailman%2Flistinfo%2Fmptcp&data=02%7C01%7Cborisp%40mellano
> > x.com%7C57765cda687e4c3ab8ed08d527283bd5%7Ca652971c7d2e4d9ba6a4
> > d149256f461b%7C0%7C0%7C636457976113353851&sdata=RtZwfWQx%2FSw
> > HoF9miFLpa2r4kA4pf0ta7la6k7F3OZI%3D&reserved=0

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-11-09  4:13 Christoph Paasch
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Paasch @ 2017-11-09  4:13 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 9472 bytes --]

+Boris

On 20/10/17 - 16:02:31, Mat Martineau wrote:
> The sk_buff control buffer is of limited size, and cannot be enlarged
> without significant impact on systemwide memory use. However, additional
> per-packet state is needed for some protocols, like Multipath TCP.
> 
> An optional shared control buffer placed after the normal struct
> skb_shared_info can accomodate the necessary state without imposing
> extra memory usage or code changes on normal struct sk_buff
> users. __alloc_skb will now place a skb_shared_info_ext structure at
> skb->end when given the SKB_ALLOC_SHINFO_EXT flag. Most users of struct
> sk_buff continue to use the skb_shinfo() macro to access shared
> info. skb_shinfo(skb)->is_ext is set if the extended structure is
> available, and cleared if it is not.
> 
> pskb_expand_head will preserve the shared control buffer if it is present.
> 
> Signed-off-by: Mat Martineau <mathew.j.martineau(a)linux.intel.com>
> ---
>  include/linux/skbuff.h | 24 +++++++++++++++++++++-
>  net/core/skbuff.c      | 56 ++++++++++++++++++++++++++++++++++++++------------
>  2 files changed, 66 insertions(+), 14 deletions(-)

Boris, below is the change I mentioned to you.

It allows to allocate 48 additional bytes on-demand after skb_shared_info.
As it is on-demand, it won't increase the size of the skb for other users.

For example, TLS could start using it when it creates the skb that it
pushes down to the TCP-stack. That way you don't need to handle the
tls_record lists.

See below for the rest of the patch.


Christoph


> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 03634ec2f918..873910c66df9 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -489,7 +489,8 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
>   * the end of the header data, ie. at skb->end.
>   */
>  struct skb_shared_info {
> -	__u8		__unused;
> +	__u8		is_ext:1,
> +			__unused:7;
>  	__u8		meta_len;
>  	__u8		nr_frags;
>  	__u8		tx_flags;
> @@ -530,6 +531,24 @@ struct skb_shared_info {
>  #define SKB_DATAREF_MASK ((1 << SKB_DATAREF_SHIFT) - 1)
>  
>  
> +/* This is an extended version of skb_shared_info, also invariant across
> + * clones and living at the end of the header data.
> + */
> +struct skb_shared_info_ext {
> +	/* skb_shared_info must be the first member */
> +	struct skb_shared_info	shinfo;
> +
> +	/* This is the shared control buffer. It is similar to sk_buff's
> +	 * control buffer, but is shared across clones. It must not be
> +	 * modified when multiple sk_buffs are referencing this structure.
> +	 */
> +	char			shcb[48];
> +};
> +
> +#define SKB_SHINFO_EXT_OVERHEAD	\
> +	(SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext)) - \
> +	 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
> +
>  enum {
>  	SKB_FCLONE_UNAVAILABLE,	/* skb has no fclone (from head_cache) */
>  	SKB_FCLONE_ORIG,	/* orig skb (from fclone_cache) */
> @@ -856,6 +875,7 @@ struct sk_buff {
>  #define SKB_ALLOC_FCLONE	0x01
>  #define SKB_ALLOC_RX		0x02
>  #define SKB_ALLOC_NAPI		0x04
> +#define SKB_ALLOC_SHINFO_EXT	0x08
>  
>  /* Returns true if the skb was allocated from PFMEMALLOC reserves */
>  static inline bool skb_pfmemalloc(const struct sk_buff *skb)
> @@ -1271,6 +1291,8 @@ static inline unsigned int skb_end_offset(const struct sk_buff *skb)
>  
>  /* Internal */
>  #define skb_shinfo(SKB)	((struct skb_shared_info *)(skb_end_pointer(SKB)))
> +#define skb_shinfo_ext(SKB)	\
> +	((struct skb_shared_info_ext *)(skb_end_pointer(SKB)))
>  
>  static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb)
>  {
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 40717501cbdd..397edd5c0613 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -166,6 +166,8 @@ static void *__kmalloc_reserve(size_t size, gfp_t flags, int node,
>   *		instead of head cache and allocate a cloned (child) skb.
>   *		If SKB_ALLOC_RX is set, __GFP_MEMALLOC will be used for
>   *		allocations in case the data is required for writeback
> + *		If SKB_ALLOC_SHINFO_EXT is set, the skb will be allocated
> + *		with an extended shared info struct.
>   *	@node: numa node to allocate memory on
>   *
>   *	Allocate a new &sk_buff. The returned buffer has no headroom and a
> @@ -179,9 +181,9 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>  			    int flags, int node)
>  {
>  	struct kmem_cache *cache;
> -	struct skb_shared_info *shinfo;
>  	struct sk_buff *skb;
>  	u8 *data;
> +	unsigned int shinfo_size;
>  	bool pfmemalloc;
>  
>  	cache = (flags & SKB_ALLOC_FCLONE)
> @@ -199,18 +201,22 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>  	/* We do our best to align skb_shared_info on a separate cache
>  	 * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
>  	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.
> -	 * Both skb->head and skb_shared_info are cache line aligned.
> +	 * Both skb->head and skb_shared_info (or skb_shared_info_ext) are
> +	 * cache line aligned.
>  	 */
>  	size = SKB_DATA_ALIGN(size);
> -	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> -	data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
> +	if (flags & SKB_ALLOC_SHINFO_EXT)
> +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext));
> +	else
> +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +	data = kmalloc_reserve(size + shinfo_size, gfp_mask, node, &pfmemalloc);
>  	if (!data)
>  		goto nodata;
>  	/* kmalloc(size) might give us more room than requested.
>  	 * Put skb_shared_info exactly at the end of allocated zone,
>  	 * to allow max possible filling before reallocation.
>  	 */
> -	size = SKB_WITH_OVERHEAD(ksize(data));
> +	size = ksize(data) - shinfo_size;
>  	prefetchw(data + size);
>  
>  	/*
> @@ -220,7 +226,10 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>  	 */
>  	memset(skb, 0, offsetof(struct sk_buff, tail));
>  	/* Account for allocated memory : skb + skb->head */
> -	skb->truesize = SKB_TRUESIZE(size);
> +	if (flags & SKB_ALLOC_SHINFO_EXT)
> +		skb->truesize = SKB_TRUESIZE(size) + SKB_SHINFO_EXT_OVERHEAD;
> +	else
> +		skb->truesize = SKB_TRUESIZE(size);
>  	skb->pfmemalloc = pfmemalloc;
>  	refcount_set(&skb->users, 1);
>  	skb->head = data;
> @@ -231,10 +240,21 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>  	skb->transport_header = (typeof(skb->transport_header))~0U;
>  
>  	/* make sure we initialize shinfo sequentially */
> -	shinfo = skb_shinfo(skb);
> -	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> -	atomic_set(&shinfo->dataref, 1);
> -	kmemcheck_annotate_variable(shinfo->destructor_arg);
> +	if (flags & SKB_ALLOC_SHINFO_EXT) {
> +		struct skb_shared_info_ext *shinfo_ext = skb_shinfo_ext(skb);
> +		shinfo_ext->shinfo.is_ext = 1;
> +		memset(&shinfo_ext->shinfo.meta_len, 0,
> +		       offsetof(struct skb_shared_info, dataref) -
> +		       offsetof(struct skb_shared_info, meta_len));
> +		atomic_set(&shinfo_ext->shinfo.dataref, 1);
> +		kmemcheck_annotate_variable(shinfo_ext->shinfo.destructor_arg);
> +		memset(&shinfo_ext->shcb, 0, sizeof(shinfo_ext->shcb));
> +	} else {
> +		struct skb_shared_info *shinfo = skb_shinfo(skb);
> +		memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> +		atomic_set(&shinfo->dataref, 1);
> +		kmemcheck_annotate_variable(shinfo->destructor_arg);
> +	}
>  
>  	if (flags & SKB_ALLOC_FCLONE) {
>  		struct sk_buff_fclones *fclones;
> @@ -1443,6 +1463,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>  {
>  	int i, osize = skb_end_offset(skb);
>  	int size = osize + nhead + ntail;
> +	int shinfo_size;
>  	long off;
>  	u8 *data;
>  
> @@ -1454,11 +1475,14 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>  
>  	if (skb_pfmemalloc(skb))
>  		gfp_mask |= __GFP_MEMALLOC;
> -	data = kmalloc_reserve(size + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
> -			       gfp_mask, NUMA_NO_NODE, NULL);
> +	if (skb_shinfo(skb)->is_ext)
> +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext));
> +	else
> +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +	data = kmalloc_reserve(size + shinfo_size, gfp_mask, NUMA_NO_NODE, NULL);
>  	if (!data)
>  		goto nodata;
> -	size = SKB_WITH_OVERHEAD(ksize(data));
> +	size = ksize(data) - shinfo_size;
>  
>  	/* Copy only real data... and, alas, header. This should be
>  	 * optimized for the cases when header is void.
> @@ -1468,6 +1492,12 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>  	memcpy((struct skb_shared_info *)(data + size),
>  	       skb_shinfo(skb),
>  	       offsetof(struct skb_shared_info, frags[skb_shinfo(skb)->nr_frags]));
> +	if (skb_shinfo(skb)->is_ext) {
> +		int offset = offsetof(struct skb_shared_info_ext, shcb);
> +		memcpy((struct skb_shared_info_ext *)(data + size + offset),
> +		       &skb_shinfo_ext(skb)->shcb,
> +		       sizeof(skb_shinfo_ext(skb)->shcb));
> +	}
>  
>  	/*
>  	 * if shinfo is shared we must drop the old head gracefully, but if it
> -- 
> 2.14.2
> 
> _______________________________________________
> mptcp mailing list
> mptcp(a)lists.01.org
> https://lists.01.org/mailman/listinfo/mptcp

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-11-08 21:02 Christoph Paasch
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Paasch @ 2017-11-08 21:02 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 1809 bytes --]

Hello,

On 08/11/17 - 12:41:31, Rao Shoaib wrote:
> 
> 
> On 11/07/2017 04:25 PM, Christoph Paasch wrote:
> > The way it is handled currently is that the mapping is decided during
> > mptcp_skb_entail(), which writes the mapping to the skb->cb.
> > 
> >  From that moment on it won't change anymore and, as you know, it will simply
> > get copied from the skb->cb to the TCP-header in mptcp_options_write().
> > 
> > 
> > Christoph
> Yes thanks I get it. And the receiver is coded that way. That is very
> implementation specific.
> 
> RFC 6824 Says:
> 
> 
>    A data sequence mapping does not need to be included in every MPTCP
>    packet, as long as the subflow sequence space in that packet is
>    covered by a mapping known at the receiver.  This can be used to
>    reduce overhead in cases where the mapping is known in advance; one
>    such case is when there is a single subflow between the hosts,
>    another is when segments of data are scheduled in larger than packet-
>    sized chunks.
> 
> So DSS mapping is not required in every packet.

Yes, and currently Linux supports that gracefully.

> There could also be a sub
> mapping for the segment itself that does not violates the original larger
> mapping or there can be a super mapping which extends the current mapping
> without violating it. Current code does not handle that, I will update the
> code.

The difficulty here is in the DSS checksum verification. When you have
different overlapping mappings, you will need to do quite some magic to
verify the checksums.


Christoph

> Thanks for pointing out this case as I had not tested partial ACK. Without
> partial ACK everything will work. Luckily the fix is very straight forward.
> 
> Regards,
> 
> Shoaib
> 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-11-08 20:41 Rao Shoaib
  0 siblings, 0 replies; 36+ messages in thread
From: Rao Shoaib @ 2017-11-08 20:41 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 1448 bytes --]



On 11/07/2017 04:25 PM, Christoph Paasch wrote:
> The way it is handled currently is that the mapping is decided during
> mptcp_skb_entail(), which writes the mapping to the skb->cb.
>
>  From that moment on it won't change anymore and, as you know, it will simply
> get copied from the skb->cb to the TCP-header in mptcp_options_write().
>
>
> Christoph
Yes thanks I get it. And the receiver is coded that way. That is very 
implementation specific.

RFC 6824 Says:


    A data sequence mapping does not need to be included in every MPTCP
    packet, as long as the subflow sequence space in that packet is
    covered by a mapping known at the receiver.  This can be used to
    reduce overhead in cases where the mapping is known in advance; one
    such case is when there is a single subflow between the hosts,
    another is when segments of data are scheduled in larger than packet-
    sized chunks.

So DSS mapping is not required in every packet. There could also be a 
sub mapping for the segment itself that does not violates the original 
larger mapping or there can be a super mapping which extends the current 
mapping without violating it. Current code does not handle that, I will 
update the code.

Thanks for pointing out this case as I had not tested partial ACK. 
Without partial ACK everything will work. Luckily the fix is very 
straight forward.

Regards,

Shoaib



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-11-08  0:25 Christoph Paasch
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Paasch @ 2017-11-08  0:25 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 2015 bytes --]

On 07/11/17 - 15:23:42, Rao Shoaib wrote:
> 
> 
> On 11/07/2017 01:15 PM, Christoph Paasch wrote:
> > hese are meta-level retransmissions that are being sent on a different
> > subflow and/or on the same subflow but with new TCP sequence numbers and a
> > new DSS-mapping. These indeed end up going through mptcp_skb_entail().
> > 
> > The retransmissions I mean are the TCP-level retransmissions (aka.,
> > fast-retransmits, tail-loss-probe, RTO,...). They don't go through
> > mptcp_skb_entail again.
> > 
> > I will take a look at the trace in the other mail.
> > 
> > 
> > Christoph
> RTO will go through this code.
> Partial ACK and fast-retransmit etc are fine if they are transmitted with
> the same mapping (or else DSS will fail as well) In fact it is required. On
> the receiver an adjustment is made for the TCP flow's seq number.  See
> mptcp_detect_mapping() it requires that the (partial)  skb has the exact
> same mapping as if it was transmitted as part of the original skb and than
> look at mptcp_validate_mapping() and mptcp_prepare_skb() that adjust the
> data sequence number based on the tcp sequence number of the packet.
> 
> If not than can you explain how the current mechanism works and what happens
> to the DSS settings in case of retransmission. Where/How does it gets
> updated because mptcp_options_write() only does a copy with the updated ACK.

The way it is handled currently is that the mapping is decided during
mptcp_skb_entail(), which writes the mapping to the skb->cb.

From that moment on it won't change anymore and, as you know, it will simply
get copied from the skb->cb to the TCP-header in mptcp_options_write().


Christoph

> 
> When I have some more time I will verify by hacking the kernel. Sorry I can
> not upload the complete kernel as it is a mess right now, I changed a lot of
> things, that is why it has taken this long to even get ssh working. I will
> try to get you just the patch for this fix.
> 
> Shoaib

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-11-07 23:35 Rao Shoaib
  0 siblings, 0 replies; 36+ messages in thread
From: Rao Shoaib @ 2017-11-07 23:35 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 2040 bytes --]



On 11/07/2017 03:23 PM, Rao Shoaib wrote:
>
>
> On 11/07/2017 01:15 PM, Christoph Paasch wrote:
>> hese are meta-level retransmissions that are being sent on a different
>> subflow and/or on the same subflow but with new TCP sequence numbers 
>> and a
>> new DSS-mapping. These indeed end up going through mptcp_skb_entail().
>>
>> The retransmissions I mean are the TCP-level retransmissions (aka.,
>> fast-retransmits, tail-loss-probe, RTO,...). They don't go through
>> mptcp_skb_entail again.
>>
>> I will take a look at the trace in the other mail.
>>
>>
>> Christoph
> RTO will go through this code.
> Partial ACK and fast-retransmit etc are fine if they are transmitted 
> with the same mapping (or else DSS will fail as well) In fact it is 
> required. On the receiver an adjustment is made for the TCP flow's seq 
> number.  See mptcp_detect_mapping() it requires that the (partial)  
> skb has the exact same mapping as if it was transmitted as part of the 
> original skb and than look at mptcp_validate_mapping() and 
> mptcp_prepare_skb() that adjust the data sequence number based on the 
> tcp sequence number of the packet.
>
> If not than can you explain how the current mechanism works and what 
> happens to the DSS settings in case of retransmission. Where/How does 
> it gets updated because mptcp_options_write() only does a copy with 
> the updated ACK.
>
> When I have some more time I will verify by hacking the kernel. Sorry 
> I can not upload the complete kernel as it is a mess right now, I 
> changed a lot of things, that is why it has taken this long to even 
> get ssh working. I will try to get you just the patch for this fix.
>
> Shoaib
OK I see what you are saying. My change updates the sequence number in 
the mapping but DSS does not. let me fix that. It should not require any 
changes to skb.

Shoaib

> _______________________________________________
> mptcp mailing list
> mptcp(a)lists.01.org
> https://lists.01.org/mailman/listinfo/mptcp


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-11-07 23:23 Rao Shoaib
  0 siblings, 0 replies; 36+ messages in thread
From: Rao Shoaib @ 2017-11-07 23:23 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 1610 bytes --]



On 11/07/2017 01:15 PM, Christoph Paasch wrote:
> hese are meta-level retransmissions that are being sent on a different
> subflow and/or on the same subflow but with new TCP sequence numbers and a
> new DSS-mapping. These indeed end up going through mptcp_skb_entail().
>
> The retransmissions I mean are the TCP-level retransmissions (aka.,
> fast-retransmits, tail-loss-probe, RTO,...). They don't go through
> mptcp_skb_entail again.
>
> I will take a look at the trace in the other mail.
>
>
> Christoph
RTO will go through this code.
Partial ACK and fast-retransmit etc are fine if they are transmitted 
with the same mapping (or else DSS will fail as well) In fact it is 
required. On the receiver an adjustment is made for the TCP flow's seq 
number.  See mptcp_detect_mapping() it requires that the (partial)  skb 
has the exact same mapping as if it was transmitted as part of the 
original skb and than look at mptcp_validate_mapping() and 
mptcp_prepare_skb() that adjust the data sequence number based on the 
tcp sequence number of the packet.

If not than can you explain how the current mechanism works and what 
happens to the DSS settings in case of retransmission. Where/How does it 
gets updated because mptcp_options_write() only does a copy with the 
updated ACK.

When I have some more time I will verify by hacking the kernel. Sorry I 
can not upload the complete kernel as it is a mess right now, I changed 
a lot of things, that is why it has taken this long to even get ssh 
working. I will try to get you just the patch for this fix.

Shoaib

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-11-07 21:15 Christoph Paasch
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Paasch @ 2017-11-07 21:15 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 4041 bytes --]

On 07/11/17 - 09:13:21, Rao Shoaib wrote:
> 
> 
> On 11/06/2017 08:09 PM, Christoph Paasch wrote:
> > 
> > Maybe to clarify, I meant TCP-level retransmissions (e.g., due to 3
> > duplicate acks). Not MPTCP-level retransmissions that are triggered through
> > the meta-level retransmission timer.
> > 
> > TCP-level retransmissions don't go through mptcp_skb_entail().
> Actually they do. When TCP timeout occurs tcp_write_timer_handler() is
> called
> 
>         case ICSK_TIME_RETRANS:
>                 icsk->icsk_pending = 0;
>                 tcp_sk(sk)->ops->retransmit_timer(sk);
>                 break;
> 
> retransmit_timer() for sub sockets is initialized to
> mptcp_sub_retransmit_timer() which will call mptcp_reinject() except in the
> fast open case (something that I need to look at). From what I have seen
> there is no case where a packet is [re]transmitted without going  through
> mptcp_skb_entail() or else dss will not be updated and the current code will
> also not work. Anyways, I will try to find some time and test with some
> packet loss.

These are meta-level retransmissions that are being sent on a different
subflow and/or on the same subflow but with new TCP sequence numbers and a
new DSS-mapping. These indeed end up going through mptcp_skb_entail().

The retransmissions I mean are the TCP-level retransmissions (aka.,
fast-retransmits, tail-loss-probe, RTO,...). They don't go through
mptcp_skb_entail again.

I will take a look at the trace in the other mail.


Christoph


> 
> If we do find any corner cases I prefer fixing them without exploding the
> size of skb.
> 
> Shoaib
> > 
> > > Perhaps I should provide you a patch that you can apply and play with. If
> > > there are any corner case issues,  I think they can be resolved in the
> > > retransmission code etc without requiring any change to the size of skb. Is
> > > providing a patch for the latest and greatest MPTCP good enough ?
> > Yes, patch would be great! Based on either mptcp_trunk or mptcp_v0.93.
> > 
> > I actually, would love to see mptcp_trunk no more bump up sk_buff->cb to 80
> > bytes. So, you can post it also on the mptcp-dev mailing-list if you think
> > it is all fine. Make sure to test it with packet-loss, because that's where
> > I feel is the culprit here.
> > 
> > 
> > Christoph
> > 
> > > Shoaib
> > > 
> > > > Christoph
> > > > 
> > > > > 15c16
> > > > > <     if (skb->mptcp_flags & MPTCPHDR_INF)
> > > > > ---
> > > > > >       if (tcb->mptcp_flags & MPTCPHDR_INF)
> > > > > 17c18
> > > > > <     else {
> > > > > ---
> > > > > >       else
> > > > > 19,22d19
> > > > > <         /* mptcp_entail_skb adds one for FIN */
> > > > > <         if (tcb->tcp_flags & TCPHDR_FIN)
> > > > > <             data_len -= 1;
> > > > > <     }
> > > > > 41c38
> > > > > <     return mptcp_dss_len/sizeof(*ptr);
> > > > > ---
> > > > > >       return ptr - start;
> > > > > 44,45c41,42
> > > > > < static int mptcp_write_dss_data_ack(const struct tcp_sock *tp,
> > > > > <     const struct sk_buff *skb, __be32 *ptr)
> > > > > ---
> > > > > > static int mptcp_write_dss_data_ack(const struct tcp_sock *tp, const
> > > > > struct sk_buff *skb,
> > > > > >                       __be32 *ptr)
> > > > > 62d58
> > > > > <     /* data_ack */
> > > > > 
> > > > > And mptcp_options_write() is now:
> > > > > 
> > > > >           if (OPTION_DATA_ACK & opts->mptcp_options) {
> > > > >                   ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
> > > > >                   if (mptcp_is_data_seq(skb)) {
> > > > >                           ptr += mptcp_write_dss_mapping(tp, skb, ptr);
> > > > >                   }
> > > > >                   skb->dev = NULL;
> > > > >           }
> > > > > 
> > > > > 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-11-07 17:13 Rao Shoaib
  0 siblings, 0 replies; 36+ messages in thread
From: Rao Shoaib @ 2017-11-07 17:13 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 3234 bytes --]



On 11/06/2017 08:09 PM, Christoph Paasch wrote:
>
> Maybe to clarify, I meant TCP-level retransmissions (e.g., due to 3
> duplicate acks). Not MPTCP-level retransmissions that are triggered through
> the meta-level retransmission timer.
>
> TCP-level retransmissions don't go through mptcp_skb_entail().
Actually they do. When TCP timeout occurs tcp_write_timer_handler() is 
called

         case ICSK_TIME_RETRANS:
                 icsk->icsk_pending = 0;
                 tcp_sk(sk)->ops->retransmit_timer(sk);
                 break;

retransmit_timer() for sub sockets is initialized to 
mptcp_sub_retransmit_timer() which will call mptcp_reinject() except in 
the fast open case (something that I need to look at). From what I have 
seen there is no case where a packet is [re]transmitted without going  
through mptcp_skb_entail() or else dss will not be updated and the 
current code will also not work. Anyways, I will try to find some time 
and test with some packet loss.

If we do find any corner cases I prefer fixing them without exploding 
the size of skb.

Shoaib
>
>> Perhaps I should provide you a patch that you can apply and play with. If
>> there are any corner case issues,  I think they can be resolved in the
>> retransmission code etc without requiring any change to the size of skb. Is
>> providing a patch for the latest and greatest MPTCP good enough ?
> Yes, patch would be great! Based on either mptcp_trunk or mptcp_v0.93.
>
> I actually, would love to see mptcp_trunk no more bump up sk_buff->cb to 80
> bytes. So, you can post it also on the mptcp-dev mailing-list if you think
> it is all fine. Make sure to test it with packet-loss, because that's where
> I feel is the culprit here.
>
>
> Christoph
>
>> Shoaib
>>
>>> Christoph
>>>
>>>> 15c16
>>>> <     if (skb->mptcp_flags & MPTCPHDR_INF)
>>>> ---
>>>>>       if (tcb->mptcp_flags & MPTCPHDR_INF)
>>>> 17c18
>>>> <     else {
>>>> ---
>>>>>       else
>>>> 19,22d19
>>>> <         /* mptcp_entail_skb adds one for FIN */
>>>> <         if (tcb->tcp_flags & TCPHDR_FIN)
>>>> <             data_len -= 1;
>>>> <     }
>>>> 41c38
>>>> <     return mptcp_dss_len/sizeof(*ptr);
>>>> ---
>>>>>       return ptr - start;
>>>> 44,45c41,42
>>>> < static int mptcp_write_dss_data_ack(const struct tcp_sock *tp,
>>>> <     const struct sk_buff *skb, __be32 *ptr)
>>>> ---
>>>>> static int mptcp_write_dss_data_ack(const struct tcp_sock *tp, const
>>>> struct sk_buff *skb,
>>>>>                       __be32 *ptr)
>>>> 62d58
>>>> <     /* data_ack */
>>>>
>>>> And mptcp_options_write() is now:
>>>>
>>>>           if (OPTION_DATA_ACK & opts->mptcp_options) {
>>>>                   ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
>>>>                   if (mptcp_is_data_seq(skb)) {
>>>>                           ptr += mptcp_write_dss_mapping(tp, skb, ptr);
>>>>                   }
>>>>                   skb->dev = NULL;
>>>>           }
>>>>
>>>>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-11-07  4:09 Christoph Paasch
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Paasch @ 2017-11-07  4:09 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 12410 bytes --]

On 06/11/17 - 18:46:12, Rao Shoaib wrote:
> 
> 
> On 11/06/2017 02:24 PM, Christoph Paasch wrote:
> > Hello Rao,
> > 
> > On 05/11/17 - 18:45:35, Rao Shoaib wrote:
> > > On 10/27/2017 12:57 PM, Christoph Paasch wrote:
> > > > I would love to see the rest of the patch. Especially wrt to writing the
> > > > mapping to the header.
> > > > 
> > > > How do you handle segments that are being split while sitting in the
> > > > subflow's send-queue and later on need to be retransmitted. Will the
> > > > retransmitted segment have the same DSS-mapping as the original
> > > > transmission?
> > > > 
> > > > 
> > > > Thanks,
> > > > Christoph
> > > > 
> > > I waited to make sure that my change works for net-next and I have just been
> > > able to ssh into and out of a system running netdev + MPTCP using the same
> > > layout. Following are the relevant changes. The main observation is that we
> > > have all the information to build the DSS header, except for the
> > > data-sequence. Which is saved and passed on and requires only a 32bit field.
> > > 
> > > Let me know if anyone sees any issues.
> > thanks for sharing, please find inline:
> > 
> > > Shoaib
> > > 
> > > I have already provided the SKB changing, I am also using the dev field in
> > > skb for the other values.
> > > Note the above mapping is based on a dated version of the code, It does not
> > > work for net-next for which I am using different mapping, but the idea is
> > > the same and it does not require any additional information.
> > > 
> > > Changes to mptcp_skb_entail()
> > > 
> > > rshoaib(a)caduceus5:~/independent_mptcp$ diff /tmp/modified /tmp/original
> > > 10c10
> > > <         TCP_SKB_CB(skb)->mptcp_flags |= (mpcb->snd_hiseq_index ?
> > > ---
> > > >          skb->mptcp_flags |= (mpcb->snd_hiseq_index ?
> > > 23c23
> > > <     TCP_SKB_CB(skb)->path_mask |= mptcp_pi_to_flag(tp->mptcp->path_index);
> > > ---
> > > >      TCP_SKB_CB(skb)->mptcp_path_mask |=
> > > mptcp_pi_to_flag(tp->mptcp->path_index);
> > > 39c39
> > > <         tcb->mptcp_flags |= MPTCPHDR_INF;
> > > ---
> > > >          skb->mptcp_flags |= MPTCPHDR_INF;
> > > 45c45,46
> > > <     mptcp_save_dss_data_seq(tp, subskb);
> > > ---
> > > >      subskb->mptcp_flags |= MPTCPHDR_SEQ;
> > > >      tcb->mptcp_data_seq = tcb->seq;
> > > 
> > > Changes to mptcp_write_dss_mapping which is now called when the options are
> > > written.
> > > 
> > > rshoaib(a)caduceus5:~/independent_mptcp$ diff /tmp/modified /tmp/original
> > > 1,2c1,2
> > > < static int mptcp_write_dss_mapping(const struct tcp_sock *tp,
> > > <     const struct sk_buff *skb, __be32 *ptr)
> > > ---
> > > > static int mptcp_write_dss_mapping(const struct tcp_sock *tp, const struct
> > > sk_buff *skb,
> > > >                     __be32 *ptr)
> > > 4a5
> > > >      __be32 *start = ptr;
> > > 7c8
> > > <     *ptr++ = htonl(tcb->mptcp_data_seq); /* data_seq */
> > > ---
> > > >      *ptr++ = htonl(tcb->seq); /* data_seq */
> > mptcp_write_dss_mapping is now being called from tcp_options_write, through
> > mptcp_options_write, right?
> > 
> > At this point, tcb->seq will be the TCP-subflow's sequence number.
> > 
> > So, I'm not sure how you are able to get the data-sequence number here.
> Look at the code it is stashed in skb.
> > 
> > > 13c14
> > > <         *ptr++ = htonl(tcb->seq - tp->mptcp->snt_isn); /* subseq */
> > > ---
> > > >          *ptr++ = htonl(tp->write_seq - tp->mptcp->snt_isn); /* subseq */
> > this here is what I was worried about. When we are retransmitting a segment,
> > tp->write_seq won't match anymore with the segment's sequence number.
> > 
> > You would need to pick tcb->seq here.
> Nope everything will just work because nothing is different than what is
> being done now. I guess example is worth a lot. So here is a tcpdump. Host
> .3 is running net-next while host .32 is running stock MPTCP kernel. I am
> ssh'ed from .3 to .32
> 
> 
> 17:45:02.174393 IP 192.168.1.3.57222 > 192.168.1.32.ssh: Flags [.], ack
> 8534, win 290, options [nop,nop,TS val 2104717628 ecr 5246410,mptcp dss ack
> 846209813], length 0
> 17:45:02.174420 IP 192.168.1.3.57222 > 192.168.1.32.ssh: Flags [.], ack
> 8626, win 290, options [nop,nop,TS val 2104717628 ecr 5246410,mptcp dss ack
> 846209849], length 0
> 17:45:02.636920 IP 192.168.1.32.ssh > 192.168.1.3.57222: Flags [P.], seq
> 8626:8718, ack 3670, win 263, options [nop,nop,TS val 5247726 ecr
> 2104717628,mptcp dss ack 2862106049 seq 846209849 subseq 8626 len 92 csum
> 0x658], length 92
> 17:45:02.636982 IP 192.168.1.3.57222 > 192.168.1.32.ssh: Flags [.], ack
> 8718, win 290, options [nop,nop,TS val 2104718091 ecr 5246410,mptcp dss ack
> 846209941], length 0
> 17:45:22.920549 IP 192.168.1.32.netbios-ns > 192.168.1.255.netbios-ns: NBT
> UDP PACKET(137): QUERY; REQUEST; BROADCAST
> 
> <Bring Down the Interface on .32 >
> 
> 17:45:37.953517 IP 192.168.1.3.57222 > 192.168.1.32.ssh: Flags [P.], seq
> 3670:3706, ack 8718, win 290, options [nop,nop,TS val 2104753408 ecr
> 5246410,mptcp dss ack 846209941 seq 2862106049 subseq 3670 len 36 csum
> 0xe2fa], length 36
> 17:45:38.021963 IP 192.168.1.3.57222 > 192.168.1.32.ssh: Flags [P.], seq
> 3670:3706, ack 8718, win 290, options [nop,nop,TS val 2104753476 ecr
> 5246410,mptcp dss ack 846209941 seq 2862106049 subseq 3670 len 36 csum
> 0xe2fa], length 36
> 
> When I bring the interface back the connection continues on.
> 
> Now let's do the reverse
> 
> 17:57:33.441698 IP 192.168.1.3.ssh > 192.168.1.32.60024: Flags [P.], seq
> 4190:4274, ack 2626, win 291, options [nop,nop,TS val 2105468922 ecr
> 5435422,mptcp dss ack 1290518463 seq 2562035193 subseq 4190 len 84 csum
> 0x6c5], length 84
> 17:57:33.441804 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [.], ack
> 4274, win 271, options [nop,nop,TS val 5435430 ecr 2105468922,mptcp dss ack
> 2562035277], length 0
> 17:57:33.576747 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [P.], seq
> 2626:2662, ack 4274, win 271, options [nop,nop,TS val 5435464 ecr
> 2105468922,mptcp dss ack 2562035277 seq 1290518463 subseq 2626 len 36 csum
> 0xc5b9], length 36
> 17:57:33.577426 IP 192.168.1.3.ssh > 192.168.1.32.60024: Flags [P.], seq
> 4274:4310, ack 2662, win 291, options [nop,nop,TS val 2105469058 ecr
> 5435464,mptcp dss ack 1290518499 seq 2562035277 subseq 4274 len 36 csum
> 0x64d], length 36
> 17:57:33.577603 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [.], ack
> 4310, win 271, options [nop,nop,TS val 5435464 ecr 2105469058,mptcp dss ack
> 2562035313], length 0
> 17:57:33.609786 IP 192.168.1.3.ssh > 192.168.1.32.60024: Flags [P.], seq
> 4310:4394, ack 2662, win 291, options [nop,nop,TS val 2105469090 ecr
> 5435464,mptcp dss ack 1290518499 seq 2562035313 subseq 4310 len 84 csum
> 0x5d5], length 84
> 17:57:33.609930 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [.], ack
> 4394, win 271, options [nop,nop,TS val 5435472 ecr 2105469090,mptcp dss ack
> 2562035397], length 0
> 
> 
> Interface on .3 is down
> 
> 
> 17:58:04.455423 IP 192.168.1.32.59734 > 192.168.1.3.ssh: Flags [.], ack
> 705070687, win 272, options [nop,nop,TS val 5443184 ecr 3451436041,mptcp dss
> ack 426484590], length 0
> 17:58:05.321012 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [P.], seq
> 2662:2698, ack 4394, win 271, options [nop,nop,TS val 5443400 ecr
> 2105469090,mptcp dss ack 2562035397 seq 1290518499 subseq 2662 len 36 csum
> 0xd5ad], length 36
> 17:58:05.513018 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [P.], seq
> 2698:2734, ack 4394, win 271, options [nop,nop,TS val 5443448 ecr
> 2105469090,mptcp dss ack 2562035397 seq 1290518535 subseq 2698 len 36 csum
> 0x7da8], length 36
> 17:58:05.527395 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [P.], seq
> 2698:2734, ack 4394, win 271, options [nop,nop,TS val 5443452 ecr
> 2105469090,mptcp dss ack 2562035397 seq 1290518535 subseq 2698 len 36 csum
> 0x7da8], length 36
> 17:58:05.696935 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [P.], seq
> 2734:2770, ack 4394, win 271, options [nop,nop,TS val 5443494 ecr
> 2105469090,mptcp dss ack 2562035397 seq 1290518571 subseq 2734 len 36 csum
> 0xebf2], length 36
> 
> interface is up and the tcp session continues
> 
> 17:58:31.975712 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [P.], seq
> 2662:2698, ack 4394, win 271, options [nop,nop,TS val 5450064 ecr
> 2105469090,mptcp dss ack 2562035397 seq 1290518499 subseq 2662 len 36 csum
> 0xd5ad], length 36
> 17:58:31.976255 IP 192.168.1.3.ssh > 192.168.1.32.60024: Flags [P.], seq
> 4394:4430, ack 2698, win 291, options [nop,nop,TS val 2105527458 ecr
> 5450064,mptcp dss ack 1290518535 seq 2562035397 subseq 4394 len 36 csum
> 0x55d], length 36
> 17:58:31.976307 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [P.], seq
> 2698:2734, ack 4394, win 271, options [nop,nop,TS val 5450064 ecr
> 2105527458,mptcp dss ack 2562035397 seq 1290518535 subseq 2698 len 36 csum
> 0x7da8], length 36
> 17:58:31.976352 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [.], ack
> 4430, win 271, options [nop,nop,TS val 5450064 ecr 2105527458,mptcp dss ack
> 2562035433], length 0
> 17:58:32.008769 IP 192.168.1.3.ssh > 192.168.1.32.60024: Flags [P.], seq
> 4430:4514, ack 2734, win 291, options [nop,nop,TS val 2105527490 ecr
> 5450064,mptcp dss ack 1290518571 seq 2562035433 subseq 4430 len 84 csum
> 0x4e5], length 84
> 
> > However even then, if the segment that is being retransmitted is due to a
> > partial ack (e.g., the original transmission was 100 bytes, and we received
> > an ack for only 50 bytes). We will then only retransmit the remaining 50
> > bytes and thus the relative sequence number won't be the same anymore as in
> > the original transmission.
> The code will do what happens in the current code. Any [re]transmission goes
> through mptcp_skb_entail(), there the mapping will be updated.

Maybe to clarify, I meant TCP-level retransmissions (e.g., due to 3
duplicate acks). Not MPTCP-level retransmissions that are triggered through
the meta-level retransmission timer.

TCP-level retransmissions don't go through mptcp_skb_entail().

> Perhaps I should provide you a patch that you can apply and play with. If
> there are any corner case issues,  I think they can be resolved in the
> retransmission code etc without requiring any change to the size of skb. Is
> providing a patch for the latest and greatest MPTCP good enough ?

Yes, patch would be great! Based on either mptcp_trunk or mptcp_v0.93.

I actually, would love to see mptcp_trunk no more bump up sk_buff->cb to 80
bytes. So, you can post it also on the mptcp-dev mailing-list if you think
it is all fine. Make sure to test it with packet-loss, because that's where
I feel is the culprit here.


Christoph

> 
> Shoaib
> 
> > 
> > Christoph
> > 
> > > 15c16
> > > <     if (skb->mptcp_flags & MPTCPHDR_INF)
> > > ---
> > > >      if (tcb->mptcp_flags & MPTCPHDR_INF)
> > > 17c18
> > > <     else {
> > > ---
> > > >      else
> > > 19,22d19
> > > <         /* mptcp_entail_skb adds one for FIN */
> > > <         if (tcb->tcp_flags & TCPHDR_FIN)
> > > <             data_len -= 1;
> > > <     }
> > > 41c38
> > > <     return mptcp_dss_len/sizeof(*ptr);
> > > ---
> > > >      return ptr - start;
> > > 44,45c41,42
> > > < static int mptcp_write_dss_data_ack(const struct tcp_sock *tp,
> > > <     const struct sk_buff *skb, __be32 *ptr)
> > > ---
> > > > static int mptcp_write_dss_data_ack(const struct tcp_sock *tp, const
> > > struct sk_buff *skb,
> > > >                      __be32 *ptr)
> > > 62d58
> > > <     /* data_ack */
> > > 
> > > And mptcp_options_write() is now:
> > > 
> > >          if (OPTION_DATA_ACK & opts->mptcp_options) {
> > >                  ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
> > >                  if (mptcp_is_data_seq(skb)) {
> > >                          ptr += mptcp_write_dss_mapping(tp, skb, ptr);
> > >                  }
> > >                  skb->dev = NULL;
> > >          }
> > > 
> > > 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-11-07  3:16 Rao Shoaib
  0 siblings, 0 replies; 36+ messages in thread
From: Rao Shoaib @ 2017-11-07  3:16 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 12111 bytes --]



On 11/06/2017 06:46 PM, Rao Shoaib wrote:
>
>
> On 11/06/2017 02:24 PM, Christoph Paasch wrote:
>> Hello Rao,
>>
>> On 05/11/17 - 18:45:35, Rao Shoaib wrote:
>>> On 10/27/2017 12:57 PM, Christoph Paasch wrote:
>>>> I would love to see the rest of the patch. Especially wrt to 
>>>> writing the
>>>> mapping to the header.
>>>>
>>>> How do you handle segments that are being split while sitting in the
>>>> subflow's send-queue and later on need to be retransmitted. Will the
>>>> retransmitted segment have the same DSS-mapping as the original
>>>> transmission?
>>>>
>>>>
>>>> Thanks,
>>>> Christoph
>>>>
>>> I waited to make sure that my change works for net-next and I have 
>>> just been
>>> able to ssh into and out of a system running netdev + MPTCP using 
>>> the same
>>> layout. Following are the relevant changes. The main observation is 
>>> that we
>>> have all the information to build the DSS header, except for the
>>> data-sequence. Which is saved and passed on and requires only a 
>>> 32bit field.
>>>
>>> Let me know if anyone sees any issues.
>> thanks for sharing, please find inline:
>>
>>> Shoaib
>>>
>>> I have already provided the SKB changing, I am also using the dev 
>>> field in
>>> skb for the other values.
>>> Note the above mapping is based on a dated version of the code, It 
>>> does not
>>> work for net-next for which I am using different mapping, but the 
>>> idea is
>>> the same and it does not require any additional information.
>>>
>>> Changes to mptcp_skb_entail()
>>>
>>> rshoaib(a)caduceus5:~/independent_mptcp$ diff /tmp/modified /tmp/original
>>> 10c10
>>> <         TCP_SKB_CB(skb)->mptcp_flags |= (mpcb->snd_hiseq_index ?
>>> ---
>>>>          skb->mptcp_flags |= (mpcb->snd_hiseq_index ?
>>> 23c23
>>> <     TCP_SKB_CB(skb)->path_mask |= 
>>> mptcp_pi_to_flag(tp->mptcp->path_index);
>>> ---
>>>> TCP_SKB_CB(skb)->mptcp_path_mask |=
>>> mptcp_pi_to_flag(tp->mptcp->path_index);
>>> 39c39
>>> <         tcb->mptcp_flags |= MPTCPHDR_INF;
>>> ---
>>>>          skb->mptcp_flags |= MPTCPHDR_INF;
>>> 45c45,46
>>> <     mptcp_save_dss_data_seq(tp, subskb);
>>> ---
>>>>      subskb->mptcp_flags |= MPTCPHDR_SEQ;
>>>>      tcb->mptcp_data_seq = tcb->seq;
>>>
>>> Changes to mptcp_write_dss_mapping which is now called when the 
>>> options are
>>> written.
>>>
>>> rshoaib(a)caduceus5:~/independent_mptcp$ diff /tmp/modified /tmp/original
>>> 1,2c1,2
>>> < static int mptcp_write_dss_mapping(const struct tcp_sock *tp,
>>> <     const struct sk_buff *skb, __be32 *ptr)
>>> ---
>>>> static int mptcp_write_dss_mapping(const struct tcp_sock *tp, const 
>>>> struct
>>> sk_buff *skb,
>>>>                     __be32 *ptr)
>>> 4a5
>>>>      __be32 *start = ptr;
>>> 7c8
>>> <     *ptr++ = htonl(tcb->mptcp_data_seq); /* data_seq */
>>> ---
>>>>      *ptr++ = htonl(tcb->seq); /* data_seq */
>> mptcp_write_dss_mapping is now being called from tcp_options_write, 
>> through
>> mptcp_options_write, right?
>>
>> At this point, tcb->seq will be the TCP-subflow's sequence number.
>>
>> So, I'm not sure how you are able to get the data-sequence number here.
> Look at the code it is stashed in skb.
>>
>>> 13c14
>>> <         *ptr++ = htonl(tcb->seq - tp->mptcp->snt_isn); /* subseq */
>>> ---
>>>>          *ptr++ = htonl(tp->write_seq - tp->mptcp->snt_isn); /* 
>>>> subseq */
>> this here is what I was worried about. When we are retransmitting a 
>> segment,
>> tp->write_seq won't match anymore with the segment's sequence number.
>>
>> You would need to pick tcb->seq here.
> Nope everything will just work because nothing is different than what 
> is being done now. I guess example is worth a lot. So here is a 
> tcpdump. Host .3 is running net-next while host .32 is running stock 
> MPTCP kernel. I am ssh'ed from .3 to .32
>
>
> 17:45:02.174393 IP 192.168.1.3.57222 > 192.168.1.32.ssh: Flags [.], 
> ack 8534, win 290, options [nop,nop,TS val 2104717628 ecr 
> 5246410,mptcp dss ack 846209813], length 0
> 17:45:02.174420 IP 192.168.1.3.57222 > 192.168.1.32.ssh: Flags [.], 
> ack 8626, win 290, options [nop,nop,TS val 2104717628 ecr 
> 5246410,mptcp dss ack 846209849], length 0
> 17:45:02.636920 IP 192.168.1.32.ssh > 192.168.1.3.57222: Flags [P.], 
> seq 8626:8718, ack 3670, win 263, options [nop,nop,TS val 5247726 ecr 
> 2104717628,mptcp dss ack 2862106049 seq 846209849 subseq 8626 len 92 
> csum 0x658], length 92
> 17:45:02.636982 IP 192.168.1.3.57222 > 192.168.1.32.ssh: Flags [.], 
> ack 8718, win 290, options [nop,nop,TS val 2104718091 ecr 
> 5246410,mptcp dss ack 846209941], length 0
> 17:45:22.920549 IP 192.168.1.32.netbios-ns > 192.168.1.255.netbios-ns: 
> NBT UDP PACKET(137): QUERY; REQUEST; BROADCAST
>
> <Bring Down the Interface on .32 >
>
> 17:45:37.953517 IP 192.168.1.3.57222 > 192.168.1.32.ssh: Flags [P.], 
> seq 3670:3706, ack 8718, win 290, options [nop,nop,TS val 2104753408 
> ecr 5246410,mptcp dss ack 846209941 seq 2862106049 subseq 3670 len 36 
> csum 0xe2fa], length 36
> 17:45:38.021963 IP 192.168.1.3.57222 > 192.168.1.32.ssh: Flags [P.], 
> seq 3670:3706, ack 8718, win 290, options [nop,nop,TS val 2104753476 
> ecr 5246410,mptcp dss ack 846209941 seq 2862106049 subseq 3670 len 36 
> csum 0xe2fa], length 36
>
> When I bring the interface back the connection continues on.
>
> Now let's do the reverse
>
> 17:57:33.441698 IP 192.168.1.3.ssh > 192.168.1.32.60024: Flags [P.], 
> seq 4190:4274, ack 2626, win 291, options [nop,nop,TS val 2105468922 
> ecr 5435422,mptcp dss ack 1290518463 seq 2562035193 subseq 4190 len 84 
> csum 0x6c5], length 84
> 17:57:33.441804 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [.], 
> ack 4274, win 271, options [nop,nop,TS val 5435430 ecr 
> 2105468922,mptcp dss ack 2562035277], length 0
> 17:57:33.576747 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [P.], 
> seq 2626:2662, ack 4274, win 271, options [nop,nop,TS val 5435464 ecr 
> 2105468922,mptcp dss ack 2562035277 seq 1290518463 subseq 2626 len 36 
> csum 0xc5b9], length 36
> 17:57:33.577426 IP 192.168.1.3.ssh > 192.168.1.32.60024: Flags [P.], 
> seq 4274:4310, ack 2662, win 291, options [nop,nop,TS val 2105469058 
> ecr 5435464,mptcp dss ack 1290518499 seq 2562035277 subseq 4274 len 36 
> csum 0x64d], length 36
> 17:57:33.577603 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [.], 
> ack 4310, win 271, options [nop,nop,TS val 5435464 ecr 
> 2105469058,mptcp dss ack 2562035313], length 0
> 17:57:33.609786 IP 192.168.1.3.ssh > 192.168.1.32.60024: Flags [P.], 
> seq 4310:4394, ack 2662, win 291, options [nop,nop,TS val 2105469090 
> ecr 5435464,mptcp dss ack 1290518499 seq 2562035313 subseq 4310 len 84 
> csum 0x5d5], length 84
> 17:57:33.609930 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [.], 
> ack 4394, win 271, options [nop,nop,TS val 5435472 ecr 
> 2105469090,mptcp dss ack 2562035397], length 0
>
>
> Interface on .3 is down
>
>
> 17:58:04.455423 IP 192.168.1.32.59734 > 192.168.1.3.ssh: Flags [.], 
> ack 705070687, win 272, options [nop,nop,TS val 5443184 ecr 
> 3451436041,mptcp dss ack 426484590], length 0
> 17:58:05.321012 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [P.], 
> seq 2662:2698, ack 4394, win 271, options [nop,nop,TS val 5443400 ecr 
> 2105469090,mptcp dss ack 2562035397 seq 1290518499 subseq 2662 len 36 
> csum 0xd5ad], length 36
> 17:58:05.513018 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [P.], 
> seq 2698:2734, ack 4394, win 271, options [nop,nop,TS val 5443448 ecr 
> 2105469090,mptcp dss ack 2562035397 seq 1290518535 subseq 2698 len 36 
> csum 0x7da8], length 36
> 17:58:05.527395 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [P.], 
> seq 2698:2734, ack 4394, win 271, options [nop,nop,TS val 5443452 ecr 
> 2105469090,mptcp dss ack 2562035397 seq 1290518535 subseq 2698 len 36 
> csum 0x7da8], length 36
> 17:58:05.696935 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [P.], 
> seq 2734:2770, ack 4394, win 271, options [nop,nop,TS val 5443494 ecr 
> 2105469090,mptcp dss ack 2562035397 seq 1290518571 subseq 2734 len 36 
> csum 0xebf2], length 36
>
> interface is up and the tcp session continues
>
> 17:58:31.975712 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [P.], 
> seq 2662:2698, ack 4394, win 271, options [nop,nop,TS val 5450064 ecr 
> 2105469090,mptcp dss ack 2562035397 seq 1290518499 subseq 2662 len 36 
> csum 0xd5ad], length 36
> 17:58:31.976255 IP 192.168.1.3.ssh > 192.168.1.32.60024: Flags [P.], 
> seq 4394:4430, ack 2698, win 291, options [nop,nop,TS val 2105527458 
> ecr 5450064,mptcp dss ack 1290518535 seq 2562035397 subseq 4394 len 36 
> csum 0x55d], length 36
> 17:58:31.976307 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [P.], 
> seq 2698:2734, ack 4394, win 271, options [nop,nop,TS val 5450064 ecr 
> 2105527458,mptcp dss ack 2562035397 seq 1290518535 subseq 2698 len 36 
> csum 0x7da8], length 36
> 17:58:31.976352 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [.], 
> ack 4430, win 271, options [nop,nop,TS val 5450064 ecr 
> 2105527458,mptcp dss ack 2562035433], length 0
> 17:58:32.008769 IP 192.168.1.3.ssh > 192.168.1.32.60024: Flags [P.], 
> seq 4430:4514, ack 2734, win 291, options [nop,nop,TS val 2105527490 
> ecr 5450064,mptcp dss ack 1290518571 seq 2562035433 subseq 4430 len 84 
> csum 0x4e5], length 84
>
>> However even then, if the segment that is being retransmitted is due 
>> to a
>> partial ack (e.g., the original transmission was 100 bytes, and we 
>> received
>> an ack for only 50 bytes). We will then only retransmit the remaining 50
>> bytes and thus the relative sequence number won't be the same anymore 
>> as in
>> the original transmission.
> The code will do what happens in the current code. Any 
> [re]transmission goes through mptcp_skb_entail(), there the mapping 
> will be updated.
>
> Perhaps I should provide you a patch that you can apply and play with. 
> If there are any corner case issues,  I think they can be resolved in 
> the retransmission code etc without requiring any change to the size 
> of skb. Is providing a patch for the latest and greatest MPTCP good 
> enough ?
>
> Shoaib

The following events will happen

*) Partial ACK will update the skb

*) The remaining data will be acked or a timeout will occur

*) If data is acked great, if timeout occurs then

*) The re-injection code will kick in and will reconstruct the mapping ....

Shoaib
>>
>> Christoph
>>
>>> 15c16
>>> <     if (skb->mptcp_flags & MPTCPHDR_INF)
>>> ---
>>>>      if (tcb->mptcp_flags & MPTCPHDR_INF)
>>> 17c18
>>> <     else {
>>> ---
>>>>      else
>>> 19,22d19
>>> <         /* mptcp_entail_skb adds one for FIN */
>>> <         if (tcb->tcp_flags & TCPHDR_FIN)
>>> <             data_len -= 1;
>>> <     }
>>> 41c38
>>> <     return mptcp_dss_len/sizeof(*ptr);
>>> ---
>>>>      return ptr - start;
>>> 44,45c41,42
>>> < static int mptcp_write_dss_data_ack(const struct tcp_sock *tp,
>>> <     const struct sk_buff *skb, __be32 *ptr)
>>> ---
>>>> static int mptcp_write_dss_data_ack(const struct tcp_sock *tp, const
>>> struct sk_buff *skb,
>>>>                      __be32 *ptr)
>>> 62d58
>>> <     /* data_ack */
>>>
>>> And mptcp_options_write() is now:
>>>
>>>          if (OPTION_DATA_ACK & opts->mptcp_options) {
>>>                  ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
>>>                  if (mptcp_is_data_seq(skb)) {
>>>                          ptr += mptcp_write_dss_mapping(tp, skb, ptr);
>>>                  }
>>>                  skb->dev = NULL;
>>>          }
>>>
>>>
>
> _______________________________________________
> mptcp mailing list
> mptcp(a)lists.01.org
> https://lists.01.org/mailman/listinfo/mptcp


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-11-07  2:46 Rao Shoaib
  0 siblings, 0 replies; 36+ messages in thread
From: Rao Shoaib @ 2017-11-07  2:46 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 11162 bytes --]



On 11/06/2017 02:24 PM, Christoph Paasch wrote:
> Hello Rao,
>
> On 05/11/17 - 18:45:35, Rao Shoaib wrote:
>> On 10/27/2017 12:57 PM, Christoph Paasch wrote:
>>> I would love to see the rest of the patch. Especially wrt to writing the
>>> mapping to the header.
>>>
>>> How do you handle segments that are being split while sitting in the
>>> subflow's send-queue and later on need to be retransmitted. Will the
>>> retransmitted segment have the same DSS-mapping as the original
>>> transmission?
>>>
>>>
>>> Thanks,
>>> Christoph
>>>
>> I waited to make sure that my change works for net-next and I have just been
>> able to ssh into and out of a system running netdev + MPTCP using the same
>> layout. Following are the relevant changes. The main observation is that we
>> have all the information to build the DSS header, except for the
>> data-sequence. Which is saved and passed on and requires only a 32bit field.
>>
>> Let me know if anyone sees any issues.
> thanks for sharing, please find inline:
>
>> Shoaib
>>
>> I have already provided the SKB changing, I am also using the dev field in
>> skb for the other values.
>> Note the above mapping is based on a dated version of the code, It does not
>> work for net-next for which I am using different mapping, but the idea is
>> the same and it does not require any additional information.
>>
>> Changes to mptcp_skb_entail()
>>
>> rshoaib(a)caduceus5:~/independent_mptcp$ diff /tmp/modified /tmp/original
>> 10c10
>> <         TCP_SKB_CB(skb)->mptcp_flags |= (mpcb->snd_hiseq_index ?
>> ---
>>>          skb->mptcp_flags |= (mpcb->snd_hiseq_index ?
>> 23c23
>> <     TCP_SKB_CB(skb)->path_mask |= mptcp_pi_to_flag(tp->mptcp->path_index);
>> ---
>>>      TCP_SKB_CB(skb)->mptcp_path_mask |=
>> mptcp_pi_to_flag(tp->mptcp->path_index);
>> 39c39
>> <         tcb->mptcp_flags |= MPTCPHDR_INF;
>> ---
>>>          skb->mptcp_flags |= MPTCPHDR_INF;
>> 45c45,46
>> <     mptcp_save_dss_data_seq(tp, subskb);
>> ---
>>>      subskb->mptcp_flags |= MPTCPHDR_SEQ;
>>>      tcb->mptcp_data_seq = tcb->seq;
>>
>> Changes to mptcp_write_dss_mapping which is now called when the options are
>> written.
>>
>> rshoaib(a)caduceus5:~/independent_mptcp$ diff /tmp/modified /tmp/original
>> 1,2c1,2
>> < static int mptcp_write_dss_mapping(const struct tcp_sock *tp,
>> <     const struct sk_buff *skb, __be32 *ptr)
>> ---
>>> static int mptcp_write_dss_mapping(const struct tcp_sock *tp, const struct
>> sk_buff *skb,
>>>                     __be32 *ptr)
>> 4a5
>>>      __be32 *start = ptr;
>> 7c8
>> <     *ptr++ = htonl(tcb->mptcp_data_seq); /* data_seq */
>> ---
>>>      *ptr++ = htonl(tcb->seq); /* data_seq */
> mptcp_write_dss_mapping is now being called from tcp_options_write, through
> mptcp_options_write, right?
>
> At this point, tcb->seq will be the TCP-subflow's sequence number.
>
> So, I'm not sure how you are able to get the data-sequence number here.
Look at the code it is stashed in skb.
>
>> 13c14
>> <         *ptr++ = htonl(tcb->seq - tp->mptcp->snt_isn); /* subseq */
>> ---
>>>          *ptr++ = htonl(tp->write_seq - tp->mptcp->snt_isn); /* subseq */
> this here is what I was worried about. When we are retransmitting a segment,
> tp->write_seq won't match anymore with the segment's sequence number.
>
> You would need to pick tcb->seq here.
Nope everything will just work because nothing is different than what is 
being done now. I guess example is worth a lot. So here is a tcpdump. 
Host .3 is running net-next while host .32 is running stock MPTCP 
kernel. I am ssh'ed from .3 to .32


17:45:02.174393 IP 192.168.1.3.57222 > 192.168.1.32.ssh: Flags [.], ack 
8534, win 290, options [nop,nop,TS val 2104717628 ecr 5246410,mptcp dss 
ack 846209813], length 0
17:45:02.174420 IP 192.168.1.3.57222 > 192.168.1.32.ssh: Flags [.], ack 
8626, win 290, options [nop,nop,TS val 2104717628 ecr 5246410,mptcp dss 
ack 846209849], length 0
17:45:02.636920 IP 192.168.1.32.ssh > 192.168.1.3.57222: Flags [P.], seq 
8626:8718, ack 3670, win 263, options [nop,nop,TS val 5247726 ecr 
2104717628,mptcp dss ack 2862106049 seq 846209849 subseq 8626 len 92 
csum 0x658], length 92
17:45:02.636982 IP 192.168.1.3.57222 > 192.168.1.32.ssh: Flags [.], ack 
8718, win 290, options [nop,nop,TS val 2104718091 ecr 5246410,mptcp dss 
ack 846209941], length 0
17:45:22.920549 IP 192.168.1.32.netbios-ns > 192.168.1.255.netbios-ns: 
NBT UDP PACKET(137): QUERY; REQUEST; BROADCAST

<Bring Down the Interface on .32 >

17:45:37.953517 IP 192.168.1.3.57222 > 192.168.1.32.ssh: Flags [P.], seq 
3670:3706, ack 8718, win 290, options [nop,nop,TS val 2104753408 ecr 
5246410,mptcp dss ack 846209941 seq 2862106049 subseq 3670 len 36 csum 
0xe2fa], length 36
17:45:38.021963 IP 192.168.1.3.57222 > 192.168.1.32.ssh: Flags [P.], seq 
3670:3706, ack 8718, win 290, options [nop,nop,TS val 2104753476 ecr 
5246410,mptcp dss ack 846209941 seq 2862106049 subseq 3670 len 36 csum 
0xe2fa], length 36

When I bring the interface back the connection continues on.

Now let's do the reverse

17:57:33.441698 IP 192.168.1.3.ssh > 192.168.1.32.60024: Flags [P.], seq 
4190:4274, ack 2626, win 291, options [nop,nop,TS val 2105468922 ecr 
5435422,mptcp dss ack 1290518463 seq 2562035193 subseq 4190 len 84 csum 
0x6c5], length 84
17:57:33.441804 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [.], ack 
4274, win 271, options [nop,nop,TS val 5435430 ecr 2105468922,mptcp dss 
ack 2562035277], length 0
17:57:33.576747 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [P.], seq 
2626:2662, ack 4274, win 271, options [nop,nop,TS val 5435464 ecr 
2105468922,mptcp dss ack 2562035277 seq 1290518463 subseq 2626 len 36 
csum 0xc5b9], length 36
17:57:33.577426 IP 192.168.1.3.ssh > 192.168.1.32.60024: Flags [P.], seq 
4274:4310, ack 2662, win 291, options [nop,nop,TS val 2105469058 ecr 
5435464,mptcp dss ack 1290518499 seq 2562035277 subseq 4274 len 36 csum 
0x64d], length 36
17:57:33.577603 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [.], ack 
4310, win 271, options [nop,nop,TS val 5435464 ecr 2105469058,mptcp dss 
ack 2562035313], length 0
17:57:33.609786 IP 192.168.1.3.ssh > 192.168.1.32.60024: Flags [P.], seq 
4310:4394, ack 2662, win 291, options [nop,nop,TS val 2105469090 ecr 
5435464,mptcp dss ack 1290518499 seq 2562035313 subseq 4310 len 84 csum 
0x5d5], length 84
17:57:33.609930 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [.], ack 
4394, win 271, options [nop,nop,TS val 5435472 ecr 2105469090,mptcp dss 
ack 2562035397], length 0


Interface on .3 is down


17:58:04.455423 IP 192.168.1.32.59734 > 192.168.1.3.ssh: Flags [.], ack 
705070687, win 272, options [nop,nop,TS val 5443184 ecr 3451436041,mptcp 
dss ack 426484590], length 0
17:58:05.321012 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [P.], seq 
2662:2698, ack 4394, win 271, options [nop,nop,TS val 5443400 ecr 
2105469090,mptcp dss ack 2562035397 seq 1290518499 subseq 2662 len 36 
csum 0xd5ad], length 36
17:58:05.513018 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [P.], seq 
2698:2734, ack 4394, win 271, options [nop,nop,TS val 5443448 ecr 
2105469090,mptcp dss ack 2562035397 seq 1290518535 subseq 2698 len 36 
csum 0x7da8], length 36
17:58:05.527395 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [P.], seq 
2698:2734, ack 4394, win 271, options [nop,nop,TS val 5443452 ecr 
2105469090,mptcp dss ack 2562035397 seq 1290518535 subseq 2698 len 36 
csum 0x7da8], length 36
17:58:05.696935 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [P.], seq 
2734:2770, ack 4394, win 271, options [nop,nop,TS val 5443494 ecr 
2105469090,mptcp dss ack 2562035397 seq 1290518571 subseq 2734 len 36 
csum 0xebf2], length 36

interface is up and the tcp session continues

17:58:31.975712 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [P.], seq 
2662:2698, ack 4394, win 271, options [nop,nop,TS val 5450064 ecr 
2105469090,mptcp dss ack 2562035397 seq 1290518499 subseq 2662 len 36 
csum 0xd5ad], length 36
17:58:31.976255 IP 192.168.1.3.ssh > 192.168.1.32.60024: Flags [P.], seq 
4394:4430, ack 2698, win 291, options [nop,nop,TS val 2105527458 ecr 
5450064,mptcp dss ack 1290518535 seq 2562035397 subseq 4394 len 36 csum 
0x55d], length 36
17:58:31.976307 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [P.], seq 
2698:2734, ack 4394, win 271, options [nop,nop,TS val 5450064 ecr 
2105527458,mptcp dss ack 2562035397 seq 1290518535 subseq 2698 len 36 
csum 0x7da8], length 36
17:58:31.976352 IP 192.168.1.32.60024 > 192.168.1.3.ssh: Flags [.], ack 
4430, win 271, options [nop,nop,TS val 5450064 ecr 2105527458,mptcp dss 
ack 2562035433], length 0
17:58:32.008769 IP 192.168.1.3.ssh > 192.168.1.32.60024: Flags [P.], seq 
4430:4514, ack 2734, win 291, options [nop,nop,TS val 2105527490 ecr 
5450064,mptcp dss ack 1290518571 seq 2562035433 subseq 4430 len 84 csum 
0x4e5], length 84

> However even then, if the segment that is being retransmitted is due to a
> partial ack (e.g., the original transmission was 100 bytes, and we received
> an ack for only 50 bytes). We will then only retransmit the remaining 50
> bytes and thus the relative sequence number won't be the same anymore as in
> the original transmission.
The code will do what happens in the current code. Any [re]transmission 
goes through mptcp_skb_entail(), there the mapping will be updated.

Perhaps I should provide you a patch that you can apply and play with. 
If there are any corner case issues,  I think they can be resolved in 
the retransmission code etc without requiring any change to the size of 
skb. Is providing a patch for the latest and greatest MPTCP good enough ?

Shoaib

>
> Christoph
>
>> 15c16
>> <     if (skb->mptcp_flags & MPTCPHDR_INF)
>> ---
>>>      if (tcb->mptcp_flags & MPTCPHDR_INF)
>> 17c18
>> <     else {
>> ---
>>>      else
>> 19,22d19
>> <         /* mptcp_entail_skb adds one for FIN */
>> <         if (tcb->tcp_flags & TCPHDR_FIN)
>> <             data_len -= 1;
>> <     }
>> 41c38
>> <     return mptcp_dss_len/sizeof(*ptr);
>> ---
>>>      return ptr - start;
>> 44,45c41,42
>> < static int mptcp_write_dss_data_ack(const struct tcp_sock *tp,
>> <     const struct sk_buff *skb, __be32 *ptr)
>> ---
>>> static int mptcp_write_dss_data_ack(const struct tcp_sock *tp, const
>> struct sk_buff *skb,
>>>                      __be32 *ptr)
>> 62d58
>> <     /* data_ack */
>>
>> And mptcp_options_write() is now:
>>
>>          if (OPTION_DATA_ACK & opts->mptcp_options) {
>>                  ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
>>                  if (mptcp_is_data_seq(skb)) {
>>                          ptr += mptcp_write_dss_mapping(tp, skb, ptr);
>>                  }
>>                  skb->dev = NULL;
>>          }
>>
>>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-11-06 22:24 Christoph Paasch
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Paasch @ 2017-11-06 22:24 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 4927 bytes --]

Hello Rao,

On 05/11/17 - 18:45:35, Rao Shoaib wrote:
> 
> On 10/27/2017 12:57 PM, Christoph Paasch wrote:
> > 
> > I would love to see the rest of the patch. Especially wrt to writing the
> > mapping to the header.
> > 
> > How do you handle segments that are being split while sitting in the
> > subflow's send-queue and later on need to be retransmitted. Will the
> > retransmitted segment have the same DSS-mapping as the original
> > transmission?
> > 
> > 
> > Thanks,
> > Christoph
> > 
> I waited to make sure that my change works for net-next and I have just been
> able to ssh into and out of a system running netdev + MPTCP using the same
> layout. Following are the relevant changes. The main observation is that we
> have all the information to build the DSS header, except for the
> data-sequence. Which is saved and passed on and requires only a 32bit field.
> 
> Let me know if anyone sees any issues.

thanks for sharing, please find inline:

> 
> Shoaib
> 
> I have already provided the SKB changing, I am also using the dev field in
> skb for the other values.
> Note the above mapping is based on a dated version of the code, It does not
> work for net-next for which I am using different mapping, but the idea is
> the same and it does not require any additional information.
> 
> Changes to mptcp_skb_entail()
> 
> rshoaib(a)caduceus5:~/independent_mptcp$ diff /tmp/modified /tmp/original
> 10c10
> <         TCP_SKB_CB(skb)->mptcp_flags |= (mpcb->snd_hiseq_index ?
> ---
> >         skb->mptcp_flags |= (mpcb->snd_hiseq_index ?
> 23c23
> <     TCP_SKB_CB(skb)->path_mask |= mptcp_pi_to_flag(tp->mptcp->path_index);
> ---
> >     TCP_SKB_CB(skb)->mptcp_path_mask |=
> mptcp_pi_to_flag(tp->mptcp->path_index);
> 39c39
> <         tcb->mptcp_flags |= MPTCPHDR_INF;
> ---
> >         skb->mptcp_flags |= MPTCPHDR_INF;
> 45c45,46
> <     mptcp_save_dss_data_seq(tp, subskb);
> ---
> >     subskb->mptcp_flags |= MPTCPHDR_SEQ;
> >     tcb->mptcp_data_seq = tcb->seq;
> 
> 
> Changes to mptcp_write_dss_mapping which is now called when the options are
> written.
> 
> rshoaib(a)caduceus5:~/independent_mptcp$ diff /tmp/modified /tmp/original
> 1,2c1,2
> < static int mptcp_write_dss_mapping(const struct tcp_sock *tp,
> <     const struct sk_buff *skb, __be32 *ptr)
> ---
> > static int mptcp_write_dss_mapping(const struct tcp_sock *tp, const struct
> sk_buff *skb,
> >                    __be32 *ptr)
> 4a5
> >     __be32 *start = ptr;
> 7c8
> <     *ptr++ = htonl(tcb->mptcp_data_seq); /* data_seq */
> ---
> >     *ptr++ = htonl(tcb->seq); /* data_seq */

mptcp_write_dss_mapping is now being called from tcp_options_write, through
mptcp_options_write, right?

At this point, tcb->seq will be the TCP-subflow's sequence number.

So, I'm not sure how you are able to get the data-sequence number here.

> 13c14
> <         *ptr++ = htonl(tcb->seq - tp->mptcp->snt_isn); /* subseq */
> ---
> >         *ptr++ = htonl(tp->write_seq - tp->mptcp->snt_isn); /* subseq */

this here is what I was worried about. When we are retransmitting a segment,
tp->write_seq won't match anymore with the segment's sequence number.

You would need to pick tcb->seq here.

However even then, if the segment that is being retransmitted is due to a
partial ack (e.g., the original transmission was 100 bytes, and we received
an ack for only 50 bytes). We will then only retransmit the remaining 50
bytes and thus the relative sequence number won't be the same anymore as in
the original transmission.


Christoph

> 15c16
> <     if (skb->mptcp_flags & MPTCPHDR_INF)
> ---
> >     if (tcb->mptcp_flags & MPTCPHDR_INF)
> 17c18
> <     else {
> ---
> >     else
> 19,22d19
> <         /* mptcp_entail_skb adds one for FIN */
> <         if (tcb->tcp_flags & TCPHDR_FIN)
> <             data_len -= 1;
> <     }
> 41c38
> <     return mptcp_dss_len/sizeof(*ptr);
> ---
> >     return ptr - start;
> 44,45c41,42
> < static int mptcp_write_dss_data_ack(const struct tcp_sock *tp,
> <     const struct sk_buff *skb, __be32 *ptr)
> ---
> > static int mptcp_write_dss_data_ack(const struct tcp_sock *tp, const
> struct sk_buff *skb,
> >                     __be32 *ptr)
> 62d58
> <     /* data_ack */
> 
> And mptcp_options_write() is now:
> 
>         if (OPTION_DATA_ACK & opts->mptcp_options) {
>                 ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
>                 if (mptcp_is_data_seq(skb)) {
>                         ptr += mptcp_write_dss_mapping(tp, skb, ptr);
>                 }
>                 skb->dev = NULL;
>         }
> 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-11-06  2:45 Rao Shoaib
  0 siblings, 0 replies; 36+ messages in thread
From: Rao Shoaib @ 2017-11-06  2:45 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 3832 bytes --]


On 10/27/2017 12:57 PM, Christoph Paasch wrote:
>
> I would love to see the rest of the patch. Especially wrt to writing the
> mapping to the header.
>
> How do you handle segments that are being split while sitting in the
> subflow's send-queue and later on need to be retransmitted. Will the
> retransmitted segment have the same DSS-mapping as the original
> transmission?
>
>
> Thanks,
> Christoph
>
I waited to make sure that my change works for net-next and I have just 
been able to ssh into and out of a system running netdev + MPTCP using 
the same layout. Following are the relevant changes. The main 
observation is that we have all the information to build the DSS header, 
except for the data-sequence. Which is saved and passed on and requires 
only a 32bit field.

Let me know if anyone sees any issues.

Shoaib

I have already provided the SKB changing, I am also using the dev field 
in skb for the other values.
Note the above mapping is based on a dated version of the code, It does 
not work for net-next for which I am using different mapping, but the 
idea is the same and it does not require any additional information.

Changes to mptcp_skb_entail()

rshoaib(a)caduceus5:~/independent_mptcp$ diff /tmp/modified /tmp/original
10c10
<         TCP_SKB_CB(skb)->mptcp_flags |= (mpcb->snd_hiseq_index ?
---
 >         skb->mptcp_flags |= (mpcb->snd_hiseq_index ?
23c23
<     TCP_SKB_CB(skb)->path_mask |= mptcp_pi_to_flag(tp->mptcp->path_index);
---
 >     TCP_SKB_CB(skb)->mptcp_path_mask |= 
mptcp_pi_to_flag(tp->mptcp->path_index);
39c39
<         tcb->mptcp_flags |= MPTCPHDR_INF;
---
 >         skb->mptcp_flags |= MPTCPHDR_INF;
45c45,46
<     mptcp_save_dss_data_seq(tp, subskb);
---
 >     subskb->mptcp_flags |= MPTCPHDR_SEQ;
 >     tcb->mptcp_data_seq = tcb->seq;


Changes to mptcp_write_dss_mapping which is now called when the options 
are written.

rshoaib(a)caduceus5:~/independent_mptcp$ diff /tmp/modified /tmp/original
1,2c1,2
< static int mptcp_write_dss_mapping(const struct tcp_sock *tp,
<     const struct sk_buff *skb, __be32 *ptr)
---
 > static int mptcp_write_dss_mapping(const struct tcp_sock *tp, const 
struct sk_buff *skb,
 >                    __be32 *ptr)
4a5
 >     __be32 *start = ptr;
7c8
<     *ptr++ = htonl(tcb->mptcp_data_seq); /* data_seq */
---
 >     *ptr++ = htonl(tcb->seq); /* data_seq */
13c14
<         *ptr++ = htonl(tcb->seq - tp->mptcp->snt_isn); /* subseq */
---
 >         *ptr++ = htonl(tp->write_seq - tp->mptcp->snt_isn); /* subseq */
15c16
<     if (skb->mptcp_flags & MPTCPHDR_INF)
---
 >     if (tcb->mptcp_flags & MPTCPHDR_INF)
17c18
<     else {
---
 >     else
19,22d19
<         /* mptcp_entail_skb adds one for FIN */
<         if (tcb->tcp_flags & TCPHDR_FIN)
<             data_len -= 1;
<     }
41c38
<     return mptcp_dss_len/sizeof(*ptr);
---
 >     return ptr - start;
44,45c41,42
< static int mptcp_write_dss_data_ack(const struct tcp_sock *tp,
<     const struct sk_buff *skb, __be32 *ptr)
---
 > static int mptcp_write_dss_data_ack(const struct tcp_sock *tp, const 
struct sk_buff *skb,
 >                     __be32 *ptr)
62d58
<     /* data_ack */

And mptcp_options_write() is now:

         if (OPTION_DATA_ACK & opts->mptcp_options) {
                 ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
                 if (mptcp_is_data_seq(skb)) {
                         ptr += mptcp_write_dss_mapping(tp, skb, ptr);
                 }
                 skb->dev = NULL;
         }



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-11-03  5:10 Christoph Paasch
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Paasch @ 2017-11-03  5:10 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 2327 bytes --]

On 02/11/17 - 14:41:22, Mat Martineau wrote:
> On Tue, 31 Oct 2017, Mat Martineau wrote:
> 
> > 
> > Hi Christoph,
> > 
> > On Mon, 30 Oct 2017, Christoph Paasch wrote:
> > 
> > > On 30/10/17 - 15:44:03, Mat Martineau wrote:
> > > > 
> > > > On Sun, 29 Oct 2017, Christoph Paasch wrote:
> > > > 
> > > > > How do you want to proceed for submitting it upstream? I
> > > > > fear that without a
> > > > > user of skb_shared_info_ext, netdev won't accept it.
> > > > > 
> > > > > Shouldn't we try to find a user that would benefit from it?
> > > > 
> > > > I was expecting to wait for an MPTCP patch set that would make use of the
> > > > shared control block, but I agree it would be better to find an existing
> > > > struct sk_buff user that would benefit. I'll look around for a
> > > > protocol with
> > > > a crowded skb->cb - there could be some cases where there are painful work
> > > > arounds for the size constraints.
> > > 
> > > Yes, having another user of it is definitely better. I looked a bit through
> > > the code, but couldn't find a good use-case though.
> > 
> > The requirements are fairly narrow to use the shared cb, which makes it
> > hard to use outside MPTCP:
> > 
> > * Only tx skbs (rx skbs allocated by a driver don't know when to add the
> > extra space)
> > 
> > * Data in the shared cb is read-only (or safely shared)
> > 
> > * Other users don't conflict with MPTCP
> > 
> > 
> > The closest I've found so far is dccp_skb_cb... but:
> > 
> > We could do something with __build_skb() instead, allocating our own
> > data with extra bytes at the end. I think a bit in tcp_skb_cb would
> > still be necessary so the header writing code could tell if the extended
> > MPTCP cb is present. The size of the extra data is more flexible this
> > way, since it's not baked in to the common skb code. Zero changes to the
> > global skb code is very compelling and makes it easier to upstream.
> > Thoughts?
> 
> To follow up to myself, while taking the __build_skb() approach does avoid
> changes to __alloc_skb() it still requires pskb_expand_head to know about
> and copy the area after skb_shared_memory.

Yes, I think we can't go without changing pskb_expand_head(). Maybe that's
the easiest and cleanest approach in the end.


Christoph



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-11-02 21:41 Mat Martineau
  0 siblings, 0 replies; 36+ messages in thread
From: Mat Martineau @ 2017-11-02 21:41 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 2026 bytes --]

On Tue, 31 Oct 2017, Mat Martineau wrote:

>
> Hi Christoph,
>
> On Mon, 30 Oct 2017, Christoph Paasch wrote:
>
>> On 30/10/17 - 15:44:03, Mat Martineau wrote:
>>> 
>>> On Sun, 29 Oct 2017, Christoph Paasch wrote:
>>> 
>>>> How do you want to proceed for submitting it upstream? I fear that 
>>>> without a
>>>> user of skb_shared_info_ext, netdev won't accept it.
>>>> 
>>>> Shouldn't we try to find a user that would benefit from it?
>>> 
>>> I was expecting to wait for an MPTCP patch set that would make use of the
>>> shared control block, but I agree it would be better to find an existing
>>> struct sk_buff user that would benefit. I'll look around for a protocol 
>>> with
>>> a crowded skb->cb - there could be some cases where there are painful work
>>> arounds for the size constraints.
>> 
>> Yes, having another user of it is definitely better. I looked a bit through
>> the code, but couldn't find a good use-case though.
>
> The requirements are fairly narrow to use the shared cb, which makes it hard 
> to use outside MPTCP:
>
> * Only tx skbs (rx skbs allocated by a driver don't know when to add the 
> extra space)
>
> * Data in the shared cb is read-only (or safely shared)
>
> * Other users don't conflict with MPTCP
>
>
> The closest I've found so far is dccp_skb_cb... but:
>
> We could do something with __build_skb() instead, allocating our own data 
> with extra bytes at the end. I think a bit in tcp_skb_cb would still be 
> necessary so the header writing code could tell if the extended MPTCP cb is 
> present. The size of the extra data is more flexible this way, since it's not 
> baked in to the common skb code. Zero changes to the global skb code is very 
> compelling and makes it easier to upstream. Thoughts?

To follow up to myself, while taking the __build_skb() approach does avoid 
changes to __alloc_skb() it still requires pskb_expand_head to know about 
and copy the area after skb_shared_memory.

--
Mat Martineau
Intel OTC

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-10-31 21:58 Mat Martineau
  0 siblings, 0 replies; 36+ messages in thread
From: Mat Martineau @ 2017-10-31 21:58 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 1707 bytes --]


Hi Christoph,

On Mon, 30 Oct 2017, Christoph Paasch wrote:

> On 30/10/17 - 15:44:03, Mat Martineau wrote:
>>
>> On Sun, 29 Oct 2017, Christoph Paasch wrote:
>>
>>> How do you want to proceed for submitting it upstream? I fear that without a
>>> user of skb_shared_info_ext, netdev won't accept it.
>>>
>>> Shouldn't we try to find a user that would benefit from it?
>>
>> I was expecting to wait for an MPTCP patch set that would make use of the
>> shared control block, but I agree it would be better to find an existing
>> struct sk_buff user that would benefit. I'll look around for a protocol with
>> a crowded skb->cb - there could be some cases where there are painful work
>> arounds for the size constraints.
>
> Yes, having another user of it is definitely better. I looked a bit through
> the code, but couldn't find a good use-case though.

The requirements are fairly narrow to use the shared cb, which makes it 
hard to use outside MPTCP:

* Only tx skbs (rx skbs allocated by a driver don't know when to add the extra 
space)

* Data in the shared cb is read-only (or safely shared)

* Other users don't conflict with MPTCP


The closest I've found so far is dccp_skb_cb... but:

We could do something with __build_skb() instead, allocating our own data 
with extra bytes at the end. I think a bit in tcp_skb_cb would still be 
necessary so the header writing code could tell if the extended MPTCP cb 
is present. The size of the extra data is more flexible this way, since 
it's not baked in to the common skb code. Zero changes to the global skb 
code is very compelling and makes it easier to upstream. Thoughts?

--
Mat Martineau
Intel OTC

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-10-31  4:17 Christoph Paasch
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Paasch @ 2017-10-31  4:17 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 14783 bytes --]

On 30/10/17 - 15:44:03, Mat Martineau wrote:
> 
> On Sun, 29 Oct 2017, Christoph Paasch wrote:
> 
> > On 23/10/17 - 15:51:26, Mat Martineau wrote:
> > > 
> > > Hi Christoph,
> > > 
> > > On Mon, 23 Oct 2017, Christoph Paasch wrote:
> > > 
> > > > Hello Mat,
> > > > 
> > > > On 20/10/17 - 16:02:31, Mat Martineau wrote:
> > > > > The sk_buff control buffer is of limited size, and cannot be enlarged
> > > > > without significant impact on systemwide memory use. However, additional
> > > > > per-packet state is needed for some protocols, like Multipath TCP.
> > > > > 
> > > > > An optional shared control buffer placed after the normal struct
> > > > > skb_shared_info can accomodate the necessary state without imposing
> > > > > extra memory usage or code changes on normal struct sk_buff
> > > > > users. __alloc_skb will now place a skb_shared_info_ext structure at
> > > > > skb->end when given the SKB_ALLOC_SHINFO_EXT flag. Most users of struct
> > > > > sk_buff continue to use the skb_shinfo() macro to access shared
> > > > > info. skb_shinfo(skb)->is_ext is set if the extended structure is
> > > > > available, and cleared if it is not.
> > > > > 
> > > > > pskb_expand_head will preserve the shared control buffer if it is present.
> > > > > 
> > > > > Signed-off-by: Mat Martineau <mathew.j.martineau(a)linux.intel.com>
> > > > > ---
> > > > >  include/linux/skbuff.h | 24 +++++++++++++++++++++-
> > > > >  net/core/skbuff.c      | 56 ++++++++++++++++++++++++++++++++++++++------------
> > > > >  2 files changed, 66 insertions(+), 14 deletions(-)
> > > > 
> > > > While digging through your patch, I realized that there could be one issue.
> > > > 
> > > > In many places in the code we use sizeof(struct skb_shared_info) to compute
> > > > the overhead (e.g., see tcp_sndbuf_expand).
> > > > 
> > > > There are also countless users of (sizeof(struct skb_shared_info) in the
> > > > drivers. They all seem to be in the rx-path, so it should be fine.
> > > > 
> > > > But the prevalent use of skb_shared_info throughout the stack seems a bit
> > > > scary to me. My concern here is that the driver assumes that the overhead is
> > > > always sizeof(skb_shared_info) and thus underestimates the size of the skb.
> > > > 
> > > > At least, in tcp_sndbuf_expand() the overhead-estimation would be wrong. I
> > > > don't think the consequences will be catastrophic, but it's something to
> > > > think about.
> > > > 
> > > > 
> > > > What do you think?
> > > > 
> > > 
> > > I saw two typical cases when I looked around existing code:
> > > 
> > > 1. Drivers doing calculations before calling build_skb() on the rx path.
> > > 
> > > 2. Various code making size estimates before creating skbs.
> > > 
> > > 
> > > For #1, build_skb() is still creating "normal" skbs with skb_shared_info, so
> > > their calculations remain valid.
> > > 
> > > Similarly, code that uses size estimates for #2 goes on to allocate "normal"
> > > skbs, so their estimates remains valid. Code that allocates extended skbs
> > > would need another set of extended skb macros to make better estimates.
> > > 
> > > tcp_sndbuf_expand() is a little different, since it changes sk_sndbuf which
> > > could impact behavior when longer skbs are in use. In this case, it still
> > > seems like a fairly coarse estimate (rounding up to a power of two and then
> > > multiplying). I'm not worried about this use, but I do want to understand it
> > > better before upstreaming. If it's important to add in the extra space for
> > > sockets using skb_shared_info_ext, that could be accounted for in
> > > tcp_sndbuf_expand by adding a flag to tcp_sock.
> > 
> > I agree that tcp_sndbuf_expand() should be fine.
> > 
> > > Anywhere that allocates extended skbs will involve new code, and if that
> > > code needs to make size estimates it should take the extra overhead in to
> > > account. For existing code that may handle (rather than allocate) skbs with
> > > extended shared data, these skbs will be truthful about their truesize and
> > > will be freed correctly.
> > > 
> > > I still want to review this approach carefully and do additional testing. I
> > > should run the patch by some network driver authors too.
> > 
> > How do you want to proceed for submitting it upstream? I fear that without a
> > user of skb_shared_info_ext, netdev won't accept it.
> > 
> > Shouldn't we try to find a user that would benefit from it?
> 
> I was expecting to wait for an MPTCP patch set that would make use of the
> shared control block, but I agree it would be better to find an existing
> struct sk_buff user that would benefit. I'll look around for a protocol with
> a crowded skb->cb - there could be some cases where there are painful work
> arounds for the size constraints.

Yes, having another user of it is definitely better. I looked a bit through
the code, but couldn't find a good use-case though.


Christoph

> 
> > 
> > Btw., I am trying to get the MD5-changes/extra-option changes ready. But I
> > got hooked up in other pieces last week...
> 
> Thanks!
> 
> Mat
> 
> 
> > 
> > Christoph
> > 
> > > 
> > > 
> > > Thanks,
> > > Mat
> > > 
> > > 
> > > > > 
> > > > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > > index 03634ec2f918..873910c66df9 100644
> > > > > --- a/include/linux/skbuff.h
> > > > > +++ b/include/linux/skbuff.h
> > > > > @@ -489,7 +489,8 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
> > > > >   * the end of the header data, ie. at skb->end.
> > > > >   */
> > > > >  struct skb_shared_info {
> > > > > -	__u8		__unused;
> > > > > +	__u8		is_ext:1,
> > > > > +			__unused:7;
> > > > >  	__u8		meta_len;
> > > > >  	__u8		nr_frags;
> > > > >  	__u8		tx_flags;
> > > > > @@ -530,6 +531,24 @@ struct skb_shared_info {
> > > > >  #define SKB_DATAREF_MASK ((1 << SKB_DATAREF_SHIFT) - 1)
> > > > > 
> > > > > 
> > > > > +/* This is an extended version of skb_shared_info, also invariant across
> > > > > + * clones and living at the end of the header data.
> > > > > + */
> > > > > +struct skb_shared_info_ext {
> > > > > +	/* skb_shared_info must be the first member */
> > > > > +	struct skb_shared_info	shinfo;
> > > > > +
> > > > > +	/* This is the shared control buffer. It is similar to sk_buff's
> > > > > +	 * control buffer, but is shared across clones. It must not be
> > > > > +	 * modified when multiple sk_buffs are referencing this structure.
> > > > > +	 */
> > > > > +	char			shcb[48];
> > > > > +};
> > > > > +
> > > > > +#define SKB_SHINFO_EXT_OVERHEAD	\
> > > > > +	(SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext)) - \
> > > > > +	 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
> > > > > +
> > > > >  enum {
> > > > >  	SKB_FCLONE_UNAVAILABLE,	/* skb has no fclone (from head_cache) */
> > > > >  	SKB_FCLONE_ORIG,	/* orig skb (from fclone_cache) */
> > > > > @@ -856,6 +875,7 @@ struct sk_buff {
> > > > >  #define SKB_ALLOC_FCLONE	0x01
> > > > >  #define SKB_ALLOC_RX		0x02
> > > > >  #define SKB_ALLOC_NAPI		0x04
> > > > > +#define SKB_ALLOC_SHINFO_EXT	0x08
> > > > > 
> > > > >  /* Returns true if the skb was allocated from PFMEMALLOC reserves */
> > > > >  static inline bool skb_pfmemalloc(const struct sk_buff *skb)
> > > > > @@ -1271,6 +1291,8 @@ static inline unsigned int skb_end_offset(const struct sk_buff *skb)
> > > > > 
> > > > >  /* Internal */
> > > > >  #define skb_shinfo(SKB)	((struct skb_shared_info *)(skb_end_pointer(SKB)))
> > > > > +#define skb_shinfo_ext(SKB)	\
> > > > > +	((struct skb_shared_info_ext *)(skb_end_pointer(SKB)))
> > > > > 
> > > > >  static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb)
> > > > >  {
> > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > > index 40717501cbdd..397edd5c0613 100644
> > > > > --- a/net/core/skbuff.c
> > > > > +++ b/net/core/skbuff.c
> > > > > @@ -166,6 +166,8 @@ static void *__kmalloc_reserve(size_t size, gfp_t flags, int node,
> > > > >   *		instead of head cache and allocate a cloned (child) skb.
> > > > >   *		If SKB_ALLOC_RX is set, __GFP_MEMALLOC will be used for
> > > > >   *		allocations in case the data is required for writeback
> > > > > + *		If SKB_ALLOC_SHINFO_EXT is set, the skb will be allocated
> > > > > + *		with an extended shared info struct.
> > > > >   *	@node: numa node to allocate memory on
> > > > >   *
> > > > >   *	Allocate a new &sk_buff. The returned buffer has no headroom and a
> > > > > @@ -179,9 +181,9 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> > > > >  			    int flags, int node)
> > > > >  {
> > > > >  	struct kmem_cache *cache;
> > > > > -	struct skb_shared_info *shinfo;
> > > > >  	struct sk_buff *skb;
> > > > >  	u8 *data;
> > > > > +	unsigned int shinfo_size;
> > > > >  	bool pfmemalloc;
> > > > > 
> > > > >  	cache = (flags & SKB_ALLOC_FCLONE)
> > > > > @@ -199,18 +201,22 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> > > > >  	/* We do our best to align skb_shared_info on a separate cache
> > > > >  	 * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
> > > > >  	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.
> > > > > -	 * Both skb->head and skb_shared_info are cache line aligned.
> > > > > +	 * Both skb->head and skb_shared_info (or skb_shared_info_ext) are
> > > > > +	 * cache line aligned.
> > > > >  	 */
> > > > >  	size = SKB_DATA_ALIGN(size);
> > > > > -	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > > > > -	data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
> > > > > +	if (flags & SKB_ALLOC_SHINFO_EXT)
> > > > > +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext));
> > > > > +	else
> > > > > +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > > > > +	data = kmalloc_reserve(size + shinfo_size, gfp_mask, node, &pfmemalloc);
> > > > >  	if (!data)
> > > > >  		goto nodata;
> > > > >  	/* kmalloc(size) might give us more room than requested.
> > > > >  	 * Put skb_shared_info exactly at the end of allocated zone,
> > > > >  	 * to allow max possible filling before reallocation.
> > > > >  	 */
> > > > > -	size = SKB_WITH_OVERHEAD(ksize(data));
> > > > > +	size = ksize(data) - shinfo_size;
> > > > >  	prefetchw(data + size);
> > > > > 
> > > > >  	/*
> > > > > @@ -220,7 +226,10 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> > > > >  	 */
> > > > >  	memset(skb, 0, offsetof(struct sk_buff, tail));
> > > > >  	/* Account for allocated memory : skb + skb->head */
> > > > > -	skb->truesize = SKB_TRUESIZE(size);
> > > > > +	if (flags & SKB_ALLOC_SHINFO_EXT)
> > > > > +		skb->truesize = SKB_TRUESIZE(size) + SKB_SHINFO_EXT_OVERHEAD;
> > > > > +	else
> > > > > +		skb->truesize = SKB_TRUESIZE(size);
> > > > >  	skb->pfmemalloc = pfmemalloc;
> > > > >  	refcount_set(&skb->users, 1);
> > > > >  	skb->head = data;
> > > > > @@ -231,10 +240,21 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> > > > >  	skb->transport_header = (typeof(skb->transport_header))~0U;
> > > > > 
> > > > >  	/* make sure we initialize shinfo sequentially */
> > > > > -	shinfo = skb_shinfo(skb);
> > > > > -	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> > > > > -	atomic_set(&shinfo->dataref, 1);
> > > > > -	kmemcheck_annotate_variable(shinfo->destructor_arg);
> > > > > +	if (flags & SKB_ALLOC_SHINFO_EXT) {
> > > > > +		struct skb_shared_info_ext *shinfo_ext = skb_shinfo_ext(skb);
> > > > > +		shinfo_ext->shinfo.is_ext = 1;
> > > > > +		memset(&shinfo_ext->shinfo.meta_len, 0,
> > > > > +		       offsetof(struct skb_shared_info, dataref) -
> > > > > +		       offsetof(struct skb_shared_info, meta_len));
> > > > > +		atomic_set(&shinfo_ext->shinfo.dataref, 1);
> > > > > +		kmemcheck_annotate_variable(shinfo_ext->shinfo.destructor_arg);
> > > > > +		memset(&shinfo_ext->shcb, 0, sizeof(shinfo_ext->shcb));
> > > > > +	} else {
> > > > > +		struct skb_shared_info *shinfo = skb_shinfo(skb);
> > > > > +		memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> > > > > +		atomic_set(&shinfo->dataref, 1);
> > > > > +		kmemcheck_annotate_variable(shinfo->destructor_arg);
> > > > > +	}
> > > > > 
> > > > >  	if (flags & SKB_ALLOC_FCLONE) {
> > > > >  		struct sk_buff_fclones *fclones;
> > > > > @@ -1443,6 +1463,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
> > > > >  {
> > > > >  	int i, osize = skb_end_offset(skb);
> > > > >  	int size = osize + nhead + ntail;
> > > > > +	int shinfo_size;
> > > > >  	long off;
> > > > >  	u8 *data;
> > > > > 
> > > > > @@ -1454,11 +1475,14 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
> > > > > 
> > > > >  	if (skb_pfmemalloc(skb))
> > > > >  		gfp_mask |= __GFP_MEMALLOC;
> > > > > -	data = kmalloc_reserve(size + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
> > > > > -			       gfp_mask, NUMA_NO_NODE, NULL);
> > > > > +	if (skb_shinfo(skb)->is_ext)
> > > > > +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext));
> > > > > +	else
> > > > > +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > > > > +	data = kmalloc_reserve(size + shinfo_size, gfp_mask, NUMA_NO_NODE, NULL);
> > > > >  	if (!data)
> > > > >  		goto nodata;
> > > > > -	size = SKB_WITH_OVERHEAD(ksize(data));
> > > > > +	size = ksize(data) - shinfo_size;
> > > > > 
> > > > >  	/* Copy only real data... and, alas, header. This should be
> > > > >  	 * optimized for the cases when header is void.
> > > > > @@ -1468,6 +1492,12 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
> > > > >  	memcpy((struct skb_shared_info *)(data + size),
> > > > >  	       skb_shinfo(skb),
> > > > >  	       offsetof(struct skb_shared_info, frags[skb_shinfo(skb)->nr_frags]));
> > > > > +	if (skb_shinfo(skb)->is_ext) {
> > > > > +		int offset = offsetof(struct skb_shared_info_ext, shcb);
> > > > > +		memcpy((struct skb_shared_info_ext *)(data + size + offset),
> > > > > +		       &skb_shinfo_ext(skb)->shcb,
> > > > > +		       sizeof(skb_shinfo_ext(skb)->shcb));
> > > > > +	}
> > > > > 
> > > > >  	/*
> > > > >  	 * if shinfo is shared we must drop the old head gracefully, but if it
> > > > > --
> > > > > 2.14.2
> > > > > 
> > > > > _______________________________________________
> > > > > mptcp mailing list
> > > > > mptcp(a)lists.01.org
> > > > > https://lists.01.org/mailman/listinfo/mptcp
> > > > 
> > > 
> > > --
> > > Mat Martineau
> > > Intel OTC
> > 
> 
> --
> Mat Martineau
> Intel OTC

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-10-30 22:44 Mat Martineau
  0 siblings, 0 replies; 36+ messages in thread
From: Mat Martineau @ 2017-10-30 22:44 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 13177 bytes --]


On Sun, 29 Oct 2017, Christoph Paasch wrote:

> On 23/10/17 - 15:51:26, Mat Martineau wrote:
>>
>> Hi Christoph,
>>
>> On Mon, 23 Oct 2017, Christoph Paasch wrote:
>>
>>> Hello Mat,
>>>
>>> On 20/10/17 - 16:02:31, Mat Martineau wrote:
>>>> The sk_buff control buffer is of limited size, and cannot be enlarged
>>>> without significant impact on systemwide memory use. However, additional
>>>> per-packet state is needed for some protocols, like Multipath TCP.
>>>>
>>>> An optional shared control buffer placed after the normal struct
>>>> skb_shared_info can accomodate the necessary state without imposing
>>>> extra memory usage or code changes on normal struct sk_buff
>>>> users. __alloc_skb will now place a skb_shared_info_ext structure at
>>>> skb->end when given the SKB_ALLOC_SHINFO_EXT flag. Most users of struct
>>>> sk_buff continue to use the skb_shinfo() macro to access shared
>>>> info. skb_shinfo(skb)->is_ext is set if the extended structure is
>>>> available, and cleared if it is not.
>>>>
>>>> pskb_expand_head will preserve the shared control buffer if it is present.
>>>>
>>>> Signed-off-by: Mat Martineau <mathew.j.martineau(a)linux.intel.com>
>>>> ---
>>>>  include/linux/skbuff.h | 24 +++++++++++++++++++++-
>>>>  net/core/skbuff.c      | 56 ++++++++++++++++++++++++++++++++++++++------------
>>>>  2 files changed, 66 insertions(+), 14 deletions(-)
>>>
>>> While digging through your patch, I realized that there could be one issue.
>>>
>>> In many places in the code we use sizeof(struct skb_shared_info) to compute
>>> the overhead (e.g., see tcp_sndbuf_expand).
>>>
>>> There are also countless users of (sizeof(struct skb_shared_info) in the
>>> drivers. They all seem to be in the rx-path, so it should be fine.
>>>
>>> But the prevalent use of skb_shared_info throughout the stack seems a bit
>>> scary to me. My concern here is that the driver assumes that the overhead is
>>> always sizeof(skb_shared_info) and thus underestimates the size of the skb.
>>>
>>> At least, in tcp_sndbuf_expand() the overhead-estimation would be wrong. I
>>> don't think the consequences will be catastrophic, but it's something to
>>> think about.
>>>
>>>
>>> What do you think?
>>>
>>
>> I saw two typical cases when I looked around existing code:
>>
>> 1. Drivers doing calculations before calling build_skb() on the rx path.
>>
>> 2. Various code making size estimates before creating skbs.
>>
>>
>> For #1, build_skb() is still creating "normal" skbs with skb_shared_info, so
>> their calculations remain valid.
>>
>> Similarly, code that uses size estimates for #2 goes on to allocate "normal"
>> skbs, so their estimates remains valid. Code that allocates extended skbs
>> would need another set of extended skb macros to make better estimates.
>>
>> tcp_sndbuf_expand() is a little different, since it changes sk_sndbuf which
>> could impact behavior when longer skbs are in use. In this case, it still
>> seems like a fairly coarse estimate (rounding up to a power of two and then
>> multiplying). I'm not worried about this use, but I do want to understand it
>> better before upstreaming. If it's important to add in the extra space for
>> sockets using skb_shared_info_ext, that could be accounted for in
>> tcp_sndbuf_expand by adding a flag to tcp_sock.
>
> I agree that tcp_sndbuf_expand() should be fine.
>
>> Anywhere that allocates extended skbs will involve new code, and if that
>> code needs to make size estimates it should take the extra overhead in to
>> account. For existing code that may handle (rather than allocate) skbs with
>> extended shared data, these skbs will be truthful about their truesize and
>> will be freed correctly.
>>
>> I still want to review this approach carefully and do additional testing. I
>> should run the patch by some network driver authors too.
>
> How do you want to proceed for submitting it upstream? I fear that without a
> user of skb_shared_info_ext, netdev won't accept it.
>
> Shouldn't we try to find a user that would benefit from it?

I was expecting to wait for an MPTCP patch set that would make use of the 
shared control block, but I agree it would be better to find an existing 
struct sk_buff user that would benefit. I'll look around for a protocol 
with a crowded skb->cb - there could be some cases where there are painful 
work arounds for the size constraints.

>
> Btw., I am trying to get the MD5-changes/extra-option changes ready. But I
> got hooked up in other pieces last week...

Thanks!

Mat


>
> Christoph
>
>>
>>
>> Thanks,
>> Mat
>>
>>
>>>>
>>>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>>>> index 03634ec2f918..873910c66df9 100644
>>>> --- a/include/linux/skbuff.h
>>>> +++ b/include/linux/skbuff.h
>>>> @@ -489,7 +489,8 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
>>>>   * the end of the header data, ie. at skb->end.
>>>>   */
>>>>  struct skb_shared_info {
>>>> -	__u8		__unused;
>>>> +	__u8		is_ext:1,
>>>> +			__unused:7;
>>>>  	__u8		meta_len;
>>>>  	__u8		nr_frags;
>>>>  	__u8		tx_flags;
>>>> @@ -530,6 +531,24 @@ struct skb_shared_info {
>>>>  #define SKB_DATAREF_MASK ((1 << SKB_DATAREF_SHIFT) - 1)
>>>>
>>>>
>>>> +/* This is an extended version of skb_shared_info, also invariant across
>>>> + * clones and living at the end of the header data.
>>>> + */
>>>> +struct skb_shared_info_ext {
>>>> +	/* skb_shared_info must be the first member */
>>>> +	struct skb_shared_info	shinfo;
>>>> +
>>>> +	/* This is the shared control buffer. It is similar to sk_buff's
>>>> +	 * control buffer, but is shared across clones. It must not be
>>>> +	 * modified when multiple sk_buffs are referencing this structure.
>>>> +	 */
>>>> +	char			shcb[48];
>>>> +};
>>>> +
>>>> +#define SKB_SHINFO_EXT_OVERHEAD	\
>>>> +	(SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext)) - \
>>>> +	 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
>>>> +
>>>>  enum {
>>>>  	SKB_FCLONE_UNAVAILABLE,	/* skb has no fclone (from head_cache) */
>>>>  	SKB_FCLONE_ORIG,	/* orig skb (from fclone_cache) */
>>>> @@ -856,6 +875,7 @@ struct sk_buff {
>>>>  #define SKB_ALLOC_FCLONE	0x01
>>>>  #define SKB_ALLOC_RX		0x02
>>>>  #define SKB_ALLOC_NAPI		0x04
>>>> +#define SKB_ALLOC_SHINFO_EXT	0x08
>>>>
>>>>  /* Returns true if the skb was allocated from PFMEMALLOC reserves */
>>>>  static inline bool skb_pfmemalloc(const struct sk_buff *skb)
>>>> @@ -1271,6 +1291,8 @@ static inline unsigned int skb_end_offset(const struct sk_buff *skb)
>>>>
>>>>  /* Internal */
>>>>  #define skb_shinfo(SKB)	((struct skb_shared_info *)(skb_end_pointer(SKB)))
>>>> +#define skb_shinfo_ext(SKB)	\
>>>> +	((struct skb_shared_info_ext *)(skb_end_pointer(SKB)))
>>>>
>>>>  static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb)
>>>>  {
>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>> index 40717501cbdd..397edd5c0613 100644
>>>> --- a/net/core/skbuff.c
>>>> +++ b/net/core/skbuff.c
>>>> @@ -166,6 +166,8 @@ static void *__kmalloc_reserve(size_t size, gfp_t flags, int node,
>>>>   *		instead of head cache and allocate a cloned (child) skb.
>>>>   *		If SKB_ALLOC_RX is set, __GFP_MEMALLOC will be used for
>>>>   *		allocations in case the data is required for writeback
>>>> + *		If SKB_ALLOC_SHINFO_EXT is set, the skb will be allocated
>>>> + *		with an extended shared info struct.
>>>>   *	@node: numa node to allocate memory on
>>>>   *
>>>>   *	Allocate a new &sk_buff. The returned buffer has no headroom and a
>>>> @@ -179,9 +181,9 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>>>>  			    int flags, int node)
>>>>  {
>>>>  	struct kmem_cache *cache;
>>>> -	struct skb_shared_info *shinfo;
>>>>  	struct sk_buff *skb;
>>>>  	u8 *data;
>>>> +	unsigned int shinfo_size;
>>>>  	bool pfmemalloc;
>>>>
>>>>  	cache = (flags & SKB_ALLOC_FCLONE)
>>>> @@ -199,18 +201,22 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>>>>  	/* We do our best to align skb_shared_info on a separate cache
>>>>  	 * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
>>>>  	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.
>>>> -	 * Both skb->head and skb_shared_info are cache line aligned.
>>>> +	 * Both skb->head and skb_shared_info (or skb_shared_info_ext) are
>>>> +	 * cache line aligned.
>>>>  	 */
>>>>  	size = SKB_DATA_ALIGN(size);
>>>> -	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>>>> -	data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
>>>> +	if (flags & SKB_ALLOC_SHINFO_EXT)
>>>> +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext));
>>>> +	else
>>>> +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>>>> +	data = kmalloc_reserve(size + shinfo_size, gfp_mask, node, &pfmemalloc);
>>>>  	if (!data)
>>>>  		goto nodata;
>>>>  	/* kmalloc(size) might give us more room than requested.
>>>>  	 * Put skb_shared_info exactly at the end of allocated zone,
>>>>  	 * to allow max possible filling before reallocation.
>>>>  	 */
>>>> -	size = SKB_WITH_OVERHEAD(ksize(data));
>>>> +	size = ksize(data) - shinfo_size;
>>>>  	prefetchw(data + size);
>>>>
>>>>  	/*
>>>> @@ -220,7 +226,10 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>>>>  	 */
>>>>  	memset(skb, 0, offsetof(struct sk_buff, tail));
>>>>  	/* Account for allocated memory : skb + skb->head */
>>>> -	skb->truesize = SKB_TRUESIZE(size);
>>>> +	if (flags & SKB_ALLOC_SHINFO_EXT)
>>>> +		skb->truesize = SKB_TRUESIZE(size) + SKB_SHINFO_EXT_OVERHEAD;
>>>> +	else
>>>> +		skb->truesize = SKB_TRUESIZE(size);
>>>>  	skb->pfmemalloc = pfmemalloc;
>>>>  	refcount_set(&skb->users, 1);
>>>>  	skb->head = data;
>>>> @@ -231,10 +240,21 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>>>>  	skb->transport_header = (typeof(skb->transport_header))~0U;
>>>>
>>>>  	/* make sure we initialize shinfo sequentially */
>>>> -	shinfo = skb_shinfo(skb);
>>>> -	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
>>>> -	atomic_set(&shinfo->dataref, 1);
>>>> -	kmemcheck_annotate_variable(shinfo->destructor_arg);
>>>> +	if (flags & SKB_ALLOC_SHINFO_EXT) {
>>>> +		struct skb_shared_info_ext *shinfo_ext = skb_shinfo_ext(skb);
>>>> +		shinfo_ext->shinfo.is_ext = 1;
>>>> +		memset(&shinfo_ext->shinfo.meta_len, 0,
>>>> +		       offsetof(struct skb_shared_info, dataref) -
>>>> +		       offsetof(struct skb_shared_info, meta_len));
>>>> +		atomic_set(&shinfo_ext->shinfo.dataref, 1);
>>>> +		kmemcheck_annotate_variable(shinfo_ext->shinfo.destructor_arg);
>>>> +		memset(&shinfo_ext->shcb, 0, sizeof(shinfo_ext->shcb));
>>>> +	} else {
>>>> +		struct skb_shared_info *shinfo = skb_shinfo(skb);
>>>> +		memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
>>>> +		atomic_set(&shinfo->dataref, 1);
>>>> +		kmemcheck_annotate_variable(shinfo->destructor_arg);
>>>> +	}
>>>>
>>>>  	if (flags & SKB_ALLOC_FCLONE) {
>>>>  		struct sk_buff_fclones *fclones;
>>>> @@ -1443,6 +1463,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>>>>  {
>>>>  	int i, osize = skb_end_offset(skb);
>>>>  	int size = osize + nhead + ntail;
>>>> +	int shinfo_size;
>>>>  	long off;
>>>>  	u8 *data;
>>>>
>>>> @@ -1454,11 +1475,14 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>>>>
>>>>  	if (skb_pfmemalloc(skb))
>>>>  		gfp_mask |= __GFP_MEMALLOC;
>>>> -	data = kmalloc_reserve(size + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
>>>> -			       gfp_mask, NUMA_NO_NODE, NULL);
>>>> +	if (skb_shinfo(skb)->is_ext)
>>>> +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext));
>>>> +	else
>>>> +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>>>> +	data = kmalloc_reserve(size + shinfo_size, gfp_mask, NUMA_NO_NODE, NULL);
>>>>  	if (!data)
>>>>  		goto nodata;
>>>> -	size = SKB_WITH_OVERHEAD(ksize(data));
>>>> +	size = ksize(data) - shinfo_size;
>>>>
>>>>  	/* Copy only real data... and, alas, header. This should be
>>>>  	 * optimized for the cases when header is void.
>>>> @@ -1468,6 +1492,12 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>>>>  	memcpy((struct skb_shared_info *)(data + size),
>>>>  	       skb_shinfo(skb),
>>>>  	       offsetof(struct skb_shared_info, frags[skb_shinfo(skb)->nr_frags]));
>>>> +	if (skb_shinfo(skb)->is_ext) {
>>>> +		int offset = offsetof(struct skb_shared_info_ext, shcb);
>>>> +		memcpy((struct skb_shared_info_ext *)(data + size + offset),
>>>> +		       &skb_shinfo_ext(skb)->shcb,
>>>> +		       sizeof(skb_shinfo_ext(skb)->shcb));
>>>> +	}
>>>>
>>>>  	/*
>>>>  	 * if shinfo is shared we must drop the old head gracefully, but if it
>>>> --
>>>> 2.14.2
>>>>
>>>> _______________________________________________
>>>> mptcp mailing list
>>>> mptcp(a)lists.01.org
>>>> https://lists.01.org/mailman/listinfo/mptcp
>>>
>>
>> --
>> Mat Martineau
>> Intel OTC
>

--
Mat Martineau
Intel OTC

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-10-30  4:16 Christoph Paasch
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Paasch @ 2017-10-30  4:16 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 12916 bytes --]

On 23/10/17 - 15:51:26, Mat Martineau wrote:
> 
> Hi Christoph,
> 
> On Mon, 23 Oct 2017, Christoph Paasch wrote:
> 
> > Hello Mat,
> > 
> > On 20/10/17 - 16:02:31, Mat Martineau wrote:
> > > The sk_buff control buffer is of limited size, and cannot be enlarged
> > > without significant impact on systemwide memory use. However, additional
> > > per-packet state is needed for some protocols, like Multipath TCP.
> > > 
> > > An optional shared control buffer placed after the normal struct
> > > skb_shared_info can accomodate the necessary state without imposing
> > > extra memory usage or code changes on normal struct sk_buff
> > > users. __alloc_skb will now place a skb_shared_info_ext structure at
> > > skb->end when given the SKB_ALLOC_SHINFO_EXT flag. Most users of struct
> > > sk_buff continue to use the skb_shinfo() macro to access shared
> > > info. skb_shinfo(skb)->is_ext is set if the extended structure is
> > > available, and cleared if it is not.
> > > 
> > > pskb_expand_head will preserve the shared control buffer if it is present.
> > > 
> > > Signed-off-by: Mat Martineau <mathew.j.martineau(a)linux.intel.com>
> > > ---
> > >  include/linux/skbuff.h | 24 +++++++++++++++++++++-
> > >  net/core/skbuff.c      | 56 ++++++++++++++++++++++++++++++++++++++------------
> > >  2 files changed, 66 insertions(+), 14 deletions(-)
> > 
> > While digging through your patch, I realized that there could be one issue.
> > 
> > In many places in the code we use sizeof(struct skb_shared_info) to compute
> > the overhead (e.g., see tcp_sndbuf_expand).
> > 
> > There are also countless users of (sizeof(struct skb_shared_info) in the
> > drivers. They all seem to be in the rx-path, so it should be fine.
> > 
> > But the prevalent use of skb_shared_info throughout the stack seems a bit
> > scary to me. My concern here is that the driver assumes that the overhead is
> > always sizeof(skb_shared_info) and thus underestimates the size of the skb.
> > 
> > At least, in tcp_sndbuf_expand() the overhead-estimation would be wrong. I
> > don't think the consequences will be catastrophic, but it's something to
> > think about.
> > 
> > 
> > What do you think?
> > 
> 
> I saw two typical cases when I looked around existing code:
> 
> 1. Drivers doing calculations before calling build_skb() on the rx path.
> 
> 2. Various code making size estimates before creating skbs.
> 
> 
> For #1, build_skb() is still creating "normal" skbs with skb_shared_info, so
> their calculations remain valid.
> 
> Similarly, code that uses size estimates for #2 goes on to allocate "normal"
> skbs, so their estimates remains valid. Code that allocates extended skbs
> would need another set of extended skb macros to make better estimates.
> 
> tcp_sndbuf_expand() is a little different, since it changes sk_sndbuf which
> could impact behavior when longer skbs are in use. In this case, it still
> seems like a fairly coarse estimate (rounding up to a power of two and then
> multiplying). I'm not worried about this use, but I do want to understand it
> better before upstreaming. If it's important to add in the extra space for
> sockets using skb_shared_info_ext, that could be accounted for in
> tcp_sndbuf_expand by adding a flag to tcp_sock.

I agree that tcp_sndbuf_expand() should be fine.

> Anywhere that allocates extended skbs will involve new code, and if that
> code needs to make size estimates it should take the extra overhead in to
> account. For existing code that may handle (rather than allocate) skbs with
> extended shared data, these skbs will be truthful about their truesize and
> will be freed correctly.
> 
> I still want to review this approach carefully and do additional testing. I
> should run the patch by some network driver authors too.

How do you want to proceed for submitting it upstream? I fear that without a
user of skb_shared_info_ext, netdev won't accept it.

Shouldn't we try to find a user that would benefit from it?


Btw., I am trying to get the MD5-changes/extra-option changes ready. But I
got hooked up in other pieces last week...


Christoph

> 
> 
> Thanks,
> Mat
> 
> 
> > > 
> > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > index 03634ec2f918..873910c66df9 100644
> > > --- a/include/linux/skbuff.h
> > > +++ b/include/linux/skbuff.h
> > > @@ -489,7 +489,8 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
> > >   * the end of the header data, ie. at skb->end.
> > >   */
> > >  struct skb_shared_info {
> > > -	__u8		__unused;
> > > +	__u8		is_ext:1,
> > > +			__unused:7;
> > >  	__u8		meta_len;
> > >  	__u8		nr_frags;
> > >  	__u8		tx_flags;
> > > @@ -530,6 +531,24 @@ struct skb_shared_info {
> > >  #define SKB_DATAREF_MASK ((1 << SKB_DATAREF_SHIFT) - 1)
> > > 
> > > 
> > > +/* This is an extended version of skb_shared_info, also invariant across
> > > + * clones and living at the end of the header data.
> > > + */
> > > +struct skb_shared_info_ext {
> > > +	/* skb_shared_info must be the first member */
> > > +	struct skb_shared_info	shinfo;
> > > +
> > > +	/* This is the shared control buffer. It is similar to sk_buff's
> > > +	 * control buffer, but is shared across clones. It must not be
> > > +	 * modified when multiple sk_buffs are referencing this structure.
> > > +	 */
> > > +	char			shcb[48];
> > > +};
> > > +
> > > +#define SKB_SHINFO_EXT_OVERHEAD	\
> > > +	(SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext)) - \
> > > +	 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
> > > +
> > >  enum {
> > >  	SKB_FCLONE_UNAVAILABLE,	/* skb has no fclone (from head_cache) */
> > >  	SKB_FCLONE_ORIG,	/* orig skb (from fclone_cache) */
> > > @@ -856,6 +875,7 @@ struct sk_buff {
> > >  #define SKB_ALLOC_FCLONE	0x01
> > >  #define SKB_ALLOC_RX		0x02
> > >  #define SKB_ALLOC_NAPI		0x04
> > > +#define SKB_ALLOC_SHINFO_EXT	0x08
> > > 
> > >  /* Returns true if the skb was allocated from PFMEMALLOC reserves */
> > >  static inline bool skb_pfmemalloc(const struct sk_buff *skb)
> > > @@ -1271,6 +1291,8 @@ static inline unsigned int skb_end_offset(const struct sk_buff *skb)
> > > 
> > >  /* Internal */
> > >  #define skb_shinfo(SKB)	((struct skb_shared_info *)(skb_end_pointer(SKB)))
> > > +#define skb_shinfo_ext(SKB)	\
> > > +	((struct skb_shared_info_ext *)(skb_end_pointer(SKB)))
> > > 
> > >  static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb)
> > >  {
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index 40717501cbdd..397edd5c0613 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -166,6 +166,8 @@ static void *__kmalloc_reserve(size_t size, gfp_t flags, int node,
> > >   *		instead of head cache and allocate a cloned (child) skb.
> > >   *		If SKB_ALLOC_RX is set, __GFP_MEMALLOC will be used for
> > >   *		allocations in case the data is required for writeback
> > > + *		If SKB_ALLOC_SHINFO_EXT is set, the skb will be allocated
> > > + *		with an extended shared info struct.
> > >   *	@node: numa node to allocate memory on
> > >   *
> > >   *	Allocate a new &sk_buff. The returned buffer has no headroom and a
> > > @@ -179,9 +181,9 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> > >  			    int flags, int node)
> > >  {
> > >  	struct kmem_cache *cache;
> > > -	struct skb_shared_info *shinfo;
> > >  	struct sk_buff *skb;
> > >  	u8 *data;
> > > +	unsigned int shinfo_size;
> > >  	bool pfmemalloc;
> > > 
> > >  	cache = (flags & SKB_ALLOC_FCLONE)
> > > @@ -199,18 +201,22 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> > >  	/* We do our best to align skb_shared_info on a separate cache
> > >  	 * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
> > >  	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.
> > > -	 * Both skb->head and skb_shared_info are cache line aligned.
> > > +	 * Both skb->head and skb_shared_info (or skb_shared_info_ext) are
> > > +	 * cache line aligned.
> > >  	 */
> > >  	size = SKB_DATA_ALIGN(size);
> > > -	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > > -	data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
> > > +	if (flags & SKB_ALLOC_SHINFO_EXT)
> > > +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext));
> > > +	else
> > > +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > > +	data = kmalloc_reserve(size + shinfo_size, gfp_mask, node, &pfmemalloc);
> > >  	if (!data)
> > >  		goto nodata;
> > >  	/* kmalloc(size) might give us more room than requested.
> > >  	 * Put skb_shared_info exactly at the end of allocated zone,
> > >  	 * to allow max possible filling before reallocation.
> > >  	 */
> > > -	size = SKB_WITH_OVERHEAD(ksize(data));
> > > +	size = ksize(data) - shinfo_size;
> > >  	prefetchw(data + size);
> > > 
> > >  	/*
> > > @@ -220,7 +226,10 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> > >  	 */
> > >  	memset(skb, 0, offsetof(struct sk_buff, tail));
> > >  	/* Account for allocated memory : skb + skb->head */
> > > -	skb->truesize = SKB_TRUESIZE(size);
> > > +	if (flags & SKB_ALLOC_SHINFO_EXT)
> > > +		skb->truesize = SKB_TRUESIZE(size) + SKB_SHINFO_EXT_OVERHEAD;
> > > +	else
> > > +		skb->truesize = SKB_TRUESIZE(size);
> > >  	skb->pfmemalloc = pfmemalloc;
> > >  	refcount_set(&skb->users, 1);
> > >  	skb->head = data;
> > > @@ -231,10 +240,21 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> > >  	skb->transport_header = (typeof(skb->transport_header))~0U;
> > > 
> > >  	/* make sure we initialize shinfo sequentially */
> > > -	shinfo = skb_shinfo(skb);
> > > -	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> > > -	atomic_set(&shinfo->dataref, 1);
> > > -	kmemcheck_annotate_variable(shinfo->destructor_arg);
> > > +	if (flags & SKB_ALLOC_SHINFO_EXT) {
> > > +		struct skb_shared_info_ext *shinfo_ext = skb_shinfo_ext(skb);
> > > +		shinfo_ext->shinfo.is_ext = 1;
> > > +		memset(&shinfo_ext->shinfo.meta_len, 0,
> > > +		       offsetof(struct skb_shared_info, dataref) -
> > > +		       offsetof(struct skb_shared_info, meta_len));
> > > +		atomic_set(&shinfo_ext->shinfo.dataref, 1);
> > > +		kmemcheck_annotate_variable(shinfo_ext->shinfo.destructor_arg);
> > > +		memset(&shinfo_ext->shcb, 0, sizeof(shinfo_ext->shcb));
> > > +	} else {
> > > +		struct skb_shared_info *shinfo = skb_shinfo(skb);
> > > +		memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> > > +		atomic_set(&shinfo->dataref, 1);
> > > +		kmemcheck_annotate_variable(shinfo->destructor_arg);
> > > +	}
> > > 
> > >  	if (flags & SKB_ALLOC_FCLONE) {
> > >  		struct sk_buff_fclones *fclones;
> > > @@ -1443,6 +1463,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
> > >  {
> > >  	int i, osize = skb_end_offset(skb);
> > >  	int size = osize + nhead + ntail;
> > > +	int shinfo_size;
> > >  	long off;
> > >  	u8 *data;
> > > 
> > > @@ -1454,11 +1475,14 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
> > > 
> > >  	if (skb_pfmemalloc(skb))
> > >  		gfp_mask |= __GFP_MEMALLOC;
> > > -	data = kmalloc_reserve(size + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
> > > -			       gfp_mask, NUMA_NO_NODE, NULL);
> > > +	if (skb_shinfo(skb)->is_ext)
> > > +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext));
> > > +	else
> > > +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > > +	data = kmalloc_reserve(size + shinfo_size, gfp_mask, NUMA_NO_NODE, NULL);
> > >  	if (!data)
> > >  		goto nodata;
> > > -	size = SKB_WITH_OVERHEAD(ksize(data));
> > > +	size = ksize(data) - shinfo_size;
> > > 
> > >  	/* Copy only real data... and, alas, header. This should be
> > >  	 * optimized for the cases when header is void.
> > > @@ -1468,6 +1492,12 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
> > >  	memcpy((struct skb_shared_info *)(data + size),
> > >  	       skb_shinfo(skb),
> > >  	       offsetof(struct skb_shared_info, frags[skb_shinfo(skb)->nr_frags]));
> > > +	if (skb_shinfo(skb)->is_ext) {
> > > +		int offset = offsetof(struct skb_shared_info_ext, shcb);
> > > +		memcpy((struct skb_shared_info_ext *)(data + size + offset),
> > > +		       &skb_shinfo_ext(skb)->shcb,
> > > +		       sizeof(skb_shinfo_ext(skb)->shcb));
> > > +	}
> > > 
> > >  	/*
> > >  	 * if shinfo is shared we must drop the old head gracefully, but if it
> > > --
> > > 2.14.2
> > > 
> > > _______________________________________________
> > > mptcp mailing list
> > > mptcp(a)lists.01.org
> > > https://lists.01.org/mailman/listinfo/mptcp
> > 
> 
> --
> Mat Martineau
> Intel OTC

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-10-27 19:57 Christoph Paasch
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Paasch @ 2017-10-27 19:57 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 8032 bytes --]

Hello,

On 26/10/17 - 15:26:00, Rao Shoaib wrote:
> 
> Hullo,
> 
> On 10/23/2017 04:10 PM, Mat Martineau wrote:
> > 
> > Yes, I'm interested too! I know David Miller is interested in making
> > good use of space in data structures:
> > https://netdevconf.org/2.2/session.html?miller-datastructurebloat-keynote
> > 
> > If there's a lot of unnecessary space being used they'd probably want to
> > reclaim it and then we'd need the extra shared info again :)
> > 
> Thanks for the pointer, I will take a look.
> Take a look at my last couple of patches, they are all using regular size
> CB. The relevant changes are in mptcp_skb_entail() and
> mptcp_write_dss_mapping() and the obvious  tcp_skb_cb and sk_buff
> structures. The change is to calculate the mapping when the header is being
> written not before.
> 
> I am working on net-next 4.14.0-rc4 and have been able to achieve the same.
> Here are the changes to the structure, cut and paste of the relevant code
> from any patch just works.
> 
> The change to tcp_skb_cb is
> 
>         __u8            ip_dsfield;     /* IPv4 tos or IPv6 dsfield     */
>         __u8            txstamp_ack:1,  /* Record TX timestamp for ack? */
>                         eor:1,          /* Is skb MSG_EOR marked? */
>                         has_rxtstamp:1, /* SKB has a RX timestamp       */
>                         unused:5;
>         union {
>                         __u32   ack_seq;        /* Sequence number ACK'd */
>                         __u32   mptcp_data_seq;
>                         __u32   mptcp_path_mask;
>         };
> 
> And used a scratch field in sk_buff, these are only needed on the Rx side.

I would love to see the rest of the patch. Especially wrt to writing the
mapping to the header.

How do you handle segments that are being split while sitting in the
subflow's send-queue and later on need to be retransmitted. Will the
retransmitted segment have the same DSS-mapping as the original
transmission?


Thanks,
Christoph

> 
> struct sk_buff {
>         union {
>                 struct {
>                         /* These two members must be first. */
>                         struct sk_buff          *next;
>                         struct sk_buff          *prev;
> 
>                         union {
>                                 struct net_device       *dev;
>                                 /* Some protocols might use this space to
> store information,
>                                  * while device pointer would be NULL.
>                                  * UDP receive path is one user.
>                                  */
>                                 unsigned long           dev_scratch;
>                                 struct {
>                                         __u8 mptcp_flags;
>                                         __u8 mptcp_dss_off;
>                                 };
>                         };
>                 };
>                 struct rb_node  rbnode; /* used in netem & tcp stack */
>         };
> 
> And here is a tcpdump. Host 192.168.1.3 is running net-next "4.14.0 Fearless
> Coyote" and host 192.168.1.32 is running "4.4.88 Blurry Fish Butt"
> 
> 
> 15:21:28.581388 IP 192.168.1.3.40096 > 192.168.1.32.ssh: Flags [S], seq
> 1515497212, win 29200, options [mss 1460,sackOK,TS val 2460468540 ecr
> 0,nop,wscale 7,mptcp capable csum {0x26b923d415f6c4d4}], length 0
> 15:21:28.581449 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [S.], seq
> 3412665074, ack 1515497213, win 28560, options [mss 1460,sackOK,TS val
> 25517800 ecr 2460468540,nop,wscale 7,mptcp capable csum
> {0x8fd2dd8f92b5f72c}], length 0
> 15:21:28.582533 IP 192.168.1.3.40096 > 192.168.1.32.ssh: Flags [P.], seq
> 1:42, ack 1, win 229, options [nop,nop,TS val 2460468541 ecr 25517800],
> length 41
> 15:21:28.582612 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [.], ack 42,
> win 224, options [nop,nop,TS val 25517800 ecr 2460468541], length 0
> 15:21:28.591889 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [P.], seq
> 1:42, ack 42, win 224, options [nop,nop,TS val 25517802 ecr 2460468541],
> length 41
> 15:21:28.592922 IP 192.168.1.3.40096 > 192.168.1.32.ssh: Flags [P.], seq
> 42:1378, ack 42, win 229, options [nop,nop,TS val 2460468552 ecr 25517802],
> length 1336
> 15:21:28.593028 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [P.], seq
> 42:1018, ack 42, win 224, options [nop,nop,TS val 25517803 ecr 2460468552],
> length 976
> 15:21:28.633043 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [.], ack
> 1378, win 246, options [nop,nop,TS val 25517813 ecr 2460468552], length 0
> 15:21:28.633956 IP 192.168.1.3.40096 > 192.168.1.32.ssh: Flags [P.], seq
> 1378:1426, ack 1018, win 244, options [nop,nop,TS val 2460468593 ecr
> 25517803], length 48
> 15:21:28.634012 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [.], ack
> 1426, win 246, options [nop,nop,TS val 25517813 ecr 2460468593], length 0
> 15:21:28.647684 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [P.], seq
> 1018:1382, ack 1426, win 246, options [nop,nop,TS val 25517816 ecr
> 2460468593], length 364
> 15:21:28.653439 IP 192.168.1.3.40096 > 192.168.1.32.ssh: Flags [P.], seq
> 1426:1442, ack 1382, win 259, options [nop,nop,TS val 2460468612 ecr
> 25517816], length 16
> 15:21:28.692996 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [.], ack
> 1442, win 246, options [nop,nop,TS val 25517828 ecr 2460468612], length 0
> 15:21:28.693584 IP 192.168.1.3.40096 > 192.168.1.32.ssh: Flags [P.], seq
> 1442:1486, ack 1382, win 259, options [nop,nop,TS val 2460468652 ecr
> 25517828], length 44
> 15:21:28.693632 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [.], ack
> 1486, win 246, options [nop,nop,TS val 25517828 ecr 2460468652], length 0
> 15:21:28.693706 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [P.], seq
> 1382:1426, ack 1486, win 246, options [nop,nop,TS val 25517828 ecr
> 2460468652], length 44
> 15:21:28.694310 IP 192.168.1.3.40096 > 192.168.1.32.ssh: Flags [P.], seq
> 1486:1554, ack 1426, win 259, options [nop,nop,TS val 2460468653 ecr
> 25517828], length 68
> 15:21:28.694976 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [P.], seq
> 1426:1478, ack 1554, win 246, options [nop,nop,TS val 25517828 ecr
> 2460468653], length 52
> 15:21:28.737617 IP 192.168.1.3.40096 > 192.168.1.32.ssh: Flags [.], ack
> 1478, win 259, options [nop,nop,TS val 2460468654 ecr 25517828], length 0
> 15:21:31.599860 IP 192.168.1.3.40096 > 192.168.1.32.ssh: Flags [P.], seq
> 1554:1702, ack 1478, win 259, options [nop,nop,TS val 2460471559 ecr
> 25517828], length 148
> 
> I have some other ideas that I want to try but right now I focused on
> getting something working so we can all start hacking.
> 
> Rao
> 
> 
> > > I had to bump cb up to 80 bytes in the current mptcp_trunk :/
> > > 
> > > 
> > > Christoph
> > > 
> > > > FYI. I am working on restructuring MPTCP code so it's not
> > > > intrusive to main
> > > > TCP code. I will probably try a few other techniques but can use
> > > > this as
> > > > well. I will post the changes at the appropriate time.
> > 
> > Nice, thanks for the update.
> > 
> > Regards,
> > Mat
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-10-27 18:19 Mat Martineau
  0 siblings, 0 replies; 36+ messages in thread
From: Mat Martineau @ 2017-10-27 18:19 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 3889 bytes --]


On Thu, 26 Oct 2017, Rao Shoaib wrote:

> 
> 
> Hullo,
> 
> On 10/23/2017 04:10 PM, Mat Martineau wrote:
>
>       Yes, I'm interested too! I know David Miller is interested in making good use of space in data structures:
>       https://netdevconf.org/2.2/session.html?miller-datastructurebloat-keynote
>
>       If there's a lot of unnecessary space being used they'd probably want to reclaim it and then we'd need the
>       extra shared info again :)
> 
> Thanks for the pointer, I will take a look.
> Take a look at my last couple of patches, they are all using regular size CB. The relevant changes are in
> mptcp_skb_entail() and mptcp_write_dss_mapping() and the obvious  tcp_skb_cb and sk_buff structures. The change is to
> calculate the mapping when the header is being written not before.
> 
> I am working on net-next 4.14.0-rc4 and have been able to achieve the same. Here are the changes to the structure, cut and
> paste of the relevant code from any patch just works.
> 
> The change to tcp_skb_cb is
>
>         __u8            ip_dsfield;     /* IPv4 tos or IPv6 dsfield     */
>         __u8            txstamp_ack:1,  /* Record TX timestamp for ack? */
>                         eor:1,          /* Is skb MSG_EOR marked? */
>                         has_rxtstamp:1, /* SKB has a RX timestamp       */
>                         unused:5;
>         union {
>                         __u32   ack_seq;        /* Sequence number ACK'd */
>                         __u32   mptcp_data_seq;
>                         __u32   mptcp_path_mask;
>         };

Some comments here to let people know when each part of the union is used 
(input vs output) would be helpful.

> And used a scratch field in sk_buff, these are only needed on the Rx side.
> 
> struct sk_buff {
>         union {
>                 struct {
>                         /* These two members must be first. */
>                         struct sk_buff          *next;
>                         struct sk_buff          *prev;
>
>                         union {
>                                 struct net_device       *dev;
>                                 /* Some protocols might use this space to store information,
>                                  * while device pointer would be NULL.
>                                  * UDP receive path is one user.
>                                  */
>                                 unsigned long           dev_scratch;
>                                 struct {
>                                         __u8 mptcp_flags;
>                                         __u8 mptcp_dss_off;
>                                 };

To make this more upstream-friendly, it's important to use dev_scrach 
without adding MPTCP-specific stuff to this union. Look at the use of 
struct udb_dev_scratch and udp_skb_scratch(), a similar approach would 
work for MPTCP.

>                         };
>                 };
>                 struct rb_node  rbnode; /* used in netem & tcp stack */
>         };
>

--
Mat Martineau
Intel OTC

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-10-26 23:20 Rao Shoaib
  0 siblings, 0 replies; 36+ messages in thread
From: Rao Shoaib @ 2017-10-26 23:20 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 11666 bytes --]

Please disregard the tcpdump as I just noticed that the connection is 
failing over to regular TCP, The race was based on a kernel that I am 
working on right now. However, the change does work, here is tcpdump 
from an older working kernel.

16:15:51.392175 ARP, Request who-has 192.168.1.32 tell 192.168.1.3, 
length 46
16:15:51.392201 ARP, Reply 192.168.1.32 is-at 00:25:64:d7:d7:db (oui 
Unknown), length 28
16:15:51.392404 IP 192.168.1.3.52119 > 192.168.1.32.ssh: Flags [S], seq 
2312071799, win 29200, options [mss 1460,sackOK,TS val 4294909772 ecr 
0,nop,wscale 7,mptcp capable csum {0xf7920474e5c2051e}], length 0
16:15:51.392474 IP 192.168.1.32.ssh > 192.168.1.3.52119: Flags [S.], seq 
3881715284, ack 2312071800, win 28560, options [mss 1460,sackOK,TS val 
26333502 ecr 4294909772,nop,wscale 7,mptcp capable csum 
{0x902725838161c390}], length 0
16:15:51.392701 IP 192.168.1.3.52119 > 192.168.1.32.ssh: Flags [.], ack 
1, win 229, options [nop,nop,TS val 4294909772 ecr 26333502,mptcp 
capable csum {0xf7920474e5c2051e,0x902725838161c390},mptcp dss ack 
1863922122], length 0
16:15:51.392822 IP 192.168.1.3.52119 > 192.168.1.32.ssh: Flags [P.], seq 
1:42, ack 1, win 229, options [nop,nop,TS val 4294909772 ecr 
26333502,mptcp dss ack 1863922122 seq 3903772137 subseq 1 len 41 csum 
0x7389], length 41
16:15:51.392853 IP 192.168.1.32.ssh > 192.168.1.3.52119: Flags [.], ack 
42, win 224, options [nop,nop,TS val 26333502 ecr 4294909772,mptcp dss 
ack 3903772178], length 0
16:15:51.403073 IP 192.168.1.32.ssh > 192.168.1.3.52119: Flags [P.], seq 
1:42, ack 42, win 224, options [nop,nop,TS val 26333505 ecr 
4294909772,mptcp dss ack 3903772178 seq 1863922122 subseq 1 len 41 csum 
0x91a9], length 41
16:15:51.403353 IP 192.168.1.3.52119 > 192.168.1.32.ssh: Flags [.], ack 
42, win 229, options [nop,nop,TS val 4294909775 ecr 26333502,mptcp dss 
ack 1863922163], length 0
16:15:51.404010 IP 192.168.1.32.ssh > 192.168.1.3.52119: Flags [P.], seq 
42:1018, ack 42, win 224, options [nop,nop,TS val 26333505 ecr 
4294909775,mptcp dss ack 3903772178 seq 1863922163 subseq 42 len 976 
csum 0xa893], length 976
16:15:51.404260 IP 192.168.1.3.52119 > 192.168.1.32.ssh: Flags [.], ack 
1018, win 244, options [nop,nop,TS val 4294909775 ecr 26333502,mptcp dss 
ack 1863923139], length 0
16:15:51.424501 IP 192.168.1.3.52119 > 192.168.1.32.ssh: Flags [P.], seq 
42:1378, ack 1018, win 244, options [nop,nop,TS val 4294909780 ecr 
26333502,mptcp dss ack 1863923139 seq 3903772178 subseq 42 len 1336 csum 
0x8a2c], length 1336
16:15:51.460964 IP 192.168.1.32.ssh > 192.168.1.3.52119: Flags [.], ack 
1378, win 246, options [nop,nop,TS val 26333520 ecr 4294909780,mptcp dss 
ack 3903773514], length 0
16:15:51.461176 IP 192.168.1.3.52119 > 192.168.1.32.ssh: Flags [P.], seq 
1378:1426, ack 1018, win 244, options [nop,nop,TS val 4294909789 ecr 
26333520,mptcp dss ack 1863923139 seq 3903773514 subseq 1378 len 48 csum 
0x8282], length 48
16:15:51.461201 IP 192.168.1.32.ssh > 192.168.1.3.52119: Flags [.], ack 
1426, win 246, options [nop,nop,TS val 26333520 ecr 4294909789,mptcp dss 
ack 3903773562], length 0
16:15:51.468913 IP 192.168.1.32.ssh > 192.168.1.3.52119: Flags [P.], seq 
1018:1382, ack 1426, win 246, options [nop,nop,TS val 26333521 ecr 
4294909789,mptcp dss ack 3903773562 seq 1863923139 subseq 1018 len 364 
csum 0x9409], length 364
16:15:51.471589 IP 192.168.1.3.52119 > 192.168.1.32.ssh: Flags [P.], seq 
1426:1442, ack 1382, win 250, options [nop,nop,TS val 4294909792 ecr 
26333520,mptcp dss ack 1863923503 seq 3903773562 subseq 1426 len 16 csum 
0x8287], length 16
16:15:51.509044 IP 192.168.1.32.ssh > 192.168.1.3.52119: Flags [.], ack 
1442, win 246, options [nop,nop,TS val 26333532 ecr 4294909792,mptcp dss 
ack 3903773578], length 0

Rao.

On 10/26/2017 03:26 PM, Rao Shoaib wrote:
>
>
> Hullo,
>
> On 10/23/2017 04:10 PM, Mat Martineau wrote:
>>
>> Yes, I'm interested too! I know David Miller is interested in making 
>> good use of space in data structures: 
>> https://netdevconf.org/2.2/session.html?miller-datastructurebloat-keynote 
>>
>>
>> If there's a lot of unnecessary space being used they'd probably want 
>> to reclaim it and then we'd need the extra shared info again :)
>>
> Thanks for the pointer, I will take a look.
> Take a look at my last couple of patches, they are all using regular 
> size CB. The relevant changes are in mptcp_skb_entail() and 
> mptcp_write_dss_mapping() and the obvious  tcp_skb_cb and sk_buff 
> structures. The change is to calculate the mapping when the header is 
> being written not before.
>
> I am working on net-next 4.14.0-rc4 and have been able to achieve the 
> same. Here are the changes to the structure, cut and paste of the 
> relevant code from any patch just works.
>
> The change to tcp_skb_cb is
>
>         __u8            ip_dsfield;     /* IPv4 tos or IPv6 dsfield     */
>         __u8            txstamp_ack:1,  /* Record TX timestamp for ack? */
>                         eor:1,          /* Is skb MSG_EOR marked? */
>                         has_rxtstamp:1, /* SKB has a RX timestamp       */
>                         unused:5;
>         union {
>                         __u32   ack_seq;        /* Sequence number 
> ACK'd */
>                         __u32   mptcp_data_seq;
>                         __u32   mptcp_path_mask;
>         };
>
> And used a scratch field in sk_buff, these are only needed on the Rx side.
>
> struct sk_buff {
>         union {
>                 struct {
>                         /* These two members must be first. */
>                         struct sk_buff          *next;
>                         struct sk_buff          *prev;
>
>                         union {
>                                 struct net_device       *dev;
>                                 /* Some protocols might use this space 
> to store information,
>                                  * while device pointer would be NULL.
>                                  * UDP receive path is one user.
>                                  */
>                                 unsigned long dev_scratch;
>                                 struct {
>                                         __u8 mptcp_flags;
>                                         __u8 mptcp_dss_off;
>                                 };
>                         };
>                 };
>                 struct rb_node  rbnode; /* used in netem & tcp stack */
>         };
>
> And here is a tcpdump. Host 192.168.1.3 is running net-next "4.14.0 
> Fearless Coyote" and host 192.168.1.32 is running "4.4.88 Blurry Fish 
> Butt"
>
>
> 15:21:28.581388 IP 192.168.1.3.40096 > 192.168.1.32.ssh: Flags [S], 
> seq 1515497212, win 29200, options [mss 1460,sackOK,TS val 2460468540 
> ecr 0,nop,wscale 7,mptcp capable csum {0x26b923d415f6c4d4}], length 0
> 15:21:28.581449 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [S.], 
> seq 3412665074, ack 1515497213, win 28560, options [mss 1460,sackOK,TS 
> val 25517800 ecr 2460468540,nop,wscale 7,mptcp capable csum 
> {0x8fd2dd8f92b5f72c}], length 0
> 15:21:28.582533 IP 192.168.1.3.40096 > 192.168.1.32.ssh: Flags [P.], 
> seq 1:42, ack 1, win 229, options [nop,nop,TS val 2460468541 ecr 
> 25517800], length 41
> 15:21:28.582612 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [.], 
> ack 42, win 224, options [nop,nop,TS val 25517800 ecr 2460468541], 
> length 0
> 15:21:28.591889 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [P.], 
> seq 1:42, ack 42, win 224, options [nop,nop,TS val 25517802 ecr 
> 2460468541], length 41
> 15:21:28.592922 IP 192.168.1.3.40096 > 192.168.1.32.ssh: Flags [P.], 
> seq 42:1378, ack 42, win 229, options [nop,nop,TS val 2460468552 ecr 
> 25517802], length 1336
> 15:21:28.593028 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [P.], 
> seq 42:1018, ack 42, win 224, options [nop,nop,TS val 25517803 ecr 
> 2460468552], length 976
> 15:21:28.633043 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [.], 
> ack 1378, win 246, options [nop,nop,TS val 25517813 ecr 2460468552], 
> length 0
> 15:21:28.633956 IP 192.168.1.3.40096 > 192.168.1.32.ssh: Flags [P.], 
> seq 1378:1426, ack 1018, win 244, options [nop,nop,TS val 2460468593 
> ecr 25517803], length 48
> 15:21:28.634012 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [.], 
> ack 1426, win 246, options [nop,nop,TS val 25517813 ecr 2460468593], 
> length 0
> 15:21:28.647684 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [P.], 
> seq 1018:1382, ack 1426, win 246, options [nop,nop,TS val 25517816 ecr 
> 2460468593], length 364
> 15:21:28.653439 IP 192.168.1.3.40096 > 192.168.1.32.ssh: Flags [P.], 
> seq 1426:1442, ack 1382, win 259, options [nop,nop,TS val 2460468612 
> ecr 25517816], length 16
> 15:21:28.692996 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [.], 
> ack 1442, win 246, options [nop,nop,TS val 25517828 ecr 2460468612], 
> length 0
> 15:21:28.693584 IP 192.168.1.3.40096 > 192.168.1.32.ssh: Flags [P.], 
> seq 1442:1486, ack 1382, win 259, options [nop,nop,TS val 2460468652 
> ecr 25517828], length 44
> 15:21:28.693632 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [.], 
> ack 1486, win 246, options [nop,nop,TS val 25517828 ecr 2460468652], 
> length 0
> 15:21:28.693706 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [P.], 
> seq 1382:1426, ack 1486, win 246, options [nop,nop,TS val 25517828 ecr 
> 2460468652], length 44
> 15:21:28.694310 IP 192.168.1.3.40096 > 192.168.1.32.ssh: Flags [P.], 
> seq 1486:1554, ack 1426, win 259, options [nop,nop,TS val 2460468653 
> ecr 25517828], length 68
> 15:21:28.694976 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [P.], 
> seq 1426:1478, ack 1554, win 246, options [nop,nop,TS val 25517828 ecr 
> 2460468653], length 52
> 15:21:28.737617 IP 192.168.1.3.40096 > 192.168.1.32.ssh: Flags [.], 
> ack 1478, win 259, options [nop,nop,TS val 2460468654 ecr 25517828], 
> length 0
> 15:21:31.599860 IP 192.168.1.3.40096 > 192.168.1.32.ssh: Flags [P.], 
> seq 1554:1702, ack 1478, win 259, options [nop,nop,TS val 2460471559 
> ecr 25517828], length 148
>
> I have some other ideas that I want to try but right now I focused on 
> getting something working so we can all start hacking.
>
> Rao
>
>
>>> I had to bump cb up to 80 bytes in the current mptcp_trunk :/
>>>
>>>
>>> Christoph
>>>
>>>> FYI. I am working on restructuring MPTCP code so it's not intrusive 
>>>> to main
>>>> TCP code. I will probably try a few other techniques but can use 
>>>> this as
>>>> well. I will post the changes at the appropriate time.
>>
>> Nice, thanks for the update.
>>
>> Regards,
>> Mat
>
>
>
> _______________________________________________
> mptcp mailing list
> mptcp(a)lists.01.org
> https://lists.01.org/mailman/listinfo/mptcp


[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 14311 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-10-26 22:26 Rao Shoaib
  0 siblings, 0 replies; 36+ messages in thread
From: Rao Shoaib @ 2017-10-26 22:26 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 7399 bytes --]


Hullo,

On 10/23/2017 04:10 PM, Mat Martineau wrote:
>
> Yes, I'm interested too! I know David Miller is interested in making 
> good use of space in data structures: 
> https://netdevconf.org/2.2/session.html?miller-datastructurebloat-keynote
>
> If there's a lot of unnecessary space being used they'd probably want 
> to reclaim it and then we'd need the extra shared info again :)
>
Thanks for the pointer, I will take a look.
Take a look at my last couple of patches, they are all using regular 
size CB. The relevant changes are in mptcp_skb_entail() and 
mptcp_write_dss_mapping() and the obvious  tcp_skb_cb and sk_buff 
structures. The change is to calculate the mapping when the header is 
being written not before.

I am working on net-next 4.14.0-rc4 and have been able to achieve the 
same. Here are the changes to the structure, cut and paste of the 
relevant code from any patch just works.

The change to tcp_skb_cb is

         __u8            ip_dsfield;     /* IPv4 tos or IPv6 dsfield     */
         __u8            txstamp_ack:1,  /* Record TX timestamp for ack? */
                         eor:1,          /* Is skb MSG_EOR marked? */
                         has_rxtstamp:1, /* SKB has a RX timestamp       */
                         unused:5;
         union {
                         __u32   ack_seq;        /* Sequence number ACK'd */
                         __u32   mptcp_data_seq;
                         __u32   mptcp_path_mask;
         };

And used a scratch field in sk_buff, these are only needed on the Rx side.

struct sk_buff {
         union {
                 struct {
                         /* These two members must be first. */
                         struct sk_buff          *next;
                         struct sk_buff          *prev;

                         union {
                                 struct net_device       *dev;
                                 /* Some protocols might use this space 
to store information,
                                  * while device pointer would be NULL.
                                  * UDP receive path is one user.
                                  */
                                 unsigned long           dev_scratch;
                                 struct {
                                         __u8 mptcp_flags;
                                         __u8 mptcp_dss_off;
                                 };
                         };
                 };
                 struct rb_node  rbnode; /* used in netem & tcp stack */
         };

And here is a tcpdump. Host 192.168.1.3 is running net-next "4.14.0 
Fearless Coyote" and host 192.168.1.32 is running "4.4.88 Blurry Fish Butt"


15:21:28.581388 IP 192.168.1.3.40096 > 192.168.1.32.ssh: Flags [S], seq 
1515497212, win 29200, options [mss 1460,sackOK,TS val 2460468540 ecr 
0,nop,wscale 7,mptcp capable csum {0x26b923d415f6c4d4}], length 0
15:21:28.581449 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [S.], seq 
3412665074, ack 1515497213, win 28560, options [mss 1460,sackOK,TS val 
25517800 ecr 2460468540,nop,wscale 7,mptcp capable csum 
{0x8fd2dd8f92b5f72c}], length 0
15:21:28.582533 IP 192.168.1.3.40096 > 192.168.1.32.ssh: Flags [P.], seq 
1:42, ack 1, win 229, options [nop,nop,TS val 2460468541 ecr 25517800], 
length 41
15:21:28.582612 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [.], ack 
42, win 224, options [nop,nop,TS val 25517800 ecr 2460468541], length 0
15:21:28.591889 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [P.], seq 
1:42, ack 42, win 224, options [nop,nop,TS val 25517802 ecr 2460468541], 
length 41
15:21:28.592922 IP 192.168.1.3.40096 > 192.168.1.32.ssh: Flags [P.], seq 
42:1378, ack 42, win 229, options [nop,nop,TS val 2460468552 ecr 
25517802], length 1336
15:21:28.593028 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [P.], seq 
42:1018, ack 42, win 224, options [nop,nop,TS val 25517803 ecr 
2460468552], length 976
15:21:28.633043 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [.], ack 
1378, win 246, options [nop,nop,TS val 25517813 ecr 2460468552], length 0
15:21:28.633956 IP 192.168.1.3.40096 > 192.168.1.32.ssh: Flags [P.], seq 
1378:1426, ack 1018, win 244, options [nop,nop,TS val 2460468593 ecr 
25517803], length 48
15:21:28.634012 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [.], ack 
1426, win 246, options [nop,nop,TS val 25517813 ecr 2460468593], length 0
15:21:28.647684 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [P.], seq 
1018:1382, ack 1426, win 246, options [nop,nop,TS val 25517816 ecr 
2460468593], length 364
15:21:28.653439 IP 192.168.1.3.40096 > 192.168.1.32.ssh: Flags [P.], seq 
1426:1442, ack 1382, win 259, options [nop,nop,TS val 2460468612 ecr 
25517816], length 16
15:21:28.692996 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [.], ack 
1442, win 246, options [nop,nop,TS val 25517828 ecr 2460468612], length 0
15:21:28.693584 IP 192.168.1.3.40096 > 192.168.1.32.ssh: Flags [P.], seq 
1442:1486, ack 1382, win 259, options [nop,nop,TS val 2460468652 ecr 
25517828], length 44
15:21:28.693632 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [.], ack 
1486, win 246, options [nop,nop,TS val 25517828 ecr 2460468652], length 0
15:21:28.693706 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [P.], seq 
1382:1426, ack 1486, win 246, options [nop,nop,TS val 25517828 ecr 
2460468652], length 44
15:21:28.694310 IP 192.168.1.3.40096 > 192.168.1.32.ssh: Flags [P.], seq 
1486:1554, ack 1426, win 259, options [nop,nop,TS val 2460468653 ecr 
25517828], length 68
15:21:28.694976 IP 192.168.1.32.ssh > 192.168.1.3.40096: Flags [P.], seq 
1426:1478, ack 1554, win 246, options [nop,nop,TS val 25517828 ecr 
2460468653], length 52
15:21:28.737617 IP 192.168.1.3.40096 > 192.168.1.32.ssh: Flags [.], ack 
1478, win 259, options [nop,nop,TS val 2460468654 ecr 25517828], length 0
15:21:31.599860 IP 192.168.1.3.40096 > 192.168.1.32.ssh: Flags [P.], seq 
1554:1702, ack 1478, win 259, options [nop,nop,TS val 2460471559 ecr 
25517828], length 148

I have some other ideas that I want to try but right now I focused on 
getting something working so we can all start hacking.

Rao


>> I had to bump cb up to 80 bytes in the current mptcp_trunk :/
>>
>>
>> Christoph
>>
>>> FYI. I am working on restructuring MPTCP code so it's not intrusive 
>>> to main
>>> TCP code. I will probably try a few other techniques but can use 
>>> this as
>>> well. I will post the changes at the appropriate time.
>
> Nice, thanks for the update.
>
> Regards,
> Mat


[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 9188 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-10-23 23:10 Mat Martineau
  0 siblings, 0 replies; 36+ messages in thread
From: Mat Martineau @ 2017-10-23 23:10 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 11936 bytes --]


Rao and Christoph,

On Mon, 23 Oct 2017, Christoph Paasch wrote:

> Hello,
>
> On 23/10/17 - 12:49:53, Rao Shoaib wrote:
>>
>> On 10/20/2017 04:02 PM, Mat Martineau wrote:
>>> The sk_buff control buffer is of limited size, and cannot be enlarged
>>> without significant impact on systemwide memory use. However, additional
>>> per-packet state is needed for some protocols, like Multipath TCP.
>>>
>>> An optional shared control buffer placed after the normal struct
>>> skb_shared_info can accomodate the necessary state without imposing
>>> extra memory usage or code changes on normal struct sk_buff
>>> users. __alloc_skb will now place a skb_shared_info_ext structure at
>>> skb->end when given the SKB_ALLOC_SHINFO_EXT flag. Most users of struct
>>> sk_buff continue to use the skb_shinfo() macro to access shared
>>> info. skb_shinfo(skb)->is_ext is set if the extended structure is
>>> available, and cleared if it is not.
>>>
>>> pskb_expand_head will preserve the shared control buffer if it is present.
>>
>> Hi Matt,
>>
>> I personally think the change looks good. It does have some minor issues
>> that still need to be resolved.
>>
>> Like Christoph stated the use of sizeof() is something to think about, for
>> example Solarflare driver on the Tx side uses sizeof(). See
>> efx_enqueue_skb_pio. Maybe create a macro so that the code can be replaced
>> by a call to the MACRO which looks at the flag and returns the correct size.
>>

In efx_enqueue_skb_pio, they're only interested in verifying that 
skb_shared_info is bigger than a cache line in order to optimize a copy 
and ensure that it doesn't access invalid memory. We only add more valid 
memory at the end of the struct, so their logic still works as designed.

>> There are some issues to look at connection establishment time and fallback
>> to regular tcp. What if the other side does not support MPTCP ?
>
> It should be fine. What happens in that case is that the initiator realizes
> that the other side does not support MPTCP and at that point simply stops
> adding the extra bytes to the SKBs it is pushing down the connection.

I agree.

>> Will we be able to switch to an allocation that does not allocate these extra 48 bytes.
>> BTW we really do not need this change for MPTCP, overloading current fields
>> just works fine, however the enhacement does open other possibilities.
>
> You were able to reduce the cb-size back down to 48 bytes with overloading
> fields? I would love to see the patch :)

Yes, I'm interested too! I know David Miller is interested in making good 
use of space in data structures: 
https://netdevconf.org/2.2/session.html?miller-datastructurebloat-keynote

If there's a lot of unnecessary space being used they'd probably want to 
reclaim it and then we'd need the extra shared info again :)

> I had to bump cb up to 80 bytes in the current mptcp_trunk :/
>
>
> Christoph
>
>> FYI. I am working on restructuring MPTCP code so it's not intrusive to main
>> TCP code. I will probably try a few other techniques but can use this as
>> well. I will post the changes at the appropriate time.

Nice, thanks for the update.

Regards,
Mat


>> Shoaib
>>
>>>
>>> Signed-off-by: Mat Martineau <mathew.j.martineau(a)linux.intel.com>
>>> ---
>>>   include/linux/skbuff.h | 24 +++++++++++++++++++++-
>>>   net/core/skbuff.c      | 56 ++++++++++++++++++++++++++++++++++++++------------
>>>   2 files changed, 66 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>>> index 03634ec2f918..873910c66df9 100644
>>> --- a/include/linux/skbuff.h
>>> +++ b/include/linux/skbuff.h
>>> @@ -489,7 +489,8 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
>>>    * the end of the header data, ie. at skb->end.
>>>    */
>>>   struct skb_shared_info {
>>> -	__u8		__unused;
>>> +	__u8		is_ext:1,
>>> +			__unused:7;
>>>   	__u8		meta_len;
>>>   	__u8		nr_frags;
>>>   	__u8		tx_flags;
>>> @@ -530,6 +531,24 @@ struct skb_shared_info {
>>>   #define SKB_DATAREF_MASK ((1 << SKB_DATAREF_SHIFT) - 1)
>>> +/* This is an extended version of skb_shared_info, also invariant across
>>> + * clones and living at the end of the header data.
>>> + */
>>> +struct skb_shared_info_ext {
>>> +	/* skb_shared_info must be the first member */
>>> +	struct skb_shared_info	shinfo;
>>> +
>>> +	/* This is the shared control buffer. It is similar to sk_buff's
>>> +	 * control buffer, but is shared across clones. It must not be
>>> +	 * modified when multiple sk_buffs are referencing this structure.
>>> +	 */
>>> +	char			shcb[48];
>>> +};
>>> +
>>> +#define SKB_SHINFO_EXT_OVERHEAD	\
>>> +	(SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext)) - \
>>> +	 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
>>> +
>>>   enum {
>>>   	SKB_FCLONE_UNAVAILABLE,	/* skb has no fclone (from head_cache) */
>>>   	SKB_FCLONE_ORIG,	/* orig skb (from fclone_cache) */
>>> @@ -856,6 +875,7 @@ struct sk_buff {
>>>   #define SKB_ALLOC_FCLONE	0x01
>>>   #define SKB_ALLOC_RX		0x02
>>>   #define SKB_ALLOC_NAPI		0x04
>>> +#define SKB_ALLOC_SHINFO_EXT	0x08
>>>   /* Returns true if the skb was allocated from PFMEMALLOC reserves */
>>>   static inline bool skb_pfmemalloc(const struct sk_buff *skb)
>>> @@ -1271,6 +1291,8 @@ static inline unsigned int skb_end_offset(const struct sk_buff *skb)
>>>   /* Internal */
>>>   #define skb_shinfo(SKB)	((struct skb_shared_info *)(skb_end_pointer(SKB)))
>>> +#define skb_shinfo_ext(SKB)	\
>>> +	((struct skb_shared_info_ext *)(skb_end_pointer(SKB)))
>>>   static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb)
>>>   {
>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>> index 40717501cbdd..397edd5c0613 100644
>>> --- a/net/core/skbuff.c
>>> +++ b/net/core/skbuff.c
>>> @@ -166,6 +166,8 @@ static void *__kmalloc_reserve(size_t size, gfp_t flags, int node,
>>>    *		instead of head cache and allocate a cloned (child) skb.
>>>    *		If SKB_ALLOC_RX is set, __GFP_MEMALLOC will be used for
>>>    *		allocations in case the data is required for writeback
>>> + *		If SKB_ALLOC_SHINFO_EXT is set, the skb will be allocated
>>> + *		with an extended shared info struct.
>>>    *	@node: numa node to allocate memory on
>>>    *
>>>    *	Allocate a new &sk_buff. The returned buffer has no headroom and a
>>> @@ -179,9 +181,9 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>>>   			    int flags, int node)
>>>   {
>>>   	struct kmem_cache *cache;
>>> -	struct skb_shared_info *shinfo;
>>>   	struct sk_buff *skb;
>>>   	u8 *data;
>>> +	unsigned int shinfo_size;
>>>   	bool pfmemalloc;
>>>   	cache = (flags & SKB_ALLOC_FCLONE)
>>> @@ -199,18 +201,22 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>>>   	/* We do our best to align skb_shared_info on a separate cache
>>>   	 * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
>>>   	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.
>>> -	 * Both skb->head and skb_shared_info are cache line aligned.
>>> +	 * Both skb->head and skb_shared_info (or skb_shared_info_ext) are
>>> +	 * cache line aligned.
>>>   	 */
>>>   	size = SKB_DATA_ALIGN(size);
>>> -	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>>> -	data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
>>> +	if (flags & SKB_ALLOC_SHINFO_EXT)
>>> +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext));
>>> +	else
>>> +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>>> +	data = kmalloc_reserve(size + shinfo_size, gfp_mask, node, &pfmemalloc);
>>>   	if (!data)
>>>   		goto nodata;
>>>   	/* kmalloc(size) might give us more room than requested.
>>>   	 * Put skb_shared_info exactly at the end of allocated zone,
>>>   	 * to allow max possible filling before reallocation.
>>>   	 */
>>> -	size = SKB_WITH_OVERHEAD(ksize(data));
>>> +	size = ksize(data) - shinfo_size;
>>>   	prefetchw(data + size);
>>>   	/*
>>> @@ -220,7 +226,10 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>>>   	 */
>>>   	memset(skb, 0, offsetof(struct sk_buff, tail));
>>>   	/* Account for allocated memory : skb + skb->head */
>>> -	skb->truesize = SKB_TRUESIZE(size);
>>> +	if (flags & SKB_ALLOC_SHINFO_EXT)
>>> +		skb->truesize = SKB_TRUESIZE(size) + SKB_SHINFO_EXT_OVERHEAD;
>>> +	else
>>> +		skb->truesize = SKB_TRUESIZE(size);
>>>   	skb->pfmemalloc = pfmemalloc;
>>>   	refcount_set(&skb->users, 1);
>>>   	skb->head = data;
>>> @@ -231,10 +240,21 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>>>   	skb->transport_header = (typeof(skb->transport_header))~0U;
>>>   	/* make sure we initialize shinfo sequentially */
>>> -	shinfo = skb_shinfo(skb);
>>> -	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
>>> -	atomic_set(&shinfo->dataref, 1);
>>> -	kmemcheck_annotate_variable(shinfo->destructor_arg);
>>> +	if (flags & SKB_ALLOC_SHINFO_EXT) {
>>> +		struct skb_shared_info_ext *shinfo_ext = skb_shinfo_ext(skb);
>>> +		shinfo_ext->shinfo.is_ext = 1;
>>> +		memset(&shinfo_ext->shinfo.meta_len, 0,
>>> +		       offsetof(struct skb_shared_info, dataref) -
>>> +		       offsetof(struct skb_shared_info, meta_len));
>>> +		atomic_set(&shinfo_ext->shinfo.dataref, 1);
>>> +		kmemcheck_annotate_variable(shinfo_ext->shinfo.destructor_arg);
>>> +		memset(&shinfo_ext->shcb, 0, sizeof(shinfo_ext->shcb));
>>> +	} else {
>>> +		struct skb_shared_info *shinfo = skb_shinfo(skb);
>>> +		memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
>>> +		atomic_set(&shinfo->dataref, 1);
>>> +		kmemcheck_annotate_variable(shinfo->destructor_arg);
>>> +	}
>>>   	if (flags & SKB_ALLOC_FCLONE) {
>>>   		struct sk_buff_fclones *fclones;
>>> @@ -1443,6 +1463,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>>>   {
>>>   	int i, osize = skb_end_offset(skb);
>>>   	int size = osize + nhead + ntail;
>>> +	int shinfo_size;
>>>   	long off;
>>>   	u8 *data;
>>> @@ -1454,11 +1475,14 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>>>   	if (skb_pfmemalloc(skb))
>>>   		gfp_mask |= __GFP_MEMALLOC;
>>> -	data = kmalloc_reserve(size + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
>>> -			       gfp_mask, NUMA_NO_NODE, NULL);
>>> +	if (skb_shinfo(skb)->is_ext)
>>> +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext));
>>> +	else
>>> +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>>> +	data = kmalloc_reserve(size + shinfo_size, gfp_mask, NUMA_NO_NODE, NULL);
>>>   	if (!data)
>>>   		goto nodata;
>>> -	size = SKB_WITH_OVERHEAD(ksize(data));
>>> +	size = ksize(data) - shinfo_size;
>>>   	/* Copy only real data... and, alas, header. This should be
>>>   	 * optimized for the cases when header is void.
>>> @@ -1468,6 +1492,12 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>>>   	memcpy((struct skb_shared_info *)(data + size),
>>>   	       skb_shinfo(skb),
>>>   	       offsetof(struct skb_shared_info, frags[skb_shinfo(skb)->nr_frags]));
>>> +	if (skb_shinfo(skb)->is_ext) {
>>> +		int offset = offsetof(struct skb_shared_info_ext, shcb);
>>> +		memcpy((struct skb_shared_info_ext *)(data + size + offset),
>>> +		       &skb_shinfo_ext(skb)->shcb,
>>> +		       sizeof(skb_shinfo_ext(skb)->shcb));
>>> +	}
>>>   	/*
>>>   	 * if shinfo is shared we must drop the old head gracefully, but if it
>>
>> _______________________________________________
>> mptcp mailing list
>> mptcp(a)lists.01.org
>> https://lists.01.org/mailman/listinfo/mptcp
> _______________________________________________
> mptcp mailing list
> mptcp(a)lists.01.org
> https://lists.01.org/mailman/listinfo/mptcp
>

--
Mat Martineau
Intel OTC

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-10-23 22:51 Mat Martineau
  0 siblings, 0 replies; 36+ messages in thread
From: Mat Martineau @ 2017-10-23 22:51 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 11696 bytes --]


Hi Christoph,

On Mon, 23 Oct 2017, Christoph Paasch wrote:

> Hello Mat,
>
> On 20/10/17 - 16:02:31, Mat Martineau wrote:
>> The sk_buff control buffer is of limited size, and cannot be enlarged
>> without significant impact on systemwide memory use. However, additional
>> per-packet state is needed for some protocols, like Multipath TCP.
>>
>> An optional shared control buffer placed after the normal struct
>> skb_shared_info can accomodate the necessary state without imposing
>> extra memory usage or code changes on normal struct sk_buff
>> users. __alloc_skb will now place a skb_shared_info_ext structure at
>> skb->end when given the SKB_ALLOC_SHINFO_EXT flag. Most users of struct
>> sk_buff continue to use the skb_shinfo() macro to access shared
>> info. skb_shinfo(skb)->is_ext is set if the extended structure is
>> available, and cleared if it is not.
>>
>> pskb_expand_head will preserve the shared control buffer if it is present.
>>
>> Signed-off-by: Mat Martineau <mathew.j.martineau(a)linux.intel.com>
>> ---
>>  include/linux/skbuff.h | 24 +++++++++++++++++++++-
>>  net/core/skbuff.c      | 56 ++++++++++++++++++++++++++++++++++++++------------
>>  2 files changed, 66 insertions(+), 14 deletions(-)
>
> While digging through your patch, I realized that there could be one issue.
>
> In many places in the code we use sizeof(struct skb_shared_info) to compute
> the overhead (e.g., see tcp_sndbuf_expand).
>
> There are also countless users of (sizeof(struct skb_shared_info) in the
> drivers. They all seem to be in the rx-path, so it should be fine.
>
> But the prevalent use of skb_shared_info throughout the stack seems a bit
> scary to me. My concern here is that the driver assumes that the overhead is
> always sizeof(skb_shared_info) and thus underestimates the size of the skb.
>
> At least, in tcp_sndbuf_expand() the overhead-estimation would be wrong. I
> don't think the consequences will be catastrophic, but it's something to
> think about.
>
>
> What do you think?
>

I saw two typical cases when I looked around existing code:

1. Drivers doing calculations before calling build_skb() on the rx path.

2. Various code making size estimates before creating skbs.


For #1, build_skb() is still creating "normal" skbs with skb_shared_info, 
so their calculations remain valid.

Similarly, code that uses size estimates for #2 goes on to allocate 
"normal" skbs, so their estimates remains valid. Code that allocates 
extended skbs would need another set of extended skb macros to make better 
estimates.

tcp_sndbuf_expand() is a little different, since it changes sk_sndbuf 
which could impact behavior when longer skbs are in use. In this case, it 
still seems like a fairly coarse estimate (rounding up to a power of two 
and then multiplying). I'm not worried about this use, but I do want to 
understand it better before upstreaming. If it's important to add in the 
extra space for sockets using skb_shared_info_ext, that could be accounted 
for in tcp_sndbuf_expand by adding a flag to tcp_sock.


Anywhere that allocates extended skbs will involve new code, and if that 
code needs to make size estimates it should take the extra overhead in to 
account. For existing code that may handle (rather than allocate) skbs 
with extended shared data, these skbs will be truthful about their 
truesize and will be freed correctly.

I still want to review this approach carefully and do additional testing. 
I should run the patch by some network driver authors too.


Thanks,
Mat


>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index 03634ec2f918..873910c66df9 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -489,7 +489,8 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
>>   * the end of the header data, ie. at skb->end.
>>   */
>>  struct skb_shared_info {
>> -	__u8		__unused;
>> +	__u8		is_ext:1,
>> +			__unused:7;
>>  	__u8		meta_len;
>>  	__u8		nr_frags;
>>  	__u8		tx_flags;
>> @@ -530,6 +531,24 @@ struct skb_shared_info {
>>  #define SKB_DATAREF_MASK ((1 << SKB_DATAREF_SHIFT) - 1)
>>
>>
>> +/* This is an extended version of skb_shared_info, also invariant across
>> + * clones and living at the end of the header data.
>> + */
>> +struct skb_shared_info_ext {
>> +	/* skb_shared_info must be the first member */
>> +	struct skb_shared_info	shinfo;
>> +
>> +	/* This is the shared control buffer. It is similar to sk_buff's
>> +	 * control buffer, but is shared across clones. It must not be
>> +	 * modified when multiple sk_buffs are referencing this structure.
>> +	 */
>> +	char			shcb[48];
>> +};
>> +
>> +#define SKB_SHINFO_EXT_OVERHEAD	\
>> +	(SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext)) - \
>> +	 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
>> +
>>  enum {
>>  	SKB_FCLONE_UNAVAILABLE,	/* skb has no fclone (from head_cache) */
>>  	SKB_FCLONE_ORIG,	/* orig skb (from fclone_cache) */
>> @@ -856,6 +875,7 @@ struct sk_buff {
>>  #define SKB_ALLOC_FCLONE	0x01
>>  #define SKB_ALLOC_RX		0x02
>>  #define SKB_ALLOC_NAPI		0x04
>> +#define SKB_ALLOC_SHINFO_EXT	0x08
>>
>>  /* Returns true if the skb was allocated from PFMEMALLOC reserves */
>>  static inline bool skb_pfmemalloc(const struct sk_buff *skb)
>> @@ -1271,6 +1291,8 @@ static inline unsigned int skb_end_offset(const struct sk_buff *skb)
>>
>>  /* Internal */
>>  #define skb_shinfo(SKB)	((struct skb_shared_info *)(skb_end_pointer(SKB)))
>> +#define skb_shinfo_ext(SKB)	\
>> +	((struct skb_shared_info_ext *)(skb_end_pointer(SKB)))
>>
>>  static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb)
>>  {
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 40717501cbdd..397edd5c0613 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -166,6 +166,8 @@ static void *__kmalloc_reserve(size_t size, gfp_t flags, int node,
>>   *		instead of head cache and allocate a cloned (child) skb.
>>   *		If SKB_ALLOC_RX is set, __GFP_MEMALLOC will be used for
>>   *		allocations in case the data is required for writeback
>> + *		If SKB_ALLOC_SHINFO_EXT is set, the skb will be allocated
>> + *		with an extended shared info struct.
>>   *	@node: numa node to allocate memory on
>>   *
>>   *	Allocate a new &sk_buff. The returned buffer has no headroom and a
>> @@ -179,9 +181,9 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>>  			    int flags, int node)
>>  {
>>  	struct kmem_cache *cache;
>> -	struct skb_shared_info *shinfo;
>>  	struct sk_buff *skb;
>>  	u8 *data;
>> +	unsigned int shinfo_size;
>>  	bool pfmemalloc;
>>
>>  	cache = (flags & SKB_ALLOC_FCLONE)
>> @@ -199,18 +201,22 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>>  	/* We do our best to align skb_shared_info on a separate cache
>>  	 * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
>>  	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.
>> -	 * Both skb->head and skb_shared_info are cache line aligned.
>> +	 * Both skb->head and skb_shared_info (or skb_shared_info_ext) are
>> +	 * cache line aligned.
>>  	 */
>>  	size = SKB_DATA_ALIGN(size);
>> -	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>> -	data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
>> +	if (flags & SKB_ALLOC_SHINFO_EXT)
>> +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext));
>> +	else
>> +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>> +	data = kmalloc_reserve(size + shinfo_size, gfp_mask, node, &pfmemalloc);
>>  	if (!data)
>>  		goto nodata;
>>  	/* kmalloc(size) might give us more room than requested.
>>  	 * Put skb_shared_info exactly at the end of allocated zone,
>>  	 * to allow max possible filling before reallocation.
>>  	 */
>> -	size = SKB_WITH_OVERHEAD(ksize(data));
>> +	size = ksize(data) - shinfo_size;
>>  	prefetchw(data + size);
>>
>>  	/*
>> @@ -220,7 +226,10 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>>  	 */
>>  	memset(skb, 0, offsetof(struct sk_buff, tail));
>>  	/* Account for allocated memory : skb + skb->head */
>> -	skb->truesize = SKB_TRUESIZE(size);
>> +	if (flags & SKB_ALLOC_SHINFO_EXT)
>> +		skb->truesize = SKB_TRUESIZE(size) + SKB_SHINFO_EXT_OVERHEAD;
>> +	else
>> +		skb->truesize = SKB_TRUESIZE(size);
>>  	skb->pfmemalloc = pfmemalloc;
>>  	refcount_set(&skb->users, 1);
>>  	skb->head = data;
>> @@ -231,10 +240,21 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>>  	skb->transport_header = (typeof(skb->transport_header))~0U;
>>
>>  	/* make sure we initialize shinfo sequentially */
>> -	shinfo = skb_shinfo(skb);
>> -	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
>> -	atomic_set(&shinfo->dataref, 1);
>> -	kmemcheck_annotate_variable(shinfo->destructor_arg);
>> +	if (flags & SKB_ALLOC_SHINFO_EXT) {
>> +		struct skb_shared_info_ext *shinfo_ext = skb_shinfo_ext(skb);
>> +		shinfo_ext->shinfo.is_ext = 1;
>> +		memset(&shinfo_ext->shinfo.meta_len, 0,
>> +		       offsetof(struct skb_shared_info, dataref) -
>> +		       offsetof(struct skb_shared_info, meta_len));
>> +		atomic_set(&shinfo_ext->shinfo.dataref, 1);
>> +		kmemcheck_annotate_variable(shinfo_ext->shinfo.destructor_arg);
>> +		memset(&shinfo_ext->shcb, 0, sizeof(shinfo_ext->shcb));
>> +	} else {
>> +		struct skb_shared_info *shinfo = skb_shinfo(skb);
>> +		memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
>> +		atomic_set(&shinfo->dataref, 1);
>> +		kmemcheck_annotate_variable(shinfo->destructor_arg);
>> +	}
>>
>>  	if (flags & SKB_ALLOC_FCLONE) {
>>  		struct sk_buff_fclones *fclones;
>> @@ -1443,6 +1463,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>>  {
>>  	int i, osize = skb_end_offset(skb);
>>  	int size = osize + nhead + ntail;
>> +	int shinfo_size;
>>  	long off;
>>  	u8 *data;
>>
>> @@ -1454,11 +1475,14 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>>
>>  	if (skb_pfmemalloc(skb))
>>  		gfp_mask |= __GFP_MEMALLOC;
>> -	data = kmalloc_reserve(size + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
>> -			       gfp_mask, NUMA_NO_NODE, NULL);
>> +	if (skb_shinfo(skb)->is_ext)
>> +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext));
>> +	else
>> +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>> +	data = kmalloc_reserve(size + shinfo_size, gfp_mask, NUMA_NO_NODE, NULL);
>>  	if (!data)
>>  		goto nodata;
>> -	size = SKB_WITH_OVERHEAD(ksize(data));
>> +	size = ksize(data) - shinfo_size;
>>
>>  	/* Copy only real data... and, alas, header. This should be
>>  	 * optimized for the cases when header is void.
>> @@ -1468,6 +1492,12 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>>  	memcpy((struct skb_shared_info *)(data + size),
>>  	       skb_shinfo(skb),
>>  	       offsetof(struct skb_shared_info, frags[skb_shinfo(skb)->nr_frags]));
>> +	if (skb_shinfo(skb)->is_ext) {
>> +		int offset = offsetof(struct skb_shared_info_ext, shcb);
>> +		memcpy((struct skb_shared_info_ext *)(data + size + offset),
>> +		       &skb_shinfo_ext(skb)->shcb,
>> +		       sizeof(skb_shinfo_ext(skb)->shcb));
>> +	}
>>
>>  	/*
>>  	 * if shinfo is shared we must drop the old head gracefully, but if it
>> --
>> 2.14.2
>>
>> _______________________________________________
>> mptcp mailing list
>> mptcp(a)lists.01.org
>> https://lists.01.org/mailman/listinfo/mptcp
>

--
Mat Martineau
Intel OTC

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-10-23 20:13 Rao Shoaib
  0 siblings, 0 replies; 36+ messages in thread
From: Rao Shoaib @ 2017-10-23 20:13 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 12002 bytes --]


On 10/23/2017 12:49 PM, Rao Shoaib wrote:
>
> On 10/20/2017 04:02 PM, Mat Martineau wrote:
>> The sk_buff control buffer is of limited size, and cannot be enlarged
>> without significant impact on systemwide memory use. However, additional
>> per-packet state is needed for some protocols, like Multipath TCP.
>>
>> An optional shared control buffer placed after the normal struct
>> skb_shared_info can accomodate the necessary state without imposing
>> extra memory usage or code changes on normal struct sk_buff
>> users. __alloc_skb will now place a skb_shared_info_ext structure at
>> skb->end when given the SKB_ALLOC_SHINFO_EXT flag. Most users of struct
>> sk_buff continue to use the skb_shinfo() macro to access shared
>> info. skb_shinfo(skb)->is_ext is set if the extended structure is
>> available, and cleared if it is not.
>>
>> pskb_expand_head will preserve the shared control buffer if it is 
>> present.
>
> Hi Matt,
>
> I personally think the change looks good. It does have some minor 
> issues that still need to be resolved.
>
> Like Christoph stated the use of sizeof() is something to think about, 
> for example Solarflare driver on the Tx side uses sizeof(). See 
> efx_enqueue_skb_pio. Maybe create a macro so that the code can be 
> replaced by a call to the MACRO which looks at the flag and returns 
> the correct size.
>
> There are some issues to look at connection establishment time and 
> fallback to regular tcp. What if the other side does not support MPTCP 
> ? Will we be able to switch to an allocation that does not allocate 
> these extra 48 bytes. BTW we really do not need this change for MPTCP, 
> overloading current fields just works fine, however the enhacement 
> does open other possibilities.
>
> FYI. I am working on restructuring MPTCP code so it's not intrusive to 
> main TCP code. I will probably try a few other techniques but can use 
> this as well. I will post the changes at the appropriate time.
>
> Shoaib

BTW do handled and other smaller devices that are more likely to use 
MPTCP use shared_skb, or do they optimize and do not use it ?

Shoaib

>
>>
>> Signed-off-by: Mat Martineau <mathew.j.martineau(a)linux.intel.com>
>> ---
>>   include/linux/skbuff.h | 24 +++++++++++++++++++++-
>>   net/core/skbuff.c      | 56 
>> ++++++++++++++++++++++++++++++++++++++------------
>>   2 files changed, 66 insertions(+), 14 deletions(-)
>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index 03634ec2f918..873910c66df9 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -489,7 +489,8 @@ int skb_zerocopy_iter_stream(struct sock *sk, 
>> struct sk_buff *skb,
>>    * the end of the header data, ie. at skb->end.
>>    */
>>   struct skb_shared_info {
>> -    __u8        __unused;
>> +    __u8        is_ext:1,
>> +            __unused:7;
>>       __u8        meta_len;
>>       __u8        nr_frags;
>>       __u8        tx_flags;
>> @@ -530,6 +531,24 @@ struct skb_shared_info {
>>   #define SKB_DATAREF_MASK ((1 << SKB_DATAREF_SHIFT) - 1)
>>     +/* This is an extended version of skb_shared_info, also 
>> invariant across
>> + * clones and living at the end of the header data.
>> + */
>> +struct skb_shared_info_ext {
>> +    /* skb_shared_info must be the first member */
>> +    struct skb_shared_info    shinfo;
>> +
>> +    /* This is the shared control buffer. It is similar to sk_buff's
>> +     * control buffer, but is shared across clones. It must not be
>> +     * modified when multiple sk_buffs are referencing this structure.
>> +     */
>> +    char            shcb[48];
>> +};
>> +
>> +#define SKB_SHINFO_EXT_OVERHEAD    \
>> +    (SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext)) - \
>> +     SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
>> +
>>   enum {
>>       SKB_FCLONE_UNAVAILABLE,    /* skb has no fclone (from 
>> head_cache) */
>>       SKB_FCLONE_ORIG,    /* orig skb (from fclone_cache) */
>> @@ -856,6 +875,7 @@ struct sk_buff {
>>   #define SKB_ALLOC_FCLONE    0x01
>>   #define SKB_ALLOC_RX        0x02
>>   #define SKB_ALLOC_NAPI        0x04
>> +#define SKB_ALLOC_SHINFO_EXT    0x08
>>     /* Returns true if the skb was allocated from PFMEMALLOC reserves */
>>   static inline bool skb_pfmemalloc(const struct sk_buff *skb)
>> @@ -1271,6 +1291,8 @@ static inline unsigned int skb_end_offset(const 
>> struct sk_buff *skb)
>>     /* Internal */
>>   #define skb_shinfo(SKB)    ((struct skb_shared_info 
>> *)(skb_end_pointer(SKB)))
>> +#define skb_shinfo_ext(SKB)    \
>> +    ((struct skb_shared_info_ext *)(skb_end_pointer(SKB)))
>>     static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct 
>> sk_buff *skb)
>>   {
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 40717501cbdd..397edd5c0613 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -166,6 +166,8 @@ static void *__kmalloc_reserve(size_t size, gfp_t 
>> flags, int node,
>>    *        instead of head cache and allocate a cloned (child) skb.
>>    *        If SKB_ALLOC_RX is set, __GFP_MEMALLOC will be used for
>>    *        allocations in case the data is required for writeback
>> + *        If SKB_ALLOC_SHINFO_EXT is set, the skb will be allocated
>> + *        with an extended shared info struct.
>>    *    @node: numa node to allocate memory on
>>    *
>>    *    Allocate a new &sk_buff. The returned buffer has no headroom 
>> and a
>> @@ -179,9 +181,9 @@ struct sk_buff *__alloc_skb(unsigned int size, 
>> gfp_t gfp_mask,
>>                   int flags, int node)
>>   {
>>       struct kmem_cache *cache;
>> -    struct skb_shared_info *shinfo;
>>       struct sk_buff *skb;
>>       u8 *data;
>> +    unsigned int shinfo_size;
>>       bool pfmemalloc;
>>         cache = (flags & SKB_ALLOC_FCLONE)
>> @@ -199,18 +201,22 @@ struct sk_buff *__alloc_skb(unsigned int size, 
>> gfp_t gfp_mask,
>>       /* We do our best to align skb_shared_info on a separate cache
>>        * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) 
>> gives
>>        * aligned memory blocks, unless SLUB/SLAB debug is enabled.
>> -     * Both skb->head and skb_shared_info are cache line aligned.
>> +     * Both skb->head and skb_shared_info (or skb_shared_info_ext) are
>> +     * cache line aligned.
>>        */
>>       size = SKB_DATA_ALIGN(size);
>> -    size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>> -    data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
>> +    if (flags & SKB_ALLOC_SHINFO_EXT)
>> +        shinfo_size = SKB_DATA_ALIGN(sizeof(struct 
>> skb_shared_info_ext));
>> +    else
>> +        shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>> +    data = kmalloc_reserve(size + shinfo_size, gfp_mask, node, 
>> &pfmemalloc);
>>       if (!data)
>>           goto nodata;
>>       /* kmalloc(size) might give us more room than requested.
>>        * Put skb_shared_info exactly at the end of allocated zone,
>>        * to allow max possible filling before reallocation.
>>        */
>> -    size = SKB_WITH_OVERHEAD(ksize(data));
>> +    size = ksize(data) - shinfo_size;
>>       prefetchw(data + size);
>>         /*
>> @@ -220,7 +226,10 @@ struct sk_buff *__alloc_skb(unsigned int size, 
>> gfp_t gfp_mask,
>>        */
>>       memset(skb, 0, offsetof(struct sk_buff, tail));
>>       /* Account for allocated memory : skb + skb->head */
>> -    skb->truesize = SKB_TRUESIZE(size);
>> +    if (flags & SKB_ALLOC_SHINFO_EXT)
>> +        skb->truesize = SKB_TRUESIZE(size) + SKB_SHINFO_EXT_OVERHEAD;
>> +    else
>> +        skb->truesize = SKB_TRUESIZE(size);
>>       skb->pfmemalloc = pfmemalloc;
>>       refcount_set(&skb->users, 1);
>>       skb->head = data;
>> @@ -231,10 +240,21 @@ struct sk_buff *__alloc_skb(unsigned int size, 
>> gfp_t gfp_mask,
>>       skb->transport_header = (typeof(skb->transport_header))~0U;
>>         /* make sure we initialize shinfo sequentially */
>> -    shinfo = skb_shinfo(skb);
>> -    memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
>> -    atomic_set(&shinfo->dataref, 1);
>> -    kmemcheck_annotate_variable(shinfo->destructor_arg);
>> +    if (flags & SKB_ALLOC_SHINFO_EXT) {
>> +        struct skb_shared_info_ext *shinfo_ext = skb_shinfo_ext(skb);
>> +        shinfo_ext->shinfo.is_ext = 1;
>> +        memset(&shinfo_ext->shinfo.meta_len, 0,
>> +               offsetof(struct skb_shared_info, dataref) -
>> +               offsetof(struct skb_shared_info, meta_len));
>> +        atomic_set(&shinfo_ext->shinfo.dataref, 1);
>> + kmemcheck_annotate_variable(shinfo_ext->shinfo.destructor_arg);
>> +        memset(&shinfo_ext->shcb, 0, sizeof(shinfo_ext->shcb));
>> +    } else {
>> +        struct skb_shared_info *shinfo = skb_shinfo(skb);
>> +        memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
>> +        atomic_set(&shinfo->dataref, 1);
>> +        kmemcheck_annotate_variable(shinfo->destructor_arg);
>> +    }
>>         if (flags & SKB_ALLOC_FCLONE) {
>>           struct sk_buff_fclones *fclones;
>> @@ -1443,6 +1463,7 @@ int pskb_expand_head(struct sk_buff *skb, int 
>> nhead, int ntail,
>>   {
>>       int i, osize = skb_end_offset(skb);
>>       int size = osize + nhead + ntail;
>> +    int shinfo_size;
>>       long off;
>>       u8 *data;
>>   @@ -1454,11 +1475,14 @@ int pskb_expand_head(struct sk_buff *skb, 
>> int nhead, int ntail,
>>         if (skb_pfmemalloc(skb))
>>           gfp_mask |= __GFP_MEMALLOC;
>> -    data = kmalloc_reserve(size + SKB_DATA_ALIGN(sizeof(struct 
>> skb_shared_info)),
>> -                   gfp_mask, NUMA_NO_NODE, NULL);
>> +    if (skb_shinfo(skb)->is_ext)
>> +        shinfo_size = SKB_DATA_ALIGN(sizeof(struct 
>> skb_shared_info_ext));
>> +    else
>> +        shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>> +    data = kmalloc_reserve(size + shinfo_size, gfp_mask, 
>> NUMA_NO_NODE, NULL);
>>       if (!data)
>>           goto nodata;
>> -    size = SKB_WITH_OVERHEAD(ksize(data));
>> +    size = ksize(data) - shinfo_size;
>>         /* Copy only real data... and, alas, header. This should be
>>        * optimized for the cases when header is void.
>> @@ -1468,6 +1492,12 @@ int pskb_expand_head(struct sk_buff *skb, int 
>> nhead, int ntail,
>>       memcpy((struct skb_shared_info *)(data + size),
>>              skb_shinfo(skb),
>>              offsetof(struct skb_shared_info, 
>> frags[skb_shinfo(skb)->nr_frags]));
>> +    if (skb_shinfo(skb)->is_ext) {
>> +        int offset = offsetof(struct skb_shared_info_ext, shcb);
>> +        memcpy((struct skb_shared_info_ext *)(data + size + offset),
>> +               &skb_shinfo_ext(skb)->shcb,
>> +               sizeof(skb_shinfo_ext(skb)->shcb));
>> +    }
>>         /*
>>        * if shinfo is shared we must drop the old head gracefully, 
>> but if it
>
> _______________________________________________
> mptcp mailing list
> mptcp(a)lists.01.org
> https://lists.01.org/mailman/listinfo/mptcp


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-10-23 20:10 Christoph Paasch
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Paasch @ 2017-10-23 20:10 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 10959 bytes --]

Hello,

On 23/10/17 - 12:49:53, Rao Shoaib wrote:
> 
> On 10/20/2017 04:02 PM, Mat Martineau wrote:
> > The sk_buff control buffer is of limited size, and cannot be enlarged
> > without significant impact on systemwide memory use. However, additional
> > per-packet state is needed for some protocols, like Multipath TCP.
> > 
> > An optional shared control buffer placed after the normal struct
> > skb_shared_info can accomodate the necessary state without imposing
> > extra memory usage or code changes on normal struct sk_buff
> > users. __alloc_skb will now place a skb_shared_info_ext structure at
> > skb->end when given the SKB_ALLOC_SHINFO_EXT flag. Most users of struct
> > sk_buff continue to use the skb_shinfo() macro to access shared
> > info. skb_shinfo(skb)->is_ext is set if the extended structure is
> > available, and cleared if it is not.
> > 
> > pskb_expand_head will preserve the shared control buffer if it is present.
> 
> Hi Matt,
> 
> I personally think the change looks good. It does have some minor issues
> that still need to be resolved.
> 
> Like Christoph stated the use of sizeof() is something to think about, for
> example Solarflare driver on the Tx side uses sizeof(). See
> efx_enqueue_skb_pio. Maybe create a macro so that the code can be replaced
> by a call to the MACRO which looks at the flag and returns the correct size.
> 
> There are some issues to look at connection establishment time and fallback
> to regular tcp. What if the other side does not support MPTCP ?

It should be fine. What happens in that case is that the initiator realizes
that the other side does not support MPTCP and at that point simply stops
adding the extra bytes to the SKBs it is pushing down the connection.

> Will we be able to switch to an allocation that does not allocate these extra 48 bytes.
> BTW we really do not need this change for MPTCP, overloading current fields
> just works fine, however the enhacement does open other possibilities.

You were able to reduce the cb-size back down to 48 bytes with overloading
fields? I would love to see the patch :)
I had to bump cb up to 80 bytes in the current mptcp_trunk :/


Christoph

> FYI. I am working on restructuring MPTCP code so it's not intrusive to main
> TCP code. I will probably try a few other techniques but can use this as
> well. I will post the changes at the appropriate time.
> 
> Shoaib
> 
> > 
> > Signed-off-by: Mat Martineau <mathew.j.martineau(a)linux.intel.com>
> > ---
> >   include/linux/skbuff.h | 24 +++++++++++++++++++++-
> >   net/core/skbuff.c      | 56 ++++++++++++++++++++++++++++++++++++++------------
> >   2 files changed, 66 insertions(+), 14 deletions(-)
> > 
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 03634ec2f918..873910c66df9 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -489,7 +489,8 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
> >    * the end of the header data, ie. at skb->end.
> >    */
> >   struct skb_shared_info {
> > -	__u8		__unused;
> > +	__u8		is_ext:1,
> > +			__unused:7;
> >   	__u8		meta_len;
> >   	__u8		nr_frags;
> >   	__u8		tx_flags;
> > @@ -530,6 +531,24 @@ struct skb_shared_info {
> >   #define SKB_DATAREF_MASK ((1 << SKB_DATAREF_SHIFT) - 1)
> > +/* This is an extended version of skb_shared_info, also invariant across
> > + * clones and living at the end of the header data.
> > + */
> > +struct skb_shared_info_ext {
> > +	/* skb_shared_info must be the first member */
> > +	struct skb_shared_info	shinfo;
> > +
> > +	/* This is the shared control buffer. It is similar to sk_buff's
> > +	 * control buffer, but is shared across clones. It must not be
> > +	 * modified when multiple sk_buffs are referencing this structure.
> > +	 */
> > +	char			shcb[48];
> > +};
> > +
> > +#define SKB_SHINFO_EXT_OVERHEAD	\
> > +	(SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext)) - \
> > +	 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
> > +
> >   enum {
> >   	SKB_FCLONE_UNAVAILABLE,	/* skb has no fclone (from head_cache) */
> >   	SKB_FCLONE_ORIG,	/* orig skb (from fclone_cache) */
> > @@ -856,6 +875,7 @@ struct sk_buff {
> >   #define SKB_ALLOC_FCLONE	0x01
> >   #define SKB_ALLOC_RX		0x02
> >   #define SKB_ALLOC_NAPI		0x04
> > +#define SKB_ALLOC_SHINFO_EXT	0x08
> >   /* Returns true if the skb was allocated from PFMEMALLOC reserves */
> >   static inline bool skb_pfmemalloc(const struct sk_buff *skb)
> > @@ -1271,6 +1291,8 @@ static inline unsigned int skb_end_offset(const struct sk_buff *skb)
> >   /* Internal */
> >   #define skb_shinfo(SKB)	((struct skb_shared_info *)(skb_end_pointer(SKB)))
> > +#define skb_shinfo_ext(SKB)	\
> > +	((struct skb_shared_info_ext *)(skb_end_pointer(SKB)))
> >   static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb)
> >   {
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 40717501cbdd..397edd5c0613 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -166,6 +166,8 @@ static void *__kmalloc_reserve(size_t size, gfp_t flags, int node,
> >    *		instead of head cache and allocate a cloned (child) skb.
> >    *		If SKB_ALLOC_RX is set, __GFP_MEMALLOC will be used for
> >    *		allocations in case the data is required for writeback
> > + *		If SKB_ALLOC_SHINFO_EXT is set, the skb will be allocated
> > + *		with an extended shared info struct.
> >    *	@node: numa node to allocate memory on
> >    *
> >    *	Allocate a new &sk_buff. The returned buffer has no headroom and a
> > @@ -179,9 +181,9 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> >   			    int flags, int node)
> >   {
> >   	struct kmem_cache *cache;
> > -	struct skb_shared_info *shinfo;
> >   	struct sk_buff *skb;
> >   	u8 *data;
> > +	unsigned int shinfo_size;
> >   	bool pfmemalloc;
> >   	cache = (flags & SKB_ALLOC_FCLONE)
> > @@ -199,18 +201,22 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> >   	/* We do our best to align skb_shared_info on a separate cache
> >   	 * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
> >   	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.
> > -	 * Both skb->head and skb_shared_info are cache line aligned.
> > +	 * Both skb->head and skb_shared_info (or skb_shared_info_ext) are
> > +	 * cache line aligned.
> >   	 */
> >   	size = SKB_DATA_ALIGN(size);
> > -	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > -	data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
> > +	if (flags & SKB_ALLOC_SHINFO_EXT)
> > +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext));
> > +	else
> > +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > +	data = kmalloc_reserve(size + shinfo_size, gfp_mask, node, &pfmemalloc);
> >   	if (!data)
> >   		goto nodata;
> >   	/* kmalloc(size) might give us more room than requested.
> >   	 * Put skb_shared_info exactly at the end of allocated zone,
> >   	 * to allow max possible filling before reallocation.
> >   	 */
> > -	size = SKB_WITH_OVERHEAD(ksize(data));
> > +	size = ksize(data) - shinfo_size;
> >   	prefetchw(data + size);
> >   	/*
> > @@ -220,7 +226,10 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> >   	 */
> >   	memset(skb, 0, offsetof(struct sk_buff, tail));
> >   	/* Account for allocated memory : skb + skb->head */
> > -	skb->truesize = SKB_TRUESIZE(size);
> > +	if (flags & SKB_ALLOC_SHINFO_EXT)
> > +		skb->truesize = SKB_TRUESIZE(size) + SKB_SHINFO_EXT_OVERHEAD;
> > +	else
> > +		skb->truesize = SKB_TRUESIZE(size);
> >   	skb->pfmemalloc = pfmemalloc;
> >   	refcount_set(&skb->users, 1);
> >   	skb->head = data;
> > @@ -231,10 +240,21 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> >   	skb->transport_header = (typeof(skb->transport_header))~0U;
> >   	/* make sure we initialize shinfo sequentially */
> > -	shinfo = skb_shinfo(skb);
> > -	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> > -	atomic_set(&shinfo->dataref, 1);
> > -	kmemcheck_annotate_variable(shinfo->destructor_arg);
> > +	if (flags & SKB_ALLOC_SHINFO_EXT) {
> > +		struct skb_shared_info_ext *shinfo_ext = skb_shinfo_ext(skb);
> > +		shinfo_ext->shinfo.is_ext = 1;
> > +		memset(&shinfo_ext->shinfo.meta_len, 0,
> > +		       offsetof(struct skb_shared_info, dataref) -
> > +		       offsetof(struct skb_shared_info, meta_len));
> > +		atomic_set(&shinfo_ext->shinfo.dataref, 1);
> > +		kmemcheck_annotate_variable(shinfo_ext->shinfo.destructor_arg);
> > +		memset(&shinfo_ext->shcb, 0, sizeof(shinfo_ext->shcb));
> > +	} else {
> > +		struct skb_shared_info *shinfo = skb_shinfo(skb);
> > +		memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> > +		atomic_set(&shinfo->dataref, 1);
> > +		kmemcheck_annotate_variable(shinfo->destructor_arg);
> > +	}
> >   	if (flags & SKB_ALLOC_FCLONE) {
> >   		struct sk_buff_fclones *fclones;
> > @@ -1443,6 +1463,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
> >   {
> >   	int i, osize = skb_end_offset(skb);
> >   	int size = osize + nhead + ntail;
> > +	int shinfo_size;
> >   	long off;
> >   	u8 *data;
> > @@ -1454,11 +1475,14 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
> >   	if (skb_pfmemalloc(skb))
> >   		gfp_mask |= __GFP_MEMALLOC;
> > -	data = kmalloc_reserve(size + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
> > -			       gfp_mask, NUMA_NO_NODE, NULL);
> > +	if (skb_shinfo(skb)->is_ext)
> > +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext));
> > +	else
> > +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > +	data = kmalloc_reserve(size + shinfo_size, gfp_mask, NUMA_NO_NODE, NULL);
> >   	if (!data)
> >   		goto nodata;
> > -	size = SKB_WITH_OVERHEAD(ksize(data));
> > +	size = ksize(data) - shinfo_size;
> >   	/* Copy only real data... and, alas, header. This should be
> >   	 * optimized for the cases when header is void.
> > @@ -1468,6 +1492,12 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
> >   	memcpy((struct skb_shared_info *)(data + size),
> >   	       skb_shinfo(skb),
> >   	       offsetof(struct skb_shared_info, frags[skb_shinfo(skb)->nr_frags]));
> > +	if (skb_shinfo(skb)->is_ext) {
> > +		int offset = offsetof(struct skb_shared_info_ext, shcb);
> > +		memcpy((struct skb_shared_info_ext *)(data + size + offset),
> > +		       &skb_shinfo_ext(skb)->shcb,
> > +		       sizeof(skb_shinfo_ext(skb)->shcb));
> > +	}
> >   	/*
> >   	 * if shinfo is shared we must drop the old head gracefully, but if it
> 
> _______________________________________________
> mptcp mailing list
> mptcp(a)lists.01.org
> https://lists.01.org/mailman/listinfo/mptcp

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-10-23 19:49 Rao Shoaib
  0 siblings, 0 replies; 36+ messages in thread
From: Rao Shoaib @ 2017-10-23 19:49 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 9991 bytes --]


On 10/20/2017 04:02 PM, Mat Martineau wrote:
> The sk_buff control buffer is of limited size, and cannot be enlarged
> without significant impact on systemwide memory use. However, additional
> per-packet state is needed for some protocols, like Multipath TCP.
>
> An optional shared control buffer placed after the normal struct
> skb_shared_info can accomodate the necessary state without imposing
> extra memory usage or code changes on normal struct sk_buff
> users. __alloc_skb will now place a skb_shared_info_ext structure at
> skb->end when given the SKB_ALLOC_SHINFO_EXT flag. Most users of struct
> sk_buff continue to use the skb_shinfo() macro to access shared
> info. skb_shinfo(skb)->is_ext is set if the extended structure is
> available, and cleared if it is not.
>
> pskb_expand_head will preserve the shared control buffer if it is present.

Hi Matt,

I personally think the change looks good. It does have some minor issues 
that still need to be resolved.

Like Christoph stated the use of sizeof() is something to think about, 
for example Solarflare driver on the Tx side uses sizeof(). See 
efx_enqueue_skb_pio. Maybe create a macro so that the code can be 
replaced by a call to the MACRO which looks at the flag and returns the 
correct size.

There are some issues to look at connection establishment time and 
fallback to regular tcp. What if the other side does not support MPTCP ? 
Will we be able to switch to an allocation that does not allocate these 
extra 48 bytes. BTW we really do not need this change for MPTCP, 
overloading current fields just works fine, however the enhacement does 
open other possibilities.

FYI. I am working on restructuring MPTCP code so it's not intrusive to 
main TCP code. I will probably try a few other techniques but can use 
this as well. I will post the changes at the appropriate time.

Shoaib

>
> Signed-off-by: Mat Martineau <mathew.j.martineau(a)linux.intel.com>
> ---
>   include/linux/skbuff.h | 24 +++++++++++++++++++++-
>   net/core/skbuff.c      | 56 ++++++++++++++++++++++++++++++++++++++------------
>   2 files changed, 66 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 03634ec2f918..873910c66df9 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -489,7 +489,8 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
>    * the end of the header data, ie. at skb->end.
>    */
>   struct skb_shared_info {
> -	__u8		__unused;
> +	__u8		is_ext:1,
> +			__unused:7;
>   	__u8		meta_len;
>   	__u8		nr_frags;
>   	__u8		tx_flags;
> @@ -530,6 +531,24 @@ struct skb_shared_info {
>   #define SKB_DATAREF_MASK ((1 << SKB_DATAREF_SHIFT) - 1)
>   
>   
> +/* This is an extended version of skb_shared_info, also invariant across
> + * clones and living at the end of the header data.
> + */
> +struct skb_shared_info_ext {
> +	/* skb_shared_info must be the first member */
> +	struct skb_shared_info	shinfo;
> +
> +	/* This is the shared control buffer. It is similar to sk_buff's
> +	 * control buffer, but is shared across clones. It must not be
> +	 * modified when multiple sk_buffs are referencing this structure.
> +	 */
> +	char			shcb[48];
> +};
> +
> +#define SKB_SHINFO_EXT_OVERHEAD	\
> +	(SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext)) - \
> +	 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
> +
>   enum {
>   	SKB_FCLONE_UNAVAILABLE,	/* skb has no fclone (from head_cache) */
>   	SKB_FCLONE_ORIG,	/* orig skb (from fclone_cache) */
> @@ -856,6 +875,7 @@ struct sk_buff {
>   #define SKB_ALLOC_FCLONE	0x01
>   #define SKB_ALLOC_RX		0x02
>   #define SKB_ALLOC_NAPI		0x04
> +#define SKB_ALLOC_SHINFO_EXT	0x08
>   
>   /* Returns true if the skb was allocated from PFMEMALLOC reserves */
>   static inline bool skb_pfmemalloc(const struct sk_buff *skb)
> @@ -1271,6 +1291,8 @@ static inline unsigned int skb_end_offset(const struct sk_buff *skb)
>   
>   /* Internal */
>   #define skb_shinfo(SKB)	((struct skb_shared_info *)(skb_end_pointer(SKB)))
> +#define skb_shinfo_ext(SKB)	\
> +	((struct skb_shared_info_ext *)(skb_end_pointer(SKB)))
>   
>   static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb)
>   {
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 40717501cbdd..397edd5c0613 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -166,6 +166,8 @@ static void *__kmalloc_reserve(size_t size, gfp_t flags, int node,
>    *		instead of head cache and allocate a cloned (child) skb.
>    *		If SKB_ALLOC_RX is set, __GFP_MEMALLOC will be used for
>    *		allocations in case the data is required for writeback
> + *		If SKB_ALLOC_SHINFO_EXT is set, the skb will be allocated
> + *		with an extended shared info struct.
>    *	@node: numa node to allocate memory on
>    *
>    *	Allocate a new &sk_buff. The returned buffer has no headroom and a
> @@ -179,9 +181,9 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>   			    int flags, int node)
>   {
>   	struct kmem_cache *cache;
> -	struct skb_shared_info *shinfo;
>   	struct sk_buff *skb;
>   	u8 *data;
> +	unsigned int shinfo_size;
>   	bool pfmemalloc;
>   
>   	cache = (flags & SKB_ALLOC_FCLONE)
> @@ -199,18 +201,22 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>   	/* We do our best to align skb_shared_info on a separate cache
>   	 * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
>   	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.
> -	 * Both skb->head and skb_shared_info are cache line aligned.
> +	 * Both skb->head and skb_shared_info (or skb_shared_info_ext) are
> +	 * cache line aligned.
>   	 */
>   	size = SKB_DATA_ALIGN(size);
> -	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> -	data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
> +	if (flags & SKB_ALLOC_SHINFO_EXT)
> +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext));
> +	else
> +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +	data = kmalloc_reserve(size + shinfo_size, gfp_mask, node, &pfmemalloc);
>   	if (!data)
>   		goto nodata;
>   	/* kmalloc(size) might give us more room than requested.
>   	 * Put skb_shared_info exactly at the end of allocated zone,
>   	 * to allow max possible filling before reallocation.
>   	 */
> -	size = SKB_WITH_OVERHEAD(ksize(data));
> +	size = ksize(data) - shinfo_size;
>   	prefetchw(data + size);
>   
>   	/*
> @@ -220,7 +226,10 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>   	 */
>   	memset(skb, 0, offsetof(struct sk_buff, tail));
>   	/* Account for allocated memory : skb + skb->head */
> -	skb->truesize = SKB_TRUESIZE(size);
> +	if (flags & SKB_ALLOC_SHINFO_EXT)
> +		skb->truesize = SKB_TRUESIZE(size) + SKB_SHINFO_EXT_OVERHEAD;
> +	else
> +		skb->truesize = SKB_TRUESIZE(size);
>   	skb->pfmemalloc = pfmemalloc;
>   	refcount_set(&skb->users, 1);
>   	skb->head = data;
> @@ -231,10 +240,21 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>   	skb->transport_header = (typeof(skb->transport_header))~0U;
>   
>   	/* make sure we initialize shinfo sequentially */
> -	shinfo = skb_shinfo(skb);
> -	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> -	atomic_set(&shinfo->dataref, 1);
> -	kmemcheck_annotate_variable(shinfo->destructor_arg);
> +	if (flags & SKB_ALLOC_SHINFO_EXT) {
> +		struct skb_shared_info_ext *shinfo_ext = skb_shinfo_ext(skb);
> +		shinfo_ext->shinfo.is_ext = 1;
> +		memset(&shinfo_ext->shinfo.meta_len, 0,
> +		       offsetof(struct skb_shared_info, dataref) -
> +		       offsetof(struct skb_shared_info, meta_len));
> +		atomic_set(&shinfo_ext->shinfo.dataref, 1);
> +		kmemcheck_annotate_variable(shinfo_ext->shinfo.destructor_arg);
> +		memset(&shinfo_ext->shcb, 0, sizeof(shinfo_ext->shcb));
> +	} else {
> +		struct skb_shared_info *shinfo = skb_shinfo(skb);
> +		memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> +		atomic_set(&shinfo->dataref, 1);
> +		kmemcheck_annotate_variable(shinfo->destructor_arg);
> +	}
>   
>   	if (flags & SKB_ALLOC_FCLONE) {
>   		struct sk_buff_fclones *fclones;
> @@ -1443,6 +1463,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>   {
>   	int i, osize = skb_end_offset(skb);
>   	int size = osize + nhead + ntail;
> +	int shinfo_size;
>   	long off;
>   	u8 *data;
>   
> @@ -1454,11 +1475,14 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>   
>   	if (skb_pfmemalloc(skb))
>   		gfp_mask |= __GFP_MEMALLOC;
> -	data = kmalloc_reserve(size + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
> -			       gfp_mask, NUMA_NO_NODE, NULL);
> +	if (skb_shinfo(skb)->is_ext)
> +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext));
> +	else
> +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +	data = kmalloc_reserve(size + shinfo_size, gfp_mask, NUMA_NO_NODE, NULL);
>   	if (!data)
>   		goto nodata;
> -	size = SKB_WITH_OVERHEAD(ksize(data));
> +	size = ksize(data) - shinfo_size;
>   
>   	/* Copy only real data... and, alas, header. This should be
>   	 * optimized for the cases when header is void.
> @@ -1468,6 +1492,12 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>   	memcpy((struct skb_shared_info *)(data + size),
>   	       skb_shinfo(skb),
>   	       offsetof(struct skb_shared_info, frags[skb_shinfo(skb)->nr_frags]));
> +	if (skb_shinfo(skb)->is_ext) {
> +		int offset = offsetof(struct skb_shared_info_ext, shcb);
> +		memcpy((struct skb_shared_info_ext *)(data + size + offset),
> +		       &skb_shinfo_ext(skb)->shcb,
> +		       sizeof(skb_shinfo_ext(skb)->shcb));
> +	}
>   
>   	/*
>   	 * if shinfo is shared we must drop the old head gracefully, but if it


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-10-23 16:37 Christoph Paasch
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Paasch @ 2017-10-23 16:37 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 9838 bytes --]

Hello Mat,

On 20/10/17 - 16:02:31, Mat Martineau wrote:
> The sk_buff control buffer is of limited size, and cannot be enlarged
> without significant impact on systemwide memory use. However, additional
> per-packet state is needed for some protocols, like Multipath TCP.
> 
> An optional shared control buffer placed after the normal struct
> skb_shared_info can accomodate the necessary state without imposing
> extra memory usage or code changes on normal struct sk_buff
> users. __alloc_skb will now place a skb_shared_info_ext structure at
> skb->end when given the SKB_ALLOC_SHINFO_EXT flag. Most users of struct
> sk_buff continue to use the skb_shinfo() macro to access shared
> info. skb_shinfo(skb)->is_ext is set if the extended structure is
> available, and cleared if it is not.
> 
> pskb_expand_head will preserve the shared control buffer if it is present.
> 
> Signed-off-by: Mat Martineau <mathew.j.martineau(a)linux.intel.com>
> ---
>  include/linux/skbuff.h | 24 +++++++++++++++++++++-
>  net/core/skbuff.c      | 56 ++++++++++++++++++++++++++++++++++++++------------
>  2 files changed, 66 insertions(+), 14 deletions(-)

While digging through your patch, I realized that there could be one issue.

In many places in the code we use sizeof(struct skb_shared_info) to compute
the overhead (e.g., see tcp_sndbuf_expand).

There are also countless users of (sizeof(struct skb_shared_info) in the
drivers. They all seem to be in the rx-path, so it should be fine.

But the prevalent use of skb_shared_info throughout the stack seems a bit
scary to me. My concern here is that the driver assumes that the overhead is
always sizeof(skb_shared_info) and thus underestimates the size of the skb.

At least, in tcp_sndbuf_expand() the overhead-estimation would be wrong. I
don't think the consequences will be catastrophic, but it's something to
think about.


What do you think?


Christoph

> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 03634ec2f918..873910c66df9 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -489,7 +489,8 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
>   * the end of the header data, ie. at skb->end.
>   */
>  struct skb_shared_info {
> -	__u8		__unused;
> +	__u8		is_ext:1,
> +			__unused:7;
>  	__u8		meta_len;
>  	__u8		nr_frags;
>  	__u8		tx_flags;
> @@ -530,6 +531,24 @@ struct skb_shared_info {
>  #define SKB_DATAREF_MASK ((1 << SKB_DATAREF_SHIFT) - 1)
>  
>  
> +/* This is an extended version of skb_shared_info, also invariant across
> + * clones and living at the end of the header data.
> + */
> +struct skb_shared_info_ext {
> +	/* skb_shared_info must be the first member */
> +	struct skb_shared_info	shinfo;
> +
> +	/* This is the shared control buffer. It is similar to sk_buff's
> +	 * control buffer, but is shared across clones. It must not be
> +	 * modified when multiple sk_buffs are referencing this structure.
> +	 */
> +	char			shcb[48];
> +};
> +
> +#define SKB_SHINFO_EXT_OVERHEAD	\
> +	(SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext)) - \
> +	 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
> +
>  enum {
>  	SKB_FCLONE_UNAVAILABLE,	/* skb has no fclone (from head_cache) */
>  	SKB_FCLONE_ORIG,	/* orig skb (from fclone_cache) */
> @@ -856,6 +875,7 @@ struct sk_buff {
>  #define SKB_ALLOC_FCLONE	0x01
>  #define SKB_ALLOC_RX		0x02
>  #define SKB_ALLOC_NAPI		0x04
> +#define SKB_ALLOC_SHINFO_EXT	0x08
>  
>  /* Returns true if the skb was allocated from PFMEMALLOC reserves */
>  static inline bool skb_pfmemalloc(const struct sk_buff *skb)
> @@ -1271,6 +1291,8 @@ static inline unsigned int skb_end_offset(const struct sk_buff *skb)
>  
>  /* Internal */
>  #define skb_shinfo(SKB)	((struct skb_shared_info *)(skb_end_pointer(SKB)))
> +#define skb_shinfo_ext(SKB)	\
> +	((struct skb_shared_info_ext *)(skb_end_pointer(SKB)))
>  
>  static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb)
>  {
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 40717501cbdd..397edd5c0613 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -166,6 +166,8 @@ static void *__kmalloc_reserve(size_t size, gfp_t flags, int node,
>   *		instead of head cache and allocate a cloned (child) skb.
>   *		If SKB_ALLOC_RX is set, __GFP_MEMALLOC will be used for
>   *		allocations in case the data is required for writeback
> + *		If SKB_ALLOC_SHINFO_EXT is set, the skb will be allocated
> + *		with an extended shared info struct.
>   *	@node: numa node to allocate memory on
>   *
>   *	Allocate a new &sk_buff. The returned buffer has no headroom and a
> @@ -179,9 +181,9 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>  			    int flags, int node)
>  {
>  	struct kmem_cache *cache;
> -	struct skb_shared_info *shinfo;
>  	struct sk_buff *skb;
>  	u8 *data;
> +	unsigned int shinfo_size;
>  	bool pfmemalloc;
>  
>  	cache = (flags & SKB_ALLOC_FCLONE)
> @@ -199,18 +201,22 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>  	/* We do our best to align skb_shared_info on a separate cache
>  	 * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
>  	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.
> -	 * Both skb->head and skb_shared_info are cache line aligned.
> +	 * Both skb->head and skb_shared_info (or skb_shared_info_ext) are
> +	 * cache line aligned.
>  	 */
>  	size = SKB_DATA_ALIGN(size);
> -	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> -	data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
> +	if (flags & SKB_ALLOC_SHINFO_EXT)
> +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext));
> +	else
> +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +	data = kmalloc_reserve(size + shinfo_size, gfp_mask, node, &pfmemalloc);
>  	if (!data)
>  		goto nodata;
>  	/* kmalloc(size) might give us more room than requested.
>  	 * Put skb_shared_info exactly at the end of allocated zone,
>  	 * to allow max possible filling before reallocation.
>  	 */
> -	size = SKB_WITH_OVERHEAD(ksize(data));
> +	size = ksize(data) - shinfo_size;
>  	prefetchw(data + size);
>  
>  	/*
> @@ -220,7 +226,10 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>  	 */
>  	memset(skb, 0, offsetof(struct sk_buff, tail));
>  	/* Account for allocated memory : skb + skb->head */
> -	skb->truesize = SKB_TRUESIZE(size);
> +	if (flags & SKB_ALLOC_SHINFO_EXT)
> +		skb->truesize = SKB_TRUESIZE(size) + SKB_SHINFO_EXT_OVERHEAD;
> +	else
> +		skb->truesize = SKB_TRUESIZE(size);
>  	skb->pfmemalloc = pfmemalloc;
>  	refcount_set(&skb->users, 1);
>  	skb->head = data;
> @@ -231,10 +240,21 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>  	skb->transport_header = (typeof(skb->transport_header))~0U;
>  
>  	/* make sure we initialize shinfo sequentially */
> -	shinfo = skb_shinfo(skb);
> -	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> -	atomic_set(&shinfo->dataref, 1);
> -	kmemcheck_annotate_variable(shinfo->destructor_arg);
> +	if (flags & SKB_ALLOC_SHINFO_EXT) {
> +		struct skb_shared_info_ext *shinfo_ext = skb_shinfo_ext(skb);
> +		shinfo_ext->shinfo.is_ext = 1;
> +		memset(&shinfo_ext->shinfo.meta_len, 0,
> +		       offsetof(struct skb_shared_info, dataref) -
> +		       offsetof(struct skb_shared_info, meta_len));
> +		atomic_set(&shinfo_ext->shinfo.dataref, 1);
> +		kmemcheck_annotate_variable(shinfo_ext->shinfo.destructor_arg);
> +		memset(&shinfo_ext->shcb, 0, sizeof(shinfo_ext->shcb));
> +	} else {
> +		struct skb_shared_info *shinfo = skb_shinfo(skb);
> +		memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> +		atomic_set(&shinfo->dataref, 1);
> +		kmemcheck_annotate_variable(shinfo->destructor_arg);
> +	}
>  
>  	if (flags & SKB_ALLOC_FCLONE) {
>  		struct sk_buff_fclones *fclones;
> @@ -1443,6 +1463,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>  {
>  	int i, osize = skb_end_offset(skb);
>  	int size = osize + nhead + ntail;
> +	int shinfo_size;
>  	long off;
>  	u8 *data;
>  
> @@ -1454,11 +1475,14 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>  
>  	if (skb_pfmemalloc(skb))
>  		gfp_mask |= __GFP_MEMALLOC;
> -	data = kmalloc_reserve(size + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
> -			       gfp_mask, NUMA_NO_NODE, NULL);
> +	if (skb_shinfo(skb)->is_ext)
> +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext));
> +	else
> +		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +	data = kmalloc_reserve(size + shinfo_size, gfp_mask, NUMA_NO_NODE, NULL);
>  	if (!data)
>  		goto nodata;
> -	size = SKB_WITH_OVERHEAD(ksize(data));
> +	size = ksize(data) - shinfo_size;
>  
>  	/* Copy only real data... and, alas, header. This should be
>  	 * optimized for the cases when header is void.
> @@ -1468,6 +1492,12 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
>  	memcpy((struct skb_shared_info *)(data + size),
>  	       skb_shinfo(skb),
>  	       offsetof(struct skb_shared_info, frags[skb_shinfo(skb)->nr_frags]));
> +	if (skb_shinfo(skb)->is_ext) {
> +		int offset = offsetof(struct skb_shared_info_ext, shcb);
> +		memcpy((struct skb_shared_info_ext *)(data + size + offset),
> +		       &skb_shinfo_ext(skb)->shcb,
> +		       sizeof(skb_shinfo_ext(skb)->shcb));
> +	}
>  
>  	/*
>  	 * if shinfo is shared we must drop the old head gracefully, but if it
> -- 
> 2.14.2
> 
> _______________________________________________
> mptcp mailing list
> mptcp(a)lists.01.org
> https://lists.01.org/mailman/listinfo/mptcp

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer
@ 2017-10-20 23:02 Mat Martineau
  0 siblings, 0 replies; 36+ messages in thread
From: Mat Martineau @ 2017-10-20 23:02 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 8432 bytes --]

The sk_buff control buffer is of limited size, and cannot be enlarged
without significant impact on systemwide memory use. However, additional
per-packet state is needed for some protocols, like Multipath TCP.

An optional shared control buffer placed after the normal struct
skb_shared_info can accomodate the necessary state without imposing
extra memory usage or code changes on normal struct sk_buff
users. __alloc_skb will now place a skb_shared_info_ext structure at
skb->end when given the SKB_ALLOC_SHINFO_EXT flag. Most users of struct
sk_buff continue to use the skb_shinfo() macro to access shared
info. skb_shinfo(skb)->is_ext is set if the extended structure is
available, and cleared if it is not.

pskb_expand_head will preserve the shared control buffer if it is present.

Signed-off-by: Mat Martineau <mathew.j.martineau(a)linux.intel.com>
---
 include/linux/skbuff.h | 24 +++++++++++++++++++++-
 net/core/skbuff.c      | 56 ++++++++++++++++++++++++++++++++++++++------------
 2 files changed, 66 insertions(+), 14 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 03634ec2f918..873910c66df9 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -489,7 +489,8 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
  * the end of the header data, ie. at skb->end.
  */
 struct skb_shared_info {
-	__u8		__unused;
+	__u8		is_ext:1,
+			__unused:7;
 	__u8		meta_len;
 	__u8		nr_frags;
 	__u8		tx_flags;
@@ -530,6 +531,24 @@ struct skb_shared_info {
 #define SKB_DATAREF_MASK ((1 << SKB_DATAREF_SHIFT) - 1)
 
 
+/* This is an extended version of skb_shared_info, also invariant across
+ * clones and living at the end of the header data.
+ */
+struct skb_shared_info_ext {
+	/* skb_shared_info must be the first member */
+	struct skb_shared_info	shinfo;
+
+	/* This is the shared control buffer. It is similar to sk_buff's
+	 * control buffer, but is shared across clones. It must not be
+	 * modified when multiple sk_buffs are referencing this structure.
+	 */
+	char			shcb[48];
+};
+
+#define SKB_SHINFO_EXT_OVERHEAD	\
+	(SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext)) - \
+	 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+
 enum {
 	SKB_FCLONE_UNAVAILABLE,	/* skb has no fclone (from head_cache) */
 	SKB_FCLONE_ORIG,	/* orig skb (from fclone_cache) */
@@ -856,6 +875,7 @@ struct sk_buff {
 #define SKB_ALLOC_FCLONE	0x01
 #define SKB_ALLOC_RX		0x02
 #define SKB_ALLOC_NAPI		0x04
+#define SKB_ALLOC_SHINFO_EXT	0x08
 
 /* Returns true if the skb was allocated from PFMEMALLOC reserves */
 static inline bool skb_pfmemalloc(const struct sk_buff *skb)
@@ -1271,6 +1291,8 @@ static inline unsigned int skb_end_offset(const struct sk_buff *skb)
 
 /* Internal */
 #define skb_shinfo(SKB)	((struct skb_shared_info *)(skb_end_pointer(SKB)))
+#define skb_shinfo_ext(SKB)	\
+	((struct skb_shared_info_ext *)(skb_end_pointer(SKB)))
 
 static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb)
 {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 40717501cbdd..397edd5c0613 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -166,6 +166,8 @@ static void *__kmalloc_reserve(size_t size, gfp_t flags, int node,
  *		instead of head cache and allocate a cloned (child) skb.
  *		If SKB_ALLOC_RX is set, __GFP_MEMALLOC will be used for
  *		allocations in case the data is required for writeback
+ *		If SKB_ALLOC_SHINFO_EXT is set, the skb will be allocated
+ *		with an extended shared info struct.
  *	@node: numa node to allocate memory on
  *
  *	Allocate a new &sk_buff. The returned buffer has no headroom and a
@@ -179,9 +181,9 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 			    int flags, int node)
 {
 	struct kmem_cache *cache;
-	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
 	u8 *data;
+	unsigned int shinfo_size;
 	bool pfmemalloc;
 
 	cache = (flags & SKB_ALLOC_FCLONE)
@@ -199,18 +201,22 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	/* We do our best to align skb_shared_info on a separate cache
 	 * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
 	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.
-	 * Both skb->head and skb_shared_info are cache line aligned.
+	 * Both skb->head and skb_shared_info (or skb_shared_info_ext) are
+	 * cache line aligned.
 	 */
 	size = SKB_DATA_ALIGN(size);
-	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
-	data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
+	if (flags & SKB_ALLOC_SHINFO_EXT)
+		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext));
+	else
+		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	data = kmalloc_reserve(size + shinfo_size, gfp_mask, node, &pfmemalloc);
 	if (!data)
 		goto nodata;
 	/* kmalloc(size) might give us more room than requested.
 	 * Put skb_shared_info exactly at the end of allocated zone,
 	 * to allow max possible filling before reallocation.
 	 */
-	size = SKB_WITH_OVERHEAD(ksize(data));
+	size = ksize(data) - shinfo_size;
 	prefetchw(data + size);
 
 	/*
@@ -220,7 +226,10 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	 */
 	memset(skb, 0, offsetof(struct sk_buff, tail));
 	/* Account for allocated memory : skb + skb->head */
-	skb->truesize = SKB_TRUESIZE(size);
+	if (flags & SKB_ALLOC_SHINFO_EXT)
+		skb->truesize = SKB_TRUESIZE(size) + SKB_SHINFO_EXT_OVERHEAD;
+	else
+		skb->truesize = SKB_TRUESIZE(size);
 	skb->pfmemalloc = pfmemalloc;
 	refcount_set(&skb->users, 1);
 	skb->head = data;
@@ -231,10 +240,21 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	skb->transport_header = (typeof(skb->transport_header))~0U;
 
 	/* make sure we initialize shinfo sequentially */
-	shinfo = skb_shinfo(skb);
-	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
-	atomic_set(&shinfo->dataref, 1);
-	kmemcheck_annotate_variable(shinfo->destructor_arg);
+	if (flags & SKB_ALLOC_SHINFO_EXT) {
+		struct skb_shared_info_ext *shinfo_ext = skb_shinfo_ext(skb);
+		shinfo_ext->shinfo.is_ext = 1;
+		memset(&shinfo_ext->shinfo.meta_len, 0,
+		       offsetof(struct skb_shared_info, dataref) -
+		       offsetof(struct skb_shared_info, meta_len));
+		atomic_set(&shinfo_ext->shinfo.dataref, 1);
+		kmemcheck_annotate_variable(shinfo_ext->shinfo.destructor_arg);
+		memset(&shinfo_ext->shcb, 0, sizeof(shinfo_ext->shcb));
+	} else {
+		struct skb_shared_info *shinfo = skb_shinfo(skb);
+		memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
+		atomic_set(&shinfo->dataref, 1);
+		kmemcheck_annotate_variable(shinfo->destructor_arg);
+	}
 
 	if (flags & SKB_ALLOC_FCLONE) {
 		struct sk_buff_fclones *fclones;
@@ -1443,6 +1463,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 {
 	int i, osize = skb_end_offset(skb);
 	int size = osize + nhead + ntail;
+	int shinfo_size;
 	long off;
 	u8 *data;
 
@@ -1454,11 +1475,14 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 
 	if (skb_pfmemalloc(skb))
 		gfp_mask |= __GFP_MEMALLOC;
-	data = kmalloc_reserve(size + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
-			       gfp_mask, NUMA_NO_NODE, NULL);
+	if (skb_shinfo(skb)->is_ext)
+		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info_ext));
+	else
+		shinfo_size = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	data = kmalloc_reserve(size + shinfo_size, gfp_mask, NUMA_NO_NODE, NULL);
 	if (!data)
 		goto nodata;
-	size = SKB_WITH_OVERHEAD(ksize(data));
+	size = ksize(data) - shinfo_size;
 
 	/* Copy only real data... and, alas, header. This should be
 	 * optimized for the cases when header is void.
@@ -1468,6 +1492,12 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	memcpy((struct skb_shared_info *)(data + size),
 	       skb_shinfo(skb),
 	       offsetof(struct skb_shared_info, frags[skb_shinfo(skb)->nr_frags]));
+	if (skb_shinfo(skb)->is_ext) {
+		int offset = offsetof(struct skb_shared_info_ext, shcb);
+		memcpy((struct skb_shared_info_ext *)(data + size + offset),
+		       &skb_shinfo_ext(skb)->shcb,
+		       sizeof(skb_shinfo_ext(skb)->shcb));
+	}
 
 	/*
 	 * if shinfo is shared we must drop the old head gracefully, but if it
-- 
2.14.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2017-11-13  6:47 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-09 16:26 [MPTCP] [PATCH 1/2] skbuff: Add shared control buffer Mat Martineau
  -- strict thread matches above, loose matches on Subject: below --
2017-11-13  6:47 cpaasch
2017-11-10  0:31 Mat Martineau
2017-11-09  7:56 cpaasch
2017-11-09  7:51 cpaasch
2017-11-09  4:48 cpaasch
2017-11-09  4:13 Christoph Paasch
2017-11-08 21:02 Christoph Paasch
2017-11-08 20:41 Rao Shoaib
2017-11-08  0:25 Christoph Paasch
2017-11-07 23:35 Rao Shoaib
2017-11-07 23:23 Rao Shoaib
2017-11-07 21:15 Christoph Paasch
2017-11-07 17:13 Rao Shoaib
2017-11-07  4:09 Christoph Paasch
2017-11-07  3:16 Rao Shoaib
2017-11-07  2:46 Rao Shoaib
2017-11-06 22:24 Christoph Paasch
2017-11-06  2:45 Rao Shoaib
2017-11-03  5:10 Christoph Paasch
2017-11-02 21:41 Mat Martineau
2017-10-31 21:58 Mat Martineau
2017-10-31  4:17 Christoph Paasch
2017-10-30 22:44 Mat Martineau
2017-10-30  4:16 Christoph Paasch
2017-10-27 19:57 Christoph Paasch
2017-10-27 18:19 Mat Martineau
2017-10-26 23:20 Rao Shoaib
2017-10-26 22:26 Rao Shoaib
2017-10-23 23:10 Mat Martineau
2017-10-23 22:51 Mat Martineau
2017-10-23 20:13 Rao Shoaib
2017-10-23 20:10 Christoph Paasch
2017-10-23 19:49 Rao Shoaib
2017-10-23 16:37 Christoph Paasch
2017-10-20 23:02 Mat Martineau

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.