All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [MPTCP] Next steps discussion
@ 2018-02-28 23:08 Christoph Paasch
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Paasch @ 2018-02-28 23:08 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 7931 bytes --]

On 27/02/18 - 10:50:38, Mat Martineau wrote:
> 
> Hi Christoph,
> 
> On Mon, 26 Feb 2018, Christoph Paasch wrote:
> 
> > Hello,
> > 
> > as for next steps after the submission of the TCP-option framework to netdev
> > and DaveM's feedback on it.
> > 
> > Even if the submission got rejected, I think we still have a very useful set
> > of patches here. The need for a framework might pop up again in the future,
> > and so these patches could come in handy.
> > Mat, maybe you can put our latest submission on your kernel.org-git repo
> > just so that we don't lose track of these patches?
> 
> Done: https://git.kernel.org/pub/scm/linux/kernel/git/martineau/linux.git/log/?h=md5

Thanks!

> > I can also create a github repo if you prefer that.
> > 
> > 
> > As for DaveM's feedback, the main takeaway - as Mat already noted on his other
> > mail - is that fast-path performance he the highest priority. Branching and
> > indirect function calls are hardly accepted there.
> > 
> > 
> > So, in that spirit I think we need to work towards reducing MPTCP's
> > intrusiveness to the TCP stack.
> > 
> > * Stop taking meta-lock when receiving subflow data (all the changes where
> >  we check for mptcp(tp) and then do bh_lock_sock(meta_sk)).
> >  The reason we do this in today's implementation is because it allows to
> >  access the meta data-structure at any point. If we stop taking the
> >  meta-lock a few things need to change:
> >  1. Do lockless accesses for selected fields (e.g., for the DATA_ACK).
> >  2. Group the more intrusive accesses to few select points in the TCP-stack
> >     where we then take the meta-lock (e.g., when receiving data).
> >     (this would be equivalent as if the TCP-option framework would be there
> >     - thus we need to move code to these or similar points in the stack)
> >  3. Sometimes schedule work-queues when we need to avoid deadlocks due to
> >     lock-ordering issues (e.g., when we can't take the meta-lock because
> >     it's already held by another thread).
> > 
> >  I think, the way to approach this here, is by working iteratively and start
> >  moving code in such a way that accesses to the meta-socket are grouped
> >  together.
> > 
> >  Also, we have a few callbacks that we added (cfr., struct tcp_sock_ops).
> >  We added them to avoid duplicating the code. Let's review those and see if
> >  we can get rid of them. (as an example: .send_fin could be removed as it is only
> >  called from tcp_shutdown, called from the .shutdown callback in tcp_prot -
> >  thus if we expose a separate MPTCP socket-type with its own struct proto,
> >  we can get rid of the .send_fin callback)
> > 
> 
> I think a separate MPTCP socket type will be important for upstream
> acceptance. My team has been working on some code with this separate socket
> type that we can share.

Great! I would love to move MPTCP to a separate socket-type.

> I'm thinking that it will be useful to share once a
> connection can stay up without falling back to TCP.

Hmm... I'm not sure I understand. What do you mean with "connection can stay
up without falling back to TCP".

> 
> > * Investigate how/if we can make MPTCP adopt KCM or ULP.
> 
> My main concern about ULP is that only one upper layer protocol can be set
> up (at least as the code is now), so you wouldn't be able to do something
> like use in-kernel TLS over MPTCP. Other than that, it seems like a natural
> fit for MPTCP.

Do you think it would be feasible to make ULP use multiple ULPs ?

> 
> So far I've been looking at KCM as a source of good ideas rather than
> something we could use directly. KCM uses SOCK_SEQPACKET or SOCK_DGRAM, but
> maybe it could be extended to include SOCK_STREAM. Where MPTCP places DSS
> mappings in the TCP options, KCM handles message boundaries within the data
> stream - that made me ponder using XDP to place the DSS mappings in the data
> payload (with the necessary TCP sequence number adjustments). I'm not sure
> it's workable because it can be expensive to change the length of an
> incoming skb and adjusting the acks gets complicated, but it's at least an
> interesting thought experiment :)
> 
> > * There is still the open question of the API, path-management,... Tessares
> >  has some experience with that, so maybe they can provide some ideas here.
> 
> We (at OTC) are working on a generic netlink proposal for path management as
> well.
> 
> > 
> > * The size of the skb. Well, we have been discussing this for quite a while :)
> >  One option is always to have a lookup table as they do for the
> >  TLS-records. That will hurt performance, but at least it's a step forward.
> >  And we have a bunch of other ideas that might be worth exploring as well.
> >  If I'm not mistaken, Rao had an approach that could work as well, right?
> 
> This is what I'm working on now. For outgoing packets, I have a way to
> optionally allocate sk_buffs with extra control block space. For incoming
> packets, my initial experiment is with preventing packet coalesce/collapse
> so TCP options are still in the skb headroom. I don't consider that a
> long-term solution, though. Some kind of lookup table will probably be
> needed.
> 
> > Any other comments, suggestions,...? :-)
> 
> I had these thoughts on evolving the multipath-tcp.org kernel fork last
> summer (excerpt from
> https://lists.01.org/pipermail/mptcp/2017-July/000064.html), which I think
> are still relevant:
> 
> """
> 
> One approach is to attempt to merge the multipath-tcp.org fork. This is an
> implementation in which the multipath-tcp.org community has invested a lot
> of time and effort, and it is in production for major applications (see
> https://tools.ietf.org/html/rfc8041). This is a tremendous amount of code to
> review at once (even separating out modules), and currently doesn't fit with
> what the maintainers have asked for (non-intrusive, sk_buff size, MPTCP by
> default). I don't think the maintainers would consider merging such an
> extensive piece of git history, especially where there are a fair number of
> commits without an "mptcp:" label on the subject line or without a DCO
> signoff (https://www.kernel.org/doc/html/latest/process/submitting-patches.html#sign-your-work-the-developer-s-certificate-of-origin).
> Today, the fork is at kernel v4.4 and current upstream development is at
> v4.13-rc1, so the fork would have to catch up and stay current. (2018 note:
> Christoph has merged up to more recent kernels now)
> 
> The other extreme is to rewrite from scratch. This would allow incremental
> development with maintainer review from the start, but doesn't take
> advantage of existing code.
> 
> The most realistic approach is somewhere in between, where we write new
> code that fits maintainer expectations and utilize components from the
> fork where licensing allows and the code fits. We'll have to find the
> right balance: over-reliance on new code could take extra time, but
> constantly reworking the fork and keeping it up-to-date with net-next is
> also a lot of overhead.
> 
> """
> 
> Gregory and Matthieu, do you have any thoughts on where the right balance is
> on evolving the fork vs. adding new code?
> 
> 
> > On my side, as a first concrete step, I will work towards lockless subflow
> > establishment. In tcp_v4_rcv, we are currently taking the meta-level lock
> > when the socket-lookup matches on a request-socket. Now that TCP supports
> > lockless listeners, MPTCP should do that as well.
> 
> I'll work on getting my team's MPTCP socket type code posted to
> git.kernel.org, and getting our generic netlink proposal posted to this
> list.

Cool! I think, the MPTCP socket-type code should also find its way into
mptcp-dev. It would allow to move that one forward as well.


Christoph


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [MPTCP] Next steps discussion
@ 2018-03-05 10:25 Matthieu Baerts
  0 siblings, 0 replies; 15+ messages in thread
From: Matthieu Baerts @ 2018-03-05 10:25 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 6201 bytes --]

Hi Mat, Christoph,

On Fri, Mar 2, 2018 at 4:37 PM, Mat Martineau <
mathew.j.martineau(a)linux.intel.com> wrote:

>                   Another issue may be the time. I think you are all in
>> the west coast of the
>>                   US :-)
>>                   But I am sure I can find time on the evening for a
>> short meeting!
>>
>>
>>             Ideally for me, it would be at 9am (for you 6pm). Otherwise,
>> I can make 8am
>>             as well (5pm for you).
>>
>>             I don't know what the opinion of the others is on the timing.
>>
>>
>>       I'm also in the Pacific time zone, as are my coworkers Peter and
>> Ossama. Personally, 9am works better.
>>
>>
>> 9am (6pm here) is good for me!
>>
>> When would you like to have our first meeting? At 6pm, I should not have
>> any other meetings.
>> Exceptionally I will not be available this Tuesday 6th of March. Just to
>> propose a date, what about Wednesday 7th of March?
>>
>
> Wednesday the 7th of March is the one morning next week I have a conflict
> at 9am, but I am available at 9:30. Other mornings at 9 are ok.


Just to be safe and not force you to rush finishing your meeting on time,
we can do that this Thursday at 9am? (8th of March, International Women's
Day)

I am going to send a calendar invitation to both of you. Can I add someone
else? Rao maybe?

      From earlier in this sub-thread:
>>
>>                  This is what I'm working on now. For outgoing packets, I
>> have a way
>>                  to optionally allocate sk_buffs with extra control block
>> space. For
>>                  incoming packets, my initial experiment is with
>> preventing packet
>>                  coalesce/collapse so TCP options are still in the skb
>> headroom. I
>>                  don't consider that a long-term solution, though. Some
>> kind of
>>                  lookup table will probably be needed.
>>
>>             That looks interesting!
>>
>>             On some systems with hardware acceleration to bypass the main
>> CPU for established connections, I guess they also
>>             have to modify the skb to share info between the main CPU and
>> another component. Do you have any ideas how they
>>             are doing that? It is maybe not a real problem for them to
>> increase the size of the skb if most of the traffic
>>             goes in the "fast path".
>>
>>
>>       Are you referring to userspace stacks (like DPDK), or optimizations
>> like GRO? I'm not sure of the specific answer to your
>>       question, but from what I see the drivers try to allocate enough
>> space in the skbs to accomodate the storage needs through
>>       the skb's lifetime.
>>
>>
>> Sorry, I was not clear. I don't know if it is common to do that, maybe
>> not because it is not upstream.
>>
>> In short, the NIC (or another external HW module) of a router/switch is
>> doing some learning about the flows it needs to forward. Once
>> this NIC/module knows the flow, the Linux network stack no longer see the
>> rest of a traffic because it has been "accelerated".
>> Here is a longer description when this is done by the NIC:
>> https://www.netronome.com/media/documents/WP_Hardware_Accele
>> ration.pdf#WP_Hardware_Acceleration.indd%3A.3203%3A27
>>
>> But I guess there are some info that the NIC/module cannot learn by
>> itself when looking at the beginning of the flow like what QoS to
>> apply, if a specific flow can be "accelerated" or not, etc. These info
>> are certainly shared in the skb.
>>
>> Now that I am thinking more about this scenario, it will maybe not help
>> for our MPTCP case. I was supposing these info shared via the skb
>> could be randomly present. But even if the skb is bigger all the time, I
>> guess this is not a problem for them because at the end, the
>> Linux network stack will only see a very small part of the traffic, not a
>> big deal to have more cache misses for the slow path only (and
>> some other exceptions).
>> In other words, their skb is maybe too big as well but that's not a
>> problem.
>>
>
> The only case where I'm allocating larger skbs is on the outgoing path
> where MPTCP code can control allocation, and it happens in a way that's
> transparent to most users of the skb. While it would be great to have the
> space on received skbs I wanted to leave control of incoming skb allocation
> with the drivers.
>
> I'm using an updated version of this:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/martineau/
> linux.git/commit/?h=sharedcb&id=8fffadb6b1ee0dc23e8eed43e40e15f5a6277307
>
> In short, the shinfo part of the skb has additional bytes allocated but
> code that expects a "normal" skb sees only the normal part of the skb. It's
> fine for the payload part of the skb to be any size, since the shinfo area
> is tracked separately.


Thank you for the explanations! That looks promising!
Do you think this approach could also be interesting for TCP TLS project?

Best regards,
Matthieu
-- 
[image: Tessares SA] <http://www.tessares.net> Matthieu Baerts | R&D
Engineer
matthieu.baerts(a)tessares.net
Tessares SA | Hybrid Access Solutions
www.tessares.net
1 Avenue Jean Monnet, 1348 Louvain-la-Neuve, Belgium
<https://www.google.com/maps?q=1+Avenue+Jean+Monnet,+1348+Ottignies-Louvain-la-Neuve,+Belgium>

-- 

------------------------------
DISCLAIMER.
This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. 
If you have received this email in error please notify the system manager. 
This message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and 
delete this e-mail from your system. If you are not the intended recipient 
you are notified that disclosing, copying, distributing or taking any 
action in reliance on the contents of this information is strictly 
prohibited.

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 9255 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [MPTCP] Next steps discussion
@ 2018-03-02 17:57 Christoph Paasch
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Paasch @ 2018-03-02 17:57 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 7831 bytes --]

On 02/03/18 - 07:37:40, Mat Martineau wrote:
> 
> On Fri, 2 Mar 2018, Matthieu Baerts wrote:
> 
> > Hi Mat, Christoph,
> > 
> > On Thu, Mar 1, 2018 at 9:05 PM, Mat Martineau <mathew.j.martineau(a)linux.intel.com> wrote:
> >       On Thu, 1 Mar 2018, Christoph Paasch wrote:
> > 
> >             On 01/03/18 - 18:05:40, Matthieu Baerts wrote:
> >                   Hi Christoph,
> > 
> >                   On Wed, Feb 28, 2018 at 11:41 PM, Christoph Paasch <cpaasch(a)apple.com>
> >                   wrote:
> > 
> >                         Yeah, feedback was disappointingly scarce :/
> > 
> >                         It probably was not what he had completely in mind, as he didn't talk about
> >                         a TCP-option framework but simply about moving TCP-MD5 code out of the
> >                         TCP-stack. So, with our proposal we went a bit further.
> > 
> > 
> >                   Or maybe simply unlucky with who was available at this time to have a look
> >                   at them :)
> > 
> > 
> >                         TLS is a good example. They actually do have a lookup table where the
> >                         driver
> >                         is calling back into the TCP-stack to get the TLS-record information.
> > 
> >                         We had discussed this with people from Mellanox that proposed the
> >                         TLS hardware offloading: https://lists.01.org/
> >                         pipermail/mptcp/2017-November/000165.html
> > 
> > 
> > 
> >                   Thank you for the link!
> > 
> >                   Yes, that would be good! Just a short weekly sync could be useful.
> > 
> >                         Do you have a Webex or other means to set this up?
> > 
> > 
> >                   I think I can setup a webex (even if it is less fun to use that on a Linux
> >                   desktop :) ). We already used talky.io, more Linux friendly if we need to
> >                   share screen and others, maybe not needed.
> > 
> > 
> >             talky.io is fine for me as well.
> > 
> > 
> >       I haven't used it before, but it looks like a fine option. Thank you for setting this up.
> > 
> > 
> > The good thing is that we can have a public URL, just in case someone else would help for this upstreaming task:
> > https://talky.io/mptcp_upstreaming
> > 
> > I can also define a shared key if needed.
> 
> Yes, since this is an open list it's important to have a public URL.
> 
> > 
> >                   Another issue may be the time. I think you are all in the west coast of the
> >                   US :-)
> >                   But I am sure I can find time on the evening for a short meeting!
> > 
> > 
> >             Ideally for me, it would be at 9am (for you 6pm). Otherwise, I can make 8am
> >             as well (5pm for you).
> > 
> >             I don't know what the opinion of the others is on the timing.
> > 
> > 
> >       I'm also in the Pacific time zone, as are my coworkers Peter and Ossama. Personally, 9am works better.
> > 
> > 
> > 9am (6pm here) is good for me!
> > 
> > When would you like to have our first meeting? At 6pm, I should not have any other meetings.
> > Exceptionally I will not be available this Tuesday 6th of March. Just to propose a date, what about Wednesday 7th of March?
> 
> Wednesday the 7th of March is the one morning next week I have a conflict at
> 9am, but I am available at 9:30. Other mornings at 9 are ok.

On my side, any day at 9am next week is fine.


Cheers,
Christoph

> 
> > 
> >                         Looking forward to the netlink PM! :-)
> > 
> > 
> >                   I hope I will have time very soon to clean this.
> >                   Mat, do you prefer to talk about that directly by commenting the patches on
> >                   mptcp-dev ML?
> > 
> > 
> >       That's fine with me. I did read the paper (unfortunately I hadn't come across it before!), and it looks like our approaches
> >       have a lot in common - it will be informative to compare the details.
> > 
> > 
> > Good idea! I will work on the rebase/clean-up and send it to mptcp-dev ML. I will add you in cc.
> > 
> >       From earlier in this sub-thread:
> > 
> >                  This is what I'm working on now. For outgoing packets, I have a way
> >                  to optionally allocate sk_buffs with extra control block space. For
> >                  incoming packets, my initial experiment is with preventing packet
> >                  coalesce/collapse so TCP options are still in the skb headroom. I
> >                  don't consider that a long-term solution, though. Some kind of
> >                  lookup table will probably be needed.
> > 
> >             That looks interesting!
> > 
> >             On some systems with hardware acceleration to bypass the main CPU for established connections, I guess they also
> >             have to modify the skb to share info between the main CPU and another component. Do you have any ideas how they
> >             are doing that? It is maybe not a real problem for them to increase the size of the skb if most of the traffic
> >             goes in the "fast path".
> > 
> > 
> >       Are you referring to userspace stacks (like DPDK), or optimizations like GRO? I'm not sure of the specific answer to your
> >       question, but from what I see the drivers try to allocate enough space in the skbs to accomodate the storage needs through
> >       the skb's lifetime.
> > 
> > 
> > Sorry, I was not clear. I don't know if it is common to do that, maybe not because it is not upstream.
> > 
> > In short, the NIC (or another external HW module) of a router/switch is doing some learning about the flows it needs to forward. Once
> > this NIC/module knows the flow, the Linux network stack no longer see the rest of a traffic because it has been "accelerated".
> > Here is a longer description when this is done by the NIC:
> > https://www.netronome.com/media/documents/WP_Hardware_Acceleration.pdf#WP_Hardware_Acceleration.indd%3A.3203%3A27
> > 
> > But I guess there are some info that the NIC/module cannot learn by itself when looking at the beginning of the flow like what QoS to
> > apply, if a specific flow can be "accelerated" or not, etc. These info are certainly shared in the skb.
> > 
> > Now that I am thinking more about this scenario, it will maybe not help for our MPTCP case. I was supposing these info shared via the skb
> > could be randomly present. But even if the skb is bigger all the time, I guess this is not a problem for them because at the end, the
> > Linux network stack will only see a very small part of the traffic, not a big deal to have more cache misses for the slow path only (and
> > some other exceptions).
> > In other words, their skb is maybe too big as well but that's not a problem.
> 
> The only case where I'm allocating larger skbs is on the outgoing path where
> MPTCP code can control allocation, and it happens in a way that's
> transparent to most users of the skb. While it would be great to have the
> space on received skbs I wanted to leave control of incoming skb allocation
> with the drivers.
> 
> I'm using an updated version of this:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/martineau/linux.git/commit/?h=sharedcb&id=8fffadb6b1ee0dc23e8eed43e40e15f5a6277307
> 
> In short, the shinfo part of the skb has additional bytes allocated but code
> that expects a "normal" skb sees only the normal part of the skb. It's fine
> for the payload part of the skb to be any size, since the shinfo area is
> tracked separately.
> 
> Thanks,
> 
> --
> Mat Martineau
> Intel OTC


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [MPTCP] Next steps discussion
@ 2018-03-02 15:37 Mat Martineau
  0 siblings, 0 replies; 15+ messages in thread
From: Mat Martineau @ 2018-03-02 15:37 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 7379 bytes --]


On Fri, 2 Mar 2018, Matthieu Baerts wrote:

> Hi Mat, Christoph,
> 
> On Thu, Mar 1, 2018 at 9:05 PM, Mat Martineau <mathew.j.martineau(a)linux.intel.com> wrote:
>       On Thu, 1 Mar 2018, Christoph Paasch wrote:
>
>             On 01/03/18 - 18:05:40, Matthieu Baerts wrote:
>                   Hi Christoph,
>
>                   On Wed, Feb 28, 2018 at 11:41 PM, Christoph Paasch <cpaasch(a)apple.com>
>                   wrote:
>
>                         Yeah, feedback was disappointingly scarce :/
>
>                         It probably was not what he had completely in mind, as he didn't talk about
>                         a TCP-option framework but simply about moving TCP-MD5 code out of the
>                         TCP-stack. So, with our proposal we went a bit further.
> 
>
>                   Or maybe simply unlucky with who was available at this time to have a look
>                   at them :)
> 
>
>                         TLS is a good example. They actually do have a lookup table where the
>                         driver
>                         is calling back into the TCP-stack to get the TLS-record information.
>
>                         We had discussed this with people from Mellanox that proposed the
>                         TLS hardware offloading: https://lists.01.org/
>                         pipermail/mptcp/2017-November/000165.html
> 
> 
>
>                   Thank you for the link!
>
>                   Yes, that would be good! Just a short weekly sync could be useful.
>
>                         Do you have a Webex or other means to set this up?
> 
>
>                   I think I can setup a webex (even if it is less fun to use that on a Linux
>                   desktop :) ). We already used talky.io, more Linux friendly if we need to
>                   share screen and others, maybe not needed.
> 
>
>             talky.io is fine for me as well.
> 
>
>       I haven't used it before, but it looks like a fine option. Thank you for setting this up.
> 
> 
> The good thing is that we can have a public URL, just in case someone else would help for this upstreaming task:
> https://talky.io/mptcp_upstreaming
> 
> I can also define a shared key if needed.

Yes, since this is an open list it's important to have a public URL.

>
>                   Another issue may be the time. I think you are all in the west coast of the
>                   US :-)
>                   But I am sure I can find time on the evening for a short meeting!
> 
>
>             Ideally for me, it would be at 9am (for you 6pm). Otherwise, I can make 8am
>             as well (5pm for you).
>
>             I don't know what the opinion of the others is on the timing.
> 
>
>       I'm also in the Pacific time zone, as are my coworkers Peter and Ossama. Personally, 9am works better.
> 
> 
> 9am (6pm here) is good for me!
> 
> When would you like to have our first meeting? At 6pm, I should not have any other meetings.
> Exceptionally I will not be available this Tuesday 6th of March. Just to propose a date, what about Wednesday 7th of March?

Wednesday the 7th of March is the one morning next week I have a conflict 
at 9am, but I am available at 9:30. Other mornings at 9 are ok.

>
>                         Looking forward to the netlink PM! :-)
> 
>
>                   I hope I will have time very soon to clean this.
>                   Mat, do you prefer to talk about that directly by commenting the patches on
>                   mptcp-dev ML?
> 
>
>       That's fine with me. I did read the paper (unfortunately I hadn't come across it before!), and it looks like our approaches
>       have a lot in common - it will be informative to compare the details.
> 
> 
> Good idea! I will work on the rebase/clean-up and send it to mptcp-dev ML. I will add you in cc.
>
>       From earlier in this sub-thread:
>
>                  This is what I'm working on now. For outgoing packets, I have a way
>                  to optionally allocate sk_buffs with extra control block space. For
>                  incoming packets, my initial experiment is with preventing packet
>                  coalesce/collapse so TCP options are still in the skb headroom. I
>                  don't consider that a long-term solution, though. Some kind of
>                  lookup table will probably be needed.
>
>             That looks interesting!
>
>             On some systems with hardware acceleration to bypass the main CPU for established connections, I guess they also
>             have to modify the skb to share info between the main CPU and another component. Do you have any ideas how they
>             are doing that? It is maybe not a real problem for them to increase the size of the skb if most of the traffic
>             goes in the "fast path".
> 
>
>       Are you referring to userspace stacks (like DPDK), or optimizations like GRO? I'm not sure of the specific answer to your
>       question, but from what I see the drivers try to allocate enough space in the skbs to accomodate the storage needs through
>       the skb's lifetime.
> 
> 
> Sorry, I was not clear. I don't know if it is common to do that, maybe not because it is not upstream.
> 
> In short, the NIC (or another external HW module) of a router/switch is doing some learning about the flows it needs to forward. Once
> this NIC/module knows the flow, the Linux network stack no longer see the rest of a traffic because it has been "accelerated".
> Here is a longer description when this is done by the NIC:
> https://www.netronome.com/media/documents/WP_Hardware_Acceleration.pdf#WP_Hardware_Acceleration.indd%3A.3203%3A27
> 
> But I guess there are some info that the NIC/module cannot learn by itself when looking at the beginning of the flow like what QoS to
> apply, if a specific flow can be "accelerated" or not, etc. These info are certainly shared in the skb.
> 
> Now that I am thinking more about this scenario, it will maybe not help for our MPTCP case. I was supposing these info shared via the skb
> could be randomly present. But even if the skb is bigger all the time, I guess this is not a problem for them because at the end, the
> Linux network stack will only see a very small part of the traffic, not a big deal to have more cache misses for the slow path only (and
> some other exceptions).
> In other words, their skb is maybe too big as well but that's not a problem.

The only case where I'm allocating larger skbs is on the outgoing path 
where MPTCP code can control allocation, and it happens in a way that's 
transparent to most users of the skb. While it would be great to have the 
space on received skbs I wanted to leave control of incoming skb 
allocation with the drivers.

I'm using an updated version of this:

https://git.kernel.org/pub/scm/linux/kernel/git/martineau/linux.git/commit/?h=sharedcb&id=8fffadb6b1ee0dc23e8eed43e40e15f5a6277307

In short, the shinfo part of the skb has additional bytes allocated but 
code that expects a "normal" skb sees only the normal part of the skb. 
It's fine for the payload part of the skb to be any size, since the shinfo 
area is tracked separately.

Thanks,

--
Mat Martineau
Intel OTC

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [MPTCP] Next steps discussion
@ 2018-03-02 15:03 Matthieu Baerts
  0 siblings, 0 replies; 15+ messages in thread
From: Matthieu Baerts @ 2018-03-02 15:03 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 6753 bytes --]

Hi Mat, Christoph,

On Thu, Mar 1, 2018 at 9:05 PM, Mat Martineau <mathew.j.martineau(a)linux.inte
l.com> wrote:

> On Thu, 1 Mar 2018, Christoph Paasch wrote:
>
> On 01/03/18 - 18:05:40, Matthieu Baerts wrote:
>>
>>> Hi Christoph,
>>>
>>> On Wed, Feb 28, 2018 at 11:41 PM, Christoph Paasch <cpaasch(a)apple.com>
>>> wrote:
>>>
>>> Yeah, feedback was disappointingly scarce :/
>>>>
>>>> It probably was not what he had completely in mind, as he didn't talk
>>>> about
>>>> a TCP-option framework but simply about moving TCP-MD5 code out of the
>>>> TCP-stack. So, with our proposal we went a bit further.
>>>>
>>>>
>>> Or maybe simply unlucky with who was available at this time to have a
>>> look
>>> at them :)
>>>
>>>
>>> TLS is a good example. They actually do have a lookup table where the
>>>> driver
>>>> is calling back into the TCP-stack to get the TLS-record information.
>>>>
>>>> We had discussed this with people from Mellanox that proposed the
>>>> TLS hardware offloading: https://lists.01.org/
>>>> pipermail/mptcp/2017-November/000165.html
>>>>
>>>
>>>
>>> Thank you for the link!
>>>
>>> Yes, that would be good! Just a short weekly sync could be useful.
>>>
>>>>
>>>> Do you have a Webex or other means to set this up?
>>>>
>>>>
>>> I think I can setup a webex (even if it is less fun to use that on a
>>> Linux
>>> desktop :) ). We already used talky.io, more Linux friendly if we need
>>> to
>>> share screen and others, maybe not needed.
>>>
>>
>> talky.io is fine for me as well.
>>
>
> I haven't used it before, but it looks like a fine option. Thank you for
> setting this up.
>

The good thing is that we can have a public URL, just in case someone else
would help for this upstreaming task: https://talky.io/mptcp_upstreaming

I can also define a shared key if needed.

Another issue may be the time. I think you are all in the west coast of the
>>> US :-)
>>> But I am sure I can find time on the evening for a short meeting!
>>>
>>
>> Ideally for me, it would be at 9am (for you 6pm). Otherwise, I can make
>> 8am
>> as well (5pm for you).
>>
>> I don't know what the opinion of the others is on the timing.
>>
>
> I'm also in the Pacific time zone, as are my coworkers Peter and Ossama.
> Personally, 9am works better.
>

9am (6pm here) is good for me!

When would you like to have our first meeting? At 6pm, I should not have
any other meetings.
Exceptionally I will not be available this Tuesday 6th of March. Just to
propose a date, what about Wednesday 7th of March?

Looking forward to the netlink PM! :-)
>>>>
>>>>
>>> I hope I will have time very soon to clean this.
>>> Mat, do you prefer to talk about that directly by commenting the patches
>>> on
>>> mptcp-dev ML?
>>>
>>
> That's fine with me. I did read the paper (unfortunately I hadn't come
> across it before!), and it looks like our approaches have a lot in common -
> it will be informative to compare the details.
>

Good idea! I will work on the rebase/clean-up and send it to mptcp-dev ML.
I will add you in cc.

From earlier in this sub-thread:
>
>      This is what I'm working on now. For outgoing packets, I have a way
>>      to optionally allocate sk_buffs with extra control block space. For
>>      incoming packets, my initial experiment is with preventing packet
>>      coalesce/collapse so TCP options are still in the skb headroom. I
>>      don't consider that a long-term solution, though. Some kind of
>>      lookup table will probably be needed.
>>
>> That looks interesting!
>>
>> On some systems with hardware acceleration to bypass the main CPU for
>> established connections, I guess they also have to modify the skb to share
>> info between the main CPU and another component. Do you have any ideas how
>> they are doing that? It is maybe not a real problem for them to increase
>> the size of the skb if most of the traffic goes in the "fast path".
>>
>
> Are you referring to userspace stacks (like DPDK), or optimizations like
> GRO? I'm not sure of the specific answer to your question, but from what I
> see the drivers try to allocate enough space in the skbs to accomodate the
> storage needs through the skb's lifetime.


Sorry, I was not clear. I don't know if it is common to do that, maybe not
because it is not upstream.

In short, the NIC (or another external HW module) of a router/switch is
doing some learning about the flows it needs to forward. Once this
NIC/module knows the flow, the Linux network stack no longer see the rest
of a traffic because it has been "accelerated".
Here is a longer description when this is done by the NIC:
https://www.netronome.com/media/documents/WP_Hardware_
Acceleration.pdf#WP_Hardware_Acceleration.indd%3A.3203%3A27

But I guess there are some info that the NIC/module cannot learn by itself
when looking at the beginning of the flow like what QoS to apply, if a
specific flow can be "accelerated" or not, etc. These info are certainly
shared in the skb.

Now that I am thinking more about this scenario, it will maybe not help for
our MPTCP case. I was supposing these info shared via the skb could be
randomly present. But even if the skb is bigger all the time, I guess this
is not a problem for them because at the end, the Linux network stack will
only see a very small part of the traffic, not a big deal to have more
cache misses for the slow path only (and some other exceptions).
In other words, their skb is maybe too big as well but that's not a problem.

Best regards,
Matthieu
-- 
[image: Tessares SA] <http://www.tessares.net> Matthieu Baerts | R&D
Engineer
matthieu.baerts(a)tessares.net
Tessares SA | Hybrid Access Solutions
www.tessares.net
1 Avenue Jean Monnet, 1348 Louvain-la-Neuve, Belgium
<https://www.google.com/maps?q=1+Avenue+Jean+Monnet,+1348+Ottignies-Louvain-la-Neuve,+Belgium>

-- 

------------------------------
DISCLAIMER.
This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. 
If you have received this email in error please notify the system manager. 
This message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and 
delete this e-mail from your system. If you are not the intended recipient 
you are notified that disclosing, copying, distributing or taking any 
action in reliance on the contents of this information is strictly 
prohibited.

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 11732 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [MPTCP] Next steps discussion
@ 2018-03-01 20:05 Mat Martineau
  0 siblings, 0 replies; 15+ messages in thread
From: Mat Martineau @ 2018-03-01 20:05 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 3510 bytes --]

On Thu, 1 Mar 2018, Christoph Paasch wrote:

> On 01/03/18 - 18:05:40, Matthieu Baerts wrote:
>> Hi Christoph,
>>
>> On Wed, Feb 28, 2018 at 11:41 PM, Christoph Paasch <cpaasch(a)apple.com>
>> wrote:
>>
>>> Yeah, feedback was disappointingly scarce :/
>>>
>>> It probably was not what he had completely in mind, as he didn't talk about
>>> a TCP-option framework but simply about moving TCP-MD5 code out of the
>>> TCP-stack. So, with our proposal we went a bit further.
>>>
>>
>> Or maybe simply unlucky with who was available at this time to have a look
>> at them :)
>>
>>
>>> TLS is a good example. They actually do have a lookup table where the
>>> driver
>>> is calling back into the TCP-stack to get the TLS-record information.
>>>
>>> We had discussed this with people from Mellanox that proposed the
>>> TLS hardware offloading: https://lists.01.org/
>>> pipermail/mptcp/2017-November/000165.html
>>
>>
>> Thank you for the link!
>>
>> Yes, that would be good! Just a short weekly sync could be useful.
>>>
>>> Do you have a Webex or other means to set this up?
>>>
>>
>> I think I can setup a webex (even if it is less fun to use that on a Linux
>> desktop :) ). We already used talky.io, more Linux friendly if we need to
>> share screen and others, maybe not needed.
>
> talky.io is fine for me as well.

I haven't used it before, but it looks like a fine option. Thank you for 
setting this up.

>
>> Another issue may be the time. I think you are all in the west coast of the
>> US :-)
>> But I am sure I can find time on the evening for a short meeting!
>
> Ideally for me, it would be at 9am (for you 6pm). Otherwise, I can make 8am
> as well (5pm for you).
>
> I don't know what the opinion of the others is on the timing.

I'm also in the Pacific time zone, as are my coworkers Peter and Ossama. 
Personally, 9am works better.

>
>>
>>
>>> Looking forward to the netlink PM! :-)
>>>
>>
>> I hope I will have time very soon to clean this.
>> Mat, do you prefer to talk about that directly by commenting the patches on
>> mptcp-dev ML?

That's fine with me. I did read the paper (unfortunately I hadn't come 
across it before!), and it looks like our approaches have a lot in common 
- it will be informative to compare the details.


From earlier in this sub-thread:

>      This is what I'm working on now. For outgoing packets, I have a way
>      to optionally allocate sk_buffs with extra control block space. For
>      incoming packets, my initial experiment is with preventing packet
>      coalesce/collapse so TCP options are still in the skb headroom. I
>      don't consider that a long-term solution, though. Some kind of
>      lookup table will probably be needed.
>
> That looks interesting!
>
> On some systems with hardware acceleration to bypass the main CPU for 
> established connections, I guess they also have to modify the skb to 
> share info between the main CPU and another component. Do you have any 
> ideas how they are doing that? It is maybe not a real problem for them 
> to increase the size of the skb if most of the traffic goes in the "fast 
> path".

Are you referring to userspace stacks (like DPDK), or optimizations like 
GRO? I'm not sure of the specific answer to your question, but from what I 
see the drivers try to allocate enough space in the skbs to accomodate the 
storage needs through the skb's lifetime.

Thanks,

--
Mat Martineau
Intel OTC

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [MPTCP] Next steps discussion
@ 2018-03-01 19:38 Christoph Paasch
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Paasch @ 2018-03-01 19:38 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 3046 bytes --]

On 01/03/18 - 18:05:40, Matthieu Baerts wrote:
> Hi Christoph,
> 
> On Wed, Feb 28, 2018 at 11:41 PM, Christoph Paasch <cpaasch(a)apple.com>
> wrote:
> 
> > Yeah, feedback was disappointingly scarce :/
> >
> > It probably was not what he had completely in mind, as he didn't talk about
> > a TCP-option framework but simply about moving TCP-MD5 code out of the
> > TCP-stack. So, with our proposal we went a bit further.
> >
> 
> Or maybe simply unlucky with who was available at this time to have a look
> at them :)
> 
> 
> > TLS is a good example. They actually do have a lookup table where the
> > driver
> > is calling back into the TCP-stack to get the TLS-record information.
> >
> > We had discussed this with people from Mellanox that proposed the
> > TLS hardware offloading: https://lists.01.org/
> > pipermail/mptcp/2017-November/000165.html
> 
> 
> Thank you for the link!
> 
> Yes, that would be good! Just a short weekly sync could be useful.
> >
> > Do you have a Webex or other means to set this up?
> >
> 
> I think I can setup a webex (even if it is less fun to use that on a Linux
> desktop :) ). We already used talky.io, more Linux friendly if we need to
> share screen and others, maybe not needed.

talky.io is fine for me as well.

> Another issue may be the time. I think you are all in the west coast of the
> US :-)
> But I am sure I can find time on the evening for a short meeting!

Ideally for me, it would be at 9am (for you 6pm). Otherwise, I can make 8am
as well (5pm for you).

I don't know what the opinion of the others is on the timing.


Christoph


> 
> 
> > Looking forward to the netlink PM! :-)
> >
> 
> I hope I will have time very soon to clean this.
> Mat, do you prefer to talk about that directly by commenting the patches on
> mptcp-dev ML?
> 
> Matthieu
> -- 
> [image: Tessares SA] <http://www.tessares.net> Matthieu Baerts | R&D
> Engineer
> matthieu.baerts(a)tessares.net
> Tessares SA | Hybrid Access Solutions
> www.tessares.net
> 1 Avenue Jean Monnet, 1348 Louvain-la-Neuve, Belgium
> <https://www.google.com/maps?q=1+Avenue+Jean+Monnet,+1348+Ottignies-Louvain-la-Neuve,+Belgium>
> 
> -- 
> 
> ------------------------------
> DISCLAIMER.
> This email and any files transmitted with it are confidential and intended 
> solely for the use of the individual or entity to whom they are addressed. 
> If you have received this email in error please notify the system manager. 
> This message contains confidential information and is intended only for the 
> individual named. If you are not the named addressee you should not 
> disseminate, distribute or copy this e-mail. Please notify the sender 
> immediately by e-mail if you have received this e-mail by mistake and 
> delete this e-mail from your system. If you are not the intended recipient 
> you are notified that disclosing, copying, distributing or taking any 
> action in reliance on the contents of this information is strictly 
> prohibited.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [MPTCP] Next steps discussion
@ 2018-03-01 19:37 Mat Martineau
  0 siblings, 0 replies; 15+ messages in thread
From: Mat Martineau @ 2018-03-01 19:37 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 5803 bytes --]

On Wed, 28 Feb 2018, Christoph Paasch wrote:

> On 28/02/18 - 15:44:39, Mat Martineau wrote:
>> On Wed, 28 Feb 2018, Christoph Paasch wrote:
>>
>>> On 27/02/18 - 10:50:38, Mat Martineau wrote:
>>>>
>>>> Hi Christoph,
>>>>
>>>> On Mon, 26 Feb 2018, Christoph Paasch wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> as for next steps after the submission of the TCP-option framework to netdev
>>>>> and DaveM's feedback on it.
>>>>>
>>>>> Even if the submission got rejected, I think we still have a very useful set
>>>>> of patches here. The need for a framework might pop up again in the future,
>>>>> and so these patches could come in handy.
>>>>> Mat, maybe you can put our latest submission on your kernel.org-git repo
>>>>> just so that we don't lose track of these patches?
>>>>
>>>> Done: https://git.kernel.org/pub/scm/linux/kernel/git/martineau/linux.git/log/?h=md5
>>>
>>> Thanks!
>>>
>>>>> I can also create a github repo if you prefer that.
>>>>>
>>>>>
>>>>> As for DaveM's feedback, the main takeaway - as Mat already noted on his other
>>>>> mail - is that fast-path performance he the highest priority. Branching and
>>>>> indirect function calls are hardly accepted there.
>>>>>
>>>>>
>>>>> So, in that spirit I think we need to work towards reducing MPTCP's
>>>>> intrusiveness to the TCP stack.
>>>>>
>>>>> * Stop taking meta-lock when receiving subflow data (all the changes where
>>>>>  we check for mptcp(tp) and then do bh_lock_sock(meta_sk)).
>>>>>  The reason we do this in today's implementation is because it allows to
>>>>>  access the meta data-structure at any point. If we stop taking the
>>>>>  meta-lock a few things need to change:
>>>>>  1. Do lockless accesses for selected fields (e.g., for the DATA_ACK).
>>>>>  2. Group the more intrusive accesses to few select points in the TCP-stack
>>>>>     where we then take the meta-lock (e.g., when receiving data).
>>>>>     (this would be equivalent as if the TCP-option framework would be there
>>>>>     - thus we need to move code to these or similar points in the stack)
>>>>>  3. Sometimes schedule work-queues when we need to avoid deadlocks due to
>>>>>     lock-ordering issues (e.g., when we can't take the meta-lock because
>>>>>     it's already held by another thread).
>>>>>
>>>>>  I think, the way to approach this here, is by working iteratively and start
>>>>>  moving code in such a way that accesses to the meta-socket are grouped
>>>>>  together.
>>>>>
>>>>>  Also, we have a few callbacks that we added (cfr., struct tcp_sock_ops).
>>>>>  We added them to avoid duplicating the code. Let's review those and see if
>>>>>  we can get rid of them. (as an example: .send_fin could be removed as it is only
>>>>>  called from tcp_shutdown, called from the .shutdown callback in tcp_prot -
>>>>>  thus if we expose a separate MPTCP socket-type with its own struct proto,
>>>>>  we can get rid of the .send_fin callback)
>>>>>
>>>>
>>>> I think a separate MPTCP socket type will be important for upstream
>>>> acceptance. My team has been working on some code with this separate socket
>>>> type that we can share.
>>>
>>> Great! I would love to move MPTCP to a separate socket-type.
>>>
>>>> I'm thinking that it will be useful to share once a
>>>> connection can stay up without falling back to TCP.
>>>
>>> Hmm... I'm not sure I understand. What do you mean with "connection can stay
>>> up without falling back to TCP".
>>
>> Sorry, to clarify:
>>
>> The code I have today does not process incoming DSS options or send a data
>> ack, so the connection falls back to regular TCP after a couple of packets.
>> It's based on net-next.
>
> Oh, so it is not integrated on top of the MPTCP-code and rather a fresh
> start from scratch?

Not entirely from scratch, but mostly.

>> I would like to get my code to a point where it sends a correct data ack so
>> the connection does not fall back, and at that point post the code on
>> git.kernel.org and as a patch set on this list.
>
> Maybe, even if the data-reception is not yet working, I still would be
> intersted in the MPTCP socket-type patches. Because that could be integrated
> into multipath-tcp.org.
>
> Because, getting the data-reception to work if you started from scratch is
> going to take you a lot of time.

We've been working on it for a while, and I'm pretty close to having data 
reception working (just a single subflow, no join, various other early 
implementation constraints).

>
>>>>> * Investigate how/if we can make MPTCP adopt KCM or ULP.
>>>>
>>>> My main concern about ULP is that only one upper layer protocol can be set
>>>> up (at least as the code is now), so you wouldn't be able to do something
>>>> like use in-kernel TLS over MPTCP. Other than that, it seems like a natural
>>>> fit for MPTCP.
>>>
>>> Do you think it would be feasible to make ULP use multiple ULPs ?
>>
>> I think it's possible to have a layered approach to ULP, maybe by providing
>> a list of protocols to setsockopt. It would be nice to have some
>> infrastructure to help implement it correctly, rather than depending on each
>> ULP to correctly call in to the layer below it whether it's the "normal" TCP
>> code or some other ULP. The combinatorics would be a headache for sure:
>> running TLS over an MPTCP connection makes sense, but MPTCP layered on top
>> of TLS doesn't make sense. At least if all the protocols were specified in
>> one setsockopt we could limit which combinations are allowed.
>
> Yes, MPTCP in this case would be special as it is so tightly coupled to TCP.
>
> MPTCP is always the protocol that would be at the bottom of the stack I
> think.

Ok, thanks for the confirmation.


--
Mat Martineau
Intel OTC

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [MPTCP] Next steps discussion
@ 2018-03-01 17:05 Matthieu Baerts
  0 siblings, 0 replies; 15+ messages in thread
From: Matthieu Baerts @ 2018-03-01 17:05 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 2631 bytes --]

Hi Christoph,

On Wed, Feb 28, 2018 at 11:41 PM, Christoph Paasch <cpaasch(a)apple.com>
wrote:

> Yeah, feedback was disappointingly scarce :/
>
> It probably was not what he had completely in mind, as he didn't talk about
> a TCP-option framework but simply about moving TCP-MD5 code out of the
> TCP-stack. So, with our proposal we went a bit further.
>

Or maybe simply unlucky with who was available at this time to have a look
at them :)


> TLS is a good example. They actually do have a lookup table where the
> driver
> is calling back into the TCP-stack to get the TLS-record information.
>
> We had discussed this with people from Mellanox that proposed the
> TLS hardware offloading: https://lists.01.org/
> pipermail/mptcp/2017-November/000165.html


Thank you for the link!

Yes, that would be good! Just a short weekly sync could be useful.
>
> Do you have a Webex or other means to set this up?
>

I think I can setup a webex (even if it is less fun to use that on a Linux
desktop :) ). We already used talky.io, more Linux friendly if we need to
share screen and others, maybe not needed.

Another issue may be the time. I think you are all in the west coast of the
US :-)
But I am sure I can find time on the evening for a short meeting!


> Looking forward to the netlink PM! :-)
>

I hope I will have time very soon to clean this.
Mat, do you prefer to talk about that directly by commenting the patches on
mptcp-dev ML?

Matthieu
-- 
[image: Tessares SA] <http://www.tessares.net> Matthieu Baerts | R&D
Engineer
matthieu.baerts(a)tessares.net
Tessares SA | Hybrid Access Solutions
www.tessares.net
1 Avenue Jean Monnet, 1348 Louvain-la-Neuve, Belgium
<https://www.google.com/maps?q=1+Avenue+Jean+Monnet,+1348+Ottignies-Louvain-la-Neuve,+Belgium>

-- 

------------------------------
DISCLAIMER.
This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. 
If you have received this email in error please notify the system manager. 
This message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and 
delete this e-mail from your system. If you are not the intended recipient 
you are notified that disclosing, copying, distributing or taking any 
action in reliance on the contents of this information is strictly 
prohibited.

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 5377 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [MPTCP] Next steps discussion
@ 2018-03-01  0:44 Christoph Paasch
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Paasch @ 2018-03-01  0:44 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 5567 bytes --]

On 28/02/18 - 15:44:39, Mat Martineau wrote:
> On Wed, 28 Feb 2018, Christoph Paasch wrote:
> 
> > On 27/02/18 - 10:50:38, Mat Martineau wrote:
> > > 
> > > Hi Christoph,
> > > 
> > > On Mon, 26 Feb 2018, Christoph Paasch wrote:
> > > 
> > > > Hello,
> > > > 
> > > > as for next steps after the submission of the TCP-option framework to netdev
> > > > and DaveM's feedback on it.
> > > > 
> > > > Even if the submission got rejected, I think we still have a very useful set
> > > > of patches here. The need for a framework might pop up again in the future,
> > > > and so these patches could come in handy.
> > > > Mat, maybe you can put our latest submission on your kernel.org-git repo
> > > > just so that we don't lose track of these patches?
> > > 
> > > Done: https://git.kernel.org/pub/scm/linux/kernel/git/martineau/linux.git/log/?h=md5
> > 
> > Thanks!
> > 
> > > > I can also create a github repo if you prefer that.
> > > > 
> > > > 
> > > > As for DaveM's feedback, the main takeaway - as Mat already noted on his other
> > > > mail - is that fast-path performance he the highest priority. Branching and
> > > > indirect function calls are hardly accepted there.
> > > > 
> > > > 
> > > > So, in that spirit I think we need to work towards reducing MPTCP's
> > > > intrusiveness to the TCP stack.
> > > > 
> > > > * Stop taking meta-lock when receiving subflow data (all the changes where
> > > >  we check for mptcp(tp) and then do bh_lock_sock(meta_sk)).
> > > >  The reason we do this in today's implementation is because it allows to
> > > >  access the meta data-structure at any point. If we stop taking the
> > > >  meta-lock a few things need to change:
> > > >  1. Do lockless accesses for selected fields (e.g., for the DATA_ACK).
> > > >  2. Group the more intrusive accesses to few select points in the TCP-stack
> > > >     where we then take the meta-lock (e.g., when receiving data).
> > > >     (this would be equivalent as if the TCP-option framework would be there
> > > >     - thus we need to move code to these or similar points in the stack)
> > > >  3. Sometimes schedule work-queues when we need to avoid deadlocks due to
> > > >     lock-ordering issues (e.g., when we can't take the meta-lock because
> > > >     it's already held by another thread).
> > > > 
> > > >  I think, the way to approach this here, is by working iteratively and start
> > > >  moving code in such a way that accesses to the meta-socket are grouped
> > > >  together.
> > > > 
> > > >  Also, we have a few callbacks that we added (cfr., struct tcp_sock_ops).
> > > >  We added them to avoid duplicating the code. Let's review those and see if
> > > >  we can get rid of them. (as an example: .send_fin could be removed as it is only
> > > >  called from tcp_shutdown, called from the .shutdown callback in tcp_prot -
> > > >  thus if we expose a separate MPTCP socket-type with its own struct proto,
> > > >  we can get rid of the .send_fin callback)
> > > > 
> > > 
> > > I think a separate MPTCP socket type will be important for upstream
> > > acceptance. My team has been working on some code with this separate socket
> > > type that we can share.
> > 
> > Great! I would love to move MPTCP to a separate socket-type.
> > 
> > > I'm thinking that it will be useful to share once a
> > > connection can stay up without falling back to TCP.
> > 
> > Hmm... I'm not sure I understand. What do you mean with "connection can stay
> > up without falling back to TCP".
> 
> Sorry, to clarify:
> 
> The code I have today does not process incoming DSS options or send a data
> ack, so the connection falls back to regular TCP after a couple of packets.
> It's based on net-next.

Oh, so it is not integrated on top of the MPTCP-code and rather a fresh
start from scratch?

> I would like to get my code to a point where it sends a correct data ack so
> the connection does not fall back, and at that point post the code on
> git.kernel.org and as a patch set on this list.

Maybe, even if the data-reception is not yet working, I still would be
intersted in the MPTCP socket-type patches. Because that could be integrated
into multipath-tcp.org.

Because, getting the data-reception to work if you started from scratch is
going to take you a lot of time.

> > > > * Investigate how/if we can make MPTCP adopt KCM or ULP.
> > > 
> > > My main concern about ULP is that only one upper layer protocol can be set
> > > up (at least as the code is now), so you wouldn't be able to do something
> > > like use in-kernel TLS over MPTCP. Other than that, it seems like a natural
> > > fit for MPTCP.
> > 
> > Do you think it would be feasible to make ULP use multiple ULPs ?
> 
> I think it's possible to have a layered approach to ULP, maybe by providing
> a list of protocols to setsockopt. It would be nice to have some
> infrastructure to help implement it correctly, rather than depending on each
> ULP to correctly call in to the layer below it whether it's the "normal" TCP
> code or some other ULP. The combinatorics would be a headache for sure:
> running TLS over an MPTCP connection makes sense, but MPTCP layered on top
> of TLS doesn't make sense. At least if all the protocols were specified in
> one setsockopt we could limit which combinations are allowed.

Yes, MPTCP in this case would be special as it is so tightly coupled to TCP.

MPTCP is always the protocol that would be at the bottom of the stack I
think.


Christoph


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [MPTCP] Next steps discussion
@ 2018-02-28 23:44 Mat Martineau
  0 siblings, 0 replies; 15+ messages in thread
From: Mat Martineau @ 2018-02-28 23:44 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 9177 bytes --]

On Wed, 28 Feb 2018, Christoph Paasch wrote:

> On 27/02/18 - 10:50:38, Mat Martineau wrote:
>>
>> Hi Christoph,
>>
>> On Mon, 26 Feb 2018, Christoph Paasch wrote:
>>
>>> Hello,
>>>
>>> as for next steps after the submission of the TCP-option framework to netdev
>>> and DaveM's feedback on it.
>>>
>>> Even if the submission got rejected, I think we still have a very useful set
>>> of patches here. The need for a framework might pop up again in the future,
>>> and so these patches could come in handy.
>>> Mat, maybe you can put our latest submission on your kernel.org-git repo
>>> just so that we don't lose track of these patches?
>>
>> Done: https://git.kernel.org/pub/scm/linux/kernel/git/martineau/linux.git/log/?h=md5
>
> Thanks!
>
>>> I can also create a github repo if you prefer that.
>>>
>>>
>>> As for DaveM's feedback, the main takeaway - as Mat already noted on his other
>>> mail - is that fast-path performance he the highest priority. Branching and
>>> indirect function calls are hardly accepted there.
>>>
>>>
>>> So, in that spirit I think we need to work towards reducing MPTCP's
>>> intrusiveness to the TCP stack.
>>>
>>> * Stop taking meta-lock when receiving subflow data (all the changes where
>>>  we check for mptcp(tp) and then do bh_lock_sock(meta_sk)).
>>>  The reason we do this in today's implementation is because it allows to
>>>  access the meta data-structure at any point. If we stop taking the
>>>  meta-lock a few things need to change:
>>>  1. Do lockless accesses for selected fields (e.g., for the DATA_ACK).
>>>  2. Group the more intrusive accesses to few select points in the TCP-stack
>>>     where we then take the meta-lock (e.g., when receiving data).
>>>     (this would be equivalent as if the TCP-option framework would be there
>>>     - thus we need to move code to these or similar points in the stack)
>>>  3. Sometimes schedule work-queues when we need to avoid deadlocks due to
>>>     lock-ordering issues (e.g., when we can't take the meta-lock because
>>>     it's already held by another thread).
>>>
>>>  I think, the way to approach this here, is by working iteratively and start
>>>  moving code in such a way that accesses to the meta-socket are grouped
>>>  together.
>>>
>>>  Also, we have a few callbacks that we added (cfr., struct tcp_sock_ops).
>>>  We added them to avoid duplicating the code. Let's review those and see if
>>>  we can get rid of them. (as an example: .send_fin could be removed as it is only
>>>  called from tcp_shutdown, called from the .shutdown callback in tcp_prot -
>>>  thus if we expose a separate MPTCP socket-type with its own struct proto,
>>>  we can get rid of the .send_fin callback)
>>>
>>
>> I think a separate MPTCP socket type will be important for upstream
>> acceptance. My team has been working on some code with this separate socket
>> type that we can share.
>
> Great! I would love to move MPTCP to a separate socket-type.
>
>> I'm thinking that it will be useful to share once a
>> connection can stay up without falling back to TCP.
>
> Hmm... I'm not sure I understand. What do you mean with "connection can stay
> up without falling back to TCP".

Sorry, to clarify:

The code I have today does not process incoming DSS options or send a 
data ack, so the connection falls back to regular TCP after a couple of 
packets. It's based on net-next.

I would like to get my code to a point where it sends a correct data ack 
so the connection does not fall back, and at that point post the code on 
git.kernel.org and as a patch set on this list.

>>
>>> * Investigate how/if we can make MPTCP adopt KCM or ULP.
>>
>> My main concern about ULP is that only one upper layer protocol can be set
>> up (at least as the code is now), so you wouldn't be able to do something
>> like use in-kernel TLS over MPTCP. Other than that, it seems like a natural
>> fit for MPTCP.
>
> Do you think it would be feasible to make ULP use multiple ULPs ?

I think it's possible to have a layered approach to ULP, maybe by 
providing a list of protocols to setsockopt. It would be nice to have some 
infrastructure to help implement it correctly, rather than depending on 
each ULP to correctly call in to the layer below it whether it's the 
"normal" TCP code or some other ULP. The combinatorics would be a headache 
for sure: running TLS over an MPTCP connection makes sense, but MPTCP 
layered on top of TLS doesn't make sense. At least if all the protocols 
were specified in one setsockopt we could limit which combinations are 
allowed.

>>
>> So far I've been looking at KCM as a source of good ideas rather than
>> something we could use directly. KCM uses SOCK_SEQPACKET or SOCK_DGRAM, but
>> maybe it could be extended to include SOCK_STREAM. Where MPTCP places DSS
>> mappings in the TCP options, KCM handles message boundaries within the data
>> stream - that made me ponder using XDP to place the DSS mappings in the data
>> payload (with the necessary TCP sequence number adjustments). I'm not sure
>> it's workable because it can be expensive to change the length of an
>> incoming skb and adjusting the acks gets complicated, but it's at least an
>> interesting thought experiment :)
>>
>>> * There is still the open question of the API, path-management,... Tessares
>>>  has some experience with that, so maybe they can provide some ideas here.
>>
>> We (at OTC) are working on a generic netlink proposal for path management as
>> well.
>>
>>>
>>> * The size of the skb. Well, we have been discussing this for quite a while :)
>>>  One option is always to have a lookup table as they do for the
>>>  TLS-records. That will hurt performance, but at least it's a step forward.
>>>  And we have a bunch of other ideas that might be worth exploring as well.
>>>  If I'm not mistaken, Rao had an approach that could work as well, right?
>>
>> This is what I'm working on now. For outgoing packets, I have a way to
>> optionally allocate sk_buffs with extra control block space. For incoming
>> packets, my initial experiment is with preventing packet coalesce/collapse
>> so TCP options are still in the skb headroom. I don't consider that a
>> long-term solution, though. Some kind of lookup table will probably be
>> needed.
>>
>>> Any other comments, suggestions,...? :-)
>>
>> I had these thoughts on evolving the multipath-tcp.org kernel fork last
>> summer (excerpt from
>> https://lists.01.org/pipermail/mptcp/2017-July/000064.html), which I think
>> are still relevant:
>>
>> """
>>
>> One approach is to attempt to merge the multipath-tcp.org fork. This is an
>> implementation in which the multipath-tcp.org community has invested a lot
>> of time and effort, and it is in production for major applications (see
>> https://tools.ietf.org/html/rfc8041). This is a tremendous amount of code to
>> review at once (even separating out modules), and currently doesn't fit with
>> what the maintainers have asked for (non-intrusive, sk_buff size, MPTCP by
>> default). I don't think the maintainers would consider merging such an
>> extensive piece of git history, especially where there are a fair number of
>> commits without an "mptcp:" label on the subject line or without a DCO
>> signoff (https://www.kernel.org/doc/html/latest/process/submitting-patches.html#sign-your-work-the-developer-s-certificate-of-origin).
>> Today, the fork is at kernel v4.4 and current upstream development is at
>> v4.13-rc1, so the fork would have to catch up and stay current. (2018 note:
>> Christoph has merged up to more recent kernels now)
>>
>> The other extreme is to rewrite from scratch. This would allow incremental
>> development with maintainer review from the start, but doesn't take
>> advantage of existing code.
>>
>> The most realistic approach is somewhere in between, where we write new
>> code that fits maintainer expectations and utilize components from the
>> fork where licensing allows and the code fits. We'll have to find the
>> right balance: over-reliance on new code could take extra time, but
>> constantly reworking the fork and keeping it up-to-date with net-next is
>> also a lot of overhead.
>>
>> """
>>
>> Gregory and Matthieu, do you have any thoughts on where the right balance is
>> on evolving the fork vs. adding new code?
>>
>>
>>> On my side, as a first concrete step, I will work towards lockless subflow
>>> establishment. In tcp_v4_rcv, we are currently taking the meta-level lock
>>> when the socket-lookup matches on a request-socket. Now that TCP supports
>>> lockless listeners, MPTCP should do that as well.
>>
>> I'll work on getting my team's MPTCP socket type code posted to
>> git.kernel.org, and getting our generic netlink proposal posted to this
>> list.
>
> Cool! I think, the MPTCP socket-type code should also find its way into
> mptcp-dev. It would allow to move that one forward as well.

Thanks. It's a fairly deep change from the metasocket architecture, so we 
will see how it works out.

--
Mat Martineau
Intel OTC

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [MPTCP] Next steps discussion
@ 2018-02-28 22:41 Christoph Paasch
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Paasch @ 2018-02-28 22:41 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 9351 bytes --]

On 28/02/18 - 18:01:20, Matthieu Baerts wrote:
> Hi Mat, Christoph,
> 
> On Tue, Feb 27, 2018 at 7:50 PM, Mat Martineau <
> mathew.j.martineau(a)linux.intel.com> wrote:
> 
> >
> > Hi Christoph,
> >
> > On Mon, 26 Feb 2018, Christoph Paasch wrote:
> >
> > Hello,
> >>
> >> as for next steps after the submission of the TCP-option framework to
> >> netdev
> >> and DaveM's feedback on it.
> >>
> >> Even if the submission got rejected, I think we still have a very useful
> >> set
> >> of patches here. The need for a framework might pop up again in the
> >> future,
> >> and so these patches could come in handy.
> >> Mat, maybe you can put our latest submission on your kernel.org-git repo
> >> just so that we don't lose track of these patches?
> >>
> >
> > Done: https://git.kernel.org/pub/scm/linux/kernel/git/martineau/li
> > nux.git/log/?h=md5
> 
> 
> By chance, did you get some feedbacks from people interested by this
> framework?
> Or even from Eric Dumazet who wanted to clean TCP MD5 part if I am not
> wrong. It is maybe not what he wanted.

Yeah, feedback was disappointingly scarce :/

It probably was not what he had completely in mind, as he didn't talk about
a TCP-option framework but simply about moving TCP-MD5 code out of the
TCP-stack. So, with our proposal we went a bit further.

> 
> * There is still the open question of the API, path-management,... Tessares
> >>  has some experience with that, so maybe they can provide some ideas here.
> >>
> >
> > We (at OTC) are working on a generic netlink proposal for path management
> > as well.
> >
> 
> At Tessares, we were going to propose a new PM supporting Netlink for a
> management from userspace. There is a short paper describing it:
> http://www.tessares.net/tessares-founders-win-an-ietf-award/
> 
> Even if we think that having a PM customisable via BPF could be even
> better, this PM has still some values. It is mature, should be ready to be
> shared but we first need to rebase it on the latest kernel. We can discuss
> about that in a separate email if you want :)
> 
> Note that the PDF of the short paper is also available there:
> http://www.tessares.net/wp-content/uploads/2016/06/SMAPP-Towards-Smart-Multipath-TCP-enabled-APPlications.pdf
> 
> 
> > * The size of the skb. Well, we have been discussing this for quite a
> >> while :)
> >>  One option is always to have a lookup table as they do for the
> >>  TLS-records. That will hurt performance, but at least it's a step
> >> forward.
> >>  And we have a bunch of other ideas that might be worth exploring as well.
> >>  If I'm not mistaken, Rao had an approach that could work as well, right?
> >>
> >
> > This is what I'm working on now. For outgoing packets, I have a way to
> > optionally allocate sk_buffs with extra control block space. For incoming
> > packets, my initial experiment is with preventing packet coalesce/collapse
> > so TCP options are still in the skb headroom. I don't consider that a
> > long-term solution, though. Some kind of lookup table will probably be
> > needed.
> >
> 
> That looks interesting!
> 
> On some systems with hardware acceleration to bypass the main CPU for
> established connections, I guess they also have to modify the skb to share
> info between the main CPU and another component.
> Do you have any ideas how they are doing that? It is maybe not a real
> problem for them to increase the size of the skb if most of the traffic
> goes in the "fast path".

TLS is a good example. They actually do have a lookup table where the driver
is calling back into the TCP-stack to get the TLS-record information.

We had discussed this with people from Mellanox that proposed the
TLS hardware offloading: https://lists.01.org/pipermail/mptcp/2017-November/000165.html

> > Any other comments, suggestions,...? :-)
> >>
> >
> > I had these thoughts on evolving the multipath-tcp.org kernel fork last
> > summer (excerpt from https://lists.01.org/pipermail
> > /mptcp/2017-July/000064.html), which I think are still relevant:
> >
> > """
> >
> > One approach is to attempt to merge the multipath-tcp.org fork. This is
> > an implementation in which the multipath-tcp.org community has invested a
> > lot of time and effort, and it is in production for major applications (see
> > https://tools.ietf.org/html/rfc8041). This is a tremendous amount of code
> > to review at once (even separating out modules), and currently doesn't fit
> > with what the maintainers have asked for (non-intrusive, sk_buff size,
> > MPTCP by default). I don't think the maintainers would consider merging
> > such an extensive piece of git history, especially where there are a fair
> > number of commits without an "mptcp:" label on the subject line or without
> > a DCO signoff (https://www.kernel.org/doc/html/latest/process/submitting-p
> > atches.html#sign-your-work-the-developer-s-certificate-of-origin). Today,
> > the fork is at kernel v4.4 and current upstream development is at
> > v4.13-rc1, so the fork would have to catch up and stay current. (2018 note:
> > Christoph has merged up to more recent kernels now)
> >
> > The other extreme is to rewrite from scratch. This would allow incremental
> > development with maintainer review from the start, but doesn't take
> > advantage of existing code.
> >
> > The most realistic approach is somewhere in between, where we write new
> > code that fits maintainer expectations and utilize components from the
> > fork where licensing allows and the code fits. We'll have to find the
> > right balance: over-reliance on new code could take extra time, but
> > constantly reworking the fork and keeping it up-to-date with net-next is
> > also a lot of overhead.
> >
> > """
> >
> > Gregory and Matthieu, do you have any thoughts on where the right balance
> > is on evolving the fork vs. adding new code?
> >
> 
> We think we don't need to rewrite it from scratch. MPTCP code is consequent
> and complex but having a fork will not simplify it a lot if we want to
> continue supporting middle boxes, fallback to TCP, etc. Most of the MPTCP
> code is then reusable and some didn't really change for years now.
> 
> We clearly understand that the current implementation cannot go upstream.
> It is not just a question of git history but mostly the fact the code is
> too intrusive as you both said. We agree that this doesn't need a full
> rewrite but some adaptations from both the upstream side and the current
> MPTCP implementation.
> 
> Sorry for not really giving new thoughts but we agree with all points
> written by Christoph :-)
> 
> What we also think is that this upstreaming job is not easy. The best is
> certainly to well organise that and prepare as a team here questions/RFC to
> netdev.
> I am personally not used to work on big changes on the same subject between
> different organisations but we guess it would be helpful to organise
> regular (virtual) meetings to share plans of what can be done by who.

Yes, that would be good! Just a short weekly sync could be useful.

Do you have a Webex or other means to set this up?


Looking forward to the netlink PM! :-)


Christoph


> 
> Unlike Gregory, I guess I don't have the same high knowledge you all have
> here regarding the overall Linux network stack but then I would be happy to
> also help on improving the organisation if this can allow me to be quickly
> efficient on the help I can provide here :)
> 
> On my side, as a first concrete step, I will work towards lockless subflow
> >> establishment. In tcp_v4_rcv, we are currently taking the meta-level lock
> >> when the socket-lookup matches on a request-socket. Now that TCP supports
> >> lockless listeners, MPTCP should do that as well.
> >>
> >
> > I'll work on getting my team's MPTCP socket type code posted to
> > git.kernel.org, and getting our generic netlink proposal posted to this
> > list.
> >
> 
> We can also work on our side to accelerate the sharing of our netlink PM.
> 
> Best regards,
> Matt, ok Matthieu :-)
> 
> -- 
> [image: Tessares SA] <http://www.tessares.net> Matthieu Baerts | R&D
> Engineer
> matthieu.baerts(a)tessares.net
> Tessares SA | Hybrid Access Solutions
> www.tessares.net
> 1 Avenue Jean Monnet, 1348 Louvain-la-Neuve, Belgium
> <https://www.google.com/maps?q=1+Avenue+Jean+Monnet,+1348+Ottignies-Louvain-la-Neuve,+Belgium>
> 
> -- 
> 
> ------------------------------
> DISCLAIMER.
> This email and any files transmitted with it are confidential and intended 
> solely for the use of the individual or entity to whom they are addressed. 
> If you have received this email in error please notify the system manager. 
> This message contains confidential information and is intended only for the 
> individual named. If you are not the named addressee you should not 
> disseminate, distribute or copy this e-mail. Please notify the sender 
> immediately by e-mail if you have received this e-mail by mistake and 
> delete this e-mail from your system. If you are not the intended recipient 
> you are notified that disclosing, copying, distributing or taking any 
> action in reliance on the contents of this information is strictly 
> prohibited.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [MPTCP] Next steps discussion
@ 2018-02-28 17:01 Matthieu Baerts
  0 siblings, 0 replies; 15+ messages in thread
From: Matthieu Baerts @ 2018-02-28 17:01 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 8177 bytes --]

Hi Mat, Christoph,

On Tue, Feb 27, 2018 at 7:50 PM, Mat Martineau <
mathew.j.martineau(a)linux.intel.com> wrote:

>
> Hi Christoph,
>
> On Mon, 26 Feb 2018, Christoph Paasch wrote:
>
> Hello,
>>
>> as for next steps after the submission of the TCP-option framework to
>> netdev
>> and DaveM's feedback on it.
>>
>> Even if the submission got rejected, I think we still have a very useful
>> set
>> of patches here. The need for a framework might pop up again in the
>> future,
>> and so these patches could come in handy.
>> Mat, maybe you can put our latest submission on your kernel.org-git repo
>> just so that we don't lose track of these patches?
>>
>
> Done: https://git.kernel.org/pub/scm/linux/kernel/git/martineau/li
> nux.git/log/?h=md5


By chance, did you get some feedbacks from people interested by this
framework?
Or even from Eric Dumazet who wanted to clean TCP MD5 part if I am not
wrong. It is maybe not what he wanted.

* There is still the open question of the API, path-management,... Tessares
>>  has some experience with that, so maybe they can provide some ideas here.
>>
>
> We (at OTC) are working on a generic netlink proposal for path management
> as well.
>

At Tessares, we were going to propose a new PM supporting Netlink for a
management from userspace. There is a short paper describing it:
http://www.tessares.net/tessares-founders-win-an-ietf-award/

Even if we think that having a PM customisable via BPF could be even
better, this PM has still some values. It is mature, should be ready to be
shared but we first need to rebase it on the latest kernel. We can discuss
about that in a separate email if you want :)

Note that the PDF of the short paper is also available there:
http://www.tessares.net/wp-content/uploads/2016/06/SMAPP-Towards-Smart-Multipath-TCP-enabled-APPlications.pdf


> * The size of the skb. Well, we have been discussing this for quite a
>> while :)
>>  One option is always to have a lookup table as they do for the
>>  TLS-records. That will hurt performance, but at least it's a step
>> forward.
>>  And we have a bunch of other ideas that might be worth exploring as well.
>>  If I'm not mistaken, Rao had an approach that could work as well, right?
>>
>
> This is what I'm working on now. For outgoing packets, I have a way to
> optionally allocate sk_buffs with extra control block space. For incoming
> packets, my initial experiment is with preventing packet coalesce/collapse
> so TCP options are still in the skb headroom. I don't consider that a
> long-term solution, though. Some kind of lookup table will probably be
> needed.
>

That looks interesting!

On some systems with hardware acceleration to bypass the main CPU for
established connections, I guess they also have to modify the skb to share
info between the main CPU and another component.
Do you have any ideas how they are doing that? It is maybe not a real
problem for them to increase the size of the skb if most of the traffic
goes in the "fast path".


> Any other comments, suggestions,...? :-)
>>
>
> I had these thoughts on evolving the multipath-tcp.org kernel fork last
> summer (excerpt from https://lists.01.org/pipermail
> /mptcp/2017-July/000064.html), which I think are still relevant:
>
> """
>
> One approach is to attempt to merge the multipath-tcp.org fork. This is
> an implementation in which the multipath-tcp.org community has invested a
> lot of time and effort, and it is in production for major applications (see
> https://tools.ietf.org/html/rfc8041). This is a tremendous amount of code
> to review at once (even separating out modules), and currently doesn't fit
> with what the maintainers have asked for (non-intrusive, sk_buff size,
> MPTCP by default). I don't think the maintainers would consider merging
> such an extensive piece of git history, especially where there are a fair
> number of commits without an "mptcp:" label on the subject line or without
> a DCO signoff (https://www.kernel.org/doc/html/latest/process/submitting-p
> atches.html#sign-your-work-the-developer-s-certificate-of-origin). Today,
> the fork is at kernel v4.4 and current upstream development is at
> v4.13-rc1, so the fork would have to catch up and stay current. (2018 note:
> Christoph has merged up to more recent kernels now)
>
> The other extreme is to rewrite from scratch. This would allow incremental
> development with maintainer review from the start, but doesn't take
> advantage of existing code.
>
> The most realistic approach is somewhere in between, where we write new
> code that fits maintainer expectations and utilize components from the
> fork where licensing allows and the code fits. We'll have to find the
> right balance: over-reliance on new code could take extra time, but
> constantly reworking the fork and keeping it up-to-date with net-next is
> also a lot of overhead.
>
> """
>
> Gregory and Matthieu, do you have any thoughts on where the right balance
> is on evolving the fork vs. adding new code?
>

We think we don't need to rewrite it from scratch. MPTCP code is consequent
and complex but having a fork will not simplify it a lot if we want to
continue supporting middle boxes, fallback to TCP, etc. Most of the MPTCP
code is then reusable and some didn't really change for years now.

We clearly understand that the current implementation cannot go upstream.
It is not just a question of git history but mostly the fact the code is
too intrusive as you both said. We agree that this doesn't need a full
rewrite but some adaptations from both the upstream side and the current
MPTCP implementation.

Sorry for not really giving new thoughts but we agree with all points
written by Christoph :-)

What we also think is that this upstreaming job is not easy. The best is
certainly to well organise that and prepare as a team here questions/RFC to
netdev.
I am personally not used to work on big changes on the same subject between
different organisations but we guess it would be helpful to organise
regular (virtual) meetings to share plans of what can be done by who.

Unlike Gregory, I guess I don't have the same high knowledge you all have
here regarding the overall Linux network stack but then I would be happy to
also help on improving the organisation if this can allow me to be quickly
efficient on the help I can provide here :)

On my side, as a first concrete step, I will work towards lockless subflow
>> establishment. In tcp_v4_rcv, we are currently taking the meta-level lock
>> when the socket-lookup matches on a request-socket. Now that TCP supports
>> lockless listeners, MPTCP should do that as well.
>>
>
> I'll work on getting my team's MPTCP socket type code posted to
> git.kernel.org, and getting our generic netlink proposal posted to this
> list.
>

We can also work on our side to accelerate the sharing of our netlink PM.

Best regards,
Matt, ok Matthieu :-)

-- 
[image: Tessares SA] <http://www.tessares.net> Matthieu Baerts | R&D
Engineer
matthieu.baerts(a)tessares.net
Tessares SA | Hybrid Access Solutions
www.tessares.net
1 Avenue Jean Monnet, 1348 Louvain-la-Neuve, Belgium
<https://www.google.com/maps?q=1+Avenue+Jean+Monnet,+1348+Ottignies-Louvain-la-Neuve,+Belgium>

-- 

------------------------------
DISCLAIMER.
This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. 
If you have received this email in error please notify the system manager. 
This message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and 
delete this e-mail from your system. If you are not the intended recipient 
you are notified that disclosing, copying, distributing or taking any 
action in reliance on the contents of this information is strictly 
prohibited.

[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 12795 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [MPTCP] Next steps discussion
@ 2018-02-27 18:50 Mat Martineau
  0 siblings, 0 replies; 15+ messages in thread
From: Mat Martineau @ 2018-02-27 18:50 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 7254 bytes --]


Hi Christoph,

On Mon, 26 Feb 2018, Christoph Paasch wrote:

> Hello,
>
> as for next steps after the submission of the TCP-option framework to netdev
> and DaveM's feedback on it.
>
> Even if the submission got rejected, I think we still have a very useful set
> of patches here. The need for a framework might pop up again in the future,
> and so these patches could come in handy.
> Mat, maybe you can put our latest submission on your kernel.org-git repo
> just so that we don't lose track of these patches?

Done: 
https://git.kernel.org/pub/scm/linux/kernel/git/martineau/linux.git/log/?h=md5


> I can also create a github repo if you prefer that.
>
>
> As for DaveM's feedback, the main takeaway - as Mat already noted on his other
> mail - is that fast-path performance he the highest priority. Branching and
> indirect function calls are hardly accepted there.
>
>
> So, in that spirit I think we need to work towards reducing MPTCP's
> intrusiveness to the TCP stack.
>
> * Stop taking meta-lock when receiving subflow data (all the changes where
>  we check for mptcp(tp) and then do bh_lock_sock(meta_sk)).
>  The reason we do this in today's implementation is because it allows to
>  access the meta data-structure at any point. If we stop taking the
>  meta-lock a few things need to change:
>  1. Do lockless accesses for selected fields (e.g., for the DATA_ACK).
>  2. Group the more intrusive accesses to few select points in the TCP-stack
>     where we then take the meta-lock (e.g., when receiving data).
>     (this would be equivalent as if the TCP-option framework would be there
>     - thus we need to move code to these or similar points in the stack)
>  3. Sometimes schedule work-queues when we need to avoid deadlocks due to
>     lock-ordering issues (e.g., when we can't take the meta-lock because
>     it's already held by another thread).
>
>  I think, the way to approach this here, is by working iteratively and start
>  moving code in such a way that accesses to the meta-socket are grouped
>  together.
>
>  Also, we have a few callbacks that we added (cfr., struct tcp_sock_ops).
>  We added them to avoid duplicating the code. Let's review those and see if
>  we can get rid of them. (as an example: .send_fin could be removed as it is only
>  called from tcp_shutdown, called from the .shutdown callback in tcp_prot -
>  thus if we expose a separate MPTCP socket-type with its own struct proto,
>  we can get rid of the .send_fin callback)
>

I think a separate MPTCP socket type will be important for upstream 
acceptance. My team has been working on some code with this separate 
socket type that we can share. I'm thinking that it will be useful to 
share once a connection can stay up without falling back to TCP.

> * Investigate how/if we can make MPTCP adopt KCM or ULP.

My main concern about ULP is that only one upper layer protocol can be set 
up (at least as the code is now), so you wouldn't be able to do something 
like use in-kernel TLS over MPTCP. Other than that, it seems like a 
natural fit for MPTCP.

So far I've been looking at KCM as a source of good ideas rather than 
something we could use directly. KCM uses SOCK_SEQPACKET or SOCK_DGRAM, 
but maybe it could be extended to include SOCK_STREAM. Where MPTCP places 
DSS mappings in the TCP options, KCM handles message boundaries within the 
data stream - that made me ponder using XDP to place the DSS mappings in 
the data payload (with the necessary TCP sequence number adjustments). I'm 
not sure it's workable because it can be expensive to change the length of 
an incoming skb and adjusting the acks gets complicated, but it's at least 
an interesting thought experiment :)

> * There is still the open question of the API, path-management,... Tessares
>  has some experience with that, so maybe they can provide some ideas here.

We (at OTC) are working on a generic netlink proposal for path management 
as well.

>
> * The size of the skb. Well, we have been discussing this for quite a while :)
>  One option is always to have a lookup table as they do for the
>  TLS-records. That will hurt performance, but at least it's a step forward.
>  And we have a bunch of other ideas that might be worth exploring as well.
>  If I'm not mistaken, Rao had an approach that could work as well, right?

This is what I'm working on now. For outgoing packets, I have a way to 
optionally allocate sk_buffs with extra control block space. For incoming 
packets, my initial experiment is with preventing packet coalesce/collapse
so TCP options are still in the skb headroom. I don't consider that a 
long-term solution, though. Some kind of lookup table will probably be 
needed.

> Any other comments, suggestions,...? :-)

I had these thoughts on evolving the multipath-tcp.org kernel fork last 
summer (excerpt from 
https://lists.01.org/pipermail/mptcp/2017-July/000064.html), which I think 
are still relevant:

"""

One approach is to attempt to merge the multipath-tcp.org fork. This is an 
implementation in which the multipath-tcp.org community has invested a lot 
of time and effort, and it is in production for major applications (see 
https://tools.ietf.org/html/rfc8041). This is a tremendous amount of code 
to review at once (even separating out modules), and currently doesn't fit 
with what the maintainers have asked for (non-intrusive, sk_buff size, 
MPTCP by default). I don't think the maintainers would consider merging 
such an extensive piece of git history, especially where there are a fair 
number of commits without an "mptcp:" label on the subject line or without 
a DCO signoff 
(https://www.kernel.org/doc/html/latest/process/submitting-patches.html#sign-your-work-the-developer-s-certificate-of-origin). 
Today, the fork is at kernel v4.4 and current upstream development is at 
v4.13-rc1, so the fork would have to catch up and stay current. (2018 
note: Christoph has merged up to more recent kernels now)

The other extreme is to rewrite from scratch. This would allow incremental
development with maintainer review from the start, but doesn't take
advantage of existing code.

The most realistic approach is somewhere in between, where we write new
code that fits maintainer expectations and utilize components from the
fork where licensing allows and the code fits. We'll have to find the
right balance: over-reliance on new code could take extra time, but
constantly reworking the fork and keeping it up-to-date with net-next is
also a lot of overhead.

"""

Gregory and Matthieu, do you have any thoughts on where the right balance 
is on evolving the fork vs. adding new code?


> On my side, as a first concrete step, I will work towards lockless subflow
> establishment. In tcp_v4_rcv, we are currently taking the meta-level lock
> when the socket-lookup matches on a request-socket. Now that TCP supports
> lockless listeners, MPTCP should do that as well.

I'll work on getting my team's MPTCP socket type code posted to 
git.kernel.org, and getting our generic netlink proposal posted to this 
list.


Thanks,

--
Mat Martineau
Intel OTC

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [MPTCP] Next steps discussion
@ 2018-02-26  9:57 Christoph Paasch
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Paasch @ 2018-02-26  9:57 UTC (permalink / raw)
  To: mptcp

[-- Attachment #1: Type: text/plain, Size: 3265 bytes --]

Hello,

as for next steps after the submission of the TCP-option framework to netdev
and DaveM's feedback on it.

Even if the submission got rejected, I think we still have a very useful set
of patches here. The need for a framework might pop up again in the future,
and so these patches could come in handy.
Mat, maybe you can put our latest submission on your kernel.org-git repo
just so that we don't lose track of these patches? I can also create a
github repo if you prefer that.


As for DaveM's feedback, the main takeaway - as Mat already noted on his other
mail - is that fast-path performance he the highest priority. Branching and
indirect function calls are hardly accepted there.


So, in that spirit I think we need to work towards reducing MPTCP's
intrusiveness to the TCP stack.

* Stop taking meta-lock when receiving subflow data (all the changes where
  we check for mptcp(tp) and then do bh_lock_sock(meta_sk)).
  The reason we do this in today's implementation is because it allows to
  access the meta data-structure at any point. If we stop taking the
  meta-lock a few things need to change:
  1. Do lockless accesses for selected fields (e.g., for the DATA_ACK).
  2. Group the more intrusive accesses to few select points in the TCP-stack
     where we then take the meta-lock (e.g., when receiving data).
     (this would be equivalent as if the TCP-option framework would be there
     - thus we need to move code to these or similar points in the stack)
  3. Sometimes schedule work-queues when we need to avoid deadlocks due to
     lock-ordering issues (e.g., when we can't take the meta-lock because
     it's already held by another thread).

  I think, the way to approach this here, is by working iteratively and start
  moving code in such a way that accesses to the meta-socket are grouped
  together.

  Also, we have a few callbacks that we added (cfr., struct tcp_sock_ops).
  We added them to avoid duplicating the code. Let's review those and see if
  we can get rid of them. (as an example: .send_fin could be removed as it is only
  called from tcp_shutdown, called from the .shutdown callback in tcp_prot -
  thus if we expose a separate MPTCP socket-type with its own struct proto,
  we can get rid of the .send_fin callback)

* Investigate how/if we can make MPTCP adopt KCM or ULP.

* There is still the open question of the API, path-management,... Tessares
  has some experience with that, so maybe they can provide some ideas here.

* The size of the skb. Well, we have been discussing this for quite a while :)
  One option is always to have a lookup table as they do for the
  TLS-records. That will hurt performance, but at least it's a step forward.
  And we have a bunch of other ideas that might be worth exploring as well.
  If I'm not mistaken, Rao had an approach that could work as well, right?


Any other comments, suggestions,...? :-)


On my side, as a first concrete step, I will work towards lockless subflow
establishment. In tcp_v4_rcv, we are currently taking the meta-level lock
when the socket-lookup matches on a request-socket. Now that TCP supports
lockless listeners, MPTCP should do that as well.


Cheers,
Christoph


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-03-05 10:25 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-28 23:08 [MPTCP] Next steps discussion Christoph Paasch
  -- strict thread matches above, loose matches on Subject: below --
2018-03-05 10:25 Matthieu Baerts
2018-03-02 17:57 Christoph Paasch
2018-03-02 15:37 Mat Martineau
2018-03-02 15:03 Matthieu Baerts
2018-03-01 20:05 Mat Martineau
2018-03-01 19:38 Christoph Paasch
2018-03-01 19:37 Mat Martineau
2018-03-01 17:05 Matthieu Baerts
2018-03-01  0:44 Christoph Paasch
2018-02-28 23:44 Mat Martineau
2018-02-28 22:41 Christoph Paasch
2018-02-28 17:01 Matthieu Baerts
2018-02-27 18:50 Mat Martineau
2018-02-26  9:57 Christoph Paasch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.