* [MPTCP] Re: [PATCH v2 mptcp-next] mptcp: add receive buffer auto-tuning
@ 2020-05-29 20:10 Florian Westphal
0 siblings, 0 replies; 8+ messages in thread
From: Florian Westphal @ 2020-05-29 20:10 UTC (permalink / raw)
To: mptcp
[-- Attachment #1: Type: text/plain, Size: 3011 bytes --]
Christoph Paasch <cpaasch(a)apple.com> wrote:
> > After:
> > ns4 MPTCP -> ns3 (10.0.3.2:10108 ) MPTCP (duration 5417ms) [ OK ]
> > ns4 MPTCP -> ns3 (10.0.3.2:10109 ) TCP (duration 5429ms) [ OK ]
> > ns4 TCP -> ns3 (10.0.3.2:10110 ) MPTCP (duration 5418ms) [ OK ]
> > ns4 MPTCP -> ns3 (dead:beef:3::2:10111) MPTCP (duration 5423ms) [ OK ]
> > ns4 MPTCP -> ns3 (dead:beef:3::2:10112) TCP (duration 5715ms) [ OK ]
> > ns4 TCP -> ns3 (dead:beef:3::2:10113) MPTCP (duration 5415ms) [ OK ]
> > Time: 275 seconds
> >
> > Signed-off-by: Florian Westphal <fw(a)strlen.de>
> > ---
> Need to also handle the fallback to TCP-case. Otherwise there is a
> divide-by-0:
>
> server login: [ 1998.037445] divide error: 0000 [#1] SMP KASAN PTI
> [ 1998.038321] CPU: 3 PID: 1671 Comm: http Kdump: loaded Not tainted 5.7.0-rc6.mptcp #43
> [ 1998.039599] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
> [ 1998.041526] RIP: 0010:mptcp_recvmsg+0x6ae/0xa40
> [ 1998.042370] Code: 44 89 e0 89 da 4c 8b 44 24 08 45 8d b4 24 40 03 00 00 c1 e0 04 48 8d 0c 50 89 d8 49 8d b8 60 04 00 00 31 d2 29 e8 48 0f af c1 <48> f7 f5 4c 8d 2c 41 e8 66 6b 6a ff 4c 8b 44 24 08 41 8b a8 64
> [ 1998.045394] RSP: 0018:ffff8881174bf968 EFLAGS: 00010206
> [ 1998.046237] RAX: 0000000140764000 RBX: 000000000000b500 RCX: 000000000001c540
> [ 1998.047397] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff8291b460
> [ 1998.048564] RBP: 0000000000000000 R08: ffffffff8291b000 R09: ffffffff82f1d0c8
> [ 1998.049724] R10: ffffffff82f1d0c3 R11: fffffbfff05e3a18 R12: 00000000000005b4
> [ 1998.050872] R13: 000000000022b368 R14: 00000000000008f4 R15: ffff888118740000
> [ 1998.052059] FS: 00007f081dd1ea40(0000) GS:ffff88811b980000(0000) knlGS:0000000000000000
> [ 1998.053357] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1998.054189] CR2: 000055b5364a7000 CR3: 0000000115d5e000 CR4: 00000000000006e0
> [ 1998.055231] Call Trace:
> [ 1998.057574] inet_recvmsg+0x207/0x220
> [ 1998.058832] sock_read_iter+0x1fe/0x230
> [ 1998.060656] new_sync_read+0x33a/0x350
> [ 1998.063934] vfs_read+0xbc/0x1b0
> [ 1998.064466] ksys_read+0x11b/0x150
> [ 1998.066975] do_syscall_64+0xbc/0x790
> [ 1998.070940] entry_SYSCALL_64_after_hwframe+0x44/0xa9
WTF? How can this happen? All TCP -> MPTCP tests passed for me.
This would mean we call mptcp_recvmsg *without* gping through
either mptcp_finish_connect or mptcp_accept?
> @@ -259,8 +259,10 @@ static void subflow_finish_connect(struct sock *sk, const struct sk_buff *skb)
> MPTCP_MIB_MPCAPABLEACTIVEFALLBACK);
> }
>
> - if (mptcp_check_fallback(sk))
> + if (mptcp_check_fallback(sk)) {
> + mptcp_rcv_space_init(mptcp_sk(parent), sk);
> return;
> + }
Oh, wait. That code is not present in net-next.
In that case I will wait until this has propagated to net-next; i assume
I would see the crash with the self test too.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [MPTCP] Re: [PATCH v2 mptcp-next] mptcp: add receive buffer auto-tuning
@ 2020-06-03 15:45 Christoph Paasch
0 siblings, 0 replies; 8+ messages in thread
From: Christoph Paasch @ 2020-06-03 15:45 UTC (permalink / raw)
To: mptcp
[-- Attachment #1: Type: text/plain, Size: 1516 bytes --]
Hello Davide,
On 06/03/20 - 17:08, Davide Caratti wrote:
> On Wed, 2020-06-03 at 16:39 +0200, Florian Westphal wrote:
> > Davide Caratti <dcaratti(a)redhat.com> wrote:
> > > hello Florian,
> > >
> > > net-next will not be accepting patches for a while - but I would like to
> > > fix this on some tree, so that periodic tests can be run over there. So, I
> > > will rebase the fallback rework on top of the rx buffer auto-tuning patch.
> >
> > Do you plan to submit your patches to net or net-next when it reopens?
>
> net-next, because the patch doesn't change functionality (moreover, the rx
> support for infinte maps is splitted into another patch). It just
> "improves" fallback to tcp.
>
> > My plan wrt. autotuning is to send the selftest patch for net, and the
> > autotune one for net-next. So, to me it would make more sense for your
> > work to go into mptcp-next first. I would then apply the delta from
> > Christoph and send a new version for mptcp-next.
>
> ok. Then, I just need to re-send [1] targeting mptcp-next (or Matthieu can
> change it with pwclient?), and let it settle for some days while reviews /
> further troubles occur. Correct?
I don't know if you saw this one, but your patch seems to reveal another
issue when MPTCP tries to connect to itself
(https://github.com/multipath-tcp/mptcp_net-next/issues/35).
AFAICS, it does not trigger because of your change, but rather just triggers
the WARN you added in subflow_data_ready.
Christoph
^ permalink raw reply [flat|nested] 8+ messages in thread
* [MPTCP] Re: [PATCH v2 mptcp-next] mptcp: add receive buffer auto-tuning
@ 2020-06-03 15:32 Florian Westphal
0 siblings, 0 replies; 8+ messages in thread
From: Florian Westphal @ 2020-06-03 15:32 UTC (permalink / raw)
To: mptcp
[-- Attachment #1: Type: text/plain, Size: 1341 bytes --]
Davide Caratti <dcaratti(a)redhat.com> wrote:
> On Wed, 2020-06-03 at 16:39 +0200, Florian Westphal wrote:
> > Davide Caratti <dcaratti(a)redhat.com> wrote:
> > > hello Florian,
> > >
> > > net-next will not be accepting patches for a while - but I would like to
> > > fix this on some tree, so that periodic tests can be run over there. So, I
> > > will rebase the fallback rework on top of the rx buffer auto-tuning patch.
> >
> > Do you plan to submit your patches to net or net-next when it reopens?
>
> net-next, because the patch doesn't change functionality (moreover, the rx
> support for infinte maps is splitted into another patch). It just
> "improves" fallback to tcp.
Ah, ok. So ordering doesn't matter, good!
> > My plan wrt. autotuning is to send the selftest patch for net, and the
> > autotune one for net-next. So, to me it would make more sense for your
> > work to go into mptcp-next first. I would then apply the delta from
> > Christoph and send a new version for mptcp-next.
>
> ok. Then, I just need to re-send [1] targeting mptcp-next (or Matthieu can
> change it with pwclient?), and let it settle for some days while reviews /
> further troubles occur. Correct?
Works for me. Matthieu, can you apply Davides patches?
I will then fix up the crash issue with the autotune patch.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [MPTCP] Re: [PATCH v2 mptcp-next] mptcp: add receive buffer auto-tuning
@ 2020-06-03 15:08 Davide Caratti
0 siblings, 0 replies; 8+ messages in thread
From: Davide Caratti @ 2020-06-03 15:08 UTC (permalink / raw)
To: mptcp
[-- Attachment #1: Type: text/plain, Size: 1240 bytes --]
On Wed, 2020-06-03 at 16:39 +0200, Florian Westphal wrote:
> Davide Caratti <dcaratti(a)redhat.com> wrote:
> > hello Florian,
> >
> > net-next will not be accepting patches for a while - but I would like to
> > fix this on some tree, so that periodic tests can be run over there. So, I
> > will rebase the fallback rework on top of the rx buffer auto-tuning patch.
>
> Do you plan to submit your patches to net or net-next when it reopens?
net-next, because the patch doesn't change functionality (moreover, the rx
support for infinte maps is splitted into another patch). It just
"improves" fallback to tcp.
> My plan wrt. autotuning is to send the selftest patch for net, and the
> autotune one for net-next. So, to me it would make more sense for your
> work to go into mptcp-next first. I would then apply the delta from
> Christoph and send a new version for mptcp-next.
ok. Then, I just need to re-send [1] targeting mptcp-next (or Matthieu can
change it with pwclient?), and let it settle for some days while reviews /
further troubles occur. Correct?
--
davide
[1] https://patchwork.ozlabs.org/project/mptcp/patch/dda3f31d39426bd93b8d5ac26f132c6630fcb815.1590740223.git.dcaratti(a)redhat.com/
^ permalink raw reply [flat|nested] 8+ messages in thread
* [MPTCP] Re: [PATCH v2 mptcp-next] mptcp: add receive buffer auto-tuning
@ 2020-06-03 14:39 Florian Westphal
0 siblings, 0 replies; 8+ messages in thread
From: Florian Westphal @ 2020-06-03 14:39 UTC (permalink / raw)
To: mptcp
[-- Attachment #1: Type: text/plain, Size: 822 bytes --]
Davide Caratti <dcaratti(a)redhat.com> wrote:
> hello Florian,
>
> net-next will not be accepting patches for a while - but I would like to
> fix this on some tree, so that periodic tests can be run over there. So, I
> will rebase the fallback rework on top of the rx buffer auto-tuning patch.
Do you plan to submit your patches to net or net-next when it reopens?
If you are targetting net, it would be better to apply your patches to
the current mptcp-next tree as-is, so they don't need to be manged again
when submitted for net.
My plan wrt. autotuning is to send the selftest patch for net, and the
autotune one for net-next. So, to me it would make more sense for your
work to go into mptcp-next first. I would then apply the delta from
Christoph and send a new version for mptcp-next.
WDYT?
^ permalink raw reply [flat|nested] 8+ messages in thread
* [MPTCP] Re: [PATCH v2 mptcp-next] mptcp: add receive buffer auto-tuning
@ 2020-06-03 14:06 Davide Caratti
0 siblings, 0 replies; 8+ messages in thread
From: Davide Caratti @ 2020-06-03 14:06 UTC (permalink / raw)
To: mptcp
[-- Attachment #1: Type: text/plain, Size: 2134 bytes --]
On Fri, 2020-05-29 at 22:10 +0200, Florian Westphal wrote:
> Christoph Paasch <cpaasch(a)apple.com> wrote:
> > > After:
> > > ns4 MPTCP -> ns3 (10.0.3.2:10108 ) MPTCP (duration 5417ms) [ OK ]
> > > ns4 MPTCP -> ns3 (10.0.3.2:10109 ) TCP (duration 5429ms) [ OK ]
> > > ns4 TCP -> ns3 (10.0.3.2:10110 ) MPTCP (duration 5418ms) [ OK ]
> > > ns4 MPTCP -> ns3 (dead:beef:3::2:10111) MPTCP (duration 5423ms) [ OK ]
> > > ns4 MPTCP -> ns3 (dead:beef:3::2:10112) TCP (duration 5715ms) [ OK ]
> > > ns4 TCP -> ns3 (dead:beef:3::2:10113) MPTCP (duration 5415ms) [ OK ]
> > > Time: 275 seconds
> > >
> > > Signed-off-by: Florian Westphal <fw(a)strlen.de>
> > > ---
> > Need to also handle the fallback to TCP-case. Otherwise there is a
> > divide-by-0:
[...]
> > server login: [ 1998.037445] divide error: 0000 [#1] SMP KASAN PTI
> > [ 1998.038321] CPU: 3 PID: 1671 Comm: http Kdump: loaded Not tainted 5.7.0-rc6.mptcp #43
> > [ 1998.039599] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
> > [ 1998.041526] RIP: 0010:mptcp_recvmsg+0x6ae/0xa40
[...]
> WTF? How can this happen? All TCP -> MPTCP tests passed for me.
> This would mean we call mptcp_recvmsg *without* gping through
> either mptcp_finish_connect or mptcp_accept?
> > @@ -259,8 +259,10 @@ static void subflow_finish_connect(struct sock *sk, const struct sk_buff *skb)
> > MPTCP_MIB_MPCAPABLEACTIVEFALLBACK);
> > }
> >
> > - if (mptcp_check_fallback(sk))
> > + if (mptcp_check_fallback(sk)) {
> > + mptcp_rcv_space_init(mptcp_sk(parent), sk);
> > return;
> > + }
>
> Oh, wait. That code is not present in net-next.
>
> In that case I will wait until this has propagated to net-next; i assume
> I would see the crash with the self test too.
hello Florian,
net-next will not be accepting patches for a while - but I would like to
fix this on some tree, so that periodic tests can be run over there. So, I
will rebase the fallback rework on top of the rx buffer auto-tuning patch.
Ok?
--
davide
^ permalink raw reply [flat|nested] 8+ messages in thread
* [MPTCP] Re: [PATCH v2 mptcp-next] mptcp: add receive buffer auto-tuning
@ 2020-05-29 22:41 Christoph Paasch
0 siblings, 0 replies; 8+ messages in thread
From: Christoph Paasch @ 2020-05-29 22:41 UTC (permalink / raw)
To: mptcp
[-- Attachment #1: Type: text/plain, Size: 3223 bytes --]
On 05/29/20 - 22:10, Florian Westphal wrote:
> Christoph Paasch <cpaasch(a)apple.com> wrote:
> > > After:
> > > ns4 MPTCP -> ns3 (10.0.3.2:10108 ) MPTCP (duration 5417ms) [ OK ]
> > > ns4 MPTCP -> ns3 (10.0.3.2:10109 ) TCP (duration 5429ms) [ OK ]
> > > ns4 TCP -> ns3 (10.0.3.2:10110 ) MPTCP (duration 5418ms) [ OK ]
> > > ns4 MPTCP -> ns3 (dead:beef:3::2:10111) MPTCP (duration 5423ms) [ OK ]
> > > ns4 MPTCP -> ns3 (dead:beef:3::2:10112) TCP (duration 5715ms) [ OK ]
> > > ns4 TCP -> ns3 (dead:beef:3::2:10113) MPTCP (duration 5415ms) [ OK ]
> > > Time: 275 seconds
> > >
> > > Signed-off-by: Florian Westphal <fw(a)strlen.de>
> > > ---
> > Need to also handle the fallback to TCP-case. Otherwise there is a
> > divide-by-0:
> >
> > server login: [ 1998.037445] divide error: 0000 [#1] SMP KASAN PTI
> > [ 1998.038321] CPU: 3 PID: 1671 Comm: http Kdump: loaded Not tainted 5.7.0-rc6.mptcp #43
> > [ 1998.039599] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
> > [ 1998.041526] RIP: 0010:mptcp_recvmsg+0x6ae/0xa40
> > [ 1998.042370] Code: 44 89 e0 89 da 4c 8b 44 24 08 45 8d b4 24 40 03 00 00 c1 e0 04 48 8d 0c 50 89 d8 49 8d b8 60 04 00 00 31 d2 29 e8 48 0f af c1 <48> f7 f5 4c 8d 2c 41 e8 66 6b 6a ff 4c 8b 44 24 08 41 8b a8 64
> > [ 1998.045394] RSP: 0018:ffff8881174bf968 EFLAGS: 00010206
> > [ 1998.046237] RAX: 0000000140764000 RBX: 000000000000b500 RCX: 000000000001c540
> > [ 1998.047397] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff8291b460
> > [ 1998.048564] RBP: 0000000000000000 R08: ffffffff8291b000 R09: ffffffff82f1d0c8
> > [ 1998.049724] R10: ffffffff82f1d0c3 R11: fffffbfff05e3a18 R12: 00000000000005b4
> > [ 1998.050872] R13: 000000000022b368 R14: 00000000000008f4 R15: ffff888118740000
> > [ 1998.052059] FS: 00007f081dd1ea40(0000) GS:ffff88811b980000(0000) knlGS:0000000000000000
> > [ 1998.053357] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 1998.054189] CR2: 000055b5364a7000 CR3: 0000000115d5e000 CR4: 00000000000006e0
> > [ 1998.055231] Call Trace:
> > [ 1998.057574] inet_recvmsg+0x207/0x220
> > [ 1998.058832] sock_read_iter+0x1fe/0x230
> > [ 1998.060656] new_sync_read+0x33a/0x350
> > [ 1998.063934] vfs_read+0xbc/0x1b0
> > [ 1998.064466] ksys_read+0x11b/0x150
> > [ 1998.066975] do_syscall_64+0xbc/0x790
> > [ 1998.070940] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> WTF? How can this happen? All TCP -> MPTCP tests passed for me.
> This would mean we call mptcp_recvmsg *without* gping through
> either mptcp_finish_connect or mptcp_accept?
>
> > @@ -259,8 +259,10 @@ static void subflow_finish_connect(struct sock *sk, const struct sk_buff *skb)
> > MPTCP_MIB_MPCAPABLEACTIVEFALLBACK);
> > }
> >
> > - if (mptcp_check_fallback(sk))
> > + if (mptcp_check_fallback(sk)) {
> > + mptcp_rcv_space_init(mptcp_sk(parent), sk);
> > return;
> > + }
>
> Oh, wait. That code is not present in net-next.
>
> In that case I will wait until this has propagated to net-next; i assume
> I would see the crash with the self test too.
True, that code came from Davide!
Christoph
^ permalink raw reply [flat|nested] 8+ messages in thread
* [MPTCP] Re: [PATCH v2 mptcp-next] mptcp: add receive buffer auto-tuning
@ 2020-05-29 16:56 Christoph Paasch
0 siblings, 0 replies; 8+ messages in thread
From: Christoph Paasch @ 2020-05-29 16:56 UTC (permalink / raw)
To: mptcp
[-- Attachment #1: Type: text/plain, Size: 11415 bytes --]
On 05/29/20 - 11:17, Florian Westphal wrote:
> When mptcp is used, userspace doesn't read from the tcp (subflow)
> socket but from the parent (mptcp) socket receive queue.
>
> skbs are moved from the subflow socket to the mptcp rx queue either from
> 'data_ready' callback (if mptcp socket can be locked), a work queue, or
> the socket receive function.
>
> This means tcp_rcv_space_adjust() is never called and thus no receive
> buffer size auto-tuning is done.
>
> An earlier (not merged) patch added tcp_rcv_space_adjust() calls to the
> function that moves skbs from subflow to mptcp socket.
> While this enabled autotuning, it also meant tuning was done even if
> userspace was reading the mptcp socket very slowly.
>
> This adds mptcp_rcv_space_adjust() and calls it after userspace has
> read data from the mptcp socket rx queue.
>
> Its very similar to tcp_rcv_space_adjust, with two differences:
>
> 1. The rtt estimate is the largest one observed on a subflow
> 2. The rcvbuf size and window clamp of all subflows is adjusted
> to the mptcp-level rcvbuf.
>
> Otherwise, we get spurious drops at tcp (subflow) socket level if
> the skbs are not moved to the mptcp socket fast enough and reduced
> throughput..
>
> Before:
> time mptcp_connect.sh -t -f $((4*1024*1024)) -d 300 -l 0.01% -r 0 -e "" -m mmap
> [..]
> ns4 MPTCP -> ns3 (10.0.3.2:10108 ) MPTCP (duration 40562ms) [ OK ]
> ns4 MPTCP -> ns3 (10.0.3.2:10109 ) TCP (duration 5415ms) [ OK ]
> ns4 TCP -> ns3 (10.0.3.2:10110 ) MPTCP (duration 5413ms) [ OK ]
> ns4 MPTCP -> ns3 (dead:beef:3::2:10111) MPTCP (duration 41331ms) [ OK ]
> ns4 MPTCP -> ns3 (dead:beef:3::2:10112) TCP (duration 5415ms) [ OK ]
> ns4 TCP -> ns3 (dead:beef:3::2:10113) MPTCP (duration 5714ms) [ OK ]
> Time: 846 seconds
>
> After:
> ns4 MPTCP -> ns3 (10.0.3.2:10108 ) MPTCP (duration 5417ms) [ OK ]
> ns4 MPTCP -> ns3 (10.0.3.2:10109 ) TCP (duration 5429ms) [ OK ]
> ns4 TCP -> ns3 (10.0.3.2:10110 ) MPTCP (duration 5418ms) [ OK ]
> ns4 MPTCP -> ns3 (dead:beef:3::2:10111) MPTCP (duration 5423ms) [ OK ]
> ns4 MPTCP -> ns3 (dead:beef:3::2:10112) TCP (duration 5715ms) [ OK ]
> ns4 TCP -> ns3 (dead:beef:3::2:10113) MPTCP (duration 5415ms) [ OK ]
> Time: 275 seconds
>
> Signed-off-by: Florian Westphal <fw(a)strlen.de>
> ---
> changes in v2:
> - cache last rtt_us value used
> - don't store seq value
> - reset 'copied' to 0 when starting
> new measurement to simplify adjust function.
> - make sure space.space is not inited to 0, else div-by-0 occurs
>
> net/mptcp/protocol.c | 124 ++++++++++++++++++++++++++++++++++++++++---
> net/mptcp/protocol.h | 6 +++
> 2 files changed, 123 insertions(+), 7 deletions(-)
>
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index b2c8b57e7942..3827e4004877 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -207,13 +207,6 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
> return false;
> }
>
> - if (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK)) {
> - int rcvbuf = max(ssk->sk_rcvbuf, sk->sk_rcvbuf);
> -
> - if (rcvbuf > sk->sk_rcvbuf)
> - sk->sk_rcvbuf = rcvbuf;
> - }
> -
> tp = tcp_sk(ssk);
> do {
> u32 map_remaining, offset;
> @@ -928,6 +921,100 @@ static int __mptcp_recvmsg_mskq(struct mptcp_sock *msk,
> return copied;
> }
>
> +/* receive buffer autotuning. See tcp_rcv_space_adjust for more information.
> + *
> + * Only difference: Use highest rtt estimate of the subflows in use.
> + */
> +static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
> +{
> + struct mptcp_subflow_context *subflow;
> + struct sock *sk = (struct sock *)msk;
> + u32 time, advmss = 1;
> + u64 rtt_us, mstamp;
> +
> + sock_owned_by_me(sk);
> +
> + if (copied <= 0)
> + return;
> +
> + msk->rcvq_space.copied += copied;
> +
> + mstamp = div_u64(tcp_clock_ns(), NSEC_PER_USEC);
> + time = tcp_stamp_us_delta(mstamp, msk->rcvq_space.time);
> +
> + rtt_us = msk->rcvq_space.rtt_us;
> + if (rtt_us && time < (rtt_us >> 3))
> + return;
> +
> + rtt_us = 0;
> + mptcp_for_each_subflow(msk, subflow) {
> + const struct tcp_sock *tp;
> + u64 sf_rtt_us;
> + u32 sf_advmss;
> +
> + tp = tcp_sk(mptcp_subflow_tcp_sock(subflow));
> +
> + sf_rtt_us = READ_ONCE(tp->rcv_rtt_est.rtt_us);
> + sf_advmss = READ_ONCE(tp->advmss);
> +
> + rtt_us = max(sf_rtt_us, rtt_us);
> + advmss = max(sf_advmss, advmss);
> + }
> +
> + msk->rcvq_space.rtt_us = rtt_us;
> + if (time < (rtt_us >> 3) || rtt_us == 0)
> + return;
> +
> + if (msk->rcvq_space.copied <= msk->rcvq_space.space)
> + goto new_measure;
> +
> + if (sock_net(sk)->ipv4.sysctl_tcp_moderate_rcvbuf &&
> + !(sk->sk_userlocks & SOCK_RCVBUF_LOCK)) {
> + int rcvmem, rcvbuf;
> + u64 rcvwin, grow;
> +
> + rcvwin = ((u64)msk->rcvq_space.copied << 1) + 16 * advmss;
> +
> + grow = rcvwin *(msk->rcvq_space.copied - msk->rcvq_space.space);
> +
> + do_div(grow, msk->rcvq_space.space);
> + rcvwin += (grow << 1);
> +
> + rcvmem = SKB_TRUESIZE(advmss + MAX_TCP_HEADER);
> + while (tcp_win_from_space(sk, rcvmem) < advmss)
> + rcvmem += 128;
> +
> + do_div(rcvwin, advmss);
> + rcvbuf = min_t(u64, rcvwin * rcvmem,
> + sock_net(sk)->ipv4.sysctl_tcp_rmem[2]);
> +
> + if (rcvbuf > sk->sk_rcvbuf) {
> + u32 window_clamp;
> +
> + window_clamp = tcp_win_from_space(sk, rcvbuf);
> + WRITE_ONCE(sk->sk_rcvbuf, rcvbuf);
> +
> + /* Make subflows follow along. If we do not do this, we
> + * get drops at subflow level if skbs can't be moved to
> + * the mptcp rx queue fast enough (announced rcv_win can
> + * exceed ssk->sk_rcvbuf).
> + */
> + mptcp_for_each_subflow(msk, subflow) {
> + struct sock *ssk;
> +
> + ssk = mptcp_subflow_tcp_sock(subflow);
> + WRITE_ONCE(ssk->sk_rcvbuf, rcvbuf);
> + tcp_sk(ssk)->window_clamp = window_clamp;
> + }
> + }
> + }
> +
> + msk->rcvq_space.space = msk->rcvq_space.copied;
> +new_measure:
> + msk->rcvq_space.copied = 0;
> + msk->rcvq_space.time = mstamp;
> +}
> +
> static bool __mptcp_move_skbs(struct mptcp_sock *msk)
> {
> unsigned int moved = 0;
> @@ -1050,6 +1137,8 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
> set_bit(MPTCP_DATA_READY, &msk->flags);
> }
> out_err:
> + mptcp_rcv_space_adjust(msk, copied);
> +
> release_sock(sk);
> return copied;
> }
> @@ -1280,6 +1369,7 @@ static int mptcp_init_sock(struct sock *sk)
> return ret;
>
> sk_sockets_allocated_inc(sk);
> + sk->sk_rcvbuf = sock_net(sk)->ipv4.sysctl_tcp_rmem[1];
> sk->sk_sndbuf = sock_net(sk)->ipv4.sysctl_tcp_wmem[2];
>
> return 0;
> @@ -1475,6 +1565,23 @@ struct sock *mptcp_sk_clone(const struct sock *sk,
> return nsk;
> }
>
> +static void mptcp_rcv_space_init(struct mptcp_sock *msk, struct sock *ssk)
> +{
> + struct tcp_sock *tp = tcp_sk(ssk);
> +
> + msk->rcvq_space.copied = 0;
> + msk->rcvq_space.rtt_us = 0;
> +
> + tcp_mstamp_refresh(tp);
> + msk->rcvq_space.time = tp->tcp_mstamp;
> +
> + /* initial rcv_space offering made to peer */
> + msk->rcvq_space.space = min_t(u32, tp->rcv_wnd,
> + TCP_INIT_CWND * tp->advmss);
> + if (msk->rcvq_space.space == 0)
> + msk->rcvq_space.space = TCP_INIT_CWND * TCP_MSS_DEFAULT;
> +}
> +
> static struct sock *mptcp_accept(struct sock *sk, int flags, int *err,
> bool kern)
> {
> @@ -1524,6 +1631,7 @@ static struct sock *mptcp_accept(struct sock *sk, int flags, int *err,
> list_add(&subflow->node, &msk->conn_list);
> inet_sk_state_store(newsk, TCP_ESTABLISHED);
>
> + mptcp_rcv_space_init(msk, ssk);
> bh_unlock_sock(new_mptcp_sock);
>
> __MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_MPCAPABLEPASSIVEACK);
> @@ -1678,6 +1786,8 @@ void mptcp_finish_connect(struct sock *ssk)
> atomic64_set(&msk->snd_una, msk->write_seq);
>
> mptcp_pm_new_connection(msk, 0);
> +
> + mptcp_rcv_space_init(msk, ssk);
> }
Need to also handle the fallback to TCP-case. Otherwise there is a
divide-by-0:
server login: [ 1998.037445] divide error: 0000 [#1] SMP KASAN PTI
[ 1998.038321] CPU: 3 PID: 1671 Comm: http Kdump: loaded Not tainted 5.7.0-rc6.mptcp #43
[ 1998.039599] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[ 1998.041526] RIP: 0010:mptcp_recvmsg+0x6ae/0xa40
[ 1998.042370] Code: 44 89 e0 89 da 4c 8b 44 24 08 45 8d b4 24 40 03 00 00 c1 e0 04 48 8d 0c 50 89 d8 49 8d b8 60 04 00 00 31 d2 29 e8 48 0f af c1 <48> f7 f5 4c 8d 2c 41 e8 66 6b 6a ff 4c 8b 44 24 08 41 8b a8 64
[ 1998.045394] RSP: 0018:ffff8881174bf968 EFLAGS: 00010206
[ 1998.046237] RAX: 0000000140764000 RBX: 000000000000b500 RCX: 000000000001c540
[ 1998.047397] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff8291b460
[ 1998.048564] RBP: 0000000000000000 R08: ffffffff8291b000 R09: ffffffff82f1d0c8
[ 1998.049724] R10: ffffffff82f1d0c3 R11: fffffbfff05e3a18 R12: 00000000000005b4
[ 1998.050872] R13: 000000000022b368 R14: 00000000000008f4 R15: ffff888118740000
[ 1998.052059] FS: 00007f081dd1ea40(0000) GS:ffff88811b980000(0000) knlGS:0000000000000000
[ 1998.053357] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1998.054189] CR2: 000055b5364a7000 CR3: 0000000115d5e000 CR4: 00000000000006e0
[ 1998.055231] Call Trace:
[ 1998.057574] inet_recvmsg+0x207/0x220
[ 1998.058832] sock_read_iter+0x1fe/0x230
[ 1998.060656] new_sync_read+0x33a/0x350
[ 1998.063934] vfs_read+0xbc/0x1b0
[ 1998.064466] ksys_read+0x11b/0x150
[ 1998.066975] do_syscall_64+0xbc/0x790
[ 1998.070940] entry_SYSCALL_64_after_hwframe+0x44/0xa9
This diff fixes it:
========
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 2820da7c787a..a5b36a4c04f7 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1533,7 +1533,7 @@ struct sock *mptcp_sk_clone(const struct sock *sk,
return nsk;
}
-static void mptcp_rcv_space_init(struct mptcp_sock *msk, struct sock *ssk)
+void mptcp_rcv_space_init(struct mptcp_sock *msk, struct sock *ssk)
{
struct tcp_sock *tp = tcp_sk(ssk);
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index e53410830ec3..cef677ebfd09 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -383,6 +383,7 @@ struct sock *mptcp_sk_clone(const struct sock *sk,
void mptcp_get_options(const struct sk_buff *skb,
struct mptcp_options_received *mp_opt);
+void mptcp_rcv_space_init(struct mptcp_sock *msk, struct sock *ssk);
void mptcp_finish_connect(struct sock *sk);
void mptcp_data_ready(struct sock *sk, struct sock *ssk);
bool mptcp_finish_join(struct sock *sk);
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 7222503b9583..0bf2b9d575fb 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -259,8 +259,10 @@ static void subflow_finish_connect(struct sock *sk, const struct sk_buff *skb)
MPTCP_MIB_MPCAPABLEACTIVEFALLBACK);
}
- if (mptcp_check_fallback(sk))
+ if (mptcp_check_fallback(sk)) {
+ mptcp_rcv_space_init(mptcp_sk(parent), sk);
return;
+ }
if (subflow->mp_capable) {
pr_debug("subflow=%p, remote_key=%llu", mptcp_subflow_ctx(sk),
^ permalink raw reply related [flat|nested] 8+ messages in thread
end of thread, other threads:[~2020-06-03 15:45 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-29 20:10 [MPTCP] Re: [PATCH v2 mptcp-next] mptcp: add receive buffer auto-tuning Florian Westphal
-- strict thread matches above, loose matches on Subject: below --
2020-06-03 15:45 Christoph Paasch
2020-06-03 15:32 Florian Westphal
2020-06-03 15:08 Davide Caratti
2020-06-03 14:39 Florian Westphal
2020-06-03 14:06 Davide Caratti
2020-05-29 22:41 Christoph Paasch
2020-05-29 16:56 Christoph Paasch
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.