All of lore.kernel.org
 help / color / mirror / Atom feed
From: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
To: <ycheng@google.com>
Cc: <andrii@kernel.org>, <ast@kernel.org>, <benh@amazon.com>,
	<bpf@vger.kernel.org>, <daniel@iogearbox.net>,
	<davem@davemloft.net>, <edumazet@google.com>, <kafai@fb.com>,
	<kuba@kernel.org>, <kuni1840@gmail.com>, <kuniyu@amazon.co.jp>,
	<linux-kernel@vger.kernel.org>, <ncardwell@google.com>,
	<netdev@vger.kernel.org>
Subject: Re: [PATCH v7 bpf-next 00/11] Socket migration for SO_REUSEPORT.
Date: Wed, 9 Jun 2021 09:34:34 +0900	[thread overview]
Message-ID: <20210609003434.49627-1-kuniyu@amazon.co.jp> (raw)
In-Reply-To: <CAK6E8=cgFKuGecTzSCSQ8z3YJ_163C0uwO9yRvfDSE7vOe9mJA@mail.gmail.com>

From:   Yuchung Cheng <ycheng@google.com>
Date:   Tue, 8 Jun 2021 16:47:37 -0700
> On Tue, Jun 8, 2021 at 4:04 PM Kuniyuki Iwashima <kuniyu@amazon.co.jp> wrote:
> >
> > From:   Yuchung Cheng <ycheng@google.com>
> > Date:   Tue, 8 Jun 2021 10:48:06 -0700
> > > On Tue, May 25, 2021 at 11:42 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> > > >
> > > > On 5/21/21 8:20 PM, Kuniyuki Iwashima wrote:
> > > > > The SO_REUSEPORT option allows sockets to listen on the same port and to
> > > > > accept connections evenly. However, there is a defect in the current
> > > > > implementation [1]. When a SYN packet is received, the connection is tied
> > > > > to a listening socket. Accordingly, when the listener is closed, in-flight
> > > > > requests during the three-way handshake and child sockets in the accept
> > > > > queue are dropped even if other listeners on the same port could accept
> > > > > such connections.
> > > > >
> > > > > This situation can happen when various server management tools restart
> > > > > server (such as nginx) processes. For instance, when we change nginx
> > > > > configurations and restart it, it spins up new workers that respect the new
> > > > > configuration and closes all listeners on the old workers, resulting in the
> > > > > in-flight ACK of 3WHS is responded by RST.
> > > > >
> > > > > To avoid such a situation, users have to know deeply how the kernel handles
> > > > > SYN packets and implement connection draining by eBPF [2]:
> > > > >
> > > > >    1. Stop routing SYN packets to the listener by eBPF.
> > > > >    2. Wait for all timers to expire to complete requests
> > > > >    3. Accept connections until EAGAIN, then close the listener.
> > > > >
> > > > >    or
> > > > >
> > > > >    1. Start counting SYN packets and accept syscalls using the eBPF map.
> > > > >    2. Stop routing SYN packets.
> > > > >    3. Accept connections up to the count, then close the listener.
> > > > >
> > > > > In either way, we cannot close a listener immediately. However, ideally,
> > > > > the application need not drain the not yet accepted sockets because 3WHS
> > > > > and tying a connection to a listener are just the kernel behaviour. The
> > > > > root cause is within the kernel, so the issue should be addressed in kernel
> > > > > space and should not be visible to user space. This patchset fixes it so
> > > > > that users need not take care of kernel implementation and connection
> > > > > draining. With this patchset, the kernel redistributes requests and
> > > > > connections from a listener to the others in the same reuseport group
> > > > > at/after close or shutdown syscalls.
> > > > >
> > > > > Although some software does connection draining, there are still merits in
> > > > > migration. For some security reasons, such as replacing TLS certificates,
> > > > > we may want to apply new settings as soon as possible and/or we may not be
> > > > > able to wait for connection draining. The sockets in the accept queue have
> > > > > not started application sessions yet. So, if we do not drain such sockets,
> > > > > they can be handled by the newer listeners and could have a longer
> > > > > lifetime. It is difficult to drain all connections in every case, but we
> > > > > can decrease such aborted connections by migration. In that sense,
> > > > > migration is always better than draining.
> > > > >
> > > > > Moreover, auto-migration simplifies user space logic and also works well in
> > > > > a case where we cannot modify and build a server program to implement the
> > > > > workaround.
> > > > >
> > > > > Note that the source and destination listeners MUST have the same settings
> > > > > at the socket API level; otherwise, applications may face inconsistency and
> > > > > cause errors. In such a case, we have to use the eBPF program to select a
> > > > > specific listener or to cancel migration.
> > > This looks to be a useful feature. What happens to migrating a
> > > passively fast-opened socket in the old listener but it has not yet
> > > been accepted (TFO is both a mini-socket and a full-socket)?
> > > It gets tricky when the old and new listener have different TFO key
> >
> > The tricky situation can happen without this patch set. We can change
> > the listener's TFO key when TCP_SYN_RECV sockets are still in the accept
> > queue. The change is already handled properly, so it does not crash
> > applications.
> >
> > In the normal 3WHS case, a full-socket is created after 3WHS. In the TFO
> > case, a full-socket is created after validating the TFO cookie in the
> > initial SYN packet.
> >
> > After that, the connection is basically handled via the full-socket, except
> > for accept() syscall. So in the both cases, the mini-socket is poped out of
> > old listener's queue, cloned, and put into the new listner's queue. Then we
> > can accept() its full-socket via the cloned mini-socket.
> 
> Thanks, that makes sense. Eric is the expert in this part to review
> the correctness. My only suggestion is to add some stats tracking the
> mini-sockets that fail to migrate due to a variety of reasons (the
> code locations that the requests need to be dropped). This can be
> useful to evaluate the effectiveness of this new feature.

That's nice idea.
I'll implement it as a follow-up patch or in the next spin.

For now, I would like to wait for Eric's review.

Thank you.

  reply	other threads:[~2021-06-09  0:34 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-21 18:20 [PATCH v7 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
2021-05-21 18:20 ` [PATCH v7 bpf-next 01/11] net: Introduce net.ipv4.tcp_migrate_req Kuniyuki Iwashima
2021-06-10 17:24   ` Eric Dumazet
2021-06-10 22:31     ` Kuniyuki Iwashima
2021-05-21 18:20 ` [PATCH v7 bpf-next 02/11] tcp: Add num_closed_socks to struct sock_reuseport Kuniyuki Iwashima
2021-06-10 17:38   ` Eric Dumazet
2021-06-10 22:33     ` Kuniyuki Iwashima
2021-05-21 18:20 ` [PATCH v7 bpf-next 03/11] tcp: Keep TCP_CLOSE sockets in the reuseport group Kuniyuki Iwashima
2021-06-10 17:59   ` Eric Dumazet
2021-06-10 22:37     ` Kuniyuki Iwashima
2021-05-21 18:20 ` [PATCH v7 bpf-next 04/11] tcp: Add reuseport_migrate_sock() to select a new listener Kuniyuki Iwashima
2021-06-10 18:09   ` Eric Dumazet
2021-06-10 22:39     ` Kuniyuki Iwashima
2021-05-21 18:20 ` [PATCH v7 bpf-next 05/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues Kuniyuki Iwashima
2021-06-10 18:20   ` Eric Dumazet
2021-06-10 22:45     ` Kuniyuki Iwashima
2021-05-21 18:20 ` [PATCH v7 bpf-next 06/11] tcp: Migrate TCP_NEW_SYN_RECV requests at retransmitting SYN+ACKs Kuniyuki Iwashima
2021-06-10 20:21   ` Eric Dumazet
2021-06-10 22:52     ` Kuniyuki Iwashima
2021-05-21 18:21 ` [PATCH v7 bpf-next 07/11] tcp: Migrate TCP_NEW_SYN_RECV requests at receiving the final ACK Kuniyuki Iwashima
2021-06-10 20:36   ` Eric Dumazet
2021-06-10 22:56     ` Kuniyuki Iwashima
2021-05-21 18:21 ` [PATCH v7 bpf-next 08/11] bpf: Support BPF_FUNC_get_socket_cookie() for BPF_PROG_TYPE_SK_REUSEPORT Kuniyuki Iwashima
2021-05-21 18:21 ` [PATCH v7 bpf-next 09/11] bpf: Support socket migration by eBPF Kuniyuki Iwashima
2021-05-21 18:21 ` [PATCH v7 bpf-next 10/11] libbpf: Set expected_attach_type for BPF_PROG_TYPE_SK_REUSEPORT Kuniyuki Iwashima
2021-05-21 18:21 ` [PATCH v7 bpf-next 11/11] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE Kuniyuki Iwashima
2021-05-26  6:42 ` [PATCH v7 bpf-next 00/11] Socket migration for SO_REUSEPORT Daniel Borkmann
2021-06-08  3:13   ` Alexei Starovoitov
2021-06-08 17:48   ` Yuchung Cheng
2021-06-08 23:03     ` Kuniyuki Iwashima
2021-06-08 23:47       ` Yuchung Cheng
2021-06-09  0:34         ` Kuniyuki Iwashima [this message]
2021-06-09 17:04           ` Eric Dumazet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210609003434.49627-1-kuniyu@amazon.co.jp \
    --to=kuniyu@amazon.co.jp \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=benh@amazon.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=kafai@fb.com \
    --cc=kuba@kernel.org \
    --cc=kuni1840@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ncardwell@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=ycheng@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.