netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
To: <eric.dumazet@gmail.com>
Cc: <ast@kernel.org>, <benh@amazon.com>, <bpf@vger.kernel.org>,
	<daniel@iogearbox.net>, <davem@davemloft.net>,
	<edumazet@google.com>, <kuba@kernel.org>, <kuni1840@gmail.com>,
	<kuniyu@amazon.co.jp>, <linux-kernel@vger.kernel.org>,
	<netdev@vger.kernel.org>
Subject: Re: [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT.
Date: Fri, 20 Nov 2020 07:05:09 +0900	[thread overview]
Message-ID: <20201119220509.74768-1-kuniyu@amazon.co.jp> (raw)
In-Reply-To: <5feaafd3-72ca-72da-0fe8-cc4206bc29e6@gmail.com>

From:   Eric Dumazet <eric.dumazet@gmail.com>
Date:   Wed, 18 Nov 2020 17:25:44 +0100
> On 11/17/20 10:40 AM, Kuniyuki Iwashima wrote:
> > The SO_REUSEPORT option allows sockets to listen on the same port and to
> > accept connections evenly. However, there is a defect in the current
> > implementation. When a SYN packet is received, the connection is tied to a
> > listening socket. Accordingly, when the listener is closed, in-flight
> > requests during the three-way handshake and child sockets in the accept
> > queue are dropped even if other listeners could accept such connections.
> > 
> > This situation can happen when various server management tools restart
> > server (such as nginx) processes. For instance, when we change nginx
> > configurations and restart it, it spins up new workers that respect the new
> > configuration and closes all listeners on the old workers, resulting in
> > in-flight ACK of 3WHS is responded by RST.
> > 
> 
> I know some programs are simply removing a listener from the group,
> so that they no longer handle new SYN packets,
> and wait until all timers or 3WHS have completed before closing them.
> 
> They pass fd of newly accepted children to more recent programs using af_unix fd passing,
> while in this draining mode.

Just out of curiosity, can I know the software for more study?


> Quite frankly, mixing eBPF in the picture is distracting.

I agree.
Also, I think eBPF itself is not always necessary in many cases and want
to make user programs simpler with this patchset.

The SO_REUSEPORT implementation is excellent to improve the scalability. On
the other hand, as a trade-off, users have to know deeply how the kernel
handles SYN packets and to implement connection draining by eBPF.


> It seems you want some way to transfer request sockets (and/or not yet accepted established ones)
> from fd1 to fd2, isn't it something that should be discussed independently ?

I understand that you are asking that I should discuss the issue and how to
transfer sockets independently. Please correct me if I have misunderstood
your question.

The kernel handles 3WHS and users cannot know its existence (without eBPF).
Many users believe SO_REUSEPORT should make it possible to distribute all
connections across available listeners ideally, but actually, there are
possibly some connections aborted silently. Some user may think that if the
kernel selected other listeners, the connections would not be dropped.

The root cause is within the kernel, so the issue should be addressed in
the kernel space and should not be visible to userspace. In order not to
make users bother with implementing new some stuff, I want to fix the root
cause by transferring sockets automatically so that users need not take
care of kernel implementation and connection draining.

Moreover, if possible, I did not want to mix eBPF with the issue. But there
may be some cases that different applications listen on the same port and
eBPF routes packets to each by some rules. In such cases, redistributing
sockets without user intention will break the application. This patchset
will work in many cases, but to care such cases, I added the eBPF part.

  reply	other threads:[~2020-11-19 22:05 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-17  9:40 [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 1/8] net: Introduce net.ipv4.tcp_migrate_req Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 2/8] tcp: Keep TCP_CLOSE sockets in the reuseport group Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 3/8] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues Kuniyuki Iwashima
2020-11-18 23:50   ` Martin KaFai Lau
2020-11-19 22:09     ` Kuniyuki Iwashima
2020-11-20  1:53       ` Martin KaFai Lau
2020-11-21 10:13         ` Kuniyuki Iwashima
2020-11-23  0:40           ` Martin KaFai Lau
2020-11-24  9:24             ` Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 4/8] tcp: Migrate TFO requests causing RST during TCP_SYN_RECV Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 5/8] tcp: Migrate TCP_NEW_SYN_RECV requests Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 6/8] bpf: Add cookie in sk_reuseport_md Kuniyuki Iwashima
2020-11-19  0:11   ` Martin KaFai Lau
2020-11-19 22:10     ` Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 7/8] bpf: Call bpf_run_sk_reuseport() for socket migration Kuniyuki Iwashima
2020-11-19  1:00   ` Martin KaFai Lau
2020-11-19 22:13     ` Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 8/8] bpf: Test BPF_PROG_TYPE_SK_REUSEPORT " Kuniyuki Iwashima
2020-11-18  9:18 ` [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT David Laight
2020-11-19 22:01   ` Kuniyuki Iwashima
2020-11-18 16:25 ` Eric Dumazet
2020-11-19 22:05   ` Kuniyuki Iwashima [this message]
2020-11-19  1:49 ` Martin KaFai Lau
2020-11-19 22:17   ` Kuniyuki Iwashima
2020-11-20  2:31     ` Martin KaFai Lau
2020-11-21 10:16       ` Kuniyuki Iwashima

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201119220509.74768-1-kuniyu@amazon.co.jp \
    --to=kuniyu@amazon.co.jp \
    --cc=ast@kernel.org \
    --cc=benh@amazon.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=eric.dumazet@gmail.com \
    --cc=kuba@kernel.org \
    --cc=kuni1840@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).