linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
To: "David S . Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	Eric Dumazet <edumazet@google.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>
Cc: Benjamin Herrenschmidt <benh@amazon.com>,
	Kuniyuki Iwashima <kuniyu@amazon.co.jp>,
	Kuniyuki Iwashima <kuni1840@gmail.com>, <bpf@vger.kernel.org>,
	<netdev@vger.kernel.org>, <linux-kernel@vger.kernel.org>
Subject: [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT.
Date: Tue, 17 Nov 2020 18:40:15 +0900	[thread overview]
Message-ID: <20201117094023.3685-1-kuniyu@amazon.co.jp> (raw)

The SO_REUSEPORT option allows sockets to listen on the same port and to
accept connections evenly. However, there is a defect in the current
implementation. When a SYN packet is received, the connection is tied to a
listening socket. Accordingly, when the listener is closed, in-flight
requests during the three-way handshake and child sockets in the accept
queue are dropped even if other listeners could accept such connections.

This situation can happen when various server management tools restart
server (such as nginx) processes. For instance, when we change nginx
configurations and restart it, it spins up new workers that respect the new
configuration and closes all listeners on the old workers, resulting in
in-flight ACK of 3WHS is responded by RST.

As a workaround for this issue, we can do connection draining by eBPF:

  1. Before closing a listener, stop routing SYN packets to it.
  2. Wait enough time for requests to complete 3WHS.
  3. Accept connections until EAGAIN, then close the listener.

Although this approach seems to work well, EAGAIN has nothing to do with
how many requests are still during 3WHS. Thus, we have to know the number
of such requests by counting SYN packets by eBPF to complete connection
draining.

  1. Start counting SYN packets and accept syscalls using eBPF map.
  2. Stop routing SYN packets.
  3. Accept connections up to the count, then close the listener.

In cases that eBPF is used only for connection draining, it seems a bit
expensive. Moreover, there is some situation that we cannot modify and
build a server program to implement the workaround. This patchset
introduces a new sysctl option to free userland programs from the kernel
issue. If we enable net.ipv4.tcp_migrate_req before creating a reuseport
group, we can redistribute requests and connections from a listener to
others in the same reuseport group at close() or shutdown() syscalls.

Note that the source and destination listeners MUST have the same settings
at the socket API level; otherwise, applications may face inconsistency and
cause errors. In such a case, we have to use eBPF program to select a
specific listener or to cancel migration.

Kuniyuki Iwashima (8):
  net: Introduce net.ipv4.tcp_migrate_req.
  tcp: Keep TCP_CLOSE sockets in the reuseport group.
  tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  tcp: Migrate TFO requests causing RST during TCP_SYN_RECV.
  tcp: Migrate TCP_NEW_SYN_RECV requests.
  bpf: Add cookie in sk_reuseport_md.
  bpf: Call bpf_run_sk_reuseport() for socket migration.
  bpf: Test BPF_PROG_TYPE_SK_REUSEPORT for socket migration.

 Documentation/networking/ip-sysctl.rst        |  15 ++
 include/linux/bpf.h                           |   1 +
 include/net/inet_connection_sock.h            |  13 ++
 include/net/netns/ipv4.h                      |   1 +
 include/net/request_sock.h                    |  13 ++
 include/net/sock_reuseport.h                  |   8 +-
 include/uapi/linux/bpf.h                      |   1 +
 net/core/filter.c                             |  34 +++-
 net/core/sock_reuseport.c                     | 110 +++++++++--
 net/ipv4/inet_connection_sock.c               |  84 ++++++++-
 net/ipv4/inet_hashtables.c                    |   9 +-
 net/ipv4/sysctl_net_ipv4.c                    |   9 +
 net/ipv4/tcp_ipv4.c                           |   9 +-
 net/ipv6/tcp_ipv6.c                           |   9 +-
 tools/include/uapi/linux/bpf.h                |   1 +
 .../bpf/prog_tests/migrate_reuseport.c        | 175 ++++++++++++++++++
 .../bpf/progs/test_migrate_reuseport_kern.c   |  53 ++++++
 17 files changed, 511 insertions(+), 34 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_migrate_reuseport_kern.c

-- 
2.17.2 (Apple Git-113)


             reply	other threads:[~2020-11-17  9:41 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-17  9:40 Kuniyuki Iwashima [this message]
2020-11-17  9:40 ` [RFC PATCH bpf-next 1/8] net: Introduce net.ipv4.tcp_migrate_req Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 2/8] tcp: Keep TCP_CLOSE sockets in the reuseport group Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 3/8] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues Kuniyuki Iwashima
2020-11-18 23:50   ` Martin KaFai Lau
2020-11-19 22:09     ` Kuniyuki Iwashima
2020-11-20  1:53       ` Martin KaFai Lau
2020-11-21 10:13         ` Kuniyuki Iwashima
2020-11-23  0:40           ` Martin KaFai Lau
2020-11-24  9:24             ` Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 4/8] tcp: Migrate TFO requests causing RST during TCP_SYN_RECV Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 5/8] tcp: Migrate TCP_NEW_SYN_RECV requests Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 6/8] bpf: Add cookie in sk_reuseport_md Kuniyuki Iwashima
2020-11-19  0:11   ` Martin KaFai Lau
2020-11-19 22:10     ` Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 7/8] bpf: Call bpf_run_sk_reuseport() for socket migration Kuniyuki Iwashima
2020-11-19  1:00   ` Martin KaFai Lau
2020-11-19 22:13     ` Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 8/8] bpf: Test BPF_PROG_TYPE_SK_REUSEPORT " Kuniyuki Iwashima
2020-11-18  9:18 ` [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT David Laight
2020-11-19 22:01   ` Kuniyuki Iwashima
2020-11-18 16:25 ` Eric Dumazet
2020-11-19 22:05   ` Kuniyuki Iwashima
2020-11-19  1:49 ` Martin KaFai Lau
2020-11-19 22:17   ` Kuniyuki Iwashima
2020-11-20  2:31     ` Martin KaFai Lau
2020-11-21 10:16       ` Kuniyuki Iwashima

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201117094023.3685-1-kuniyu@amazon.co.jp \
    --to=kuniyu@amazon.co.jp \
    --cc=ast@kernel.org \
    --cc=benh@amazon.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=kuba@kernel.org \
    --cc=kuni1840@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).