Netdev Archive on lore.kernel.org
 help / color / Atom feed
From: Martin KaFai Lau <kafai@fb.com>
To: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Cc: <ast@kernel.org>, <benh@amazon.com>, <bpf@vger.kernel.org>,
	<daniel@iogearbox.net>, <davem@davemloft.net>,
	<edumazet@google.com>, <kuba@kernel.org>, <kuni1840@gmail.com>,
	<linux-kernel@vger.kernel.org>, <netdev@vger.kernel.org>
Subject: Re: [RFC PATCH bpf-next 3/8] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
Date: Sun, 22 Nov 2020 16:40:20 -0800
Message-ID: <20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com> (raw)
In-Reply-To: <20201121101322.97015-1-kuniyu@amazon.co.jp>

On Sat, Nov 21, 2020 at 07:13:22PM +0900, Kuniyuki Iwashima wrote:
> From:   Martin KaFai Lau <kafai@fb.com>
> Date:   Thu, 19 Nov 2020 17:53:46 -0800
> > On Fri, Nov 20, 2020 at 07:09:22AM +0900, Kuniyuki Iwashima wrote:
> > > From: Martin KaFai Lau <kafai@fb.com>
> > > Date: Wed, 18 Nov 2020 15:50:17 -0800
> > > > On Tue, Nov 17, 2020 at 06:40:18PM +0900, Kuniyuki Iwashima wrote:
> > > > > This patch lets reuseport_detach_sock() return a pointer of struct sock,
> > > > > which is used only by inet_unhash(). If it is not NULL,
> > > > > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> > > > > sockets from the closing listener to the selected one.
> > > > > 
> > > > > Listening sockets hold incoming connections as a linked list of struct
> > > > > request_sock in the accept queue, and each request has reference to a full
> > > > > socket and its listener. In inet_csk_reqsk_queue_migrate(), we unlink the
> > > > > requests from the closing listener's queue and relink them to the head of
> > > > > the new listener's queue. We do not process each request, so the migration
> > > > > completes in O(1) time complexity. However, in the case of TCP_SYN_RECV
> > > > > sockets, we will take special care in the next commit.
> > > > > 
> > > > > By default, we select the last element of socks[] as the new listener.
> > > > > This behaviour is based on how the kernel moves sockets in socks[].
> > > > > 
> > > > > For example, we call listen() for four sockets (A, B, C, D), and close the
> > > > > first two by turns. The sockets move in socks[] like below. (See also [1])
> > > > > 
> > > > >   socks[0] : A <-.      socks[0] : D          socks[0] : D
> > > > >   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
> > > > >   socks[2] : C   |      socks[2] : C --'
> > > > >   socks[3] : D --'
> > > > > 
> > > > > Then, if C and D have newer settings than A and B, and each socket has a
> > > > > request (a, b, c, d) in their accept queue, we can redistribute old
> > > > > requests evenly to new listeners.
> > > > I don't think it should emphasize/claim there is a specific way that
> > > > the kernel-pick here can redistribute the requests evenly.  It depends on
> > > > how the application close/listen.  The userspace can not expect the
> > > > ordering of socks[] will behave in a certain way.
> > > 
> > > I've expected replacing listeners by generations as a general use case.
> > > But exactly. Users should not expect the undocumented kernel internal.
> > > 
> > > 
> > > > The primary redistribution policy has to depend on BPF which is the
> > > > policy defined by the user based on its application logic (e.g. how
> > > > its binary restart work).  The application (and bpf) knows which one
> > > > is a dying process and can avoid distributing to it.
> > > > 
> > > > The kernel-pick could be an optional fallback but not a must.  If the bpf
> > > > prog is attached, I would even go further to call bpf to redistribute
> > > > regardless of the sysctl, so I think the sysctl is not necessary.
> > > 
> > > I also think it is just an optional fallback, but to pick out a different
> > > listener everytime, choosing the moved socket was reasonable. So the even
> > > redistribution for a specific use case is a side effect of such socket
> > > selection.
> > > 
> > > But, users should decide to use either way:
> > >   (1) let the kernel select a new listener randomly
> > >   (2) select a particular listener by eBPF
> > > 
> > > I will update the commit message like:
> > > The kernel selects a new listener randomly, but as the side effect, it can
> > > redistribute packets evenly for a specific case where an application
> > > replaces listeners by generations.
> > Since there is no feedback on sysctl, so may be something missed
> > in the lines.
> 
> I'm sorry, I have missed this point while thinking about each reply...
> 
> 
> > I don't think this migration logic should depend on a sysctl.
> > At least not when a bpf prog is attached that is capable of doing
> > migration, it is too fragile to ask user to remember to turn on
> > the sysctl before attaching the bpf prog.
> > 
> > Your use case is to primarily based on bpf prog to pick or only based
> > on kernel to do a random pick?
Again, what is your primarily use case?

> 
> I think we have to care about both cases.
> 
> I think we can always enable the migration feature if eBPF prog is not
> attached. On the other hand, if BPF_PROG_TYPE_SK_REUSEPORT prog is attached
> to select a listener by some rules, along updating the kernel,
> redistributing requests without user intention can break the application.
> So, there is something needed to confirm user intension at least if eBPF
> prog is attached.
Right, something being able to tell if the bpf prog can do migration
can confirm the user intention here.  However, this will not be a
sysctl.

A new bpf_attach_type "BPF_SK_REUSEPORT_SELECT_OR_MIGRATE" can be added.
"prog->expected_attach_type == BPF_SK_REUSEPORT_SELECT_OR_MIGRATE"
can be used to decide if migration can be done by the bpf prog.
Although the prog->expected_attach_type has not been checked for
BPF_PROG_TYPE_SK_REUSEPORT, there was an earlier discussion
that the risk of breaking is very small and is acceptable.

Instead of depending on !reuse_md->data to decide if it
is doing migration or not, a clearer signal should be given
to the bpf prog.  A "u8 migration" can be added to "struct sk_reuseport_kern"
(and to "struct sk_reuseport_md" accordingly).  It can tell
the bpf prog that it is doing migration.  It should also tell if it is
migrating a list of established sk(s) or an individual req_sk.
Accessing "reuse_md->migration" should only be allowed for
BPF_SK_REUSEPORT_SELECT_OR_MIGRATE during is_valid_access().

During migration, if skb is not available, an empty skb can be used.
Migration is a slow path and does not happen very often, so it will
be fine even it has to create a temp skb (or may be a static const skb
can be used, not sure but this is implementation details).

> 
> But honestly, I believe such eBPF users can follow this change and
> implement migration eBPF prog if we introduce such a breaking change.
> 
> 
> > Also, IIUC, this sysctl setting sticks at "*reuse", there is no way to
> > change it until all the listening sockets are closed which is exactly
> > the service disruption problem this series is trying to solve here.
> 
> Oh, exactly...
> If we apply this series by live patching, we cannot enable the feature
> without service disruption.
> 
> To enable the migration feature dynamically, how about this logic?
> In this logic, we do not save the sysctl value and check it at each time.
> 
>   1. no eBPF prog attached -> ON
>   2. eBPF prog attached and sysctl is 0 -> OFF
No.  When bpf prog is attached and it clearly signals (expected_attach_type
here) it can do migration, it should not depend on anything else.  It is very
confusing to use.  When a prog is successfully loaded, verified
and attached, it is expected to run.

This sysctl essentially only disables the bpf prog with
type == BPF_PROG_TYPE_SK_REUSEPORT running at a particular point.
This is going down a path that having another sysctl in the future
to disable another bpf prog type.  If there would be a need to disable
bpf prog on a type-by-type bases, it would need a more
generic solution on the bpf side and do it in a consistent way
for all prog types.  It needs a separate and longer discussion.

All behaviors of the BPF_SK_REUSEPORT_SELECT_OR_MIGRATE bpf prog
should not depend on this sysctl at all .

/* Pseudo code to show the idea only.
 * Actual implementation should try to fit
 * better into the current code and should look
 * quite different from here.
 */

if ((prog && prog->expected_attach_type == BPF_SK_REUSEPORT_SELECT_OR_MIGRATE)) {
	/* call bpf to migrate */
	action = BPF_PROG_RUN(prog, &reuse_kern);

	if (action == SK_PASS) {
		if (!reuse_kern.selected_sk)
			/* fallback to kernel random pick */
		else
			/* migrate to reuse_kern.selected_sk */
	} else {
		/* action == SK_DROP. don't do migration at all and
		 * don't fallback to kernel random pick.
		 */ 
	}
}

Going back to the sysctl, with BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
do you still have a need on adding sysctl_tcp_migrate_req?
Regardless, if there is still a need,
the document for sysctl_tcp_migrate_req should be something like:
"the kernel will do a random pick when there is no bpf prog
 attached to the reuseport group...."

[ ps, my reply will be slow in this week. ]

>   3. eBPF prog attached and sysctl is 1 -> ON
> 
> So, replacing 
> 
>   if (reuse->migrate_req)
> 
> to 
> 
>   if (!reuse->prog || net->ipv4.sysctl_tcp_migrate_req)


  reply index

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-17  9:40 [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 1/8] net: Introduce net.ipv4.tcp_migrate_req Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 2/8] tcp: Keep TCP_CLOSE sockets in the reuseport group Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 3/8] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues Kuniyuki Iwashima
2020-11-18 23:50   ` Martin KaFai Lau
2020-11-19 22:09     ` Kuniyuki Iwashima
2020-11-20  1:53       ` Martin KaFai Lau
2020-11-21 10:13         ` Kuniyuki Iwashima
2020-11-23  0:40           ` Martin KaFai Lau [this message]
2020-11-24  9:24             ` Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 4/8] tcp: Migrate TFO requests causing RST during TCP_SYN_RECV Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 5/8] tcp: Migrate TCP_NEW_SYN_RECV requests Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 6/8] bpf: Add cookie in sk_reuseport_md Kuniyuki Iwashima
2020-11-19  0:11   ` Martin KaFai Lau
2020-11-19 22:10     ` Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 7/8] bpf: Call bpf_run_sk_reuseport() for socket migration Kuniyuki Iwashima
2020-11-19  1:00   ` Martin KaFai Lau
2020-11-19 22:13     ` Kuniyuki Iwashima
2020-11-17  9:40 ` [RFC PATCH bpf-next 8/8] bpf: Test BPF_PROG_TYPE_SK_REUSEPORT " Kuniyuki Iwashima
2020-11-18  9:18 ` [RFC PATCH bpf-next 0/8] Socket migration for SO_REUSEPORT David Laight
2020-11-19 22:01   ` Kuniyuki Iwashima
2020-11-18 16:25 ` Eric Dumazet
2020-11-19 22:05   ` Kuniyuki Iwashima
2020-11-19  1:49 ` Martin KaFai Lau
2020-11-19 22:17   ` Kuniyuki Iwashima
2020-11-20  2:31     ` Martin KaFai Lau
2020-11-21 10:16       ` Kuniyuki Iwashima

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com \
    --to=kafai@fb.com \
    --cc=ast@kernel.org \
    --cc=benh@amazon.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=kuba@kernel.org \
    --cc=kuni1840@gmail.com \
    --cc=kuniyu@amazon.co.jp \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Netdev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/netdev/0 netdev/git/0.git
	git clone --mirror https://lore.kernel.org/netdev/1 netdev/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 netdev netdev/ https://lore.kernel.org/netdev \
		netdev@vger.kernel.org
	public-inbox-index netdev

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.netdev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git