All of lore.kernel.org
 help / color / mirror / Atom feed
From: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
To: <kafai@fb.com>
Cc: <ast@kernel.org>, <benh@amazon.com>, <bpf@vger.kernel.org>,
	<daniel@iogearbox.net>, <davem@davemloft.net>,
	<edumazet@google.com>, <kuba@kernel.org>, <kuni1840@gmail.com>,
	<kuniyu@amazon.co.jp>, <linux-kernel@vger.kernel.org>,
	<netdev@vger.kernel.org>
Subject: Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
Date: Tue, 8 Dec 2020 18:02:36 +0900	[thread overview]
Message-ID: <20201208090236.86926-1-kuniyu@amazon.co.jp> (raw)
In-Reply-To: <20201208081328.aspzklzmeznw3hob@kafai-mbp.dhcp.thefacebook.com>

From:   Martin KaFai Lau <kafai@fb.com>
Date:   Tue, 8 Dec 2020 00:13:28 -0800
> On Tue, Dec 08, 2020 at 03:27:14PM +0900, Kuniyuki Iwashima wrote:
> > From:   Martin KaFai Lau <kafai@fb.com>
> > Date:   Mon, 7 Dec 2020 12:14:38 -0800
> > > On Sun, Dec 06, 2020 at 01:03:07AM +0900, Kuniyuki Iwashima wrote:
> > > > From:   Martin KaFai Lau <kafai@fb.com>
> > > > Date:   Fri, 4 Dec 2020 17:42:41 -0800
> > > > > On Tue, Dec 01, 2020 at 11:44:10PM +0900, Kuniyuki Iwashima wrote:
> > > > > [ ... ]
> > > > > > diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
> > > > > > index fd133516ac0e..60d7c1f28809 100644
> > > > > > --- a/net/core/sock_reuseport.c
> > > > > > +++ b/net/core/sock_reuseport.c
> > > > > > @@ -216,9 +216,11 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
> > > > > >  }
> > > > > >  EXPORT_SYMBOL(reuseport_add_sock);
> > > > > >  
> > > > > > -void reuseport_detach_sock(struct sock *sk)
> > > > > > +struct sock *reuseport_detach_sock(struct sock *sk)
> > > > > >  {
> > > > > >  	struct sock_reuseport *reuse;
> > > > > > +	struct bpf_prog *prog;
> > > > > > +	struct sock *nsk = NULL;
> > > > > >  	int i;
> > > > > >  
> > > > > >  	spin_lock_bh(&reuseport_lock);
> > > > > > @@ -242,8 +244,12 @@ void reuseport_detach_sock(struct sock *sk)
> > > > > >  
> > > > > >  		reuse->num_socks--;
> > > > > >  		reuse->socks[i] = reuse->socks[reuse->num_socks];
> > > > > > +		prog = rcu_dereference(reuse->prog);
> > > > > Is it under rcu_read_lock() here?
> > > > 
> > > > reuseport_lock is locked in this function, and we do not modify the prog,
> > > > but is rcu_dereference_protected() preferable?
> > > > 
> > > > ---8<---
> > > > prog = rcu_dereference_protected(reuse->prog,
> > > > 				 lockdep_is_held(&reuseport_lock));
> > > > ---8<---
> > > It is not only reuse->prog.  Other things also require rcu_read_lock(),
> > > e.g. please take a look at __htab_map_lookup_elem().
> > > 
> > > The TCP_LISTEN sk (selected by bpf to be the target of the migration)
> > > is also protected by rcu.
> > 
> > Thank you, I will use rcu_read_lock() and rcu_dereference() in v3 patchset.
> > 
> > 
> > > I am surprised there is no WARNING in the test.
> > > Do you have the needed DEBUG_LOCK* config enabled?
> > 
> > Yes, DEBUG_LOCK* was 'y', but rcu_dereference() without rcu_read_lock()
> > does not show warnings...
> I would at least expect the "WARN_ON_ONCE(!rcu_read_lock_held() ...)"
> from __htab_map_lookup_elem() should fire in your test
> example in the last patch.
> 
> It is better to check the config before sending v3.

It seems ok, but I will check it again.

---8<---
[ec2-user@ip-10-0-0-124 bpf-next]$ cat .config | grep DEBUG_LOCK
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_DEBUG_LOCKDEP=y
CONFIG_DEBUG_LOCKING_API_SELFTESTS=y
---8<---


> > > > > > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > > > > > index 1451aa9712b0..b27241ea96bd 100644
> > > > > > --- a/net/ipv4/inet_connection_sock.c
> > > > > > +++ b/net/ipv4/inet_connection_sock.c
> > > > > > @@ -992,6 +992,36 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> > > > > >  }
> > > > > >  EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
> > > > > >  
> > > > > > +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
> > > > > > +{
> > > > > > +	struct request_sock_queue *old_accept_queue, *new_accept_queue;
> > > > > > +
> > > > > > +	old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
> > > > > > +	new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
> > > > > > +
> > > > > > +	spin_lock(&old_accept_queue->rskq_lock);
> > > > > > +	spin_lock(&new_accept_queue->rskq_lock);
> > > > > I am also not very thrilled on this double spin_lock.
> > > > > Can this be done in (or like) inet_csk_listen_stop() instead?
> > > > 
> > > > It will be possible to migrate sockets in inet_csk_listen_stop(), but I
> > > > think it is better to do it just after reuseport_detach_sock() becuase we
> > > > can select a different listener (almost) every time at a lower cost by
> > > > selecting the moved socket and pass it to inet_csk_reqsk_queue_migrate()
> > > > easily.
> > > I don't see the "lower cost" point.  Please elaborate.
> > 
> > In reuseport_select_sock(), we pass sk_hash of the request socket to
> > reciprocal_scale() and generate a random index for socks[] to select
> > a different listener every time.
> > On the other hand, we do not have request sockets in unhash path and
> > sk_hash of the listener is always 0, so we have to generate a random number
> > in another way. In reuseport_detach_sock(), we can use the index of the
> > moved socket, but we do not have it in inet_csk_listen_stop(), so we have
> > to generate a random number in inet_csk_listen_stop().
> > I think it is at lower cost to use the index of the moved socket.
> Generate a random number is not a big deal for the migration code path.
> 
> Also, I really still failed to see a particular way that the kernel
> pick will help in the migration case.  The kernel has no clue
> on how to select the right process to migrate to without
> a proper policy signal from the user.  They are all as bad as
> a random pick.  I am not sure this migration feature is
> even useful if there is no bpf prog attached to define the policy.

I think most applications start new listeners before closing listeners, in
this case, selecting the moved socket as the new listener works well.


> That said, if it is still desired to do a random pick by kernel when
> there is no bpf prog, it probably makes sense to guard it in a sysctl as
> suggested in another reply.  To keep it simple, I would also keep this
> kernel-pick consistent instead of request socket is doing something
> different from the unhash path.

Then, is this way better to keep kernel-pick consistent?

  1. call reuseport_select_migrated_sock() without sk_hash from any path
  2. generate a random number in reuseport_select_migrated_sock()
  3. pass it to __reuseport_select_sock() only for select-by-hash
  (4. pass 0 as sk_hash to bpf_run_sk_reuseport not to use it)
  5. do migration per queue in inet_csk_listen_stop() or per request in
     receive path.

I understand it is beautiful to keep consistensy, but also think
the kernel-pick with heuristic performs better than random-pick.

  reply	other threads:[~2020-12-08  9:04 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-01 14:44 [PATCH v1 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 01/11] tcp: Keep TCP_CLOSE sockets in the reuseport group Kuniyuki Iwashima
2020-12-05  1:31   ` Martin KaFai Lau
2020-12-06  4:38     ` Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 02/11] bpf: Define migration types for SO_REUSEPORT Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues Kuniyuki Iwashima
2020-12-01 15:25   ` Eric Dumazet
2020-12-03 14:14     ` Kuniyuki Iwashima
2020-12-03 14:31       ` Eric Dumazet
2020-12-03 15:41         ` Kuniyuki Iwashima
2020-12-07 20:33       ` Martin KaFai Lau
2020-12-08  6:31         ` Kuniyuki Iwashima
2020-12-08  7:34           ` Martin KaFai Lau
2020-12-08  8:17             ` Kuniyuki Iwashima
2020-12-09  3:09               ` Martin KaFai Lau
2020-12-09  8:05                 ` Kuniyuki Iwashima
2020-12-09 16:57                   ` Kuniyuki Iwashima
2020-12-10  1:53                     ` Martin KaFai Lau
2020-12-10  5:58                       ` Kuniyuki Iwashima
2020-12-10 19:33                         ` Martin KaFai Lau
2020-12-14 17:16                           ` Kuniyuki Iwashima
2020-12-05  1:42   ` Martin KaFai Lau
2020-12-06  4:41     ` Kuniyuki Iwashima
     [not found]     ` <20201205160307.91179-1-kuniyu@amazon.co.jp>
2020-12-07 20:14       ` Martin KaFai Lau
2020-12-08  6:27         ` Kuniyuki Iwashima
2020-12-08  8:13           ` Martin KaFai Lau
2020-12-08  9:02             ` Kuniyuki Iwashima [this message]
2020-12-08  6:54   ` Martin KaFai Lau
2020-12-08  7:42     ` Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 04/11] tcp: Migrate TFO requests causing RST during TCP_SYN_RECV Kuniyuki Iwashima
2020-12-01 15:30   ` Eric Dumazet
2020-12-01 14:44 ` [PATCH v1 bpf-next 05/11] tcp: Migrate TCP_NEW_SYN_RECV requests Kuniyuki Iwashima
2020-12-01 15:13   ` Eric Dumazet
2020-12-03 14:12     ` Kuniyuki Iwashima
2020-12-01 17:37   ` kernel test robot
2020-12-01 17:37     ` kernel test robot
2020-12-01 17:42   ` kernel test robot
2020-12-01 17:42     ` kernel test robot
2020-12-10  0:07   ` Martin KaFai Lau
2020-12-10  5:15     ` Kuniyuki Iwashima
2020-12-10 18:49       ` Martin KaFai Lau
2020-12-14 17:03         ` Kuniyuki Iwashima
2020-12-15  2:58           ` Martin KaFai Lau
2020-12-16 16:41             ` Kuniyuki Iwashima
2020-12-16 22:24               ` Martin KaFai Lau
2020-12-01 14:44 ` [PATCH v1 bpf-next 06/11] bpf: Introduce two attach types for BPF_PROG_TYPE_SK_REUSEPORT Kuniyuki Iwashima
2020-12-02  2:04   ` Andrii Nakryiko
2020-12-02 19:19     ` Martin KaFai Lau
2020-12-03  4:24       ` Martin KaFai Lau
2020-12-03 14:16         ` Kuniyuki Iwashima
2020-12-04  5:56           ` Martin KaFai Lau
2020-12-06  4:32             ` Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 07/11] libbpf: Set expected_attach_type " Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 08/11] bpf: Add migration to sk_reuseport_(kern|md) Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 09/11] bpf: Support bpf_get_socket_cookie_sock() for BPF_PROG_TYPE_SK_REUSEPORT Kuniyuki Iwashima
2020-12-04 19:58   ` Martin KaFai Lau
2020-12-06  4:36     ` Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 10/11] bpf: Call bpf_run_sk_reuseport() for socket migration Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 11/11] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE Kuniyuki Iwashima
2020-12-05  1:50   ` Martin KaFai Lau
2020-12-06  4:43     ` Kuniyuki Iwashima

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201208090236.86926-1-kuniyu@amazon.co.jp \
    --to=kuniyu@amazon.co.jp \
    --cc=ast@kernel.org \
    --cc=benh@amazon.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=kafai@fb.com \
    --cc=kuba@kernel.org \
    --cc=kuni1840@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.