All of lore.kernel.org
 help / color / mirror / Atom feed
From: Martin KaFai Lau <kafai@fb.com>
To: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Cc: <ast@kernel.org>, <benh@amazon.com>, <bpf@vger.kernel.org>,
	<daniel@iogearbox.net>, <davem@davemloft.net>,
	<edumazet@google.com>, <kuba@kernel.org>, <kuni1840@gmail.com>,
	<linux-kernel@vger.kernel.org>, <netdev@vger.kernel.org>
Subject: Re: [PATCH v1 bpf-next 05/11] tcp: Migrate TCP_NEW_SYN_RECV requests.
Date: Mon, 14 Dec 2020 18:58:37 -0800	[thread overview]
Message-ID: <20201215025837.k2cuhykmz6h46fud@kafai-mbp.dhcp.thefacebook.com> (raw)
In-Reply-To: <20201214170313.50197-1-kuniyu@amazon.co.jp>

On Tue, Dec 15, 2020 at 02:03:13AM +0900, Kuniyuki Iwashima wrote:
> From:   Martin KaFai Lau <kafai@fb.com>
> Date:   Thu, 10 Dec 2020 10:49:15 -0800
> > On Thu, Dec 10, 2020 at 02:15:38PM +0900, Kuniyuki Iwashima wrote:
> > > From:   Martin KaFai Lau <kafai@fb.com>
> > > Date:   Wed, 9 Dec 2020 16:07:07 -0800
> > > > On Tue, Dec 01, 2020 at 11:44:12PM +0900, Kuniyuki Iwashima wrote:
> > > > > This patch renames reuseport_select_sock() to __reuseport_select_sock() and
> > > > > adds two wrapper function of it to pass the migration type defined in the
> > > > > previous commit.
> > > > > 
> > > > >   reuseport_select_sock          : BPF_SK_REUSEPORT_MIGRATE_NO
> > > > >   reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST
> > > > > 
> > > > > As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV
> > > > > requests at receiving the final ACK or sending a SYN+ACK. Therefore, this
> > > > > patch also changes the code to call reuseport_select_migrated_sock() even
> > > > > if the listening socket is TCP_CLOSE. If we can pick out a listening socket
> > > > > from the reuseport group, we rewrite request_sock.rsk_listener and resume
> > > > > processing the request.
> > > > > 
> > > > > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > > > > ---
> > > > >  include/net/inet_connection_sock.h | 12 +++++++++++
> > > > >  include/net/request_sock.h         | 13 ++++++++++++
> > > > >  include/net/sock_reuseport.h       |  8 +++----
> > > > >  net/core/sock_reuseport.c          | 34 ++++++++++++++++++++++++------
> > > > >  net/ipv4/inet_connection_sock.c    | 13 ++++++++++--
> > > > >  net/ipv4/tcp_ipv4.c                |  9 ++++++--
> > > > >  net/ipv6/tcp_ipv6.c                |  9 ++++++--
> > > > >  7 files changed, 81 insertions(+), 17 deletions(-)
> > > > > 
> > > > > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> > > > > index 2ea2d743f8fc..1e0958f5eb21 100644
> > > > > --- a/include/net/inet_connection_sock.h
> > > > > +++ b/include/net/inet_connection_sock.h
> > > > > @@ -272,6 +272,18 @@ static inline void inet_csk_reqsk_queue_added(struct sock *sk)
> > > > >  	reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue);
> > > > >  }
> > > > >  
> > > > > +static inline void inet_csk_reqsk_queue_migrated(struct sock *sk,
> > > > > +						 struct sock *nsk,
> > > > > +						 struct request_sock *req)
> > > > > +{
> > > > > +	reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue,
> > > > > +			     &inet_csk(nsk)->icsk_accept_queue,
> > > > > +			     req);
> > > > > +	sock_put(sk);
> > > > not sure if it is safe to do here.
> > > > IIUC, when the req->rsk_refcnt is held, it also holds a refcnt
> > > > to req->rsk_listener such that sock_hold(req->rsk_listener) is
> > > > safe because its sk_refcnt is not zero.
> > > 
> > > I think it is safe to call sock_put() for the old listener here.
> > > 
> > > Without this patchset, at receiving the final ACK or retransmitting
> > > SYN+ACK, if sk_state == TCP_CLOSE, sock_put(req->rsk_listener) is done
> > > by calling reqsk_put() twice in inet_csk_reqsk_queue_drop_and_put().
> > Note that in your example (final ACK), sock_put(req->rsk_listener) is
> > _only_ called when reqsk_put() can get refcount_dec_and_test(&req->rsk_refcnt)
> > to reach zero.
> > 
> > Here in this patch, it sock_put(req->rsk_listener) without req->rsk_refcnt
> > reaching zero.
> > 
> > Let says there are two cores holding two refcnt to req (one cnt for each core)
> > by looking up the req from ehash.  One of the core do this migrate and
> > sock_put(req->rsk_listener).  Another core does sock_hold(req->rsk_listener).
> > 
> > 	Core1					Core2
> > 						sock_put(req->rsk_listener)
> > 
> > 	sock_hold(req->rsk_listener)
> 
> I'm sorry for the late reply.
> 
> I missed this situation that different Cores get into NEW_SYN_RECV path,
> but this does exist.
> https://lore.kernel.org/netdev/1517977874.3715.153.camel@gmail.com/#t
> https://lore.kernel.org/netdev/1518531252.3715.178.camel@gmail.com/
> 
> 
> If close() is called for the listener and the request has the last refcount
> for it, sock_put() by Core2 frees it, so Core1 cannot proceed with freed
> listener. So, it is good to call refcount_inc_not_zero() instead of
> sock_hold(). If refcount_inc_not_zero() fails, it means that the listener
_inc_not_zero() usually means it requires rcu_read_lock().
That may have rippling effect on other req->rsk_listener readers.

There may also be places assuming that the req->rsk_listener will never
change once it is assigned.  not sure.  have not looked closely yet.

It probably needs some more thoughts here to get a simpler solution.

> is closed and the req->rsk_listener is changed in another place. Then, we
> can continue processing the request by rewriting sk with rsk_listener and
> calling sock_hold() for it.
> 
> Also, the migration by Core2 can be done after sock_hold() by Core1. Then
> if Core1 win the race by removing the request from ehash,
> in inet_csk_reqsk_queue_add(), instead of sk, req->rsk_listener should be
> used as the proper listener to add the req into its queue. But if the
> rsk_listener is also TCP_CLOSE, we have to call inet_child_forget().
> 
> Moreover, we have to check the listener is freed in the beginning of
> reqsk_timer_handler() by refcount_inc_not_zero().
> 
> 
> > > And then, we do `goto lookup;` and overwrite the sk.
> > > 
> > > In the v2 patchset, refcount_inc_not_zero() is done for the new listener in
> > > reuseport_select_migrated_sock(), so we have to call sock_put() for the old
> > > listener instead to free it properly.
> > > 
> > > ---8<---
> > > +struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash,
> > > +					    struct sk_buff *skb)
> > > +{
> > > +	struct sock *nsk;
> > > +
> > > +	nsk = __reuseport_select_sock(sk, hash, skb, 0, BPF_SK_REUSEPORT_MIGRATE_REQUEST);
> > > +	if (nsk && likely(refcount_inc_not_zero(&nsk->sk_refcnt)))
> > There is another potential issue here.  The TCP_LISTEN nsk is protected
> > by rcu.  refcount_inc_not_zero(&nsk->sk_refcnt) cannot be done if it
> > is not under rcu_read_lock().
> > 
> > The receive path may be ok as it is in rcu.  You may need to check for
> > others.
> 
> IIUC, is this mean nsk can be NULL after grace period of RCU? If so, I will
worse than NULL.  an invalid pointer.
 
> move rcu_read_lock/unlock() from __reuseport_select_sock() to
> reuseport_select_sock() and reuseport_select_migrated_sock().
ok.

> 
> 
> > > +		return nsk;
> > > +
> > > +	return NULL;
> > > +}
> > > +EXPORT_SYMBOL(reuseport_select_migrated_sock);
> > > ---8<---
> > > https://lore.kernel.org/netdev/20201207132456.65472-8-kuniyu@amazon.co.jp/
> > > 
> > > 
> > > > > +	sock_hold(nsk);
> > > > > +	req->rsk_listener = nsk;
> > It looks like there is another race here.  What
> > if multiple cores try to update req->rsk_listener?
> 
> I think we have to add a lock in struct request_sock, acquire it, check
> if the rsk_listener is changed or not, and then do migration. Also, if the
> listener has been changed, we have to tell the caller to use it as the new
> listener.
> 
> ---8<---
>        spin_lock(&lock)
>        if (sk != req->rsk_listener) {
>                nsk = req->rsk_listener;
>                goto out;
>        }
> 
>        // do migration
> out:
>        spin_unlock(&lock)
>        return nsk;
> ---8<---
cmpxchg may help here.

  reply	other threads:[~2020-12-15  3:00 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-01 14:44 [PATCH v1 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 01/11] tcp: Keep TCP_CLOSE sockets in the reuseport group Kuniyuki Iwashima
2020-12-05  1:31   ` Martin KaFai Lau
2020-12-06  4:38     ` Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 02/11] bpf: Define migration types for SO_REUSEPORT Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues Kuniyuki Iwashima
2020-12-01 15:25   ` Eric Dumazet
2020-12-03 14:14     ` Kuniyuki Iwashima
2020-12-03 14:31       ` Eric Dumazet
2020-12-03 15:41         ` Kuniyuki Iwashima
2020-12-07 20:33       ` Martin KaFai Lau
2020-12-08  6:31         ` Kuniyuki Iwashima
2020-12-08  7:34           ` Martin KaFai Lau
2020-12-08  8:17             ` Kuniyuki Iwashima
2020-12-09  3:09               ` Martin KaFai Lau
2020-12-09  8:05                 ` Kuniyuki Iwashima
2020-12-09 16:57                   ` Kuniyuki Iwashima
2020-12-10  1:53                     ` Martin KaFai Lau
2020-12-10  5:58                       ` Kuniyuki Iwashima
2020-12-10 19:33                         ` Martin KaFai Lau
2020-12-14 17:16                           ` Kuniyuki Iwashima
2020-12-05  1:42   ` Martin KaFai Lau
2020-12-06  4:41     ` Kuniyuki Iwashima
     [not found]     ` <20201205160307.91179-1-kuniyu@amazon.co.jp>
2020-12-07 20:14       ` Martin KaFai Lau
2020-12-08  6:27         ` Kuniyuki Iwashima
2020-12-08  8:13           ` Martin KaFai Lau
2020-12-08  9:02             ` Kuniyuki Iwashima
2020-12-08  6:54   ` Martin KaFai Lau
2020-12-08  7:42     ` Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 04/11] tcp: Migrate TFO requests causing RST during TCP_SYN_RECV Kuniyuki Iwashima
2020-12-01 15:30   ` Eric Dumazet
2020-12-01 14:44 ` [PATCH v1 bpf-next 05/11] tcp: Migrate TCP_NEW_SYN_RECV requests Kuniyuki Iwashima
2020-12-01 15:13   ` Eric Dumazet
2020-12-03 14:12     ` Kuniyuki Iwashima
2020-12-01 17:37   ` kernel test robot
2020-12-01 17:37     ` kernel test robot
2020-12-01 17:42   ` kernel test robot
2020-12-01 17:42     ` kernel test robot
2020-12-10  0:07   ` Martin KaFai Lau
2020-12-10  5:15     ` Kuniyuki Iwashima
2020-12-10 18:49       ` Martin KaFai Lau
2020-12-14 17:03         ` Kuniyuki Iwashima
2020-12-15  2:58           ` Martin KaFai Lau [this message]
2020-12-16 16:41             ` Kuniyuki Iwashima
2020-12-16 22:24               ` Martin KaFai Lau
2020-12-01 14:44 ` [PATCH v1 bpf-next 06/11] bpf: Introduce two attach types for BPF_PROG_TYPE_SK_REUSEPORT Kuniyuki Iwashima
2020-12-02  2:04   ` Andrii Nakryiko
2020-12-02 19:19     ` Martin KaFai Lau
2020-12-03  4:24       ` Martin KaFai Lau
2020-12-03 14:16         ` Kuniyuki Iwashima
2020-12-04  5:56           ` Martin KaFai Lau
2020-12-06  4:32             ` Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 07/11] libbpf: Set expected_attach_type " Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 08/11] bpf: Add migration to sk_reuseport_(kern|md) Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 09/11] bpf: Support bpf_get_socket_cookie_sock() for BPF_PROG_TYPE_SK_REUSEPORT Kuniyuki Iwashima
2020-12-04 19:58   ` Martin KaFai Lau
2020-12-06  4:36     ` Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 10/11] bpf: Call bpf_run_sk_reuseport() for socket migration Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 11/11] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE Kuniyuki Iwashima
2020-12-05  1:50   ` Martin KaFai Lau
2020-12-06  4:43     ` Kuniyuki Iwashima

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201215025837.k2cuhykmz6h46fud@kafai-mbp.dhcp.thefacebook.com \
    --to=kafai@fb.com \
    --cc=ast@kernel.org \
    --cc=benh@amazon.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=kuba@kernel.org \
    --cc=kuni1840@gmail.com \
    --cc=kuniyu@amazon.co.jp \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.