From: Martin KaFai Lau <kafai@fb.com>
To: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Cc: <ast@kernel.org>, <benh@amazon.com>, <bpf@vger.kernel.org>,
<daniel@iogearbox.net>, <davem@davemloft.net>,
<edumazet@google.com>, <eric.dumazet@gmail.com>,
<kuba@kernel.org>, <kuni1840@gmail.com>,
<linux-kernel@vger.kernel.org>, <netdev@vger.kernel.org>
Subject: Re: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
Date: Wed, 9 Dec 2020 17:53:19 -0800 [thread overview]
Message-ID: <20201210015319.e6njlcwuhfpre3bn@kafai-mbp.dhcp.thefacebook.com> (raw)
In-Reply-To: <20201209165719.73652-1-kuniyu@amazon.co.jp>
On Thu, Dec 10, 2020 at 01:57:19AM +0900, Kuniyuki Iwashima wrote:
[ ... ]
> > > > I think it is a bit complex to pass the new listener from
> > > > reuseport_detach_sock() to inet_csk_listen_stop().
> > > >
> > > > __tcp_close/tcp_disconnect/tcp_abort
> > > > |-tcp_set_state
> > > > | |-unhash
> > > > | |-reuseport_detach_sock (return nsk)
> > > > |-inet_csk_listen_stop
> > > Picking the new listener does not have to be done in
> > > reuseport_detach_sock().
> > >
> > > IIUC, it is done there only because it prefers to pick
> > > the last sk from socks[] when bpf prog is not attached.
> > > This seems to get into the way of exploring other potential
> > > implementation options.
> >
> > Yes.
> > This is just idea, but we can reserve the last index of socks[] to hold the
> > last 'moved' socket in reuseport_detach_sock() and use it in
> > inet_csk_listen_stop().
> >
> >
> > > Merging the discussion on the last socks[] pick from another thread:
> > > >
> > > > I think most applications start new listeners before closing listeners, in
> > > > this case, selecting the moved socket as the new listener works well.
> > > >
> > > >
> > > > > That said, if it is still desired to do a random pick by kernel when
> > > > > there is no bpf prog, it probably makes sense to guard it in a sysctl as
> > > > > suggested in another reply. To keep it simple, I would also keep this
> > > > > kernel-pick consistent instead of request socket is doing something
> > > > > different from the unhash path.
> > > >
> > > > Then, is this way better to keep kernel-pick consistent?
> > > >
> > > > 1. call reuseport_select_migrated_sock() without sk_hash from any path
> > > > 2. generate a random number in reuseport_select_migrated_sock()
> > > > 3. pass it to __reuseport_select_sock() only for select-by-hash
> > > > (4. pass 0 as sk_hash to bpf_run_sk_reuseport not to use it)
> > > > 5. do migration per queue in inet_csk_listen_stop() or per request in
> > > > receive path.
> > > >
> > > > I understand it is beautiful to keep consistensy, but also think
> > > > the kernel-pick with heuristic performs better than random-pick.
> > > I think discussing the best kernel pick without explicit user input
> > > is going to be a dead end. There is always a case that
> > > makes this heuristic (or guess) fail. e.g. what if multiple
> > > sk(s) being closed are always the last one in the socks[]?
> > > all their child sk(s) will then be piled up at one listen sk
> > > because the last socks[] is always picked?
> >
> > There can be such a case, but it means the newly listened sockets are
> > closed earlier than old ones.
> >
> >
> > > Lets assume the last socks[] is indeed the best for all cases. Then why
> > > the in-progress req don't pick it this way? I feel the implementation
> > > is doing what is convenient at that point. And that is fine, I think
> >
> > In this patchset, I originally assumed four things:
> >
> > migration should be done
> > (i) from old to new
> > (ii) to redistribute requests evenly as possible
> > (iii) to keep the order of requests in the queue
> > (resulting in splicing queues)
> > (iv) in O(1) for scalability
> > (resulting in fix-up rsk_listener approach)
> >
> > I selected the last socket in unhash path to satisfy above four because the
> > last socket changes at every close() syscall if application closes from
> > older socket.
> >
> > But in receiving ACK or retransmitting SYN+ACK, we cannot get the last
> > 'moved' socket. Even if we reserve the last 'moved' socket in the last
> > index by the idea above, we cannot sure the last socket is changed after
> > close() for each req->listener. For example, we have listeners A, B, C, and
> > D, and then call close(A) and close(B), and receive the final ACKs for A
> > and B, then both of them are assigned to C. In this case, A for D and B for
> > C is desired. So, selecting the last socket in socks[] for incoming
> > requests cannnot realize (ii).
> >
> > This is why I selected the last moved socket in unhash path and a random
> > listener in receive path.
> >
> >
> > > for kernel-pick, it should just go for simplicity and stay with
> > > the random(/hash) pick instead of pretending the kernel knows the
> > > application must operate in a certain way. It is fine
> > > that the pick was wrong, the kernel will eventually move the
> > > childs/reqs to the survived listen sk.
> >
> > Exactly. Also the heuristic way is not fair for every application.
> >
> > After reading below idea (migrated_sk), I think random-pick is better
> > at simplicity and passing each sk.
> >
> >
> > > [ I still think the kernel should not even pick if
> > > there is no bpf prog to instruct how to pick
> > > but I am fine as long as there is a sysctl to
> > > guard this. ]
> >
> > Unless different applications listen on the same port, random-pick can save
> > connections which would be aborted. So, I would add a sysctl to do
> > migration when no eBPF prog is attached.
> >
> >
> > > I would rather focus on ensuring the bpf prog getting what it
> > > needs to make the migration pick. A few things
> > > I would like to discuss and explore:
> > > > If we splice requests like this, we do not need double lock?
> > > >
> > > > 1. lock the accept queue of the old listener
> > > > 2. unlink all requests and decrement refcount
> > > > 3. unlock
> > > > 4. update all requests with new listener
> > > I guess updating rsk_listener can be done without acquiring
> > > the lock in (5) below is because it is done under the
> > > listening_hash's bucket lock (and also the global reuseport_lock) so
> > > that the new listener will stay in TCP_LISTEN state?
> >
> > If we do migration in inet_unhash(), the lock is held, but it is not held
> > in inet_csk_listen_stop().
> >
> >
> > > I am not sure iterating the queue under these
> > > locks is a very good thing to do though. The queue may not be
> > > very long in usual setup but still let see
> > > if that can be avoided.
> >
> > I agree, lock should not be held long.
> >
> >
> > > Do you think the iteration can be done without holding
> > > bucket lock and the global reuseport_lock? inet_csk_reqsk_queue_add()
> > > is taking the rskq_lock and then check for TCP_LISTEN. May be
> > > something similar can be done also?
> >
> > I think either one is necessary at least, so if the sk_state of selected
> > listener is TCP_CLOSE (this is mostly by random-pick of kernel), then we
> > have to fall back to call inet_child_forget().
> >
> >
> > > While doing BPF_SK_REUSEPORT_MIGRATE_REQUEST,
> > > the bpf prog can pick per req and have the sk_hash.
> > > However, while doing BPF_SK_REUSEPORT_MIGRATE_QUEUE,
> > > the bpf prog currently does not have a chance to
> > > pick individually for each req/child on the queue.
> > > Since it is iterating the queue anyway, does it make
> > > sense to also call the bpf to pick for each req/child
> > > in the queue? It then can pass sk_hash (from child->sk_hash?)
> > > to the bpf prog also instead of current 0. The cost of calling
> > > bpf prog is not really that much / signficant at the
> > > migration code path. If the queue is somehow
> > > unusally long, there is already an existing
> > > cond_resched() in inet_csk_listen_stop().
> > >
> > > Then, instead of adding sk_reuseport_md->migration,
> > > it can then add sk_reuseport_md->migrate_sk.
> > > "migrate_sk = req" for in-progress req and "migrate_sk = child"
> > > for iterating acceptq. The bpf_prog can then tell what sk (req or child)
> > > it is migrating by reading migrate_sk->state. It can then also
> > > learn the 4 tuples src/dst ip/port while skb is missing.
> > > The sk_reuseport_md->sk can still point to the closed sk
> > > such that the bpf prog can learn the cookie.
> > >
> > > I suspect a few things between BPF_SK_REUSEPORT_MIGRATE_REQUEST
> > > and BPF_SK_REUSEPORT_MIGRATE_QUEUE can be folded together
> > > by doing the above. It also gives a more consistent
> > > interface for the bpf prog, no more MIGRATE_QUEUE vs MIGRATE_REQUEST.
> >
> > I think this is really nice idea. Also, I tried to implement random-pick
> > one by one in inet_csk_listen_stop() yesterday, I found a concern about how
> > to handle requests in TFO queue.
> >
> > The request can be already accepted, so passing it to eBPF prog is
> > confusing? But, redistributing randomly can affect all listeners
> > unnecessary. How should we handle such requests?
>
> I've implemented one-by-one migration only for the accept queue for now.
> In addition to the concern about TFO queue,
You meant this queue: queue->fastopenq.rskq_rst_head?
Can "req" be passed?
I did not look up the lock/race in details for that though.
> I want to discuss which should
> we pass NULL or request_sock to eBPF program as migrate_sk when selecting a
> listener for SYN ?
hmmm... not sure I understand your question.
You meant the existing lookup listener case from inet_lhash2_lookup()?
There is nothing to migrate at that point, so NULL makes sense to me.
migrate_sk's type should be PTR_TO_SOCK_COMMON_OR_NULL.
>
> ---8<---
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index a82fd4c912be..d0ddd3cb988b 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -1001,6 +1001,29 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> }
> EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
>
> +static bool inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk, struct request_sock *req)
> +{
> + struct request_sock_queue *queue = &inet_csk(nsk)->icsk_accept_queue;
> + bool migrated = false;
> +
> + spin_lock(&queue->rskq_lock);
> + if (likely(nsk->sk_state == TCP_LISTEN)) {
> + migrated = true;
> +
> + req->dl_next = NULL;
> + if (queue->rskq_accept_head == NULL)
> + WRITE_ONCE(queue->rskq_accept_head, req);
> + else
> + queue->rskq_accept_tail->dl_next = req;
> + queue->rskq_accept_tail = req;
> + sk_acceptq_added(nsk);
> + inet_csk_reqsk_queue_migrated(sk, nsk, req);
need to first resolve the question raised in patch 5 regarding
to the update on req->rsk_listener though.
> + }
> + spin_unlock(&queue->rskq_lock);
> +
> + return migrated;
> +}
> +
> struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
> struct request_sock *req, bool own_req)
> {
> @@ -1023,9 +1046,11 @@ EXPORT_SYMBOL(inet_csk_complete_hashdance);
> */
> void inet_csk_listen_stop(struct sock *sk)
> {
> + struct sock_reuseport *reuseport_cb = rcu_access_pointer(sk->sk_reuseport_cb);
> struct inet_connection_sock *icsk = inet_csk(sk);
> struct request_sock_queue *queue = &icsk->icsk_accept_queue;
> struct request_sock *next, *req;
> + struct sock *nsk;
>
> /* Following specs, it would be better either to send FIN
> * (and enter FIN-WAIT-1, it is normal close)
> @@ -1043,8 +1068,19 @@ void inet_csk_listen_stop(struct sock *sk)
> WARN_ON(sock_owned_by_user(child));
> sock_hold(child);
>
> + if (reuseport_cb) {
> + nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, NULL);
> + if (nsk) {
> + if (inet_csk_reqsk_queue_migrate(sk, nsk, req))
> + goto unlock_sock;
> + else
> + sock_put(nsk);
> + }
> + }
> +
> inet_child_forget(sk, req, child);
> reqsk_put(req);
> +unlock_sock:
> bh_unlock_sock(child);
> local_bh_enable();
> sock_put(child);
> ---8<---
>
>
> > > > 5. lock the accept queue of the new listener
> > > > 6. splice requests and increment refcount
> > > > 7. unlock
> > > >
> > > > Also, I think splicing is better to keep the order of requests. Adding one
> > > > by one reverses it.
> > > It can keep the order but I think it is orthogonal here.
next prev parent reply other threads:[~2020-12-10 1:54 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-12-01 14:44 [PATCH v1 bpf-next 00/11] Socket migration for SO_REUSEPORT Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 01/11] tcp: Keep TCP_CLOSE sockets in the reuseport group Kuniyuki Iwashima
2020-12-05 1:31 ` Martin KaFai Lau
2020-12-06 4:38 ` Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 02/11] bpf: Define migration types for SO_REUSEPORT Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues Kuniyuki Iwashima
2020-12-01 15:25 ` Eric Dumazet
2020-12-03 14:14 ` Kuniyuki Iwashima
2020-12-03 14:31 ` Eric Dumazet
2020-12-03 15:41 ` Kuniyuki Iwashima
2020-12-07 20:33 ` Martin KaFai Lau
2020-12-08 6:31 ` Kuniyuki Iwashima
2020-12-08 7:34 ` Martin KaFai Lau
2020-12-08 8:17 ` Kuniyuki Iwashima
2020-12-09 3:09 ` Martin KaFai Lau
2020-12-09 8:05 ` Kuniyuki Iwashima
2020-12-09 16:57 ` Kuniyuki Iwashima
2020-12-10 1:53 ` Martin KaFai Lau [this message]
2020-12-10 5:58 ` Kuniyuki Iwashima
2020-12-10 19:33 ` Martin KaFai Lau
2020-12-14 17:16 ` Kuniyuki Iwashima
2020-12-05 1:42 ` Martin KaFai Lau
2020-12-06 4:41 ` Kuniyuki Iwashima
[not found] ` <20201205160307.91179-1-kuniyu@amazon.co.jp>
2020-12-07 20:14 ` Martin KaFai Lau
2020-12-08 6:27 ` Kuniyuki Iwashima
2020-12-08 8:13 ` Martin KaFai Lau
2020-12-08 9:02 ` Kuniyuki Iwashima
2020-12-08 6:54 ` Martin KaFai Lau
2020-12-08 7:42 ` Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 04/11] tcp: Migrate TFO requests causing RST during TCP_SYN_RECV Kuniyuki Iwashima
2020-12-01 15:30 ` Eric Dumazet
2020-12-01 14:44 ` [PATCH v1 bpf-next 05/11] tcp: Migrate TCP_NEW_SYN_RECV requests Kuniyuki Iwashima
2020-12-01 15:13 ` Eric Dumazet
2020-12-03 14:12 ` Kuniyuki Iwashima
2020-12-10 0:07 ` Martin KaFai Lau
2020-12-10 5:15 ` Kuniyuki Iwashima
2020-12-10 18:49 ` Martin KaFai Lau
2020-12-14 17:03 ` Kuniyuki Iwashima
2020-12-15 2:58 ` Martin KaFai Lau
2020-12-16 16:41 ` Kuniyuki Iwashima
2020-12-16 22:24 ` Martin KaFai Lau
2020-12-01 14:44 ` [PATCH v1 bpf-next 06/11] bpf: Introduce two attach types for BPF_PROG_TYPE_SK_REUSEPORT Kuniyuki Iwashima
2020-12-02 2:04 ` Andrii Nakryiko
2020-12-02 19:19 ` Martin KaFai Lau
2020-12-03 4:24 ` Martin KaFai Lau
2020-12-03 14:16 ` Kuniyuki Iwashima
2020-12-04 5:56 ` Martin KaFai Lau
2020-12-06 4:32 ` Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 07/11] libbpf: Set expected_attach_type " Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 08/11] bpf: Add migration to sk_reuseport_(kern|md) Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 09/11] bpf: Support bpf_get_socket_cookie_sock() for BPF_PROG_TYPE_SK_REUSEPORT Kuniyuki Iwashima
2020-12-04 19:58 ` Martin KaFai Lau
2020-12-06 4:36 ` Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 10/11] bpf: Call bpf_run_sk_reuseport() for socket migration Kuniyuki Iwashima
2020-12-01 14:44 ` [PATCH v1 bpf-next 11/11] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE Kuniyuki Iwashima
2020-12-05 1:50 ` Martin KaFai Lau
2020-12-06 4:43 ` Kuniyuki Iwashima
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20201210015319.e6njlcwuhfpre3bn@kafai-mbp.dhcp.thefacebook.com \
--to=kafai@fb.com \
--cc=ast@kernel.org \
--cc=benh@amazon.com \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=eric.dumazet@gmail.com \
--cc=kuba@kernel.org \
--cc=kuni1840@gmail.com \
--cc=kuniyu@amazon.co.jp \
--cc=linux-kernel@vger.kernel.org \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).