All of lore.kernel.org
 help / color / mirror / Atom feed
From: Paolo Abeni <pabeni@redhat.com>
To: dust.li@linux.alibaba.com, Jens Axboe <axboe@kernel.dk>,
	Eric Dumazet <edumazet@google.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>,
	io-uring@vger.kernel.org, netdev <netdev@vger.kernel.org>
Subject: Re: [PATCHSET 0/4] Add support for no-lock sockets
Date: Wed, 13 Apr 2022 09:53:59 +0200	[thread overview]
Message-ID: <4864c2c265f6986bcd7576a9066cb430fbfdde95.camel@redhat.com> (raw)
In-Reply-To: <20220413052339.GJ35207@linux.alibaba.com>

On Wed, 2022-04-13 at 13:23 +0800, dust.li wrote:
> On Tue, Apr 12, 2022 at 08:01:10PM -0600, Jens Axboe wrote:
> > On 4/12/22 7:54 PM, Eric Dumazet wrote:
> > > On Tue, Apr 12, 2022 at 6:26 PM Jens Axboe <axboe@kernel.dk> wrote:
> > > > 
> > > > On 4/12/22 6:40 PM, Eric Dumazet wrote:
> > > > > 
> > > > > On 4/12/22 13:26, Jens Axboe wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > If we accept a connection directly, eg without installing a file
> > > > > > descriptor for it, or if we use IORING_OP_SOCKET in direct mode, then
> > > > > > we have a socket for recv/send that we can fully serialize access to.
> > > > > > 
> > > > > > With that in mind, we can feasibly skip locking on the socket for TCP
> > > > > > in that case. Some of the testing I've done has shown as much as 15%
> > > > > > of overhead in the lock_sock/release_sock part, with this change then
> > > > > > we see none.
> > > > > > 
> > > > > > Comments welcome!
> > > > > > 
> > > > > How BH handlers (including TCP timers) and io_uring are going to run
> > > > > safely ? Even if a tcp socket had one user, (private fd opened by a
> > > > > non multi-threaded program), we would still to use the spinlock.
> > > > 
> > > > But we don't even hold the spinlock over lock_sock() and release_sock(),
> > > > just the mutex. And we do check for running eg the backlog on release,
> > > > which I believe is done safely and similarly in other places too.
> > > 
> > > So lets say TCP stack receives a packet in BH handler... it proceeds
> > > using many tcp sock fields.
> > > 
> > > Then io_uring wants to read/write stuff from another cpu, while BH
> > > handler(s) is(are) not done yet,
> > > and will happily read/change many of the same fields
> > 
> > But how is that currently protected? The bh spinlock is only held
> > briefly while locking the socket, and ditto on the relase. Outside of
> > that, the owner field is used. At least as far as I can tell. I'm
> > assuming the mutex exists solely to serialize acess to eg send/recv on
> > the system call side.
> 
> Hi jens,
> 
> I personally like the idea of using iouring to improve the performance
> of the socket API.
> 
> AFAIU, the bh spinlock will be held by the BH when trying to make
> changes to those protected fields on the socket, and the userspace
> will try to hold that spinlock before it can change the sock lock
> owner field.
> 
> For example:
> in tcp_v4_rcv() we have
> 
>         bh_lock_sock_nested(sk);
>         tcp_segs_in(tcp_sk(sk), skb);
>         ret = 0;
>         if (!sock_owned_by_user(sk)) {
>                 ret = tcp_v4_do_rcv(sk, skb);
>         } else {
>                 if (tcp_add_backlog(sk, skb, &drop_reason))
>                         goto discard_and_relse;
>         }
>         bh_unlock_sock(sk);
> 
> When this is called in the BH, it will first hold the bh spinlock
> and then check the owner field, tcp_v4_do_rcv() will always been
> protected by the bh spinlock.
> 
> If the user thread tries to make changes to the socket, it first
> call lock_sock() which will also try to hold the bh spinlock, I
> think that prevent the race.
> 
>   void lock_sock_nested(struct sock *sk, int subclass)
>   {
>           /* The sk_lock has mutex_lock() semantics here. */
>           mutex_acquire(&sk->sk_lock.dep_map, subclass, 0, _RET_IP_);
> 
>           might_sleep();
>           spin_lock_bh(&sk->sk_lock.slock);
>           if (sock_owned_by_user_nocheck(sk))
>                   __lock_sock(sk);
>           sk->sk_lock.owned = 1;
>           spin_unlock_bh(&sk->sk_lock.slock);
>   }
> 
> But if we remove the spinlock in the lock_sock() when sk_no_lock
> is set to true. When the the bh spinlock is already held by the BH,
> it seems the userspace won't respect that anymore ?

Exactly, with sk_no_lock we will have the following race:

[BH/timer on CPU 0]			[ reader/writer on CPU 1]

bh_lock_sock_nested(sk);
// owned is currently 0
if (!sock_owned_by_user(sk)) {
    // modify sk state

					if (sk->sk_no_lock) {
						sk->sk_lock.owned = 1;
						smp_wmb();
   // still touching sk state
					// cuncurrently modify sk
state
					// sk is corrupted

We need both the sk spinlock and the 'owned' bit to ensure mutually
exclusive access WRT soft interrupts. 

I personally don't see any way to fix the above without the sk spinlock
- or an equivalent contended atomic operation.

Additionally these changes add relevant overhead for the !sk_no_lock
case - the additional memory barriers and conditionals - which will
impact most/all existing users.

Touching a very fundamental and internal piece of the core networking,
corrently extremelly stable, similar changes will require a very
extensive testing, comprising benchmarking for the current/!sk_no_lock
code paths with different workloads and additional self-tests.

Thanks.

Paolo


      reply	other threads:[~2022-04-13  7:54 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-12 20:26 [PATCHSET 0/4] Add support for no-lock sockets Jens Axboe
2022-04-12 20:26 ` [PATCH 1/4] net: add sock 'sk_no_lock' member Jens Axboe
2022-04-12 20:26 ` [PATCH 2/4] net: allow sk_prot->release_cb() without sock lock held Jens Axboe
2022-04-12 20:26 ` [PATCH 3/4] net: add support for socket no-lock Jens Axboe
2022-04-12 20:26 ` [PATCH 4/4] io_uring: mark accept direct socket as no-lock Jens Axboe
2022-04-13  0:40 ` [PATCHSET 0/4] Add support for no-lock sockets Eric Dumazet
2022-04-13  1:26   ` Jens Axboe
2022-04-13  1:54     ` Eric Dumazet
2022-04-13  2:01       ` Jens Axboe
2022-04-13  2:05         ` Eric Dumazet
2022-04-13  2:12           ` Jens Axboe
2022-04-13  2:19             ` Eric Dumazet
2022-04-13  2:26               ` Eric Dumazet
2022-04-13  2:27               ` Jens Axboe
2022-04-13  2:32                 ` Eric Dumazet
2022-04-13  2:38                   ` Jens Axboe
2022-04-13  5:23         ` dust.li
2022-04-13  7:53           ` Paolo Abeni [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4864c2c265f6986bcd7576a9066cb430fbfdde95.camel@redhat.com \
    --to=pabeni@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=dust.li@linux.alibaba.com \
    --cc=edumazet@google.com \
    --cc=eric.dumazet@gmail.com \
    --cc=io-uring@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.