* [PATCH net-next] udp: no longer touch sk->sk_refcnt in early demux
@ 2024-03-07 22:00 Eric Dumazet
2024-03-08 8:37 ` Paolo Abeni
2024-03-11 19:39 ` patchwork-bot+netdevbpf
0 siblings, 2 replies; 8+ messages in thread
From: Eric Dumazet @ 2024-03-07 22:00 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: netdev, eric.dumazet, Eric Dumazet, Martin KaFai Lau,
Joe Stringer, Alexei Starovoitov, Willem de Bruijn,
Kuniyuki Iwashima
After commits ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU")
and 7ae215d23c12 ("bpf: Don't refcount LISTEN sockets in sk_assign()")
UDP early demux no longer need to grab a refcount on the UDP socket.
This save two atomic operations per incoming packet for connected
sockets.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Joe Stringer <joe@wand.net.nz>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
---
net/ipv4/udp.c | 5 +++--
net/ipv6/udp.c | 5 +++--
2 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index a8acea17b4e5344d022ae8f8eb674d1a36f8035a..e43ad1d846bdc2ddf5767606b78bbd055f692aa8 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2570,11 +2570,12 @@ int udp_v4_early_demux(struct sk_buff *skb)
uh->source, iph->saddr, dif, sdif);
}
- if (!sk || !refcount_inc_not_zero(&sk->sk_refcnt))
+ if (!sk)
return 0;
skb->sk = sk;
- skb->destructor = sock_efree;
+ DEBUG_NET_WARN_ON_ONCE(sk_is_refcounted(sk));
+ skb->destructor = sock_pfree;
dst = rcu_dereference(sk->sk_rx_dst);
if (dst)
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 3f2249b4cd5f6a594dd9768e29f20f0d9a57faed..fad6667fad6644db8c679ae9b723ccda15edaede 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1101,11 +1101,12 @@ void udp_v6_early_demux(struct sk_buff *skb)
else
return;
- if (!sk || !refcount_inc_not_zero(&sk->sk_refcnt))
+ if (!sk)
return;
skb->sk = sk;
- skb->destructor = sock_efree;
+ DEBUG_NET_WARN_ON_ONCE(sk_is_refcounted(sk));
+ skb->destructor = sock_pfree;
dst = rcu_dereference(sk->sk_rx_dst);
if (dst)
--
2.44.0.278.ge034bb2e1d-goog
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH net-next] udp: no longer touch sk->sk_refcnt in early demux
2024-03-07 22:00 [PATCH net-next] udp: no longer touch sk->sk_refcnt in early demux Eric Dumazet
@ 2024-03-08 8:37 ` Paolo Abeni
2024-03-08 9:19 ` Paolo Abeni
2024-03-08 9:21 ` Eric Dumazet
2024-03-11 19:39 ` patchwork-bot+netdevbpf
1 sibling, 2 replies; 8+ messages in thread
From: Paolo Abeni @ 2024-03-08 8:37 UTC (permalink / raw)
To: Eric Dumazet, David S . Miller, Jakub Kicinski
Cc: netdev, eric.dumazet, Martin KaFai Lau, Joe Stringer,
Alexei Starovoitov, Willem de Bruijn, Kuniyuki Iwashima
On Thu, 2024-03-07 at 22:00 +0000, Eric Dumazet wrote:
> After commits ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU")
> and 7ae215d23c12 ("bpf: Don't refcount LISTEN sockets in sk_assign()")
> UDP early demux no longer need to grab a refcount on the UDP socket.
>
> This save two atomic operations per incoming packet for connected
> sockets.
This reminds me of a old series:
https://lore.kernel.org/netdev/cover.1506114055.git.pabeni@redhat.com/
and I'm wondering if we could reconsider such option.
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Martin KaFai Lau <kafai@fb.com>
> Cc: Joe Stringer <joe@wand.net.nz>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
> net/ipv4/udp.c | 5 +++--
> net/ipv6/udp.c | 5 +++--
> 2 files changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index a8acea17b4e5344d022ae8f8eb674d1a36f8035a..e43ad1d846bdc2ddf5767606b78bbd055f692aa8 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -2570,11 +2570,12 @@ int udp_v4_early_demux(struct sk_buff *skb)
> uh->source, iph->saddr, dif, sdif);
> }
>
> - if (!sk || !refcount_inc_not_zero(&sk->sk_refcnt))
> + if (!sk)
> return 0;
>
> skb->sk = sk;
> - skb->destructor = sock_efree;
> + DEBUG_NET_WARN_ON_ONCE(sk_is_refcounted(sk));
> + skb->destructor = sock_pfree;
I *think* that the skb may escape the current rcu section if e.g. if
matches a nf dup target in the input tables.
Back then I tried to implement some debug infra to track such accesses:
https://lore.kernel.org/lkml/cover.1507294365.git.pabeni@redhat.com/
which was buggy (prone to false negative). I think it can be improved
to something more reliable, perhaps I should revamp it?
I'm also wondering if the DEBUG_NET_WARN_ON_ONCE is worthy?!? the sk is
an hashed UDP socket so is a full sock and has the bit SOCK_RCU_FREE
set.
Perhaps we could use a simple 'noop' destructor as in:
https://lore.kernel.org/netdev/b16163e3a4fa4d772edeabd8743acb4a07206bb9.1506114055.git.pabeni@redhat.com/
Thanks!
Paolo
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net-next] udp: no longer touch sk->sk_refcnt in early demux
2024-03-08 8:37 ` Paolo Abeni
@ 2024-03-08 9:19 ` Paolo Abeni
2024-03-08 9:23 ` Eric Dumazet
2024-03-08 9:21 ` Eric Dumazet
1 sibling, 1 reply; 8+ messages in thread
From: Paolo Abeni @ 2024-03-08 9:19 UTC (permalink / raw)
To: Eric Dumazet, David S . Miller, Jakub Kicinski
Cc: netdev, eric.dumazet, Martin KaFai Lau, Joe Stringer,
Alexei Starovoitov, Willem de Bruijn, Kuniyuki Iwashima
On Fri, 2024-03-08 at 09:37 +0100, Paolo Abeni wrote:
> On Thu, 2024-03-07 at 22:00 +0000, Eric Dumazet wrote:
> > After commits ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU")
> > and 7ae215d23c12 ("bpf: Don't refcount LISTEN sockets in sk_assign()")
> > UDP early demux no longer need to grab a refcount on the UDP socket.
> >
> > This save two atomic operations per incoming packet for connected
> > sockets.
>
> This reminds me of a old series:
>
> https://lore.kernel.org/netdev/cover.1506114055.git.pabeni@redhat.com/
>
> and I'm wondering if we could reconsider such option.
>
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: Martin KaFai Lau <kafai@fb.com>
> > Cc: Joe Stringer <joe@wand.net.nz>
> > Cc: Alexei Starovoitov <ast@kernel.org>
> > Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> > Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> > net/ipv4/udp.c | 5 +++--
> > net/ipv6/udp.c | 5 +++--
> > 2 files changed, 6 insertions(+), 4 deletions(-)
> >
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index a8acea17b4e5344d022ae8f8eb674d1a36f8035a..e43ad1d846bdc2ddf5767606b78bbd055f692aa8 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -2570,11 +2570,12 @@ int udp_v4_early_demux(struct sk_buff *skb)
> > uh->source, iph->saddr, dif, sdif);
> > }
> >
> > - if (!sk || !refcount_inc_not_zero(&sk->sk_refcnt))
> > + if (!sk)
> > return 0;
> >
> > skb->sk = sk;
> > - skb->destructor = sock_efree;
> > + DEBUG_NET_WARN_ON_ONCE(sk_is_refcounted(sk));
> > + skb->destructor = sock_pfree;
>
> I *think* that the skb may escape the current rcu section if e.g. if
> matches a nf dup target in the input tables.
>
> Back then I tried to implement some debug infra to track such accesses:
>
> https://lore.kernel.org/lkml/cover.1507294365.git.pabeni@redhat.com/
>
> which was buggy (prone to false negative). I think it can be improved
> to something more reliable, perhaps I should revamp it?
>
> I'm also wondering if the DEBUG_NET_WARN_ON_ONCE is worthy?!? the sk is
> an hashed UDP socket so is a full sock and has the bit SOCK_RCU_FREE
> set.
>
> Perhaps we could use a simple 'noop' destructor as in:
>
> https://lore.kernel.org/netdev/b16163e3a4fa4d772edeabd8743acb4a07206bb9.1506114055.git.pabeni@redhat.com/
Please ignore this last part, too late I noticed we need 'sock_pfree'
to let inet_steal_sock() work as expected.
Paolo
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net-next] udp: no longer touch sk->sk_refcnt in early demux
2024-03-08 8:37 ` Paolo Abeni
2024-03-08 9:19 ` Paolo Abeni
@ 2024-03-08 9:21 ` Eric Dumazet
2024-03-08 11:10 ` Paolo Abeni
1 sibling, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2024-03-08 9:21 UTC (permalink / raw)
To: Paolo Abeni
Cc: David S . Miller, Jakub Kicinski, netdev, eric.dumazet,
Martin KaFai Lau, Joe Stringer, Alexei Starovoitov,
Willem de Bruijn, Kuniyuki Iwashima, Florian Westphal
On Fri, Mar 8, 2024 at 9:37 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Thu, 2024-03-07 at 22:00 +0000, Eric Dumazet wrote:
> > After commits ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU")
> > and 7ae215d23c12 ("bpf: Don't refcount LISTEN sockets in sk_assign()")
> > UDP early demux no longer need to grab a refcount on the UDP socket.
> >
> > This save two atomic operations per incoming packet for connected
> > sockets.
>
> This reminds me of a old series:
>
> https://lore.kernel.org/netdev/cover.1506114055.git.pabeni@redhat.com/
>
> and I'm wondering if we could reconsider such option.
>
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: Martin KaFai Lau <kafai@fb.com>
> > Cc: Joe Stringer <joe@wand.net.nz>
> > Cc: Alexei Starovoitov <ast@kernel.org>
> > Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> > Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> > net/ipv4/udp.c | 5 +++--
> > net/ipv6/udp.c | 5 +++--
> > 2 files changed, 6 insertions(+), 4 deletions(-)
> >
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index a8acea17b4e5344d022ae8f8eb674d1a36f8035a..e43ad1d846bdc2ddf5767606b78bbd055f692aa8 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -2570,11 +2570,12 @@ int udp_v4_early_demux(struct sk_buff *skb)
> > uh->source, iph->saddr, dif, sdif);
> > }
> >
> > - if (!sk || !refcount_inc_not_zero(&sk->sk_refcnt))
> > + if (!sk)
> > return 0;
> >
> > skb->sk = sk;
> > - skb->destructor = sock_efree;
> > + DEBUG_NET_WARN_ON_ONCE(sk_is_refcounted(sk));
> > + skb->destructor = sock_pfree;
>
> I *think* that the skb may escape the current rcu section if e.g. if
> matches a nf dup target in the input tables.
You mean the netfilter queueing stuff perhaps ?
This is already safe, it uses a refcount_inc_not_zero(&sk->sk_refcnt):
if (skb_sk_is_prefetched(skb)) {
struct sock *sk = skb->sk;
if (!sk_is_refcounted(sk)) {
if (!refcount_inc_not_zero(&sk->sk_refcnt))
return -ENOTCONN;
/* drop refcount on skb_orphan */
skb->destructor = sock_edemux;
}
}
I would think a duplicate can not duplicate skb->sk in general, or must also
attempt an refcount_inc_not_zero(&sk->sk_refcnt) and use a related destructor.
>
> Back then I tried to implement some debug infra to track such accesses:
>
> https://lore.kernel.org/lkml/cover.1507294365.git.pabeni@redhat.com/
>
> which was buggy (prone to false negative). I think it can be improved
> to something more reliable, perhaps I should revamp it?
>
> I'm also wondering if the DEBUG_NET_WARN_ON_ONCE is worthy?!? the sk is
> an hashed UDP socket so is a full sock and has the bit SOCK_RCU_FREE
> set.
This was mostly to catch any future issues and related to my use of sock_pfree()
DEBUG_NET_WARN_ON_ONCE() is a nop, unless you compile a DEV kernel.
>
> Perhaps we could use a simple 'noop' destructor as in:
>
> https://lore.kernel.org/netdev/b16163e3a4fa4d772edeabd8743acb4a07206bb9.1506114055.git.pabeni@redhat.com/
>
I think we need sock_pfree() for inet_steal_sock(), I might be wrong.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net-next] udp: no longer touch sk->sk_refcnt in early demux
2024-03-08 9:19 ` Paolo Abeni
@ 2024-03-08 9:23 ` Eric Dumazet
0 siblings, 0 replies; 8+ messages in thread
From: Eric Dumazet @ 2024-03-08 9:23 UTC (permalink / raw)
To: Paolo Abeni
Cc: David S . Miller, Jakub Kicinski, netdev, eric.dumazet,
Martin KaFai Lau, Joe Stringer, Alexei Starovoitov,
Willem de Bruijn, Kuniyuki Iwashima
On Fri, Mar 8, 2024 at 10:20 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
>
> Please ignore this last part, too late I noticed we need 'sock_pfree'
> to let inet_steal_sock() work as expected.
Yes, this is skb_sk_is_prefetched() part.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net-next] udp: no longer touch sk->sk_refcnt in early demux
2024-03-08 9:21 ` Eric Dumazet
@ 2024-03-08 11:10 ` Paolo Abeni
2024-03-08 12:40 ` Eric Dumazet
0 siblings, 1 reply; 8+ messages in thread
From: Paolo Abeni @ 2024-03-08 11:10 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, netdev, eric.dumazet,
Martin KaFai Lau, Joe Stringer, Alexei Starovoitov,
Willem de Bruijn, Kuniyuki Iwashima, Florian Westphal
On Fri, 2024-03-08 at 10:21 +0100, Eric Dumazet wrote:
> > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > index a8acea17b4e5344d022ae8f8eb674d1a36f8035a..e43ad1d846bdc2ddf5767606b78bbd055f692aa8 100644
> > > --- a/net/ipv4/udp.c
> > > +++ b/net/ipv4/udp.c
> > > @@ -2570,11 +2570,12 @@ int udp_v4_early_demux(struct sk_buff *skb)
> > > uh->source, iph->saddr, dif, sdif);
> > > }
> > >
> > > - if (!sk || !refcount_inc_not_zero(&sk->sk_refcnt))
> > > + if (!sk)
> > > return 0;
> > >
> > > skb->sk = sk;
> > > - skb->destructor = sock_efree;
> > > + DEBUG_NET_WARN_ON_ONCE(sk_is_refcounted(sk));
> > > + skb->destructor = sock_pfree;
> >
> > I *think* that the skb may escape the current rcu section if e.g. if
> > matches a nf dup target in the input tables.
>
> You mean the netfilter queueing stuff perhaps ?
>
> This is already safe, it uses a refcount_inc_not_zero(&sk->sk_refcnt):
>
> if (skb_sk_is_prefetched(skb)) {
> struct sock *sk = skb->sk;
>
> if (!sk_is_refcounted(sk)) {
> if (!refcount_inc_not_zero(&sk->sk_refcnt))
> return -ENOTCONN;
>
> /* drop refcount on skb_orphan */
> skb->destructor = sock_edemux;
> }
> }
>
> I would think a duplicate can not duplicate skb->sk in general, or must also
> attempt an refcount_inc_not_zero(&sk->sk_refcnt) and use a related destructor.
Right, looks safe.
Acked-by: Paolo Abeni <pabeni@redhat.com>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net-next] udp: no longer touch sk->sk_refcnt in early demux
2024-03-08 11:10 ` Paolo Abeni
@ 2024-03-08 12:40 ` Eric Dumazet
0 siblings, 0 replies; 8+ messages in thread
From: Eric Dumazet @ 2024-03-08 12:40 UTC (permalink / raw)
To: Paolo Abeni
Cc: David S . Miller, Jakub Kicinski, netdev, eric.dumazet,
Martin KaFai Lau, Joe Stringer, Alexei Starovoitov,
Willem de Bruijn, Kuniyuki Iwashima, Florian Westphal
On Fri, Mar 8, 2024 at 12:11 PM Paolo Abeni <pabeni@redhat.com> wrote:
> Right, looks safe.
>
> Acked-by: Paolo Abeni <pabeni@redhat.com>
Thanks !
BTW, I was thinking (for next cycle) to let users (or TCP stack with
ad-hoc heuristics )
opt-in TCP sockets to SOCK_RCU_FREE.
This would avoid the refcnt dance for their incoming packets.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net-next] udp: no longer touch sk->sk_refcnt in early demux
2024-03-07 22:00 [PATCH net-next] udp: no longer touch sk->sk_refcnt in early demux Eric Dumazet
2024-03-08 8:37 ` Paolo Abeni
@ 2024-03-11 19:39 ` patchwork-bot+netdevbpf
1 sibling, 0 replies; 8+ messages in thread
From: patchwork-bot+netdevbpf @ 2024-03-11 19:39 UTC (permalink / raw)
To: Eric Dumazet
Cc: davem, kuba, pabeni, netdev, eric.dumazet, kafai, joe, ast,
willemdebruijn.kernel, kuniyu
Hello:
This patch was applied to netdev/net-next.git (main)
by David S. Miller <davem@davemloft.net>:
On Thu, 7 Mar 2024 22:00:16 +0000 you wrote:
> After commits ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU")
> and 7ae215d23c12 ("bpf: Don't refcount LISTEN sockets in sk_assign()")
> UDP early demux no longer need to grab a refcount on the UDP socket.
>
> This save two atomic operations per incoming packet for connected
> sockets.
>
> [...]
Here is the summary with links:
- [net-next] udp: no longer touch sk->sk_refcnt in early demux
https://git.kernel.org/netdev/net-next/c/08842c43d016
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-03-11 19:39 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-07 22:00 [PATCH net-next] udp: no longer touch sk->sk_refcnt in early demux Eric Dumazet
2024-03-08 8:37 ` Paolo Abeni
2024-03-08 9:19 ` Paolo Abeni
2024-03-08 9:23 ` Eric Dumazet
2024-03-08 9:21 ` Eric Dumazet
2024-03-08 11:10 ` Paolo Abeni
2024-03-08 12:40 ` Eric Dumazet
2024-03-11 19:39 ` patchwork-bot+netdevbpf
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).