All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/4] udp: small changes on receive path
@ 2024-03-28 14:40 Eric Dumazet
  2024-03-28 14:40 ` [PATCH net-next 1/4] udp: annotate data-race in __udp_enqueue_schedule_skb() Eric Dumazet
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Eric Dumazet @ 2024-03-28 14:40 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Willem de Bruijn, netdev, eric.dumazet, Eric Dumazet

This series is based on an observation I made in UDP receive path.

The sock_def_readable() costs are pretty high, especially when
epoll is used to generate EPOLLIN events.

First patch annotates races on sk->sk_rcvbuf reads.

Second patch replaces an atomic_add_return()
 with a less expensive atomic_add()

Third patch avoids calling sock_def_readable() when possible.

Fourth patch adds sk_wake_async_rcu() to get better inlining
and code generation.

Eric Dumazet (4):
  udp: annotate data-race in __udp_enqueue_schedule_skb()
  udp: relax atomic operation on sk->sk_rmem_alloc
  udp: avoid calling sock_def_readable() if possible
  net: add sk_wake_async_rcu() helper

 crypto/af_alg.c      |  4 ++--
 include/net/sock.h   |  6 ++++++
 net/atm/common.c     |  2 +-
 net/core/sock.c      |  8 ++++----
 net/dccp/output.c    |  2 +-
 net/ipv4/udp.c       | 32 ++++++++++++++++++--------------
 net/iucv/af_iucv.c   |  2 +-
 net/rxrpc/af_rxrpc.c |  2 +-
 net/sctp/socket.c    |  2 +-
 net/smc/smc_rx.c     |  4 ++--
 net/unix/af_unix.c   |  2 +-
 11 files changed, 38 insertions(+), 28 deletions(-)

-- 
2.44.0.396.g6e790dbe36-goog


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH net-next 1/4] udp: annotate data-race in __udp_enqueue_schedule_skb()
  2024-03-28 14:40 [PATCH net-next 0/4] udp: small changes on receive path Eric Dumazet
@ 2024-03-28 14:40 ` Eric Dumazet
  2024-03-28 23:52   ` Willem de Bruijn
  2024-03-28 14:40 ` [PATCH net-next 2/4] udp: relax atomic operation on sk->sk_rmem_alloc Eric Dumazet
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2024-03-28 14:40 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Willem de Bruijn, netdev, eric.dumazet, Eric Dumazet

sk->sk_rcvbuf is read locklessly twice, while other threads
could change its value.

Use a READ_ONCE() to annotate the race.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/udp.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 661d0e0d273f616ad82746b69b2c76d056633017..f2736e8958187e132ef45d8e25ab2b4ea7bcbc3d 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1492,13 +1492,14 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	struct sk_buff_head *list = &sk->sk_receive_queue;
 	int rmem, err = -ENOMEM;
 	spinlock_t *busy = NULL;
-	int size;
+	int size, rcvbuf;
 
-	/* try to avoid the costly atomic add/sub pair when the receive
-	 * queue is full; always allow at least a packet
+	/* Immediately drop when the receive queue is full.
+	 * Always allow at least one packet.
 	 */
 	rmem = atomic_read(&sk->sk_rmem_alloc);
-	if (rmem > sk->sk_rcvbuf)
+	rcvbuf = READ_ONCE(sk->sk_rcvbuf);
+	if (rmem > rcvbuf)
 		goto drop;
 
 	/* Under mem pressure, it might be helpful to help udp_recvmsg()
@@ -1507,7 +1508,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	 * - Less cache line misses at copyout() time
 	 * - Less work at consume_skb() (less alien page frag freeing)
 	 */
-	if (rmem > (sk->sk_rcvbuf >> 1)) {
+	if (rmem > (rcvbuf >> 1)) {
 		skb_condense(skb);
 
 		busy = busylock_acquire(sk);
-- 
2.44.0.396.g6e790dbe36-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net-next 2/4] udp: relax atomic operation on sk->sk_rmem_alloc
  2024-03-28 14:40 [PATCH net-next 0/4] udp: small changes on receive path Eric Dumazet
  2024-03-28 14:40 ` [PATCH net-next 1/4] udp: annotate data-race in __udp_enqueue_schedule_skb() Eric Dumazet
@ 2024-03-28 14:40 ` Eric Dumazet
  2024-03-28 14:40 ` [PATCH net-next 3/4] udp: avoid calling sock_def_readable() if possible Eric Dumazet
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2024-03-28 14:40 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Willem de Bruijn, netdev, eric.dumazet, Eric Dumazet

atomic_add_return() is more expensive than atomic_add()
and seems overkill in UDP rx fast path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/udp.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index f2736e8958187e132ef45d8e25ab2b4ea7bcbc3d..d2fa9755727ce034c2b4bca82bd9e72130d588e6 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1516,12 +1516,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	size = skb->truesize;
 	udp_set_dev_scratch(skb);
 
-	/* we drop only if the receive buf is full and the receive
-	 * queue contains some other skb
-	 */
-	rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
-	if (rmem > (size + (unsigned int)sk->sk_rcvbuf))
-		goto uncharge_drop;
+	atomic_add(size, &sk->sk_rmem_alloc);
 
 	spin_lock(&list->lock);
 	err = udp_rmem_schedule(sk, size);
-- 
2.44.0.396.g6e790dbe36-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net-next 3/4] udp: avoid calling sock_def_readable() if possible
  2024-03-28 14:40 [PATCH net-next 0/4] udp: small changes on receive path Eric Dumazet
  2024-03-28 14:40 ` [PATCH net-next 1/4] udp: annotate data-race in __udp_enqueue_schedule_skb() Eric Dumazet
  2024-03-28 14:40 ` [PATCH net-next 2/4] udp: relax atomic operation on sk->sk_rmem_alloc Eric Dumazet
@ 2024-03-28 14:40 ` Eric Dumazet
  2024-03-28 23:58   ` Willem de Bruijn
  2024-03-29 10:22   ` Paolo Abeni
  2024-03-28 14:40 ` [PATCH net-next 4/4] net: add sk_wake_async_rcu() helper Eric Dumazet
  2024-03-29 22:10 ` [PATCH net-next 0/4] udp: small changes on receive path patchwork-bot+netdevbpf
  4 siblings, 2 replies; 12+ messages in thread
From: Eric Dumazet @ 2024-03-28 14:40 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Willem de Bruijn, netdev, eric.dumazet, Eric Dumazet

sock_def_readable() is quite expensive (particularly
when ep_poll_callback() is in the picture).

We must call sk->sk_data_ready() when :

- receive queue was empty, or
- SO_PEEK_OFF is enabled on the socket, or
- sk->sk_data_ready is not sock_def_readable.

We still need to call sk_wake_async().

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/udp.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index d2fa9755727ce034c2b4bca82bd9e72130d588e6..5dfbe4499c0f89f94af9ee1fb64559dd672c1439 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1492,6 +1492,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	struct sk_buff_head *list = &sk->sk_receive_queue;
 	int rmem, err = -ENOMEM;
 	spinlock_t *busy = NULL;
+	bool becomes_readable;
 	int size, rcvbuf;
 
 	/* Immediately drop when the receive queue is full.
@@ -1532,12 +1533,19 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	 */
 	sock_skb_set_dropcount(sk, skb);
 
+	becomes_readable = skb_queue_empty(list);
 	__skb_queue_tail(list, skb);
 	spin_unlock(&list->lock);
 
-	if (!sock_flag(sk, SOCK_DEAD))
-		INDIRECT_CALL_1(sk->sk_data_ready, sock_def_readable, sk);
-
+	if (!sock_flag(sk, SOCK_DEAD)) {
+		if (becomes_readable ||
+		    sk->sk_data_ready != sock_def_readable ||
+		    READ_ONCE(sk->sk_peek_off) >= 0)
+			INDIRECT_CALL_1(sk->sk_data_ready,
+					sock_def_readable, sk);
+		else
+			sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
+	}
 	busylock_release(busy);
 	return 0;
 
-- 
2.44.0.396.g6e790dbe36-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net-next 4/4] net: add sk_wake_async_rcu() helper
  2024-03-28 14:40 [PATCH net-next 0/4] udp: small changes on receive path Eric Dumazet
                   ` (2 preceding siblings ...)
  2024-03-28 14:40 ` [PATCH net-next 3/4] udp: avoid calling sock_def_readable() if possible Eric Dumazet
@ 2024-03-28 14:40 ` Eric Dumazet
  2024-03-29 22:10 ` [PATCH net-next 0/4] udp: small changes on receive path patchwork-bot+netdevbpf
  4 siblings, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2024-03-28 14:40 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Willem de Bruijn, netdev, eric.dumazet, Eric Dumazet

While looking at UDP receive performance, I saw sk_wake_async()
was no longer inlined.

This matters at least on AMD Zen1-4 platforms (see SRSO)

This might be because rcu_read_lock() and rcu_read_unlock()
are no longer nops in recent kernels ?

Add sk_wake_async_rcu() variant, which must be called from
contexts already holding rcu lock.

As SOCK_FASYNC is deprecated in modern days, use unlikely()
to give a hint to the compiler.

sk_wake_async_rcu() is properly inlined from
__udp_enqueue_schedule_skb() and sock_def_readable().

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 crypto/af_alg.c      | 4 ++--
 include/net/sock.h   | 6 ++++++
 net/atm/common.c     | 2 +-
 net/core/sock.c      | 8 ++++----
 net/dccp/output.c    | 2 +-
 net/ipv4/udp.c       | 2 +-
 net/iucv/af_iucv.c   | 2 +-
 net/rxrpc/af_rxrpc.c | 2 +-
 net/sctp/socket.c    | 2 +-
 net/smc/smc_rx.c     | 4 ++--
 net/unix/af_unix.c   | 2 +-
 11 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index 68cc9290cabe9a9f8a264908466897f2f93e039d..5bc6d0fa7498df30fdf002ec7bcfb46ed4344e8c 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -847,7 +847,7 @@ void af_alg_wmem_wakeup(struct sock *sk)
 		wake_up_interruptible_sync_poll(&wq->wait, EPOLLIN |
 							   EPOLLRDNORM |
 							   EPOLLRDBAND);
-	sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
+	sk_wake_async_rcu(sk, SOCK_WAKE_WAITD, POLL_IN);
 	rcu_read_unlock();
 }
 EXPORT_SYMBOL_GPL(af_alg_wmem_wakeup);
@@ -914,7 +914,7 @@ static void af_alg_data_wakeup(struct sock *sk)
 		wake_up_interruptible_sync_poll(&wq->wait, EPOLLOUT |
 							   EPOLLRDNORM |
 							   EPOLLRDBAND);
-	sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
+	sk_wake_async_rcu(sk, SOCK_WAKE_SPACE, POLL_OUT);
 	rcu_read_unlock();
 }
 
diff --git a/include/net/sock.h b/include/net/sock.h
index b5e00702acc1f037df7eb8ad085d00e0b18079a8..38adc3970500f4ae1b8d5ade343c5fbe1d04e085 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2506,6 +2506,12 @@ static inline void sk_wake_async(const struct sock *sk, int how, int band)
 	}
 }
 
+static inline void sk_wake_async_rcu(const struct sock *sk, int how, int band)
+{
+	if (unlikely(sock_flag(sk, SOCK_FASYNC)))
+		sock_wake_async(rcu_dereference(sk->sk_wq), how, band);
+}
+
 /* Since sk_{r,w}mem_alloc sums skb->truesize, even a small frame might
  * need sizeof(sk_buff) + MTU + padding, unless net driver perform copybreak.
  * Note: for send buffers, TCP works better if we can build two skbs at
diff --git a/net/atm/common.c b/net/atm/common.c
index 2a1ec014e901d6549732e7bce35bce6a9eb467e0..9b75699992ff9244470c143433f444fb9d46c3b2 100644
--- a/net/atm/common.c
+++ b/net/atm/common.c
@@ -116,7 +116,7 @@ static void vcc_write_space(struct sock *sk)
 		if (skwq_has_sleeper(wq))
 			wake_up_interruptible(&wq->wait);
 
-		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
+		sk_wake_async_rcu(sk, SOCK_WAKE_SPACE, POLL_OUT);
 	}
 
 	rcu_read_unlock();
diff --git a/net/core/sock.c b/net/core/sock.c
index 43bf3818c19e829b47d3989d36e2e1b3bf985438..b9203fccaf1e29ba8e5f48b44987abb79f28fc60 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3338,7 +3338,7 @@ static void sock_def_error_report(struct sock *sk)
 	wq = rcu_dereference(sk->sk_wq);
 	if (skwq_has_sleeper(wq))
 		wake_up_interruptible_poll(&wq->wait, EPOLLERR);
-	sk_wake_async(sk, SOCK_WAKE_IO, POLL_ERR);
+	sk_wake_async_rcu(sk, SOCK_WAKE_IO, POLL_ERR);
 	rcu_read_unlock();
 }
 
@@ -3353,7 +3353,7 @@ void sock_def_readable(struct sock *sk)
 	if (skwq_has_sleeper(wq))
 		wake_up_interruptible_sync_poll(&wq->wait, EPOLLIN | EPOLLPRI |
 						EPOLLRDNORM | EPOLLRDBAND);
-	sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
+	sk_wake_async_rcu(sk, SOCK_WAKE_WAITD, POLL_IN);
 	rcu_read_unlock();
 }
 
@@ -3373,7 +3373,7 @@ static void sock_def_write_space(struct sock *sk)
 						EPOLLWRNORM | EPOLLWRBAND);
 
 		/* Should agree with poll, otherwise some programs break */
-		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
+		sk_wake_async_rcu(sk, SOCK_WAKE_SPACE, POLL_OUT);
 	}
 
 	rcu_read_unlock();
@@ -3398,7 +3398,7 @@ static void sock_def_write_space_wfree(struct sock *sk)
 						EPOLLWRNORM | EPOLLWRBAND);
 
 		/* Should agree with poll, otherwise some programs break */
-		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
+		sk_wake_async_rcu(sk, SOCK_WAKE_SPACE, POLL_OUT);
 	}
 }
 
diff --git a/net/dccp/output.c b/net/dccp/output.c
index fd2eb148d24de4d1b9e40c6721577ed7f11b5a6c..5c2e24f3c39b7ff4ee1d5d96d5e406c96609a022 100644
--- a/net/dccp/output.c
+++ b/net/dccp/output.c
@@ -204,7 +204,7 @@ void dccp_write_space(struct sock *sk)
 		wake_up_interruptible(&wq->wait);
 	/* Should agree with poll, otherwise some programs break */
 	if (sock_writeable(sk))
-		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
+		sk_wake_async_rcu(sk, SOCK_WAKE_SPACE, POLL_OUT);
 
 	rcu_read_unlock();
 }
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 5dfbe4499c0f89f94af9ee1fb64559dd672c1439..4119e74fee02b3930075fe5b00c0fc753a620149 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1544,7 +1544,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 			INDIRECT_CALL_1(sk->sk_data_ready,
 					sock_def_readable, sk);
 		else
-			sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
+			sk_wake_async_rcu(sk, SOCK_WAKE_WAITD, POLL_IN);
 	}
 	busylock_release(busy);
 	return 0;
diff --git a/net/iucv/af_iucv.c b/net/iucv/af_iucv.c
index 7c8c3adcac6e94379360ef6e609c48e3b396ceaa..c951bb9cc2e044249ff7e4f86470b4035d60aeaa 100644
--- a/net/iucv/af_iucv.c
+++ b/net/iucv/af_iucv.c
@@ -184,7 +184,7 @@ static void iucv_sock_wake_msglim(struct sock *sk)
 	wq = rcu_dereference(sk->sk_wq);
 	if (skwq_has_sleeper(wq))
 		wake_up_interruptible_all(&wq->wait);
-	sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
+	sk_wake_async_rcu(sk, SOCK_WAKE_SPACE, POLL_OUT);
 	rcu_read_unlock();
 }
 
diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
index 5222bc97d192e05e2169dcf5f548fdeb98e6b07b..f4844683e12039d636253cb06f622468593487eb 100644
--- a/net/rxrpc/af_rxrpc.c
+++ b/net/rxrpc/af_rxrpc.c
@@ -65,7 +65,7 @@ static void rxrpc_write_space(struct sock *sk)
 
 		if (skwq_has_sleeper(wq))
 			wake_up_interruptible(&wq->wait);
-		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
+		sk_wake_async_rcu(sk, SOCK_WAKE_SPACE, POLL_OUT);
 	}
 	rcu_read_unlock();
 }
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index c67679a41044fc8e801d175b235249f2c8b99dc0..e416b6d3d2705286d3e5af18b2314bceacfb98b1 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -9276,7 +9276,7 @@ void sctp_data_ready(struct sock *sk)
 	if (skwq_has_sleeper(wq))
 		wake_up_interruptible_sync_poll(&wq->wait, EPOLLIN |
 						EPOLLRDNORM | EPOLLRDBAND);
-	sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
+	sk_wake_async_rcu(sk, SOCK_WAKE_WAITD, POLL_IN);
 	rcu_read_unlock();
 }
 
diff --git a/net/smc/smc_rx.c b/net/smc/smc_rx.c
index 9a2f3638d161d2ff7d7261835a5b13be63b11701..f0cbe77a80b44046b880e5a7107f535507c76c7c 100644
--- a/net/smc/smc_rx.c
+++ b/net/smc/smc_rx.c
@@ -42,10 +42,10 @@ static void smc_rx_wake_up(struct sock *sk)
 	if (skwq_has_sleeper(wq))
 		wake_up_interruptible_sync_poll(&wq->wait, EPOLLIN | EPOLLPRI |
 						EPOLLRDNORM | EPOLLRDBAND);
-	sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
+	sk_wake_async_rcu(sk, SOCK_WAKE_WAITD, POLL_IN);
 	if ((sk->sk_shutdown == SHUTDOWN_MASK) ||
 	    (sk->sk_state == SMC_CLOSED))
-		sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_HUP);
+		sk_wake_async_rcu(sk, SOCK_WAKE_WAITD, POLL_HUP);
 	rcu_read_unlock();
 }
 
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 5b41e2321209ae0a17ac97d7214eefd252ec0180..ee382cf55f2016d19e600b6fde75da12b53bea09 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -546,7 +546,7 @@ static void unix_write_space(struct sock *sk)
 		if (skwq_has_sleeper(wq))
 			wake_up_interruptible_sync_poll(&wq->wait,
 				EPOLLOUT | EPOLLWRNORM | EPOLLWRBAND);
-		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
+		sk_wake_async_rcu(sk, SOCK_WAKE_SPACE, POLL_OUT);
 	}
 	rcu_read_unlock();
 }
-- 
2.44.0.396.g6e790dbe36-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 1/4] udp: annotate data-race in __udp_enqueue_schedule_skb()
  2024-03-28 14:40 ` [PATCH net-next 1/4] udp: annotate data-race in __udp_enqueue_schedule_skb() Eric Dumazet
@ 2024-03-28 23:52   ` Willem de Bruijn
  2024-03-29  0:05     ` Willem de Bruijn
  0 siblings, 1 reply; 12+ messages in thread
From: Willem de Bruijn @ 2024-03-28 23:52 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Willem de Bruijn, netdev, eric.dumazet, Eric Dumazet

Eric Dumazet wrote:
> sk->sk_rcvbuf is read locklessly twice, while other threads
> could change its value.
> 
> Use a READ_ONCE() to annotate the race.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  net/ipv4/udp.c | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 661d0e0d273f616ad82746b69b2c76d056633017..f2736e8958187e132ef45d8e25ab2b4ea7bcbc3d 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1492,13 +1492,14 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
>  	struct sk_buff_head *list = &sk->sk_receive_queue;
>  	int rmem, err = -ENOMEM;
>  	spinlock_t *busy = NULL;
> -	int size;
> +	int size, rcvbuf;
>  
> -	/* try to avoid the costly atomic add/sub pair when the receive
> -	 * queue is full; always allow at least a packet
> +	/* Immediately drop when the receive queue is full.
> +	 * Always allow at least one packet.
>  	 */
>  	rmem = atomic_read(&sk->sk_rmem_alloc);
> -	if (rmem > sk->sk_rcvbuf)
> +	rcvbuf = READ_ONCE(sk->sk_rcvbuf);
> +	if (rmem > rcvbuf)
>  		goto drop;
>  
>  	/* Under mem pressure, it might be helpful to help udp_recvmsg()
> @@ -1507,7 +1508,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
>  	 * - Less cache line misses at copyout() time
>  	 * - Less work at consume_skb() (less alien page frag freeing)
>  	 */
> -	if (rmem > (sk->sk_rcvbuf >> 1)) {
> +	if (rmem > (rcvbuf >> 1)) {
>  		skb_condense(skb);
>  
>  		busy = busylock_acquire(sk);

There's a third read in this function:

        /* we drop only if the receive buf is full and the receive
         * queue contains some other skb
         */
        rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
        if (rmem > (size + (unsigned int)sk->sk_rcvbuf))
                goto uncharge_drop;

Another READ_ONCE if intent is to not use the locally cached copy?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 3/4] udp: avoid calling sock_def_readable() if possible
  2024-03-28 14:40 ` [PATCH net-next 3/4] udp: avoid calling sock_def_readable() if possible Eric Dumazet
@ 2024-03-28 23:58   ` Willem de Bruijn
  2024-03-29 10:22   ` Paolo Abeni
  1 sibling, 0 replies; 12+ messages in thread
From: Willem de Bruijn @ 2024-03-28 23:58 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Willem de Bruijn, netdev, eric.dumazet, Eric Dumazet

Eric Dumazet wrote:
> sock_def_readable() is quite expensive (particularly
> when ep_poll_callback() is in the picture).
> 
> We must call sk->sk_data_ready() when :
> 
> - receive queue was empty, or
> - SO_PEEK_OFF is enabled on the socket, or
> - sk->sk_data_ready is not sock_def_readable.
> 
> We still need to call sk_wake_async().
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Willem de Bruijn <willemb@google.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 1/4] udp: annotate data-race in __udp_enqueue_schedule_skb()
  2024-03-28 23:52   ` Willem de Bruijn
@ 2024-03-29  0:05     ` Willem de Bruijn
  0 siblings, 0 replies; 12+ messages in thread
From: Willem de Bruijn @ 2024-03-29  0:05 UTC (permalink / raw)
  To: Willem de Bruijn, Eric Dumazet, David S . Miller, Jakub Kicinski,
	Paolo Abeni
  Cc: Willem de Bruijn, netdev, eric.dumazet, Eric Dumazet

Willem de Bruijn wrote:
> Eric Dumazet wrote:
> > sk->sk_rcvbuf is read locklessly twice, while other threads
> > could change its value.
> > 
> > Use a READ_ONCE() to annotate the race.
> > 
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > ---
> >  net/ipv4/udp.c | 11 ++++++-----
> >  1 file changed, 6 insertions(+), 5 deletions(-)
> > 
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index 661d0e0d273f616ad82746b69b2c76d056633017..f2736e8958187e132ef45d8e25ab2b4ea7bcbc3d 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -1492,13 +1492,14 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> >  	struct sk_buff_head *list = &sk->sk_receive_queue;
> >  	int rmem, err = -ENOMEM;
> >  	spinlock_t *busy = NULL;
> > -	int size;
> > +	int size, rcvbuf;
> >  
> > -	/* try to avoid the costly atomic add/sub pair when the receive
> > -	 * queue is full; always allow at least a packet
> > +	/* Immediately drop when the receive queue is full.
> > +	 * Always allow at least one packet.
> >  	 */
> >  	rmem = atomic_read(&sk->sk_rmem_alloc);
> > -	if (rmem > sk->sk_rcvbuf)
> > +	rcvbuf = READ_ONCE(sk->sk_rcvbuf);
> > +	if (rmem > rcvbuf)
> >  		goto drop;
> >  
> >  	/* Under mem pressure, it might be helpful to help udp_recvmsg()
> > @@ -1507,7 +1508,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> >  	 * - Less cache line misses at copyout() time
> >  	 * - Less work at consume_skb() (less alien page frag freeing)
> >  	 */
> > -	if (rmem > (sk->sk_rcvbuf >> 1)) {
> > +	if (rmem > (rcvbuf >> 1)) {
> >  		skb_condense(skb);
> >  
> >  		busy = busylock_acquire(sk);
> 
> There's a third read in this function:

But you remove that in the next patch. Ok.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 3/4] udp: avoid calling sock_def_readable() if possible
  2024-03-28 14:40 ` [PATCH net-next 3/4] udp: avoid calling sock_def_readable() if possible Eric Dumazet
  2024-03-28 23:58   ` Willem de Bruijn
@ 2024-03-29 10:22   ` Paolo Abeni
  2024-03-29 10:52     ` Eric Dumazet
  1 sibling, 1 reply; 12+ messages in thread
From: Paolo Abeni @ 2024-03-29 10:22 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski
  Cc: Willem de Bruijn, netdev, eric.dumazet

On Thu, 2024-03-28 at 14:40 +0000, Eric Dumazet wrote:
> sock_def_readable() is quite expensive (particularly
> when ep_poll_callback() is in the picture).
> 
> We must call sk->sk_data_ready() when :
> 
> - receive queue was empty, or
> - SO_PEEK_OFF is enabled on the socket, or
> - sk->sk_data_ready is not sock_def_readable.
> 
> We still need to call sk_wake_async().
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  net/ipv4/udp.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index d2fa9755727ce034c2b4bca82bd9e72130d588e6..5dfbe4499c0f89f94af9ee1fb64559dd672c1439 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1492,6 +1492,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
>  	struct sk_buff_head *list = &sk->sk_receive_queue;
>  	int rmem, err = -ENOMEM;
>  	spinlock_t *busy = NULL;
> +	bool becomes_readable;
>  	int size, rcvbuf;
>  
>  	/* Immediately drop when the receive queue is full.
> @@ -1532,12 +1533,19 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
>  	 */
>  	sock_skb_set_dropcount(sk, skb);
>  
> +	becomes_readable = skb_queue_empty(list);
>  	__skb_queue_tail(list, skb);
>  	spin_unlock(&list->lock);
>  
> -	if (!sock_flag(sk, SOCK_DEAD))
> -		INDIRECT_CALL_1(sk->sk_data_ready, sock_def_readable, sk);
> -
> +	if (!sock_flag(sk, SOCK_DEAD)) {
> +		if (becomes_readable ||
> +		    sk->sk_data_ready != sock_def_readable ||
> +		    READ_ONCE(sk->sk_peek_off) >= 0)
> +			INDIRECT_CALL_1(sk->sk_data_ready,
> +					sock_def_readable, sk);
> +		else
> +			sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
> +	}

I understood this change showed no performances benefit???

I guess the atomic_add_return() MB was hiding some/most of
sock_def_readable() cost?

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 3/4] udp: avoid calling sock_def_readable() if possible
  2024-03-29 10:22   ` Paolo Abeni
@ 2024-03-29 10:52     ` Eric Dumazet
  2024-03-29 11:24       ` Paolo Abeni
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2024-03-29 10:52 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: David S . Miller, Jakub Kicinski, Willem de Bruijn, netdev, eric.dumazet

On Fri, Mar 29, 2024 at 11:22 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Thu, 2024-03-28 at 14:40 +0000, Eric Dumazet wrote:
> > sock_def_readable() is quite expensive (particularly
> > when ep_poll_callback() is in the picture).
> >
> > We must call sk->sk_data_ready() when :
> >
> > - receive queue was empty, or
> > - SO_PEEK_OFF is enabled on the socket, or
> > - sk->sk_data_ready is not sock_def_readable.
> >
> > We still need to call sk_wake_async().
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > ---
> >  net/ipv4/udp.c | 14 +++++++++++---
> >  1 file changed, 11 insertions(+), 3 deletions(-)
> >
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index d2fa9755727ce034c2b4bca82bd9e72130d588e6..5dfbe4499c0f89f94af9ee1fb64559dd672c1439 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -1492,6 +1492,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> >       struct sk_buff_head *list = &sk->sk_receive_queue;
> >       int rmem, err = -ENOMEM;
> >       spinlock_t *busy = NULL;
> > +     bool becomes_readable;
> >       int size, rcvbuf;
> >
> >       /* Immediately drop when the receive queue is full.
> > @@ -1532,12 +1533,19 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> >        */
> >       sock_skb_set_dropcount(sk, skb);
> >
> > +     becomes_readable = skb_queue_empty(list);
> >       __skb_queue_tail(list, skb);
> >       spin_unlock(&list->lock);
> >
> > -     if (!sock_flag(sk, SOCK_DEAD))
> > -             INDIRECT_CALL_1(sk->sk_data_ready, sock_def_readable, sk);
> > -
> > +     if (!sock_flag(sk, SOCK_DEAD)) {
> > +             if (becomes_readable ||
> > +                 sk->sk_data_ready != sock_def_readable ||
> > +                 READ_ONCE(sk->sk_peek_off) >= 0)
> > +                     INDIRECT_CALL_1(sk->sk_data_ready,
> > +                                     sock_def_readable, sk);
> > +             else
> > +                     sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
> > +     }
>
> I understood this change showed no performances benefit???
>
> I guess the atomic_add_return() MB was hiding some/most of
> sock_def_readable() cost?

It did show benefits in the epoll case, because ep_poll_callback() is
very expensive.

I think you are referring to a prior discussion we had while still
using netperf tests, which do not use epoll.

Eliminating sock_def_readable() was avoiding the smp_mb() we have in
wq_has_sleeper()
and this was not a convincing win : The apparent cost of this smp_mb()
was high in moderate traffic,
but gradually became small if the cpu was fully utilized.

The atomic_add_return() cost is orthogonal (I see it mostly on ARM64 platforms)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 3/4] udp: avoid calling sock_def_readable() if possible
  2024-03-29 10:52     ` Eric Dumazet
@ 2024-03-29 11:24       ` Paolo Abeni
  0 siblings, 0 replies; 12+ messages in thread
From: Paolo Abeni @ 2024-03-29 11:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Willem de Bruijn, netdev, eric.dumazet

On Fri, 2024-03-29 at 11:52 +0100, Eric Dumazet wrote:
> On Fri, Mar 29, 2024 at 11:22 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > 
> > On Thu, 2024-03-28 at 14:40 +0000, Eric Dumazet wrote:
> > > sock_def_readable() is quite expensive (particularly
> > > when ep_poll_callback() is in the picture).
> > > 
> > > We must call sk->sk_data_ready() when :
> > > 
> > > - receive queue was empty, or
> > > - SO_PEEK_OFF is enabled on the socket, or
> > > - sk->sk_data_ready is not sock_def_readable.
> > > 
> > > We still need to call sk_wake_async().
> > > 
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > ---
> > >  net/ipv4/udp.c | 14 +++++++++++---
> > >  1 file changed, 11 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > index d2fa9755727ce034c2b4bca82bd9e72130d588e6..5dfbe4499c0f89f94af9ee1fb64559dd672c1439 100644
> > > --- a/net/ipv4/udp.c
> > > +++ b/net/ipv4/udp.c
> > > @@ -1492,6 +1492,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> > >       struct sk_buff_head *list = &sk->sk_receive_queue;
> > >       int rmem, err = -ENOMEM;
> > >       spinlock_t *busy = NULL;
> > > +     bool becomes_readable;
> > >       int size, rcvbuf;
> > > 
> > >       /* Immediately drop when the receive queue is full.
> > > @@ -1532,12 +1533,19 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> > >        */
> > >       sock_skb_set_dropcount(sk, skb);
> > > 
> > > +     becomes_readable = skb_queue_empty(list);
> > >       __skb_queue_tail(list, skb);
> > >       spin_unlock(&list->lock);
> > > 
> > > -     if (!sock_flag(sk, SOCK_DEAD))
> > > -             INDIRECT_CALL_1(sk->sk_data_ready, sock_def_readable, sk);
> > > -
> > > +     if (!sock_flag(sk, SOCK_DEAD)) {
> > > +             if (becomes_readable ||
> > > +                 sk->sk_data_ready != sock_def_readable ||
> > > +                 READ_ONCE(sk->sk_peek_off) >= 0)
> > > +                     INDIRECT_CALL_1(sk->sk_data_ready,
> > > +                                     sock_def_readable, sk);
> > > +             else
> > > +                     sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
> > > +     }
> > 
> > I understood this change showed no performances benefit???
> > 
> > I guess the atomic_add_return() MB was hiding some/most of
> > sock_def_readable() cost?
> 
> It did show benefits in the epoll case, because ep_poll_callback() is
> very expensive.
> 
> I think you are referring to a prior discussion we had while still
> using netperf tests, which do not use epoll.

Indeed.

> Eliminating sock_def_readable() was avoiding the smp_mb() we have in
> wq_has_sleeper()
> and this was not a convincing win : The apparent cost of this smp_mb()
> was high in moderate traffic,
> but gradually became small if the cpu was fully utilized.
> 
> The atomic_add_return() cost is orthogonal (I see it mostly on ARM64 platforms)

Thanks for the additional details.

FTR, I guessed that (part of) atomic_add_return() cost comes from the
implied additional barrier (compared to plain adomic_add()) and the
barrier in sock_def_readable() was relatively cheap in the presence of
the previous one and become more visible after moving to adomic_add(). 

In any case LGTM, thanks!

Acked-by: Paolo Abeni <pabeni@redhat.com>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 0/4] udp: small changes on receive path
  2024-03-28 14:40 [PATCH net-next 0/4] udp: small changes on receive path Eric Dumazet
                   ` (3 preceding siblings ...)
  2024-03-28 14:40 ` [PATCH net-next 4/4] net: add sk_wake_async_rcu() helper Eric Dumazet
@ 2024-03-29 22:10 ` patchwork-bot+netdevbpf
  4 siblings, 0 replies; 12+ messages in thread
From: patchwork-bot+netdevbpf @ 2024-03-29 22:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, kuba, pabeni, willemb, netdev, eric.dumazet

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Thu, 28 Mar 2024 14:40:28 +0000 you wrote:
> This series is based on an observation I made in UDP receive path.
> 
> The sock_def_readable() costs are pretty high, especially when
> epoll is used to generate EPOLLIN events.
> 
> First patch annotates races on sk->sk_rcvbuf reads.
> 
> [...]

Here is the summary with links:
  - [net-next,1/4] udp: annotate data-race in __udp_enqueue_schedule_skb()
    https://git.kernel.org/netdev/net-next/c/605579699513
  - [net-next,2/4] udp: relax atomic operation on sk->sk_rmem_alloc
    https://git.kernel.org/netdev/net-next/c/6a1f12dd85a8
  - [net-next,3/4] udp: avoid calling sock_def_readable() if possible
    https://git.kernel.org/netdev/net-next/c/612b1c0dec5b
  - [net-next,4/4] net: add sk_wake_async_rcu() helper
    https://git.kernel.org/netdev/net-next/c/1abe267f173e

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-03-29 22:10 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-28 14:40 [PATCH net-next 0/4] udp: small changes on receive path Eric Dumazet
2024-03-28 14:40 ` [PATCH net-next 1/4] udp: annotate data-race in __udp_enqueue_schedule_skb() Eric Dumazet
2024-03-28 23:52   ` Willem de Bruijn
2024-03-29  0:05     ` Willem de Bruijn
2024-03-28 14:40 ` [PATCH net-next 2/4] udp: relax atomic operation on sk->sk_rmem_alloc Eric Dumazet
2024-03-28 14:40 ` [PATCH net-next 3/4] udp: avoid calling sock_def_readable() if possible Eric Dumazet
2024-03-28 23:58   ` Willem de Bruijn
2024-03-29 10:22   ` Paolo Abeni
2024-03-29 10:52     ` Eric Dumazet
2024-03-29 11:24       ` Paolo Abeni
2024-03-28 14:40 ` [PATCH net-next 4/4] net: add sk_wake_async_rcu() helper Eric Dumazet
2024-03-29 22:10 ` [PATCH net-next 0/4] udp: small changes on receive path patchwork-bot+netdevbpf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.