netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 net 0/3] Report RCU QS for busy network kthreads
@ 2024-03-15 19:55 Yan Zhai
  2024-03-15 19:55 ` [PATCH v4 net 1/3] rcu: add a helper to report consolidated flavor QS Yan Zhai
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Yan Zhai @ 2024-03-15 19:55 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi,
	Coco Li, Wei Wang, Alexander Duyck, Hannes Frederic Sowa,
	linux-kernel, rcu, bpf, kernel-team, Joel Fernandes,
	Paul E. McKenney, Toke Høiland-Jørgensen,
	Alexei Starovoitov, Steven Rostedt, mark.rutland,
	Jesper Dangaard Brouer

This changeset fixes a common problem for busy networking kthreads.
These threads, e.g. NAPI threads, typically will do:

* polling a batch of packets
* if there are more work, call cond_resched to allow scheduling
* continue to poll more packets when rx queue is not empty

We observed this being a problem in production, since it can block RCU
tasks from making progress under heavy load. Investigation indicates
that just calling cond_resched is insufficient for RCU tasks to reach
quiescent states. This at least affects NAPI threads, napi_busy_loop, and
also cpumap kthread for now.

By reporting RCU QSes in these kthreads periodically before
cond_resched, the blocked RCU waiters can correctly progress. Instead of
just reporting QS for RCU tasks, these code share the same concern as
noted in the commit d28139c4e967 ("rcu: Apply RCU-bh QSes to RCU-sched
and RCU-preempt when safe"). So report a consolidated QS for safety.

It is worth noting that, although this problem is reproducible in
napi_busy_loop, it only shows up when setting the polling interval to as
high as 2ms, which is far larger than recommended 50us-100us in the
documentation. So napi_busy_loop is left untouched.

V3: https://lore.kernel.org/lkml/20240314145459.7b3aedf1@kernel.org/t/
V2: https://lore.kernel.org/bpf/ZeFPz4D121TgvCje@debian.debian/
V1: https://lore.kernel.org/lkml/Zd4DXTyCf17lcTfq@debian.debian/#t

changes since v3:
 * fixed kernel-doc errors

changes since v2:
 * created a helper in rcu header to abstract the behavior
 * fixed cpumap kthread in addition

changes since v1:
 * disable preemption first as Paul McKenney suggested

Yan Zhai (3):
  rcu: add a helper to report consolidated flavor QS
  net: report RCU QS on threaded NAPI repolling
  bpf: report RCU QS in cpumap kthread

 include/linux/rcupdate.h | 24 ++++++++++++++++++++++++
 kernel/bpf/cpumap.c      |  3 +++
 net/core/dev.c           |  3 +++
 3 files changed, 30 insertions(+)

-- 
2.30.2



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v4 net 1/3] rcu: add a helper to report consolidated flavor QS
  2024-03-15 19:55 [PATCH v4 net 0/3] Report RCU QS for busy network kthreads Yan Zhai
@ 2024-03-15 19:55 ` Yan Zhai
  2024-03-16  5:40   ` Paul E. McKenney
  2024-03-15 19:55 ` [PATCH v4 net 2/3] net: report RCU QS on threaded NAPI repolling Yan Zhai
  2024-03-15 19:55 ` [PATCH v4 net 3/3] bpf: report RCU QS in cpumap kthread Yan Zhai
  2 siblings, 1 reply; 9+ messages in thread
From: Yan Zhai @ 2024-03-15 19:55 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi,
	Coco Li, Wei Wang, Alexander Duyck, Hannes Frederic Sowa,
	linux-kernel, rcu, bpf, kernel-team, Joel Fernandes,
	Paul E. McKenney, Toke Høiland-Jørgensen,
	Alexei Starovoitov, Steven Rostedt, mark.rutland,
	Jesper Dangaard Brouer

There are several scenario in network processing that can run
extensively under heavy traffic. In such situation, RCU synchronization
might not observe desired quiescent states for indefinitely long period.
Create a helper to safely raise the desired RCU quiescent states for
such scenario.

Currently the frequency is locked at HZ/10, i.e. 100ms, which is
sufficient to address existing problems around RCU tasks. It's unclear
yet if there is any future scenario for it to be further tuned down.

Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
---
v3->v4: comment fixup

---
 include/linux/rcupdate.h | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 0746b1b0b663..da224706323e 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -247,6 +247,30 @@ do { \
 	cond_resched(); \
 } while (0)
 
+/**
+ * rcu_softirq_qs_periodic - Periodically report consolidated quiescent states
+ * @old_ts: last jiffies when QS was reported. Might be modified in the macro.
+ *
+ * This helper is for network processing in non-RT kernels, where there could
+ * be busy polling threads that block RCU synchronization indefinitely.  In
+ * such context, simply calling cond_resched is insufficient, so give it a
+ * stronger push to eliminate all potential blockage of all RCU types.
+ *
+ * NOTE: unless absolutely sure, this helper should in general be called
+ * outside of bh lock section to avoid reporting a surprising QS to updaters,
+ * who could be expecting RCU read critical section to end at local_bh_enable().
+ */
+#define rcu_softirq_qs_periodic(old_ts) \
+do { \
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \
+	    time_after(jiffies, (old_ts) + HZ / 10)) { \
+		preempt_disable(); \
+		rcu_softirq_qs(); \
+		preempt_enable(); \
+		(old_ts) = jiffies; \
+	} \
+} while (0)
+
 /*
  * Infrastructure to implement the synchronize_() primitives in
  * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v4 net 2/3] net: report RCU QS on threaded NAPI repolling
  2024-03-15 19:55 [PATCH v4 net 0/3] Report RCU QS for busy network kthreads Yan Zhai
  2024-03-15 19:55 ` [PATCH v4 net 1/3] rcu: add a helper to report consolidated flavor QS Yan Zhai
@ 2024-03-15 19:55 ` Yan Zhai
  2024-03-15 19:55 ` [PATCH v4 net 3/3] bpf: report RCU QS in cpumap kthread Yan Zhai
  2 siblings, 0 replies; 9+ messages in thread
From: Yan Zhai @ 2024-03-15 19:55 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi,
	Coco Li, Wei Wang, Alexander Duyck, Hannes Frederic Sowa,
	linux-kernel, rcu, bpf, kernel-team, Joel Fernandes,
	Paul E. McKenney, Toke Høiland-Jørgensen,
	Alexei Starovoitov, Steven Rostedt, mark.rutland,
	Jesper Dangaard Brouer

NAPI threads can keep polling packets under load. Currently it is only
calling cond_resched() before repolling, but it is not sufficient to
clear out the holdout of RCU tasks, which prevent BPF tracing programs
from detaching for long period. This can be reproduced easily with
following set up:

ip netns add test1
ip netns add test2

ip -n test1 link add veth1 type veth peer name veth2 netns test2

ip -n test1 link set veth1 up
ip -n test1 link set lo up
ip -n test2 link set veth2 up
ip -n test2 link set lo up

ip -n test1 addr add 192.168.1.2/31 dev veth1
ip -n test1 addr add 1.1.1.1/32 dev lo
ip -n test2 addr add 192.168.1.3/31 dev veth2
ip -n test2 addr add 2.2.2.2/31 dev lo

ip -n test1 route add default via 192.168.1.3
ip -n test2 route add default via 192.168.1.2

for i in `seq 10 210`; do
 for j in `seq 10 210`; do
    ip netns exec test2 iptables -I INPUT -s 3.3.$i.$j -p udp --dport 5201
 done
done

ip netns exec test2 ethtool -K veth2 gro on
ip netns exec test2 bash -c 'echo 1 > /sys/class/net/veth2/threaded'
ip netns exec test1 ethtool -K veth1 tso off

Then run an iperf3 client/server and a bpftrace script can trigger it:

ip netns exec test2 iperf3 -s -B 2.2.2.2 >/dev/null&
ip netns exec test1 iperf3 -c 2.2.2.2 -B 1.1.1.1 -u -l 1500 -b 3g -t 100 >/dev/null&
bpftrace -e 'kfunc:__napi_poll{@=count();} interval:s:1{exit();}'

Report RCU quiescent states periodically will resolve the issue.

Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")
Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
---
v2->v3: abstracted the work into a RCU helper
v1->v2: moved rcu_softirq_qs out from bh critical section, and only
raise it after a second of repolling. Added some brief perf test result.

v2: https://lore.kernel.org/bpf/ZeFPz4D121TgvCje@debian.debian/
v1: https://lore.kernel.org/lkml/Zd4DXTyCf17lcTfq@debian.debian/#t
---
 net/core/dev.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 76e6438f4858..6b7fc42d4b3e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6708,6 +6708,8 @@ static int napi_threaded_poll(void *data)
 	void *have;
 
 	while (!napi_thread_wait(napi)) {
+		unsigned long last_qs = jiffies;
+
 		for (;;) {
 			bool repoll = false;
 
@@ -6732,6 +6734,7 @@ static int napi_threaded_poll(void *data)
 			if (!repoll)
 				break;
 
+			rcu_softirq_qs_periodic(last_qs);
 			cond_resched();
 		}
 	}
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v4 net 3/3] bpf: report RCU QS in cpumap kthread
  2024-03-15 19:55 [PATCH v4 net 0/3] Report RCU QS for busy network kthreads Yan Zhai
  2024-03-15 19:55 ` [PATCH v4 net 1/3] rcu: add a helper to report consolidated flavor QS Yan Zhai
  2024-03-15 19:55 ` [PATCH v4 net 2/3] net: report RCU QS on threaded NAPI repolling Yan Zhai
@ 2024-03-15 19:55 ` Yan Zhai
  2 siblings, 0 replies; 9+ messages in thread
From: Yan Zhai @ 2024-03-15 19:55 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi,
	Coco Li, Wei Wang, Alexander Duyck, Hannes Frederic Sowa,
	linux-kernel, rcu, bpf, kernel-team, Joel Fernandes,
	Paul E. McKenney, Toke Høiland-Jørgensen,
	Alexei Starovoitov, Steven Rostedt, mark.rutland,
	Jesper Dangaard Brouer

When there are heavy load, cpumap kernel threads can be busy polling
packets from redirect queues and block out RCU tasks from reaching
quiescent states. It is insufficient to just call cond_resched() in such
context. Periodically raise a consolidated RCU QS before cond_resched
fixes the problem.

Fixes: 6710e1126934 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP")
Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
---
 kernel/bpf/cpumap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index ef82ffc90cbe..8f1d390bcbde 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -262,6 +262,7 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
 static int cpu_map_kthread_run(void *data)
 {
 	struct bpf_cpu_map_entry *rcpu = data;
+	unsigned long last_qs = jiffies;
 
 	complete(&rcpu->kthread_running);
 	set_current_state(TASK_INTERRUPTIBLE);
@@ -287,10 +288,12 @@ static int cpu_map_kthread_run(void *data)
 			if (__ptr_ring_empty(rcpu->queue)) {
 				schedule();
 				sched = 1;
+				last_qs = jiffies;
 			} else {
 				__set_current_state(TASK_RUNNING);
 			}
 		} else {
+			rcu_softirq_qs_periodic(last_qs);
 			sched = cond_resched();
 		}
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 net 1/3] rcu: add a helper to report consolidated flavor QS
  2024-03-15 19:55 ` [PATCH v4 net 1/3] rcu: add a helper to report consolidated flavor QS Yan Zhai
@ 2024-03-16  5:40   ` Paul E. McKenney
  2024-03-18 10:58     ` Mark Rutland
  2024-03-19  1:26     ` Yan Zhai
  0 siblings, 2 replies; 9+ messages in thread
From: Paul E. McKenney @ 2024-03-16  5:40 UTC (permalink / raw)
  To: Yan Zhai
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann,
	Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck,
	Hannes Frederic Sowa, linux-kernel, rcu, bpf, kernel-team,
	Joel Fernandes, Toke Høiland-Jørgensen,
	Alexei Starovoitov, Steven Rostedt, mark.rutland,
	Jesper Dangaard Brouer

On Fri, Mar 15, 2024 at 12:55:03PM -0700, Yan Zhai wrote:
> There are several scenario in network processing that can run
> extensively under heavy traffic. In such situation, RCU synchronization
> might not observe desired quiescent states for indefinitely long period.
> Create a helper to safely raise the desired RCU quiescent states for
> such scenario.
> 
> Currently the frequency is locked at HZ/10, i.e. 100ms, which is
> sufficient to address existing problems around RCU tasks. It's unclear
> yet if there is any future scenario for it to be further tuned down.

I suggest something like the following for the commit log:

------------------------------------------------------------------------

When under heavy load, network processing can run CPU-bound for many tens
of seconds.  Even in preemptible kernels, this can block RCU Tasks grace
periods, which can cause trace-event removal to take more than a minute,
which is unacceptably long.

This commit therefore creates a new helper function that passes
through both RCU and RCU-Tasks quiescent states every 100 milliseconds.
This hard-coded value suffices for current workloads.

------------------------------------------------------------------------

> Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Signed-off-by: Yan Zhai <yan@cloudflare.com>
> ---
> v3->v4: comment fixup
> 
> ---
>  include/linux/rcupdate.h | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 0746b1b0b663..da224706323e 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -247,6 +247,30 @@ do { \
>  	cond_resched(); \
>  } while (0)
>  
> +/**
> + * rcu_softirq_qs_periodic - Periodically report consolidated quiescent states
> + * @old_ts: last jiffies when QS was reported. Might be modified in the macro.
> + *
> + * This helper is for network processing in non-RT kernels, where there could
> + * be busy polling threads that block RCU synchronization indefinitely.  In
> + * such context, simply calling cond_resched is insufficient, so give it a
> + * stronger push to eliminate all potential blockage of all RCU types.
> + *
> + * NOTE: unless absolutely sure, this helper should in general be called
> + * outside of bh lock section to avoid reporting a surprising QS to updaters,
> + * who could be expecting RCU read critical section to end at local_bh_enable().
> + */

How about something like this for the kernel-doc comment?

/**
 * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states
 * @old_ts: jiffies at start of processing.
 *
 * This helper is for long-running softirq handlers, such as those
 * in networking.  The caller should initialize the variable passed in
 * as @old_ts at the beginning of the softirq handler.  When invoked
 * frequently, this macro will invoke rcu_softirq_qs() every 100
 * milliseconds thereafter, which will provide both RCU and RCU-Tasks
 * quiescent states.  Note that this macro modifies its old_ts argument.
 *
 * Note that although cond_resched() provides RCU quiescent states,
 * it does not provide RCU-Tasks quiescent states.
 *
 * Because regions of code that have disabled softirq act as RCU
 * read-side critical sections, this macro should be invoked with softirq
 * (and preemption) enabled.
 *
 * This macro has no effect in CONFIG_PREEMPT_RT kernels.
 */

							Thanx, Paul

> +#define rcu_softirq_qs_periodic(old_ts) \
> +do { \
> +	if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \
> +	    time_after(jiffies, (old_ts) + HZ / 10)) { \
> +		preempt_disable(); \
> +		rcu_softirq_qs(); \
> +		preempt_enable(); \
> +		(old_ts) = jiffies; \
> +	} \
> +} while (0)
> +
>  /*
>   * Infrastructure to implement the synchronize_() primitives in
>   * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
> -- 
> 2.30.2
> 
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 net 1/3] rcu: add a helper to report consolidated flavor QS
  2024-03-16  5:40   ` Paul E. McKenney
@ 2024-03-18 10:58     ` Mark Rutland
  2024-03-19  2:32       ` Yan Zhai
  2024-03-19  1:26     ` Yan Zhai
  1 sibling, 1 reply; 9+ messages in thread
From: Mark Rutland @ 2024-03-18 10:58 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Yan Zhai, netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann,
	Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck,
	Hannes Frederic Sowa, linux-kernel, rcu, bpf, kernel-team,
	Joel Fernandes, Toke Hoiland-Jorgensen, Alexei Starovoitov,
	Steven Rostedt, Jesper Dangaard Brouer

On Fri, Mar 15, 2024 at 10:40:56PM -0700, Paul E. McKenney wrote:
> On Fri, Mar 15, 2024 at 12:55:03PM -0700, Yan Zhai wrote:
> > There are several scenario in network processing that can run
> > extensively under heavy traffic. In such situation, RCU synchronization
> > might not observe desired quiescent states for indefinitely long period.
> > Create a helper to safely raise the desired RCU quiescent states for
> > such scenario.
> > 
> > Currently the frequency is locked at HZ/10, i.e. 100ms, which is
> > sufficient to address existing problems around RCU tasks. It's unclear
> > yet if there is any future scenario for it to be further tuned down.
> 
> I suggest something like the following for the commit log:
> 
> ------------------------------------------------------------------------
> 
> When under heavy load, network processing can run CPU-bound for many tens
> of seconds.  Even in preemptible kernels, this can block RCU Tasks grace
> periods, which can cause trace-event removal to take more than a minute,
> which is unacceptably long.
> 
> This commit therefore creates a new helper function that passes
> through both RCU and RCU-Tasks quiescent states every 100 milliseconds.
> This hard-coded value suffices for current workloads.

FWIW, this sounds good to me.

> 
> ------------------------------------------------------------------------
> 
> > Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> > Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
> > Signed-off-by: Yan Zhai <yan@cloudflare.com>
> > ---
> > v3->v4: comment fixup
> > 
> > ---
> >  include/linux/rcupdate.h | 24 ++++++++++++++++++++++++
> >  1 file changed, 24 insertions(+)
> > 
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index 0746b1b0b663..da224706323e 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -247,6 +247,30 @@ do { \
> >  	cond_resched(); \
> >  } while (0)
> >  
> > +/**
> > + * rcu_softirq_qs_periodic - Periodically report consolidated quiescent states
> > + * @old_ts: last jiffies when QS was reported. Might be modified in the macro.
> > + *
> > + * This helper is for network processing in non-RT kernels, where there could
> > + * be busy polling threads that block RCU synchronization indefinitely.  In
> > + * such context, simply calling cond_resched is insufficient, so give it a
> > + * stronger push to eliminate all potential blockage of all RCU types.
> > + *
> > + * NOTE: unless absolutely sure, this helper should in general be called
> > + * outside of bh lock section to avoid reporting a surprising QS to updaters,
> > + * who could be expecting RCU read critical section to end at local_bh_enable().
> > + */
> 
> How about something like this for the kernel-doc comment?
> 
> /**
>  * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states
>  * @old_ts: jiffies at start of processing.
>  *
>  * This helper is for long-running softirq handlers, such as those
>  * in networking.  The caller should initialize the variable passed in
>  * as @old_ts at the beginning of the softirq handler.  When invoked
>  * frequently, this macro will invoke rcu_softirq_qs() every 100
>  * milliseconds thereafter, which will provide both RCU and RCU-Tasks
>  * quiescent states.  Note that this macro modifies its old_ts argument.
>  *
>  * Note that although cond_resched() provides RCU quiescent states,
>  * it does not provide RCU-Tasks quiescent states.
>  *
>  * Because regions of code that have disabled softirq act as RCU
>  * read-side critical sections, this macro should be invoked with softirq
>  * (and preemption) enabled.
>  *
>  * This macro has no effect in CONFIG_PREEMPT_RT kernels.
>  */

Considering the note about cond_resched(), does does cond_resched() actually
provide an RCU quiescent state for fully-preemptible kernels? IIUC for those
cond_resched() expands to:

	__might_resched();
	klp_sched_try_switch()

... and AFAICT neither reports an RCU quiescent state.

So maybe it's worth dropping the note?

Seperately, what's the rationale for not doing this on PREEMPT_RT? Does that
avoid the problem through other means, or are people just not running effected
workloads on that?

Mark.

> 
> 							Thanx, Paul
> 
> > +#define rcu_softirq_qs_periodic(old_ts) \
> > +do { \
> > +	if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \
> > +	    time_after(jiffies, (old_ts) + HZ / 10)) { \
> > +		preempt_disable(); \
> > +		rcu_softirq_qs(); \
> > +		preempt_enable(); \
> > +		(old_ts) = jiffies; \
> > +	} \
> > +} while (0)
> > +
> >  /*
> >   * Infrastructure to implement the synchronize_() primitives in
> >   * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
> > -- 
> > 2.30.2
> > 
> > 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 net 1/3] rcu: add a helper to report consolidated flavor QS
  2024-03-16  5:40   ` Paul E. McKenney
  2024-03-18 10:58     ` Mark Rutland
@ 2024-03-19  1:26     ` Yan Zhai
  1 sibling, 0 replies; 9+ messages in thread
From: Yan Zhai @ 2024-03-19  1:26 UTC (permalink / raw)
  To: paulmck
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann,
	Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck,
	Hannes Frederic Sowa, linux-kernel, rcu, bpf, kernel-team,
	Joel Fernandes, Toke Høiland-Jørgensen,
	Alexei Starovoitov, Steven Rostedt, mark.rutland,
	Jesper Dangaard Brouer

On Sat, Mar 16, 2024 at 12:41 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Fri, Mar 15, 2024 at 12:55:03PM -0700, Yan Zhai wrote:
> > There are several scenario in network processing that can run
> > extensively under heavy traffic. In such situation, RCU synchronization
> > might not observe desired quiescent states for indefinitely long period.
> > Create a helper to safely raise the desired RCU quiescent states for
> > such scenario.
> >
> > Currently the frequency is locked at HZ/10, i.e. 100ms, which is
> > sufficient to address existing problems around RCU tasks. It's unclear
> > yet if there is any future scenario for it to be further tuned down.
>
> I suggest something like the following for the commit log:
>
> ------------------------------------------------------------------------
>
> When under heavy load, network processing can run CPU-bound for many tens
> of seconds.  Even in preemptible kernels, this can block RCU Tasks grace
> periods, which can cause trace-event removal to take more than a minute,
> which is unacceptably long.
>
> This commit therefore creates a new helper function that passes
> through both RCU and RCU-Tasks quiescent states every 100 milliseconds.
> This hard-coded value suffices for current workloads.
>
> ------------------------------------------------------------------------
>
> > Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> > Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
> > Signed-off-by: Yan Zhai <yan@cloudflare.com>
> > ---
> > v3->v4: comment fixup
> >
> > ---
> >  include/linux/rcupdate.h | 24 ++++++++++++++++++++++++
> >  1 file changed, 24 insertions(+)
> >
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index 0746b1b0b663..da224706323e 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -247,6 +247,30 @@ do { \
> >       cond_resched(); \
> >  } while (0)
> >
> > +/**
> > + * rcu_softirq_qs_periodic - Periodically report consolidated quiescent states
> > + * @old_ts: last jiffies when QS was reported. Might be modified in the macro.
> > + *
> > + * This helper is for network processing in non-RT kernels, where there could
> > + * be busy polling threads that block RCU synchronization indefinitely.  In
> > + * such context, simply calling cond_resched is insufficient, so give it a
> > + * stronger push to eliminate all potential blockage of all RCU types.
> > + *
> > + * NOTE: unless absolutely sure, this helper should in general be called
> > + * outside of bh lock section to avoid reporting a surprising QS to updaters,
> > + * who could be expecting RCU read critical section to end at local_bh_enable().
> > + */
>
> How about something like this for the kernel-doc comment?
>
> /**
>  * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states
>  * @old_ts: jiffies at start of processing.
>  *
>  * This helper is for long-running softirq handlers, such as those
>  * in networking.  The caller should initialize the variable passed in
>  * as @old_ts at the beginning of the softirq handler.  When invoked
>  * frequently, this macro will invoke rcu_softirq_qs() every 100
>  * milliseconds thereafter, which will provide both RCU and RCU-Tasks
>  * quiescent states.  Note that this macro modifies its old_ts argument.
>  *
>  * Note that although cond_resched() provides RCU quiescent states,
>  * it does not provide RCU-Tasks quiescent states.
>  *
>  * Because regions of code that have disabled softirq act as RCU
>  * read-side critical sections, this macro should be invoked with softirq
>  * (and preemption) enabled.
>  *
>  * This macro has no effect in CONFIG_PREEMPT_RT kernels.
>  */
>
It would be more accurate this way, I like it. Thanks!

Yan

>                                                         Thanx, Paul
>
> > +#define rcu_softirq_qs_periodic(old_ts) \
> > +do { \
> > +     if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \
> > +         time_after(jiffies, (old_ts) + HZ / 10)) { \
> > +             preempt_disable(); \
> > +             rcu_softirq_qs(); \
> > +             preempt_enable(); \
> > +             (old_ts) = jiffies; \
> > +     } \
> > +} while (0)
> > +
> >  /*
> >   * Infrastructure to implement the synchronize_() primitives in
> >   * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
> > --
> > 2.30.2
> >
> >

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 net 1/3] rcu: add a helper to report consolidated flavor QS
  2024-03-18 10:58     ` Mark Rutland
@ 2024-03-19  2:32       ` Yan Zhai
  2024-03-19  2:39         ` Yan Zhai
  0 siblings, 1 reply; 9+ messages in thread
From: Yan Zhai @ 2024-03-19  2:32 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Paul E. McKenney, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jiri Pirko, Simon Horman,
	Daniel Borkmann, Lorenzo Bianconi, Coco Li, Wei Wang,
	Alexander Duyck, Hannes Frederic Sowa, linux-kernel, rcu, bpf,
	kernel-team, Joel Fernandes, Toke Hoiland-Jorgensen,
	Alexei Starovoitov, Steven Rostedt, Jesper Dangaard Brouer

On Mon, Mar 18, 2024 at 5:59 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Fri, Mar 15, 2024 at 10:40:56PM -0700, Paul E. McKenney wrote:
> > On Fri, Mar 15, 2024 at 12:55:03PM -0700, Yan Zhai wrote:
> > > There are several scenario in network processing that can run
> > > extensively under heavy traffic. In such situation, RCU synchronization
> > > might not observe desired quiescent states for indefinitely long period.
> > > Create a helper to safely raise the desired RCU quiescent states for
> > > such scenario.
> > >
> > > Currently the frequency is locked at HZ/10, i.e. 100ms, which is
> > > sufficient to address existing problems around RCU tasks. It's unclear
> > > yet if there is any future scenario for it to be further tuned down.
> >
> > I suggest something like the following for the commit log:
> >
> > ------------------------------------------------------------------------
> >
> > When under heavy load, network processing can run CPU-bound for many tens
> > of seconds.  Even in preemptible kernels, this can block RCU Tasks grace
> > periods, which can cause trace-event removal to take more than a minute,
> > which is unacceptably long.
> >
> > This commit therefore creates a new helper function that passes
> > through both RCU and RCU-Tasks quiescent states every 100 milliseconds.
> > This hard-coded value suffices for current workloads.
>
> FWIW, this sounds good to me.
>
> >
> > ------------------------------------------------------------------------
> >
> > > Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> > > Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
> > > Signed-off-by: Yan Zhai <yan@cloudflare.com>
> > > ---
> > > v3->v4: comment fixup
> > >
> > > ---
> > >  include/linux/rcupdate.h | 24 ++++++++++++++++++++++++
> > >  1 file changed, 24 insertions(+)
> > >
> > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > > index 0746b1b0b663..da224706323e 100644
> > > --- a/include/linux/rcupdate.h
> > > +++ b/include/linux/rcupdate.h
> > > @@ -247,6 +247,30 @@ do { \
> > >     cond_resched(); \
> > >  } while (0)
> > >
> > > +/**
> > > + * rcu_softirq_qs_periodic - Periodically report consolidated quiescent states
> > > + * @old_ts: last jiffies when QS was reported. Might be modified in the macro.
> > > + *
> > > + * This helper is for network processing in non-RT kernels, where there could
> > > + * be busy polling threads that block RCU synchronization indefinitely.  In
> > > + * such context, simply calling cond_resched is insufficient, so give it a
> > > + * stronger push to eliminate all potential blockage of all RCU types.
> > > + *
> > > + * NOTE: unless absolutely sure, this helper should in general be called
> > > + * outside of bh lock section to avoid reporting a surprising QS to updaters,
> > > + * who could be expecting RCU read critical section to end at local_bh_enable().
> > > + */
> >
> > How about something like this for the kernel-doc comment?
> >
> > /**
> >  * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states
> >  * @old_ts: jiffies at start of processing.
> >  *
> >  * This helper is for long-running softirq handlers, such as those
> >  * in networking.  The caller should initialize the variable passed in
> >  * as @old_ts at the beginning of the softirq handler.  When invoked
> >  * frequently, this macro will invoke rcu_softirq_qs() every 100
> >  * milliseconds thereafter, which will provide both RCU and RCU-Tasks
> >  * quiescent states.  Note that this macro modifies its old_ts argument.
> >  *
> >  * Note that although cond_resched() provides RCU quiescent states,
> >  * it does not provide RCU-Tasks quiescent states.
> >  *
> >  * Because regions of code that have disabled softirq act as RCU
> >  * read-side critical sections, this macro should be invoked with softirq
> >  * (and preemption) enabled.
> >  *
> >  * This macro has no effect in CONFIG_PREEMPT_RT kernels.
> >  */
>
> Considering the note about cond_resched(), does does cond_resched() actually
> provide an RCU quiescent state for fully-preemptible kernels? IIUC for those
> cond_resched() expands to:
>
>         __might_resched();
>         klp_sched_try_switch()
>
> ... and AFAICT neither reports an RCU quiescent state.
>
> So maybe it's worth dropping the note?
>
> Seperately, what's the rationale for not doing this on PREEMPT_RT? Does that
> avoid the problem through other means, or are people just not running effected
> workloads on that?
>
It's a bit anti-intuition but yes the RT kernel avoids the problem.
This is because "schedule()" reports task RCU QS actually, and on RT
kernel cond_resched() call won't call "__cond_resched()" or
"__schedule(PREEMPT)" as you already pointed out, which would clear
need-resched flag. This then allows "schedule()" to be called on hard
IRQ exit time by time.

Yan

> Mark.
>
> >
> >                                                       Thanx, Paul
> >
> > > +#define rcu_softirq_qs_periodic(old_ts) \
> > > +do { \
> > > +   if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \
> > > +       time_after(jiffies, (old_ts) + HZ / 10)) { \
> > > +           preempt_disable(); \
> > > +           rcu_softirq_qs(); \
> > > +           preempt_enable(); \
> > > +           (old_ts) = jiffies; \
> > > +   } \
> > > +} while (0)
> > > +
> > >  /*
> > >   * Infrastructure to implement the synchronize_() primitives in
> > >   * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
> > > --
> > > 2.30.2
> > >
> > >

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 net 1/3] rcu: add a helper to report consolidated flavor QS
  2024-03-19  2:32       ` Yan Zhai
@ 2024-03-19  2:39         ` Yan Zhai
  0 siblings, 0 replies; 9+ messages in thread
From: Yan Zhai @ 2024-03-19  2:39 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Paul E. McKenney, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jiri Pirko, Simon Horman,
	Daniel Borkmann, Lorenzo Bianconi, Coco Li, Wei Wang,
	Alexander Duyck, Hannes Frederic Sowa, linux-kernel, rcu, bpf,
	kernel-team, Joel Fernandes, Toke Hoiland-Jorgensen,
	Alexei Starovoitov, Steven Rostedt, Jesper Dangaard Brouer

On Mon, Mar 18, 2024 at 9:32 PM Yan Zhai <yan@cloudflare.com> wrote:
>
> On Mon, Mar 18, 2024 at 5:59 AM Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > On Fri, Mar 15, 2024 at 10:40:56PM -0700, Paul E. McKenney wrote:
> > > On Fri, Mar 15, 2024 at 12:55:03PM -0700, Yan Zhai wrote:
> > > > There are several scenario in network processing that can run
> > > > extensively under heavy traffic. In such situation, RCU synchronization
> > > > might not observe desired quiescent states for indefinitely long period.
> > > > Create a helper to safely raise the desired RCU quiescent states for
> > > > such scenario.
> > > >
> > > > Currently the frequency is locked at HZ/10, i.e. 100ms, which is
> > > > sufficient to address existing problems around RCU tasks. It's unclear
> > > > yet if there is any future scenario for it to be further tuned down.
> > >
> > > I suggest something like the following for the commit log:
> > >
> > > ------------------------------------------------------------------------
> > >
> > > When under heavy load, network processing can run CPU-bound for many tens
> > > of seconds.  Even in preemptible kernels, this can block RCU Tasks grace
> > > periods, which can cause trace-event removal to take more than a minute,
> > > which is unacceptably long.
> > >
> > > This commit therefore creates a new helper function that passes
> > > through both RCU and RCU-Tasks quiescent states every 100 milliseconds.
> > > This hard-coded value suffices for current workloads.
> >
> > FWIW, this sounds good to me.
> >
> > >
> > > ------------------------------------------------------------------------
> > >
> > > > Suggested-by: Paul E. McKenney <paulmck@kernel.org>
> > > > Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
> > > > Signed-off-by: Yan Zhai <yan@cloudflare.com>
> > > > ---
> > > > v3->v4: comment fixup
> > > >
> > > > ---
> > > >  include/linux/rcupdate.h | 24 ++++++++++++++++++++++++
> > > >  1 file changed, 24 insertions(+)
> > > >
> > > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > > > index 0746b1b0b663..da224706323e 100644
> > > > --- a/include/linux/rcupdate.h
> > > > +++ b/include/linux/rcupdate.h
> > > > @@ -247,6 +247,30 @@ do { \
> > > >     cond_resched(); \
> > > >  } while (0)
> > > >
> > > > +/**
> > > > + * rcu_softirq_qs_periodic - Periodically report consolidated quiescent states
> > > > + * @old_ts: last jiffies when QS was reported. Might be modified in the macro.
> > > > + *
> > > > + * This helper is for network processing in non-RT kernels, where there could
> > > > + * be busy polling threads that block RCU synchronization indefinitely.  In
> > > > + * such context, simply calling cond_resched is insufficient, so give it a
> > > > + * stronger push to eliminate all potential blockage of all RCU types.
> > > > + *
> > > > + * NOTE: unless absolutely sure, this helper should in general be called
> > > > + * outside of bh lock section to avoid reporting a surprising QS to updaters,
> > > > + * who could be expecting RCU read critical section to end at local_bh_enable().
> > > > + */
> > >
> > > How about something like this for the kernel-doc comment?
> > >
> > > /**
> > >  * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states
> > >  * @old_ts: jiffies at start of processing.
> > >  *
> > >  * This helper is for long-running softirq handlers, such as those
> > >  * in networking.  The caller should initialize the variable passed in
> > >  * as @old_ts at the beginning of the softirq handler.  When invoked
> > >  * frequently, this macro will invoke rcu_softirq_qs() every 100
> > >  * milliseconds thereafter, which will provide both RCU and RCU-Tasks
> > >  * quiescent states.  Note that this macro modifies its old_ts argument.
> > >  *
> > >  * Note that although cond_resched() provides RCU quiescent states,
> > >  * it does not provide RCU-Tasks quiescent states.
> > >  *
> > >  * Because regions of code that have disabled softirq act as RCU
> > >  * read-side critical sections, this macro should be invoked with softirq
> > >  * (and preemption) enabled.
> > >  *
> > >  * This macro has no effect in CONFIG_PREEMPT_RT kernels.
> > >  */
> >
> > Considering the note about cond_resched(), does does cond_resched() actually
> > provide an RCU quiescent state for fully-preemptible kernels? IIUC for those
> > cond_resched() expands to:
> >
> >         __might_resched();
> >         klp_sched_try_switch()
> >
> > ... and AFAICT neither reports an RCU quiescent state.
> >
> > So maybe it's worth dropping the note?
> >
> > Seperately, what's the rationale for not doing this on PREEMPT_RT? Does that
> > avoid the problem through other means, or are people just not running effected
> > workloads on that?
> >
> It's a bit anti-intuition but yes the RT kernel avoids the problem.
> This is because "schedule()" reports task RCU QS actually, and on RT
> kernel cond_resched() call won't call "__cond_resched()" or
> "__schedule(PREEMPT)" as you already pointed out, which would clear
> need-resched flag. This then allows "schedule()" to be called on hard
> IRQ exit time by time.
>

And these are excellent questions that I should originally include in
the comment. Thanks for bringing it up.
Let me send another version tomorrow, allowing more thoughts on this if any.

thanks
Yan

> Yan
>
> > Mark.
> >
> > >
> > >                                                       Thanx, Paul
> > >
> > > > +#define rcu_softirq_qs_periodic(old_ts) \
> > > > +do { \
> > > > +   if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \
> > > > +       time_after(jiffies, (old_ts) + HZ / 10)) { \
> > > > +           preempt_disable(); \
> > > > +           rcu_softirq_qs(); \
> > > > +           preempt_enable(); \
> > > > +           (old_ts) = jiffies; \
> > > > +   } \
> > > > +} while (0)
> > > > +
> > > >  /*
> > > >   * Infrastructure to implement the synchronize_() primitives in
> > > >   * TREE_RCU and rcu_barrier_() primitives in TINY_RCU.
> > > > --
> > > > 2.30.2
> > > >
> > > >

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-03-19  2:39 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-15 19:55 [PATCH v4 net 0/3] Report RCU QS for busy network kthreads Yan Zhai
2024-03-15 19:55 ` [PATCH v4 net 1/3] rcu: add a helper to report consolidated flavor QS Yan Zhai
2024-03-16  5:40   ` Paul E. McKenney
2024-03-18 10:58     ` Mark Rutland
2024-03-19  2:32       ` Yan Zhai
2024-03-19  2:39         ` Yan Zhai
2024-03-19  1:26     ` Yan Zhai
2024-03-15 19:55 ` [PATCH v4 net 2/3] net: report RCU QS on threaded NAPI repolling Yan Zhai
2024-03-15 19:55 ` [PATCH v4 net 3/3] bpf: report RCU QS in cpumap kthread Yan Zhai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).