* [PATCH v4 net 0/3] Report RCU QS for busy network kthreads @ 2024-03-15 19:55 Yan Zhai 2024-03-15 19:55 ` [PATCH v4 net 1/3] rcu: add a helper to report consolidated flavor QS Yan Zhai ` (2 more replies) 0 siblings, 3 replies; 9+ messages in thread From: Yan Zhai @ 2024-03-15 19:55 UTC (permalink / raw) To: netdev Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck, Hannes Frederic Sowa, linux-kernel, rcu, bpf, kernel-team, Joel Fernandes, Paul E. McKenney, Toke Høiland-Jørgensen, Alexei Starovoitov, Steven Rostedt, mark.rutland, Jesper Dangaard Brouer This changeset fixes a common problem for busy networking kthreads. These threads, e.g. NAPI threads, typically will do: * polling a batch of packets * if there are more work, call cond_resched to allow scheduling * continue to poll more packets when rx queue is not empty We observed this being a problem in production, since it can block RCU tasks from making progress under heavy load. Investigation indicates that just calling cond_resched is insufficient for RCU tasks to reach quiescent states. This at least affects NAPI threads, napi_busy_loop, and also cpumap kthread for now. By reporting RCU QSes in these kthreads periodically before cond_resched, the blocked RCU waiters can correctly progress. Instead of just reporting QS for RCU tasks, these code share the same concern as noted in the commit d28139c4e967 ("rcu: Apply RCU-bh QSes to RCU-sched and RCU-preempt when safe"). So report a consolidated QS for safety. It is worth noting that, although this problem is reproducible in napi_busy_loop, it only shows up when setting the polling interval to as high as 2ms, which is far larger than recommended 50us-100us in the documentation. So napi_busy_loop is left untouched. V3: https://lore.kernel.org/lkml/20240314145459.7b3aedf1@kernel.org/t/ V2: https://lore.kernel.org/bpf/ZeFPz4D121TgvCje@debian.debian/ V1: https://lore.kernel.org/lkml/Zd4DXTyCf17lcTfq@debian.debian/#t changes since v3: * fixed kernel-doc errors changes since v2: * created a helper in rcu header to abstract the behavior * fixed cpumap kthread in addition changes since v1: * disable preemption first as Paul McKenney suggested Yan Zhai (3): rcu: add a helper to report consolidated flavor QS net: report RCU QS on threaded NAPI repolling bpf: report RCU QS in cpumap kthread include/linux/rcupdate.h | 24 ++++++++++++++++++++++++ kernel/bpf/cpumap.c | 3 +++ net/core/dev.c | 3 +++ 3 files changed, 30 insertions(+) -- 2.30.2 ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v4 net 1/3] rcu: add a helper to report consolidated flavor QS 2024-03-15 19:55 [PATCH v4 net 0/3] Report RCU QS for busy network kthreads Yan Zhai @ 2024-03-15 19:55 ` Yan Zhai 2024-03-16 5:40 ` Paul E. McKenney 2024-03-15 19:55 ` [PATCH v4 net 2/3] net: report RCU QS on threaded NAPI repolling Yan Zhai 2024-03-15 19:55 ` [PATCH v4 net 3/3] bpf: report RCU QS in cpumap kthread Yan Zhai 2 siblings, 1 reply; 9+ messages in thread From: Yan Zhai @ 2024-03-15 19:55 UTC (permalink / raw) To: netdev Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck, Hannes Frederic Sowa, linux-kernel, rcu, bpf, kernel-team, Joel Fernandes, Paul E. McKenney, Toke Høiland-Jørgensen, Alexei Starovoitov, Steven Rostedt, mark.rutland, Jesper Dangaard Brouer There are several scenario in network processing that can run extensively under heavy traffic. In such situation, RCU synchronization might not observe desired quiescent states for indefinitely long period. Create a helper to safely raise the desired RCU quiescent states for such scenario. Currently the frequency is locked at HZ/10, i.e. 100ms, which is sufficient to address existing problems around RCU tasks. It's unclear yet if there is any future scenario for it to be further tuned down. Suggested-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Yan Zhai <yan@cloudflare.com> --- v3->v4: comment fixup --- include/linux/rcupdate.h | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 0746b1b0b663..da224706323e 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -247,6 +247,30 @@ do { \ cond_resched(); \ } while (0) +/** + * rcu_softirq_qs_periodic - Periodically report consolidated quiescent states + * @old_ts: last jiffies when QS was reported. Might be modified in the macro. + * + * This helper is for network processing in non-RT kernels, where there could + * be busy polling threads that block RCU synchronization indefinitely. In + * such context, simply calling cond_resched is insufficient, so give it a + * stronger push to eliminate all potential blockage of all RCU types. + * + * NOTE: unless absolutely sure, this helper should in general be called + * outside of bh lock section to avoid reporting a surprising QS to updaters, + * who could be expecting RCU read critical section to end at local_bh_enable(). + */ +#define rcu_softirq_qs_periodic(old_ts) \ +do { \ + if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \ + time_after(jiffies, (old_ts) + HZ / 10)) { \ + preempt_disable(); \ + rcu_softirq_qs(); \ + preempt_enable(); \ + (old_ts) = jiffies; \ + } \ +} while (0) + /* * Infrastructure to implement the synchronize_() primitives in * TREE_RCU and rcu_barrier_() primitives in TINY_RCU. -- 2.30.2 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v4 net 1/3] rcu: add a helper to report consolidated flavor QS 2024-03-15 19:55 ` [PATCH v4 net 1/3] rcu: add a helper to report consolidated flavor QS Yan Zhai @ 2024-03-16 5:40 ` Paul E. McKenney 2024-03-18 10:58 ` Mark Rutland 2024-03-19 1:26 ` Yan Zhai 0 siblings, 2 replies; 9+ messages in thread From: Paul E. McKenney @ 2024-03-16 5:40 UTC (permalink / raw) To: Yan Zhai Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck, Hannes Frederic Sowa, linux-kernel, rcu, bpf, kernel-team, Joel Fernandes, Toke Høiland-Jørgensen, Alexei Starovoitov, Steven Rostedt, mark.rutland, Jesper Dangaard Brouer On Fri, Mar 15, 2024 at 12:55:03PM -0700, Yan Zhai wrote: > There are several scenario in network processing that can run > extensively under heavy traffic. In such situation, RCU synchronization > might not observe desired quiescent states for indefinitely long period. > Create a helper to safely raise the desired RCU quiescent states for > such scenario. > > Currently the frequency is locked at HZ/10, i.e. 100ms, which is > sufficient to address existing problems around RCU tasks. It's unclear > yet if there is any future scenario for it to be further tuned down. I suggest something like the following for the commit log: ------------------------------------------------------------------------ When under heavy load, network processing can run CPU-bound for many tens of seconds. Even in preemptible kernels, this can block RCU Tasks grace periods, which can cause trace-event removal to take more than a minute, which is unacceptably long. This commit therefore creates a new helper function that passes through both RCU and RCU-Tasks quiescent states every 100 milliseconds. This hard-coded value suffices for current workloads. ------------------------------------------------------------------------ > Suggested-by: Paul E. McKenney <paulmck@kernel.org> > Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> > Signed-off-by: Yan Zhai <yan@cloudflare.com> > --- > v3->v4: comment fixup > > --- > include/linux/rcupdate.h | 24 ++++++++++++++++++++++++ > 1 file changed, 24 insertions(+) > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h > index 0746b1b0b663..da224706323e 100644 > --- a/include/linux/rcupdate.h > +++ b/include/linux/rcupdate.h > @@ -247,6 +247,30 @@ do { \ > cond_resched(); \ > } while (0) > > +/** > + * rcu_softirq_qs_periodic - Periodically report consolidated quiescent states > + * @old_ts: last jiffies when QS was reported. Might be modified in the macro. > + * > + * This helper is for network processing in non-RT kernels, where there could > + * be busy polling threads that block RCU synchronization indefinitely. In > + * such context, simply calling cond_resched is insufficient, so give it a > + * stronger push to eliminate all potential blockage of all RCU types. > + * > + * NOTE: unless absolutely sure, this helper should in general be called > + * outside of bh lock section to avoid reporting a surprising QS to updaters, > + * who could be expecting RCU read critical section to end at local_bh_enable(). > + */ How about something like this for the kernel-doc comment? /** * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states * @old_ts: jiffies at start of processing. * * This helper is for long-running softirq handlers, such as those * in networking. The caller should initialize the variable passed in * as @old_ts at the beginning of the softirq handler. When invoked * frequently, this macro will invoke rcu_softirq_qs() every 100 * milliseconds thereafter, which will provide both RCU and RCU-Tasks * quiescent states. Note that this macro modifies its old_ts argument. * * Note that although cond_resched() provides RCU quiescent states, * it does not provide RCU-Tasks quiescent states. * * Because regions of code that have disabled softirq act as RCU * read-side critical sections, this macro should be invoked with softirq * (and preemption) enabled. * * This macro has no effect in CONFIG_PREEMPT_RT kernels. */ Thanx, Paul > +#define rcu_softirq_qs_periodic(old_ts) \ > +do { \ > + if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \ > + time_after(jiffies, (old_ts) + HZ / 10)) { \ > + preempt_disable(); \ > + rcu_softirq_qs(); \ > + preempt_enable(); \ > + (old_ts) = jiffies; \ > + } \ > +} while (0) > + > /* > * Infrastructure to implement the synchronize_() primitives in > * TREE_RCU and rcu_barrier_() primitives in TINY_RCU. > -- > 2.30.2 > > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v4 net 1/3] rcu: add a helper to report consolidated flavor QS 2024-03-16 5:40 ` Paul E. McKenney @ 2024-03-18 10:58 ` Mark Rutland 2024-03-19 2:32 ` Yan Zhai 2024-03-19 1:26 ` Yan Zhai 1 sibling, 1 reply; 9+ messages in thread From: Mark Rutland @ 2024-03-18 10:58 UTC (permalink / raw) To: Paul E. McKenney Cc: Yan Zhai, netdev, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck, Hannes Frederic Sowa, linux-kernel, rcu, bpf, kernel-team, Joel Fernandes, Toke Hoiland-Jorgensen, Alexei Starovoitov, Steven Rostedt, Jesper Dangaard Brouer On Fri, Mar 15, 2024 at 10:40:56PM -0700, Paul E. McKenney wrote: > On Fri, Mar 15, 2024 at 12:55:03PM -0700, Yan Zhai wrote: > > There are several scenario in network processing that can run > > extensively under heavy traffic. In such situation, RCU synchronization > > might not observe desired quiescent states for indefinitely long period. > > Create a helper to safely raise the desired RCU quiescent states for > > such scenario. > > > > Currently the frequency is locked at HZ/10, i.e. 100ms, which is > > sufficient to address existing problems around RCU tasks. It's unclear > > yet if there is any future scenario for it to be further tuned down. > > I suggest something like the following for the commit log: > > ------------------------------------------------------------------------ > > When under heavy load, network processing can run CPU-bound for many tens > of seconds. Even in preemptible kernels, this can block RCU Tasks grace > periods, which can cause trace-event removal to take more than a minute, > which is unacceptably long. > > This commit therefore creates a new helper function that passes > through both RCU and RCU-Tasks quiescent states every 100 milliseconds. > This hard-coded value suffices for current workloads. FWIW, this sounds good to me. > > ------------------------------------------------------------------------ > > > Suggested-by: Paul E. McKenney <paulmck@kernel.org> > > Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> > > Signed-off-by: Yan Zhai <yan@cloudflare.com> > > --- > > v3->v4: comment fixup > > > > --- > > include/linux/rcupdate.h | 24 ++++++++++++++++++++++++ > > 1 file changed, 24 insertions(+) > > > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h > > index 0746b1b0b663..da224706323e 100644 > > --- a/include/linux/rcupdate.h > > +++ b/include/linux/rcupdate.h > > @@ -247,6 +247,30 @@ do { \ > > cond_resched(); \ > > } while (0) > > > > +/** > > + * rcu_softirq_qs_periodic - Periodically report consolidated quiescent states > > + * @old_ts: last jiffies when QS was reported. Might be modified in the macro. > > + * > > + * This helper is for network processing in non-RT kernels, where there could > > + * be busy polling threads that block RCU synchronization indefinitely. In > > + * such context, simply calling cond_resched is insufficient, so give it a > > + * stronger push to eliminate all potential blockage of all RCU types. > > + * > > + * NOTE: unless absolutely sure, this helper should in general be called > > + * outside of bh lock section to avoid reporting a surprising QS to updaters, > > + * who could be expecting RCU read critical section to end at local_bh_enable(). > > + */ > > How about something like this for the kernel-doc comment? > > /** > * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states > * @old_ts: jiffies at start of processing. > * > * This helper is for long-running softirq handlers, such as those > * in networking. The caller should initialize the variable passed in > * as @old_ts at the beginning of the softirq handler. When invoked > * frequently, this macro will invoke rcu_softirq_qs() every 100 > * milliseconds thereafter, which will provide both RCU and RCU-Tasks > * quiescent states. Note that this macro modifies its old_ts argument. > * > * Note that although cond_resched() provides RCU quiescent states, > * it does not provide RCU-Tasks quiescent states. > * > * Because regions of code that have disabled softirq act as RCU > * read-side critical sections, this macro should be invoked with softirq > * (and preemption) enabled. > * > * This macro has no effect in CONFIG_PREEMPT_RT kernels. > */ Considering the note about cond_resched(), does does cond_resched() actually provide an RCU quiescent state for fully-preemptible kernels? IIUC for those cond_resched() expands to: __might_resched(); klp_sched_try_switch() ... and AFAICT neither reports an RCU quiescent state. So maybe it's worth dropping the note? Seperately, what's the rationale for not doing this on PREEMPT_RT? Does that avoid the problem through other means, or are people just not running effected workloads on that? Mark. > > Thanx, Paul > > > +#define rcu_softirq_qs_periodic(old_ts) \ > > +do { \ > > + if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \ > > + time_after(jiffies, (old_ts) + HZ / 10)) { \ > > + preempt_disable(); \ > > + rcu_softirq_qs(); \ > > + preempt_enable(); \ > > + (old_ts) = jiffies; \ > > + } \ > > +} while (0) > > + > > /* > > * Infrastructure to implement the synchronize_() primitives in > > * TREE_RCU and rcu_barrier_() primitives in TINY_RCU. > > -- > > 2.30.2 > > > > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v4 net 1/3] rcu: add a helper to report consolidated flavor QS 2024-03-18 10:58 ` Mark Rutland @ 2024-03-19 2:32 ` Yan Zhai 2024-03-19 2:39 ` Yan Zhai 0 siblings, 1 reply; 9+ messages in thread From: Yan Zhai @ 2024-03-19 2:32 UTC (permalink / raw) To: Mark Rutland Cc: Paul E. McKenney, netdev, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck, Hannes Frederic Sowa, linux-kernel, rcu, bpf, kernel-team, Joel Fernandes, Toke Hoiland-Jorgensen, Alexei Starovoitov, Steven Rostedt, Jesper Dangaard Brouer On Mon, Mar 18, 2024 at 5:59 AM Mark Rutland <mark.rutland@arm.com> wrote: > > On Fri, Mar 15, 2024 at 10:40:56PM -0700, Paul E. McKenney wrote: > > On Fri, Mar 15, 2024 at 12:55:03PM -0700, Yan Zhai wrote: > > > There are several scenario in network processing that can run > > > extensively under heavy traffic. In such situation, RCU synchronization > > > might not observe desired quiescent states for indefinitely long period. > > > Create a helper to safely raise the desired RCU quiescent states for > > > such scenario. > > > > > > Currently the frequency is locked at HZ/10, i.e. 100ms, which is > > > sufficient to address existing problems around RCU tasks. It's unclear > > > yet if there is any future scenario for it to be further tuned down. > > > > I suggest something like the following for the commit log: > > > > ------------------------------------------------------------------------ > > > > When under heavy load, network processing can run CPU-bound for many tens > > of seconds. Even in preemptible kernels, this can block RCU Tasks grace > > periods, which can cause trace-event removal to take more than a minute, > > which is unacceptably long. > > > > This commit therefore creates a new helper function that passes > > through both RCU and RCU-Tasks quiescent states every 100 milliseconds. > > This hard-coded value suffices for current workloads. > > FWIW, this sounds good to me. > > > > > ------------------------------------------------------------------------ > > > > > Suggested-by: Paul E. McKenney <paulmck@kernel.org> > > > Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> > > > Signed-off-by: Yan Zhai <yan@cloudflare.com> > > > --- > > > v3->v4: comment fixup > > > > > > --- > > > include/linux/rcupdate.h | 24 ++++++++++++++++++++++++ > > > 1 file changed, 24 insertions(+) > > > > > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h > > > index 0746b1b0b663..da224706323e 100644 > > > --- a/include/linux/rcupdate.h > > > +++ b/include/linux/rcupdate.h > > > @@ -247,6 +247,30 @@ do { \ > > > cond_resched(); \ > > > } while (0) > > > > > > +/** > > > + * rcu_softirq_qs_periodic - Periodically report consolidated quiescent states > > > + * @old_ts: last jiffies when QS was reported. Might be modified in the macro. > > > + * > > > + * This helper is for network processing in non-RT kernels, where there could > > > + * be busy polling threads that block RCU synchronization indefinitely. In > > > + * such context, simply calling cond_resched is insufficient, so give it a > > > + * stronger push to eliminate all potential blockage of all RCU types. > > > + * > > > + * NOTE: unless absolutely sure, this helper should in general be called > > > + * outside of bh lock section to avoid reporting a surprising QS to updaters, > > > + * who could be expecting RCU read critical section to end at local_bh_enable(). > > > + */ > > > > How about something like this for the kernel-doc comment? > > > > /** > > * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states > > * @old_ts: jiffies at start of processing. > > * > > * This helper is for long-running softirq handlers, such as those > > * in networking. The caller should initialize the variable passed in > > * as @old_ts at the beginning of the softirq handler. When invoked > > * frequently, this macro will invoke rcu_softirq_qs() every 100 > > * milliseconds thereafter, which will provide both RCU and RCU-Tasks > > * quiescent states. Note that this macro modifies its old_ts argument. > > * > > * Note that although cond_resched() provides RCU quiescent states, > > * it does not provide RCU-Tasks quiescent states. > > * > > * Because regions of code that have disabled softirq act as RCU > > * read-side critical sections, this macro should be invoked with softirq > > * (and preemption) enabled. > > * > > * This macro has no effect in CONFIG_PREEMPT_RT kernels. > > */ > > Considering the note about cond_resched(), does does cond_resched() actually > provide an RCU quiescent state for fully-preemptible kernels? IIUC for those > cond_resched() expands to: > > __might_resched(); > klp_sched_try_switch() > > ... and AFAICT neither reports an RCU quiescent state. > > So maybe it's worth dropping the note? > > Seperately, what's the rationale for not doing this on PREEMPT_RT? Does that > avoid the problem through other means, or are people just not running effected > workloads on that? > It's a bit anti-intuition but yes the RT kernel avoids the problem. This is because "schedule()" reports task RCU QS actually, and on RT kernel cond_resched() call won't call "__cond_resched()" or "__schedule(PREEMPT)" as you already pointed out, which would clear need-resched flag. This then allows "schedule()" to be called on hard IRQ exit time by time. Yan > Mark. > > > > > Thanx, Paul > > > > > +#define rcu_softirq_qs_periodic(old_ts) \ > > > +do { \ > > > + if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \ > > > + time_after(jiffies, (old_ts) + HZ / 10)) { \ > > > + preempt_disable(); \ > > > + rcu_softirq_qs(); \ > > > + preempt_enable(); \ > > > + (old_ts) = jiffies; \ > > > + } \ > > > +} while (0) > > > + > > > /* > > > * Infrastructure to implement the synchronize_() primitives in > > > * TREE_RCU and rcu_barrier_() primitives in TINY_RCU. > > > -- > > > 2.30.2 > > > > > > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v4 net 1/3] rcu: add a helper to report consolidated flavor QS 2024-03-19 2:32 ` Yan Zhai @ 2024-03-19 2:39 ` Yan Zhai 0 siblings, 0 replies; 9+ messages in thread From: Yan Zhai @ 2024-03-19 2:39 UTC (permalink / raw) To: Mark Rutland Cc: Paul E. McKenney, netdev, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck, Hannes Frederic Sowa, linux-kernel, rcu, bpf, kernel-team, Joel Fernandes, Toke Hoiland-Jorgensen, Alexei Starovoitov, Steven Rostedt, Jesper Dangaard Brouer On Mon, Mar 18, 2024 at 9:32 PM Yan Zhai <yan@cloudflare.com> wrote: > > On Mon, Mar 18, 2024 at 5:59 AM Mark Rutland <mark.rutland@arm.com> wrote: > > > > On Fri, Mar 15, 2024 at 10:40:56PM -0700, Paul E. McKenney wrote: > > > On Fri, Mar 15, 2024 at 12:55:03PM -0700, Yan Zhai wrote: > > > > There are several scenario in network processing that can run > > > > extensively under heavy traffic. In such situation, RCU synchronization > > > > might not observe desired quiescent states for indefinitely long period. > > > > Create a helper to safely raise the desired RCU quiescent states for > > > > such scenario. > > > > > > > > Currently the frequency is locked at HZ/10, i.e. 100ms, which is > > > > sufficient to address existing problems around RCU tasks. It's unclear > > > > yet if there is any future scenario for it to be further tuned down. > > > > > > I suggest something like the following for the commit log: > > > > > > ------------------------------------------------------------------------ > > > > > > When under heavy load, network processing can run CPU-bound for many tens > > > of seconds. Even in preemptible kernels, this can block RCU Tasks grace > > > periods, which can cause trace-event removal to take more than a minute, > > > which is unacceptably long. > > > > > > This commit therefore creates a new helper function that passes > > > through both RCU and RCU-Tasks quiescent states every 100 milliseconds. > > > This hard-coded value suffices for current workloads. > > > > FWIW, this sounds good to me. > > > > > > > > ------------------------------------------------------------------------ > > > > > > > Suggested-by: Paul E. McKenney <paulmck@kernel.org> > > > > Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> > > > > Signed-off-by: Yan Zhai <yan@cloudflare.com> > > > > --- > > > > v3->v4: comment fixup > > > > > > > > --- > > > > include/linux/rcupdate.h | 24 ++++++++++++++++++++++++ > > > > 1 file changed, 24 insertions(+) > > > > > > > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h > > > > index 0746b1b0b663..da224706323e 100644 > > > > --- a/include/linux/rcupdate.h > > > > +++ b/include/linux/rcupdate.h > > > > @@ -247,6 +247,30 @@ do { \ > > > > cond_resched(); \ > > > > } while (0) > > > > > > > > +/** > > > > + * rcu_softirq_qs_periodic - Periodically report consolidated quiescent states > > > > + * @old_ts: last jiffies when QS was reported. Might be modified in the macro. > > > > + * > > > > + * This helper is for network processing in non-RT kernels, where there could > > > > + * be busy polling threads that block RCU synchronization indefinitely. In > > > > + * such context, simply calling cond_resched is insufficient, so give it a > > > > + * stronger push to eliminate all potential blockage of all RCU types. > > > > + * > > > > + * NOTE: unless absolutely sure, this helper should in general be called > > > > + * outside of bh lock section to avoid reporting a surprising QS to updaters, > > > > + * who could be expecting RCU read critical section to end at local_bh_enable(). > > > > + */ > > > > > > How about something like this for the kernel-doc comment? > > > > > > /** > > > * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states > > > * @old_ts: jiffies at start of processing. > > > * > > > * This helper is for long-running softirq handlers, such as those > > > * in networking. The caller should initialize the variable passed in > > > * as @old_ts at the beginning of the softirq handler. When invoked > > > * frequently, this macro will invoke rcu_softirq_qs() every 100 > > > * milliseconds thereafter, which will provide both RCU and RCU-Tasks > > > * quiescent states. Note that this macro modifies its old_ts argument. > > > * > > > * Note that although cond_resched() provides RCU quiescent states, > > > * it does not provide RCU-Tasks quiescent states. > > > * > > > * Because regions of code that have disabled softirq act as RCU > > > * read-side critical sections, this macro should be invoked with softirq > > > * (and preemption) enabled. > > > * > > > * This macro has no effect in CONFIG_PREEMPT_RT kernels. > > > */ > > > > Considering the note about cond_resched(), does does cond_resched() actually > > provide an RCU quiescent state for fully-preemptible kernels? IIUC for those > > cond_resched() expands to: > > > > __might_resched(); > > klp_sched_try_switch() > > > > ... and AFAICT neither reports an RCU quiescent state. > > > > So maybe it's worth dropping the note? > > > > Seperately, what's the rationale for not doing this on PREEMPT_RT? Does that > > avoid the problem through other means, or are people just not running effected > > workloads on that? > > > It's a bit anti-intuition but yes the RT kernel avoids the problem. > This is because "schedule()" reports task RCU QS actually, and on RT > kernel cond_resched() call won't call "__cond_resched()" or > "__schedule(PREEMPT)" as you already pointed out, which would clear > need-resched flag. This then allows "schedule()" to be called on hard > IRQ exit time by time. > And these are excellent questions that I should originally include in the comment. Thanks for bringing it up. Let me send another version tomorrow, allowing more thoughts on this if any. thanks Yan > Yan > > > Mark. > > > > > > > > Thanx, Paul > > > > > > > +#define rcu_softirq_qs_periodic(old_ts) \ > > > > +do { \ > > > > + if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \ > > > > + time_after(jiffies, (old_ts) + HZ / 10)) { \ > > > > + preempt_disable(); \ > > > > + rcu_softirq_qs(); \ > > > > + preempt_enable(); \ > > > > + (old_ts) = jiffies; \ > > > > + } \ > > > > +} while (0) > > > > + > > > > /* > > > > * Infrastructure to implement the synchronize_() primitives in > > > > * TREE_RCU and rcu_barrier_() primitives in TINY_RCU. > > > > -- > > > > 2.30.2 > > > > > > > > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v4 net 1/3] rcu: add a helper to report consolidated flavor QS 2024-03-16 5:40 ` Paul E. McKenney 2024-03-18 10:58 ` Mark Rutland @ 2024-03-19 1:26 ` Yan Zhai 1 sibling, 0 replies; 9+ messages in thread From: Yan Zhai @ 2024-03-19 1:26 UTC (permalink / raw) To: paulmck Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck, Hannes Frederic Sowa, linux-kernel, rcu, bpf, kernel-team, Joel Fernandes, Toke Høiland-Jørgensen, Alexei Starovoitov, Steven Rostedt, mark.rutland, Jesper Dangaard Brouer On Sat, Mar 16, 2024 at 12:41 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Fri, Mar 15, 2024 at 12:55:03PM -0700, Yan Zhai wrote: > > There are several scenario in network processing that can run > > extensively under heavy traffic. In such situation, RCU synchronization > > might not observe desired quiescent states for indefinitely long period. > > Create a helper to safely raise the desired RCU quiescent states for > > such scenario. > > > > Currently the frequency is locked at HZ/10, i.e. 100ms, which is > > sufficient to address existing problems around RCU tasks. It's unclear > > yet if there is any future scenario for it to be further tuned down. > > I suggest something like the following for the commit log: > > ------------------------------------------------------------------------ > > When under heavy load, network processing can run CPU-bound for many tens > of seconds. Even in preemptible kernels, this can block RCU Tasks grace > periods, which can cause trace-event removal to take more than a minute, > which is unacceptably long. > > This commit therefore creates a new helper function that passes > through both RCU and RCU-Tasks quiescent states every 100 milliseconds. > This hard-coded value suffices for current workloads. > > ------------------------------------------------------------------------ > > > Suggested-by: Paul E. McKenney <paulmck@kernel.org> > > Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> > > Signed-off-by: Yan Zhai <yan@cloudflare.com> > > --- > > v3->v4: comment fixup > > > > --- > > include/linux/rcupdate.h | 24 ++++++++++++++++++++++++ > > 1 file changed, 24 insertions(+) > > > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h > > index 0746b1b0b663..da224706323e 100644 > > --- a/include/linux/rcupdate.h > > +++ b/include/linux/rcupdate.h > > @@ -247,6 +247,30 @@ do { \ > > cond_resched(); \ > > } while (0) > > > > +/** > > + * rcu_softirq_qs_periodic - Periodically report consolidated quiescent states > > + * @old_ts: last jiffies when QS was reported. Might be modified in the macro. > > + * > > + * This helper is for network processing in non-RT kernels, where there could > > + * be busy polling threads that block RCU synchronization indefinitely. In > > + * such context, simply calling cond_resched is insufficient, so give it a > > + * stronger push to eliminate all potential blockage of all RCU types. > > + * > > + * NOTE: unless absolutely sure, this helper should in general be called > > + * outside of bh lock section to avoid reporting a surprising QS to updaters, > > + * who could be expecting RCU read critical section to end at local_bh_enable(). > > + */ > > How about something like this for the kernel-doc comment? > > /** > * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states > * @old_ts: jiffies at start of processing. > * > * This helper is for long-running softirq handlers, such as those > * in networking. The caller should initialize the variable passed in > * as @old_ts at the beginning of the softirq handler. When invoked > * frequently, this macro will invoke rcu_softirq_qs() every 100 > * milliseconds thereafter, which will provide both RCU and RCU-Tasks > * quiescent states. Note that this macro modifies its old_ts argument. > * > * Note that although cond_resched() provides RCU quiescent states, > * it does not provide RCU-Tasks quiescent states. > * > * Because regions of code that have disabled softirq act as RCU > * read-side critical sections, this macro should be invoked with softirq > * (and preemption) enabled. > * > * This macro has no effect in CONFIG_PREEMPT_RT kernels. > */ > It would be more accurate this way, I like it. Thanks! Yan > Thanx, Paul > > > +#define rcu_softirq_qs_periodic(old_ts) \ > > +do { \ > > + if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \ > > + time_after(jiffies, (old_ts) + HZ / 10)) { \ > > + preempt_disable(); \ > > + rcu_softirq_qs(); \ > > + preempt_enable(); \ > > + (old_ts) = jiffies; \ > > + } \ > > +} while (0) > > + > > /* > > * Infrastructure to implement the synchronize_() primitives in > > * TREE_RCU and rcu_barrier_() primitives in TINY_RCU. > > -- > > 2.30.2 > > > > ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v4 net 2/3] net: report RCU QS on threaded NAPI repolling 2024-03-15 19:55 [PATCH v4 net 0/3] Report RCU QS for busy network kthreads Yan Zhai 2024-03-15 19:55 ` [PATCH v4 net 1/3] rcu: add a helper to report consolidated flavor QS Yan Zhai @ 2024-03-15 19:55 ` Yan Zhai 2024-03-15 19:55 ` [PATCH v4 net 3/3] bpf: report RCU QS in cpumap kthread Yan Zhai 2 siblings, 0 replies; 9+ messages in thread From: Yan Zhai @ 2024-03-15 19:55 UTC (permalink / raw) To: netdev Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck, Hannes Frederic Sowa, linux-kernel, rcu, bpf, kernel-team, Joel Fernandes, Paul E. McKenney, Toke Høiland-Jørgensen, Alexei Starovoitov, Steven Rostedt, mark.rutland, Jesper Dangaard Brouer NAPI threads can keep polling packets under load. Currently it is only calling cond_resched() before repolling, but it is not sufficient to clear out the holdout of RCU tasks, which prevent BPF tracing programs from detaching for long period. This can be reproduced easily with following set up: ip netns add test1 ip netns add test2 ip -n test1 link add veth1 type veth peer name veth2 netns test2 ip -n test1 link set veth1 up ip -n test1 link set lo up ip -n test2 link set veth2 up ip -n test2 link set lo up ip -n test1 addr add 192.168.1.2/31 dev veth1 ip -n test1 addr add 1.1.1.1/32 dev lo ip -n test2 addr add 192.168.1.3/31 dev veth2 ip -n test2 addr add 2.2.2.2/31 dev lo ip -n test1 route add default via 192.168.1.3 ip -n test2 route add default via 192.168.1.2 for i in `seq 10 210`; do for j in `seq 10 210`; do ip netns exec test2 iptables -I INPUT -s 3.3.$i.$j -p udp --dport 5201 done done ip netns exec test2 ethtool -K veth2 gro on ip netns exec test2 bash -c 'echo 1 > /sys/class/net/veth2/threaded' ip netns exec test1 ethtool -K veth1 tso off Then run an iperf3 client/server and a bpftrace script can trigger it: ip netns exec test2 iperf3 -s -B 2.2.2.2 >/dev/null& ip netns exec test1 iperf3 -c 2.2.2.2 -B 1.1.1.1 -u -l 1500 -b 3g -t 100 >/dev/null& bpftrace -e 'kfunc:__napi_poll{@=count();} interval:s:1{exit();}' Report RCU quiescent states periodically will resolve the issue. Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support") Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Yan Zhai <yan@cloudflare.com> --- v2->v3: abstracted the work into a RCU helper v1->v2: moved rcu_softirq_qs out from bh critical section, and only raise it after a second of repolling. Added some brief perf test result. v2: https://lore.kernel.org/bpf/ZeFPz4D121TgvCje@debian.debian/ v1: https://lore.kernel.org/lkml/Zd4DXTyCf17lcTfq@debian.debian/#t --- net/core/dev.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/core/dev.c b/net/core/dev.c index 76e6438f4858..6b7fc42d4b3e 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6708,6 +6708,8 @@ static int napi_threaded_poll(void *data) void *have; while (!napi_thread_wait(napi)) { + unsigned long last_qs = jiffies; + for (;;) { bool repoll = false; @@ -6732,6 +6734,7 @@ static int napi_threaded_poll(void *data) if (!repoll) break; + rcu_softirq_qs_periodic(last_qs); cond_resched(); } } -- 2.30.2 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v4 net 3/3] bpf: report RCU QS in cpumap kthread 2024-03-15 19:55 [PATCH v4 net 0/3] Report RCU QS for busy network kthreads Yan Zhai 2024-03-15 19:55 ` [PATCH v4 net 1/3] rcu: add a helper to report consolidated flavor QS Yan Zhai 2024-03-15 19:55 ` [PATCH v4 net 2/3] net: report RCU QS on threaded NAPI repolling Yan Zhai @ 2024-03-15 19:55 ` Yan Zhai 2 siblings, 0 replies; 9+ messages in thread From: Yan Zhai @ 2024-03-15 19:55 UTC (permalink / raw) To: netdev Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jiri Pirko, Simon Horman, Daniel Borkmann, Lorenzo Bianconi, Coco Li, Wei Wang, Alexander Duyck, Hannes Frederic Sowa, linux-kernel, rcu, bpf, kernel-team, Joel Fernandes, Paul E. McKenney, Toke Høiland-Jørgensen, Alexei Starovoitov, Steven Rostedt, mark.rutland, Jesper Dangaard Brouer When there are heavy load, cpumap kernel threads can be busy polling packets from redirect queues and block out RCU tasks from reaching quiescent states. It is insufficient to just call cond_resched() in such context. Periodically raise a consolidated RCU QS before cond_resched fixes the problem. Fixes: 6710e1126934 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP") Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Yan Zhai <yan@cloudflare.com> --- kernel/bpf/cpumap.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c index ef82ffc90cbe..8f1d390bcbde 100644 --- a/kernel/bpf/cpumap.c +++ b/kernel/bpf/cpumap.c @@ -262,6 +262,7 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames, static int cpu_map_kthread_run(void *data) { struct bpf_cpu_map_entry *rcpu = data; + unsigned long last_qs = jiffies; complete(&rcpu->kthread_running); set_current_state(TASK_INTERRUPTIBLE); @@ -287,10 +288,12 @@ static int cpu_map_kthread_run(void *data) if (__ptr_ring_empty(rcpu->queue)) { schedule(); sched = 1; + last_qs = jiffies; } else { __set_current_state(TASK_RUNNING); } } else { + rcu_softirq_qs_periodic(last_qs); sched = cond_resched(); } -- 2.30.2 ^ permalink raw reply related [flat|nested] 9+ messages in thread
end of thread, other threads:[~2024-03-19 2:39 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-03-15 19:55 [PATCH v4 net 0/3] Report RCU QS for busy network kthreads Yan Zhai 2024-03-15 19:55 ` [PATCH v4 net 1/3] rcu: add a helper to report consolidated flavor QS Yan Zhai 2024-03-16 5:40 ` Paul E. McKenney 2024-03-18 10:58 ` Mark Rutland 2024-03-19 2:32 ` Yan Zhai 2024-03-19 2:39 ` Yan Zhai 2024-03-19 1:26 ` Yan Zhai 2024-03-15 19:55 ` [PATCH v4 net 2/3] net: report RCU QS on threaded NAPI repolling Yan Zhai 2024-03-15 19:55 ` [PATCH v4 net 3/3] bpf: report RCU QS in cpumap kthread Yan Zhai
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).