All of lore.kernel.org
 help / color / mirror / Atom feed
* slow sync rcu_tasks_trace
@ 2020-09-09  2:34 Alexei Starovoitov
  2020-09-09 11:38 ` Paul E. McKenney
  0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2020-09-09  2:34 UTC (permalink / raw)
  To: bpf, Daniel Borkmann, Kernel Team, Paul E. McKenney

Hi Paul,

Looks like sync rcu_tasks_trace got slower or we simply didn't notice
it earlier.

In selftests/bpf try:
time ./test_progs -t trampoline_count
#101 trampoline_count:OK
Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED

real    1m17.082s
user    0m0.145s
sys    0m1.369s

But with the following hack:
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index 7dd523a7e32d..c417b817ec5d 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -217,7 +217,7 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr)
         * programs finish executing.
         * Wait for these two grace periods together.
         */
-       synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
+//     synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);

I see:
time ./test_progs -t trampoline_count
#101 trampoline_count:OK
Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED

real    0m1.588s
user    0m0.131s
sys    0m1.342s

It takes an extra minute to do 40 sync rcu_tasks_trace calls.
It means that every sync takes more than a second.
That feels excessive.

Doing:
-       synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
+       synchronize_rcu();
is also fast:
time ./test_progs -t trampoline_count
#101 trampoline_count:OK
Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED

real    0m2.089s
user    0m0.139s
sys    0m1.282s

sync rcu_tasks() is fast too:
-       synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
+       synchronize_rcu_tasks();
time ./test_progs -t trampoline_count
#101 trampoline_count:OK
Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED

real    0m2.209s
user    0m0.117s
sys    0m1.344s

so it's really something going on with sync rcu_tasks_trace.
Could you please take a look?

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: slow sync rcu_tasks_trace
  2020-09-09  2:34 slow sync rcu_tasks_trace Alexei Starovoitov
@ 2020-09-09 11:38 ` Paul E. McKenney
  2020-09-09 15:10   ` Jiri Olsa
  2020-09-09 17:12   ` Alexei Starovoitov
  0 siblings, 2 replies; 16+ messages in thread
From: Paul E. McKenney @ 2020-09-09 11:38 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: bpf, Daniel Borkmann, Kernel Team

On Tue, Sep 08, 2020 at 07:34:20PM -0700, Alexei Starovoitov wrote:
> Hi Paul,
> 
> Looks like sync rcu_tasks_trace got slower or we simply didn't notice
> it earlier.
> 
> In selftests/bpf try:
> time ./test_progs -t trampoline_count
> #101 trampoline_count:OK
> Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> 
> real    1m17.082s
> user    0m0.145s
> sys    0m1.369s
> 
> But with the following hack:
> diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
> index 7dd523a7e32d..c417b817ec5d 100644
> --- a/kernel/bpf/trampoline.c
> +++ b/kernel/bpf/trampoline.c
> @@ -217,7 +217,7 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr)
>          * programs finish executing.
>          * Wait for these two grace periods together.
>          */
> -       synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
> +//     synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
> 
> I see:
> time ./test_progs -t trampoline_count
> #101 trampoline_count:OK
> Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> 
> real    0m1.588s
> user    0m0.131s
> sys    0m1.342s
> 
> It takes an extra minute to do 40 sync rcu_tasks_trace calls.
> It means that every sync takes more than a second.
> That feels excessive.
> 
> Doing:
> -       synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
> +       synchronize_rcu();
> is also fast:
> time ./test_progs -t trampoline_count
> #101 trampoline_count:OK
> Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> 
> real    0m2.089s
> user    0m0.139s
> sys    0m1.282s
> 
> sync rcu_tasks() is fast too:
> -       synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
> +       synchronize_rcu_tasks();
> time ./test_progs -t trampoline_count
> #101 trampoline_count:OK
> Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> 
> real    0m2.209s
> user    0m0.117s
> sys    0m1.344s
> 
> so it's really something going on with sync rcu_tasks_trace.
> Could you please take a look?

I am guessing that your .config has CONFIG_TASKS_TRACE_RCU_READ_MB=n.
If I am wrong, please try CONFIG_TASKS_TRACE_RCU_READ_MB=y.

Otherwise (or alternatively), could you please try booting with
rcupdate.rcu_task_ipi_delay=50?  The default value is 500, or half a
second on a HZ=1000 system, which on a busy system could easily result
in the grace-period delays that you are seeing.  The value of this
kernel boot parameter does interact with the tasklist-scan backoffs,
so its effect will not likely be linear.

Do either of those approaches help?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: slow sync rcu_tasks_trace
  2020-09-09 11:38 ` Paul E. McKenney
@ 2020-09-09 15:10   ` Jiri Olsa
  2020-09-09 17:02     ` Paul E. McKenney
  2020-09-09 17:12   ` Alexei Starovoitov
  1 sibling, 1 reply; 16+ messages in thread
From: Jiri Olsa @ 2020-09-09 15:10 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Alexei Starovoitov, bpf, Daniel Borkmann, Kernel Team

On Wed, Sep 09, 2020 at 04:38:58AM -0700, Paul E. McKenney wrote:
> On Tue, Sep 08, 2020 at 07:34:20PM -0700, Alexei Starovoitov wrote:
> > Hi Paul,
> > 
> > Looks like sync rcu_tasks_trace got slower or we simply didn't notice
> > it earlier.
> > 
> > In selftests/bpf try:
> > time ./test_progs -t trampoline_count
> > #101 trampoline_count:OK
> > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> > 
> > real    1m17.082s
> > user    0m0.145s
> > sys    0m1.369s
> > 
> > But with the following hack:
> > diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
> > index 7dd523a7e32d..c417b817ec5d 100644
> > --- a/kernel/bpf/trampoline.c
> > +++ b/kernel/bpf/trampoline.c
> > @@ -217,7 +217,7 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr)
> >          * programs finish executing.
> >          * Wait for these two grace periods together.
> >          */
> > -       synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
> > +//     synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
> > 
> > I see:
> > time ./test_progs -t trampoline_count
> > #101 trampoline_count:OK
> > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> > 
> > real    0m1.588s
> > user    0m0.131s
> > sys    0m1.342s
> > 
> > It takes an extra minute to do 40 sync rcu_tasks_trace calls.
> > It means that every sync takes more than a second.
> > That feels excessive.
> > 
> > Doing:
> > -       synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
> > +       synchronize_rcu();
> > is also fast:
> > time ./test_progs -t trampoline_count
> > #101 trampoline_count:OK
> > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> > 
> > real    0m2.089s
> > user    0m0.139s
> > sys    0m1.282s
> > 
> > sync rcu_tasks() is fast too:
> > -       synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
> > +       synchronize_rcu_tasks();
> > time ./test_progs -t trampoline_count
> > #101 trampoline_count:OK
> > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> > 
> > real    0m2.209s
> > user    0m0.117s
> > sys    0m1.344s
> > 
> > so it's really something going on with sync rcu_tasks_trace.
> > Could you please take a look?
> 
> I am guessing that your .config has CONFIG_TASKS_TRACE_RCU_READ_MB=n.
> If I am wrong, please try CONFIG_TASKS_TRACE_RCU_READ_MB=y.

hi,
I noticed the slowdown as well, and adding CONFIG_TASKS_TRACE_RCU_READ_MB=y
speeds it up for me

thanks,
jirka

> 
> Otherwise (or alternatively), could you please try booting with
> rcupdate.rcu_task_ipi_delay=50?  The default value is 500, or half a
> second on a HZ=1000 system, which on a busy system could easily result
> in the grace-period delays that you are seeing.  The value of this
> kernel boot parameter does interact with the tasklist-scan backoffs,
> so its effect will not likely be linear.
> 
> Do either of those approaches help?
> 
> 							Thanx, Paul
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: slow sync rcu_tasks_trace
  2020-09-09 15:10   ` Jiri Olsa
@ 2020-09-09 17:02     ` Paul E. McKenney
  0 siblings, 0 replies; 16+ messages in thread
From: Paul E. McKenney @ 2020-09-09 17:02 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: Alexei Starovoitov, bpf, Daniel Borkmann, Kernel Team

On Wed, Sep 09, 2020 at 05:10:53PM +0200, Jiri Olsa wrote:
> On Wed, Sep 09, 2020 at 04:38:58AM -0700, Paul E. McKenney wrote:
> > On Tue, Sep 08, 2020 at 07:34:20PM -0700, Alexei Starovoitov wrote:
> > > Hi Paul,
> > > 
> > > Looks like sync rcu_tasks_trace got slower or we simply didn't notice
> > > it earlier.
> > > 
> > > In selftests/bpf try:
> > > time ./test_progs -t trampoline_count
> > > #101 trampoline_count:OK
> > > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> > > 
> > > real    1m17.082s
> > > user    0m0.145s
> > > sys    0m1.369s
> > > 
> > > But with the following hack:
> > > diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
> > > index 7dd523a7e32d..c417b817ec5d 100644
> > > --- a/kernel/bpf/trampoline.c
> > > +++ b/kernel/bpf/trampoline.c
> > > @@ -217,7 +217,7 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr)
> > >          * programs finish executing.
> > >          * Wait for these two grace periods together.
> > >          */
> > > -       synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
> > > +//     synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
> > > 
> > > I see:
> > > time ./test_progs -t trampoline_count
> > > #101 trampoline_count:OK
> > > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> > > 
> > > real    0m1.588s
> > > user    0m0.131s
> > > sys    0m1.342s
> > > 
> > > It takes an extra minute to do 40 sync rcu_tasks_trace calls.
> > > It means that every sync takes more than a second.
> > > That feels excessive.
> > > 
> > > Doing:
> > > -       synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
> > > +       synchronize_rcu();
> > > is also fast:
> > > time ./test_progs -t trampoline_count
> > > #101 trampoline_count:OK
> > > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> > > 
> > > real    0m2.089s
> > > user    0m0.139s
> > > sys    0m1.282s
> > > 
> > > sync rcu_tasks() is fast too:
> > > -       synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
> > > +       synchronize_rcu_tasks();
> > > time ./test_progs -t trampoline_count
> > > #101 trampoline_count:OK
> > > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> > > 
> > > real    0m2.209s
> > > user    0m0.117s
> > > sys    0m1.344s
> > > 
> > > so it's really something going on with sync rcu_tasks_trace.
> > > Could you please take a look?
> > 
> > I am guessing that your .config has CONFIG_TASKS_TRACE_RCU_READ_MB=n.
> > If I am wrong, please try CONFIG_TASKS_TRACE_RCU_READ_MB=y.
> 
> hi,
> I noticed the slowdown as well, and adding CONFIG_TASKS_TRACE_RCU_READ_MB=y
> speeds it up for me

Thank you for testing this!  This will most likley also degrade read-side
performance beyond what is reasonable.  So could you please also try
the kernel boot parameter called out below?

Nevertheless, the fact that this fixes things does mean that a solution
exists.  Now to close in on it.  ;-)

(For example, it might be necessary to provide per-flavor tasklist
scan backoffs and/or it might be necessary to adjust the default for
rcupdate.rcu_task_ipi_delay=50.)

							Thanx, Paul

> thanks,
> jirka
> 
> > 
> > Otherwise (or alternatively), could you please try booting with
> > rcupdate.rcu_task_ipi_delay=50?  The default value is 500, or half a
> > second on a HZ=1000 system, which on a busy system could easily result
> > in the grace-period delays that you are seeing.  The value of this
> > kernel boot parameter does interact with the tasklist-scan backoffs,
> > so its effect will not likely be linear.
> > 
> > Do either of those approaches help?
> > 
> > 							Thanx, Paul
> > 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: slow sync rcu_tasks_trace
  2020-09-09 11:38 ` Paul E. McKenney
  2020-09-09 15:10   ` Jiri Olsa
@ 2020-09-09 17:12   ` Alexei Starovoitov
  2020-09-09 17:35     ` Paul E. McKenney
  1 sibling, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2020-09-09 17:12 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: bpf, Daniel Borkmann, Kernel Team

On Wed, Sep 09, 2020 at 04:38:58AM -0700, Paul E. McKenney wrote:
> On Tue, Sep 08, 2020 at 07:34:20PM -0700, Alexei Starovoitov wrote:
> > Hi Paul,
> > 
> > Looks like sync rcu_tasks_trace got slower or we simply didn't notice
> > it earlier.
> > 
> > In selftests/bpf try:
> > time ./test_progs -t trampoline_count
> > #101 trampoline_count:OK
> > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> > 
> > real    1m17.082s
> > user    0m0.145s
> > sys    0m1.369s
> > 
> > so it's really something going on with sync rcu_tasks_trace.
> > Could you please take a look?
> 
> I am guessing that your .config has CONFIG_TASKS_TRACE_RCU_READ_MB=n.
> If I am wrong, please try CONFIG_TASKS_TRACE_RCU_READ_MB=y.

I've added
CONFIG_RCU_EXPERT=y
CONFIG_TASKS_TRACE_RCU_READ_MB=y

and it helped:

time ./test_progs -t trampoline_count
#101 trampoline_count:OK
Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED

real	0m8.924s
user	0m0.138s
sys	0m1.408s

But this is still bad. It's 4 times slower vs rcu_tasks
and isn't really usable for bpf, since it adds memory barriers exactly
where we need them removed.

In the default configuration rcu_tasks_trace is 40! times slower than rcu_tasks.
This huge difference in sync times concerns me a lot.
If bpf has to use memory barriers in rcu_read_lock_trace
and still be 4 times slower than rcu_tasks in the best case
then there is no much point in rcu_tasks_trace.
Converting everything to srcu would be better, but I really hope
you can find a solution to this tasks_trace issue.

> Otherwise (or alternatively), could you please try booting with
> rcupdate.rcu_task_ipi_delay=50?  The default value is 500, or half a
> second on a HZ=1000 system, which on a busy system could easily result
> in the grace-period delays that you are seeing.  The value of this
> kernel boot parameter does interact with the tasklist-scan backoffs,
> so its effect will not likely be linear.

The tests were run on freshly booted VM with 4 cpus. The VM is idle.
The host is idle too.

Adding rcupdate.rcu_task_ipi_delay=50 boot param sort-of helped:
time ./test_progs -t trampoline_count
#101 trampoline_count:OK
Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED

real	0m25.890s
user	0m0.124s
sys	0m1.507s
It is still awful.

From "perf report" there is little time spend in the kernel. The kernel is
waiting on something. I thought in theory the rcu_tasks_trace should have been
faster on update side vs rcu_tasks ? Could it be a bug somewhere and some
missing wakeup? It doesn't feel that it works as intended. Whatever it is
please try to reproduce it to remove me as a middle man.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: slow sync rcu_tasks_trace
  2020-09-09 17:12   ` Alexei Starovoitov
@ 2020-09-09 17:35     ` Paul E. McKenney
  2020-09-09 18:04       ` Alexei Starovoitov
  0 siblings, 1 reply; 16+ messages in thread
From: Paul E. McKenney @ 2020-09-09 17:35 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: bpf, Daniel Borkmann, Kernel Team

On Wed, Sep 09, 2020 at 10:12:28AM -0700, Alexei Starovoitov wrote:
> On Wed, Sep 09, 2020 at 04:38:58AM -0700, Paul E. McKenney wrote:
> > On Tue, Sep 08, 2020 at 07:34:20PM -0700, Alexei Starovoitov wrote:
> > > Hi Paul,
> > > 
> > > Looks like sync rcu_tasks_trace got slower or we simply didn't notice
> > > it earlier.
> > > 
> > > In selftests/bpf try:
> > > time ./test_progs -t trampoline_count
> > > #101 trampoline_count:OK
> > > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> > > 
> > > real    1m17.082s
> > > user    0m0.145s
> > > sys    0m1.369s
> > > 
> > > so it's really something going on with sync rcu_tasks_trace.
> > > Could you please take a look?
> > 
> > I am guessing that your .config has CONFIG_TASKS_TRACE_RCU_READ_MB=n.
> > If I am wrong, please try CONFIG_TASKS_TRACE_RCU_READ_MB=y.
> 
> I've added
> CONFIG_RCU_EXPERT=y
> CONFIG_TASKS_TRACE_RCU_READ_MB=y
> 
> and it helped:
> 
> time ./test_progs -t trampoline_count
> #101 trampoline_count:OK
> Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> 
> real	0m8.924s
> user	0m0.138s
> sys	0m1.408s
> 
> But this is still bad. It's 4 times slower vs rcu_tasks
> and isn't really usable for bpf, since it adds memory barriers exactly
> where we need them removed.
> 
> In the default configuration rcu_tasks_trace is 40! times slower than rcu_tasks.
> This huge difference in sync times concerns me a lot.
> If bpf has to use memory barriers in rcu_read_lock_trace
> and still be 4 times slower than rcu_tasks in the best case
> then there is no much point in rcu_tasks_trace.
> Converting everything to srcu would be better, but I really hope
> you can find a solution to this tasks_trace issue.
> 
> > Otherwise (or alternatively), could you please try booting with
> > rcupdate.rcu_task_ipi_delay=50?  The default value is 500, or half a
> > second on a HZ=1000 system, which on a busy system could easily result
> > in the grace-period delays that you are seeing.  The value of this
> > kernel boot parameter does interact with the tasklist-scan backoffs,
> > so its effect will not likely be linear.
> 
> The tests were run on freshly booted VM with 4 cpus. The VM is idle.
> The host is idle too.
> 
> Adding rcupdate.rcu_task_ipi_delay=50 boot param sort-of helped:
> time ./test_progs -t trampoline_count
> #101 trampoline_count:OK
> Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> 
> real	0m25.890s
> user	0m0.124s
> sys	0m1.507s
> It is still awful.
> 
> >From "perf report" there is little time spend in the kernel. The kernel is
> waiting on something. I thought in theory the rcu_tasks_trace should have been
> faster on update side vs rcu_tasks ? Could it be a bug somewhere and some
> missing wakeup? It doesn't feel that it works as intended. Whatever it is
> please try to reproduce it to remove me as a middle man.

On it.

To be fair, I was designing for a nominal one-second grace period,
which was also the rough goal for rcu_tasks.

When do you need this by?

Left to myself, I will aim for the merge window after the upcoming one,
and then backport to the prior -stable versions having RCU tasks trace.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: slow sync rcu_tasks_trace
  2020-09-09 17:35     ` Paul E. McKenney
@ 2020-09-09 18:04       ` Alexei Starovoitov
  2020-09-09 19:39         ` Paul E. McKenney
  0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2020-09-09 18:04 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: bpf, Daniel Borkmann, Kernel Team

On Wed, Sep 09, 2020 at 10:35:12AM -0700, Paul E. McKenney wrote:
> On Wed, Sep 09, 2020 at 10:12:28AM -0700, Alexei Starovoitov wrote:
> > On Wed, Sep 09, 2020 at 04:38:58AM -0700, Paul E. McKenney wrote:
> > > On Tue, Sep 08, 2020 at 07:34:20PM -0700, Alexei Starovoitov wrote:
> > > > Hi Paul,
> > > > 
> > > > Looks like sync rcu_tasks_trace got slower or we simply didn't notice
> > > > it earlier.
> > > > 
> > > > In selftests/bpf try:
> > > > time ./test_progs -t trampoline_count
> > > > #101 trampoline_count:OK
> > > > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> > > > 
> > > > real    1m17.082s
> > > > user    0m0.145s
> > > > sys    0m1.369s
> > > > 
> > > > so it's really something going on with sync rcu_tasks_trace.
> > > > Could you please take a look?
> > > 
> > > I am guessing that your .config has CONFIG_TASKS_TRACE_RCU_READ_MB=n.
> > > If I am wrong, please try CONFIG_TASKS_TRACE_RCU_READ_MB=y.
> > 
> > I've added
> > CONFIG_RCU_EXPERT=y
> > CONFIG_TASKS_TRACE_RCU_READ_MB=y
> > 
> > and it helped:
> > 
> > time ./test_progs -t trampoline_count
> > #101 trampoline_count:OK
> > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> > 
> > real	0m8.924s
> > user	0m0.138s
> > sys	0m1.408s
> > 
> > But this is still bad. It's 4 times slower vs rcu_tasks
> > and isn't really usable for bpf, since it adds memory barriers exactly
> > where we need them removed.
> > 
> > In the default configuration rcu_tasks_trace is 40! times slower than rcu_tasks.
> > This huge difference in sync times concerns me a lot.
> > If bpf has to use memory barriers in rcu_read_lock_trace
> > and still be 4 times slower than rcu_tasks in the best case
> > then there is no much point in rcu_tasks_trace.
> > Converting everything to srcu would be better, but I really hope
> > you can find a solution to this tasks_trace issue.
> > 
> > > Otherwise (or alternatively), could you please try booting with
> > > rcupdate.rcu_task_ipi_delay=50?  The default value is 500, or half a
> > > second on a HZ=1000 system, which on a busy system could easily result
> > > in the grace-period delays that you are seeing.  The value of this
> > > kernel boot parameter does interact with the tasklist-scan backoffs,
> > > so its effect will not likely be linear.
> > 
> > The tests were run on freshly booted VM with 4 cpus. The VM is idle.
> > The host is idle too.
> > 
> > Adding rcupdate.rcu_task_ipi_delay=50 boot param sort-of helped:
> > time ./test_progs -t trampoline_count
> > #101 trampoline_count:OK
> > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> > 
> > real	0m25.890s
> > user	0m0.124s
> > sys	0m1.507s
> > It is still awful.
> > 
> > >From "perf report" there is little time spend in the kernel. The kernel is
> > waiting on something. I thought in theory the rcu_tasks_trace should have been
> > faster on update side vs rcu_tasks ? Could it be a bug somewhere and some
> > missing wakeup? It doesn't feel that it works as intended. Whatever it is
> > please try to reproduce it to remove me as a middle man.
> 
> On it.
> 
> To be fair, I was designing for a nominal one-second grace period,
> which was also the rough goal for rcu_tasks.
> 
> When do you need this by?
> 
> Left to myself, I will aim for the merge window after the upcoming one,
> and then backport to the prior -stable versions having RCU tasks trace.

That would be too late.
We would have to disable sleepable bpf progs or convert them to srcu.
bcc/bpftrace have a limit of 1000 probes for regexes to make sure
these tools don't add too many kprobes to the kernel at once.
Right now fentry/fexit/freplace are using trampoline which does
synchronize_rcu_tasks(). My measurements show that it's roughly
equal to synchronize_rcu() on idle box and perfectly capable to
be a replacement for kprobe based attaching.
It's not uncommon to attach a hundred kprobes or fentry probes at
a start time. So bpf trampoline has to be able to do 1000 in a second.
And it was the case before sleepable got added to the trampoline.
Now it's doing:
synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
and it's causing this massive slowdown which makes bpf trampoline
pretty much unusable and everything that builds on top suffers.
I can add a counter of sleepable progs to trampoline and do
either sync rcu_tasks or sync_mult(tasks, tasks_trace),
but we've discussed exactly that idea few months back and concluded that
rcu_tasks is likely to be heavier than rcu_tasks_trace, so I didn't
bother with the counter. I can still add it, but slow rcu_tasks_trace
means that sleepable progs are not usable due to slow startup time,
so have to do something with sleepable anyway.
So "when do you need this by?" the answer is asap.
I'm considering such changes to be a bugfix, not a feture.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: slow sync rcu_tasks_trace
  2020-09-09 18:04       ` Alexei Starovoitov
@ 2020-09-09 19:39         ` Paul E. McKenney
  2020-09-09 19:48           ` Alexei Starovoitov
  0 siblings, 1 reply; 16+ messages in thread
From: Paul E. McKenney @ 2020-09-09 19:39 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: bpf, Daniel Borkmann, Kernel Team

On Wed, Sep 09, 2020 at 11:04:18AM -0700, Alexei Starovoitov wrote:
> On Wed, Sep 09, 2020 at 10:35:12AM -0700, Paul E. McKenney wrote:
> > On Wed, Sep 09, 2020 at 10:12:28AM -0700, Alexei Starovoitov wrote:
> > > On Wed, Sep 09, 2020 at 04:38:58AM -0700, Paul E. McKenney wrote:
> > > > On Tue, Sep 08, 2020 at 07:34:20PM -0700, Alexei Starovoitov wrote:
> > > > > Hi Paul,
> > > > > 
> > > > > Looks like sync rcu_tasks_trace got slower or we simply didn't notice
> > > > > it earlier.
> > > > > 
> > > > > In selftests/bpf try:
> > > > > time ./test_progs -t trampoline_count
> > > > > #101 trampoline_count:OK
> > > > > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> > > > > 
> > > > > real    1m17.082s
> > > > > user    0m0.145s
> > > > > sys    0m1.369s
> > > > > 
> > > > > so it's really something going on with sync rcu_tasks_trace.
> > > > > Could you please take a look?
> > > > 
> > > > I am guessing that your .config has CONFIG_TASKS_TRACE_RCU_READ_MB=n.
> > > > If I am wrong, please try CONFIG_TASKS_TRACE_RCU_READ_MB=y.
> > > 
> > > I've added
> > > CONFIG_RCU_EXPERT=y
> > > CONFIG_TASKS_TRACE_RCU_READ_MB=y
> > > 
> > > and it helped:
> > > 
> > > time ./test_progs -t trampoline_count
> > > #101 trampoline_count:OK
> > > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> > > 
> > > real	0m8.924s
> > > user	0m0.138s
> > > sys	0m1.408s
> > > 
> > > But this is still bad. It's 4 times slower vs rcu_tasks
> > > and isn't really usable for bpf, since it adds memory barriers exactly
> > > where we need them removed.
> > > 
> > > In the default configuration rcu_tasks_trace is 40! times slower than rcu_tasks.
> > > This huge difference in sync times concerns me a lot.
> > > If bpf has to use memory barriers in rcu_read_lock_trace
> > > and still be 4 times slower than rcu_tasks in the best case
> > > then there is no much point in rcu_tasks_trace.
> > > Converting everything to srcu would be better, but I really hope
> > > you can find a solution to this tasks_trace issue.
> > > 
> > > > Otherwise (or alternatively), could you please try booting with
> > > > rcupdate.rcu_task_ipi_delay=50?  The default value is 500, or half a
> > > > second on a HZ=1000 system, which on a busy system could easily result
> > > > in the grace-period delays that you are seeing.  The value of this
> > > > kernel boot parameter does interact with the tasklist-scan backoffs,
> > > > so its effect will not likely be linear.
> > > 
> > > The tests were run on freshly booted VM with 4 cpus. The VM is idle.
> > > The host is idle too.
> > > 
> > > Adding rcupdate.rcu_task_ipi_delay=50 boot param sort-of helped:
> > > time ./test_progs -t trampoline_count
> > > #101 trampoline_count:OK
> > > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> > > 
> > > real	0m25.890s
> > > user	0m0.124s
> > > sys	0m1.507s
> > > It is still awful.
> > > 
> > > >From "perf report" there is little time spend in the kernel. The kernel is
> > > waiting on something. I thought in theory the rcu_tasks_trace should have been
> > > faster on update side vs rcu_tasks ? Could it be a bug somewhere and some
> > > missing wakeup? It doesn't feel that it works as intended. Whatever it is
> > > please try to reproduce it to remove me as a middle man.
> > 
> > On it.
> > 
> > To be fair, I was designing for a nominal one-second grace period,
> > which was also the rough goal for rcu_tasks.
> > 
> > When do you need this by?
> > 
> > Left to myself, I will aim for the merge window after the upcoming one,
> > and then backport to the prior -stable versions having RCU tasks trace.
> 
> That would be too late.
> We would have to disable sleepable bpf progs or convert them to srcu.
> bcc/bpftrace have a limit of 1000 probes for regexes to make sure
> these tools don't add too many kprobes to the kernel at once.
> Right now fentry/fexit/freplace are using trampoline which does
> synchronize_rcu_tasks(). My measurements show that it's roughly
> equal to synchronize_rcu() on idle box and perfectly capable to
> be a replacement for kprobe based attaching.
> It's not uncommon to attach a hundred kprobes or fentry probes at
> a start time. So bpf trampoline has to be able to do 1000 in a second.
> And it was the case before sleepable got added to the trampoline.
> Now it's doing:
> synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
> and it's causing this massive slowdown which makes bpf trampoline
> pretty much unusable and everything that builds on top suffers.
> I can add a counter of sleepable progs to trampoline and do
> either sync rcu_tasks or sync_mult(tasks, tasks_trace),
> but we've discussed exactly that idea few months back and concluded that
> rcu_tasks is likely to be heavier than rcu_tasks_trace, so I didn't
> bother with the counter. I can still add it, but slow rcu_tasks_trace
> means that sleepable progs are not usable due to slow startup time,
> so have to do something with sleepable anyway.
> So "when do you need this by?" the answer is asap.
> I'm considering such changes to be a bugfix, not a feture.

Got it.

With the patch below, I am able to reproduce this issue, as expected.

My plan is to try the following:

1.	Parameterize the backoff sequence so that RCU Tasks Trace
	uses faster rechecking than does RCU Tasks.  Experiment as
	needed to arrive at a good backoff value.

2.	If the tasks-list scan turns out to be a tighter bottleneck 
	than the backoff waits, look into parallelizing this scan.
	(This seems unlikely, but the fact remains that RCU Tasks
	Trace must do a bit more work per task than RCU Tasks.)

3.	If these two approaches, still don't get the update-side
	latency where it needs to be, improvise.

The exact path into mainline will of course depend on how far down this
list I must go, but first to get a solution.

							Thanx, Paul

------------------------------------------------------------------------

commit 1b5b6a341cc17b5f236bceca3d1cfb23e39176b5
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Wed Sep 9 12:27:03 2020 -0700

    rcuscale: Add RCU Tasks Trace
    
    This commit adds the ability to test performance and scalability of RCU
    Tasks Trace updaters.
    
    Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/kernel/rcu/rcuscale.c b/kernel/rcu/rcuscale.c
index 2819b95..c42f240 100644
--- a/kernel/rcu/rcuscale.c
+++ b/kernel/rcu/rcuscale.c
@@ -38,6 +38,7 @@
 #include <asm/byteorder.h>
 #include <linux/torture.h>
 #include <linux/vmalloc.h>
+#include <linux/rcupdate_trace.h>
 
 #include "rcu.h"
 
@@ -294,6 +295,35 @@ static struct rcu_scale_ops tasks_ops = {
 	.name		= "tasks"
 };
 
+/*
+ * Definitions for RCU-tasks-trace scalability testing.
+ */
+
+static int tasks_trace_scale_read_lock(void)
+{
+	rcu_read_lock_trace();
+	return 0;
+}
+
+static void tasks_trace_scale_read_unlock(int idx)
+{
+	rcu_read_unlock_trace();
+}
+
+static struct rcu_scale_ops tasks_tracing_ops = {
+	.ptype		= RCU_TASKS_FLAVOR,
+	.init		= rcu_sync_scale_init,
+	.readlock	= tasks_trace_scale_read_lock,
+	.readunlock	= tasks_trace_scale_read_unlock,
+	.get_gp_seq	= rcu_no_completed,
+	.gp_diff	= rcu_seq_diff,
+	.async		= call_rcu_tasks_trace,
+	.gp_barrier	= rcu_barrier_tasks_trace,
+	.sync		= synchronize_rcu_tasks_trace,
+	.exp_sync	= synchronize_rcu_tasks_trace,
+	.name		= "tasks-tracing"
+};
+
 static unsigned long rcuscale_seq_diff(unsigned long new, unsigned long old)
 {
 	if (!cur_ops->gp_diff)
@@ -754,7 +784,7 @@ rcu_scale_init(void)
 	long i;
 	int firsterr = 0;
 	static struct rcu_scale_ops *scale_ops[] = {
-		&rcu_ops, &srcu_ops, &srcud_ops, &tasks_ops,
+		&rcu_ops, &srcu_ops, &srcud_ops, &tasks_ops, &tasks_tracing_ops
 	};
 
 	if (!torture_init_begin(scale_type, verbose))
diff --git a/tools/testing/selftests/rcutorture/configs/rcuscale/CFcommon b/tools/testing/selftests/rcutorture/configs/rcuscale/CFcommon
index 87caa0e..90942bb 100644
--- a/tools/testing/selftests/rcutorture/configs/rcuscale/CFcommon
+++ b/tools/testing/selftests/rcutorture/configs/rcuscale/CFcommon
@@ -1,2 +1,5 @@
 CONFIG_RCU_SCALE_TEST=y
 CONFIG_PRINTK_TIME=y
+CONFIG_TASKS_RCU_GENERIC=y
+CONFIG_TASKS_RCU=y
+CONFIG_TASKS_TRACE_RCU=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcuscale/TRACE01 b/tools/testing/selftests/rcutorture/configs/rcuscale/TRACE01
new file mode 100644
index 0000000..4255490
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcuscale/TRACE01
@@ -0,0 +1,18 @@
+CONFIG_SMP=y
+CONFIG_PREEMPT_NONE=y
+CONFIG_PREEMPT_VOLUNTARY=n
+CONFIG_PREEMPT=n
+CONFIG_HZ_PERIODIC=n
+CONFIG_NO_HZ_IDLE=y
+CONFIG_NO_HZ_FULL=n
+CONFIG_RCU_FAST_NO_HZ=n
+CONFIG_HOTPLUG_CPU=n
+CONFIG_SUSPEND=n
+CONFIG_HIBERNATION=n
+CONFIG_RCU_NOCB_CPU=n
+CONFIG_DEBUG_LOCK_ALLOC=n
+CONFIG_PROVE_LOCKING=n
+CONFIG_RCU_BOOST=n
+CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
+CONFIG_RCU_EXPERT=y
+CONFIG_RCU_TRACE=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcuscale/TRACE01.boot b/tools/testing/selftests/rcutorture/configs/rcuscale/TRACE01.boot
new file mode 100644
index 0000000..af0aff1
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcuscale/TRACE01.boot
@@ -0,0 +1 @@
+rcuscale.scale_type=tasks-tracing

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: slow sync rcu_tasks_trace
  2020-09-09 19:39         ` Paul E. McKenney
@ 2020-09-09 19:48           ` Alexei Starovoitov
  2020-09-09 21:04             ` Paul E. McKenney
  0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2020-09-09 19:48 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: bpf, Daniel Borkmann, Kernel Team

On Wed, Sep 09, 2020 at 12:39:00PM -0700, Paul E. McKenney wrote:
> > > 
> > > When do you need this by?
> > > 
> > > Left to myself, I will aim for the merge window after the upcoming one,
> > > and then backport to the prior -stable versions having RCU tasks trace.
> > 
> > That would be too late.
> > We would have to disable sleepable bpf progs or convert them to srcu.
> > bcc/bpftrace have a limit of 1000 probes for regexes to make sure
> > these tools don't add too many kprobes to the kernel at once.
> > Right now fentry/fexit/freplace are using trampoline which does
> > synchronize_rcu_tasks(). My measurements show that it's roughly
> > equal to synchronize_rcu() on idle box and perfectly capable to
> > be a replacement for kprobe based attaching.
> > It's not uncommon to attach a hundred kprobes or fentry probes at
> > a start time. So bpf trampoline has to be able to do 1000 in a second.
> > And it was the case before sleepable got added to the trampoline.
> > Now it's doing:
> > synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
> > and it's causing this massive slowdown which makes bpf trampoline
> > pretty much unusable and everything that builds on top suffers.
> > I can add a counter of sleepable progs to trampoline and do
> > either sync rcu_tasks or sync_mult(tasks, tasks_trace),
> > but we've discussed exactly that idea few months back and concluded that
> > rcu_tasks is likely to be heavier than rcu_tasks_trace, so I didn't
> > bother with the counter. I can still add it, but slow rcu_tasks_trace
> > means that sleepable progs are not usable due to slow startup time,
> > so have to do something with sleepable anyway.
> > So "when do you need this by?" the answer is asap.
> > I'm considering such changes to be a bugfix, not a feture.
> 
> Got it.
> 
> With the patch below, I am able to reproduce this issue, as expected.

I think your tests is more stressful than mine.
test_progs -t trampoline_count
doesn't run the sleepable progs. So there is no lock/unlock_trace at all.
It's updating trampoline and doing sync_mult() that's all.

> My plan is to try the following:
> 
> 1.	Parameterize the backoff sequence so that RCU Tasks Trace
> 	uses faster rechecking than does RCU Tasks.  Experiment as
> 	needed to arrive at a good backoff value.
> 
> 2.	If the tasks-list scan turns out to be a tighter bottleneck 
> 	than the backoff waits, look into parallelizing this scan.
> 	(This seems unlikely, but the fact remains that RCU Tasks
> 	Trace must do a bit more work per task than RCU Tasks.)
> 
> 3.	If these two approaches, still don't get the update-side
> 	latency where it needs to be, improvise.
> 
> The exact path into mainline will of course depend on how far down this
> list I must go, but first to get a solution.

I think there is a case of 4. Nothing is inside rcu_trace critical section.
I would expect single ipi would confirm that.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: slow sync rcu_tasks_trace
  2020-09-09 19:48           ` Alexei Starovoitov
@ 2020-09-09 21:04             ` Paul E. McKenney
  2020-09-09 21:22               ` Paul E. McKenney
  0 siblings, 1 reply; 16+ messages in thread
From: Paul E. McKenney @ 2020-09-09 21:04 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: bpf, Daniel Borkmann, Kernel Team

On Wed, Sep 09, 2020 at 12:48:28PM -0700, Alexei Starovoitov wrote:
> On Wed, Sep 09, 2020 at 12:39:00PM -0700, Paul E. McKenney wrote:
> > > > 
> > > > When do you need this by?
> > > > 
> > > > Left to myself, I will aim for the merge window after the upcoming one,
> > > > and then backport to the prior -stable versions having RCU tasks trace.
> > > 
> > > That would be too late.
> > > We would have to disable sleepable bpf progs or convert them to srcu.
> > > bcc/bpftrace have a limit of 1000 probes for regexes to make sure
> > > these tools don't add too many kprobes to the kernel at once.
> > > Right now fentry/fexit/freplace are using trampoline which does
> > > synchronize_rcu_tasks(). My measurements show that it's roughly
> > > equal to synchronize_rcu() on idle box and perfectly capable to
> > > be a replacement for kprobe based attaching.
> > > It's not uncommon to attach a hundred kprobes or fentry probes at
> > > a start time. So bpf trampoline has to be able to do 1000 in a second.
> > > And it was the case before sleepable got added to the trampoline.
> > > Now it's doing:
> > > synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
> > > and it's causing this massive slowdown which makes bpf trampoline
> > > pretty much unusable and everything that builds on top suffers.
> > > I can add a counter of sleepable progs to trampoline and do
> > > either sync rcu_tasks or sync_mult(tasks, tasks_trace),
> > > but we've discussed exactly that idea few months back and concluded that
> > > rcu_tasks is likely to be heavier than rcu_tasks_trace, so I didn't
> > > bother with the counter. I can still add it, but slow rcu_tasks_trace
> > > means that sleepable progs are not usable due to slow startup time,
> > > so have to do something with sleepable anyway.
> > > So "when do you need this by?" the answer is asap.
> > > I'm considering such changes to be a bugfix, not a feture.
> > 
> > Got it.
> > 
> > With the patch below, I am able to reproduce this issue, as expected.
> 
> I think your tests is more stressful than mine.
> test_progs -t trampoline_count
> doesn't run the sleepable progs. So there is no lock/unlock_trace at all.
> It's updating trampoline and doing sync_mult() that's all.
> 
> > My plan is to try the following:
> > 
> > 1.	Parameterize the backoff sequence so that RCU Tasks Trace
> > 	uses faster rechecking than does RCU Tasks.  Experiment as
> > 	needed to arrive at a good backoff value.
> > 
> > 2.	If the tasks-list scan turns out to be a tighter bottleneck 
> > 	than the backoff waits, look into parallelizing this scan.
> > 	(This seems unlikely, but the fact remains that RCU Tasks
> > 	Trace must do a bit more work per task than RCU Tasks.)
> > 
> > 3.	If these two approaches, still don't get the update-side
> > 	latency where it needs to be, improvise.
> > 
> > The exact path into mainline will of course depend on how far down this
> > list I must go, but first to get a solution.
> 
> I think there is a case of 4. Nothing is inside rcu_trace critical section.
> I would expect single ipi would confirm that.

Unless the task moves, yes.  So a single IPI should suffice in the
common case.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: slow sync rcu_tasks_trace
  2020-09-09 21:04             ` Paul E. McKenney
@ 2020-09-09 21:22               ` Paul E. McKenney
  2020-09-10  5:27                 ` Paul E. McKenney
  0 siblings, 1 reply; 16+ messages in thread
From: Paul E. McKenney @ 2020-09-09 21:22 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: bpf, Daniel Borkmann, Kernel Team

On Wed, Sep 09, 2020 at 02:04:47PM -0700, Paul E. McKenney wrote:
> On Wed, Sep 09, 2020 at 12:48:28PM -0700, Alexei Starovoitov wrote:
> > On Wed, Sep 09, 2020 at 12:39:00PM -0700, Paul E. McKenney wrote:
> > > > > 
> > > > > When do you need this by?
> > > > > 
> > > > > Left to myself, I will aim for the merge window after the upcoming one,
> > > > > and then backport to the prior -stable versions having RCU tasks trace.
> > > > 
> > > > That would be too late.
> > > > We would have to disable sleepable bpf progs or convert them to srcu.
> > > > bcc/bpftrace have a limit of 1000 probes for regexes to make sure
> > > > these tools don't add too many kprobes to the kernel at once.
> > > > Right now fentry/fexit/freplace are using trampoline which does
> > > > synchronize_rcu_tasks(). My measurements show that it's roughly
> > > > equal to synchronize_rcu() on idle box and perfectly capable to
> > > > be a replacement for kprobe based attaching.
> > > > It's not uncommon to attach a hundred kprobes or fentry probes at
> > > > a start time. So bpf trampoline has to be able to do 1000 in a second.
> > > > And it was the case before sleepable got added to the trampoline.
> > > > Now it's doing:
> > > > synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
> > > > and it's causing this massive slowdown which makes bpf trampoline
> > > > pretty much unusable and everything that builds on top suffers.
> > > > I can add a counter of sleepable progs to trampoline and do
> > > > either sync rcu_tasks or sync_mult(tasks, tasks_trace),
> > > > but we've discussed exactly that idea few months back and concluded that
> > > > rcu_tasks is likely to be heavier than rcu_tasks_trace, so I didn't
> > > > bother with the counter. I can still add it, but slow rcu_tasks_trace
> > > > means that sleepable progs are not usable due to slow startup time,
> > > > so have to do something with sleepable anyway.
> > > > So "when do you need this by?" the answer is asap.
> > > > I'm considering such changes to be a bugfix, not a feture.
> > > 
> > > Got it.
> > > 
> > > With the patch below, I am able to reproduce this issue, as expected.
> > 
> > I think your tests is more stressful than mine.
> > test_progs -t trampoline_count
> > doesn't run the sleepable progs. So there is no lock/unlock_trace at all.
> > It's updating trampoline and doing sync_mult() that's all.
> > 
> > > My plan is to try the following:
> > > 
> > > 1.	Parameterize the backoff sequence so that RCU Tasks Trace
> > > 	uses faster rechecking than does RCU Tasks.  Experiment as
> > > 	needed to arrive at a good backoff value.
> > > 
> > > 2.	If the tasks-list scan turns out to be a tighter bottleneck 
> > > 	than the backoff waits, look into parallelizing this scan.
> > > 	(This seems unlikely, but the fact remains that RCU Tasks
> > > 	Trace must do a bit more work per task than RCU Tasks.)
> > > 
> > > 3.	If these two approaches, still don't get the update-side
> > > 	latency where it needs to be, improvise.
> > > 
> > > The exact path into mainline will of course depend on how far down this
> > > list I must go, but first to get a solution.
> > 
> > I think there is a case of 4. Nothing is inside rcu_trace critical section.
> > I would expect single ipi would confirm that.
> 
> Unless the task moves, yes.  So a single IPI should suffice in the
> common case.

And what I am doing now is checking code paths.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: slow sync rcu_tasks_trace
  2020-09-09 21:22               ` Paul E. McKenney
@ 2020-09-10  5:27                 ` Paul E. McKenney
  2020-09-10 18:33                   ` Alexei Starovoitov
  0 siblings, 1 reply; 16+ messages in thread
From: Paul E. McKenney @ 2020-09-10  5:27 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: bpf, Daniel Borkmann, Kernel Team

On Wed, Sep 09, 2020 at 02:22:12PM -0700, Paul E. McKenney wrote:
> On Wed, Sep 09, 2020 at 02:04:47PM -0700, Paul E. McKenney wrote:
> > On Wed, Sep 09, 2020 at 12:48:28PM -0700, Alexei Starovoitov wrote:
> > > On Wed, Sep 09, 2020 at 12:39:00PM -0700, Paul E. McKenney wrote:

[ . . . ]

> > > > My plan is to try the following:
> > > > 
> > > > 1.	Parameterize the backoff sequence so that RCU Tasks Trace
> > > > 	uses faster rechecking than does RCU Tasks.  Experiment as
> > > > 	needed to arrive at a good backoff value.
> > > > 
> > > > 2.	If the tasks-list scan turns out to be a tighter bottleneck 
> > > > 	than the backoff waits, look into parallelizing this scan.
> > > > 	(This seems unlikely, but the fact remains that RCU Tasks
> > > > 	Trace must do a bit more work per task than RCU Tasks.)
> > > > 
> > > > 3.	If these two approaches, still don't get the update-side
> > > > 	latency where it needs to be, improvise.
> > > > 
> > > > The exact path into mainline will of course depend on how far down this
> > > > list I must go, but first to get a solution.
> > > 
> > > I think there is a case of 4. Nothing is inside rcu_trace critical section.
> > > I would expect single ipi would confirm that.
> > 
> > Unless the task moves, yes.  So a single IPI should suffice in the
> > common case.
> 
> And what I am doing now is checking code paths.

And the following diff from a set of three patches gets my average
RCU Tasks Trace grace-period latencies down to about 20 milliseconds,
almost a 50x improvement from earlier today.

These are still quite rough and not yet suited for production use, but
I will be testing.  If that goes well, I hope to send a more polished
set of patches by end of day tomorrow, Pacific Time.  But if you get a
chance to test them, I would value any feedback that you might have.

These patches do not require hand-tuning, they instead adjust the
behavior according to CONFIG_TASKS_TRACE_RCU_READ_MB, which in turn
adjusts according to CONFIG_PREEMPT_RT.  So you should get the desired
latency reductions "out of the box", again, without tuning.

							Thanx, Paul

-----------------------------------------------------------------------

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 978508e..a0eaed5 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -28,6 +28,8 @@ typedef void (*postgp_func_t)(struct rcu_tasks *rtp);
  * @kthread_ptr: This flavor's grace-period/callback-invocation kthread.
  * @gp_func: This flavor's grace-period-wait function.
  * @gp_state: Grace period's most recent state transition (debugging).
+ * @gp_sleep: Per-grace-period sleep to prevent CPU-bound looping.
+ * @init_fract: Initial backoff sleep interval.
  * @gp_jiffies: Time of last @gp_state transition.
  * @gp_start: Most recent grace-period start in jiffies.
  * @n_gps: Number of grace periods completed since boot.
@@ -48,6 +50,8 @@ struct rcu_tasks {
 	struct wait_queue_head cbs_wq;
 	raw_spinlock_t cbs_lock;
 	int gp_state;
+	int gp_sleep;
+	int init_fract;
 	unsigned long gp_jiffies;
 	unsigned long gp_start;
 	unsigned long n_gps;
@@ -81,7 +85,7 @@ static struct rcu_tasks rt_name =					\
 DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu);
 
 /* Avoid IPIing CPUs early in the grace period. */
-#define RCU_TASK_IPI_DELAY (HZ / 2)
+#define RCU_TASK_IPI_DELAY (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB) ? HZ / 2 : 0)
 static int rcu_task_ipi_delay __read_mostly = RCU_TASK_IPI_DELAY;
 module_param(rcu_task_ipi_delay, int, 0644);
 
@@ -231,7 +235,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 			cond_resched();
 		}
 		/* Paranoid sleep to keep this from entering a tight loop */
-		schedule_timeout_idle(HZ/10);
+		schedule_timeout_idle(rtp->gp_sleep);
 
 		set_tasks_gp_state(rtp, RTGS_WAIT_CBS);
 	}
@@ -329,8 +333,10 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 	 */
 	lastreport = jiffies;
 
-	/* Start off with HZ/10 wait and slowly back off to 1 HZ wait. */
-	fract = 10;
+	// Start off with initial wait and slowly back off to 1 HZ wait.
+	fract = rtp->init_fract;
+	if (fract > HZ)
+		fract = HZ;
 
 	for (;;) {
 		bool firstreport;
@@ -553,6 +559,8 @@ EXPORT_SYMBOL_GPL(rcu_barrier_tasks);
 
 static int __init rcu_spawn_tasks_kthread(void)
 {
+	rcu_tasks.gp_sleep = HZ / 10;
+	rcu_tasks.init_fract = 10;
 	rcu_tasks.pregp_func = rcu_tasks_pregp_step;
 	rcu_tasks.pertask_func = rcu_tasks_pertask;
 	rcu_tasks.postscan_func = rcu_tasks_postscan;
@@ -685,6 +693,7 @@ EXPORT_SYMBOL_GPL(rcu_barrier_tasks_rude);
 
 static int __init rcu_spawn_tasks_rude_kthread(void)
 {
+	rcu_tasks_rude.gp_sleep = HZ / 10;
 	rcu_spawn_tasks_kthread_generic(&rcu_tasks_rude);
 	return 0;
 }
@@ -911,7 +920,8 @@ static void trc_wait_for_one_reader(struct task_struct *t,
 
 	// If currently running, send an IPI, either way, add to list.
 	trc_add_holdout(t, bhp);
-	if (task_curr(t) && time_after(jiffies, rcu_tasks_trace.gp_start + rcu_task_ipi_delay)) {
+	if (task_curr(t) &&
+	    time_after(jiffies + 1, rcu_tasks_trace.gp_start + rcu_task_ipi_delay)) {
 		// The task is currently running, so try IPIing it.
 		cpu = task_cpu(t);
 
@@ -1163,6 +1173,17 @@ EXPORT_SYMBOL_GPL(rcu_barrier_tasks_trace);
 
 static int __init rcu_spawn_tasks_trace_kthread(void)
 {
+	if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB)) {
+		rcu_tasks_trace.gp_sleep = HZ / 10;
+		rcu_tasks_trace.init_fract = 10;
+	} else {
+		rcu_tasks_trace.gp_sleep = HZ / 200;
+		if (rcu_tasks_trace.gp_sleep <= 0)
+			rcu_tasks_trace.gp_sleep = 1;
+		rcu_tasks_trace.init_fract = HZ / 5;
+		if (rcu_tasks_trace.init_fract <= 0)
+			rcu_tasks_trace.init_fract = 1;
+	}
 	rcu_tasks_trace.pregp_func = rcu_tasks_trace_pregp_step;
 	rcu_tasks_trace.pertask_func = rcu_tasks_trace_pertask;
 	rcu_tasks_trace.postscan_func = rcu_tasks_trace_postscan;

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: slow sync rcu_tasks_trace
  2020-09-10  5:27                 ` Paul E. McKenney
@ 2020-09-10 18:33                   ` Alexei Starovoitov
  2020-09-10 18:51                     ` Paul E. McKenney
  0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2020-09-10 18:33 UTC (permalink / raw)
  To: paulmck, Alexei Starovoitov; +Cc: bpf, Daniel Borkmann, Kernel Team

On 9/9/20 10:27 PM, Paul E. McKenney wrote:
> On Wed, Sep 09, 2020 at 02:22:12PM -0700, Paul E. McKenney wrote:
>> On Wed, Sep 09, 2020 at 02:04:47PM -0700, Paul E. McKenney wrote:
>>> On Wed, Sep 09, 2020 at 12:48:28PM -0700, Alexei Starovoitov wrote:
>>>> On Wed, Sep 09, 2020 at 12:39:00PM -0700, Paul E. McKenney wrote:
> 
> [ . . . ]
> 
>>>>> My plan is to try the following:
>>>>>
>>>>> 1.	Parameterize the backoff sequence so that RCU Tasks Trace
>>>>> 	uses faster rechecking than does RCU Tasks.  Experiment as
>>>>> 	needed to arrive at a good backoff value.
>>>>>
>>>>> 2.	If the tasks-list scan turns out to be a tighter bottleneck
>>>>> 	than the backoff waits, look into parallelizing this scan.
>>>>> 	(This seems unlikely, but the fact remains that RCU Tasks
>>>>> 	Trace must do a bit more work per task than RCU Tasks.)
>>>>>
>>>>> 3.	If these two approaches, still don't get the update-side
>>>>> 	latency where it needs to be, improvise.
>>>>>
>>>>> The exact path into mainline will of course depend on how far down this
>>>>> list I must go, but first to get a solution.
>>>>
>>>> I think there is a case of 4. Nothing is inside rcu_trace critical section.
>>>> I would expect single ipi would confirm that.
>>>
>>> Unless the task moves, yes.  So a single IPI should suffice in the
>>> common case.
>>
>> And what I am doing now is checking code paths.
> 
> And the following diff from a set of three patches gets my average
> RCU Tasks Trace grace-period latencies down to about 20 milliseconds,
> almost a 50x improvement from earlier today.
> 
> These are still quite rough and not yet suited for production use, but
> I will be testing.  If that goes well, I hope to send a more polished
> set of patches by end of day tomorrow, Pacific Time.  But if you get a
> chance to test them, I would value any feedback that you might have.
> 
> These patches do not require hand-tuning, they instead adjust the
> behavior according to CONFIG_TASKS_TRACE_RCU_READ_MB, which in turn
> adjusts according to CONFIG_PREEMPT_RT.  So you should get the desired
> latency reductions "out of the box", again, without tuning.

Great. Confirming improvement :)

time ./test_progs -t trampoline_count
#101 trampoline_count:OK
Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED

real	0m2.897s
user	0m0.128s
sys	0m1.527s

This is without CONFIG_TASKS_TRACE_RCU_READ_MB, of course.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: slow sync rcu_tasks_trace
  2020-09-10 18:33                   ` Alexei Starovoitov
@ 2020-09-10 18:51                     ` Paul E. McKenney
  2020-09-10 19:04                       ` Alexei Starovoitov
  0 siblings, 1 reply; 16+ messages in thread
From: Paul E. McKenney @ 2020-09-10 18:51 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Alexei Starovoitov, bpf, Daniel Borkmann, Kernel Team

On Thu, Sep 10, 2020 at 11:33:58AM -0700, Alexei Starovoitov wrote:
> On 9/9/20 10:27 PM, Paul E. McKenney wrote:
> > On Wed, Sep 09, 2020 at 02:22:12PM -0700, Paul E. McKenney wrote:
> > > On Wed, Sep 09, 2020 at 02:04:47PM -0700, Paul E. McKenney wrote:
> > > > On Wed, Sep 09, 2020 at 12:48:28PM -0700, Alexei Starovoitov wrote:
> > > > > On Wed, Sep 09, 2020 at 12:39:00PM -0700, Paul E. McKenney wrote:
> > 
> > [ . . . ]
> > 
> > > > > > My plan is to try the following:
> > > > > > 
> > > > > > 1.	Parameterize the backoff sequence so that RCU Tasks Trace
> > > > > > 	uses faster rechecking than does RCU Tasks.  Experiment as
> > > > > > 	needed to arrive at a good backoff value.
> > > > > > 
> > > > > > 2.	If the tasks-list scan turns out to be a tighter bottleneck
> > > > > > 	than the backoff waits, look into parallelizing this scan.
> > > > > > 	(This seems unlikely, but the fact remains that RCU Tasks
> > > > > > 	Trace must do a bit more work per task than RCU Tasks.)
> > > > > > 
> > > > > > 3.	If these two approaches, still don't get the update-side
> > > > > > 	latency where it needs to be, improvise.
> > > > > > 
> > > > > > The exact path into mainline will of course depend on how far down this
> > > > > > list I must go, but first to get a solution.
> > > > > 
> > > > > I think there is a case of 4. Nothing is inside rcu_trace critical section.
> > > > > I would expect single ipi would confirm that.
> > > > 
> > > > Unless the task moves, yes.  So a single IPI should suffice in the
> > > > common case.
> > > 
> > > And what I am doing now is checking code paths.
> > 
> > And the following diff from a set of three patches gets my average
> > RCU Tasks Trace grace-period latencies down to about 20 milliseconds,
> > almost a 50x improvement from earlier today.
> > 
> > These are still quite rough and not yet suited for production use, but
> > I will be testing.  If that goes well, I hope to send a more polished
> > set of patches by end of day tomorrow, Pacific Time.  But if you get a
> > chance to test them, I would value any feedback that you might have.
> > 
> > These patches do not require hand-tuning, they instead adjust the
> > behavior according to CONFIG_TASKS_TRACE_RCU_READ_MB, which in turn
> > adjusts according to CONFIG_PREEMPT_RT.  So you should get the desired
> > latency reductions "out of the box", again, without tuning.
> 
> Great. Confirming improvement :)
> 
> time ./test_progs -t trampoline_count
> #101 trampoline_count:OK
> Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> 
> real	0m2.897s
> user	0m0.128s
> sys	0m1.527s
> 
> This is without CONFIG_TASKS_TRACE_RCU_READ_MB, of course.

Good to hear, thank you!

or is more required?  I can tweak to get more.  There is never a free
lunch, though, and in this case the downside of further tweaking would
be greater CPU overhead.  Alternatively, I could just as easily tweak
it to be slower, thereby reducing the CPU overhead.

If I don't hear otherwise, I will assume that the current settings
work fine.

Of course, if people start removing thousands of BPF programs at one go,
I suspect that it will be necessary to provide a bulk-removal operation,
similar to some of the bulk-configuration-change operations provided by
networking.  The idea is to have a single RCU Tasks Trace grace period
cover all of the thousands of BPF removal operations.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: slow sync rcu_tasks_trace
  2020-09-10 18:51                     ` Paul E. McKenney
@ 2020-09-10 19:04                       ` Alexei Starovoitov
  2020-09-10 20:24                         ` Paul E. McKenney
  0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2020-09-10 19:04 UTC (permalink / raw)
  To: paulmck; +Cc: Alexei Starovoitov, bpf, Daniel Borkmann, Kernel Team

On 9/10/20 11:51 AM, Paul E. McKenney wrote:
> On Thu, Sep 10, 2020 at 11:33:58AM -0700, Alexei Starovoitov wrote:
>> On 9/9/20 10:27 PM, Paul E. McKenney wrote:
>>> On Wed, Sep 09, 2020 at 02:22:12PM -0700, Paul E. McKenney wrote:
>>>> On Wed, Sep 09, 2020 at 02:04:47PM -0700, Paul E. McKenney wrote:
>>>>> On Wed, Sep 09, 2020 at 12:48:28PM -0700, Alexei Starovoitov wrote:
>>>>>> On Wed, Sep 09, 2020 at 12:39:00PM -0700, Paul E. McKenney wrote:
>>>
>>> [ . . . ]
>>>
>>>>>>> My plan is to try the following:
>>>>>>>
>>>>>>> 1.	Parameterize the backoff sequence so that RCU Tasks Trace
>>>>>>> 	uses faster rechecking than does RCU Tasks.  Experiment as
>>>>>>> 	needed to arrive at a good backoff value.
>>>>>>>
>>>>>>> 2.	If the tasks-list scan turns out to be a tighter bottleneck
>>>>>>> 	than the backoff waits, look into parallelizing this scan.
>>>>>>> 	(This seems unlikely, but the fact remains that RCU Tasks
>>>>>>> 	Trace must do a bit more work per task than RCU Tasks.)
>>>>>>>
>>>>>>> 3.	If these two approaches, still don't get the update-side
>>>>>>> 	latency where it needs to be, improvise.
>>>>>>>
>>>>>>> The exact path into mainline will of course depend on how far down this
>>>>>>> list I must go, but first to get a solution.
>>>>>>
>>>>>> I think there is a case of 4. Nothing is inside rcu_trace critical section.
>>>>>> I would expect single ipi would confirm that.
>>>>>
>>>>> Unless the task moves, yes.  So a single IPI should suffice in the
>>>>> common case.
>>>>
>>>> And what I am doing now is checking code paths.
>>>
>>> And the following diff from a set of three patches gets my average
>>> RCU Tasks Trace grace-period latencies down to about 20 milliseconds,
>>> almost a 50x improvement from earlier today.
>>>
>>> These are still quite rough and not yet suited for production use, but
>>> I will be testing.  If that goes well, I hope to send a more polished
>>> set of patches by end of day tomorrow, Pacific Time.  But if you get a
>>> chance to test them, I would value any feedback that you might have.
>>>
>>> These patches do not require hand-tuning, they instead adjust the
>>> behavior according to CONFIG_TASKS_TRACE_RCU_READ_MB, which in turn
>>> adjusts according to CONFIG_PREEMPT_RT.  So you should get the desired
>>> latency reductions "out of the box", again, without tuning.
>>
>> Great. Confirming improvement :)
>>
>> time ./test_progs -t trampoline_count
>> #101 trampoline_count:OK
>> Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
>>
>> real	0m2.897s
>> user	0m0.128s
>> sys	0m1.527s
>>
>> This is without CONFIG_TASKS_TRACE_RCU_READ_MB, of course.
> 
> Good to hear, thank you!
> 
> or is more required?  I can tweak to get more.  There is never a free
> lunch, though, and in this case the downside of further tweaking would
> be greater CPU overhead.  Alternatively, I could just as easily tweak
> it to be slower, thereby reducing the CPU overhead.
> 
> If I don't hear otherwise, I will assume that the current settings
> work fine.

Now it looks like that sync rcu_tasks_trace is not slower than 
rcu_tasks, so if it would only makes sense to accelerate both at the 
same time.
I think for now it's good.

> Of course, if people start removing thousands of BPF programs at one go,
> I suspect that it will be necessary to provide a bulk-removal operation,
> similar to some of the bulk-configuration-change operations provided by
> networking.  The idea is to have a single RCU Tasks Trace grace period
> cover all of the thousands of BPF removal operations.

bulk api won't really work for user space.
There is no good way to coordinate attaching different progs (or the 
same prog) to many different places.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: slow sync rcu_tasks_trace
  2020-09-10 19:04                       ` Alexei Starovoitov
@ 2020-09-10 20:24                         ` Paul E. McKenney
  0 siblings, 0 replies; 16+ messages in thread
From: Paul E. McKenney @ 2020-09-10 20:24 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Alexei Starovoitov, bpf, Daniel Borkmann, Kernel Team

On Thu, Sep 10, 2020 at 12:04:32PM -0700, Alexei Starovoitov wrote:
> On 9/10/20 11:51 AM, Paul E. McKenney wrote:
> > On Thu, Sep 10, 2020 at 11:33:58AM -0700, Alexei Starovoitov wrote:
> > > On 9/9/20 10:27 PM, Paul E. McKenney wrote:
> > > > On Wed, Sep 09, 2020 at 02:22:12PM -0700, Paul E. McKenney wrote:
> > > > > On Wed, Sep 09, 2020 at 02:04:47PM -0700, Paul E. McKenney wrote:
> > > > > > On Wed, Sep 09, 2020 at 12:48:28PM -0700, Alexei Starovoitov wrote:
> > > > > > > On Wed, Sep 09, 2020 at 12:39:00PM -0700, Paul E. McKenney wrote:
> > > > 
> > > > [ . . . ]
> > > > 
> > > > > > > > My plan is to try the following:
> > > > > > > > 
> > > > > > > > 1.	Parameterize the backoff sequence so that RCU Tasks Trace
> > > > > > > > 	uses faster rechecking than does RCU Tasks.  Experiment as
> > > > > > > > 	needed to arrive at a good backoff value.
> > > > > > > > 
> > > > > > > > 2.	If the tasks-list scan turns out to be a tighter bottleneck
> > > > > > > > 	than the backoff waits, look into parallelizing this scan.
> > > > > > > > 	(This seems unlikely, but the fact remains that RCU Tasks
> > > > > > > > 	Trace must do a bit more work per task than RCU Tasks.)
> > > > > > > > 
> > > > > > > > 3.	If these two approaches, still don't get the update-side
> > > > > > > > 	latency where it needs to be, improvise.
> > > > > > > > 
> > > > > > > > The exact path into mainline will of course depend on how far down this
> > > > > > > > list I must go, but first to get a solution.
> > > > > > > 
> > > > > > > I think there is a case of 4. Nothing is inside rcu_trace critical section.
> > > > > > > I would expect single ipi would confirm that.
> > > > > > 
> > > > > > Unless the task moves, yes.  So a single IPI should suffice in the
> > > > > > common case.
> > > > > 
> > > > > And what I am doing now is checking code paths.
> > > > 
> > > > And the following diff from a set of three patches gets my average
> > > > RCU Tasks Trace grace-period latencies down to about 20 milliseconds,
> > > > almost a 50x improvement from earlier today.
> > > > 
> > > > These are still quite rough and not yet suited for production use, but
> > > > I will be testing.  If that goes well, I hope to send a more polished
> > > > set of patches by end of day tomorrow, Pacific Time.  But if you get a
> > > > chance to test them, I would value any feedback that you might have.
> > > > 
> > > > These patches do not require hand-tuning, they instead adjust the
> > > > behavior according to CONFIG_TASKS_TRACE_RCU_READ_MB, which in turn
> > > > adjusts according to CONFIG_PREEMPT_RT.  So you should get the desired
> > > > latency reductions "out of the box", again, without tuning.
> > > 
> > > Great. Confirming improvement :)
> > > 
> > > time ./test_progs -t trampoline_count
> > > #101 trampoline_count:OK
> > > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> > > 
> > > real	0m2.897s
> > > user	0m0.128s
> > > sys	0m1.527s
> > > 
> > > This is without CONFIG_TASKS_TRACE_RCU_READ_MB, of course.
> > 
> > Good to hear, thank you!
> > 
> > or is more required?  I can tweak to get more.  There is never a free
> > lunch, though, and in this case the downside of further tweaking would
> > be greater CPU overhead.  Alternatively, I could just as easily tweak
> > it to be slower, thereby reducing the CPU overhead.
> > 
> > If I don't hear otherwise, I will assume that the current settings
> > work fine.
> 
> Now it looks like that sync rcu_tasks_trace is not slower than rcu_tasks, so
> if it would only makes sense to accelerate both at the same time.
> I think for now it's good.

Music to my ears!

I have sent the official RFC patch series, CCing the people active on this
thread and also the BPF email list, as well as the usual RCU suspects.
Anyone else I should solicit testing/review from?

> > Of course, if people start removing thousands of BPF programs at one go,
> > I suspect that it will be necessary to provide a bulk-removal operation,
> > similar to some of the bulk-configuration-change operations provided by
> > networking.  The idea is to have a single RCU Tasks Trace grace period
> > cover all of the thousands of BPF removal operations.
> 
> bulk api won't really work for user space.
> There is no good way to coordinate attaching different progs (or the same
> prog) to many different places.

Fair enough for now, especially unless and until it becomes a problem.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-09-10 20:24 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-09  2:34 slow sync rcu_tasks_trace Alexei Starovoitov
2020-09-09 11:38 ` Paul E. McKenney
2020-09-09 15:10   ` Jiri Olsa
2020-09-09 17:02     ` Paul E. McKenney
2020-09-09 17:12   ` Alexei Starovoitov
2020-09-09 17:35     ` Paul E. McKenney
2020-09-09 18:04       ` Alexei Starovoitov
2020-09-09 19:39         ` Paul E. McKenney
2020-09-09 19:48           ` Alexei Starovoitov
2020-09-09 21:04             ` Paul E. McKenney
2020-09-09 21:22               ` Paul E. McKenney
2020-09-10  5:27                 ` Paul E. McKenney
2020-09-10 18:33                   ` Alexei Starovoitov
2020-09-10 18:51                     ` Paul E. McKenney
2020-09-10 19:04                       ` Alexei Starovoitov
2020-09-10 20:24                         ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.