All of lore.kernel.org
 help / color / mirror / Atom feed
* Observation on NOHZ_FULL
@ 2024-01-29 17:47 Joel Fernandes
  2024-01-29 20:41 ` Paul E. McKenney
  2024-01-29 20:48 ` Uladzislau Rezki
  0 siblings, 2 replies; 35+ messages in thread
From: Joel Fernandes @ 2024-01-29 17:47 UTC (permalink / raw)
  To: rcu, Cc: Frederic Weisbecker, Paul E. McKenney

Hi Guys,
Something caught my eye in [1] which a colleague pointed me to
 - CONFIG_HZ=1000 : 14866.05 bogo ops/s
 - CONFIG_HZ=1000+nohz_full : 18505.52 bogo ops/s

The test in concern is:
stress-ng --matrix $(getconf _NPROCESSORS_ONLN) --timeout 5m --metrics-brief

which is a CPU intensive test.

Any thoughts on what else can attribute a 30% performance increase
versus non-nohz_full ? (Confession: No idea if the baseline is
nohz_idle or no nohz at all). If it is 30%, I may want to evaluate
nohz_full on some of our limited-CPU devices :)

Cheers,

 - Joel

[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2051342

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-01-29 17:47 Observation on NOHZ_FULL Joel Fernandes
@ 2024-01-29 20:41 ` Paul E. McKenney
  2024-01-29 22:16   ` Joel Fernandes
  2024-01-29 20:48 ` Uladzislau Rezki
  1 sibling, 1 reply; 35+ messages in thread
From: Paul E. McKenney @ 2024-01-29 20:41 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: rcu, Cc: Frederic Weisbecker

On Mon, Jan 29, 2024 at 05:47:39PM +0000, Joel Fernandes wrote:
> Hi Guys,
> Something caught my eye in [1] which a colleague pointed me to
>  - CONFIG_HZ=1000 : 14866.05 bogo ops/s
>  - CONFIG_HZ=1000+nohz_full : 18505.52 bogo ops/s
> 
> The test in concern is:
> stress-ng --matrix $(getconf _NPROCESSORS_ONLN) --timeout 5m --metrics-brief
> 
> which is a CPU intensive test.
> 
> Any thoughts on what else can attribute a 30% performance increase
> versus non-nohz_full ? (Confession: No idea if the baseline is
> nohz_idle or no nohz at all). If it is 30%, I may want to evaluate
> nohz_full on some of our limited-CPU devices :)

The usual questions.  ;-)

Is this repeatable?  Is it under the same conditions of temperature,
load, and so on?  Was it running on bare metal or on a guest OS?  If on a
guest OS, what was the load from other guest OSes on the same hypervisor
or on the hypervisor itself?

The bug report ad "CONFIG_HZ=250 : 17415.60 bogo ops/s", which makes
me wonder if someone enabled some heavy debug that is greatly
increasing the overhead of the scheduling-clock interrupt.

Now, if that was the case, I would expect the 250HZ number to have
three-quarters of the improvement of the nohz_full number compared
to the 1000HZ number:

17415.60-14866.05=2549.55
18505.52-14866.05=3639.47

2549.55/3639.47=0.70

OK, 0.70 is not *that* far off of 0.75.  So what debugging does that
test have enabled?  Also, if you use tracing (or whatever) to measure
the typical duration of the scheduling-clock interrupt and related things
like softirq handlers, does it fit with these numbers?  Such a measurment
would look at how long it took to get back into userspace.

I will let you do the algebra to work out the expected duration that
would be consistent with this hypothesis.  ;-)

							Thanx, Paul

> Cheers,
> 
>  - Joel
> 
> [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2051342

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-01-29 17:47 Observation on NOHZ_FULL Joel Fernandes
  2024-01-29 20:41 ` Paul E. McKenney
@ 2024-01-29 20:48 ` Uladzislau Rezki
  2024-01-29 22:20   ` Joel Fernandes
  1 sibling, 1 reply; 35+ messages in thread
From: Uladzislau Rezki @ 2024-01-29 20:48 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: rcu, Cc: Frederic Weisbecker, Paul E. McKenney

On Mon, Jan 29, 2024 at 05:47:39PM +0000, Joel Fernandes wrote:
> Hi Guys,
> Something caught my eye in [1] which a colleague pointed me to
>  - CONFIG_HZ=1000 : 14866.05 bogo ops/s
>  - CONFIG_HZ=1000+nohz_full : 18505.52 bogo ops/s
> 
> The test in concern is:
> stress-ng --matrix $(getconf _NPROCESSORS_ONLN) --timeout 5m --metrics-brief
> 
> which is a CPU intensive test.
> 
> Any thoughts on what else can attribute a 30% performance increase
> versus non-nohz_full ? (Confession: No idea if the baseline is
> nohz_idle or no nohz at all). If it is 30%, I may want to evaluate
> nohz_full on some of our limited-CPU devices :)
> 
Timer overhead and its scheduler-callback? If i do not mess something
the NO_HZ_FULL will disable the timer if there is only one task on CPU
so that running task benefits from not being interrupted thus gets more
CPU time.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-01-29 20:41 ` Paul E. McKenney
@ 2024-01-29 22:16   ` Joel Fernandes
  2024-01-30  6:58     ` Andrea Righi
  0 siblings, 1 reply; 35+ messages in thread
From: Joel Fernandes @ 2024-01-29 22:16 UTC (permalink / raw)
  To: paulmck, Joel Fernandes, andrea.righi; +Cc: rcu, Cc: Frederic Weisbecker

Hi Paul,

On 1/29/2024 3:41 PM, Paul E. McKenney wrote:
> On Mon, Jan 29, 2024 at 05:47:39PM +0000, Joel Fernandes wrote:
>> Hi Guys,
>> Something caught my eye in [1] which a colleague pointed me to
>>  - CONFIG_HZ=1000 : 14866.05 bogo ops/s
>>  - CONFIG_HZ=1000+nohz_full : 18505.52 bogo ops/s
>>
>> The test in concern is:
>> stress-ng --matrix $(getconf _NPROCESSORS_ONLN) --timeout 5m --metrics-brief
>>
>> which is a CPU intensive test.
>>
>> Any thoughts on what else can attribute a 30% performance increase
>> versus non-nohz_full ? (Confession: No idea if the baseline is
>> nohz_idle or no nohz at all). If it is 30%, I may want to evaluate
>> nohz_full on some of our limited-CPU devices :)
> 
> The usual questions.  ;-)
> 
> Is this repeatable?  Is it under the same conditions of temperature,
> load, and so on?  Was it running on bare metal or on a guest OS?  If on a
> guest OS, what was the load from other guest OSes on the same hypervisor
> or on the hypervisor itself?
> 
> The bug report ad "CONFIG_HZ=250 : 17415.60 bogo ops/s", which makes
> me wonder if someone enabled some heavy debug that is greatly
> increasing the overhead of the scheduling-clock interrupt.
> 
> Now, if that was the case, I would expect the 250HZ number to have
> three-quarters of the improvement of the nohz_full number compared
> to the 1000HZ number:
>> 17415.60-14866.05=2549.55
> 18505.52-14866.05=3639.47
> 
> 2549.55/3639.47=0.70

I wonder if the difference here could possibly also be because of CPU idle
governor. It may behave differently at differently clock rates so perhaps has
different overhead.

I have added trying nohz full to my list as well to evaluate. FWIW, when we
moved from 250HZ to 1000HZ, it actually improved power because the CPUidle
governor could put the CPUs in deeper idle states more quickly!

> OK, 0.70 is not *that* far off of 0.75.  So what debugging does that
> test have enabled?  Also, if you use tracing (or whatever) to measure
> the typical duration of the scheduling-clock interrupt and related things
> like softirq handlers, does it fit with these numbers?  Such a measurment
> would look at how long it took to get back into userspace.

Thanks for your detailed questions. I will add Andrea Righi to this list thread
since he is the author of the bug report. Andrea do you mind clarifying a few
things mentioned above? Also nice to see you are using CONFIG_RCU_LAZY for Ubuntu :)

thanks,

 - Joel


> 
>> Cheers,
>>
>>  - Joel
>>
>> [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2051342
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-01-29 20:48 ` Uladzislau Rezki
@ 2024-01-29 22:20   ` Joel Fernandes
  2024-01-29 22:43     ` Frederic Weisbecker
  0 siblings, 1 reply; 35+ messages in thread
From: Joel Fernandes @ 2024-01-29 22:20 UTC (permalink / raw)
  To: Uladzislau Rezki, Joel Fernandes
  Cc: rcu, Cc: Frederic Weisbecker, Paul E. McKenney



On 1/29/2024 3:48 PM, Uladzislau Rezki wrote:
> On Mon, Jan 29, 2024 at 05:47:39PM +0000, Joel Fernandes wrote:
>> Hi Guys,
>> Something caught my eye in [1] which a colleague pointed me to
>>  - CONFIG_HZ=1000 : 14866.05 bogo ops/s
>>  - CONFIG_HZ=1000+nohz_full : 18505.52 bogo ops/s
>>
>> The test in concern is:
>> stress-ng --matrix $(getconf _NPROCESSORS_ONLN) --timeout 5m --metrics-brief
>>
>> which is a CPU intensive test.
>>
>> Any thoughts on what else can attribute a 30% performance increase
>> versus non-nohz_full ? (Confession: No idea if the baseline is
>> nohz_idle or no nohz at all). If it is 30%, I may want to evaluate
>> nohz_full on some of our limited-CPU devices :)
>>
> Timer overhead and its scheduler-callback?

Yes, quite possible.

> If i do not miss something
> the NO_HZ_FULL will disable the timer if there is only one task on CPU
> so that running task benefits from not being interrupted thus gets more
> CPU time.

Yes, that's right. I believe it is well known that HPC-type of workloads benefit
from FULL, however it has led to want to try it out for constrained system as
well where CPU cycles are a premium, especially if the improvement is like what
the report suggests (give or take the concerns/questions Paul raised).

Thanks,

 - Joel


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-01-29 22:20   ` Joel Fernandes
@ 2024-01-29 22:43     ` Frederic Weisbecker
  2024-01-29 22:53       ` Joel Fernandes
  2024-01-30  0:40       ` Paul E. McKenney
  0 siblings, 2 replies; 35+ messages in thread
From: Frederic Weisbecker @ 2024-01-29 22:43 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Uladzislau Rezki, Joel Fernandes, rcu, Cc: Frederic Weisbecker,
	Paul E. McKenney

Le Mon, Jan 29, 2024 at 05:20:23PM -0500, Joel Fernandes a écrit :
> > If i do not miss something
> > the NO_HZ_FULL will disable the timer if there is only one task on CPU
> > so that running task benefits from not being interrupted thus gets more
> > CPU time.
> 
> Yes, that's right. I believe it is well known that HPC-type of workloads benefit
> from FULL, however it has led to want to try it out for constrained system as
> well where CPU cycles are a premium, especially if the improvement is like what
> the report suggests (give or take the concerns/questions Paul raised).

I'll be unable to suggest anything related to that Bogomips calculation but
I must add something about HPC.

I have long believed that HPC would benefit from nohz_full but I actually never
heard of any user of that. The current known users of nohz_full are workloads
that don't use the kernel once the application is launched and do their own
stack of, for example, networking, talking directly to the device from
userspace. Using DPDK for example. These usecases are for extremely low latency
expectations (a single interrupt can make you lose).

HPC looks to me different, making use of syscalls and kernel for I/O. Nohz_full
may remove timer IRQs but it adds performance loss on kernel entry, making it
probably unsuitable there. But I might be wrong.

Thanks.

> 
> Thanks,
> 
>  - Joel
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-01-29 22:43     ` Frederic Weisbecker
@ 2024-01-29 22:53       ` Joel Fernandes
  2024-01-29 23:11         ` Frederic Weisbecker
  2024-01-30  0:40       ` Paul E. McKenney
  1 sibling, 1 reply; 35+ messages in thread
From: Joel Fernandes @ 2024-01-29 22:53 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Uladzislau Rezki, Joel Fernandes, rcu, Cc: Frederic Weisbecker,
	Paul E. McKenney



On 1/29/2024 5:43 PM, Frederic Weisbecker wrote:
> Le Mon, Jan 29, 2024 at 05:20:23PM -0500, Joel Fernandes a écrit :
>>> If i do not miss something
>>> the NO_HZ_FULL will disable the timer if there is only one task on CPU
>>> so that running task benefits from not being interrupted thus gets more
>>> CPU time.
>>
>> Yes, that's right. I believe it is well known that HPC-type of workloads benefit
>> from FULL, however it has led to want to try it out for constrained system as
>> well where CPU cycles are a premium, especially if the improvement is like what
>> the report suggests (give or take the concerns/questions Paul raised).
> 
> I'll be unable to suggest anything related to that Bogomips calculation but
> I must add something about HPC.
> 
> I have long believed that HPC would benefit from nohz_full but I actually never
> heard of any user of that. The current known users of nohz_full are workloads
> that don't use the kernel once the application is launched and do their own
> stack of, for example, networking, talking directly to the device from
> userspace. Using DPDK for example. These usecases are for extremely low latency
> expectations (a single interrupt can make you lose).
> 
> HPC looks to me different, making use of syscalls and kernel for I/O. Nohz_full
> may remove timer IRQs but it adds performance loss on kernel entry, making it
> probably unsuitable there. But I might be wrong.
> 

Thanks for the insights!

The kernel entry/exit overhead bit is an interesting point and also tracking
state for RCU observation purposes. I wonder how much is the overhead in
practical cases and curious to measure (especially the tradeoffs between that
and the tick). Added note to my list ;)

thanks,

- Joel



>>
>> Thanks,
>>
>>  - Joel
>>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-01-29 22:53       ` Joel Fernandes
@ 2024-01-29 23:11         ` Frederic Weisbecker
  2024-01-29 23:36           ` Joel Fernandes
  0 siblings, 1 reply; 35+ messages in thread
From: Frederic Weisbecker @ 2024-01-29 23:11 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Uladzislau Rezki, Joel Fernandes, rcu, Cc: Frederic Weisbecker,
	Paul E. McKenney

Le Mon, Jan 29, 2024 at 05:53:50PM -0500, Joel Fernandes a écrit :
> 
> 
> On 1/29/2024 5:43 PM, Frederic Weisbecker wrote:
> > Le Mon, Jan 29, 2024 at 05:20:23PM -0500, Joel Fernandes a écrit :
> >>> If i do not miss something
> >>> the NO_HZ_FULL will disable the timer if there is only one task on CPU
> >>> so that running task benefits from not being interrupted thus gets more
> >>> CPU time.
> >>
> >> Yes, that's right. I believe it is well known that HPC-type of workloads benefit
> >> from FULL, however it has led to want to try it out for constrained system as
> >> well where CPU cycles are a premium, especially if the improvement is like what
> >> the report suggests (give or take the concerns/questions Paul raised).
> > 
> > I'll be unable to suggest anything related to that Bogomips calculation but
> > I must add something about HPC.
> > 
> > I have long believed that HPC would benefit from nohz_full but I actually never
> > heard of any user of that. The current known users of nohz_full are workloads
> > that don't use the kernel once the application is launched and do their own
> > stack of, for example, networking, talking directly to the device from
> > userspace. Using DPDK for example. These usecases are for extremely low latency
> > expectations (a single interrupt can make you lose).
> > 
> > HPC looks to me different, making use of syscalls and kernel for I/O. Nohz_full
> > may remove timer IRQs but it adds performance loss on kernel entry, making it
> > probably unsuitable there. But I might be wrong.
> > 
> 
> Thanks for the insights!
> 
> The kernel entry/exit overhead bit is an interesting point and also tracking
> state for RCU observation purposes. I wonder how much is the overhead in
> practical cases and curious to measure (especially the tradeoffs between that
> and the tick). Added note to my list ;)

And here is a summary of the added overhead with nohz_full:

* Use of RCU_NOCB
* Kernel entry/exit overhead (context_tracking/rcu/vtime)
* Unbound work reaffine (less distributed across CPUs): workqueues, timers, kthreads...
* Remote tick every 1Hz on CPU 0 (one day I would love to remove that but that
  requires too much scheduler knowledge)

The difference should be significantly measureable, even as compared to RCU_NOCB
alone.

Thanks.

> 
> thanks,
> 
> - Joel
> 
> 
> 
> >>
> >> Thanks,
> >>
> >>  - Joel
> >>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-01-29 23:11         ` Frederic Weisbecker
@ 2024-01-29 23:36           ` Joel Fernandes
  0 siblings, 0 replies; 35+ messages in thread
From: Joel Fernandes @ 2024-01-29 23:36 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joel Fernandes, Uladzislau Rezki, rcu, Cc: Frederic Weisbecker,
	Paul E. McKenney

On Mon, Jan 29, 2024 at 6:11 PM Frederic Weisbecker <frederic@kernel.org> wrote:
>
> Le Mon, Jan 29, 2024 at 05:53:50PM -0500, Joel Fernandes a écrit :
> >
> >
> > On 1/29/2024 5:43 PM, Frederic Weisbecker wrote:
> > > Le Mon, Jan 29, 2024 at 05:20:23PM -0500, Joel Fernandes a écrit :
> > >>> If i do not miss something
> > >>> the NO_HZ_FULL will disable the timer if there is only one task on CPU
> > >>> so that running task benefits from not being interrupted thus gets more
> > >>> CPU time.
> > >>
> > >> Yes, that's right. I believe it is well known that HPC-type of workloads benefit
> > >> from FULL, however it has led to want to try it out for constrained system as
> > >> well where CPU cycles are a premium, especially if the improvement is like what
> > >> the report suggests (give or take the concerns/questions Paul raised).
> > >
> > > I'll be unable to suggest anything related to that Bogomips calculation but
> > > I must add something about HPC.
> > >
> > > I have long believed that HPC would benefit from nohz_full but I actually never
> > > heard of any user of that. The current known users of nohz_full are workloads
> > > that don't use the kernel once the application is launched and do their own
> > > stack of, for example, networking, talking directly to the device from
> > > userspace. Using DPDK for example. These usecases are for extremely low latency
> > > expectations (a single interrupt can make you lose).
> > >
> > > HPC looks to me different, making use of syscalls and kernel for I/O. Nohz_full
> > > may remove timer IRQs but it adds performance loss on kernel entry, making it
> > > probably unsuitable there. But I might be wrong.
> > >
> >
> > Thanks for the insights!
> >
> > The kernel entry/exit overhead bit is an interesting point and also tracking
> > state for RCU observation purposes. I wonder how much is the overhead in
> > practical cases and curious to measure (especially the tradeoffs between that
> > and the tick). Added note to my list ;)
>
> And here is a summary of the added overhead with nohz_full:
>
> * Use of RCU_NOCB
> * Kernel entry/exit overhead (context_tracking/rcu/vtime)
> * Unbound work reaffine (less distributed across CPUs): workqueues, timers, kthreads...
> * Remote tick every 1Hz on CPU 0 (one day I would love to remove that but that
>   requires too much scheduler knowledge)
>
> The difference should be significantly measureable, even as compared to RCU_NOCB
> alone.

Noted, thank you Frederic!
 - Joel

>
> Thanks.
>
> >
> > thanks,
> >
> > - Joel
> >
> >
> >
> > >>
> > >> Thanks,
> > >>
> > >>  - Joel
> > >>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-01-29 22:43     ` Frederic Weisbecker
  2024-01-29 22:53       ` Joel Fernandes
@ 2024-01-30  0:40       ` Paul E. McKenney
  1 sibling, 0 replies; 35+ messages in thread
From: Paul E. McKenney @ 2024-01-30  0:40 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joel Fernandes, Uladzislau Rezki, Joel Fernandes, rcu,
	Cc: Frederic Weisbecker

On Mon, Jan 29, 2024 at 11:43:08PM +0100, Frederic Weisbecker wrote:
> Le Mon, Jan 29, 2024 at 05:20:23PM -0500, Joel Fernandes a écrit :
> > > If i do not miss something
> > > the NO_HZ_FULL will disable the timer if there is only one task on CPU
> > > so that running task benefits from not being interrupted thus gets more
> > > CPU time.
> > 
> > Yes, that's right. I believe it is well known that HPC-type of workloads benefit
> > from FULL, however it has led to want to try it out for constrained system as
> > well where CPU cycles are a premium, especially if the improvement is like what
> > the report suggests (give or take the concerns/questions Paul raised).
> 
> I'll be unable to suggest anything related to that Bogomips calculation but
> I must add something about HPC.
> 
> I have long believed that HPC would benefit from nohz_full but I actually never
> heard of any user of that. The current known users of nohz_full are workloads
> that don't use the kernel once the application is launched and do their own
> stack of, for example, networking, talking directly to the device from
> userspace. Using DPDK for example. These usecases are for extremely low latency
> expectations (a single interrupt can make you lose).
> 
> HPC looks to me different, making use of syscalls and kernel for I/O. Nohz_full
> may remove timer IRQs but it adds performance loss on kernel entry, making it
> probably unsuitable there. But I might be wrong.

Some IBM clients made use of nohz_full for HPC workloads, and claimed
to see significant benefits.  (Single-digit percents, if I remember
correctly, but still...)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-01-29 22:16   ` Joel Fernandes
@ 2024-01-30  6:58     ` Andrea Righi
  2024-01-30 10:17       ` Paul E. McKenney
  0 siblings, 1 reply; 35+ messages in thread
From: Andrea Righi @ 2024-01-30  6:58 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: paulmck, Joel Fernandes, rcu, Cc: Frederic Weisbecker

Hi Joel and Paul,

comments below.

On Mon, Jan 29, 2024 at 05:16:38PM -0500, Joel Fernandes wrote:
> Hi Paul,
> 
> On 1/29/2024 3:41 PM, Paul E. McKenney wrote:
> > On Mon, Jan 29, 2024 at 05:47:39PM +0000, Joel Fernandes wrote:
> >> Hi Guys,
> >> Something caught my eye in [1] which a colleague pointed me to
> >>  - CONFIG_HZ=1000 : 14866.05 bogo ops/s
> >>  - CONFIG_HZ=1000+nohz_full : 18505.52 bogo ops/s
> >>
> >> The test in concern is:
> >> stress-ng --matrix $(getconf _NPROCESSORS_ONLN) --timeout 5m --metrics-brief
> >>
> >> which is a CPU intensive test.
> >>
> >> Any thoughts on what else can attribute a 30% performance increase
> >> versus non-nohz_full ? (Confession: No idea if the baseline is
> >> nohz_idle or no nohz at all). If it is 30%, I may want to evaluate
> >> nohz_full on some of our limited-CPU devices :)
> > 
> > The usual questions.  ;-)
> > 
> > Is this repeatable?  Is it under the same conditions of temperature,
> > load, and so on?  Was it running on bare metal or on a guest OS?  If on a
> > guest OS, what was the load from other guest OSes on the same hypervisor
> > or on the hypervisor itself?

That was the result of a quick test, so I expect it has some fuzzyness
in there.

It's an average of 10 runs, it was bare metal (my laptop, 8 cores 11th
Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz), *but* I wanted to run the
test with the default Ubuntu settings, that means having "power mode:
balanced" enabled. I don't know exactly what it's doing (I'll check how
it works in details), I think it's using intel p-states IIRC.

Also, the system was not completely isolated (my email client was
running) but the system was mostly idle in general.

I was already planning to repeat the tests in a more "isolated"
environment and add details to the bug tracker.

> > 
> > The bug report ad "CONFIG_HZ=250 : 17415.60 bogo ops/s", which makes
> > me wonder if someone enabled some heavy debug that is greatly
> > increasing the overhead of the scheduling-clock interrupt.
> > 
> > Now, if that was the case, I would expect the 250HZ number to have
> > three-quarters of the improvement of the nohz_full number compared
> > to the 1000HZ number:
> >> 17415.60-14866.05=2549.55
> > 18505.52-14866.05=3639.47
> > 
> > 2549.55/3639.47=0.70
> 
> I wonder if the difference here could possibly also be because of CPU idle
> governor. It may behave differently at differently clock rates so perhaps has
> different overhead.

Could be, but, again, the balanced power mode could play a major role
here.

> 
> I have added trying nohz full to my list as well to evaluate. FWIW, when we
> moved from 250HZ to 1000HZ, it actually improved power because the CPUidle
> governor could put the CPUs in deeper idle states more quickly!

Interesting, another benefit to add to my proposal. :)

> 
> > OK, 0.70 is not *that* far off of 0.75.  So what debugging does that
> > test have enabled?  Also, if you use tracing (or whatever) to measure
> > the typical duration of the scheduling-clock interrupt and related things
> > like softirq handlers, does it fit with these numbers?  Such a measurment
> > would look at how long it took to get back into userspace.
> 
> Thanks for your detailed questions. I will add Andrea Righi to this list thread
> since he is the author of the bug report. Andrea do you mind clarifying a few
> things mentioned above? Also nice to see you are using CONFIG_RCU_LAZY for Ubuntu :)

Thanks for including me. Sorry that I didn't provide much details of my
tests.

And yes, I really want to see CONFIG_RCU_LAZY enabled in the stock
Ubuntu kernel, so the battery of my laptop lasts longer when I go to
conferences. :)

-Andrea

> 
> thanks,
> 
>  - Joel
> 
> 
> > 
> >> Cheers,
> >>
> >>  - Joel
> >>
> >> [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2051342
> > 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-01-30  6:58     ` Andrea Righi
@ 2024-01-30 10:17       ` Paul E. McKenney
  2024-01-30 11:06         ` Uladzislau Rezki
  0 siblings, 1 reply; 35+ messages in thread
From: Paul E. McKenney @ 2024-01-30 10:17 UTC (permalink / raw)
  To: Andrea Righi; +Cc: Joel Fernandes, Joel Fernandes, rcu, Cc: Frederic Weisbecker

On Tue, Jan 30, 2024 at 07:58:18AM +0100, Andrea Righi wrote:
> Hi Joel and Paul,
> 
> comments below.
> 
> On Mon, Jan 29, 2024 at 05:16:38PM -0500, Joel Fernandes wrote:
> > Hi Paul,
> > 
> > On 1/29/2024 3:41 PM, Paul E. McKenney wrote:
> > > On Mon, Jan 29, 2024 at 05:47:39PM +0000, Joel Fernandes wrote:
> > >> Hi Guys,
> > >> Something caught my eye in [1] which a colleague pointed me to
> > >>  - CONFIG_HZ=1000 : 14866.05 bogo ops/s
> > >>  - CONFIG_HZ=1000+nohz_full : 18505.52 bogo ops/s
> > >>
> > >> The test in concern is:
> > >> stress-ng --matrix $(getconf _NPROCESSORS_ONLN) --timeout 5m --metrics-brief
> > >>
> > >> which is a CPU intensive test.
> > >>
> > >> Any thoughts on what else can attribute a 30% performance increase
> > >> versus non-nohz_full ? (Confession: No idea if the baseline is
> > >> nohz_idle or no nohz at all). If it is 30%, I may want to evaluate
> > >> nohz_full on some of our limited-CPU devices :)
> > > 
> > > The usual questions.  ;-)
> > > 
> > > Is this repeatable?  Is it under the same conditions of temperature,
> > > load, and so on?  Was it running on bare metal or on a guest OS?  If on a
> > > guest OS, what was the load from other guest OSes on the same hypervisor
> > > or on the hypervisor itself?
> 
> That was the result of a quick test, so I expect it has some fuzzyness
> in there.
> 
> It's an average of 10 runs, it was bare metal (my laptop, 8 cores 11th
> Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz), *but* I wanted to run the
> test with the default Ubuntu settings, that means having "power mode:
> balanced" enabled. I don't know exactly what it's doing (I'll check how
> it works in details), I think it's using intel p-states IIRC.
> 
> Also, the system was not completely isolated (my email client was
> running) but the system was mostly idle in general.
> 
> I was already planning to repeat the tests in a more "isolated"
> environment and add details to the bug tracker.
> 
> > > 
> > > The bug report ad "CONFIG_HZ=250 : 17415.60 bogo ops/s", which makes
> > > me wonder if someone enabled some heavy debug that is greatly
> > > increasing the overhead of the scheduling-clock interrupt.
> > > 
> > > Now, if that was the case, I would expect the 250HZ number to have
> > > three-quarters of the improvement of the nohz_full number compared
> > > to the 1000HZ number:
> > >> 17415.60-14866.05=2549.55
> > > 18505.52-14866.05=3639.47
> > > 
> > > 2549.55/3639.47=0.70
> > 
> > I wonder if the difference here could possibly also be because of CPU idle
> > governor. It may behave differently at differently clock rates so perhaps has
> > different overhead.
> 
> Could be, but, again, the balanced power mode could play a major role
> here.
> 
> > 
> > I have added trying nohz full to my list as well to evaluate. FWIW, when we
> > moved from 250HZ to 1000HZ, it actually improved power because the CPUidle
> > governor could put the CPUs in deeper idle states more quickly!
> 
> Interesting, another benefit to add to my proposal. :)
> 
> > 
> > > OK, 0.70 is not *that* far off of 0.75.  So what debugging does that
> > > test have enabled?  Also, if you use tracing (or whatever) to measure
> > > the typical duration of the scheduling-clock interrupt and related things
> > > like softirq handlers, does it fit with these numbers?  Such a measurment
> > > would look at how long it took to get back into userspace.

Just to emphasize...

The above calculations show that your measurements are close to what you
would expect if scheduling-clock interrupts took longer than one would
expect.  Here "scheduling-clock interrupts" includes softirq processing
(timers, networking, RCU, ...)  that piggybacks on each such interrupt.

Although softirq makes the most sense given the amount of time that must
be consumed, for the most part softirq work is conserved.  which suggests
that you should also at the rest of the system to check whether the
reported speedup is instead due to this work simply being moved to some
other CPU.

But maybe the fat softirqs are due to some debugging option that Ubuntu
enabled.  In which case checking up on the actual duration (perhaps
using some form of tracing) would provide useful information.  ;-)

							Thanx, Paul

> > Thanks for your detailed questions. I will add Andrea Righi to this list thread
> > since he is the author of the bug report. Andrea do you mind clarifying a few
> > things mentioned above? Also nice to see you are using CONFIG_RCU_LAZY for Ubuntu :)
> 
> Thanks for including me. Sorry that I didn't provide much details of my
> tests.
> 
> And yes, I really want to see CONFIG_RCU_LAZY enabled in the stock
> Ubuntu kernel, so the battery of my laptop lasts longer when I go to
> conferences. :)
> 
> -Andrea
> 
> > 
> > thanks,
> > 
> >  - Joel
> > 
> > 
> > > 
> > >> Cheers,
> > >>
> > >>  - Joel
> > >>
> > >> [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2051342
> > > 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-01-30 10:17       ` Paul E. McKenney
@ 2024-01-30 11:06         ` Uladzislau Rezki
  2024-01-30 11:27           ` Andrea Righi
  0 siblings, 1 reply; 35+ messages in thread
From: Uladzislau Rezki @ 2024-01-30 11:06 UTC (permalink / raw)
  To: Paul E. McKenney, Andrea Righi
  Cc: Andrea Righi, Joel Fernandes, Joel Fernandes, rcu,
	Cc: Frederic Weisbecker

On Tue, Jan 30, 2024 at 02:17:22AM -0800, Paul E. McKenney wrote:
> On Tue, Jan 30, 2024 at 07:58:18AM +0100, Andrea Righi wrote:
> > Hi Joel and Paul,
> > 
> > comments below.
> > 
> > On Mon, Jan 29, 2024 at 05:16:38PM -0500, Joel Fernandes wrote:
> > > Hi Paul,
> > > 
> > > On 1/29/2024 3:41 PM, Paul E. McKenney wrote:
> > > > On Mon, Jan 29, 2024 at 05:47:39PM +0000, Joel Fernandes wrote:
> > > >> Hi Guys,
> > > >> Something caught my eye in [1] which a colleague pointed me to
> > > >>  - CONFIG_HZ=1000 : 14866.05 bogo ops/s
> > > >>  - CONFIG_HZ=1000+nohz_full : 18505.52 bogo ops/s
> > > >>
> > > >> The test in concern is:
> > > >> stress-ng --matrix $(getconf _NPROCESSORS_ONLN) --timeout 5m --metrics-brief
> > > >>
> > > >> which is a CPU intensive test.
> > > >>
> > > >> Any thoughts on what else can attribute a 30% performance increase
> > > >> versus non-nohz_full ? (Confession: No idea if the baseline is
> > > >> nohz_idle or no nohz at all). If it is 30%, I may want to evaluate
> > > >> nohz_full on some of our limited-CPU devices :)
> > > > 
> > > > The usual questions.  ;-)
> > > > 
> > > > Is this repeatable?  Is it under the same conditions of temperature,
> > > > load, and so on?  Was it running on bare metal or on a guest OS?  If on a
> > > > guest OS, what was the load from other guest OSes on the same hypervisor
> > > > or on the hypervisor itself?
> > 
> > That was the result of a quick test, so I expect it has some fuzzyness
> > in there.
> > 
> > It's an average of 10 runs, it was bare metal (my laptop, 8 cores 11th
> > Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz), *but* I wanted to run the
> > test with the default Ubuntu settings, that means having "power mode:
> > balanced" enabled. I don't know exactly what it's doing (I'll check how
> > it works in details), I think it's using intel p-states IIRC.
> > 
> > Also, the system was not completely isolated (my email client was
> > running) but the system was mostly idle in general.
> > 
> > I was already planning to repeat the tests in a more "isolated"
> > environment and add details to the bug tracker.
> > 
> > > > 
> > > > The bug report ad "CONFIG_HZ=250 : 17415.60 bogo ops/s", which makes
> > > > me wonder if someone enabled some heavy debug that is greatly
> > > > increasing the overhead of the scheduling-clock interrupt.
> > > > 
> > > > Now, if that was the case, I would expect the 250HZ number to have
> > > > three-quarters of the improvement of the nohz_full number compared
> > > > to the 1000HZ number:
> > > >> 17415.60-14866.05=2549.55
> > > > 18505.52-14866.05=3639.47
> > > > 
> > > > 2549.55/3639.47=0.70
> > > 
> > > I wonder if the difference here could possibly also be because of CPU idle
> > > governor. It may behave differently at differently clock rates so perhaps has
> > > different overhead.
> > 
> > Could be, but, again, the balanced power mode could play a major role
> > here.
> > 
> > > 
> > > I have added trying nohz full to my list as well to evaluate. FWIW, when we
> > > moved from 250HZ to 1000HZ, it actually improved power because the CPUidle
> > > governor could put the CPUs in deeper idle states more quickly!
> > 
> > Interesting, another benefit to add to my proposal. :)
> > 
> > > 
> > > > OK, 0.70 is not *that* far off of 0.75.  So what debugging does that
> > > > test have enabled?  Also, if you use tracing (or whatever) to measure
> > > > the typical duration of the scheduling-clock interrupt and related things
> > > > like softirq handlers, does it fit with these numbers?  Such a measurment
> > > > would look at how long it took to get back into userspace.
> 
> Just to emphasize...
> 
> The above calculations show that your measurements are close to what you
> would expect if scheduling-clock interrupts took longer than one would
> expect.  Here "scheduling-clock interrupts" includes softirq processing
> (timers, networking, RCU, ...)  that piggybacks on each such interrupt.
> 
> Although softirq makes the most sense given the amount of time that must
> be consumed, for the most part softirq work is conserved.  which suggests
> that you should also at the rest of the system to check whether the
> reported speedup is instead due to this work simply being moved to some
> other CPU.
> 
> But maybe the fat softirqs are due to some debugging option that Ubuntu
> enabled.  In which case checking up on the actual duration (perhaps
> using some form of tracing) would provide useful information.  ;-)
> 
As a first step i would have a look at perf figures what is going on
during a test run. For such purpose the "perf" tool can be used. As a
basic step it can be run in a "top" mode:

perf top -a -g -e cycles:k 

Sorry for the noise :)

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-01-30 11:06         ` Uladzislau Rezki
@ 2024-01-30 11:27           ` Andrea Righi
  2024-01-30 11:41             ` Uladzislau Rezki
  2024-01-30 13:39             ` Paul E. McKenney
  0 siblings, 2 replies; 35+ messages in thread
From: Andrea Righi @ 2024-01-30 11:27 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Paul E. McKenney, Joel Fernandes, Joel Fernandes, rcu,
	Cc: Frederic Weisbecker

On Tue, Jan 30, 2024 at 12:06:49PM +0100, Uladzislau Rezki wrote:
> On Tue, Jan 30, 2024 at 02:17:22AM -0800, Paul E. McKenney wrote:
> > On Tue, Jan 30, 2024 at 07:58:18AM +0100, Andrea Righi wrote:
> > > Hi Joel and Paul,
> > > 
> > > comments below.
> > > 
> > > On Mon, Jan 29, 2024 at 05:16:38PM -0500, Joel Fernandes wrote:
> > > > Hi Paul,
> > > > 
> > > > On 1/29/2024 3:41 PM, Paul E. McKenney wrote:
> > > > > On Mon, Jan 29, 2024 at 05:47:39PM +0000, Joel Fernandes wrote:
> > > > >> Hi Guys,
> > > > >> Something caught my eye in [1] which a colleague pointed me to
> > > > >>  - CONFIG_HZ=1000 : 14866.05 bogo ops/s
> > > > >>  - CONFIG_HZ=1000+nohz_full : 18505.52 bogo ops/s
> > > > >>
> > > > >> The test in concern is:
> > > > >> stress-ng --matrix $(getconf _NPROCESSORS_ONLN) --timeout 5m --metrics-brief
> > > > >>
> > > > >> which is a CPU intensive test.
> > > > >>
> > > > >> Any thoughts on what else can attribute a 30% performance increase
> > > > >> versus non-nohz_full ? (Confession: No idea if the baseline is
> > > > >> nohz_idle or no nohz at all). If it is 30%, I may want to evaluate
> > > > >> nohz_full on some of our limited-CPU devices :)
> > > > > 
> > > > > The usual questions.  ;-)
> > > > > 
> > > > > Is this repeatable?  Is it under the same conditions of temperature,
> > > > > load, and so on?  Was it running on bare metal or on a guest OS?  If on a
> > > > > guest OS, what was the load from other guest OSes on the same hypervisor
> > > > > or on the hypervisor itself?
> > > 
> > > That was the result of a quick test, so I expect it has some fuzzyness
> > > in there.
> > > 
> > > It's an average of 10 runs, it was bare metal (my laptop, 8 cores 11th
> > > Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz), *but* I wanted to run the
> > > test with the default Ubuntu settings, that means having "power mode:
> > > balanced" enabled. I don't know exactly what it's doing (I'll check how
> > > it works in details), I think it's using intel p-states IIRC.
> > > 
> > > Also, the system was not completely isolated (my email client was
> > > running) but the system was mostly idle in general.
> > > 
> > > I was already planning to repeat the tests in a more "isolated"
> > > environment and add details to the bug tracker.
> > > 
> > > > > 
> > > > > The bug report ad "CONFIG_HZ=250 : 17415.60 bogo ops/s", which makes
> > > > > me wonder if someone enabled some heavy debug that is greatly
> > > > > increasing the overhead of the scheduling-clock interrupt.
> > > > > 
> > > > > Now, if that was the case, I would expect the 250HZ number to have
> > > > > three-quarters of the improvement of the nohz_full number compared
> > > > > to the 1000HZ number:
> > > > >> 17415.60-14866.05=2549.55
> > > > > 18505.52-14866.05=3639.47
> > > > > 
> > > > > 2549.55/3639.47=0.70
> > > > 
> > > > I wonder if the difference here could possibly also be because of CPU idle
> > > > governor. It may behave differently at differently clock rates so perhaps has
> > > > different overhead.
> > > 
> > > Could be, but, again, the balanced power mode could play a major role
> > > here.
> > > 
> > > > 
> > > > I have added trying nohz full to my list as well to evaluate. FWIW, when we
> > > > moved from 250HZ to 1000HZ, it actually improved power because the CPUidle
> > > > governor could put the CPUs in deeper idle states more quickly!
> > > 
> > > Interesting, another benefit to add to my proposal. :)
> > > 
> > > > 
> > > > > OK, 0.70 is not *that* far off of 0.75.  So what debugging does that
> > > > > test have enabled?  Also, if you use tracing (or whatever) to measure
> > > > > the typical duration of the scheduling-clock interrupt and related things
> > > > > like softirq handlers, does it fit with these numbers?  Such a measurment
> > > > > would look at how long it took to get back into userspace.
> > 
> > Just to emphasize...
> > 
> > The above calculations show that your measurements are close to what you
> > would expect if scheduling-clock interrupts took longer than one would
> > expect.  Here "scheduling-clock interrupts" includes softirq processing
> > (timers, networking, RCU, ...)  that piggybacks on each such interrupt.
> > 
> > Although softirq makes the most sense given the amount of time that must
> > be consumed, for the most part softirq work is conserved.  which suggests
> > that you should also at the rest of the system to check whether the
> > reported speedup is instead due to this work simply being moved to some
> > other CPU.
> > 
> > But maybe the fat softirqs are due to some debugging option that Ubuntu
> > enabled.  In which case checking up on the actual duration (perhaps
> > using some form of tracing) would provide useful information.  ;-)
> > 
> As a first step i would have a look at perf figures what is going on
> during a test run. For such purpose the "perf" tool can be used. As a
> basic step it can be run in a "top" mode:
> 
> perf top -a -g -e cycles:k 
> 
> Sorry for the noise :)

Yep, I'm planning to do better tests and collect more info (perf,
bpftrace). Also making sure that we don't have some crazy debugging
config enabled in the Ubuntu kernel, as correctly pointed by Paul. But
first of all I need to repeat the tests in a more isolated environment,
just to make sure we're looking at reasonable numbers here.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-01-30 11:27           ` Andrea Righi
@ 2024-01-30 11:41             ` Uladzislau Rezki
  2024-01-30 13:39             ` Paul E. McKenney
  1 sibling, 0 replies; 35+ messages in thread
From: Uladzislau Rezki @ 2024-01-30 11:41 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Uladzislau Rezki, Paul E. McKenney, Joel Fernandes,
	Joel Fernandes, rcu, Cc: Frederic Weisbecker

On Tue, Jan 30, 2024 at 12:27:43PM +0100, Andrea Righi wrote:
> On Tue, Jan 30, 2024 at 12:06:49PM +0100, Uladzislau Rezki wrote:
> > On Tue, Jan 30, 2024 at 02:17:22AM -0800, Paul E. McKenney wrote:
> > > On Tue, Jan 30, 2024 at 07:58:18AM +0100, Andrea Righi wrote:
> > > > Hi Joel and Paul,
> > > > 
> > > > comments below.
> > > > 
> > > > On Mon, Jan 29, 2024 at 05:16:38PM -0500, Joel Fernandes wrote:
> > > > > Hi Paul,
> > > > > 
> > > > > On 1/29/2024 3:41 PM, Paul E. McKenney wrote:
> > > > > > On Mon, Jan 29, 2024 at 05:47:39PM +0000, Joel Fernandes wrote:
> > > > > >> Hi Guys,
> > > > > >> Something caught my eye in [1] which a colleague pointed me to
> > > > > >>  - CONFIG_HZ=1000 : 14866.05 bogo ops/s
> > > > > >>  - CONFIG_HZ=1000+nohz_full : 18505.52 bogo ops/s
> > > > > >>
> > > > > >> The test in concern is:
> > > > > >> stress-ng --matrix $(getconf _NPROCESSORS_ONLN) --timeout 5m --metrics-brief
> > > > > >>
> > > > > >> which is a CPU intensive test.
> > > > > >>
> > > > > >> Any thoughts on what else can attribute a 30% performance increase
> > > > > >> versus non-nohz_full ? (Confession: No idea if the baseline is
> > > > > >> nohz_idle or no nohz at all). If it is 30%, I may want to evaluate
> > > > > >> nohz_full on some of our limited-CPU devices :)
> > > > > > 
> > > > > > The usual questions.  ;-)
> > > > > > 
> > > > > > Is this repeatable?  Is it under the same conditions of temperature,
> > > > > > load, and so on?  Was it running on bare metal or on a guest OS?  If on a
> > > > > > guest OS, what was the load from other guest OSes on the same hypervisor
> > > > > > or on the hypervisor itself?
> > > > 
> > > > That was the result of a quick test, so I expect it has some fuzzyness
> > > > in there.
> > > > 
> > > > It's an average of 10 runs, it was bare metal (my laptop, 8 cores 11th
> > > > Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz), *but* I wanted to run the
> > > > test with the default Ubuntu settings, that means having "power mode:
> > > > balanced" enabled. I don't know exactly what it's doing (I'll check how
> > > > it works in details), I think it's using intel p-states IIRC.
> > > > 
> > > > Also, the system was not completely isolated (my email client was
> > > > running) but the system was mostly idle in general.
> > > > 
> > > > I was already planning to repeat the tests in a more "isolated"
> > > > environment and add details to the bug tracker.
> > > > 
> > > > > > 
> > > > > > The bug report ad "CONFIG_HZ=250 : 17415.60 bogo ops/s", which makes
> > > > > > me wonder if someone enabled some heavy debug that is greatly
> > > > > > increasing the overhead of the scheduling-clock interrupt.
> > > > > > 
> > > > > > Now, if that was the case, I would expect the 250HZ number to have
> > > > > > three-quarters of the improvement of the nohz_full number compared
> > > > > > to the 1000HZ number:
> > > > > >> 17415.60-14866.05=2549.55
> > > > > > 18505.52-14866.05=3639.47
> > > > > > 
> > > > > > 2549.55/3639.47=0.70
> > > > > 
> > > > > I wonder if the difference here could possibly also be because of CPU idle
> > > > > governor. It may behave differently at differently clock rates so perhaps has
> > > > > different overhead.
> > > > 
> > > > Could be, but, again, the balanced power mode could play a major role
> > > > here.
> > > > 
> > > > > 
> > > > > I have added trying nohz full to my list as well to evaluate. FWIW, when we
> > > > > moved from 250HZ to 1000HZ, it actually improved power because the CPUidle
> > > > > governor could put the CPUs in deeper idle states more quickly!
> > > > 
> > > > Interesting, another benefit to add to my proposal. :)
> > > > 
> > > > > 
> > > > > > OK, 0.70 is not *that* far off of 0.75.  So what debugging does that
> > > > > > test have enabled?  Also, if you use tracing (or whatever) to measure
> > > > > > the typical duration of the scheduling-clock interrupt and related things
> > > > > > like softirq handlers, does it fit with these numbers?  Such a measurment
> > > > > > would look at how long it took to get back into userspace.
> > > 
> > > Just to emphasize...
> > > 
> > > The above calculations show that your measurements are close to what you
> > > would expect if scheduling-clock interrupts took longer than one would
> > > expect.  Here "scheduling-clock interrupts" includes softirq processing
> > > (timers, networking, RCU, ...)  that piggybacks on each such interrupt.
> > > 
> > > Although softirq makes the most sense given the amount of time that must
> > > be consumed, for the most part softirq work is conserved.  which suggests
> > > that you should also at the rest of the system to check whether the
> > > reported speedup is instead due to this work simply being moved to some
> > > other CPU.
> > > 
> > > But maybe the fat softirqs are due to some debugging option that Ubuntu
> > > enabled.  In which case checking up on the actual duration (perhaps
> > > using some form of tracing) would provide useful information.  ;-)
> > > 
> > As a first step i would have a look at perf figures what is going on
> > during a test run. For such purpose the "perf" tool can be used. As a
> > basic step it can be run in a "top" mode:
> > 
> > perf top -a -g -e cycles:k 
> > 
> > Sorry for the noise :)
> 
> Yep, I'm planning to do better tests and collect more info (perf,
> bpftrace). Also making sure that we don't have some crazy debugging
> config enabled in the Ubuntu kernel, as correctly pointed by Paul. But
> first of all I need to repeat the tests in a more isolated environment,
> just to make sure we're looking at reasonable numbers here.
> 
That makes sense :)

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-01-30 11:27           ` Andrea Righi
  2024-01-30 11:41             ` Uladzislau Rezki
@ 2024-01-30 13:39             ` Paul E. McKenney
  2024-02-06 17:51               ` Andrea Righi
  1 sibling, 1 reply; 35+ messages in thread
From: Paul E. McKenney @ 2024-01-30 13:39 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Uladzislau Rezki, Joel Fernandes, Joel Fernandes, rcu,
	Cc: Frederic Weisbecker

On Tue, Jan 30, 2024 at 12:27:43PM +0100, Andrea Righi wrote:
> On Tue, Jan 30, 2024 at 12:06:49PM +0100, Uladzislau Rezki wrote:
> > On Tue, Jan 30, 2024 at 02:17:22AM -0800, Paul E. McKenney wrote:
> > > On Tue, Jan 30, 2024 at 07:58:18AM +0100, Andrea Righi wrote:
> > > > Hi Joel and Paul,
> > > > 
> > > > comments below.
> > > > 
> > > > On Mon, Jan 29, 2024 at 05:16:38PM -0500, Joel Fernandes wrote:
> > > > > Hi Paul,
> > > > > 
> > > > > On 1/29/2024 3:41 PM, Paul E. McKenney wrote:
> > > > > > On Mon, Jan 29, 2024 at 05:47:39PM +0000, Joel Fernandes wrote:
> > > > > >> Hi Guys,
> > > > > >> Something caught my eye in [1] which a colleague pointed me to
> > > > > >>  - CONFIG_HZ=1000 : 14866.05 bogo ops/s
> > > > > >>  - CONFIG_HZ=1000+nohz_full : 18505.52 bogo ops/s
> > > > > >>
> > > > > >> The test in concern is:
> > > > > >> stress-ng --matrix $(getconf _NPROCESSORS_ONLN) --timeout 5m --metrics-brief
> > > > > >>
> > > > > >> which is a CPU intensive test.
> > > > > >>
> > > > > >> Any thoughts on what else can attribute a 30% performance increase
> > > > > >> versus non-nohz_full ? (Confession: No idea if the baseline is
> > > > > >> nohz_idle or no nohz at all). If it is 30%, I may want to evaluate
> > > > > >> nohz_full on some of our limited-CPU devices :)
> > > > > > 
> > > > > > The usual questions.  ;-)
> > > > > > 
> > > > > > Is this repeatable?  Is it under the same conditions of temperature,
> > > > > > load, and so on?  Was it running on bare metal or on a guest OS?  If on a
> > > > > > guest OS, what was the load from other guest OSes on the same hypervisor
> > > > > > or on the hypervisor itself?
> > > > 
> > > > That was the result of a quick test, so I expect it has some fuzzyness
> > > > in there.
> > > > 
> > > > It's an average of 10 runs, it was bare metal (my laptop, 8 cores 11th
> > > > Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz), *but* I wanted to run the
> > > > test with the default Ubuntu settings, that means having "power mode:
> > > > balanced" enabled. I don't know exactly what it's doing (I'll check how
> > > > it works in details), I think it's using intel p-states IIRC.
> > > > 
> > > > Also, the system was not completely isolated (my email client was
> > > > running) but the system was mostly idle in general.
> > > > 
> > > > I was already planning to repeat the tests in a more "isolated"
> > > > environment and add details to the bug tracker.
> > > > 
> > > > > > 
> > > > > > The bug report ad "CONFIG_HZ=250 : 17415.60 bogo ops/s", which makes
> > > > > > me wonder if someone enabled some heavy debug that is greatly
> > > > > > increasing the overhead of the scheduling-clock interrupt.
> > > > > > 
> > > > > > Now, if that was the case, I would expect the 250HZ number to have
> > > > > > three-quarters of the improvement of the nohz_full number compared
> > > > > > to the 1000HZ number:
> > > > > >> 17415.60-14866.05=2549.55
> > > > > > 18505.52-14866.05=3639.47
> > > > > > 
> > > > > > 2549.55/3639.47=0.70
> > > > > 
> > > > > I wonder if the difference here could possibly also be because of CPU idle
> > > > > governor. It may behave differently at differently clock rates so perhaps has
> > > > > different overhead.
> > > > 
> > > > Could be, but, again, the balanced power mode could play a major role
> > > > here.
> > > > 
> > > > > 
> > > > > I have added trying nohz full to my list as well to evaluate. FWIW, when we
> > > > > moved from 250HZ to 1000HZ, it actually improved power because the CPUidle
> > > > > governor could put the CPUs in deeper idle states more quickly!
> > > > 
> > > > Interesting, another benefit to add to my proposal. :)
> > > > 
> > > > > 
> > > > > > OK, 0.70 is not *that* far off of 0.75.  So what debugging does that
> > > > > > test have enabled?  Also, if you use tracing (or whatever) to measure
> > > > > > the typical duration of the scheduling-clock interrupt and related things
> > > > > > like softirq handlers, does it fit with these numbers?  Such a measurment
> > > > > > would look at how long it took to get back into userspace.
> > > 
> > > Just to emphasize...
> > > 
> > > The above calculations show that your measurements are close to what you
> > > would expect if scheduling-clock interrupts took longer than one would
> > > expect.  Here "scheduling-clock interrupts" includes softirq processing
> > > (timers, networking, RCU, ...)  that piggybacks on each such interrupt.
> > > 
> > > Although softirq makes the most sense given the amount of time that must
> > > be consumed, for the most part softirq work is conserved.  which suggests
> > > that you should also at the rest of the system to check whether the
> > > reported speedup is instead due to this work simply being moved to some
> > > other CPU.
> > > 
> > > But maybe the fat softirqs are due to some debugging option that Ubuntu
> > > enabled.  In which case checking up on the actual duration (perhaps
> > > using some form of tracing) would provide useful information.  ;-)
> > > 
> > As a first step i would have a look at perf figures what is going on
> > during a test run. For such purpose the "perf" tool can be used. As a
> > basic step it can be run in a "top" mode:
> > 
> > perf top -a -g -e cycles:k 
> > 
> > Sorry for the noise :)
> 
> Yep, I'm planning to do better tests and collect more info (perf,
> bpftrace). Also making sure that we don't have some crazy debugging
> config enabled in the Ubuntu kernel, as correctly pointed by Paul. But
> first of all I need to repeat the tests in a more isolated environment,
> just to make sure we're looking at reasonable numbers here.

Very good, and looking forward to seeing what you come up with!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-01-30 13:39             ` Paul E. McKenney
@ 2024-02-06 17:51               ` Andrea Righi
  2024-02-06 18:44                 ` Paul E. McKenney
  2024-02-07 15:48                 ` Joel Fernandes
  0 siblings, 2 replies; 35+ messages in thread
From: Andrea Righi @ 2024-02-06 17:51 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Uladzislau Rezki, Joel Fernandes, Joel Fernandes, rcu,
	Cc: Frederic Weisbecker

On Tue, Jan 30, 2024 at 05:39:37AM -0800, Paul E. McKenney wrote:
> On Tue, Jan 30, 2024 at 12:27:43PM +0100, Andrea Righi wrote:
> > On Tue, Jan 30, 2024 at 12:06:49PM +0100, Uladzislau Rezki wrote:
> > > On Tue, Jan 30, 2024 at 02:17:22AM -0800, Paul E. McKenney wrote:
> > > > On Tue, Jan 30, 2024 at 07:58:18AM +0100, Andrea Righi wrote:
> > > > > Hi Joel and Paul,
> > > > > 
> > > > > comments below.
> > > > > 
> > > > > On Mon, Jan 29, 2024 at 05:16:38PM -0500, Joel Fernandes wrote:
> > > > > > Hi Paul,
> > > > > > 
> > > > > > On 1/29/2024 3:41 PM, Paul E. McKenney wrote:
> > > > > > > On Mon, Jan 29, 2024 at 05:47:39PM +0000, Joel Fernandes wrote:
> > > > > > >> Hi Guys,
> > > > > > >> Something caught my eye in [1] which a colleague pointed me to
> > > > > > >>  - CONFIG_HZ=1000 : 14866.05 bogo ops/s
> > > > > > >>  - CONFIG_HZ=1000+nohz_full : 18505.52 bogo ops/s
> > > > > > >>
> > > > > > >> The test in concern is:
> > > > > > >> stress-ng --matrix $(getconf _NPROCESSORS_ONLN) --timeout 5m --metrics-brief
> > > > > > >>
> > > > > > >> which is a CPU intensive test.
> > > > > > >>
> > > > > > >> Any thoughts on what else can attribute a 30% performance increase
> > > > > > >> versus non-nohz_full ? (Confession: No idea if the baseline is
> > > > > > >> nohz_idle or no nohz at all). If it is 30%, I may want to evaluate
> > > > > > >> nohz_full on some of our limited-CPU devices :)
> > > > > > > 
> > > > > > > The usual questions.  ;-)
> > > > > > > 
> > > > > > > Is this repeatable?  Is it under the same conditions of temperature,
> > > > > > > load, and so on?  Was it running on bare metal or on a guest OS?  If on a
> > > > > > > guest OS, what was the load from other guest OSes on the same hypervisor
> > > > > > > or on the hypervisor itself?
> > > > > 
> > > > > That was the result of a quick test, so I expect it has some fuzzyness
> > > > > in there.
> > > > > 
> > > > > It's an average of 10 runs, it was bare metal (my laptop, 8 cores 11th
> > > > > Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz), *but* I wanted to run the
> > > > > test with the default Ubuntu settings, that means having "power mode:
> > > > > balanced" enabled. I don't know exactly what it's doing (I'll check how
> > > > > it works in details), I think it's using intel p-states IIRC.
> > > > > 
> > > > > Also, the system was not completely isolated (my email client was
> > > > > running) but the system was mostly idle in general.
> > > > > 
> > > > > I was already planning to repeat the tests in a more "isolated"
> > > > > environment and add details to the bug tracker.
> > > > > 
> > > > > > > 
> > > > > > > The bug report ad "CONFIG_HZ=250 : 17415.60 bogo ops/s", which makes
> > > > > > > me wonder if someone enabled some heavy debug that is greatly
> > > > > > > increasing the overhead of the scheduling-clock interrupt.
> > > > > > > 
> > > > > > > Now, if that was the case, I would expect the 250HZ number to have
> > > > > > > three-quarters of the improvement of the nohz_full number compared
> > > > > > > to the 1000HZ number:
> > > > > > >> 17415.60-14866.05=2549.55
> > > > > > > 18505.52-14866.05=3639.47
> > > > > > > 
> > > > > > > 2549.55/3639.47=0.70
> > > > > > 
> > > > > > I wonder if the difference here could possibly also be because of CPU idle
> > > > > > governor. It may behave differently at differently clock rates so perhaps has
> > > > > > different overhead.
> > > > > 
> > > > > Could be, but, again, the balanced power mode could play a major role
> > > > > here.
> > > > > 
> > > > > > 
> > > > > > I have added trying nohz full to my list as well to evaluate. FWIW, when we
> > > > > > moved from 250HZ to 1000HZ, it actually improved power because the CPUidle
> > > > > > governor could put the CPUs in deeper idle states more quickly!
> > > > > 
> > > > > Interesting, another benefit to add to my proposal. :)
> > > > > 
> > > > > > 
> > > > > > > OK, 0.70 is not *that* far off of 0.75.  So what debugging does that
> > > > > > > test have enabled?  Also, if you use tracing (or whatever) to measure
> > > > > > > the typical duration of the scheduling-clock interrupt and related things
> > > > > > > like softirq handlers, does it fit with these numbers?  Such a measurment
> > > > > > > would look at how long it took to get back into userspace.
> > > > 
> > > > Just to emphasize...
> > > > 
> > > > The above calculations show that your measurements are close to what you
> > > > would expect if scheduling-clock interrupts took longer than one would
> > > > expect.  Here "scheduling-clock interrupts" includes softirq processing
> > > > (timers, networking, RCU, ...)  that piggybacks on each such interrupt.
> > > > 
> > > > Although softirq makes the most sense given the amount of time that must
> > > > be consumed, for the most part softirq work is conserved.  which suggests
> > > > that you should also at the rest of the system to check whether the
> > > > reported speedup is instead due to this work simply being moved to some
> > > > other CPU.
> > > > 
> > > > But maybe the fat softirqs are due to some debugging option that Ubuntu
> > > > enabled.  In which case checking up on the actual duration (perhaps
> > > > using some form of tracing) would provide useful information.  ;-)
> > > > 
> > > As a first step i would have a look at perf figures what is going on
> > > during a test run. For such purpose the "perf" tool can be used. As a
> > > basic step it can be run in a "top" mode:
> > > 
> > > perf top -a -g -e cycles:k 
> > > 
> > > Sorry for the noise :)
> > 
> > Yep, I'm planning to do better tests and collect more info (perf,
> > bpftrace). Also making sure that we don't have some crazy debugging
> > config enabled in the Ubuntu kernel, as correctly pointed by Paul. But
> > first of all I need to repeat the tests in a more isolated environment,
> > just to make sure we're looking at reasonable numbers here.
> 
> Very good, and looking forward to seeing what you come up with!
> 
> 							Thanx, Paul

I repeated some tests in a more isolated environment and posted the
results here:
https://discourse.ubuntu.com/t/enable-low-latency-features-in-the-generic-ubuntu-kernel-for-24-04/42255

Highlights:

 - stress-ng --matrix seems quite unpredictable to be used a benchmarks
   in this scenario (the bogo-ops/s are very susceptible to any kind of
   interference, so even if in the long runs NO_HZ_FULL still seems to
   provide some benefits looking at the average, we also need to
   consider that there might be a significant error in the measurements,
   standard deviation was pretty high)

 - fio doing short writes (in page cache) seems to perform like 2x
   better in terms of iops with nohz_full, respect to the other cases
   and it performs 2x slower with large IO writes (not sure why... need
   to investigate more)

 - with lazy RCU enabled hrtimer_interrupt() takes like 2x more to
   return, respect to the other cases (is this expected?)

Let me know if you notice anything odd, if you need more information,
etc.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-02-06 17:51               ` Andrea Righi
@ 2024-02-06 18:44                 ` Paul E. McKenney
  2024-02-07 13:04                   ` Uladzislau Rezki
  2024-02-15  7:51                   ` Andrea Righi
  2024-02-07 15:48                 ` Joel Fernandes
  1 sibling, 2 replies; 35+ messages in thread
From: Paul E. McKenney @ 2024-02-06 18:44 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Uladzislau Rezki, Joel Fernandes, Joel Fernandes, rcu,
	Cc: Frederic Weisbecker

On Tue, Feb 06, 2024 at 06:51:02PM +0100, Andrea Righi wrote:
> On Tue, Jan 30, 2024 at 05:39:37AM -0800, Paul E. McKenney wrote:
> > On Tue, Jan 30, 2024 at 12:27:43PM +0100, Andrea Righi wrote:
> > > On Tue, Jan 30, 2024 at 12:06:49PM +0100, Uladzislau Rezki wrote:
> > > > On Tue, Jan 30, 2024 at 02:17:22AM -0800, Paul E. McKenney wrote:
> > > > > On Tue, Jan 30, 2024 at 07:58:18AM +0100, Andrea Righi wrote:
> > > > > > Hi Joel and Paul,
> > > > > > 
> > > > > > comments below.
> > > > > > 
> > > > > > On Mon, Jan 29, 2024 at 05:16:38PM -0500, Joel Fernandes wrote:
> > > > > > > Hi Paul,
> > > > > > > 
> > > > > > > On 1/29/2024 3:41 PM, Paul E. McKenney wrote:
> > > > > > > > On Mon, Jan 29, 2024 at 05:47:39PM +0000, Joel Fernandes wrote:
> > > > > > > >> Hi Guys,
> > > > > > > >> Something caught my eye in [1] which a colleague pointed me to
> > > > > > > >>  - CONFIG_HZ=1000 : 14866.05 bogo ops/s
> > > > > > > >>  - CONFIG_HZ=1000+nohz_full : 18505.52 bogo ops/s
> > > > > > > >>
> > > > > > > >> The test in concern is:
> > > > > > > >> stress-ng --matrix $(getconf _NPROCESSORS_ONLN) --timeout 5m --metrics-brief
> > > > > > > >>
> > > > > > > >> which is a CPU intensive test.
> > > > > > > >>
> > > > > > > >> Any thoughts on what else can attribute a 30% performance increase
> > > > > > > >> versus non-nohz_full ? (Confession: No idea if the baseline is
> > > > > > > >> nohz_idle or no nohz at all). If it is 30%, I may want to evaluate
> > > > > > > >> nohz_full on some of our limited-CPU devices :)
> > > > > > > > 
> > > > > > > > The usual questions.  ;-)
> > > > > > > > 
> > > > > > > > Is this repeatable?  Is it under the same conditions of temperature,
> > > > > > > > load, and so on?  Was it running on bare metal or on a guest OS?  If on a
> > > > > > > > guest OS, what was the load from other guest OSes on the same hypervisor
> > > > > > > > or on the hypervisor itself?
> > > > > > 
> > > > > > That was the result of a quick test, so I expect it has some fuzzyness
> > > > > > in there.
> > > > > > 
> > > > > > It's an average of 10 runs, it was bare metal (my laptop, 8 cores 11th
> > > > > > Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz), *but* I wanted to run the
> > > > > > test with the default Ubuntu settings, that means having "power mode:
> > > > > > balanced" enabled. I don't know exactly what it's doing (I'll check how
> > > > > > it works in details), I think it's using intel p-states IIRC.
> > > > > > 
> > > > > > Also, the system was not completely isolated (my email client was
> > > > > > running) but the system was mostly idle in general.
> > > > > > 
> > > > > > I was already planning to repeat the tests in a more "isolated"
> > > > > > environment and add details to the bug tracker.
> > > > > > 
> > > > > > > > 
> > > > > > > > The bug report ad "CONFIG_HZ=250 : 17415.60 bogo ops/s", which makes
> > > > > > > > me wonder if someone enabled some heavy debug that is greatly
> > > > > > > > increasing the overhead of the scheduling-clock interrupt.
> > > > > > > > 
> > > > > > > > Now, if that was the case, I would expect the 250HZ number to have
> > > > > > > > three-quarters of the improvement of the nohz_full number compared
> > > > > > > > to the 1000HZ number:
> > > > > > > >> 17415.60-14866.05=2549.55
> > > > > > > > 18505.52-14866.05=3639.47
> > > > > > > > 
> > > > > > > > 2549.55/3639.47=0.70
> > > > > > > 
> > > > > > > I wonder if the difference here could possibly also be because of CPU idle
> > > > > > > governor. It may behave differently at differently clock rates so perhaps has
> > > > > > > different overhead.
> > > > > > 
> > > > > > Could be, but, again, the balanced power mode could play a major role
> > > > > > here.
> > > > > > 
> > > > > > > 
> > > > > > > I have added trying nohz full to my list as well to evaluate. FWIW, when we
> > > > > > > moved from 250HZ to 1000HZ, it actually improved power because the CPUidle
> > > > > > > governor could put the CPUs in deeper idle states more quickly!
> > > > > > 
> > > > > > Interesting, another benefit to add to my proposal. :)
> > > > > > 
> > > > > > > 
> > > > > > > > OK, 0.70 is not *that* far off of 0.75.  So what debugging does that
> > > > > > > > test have enabled?  Also, if you use tracing (or whatever) to measure
> > > > > > > > the typical duration of the scheduling-clock interrupt and related things
> > > > > > > > like softirq handlers, does it fit with these numbers?  Such a measurment
> > > > > > > > would look at how long it took to get back into userspace.
> > > > > 
> > > > > Just to emphasize...
> > > > > 
> > > > > The above calculations show that your measurements are close to what you
> > > > > would expect if scheduling-clock interrupts took longer than one would
> > > > > expect.  Here "scheduling-clock interrupts" includes softirq processing
> > > > > (timers, networking, RCU, ...)  that piggybacks on each such interrupt.
> > > > > 
> > > > > Although softirq makes the most sense given the amount of time that must
> > > > > be consumed, for the most part softirq work is conserved.  which suggests
> > > > > that you should also at the rest of the system to check whether the
> > > > > reported speedup is instead due to this work simply being moved to some
> > > > > other CPU.
> > > > > 
> > > > > But maybe the fat softirqs are due to some debugging option that Ubuntu
> > > > > enabled.  In which case checking up on the actual duration (perhaps
> > > > > using some form of tracing) would provide useful information.  ;-)
> > > > > 
> > > > As a first step i would have a look at perf figures what is going on
> > > > during a test run. For such purpose the "perf" tool can be used. As a
> > > > basic step it can be run in a "top" mode:
> > > > 
> > > > perf top -a -g -e cycles:k 
> > > > 
> > > > Sorry for the noise :)
> > > 
> > > Yep, I'm planning to do better tests and collect more info (perf,
> > > bpftrace). Also making sure that we don't have some crazy debugging
> > > config enabled in the Ubuntu kernel, as correctly pointed by Paul. But
> > > first of all I need to repeat the tests in a more isolated environment,
> > > just to make sure we're looking at reasonable numbers here.
> > 
> > Very good, and looking forward to seeing what you come up with!
> > 
> > 							Thanx, Paul
> 
> I repeated some tests in a more isolated environment and posted the
> results here:
> https://discourse.ubuntu.com/t/enable-low-latency-features-in-the-generic-ubuntu-kernel-for-24-04/42255
> 
> Highlights:
> 
>  - stress-ng --matrix seems quite unpredictable to be used a benchmarks
>    in this scenario (the bogo-ops/s are very susceptible to any kind of
>    interference, so even if in the long runs NO_HZ_FULL still seems to
>    provide some benefits looking at the average, we also need to
>    consider that there might be a significant error in the measurements,
>    standard deviation was pretty high)
> 
>  - fio doing short writes (in page cache) seems to perform like 2x
>    better in terms of iops with nohz_full, respect to the other cases
>    and it performs 2x slower with large IO writes (not sure why... need
>    to investigate more)
> 
>  - with lazy RCU enabled hrtimer_interrupt() takes like 2x more to
>    return, respect to the other cases (is this expected?)

This last is surprising at first glance, but I could be missing
something.  Joel, Uladzislau, thoughts?

							Thanx, Paul

> Let me know if you notice anything odd, if you need more information,
> etc.
> 
> Thanks,
> -Andrea

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-02-06 18:44                 ` Paul E. McKenney
@ 2024-02-07 13:04                   ` Uladzislau Rezki
  2024-02-07 15:12                     ` Andrea Righi
  2024-02-15  7:51                   ` Andrea Righi
  1 sibling, 1 reply; 35+ messages in thread
From: Uladzislau Rezki @ 2024-02-07 13:04 UTC (permalink / raw)
  To: Paul E. McKenney, Andrea Righi
  Cc: Andrea Righi, Uladzislau Rezki, Joel Fernandes, Joel Fernandes,
	rcu, Cc: Frederic Weisbecker

> > 
> > I repeated some tests in a more isolated environment and posted the
> > results here:
> > https://discourse.ubuntu.com/t/enable-low-latency-features-in-the-generic-ubuntu-kernel-for-24-04/42255
> > 
> > Highlights:
> > 
> >  - stress-ng --matrix seems quite unpredictable to be used a benchmarks
> >    in this scenario (the bogo-ops/s are very susceptible to any kind of
> >    interference, so even if in the long runs NO_HZ_FULL still seems to
> >    provide some benefits looking at the average, we also need to
> >    consider that there might be a significant error in the measurements,
> >    standard deviation was pretty high)
> > 
> >  - fio doing short writes (in page cache) seems to perform like 2x
> >    better in terms of iops with nohz_full, respect to the other cases
> >    and it performs 2x slower with large IO writes (not sure why... need
> >    to investigate more)
> > 
> >  - with lazy RCU enabled hrtimer_interrupt() takes like 2x more to
> >    return, respect to the other cases (is this expected?)
> 
> This last is surprising at first glance, but I could be missing
> something.  Joel, Uladzislau, thoughts?
> 
Could you please share the steps how you run "fio" tests?

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-02-07 13:04                   ` Uladzislau Rezki
@ 2024-02-07 15:12                     ` Andrea Righi
  2024-02-07 15:49                       ` Joel Fernandes
  0 siblings, 1 reply; 35+ messages in thread
From: Andrea Righi @ 2024-02-07 15:12 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Paul E. McKenney, Joel Fernandes, Joel Fernandes, rcu,
	Cc: Frederic Weisbecker

On Wed, Feb 07, 2024 at 02:04:12PM +0100, Uladzislau Rezki wrote:
> > > 
> > > I repeated some tests in a more isolated environment and posted the
> > > results here:
> > > https://discourse.ubuntu.com/t/enable-low-latency-features-in-the-generic-ubuntu-kernel-for-24-04/42255
> > > 
> > > Highlights:
> > > 
> > >  - stress-ng --matrix seems quite unpredictable to be used a benchmarks
> > >    in this scenario (the bogo-ops/s are very susceptible to any kind of
> > >    interference, so even if in the long runs NO_HZ_FULL still seems to
> > >    provide some benefits looking at the average, we also need to
> > >    consider that there might be a significant error in the measurements,
> > >    standard deviation was pretty high)
> > > 
> > >  - fio doing short writes (in page cache) seems to perform like 2x
> > >    better in terms of iops with nohz_full, respect to the other cases
> > >    and it performs 2x slower with large IO writes (not sure why... need
> > >    to investigate more)
> > > 
> > >  - with lazy RCU enabled hrtimer_interrupt() takes like 2x more to
> > >    return, respect to the other cases (is this expected?)
> > 
> > This last is surprising at first glance, but I could be missing
> > something.  Joel, Uladzislau, thoughts?
> > 
> Could you please share the steps how you run "fio" tests?

For short writes I was running something like this (on a 8 cores system):

 $ fio --rw=write --bs=1M --size=32M --numjobs=8 --name=worker --time_based --runtime=300

Larger writes:

 $ fio --rw=write --bs=1M --size=1G --numjobs=8 --name=worker --time_based --runtime=300

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-02-06 17:51               ` Andrea Righi
  2024-02-06 18:44                 ` Paul E. McKenney
@ 2024-02-07 15:48                 ` Joel Fernandes
  2024-02-07 16:31                   ` Andrea Righi
  1 sibling, 1 reply; 35+ messages in thread
From: Joel Fernandes @ 2024-02-07 15:48 UTC (permalink / raw)
  To: Andrea Righi, Paul E. McKenney
  Cc: Uladzislau Rezki, Joel Fernandes, rcu, Cc: Frederic Weisbecker



On 2/6/2024 12:51 PM, Andrea Righi wrote:
>  - stress-ng --matrix seems quite unpredictable to be used a benchmarks
>    in this scenario (the bogo-ops/s are very susceptible to any kind of
>    interference, so even if in the long runs NO_HZ_FULL still seems to
>    provide some benefits looking at the average, we also need to
>    consider that there might be a significant error in the measurements,
>    standard deviation was pretty high)
> 

Ack on the bogo-ops disclaimers as also mentioned in the stress-ng docs. Agreed
a better metric for perf is helpful.

I am assuming you also have RCU_NOCB enabled for this test?

>  - fio doing short writes (in page cache) seems to perform like 2x
>    better in terms of iops with nohz_full, respect to the other cases
>    and it performs 2x slower with large IO writes (not sure why... need
>    to investigate more)

This is interesting, it could be worth counting how many kernel entries/exits
occur for large IO vs small IO. I'd imagine for large IO we have fewer syscalls
and hence lower entry/exit overhead. But if there more interrupts for whatever
reason with large IO, then that also implies more kernel entries/exits. As
Frederic was saying, NOHZ_FULL has higher overhead on kernel entry/exit.

> 
>  - with lazy RCU enabled hrtimer_interrupt() takes like 2x more to
>    return, respect to the other cases (is this expected?)

It depends on which hrtimer_interrupt() instance? There must be several in the
trace due to unrelated timers. Are you saying the worst case or it is always 2x
more? We do queue a timer for Lazy RCU to flush the RCU work but it is set to 10
seconds and should be canceled most of the time (Its just a guard rail). It is
possible there is lock contention on ->nocb_gp_lock which is causing the timer
handler execution to be slow. We have several trace_rcu_nocb* trace points,
including for the timer. Perhaps you could enable those and we dig deeper?

Further, it is interesting to see if it is only the hrtimer_interrupt() instance
that also results in a call to do_nocb_deferred_wakeup_timer() via say function
tracing. That will confirm that it is the lazy timer that is slow for you.

The actual number of callbacks should not be causing specifically the
hrtimer_interrupt() to take too long to run, AFAICS. But RCU's lazy feature does
increase the number of timer interrupts.

Further still, it depends on how much hrtimer_interrupt() takes with lazy RCU to
call it a problem IMO. Some numbers with units will be nice.

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-02-07 15:12                     ` Andrea Righi
@ 2024-02-07 15:49                       ` Joel Fernandes
  0 siblings, 0 replies; 35+ messages in thread
From: Joel Fernandes @ 2024-02-07 15:49 UTC (permalink / raw)
  To: Andrea Righi, Uladzislau Rezki
  Cc: Paul E. McKenney, Joel Fernandes, rcu, Cc: Frederic Weisbecker



On 2/7/2024 10:12 AM, Andrea Righi wrote:
> On Wed, Feb 07, 2024 at 02:04:12PM +0100, Uladzislau Rezki wrote:
>>>>
>>>> I repeated some tests in a more isolated environment and posted the
>>>> results here:
>>>> https://discourse.ubuntu.com/t/enable-low-latency-features-in-the-generic-ubuntu-kernel-for-24-04/42255
>>>>
>>>> Highlights:
>>>>
>>>>  - stress-ng --matrix seems quite unpredictable to be used a benchmarks
>>>>    in this scenario (the bogo-ops/s are very susceptible to any kind of
>>>>    interference, so even if in the long runs NO_HZ_FULL still seems to
>>>>    provide some benefits looking at the average, we also need to
>>>>    consider that there might be a significant error in the measurements,
>>>>    standard deviation was pretty high)
>>>>
>>>>  - fio doing short writes (in page cache) seems to perform like 2x
>>>>    better in terms of iops with nohz_full, respect to the other cases
>>>>    and it performs 2x slower with large IO writes (not sure why... need
>>>>    to investigate more)
>>>>
>>>>  - with lazy RCU enabled hrtimer_interrupt() takes like 2x more to
>>>>    return, respect to the other cases (is this expected?)
>>>
>>> This last is surprising at first glance, but I could be missing
>>> something.  Joel, Uladzislau, thoughts?
>>>
>> Could you please share the steps how you run "fio" tests?
> 
> For short writes I was running something like this (on a 8 cores system):
> 
>  $ fio --rw=write --bs=1M --size=32M --numjobs=8 --name=worker --time_based --runtime=300
> 
> Larger writes:
> 
>  $ fio --rw=write --bs=1M --size=1G --numjobs=8 --name=worker --time_based --runtime=300

Andrea, Thank you for providing these to us!

 - Joel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-02-07 15:48                 ` Joel Fernandes
@ 2024-02-07 16:31                   ` Andrea Righi
  2024-02-07 16:52                     ` Joel Fernandes
  0 siblings, 1 reply; 35+ messages in thread
From: Andrea Righi @ 2024-02-07 16:31 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Paul E. McKenney, Uladzislau Rezki, Joel Fernandes, rcu,
	Cc: Frederic Weisbecker

On Wed, Feb 07, 2024 at 10:48:10AM -0500, Joel Fernandes wrote:
> 
> 
> On 2/6/2024 12:51 PM, Andrea Righi wrote:
> >  - stress-ng --matrix seems quite unpredictable to be used a benchmarks
> >    in this scenario (the bogo-ops/s are very susceptible to any kind of
> >    interference, so even if in the long runs NO_HZ_FULL still seems to
> >    provide some benefits looking at the average, we also need to
> >    consider that there might be a significant error in the measurements,
> >    standard deviation was pretty high)
> > 
> 
> Ack on the bogo-ops disclaimers as also mentioned in the stress-ng docs. Agreed
> a better metric for perf is helpful.
> 
> I am assuming you also have RCU_NOCB enabled for this test?

yep, booting with rcu_nocb=all.

> 
> >  - fio doing short writes (in page cache) seems to perform like 2x
> >    better in terms of iops with nohz_full, respect to the other cases
> >    and it performs 2x slower with large IO writes (not sure why... need
> >    to investigate more)
> 
> This is interesting, it could be worth counting how many kernel entries/exits
> occur for large IO vs small IO. I'd imagine for large IO we have fewer syscalls
> and hence lower entry/exit overhead. But if there more interrupts for whatever
> reason with large IO, then that also implies more kernel entries/exits. As
> Frederic was saying, NOHZ_FULL has higher overhead on kernel entry/exit.

Good point. I was expecting to see a reduced performance due to the
kernel entry/exit overhead, assuming we had more syscalls in the short
writes case, that's why I was confused. But, as you correctly pointed
out, we may actually have more entry/exit events due to interrupts.

I'll repeat the test counting the kernel entries/exits to figure out
what's happening exactly, thanks for the hint!

> 
> > 
> >  - with lazy RCU enabled hrtimer_interrupt() takes like 2x more to
> >    return, respect to the other cases (is this expected?)
> 
> It depends on which hrtimer_interrupt() instance? There must be several in the
> trace due to unrelated timers. Are you saying the worst case or it is always 2x
> more? We do queue a timer for Lazy RCU to flush the RCU work but it is set to 10
> seconds and should be canceled most of the time (Its just a guard rail). It is
> possible there is lock contention on ->nocb_gp_lock whizxzch is causing the timer
> handler execution to be slow. We have several trace_rcu_nocb* trace points,
> including for the timer. Perhaps you could enable those and we dig deeper?

So, my plan was to (try to) measure the potential overhead that we may
introduce by enabling the new config options in the tick handler,
therefore I tried to tace the time distribution of the different
hrtimer_interrupt() invocations, using this:

---
#!/usr/bin/bpftrace

kprobe:hrtimer_interrupt {
    @start[tid] = nsecs;
}

kretprobe:hrtimer_interrupt /@start[tid]/ {
    @elapsed = hist(nsecs - @start[tid]);
    delete(@start[tid]);
}
---

The time distribution is always ~2x with lazy RCU enabled, but honestly
I don't know if this metric is really relevant.

In terms of performance it is probably more relevant the fact that we
move the RCU callbacks from softirq to kthread context and process them
in batch. This for sure has a bigger impact respect to the overhead that
we may add to the tick handler.

But I'll try to enable the trace_rcu_nocb* tracepoints and see if we can
get some interesting details of what is happening.

> 
> Further, it is interesting to see if it is only the hrtimer_interrupt() instance
> that also results in a call to do_nocb_deferred_wakeup_timer() via say function
> tracing. That will confirm that it is the lazy timer that is slow for you.

ok

> 
> The actual number of callbacks should not be causing specifically the
> hrtimer_interrupt() to take too long to run, AFAICS. But RCU's lazy feature does
> increase the number of timer interrupts.
> 
> Further still, it depends on how much hrtimer_interrupt() takes with lazy RCU to
> call it a problem IMO. Some numbers with units will be nice.

This is what I see (this is a single run, but the other runs are
similar), unit is nanosec, with lazy RCU enabled hrtimer_interrupt()
takes around 4K-16K ns, with lazy RCU off most of the times it takes
2K-4K ns:

 - lazy rcu off:

[1K, 2K)       	 88307 |@@@@@@@@@@@@                                    	|
[2K, 4K)      	380695 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    |
[4K, 8K)           194 |                                                	|

 - lazy rcu on:

[2K, 4K)          3094 |                                                	  |
[4K, 8K)        265763 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K)       182341 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@             	  |
[16K, 32K)        3422 |                                                	  |

Again, I'm not sure if this is really a problem or not, or if it is even
a relevant metric for the overall performance, I was just curious to
understand why it is different.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-02-07 16:31                   ` Andrea Righi
@ 2024-02-07 16:52                     ` Joel Fernandes
  2024-02-07 17:05                       ` Andrea Righi
  0 siblings, 1 reply; 35+ messages in thread
From: Joel Fernandes @ 2024-02-07 16:52 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Paul E. McKenney, Uladzislau Rezki, Joel Fernandes, rcu,
	Cc: Frederic Weisbecker

On Wed, Feb 7, 2024 at 11:31 AM Andrea Righi <andrea.righi@canonical.com> wrote:

> > The actual number of callbacks should not be causing specifically the
> > hrtimer_interrupt() to take too long to run, AFAICS. But RCU's lazy feature does
> > increase the number of timer interrupts.
> >
> > Further still, it depends on how much hrtimer_interrupt() takes with lazy RCU to
> > call it a problem IMO. Some numbers with units will be nice.
>
> This is what I see (this is a single run, but the other runs are
> similar), unit is nanosec, with lazy RCU enabled hrtimer_interrupt()
> takes around 4K-16K ns, with lazy RCU off most of the times it takes
> 2K-4K ns:
>
>  - lazy rcu off:
>
> [1K, 2K)         88307 |@@@@@@@@@@@@                                            |
> [2K, 4K)        380695 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    |
> [4K, 8K)           194 |                                                        |
>
>  - lazy rcu on:
>
> [2K, 4K)          3094 |                                                          |
> [4K, 8K)        265763 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [8K, 16K)       182341 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                       |
> [16K, 32K)        3422 |                                                          |
>
> Again, I'm not sure if this is really a problem or not, or if it is even
> a relevant metric for the overall performance, I was just curious to
> understand why it is different.

This is an interesting find, the number of timer interrupt executions
looks roughly the same in this histogram so it might not be missed
cancellations or such, so it is not clear to me. But it is worth
debugging and we'll try to reproduce your results.

Some more theories from our internal RCU discussion:
- Could it be another user of RCU (call_rcu) from an unrelated hrtimer
interrupt callback that is causing a "flush" of lazy callbacks?
- What does the distribution look like for
do_nocb_deferred_wakeup_timer ? That will have to probably be made
non-static to be picked up by bpftrace (If you could try that real
quick, appreciate!).

Slightly related, but one of the things we are wondering also is how
much of the overhead for your nohz-full and lazy-RCU test (on top of
baseline - that is just CONFIG_HZ=1000 without nohz-full or nocbs) is
because of just using NOCB. Uladsizlau mentioned he might run a test
for comparing along those lines as well.

Frederic, looking forward to any thoughts from you as well on this
behavior. Thanks,

 - Joel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-02-07 16:52                     ` Joel Fernandes
@ 2024-02-07 17:05                       ` Andrea Righi
  2024-02-08  5:54                         ` Joel Fernandes
  0 siblings, 1 reply; 35+ messages in thread
From: Andrea Righi @ 2024-02-07 17:05 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Paul E. McKenney, Uladzislau Rezki, Joel Fernandes, rcu,
	Cc: Frederic Weisbecker

On Wed, Feb 07, 2024 at 11:52:35AM -0500, Joel Fernandes wrote:
> On Wed, Feb 7, 2024 at 11:31 AM Andrea Righi <andrea.righi@canonical.com> wrote:
> 
> > > The actual number of callbacks should not be causing specifically the
> > > hrtimer_interrupt() to take too long to run, AFAICS. But RCU's lazy feature does
> > > increase the number of timer interrupts.
> > >
> > > Further still, it depends on how much hrtimer_interrupt() takes with lazy RCU to
> > > call it a problem IMO. Some numbers with units will be nice.
> >
> > This is what I see (this is a single run, but the other runs are
> > similar), unit is nanosec, with lazy RCU enabled hrtimer_interrupt()
> > takes around 4K-16K ns, with lazy RCU off most of the times it takes
> > 2K-4K ns:
> >
> >  - lazy rcu off:
> >
> > [1K, 2K)         88307 |@@@@@@@@@@@@                                            |
> > [2K, 4K)        380695 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    |
> > [4K, 8K)           194 |                                                        |
> >
> >  - lazy rcu on:
> >
> > [2K, 4K)          3094 |                                                          |
> > [4K, 8K)        265763 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [8K, 16K)       182341 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                       |
> > [16K, 32K)        3422 |                                                          |
> >
> > Again, I'm not sure if this is really a problem or not, or if it is even
> > a relevant metric for the overall performance, I was just curious to
> > understand why it is different.
> 
> This is an interesting find, the number of timer interrupt executions
> looks roughly the same in this histogram so it might not be missed
> cancellations or such, so it is not clear to me. But it is worth
> debugging and we'll try to reproduce your results.
> 
> Some more theories from our internal RCU discussion:
> - Could it be another user of RCU (call_rcu) from an unrelated hrtimer
> interrupt callback that is causing a "flush" of lazy callbacks?
> - What does the distribution look like for
> do_nocb_deferred_wakeup_timer ? That will have to probably be made
> non-static to be picked up by bpftrace (If you could try that real
> quick, appreciate!).

Sure, I'll repeat the test tracing do_nocb_deferred_wakeup_timer.

> 
> Slightly related, but one of the things we are wondering also is how
> much of the overhead for your nohz-full and lazy-RCU test (on top of
> baseline - that is just CONFIG_HZ=1000 without nohz-full or nocbs) is
> because of just using NOCB. Uladsizlau mentioned he might run a test
> for comparing along those lines as well.

Just to clarify, "lazy rcu on" results are just with rcu_nocb=all and
lazy RCUs enabled (and HZ=1000), so without nohz_full.

If I enable only nohz_full=all (without rcu_nocb) I see something like
this:

[1K, 2K)         	294 |                                                	 |
[2K, 4K)       	      59568 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K)         	368 |                                                	 |

That is like baseline result / 8 invocations, because I have 8 cores and
only the timekeeping CPU is ticking, so that seems to make sense.

-Andrea

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-02-07 17:05                       ` Andrea Righi
@ 2024-02-08  5:54                         ` Joel Fernandes
  2024-02-08  6:55                           ` Andrea Righi
  0 siblings, 1 reply; 35+ messages in thread
From: Joel Fernandes @ 2024-02-08  5:54 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Paul E. McKenney, Uladzislau Rezki, Joel Fernandes, rcu,
	Cc: Frederic Weisbecker



On 2/7/2024 12:05 PM, Andrea Righi wrote:
> On Wed, Feb 07, 2024 at 11:52:35AM -0500, Joel Fernandes wrote:
>> On Wed, Feb 7, 2024 at 11:31 AM Andrea Righi <andrea.righi@canonical.com> wrote:
>>
>>>> The actual number of callbacks should not be causing specifically the
>>>> hrtimer_interrupt() to take too long to run, AFAICS. But RCU's lazy feature does
>>>> increase the number of timer interrupts.
>>>>
>>>> Further still, it depends on how much hrtimer_interrupt() takes with lazy RCU to
>>>> call it a problem IMO. Some numbers with units will be nice.
>>>
>>> This is what I see (this is a single run, but the other runs are
>>> similar), unit is nanosec, with lazy RCU enabled hrtimer_interrupt()
>>> takes around 4K-16K ns, with lazy RCU off most of the times it takes
>>> 2K-4K ns:
>>>
>>>  - lazy rcu off:
>>>
>>> [1K, 2K)         88307 |@@@@@@@@@@@@                                            |
>>> [2K, 4K)        380695 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    |
>>> [4K, 8K)           194 |                                                        |
>>>
>>>  - lazy rcu on:
>>>
>>> [2K, 4K)          3094 |                                                          |
>>> [4K, 8K)        265763 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>>> [8K, 16K)       182341 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                       |
>>> [16K, 32K)        3422 |                                                          |
>>>
>>> Again, I'm not sure if this is really a problem or not, or if it is even
>>> a relevant metric for the overall performance, I was just curious to
>>> understand why it is different.
>>
>> This is an interesting find, the number of timer interrupt executions
>> looks roughly the same in this histogram so it might not be missed
>> cancellations or such, so it is not clear to me. But it is worth
>> debugging and we'll try to reproduce your results.
>>
>> Some more theories from our internal RCU discussion:
>> - Could it be another user of RCU (call_rcu) from an unrelated hrtimer
>> interrupt callback that is causing a "flush" of lazy callbacks?
>> - What does the distribution look like for
>> do_nocb_deferred_wakeup_timer ? That will have to probably be made
>> non-static to be picked up by bpftrace (If you could try that real
>> quick, appreciate!).
> 
> Sure, I'll repeat the test tracing do_nocb_deferred_wakeup_timer.

Thanks, that'll be great.

>> Slightly related, but one of the things we are wondering also is how
>> much of the overhead for your nohz-full and lazy-RCU test (on top of
>> baseline - that is just CONFIG_HZ=1000 without nohz-full or nocbs) is
>> because of just using NOCB. Uladsizlau mentioned he might run a test
>> for comparing along those lines as well.
> 
> Just to clarify, "lazy rcu on" results are just with rcu_nocb=all and
> lazy RCUs enabled (and HZ=1000), so without nohz_full.
> 
> If I enable only nohz_full=all (without rcu_nocb) I see something like
> this:

Ok. I did want to mention nohz_full implies rcu_nocb on the same CPUs as well.

Its also mentioned in the boot param docs on the last line of the description:

        nohz_full=      [KNL,BOOT,SMP,ISOL]
                        The argument is a cpu list, as described above.
                        In kernels built with CONFIG_NO_HZ_FULL=y, set
                        the specified list of CPUs whose tick will be stopped
                        whenever possible. The boot CPU will be forced outside
                        the range to maintain the timekeeping.  Any CPUs
                        in this list will have their RCU callbacks offloaded,
                        just as if they had also been called out in the
                        rcu_nocbs= boot parameter.

(btw, I am a bit slow to respond to email due to being ill).

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-02-08  5:54                         ` Joel Fernandes
@ 2024-02-08  6:55                           ` Andrea Righi
  2024-02-08 12:53                             ` Uladzislau Rezki
  0 siblings, 1 reply; 35+ messages in thread
From: Andrea Righi @ 2024-02-08  6:55 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Paul E. McKenney, Uladzislau Rezki, Joel Fernandes, rcu,
	Cc: Frederic Weisbecker

On Thu, Feb 08, 2024 at 12:54:58AM -0500, Joel Fernandes wrote:
...
> >> Slightly related, but one of the things we are wondering also is how
> >> much of the overhead for your nohz-full and lazy-RCU test (on top of
> >> baseline - that is just CONFIG_HZ=1000 without nohz-full or nocbs) is
> >> because of just using NOCB. Uladsizlau mentioned he might run a test
> >> for comparing along those lines as well.
> > 
> > Just to clarify, "lazy rcu on" results are just with rcu_nocb=all and
> > lazy RCUs enabled (and HZ=1000), so without nohz_full.
> > 
> > If I enable only nohz_full=all (without rcu_nocb) I see something like
> > this:
> 
> Ok. I did want to mention nohz_full implies rcu_nocb on the same CPUs as well.
> 
> Its also mentioned in the boot param docs on the last line of the description:
> 
>         nohz_full=      [KNL,BOOT,SMP,ISOL]
>                         The argument is a cpu list, as described above.
>                         In kernels built with CONFIG_NO_HZ_FULL=y, set
>                         the specified list of CPUs whose tick will be stopped
>                         whenever possible. The boot CPU will be forced outside
>                         the range to maintain the timekeeping.  Any CPUs
>                         in this list will have their RCU callbacks offloaded,
>                         just as if they had also been called out in the
>                         rcu_nocbs= boot parameter.

Ah I didn't realize that, it definitely makes sense, thanks for
clarifying it.

Then basically in the results that I posted the difference is
"nohz_full=all+rcu_nocb=all" vs "rcu_nocb=all+lazy_RCU=on".

> 
> (btw, I am a bit slow to respond to email due to being ill).

Get well soon!

> 
> thanks,
> 
>  - Joel

-Andrea

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-02-08  6:55                           ` Andrea Righi
@ 2024-02-08 12:53                             ` Uladzislau Rezki
  2024-02-08 14:51                               ` Uladzislau Rezki
  0 siblings, 1 reply; 35+ messages in thread
From: Uladzislau Rezki @ 2024-02-08 12:53 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Joel Fernandes, Paul E. McKenney, Uladzislau Rezki,
	Joel Fernandes, rcu, Cc: Frederic Weisbecker

On Thu, Feb 08, 2024 at 07:55:37AM +0100, Andrea Righi wrote:
> On Thu, Feb 08, 2024 at 12:54:58AM -0500, Joel Fernandes wrote:
> ...
> > >> Slightly related, but one of the things we are wondering also is how
> > >> much of the overhead for your nohz-full and lazy-RCU test (on top of
> > >> baseline - that is just CONFIG_HZ=1000 without nohz-full or nocbs) is
> > >> because of just using NOCB. Uladsizlau mentioned he might run a test
> > >> for comparing along those lines as well.
> > > 
> > > Just to clarify, "lazy rcu on" results are just with rcu_nocb=all and
> > > lazy RCUs enabled (and HZ=1000), so without nohz_full.
> > > 
> > > If I enable only nohz_full=all (without rcu_nocb) I see something like
> > > this:
> > 
> > Ok. I did want to mention nohz_full implies rcu_nocb on the same CPUs as well.
> > 
> > Its also mentioned in the boot param docs on the last line of the description:
> > 
> >         nohz_full=      [KNL,BOOT,SMP,ISOL]
> >                         The argument is a cpu list, as described above.
> >                         In kernels built with CONFIG_NO_HZ_FULL=y, set
> >                         the specified list of CPUs whose tick will be stopped
> >                         whenever possible. The boot CPU will be forced outside
> >                         the range to maintain the timekeeping.  Any CPUs
> >                         in this list will have their RCU callbacks offloaded,
> >                         just as if they had also been called out in the
> >                         rcu_nocbs= boot parameter.
> 
> Ah I didn't realize that, it definitely makes sense, thanks for
> clarifying it.
> 
> Then basically in the results that I posted the difference is
> "nohz_full=all+rcu_nocb=all" vs "rcu_nocb=all+lazy_RCU=on".
> 
So, you say that a hrtimer_interrupt() handler takes more time in case
of lazy + nocb + rcu_nocb=all and for nohz_full + rcu_nocb=all it faster?
Could you please clarify this? I will try to measure from my side!

I have done some basic research about hrtimer_interrupt() latency on my
HW with latest Linux kernel. I have compared below cases:

case a: 1000HZ + lazy + nocb_all_cpus
case b: 1000HZ + nocb_all_cpus

I used "ftrace" to measure time(in microseconds). Steps:

echo 0 > tracing_on
echo function_graph > current_tracer
echo funcgraph-proc > trace_options
echo funcgraph-abstime > trace_options
echo hrtimer_interrupt > set_ftrace_filter

fio --rw=write --bs=1M --size=1G --numjobs=8 --name=worker --time_based --runtime=50&

echo 1 > tracing_on; sleep 10; echo 0 > tracing_on

data is based on 10 seconds collection:

<case a>
     6  2102 ############################################################
     8  2079 ############################################################
    10  1464 ##########################################
     7   897 ##########################
     9   625 ##################
    12   490 ##############
    13   479 ##############
    11   289 #########
     5   249 ########
    14   124 ####
    15    72 ###
    16    41 ##
    17    24 #
     4    22 #
    18    12 #
    22     2 #
    19     1 #
<case a>

<case b>
     9  1658 ############################################################
    13  1308 ################################################
    12  1224 #############################################
    10   972 ####################################
     8   703 ##########################
    14   595 ######################
    15   571 #####################
    11   525 ###################
    17   350 #############
    16   235 #########
     7   214 ########
     4    73 ###
     5    68 ###
     6    54 ##
    20     9 #
    18     9 #
    19     6 #
    33     1 #
     3     1 #
    28     1 #
    27     1 #
    25     1 #
    22     1 #
    21     1 #
<case b>

I do not see the difference, there is a nose of 1/2/3 microseconds diff.

Thank you!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-02-08 12:53                             ` Uladzislau Rezki
@ 2024-02-08 14:51                               ` Uladzislau Rezki
  2024-02-12  0:22                                 ` Joel Fernandes
  0 siblings, 1 reply; 35+ messages in thread
From: Uladzislau Rezki @ 2024-02-08 14:51 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Andrea Righi, Joel Fernandes, Paul E. McKenney, Joel Fernandes,
	rcu, Cc: Frederic Weisbecker

On Thu, Feb 08, 2024 at 01:53:58PM +0100, Uladzislau Rezki wrote:
> On Thu, Feb 08, 2024 at 07:55:37AM +0100, Andrea Righi wrote:
> > On Thu, Feb 08, 2024 at 12:54:58AM -0500, Joel Fernandes wrote:
> > ...
> > > >> Slightly related, but one of the things we are wondering also is how
> > > >> much of the overhead for your nohz-full and lazy-RCU test (on top of
> > > >> baseline - that is just CONFIG_HZ=1000 without nohz-full or nocbs) is
> > > >> because of just using NOCB. Uladsizlau mentioned he might run a test
> > > >> for comparing along those lines as well.
> > > > 
> > > > Just to clarify, "lazy rcu on" results are just with rcu_nocb=all and
> > > > lazy RCUs enabled (and HZ=1000), so without nohz_full.
> > > > 
> > > > If I enable only nohz_full=all (without rcu_nocb) I see something like
> > > > this:
> > > 
> > > Ok. I did want to mention nohz_full implies rcu_nocb on the same CPUs as well.
> > > 
> > > Its also mentioned in the boot param docs on the last line of the description:
> > > 
> > >         nohz_full=      [KNL,BOOT,SMP,ISOL]
> > >                         The argument is a cpu list, as described above.
> > >                         In kernels built with CONFIG_NO_HZ_FULL=y, set
> > >                         the specified list of CPUs whose tick will be stopped
> > >                         whenever possible. The boot CPU will be forced outside
> > >                         the range to maintain the timekeeping.  Any CPUs
> > >                         in this list will have their RCU callbacks offloaded,
> > >                         just as if they had also been called out in the
> > >                         rcu_nocbs= boot parameter.
> > 
> > Ah I didn't realize that, it definitely makes sense, thanks for
> > clarifying it.
> > 
> > Then basically in the results that I posted the difference is
> > "nohz_full=all+rcu_nocb=all" vs "rcu_nocb=all+lazy_RCU=on".
> > 
> So, you say that a hrtimer_interrupt() handler takes more time in case
> of lazy + nocb + rcu_nocb=all and for nohz_full + rcu_nocb=all it faster?
> Could you please clarify this? I will try to measure from my side!
> 
> I have done some basic research about hrtimer_interrupt() latency on my
> HW with latest Linux kernel. I have compared below cases:
> 
> case a: 1000HZ + lazy + nocb_all_cpus
> case b: 1000HZ + nocb_all_cpus
> 
> I used "ftrace" to measure time(in microseconds). Steps:
> 
> echo 0 > tracing_on
> echo function_graph > current_tracer
> echo funcgraph-proc > trace_options
> echo funcgraph-abstime > trace_options
> echo hrtimer_interrupt > set_ftrace_filter
> 
> fio --rw=write --bs=1M --size=1G --numjobs=8 --name=worker --time_based --runtime=50&
> 
> echo 1 > tracing_on; sleep 10; echo 0 > tracing_on
> 
> data is based on 10 seconds collection:
> 
> <case a>
>      6  2102 ############################################################
>      8  2079 ############################################################
>     10  1464 ##########################################
>      7   897 ##########################
>      9   625 ##################
>     12   490 ##############
>     13   479 ##############
>     11   289 #########
>      5   249 ########
>     14   124 ####
>     15    72 ###
>     16    41 ##
>     17    24 #
>      4    22 #
>     18    12 #
>     22     2 #
>     19     1 #
> <case a>
> 
> <case b>
>      9  1658 ############################################################
>     13  1308 ################################################
>     12  1224 #############################################
>     10   972 ####################################
>      8   703 ##########################
>     14   595 ######################
>     15   571 #####################
>     11   525 ###################
>     17   350 #############
>     16   235 #########
>      7   214 ########
>      4    73 ###
>      5    68 ###
>      6    54 ##
>     20     9 #
>     18     9 #
>     19     6 #
>     33     1 #
>      3     1 #
>     28     1 #
>     27     1 #
>     25     1 #
>     22     1 #
>     21     1 #
> <case b>
> 
> I do not see the difference, there is a nose of 1/2/3 microseconds diff.
> 
Let me further have a look at what we use for lazy in terms on hrtimer though.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-02-08 14:51                               ` Uladzislau Rezki
@ 2024-02-12  0:22                                 ` Joel Fernandes
  2024-02-12  9:05                                   ` Uladzislau Rezki
  2024-02-12  9:44                                   ` Uladzislau Rezki
  0 siblings, 2 replies; 35+ messages in thread
From: Joel Fernandes @ 2024-02-12  0:22 UTC (permalink / raw)
  To: Uladzislau Rezki, Andrea Righi
  Cc: Paul E. McKenney, Joel Fernandes, rcu, Cc: Frederic Weisbecker



On 2/8/2024 9:51 AM, Uladzislau Rezki wrote:
> On Thu, Feb 08, 2024 at 01:53:58PM +0100, Uladzislau Rezki wrote:
>> On Thu, Feb 08, 2024 at 07:55:37AM +0100, Andrea Righi wrote:
>>> On Thu, Feb 08, 2024 at 12:54:58AM -0500, Joel Fernandes wrote:
>>> ...
>>>>>> Slightly related, but one of the things we are wondering also is how
>>>>>> much of the overhead for your nohz-full and lazy-RCU test (on top of
>>>>>> baseline - that is just CONFIG_HZ=1000 without nohz-full or nocbs) is
>>>>>> because of just using NOCB. Uladsizlau mentioned he might run a test
>>>>>> for comparing along those lines as well.
>>>>>
>>>>> Just to clarify, "lazy rcu on" results are just with rcu_nocb=all and
>>>>> lazy RCUs enabled (and HZ=1000), so without nohz_full.
>>>>>
>>>>> If I enable only nohz_full=all (without rcu_nocb) I see something like
>>>>> this:
>>>>
>>>> Ok. I did want to mention nohz_full implies rcu_nocb on the same CPUs as well.
>>>>
>>>> Its also mentioned in the boot param docs on the last line of the description:
>>>>
>>>>         nohz_full=      [KNL,BOOT,SMP,ISOL]
>>>>                         The argument is a cpu list, as described above.
>>>>                         In kernels built with CONFIG_NO_HZ_FULL=y, set
>>>>                         the specified list of CPUs whose tick will be stopped
>>>>                         whenever possible. The boot CPU will be forced outside
>>>>                         the range to maintain the timekeeping.  Any CPUs
>>>>                         in this list will have their RCU callbacks offloaded,
>>>>                         just as if they had also been called out in the
>>>>                         rcu_nocbs= boot parameter.
>>>
>>> Ah I didn't realize that, it definitely makes sense, thanks for
>>> clarifying it.
>>>
>>> Then basically in the results that I posted the difference is
>>> "nohz_full=all+rcu_nocb=all" vs "rcu_nocb=all+lazy_RCU=on".
>>>
>> So, you say that a hrtimer_interrupt() handler takes more time in case
>> of lazy + nocb + rcu_nocb=all and for nohz_full + rcu_nocb=all it faster?
>> Could you please clarify this? I will try to measure from my side!
>>
>> I have done some basic research about hrtimer_interrupt() latency on my
>> HW with latest Linux kernel. I have compared below cases:
>>
>> case a: 1000HZ + lazy + nocb_all_cpus
>> case b: 1000HZ + nocb_all_cpus
>>
>> I used "ftrace" to measure time(in microseconds). Steps:
>>
>> echo 0 > tracing_on
>> echo function_graph > current_tracer
>> echo funcgraph-proc > trace_options
>> echo funcgraph-abstime > trace_options
>> echo hrtimer_interrupt > set_ftrace_filter
>>
>> fio --rw=write --bs=1M --size=1G --numjobs=8 --name=worker --time_based --runtime=50&
>>
>> echo 1 > tracing_on; sleep 10; echo 0 > tracing_on
>>
>> data is based on 10 seconds collection:
>>
>> <case a>
>>      6  2102 ############################################################
>>      8  2079 ############################################################
>>     10  1464 ##########################################
>>      7   897 ##########################

So first column is microseconds and second one is count?

>>      9   625 ##################
>>     12   490 ##############
>>     13   479 ##############
>>     11   289 #########
>>      5   249 ########
>>     14   124 ####
>>     15    72 ###
>>     16    41 ##
>>     17    24 #
>>      4    22 #
>>     18    12 #
>>     22     2 #
>>     19     1 #
>> <case a>
>>
>> <case b>
>>      9  1658 ############################################################
>>     13  1308 ################################################
>>     12  1224 #############################################

Assuming that, it does seem the "best" case is off by 3 microseconds (9 vs 6),
still would not warrant being regarded a bug and possibly just in the noise.

>>     10   972 ####################################
>>      8   703 ##########################
>>     14   595 ######################
>>     15   571 #####################
>>     11   525 ###################
>>     17   350 #############
>>     16   235 #########
>>      7   214 ########
>>      4    73 ###
>>      5    68 ###
>>      6    54 ##
>>     20     9 #
>>     18     9 #
>>     19     6 #
>>     33     1 #
>>      3     1 #
>>     28     1 #
>>     27     1 #
>>     25     1 #
>>     22     1 #
>>     21     1 #
>> <case b>
>>
>> I do not see the difference, there is a nose of 1/2/3 microseconds diff.
>>
> Let me further have a look at what we use for lazy in terms on hrtimer though.

Thanks for tracing it. Yeah it would be nice to count how many counts of
do_nocb_deferred_wakeup() does the fio test trigger. If it is few, then maybe
the problem with hrtimer_interrupt() is something else.

 - Joel


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-02-12  0:22                                 ` Joel Fernandes
@ 2024-02-12  9:05                                   ` Uladzislau Rezki
  2024-02-12  9:44                                   ` Uladzislau Rezki
  1 sibling, 0 replies; 35+ messages in thread
From: Uladzislau Rezki @ 2024-02-12  9:05 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Uladzislau Rezki, Andrea Righi, Paul E. McKenney, Joel Fernandes,
	rcu, Cc: Frederic Weisbecker

On Sun, Feb 11, 2024 at 07:22:55PM -0500, Joel Fernandes wrote:
> 
> 
> On 2/8/2024 9:51 AM, Uladzislau Rezki wrote:
> > On Thu, Feb 08, 2024 at 01:53:58PM +0100, Uladzislau Rezki wrote:
> >> On Thu, Feb 08, 2024 at 07:55:37AM +0100, Andrea Righi wrote:
> >>> On Thu, Feb 08, 2024 at 12:54:58AM -0500, Joel Fernandes wrote:
> >>> ...
> >>>>>> Slightly related, but one of the things we are wondering also is how
> >>>>>> much of the overhead for your nohz-full and lazy-RCU test (on top of
> >>>>>> baseline - that is just CONFIG_HZ=1000 without nohz-full or nocbs) is
> >>>>>> because of just using NOCB. Uladsizlau mentioned he might run a test
> >>>>>> for comparing along those lines as well.
> >>>>>
> >>>>> Just to clarify, "lazy rcu on" results are just with rcu_nocb=all and
> >>>>> lazy RCUs enabled (and HZ=1000), so without nohz_full.
> >>>>>
> >>>>> If I enable only nohz_full=all (without rcu_nocb) I see something like
> >>>>> this:
> >>>>
> >>>> Ok. I did want to mention nohz_full implies rcu_nocb on the same CPUs as well.
> >>>>
> >>>> Its also mentioned in the boot param docs on the last line of the description:
> >>>>
> >>>>         nohz_full=      [KNL,BOOT,SMP,ISOL]
> >>>>                         The argument is a cpu list, as described above.
> >>>>                         In kernels built with CONFIG_NO_HZ_FULL=y, set
> >>>>                         the specified list of CPUs whose tick will be stopped
> >>>>                         whenever possible. The boot CPU will be forced outside
> >>>>                         the range to maintain the timekeeping.  Any CPUs
> >>>>                         in this list will have their RCU callbacks offloaded,
> >>>>                         just as if they had also been called out in the
> >>>>                         rcu_nocbs= boot parameter.
> >>>
> >>> Ah I didn't realize that, it definitely makes sense, thanks for
> >>> clarifying it.
> >>>
> >>> Then basically in the results that I posted the difference is
> >>> "nohz_full=all+rcu_nocb=all" vs "rcu_nocb=all+lazy_RCU=on".
> >>>
> >> So, you say that a hrtimer_interrupt() handler takes more time in case
> >> of lazy + nocb + rcu_nocb=all and for nohz_full + rcu_nocb=all it faster?
> >> Could you please clarify this? I will try to measure from my side!
> >>
> >> I have done some basic research about hrtimer_interrupt() latency on my
> >> HW with latest Linux kernel. I have compared below cases:
> >>
> >> case a: 1000HZ + lazy + nocb_all_cpus
> >> case b: 1000HZ + nocb_all_cpus
> >>
> >> I used "ftrace" to measure time(in microseconds). Steps:
> >>
> >> echo 0 > tracing_on
> >> echo function_graph > current_tracer
> >> echo funcgraph-proc > trace_options
> >> echo funcgraph-abstime > trace_options
> >> echo hrtimer_interrupt > set_ftrace_filter
> >>
> >> fio --rw=write --bs=1M --size=1G --numjobs=8 --name=worker --time_based --runtime=50&
> >>
> >> echo 1 > tracing_on; sleep 10; echo 0 > tracing_on
> >>
> >> data is based on 10 seconds collection:
> >>
> >> <case a>
> >>      6  2102 ############################################################
> >>      8  2079 ############################################################
> >>     10  1464 ##########################################
> >>      7   897 ##########################
> 
> So first column is microseconds and second one is count?
> 
> >>      9   625 ##################
> >>     12   490 ##############
> >>     13   479 ##############
> >>     11   289 #########
> >>      5   249 ########
> >>     14   124 ####
> >>     15    72 ###
> >>     16    41 ##
> >>     17    24 #
> >>      4    22 #
> >>     18    12 #
> >>     22     2 #
> >>     19     1 #
> >> <case a>
> >>
> >> <case b>
> >>      9  1658 ############################################################
> >>     13  1308 ################################################
> >>     12  1224 #############################################
> 
> Assuming that, it does seem the "best" case is off by 3 microseconds (9 vs 6),
> still would not warrant being regarded a bug and possibly just in the noise.
> 
> >>     10   972 ####################################
> >>      8   703 ##########################
> >>     14   595 ######################
> >>     15   571 #####################
> >>     11   525 ###################
> >>     17   350 #############
> >>     16   235 #########
> >>      7   214 ########
> >>      4    73 ###
> >>      5    68 ###
> >>      6    54 ##
> >>     20     9 #
> >>     18     9 #
> >>     19     6 #
> >>     33     1 #
> >>      3     1 #
> >>     28     1 #
> >>     27     1 #
> >>     25     1 #
> >>     22     1 #
> >>     21     1 #
> >> <case b>
> >>
> >> I do not see the difference, there is a nose of 1/2/3 microseconds diff.
> >>
> > Let me further have a look at what we use for lazy in terms on hrtimer though.
> 
> Thanks for tracing it. Yeah it would be nice to count how many counts of
> do_nocb_deferred_wakeup() does the fio test trigger. If it is few, then maybe
> the problem with hrtimer_interrupt() is something else.
> 

urezki@pc638:~$ fio --rw=write --bs=1M --size=1G --numjobs=8 --name=worker --time_based --runtime=1000 > /dev/null 2>&1&
[1] 697
urezki@pc638:~$ sudo cat /sys/kernel/debug/tracing/trace_pipe  | grep Timer
          <idle>-0       [005] ..s1.   175.762689: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [002] ..s1.   186.002386: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [005] ..s1.   186.002391: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [001] ..s1.   228.497061: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [001] ..s1.   242.831956: rcu_nocb_wake: rcu_preempt 4 Timer
             fio-705     [004] ..s..   272.009997: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [002] ..s1.   282.253447: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [004] ..s1.   347.786758: rcu_nocb_wake: rcu_preempt 4 Timer
            cron-714     [004] ..s1.   362.429287: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [007] .Ns1.   377.993814: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [002] ..s1.   393.353371: rcu_nocb_wake: rcu_preempt 0 Timer
    kworker/6:1H-172     [006] ..s1.   410.760933: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [007] ..s1.   427.148456: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [004] ..s1.   438.220173: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [007] .Ns1.   451.719831: rcu_nocb_wake: rcu_preempt 4 Timer
    kworker/6:1H-172     [006] ..s1.   472.199389: rcu_nocb_wake: rcu_preempt 4 Timer
           <...>-717     [000] ..s..   482.434074: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [004] ..s1.   484.969012: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [007] ..s1.   495.750151: rcu_nocb_wake: rcu_preempt 4 Timer
             fio-702     [000] ..s1.   516.439595: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [004] ..s1.   578.212713: rcu_nocb_wake: rcu_preempt 4 Timer
    kworker/6:1H-172     [006] ..s1.   595.583863: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [004] ..s1.   597.634693: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [005] ..s1.   617.598664: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [004] ..s1.   655.996890: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [004] ..s1.   660.606691: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [004] ..s1.   680.059867: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [004] ..s1.   688.227533: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [004] ..s1.   698.491126: rcu_nocb_wake: rcu_preempt 4 Timer
             fio-703     [000] ..s1.   705.925823: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [004] ..s1.   721.018256: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [007] ..s1.   732.793797: rcu_nocb_wake: rcu_preempt 4 Timer
    kworker/6:1H-172     [006] ..s1.   747.129268: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [007] ..s1.   766.075583: rcu_nocb_wake: rcu_preempt 4 Timer
             fio-701     [004] ..s2.   767.047547: rcu_nocb_wake: rcu_preempt 4 Timer
    kworker/6:1H-172     [006] ..s1.   782.456032: rcu_nocb_wake: rcu_preempt 4 Timer
    kworker/6:1H-172     [006] ..s1.   797.303567: rcu_nocb_wake: rcu_preempt 4 Timer
    kworker/6:1H-172     [006] ..s1.   813.686996: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [005] ..s1.   828.534430: rcu_nocb_wake: rcu_preempt 4 Timer
             fio-700     [004] ..s2.   850.536698: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [007] ..s1.   860.789367: rcu_nocb_wake: rcu_preempt 4 Timer
     ksoftirqd/6-63      [006] ..s..   888.948515: rcu_nocb_wake: rcu_preempt 4 Timer
             fio-699     [000] ..s..   905.854912: rcu_nocb_wake: rcu_preempt 0 Timer
    kworker/6:1H-172     [006] ..s1.   911.987781: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [005] ..s1.   924.275337: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [004] ..s1.   938.208899: rcu_nocb_wake: rcu_preempt 4 Timer
    kworker/0:1H-100     [000] ..s1.   938.236053: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [004] ..s1.   948.338615: rcu_nocb_wake: rcu_preempt 4 Timer
             fio-702     [000] ..s2.   958.216274: rcu_nocb_wake: rcu_preempt 0 Timer
           <...>-728     [004] ..s..   962.432149: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [002] ..s1.   978.545645: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [003] ..s1.  1002.608935: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [004] ..s1.  1002.614904: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [001] ..s1.  1012.848592: rcu_nocb_wake: rcu_preempt 0 Timer
    kworker/6:1H-172     [006] ..s1.  1021.552408: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [002] ..s1.  1042.543682: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [001] ..s1.  1068.654892: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [002] ..s1.  1080.430541: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [000] ..s1.  1098.349992: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [002] ..s1.  1115.757464: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [005] ..s1.  1170.539822: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [000] ..s1.  1178.191592: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [000] ..s1.  1188.459281: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [004] ..s1.  1196.139077: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [000] ..s1.  1216.762430: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [004] ..s1.  1238.374788: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [004] ..s1.  1241.194246: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [003] ..s1.  1251.433991: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [004] ..s1.  1261.673634: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [001] ..s1.  1261.673637: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [007] ..s1.  1271.913347: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [005] ..s1.  1308.263711: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [000] ..s1.  1308.263738: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [005] ..s1.  1308.270707: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [000] ..s1.  1326.183705: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [004] ..s1.  1326.183709: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [004] ..s1.  1326.193714: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [005] ..s1.  1336.423403: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [003] ..s1.  1336.423404: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [000] ..s1.  1368.678480: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [007] ..s1.  1398.373573: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [005] ..s1.  1428.580662: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [001] ..s1.  1428.580666: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [007] ..s1.  1468.515358: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [000] ..s1.  1488.482412: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [000] ..s1.  1548.385082: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [005] ..s1.  1548.385083: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [003] ..s1.  1572.448367: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [004] ..s1.  1572.448367: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [004] ..s1.  1572.453972: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [005] ..s1.  1582.688060: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [001] ..s1.  1582.688062: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [004] ..s1.  1592.927895: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [007] ..s1.  1608.287294: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [001] ..s1.  1608.799312: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [002] ..s1.  1621.086909: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [005] ..s1.  1631.326609: rcu_nocb_wake: rcu_preempt 4 Timer
          <idle>-0       [003] .Ns1.  1653.854052: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [001] ..s1.  1668.189531: rcu_nocb_wake: rcu_preempt 0 Timer
          <idle>-0       [007] ..s1.  1678.429219: rcu_nocb_wake: rcu_preempt 4 Timer
<snip>

1000 seconds runtime of "fio" test triggers the do_nocb_deferred_wakeup_timer()
around 100 times. Most of them are handled from the soft-irq context. I have
looked at the perf figures vs hrtimer_interrupt(). I do not see any obvious
performance issues with it.

8xCPUs KVM system;
1000HZ timer;
lazy + all offloading.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-02-12  0:22                                 ` Joel Fernandes
  2024-02-12  9:05                                   ` Uladzislau Rezki
@ 2024-02-12  9:44                                   ` Uladzislau Rezki
  1 sibling, 0 replies; 35+ messages in thread
From: Uladzislau Rezki @ 2024-02-12  9:44 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Uladzislau Rezki, Andrea Righi, Paul E. McKenney, Joel Fernandes,
	rcu, Cc: Frederic Weisbecker

> >>
> >> I have done some basic research about hrtimer_interrupt() latency on my
> >> HW with latest Linux kernel. I have compared below cases:
> >>
> >> case a: 1000HZ + lazy + nocb_all_cpus
> >> case b: 1000HZ + nocb_all_cpus
> >>
> >> I used "ftrace" to measure time(in microseconds). Steps:
> >>
> >> echo 0 > tracing_on
> >> echo function_graph > current_tracer
> >> echo funcgraph-proc > trace_options
> >> echo funcgraph-abstime > trace_options
> >> echo hrtimer_interrupt > set_ftrace_filter
> >>
> >> fio --rw=write --bs=1M --size=1G --numjobs=8 --name=worker --time_based --runtime=50&
> >>
> >> echo 1 > tracing_on; sleep 10; echo 0 > tracing_on
> >>
> >> data is based on 10 seconds collection:
> >>
> >> <case a>
> >>      6  2102 ############################################################
> >>      8  2079 ############################################################
> >>     10  1464 ##########################################
> >>      7   897 ##########################
> 
> So first column is microseconds and second one is count?
> 
Right!

> >>      9   625 ##################
> >>     12   490 ##############
> >>     13   479 ##############
> >>     11   289 #########
> >>      5   249 ########
> >>     14   124 ####
> >>     15    72 ###
> >>     16    41 ##
> >>     17    24 #
> >>      4    22 #
> >>     18    12 #
> >>     22     2 #
> >>     19     1 #
> >> <case a>
> >>
> >> <case b>
> >>      9  1658 ############################################################
> >>     13  1308 ################################################
> >>     12  1224 #############################################
> 
> Assuming that, it does seem the "best" case is off by 3 microseconds (9 vs 6),
> still would not warrant being regarded a bug and possibly just in the noise.
> 
I also think it is a noise.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-02-06 18:44                 ` Paul E. McKenney
  2024-02-07 13:04                   ` Uladzislau Rezki
@ 2024-02-15  7:51                   ` Andrea Righi
  2024-02-15 13:15                     ` Uladzislau Rezki
  1 sibling, 1 reply; 35+ messages in thread
From: Andrea Righi @ 2024-02-15  7:51 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Uladzislau Rezki, Joel Fernandes, Joel Fernandes, rcu,
	Cc: Frederic Weisbecker

On Tue, Feb 06, 2024 at 10:44:33AM -0800, Paul E. McKenney wrote:
...
> > 
> >  - with lazy RCU enabled hrtimer_interrupt() takes like 2x more to
> >    return, respect to the other cases (is this expected?)
> 
> This last is surprising at first glance, but I could be missing
> something.  Joel, Uladzislau, thoughts?

On this specific point, I repeated the tests with the latest 6.8
kernel and now I don't notice this difference anymore.

hrtimer_interrupt() time is always in the range [4K, 16K) ns, both with
lazy RCU off and lazy RCU on.

I don't really know why I was getting different results, probably some
noise triggered by something running in user-space...?

In any case I will update my old post with these new results and I
wouldn't consider this an issue for now.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-02-15  7:51                   ` Andrea Righi
@ 2024-02-15 13:15                     ` Uladzislau Rezki
  2024-02-15 14:02                       ` Joel Fernandes
  0 siblings, 1 reply; 35+ messages in thread
From: Uladzislau Rezki @ 2024-02-15 13:15 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Paul E. McKenney, Uladzislau Rezki, Joel Fernandes,
	Joel Fernandes, rcu, Cc: Frederic Weisbecker

On Thu, Feb 15, 2024 at 08:51:54AM +0100, Andrea Righi wrote:
> On Tue, Feb 06, 2024 at 10:44:33AM -0800, Paul E. McKenney wrote:
> ...
> > > 
> > >  - with lazy RCU enabled hrtimer_interrupt() takes like 2x more to
> > >    return, respect to the other cases (is this expected?)
> > 
> > This last is surprising at first glance, but I could be missing
> > something.  Joel, Uladzislau, thoughts?
> 
> On this specific point, I repeated the tests with the latest 6.8
> kernel and now I don't notice this difference anymore.
> 
> hrtimer_interrupt() time is always in the range [4K, 16K) ns, both with
> lazy RCU off and lazy RCU on.
> 
> I don't really know why I was getting different results, probably some
> noise triggered by something running in user-space...?
> 
> In any case I will update my old post with these new results and I
> wouldn't consider this an issue for now.
> 
Thank you for your confirmation, so our results are aligned in terms of
i also do not see the difference and issues with hrtimer_interrupt()
duration time.

Thanks!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Observation on NOHZ_FULL
  2024-02-15 13:15                     ` Uladzislau Rezki
@ 2024-02-15 14:02                       ` Joel Fernandes
  0 siblings, 0 replies; 35+ messages in thread
From: Joel Fernandes @ 2024-02-15 14:02 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Andrea Righi, Paul E. McKenney, Joel Fernandes, rcu,
	Cc: Frederic Weisbecker

On Thu, Feb 15, 2024 at 8:16 AM Uladzislau Rezki <urezki@gmail.com> wrote:
>
> On Thu, Feb 15, 2024 at 08:51:54AM +0100, Andrea Righi wrote:
> > On Tue, Feb 06, 2024 at 10:44:33AM -0800, Paul E. McKenney wrote:
> > ...
> > > >
> > > >  - with lazy RCU enabled hrtimer_interrupt() takes like 2x more to
> > > >    return, respect to the other cases (is this expected?)
> > >
> > > This last is surprising at first glance, but I could be missing
> > > something.  Joel, Uladzislau, thoughts?
> >
> > On this specific point, I repeated the tests with the latest 6.8
> > kernel and now I don't notice this difference anymore.
> >
> > hrtimer_interrupt() time is always in the range [4K, 16K) ns, both with
> > lazy RCU off and lazy RCU on.
> >
> > I don't really know why I was getting different results, probably some
> > noise triggered by something running in user-space...?
> >
> > In any case I will update my old post with these new results and I
> > wouldn't consider this an issue for now.
> >
> Thank you for your confirmation, so our results are aligned in terms of
> i also do not see the difference and issues with hrtimer_interrupt()
> duration time.
>

Thanks Vlad and Andrea. That's a relief !!! ;-)

 - Joel

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2024-02-15 14:02 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-29 17:47 Observation on NOHZ_FULL Joel Fernandes
2024-01-29 20:41 ` Paul E. McKenney
2024-01-29 22:16   ` Joel Fernandes
2024-01-30  6:58     ` Andrea Righi
2024-01-30 10:17       ` Paul E. McKenney
2024-01-30 11:06         ` Uladzislau Rezki
2024-01-30 11:27           ` Andrea Righi
2024-01-30 11:41             ` Uladzislau Rezki
2024-01-30 13:39             ` Paul E. McKenney
2024-02-06 17:51               ` Andrea Righi
2024-02-06 18:44                 ` Paul E. McKenney
2024-02-07 13:04                   ` Uladzislau Rezki
2024-02-07 15:12                     ` Andrea Righi
2024-02-07 15:49                       ` Joel Fernandes
2024-02-15  7:51                   ` Andrea Righi
2024-02-15 13:15                     ` Uladzislau Rezki
2024-02-15 14:02                       ` Joel Fernandes
2024-02-07 15:48                 ` Joel Fernandes
2024-02-07 16:31                   ` Andrea Righi
2024-02-07 16:52                     ` Joel Fernandes
2024-02-07 17:05                       ` Andrea Righi
2024-02-08  5:54                         ` Joel Fernandes
2024-02-08  6:55                           ` Andrea Righi
2024-02-08 12:53                             ` Uladzislau Rezki
2024-02-08 14:51                               ` Uladzislau Rezki
2024-02-12  0:22                                 ` Joel Fernandes
2024-02-12  9:05                                   ` Uladzislau Rezki
2024-02-12  9:44                                   ` Uladzislau Rezki
2024-01-29 20:48 ` Uladzislau Rezki
2024-01-29 22:20   ` Joel Fernandes
2024-01-29 22:43     ` Frederic Weisbecker
2024-01-29 22:53       ` Joel Fernandes
2024-01-29 23:11         ` Frederic Weisbecker
2024-01-29 23:36           ` Joel Fernandes
2024-01-30  0:40       ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.