All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] nohz1: Documentation
@ 2013-03-18 16:29 Paul E. McKenney
  2013-03-18 18:13 ` Rob Landley
  0 siblings, 1 reply; 43+ messages in thread
From: Paul E. McKenney @ 2013-03-18 16:29 UTC (permalink / raw)
  To: fweisbec; +Cc: linux-kernel, josh, rostedt, zhong, khilman, geoff, tglx

First attempt at documentation for adaptive ticks.

Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

nohz1: Documentation

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/Documentation/timers/NO_HZ.txt b/Documentation/timers/NO_HZ.txt
new file mode 100644
index 0000000..7279109
--- /dev/null
+++ b/Documentation/timers/NO_HZ.txt
@@ -0,0 +1,200 @@
+		NO_HZ: Reducing Scheduling-Clock Ticks
+
+
+This document covers kernel configuration variables used to reduce
+the number of scheduling-clock interrupts.  These reductions can be
+helpful in improving energy efficiency and in reducing "OS jitter",
+the latter being very important for some types of computationally
+intensive high-performance computing (HPC) applications and for real-time
+applications.
+
+Within the Linux kernel, there are two major aspects of scheduling-clock
+interrupt reduction:
+
+1.	Idle CPUs.
+
+2.	CPUs having only one runnable task.
+
+These two cases are described in the following sections.
+
+
+IDLE CPUs
+
+If a CPU is idle, there is little point in sending it a scheduling-clock
+interrupt.  After all, the primary purpose of a scheduling-clock interrupt
+is to force a busy CPU to shift its attention among multiple duties,
+but an idle CPU by definition has no duties to shift its attention among.
+
+The CONFIG_NO_HZ=y Kconfig option causes the kernel to avoid sending
+scheduling-clock interrupts to idle CPUs, which is critically important
+both to battery-powered devices and to highly virtualized mainframes.
+A battery-powered device running a CONFIG_NO_HZ=n kernel would drain its
+battery very quickly, easily 2-3x as fast as would the same device running
+a CONFIG_NO_HZ=n kernel.  A mainframe running 1,500 OS instances could
+easily find that half of its CPU time was consumed by scheduling-clock
+interrupts.  In these situations, there is therefore strong motivation
+to avoid sending scheduling-clock interrupts to idle CPUs.  That said,
+dyntick-idle mode is not free:
+
+1.	It increases the number of instructions executed on the path
+	to and from the idle loop.
+
+2.	Many architectures will place dyntick-idle CPUs into deep sleep
+	states, which further degrades from-idle transition latencies.
+
+Therefore, systems with aggressive real-time response constraints
+often run CONFIG_NO_HZ=n kernels in order to avoid degrading from-idle
+transition latencies.
+
+An idle CPU that is not receiving scheduling-clock interrupts is said to
+be "dyntick-idle", "in dyntick-idle mode", "in nohz mode", or "running
+tickless".  The remainder of this document will use "dyntick-idle mode".
+
+There is also a boot parameter "nohz=" that can be used to disable
+dyntick-idle mode in CONFIG_NO_HZ=y kernels by specifying "nohz=off".
+By default, CONFIG_NO_HZ=y kernels boot with "nohz=on", enabling
+dyntick-idle mode.
+
+
+CPUs WITH ONLY ONE RUNNABLE TASK
+
+If a CPU has only one runnable task, there is again little point in
+sending it a scheduling-clock interrupt.  Recall that the primary
+purpose of a scheduling-clock interrupt is to force a busy CPU to
+shift its attention among many things requiring its attention -- and
+there is nowhere else for a CPU with but one runnable task to shift its
+attention to.
+
+The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid
+sending scheduling-clock interrupts to CPUs with a single runnable task.
+This is important for applications with aggressive real-time response
+constraints because it allows them to improve their worst-case response
+times by the maximum duration of a scheduling-clock interrupt.  It is also
+important for computationally intensive iterative workloads with short
+iterations:  If any CPU is delayed during a given iteration, all the
+other CPUs will be forced to wait idle while the delayed CPU finished.
+Thus, the delay is multiplied by one less than the number of CPUs.
+In these situations, there is again strong motivation to avoid sending
+scheduling-clock interrupts to CPUs that have but one runnable task that
+is executing in user mode.
+
+Note that if a given CPU is in adaptive-ticks mode while executing in
+user mode, transitioning to kernel mode does not automatically force
+that CPU out of adaptive-ticks mode.  The CPU will exit adaptive-ticks
+mode only if needed, for example, if that CPU enqueues an RCU callback.
+
+Just as with dyntick-idle mode, the benefits of adaptive-tick mode do
+not come for free:
+
+1.	The user/kernel transitions are slightly more expensive due
+	to the need to inform kernel subsystems (such as RCU) about
+	the change in mode.
+
+2.	POSIX CPU timers on adaptive-tick CPUs may fire late (or even
+	not at all) because they currently rely on scheduling-tick
+	interrupts.  This will likely be fixed in one of two ways: (1)
+	Prevent CPUs with POSIX CPU timers from entering adaptive-tick
+	mode, or (2) Use hrtimers or other adaptive-ticks-immune mechanism
+	to cause the POSIX CPU timer to fire properly.
+
+3.	If there are more perf events pending than the hardware can
+	accommodate, they are normally round-robined so as to collect
+	all of them over time.  Adaptive-tick mode may prevent this
+	round-robining from happening.  This will likely be fixed by
+	preventing CPUs with large numbers of perf events pending from
+	entering adaptive-tick mode.
+
+4.	Scheduler statistics for adaptive-idle CPUs may be computed
+	slightly differently than those for non-adaptive-idle CPUs.
+	This may in turn perturb load-balancing of real-time tasks.
+
+5.	The LB_BIAS scheduler feature is disabled by adaptive ticks.
+
+Although improvements are expected over time, adaptive ticks is quite
+useful for many types of real-time and compute-intensive applications.
+However, the drawbacks listed above mean that adaptive ticks should not
+be enabled by default across the board at the current time.
+
+
+RCU IMPLICATIONS
+
+There are situations in which idle CPUs cannot be permitted to
+enter either dyntick-idle mode or adaptive-tick mode, the most
+familiar being the case where that CPU has RCU callbacks pending.
+
+The CONFIG_RCU_FAST_NO_HZ=y Kconfig option may be used to cause such
+CPUs to enter dyntick-idle mode or adaptive-tick mode anyway, though a
+timer will awaken these CPUs every four jiffies in order to ensure that
+the RCU callbacks are processed in a timely fashion.
+
+Another approach is to offload RCU callback processing to "rcuo" kthreads
+using the CONFIG_RCU_NOCB_CPU=y.  The specific CPUs to offload may be
+selected via several methods:
+
+1.	The "rcu_nocbs=" kernel boot parameter, which takes a comma-separated
+	list of CPUs and CPU ranges, for example, "1,3-5" selects CPUs 1,
+	3, 4, and 5.
+
+2.	The RCU_NOCB_CPU_ZERO=y Kconfig option, which causes CPU 0 to
+	be offloaded.  This is the build-time equivalent of "rcu_nocbs=0".
+
+3.	The RCU_NOCB_CPU_ALL=y Kconfig option, which causes all CPUs
+	to be offloaded.  On a 16-CPU system, this is equivalent to
+	"rcu_nocbs=0-15".
+
+The offloaded CPUs never have RCU callbacks queued, and therefore RCU
+never prevents offloaded CPUs from entering either dyntick-idle mode or
+adaptive-tick mode.  That said, note that it is up to userspace to
+pin the "rcuo" kthreads to specific CPUs if desired.  Otherwise, the
+scheduler will decide where to run them, which might or might not be
+where you want them to run.
+
+
+KNOWN ISSUES
+
+o	Dyntick-idle slows transitions to and from idle slightly.
+	In practice, this has not been a problem except for the most
+	aggressive real-time workloads, which have the option of disabling
+	dyntick-idle mode, an option that most of them take.
+
+o	Adaptive-ticks slows user/kernel transitions slightly.
+	This is not expected to be a problem for computational-intensive
+	workloads, which have few such transitions.  Careful benchmarking
+	will be required to determine whether or not other workloads
+	are significantly affected by this effect.
+
+o	Adaptive-ticks does not do anything unless there is only one
+	runnable task for a given CPU, even though there are a number
+	of other situations where the scheduling-clock tick is not
+	needed.  To give but one example, consider a CPU that has one
+	runnable high-priority SCHED_FIFO task and an arbitrary number
+	of low-priority SCHED_OTHER tasks.  In this case, the CPU is
+	required to run the SCHED_FIFO task until either it blocks or
+	some other higher-priority task awakens on (or is assigned to)
+	this CPU, so there is no point in sending a scheduling-clock
+	interrupt to this CPU.
+
+	Better handling of these sorts of situations is future work.
+
+o	A reboot is required to reconfigure both adaptive idle and RCU
+	callback offloading.  Runtime reconfiguration could be provided
+	if needed, however, due to the complexity of reconfiguring RCU
+	at runtime, there would need to be an earthshakingly good reason.
+	Especially given the option of simply offloading RCU callbacks
+	from all CPUs.
+
+o	Additional configuration is required to deal with other sources
+	of OS jitter, including interrupts and system-utility tasks
+	and processes.
+
+o	Some sources of OS jitter can currently be eliminated only by
+	constraining the workload.  For example, the only way to eliminate
+	OS jitter due to global TLB shootdowns is to avoid the unmapping
+	operations (such as kernel module unload operations) that result
+	in these shootdowns.  For another example, page faults and TLB
+	misses can be reduced (and in some cases eliminated) by using
+	huge pages and by constraining the amount of memory used by the
+	application.
+
+o	At least one CPU must keep the scheduling-clock interrupt going
+	in order to support accurate timekeeping.


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-18 16:29 [PATCH] nohz1: Documentation Paul E. McKenney
@ 2013-03-18 18:13 ` Rob Landley
  2013-03-18 18:46   ` Frederic Weisbecker
  0 siblings, 1 reply; 43+ messages in thread
From: Rob Landley @ 2013-03-18 18:13 UTC (permalink / raw)
  To: paulmck
  Cc: fweisbec, linux-kernel, josh, rostedt, zhong, khilman, geoff, tglx

On 03/18/2013 11:29:42 AM, Paul E. McKenney wrote:
> First attempt at documentation for adaptive ticks.
> 
> Thoughts?
> 
> 							Thanx, Paul

It's really long and repetitive? And really seems like it's kconfig
help text?

   The CONFIG_NO_HZ=y and CONFIG_NO_HZ_FULL=y options cause the kernel
   to (respectively) avoid sending scheduling-clock interrupts to idle
   processors, or to processors with only a single single runnable task.
   You can disable this at boot time with kernel parameter "nohz=off".

   This reduces power consumption by allowing processors to suspend more
   deeply for longer periods, and can also improve some computationally
   intensive workloads. The downside is coming out of a deeper sleep can
   reduce realtime response to wakeup events.

   This is split into two config options because the second isn't quite
   finished and won't reliably deliver posix timer interrupts, perf
   events, or do as well on CPU load balancing. The  
CONFIG_RCU_FAST_NO_HZ
   option enables a workaround to force tick delivery every 4 jiffies to
   handle RCU events. See the CONFIG_RCU_NOCB_CPU option for a different
   workaround.

> +1.	It increases the number of instructions executed on the path
> +	to and from the idle loop.

This detail didn't get mentioned in my summary.

> +5.	The LB_BIAS scheduler feature is disabled by adaptive ticks.

I have no idea what that one is, my summary didn't mention it.

> +Another approach is to offload RCU callback processing to "rcuo"  
> kthreads
> +using the CONFIG_RCU_NOCB_CPU=y.  The specific CPUs to offload may be
> +selected via several methods:
> +
> +1.	The "rcu_nocbs=" kernel boot parameter, which takes a  
> comma-separated
> +	list of CPUs and CPU ranges, for example, "1,3-5" selects CPUs  
> 1,
> +	3, 4, and 5.
> +
> +2.	The RCU_NOCB_CPU_ZERO=y Kconfig option, which causes CPU 0 to
> +	be offloaded.  This is the build-time equivalent of  
> "rcu_nocbs=0".
> +
> +3.	The RCU_NOCB_CPU_ALL=y Kconfig option, which causes all CPUs
> +	to be offloaded.  On a 16-CPU system, this is equivalent to
> +	"rcu_nocbs=0-15".
> +
> +The offloaded CPUs never have RCU callbacks queued, and therefore RCU
> +never prevents offloaded CPUs from entering either dyntick-idle mode  
> or
> +adaptive-tick mode.  That said, note that it is up to userspace to
> +pin the "rcuo" kthreads to specific CPUs if desired.  Otherwise, the
> +scheduler will decide where to run them, which might or might not be
> +where you want them to run.

Ok, this whole chunk was just confusing and I glossed it. Why on earth  
do
you offer three wildly different ways to do the same thing? (You have  
config
options to set defaults?) I _think_ the gloss is just:

   RCU_NOCB_CPU_ALL=y moves each processor's RCU callback handling into
   its own kernel thread, which the user can pin to specific CPUs if
   desired. If you only want to move specific processors' RCU handling
   to threads, list those processors on the kernel command line ala
   "rcu_nocbs=1,3-5".

But that's a guess.

> +o	Additional configuration is required to deal with other sources
> +	of OS jitter, including interrupts and system-utility tasks
> +	and processes.
> +
> +o	Some sources of OS jitter can currently be eliminated only by
> +	constraining the workload.  For example, the only way to  
> eliminate
> +	OS jitter due to global TLB shootdowns is to avoid the unmapping
> +	operations (such as kernel module unload operations) that result
> +	in these shootdowns.  For another example, page faults and TLB
> +	misses can be reduced (and in some cases eliminated) by using
> +	huge pages and by constraining the amount of memory used by the
> +	application.

If you want to write a doc on reducing system jitter, go for it. This is
a topic transition near the end of a document.

> +o	At least one CPU must keep the scheduling-clock interrupt going
> +	in order to support accurate timekeeping.

How? You never said how to tell a processor _not_ to suppress interrupts
when CONFIG_THE_OTHER_HALF_OF_NOHZ is enabled.

I take it the problem is the value in the sysenter page won't get  
updated,
so gettimeofday() will see a stale value until the CPU hog stops
suppressing interrupts? I thought the first half of NOHZ had a way of
dealing with that many moons ago? (Did sysenter cause a regression?)

Rob

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-18 18:13 ` Rob Landley
@ 2013-03-18 18:46   ` Frederic Weisbecker
  2013-03-18 19:59     ` Rob Landley
  0 siblings, 1 reply; 43+ messages in thread
From: Frederic Weisbecker @ 2013-03-18 18:46 UTC (permalink / raw)
  To: Rob Landley
  Cc: paulmck, linux-kernel, josh, rostedt, zhong, khilman, geoff, tglx

2013/3/18 Rob Landley <rob@landley.net>:
> On 03/18/2013 11:29:42 AM, Paul E. McKenney wrote:
> And really seems like it's kconfig help text?

It's more exhaustive than a Kconfig help. A Kconfig help text should
have the level of detail that describe the purpose and impact of a
feature, as well as some quick reference/pointer to the interface.

Deeper explanation which include implementation internals, finegrained
constraints, TODO list, detailed interface are better here.

>   The CONFIG_NO_HZ=y and CONFIG_NO_HZ_FULL=y options cause the kernel
>   to (respectively) avoid sending scheduling-clock interrupts to idle
>   processors, or to processors with only a single single runnable task.
>   You can disable this at boot time with kernel parameter "nohz=off".
>
>   This reduces power consumption by allowing processors to suspend more
>   deeply for longer periods, and can also improve some computationally
>   intensive workloads. The downside is coming out of a deeper sleep can
>   reduce realtime response to wakeup events.
>
>   This is split into two config options because the second isn't quite
>   finished and won't reliably deliver posix timer interrupts, perf
>   events, or do as well on CPU load balancing. The CONFIG_RCU_FAST_NO_HZ
>   option enables a workaround to force tick delivery every 4 jiffies to
>   handle RCU events. See the CONFIG_RCU_NOCB_CPU option for a different
>   workaround.

I really think we want to keep all the detailed explanations from
Paul's doc. What we need is not a quick reference but a very detailed
documentation.

>
>> +1.     It increases the number of instructions executed on the path
>> +       to and from the idle loop.
>
>
> This detail didn't get mentioned in my summary.

And it's an important point.

>
>
>> +5.     The LB_BIAS scheduler feature is disabled by adaptive ticks.
>
>
> I have no idea what that one is, my summary didn't mention it.

Nobody seem to know what that thing is, except probably the scheduler
warlocks :o)
All I know is that it's hard to implement without the tick. So I
disabled it in my tree.

>> +o      Some sources of OS jitter can currently be eliminated only by
>> +       constraining the workload.  For example, the only way to eliminate
>> +       OS jitter due to global TLB shootdowns is to avoid the unmapping
>> +       operations (such as kernel module unload operations) that result
>> +       in these shootdowns.  For another example, page faults and TLB
>> +       misses can be reduced (and in some cases eliminated) by using
>> +       huge pages and by constraining the amount of memory used by the
>> +       application.
>
>
> If you want to write a doc on reducing system jitter, go for it. This is
> a topic transition near the end of a document.
>
>
>> +o      At least one CPU must keep the scheduling-clock interrupt going
>> +       in order to support accurate timekeeping.
>
>
> How? You never said how to tell a processor _not_ to suppress interrupts
> when CONFIG_THE_OTHER_HALF_OF_NOHZ is enabled.

Ah indeed it would be nice to point out that there must be an online
CPU outside the value range of the nohz_mask=  boot parameter.

> I take it the problem is the value in the sysenter page won't get updated,
> so gettimeofday() will see a stale value until the CPU hog stops
> suppressing interrupts? I thought the first half of NOHZ had a way of
> dealing with that many moons ago? (Did sysenter cause a regression?)

With CONFIG_NO_HZ, there is always a tick running that updates GTOD
and jiffies as long as there is non-idle CPU. If every CPUs are idle
and one suddenly wakes up, GTOD and jiffies values are caught up.

With full dynticks we have a new problem: there can be a CPU using
jiffies of GTOD without running the tick (we are not idle so there can
be such users). So there must a ticking CPU somewhere.

> Rob

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-18 18:46   ` Frederic Weisbecker
@ 2013-03-18 19:59     ` Rob Landley
  2013-03-18 20:48       ` Frederic Weisbecker
  0 siblings, 1 reply; 43+ messages in thread
From: Rob Landley @ 2013-03-18 19:59 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: paulmck, linux-kernel, josh, rostedt, zhong, khilman, geoff, tglx

On 03/18/2013 01:46:32 PM, Frederic Weisbecker wrote:
> 2013/3/18 Rob Landley <rob@landley.net>:
> > On 03/18/2013 11:29:42 AM, Paul E. McKenney wrote:
> > And really seems like it's kconfig help text?
> 
> It's more exhaustive than a Kconfig help. A Kconfig help text should
> have the level of detail that describe the purpose and impact of a
> feature, as well as some quick reference/pointer to the interface.
> 
> Deeper explanation which include implementation internals, finegrained
> constraints, TODO list, detailed interface are better here.
...
> I really think we want to keep all the detailed explanations from
> Paul's doc. What we need is not a quick reference but a very detailed
> documentation.

It's much _longer_, I'm not sure it contains significantly more  
information. ("Using more power will shorten battery life" is a nice  
observation, but is it specific to your subsystem? I dunno, maybe it's  
a personal idiosyncrasy, but I tend to think that people start with use  
cases and need to find infrastructure. The other direction seems less  
interesting somehow. Like a pan with a picture on the front of what you  
might want to bake with it.)

> >> +1.     It increases the number of instructions executed on the  
> path
> >> +       to and from the idle loop.
> >
> >
> > This detail didn't get mentioned in my summary.
> 
> And it's an important point.

I mentioned increased latency coming out of idle. Increased latency  
going _to_ idle is an important point? (And pretty much _every_ kconfig  
option has ramifications at that level which realtime people tend to  
want to bench.)

Also, I mentioned this one because all the other details I deleted  
pretty much _did_ get taken into account in my summary.

> >> +5.     The LB_BIAS scheduler feature is disabled by adaptive  
> ticks.
> >
> >
> > I have no idea what that one is, my summary didn't mention it.
> 
> Nobody seem to know what that thing is, except probably the scheduler
> warlocks :o)
> All I know is that it's hard to implement without the tick. So I
> disabled it in my tree.

Is it also an important point?

> >> +o      At least one CPU must keep the scheduling-clock interrupt  
> going
> >> +       in order to support accurate timekeeping.
> >
> >
> > How? You never said how to tell a processor _not_ to suppress  
> interrupts
> > when CONFIG_THE_OTHER_HALF_OF_NOHZ is enabled.
> 
> Ah indeed it would be nice to point out that there must be an online
> CPU outside the value range of the nohz_mask=  boot parameter.

There's a nohz_mask boot parameter?

> > I take it the problem is the value in the sysenter page won't get  
> updated,
> > so gettimeofday() will see a stale value until the CPU hog stops
> > suppressing interrupts? I thought the first half of NOHZ had a way  
> of
> > dealing with that many moons ago? (Did sysenter cause a regression?)
> 
> With CONFIG_NO_HZ, there is always a tick running that updates GTOD
> and jiffies as long as there is non-idle CPU. If every CPUs are idle
> and one suddenly wakes up, GTOD and jiffies values are caught up.
> 
> With full dynticks we have a new problem: there can be a CPU using
> jiffies of GTOD without running the tick (we are not idle so there can
> be such users). So there must a ticking CPU somewhere.

I.E. because gettimeofday() just checks a memory location without  
requiring a kernel transition, there's no opportunity for the kernel to  
trigger and run catch-up code.

So you'd need a timer to remove the read flag on the page containing  
the jiffies value after it was considered sufficiently stale, and then  
have the page fault update the value restore the read flag and reset  
the timer to switch it off again, and then just tell CPU-intensive code  
that wanted to take advantage of running uninterrupted not to mess with  
jiffies unless they wanted to trigger interrupts to keep it current.

By the way, I find this "full" name strange if you yourself have a list  
of more cases where ticks could be dropped, but which you haven't  
implemented yet. The system being entirely idle means unnecessary ticks  
can be dropped. The system having no scheduling decisions to make on a  
processor also means unnecessary ticks can be dropped. But there are  
two config options and they get treated as entirely different  
subsystems...

I suppose one of them having a bucket of workarounds and caveats is the  
reason? One is just "let the system behave more efficiently, only  
reason it's a config option is increased latency waking up from idle  
can annoy the realtime guys". The second is "let the system behave more  
efficiently in a way that opens up a bunch of sharp edges and requires  
extensive micromanagement". But those sharp edges seem more  
"unfinished" than really a design limitation...

Rob

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-18 19:59     ` Rob Landley
@ 2013-03-18 20:48       ` Frederic Weisbecker
  2013-03-18 22:25         ` Paul E. McKenney
  0 siblings, 1 reply; 43+ messages in thread
From: Frederic Weisbecker @ 2013-03-18 20:48 UTC (permalink / raw)
  To: Rob Landley
  Cc: paulmck, linux-kernel, josh, rostedt, zhong, khilman, geoff, tglx

2013/3/18 Rob Landley <rob@landley.net>:
> On 03/18/2013 01:46:32 PM, Frederic Weisbecker wrote:
>> I really think we want to keep all the detailed explanations from
>> Paul's doc. What we need is not a quick reference but a very detailed
>> documentation.
>
>
> It's much _longer_, I'm not sure it contains significantly more information.
> ("Using more power will shorten battery life" is a nice observation, but is
> it specific to your subsystem? I dunno, maybe it's a personal idiosyncrasy,
> but I tend to think that people start with use cases and need to find
> infrastructure. The other direction seems less interesting somehow. Like a
> pan with a picture on the front of what you might want to bake with it.)

People start with a usecase, find an infrastructure and finally its
documentation that tell them the tradeoffs, constraints, possible
enhancements. Yes both directions are valuable.

Another point in favor of taking that direction: consider LB_BIAS. Do
you know what it's all about? Me neither. Too bad there is no
documentation. Obscure kernel code make kernel hacking closer to
reverse engineering. As the kernel grows in complexity, this all will
have some interesting effect in the future. And I'm just rephrasing
what people like Andrew already started to say a few years ago.

Addition of detailed documentation of core (and even less core) kernel
code is hardly arguable.

>> >> +1.     It increases the number of instructions executed on the path
>> >> +       to and from the idle loop.
>> >
>> >
>> > This detail didn't get mentioned in my summary.
>>
>> And it's an important point.
>
>
> I mentioned increased latency coming out of idle. Increased latency going
> _to_ idle is an important point? (And pretty much _every_ kconfig option has
> ramifications at that level which realtime people tend to want to bench.)

Yeah, increased latency in going to idle has consequences in term of
energy saving, latency and throughput.

>
> Also, I mentioned this one because all the other details I deleted pretty
> much _did_ get taken into account in my summary.

Certainly not with the same level of detail.

>
>> >> +5.     The LB_BIAS scheduler feature is disabled by adaptive ticks.
>> >
>> >
>> > I have no idea what that one is, my summary didn't mention it.
>>
>> Nobody seem to know what that thing is, except probably the scheduler
>> warlocks :o)
>> All I know is that it's hard to implement without the tick. So I
>> disabled it in my tree.
>
>
> Is it also an important point?

Yes, users must be informed about limitations.

>
>> >> +o      At least one CPU must keep the scheduling-clock interrupt going
>> >> +       in order to support accurate timekeeping.
>> >
>> >
>> > How? You never said how to tell a processor _not_ to suppress interrupts
>> > when CONFIG_THE_OTHER_HALF_OF_NOHZ is enabled.
>>
>> Ah indeed it would be nice to point out that there must be an online
>> CPU outside the value range of the nohz_mask=  boot parameter.
>
>
> There's a nohz_mask boot parameter?

Yeah we need to document that too.

>
>> > I take it the problem is the value in the sysenter page won't get
>> > updated,
>> > so gettimeofday() will see a stale value until the CPU hog stops
>> > suppressing interrupts? I thought the first half of NOHZ had a way of
>> > dealing with that many moons ago? (Did sysenter cause a regression?)
>>
>> With CONFIG_NO_HZ, there is always a tick running that updates GTOD
>> and jiffies as long as there is non-idle CPU. If every CPUs are idle
>> and one suddenly wakes up, GTOD and jiffies values are caught up.
>>
>> With full dynticks we have a new problem: there can be a CPU using
>> jiffies of GTOD without running the tick (we are not idle so there can
>> be such users). So there must a ticking CPU somewhere.
>
>
> I.E. because gettimeofday() just checks a memory location without requiring
> a kernel transition, there's no opportunity for the kernel to trigger and
> run catch-up code.

Isn't that value updated by the kernel?

>
> So you'd need a timer to remove the read flag on the page containing the
> jiffies value after it was considered sufficiently stale, and then have the
> page fault update the value restore the read flag and reset the timer to
> switch it off again, and then just tell CPU-intensive code that wanted to
> take advantage of running uninterrupted not to mess with jiffies unless they
> wanted to trigger interrupts to keep it current.

I fear making the jiffies read faultable is not something we can
afford. That means there will be several places where we couldn't use
it. And there would be some performance issues. Also such a timer
defeats the initial purpose of reducing timers interrupts.

GTOD is another issue but page faults would be a performance problem
as well. And timer too.

>
> By the way, I find this "full" name strange if you yourself have a list of
> more cases where ticks could be dropped, but which you haven't implemented
> yet.

Yeah. Full dynticks works because it suggest tick periods are
dynamics. But full tickless or full nohz is not true. Some renaming
are on the work anyway.

> The system being entirely idle means unnecessary ticks can be dropped.
> The system having no scheduling decisions to make on a processor also means
> unnecessary ticks can be dropped. But there are two config options and they
> get treated as entirely different subsystems...

No they share a lot of common infrastructure. Also full dynticks
depends on dynticks-idle.

> I suppose one of them having a bucket of workarounds and caveats is the
> reason? One is just "let the system behave more efficiently, only reason
> it's a config option is increased latency waking up from idle can annoy the
> realtime guys". The second is "let the system behave more efficiently in a
> way that opens up a bunch of sharp edges and requires extensive
> micromanagement". But those sharp edges seem more "unfinished" than really a
> design limitation...

The reason of having a seperate Kconfig for the new feature is because
it adds some overhead even in the off-case.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-18 20:48       ` Frederic Weisbecker
@ 2013-03-18 22:25         ` Paul E. McKenney
  2013-03-20 23:32           ` Steven Rostedt
  0 siblings, 1 reply; 43+ messages in thread
From: Paul E. McKenney @ 2013-03-18 22:25 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Rob Landley, linux-kernel, josh, rostedt, zhong, khilman, geoff, tglx

On Mon, Mar 18, 2013 at 09:48:31PM +0100, Frederic Weisbecker wrote:
> 2013/3/18 Rob Landley <rob@landley.net>:
> > On 03/18/2013 01:46:32 PM, Frederic Weisbecker wrote:

[ . . . ]

> >> >> +o      At least one CPU must keep the scheduling-clock interrupt going
> >> >> +       in order to support accurate timekeeping.
> >> >
> >> >
> >> > How? You never said how to tell a processor _not_ to suppress interrupts
> >> > when CONFIG_THE_OTHER_HALF_OF_NOHZ is enabled.
> >>
> >> Ah indeed it would be nice to point out that there must be an online
> >> CPU outside the value range of the nohz_mask=  boot parameter.
> >
> >
> > There's a nohz_mask boot parameter?
> 
> Yeah we need to document that too.

Good catch both of you, fixed!

[ . . . ]

> > The system being entirely idle means unnecessary ticks can be dropped.
> > The system having no scheduling decisions to make on a processor also means
> > unnecessary ticks can be dropped. But there are two config options and they
> > get treated as entirely different subsystems...
> 
> No they share a lot of common infrastructure. Also full dynticks
> depends on dynticks-idle.

Good point, added this.

> > I suppose one of them having a bucket of workarounds and caveats is the
> > reason? One is just "let the system behave more efficiently, only reason
> > it's a config option is increased latency waking up from idle can annoy the
> > realtime guys". The second is "let the system behave more efficiently in a
> > way that opens up a bunch of sharp edges and requires extensive
> > micromanagement". But those sharp edges seem more "unfinished" than really a
> > design limitation...
> 
> The reason of having a seperate Kconfig for the new feature is because
> it adds some overhead even in the off-case.

Good point, added words stating that all of the costs of CONFIG_NO_HZ
are also incurred by CONFIG_NO_HZ_FULL.

Rob also noted that the presentation of the NOCB Kconfig options and boot
parameters was confusing, so I reworked this to put the Kconfig options
first (build then boot!) and to indicate that the RCU_NOCB_CPU_NONE,
RCU_NOCB_CPU_ZERO, and RCU_NOCB_CPU_ALL options are mutually exclusive.

Rob also noted that the current draft is wordy, which I will address
in a later draft.

							Thanx, Paul

------------------------------------------------------------------------

		NO_HZ: Reducing Scheduling-Clock Ticks


This document covers Kconfig options and boot parameters used to reduce
the number of scheduling-clock interrupts.  These reductions can be
helpful in improving energy efficiency and in reducing "OS jitter",
the latter being very important for some types of computationally
intensive high-performance computing (HPC) applications and for real-time
applications.

Within the Linux kernel, there are two major aspects of scheduling-clock
interrupt reduction:

1.	Idle CPUs.

2.	CPUs having only one runnable task.

These two cases are described in the following sections.


IDLE CPUs

If a CPU is idle, there is little point in sending it a scheduling-clock
interrupt.  After all, the primary purpose of a scheduling-clock interrupt
is to force a busy CPU to shift its attention among multiple duties,
but an idle CPU by definition has no duties to shift its attention among.

The CONFIG_NO_HZ=y Kconfig option causes the kernel to avoid sending
scheduling-clock interrupts to idle CPUs, which is critically important
both to battery-powered devices and to highly virtualized mainframes.
A battery-powered device running a CONFIG_NO_HZ=n kernel would drain its
battery very quickly, easily 2-3x as fast as would the same device running
a CONFIG_NO_HZ=n kernel.  A mainframe running 1,500 OS instances could
easily find that half of its CPU time was consumed by scheduling-clock
interrupts.  In these situations, there is therefore strong motivation
to avoid sending scheduling-clock interrupts to idle CPUs.  That said,
dyntick-idle mode is not free:

1.	It increases the number of instructions executed on the path
	to and from the idle loop.

2.	Many architectures will place dyntick-idle CPUs into deep sleep
	states, which further degrades from-idle transition latencies.

Therefore, systems with aggressive real-time response constraints
often run CONFIG_NO_HZ=n kernels in order to avoid degrading from-idle
transition latencies.

An idle CPU that is not receiving scheduling-clock interrupts is said to
be "dyntick-idle", "in dyntick-idle mode", "in nohz mode", or "running
tickless".  The remainder of this document will use "dyntick-idle mode".

There is also a boot parameter "nohz=" that can be used to disable
dyntick-idle mode in CONFIG_NO_HZ=y kernels by specifying "nohz=off".
By default, CONFIG_NO_HZ=y kernels boot with "nohz=on", enabling
dyntick-idle mode.


CPUs WITH ONLY ONE RUNNABLE TASK

If a CPU has only one runnable task, there is again little point in
sending it a scheduling-clock interrupt.  Recall that the primary
purpose of a scheduling-clock interrupt is to force a busy CPU to
shift its attention among many things requiring its attention -- and
there is nowhere else for a CPU with but one runnable task to shift its
attention to.

The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid
sending scheduling-clock interrupts to CPUs with a single runnable task.
This is important for applications with aggressive real-time response
constraints because it allows them to improve their worst-case response
times by the maximum duration of a scheduling-clock interrupt.  It is also
important for computationally intensive iterative workloads with short
iterations:  If any CPU is delayed during a given iteration, all the
other CPUs will be forced to wait idle while the delayed CPU finished.
Thus, the delay is multiplied by one less than the number of CPUs.
In these situations, there is again strong motivation to avoid sending
scheduling-clock interrupts to CPUs that have but one runnable task that
is executing in user mode.

The "full_nohz=" boot parameter specifies which CPUs are to be
adaptive-ticks CPUs.  For example, "full_nohz=1,6-8" says that CPUs 1,
6, 7, and 8 are to be adaptive-ticks CPUs.  By default, no CPUs will
be adaptive-ticks CPUs.  Not that you are prohibited from marking all
of the CPUs as adaptive-tick CPUs:  At least one non-adaptive-tick CPU
must remain online to handle timekeeping tasks in order to ensure that
gettimeofday() returns sane values on adaptive-tick CPUs.

Note that if a given CPU is in adaptive-ticks mode while executing in
user mode, transitioning to kernel mode does not automatically force
that CPU out of adaptive-ticks mode.  The CPU will exit adaptive-ticks
mode only if needed, for example, if that CPU enqueues an RCU callback.

Just as with dyntick-idle mode, the benefits of adaptive-tick mode do
not come for free:

1.	CONFIG_NO_HZ_FULL depends on CONFIG_NO_HZ, so you cannot run
	adaptive ticks without also running dyntick idle.  This dependency
	of CONFIG_NO_HZ_FULL on CONFIG_NO_HZ extends down into the
	implementation.  Therefore, all of the costs of CONFIG_NO_HZ
	are also incurred by CONFIG_NO_HZ_FULL.

2.	The user/kernel transitions are slightly more expensive due
	to the need to inform kernel subsystems (such as RCU) about
	the change in mode.

3.	POSIX CPU timers on adaptive-tick CPUs may fire late (or even
	not at all) because they currently rely on scheduling-tick
	interrupts.  This will likely be fixed in one of two ways: (1)
	Prevent CPUs with POSIX CPU timers from entering adaptive-tick
	mode, or (2) Use hrtimers or other adaptive-ticks-immune mechanism
	to cause the POSIX CPU timer to fire properly.

4.	If there are more perf events pending than the hardware can
	accommodate, they are normally round-robined so as to collect
	all of them over time.  Adaptive-tick mode may prevent this
	round-robining from happening.  This will likely be fixed by
	preventing CPUs with large numbers of perf events pending from
	entering adaptive-tick mode.

5.	Scheduler statistics for adaptive-idle CPUs may be computed
	slightly differently than those for non-adaptive-idle CPUs.
	This may in turn perturb load-balancing of real-time tasks.

6.	The LB_BIAS scheduler feature is disabled by adaptive ticks.

Although improvements are expected over time, adaptive ticks is quite
useful for many types of real-time and compute-intensive applications.
However, the drawbacks listed above mean that adaptive ticks should not
be enabled by default across the board at the current time.


RCU IMPLICATIONS

There are situations in which idle CPUs cannot be permitted to
enter either dyntick-idle mode or adaptive-tick mode, the most
familiar being the case where that CPU has RCU callbacks pending.

The CONFIG_RCU_FAST_NO_HZ=y Kconfig option may be used to cause such
CPUs to enter dyntick-idle mode or adaptive-tick mode anyway, though a
timer will awaken these CPUs every four jiffies in order to ensure that
the RCU callbacks are processed in a timely fashion.

Another approach is to offload RCU callback processing to "rcuo" kthreads
using the CONFIG_RCU_NOCB_CPU=y.  The specific CPUs to offload may be
selected via several methods:

1.	One of three mutually exclusive Kconfig options specify a
	build-time default for the CPUs to offload:

	a.	The RCU_NOCB_CPU_NONE=y Kconfig option results in
		no CPUs being offloaded.

	b.	The RCU_NOCB_CPU_ZERO=y Kconfig option causes CPU 0 to
		be offloaded.

	c.	The RCU_NOCB_CPU_ALL=y Kconfig option causes all CPUs
		to be offloaded.

2.	The "rcu_nocbs=" kernel boot parameter, which takes a comma-separated
	list of CPUs and CPU ranges, for example, "1,3-5" selects CPUs 1,
	3, 4, and 5.  The specified CPUs will be offloaded in addition
	to any CPUs specified as offloaded by RCU_NOCB_CPU_ZERO or
	RCU_NOCB_CPU_ALL.

The offloaded CPUs never have RCU callbacks queued, and therefore RCU
never prevents offloaded CPUs from entering either dyntick-idle mode or
adaptive-tick mode.  That said, note that it is up to userspace to
pin the "rcuo" kthreads to specific CPUs if desired.  Otherwise, the
scheduler will decide where to run them, which might or might not be
where you want them to run.


KNOWN ISSUES

o	Dyntick-idle slows transitions to and from idle slightly.
	In practice, this has not been a problem except for the most
	aggressive real-time workloads, which have the option of disabling
	dyntick-idle mode, an option that most of them take.

o	Adaptive-ticks slows user/kernel transitions slightly.
	This is not expected to be a problem for computational-intensive
	workloads, which have few such transitions.  Careful benchmarking
	will be required to determine whether or not other workloads
	are significantly affected by this effect.

o	Adaptive-ticks does not do anything unless there is only one
	runnable task for a given CPU, even though there are a number
	of other situations where the scheduling-clock tick is not
	needed.  To give but one example, consider a CPU that has one
	runnable high-priority SCHED_FIFO task and an arbitrary number
	of low-priority SCHED_OTHER tasks.  In this case, the CPU is
	required to run the SCHED_FIFO task until either it blocks or
	some other higher-priority task awakens on (or is assigned to)
	this CPU, so there is no point in sending a scheduling-clock
	interrupt to this CPU.

	Better handling of these sorts of situations is future work.

o	A reboot is required to reconfigure both adaptive idle and RCU
	callback offloading.  Runtime reconfiguration could be provided
	if needed, however, due to the complexity of reconfiguring RCU
	at runtime, there would need to be an earthshakingly good reason.
	Especially given the option of simply offloading RCU callbacks
	from all CPUs.

o	Additional configuration is required to deal with other sources
	of OS jitter, including interrupts and system-utility tasks
	and processes.  This configuration normally involves binding
	interrupts and tasks to particular CPUs.

o	Some sources of OS jitter can currently be eliminated only by
	constraining the workload.  For example, the only way to eliminate
	OS jitter due to global TLB shootdowns is to avoid the unmapping
	operations (such as kernel module unload operations) that result
	in these shootdowns.  For another example, page faults and TLB
	misses can be reduced (and in some cases eliminated) by using
	huge pages and by constraining the amount of memory used by the
	application.

o	At least one CPU must keep the scheduling-clock interrupt going
	in order to support accurate timekeeping.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-18 22:25         ` Paul E. McKenney
@ 2013-03-20 23:32           ` Steven Rostedt
  2013-03-20 23:55             ` Paul E. McKenney
  0 siblings, 1 reply; 43+ messages in thread
From: Steven Rostedt @ 2013-03-20 23:32 UTC (permalink / raw)
  To: paulmck
  Cc: Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong,
	khilman, geoff, tglx

On Mon, 2013-03-18 at 15:25 -0700, Paul E. McKenney wrote:

> ------------------------------------------------------------------------
> 
> 		NO_HZ: Reducing Scheduling-Clock Ticks
> 
> 
> This document covers Kconfig options and boot parameters used to reduce
> the number of scheduling-clock interrupts.  These reductions can be
> helpful in improving energy efficiency and in reducing "OS jitter",
> the latter being very important for some types of computationally
> intensive high-performance computing (HPC) applications and for real-time
> applications.
> 
> Within the Linux kernel, there are two major aspects of scheduling-clock
> interrupt reduction:
> 
> 1.	Idle CPUs.
> 
> 2.	CPUs having only one runnable task.
> 
> These two cases are described in the following sections.
> 
> 
> IDLE CPUs
> 
> If a CPU is idle, there is little point in sending it a scheduling-clock
> interrupt.  After all, the primary purpose of a scheduling-clock interrupt
> is to force a busy CPU to shift its attention among multiple duties,
> but an idle CPU by definition has no duties to shift its attention among.
> 
> The CONFIG_NO_HZ=y Kconfig option causes the kernel to avoid sending
> scheduling-clock interrupts to idle CPUs, which is critically important
> both to battery-powered devices and to highly virtualized mainframes.
> A battery-powered device running a CONFIG_NO_HZ=n kernel would drain its
> battery very quickly, easily 2-3x as fast as would the same device running
> a CONFIG_NO_HZ=n kernel.  A mainframe running 1,500 OS instances could

So a device running CONFIG_NO_HZ=n would drain its battery 2-3x faster
than the
same device running CONFIG_NO_HZ=n ?

:-)

> easily find that half of its CPU time was consumed by scheduling-clock
> interrupts.  In these situations, there is therefore strong motivation
> to avoid sending scheduling-clock interrupts to idle CPUs.  That said,
> dyntick-idle mode is not free:
> 
> 1.	It increases the number of instructions executed on the path
> 	to and from the idle loop.
> 
> 2.	Many architectures will place dyntick-idle CPUs into deep sleep
> 	states, which further degrades from-idle transition latencies.
> 
> Therefore, systems with aggressive real-time response constraints
> often run CONFIG_NO_HZ=n kernels in order to avoid degrading from-idle
> transition latencies.
> 
> An idle CPU that is not receiving scheduling-clock interrupts is said to
> be "dyntick-idle", "in dyntick-idle mode", "in nohz mode", or "running
> tickless".  The remainder of this document will use "dyntick-idle mode".
> 
> There is also a boot parameter "nohz=" that can be used to disable
> dyntick-idle mode in CONFIG_NO_HZ=y kernels by specifying "nohz=off".
> By default, CONFIG_NO_HZ=y kernels boot with "nohz=on", enabling
> dyntick-idle mode.
> 
> 
> CPUs WITH ONLY ONE RUNNABLE TASK
> 
> If a CPU has only one runnable task, there is again little point in
> sending it a scheduling-clock interrupt.  Recall that the primary
> purpose of a scheduling-clock interrupt is to force a busy CPU to
> shift its attention among many things requiring its attention -- and
> there is nowhere else for a CPU with but one runnable task to shift its
> attention to.
> 
> The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid
> sending scheduling-clock interrupts to CPUs with a single runnable task.
> This is important for applications with aggressive real-time response
> constraints because it allows them to improve their worst-case response
> times by the maximum duration of a scheduling-clock interrupt.  It is also
> important for computationally intensive iterative workloads with short
> iterations:  If any CPU is delayed during a given iteration, all the
> other CPUs will be forced to wait idle while the delayed CPU finished.
> Thus, the delay is multiplied by one less than the number of CPUs.
> In these situations, there is again strong motivation to avoid sending
> scheduling-clock interrupts to CPUs that have but one runnable task that
> is executing in user mode.
> 
> The "full_nohz=" boot parameter specifies which CPUs are to be
> adaptive-ticks CPUs.  For example, "full_nohz=1,6-8" says that CPUs 1,

This is the first time you mention "adaptive-ticks". Probably should
define it before just using it, even though one should be able to figure
out what adaptive-ticks are, it does throw in a wrench when reading this
if you have no idea what an "adaptive-tick" is.


> 6, 7, and 8 are to be adaptive-ticks CPUs.  By default, no CPUs will
> be adaptive-ticks CPUs.  Not that you are prohibited from marking all
> of the CPUs as adaptive-tick CPUs:  At least one non-adaptive-tick CPU
> must remain online to handle timekeeping tasks in order to ensure that
> gettimeofday() returns sane values on adaptive-tick CPUs.
> 
> Note that if a given CPU is in adaptive-ticks mode while executing in
> user mode, transitioning to kernel mode does not automatically force
> that CPU out of adaptive-ticks mode.  The CPU will exit adaptive-ticks
> mode only if needed, for example, if that CPU enqueues an RCU callback.
> 
> Just as with dyntick-idle mode, the benefits of adaptive-tick mode do
> not come for free:
> 
> 1.	CONFIG_NO_HZ_FULL depends on CONFIG_NO_HZ, so you cannot run
> 	adaptive ticks without also running dyntick idle.  This dependency
> 	of CONFIG_NO_HZ_FULL on CONFIG_NO_HZ extends down into the
> 	implementation.  Therefore, all of the costs of CONFIG_NO_HZ
> 	are also incurred by CONFIG_NO_HZ_FULL.

Not a comment on this document, but on the implementation. As idle NO_HZ
can hurt RT, but RT would want to have full NO_HZ, it's a shame that you
can't have both (no idle but full). As we only care about not letting
the CPU go into deep sleep, I wonder if it wouldn't be too hard to add
something that keeps idle from going into nohz mode. Hmm, I think there
may be an option to keep the CPU from going too deep into sleep. That
may be a better approach.


> 
> 2.	The user/kernel transitions are slightly more expensive due
> 	to the need to inform kernel subsystems (such as RCU) about
> 	the change in mode.
> 
> 3.	POSIX CPU timers on adaptive-tick CPUs may fire late (or even
> 	not at all) because they currently rely on scheduling-tick
> 	interrupts.  This will likely be fixed in one of two ways: (1)
> 	Prevent CPUs with POSIX CPU timers from entering adaptive-tick
> 	mode, or (2) Use hrtimers or other adaptive-ticks-immune mechanism
> 	to cause the POSIX CPU timer to fire properly.
> 
> 4.	If there are more perf events pending than the hardware can
> 	accommodate, they are normally round-robined so as to collect
> 	all of them over time.  Adaptive-tick mode may prevent this
> 	round-robining from happening.  This will likely be fixed by
> 	preventing CPUs with large numbers of perf events pending from
> 	entering adaptive-tick mode.
> 
> 5.	Scheduler statistics for adaptive-idle CPUs may be computed
> 	slightly differently than those for non-adaptive-idle CPUs.
> 	This may in turn perturb load-balancing of real-time tasks.
> 
> 6.	The LB_BIAS scheduler feature is disabled by adaptive ticks.
> 
> Although improvements are expected over time, adaptive ticks is quite
> useful for many types of real-time and compute-intensive applications.
> However, the drawbacks listed above mean that adaptive ticks should not
> be enabled by default across the board at the current time.
> 
> 
> RCU IMPLICATIONS
> 
> There are situations in which idle CPUs cannot be permitted to
> enter either dyntick-idle mode or adaptive-tick mode, the most
> familiar being the case where that CPU has RCU callbacks pending.
> 
> The CONFIG_RCU_FAST_NO_HZ=y Kconfig option may be used to cause such
> CPUs to enter dyntick-idle mode or adaptive-tick mode anyway, though a
> timer will awaken these CPUs every four jiffies in order to ensure that
> the RCU callbacks are processed in a timely fashion.
> 
> Another approach is to offload RCU callback processing to "rcuo" kthreads
> using the CONFIG_RCU_NOCB_CPU=y.  The specific CPUs to offload may be
> selected via several methods:
> 
> 1.	One of three mutually exclusive Kconfig options specify a
> 	build-time default for the CPUs to offload:
> 
> 	a.	The RCU_NOCB_CPU_NONE=y Kconfig option results in
> 		no CPUs being offloaded.
> 
> 	b.	The RCU_NOCB_CPU_ZERO=y Kconfig option causes CPU 0 to
> 		be offloaded.
> 
> 	c.	The RCU_NOCB_CPU_ALL=y Kconfig option causes all CPUs
> 		to be offloaded.

All CPUs don't have their RCU call backs on them? I'm a bit confused by
this. Or is it that the scheduler picks one CPU to do call backs? Does
this mean that to use rcu_ncbs= to be the only deciding factor, you
select RCU_NCB_CPU_NONE?

I think this needs to be explained better.

> 
> 2.	The "rcu_nocbs=" kernel boot parameter, which takes a comma-separated
> 	list of CPUs and CPU ranges, for example, "1,3-5" selects CPUs 1,
> 	3, 4, and 5.  The specified CPUs will be offloaded in addition
> 	to any CPUs specified as offloaded by RCU_NOCB_CPU_ZERO or
> 	RCU_NOCB_CPU_ALL.
> 
> The offloaded CPUs never have RCU callbacks queued, and therefore RCU
> never prevents offloaded CPUs from entering either dyntick-idle mode or
> adaptive-tick mode.  That said, note that it is up to userspace to
> pin the "rcuo" kthreads to specific CPUs if desired.  Otherwise, the
> scheduler will decide where to run them, which might or might not be
> where you want them to run.
> 
> 
> KNOWN ISSUES
> 
> o	Dyntick-idle slows transitions to and from idle slightly.
> 	In practice, this has not been a problem except for the most
> 	aggressive real-time workloads, which have the option of disabling
> 	dyntick-idle mode, an option that most of them take.
> 
> o	Adaptive-ticks slows user/kernel transitions slightly.
> 	This is not expected to be a problem for computational-intensive
> 	workloads, which have few such transitions.  Careful benchmarking
> 	will be required to determine whether or not other workloads
> 	are significantly affected by this effect.

It should be mentioned that only CPUs that are in adaptive-tick mode
have this issue. Other CPUs are still using the tick based accounting,
right?

> 
> o	Adaptive-ticks does not do anything unless there is only one
> 	runnable task for a given CPU, even though there are a number
> 	of other situations where the scheduling-clock tick is not
> 	needed.  To give but one example, consider a CPU that has one
> 	runnable high-priority SCHED_FIFO task and an arbitrary number
> 	of low-priority SCHED_OTHER tasks.  In this case, the CPU is
> 	required to run the SCHED_FIFO task until either it blocks or
> 	some other higher-priority task awakens on (or is assigned to)
> 	this CPU, so there is no point in sending a scheduling-clock
> 	interrupt to this CPU.

You should point out that the example does not enable adaptive-ticks.
That point is hinted at, but not really expressed. That is, perhaps end
the paragraph with: 

"Even though the SCHED_FIFO task is the only task running, because the
SCHED_OTHER tasks are queued on the CPU, it currently will not enter
adaptive tick mode."


> 
> 	Better handling of these sorts of situations is future work.
> 
> o	A reboot is required to reconfigure both adaptive idle and RCU
> 	callback offloading.  Runtime reconfiguration could be provided
> 	if needed, however, due to the complexity of reconfiguring RCU
> 	at runtime, there would need to be an earthshakingly good reason.
> 	Especially given the option of simply offloading RCU callbacks
> 	from all CPUs.

When you enable for all CPUs, how do you tell what CPUs you don't want
the scheduler to pick for off loading? I mean, if you pick all CPUs, can
you at run time pick which ones should always off load and which ones
shouldn't?

> 
> o	Additional configuration is required to deal with other sources
> 	of OS jitter, including interrupts and system-utility tasks
> 	and processes.  This configuration normally involves binding
> 	interrupts and tasks to particular CPUs.
> 
> o	Some sources of OS jitter can currently be eliminated only by
> 	constraining the workload.  For example, the only way to eliminate
> 	OS jitter due to global TLB shootdowns is to avoid the unmapping
> 	operations (such as kernel module unload operations) that result
> 	in these shootdowns.  For another example, page faults and TLB
> 	misses can be reduced (and in some cases eliminated) by using
> 	huge pages and by constraining the amount of memory used by the
> 	application.
> 
> o	At least one CPU must keep the scheduling-clock interrupt going
> 	in order to support accurate timekeeping.

Thanks for writing this up Paul!

-- Steve



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-20 23:32           ` Steven Rostedt
@ 2013-03-20 23:55             ` Paul E. McKenney
  2013-03-21  0:27               ` Steven Rostedt
  2013-03-21 16:08               ` Christoph Lameter
  0 siblings, 2 replies; 43+ messages in thread
From: Paul E. McKenney @ 2013-03-20 23:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong,
	khilman, geoff, tglx

On Wed, Mar 20, 2013 at 07:32:18PM -0400, Steven Rostedt wrote:
> On Mon, 2013-03-18 at 15:25 -0700, Paul E. McKenney wrote:
> 
> > ------------------------------------------------------------------------
> > 
> > 		NO_HZ: Reducing Scheduling-Clock Ticks
> > 
> > 
> > This document covers Kconfig options and boot parameters used to reduce
> > the number of scheduling-clock interrupts.  These reductions can be
> > helpful in improving energy efficiency and in reducing "OS jitter",
> > the latter being very important for some types of computationally
> > intensive high-performance computing (HPC) applications and for real-time
> > applications.
> > 
> > Within the Linux kernel, there are two major aspects of scheduling-clock
> > interrupt reduction:
> > 
> > 1.	Idle CPUs.
> > 
> > 2.	CPUs having only one runnable task.
> > 
> > These two cases are described in the following sections.
> > 
> > 
> > IDLE CPUs
> > 
> > If a CPU is idle, there is little point in sending it a scheduling-clock
> > interrupt.  After all, the primary purpose of a scheduling-clock interrupt
> > is to force a busy CPU to shift its attention among multiple duties,
> > but an idle CPU by definition has no duties to shift its attention among.
> > 
> > The CONFIG_NO_HZ=y Kconfig option causes the kernel to avoid sending
> > scheduling-clock interrupts to idle CPUs, which is critically important
> > both to battery-powered devices and to highly virtualized mainframes.
> > A battery-powered device running a CONFIG_NO_HZ=n kernel would drain its
> > battery very quickly, easily 2-3x as fast as would the same device running
> > a CONFIG_NO_HZ=n kernel.  A mainframe running 1,500 OS instances could
> 
> So a device running CONFIG_NO_HZ=n would drain its battery 2-3x faster
> than the
> same device running CONFIG_NO_HZ=n ?
> 
> :-)

Good catch, fixed!

That said, there are two solutions as stated -- either the battery drains
immediately, or it takes infinitely long to drain.  ;-)

> > easily find that half of its CPU time was consumed by scheduling-clock
> > interrupts.  In these situations, there is therefore strong motivation
> > to avoid sending scheduling-clock interrupts to idle CPUs.  That said,
> > dyntick-idle mode is not free:
> > 
> > 1.	It increases the number of instructions executed on the path
> > 	to and from the idle loop.
> > 
> > 2.	Many architectures will place dyntick-idle CPUs into deep sleep
> > 	states, which further degrades from-idle transition latencies.
> > 
> > Therefore, systems with aggressive real-time response constraints
> > often run CONFIG_NO_HZ=n kernels in order to avoid degrading from-idle
> > transition latencies.
> > 
> > An idle CPU that is not receiving scheduling-clock interrupts is said to
> > be "dyntick-idle", "in dyntick-idle mode", "in nohz mode", or "running
> > tickless".  The remainder of this document will use "dyntick-idle mode".
> > 
> > There is also a boot parameter "nohz=" that can be used to disable
> > dyntick-idle mode in CONFIG_NO_HZ=y kernels by specifying "nohz=off".
> > By default, CONFIG_NO_HZ=y kernels boot with "nohz=on", enabling
> > dyntick-idle mode.
> > 
> > 
> > CPUs WITH ONLY ONE RUNNABLE TASK
> > 
> > If a CPU has only one runnable task, there is again little point in
> > sending it a scheduling-clock interrupt.  Recall that the primary
> > purpose of a scheduling-clock interrupt is to force a busy CPU to
> > shift its attention among many things requiring its attention -- and
> > there is nowhere else for a CPU with but one runnable task to shift its
> > attention to.
> > 
> > The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid
> > sending scheduling-clock interrupts to CPUs with a single runnable task.
> > This is important for applications with aggressive real-time response
> > constraints because it allows them to improve their worst-case response
> > times by the maximum duration of a scheduling-clock interrupt.  It is also
> > important for computationally intensive iterative workloads with short
> > iterations:  If any CPU is delayed during a given iteration, all the
> > other CPUs will be forced to wait idle while the delayed CPU finished.
> > Thus, the delay is multiplied by one less than the number of CPUs.
> > In these situations, there is again strong motivation to avoid sending
> > scheduling-clock interrupts to CPUs that have but one runnable task that
> > is executing in user mode.
> > 
> > The "full_nohz=" boot parameter specifies which CPUs are to be
> > adaptive-ticks CPUs.  For example, "full_nohz=1,6-8" says that CPUs 1,
> 
> This is the first time you mention "adaptive-ticks". Probably should
> define it before just using it, even though one should be able to figure
> out what adaptive-ticks are, it does throw in a wrench when reading this
> if you have no idea what an "adaptive-tick" is.

Good point, changed the first sentence of this paragraph to read:

	The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to
	avoid sending scheduling-clock interrupts to CPUs with a single
	runnable task, and such CPUs are said to be "adaptive-ticks CPUs".

> > 6, 7, and 8 are to be adaptive-ticks CPUs.  By default, no CPUs will
> > be adaptive-ticks CPUs.  Not that you are prohibited from marking all
> > of the CPUs as adaptive-tick CPUs:  At least one non-adaptive-tick CPU
> > must remain online to handle timekeeping tasks in order to ensure that
> > gettimeofday() returns sane values on adaptive-tick CPUs.
> > 
> > Note that if a given CPU is in adaptive-ticks mode while executing in
> > user mode, transitioning to kernel mode does not automatically force
> > that CPU out of adaptive-ticks mode.  The CPU will exit adaptive-ticks
> > mode only if needed, for example, if that CPU enqueues an RCU callback.
> > 
> > Just as with dyntick-idle mode, the benefits of adaptive-tick mode do
> > not come for free:
> > 
> > 1.	CONFIG_NO_HZ_FULL depends on CONFIG_NO_HZ, so you cannot run
> > 	adaptive ticks without also running dyntick idle.  This dependency
> > 	of CONFIG_NO_HZ_FULL on CONFIG_NO_HZ extends down into the
> > 	implementation.  Therefore, all of the costs of CONFIG_NO_HZ
> > 	are also incurred by CONFIG_NO_HZ_FULL.
> 
> Not a comment on this document, but on the implementation. As idle NO_HZ
> can hurt RT, but RT would want to have full NO_HZ, it's a shame that you
> can't have both (no idle but full). As we only care about not letting
> the CPU go into deep sleep, I wonder if it wouldn't be too hard to add
> something that keeps idle from going into nohz mode. Hmm, I think there
> may be an option to keep the CPU from going too deep into sleep. That
> may be a better approach.

Would the combination of CONFIG_NO_HZ=y, CONFIG_NO_HZ_FULL=y, and
idle=poll do the trick in this case?

If so, I do need to document it.

> > 2.	The user/kernel transitions are slightly more expensive due
> > 	to the need to inform kernel subsystems (such as RCU) about
> > 	the change in mode.
> > 
> > 3.	POSIX CPU timers on adaptive-tick CPUs may fire late (or even
> > 	not at all) because they currently rely on scheduling-tick
> > 	interrupts.  This will likely be fixed in one of two ways: (1)
> > 	Prevent CPUs with POSIX CPU timers from entering adaptive-tick
> > 	mode, or (2) Use hrtimers or other adaptive-ticks-immune mechanism
> > 	to cause the POSIX CPU timer to fire properly.
> > 
> > 4.	If there are more perf events pending than the hardware can
> > 	accommodate, they are normally round-robined so as to collect
> > 	all of them over time.  Adaptive-tick mode may prevent this
> > 	round-robining from happening.  This will likely be fixed by
> > 	preventing CPUs with large numbers of perf events pending from
> > 	entering adaptive-tick mode.
> > 
> > 5.	Scheduler statistics for adaptive-idle CPUs may be computed
> > 	slightly differently than those for non-adaptive-idle CPUs.
> > 	This may in turn perturb load-balancing of real-time tasks.
> > 
> > 6.	The LB_BIAS scheduler feature is disabled by adaptive ticks.
> > 
> > Although improvements are expected over time, adaptive ticks is quite
> > useful for many types of real-time and compute-intensive applications.
> > However, the drawbacks listed above mean that adaptive ticks should not
> > be enabled by default across the board at the current time.
> > 
> > 
> > RCU IMPLICATIONS
> > 
> > There are situations in which idle CPUs cannot be permitted to
> > enter either dyntick-idle mode or adaptive-tick mode, the most
> > familiar being the case where that CPU has RCU callbacks pending.
> > 
> > The CONFIG_RCU_FAST_NO_HZ=y Kconfig option may be used to cause such
> > CPUs to enter dyntick-idle mode or adaptive-tick mode anyway, though a
> > timer will awaken these CPUs every four jiffies in order to ensure that
> > the RCU callbacks are processed in a timely fashion.
> > 
> > Another approach is to offload RCU callback processing to "rcuo" kthreads
> > using the CONFIG_RCU_NOCB_CPU=y.  The specific CPUs to offload may be
> > selected via several methods:
> > 
> > 1.	One of three mutually exclusive Kconfig options specify a
> > 	build-time default for the CPUs to offload:
> > 
> > 	a.	The RCU_NOCB_CPU_NONE=y Kconfig option results in
> > 		no CPUs being offloaded.
> > 
> > 	b.	The RCU_NOCB_CPU_ZERO=y Kconfig option causes CPU 0 to
> > 		be offloaded.
> > 
> > 	c.	The RCU_NOCB_CPU_ALL=y Kconfig option causes all CPUs
> > 		to be offloaded.
> 
> All CPUs don't have their RCU call backs on them? I'm a bit confused by
> this. Or is it that the scheduler picks one CPU to do call backs? Does
> this mean that to use rcu_ncbs= to be the only deciding factor, you
> select RCU_NCB_CPU_NONE?
> 
> I think this needs to be explained better.

Does this help?

	c.	The RCU_NOCB_CPU_ALL=y Kconfig option causes all CPUs
		to be offloaded.  Note that the callbacks will be
		offloaded to "rcuo" kthreads, and that those kthreads
		will in fact run on some CPU.  However, this approach
		gives fine-grained control on exactly which CPUs the
		callbacks run on, the priority that they run at (including
		the default of SCHED_OTHER), and it further allows
		this control to be varied dynamically at runtime.

> > 2.	The "rcu_nocbs=" kernel boot parameter, which takes a comma-separated
> > 	list of CPUs and CPU ranges, for example, "1,3-5" selects CPUs 1,
> > 	3, 4, and 5.  The specified CPUs will be offloaded in addition
> > 	to any CPUs specified as offloaded by RCU_NOCB_CPU_ZERO or
> > 	RCU_NOCB_CPU_ALL.
> > 
> > The offloaded CPUs never have RCU callbacks queued, and therefore RCU
> > never prevents offloaded CPUs from entering either dyntick-idle mode or
> > adaptive-tick mode.  That said, note that it is up to userspace to
> > pin the "rcuo" kthreads to specific CPUs if desired.  Otherwise, the
> > scheduler will decide where to run them, which might or might not be
> > where you want them to run.
> > 
> > 
> > KNOWN ISSUES
> > 
> > o	Dyntick-idle slows transitions to and from idle slightly.
> > 	In practice, this has not been a problem except for the most
> > 	aggressive real-time workloads, which have the option of disabling
> > 	dyntick-idle mode, an option that most of them take.
> > 
> > o	Adaptive-ticks slows user/kernel transitions slightly.
> > 	This is not expected to be a problem for computational-intensive
> > 	workloads, which have few such transitions.  Careful benchmarking
> > 	will be required to determine whether or not other workloads
> > 	are significantly affected by this effect.
> 
> It should be mentioned that only CPUs that are in adaptive-tick mode
> have this issue. Other CPUs are still using the tick based accounting,
> right?
> 
> > 
> > o	Adaptive-ticks does not do anything unless there is only one
> > 	runnable task for a given CPU, even though there are a number
> > 	of other situations where the scheduling-clock tick is not
> > 	needed.  To give but one example, consider a CPU that has one
> > 	runnable high-priority SCHED_FIFO task and an arbitrary number
> > 	of low-priority SCHED_OTHER tasks.  In this case, the CPU is
> > 	required to run the SCHED_FIFO task until either it blocks or
> > 	some other higher-priority task awakens on (or is assigned to)
> > 	this CPU, so there is no point in sending a scheduling-clock
> > 	interrupt to this CPU.
> 
> You should point out that the example does not enable adaptive-ticks.
> That point is hinted at, but not really expressed. That is, perhaps end
> the paragraph with: 
> 
> "Even though the SCHED_FIFO task is the only task running, because the
> SCHED_OTHER tasks are queued on the CPU, it currently will not enter
> adaptive tick mode."

Again, good point!

How about adding the following sentence at the end of this paragraph.

	However, the current implementation prohibits CPU with a single
	runnable SCHED_FIFO task and multiple runnable SCHED_OTHER
	tasks from entering adaptive-ticks mode, even though it would
	be correct to allow it to do so.

> > 	Better handling of these sorts of situations is future work.
> > 
> > o	A reboot is required to reconfigure both adaptive idle and RCU
> > 	callback offloading.  Runtime reconfiguration could be provided
> > 	if needed, however, due to the complexity of reconfiguring RCU
> > 	at runtime, there would need to be an earthshakingly good reason.
> > 	Especially given the option of simply offloading RCU callbacks
> > 	from all CPUs.
> 
> When you enable for all CPUs, how do you tell what CPUs you don't want
> the scheduler to pick for off loading? I mean, if you pick all CPUs, can
> you at run time pick which ones should always off load and which ones
> shouldn't?

I must defer to Frederic on this one.

> > o	Additional configuration is required to deal with other sources
> > 	of OS jitter, including interrupts and system-utility tasks
> > 	and processes.  This configuration normally involves binding
> > 	interrupts and tasks to particular CPUs.
> > 
> > o	Some sources of OS jitter can currently be eliminated only by
> > 	constraining the workload.  For example, the only way to eliminate
> > 	OS jitter due to global TLB shootdowns is to avoid the unmapping
> > 	operations (such as kernel module unload operations) that result
> > 	in these shootdowns.  For another example, page faults and TLB
> > 	misses can be reduced (and in some cases eliminated) by using
> > 	huge pages and by constraining the amount of memory used by the
> > 	application.
> > 
> > o	At least one CPU must keep the scheduling-clock interrupt going
> > 	in order to support accurate timekeeping.
> 
> Thanks for writing this up Paul!

And to many other people, including yourself, for doing the actual work!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-20 23:55             ` Paul E. McKenney
@ 2013-03-21  0:27               ` Steven Rostedt
  2013-03-21  2:22                 ` Paul E. McKenney
                                   ` (2 more replies)
  2013-03-21 16:08               ` Christoph Lameter
  1 sibling, 3 replies; 43+ messages in thread
From: Steven Rostedt @ 2013-03-21  0:27 UTC (permalink / raw)
  To: paulmck
  Cc: Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong,
	khilman, geoff, tglx, Arjan van de Ven

[ Added Arjan in case he as anything to add about the idle=poll below ]


On Wed, 2013-03-20 at 16:55 -0700, Paul E. McKenney wrote:
> On Wed, Mar 20, 2013 at 07:32:18PM -0400, Steven Rostedt wrote:
> > On Mon, 2013-03-18 at 15:25 -0700, Paul E. McKenney wrote:
> > 
> > > ------------------------------------------------------------------------
> > > 
> > > 		NO_HZ: Reducing Scheduling-Clock Ticks
> > > 
> > > 
> > > This document covers Kconfig options and boot parameters used to reduce
> > > the number of scheduling-clock interrupts.  These reductions can be
> > > helpful in improving energy efficiency and in reducing "OS jitter",
> > > the latter being very important for some types of computationally
> > > intensive high-performance computing (HPC) applications and for real-time
> > > applications.
> > > 
> > > Within the Linux kernel, there are two major aspects of scheduling-clock
> > > interrupt reduction:
> > > 
> > > 1.	Idle CPUs.
> > > 
> > > 2.	CPUs having only one runnable task.
> > > 
> > > These two cases are described in the following sections.
> > > 
> > > 
> > > IDLE CPUs
> > > 
> > > If a CPU is idle, there is little point in sending it a scheduling-clock
> > > interrupt.  After all, the primary purpose of a scheduling-clock interrupt
> > > is to force a busy CPU to shift its attention among multiple duties,
> > > but an idle CPU by definition has no duties to shift its attention among.
> > > 
> > > The CONFIG_NO_HZ=y Kconfig option causes the kernel to avoid sending
> > > scheduling-clock interrupts to idle CPUs, which is critically important
> > > both to battery-powered devices and to highly virtualized mainframes.
> > > A battery-powered device running a CONFIG_NO_HZ=n kernel would drain its
> > > battery very quickly, easily 2-3x as fast as would the same device running
> > > a CONFIG_NO_HZ=n kernel.  A mainframe running 1,500 OS instances could
> > 
> > So a device running CONFIG_NO_HZ=n would drain its battery 2-3x faster
> > than the

Hmm, Evolution had the above on one line in the composer, but it seems
to be chopping it when it sends. I recently did an update on this box,
which screwed up the formatting of what the composer does and what it
sends out :-/

I hit a hard return to have CONFIG_NO_HZ = 0 be lined up correctly
(since I already knew that evolution screwed this up)

> > same device running CONFIG_NO_HZ=n ?
> > 
> > :-)
> 
> Good catch, fixed!
> 
> That said, there are two solutions as stated -- either the battery drains
> immediately, or it takes infinitely long to drain.  ;-)

A typical paulmck response ;-)

> 
> > > easily find that half of its CPU time was consumed by scheduling-clock
> > > interrupts.  In these situations, there is therefore strong motivation
> > > to avoid sending scheduling-clock interrupts to idle CPUs.  That said,
> > > dyntick-idle mode is not free:
> > > 
> > > 1.	It increases the number of instructions executed on the path
> > > 	to and from the idle loop.
> > > 
> > > 2.	Many architectures will place dyntick-idle CPUs into deep sleep
> > > 	states, which further degrades from-idle transition latencies.
> > > 
> > > Therefore, systems with aggressive real-time response constraints
> > > often run CONFIG_NO_HZ=n kernels in order to avoid degrading from-idle
> > > transition latencies.
> > > 
> > > An idle CPU that is not receiving scheduling-clock interrupts is said to
> > > be "dyntick-idle", "in dyntick-idle mode", "in nohz mode", or "running
> > > tickless".  The remainder of this document will use "dyntick-idle mode".
> > > 
> > > There is also a boot parameter "nohz=" that can be used to disable
> > > dyntick-idle mode in CONFIG_NO_HZ=y kernels by specifying "nohz=off".
> > > By default, CONFIG_NO_HZ=y kernels boot with "nohz=on", enabling
> > > dyntick-idle mode.
> > > 
> > > 
> > > CPUs WITH ONLY ONE RUNNABLE TASK
> > > 
> > > If a CPU has only one runnable task, there is again little point in
> > > sending it a scheduling-clock interrupt.  Recall that the primary
> > > purpose of a scheduling-clock interrupt is to force a busy CPU to
> > > shift its attention among many things requiring its attention -- and
> > > there is nowhere else for a CPU with but one runnable task to shift its
> > > attention to.
> > > 
> > > The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid
> > > sending scheduling-clock interrupts to CPUs with a single runnable task.
> > > This is important for applications with aggressive real-time response
> > > constraints because it allows them to improve their worst-case response
> > > times by the maximum duration of a scheduling-clock interrupt.  It is also
> > > important for computationally intensive iterative workloads with short
> > > iterations:  If any CPU is delayed during a given iteration, all the
> > > other CPUs will be forced to wait idle while the delayed CPU finished.
> > > Thus, the delay is multiplied by one less than the number of CPUs.
> > > In these situations, there is again strong motivation to avoid sending
> > > scheduling-clock interrupts to CPUs that have but one runnable task that
> > > is executing in user mode.
> > > 
> > > The "full_nohz=" boot parameter specifies which CPUs are to be
> > > adaptive-ticks CPUs.  For example, "full_nohz=1,6-8" says that CPUs 1,
> > 
> > This is the first time you mention "adaptive-ticks". Probably should
> > define it before just using it, even though one should be able to figure
> > out what adaptive-ticks are, it does throw in a wrench when reading this
> > if you have no idea what an "adaptive-tick" is.
> 
> Good point, changed the first sentence of this paragraph to read:
> 
> 	The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to
> 	avoid sending scheduling-clock interrupts to CPUs with a single
> 	runnable task, and such CPUs are said to be "adaptive-ticks CPUs".

Sounds good.

> 
> > > 6, 7, and 8 are to be adaptive-ticks CPUs.  By default, no CPUs will
> > > be adaptive-ticks CPUs.  Not that you are prohibited from marking all
> > > of the CPUs as adaptive-tick CPUs:  At least one non-adaptive-tick CPU
> > > must remain online to handle timekeeping tasks in order to ensure that
> > > gettimeofday() returns sane values on adaptive-tick CPUs.
> > > 
> > > Note that if a given CPU is in adaptive-ticks mode while executing in
> > > user mode, transitioning to kernel mode does not automatically force
> > > that CPU out of adaptive-ticks mode.  The CPU will exit adaptive-ticks
> > > mode only if needed, for example, if that CPU enqueues an RCU callback.
> > > 
> > > Just as with dyntick-idle mode, the benefits of adaptive-tick mode do
> > > not come for free:
> > > 
> > > 1.	CONFIG_NO_HZ_FULL depends on CONFIG_NO_HZ, so you cannot run
> > > 	adaptive ticks without also running dyntick idle.  This dependency
> > > 	of CONFIG_NO_HZ_FULL on CONFIG_NO_HZ extends down into the
> > > 	implementation.  Therefore, all of the costs of CONFIG_NO_HZ
> > > 	are also incurred by CONFIG_NO_HZ_FULL.
> > 
> > Not a comment on this document, but on the implementation. As idle NO_HZ
> > can hurt RT, but RT would want to have full NO_HZ, it's a shame that you
> > can't have both (no idle but full). As we only care about not letting
> > the CPU go into deep sleep, I wonder if it wouldn't be too hard to add
> > something that keeps idle from going into nohz mode. Hmm, I think there
> > may be an option to keep the CPU from going too deep into sleep. That
> > may be a better approach.
> 
> Would the combination of CONFIG_NO_HZ=y, CONFIG_NO_HZ_FULL=y, and
> idle=poll do the trick in this case?

I'm not sure I would recommend idle=poll either. It would certainly
work, but it goes to the other extreme. You think NO_HZ=n drains a
battery? Try idle=poll.

Looking at Documentation/kernel-parameters.txt, it looks like idle=mwait
may be better. It states that performance is the same as idle=poll (if
supported).

Also there's a kernel parameter for x86 called intel_idle.max_cstate=X.

As idle=poll will most likely run the processor very hot and you will
need to add more electricity not only for the computer but also for the
A/C, it would be nice to still have the CPU sleep, but just at a shallow
(fast wakeup) state.

Perhaps Arjan can add some input here?

> 
> If so, I do need to document it.
> 
> > > 2.	The user/kernel transitions are slightly more expensive due
> > > 	to the need to inform kernel subsystems (such as RCU) about
> > > 	the change in mode.
> > > 
> > > 3.	POSIX CPU timers on adaptive-tick CPUs may fire late (or even
> > > 	not at all) because they currently rely on scheduling-tick
> > > 	interrupts.  This will likely be fixed in one of two ways: (1)
> > > 	Prevent CPUs with POSIX CPU timers from entering adaptive-tick
> > > 	mode, or (2) Use hrtimers or other adaptive-ticks-immune mechanism
> > > 	to cause the POSIX CPU timer to fire properly.
> > > 
> > > 4.	If there are more perf events pending than the hardware can
> > > 	accommodate, they are normally round-robined so as to collect
> > > 	all of them over time.  Adaptive-tick mode may prevent this
> > > 	round-robining from happening.  This will likely be fixed by
> > > 	preventing CPUs with large numbers of perf events pending from
> > > 	entering adaptive-tick mode.
> > > 
> > > 5.	Scheduler statistics for adaptive-idle CPUs may be computed
> > > 	slightly differently than those for non-adaptive-idle CPUs.
> > > 	This may in turn perturb load-balancing of real-time tasks.
> > > 
> > > 6.	The LB_BIAS scheduler feature is disabled by adaptive ticks.
> > > 
> > > Although improvements are expected over time, adaptive ticks is quite
> > > useful for many types of real-time and compute-intensive applications.
> > > However, the drawbacks listed above mean that adaptive ticks should not
> > > be enabled by default across the board at the current time.
> > > 
> > > 
> > > RCU IMPLICATIONS
> > > 
> > > There are situations in which idle CPUs cannot be permitted to
> > > enter either dyntick-idle mode or adaptive-tick mode, the most
> > > familiar being the case where that CPU has RCU callbacks pending.
> > > 
> > > The CONFIG_RCU_FAST_NO_HZ=y Kconfig option may be used to cause such
> > > CPUs to enter dyntick-idle mode or adaptive-tick mode anyway, though a
> > > timer will awaken these CPUs every four jiffies in order to ensure that
> > > the RCU callbacks are processed in a timely fashion.
> > > 
> > > Another approach is to offload RCU callback processing to "rcuo" kthreads
> > > using the CONFIG_RCU_NOCB_CPU=y.  The specific CPUs to offload may be
> > > selected via several methods:
> > > 
> > > 1.	One of three mutually exclusive Kconfig options specify a
> > > 	build-time default for the CPUs to offload:
> > > 
> > > 	a.	The RCU_NOCB_CPU_NONE=y Kconfig option results in
> > > 		no CPUs being offloaded.
> > > 
> > > 	b.	The RCU_NOCB_CPU_ZERO=y Kconfig option causes CPU 0 to
> > > 		be offloaded.
> > > 
> > > 	c.	The RCU_NOCB_CPU_ALL=y Kconfig option causes all CPUs
> > > 		to be offloaded.
> > 
> > All CPUs don't have their RCU call backs on them? I'm a bit confused by
> > this. Or is it that the scheduler picks one CPU to do call backs? Does
> > this mean that to use rcu_ncbs= to be the only deciding factor, you
> > select RCU_NCB_CPU_NONE?
> > 
> > I think this needs to be explained better.
> 
> Does this help?
> 
> 	c.	The RCU_NOCB_CPU_ALL=y Kconfig option causes all CPUs
> 		to be offloaded.  Note that the callbacks will be
> 		offloaded to "rcuo" kthreads, and that those kthreads
> 		will in fact run on some CPU.  However, this approach
> 		gives fine-grained control on exactly which CPUs the
> 		callbacks run on, the priority that they run at (including
> 		the default of SCHED_OTHER), and it further allows
> 		this control to be varied dynamically at runtime.

Excellent!

> 
> > > 2.	The "rcu_nocbs=" kernel boot parameter, which takes a comma-separated
> > > 	list of CPUs and CPU ranges, for example, "1,3-5" selects CPUs 1,
> > > 	3, 4, and 5.  The specified CPUs will be offloaded in addition
> > > 	to any CPUs specified as offloaded by RCU_NOCB_CPU_ZERO or
> > > 	RCU_NOCB_CPU_ALL.
> > > 
> > > The offloaded CPUs never have RCU callbacks queued, and therefore RCU
> > > never prevents offloaded CPUs from entering either dyntick-idle mode or
> > > adaptive-tick mode.  That said, note that it is up to userspace to
> > > pin the "rcuo" kthreads to specific CPUs if desired.  Otherwise, the
> > > scheduler will decide where to run them, which might or might not be
> > > where you want them to run.
> > > 
> > > 
> > > KNOWN ISSUES
> > > 
> > > o	Dyntick-idle slows transitions to and from idle slightly.
> > > 	In practice, this has not been a problem except for the most
> > > 	aggressive real-time workloads, which have the option of disabling
> > > 	dyntick-idle mode, an option that most of them take.
> > > 
> > > o	Adaptive-ticks slows user/kernel transitions slightly.
> > > 	This is not expected to be a problem for computational-intensive
> > > 	workloads, which have few such transitions.  Careful benchmarking
> > > 	will be required to determine whether or not other workloads
> > > 	are significantly affected by this effect.
> > 
> > It should be mentioned that only CPUs that are in adaptive-tick mode
> > have this issue. Other CPUs are still using the tick based accounting,
> > right?

?

> > 
> > > 
> > > o	Adaptive-ticks does not do anything unless there is only one
> > > 	runnable task for a given CPU, even though there are a number
> > > 	of other situations where the scheduling-clock tick is not
> > > 	needed.  To give but one example, consider a CPU that has one
> > > 	runnable high-priority SCHED_FIFO task and an arbitrary number
> > > 	of low-priority SCHED_OTHER tasks.  In this case, the CPU is
> > > 	required to run the SCHED_FIFO task until either it blocks or
> > > 	some other higher-priority task awakens on (or is assigned to)
> > > 	this CPU, so there is no point in sending a scheduling-clock
> > > 	interrupt to this CPU.
> > 
> > You should point out that the example does not enable adaptive-ticks.
> > That point is hinted at, but not really expressed. That is, perhaps end
> > the paragraph with: 
> > 
> > "Even though the SCHED_FIFO task is the only task running, because the
> > SCHED_OTHER tasks are queued on the CPU, it currently will not enter
> > adaptive tick mode."
> 
> Again, good point!
> 
> How about adding the following sentence at the end of this paragraph.
> 
> 	However, the current implementation prohibits CPU with a single
> 	runnable SCHED_FIFO task and multiple runnable SCHED_OTHER
> 	tasks from entering adaptive-ticks mode, even though it would
> 	be correct to allow it to do so.

Sure.

> 
> > > 	Better handling of these sorts of situations is future work.
> > > 
> > > o	A reboot is required to reconfigure both adaptive idle and RCU
> > > 	callback offloading.  Runtime reconfiguration could be provided
> > > 	if needed, however, due to the complexity of reconfiguring RCU
> > > 	at runtime, there would need to be an earthshakingly good reason.
> > > 	Especially given the option of simply offloading RCU callbacks
> > > 	from all CPUs.
> > 
> > When you enable for all CPUs, how do you tell what CPUs you don't want
> > the scheduler to pick for off loading? I mean, if you pick all CPUs, can
> > you at run time pick which ones should always off load and which ones
> > shouldn't?
> 
> I must defer to Frederic on this one.

Well I was actually thinking more about the RCU NOCB mode. You answered
my question above about the rcu kthreads that do the callbacks instead
of them being pinned to a CPU.

-- Steve

> 
> > > o	Additional configuration is required to deal with other sources
> > > 	of OS jitter, including interrupts and system-utility tasks
> > > 	and processes.  This configuration normally involves binding
> > > 	interrupts and tasks to particular CPUs.
> > > 
> > > o	Some sources of OS jitter can currently be eliminated only by
> > > 	constraining the workload.  For example, the only way to eliminate
> > > 	OS jitter due to global TLB shootdowns is to avoid the unmapping
> > > 	operations (such as kernel module unload operations) that result
> > > 	in these shootdowns.  For another example, page faults and TLB
> > > 	misses can be reduced (and in some cases eliminated) by using
> > > 	huge pages and by constraining the amount of memory used by the
> > > 	application.
> > > 
> > > o	At least one CPU must keep the scheduling-clock interrupt going
> > > 	in order to support accurate timekeeping.
> > 
> > Thanks for writing this up Paul!
> 
> And to many other people, including yourself, for doing the actual work!
> 
> 							Thanx, Paul




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-21  0:27               ` Steven Rostedt
@ 2013-03-21  2:22                 ` Paul E. McKenney
  2013-03-21 10:16                   ` Borislav Petkov
  2013-03-21 15:45                 ` Arjan van de Ven
  2013-03-21 18:01                 ` Frederic Weisbecker
  2 siblings, 1 reply; 43+ messages in thread
From: Paul E. McKenney @ 2013-03-21  2:22 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong,
	khilman, geoff, tglx, Arjan van de Ven

On Wed, Mar 20, 2013 at 08:27:11PM -0400, Steven Rostedt wrote:
> [ Added Arjan in case he as anything to add about the idle=poll below ]

Good point!

> On Wed, 2013-03-20 at 16:55 -0700, Paul E. McKenney wrote:
> > On Wed, Mar 20, 2013 at 07:32:18PM -0400, Steven Rostedt wrote:
> > > On Mon, 2013-03-18 at 15:25 -0700, Paul E. McKenney wrote:
> > > 
> > > > ------------------------------------------------------------------------
> > > > 
> > > > 		NO_HZ: Reducing Scheduling-Clock Ticks
> > > > 
> > > > 
> > > > This document covers Kconfig options and boot parameters used to reduce
> > > > the number of scheduling-clock interrupts.  These reductions can be
> > > > helpful in improving energy efficiency and in reducing "OS jitter",
> > > > the latter being very important for some types of computationally
> > > > intensive high-performance computing (HPC) applications and for real-time
> > > > applications.
> > > > 
> > > > Within the Linux kernel, there are two major aspects of scheduling-clock
> > > > interrupt reduction:
> > > > 
> > > > 1.	Idle CPUs.
> > > > 
> > > > 2.	CPUs having only one runnable task.
> > > > 
> > > > These two cases are described in the following sections.
> > > > 
> > > > 
> > > > IDLE CPUs
> > > > 
> > > > If a CPU is idle, there is little point in sending it a scheduling-clock
> > > > interrupt.  After all, the primary purpose of a scheduling-clock interrupt
> > > > is to force a busy CPU to shift its attention among multiple duties,
> > > > but an idle CPU by definition has no duties to shift its attention among.
> > > > 
> > > > The CONFIG_NO_HZ=y Kconfig option causes the kernel to avoid sending
> > > > scheduling-clock interrupts to idle CPUs, which is critically important
> > > > both to battery-powered devices and to highly virtualized mainframes.
> > > > A battery-powered device running a CONFIG_NO_HZ=n kernel would drain its
> > > > battery very quickly, easily 2-3x as fast as would the same device running
> > > > a CONFIG_NO_HZ=n kernel.  A mainframe running 1,500 OS instances could
> > > 
> > > So a device running CONFIG_NO_HZ=n would drain its battery 2-3x faster
> > > than the
> 
> Hmm, Evolution had the above on one line in the composer, but it seems
> to be chopping it when it sends. I recently did an update on this box,
> which screwed up the formatting of what the composer does and what it
> sends out :-/
> 
> I hit a hard return to have CONFIG_NO_HZ = 0 be lined up correctly
> (since I already knew that evolution screwed this up)

Forever mutt!!!  ;-)

> > > same device running CONFIG_NO_HZ=n ?
> > > 
> > > :-)
> > 
> > Good catch, fixed!
> > 
> > That said, there are two solutions as stated -- either the battery drains
> > immediately, or it takes infinitely long to drain.  ;-)
> 
> A typical paulmck response ;-)

;-) ;-) ;-)

> > > > easily find that half of its CPU time was consumed by scheduling-clock
> > > > interrupts.  In these situations, there is therefore strong motivation
> > > > to avoid sending scheduling-clock interrupts to idle CPUs.  That said,
> > > > dyntick-idle mode is not free:
> > > > 
> > > > 1.	It increases the number of instructions executed on the path
> > > > 	to and from the idle loop.
> > > > 
> > > > 2.	Many architectures will place dyntick-idle CPUs into deep sleep
> > > > 	states, which further degrades from-idle transition latencies.
> > > > 
> > > > Therefore, systems with aggressive real-time response constraints
> > > > often run CONFIG_NO_HZ=n kernels in order to avoid degrading from-idle
> > > > transition latencies.
> > > > 
> > > > An idle CPU that is not receiving scheduling-clock interrupts is said to
> > > > be "dyntick-idle", "in dyntick-idle mode", "in nohz mode", or "running
> > > > tickless".  The remainder of this document will use "dyntick-idle mode".
> > > > 
> > > > There is also a boot parameter "nohz=" that can be used to disable
> > > > dyntick-idle mode in CONFIG_NO_HZ=y kernels by specifying "nohz=off".
> > > > By default, CONFIG_NO_HZ=y kernels boot with "nohz=on", enabling
> > > > dyntick-idle mode.
> > > > 
> > > > 
> > > > CPUs WITH ONLY ONE RUNNABLE TASK
> > > > 
> > > > If a CPU has only one runnable task, there is again little point in
> > > > sending it a scheduling-clock interrupt.  Recall that the primary
> > > > purpose of a scheduling-clock interrupt is to force a busy CPU to
> > > > shift its attention among many things requiring its attention -- and
> > > > there is nowhere else for a CPU with but one runnable task to shift its
> > > > attention to.
> > > > 
> > > > The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid
> > > > sending scheduling-clock interrupts to CPUs with a single runnable task.
> > > > This is important for applications with aggressive real-time response
> > > > constraints because it allows them to improve their worst-case response
> > > > times by the maximum duration of a scheduling-clock interrupt.  It is also
> > > > important for computationally intensive iterative workloads with short
> > > > iterations:  If any CPU is delayed during a given iteration, all the
> > > > other CPUs will be forced to wait idle while the delayed CPU finished.
> > > > Thus, the delay is multiplied by one less than the number of CPUs.
> > > > In these situations, there is again strong motivation to avoid sending
> > > > scheduling-clock interrupts to CPUs that have but one runnable task that
> > > > is executing in user mode.
> > > > 
> > > > The "full_nohz=" boot parameter specifies which CPUs are to be
> > > > adaptive-ticks CPUs.  For example, "full_nohz=1,6-8" says that CPUs 1,
> > > 
> > > This is the first time you mention "adaptive-ticks". Probably should
> > > define it before just using it, even though one should be able to figure
> > > out what adaptive-ticks are, it does throw in a wrench when reading this
> > > if you have no idea what an "adaptive-tick" is.
> > 
> > Good point, changed the first sentence of this paragraph to read:
> > 
> > 	The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to
> > 	avoid sending scheduling-clock interrupts to CPUs with a single
> > 	runnable task, and such CPUs are said to be "adaptive-ticks CPUs".
> 
> Sounds good.
> 
> > 
> > > > 6, 7, and 8 are to be adaptive-ticks CPUs.  By default, no CPUs will
> > > > be adaptive-ticks CPUs.  Not that you are prohibited from marking all
> > > > of the CPUs as adaptive-tick CPUs:  At least one non-adaptive-tick CPU
> > > > must remain online to handle timekeeping tasks in order to ensure that
> > > > gettimeofday() returns sane values on adaptive-tick CPUs.
> > > > 
> > > > Note that if a given CPU is in adaptive-ticks mode while executing in
> > > > user mode, transitioning to kernel mode does not automatically force
> > > > that CPU out of adaptive-ticks mode.  The CPU will exit adaptive-ticks
> > > > mode only if needed, for example, if that CPU enqueues an RCU callback.
> > > > 
> > > > Just as with dyntick-idle mode, the benefits of adaptive-tick mode do
> > > > not come for free:
> > > > 
> > > > 1.	CONFIG_NO_HZ_FULL depends on CONFIG_NO_HZ, so you cannot run
> > > > 	adaptive ticks without also running dyntick idle.  This dependency
> > > > 	of CONFIG_NO_HZ_FULL on CONFIG_NO_HZ extends down into the
> > > > 	implementation.  Therefore, all of the costs of CONFIG_NO_HZ
> > > > 	are also incurred by CONFIG_NO_HZ_FULL.
> > > 
> > > Not a comment on this document, but on the implementation. As idle NO_HZ
> > > can hurt RT, but RT would want to have full NO_HZ, it's a shame that you
> > > can't have both (no idle but full). As we only care about not letting
> > > the CPU go into deep sleep, I wonder if it wouldn't be too hard to add
> > > something that keeps idle from going into nohz mode. Hmm, I think there
> > > may be an option to keep the CPU from going too deep into sleep. That
> > > may be a better approach.
> > 
> > Would the combination of CONFIG_NO_HZ=y, CONFIG_NO_HZ_FULL=y, and
> > idle=poll do the trick in this case?
> 
> I'm not sure I would recommend idle=poll either. It would certainly
> work, but it goes to the other extreme. You think NO_HZ=n drains a
> battery? Try idle=poll.

And a few people already run realtime on battery-powered systems, so
good point...

> Looking at Documentation/kernel-parameters.txt, it looks like idle=mwait
> may be better. It states that performance is the same as idle=poll (if
> supported).
> 
> Also there's a kernel parameter for x86 called intel_idle.max_cstate=X.
> 
> As idle=poll will most likely run the processor very hot and you will
> need to add more electricity not only for the computer but also for the
> A/C, it would be nice to still have the CPU sleep, but just at a shallow
> (fast wakeup) state.

So maybe idle=mwait or intel_idle.max_cstate=? if supported, otherwise
if on AC power, idle=poll plus active cooling.  ;-)

> Perhaps Arjan can add some input here?

I would certainly like to hear it!

> > If so, I do need to document it.
> > 
> > > > 2.	The user/kernel transitions are slightly more expensive due
> > > > 	to the need to inform kernel subsystems (such as RCU) about
> > > > 	the change in mode.
> > > > 
> > > > 3.	POSIX CPU timers on adaptive-tick CPUs may fire late (or even
> > > > 	not at all) because they currently rely on scheduling-tick
> > > > 	interrupts.  This will likely be fixed in one of two ways: (1)
> > > > 	Prevent CPUs with POSIX CPU timers from entering adaptive-tick
> > > > 	mode, or (2) Use hrtimers or other adaptive-ticks-immune mechanism
> > > > 	to cause the POSIX CPU timer to fire properly.
> > > > 
> > > > 4.	If there are more perf events pending than the hardware can
> > > > 	accommodate, they are normally round-robined so as to collect
> > > > 	all of them over time.  Adaptive-tick mode may prevent this
> > > > 	round-robining from happening.  This will likely be fixed by
> > > > 	preventing CPUs with large numbers of perf events pending from
> > > > 	entering adaptive-tick mode.
> > > > 
> > > > 5.	Scheduler statistics for adaptive-idle CPUs may be computed
> > > > 	slightly differently than those for non-adaptive-idle CPUs.
> > > > 	This may in turn perturb load-balancing of real-time tasks.
> > > > 
> > > > 6.	The LB_BIAS scheduler feature is disabled by adaptive ticks.
> > > > 
> > > > Although improvements are expected over time, adaptive ticks is quite
> > > > useful for many types of real-time and compute-intensive applications.
> > > > However, the drawbacks listed above mean that adaptive ticks should not
> > > > be enabled by default across the board at the current time.
> > > > 
> > > > 
> > > > RCU IMPLICATIONS
> > > > 
> > > > There are situations in which idle CPUs cannot be permitted to
> > > > enter either dyntick-idle mode or adaptive-tick mode, the most
> > > > familiar being the case where that CPU has RCU callbacks pending.
> > > > 
> > > > The CONFIG_RCU_FAST_NO_HZ=y Kconfig option may be used to cause such
> > > > CPUs to enter dyntick-idle mode or adaptive-tick mode anyway, though a
> > > > timer will awaken these CPUs every four jiffies in order to ensure that
> > > > the RCU callbacks are processed in a timely fashion.
> > > > 
> > > > Another approach is to offload RCU callback processing to "rcuo" kthreads
> > > > using the CONFIG_RCU_NOCB_CPU=y.  The specific CPUs to offload may be
> > > > selected via several methods:
> > > > 
> > > > 1.	One of three mutually exclusive Kconfig options specify a
> > > > 	build-time default for the CPUs to offload:
> > > > 
> > > > 	a.	The RCU_NOCB_CPU_NONE=y Kconfig option results in
> > > > 		no CPUs being offloaded.
> > > > 
> > > > 	b.	The RCU_NOCB_CPU_ZERO=y Kconfig option causes CPU 0 to
> > > > 		be offloaded.
> > > > 
> > > > 	c.	The RCU_NOCB_CPU_ALL=y Kconfig option causes all CPUs
> > > > 		to be offloaded.
> > > 
> > > All CPUs don't have their RCU call backs on them? I'm a bit confused by
> > > this. Or is it that the scheduler picks one CPU to do call backs? Does
> > > this mean that to use rcu_ncbs= to be the only deciding factor, you
> > > select RCU_NCB_CPU_NONE?
> > > 
> > > I think this needs to be explained better.
> > 
> > Does this help?
> > 
> > 	c.	The RCU_NOCB_CPU_ALL=y Kconfig option causes all CPUs
> > 		to be offloaded.  Note that the callbacks will be
> > 		offloaded to "rcuo" kthreads, and that those kthreads
> > 		will in fact run on some CPU.  However, this approach
> > 		gives fine-grained control on exactly which CPUs the
> > 		callbacks run on, the priority that they run at (including
> > 		the default of SCHED_OTHER), and it further allows
> > 		this control to be varied dynamically at runtime.
> 
> Excellent!
> 
> > 
> > > > 2.	The "rcu_nocbs=" kernel boot parameter, which takes a comma-separated
> > > > 	list of CPUs and CPU ranges, for example, "1,3-5" selects CPUs 1,
> > > > 	3, 4, and 5.  The specified CPUs will be offloaded in addition
> > > > 	to any CPUs specified as offloaded by RCU_NOCB_CPU_ZERO or
> > > > 	RCU_NOCB_CPU_ALL.
> > > > 
> > > > The offloaded CPUs never have RCU callbacks queued, and therefore RCU
> > > > never prevents offloaded CPUs from entering either dyntick-idle mode or
> > > > adaptive-tick mode.  That said, note that it is up to userspace to
> > > > pin the "rcuo" kthreads to specific CPUs if desired.  Otherwise, the
> > > > scheduler will decide where to run them, which might or might not be
> > > > where you want them to run.
> > > > 
> > > > 
> > > > KNOWN ISSUES
> > > > 
> > > > o	Dyntick-idle slows transitions to and from idle slightly.
> > > > 	In practice, this has not been a problem except for the most
> > > > 	aggressive real-time workloads, which have the option of disabling
> > > > 	dyntick-idle mode, an option that most of them take.
> > > > 
> > > > o	Adaptive-ticks slows user/kernel transitions slightly.
> > > > 	This is not expected to be a problem for computational-intensive
> > > > 	workloads, which have few such transitions.  Careful benchmarking
> > > > 	will be required to determine whether or not other workloads
> > > > 	are significantly affected by this effect.
> > > 
> > > It should be mentioned that only CPUs that are in adaptive-tick mode
> > > have this issue. Other CPUs are still using the tick based accounting,
> > > right?
> 
> ?

True, but they still end up executing extra code to deal with the
possibility that they are in adaptive-tick mode.

> > > 
> > > > 
> > > > o	Adaptive-ticks does not do anything unless there is only one
> > > > 	runnable task for a given CPU, even though there are a number
> > > > 	of other situations where the scheduling-clock tick is not
> > > > 	needed.  To give but one example, consider a CPU that has one
> > > > 	runnable high-priority SCHED_FIFO task and an arbitrary number
> > > > 	of low-priority SCHED_OTHER tasks.  In this case, the CPU is
> > > > 	required to run the SCHED_FIFO task until either it blocks or
> > > > 	some other higher-priority task awakens on (or is assigned to)
> > > > 	this CPU, so there is no point in sending a scheduling-clock
> > > > 	interrupt to this CPU.
> > > 
> > > You should point out that the example does not enable adaptive-ticks.
> > > That point is hinted at, but not really expressed. That is, perhaps end
> > > the paragraph with: 
> > > 
> > > "Even though the SCHED_FIFO task is the only task running, because the
> > > SCHED_OTHER tasks are queued on the CPU, it currently will not enter
> > > adaptive tick mode."
> > 
> > Again, good point!
> > 
> > How about adding the following sentence at the end of this paragraph.
> > 
> > 	However, the current implementation prohibits CPU with a single
> > 	runnable SCHED_FIFO task and multiple runnable SCHED_OTHER
> > 	tasks from entering adaptive-ticks mode, even though it would
> > 	be correct to allow it to do so.
> 
> Sure.
> 
> > 
> > > > 	Better handling of these sorts of situations is future work.
> > > > 
> > > > o	A reboot is required to reconfigure both adaptive idle and RCU
> > > > 	callback offloading.  Runtime reconfiguration could be provided
> > > > 	if needed, however, due to the complexity of reconfiguring RCU
> > > > 	at runtime, there would need to be an earthshakingly good reason.
> > > > 	Especially given the option of simply offloading RCU callbacks
> > > > 	from all CPUs.
> > > 
> > > When you enable for all CPUs, how do you tell what CPUs you don't want
> > > the scheduler to pick for off loading? I mean, if you pick all CPUs, can
> > > you at run time pick which ones should always off load and which ones
> > > shouldn't?
> > 
> > I must defer to Frederic on this one.
> 
> Well I was actually thinking more about the RCU NOCB mode. You answered
> my question above about the rcu kthreads that do the callbacks instead
> of them being pinned to a CPU.

Ah, color me confused.  ;-)

							Thanx, Paul

> -- Steve
> 
> > 
> > > > o	Additional configuration is required to deal with other sources
> > > > 	of OS jitter, including interrupts and system-utility tasks
> > > > 	and processes.  This configuration normally involves binding
> > > > 	interrupts and tasks to particular CPUs.
> > > > 
> > > > o	Some sources of OS jitter can currently be eliminated only by
> > > > 	constraining the workload.  For example, the only way to eliminate
> > > > 	OS jitter due to global TLB shootdowns is to avoid the unmapping
> > > > 	operations (such as kernel module unload operations) that result
> > > > 	in these shootdowns.  For another example, page faults and TLB
> > > > 	misses can be reduced (and in some cases eliminated) by using
> > > > 	huge pages and by constraining the amount of memory used by the
> > > > 	application.
> > > > 
> > > > o	At least one CPU must keep the scheduling-clock interrupt going
> > > > 	in order to support accurate timekeeping.
> > > 
> > > Thanks for writing this up Paul!
> > 
> > And to many other people, including yourself, for doing the actual work!
> > 
> > 							Thanx, Paul
> 
> 
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-21  2:22                 ` Paul E. McKenney
@ 2013-03-21 10:16                   ` Borislav Petkov
  2013-03-21 15:18                     ` Paul E. McKenney
  0 siblings, 1 reply; 43+ messages in thread
From: Borislav Petkov @ 2013-03-21 10:16 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel,
	josh, zhong, khilman, geoff, tglx, Arjan van de Ven

On Wed, Mar 20, 2013 at 07:22:59PM -0700, Paul E. McKenney wrote:
> > > > > The "full_nohz=" boot parameter specifies which CPUs are to be
> > > > > adaptive-ticks CPUs.  For example, "full_nohz=1,6-8" says that CPUs 1,
> > > > 
> > > > This is the first time you mention "adaptive-ticks". Probably should
> > > > define it before just using it, even though one should be able to figure
> > > > out what adaptive-ticks are, it does throw in a wrench when reading this
> > > > if you have no idea what an "adaptive-tick" is.
> > > 
> > > Good point, changed the first sentence of this paragraph to read:
> > > 
> > > 	The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to
> > > 	avoid sending scheduling-clock interrupts to CPUs with a single
> > > 	runnable task, and such CPUs are said to be "adaptive-ticks CPUs".
> > 
> > Sounds good.

Yeah,

so I read this last night too and I have to say, very clearly written,
even for dummies like me.

But this "adaptive-ticks CPUs" reads kinda strange throughout the whole
text, it feels a bit weird. And since the cmdline option is called
"full_nohz", you might just as well call them the "full_nohz CPUs" or
the "full_nohz subset of CPUs" for simplicity and so that you don't have
yet another new term in the text denoting the same idea. I mean, all
those names kinda suck and need the full definition of what adaptive
ticking actually means anyway. :)

Btw, congrats on coining a new noun: "Adaptive-tick mode may prevent
this round-robining from happening."
     ^^^^^^^^^^^^^^

Funny. :-)

I spose now one can say: "The kids in the garden are round-robining on
the carousel."

or

"The kernel developers are round-robined for pull requests."

Or maybe it wasn't you who coined it after /me doing a little search. It
looks like technical people are pushing hard for it to be committed in
the upstream English language repository. :-)

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-21 10:16                   ` Borislav Petkov
@ 2013-03-21 15:18                     ` Paul E. McKenney
  2013-03-21 16:00                       ` Borislav Petkov
  0 siblings, 1 reply; 43+ messages in thread
From: Paul E. McKenney @ 2013-03-21 15:18 UTC (permalink / raw)
  To: Borislav Petkov, Steven Rostedt, Frederic Weisbecker,
	Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx,
	Arjan van de Ven

On Thu, Mar 21, 2013 at 11:16:50AM +0100, Borislav Petkov wrote:
> On Wed, Mar 20, 2013 at 07:22:59PM -0700, Paul E. McKenney wrote:
> > > > > > The "full_nohz=" boot parameter specifies which CPUs are to be
> > > > > > adaptive-ticks CPUs.  For example, "full_nohz=1,6-8" says that CPUs 1,
> > > > > 
> > > > > This is the first time you mention "adaptive-ticks". Probably should
> > > > > define it before just using it, even though one should be able to figure
> > > > > out what adaptive-ticks are, it does throw in a wrench when reading this
> > > > > if you have no idea what an "adaptive-tick" is.
> > > > 
> > > > Good point, changed the first sentence of this paragraph to read:
> > > > 
> > > > 	The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to
> > > > 	avoid sending scheduling-clock interrupts to CPUs with a single
> > > > 	runnable task, and such CPUs are said to be "adaptive-ticks CPUs".
> > > 
> > > Sounds good.
> 
> Yeah,
> 
> so I read this last night too and I have to say, very clearly written,
> even for dummies like me.

Can't say that I think of you as a dummy, but glad you liked it!

> But this "adaptive-ticks CPUs" reads kinda strange throughout the whole
> text, it feels a bit weird. And since the cmdline option is called
> "full_nohz", you might just as well call them the "full_nohz CPUs" or
> the "full_nohz subset of CPUs" for simplicity and so that you don't have
> yet another new term in the text denoting the same idea. I mean, all
> those names kinda suck and need the full definition of what adaptive
> ticking actually means anyway. :)

I am happy with either "adaptive-ticks CPUs" or "full_nohz CPUs", and
leave the choice to Frederic.

> Btw, congrats on coining a new noun: "Adaptive-tick mode may prevent
> this round-robining from happening."
>      ^^^^^^^^^^^^^^

Actually, this is a generic transformation.  Given an English verb,
you almost always add "ing" to create a noun.  Since "round-robin" is
used as a verb, as in "The scheduler will round-robin between the two
SCHED_RR tasks", "round-robining" may be used as a noun denoting the
action corresponding to the verb "round-robin".  There is no doubt
an argument as to whether this should be spelled "round-robining" or
"round-robinning", but I will leave this to those who care enough to
argue about it.  ;-)

> Funny. :-)
> 
> I spose now one can say: "The kids in the garden are round-robining on
> the carousel."
> 
> or
> 
> "The kernel developers are round-robined for pull requests."

;-)

> Or maybe it wasn't you who coined it after /me doing a little search. It
> looks like technical people are pushing hard for it to be committed in
> the upstream English language repository. :-)

The thing about English is that it is an open-source language, and always
has been.  English is defined by its usage, and the wise dictionary-makers
try their best to keep up.  (The unwise ones attempt to stop the evolution
of the English language.)  Everything good and everything bad about
English stems from this property.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-21  0:27               ` Steven Rostedt
  2013-03-21  2:22                 ` Paul E. McKenney
@ 2013-03-21 15:45                 ` Arjan van de Ven
  2013-03-21 17:18                   ` Paul E. McKenney
  2013-03-22  4:59                   ` Rob Landley
  2013-03-21 18:01                 ` Frederic Weisbecker
  2 siblings, 2 replies; 43+ messages in thread
From: Arjan van de Ven @ 2013-03-21 15:45 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: paulmck, Frederic Weisbecker, Rob Landley, linux-kernel, josh,
	zhong, khilman, geoff, tglx

On 3/20/2013 5:27 PM, Steven Rostedt wrote:
> I'm not sure I would recommend idle=poll either. It would certainly
> work, but it goes to the other extreme. You think NO_HZ=n drains a
> battery? Try idle=poll.


do not ever use idle=poll on anything production.. really bad idea.

if you temporary cannot cope with the latency, you can use the PMQOS system
to limit (including going all the way to idle=poll).
but using idle=poll completely is very nasty for the hardware.

In addition we should document that idle=poll will cost you peak performance,
possibly quite a bit.

the same is true for the kernel paramter to some extend; it's there to work around
broken bioses/hardware/etc; if you have a latency/runtime requirement, it's much better
to use PMQOS for this from userspace.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-21 15:18                     ` Paul E. McKenney
@ 2013-03-21 16:00                       ` Borislav Petkov
  0 siblings, 0 replies; 43+ messages in thread
From: Borislav Petkov @ 2013-03-21 16:00 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel,
	josh, zhong, khilman, geoff, tglx, Arjan van de Ven

On Thu, Mar 21, 2013 at 08:18:11AM -0700, Paul E. McKenney wrote:
> Actually, this is a generic transformation. Given an English verb,
> you almost always add "ing" to create a noun. Since "round-robin" is
> used as a verb,

... which sounds, in this case, weird IMHO. :-)

> as in "The scheduler will round-robin between the two SCHED_RR
> tasks",

I think the "correct" way to say it is "The scheduler will select tasks
in a round-robin fashion..." But while it is correct (for some accepted
definition of correct), this is slow, has too many words and we don't
want that - we want fast! We want a lot less instructions in the pipe!
This way, we burn a lot less energy when talking. :-)

> "round-robining" may be used as a noun denoting the action
> corresponding to the verb "round-robin". There is no doubt an
> argument as to whether this should be spelled "round-robining" or
> "round-robinning", but I will leave this to those who care enough to
> argue about it. ;-)

Hey sir, you're preaching to the choir - I'm all for doing all kinds of
weird/funny experiments with language...

> The thing about English is that it is an open-source language, and
> always has been. English is defined by its usage, and the wise
> dictionary-makers try their best to keep up.

... yes, and then there are the English language Nazis who wouldn't
allow that - their rules are stricter than software APIs and breaking
userspace compatibility.

Technical people, OTOH, are much more willing and not afraid to take the
language and mold it in such a form so that it works for them instead of
adhering to ancient rules. Which is cool. That's why I was pointing out
the "round-robining" - nice and cool. And look how much shorter it is:

round-robining = iterate over the items on a list by periodically
switching from one to the next in a circular order.

Now imagine the pressure on I$ the two versions create. And compare. :-)

> (The unwise ones attempt to stop the evolution of the English
> language.) Everything good and everything bad about English stems from
> this property. ;-)

Yeah, I've had to deal with enough of those evolution-stopping idiots
during my days at the university. Well, I've got three words for them:
"Resistance is futile!"

:-)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-20 23:55             ` Paul E. McKenney
  2013-03-21  0:27               ` Steven Rostedt
@ 2013-03-21 16:08               ` Christoph Lameter
  2013-03-21 17:15                 ` Paul E. McKenney
  1 sibling, 1 reply; 43+ messages in thread
From: Christoph Lameter @ 2013-03-21 16:08 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel,
	josh, zhong, khilman, geoff, tglx

On Wed, 20 Mar 2013, Paul E. McKenney wrote:

> > > Another approach is to offload RCU callback processing to "rcuo" kthreads
> > > using the CONFIG_RCU_NOCB_CPU=y.  The specific CPUs to offload may be
> > > selected via several methods:

Why are there multiple rcuo threads? Would a single thread that may be
able to run on multiple cpus not be sufficient?

> > "Even though the SCHED_FIFO task is the only task running, because the
> > SCHED_OTHER tasks are queued on the CPU, it currently will not enter
> > adaptive tick mode."
>
> Again, good point!

Uggh. That will cause problems and did cause problems when I tried to use
nohz.

The OS always has some sched other tasks around that become runnable after
a while (like for example the vm statistics update, or the notorious slab
scanning). As long as SCHED_FIFO is active and there is no process in the
same scheduling class then tick needs to be off. Also wish that this would
work with SCHED_OTHER if there is only a single task with a certain renice
value (-10?) and the rest is runnable at lower priorities. Maybe in that
case stop the tick for a longer period and then give the lower priority
tasks a chance to run but then switch off the tick again.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-21 16:08               ` Christoph Lameter
@ 2013-03-21 17:15                 ` Paul E. McKenney
  2013-03-21 18:39                   ` Christoph Lameter
  2013-03-21 18:44                   ` Steven Rostedt
  0 siblings, 2 replies; 43+ messages in thread
From: Paul E. McKenney @ 2013-03-21 17:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel,
	josh, zhong, khilman, geoff, tglx

On Thu, Mar 21, 2013 at 04:08:08PM +0000, Christoph Lameter wrote:
> On Wed, 20 Mar 2013, Paul E. McKenney wrote:
> 
> > > > Another approach is to offload RCU callback processing to "rcuo" kthreads
> > > > using the CONFIG_RCU_NOCB_CPU=y.  The specific CPUs to offload may be
> > > > selected via several methods:
> 
> Why are there multiple rcuo threads? Would a single thread that may be
> able to run on multiple cpus not be sufficient?

In many cases, this would indeed be sufficient.  However, if you have
enough CPUs posting RCU callbacks, then the single thread would become
a bottleneck, eventually resulting in an OOM.  Per-CPU kthreads avoid
this possibility.

That said, if you know that your workload's RCU callbacks could be
serviced by a single CPU, you can bind all the rcuo kthreads to a
single CPU.

> > > "Even though the SCHED_FIFO task is the only task running, because the
> > > SCHED_OTHER tasks are queued on the CPU, it currently will not enter
> > > adaptive tick mode."
> >
> > Again, good point!
> 
> Uggh. That will cause problems and did cause problems when I tried to use
> nohz.
> 
> The OS always has some sched other tasks around that become runnable after
> a while (like for example the vm statistics update, or the notorious slab
> scanning). As long as SCHED_FIFO is active and there is no process in the
> same scheduling class then tick needs to be off. Also wish that this would
> work with SCHED_OTHER if there is only a single task with a certain renice
> value (-10?) and the rest is runnable at lower priorities. Maybe in that
> case stop the tick for a longer period and then give the lower priority
> tasks a chance to run but then switch off the tick again.

These sound to me like good future enhancements.

In the meantime, one approach is to bind all these SCHED_OTHER tasks
to designated housekeeping CPU(s) that don't run your main workload.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-21 15:45                 ` Arjan van de Ven
@ 2013-03-21 17:18                   ` Paul E. McKenney
  2013-03-21 17:41                     ` Arjan van de Ven
  2013-03-22  4:59                   ` Rob Landley
  1 sibling, 1 reply; 43+ messages in thread
From: Paul E. McKenney @ 2013-03-21 17:18 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel,
	josh, zhong, khilman, geoff, tglx

On Thu, Mar 21, 2013 at 08:45:07AM -0700, Arjan van de Ven wrote:
> On 3/20/2013 5:27 PM, Steven Rostedt wrote:
> >I'm not sure I would recommend idle=poll either. It would certainly
> >work, but it goes to the other extreme. You think NO_HZ=n drains a
> >battery? Try idle=poll.
> 
> 
> do not ever use idle=poll on anything production.. really bad idea.
> 
> if you temporary cannot cope with the latency, you can use the PMQOS system
> to limit (including going all the way to idle=poll).
> but using idle=poll completely is very nasty for the hardware.
> 
> In addition we should document that idle=poll will cost you peak performance,
> possibly quite a bit.
> 
> the same is true for the kernel paramter to some extend; it's there to work around
> broken bioses/hardware/etc; if you have a latency/runtime requirement, it's much better
> to use PMQOS for this from userspace.

Thank you for the info, Arjan!  Does the following capture the tradeoffs?

o	Dyntick-idle slows transitions to and from idle slightly.
	In practice, this has not been a problem except for the most
	aggressive real-time workloads, which have the option of disabling
	dyntick-idle mode, an option that most of them take.  However,
	some workloads will no doubt want to use adaptive ticks to
	eliminate scheduling-clock-tick latencies.  Here are some
	options for these workloads:

	o	Use PMQOS from userspace to inform the kernel of your
		latency requirements (preferred).

	o	Use the "idle=mwait" boot parameter.

	o	Use the "intel_idle.max_cstate=" to limit the maximum
		depth C-state depth.

	o	Use the "idle=poll" boot parameter.  However, please note
		that use of this parameter can cause your CPU to overheat,
		which may cause thermal throttling to degrade your
		latencies --and that this degradation can be even worse
		than that of dyntick-idle.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-21 17:18                   ` Paul E. McKenney
@ 2013-03-21 17:41                     ` Arjan van de Ven
  2013-03-21 18:02                       ` Paul E. McKenney
  0 siblings, 1 reply; 43+ messages in thread
From: Arjan van de Ven @ 2013-03-21 17:41 UTC (permalink / raw)
  To: paulmck
  Cc: Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel,
	josh, zhong, khilman, geoff, tglx

On 3/21/2013 10:18 AM, Paul E. McKenney wrote:
> 	o	Use the "idle=poll" boot parameter.  However, please note
> 		that use of this parameter can cause your CPU to overheat,
> 		which may cause thermal throttling to degrade your
> 		latencies --and that this degradation can be even worse
> 		than that of dyntick-idle.

it also disables (effectively) Turbo Mode on Intel cpus... which can cost you a serious percentage of performance

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-21  0:27               ` Steven Rostedt
  2013-03-21  2:22                 ` Paul E. McKenney
  2013-03-21 15:45                 ` Arjan van de Ven
@ 2013-03-21 18:01                 ` Frederic Weisbecker
  2013-03-21 18:26                   ` Paul E. McKenney
  2 siblings, 1 reply; 43+ messages in thread
From: Frederic Weisbecker @ 2013-03-21 18:01 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: paulmck, Rob Landley, linux-kernel, josh, zhong, khilman, geoff,
	tglx, Arjan van de Ven

2013/3/21 Steven Rostedt <rostedt@goodmis.org>:
> [ Added Arjan in case he as anything to add about the idle=poll below ]
>
>
> On Wed, 2013-03-20 at 16:55 -0700, Paul E. McKenney wrote:
>> On Wed, Mar 20, 2013 at 07:32:18PM -0400, Steven Rostedt wrote:
>> > Not a comment on this document, but on the implementation. As idle NO_HZ
>> > can hurt RT, but RT would want to have full NO_HZ, it's a shame that you
>> > can't have both (no idle but full). As we only care about not letting
>> > the CPU go into deep sleep, I wonder if it wouldn't be too hard to add
>> > something that keeps idle from going into nohz mode. Hmm, I think there
>> > may be an option to keep the CPU from going too deep into sleep. That
>> > may be a better approach.
>>
>> Would the combination of CONFIG_NO_HZ=y, CONFIG_NO_HZ_FULL=y, and
>> idle=poll do the trick in this case?
>
> I'm not sure I would recommend idle=poll either. It would certainly
> work, but it goes to the other extreme. You think NO_HZ=n drains a
> battery? Try idle=poll.
>
> Looking at Documentation/kernel-parameters.txt, it looks like idle=mwait
> may be better. It states that performance is the same as idle=poll (if
> supported).
>
> Also there's a kernel parameter for x86 called intel_idle.max_cstate=X.
>
> As idle=poll will most likely run the processor very hot and you will
> need to add more electricity not only for the computer but also for the
> A/C, it would be nice to still have the CPU sleep, but just at a shallow
> (fast wakeup) state.
>
> Perhaps Arjan can add some input here?

But I note that it's an interesting usecase. May be we'll want to make
CONFIG_NO_HZ_FULL (or whatever it's going to be called) not depend on
CONFIG_NO_HZ_IDLE in the long.

We'll see.

Also, just a guess, but on dynticks-idle may be wakeup from deep CPU
sleep state is not the only latency source. Reprogramming the timer
tick on idle exit may be another one? Not sure how fast it is to write
to the clock device. I supect it's not that free. So probably you
would like to get rid of the entire dynticks-idle infrastructure for
real time.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-21 17:41                     ` Arjan van de Ven
@ 2013-03-21 18:02                       ` Paul E. McKenney
  2013-03-22 18:37                         ` Kevin Hilman
  0 siblings, 1 reply; 43+ messages in thread
From: Paul E. McKenney @ 2013-03-21 18:02 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel,
	josh, zhong, khilman, geoff, tglx

On Thu, Mar 21, 2013 at 10:41:30AM -0700, Arjan van de Ven wrote:
> On 3/21/2013 10:18 AM, Paul E. McKenney wrote:
> >	o	Use the "idle=poll" boot parameter.  However, please note
> >		that use of this parameter can cause your CPU to overheat,
> >		which may cause thermal throttling to degrade your
> >		latencies --and that this degradation can be even worse
> >		than that of dyntick-idle.
> 
> it also disables (effectively) Turbo Mode on Intel cpus... which can cost you a serious percentage of performance

Thank you, added!  Please see below for the updated list.

							Thanx, Paul

------------------------------------------------------------------------

o	Dyntick-idle slows transitions to and from idle slightly.
	In practice, this has not been a problem except for the most
	aggressive real-time workloads, which have the option of disabling
	dyntick-idle mode, an option that most of them take.  However,
	some workloads will no doubt want to use adaptive ticks to
	eliminate scheduling-clock-tick latencies.  Here are some
	options for these workloads:

	a.	Use PMQOS from userspace to inform the kernel of your
		latency requirements (preferred).

	b.	Use the "idle=mwait" boot parameter.

	c.	Use the "intel_idle.max_cstate=" to limit the maximum
		depth C-state depth.

	d.	Use the "idle=poll" boot parameter.  However, please note
		that use of this parameter can cause your CPU to overheat,
		which may cause thermal throttling to degrade your
		latencies -- and that this degradation can be even worse
		than that of dyntick-idle.  Furthermore, this parameter
		effectively disables Turbo Mode on Intel CPUs, which
		can significantly reduce maximum performance.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-21 18:01                 ` Frederic Weisbecker
@ 2013-03-21 18:26                   ` Paul E. McKenney
  0 siblings, 0 replies; 43+ messages in thread
From: Paul E. McKenney @ 2013-03-21 18:26 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Steven Rostedt, Rob Landley, linux-kernel, josh, zhong, khilman,
	geoff, tglx, Arjan van de Ven

On Thu, Mar 21, 2013 at 07:01:01PM +0100, Frederic Weisbecker wrote:
> 2013/3/21 Steven Rostedt <rostedt@goodmis.org>:
> > [ Added Arjan in case he as anything to add about the idle=poll below ]
> >
> >
> > On Wed, 2013-03-20 at 16:55 -0700, Paul E. McKenney wrote:
> >> On Wed, Mar 20, 2013 at 07:32:18PM -0400, Steven Rostedt wrote:
> >> > Not a comment on this document, but on the implementation. As idle NO_HZ
> >> > can hurt RT, but RT would want to have full NO_HZ, it's a shame that you
> >> > can't have both (no idle but full). As we only care about not letting
> >> > the CPU go into deep sleep, I wonder if it wouldn't be too hard to add
> >> > something that keeps idle from going into nohz mode. Hmm, I think there
> >> > may be an option to keep the CPU from going too deep into sleep. That
> >> > may be a better approach.
> >>
> >> Would the combination of CONFIG_NO_HZ=y, CONFIG_NO_HZ_FULL=y, and
> >> idle=poll do the trick in this case?
> >
> > I'm not sure I would recommend idle=poll either. It would certainly
> > work, but it goes to the other extreme. You think NO_HZ=n drains a
> > battery? Try idle=poll.
> >
> > Looking at Documentation/kernel-parameters.txt, it looks like idle=mwait
> > may be better. It states that performance is the same as idle=poll (if
> > supported).
> >
> > Also there's a kernel parameter for x86 called intel_idle.max_cstate=X.
> >
> > As idle=poll will most likely run the processor very hot and you will
> > need to add more electricity not only for the computer but also for the
> > A/C, it would be nice to still have the CPU sleep, but just at a shallow
> > (fast wakeup) state.
> >
> > Perhaps Arjan can add some input here?
> 
> But I note that it's an interesting usecase. May be we'll want to make
> CONFIG_NO_HZ_FULL (or whatever it's going to be called) not depend on
> CONFIG_NO_HZ_IDLE in the long.
> 
> We'll see.
> 
> Also, just a guess, but on dynticks-idle may be wakeup from deep CPU
> sleep state is not the only latency source. Reprogramming the timer
> tick on idle exit may be another one? Not sure how fast it is to write
> to the clock device. I supect it's not that free. So probably you
> would like to get rid of the entire dynticks-idle infrastructure for
> real time.

Agreed, and the first known-issues bullet calls that possibility out.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-21 17:15                 ` Paul E. McKenney
@ 2013-03-21 18:39                   ` Christoph Lameter
  2013-03-21 18:58                     ` Paul E. McKenney
  2013-03-21 18:44                   ` Steven Rostedt
  1 sibling, 1 reply; 43+ messages in thread
From: Christoph Lameter @ 2013-03-21 18:39 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel,
	josh, zhong, khilman, geoff, tglx

On Thu, 21 Mar 2013, Paul E. McKenney wrote:

> > Why are there multiple rcuo threads? Would a single thread that may be
> > able to run on multiple cpus not be sufficient?
>
> In many cases, this would indeed be sufficient.  However, if you have
> enough CPUs posting RCU callbacks, then the single thread would become
> a bottleneck, eventually resulting in an OOM.  Per-CPU kthreads avoid
> this possibility.

Spawn another if the load gets too high for a single cpu?

> That said, if you know that your workload's RCU callbacks could be
> serviced by a single CPU, you can bind all the rcuo kthreads to a
> single CPU.

Yeah doing that right now but I'd like to see it handled without manual
intervention.

> > > Again, good point!
> >
> > Uggh. That will cause problems and did cause problems when I tried to use
> > nohz.
> >
> > The OS always has some sched other tasks around that become runnable after
> > a while (like for example the vm statistics update, or the notorious slab
> > scanning). As long as SCHED_FIFO is active and there is no process in the
> > same scheduling class then tick needs to be off. Also wish that this would
> > work with SCHED_OTHER if there is only a single task with a certain renice
> > value (-10?) and the rest is runnable at lower priorities. Maybe in that
> > case stop the tick for a longer period and then give the lower priority
> > tasks a chance to run but then switch off the tick again.
>
> These sound to me like good future enhancements.
>
> In the meantime, one approach is to bind all these SCHED_OTHER tasks
> to designated housekeeping CPU(s) that don't run your main workload.

One cannot bind kevent threads and other per cpu threads to another
processor. So right now there is no way to avoid this issue.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-21 17:15                 ` Paul E. McKenney
  2013-03-21 18:39                   ` Christoph Lameter
@ 2013-03-21 18:44                   ` Steven Rostedt
  2013-03-21 18:53                     ` Christoph Lameter
  2013-03-21 18:59                     ` Paul E. McKenney
  1 sibling, 2 replies; 43+ messages in thread
From: Steven Rostedt @ 2013-03-21 18:44 UTC (permalink / raw)
  To: paulmck
  Cc: Christoph Lameter, Frederic Weisbecker, Rob Landley,
	linux-kernel, josh, zhong, khilman, geoff, tglx

On Thu, 2013-03-21 at 10:15 -0700, Paul E. McKenney wrote:

> > The OS always has some sched other tasks around that become runnable after
> > a while (like for example the vm statistics update, or the notorious slab
> > scanning). As long as SCHED_FIFO is active and there is no process in the
> > same scheduling class then tick needs to be off. Also wish that this would
> > work with SCHED_OTHER if there is only a single task with a certain renice
> > value (-10?) and the rest is runnable at lower priorities. Maybe in that
> > case stop the tick for a longer period and then give the lower priority
> > tasks a chance to run but then switch off the tick again.
> 
> These sound to me like good future enhancements.

Exactly. Please, this is a complex enough change to something that is
critical to the entire system (similar to RCU itself). Lets take baby
steps here and get it right each step of the way.

For now, no, if more than one process is scheduled on the CPU, we fall
out of dynamic tick mode. In the future, we can add SCHED_FIFO task
scheduled in to trigger it. But lets conquer that after we successfully
conquer the current changes.

-- Steve



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-21 18:44                   ` Steven Rostedt
@ 2013-03-21 18:53                     ` Christoph Lameter
  2013-03-21 19:16                       ` Steven Rostedt
  2013-03-21 18:59                     ` Paul E. McKenney
  1 sibling, 1 reply; 43+ messages in thread
From: Christoph Lameter @ 2013-03-21 18:53 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: paulmck, Frederic Weisbecker, Rob Landley, linux-kernel, josh,
	zhong, khilman, geoff, tglx

On Thu, 21 Mar 2013, Steven Rostedt wrote:

> For now, no, if more than one process is scheduled on the CPU, we fall
> out of dynamic tick mode. In the future, we can add SCHED_FIFO task
> scheduled in to trigger it. But lets conquer that after we successfully
> conquer the current changes.

Be glad to see whatever is possible merged as soon as possible. But be
aware that we will fall out of dyntick mode at mininum every 2 seconds
because that is when the per cpu vm stats and the slab scanning is
occurring. These are both deferrable activities.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-21 18:39                   ` Christoph Lameter
@ 2013-03-21 18:58                     ` Paul E. McKenney
  2013-03-21 20:04                       ` Christoph Lameter
  2013-03-22 19:01                       ` Kevin Hilman
  0 siblings, 2 replies; 43+ messages in thread
From: Paul E. McKenney @ 2013-03-21 18:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel,
	josh, zhong, khilman, geoff, tglx

On Thu, Mar 21, 2013 at 06:39:09PM +0000, Christoph Lameter wrote:
> On Thu, 21 Mar 2013, Paul E. McKenney wrote:
> 
> > > Why are there multiple rcuo threads? Would a single thread that may be
> > > able to run on multiple cpus not be sufficient?
> >
> > In many cases, this would indeed be sufficient.  However, if you have
> > enough CPUs posting RCU callbacks, then the single thread would become
> > a bottleneck, eventually resulting in an OOM.  Per-CPU kthreads avoid
> > this possibility.
> 
> Spawn another if the load gets too high for a single cpu?
> 
> > That said, if you know that your workload's RCU callbacks could be
> > serviced by a single CPU, you can bind all the rcuo kthreads to a
> > single CPU.
> 
> Yeah doing that right now but I'd like to see it handled without manual
> intervention.

Given that RCU has no idea where you want them to run, some manual
intervention would most likely be required even if RCU spawned them
dynamically, right?

> > > > Again, good point!
> > >
> > > Uggh. That will cause problems and did cause problems when I tried to use
> > > nohz.
> > >
> > > The OS always has some sched other tasks around that become runnable after
> > > a while (like for example the vm statistics update, or the notorious slab
> > > scanning). As long as SCHED_FIFO is active and there is no process in the
> > > same scheduling class then tick needs to be off. Also wish that this would
> > > work with SCHED_OTHER if there is only a single task with a certain renice
> > > value (-10?) and the rest is runnable at lower priorities. Maybe in that
> > > case stop the tick for a longer period and then give the lower priority
> > > tasks a chance to run but then switch off the tick again.
> >
> > These sound to me like good future enhancements.
> >
> > In the meantime, one approach is to bind all these SCHED_OTHER tasks
> > to designated housekeeping CPU(s) that don't run your main workload.
> 
> One cannot bind kevent threads and other per cpu threads to another
> processor. So right now there is no way to avoid this issue.

Yep, my approach works only for those threads that are free to migrate.
Of course, in some cases, you can avoid per-CPU threads running by pinning
interrupts, avoiding certain operations in your workload, and so on.

So, again, removing scheduling-clock interrupts in more situations is
a good future enhancement.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-21 18:44                   ` Steven Rostedt
  2013-03-21 18:53                     ` Christoph Lameter
@ 2013-03-21 18:59                     ` Paul E. McKenney
  1 sibling, 0 replies; 43+ messages in thread
From: Paul E. McKenney @ 2013-03-21 18:59 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Christoph Lameter, Frederic Weisbecker, Rob Landley,
	linux-kernel, josh, zhong, khilman, geoff, tglx

On Thu, Mar 21, 2013 at 02:44:22PM -0400, Steven Rostedt wrote:
> On Thu, 2013-03-21 at 10:15 -0700, Paul E. McKenney wrote:
> 
> > > The OS always has some sched other tasks around that become runnable after
> > > a while (like for example the vm statistics update, or the notorious slab
> > > scanning). As long as SCHED_FIFO is active and there is no process in the
> > > same scheduling class then tick needs to be off. Also wish that this would
> > > work with SCHED_OTHER if there is only a single task with a certain renice
> > > value (-10?) and the rest is runnable at lower priorities. Maybe in that
> > > case stop the tick for a longer period and then give the lower priority
> > > tasks a chance to run but then switch off the tick again.
> > 
> > These sound to me like good future enhancements.
> 
> Exactly. Please, this is a complex enough change to something that is
> critical to the entire system (similar to RCU itself). Lets take baby
> steps here and get it right each step of the way.
> 
> For now, no, if more than one process is scheduled on the CPU, we fall
> out of dynamic tick mode. In the future, we can add SCHED_FIFO task
> scheduled in to trigger it. But lets conquer that after we successfully
> conquer the current changes.

What Steve said!!!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-21 18:53                     ` Christoph Lameter
@ 2013-03-21 19:16                       ` Steven Rostedt
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Rostedt @ 2013-03-21 19:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: paulmck, Frederic Weisbecker, Rob Landley, linux-kernel, josh,
	zhong, khilman, geoff, tglx

On Thu, 2013-03-21 at 18:53 +0000, Christoph Lameter wrote:
> On Thu, 21 Mar 2013, Steven Rostedt wrote:
> 
> > For now, no, if more than one process is scheduled on the CPU, we fall
> > out of dynamic tick mode. In the future, we can add SCHED_FIFO task
> > scheduled in to trigger it. But lets conquer that after we successfully
> > conquer the current changes.
> 
> Be glad to see whatever is possible merged as soon as possible. But be
> aware that we will fall out of dyntick mode at mininum every 2 seconds
> because that is when the per cpu vm stats and the slab scanning is
> occurring. These are both deferrable activities.
> 

Thanks for giving us the heads up.

Yeah, I understand your concern. Even when the current patch set is in,
I'm not claiming success. Just like how most of -rt is now in mainline.
It's not completely finished until the rest of -rt is there. I feel the
same with the dynamic tick patches. It will get in in stages. But it
truly isn't there until we have it  fully functional, that even you will
be pleased with the result ;-)

-- Steve



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-21 18:58                     ` Paul E. McKenney
@ 2013-03-21 20:04                       ` Christoph Lameter
  2013-03-21 20:42                         ` Frederic Weisbecker
                                           ` (2 more replies)
  2013-03-22 19:01                       ` Kevin Hilman
  1 sibling, 3 replies; 43+ messages in thread
From: Christoph Lameter @ 2013-03-21 20:04 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel,
	josh, zhong, khilman, geoff, tglx

On Thu, 21 Mar 2013, Paul E. McKenney wrote:

> > Yeah doing that right now but I'd like to see it handled without manual
> > intervention.
>
> Given that RCU has no idea where you want them to run, some manual
> intervention would most likely be required even if RCU spawned them
> dynamically, right?

If rcuoXX is a SCHED_OTHER process/thread then the kernel will move it to
another processor from the one running the SCHED_FIFO task. There would be
no manual intervention required.

> So, again, removing scheduling-clock interrupts in more situations is
> a good future enhancement.

The point here is that the check for a single runnable process is wrong
because it accounts for tasks in all scheduling classes.

It would be better to check if there is only one runnable task in the
highest scheduling class. That would work and defer the SCHED_OTHER kernel
threads for the SCHED_FIFO thread.

I am wondering how you actually can get NOHZ to work right? There is
always a kernel thread that is scheduled in a couple of ticks.

I guess what will happens with this patchset is:

1. SCHED_FIFO thread begins to run. There is only a single runnable task
so adaptive tick mode is enabled.

2. After 2 seconds or so some or other thing needs to run (keventd thread
needs to run vm statistics f.e.). It becomes runnable. nr_running > 1.
Adaptive tick mode is disabled? Occurs on my system. Or is there some
other trick to avoid kernel threads becoming runnable?

3. Now there are 2 runnable processes. The SCHED_FIFO thread continues to
run with the tick. The kernel thread is also runnable but will not be
given cpu time since the SCHED_FIFO thread has priority?

So the SCHED_FIFO thread enjoys 2 seconds of no tick time and then ticks
occur uselessly from there on?


I have not been able to consistently get the tick switched off with
the nohz patchset. How do others use nohz? Is it only usable for short
periods of less than 2 seconds?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-21 20:04                       ` Christoph Lameter
@ 2013-03-21 20:42                         ` Frederic Weisbecker
  2013-03-21 21:02                           ` Christoph Lameter
  2013-03-21 20:50                         ` Paul E. McKenney
  2013-03-22  9:52                         ` Mats Liljegren
  2 siblings, 1 reply; 43+ messages in thread
From: Frederic Weisbecker @ 2013-03-21 20:42 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul E. McKenney, Steven Rostedt, Rob Landley, linux-kernel,
	josh, zhong, khilman, geoff, tglx

2013/3/21 Christoph Lameter <cl@linux.com>:
> On Thu, 21 Mar 2013, Paul E. McKenney wrote:
>> So, again, removing scheduling-clock interrupts in more situations is
>> a good future enhancement.
>
> The point here is that the check for a single runnable process is wrong
> because it accounts for tasks in all scheduling classes.
>
> It would be better to check if there is only one runnable task in the
> highest scheduling class. That would work and defer the SCHED_OTHER kernel
> threads for the SCHED_FIFO thread.

It sounds that simple but it's more complicated. It requires some more
hooks on the scheduler, namely in the sched_switch/dequeue path so
that when the last task of a class goes to sleep, we check what else
is running and whether we need to restart the tick or not depending on
the class of the next task and how many tasks are there.

This will probably need to go in sched_class::dequeue_task(). There is
some careful and subtle attention to put on that.

Of course we want to improve that in the long run. But for now we have
a KISS solution that works. And like Steve said, the patchset is
complicated enough. Move baby steps forward to ease the upstream
integration.

> I am wondering how you actually can get NOHZ to work right? There is
> always a kernel thread that is scheduled in a couple of ticks.
>
> I guess what will happens with this patchset is:
>
> 1. SCHED_FIFO thread begins to run. There is only a single runnable task
> so adaptive tick mode is enabled.
>
> 2. After 2 seconds or so some or other thing needs to run (keventd thread
> needs to run vm statistics f.e.). It becomes runnable. nr_running > 1.
> Adaptive tick mode is disabled? Occurs on my system. Or is there some
> other trick to avoid kernel threads becoming runnable?
>
> 3. Now there are 2 runnable processes. The SCHED_FIFO thread continues to
> run with the tick. The kernel thread is also runnable but will not be
> given cpu time since the SCHED_FIFO thread has priority?
>
> So the SCHED_FIFO thread enjoys 2 seconds of no tick time and then ticks
> occur uselessly from there on?
>
>
> I have not been able to consistently get the tick switched off with
> the nohz patchset. How do others use nohz? Is it only usable for short
> periods of less than 2 seconds?

Sure, for now just don't use SCHED_FIFO and you will have a much more
extended dynticks coverage.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-21 20:04                       ` Christoph Lameter
  2013-03-21 20:42                         ` Frederic Weisbecker
@ 2013-03-21 20:50                         ` Paul E. McKenney
  2013-03-22 14:38                           ` Christoph Lameter
  2013-03-22  9:52                         ` Mats Liljegren
  2 siblings, 1 reply; 43+ messages in thread
From: Paul E. McKenney @ 2013-03-21 20:50 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel,
	josh, zhong, khilman, geoff, tglx

On Thu, Mar 21, 2013 at 08:04:08PM +0000, Christoph Lameter wrote:
> On Thu, 21 Mar 2013, Paul E. McKenney wrote:
> 
> > > Yeah doing that right now but I'd like to see it handled without manual
> > > intervention.
> >
> > Given that RCU has no idea where you want them to run, some manual
> > intervention would most likely be required even if RCU spawned them
> > dynamically, right?
> 
> If rcuoXX is a SCHED_OTHER process/thread then the kernel will move it to
> another processor from the one running the SCHED_FIFO task. There would be
> no manual intervention required.

Assuming that the SCHED_FIFO task was running at the time that RCU
decided to spawn the kthread, and assuming that there was at least
one CPU not running a SCHED_FIFO task, agreed.  But these assumptions
do not hold in general.

> > So, again, removing scheduling-clock interrupts in more situations is
> > a good future enhancement.
> 
> The point here is that the check for a single runnable process is wrong
> because it accounts for tasks in all scheduling classes.

Incomplete, yes.  Only a starting point, yes.  Wrong, no.

> It would be better to check if there is only one runnable task in the
> highest scheduling class. That would work and defer the SCHED_OTHER kernel
> threads for the SCHED_FIFO thread.

Agreed, that would be better.  Hopefully we will handle that and other
similar cases at some point.

> I am wondering how you actually can get NOHZ to work right? There is
> always a kernel thread that is scheduled in a couple of ticks.
> 
> I guess what will happens with this patchset is:
> 
> 1. SCHED_FIFO thread begins to run. There is only a single runnable task
> so adaptive tick mode is enabled.

Yep.

> 2. After 2 seconds or so some or other thing needs to run (keventd thread
> needs to run vm statistics f.e.). It becomes runnable. nr_running > 1.
> Adaptive tick mode is disabled? Occurs on my system. Or is there some
> other trick to avoid kernel threads becoming runnable?

Yes, adaptive tick mode would be disabled at that point.

> 3. Now there are 2 runnable processes. The SCHED_FIFO thread continues to
> run with the tick. The kernel thread is also runnable but will not be
> given cpu time since the SCHED_FIFO thread has priority?

Yep.

> So the SCHED_FIFO thread enjoys 2 seconds of no tick time and then ticks
> occur uselessly from there on?

If the SCHED_FIFO thread never sleeps at all, this would be the outcome.
On the other hand, if the SCHED_FIFO thread never sleeps at all, the
various per-CPU kthreads are deferred forever, which might not be so
good long term.

If the SCHED_FIFO thread does sleep at some point, the SCHED_OTHER threads
would run, the CPU would go idle, and then when the SCHED_OTHER thread
started up again, it would start up in adaptive-idle mode.

> I have not been able to consistently get the tick switched off with
> the nohz patchset. How do others use nohz? Is it only usable for short
> periods of less than 2 seconds?

I believe that many other SCHED_FIFO users run their SCHED_FIFO threads
in short bursts to respond to some real-time event.  They would not tend
to have a SCHED_FIFO thread with a busy period exceeding two seconds,
and therefore would be less likely to encounter this issue.

So, how long of busy periods are you contemplating for your SCHED_FIFO
threads?  Is it possible to tune/adjust the offending per-CPU ktheads
to wake up less frequently than that time?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-21 20:42                         ` Frederic Weisbecker
@ 2013-03-21 21:02                           ` Christoph Lameter
  2013-03-21 21:06                             ` Frederic Weisbecker
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Lameter @ 2013-03-21 21:02 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E. McKenney, Steven Rostedt, Rob Landley, linux-kernel,
	josh, zhong, khilman, geoff, tglx

On Thu, 21 Mar 2013, Frederic Weisbecker wrote:

> Sure, for now just don't use SCHED_FIFO and you will have a much more
> extended dynticks coverage.

Ah. Ok. Important information. That would mean no tick for the 2 second
intervals between the vm stats etc. Much much better than now where we
have a tick 1000 times per second.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-21 21:02                           ` Christoph Lameter
@ 2013-03-21 21:06                             ` Frederic Weisbecker
  0 siblings, 0 replies; 43+ messages in thread
From: Frederic Weisbecker @ 2013-03-21 21:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul E. McKenney, Steven Rostedt, Rob Landley, linux-kernel,
	josh, zhong, khilman, geoff, tglx

2013/3/21 Christoph Lameter <cl@linux.com>:
> On Thu, 21 Mar 2013, Frederic Weisbecker wrote:
>
>> Sure, for now just don't use SCHED_FIFO and you will have a much more
>> extended dynticks coverage.
>
> Ah. Ok. Important information. That would mean no tick for the 2 second
> intervals between the vm stats etc. Much much better than now where we
> have a tick 1000 times per second.

I can't guarantee no tick, there can be many reasons for the tick to
happen. But if you don't run SCHED_FIFO, the pending kernel thread can
execute quickly and give back the CPU for your task alone, then the
tick can shut down again.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-21 15:45                 ` Arjan van de Ven
  2013-03-21 17:18                   ` Paul E. McKenney
@ 2013-03-22  4:59                   ` Rob Landley
  1 sibling, 0 replies; 43+ messages in thread
From: Rob Landley @ 2013-03-22  4:59 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Steven Rostedt, paulmck, Frederic Weisbecker, linux-kernel, josh,
	zhong, khilman, geoff, tglx

On 03/21/2013 10:45:07 AM, Arjan van de Ven wrote:
> On 3/20/2013 5:27 PM, Steven Rostedt wrote:
>> I'm not sure I would recommend idle=poll either. It would certainly
>> work, but it goes to the other extreme. You think NO_HZ=n drains a
>> battery? Try idle=poll.
> 
> 
> do not ever use idle=poll on anything production.. really bad idea.
> 
> if you temporary cannot cope with the latency, you can use the PMQOS  
> system
> to limit (including going all the way to idle=poll).
> but using idle=poll completely is very nasty for the hardware.
> 
> In addition we should document that idle=poll will cost you peak  
> performance,
> possibly quite a bit.

Where should that be documented?

> the same is true for the kernel paramter to some extend; it's there  
> to work around
> broken bioses/hardware/etc; if you have a latency/runtime  
> requirement, it's much better
> to use PMQOS for this from userspace.

I googled and found  
http://elinux.org/images/f/f9/Elc2008_pm_qos_slides.pdf

Rob

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-21 20:04                       ` Christoph Lameter
  2013-03-21 20:42                         ` Frederic Weisbecker
  2013-03-21 20:50                         ` Paul E. McKenney
@ 2013-03-22  9:52                         ` Mats Liljegren
  2 siblings, 0 replies; 43+ messages in thread
From: Mats Liljegren @ 2013-03-22  9:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul E. McKenney, Steven Rostedt, Frederic Weisbecker,
	Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx

Christoph Lameter wrote:
> On Thu, 21 Mar 2013, Paul E. McKenney wrote:
> 
> > > Yeah doing that right now but I'd like to see it handled without manual
> > > intervention.
> >
> > Given that RCU has no idea where you want them to run, some manual
> > intervention would most likely be required even if RCU spawned them
> > dynamically, right?
> 
> If rcuoXX is a SCHED_OTHER process/thread then the kernel will move it to
> another processor from the one running the SCHED_FIFO task. There would be
> no manual intervention required.
> 
> > So, again, removing scheduling-clock interrupts in more situations is
> > a good future enhancement.
> 
> The point here is that the check for a single runnable process is wrong
> because it accounts for tasks in all scheduling classes.
> 
> It would be better to check if there is only one runnable task in the
> highest scheduling class. That would work and defer the SCHED_OTHER kernel
> threads for the SCHED_FIFO thread.
> 
> I am wondering how you actually can get NOHZ to work right? There is
> always a kernel thread that is scheduled in a couple of ticks.

In my case I use 2 CPU PandaBoard where I use cpuset to create a
non-realtime domain for CPU0 and a real-time domain for CPU1. I then move
all kernel threads and IRQs to CPU0, leaving only the application specific
IRQ for CPU1. I then start a singe thread on CPU1.

I use a quite down-stripped version of Linux built using Yocto. I have run
the application for a minute and got 70-80 ticks, most (all?) occurring
during start and exit of the application. I use 100Hz ticks.

So personally I do get something by using full NOHZ in its current
incarnation. I'd like some better interrupt latency though, so disabling
 nohz-idle might be interesting for me. But that's another story...

-- Mats

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-21 20:50                         ` Paul E. McKenney
@ 2013-03-22 14:38                           ` Christoph Lameter
  2013-03-22 16:28                             ` Paul E. McKenney
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Lameter @ 2013-03-22 14:38 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel,
	josh, zhong, khilman, geoff, tglx

On Thu, 21 Mar 2013, Paul E. McKenney wrote:

> So, how long of busy periods are you contemplating for your SCHED_FIFO
> threads?  Is it possible to tune/adjust the offending per-CPU ktheads
> to wake up less frequently than that time?

Test programs right now run 10 seconds. 30 seconds would definitely be
enough for the worst case.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-22 14:38                           ` Christoph Lameter
@ 2013-03-22 16:28                             ` Paul E. McKenney
  2013-03-25 14:31                               ` Christoph Lameter
  0 siblings, 1 reply; 43+ messages in thread
From: Paul E. McKenney @ 2013-03-22 16:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel,
	josh, zhong, khilman, geoff, tglx

On Fri, Mar 22, 2013 at 02:38:58PM +0000, Christoph Lameter wrote:
> On Thu, 21 Mar 2013, Paul E. McKenney wrote:
> 
> > So, how long of busy periods are you contemplating for your SCHED_FIFO
> > threads?  Is it possible to tune/adjust the offending per-CPU ktheads
> > to wake up less frequently than that time?
> 
> Test programs right now run 10 seconds. 30 seconds would definitely be
> enough for the worst case.

OK, that might be doable for some workloads.  What happens when you
try tuning the 2-second wakeup interval to (say) 45 seconds?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-21 18:02                       ` Paul E. McKenney
@ 2013-03-22 18:37                         ` Kevin Hilman
  2013-03-22 19:25                           ` Paul E. McKenney
  0 siblings, 1 reply; 43+ messages in thread
From: Kevin Hilman @ 2013-03-22 18:37 UTC (permalink / raw)
  To: paulmck
  Cc: Arjan van de Ven, Steven Rostedt, Frederic Weisbecker,
	Rob Landley, linux-kernel, josh, zhong, geoff, tglx

"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:

> On Thu, Mar 21, 2013 at 10:41:30AM -0700, Arjan van de Ven wrote:
>> On 3/21/2013 10:18 AM, Paul E. McKenney wrote:
>> >	o	Use the "idle=poll" boot parameter.  However, please note
>> >		that use of this parameter can cause your CPU to overheat,
>> >		which may cause thermal throttling to degrade your
>> >		latencies --and that this degradation can be even worse
>> >		than that of dyntick-idle.
>> 
>> it also disables (effectively) Turbo Mode on Intel cpus... which can
>> cost you a serious percentage of performance
>
> Thank you, added!  Please see below for the updated list.
>
> 							Thanx, Paul
>
> ------------------------------------------------------------------------
>
> o	Dyntick-idle slows transitions to and from idle slightly.
> 	In practice, this has not been a problem except for the most
> 	aggressive real-time workloads, which have the option of disabling
> 	dyntick-idle mode, an option that most of them take.  However,
> 	some workloads will no doubt want to use adaptive ticks to
> 	eliminate scheduling-clock-tick latencies.  Here are some
> 	options for these workloads:
>
> 	a.	Use PMQOS from userspace to inform the kernel of your
> 		latency requirements (preferred).

This is not only the preferred approach, but the *only* approach
available on non-x86 systems.  Perhaps the others should be marked as
x86-only?

Kevin

> 	b.	Use the "idle=mwait" boot parameter.
>
> 	c.	Use the "intel_idle.max_cstate=" to limit the maximum
> 		depth C-state depth.
>
> 	d.	Use the "idle=poll" boot parameter.  However, please note
> 		that use of this parameter can cause your CPU to overheat,
> 		which may cause thermal throttling to degrade your
> 		latencies -- and that this degradation can be even worse
> 		than that of dyntick-idle.  Furthermore, this parameter
> 		effectively disables Turbo Mode on Intel CPUs, which
> 		can significantly reduce maximum performance.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-21 18:58                     ` Paul E. McKenney
  2013-03-21 20:04                       ` Christoph Lameter
@ 2013-03-22 19:01                       ` Kevin Hilman
  1 sibling, 0 replies; 43+ messages in thread
From: Kevin Hilman @ 2013-03-22 19:01 UTC (permalink / raw)
  To: paulmck
  Cc: Christoph Lameter, Steven Rostedt, Frederic Weisbecker,
	Rob Landley, linux-kernel, josh, zhong, geoff, tglx

[...]

>> >
>> > In the meantime, one approach is to bind all these SCHED_OTHER tasks
>> > to designated housekeeping CPU(s) that don't run your main workload.
>> 
>> One cannot bind kevent threads and other per cpu threads to another
>> processor. So right now there is no way to avoid this issue.
>
> Yep, my approach works only for those threads that are free to migrate.
> Of course, in some cases, you can avoid per-CPU threads running by pinning
> interrupts, avoiding certain operations in your workload, and so on.

Speaking of threads that are not free to migrate, you might add a bit to
the doc explaining that these various kernel threads that cannot migrate
are also potential sources of jitter and also reasons why a CPU may exit
(or not enter) full nohz mode.

And thanks a ton for writing up this detailed doc.  Speaking as someone
trying to get full nohz working on a new arch (ARM), getting my head
around all of this has been challenging, and your doc is a great help.

Kevin

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-22 18:37                         ` Kevin Hilman
@ 2013-03-22 19:25                           ` Paul E. McKenney
  0 siblings, 0 replies; 43+ messages in thread
From: Paul E. McKenney @ 2013-03-22 19:25 UTC (permalink / raw)
  To: Kevin Hilman
  Cc: Arjan van de Ven, Steven Rostedt, Frederic Weisbecker,
	Rob Landley, linux-kernel, josh, zhong, geoff, tglx

On Fri, Mar 22, 2013 at 11:37:55AM -0700, Kevin Hilman wrote:
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
> 
> > On Thu, Mar 21, 2013 at 10:41:30AM -0700, Arjan van de Ven wrote:
> >> On 3/21/2013 10:18 AM, Paul E. McKenney wrote:
> >> >	o	Use the "idle=poll" boot parameter.  However, please note
> >> >		that use of this parameter can cause your CPU to overheat,
> >> >		which may cause thermal throttling to degrade your
> >> >		latencies --and that this degradation can be even worse
> >> >		than that of dyntick-idle.
> >> 
> >> it also disables (effectively) Turbo Mode on Intel cpus... which can
> >> cost you a serious percentage of performance
> >
> > Thank you, added!  Please see below for the updated list.
> >
> > 							Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > o	Dyntick-idle slows transitions to and from idle slightly.
> > 	In practice, this has not been a problem except for the most
> > 	aggressive real-time workloads, which have the option of disabling
> > 	dyntick-idle mode, an option that most of them take.  However,
> > 	some workloads will no doubt want to use adaptive ticks to
> > 	eliminate scheduling-clock-tick latencies.  Here are some
> > 	options for these workloads:
> >
> > 	a.	Use PMQOS from userspace to inform the kernel of your
> > 		latency requirements (preferred).
> 
> This is not only the preferred approach, but the *only* approach
> available on non-x86 systems.  Perhaps the others should be marked as
> x86-only?

Good point, added that.

							Thanx, Paul

> Kevin
> 
> > 	b.	Use the "idle=mwait" boot parameter.
> >
> > 	c.	Use the "intel_idle.max_cstate=" to limit the maximum
> > 		depth C-state depth.
> >
> > 	d.	Use the "idle=poll" boot parameter.  However, please note
> > 		that use of this parameter can cause your CPU to overheat,
> > 		which may cause thermal throttling to degrade your
> > 		latencies -- and that this degradation can be even worse
> > 		than that of dyntick-idle.  Furthermore, this parameter
> > 		effectively disables Turbo Mode on Intel CPUs, which
> > 		can significantly reduce maximum performance.
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-22 16:28                             ` Paul E. McKenney
@ 2013-03-25 14:31                               ` Christoph Lameter
  2013-03-25 14:37                                 ` Frederic Weisbecker
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Lameter @ 2013-03-25 14:31 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel,
	josh, zhong, khilman, geoff, tglx

On Fri, 22 Mar 2013, Paul E. McKenney wrote:

> On Fri, Mar 22, 2013 at 02:38:58PM +0000, Christoph Lameter wrote:
> > On Thu, 21 Mar 2013, Paul E. McKenney wrote:
> >
> > > So, how long of busy periods are you contemplating for your SCHED_FIFO
> > > threads?  Is it possible to tune/adjust the offending per-CPU ktheads
> > > to wake up less frequently than that time?
> >
> > Test programs right now run 10 seconds. 30 seconds would definitely be
> > enough for the worst case.
>
> OK, that might be doable for some workloads.  What happens when you
> try tuning the 2-second wakeup interval to (say) 45 seconds?

The vm kernel threads do no useful work if no system calls are being done.
If there is no kernel action then they can be deferred indefinitely.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-25 14:31                               ` Christoph Lameter
@ 2013-03-25 14:37                                 ` Frederic Weisbecker
  2013-03-25 15:18                                   ` Christoph Lameter
  0 siblings, 1 reply; 43+ messages in thread
From: Frederic Weisbecker @ 2013-03-25 14:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul E. McKenney, Steven Rostedt, Rob Landley, linux-kernel,
	josh, zhong, khilman, geoff, tglx

2013/3/25 Christoph Lameter <cl@linux.com>:
> On Fri, 22 Mar 2013, Paul E. McKenney wrote:
>
>> On Fri, Mar 22, 2013 at 02:38:58PM +0000, Christoph Lameter wrote:
>> > On Thu, 21 Mar 2013, Paul E. McKenney wrote:
>> >
>> > > So, how long of busy periods are you contemplating for your SCHED_FIFO
>> > > threads?  Is it possible to tune/adjust the offending per-CPU ktheads
>> > > to wake up less frequently than that time?
>> >
>> > Test programs right now run 10 seconds. 30 seconds would definitely be
>> > enough for the worst case.
>>
>> OK, that might be doable for some workloads.  What happens when you
>> try tuning the 2-second wakeup interval to (say) 45 seconds?
>
> The vm kernel threads do no useful work if no system calls are being done.
> If there is no kernel action then they can be deferred indefinitely.
>

We can certainly add some user deferrable timer_list. But that's going
to be for extreme usecases (those who require pure isolation) because
we'll need to settle that with a timer reprogramming into user/kernel
boundaries. That won't be free.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-25 14:37                                 ` Frederic Weisbecker
@ 2013-03-25 15:18                                   ` Christoph Lameter
  2013-03-25 15:20                                     ` Frederic Weisbecker
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Lameter @ 2013-03-25 15:18 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E. McKenney, Steven Rostedt, Rob Landley, linux-kernel,
	josh, zhong, khilman, geoff, tglx

On Mon, 25 Mar 2013, Frederic Weisbecker wrote:

> > The vm kernel threads do no useful work if no system calls are being done.
> > If there is no kernel action then they can be deferred indefinitely.
> >
>
> We can certainly add some user deferrable timer_list. But that's going
> to be for extreme usecases (those who require pure isolation) because
> we'll need to settle that with a timer reprogramming into user/kernel
> boundaries. That won't be free.

These timers are already marked deferrable and are deferred for the idle
dynticks case. Could we reuse the same logic? See timer.h around the
define of TIMER_DEFERRABLE. I just assumed so far that the dyntick idle
logic would have been used for this case.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] nohz1: Documentation
  2013-03-25 15:18                                   ` Christoph Lameter
@ 2013-03-25 15:20                                     ` Frederic Weisbecker
  0 siblings, 0 replies; 43+ messages in thread
From: Frederic Weisbecker @ 2013-03-25 15:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul E. McKenney, Steven Rostedt, Rob Landley, linux-kernel,
	josh, zhong, khilman, geoff, tglx

2013/3/25 Christoph Lameter <cl@linux.com>:
> On Mon, 25 Mar 2013, Frederic Weisbecker wrote:
>
>> > The vm kernel threads do no useful work if no system calls are being done.
>> > If there is no kernel action then they can be deferred indefinitely.
>> >
>>
>> We can certainly add some user deferrable timer_list. But that's going
>> to be for extreme usecases (those who require pure isolation) because
>> we'll need to settle that with a timer reprogramming into user/kernel
>> boundaries. That won't be free.
>
> These timers are already marked deferrable and are deferred for the idle
> dynticks case. Could we reuse the same logic? See timer.h around the
> define of TIMER_DEFERRABLE. I just assumed so far that the dyntick idle
> logic would have been used for this case.

We need to audit all deferreable timers to check if their
deferrability in idle also applies for userspace. If so may be we can
consider that.

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2013-03-25 15:27 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-18 16:29 [PATCH] nohz1: Documentation Paul E. McKenney
2013-03-18 18:13 ` Rob Landley
2013-03-18 18:46   ` Frederic Weisbecker
2013-03-18 19:59     ` Rob Landley
2013-03-18 20:48       ` Frederic Weisbecker
2013-03-18 22:25         ` Paul E. McKenney
2013-03-20 23:32           ` Steven Rostedt
2013-03-20 23:55             ` Paul E. McKenney
2013-03-21  0:27               ` Steven Rostedt
2013-03-21  2:22                 ` Paul E. McKenney
2013-03-21 10:16                   ` Borislav Petkov
2013-03-21 15:18                     ` Paul E. McKenney
2013-03-21 16:00                       ` Borislav Petkov
2013-03-21 15:45                 ` Arjan van de Ven
2013-03-21 17:18                   ` Paul E. McKenney
2013-03-21 17:41                     ` Arjan van de Ven
2013-03-21 18:02                       ` Paul E. McKenney
2013-03-22 18:37                         ` Kevin Hilman
2013-03-22 19:25                           ` Paul E. McKenney
2013-03-22  4:59                   ` Rob Landley
2013-03-21 18:01                 ` Frederic Weisbecker
2013-03-21 18:26                   ` Paul E. McKenney
2013-03-21 16:08               ` Christoph Lameter
2013-03-21 17:15                 ` Paul E. McKenney
2013-03-21 18:39                   ` Christoph Lameter
2013-03-21 18:58                     ` Paul E. McKenney
2013-03-21 20:04                       ` Christoph Lameter
2013-03-21 20:42                         ` Frederic Weisbecker
2013-03-21 21:02                           ` Christoph Lameter
2013-03-21 21:06                             ` Frederic Weisbecker
2013-03-21 20:50                         ` Paul E. McKenney
2013-03-22 14:38                           ` Christoph Lameter
2013-03-22 16:28                             ` Paul E. McKenney
2013-03-25 14:31                               ` Christoph Lameter
2013-03-25 14:37                                 ` Frederic Weisbecker
2013-03-25 15:18                                   ` Christoph Lameter
2013-03-25 15:20                                     ` Frederic Weisbecker
2013-03-22  9:52                         ` Mats Liljegren
2013-03-22 19:01                       ` Kevin Hilman
2013-03-21 18:44                   ` Steven Rostedt
2013-03-21 18:53                     ` Christoph Lameter
2013-03-21 19:16                       ` Steven Rostedt
2013-03-21 18:59                     ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.