* [PATCH] nohz1: Documentation @ 2013-03-18 16:29 Paul E. McKenney 2013-03-18 18:13 ` Rob Landley 0 siblings, 1 reply; 43+ messages in thread From: Paul E. McKenney @ 2013-03-18 16:29 UTC (permalink / raw) To: fweisbec; +Cc: linux-kernel, josh, rostedt, zhong, khilman, geoff, tglx First attempt at documentation for adaptive ticks. Thoughts? Thanx, Paul ------------------------------------------------------------------------ nohz1: Documentation Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> diff --git a/Documentation/timers/NO_HZ.txt b/Documentation/timers/NO_HZ.txt new file mode 100644 index 0000000..7279109 --- /dev/null +++ b/Documentation/timers/NO_HZ.txt @@ -0,0 +1,200 @@ + NO_HZ: Reducing Scheduling-Clock Ticks + + +This document covers kernel configuration variables used to reduce +the number of scheduling-clock interrupts. These reductions can be +helpful in improving energy efficiency and in reducing "OS jitter", +the latter being very important for some types of computationally +intensive high-performance computing (HPC) applications and for real-time +applications. + +Within the Linux kernel, there are two major aspects of scheduling-clock +interrupt reduction: + +1. Idle CPUs. + +2. CPUs having only one runnable task. + +These two cases are described in the following sections. + + +IDLE CPUs + +If a CPU is idle, there is little point in sending it a scheduling-clock +interrupt. After all, the primary purpose of a scheduling-clock interrupt +is to force a busy CPU to shift its attention among multiple duties, +but an idle CPU by definition has no duties to shift its attention among. + +The CONFIG_NO_HZ=y Kconfig option causes the kernel to avoid sending +scheduling-clock interrupts to idle CPUs, which is critically important +both to battery-powered devices and to highly virtualized mainframes. +A battery-powered device running a CONFIG_NO_HZ=n kernel would drain its +battery very quickly, easily 2-3x as fast as would the same device running +a CONFIG_NO_HZ=n kernel. A mainframe running 1,500 OS instances could +easily find that half of its CPU time was consumed by scheduling-clock +interrupts. In these situations, there is therefore strong motivation +to avoid sending scheduling-clock interrupts to idle CPUs. That said, +dyntick-idle mode is not free: + +1. It increases the number of instructions executed on the path + to and from the idle loop. + +2. Many architectures will place dyntick-idle CPUs into deep sleep + states, which further degrades from-idle transition latencies. + +Therefore, systems with aggressive real-time response constraints +often run CONFIG_NO_HZ=n kernels in order to avoid degrading from-idle +transition latencies. + +An idle CPU that is not receiving scheduling-clock interrupts is said to +be "dyntick-idle", "in dyntick-idle mode", "in nohz mode", or "running +tickless". The remainder of this document will use "dyntick-idle mode". + +There is also a boot parameter "nohz=" that can be used to disable +dyntick-idle mode in CONFIG_NO_HZ=y kernels by specifying "nohz=off". +By default, CONFIG_NO_HZ=y kernels boot with "nohz=on", enabling +dyntick-idle mode. + + +CPUs WITH ONLY ONE RUNNABLE TASK + +If a CPU has only one runnable task, there is again little point in +sending it a scheduling-clock interrupt. Recall that the primary +purpose of a scheduling-clock interrupt is to force a busy CPU to +shift its attention among many things requiring its attention -- and +there is nowhere else for a CPU with but one runnable task to shift its +attention to. + +The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid +sending scheduling-clock interrupts to CPUs with a single runnable task. +This is important for applications with aggressive real-time response +constraints because it allows them to improve their worst-case response +times by the maximum duration of a scheduling-clock interrupt. It is also +important for computationally intensive iterative workloads with short +iterations: If any CPU is delayed during a given iteration, all the +other CPUs will be forced to wait idle while the delayed CPU finished. +Thus, the delay is multiplied by one less than the number of CPUs. +In these situations, there is again strong motivation to avoid sending +scheduling-clock interrupts to CPUs that have but one runnable task that +is executing in user mode. + +Note that if a given CPU is in adaptive-ticks mode while executing in +user mode, transitioning to kernel mode does not automatically force +that CPU out of adaptive-ticks mode. The CPU will exit adaptive-ticks +mode only if needed, for example, if that CPU enqueues an RCU callback. + +Just as with dyntick-idle mode, the benefits of adaptive-tick mode do +not come for free: + +1. The user/kernel transitions are slightly more expensive due + to the need to inform kernel subsystems (such as RCU) about + the change in mode. + +2. POSIX CPU timers on adaptive-tick CPUs may fire late (or even + not at all) because they currently rely on scheduling-tick + interrupts. This will likely be fixed in one of two ways: (1) + Prevent CPUs with POSIX CPU timers from entering adaptive-tick + mode, or (2) Use hrtimers or other adaptive-ticks-immune mechanism + to cause the POSIX CPU timer to fire properly. + +3. If there are more perf events pending than the hardware can + accommodate, they are normally round-robined so as to collect + all of them over time. Adaptive-tick mode may prevent this + round-robining from happening. This will likely be fixed by + preventing CPUs with large numbers of perf events pending from + entering adaptive-tick mode. + +4. Scheduler statistics for adaptive-idle CPUs may be computed + slightly differently than those for non-adaptive-idle CPUs. + This may in turn perturb load-balancing of real-time tasks. + +5. The LB_BIAS scheduler feature is disabled by adaptive ticks. + +Although improvements are expected over time, adaptive ticks is quite +useful for many types of real-time and compute-intensive applications. +However, the drawbacks listed above mean that adaptive ticks should not +be enabled by default across the board at the current time. + + +RCU IMPLICATIONS + +There are situations in which idle CPUs cannot be permitted to +enter either dyntick-idle mode or adaptive-tick mode, the most +familiar being the case where that CPU has RCU callbacks pending. + +The CONFIG_RCU_FAST_NO_HZ=y Kconfig option may be used to cause such +CPUs to enter dyntick-idle mode or adaptive-tick mode anyway, though a +timer will awaken these CPUs every four jiffies in order to ensure that +the RCU callbacks are processed in a timely fashion. + +Another approach is to offload RCU callback processing to "rcuo" kthreads +using the CONFIG_RCU_NOCB_CPU=y. The specific CPUs to offload may be +selected via several methods: + +1. The "rcu_nocbs=" kernel boot parameter, which takes a comma-separated + list of CPUs and CPU ranges, for example, "1,3-5" selects CPUs 1, + 3, 4, and 5. + +2. The RCU_NOCB_CPU_ZERO=y Kconfig option, which causes CPU 0 to + be offloaded. This is the build-time equivalent of "rcu_nocbs=0". + +3. The RCU_NOCB_CPU_ALL=y Kconfig option, which causes all CPUs + to be offloaded. On a 16-CPU system, this is equivalent to + "rcu_nocbs=0-15". + +The offloaded CPUs never have RCU callbacks queued, and therefore RCU +never prevents offloaded CPUs from entering either dyntick-idle mode or +adaptive-tick mode. That said, note that it is up to userspace to +pin the "rcuo" kthreads to specific CPUs if desired. Otherwise, the +scheduler will decide where to run them, which might or might not be +where you want them to run. + + +KNOWN ISSUES + +o Dyntick-idle slows transitions to and from idle slightly. + In practice, this has not been a problem except for the most + aggressive real-time workloads, which have the option of disabling + dyntick-idle mode, an option that most of them take. + +o Adaptive-ticks slows user/kernel transitions slightly. + This is not expected to be a problem for computational-intensive + workloads, which have few such transitions. Careful benchmarking + will be required to determine whether or not other workloads + are significantly affected by this effect. + +o Adaptive-ticks does not do anything unless there is only one + runnable task for a given CPU, even though there are a number + of other situations where the scheduling-clock tick is not + needed. To give but one example, consider a CPU that has one + runnable high-priority SCHED_FIFO task and an arbitrary number + of low-priority SCHED_OTHER tasks. In this case, the CPU is + required to run the SCHED_FIFO task until either it blocks or + some other higher-priority task awakens on (or is assigned to) + this CPU, so there is no point in sending a scheduling-clock + interrupt to this CPU. + + Better handling of these sorts of situations is future work. + +o A reboot is required to reconfigure both adaptive idle and RCU + callback offloading. Runtime reconfiguration could be provided + if needed, however, due to the complexity of reconfiguring RCU + at runtime, there would need to be an earthshakingly good reason. + Especially given the option of simply offloading RCU callbacks + from all CPUs. + +o Additional configuration is required to deal with other sources + of OS jitter, including interrupts and system-utility tasks + and processes. + +o Some sources of OS jitter can currently be eliminated only by + constraining the workload. For example, the only way to eliminate + OS jitter due to global TLB shootdowns is to avoid the unmapping + operations (such as kernel module unload operations) that result + in these shootdowns. For another example, page faults and TLB + misses can be reduced (and in some cases eliminated) by using + huge pages and by constraining the amount of memory used by the + application. + +o At least one CPU must keep the scheduling-clock interrupt going + in order to support accurate timekeeping. ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-18 16:29 [PATCH] nohz1: Documentation Paul E. McKenney @ 2013-03-18 18:13 ` Rob Landley 2013-03-18 18:46 ` Frederic Weisbecker 0 siblings, 1 reply; 43+ messages in thread From: Rob Landley @ 2013-03-18 18:13 UTC (permalink / raw) To: paulmck Cc: fweisbec, linux-kernel, josh, rostedt, zhong, khilman, geoff, tglx On 03/18/2013 11:29:42 AM, Paul E. McKenney wrote: > First attempt at documentation for adaptive ticks. > > Thoughts? > > Thanx, Paul It's really long and repetitive? And really seems like it's kconfig help text? The CONFIG_NO_HZ=y and CONFIG_NO_HZ_FULL=y options cause the kernel to (respectively) avoid sending scheduling-clock interrupts to idle processors, or to processors with only a single single runnable task. You can disable this at boot time with kernel parameter "nohz=off". This reduces power consumption by allowing processors to suspend more deeply for longer periods, and can also improve some computationally intensive workloads. The downside is coming out of a deeper sleep can reduce realtime response to wakeup events. This is split into two config options because the second isn't quite finished and won't reliably deliver posix timer interrupts, perf events, or do as well on CPU load balancing. The CONFIG_RCU_FAST_NO_HZ option enables a workaround to force tick delivery every 4 jiffies to handle RCU events. See the CONFIG_RCU_NOCB_CPU option for a different workaround. > +1. It increases the number of instructions executed on the path > + to and from the idle loop. This detail didn't get mentioned in my summary. > +5. The LB_BIAS scheduler feature is disabled by adaptive ticks. I have no idea what that one is, my summary didn't mention it. > +Another approach is to offload RCU callback processing to "rcuo" > kthreads > +using the CONFIG_RCU_NOCB_CPU=y. The specific CPUs to offload may be > +selected via several methods: > + > +1. The "rcu_nocbs=" kernel boot parameter, which takes a > comma-separated > + list of CPUs and CPU ranges, for example, "1,3-5" selects CPUs > 1, > + 3, 4, and 5. > + > +2. The RCU_NOCB_CPU_ZERO=y Kconfig option, which causes CPU 0 to > + be offloaded. This is the build-time equivalent of > "rcu_nocbs=0". > + > +3. The RCU_NOCB_CPU_ALL=y Kconfig option, which causes all CPUs > + to be offloaded. On a 16-CPU system, this is equivalent to > + "rcu_nocbs=0-15". > + > +The offloaded CPUs never have RCU callbacks queued, and therefore RCU > +never prevents offloaded CPUs from entering either dyntick-idle mode > or > +adaptive-tick mode. That said, note that it is up to userspace to > +pin the "rcuo" kthreads to specific CPUs if desired. Otherwise, the > +scheduler will decide where to run them, which might or might not be > +where you want them to run. Ok, this whole chunk was just confusing and I glossed it. Why on earth do you offer three wildly different ways to do the same thing? (You have config options to set defaults?) I _think_ the gloss is just: RCU_NOCB_CPU_ALL=y moves each processor's RCU callback handling into its own kernel thread, which the user can pin to specific CPUs if desired. If you only want to move specific processors' RCU handling to threads, list those processors on the kernel command line ala "rcu_nocbs=1,3-5". But that's a guess. > +o Additional configuration is required to deal with other sources > + of OS jitter, including interrupts and system-utility tasks > + and processes. > + > +o Some sources of OS jitter can currently be eliminated only by > + constraining the workload. For example, the only way to > eliminate > + OS jitter due to global TLB shootdowns is to avoid the unmapping > + operations (such as kernel module unload operations) that result > + in these shootdowns. For another example, page faults and TLB > + misses can be reduced (and in some cases eliminated) by using > + huge pages and by constraining the amount of memory used by the > + application. If you want to write a doc on reducing system jitter, go for it. This is a topic transition near the end of a document. > +o At least one CPU must keep the scheduling-clock interrupt going > + in order to support accurate timekeeping. How? You never said how to tell a processor _not_ to suppress interrupts when CONFIG_THE_OTHER_HALF_OF_NOHZ is enabled. I take it the problem is the value in the sysenter page won't get updated, so gettimeofday() will see a stale value until the CPU hog stops suppressing interrupts? I thought the first half of NOHZ had a way of dealing with that many moons ago? (Did sysenter cause a regression?) Rob ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-18 18:13 ` Rob Landley @ 2013-03-18 18:46 ` Frederic Weisbecker 2013-03-18 19:59 ` Rob Landley 0 siblings, 1 reply; 43+ messages in thread From: Frederic Weisbecker @ 2013-03-18 18:46 UTC (permalink / raw) To: Rob Landley Cc: paulmck, linux-kernel, josh, rostedt, zhong, khilman, geoff, tglx 2013/3/18 Rob Landley <rob@landley.net>: > On 03/18/2013 11:29:42 AM, Paul E. McKenney wrote: > And really seems like it's kconfig help text? It's more exhaustive than a Kconfig help. A Kconfig help text should have the level of detail that describe the purpose and impact of a feature, as well as some quick reference/pointer to the interface. Deeper explanation which include implementation internals, finegrained constraints, TODO list, detailed interface are better here. > The CONFIG_NO_HZ=y and CONFIG_NO_HZ_FULL=y options cause the kernel > to (respectively) avoid sending scheduling-clock interrupts to idle > processors, or to processors with only a single single runnable task. > You can disable this at boot time with kernel parameter "nohz=off". > > This reduces power consumption by allowing processors to suspend more > deeply for longer periods, and can also improve some computationally > intensive workloads. The downside is coming out of a deeper sleep can > reduce realtime response to wakeup events. > > This is split into two config options because the second isn't quite > finished and won't reliably deliver posix timer interrupts, perf > events, or do as well on CPU load balancing. The CONFIG_RCU_FAST_NO_HZ > option enables a workaround to force tick delivery every 4 jiffies to > handle RCU events. See the CONFIG_RCU_NOCB_CPU option for a different > workaround. I really think we want to keep all the detailed explanations from Paul's doc. What we need is not a quick reference but a very detailed documentation. > >> +1. It increases the number of instructions executed on the path >> + to and from the idle loop. > > > This detail didn't get mentioned in my summary. And it's an important point. > > >> +5. The LB_BIAS scheduler feature is disabled by adaptive ticks. > > > I have no idea what that one is, my summary didn't mention it. Nobody seem to know what that thing is, except probably the scheduler warlocks :o) All I know is that it's hard to implement without the tick. So I disabled it in my tree. >> +o Some sources of OS jitter can currently be eliminated only by >> + constraining the workload. For example, the only way to eliminate >> + OS jitter due to global TLB shootdowns is to avoid the unmapping >> + operations (such as kernel module unload operations) that result >> + in these shootdowns. For another example, page faults and TLB >> + misses can be reduced (and in some cases eliminated) by using >> + huge pages and by constraining the amount of memory used by the >> + application. > > > If you want to write a doc on reducing system jitter, go for it. This is > a topic transition near the end of a document. > > >> +o At least one CPU must keep the scheduling-clock interrupt going >> + in order to support accurate timekeeping. > > > How? You never said how to tell a processor _not_ to suppress interrupts > when CONFIG_THE_OTHER_HALF_OF_NOHZ is enabled. Ah indeed it would be nice to point out that there must be an online CPU outside the value range of the nohz_mask= boot parameter. > I take it the problem is the value in the sysenter page won't get updated, > so gettimeofday() will see a stale value until the CPU hog stops > suppressing interrupts? I thought the first half of NOHZ had a way of > dealing with that many moons ago? (Did sysenter cause a regression?) With CONFIG_NO_HZ, there is always a tick running that updates GTOD and jiffies as long as there is non-idle CPU. If every CPUs are idle and one suddenly wakes up, GTOD and jiffies values are caught up. With full dynticks we have a new problem: there can be a CPU using jiffies of GTOD without running the tick (we are not idle so there can be such users). So there must a ticking CPU somewhere. > Rob ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-18 18:46 ` Frederic Weisbecker @ 2013-03-18 19:59 ` Rob Landley 2013-03-18 20:48 ` Frederic Weisbecker 0 siblings, 1 reply; 43+ messages in thread From: Rob Landley @ 2013-03-18 19:59 UTC (permalink / raw) To: Frederic Weisbecker Cc: paulmck, linux-kernel, josh, rostedt, zhong, khilman, geoff, tglx On 03/18/2013 01:46:32 PM, Frederic Weisbecker wrote: > 2013/3/18 Rob Landley <rob@landley.net>: > > On 03/18/2013 11:29:42 AM, Paul E. McKenney wrote: > > And really seems like it's kconfig help text? > > It's more exhaustive than a Kconfig help. A Kconfig help text should > have the level of detail that describe the purpose and impact of a > feature, as well as some quick reference/pointer to the interface. > > Deeper explanation which include implementation internals, finegrained > constraints, TODO list, detailed interface are better here. ... > I really think we want to keep all the detailed explanations from > Paul's doc. What we need is not a quick reference but a very detailed > documentation. It's much _longer_, I'm not sure it contains significantly more information. ("Using more power will shorten battery life" is a nice observation, but is it specific to your subsystem? I dunno, maybe it's a personal idiosyncrasy, but I tend to think that people start with use cases and need to find infrastructure. The other direction seems less interesting somehow. Like a pan with a picture on the front of what you might want to bake with it.) > >> +1. It increases the number of instructions executed on the > path > >> + to and from the idle loop. > > > > > > This detail didn't get mentioned in my summary. > > And it's an important point. I mentioned increased latency coming out of idle. Increased latency going _to_ idle is an important point? (And pretty much _every_ kconfig option has ramifications at that level which realtime people tend to want to bench.) Also, I mentioned this one because all the other details I deleted pretty much _did_ get taken into account in my summary. > >> +5. The LB_BIAS scheduler feature is disabled by adaptive > ticks. > > > > > > I have no idea what that one is, my summary didn't mention it. > > Nobody seem to know what that thing is, except probably the scheduler > warlocks :o) > All I know is that it's hard to implement without the tick. So I > disabled it in my tree. Is it also an important point? > >> +o At least one CPU must keep the scheduling-clock interrupt > going > >> + in order to support accurate timekeeping. > > > > > > How? You never said how to tell a processor _not_ to suppress > interrupts > > when CONFIG_THE_OTHER_HALF_OF_NOHZ is enabled. > > Ah indeed it would be nice to point out that there must be an online > CPU outside the value range of the nohz_mask= boot parameter. There's a nohz_mask boot parameter? > > I take it the problem is the value in the sysenter page won't get > updated, > > so gettimeofday() will see a stale value until the CPU hog stops > > suppressing interrupts? I thought the first half of NOHZ had a way > of > > dealing with that many moons ago? (Did sysenter cause a regression?) > > With CONFIG_NO_HZ, there is always a tick running that updates GTOD > and jiffies as long as there is non-idle CPU. If every CPUs are idle > and one suddenly wakes up, GTOD and jiffies values are caught up. > > With full dynticks we have a new problem: there can be a CPU using > jiffies of GTOD without running the tick (we are not idle so there can > be such users). So there must a ticking CPU somewhere. I.E. because gettimeofday() just checks a memory location without requiring a kernel transition, there's no opportunity for the kernel to trigger and run catch-up code. So you'd need a timer to remove the read flag on the page containing the jiffies value after it was considered sufficiently stale, and then have the page fault update the value restore the read flag and reset the timer to switch it off again, and then just tell CPU-intensive code that wanted to take advantage of running uninterrupted not to mess with jiffies unless they wanted to trigger interrupts to keep it current. By the way, I find this "full" name strange if you yourself have a list of more cases where ticks could be dropped, but which you haven't implemented yet. The system being entirely idle means unnecessary ticks can be dropped. The system having no scheduling decisions to make on a processor also means unnecessary ticks can be dropped. But there are two config options and they get treated as entirely different subsystems... I suppose one of them having a bucket of workarounds and caveats is the reason? One is just "let the system behave more efficiently, only reason it's a config option is increased latency waking up from idle can annoy the realtime guys". The second is "let the system behave more efficiently in a way that opens up a bunch of sharp edges and requires extensive micromanagement". But those sharp edges seem more "unfinished" than really a design limitation... Rob ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-18 19:59 ` Rob Landley @ 2013-03-18 20:48 ` Frederic Weisbecker 2013-03-18 22:25 ` Paul E. McKenney 0 siblings, 1 reply; 43+ messages in thread From: Frederic Weisbecker @ 2013-03-18 20:48 UTC (permalink / raw) To: Rob Landley Cc: paulmck, linux-kernel, josh, rostedt, zhong, khilman, geoff, tglx 2013/3/18 Rob Landley <rob@landley.net>: > On 03/18/2013 01:46:32 PM, Frederic Weisbecker wrote: >> I really think we want to keep all the detailed explanations from >> Paul's doc. What we need is not a quick reference but a very detailed >> documentation. > > > It's much _longer_, I'm not sure it contains significantly more information. > ("Using more power will shorten battery life" is a nice observation, but is > it specific to your subsystem? I dunno, maybe it's a personal idiosyncrasy, > but I tend to think that people start with use cases and need to find > infrastructure. The other direction seems less interesting somehow. Like a > pan with a picture on the front of what you might want to bake with it.) People start with a usecase, find an infrastructure and finally its documentation that tell them the tradeoffs, constraints, possible enhancements. Yes both directions are valuable. Another point in favor of taking that direction: consider LB_BIAS. Do you know what it's all about? Me neither. Too bad there is no documentation. Obscure kernel code make kernel hacking closer to reverse engineering. As the kernel grows in complexity, this all will have some interesting effect in the future. And I'm just rephrasing what people like Andrew already started to say a few years ago. Addition of detailed documentation of core (and even less core) kernel code is hardly arguable. >> >> +1. It increases the number of instructions executed on the path >> >> + to and from the idle loop. >> > >> > >> > This detail didn't get mentioned in my summary. >> >> And it's an important point. > > > I mentioned increased latency coming out of idle. Increased latency going > _to_ idle is an important point? (And pretty much _every_ kconfig option has > ramifications at that level which realtime people tend to want to bench.) Yeah, increased latency in going to idle has consequences in term of energy saving, latency and throughput. > > Also, I mentioned this one because all the other details I deleted pretty > much _did_ get taken into account in my summary. Certainly not with the same level of detail. > >> >> +5. The LB_BIAS scheduler feature is disabled by adaptive ticks. >> > >> > >> > I have no idea what that one is, my summary didn't mention it. >> >> Nobody seem to know what that thing is, except probably the scheduler >> warlocks :o) >> All I know is that it's hard to implement without the tick. So I >> disabled it in my tree. > > > Is it also an important point? Yes, users must be informed about limitations. > >> >> +o At least one CPU must keep the scheduling-clock interrupt going >> >> + in order to support accurate timekeeping. >> > >> > >> > How? You never said how to tell a processor _not_ to suppress interrupts >> > when CONFIG_THE_OTHER_HALF_OF_NOHZ is enabled. >> >> Ah indeed it would be nice to point out that there must be an online >> CPU outside the value range of the nohz_mask= boot parameter. > > > There's a nohz_mask boot parameter? Yeah we need to document that too. > >> > I take it the problem is the value in the sysenter page won't get >> > updated, >> > so gettimeofday() will see a stale value until the CPU hog stops >> > suppressing interrupts? I thought the first half of NOHZ had a way of >> > dealing with that many moons ago? (Did sysenter cause a regression?) >> >> With CONFIG_NO_HZ, there is always a tick running that updates GTOD >> and jiffies as long as there is non-idle CPU. If every CPUs are idle >> and one suddenly wakes up, GTOD and jiffies values are caught up. >> >> With full dynticks we have a new problem: there can be a CPU using >> jiffies of GTOD without running the tick (we are not idle so there can >> be such users). So there must a ticking CPU somewhere. > > > I.E. because gettimeofday() just checks a memory location without requiring > a kernel transition, there's no opportunity for the kernel to trigger and > run catch-up code. Isn't that value updated by the kernel? > > So you'd need a timer to remove the read flag on the page containing the > jiffies value after it was considered sufficiently stale, and then have the > page fault update the value restore the read flag and reset the timer to > switch it off again, and then just tell CPU-intensive code that wanted to > take advantage of running uninterrupted not to mess with jiffies unless they > wanted to trigger interrupts to keep it current. I fear making the jiffies read faultable is not something we can afford. That means there will be several places where we couldn't use it. And there would be some performance issues. Also such a timer defeats the initial purpose of reducing timers interrupts. GTOD is another issue but page faults would be a performance problem as well. And timer too. > > By the way, I find this "full" name strange if you yourself have a list of > more cases where ticks could be dropped, but which you haven't implemented > yet. Yeah. Full dynticks works because it suggest tick periods are dynamics. But full tickless or full nohz is not true. Some renaming are on the work anyway. > The system being entirely idle means unnecessary ticks can be dropped. > The system having no scheduling decisions to make on a processor also means > unnecessary ticks can be dropped. But there are two config options and they > get treated as entirely different subsystems... No they share a lot of common infrastructure. Also full dynticks depends on dynticks-idle. > I suppose one of them having a bucket of workarounds and caveats is the > reason? One is just "let the system behave more efficiently, only reason > it's a config option is increased latency waking up from idle can annoy the > realtime guys". The second is "let the system behave more efficiently in a > way that opens up a bunch of sharp edges and requires extensive > micromanagement". But those sharp edges seem more "unfinished" than really a > design limitation... The reason of having a seperate Kconfig for the new feature is because it adds some overhead even in the off-case. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-18 20:48 ` Frederic Weisbecker @ 2013-03-18 22:25 ` Paul E. McKenney 2013-03-20 23:32 ` Steven Rostedt 0 siblings, 1 reply; 43+ messages in thread From: Paul E. McKenney @ 2013-03-18 22:25 UTC (permalink / raw) To: Frederic Weisbecker Cc: Rob Landley, linux-kernel, josh, rostedt, zhong, khilman, geoff, tglx On Mon, Mar 18, 2013 at 09:48:31PM +0100, Frederic Weisbecker wrote: > 2013/3/18 Rob Landley <rob@landley.net>: > > On 03/18/2013 01:46:32 PM, Frederic Weisbecker wrote: [ . . . ] > >> >> +o At least one CPU must keep the scheduling-clock interrupt going > >> >> + in order to support accurate timekeeping. > >> > > >> > > >> > How? You never said how to tell a processor _not_ to suppress interrupts > >> > when CONFIG_THE_OTHER_HALF_OF_NOHZ is enabled. > >> > >> Ah indeed it would be nice to point out that there must be an online > >> CPU outside the value range of the nohz_mask= boot parameter. > > > > > > There's a nohz_mask boot parameter? > > Yeah we need to document that too. Good catch both of you, fixed! [ . . . ] > > The system being entirely idle means unnecessary ticks can be dropped. > > The system having no scheduling decisions to make on a processor also means > > unnecessary ticks can be dropped. But there are two config options and they > > get treated as entirely different subsystems... > > No they share a lot of common infrastructure. Also full dynticks > depends on dynticks-idle. Good point, added this. > > I suppose one of them having a bucket of workarounds and caveats is the > > reason? One is just "let the system behave more efficiently, only reason > > it's a config option is increased latency waking up from idle can annoy the > > realtime guys". The second is "let the system behave more efficiently in a > > way that opens up a bunch of sharp edges and requires extensive > > micromanagement". But those sharp edges seem more "unfinished" than really a > > design limitation... > > The reason of having a seperate Kconfig for the new feature is because > it adds some overhead even in the off-case. Good point, added words stating that all of the costs of CONFIG_NO_HZ are also incurred by CONFIG_NO_HZ_FULL. Rob also noted that the presentation of the NOCB Kconfig options and boot parameters was confusing, so I reworked this to put the Kconfig options first (build then boot!) and to indicate that the RCU_NOCB_CPU_NONE, RCU_NOCB_CPU_ZERO, and RCU_NOCB_CPU_ALL options are mutually exclusive. Rob also noted that the current draft is wordy, which I will address in a later draft. Thanx, Paul ------------------------------------------------------------------------ NO_HZ: Reducing Scheduling-Clock Ticks This document covers Kconfig options and boot parameters used to reduce the number of scheduling-clock interrupts. These reductions can be helpful in improving energy efficiency and in reducing "OS jitter", the latter being very important for some types of computationally intensive high-performance computing (HPC) applications and for real-time applications. Within the Linux kernel, there are two major aspects of scheduling-clock interrupt reduction: 1. Idle CPUs. 2. CPUs having only one runnable task. These two cases are described in the following sections. IDLE CPUs If a CPU is idle, there is little point in sending it a scheduling-clock interrupt. After all, the primary purpose of a scheduling-clock interrupt is to force a busy CPU to shift its attention among multiple duties, but an idle CPU by definition has no duties to shift its attention among. The CONFIG_NO_HZ=y Kconfig option causes the kernel to avoid sending scheduling-clock interrupts to idle CPUs, which is critically important both to battery-powered devices and to highly virtualized mainframes. A battery-powered device running a CONFIG_NO_HZ=n kernel would drain its battery very quickly, easily 2-3x as fast as would the same device running a CONFIG_NO_HZ=n kernel. A mainframe running 1,500 OS instances could easily find that half of its CPU time was consumed by scheduling-clock interrupts. In these situations, there is therefore strong motivation to avoid sending scheduling-clock interrupts to idle CPUs. That said, dyntick-idle mode is not free: 1. It increases the number of instructions executed on the path to and from the idle loop. 2. Many architectures will place dyntick-idle CPUs into deep sleep states, which further degrades from-idle transition latencies. Therefore, systems with aggressive real-time response constraints often run CONFIG_NO_HZ=n kernels in order to avoid degrading from-idle transition latencies. An idle CPU that is not receiving scheduling-clock interrupts is said to be "dyntick-idle", "in dyntick-idle mode", "in nohz mode", or "running tickless". The remainder of this document will use "dyntick-idle mode". There is also a boot parameter "nohz=" that can be used to disable dyntick-idle mode in CONFIG_NO_HZ=y kernels by specifying "nohz=off". By default, CONFIG_NO_HZ=y kernels boot with "nohz=on", enabling dyntick-idle mode. CPUs WITH ONLY ONE RUNNABLE TASK If a CPU has only one runnable task, there is again little point in sending it a scheduling-clock interrupt. Recall that the primary purpose of a scheduling-clock interrupt is to force a busy CPU to shift its attention among many things requiring its attention -- and there is nowhere else for a CPU with but one runnable task to shift its attention to. The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid sending scheduling-clock interrupts to CPUs with a single runnable task. This is important for applications with aggressive real-time response constraints because it allows them to improve their worst-case response times by the maximum duration of a scheduling-clock interrupt. It is also important for computationally intensive iterative workloads with short iterations: If any CPU is delayed during a given iteration, all the other CPUs will be forced to wait idle while the delayed CPU finished. Thus, the delay is multiplied by one less than the number of CPUs. In these situations, there is again strong motivation to avoid sending scheduling-clock interrupts to CPUs that have but one runnable task that is executing in user mode. The "full_nohz=" boot parameter specifies which CPUs are to be adaptive-ticks CPUs. For example, "full_nohz=1,6-8" says that CPUs 1, 6, 7, and 8 are to be adaptive-ticks CPUs. By default, no CPUs will be adaptive-ticks CPUs. Not that you are prohibited from marking all of the CPUs as adaptive-tick CPUs: At least one non-adaptive-tick CPU must remain online to handle timekeeping tasks in order to ensure that gettimeofday() returns sane values on adaptive-tick CPUs. Note that if a given CPU is in adaptive-ticks mode while executing in user mode, transitioning to kernel mode does not automatically force that CPU out of adaptive-ticks mode. The CPU will exit adaptive-ticks mode only if needed, for example, if that CPU enqueues an RCU callback. Just as with dyntick-idle mode, the benefits of adaptive-tick mode do not come for free: 1. CONFIG_NO_HZ_FULL depends on CONFIG_NO_HZ, so you cannot run adaptive ticks without also running dyntick idle. This dependency of CONFIG_NO_HZ_FULL on CONFIG_NO_HZ extends down into the implementation. Therefore, all of the costs of CONFIG_NO_HZ are also incurred by CONFIG_NO_HZ_FULL. 2. The user/kernel transitions are slightly more expensive due to the need to inform kernel subsystems (such as RCU) about the change in mode. 3. POSIX CPU timers on adaptive-tick CPUs may fire late (or even not at all) because they currently rely on scheduling-tick interrupts. This will likely be fixed in one of two ways: (1) Prevent CPUs with POSIX CPU timers from entering adaptive-tick mode, or (2) Use hrtimers or other adaptive-ticks-immune mechanism to cause the POSIX CPU timer to fire properly. 4. If there are more perf events pending than the hardware can accommodate, they are normally round-robined so as to collect all of them over time. Adaptive-tick mode may prevent this round-robining from happening. This will likely be fixed by preventing CPUs with large numbers of perf events pending from entering adaptive-tick mode. 5. Scheduler statistics for adaptive-idle CPUs may be computed slightly differently than those for non-adaptive-idle CPUs. This may in turn perturb load-balancing of real-time tasks. 6. The LB_BIAS scheduler feature is disabled by adaptive ticks. Although improvements are expected over time, adaptive ticks is quite useful for many types of real-time and compute-intensive applications. However, the drawbacks listed above mean that adaptive ticks should not be enabled by default across the board at the current time. RCU IMPLICATIONS There are situations in which idle CPUs cannot be permitted to enter either dyntick-idle mode or adaptive-tick mode, the most familiar being the case where that CPU has RCU callbacks pending. The CONFIG_RCU_FAST_NO_HZ=y Kconfig option may be used to cause such CPUs to enter dyntick-idle mode or adaptive-tick mode anyway, though a timer will awaken these CPUs every four jiffies in order to ensure that the RCU callbacks are processed in a timely fashion. Another approach is to offload RCU callback processing to "rcuo" kthreads using the CONFIG_RCU_NOCB_CPU=y. The specific CPUs to offload may be selected via several methods: 1. One of three mutually exclusive Kconfig options specify a build-time default for the CPUs to offload: a. The RCU_NOCB_CPU_NONE=y Kconfig option results in no CPUs being offloaded. b. The RCU_NOCB_CPU_ZERO=y Kconfig option causes CPU 0 to be offloaded. c. The RCU_NOCB_CPU_ALL=y Kconfig option causes all CPUs to be offloaded. 2. The "rcu_nocbs=" kernel boot parameter, which takes a comma-separated list of CPUs and CPU ranges, for example, "1,3-5" selects CPUs 1, 3, 4, and 5. The specified CPUs will be offloaded in addition to any CPUs specified as offloaded by RCU_NOCB_CPU_ZERO or RCU_NOCB_CPU_ALL. The offloaded CPUs never have RCU callbacks queued, and therefore RCU never prevents offloaded CPUs from entering either dyntick-idle mode or adaptive-tick mode. That said, note that it is up to userspace to pin the "rcuo" kthreads to specific CPUs if desired. Otherwise, the scheduler will decide where to run them, which might or might not be where you want them to run. KNOWN ISSUES o Dyntick-idle slows transitions to and from idle slightly. In practice, this has not been a problem except for the most aggressive real-time workloads, which have the option of disabling dyntick-idle mode, an option that most of them take. o Adaptive-ticks slows user/kernel transitions slightly. This is not expected to be a problem for computational-intensive workloads, which have few such transitions. Careful benchmarking will be required to determine whether or not other workloads are significantly affected by this effect. o Adaptive-ticks does not do anything unless there is only one runnable task for a given CPU, even though there are a number of other situations where the scheduling-clock tick is not needed. To give but one example, consider a CPU that has one runnable high-priority SCHED_FIFO task and an arbitrary number of low-priority SCHED_OTHER tasks. In this case, the CPU is required to run the SCHED_FIFO task until either it blocks or some other higher-priority task awakens on (or is assigned to) this CPU, so there is no point in sending a scheduling-clock interrupt to this CPU. Better handling of these sorts of situations is future work. o A reboot is required to reconfigure both adaptive idle and RCU callback offloading. Runtime reconfiguration could be provided if needed, however, due to the complexity of reconfiguring RCU at runtime, there would need to be an earthshakingly good reason. Especially given the option of simply offloading RCU callbacks from all CPUs. o Additional configuration is required to deal with other sources of OS jitter, including interrupts and system-utility tasks and processes. This configuration normally involves binding interrupts and tasks to particular CPUs. o Some sources of OS jitter can currently be eliminated only by constraining the workload. For example, the only way to eliminate OS jitter due to global TLB shootdowns is to avoid the unmapping operations (such as kernel module unload operations) that result in these shootdowns. For another example, page faults and TLB misses can be reduced (and in some cases eliminated) by using huge pages and by constraining the amount of memory used by the application. o At least one CPU must keep the scheduling-clock interrupt going in order to support accurate timekeeping. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-18 22:25 ` Paul E. McKenney @ 2013-03-20 23:32 ` Steven Rostedt 2013-03-20 23:55 ` Paul E. McKenney 0 siblings, 1 reply; 43+ messages in thread From: Steven Rostedt @ 2013-03-20 23:32 UTC (permalink / raw) To: paulmck Cc: Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx On Mon, 2013-03-18 at 15:25 -0700, Paul E. McKenney wrote: > ------------------------------------------------------------------------ > > NO_HZ: Reducing Scheduling-Clock Ticks > > > This document covers Kconfig options and boot parameters used to reduce > the number of scheduling-clock interrupts. These reductions can be > helpful in improving energy efficiency and in reducing "OS jitter", > the latter being very important for some types of computationally > intensive high-performance computing (HPC) applications and for real-time > applications. > > Within the Linux kernel, there are two major aspects of scheduling-clock > interrupt reduction: > > 1. Idle CPUs. > > 2. CPUs having only one runnable task. > > These two cases are described in the following sections. > > > IDLE CPUs > > If a CPU is idle, there is little point in sending it a scheduling-clock > interrupt. After all, the primary purpose of a scheduling-clock interrupt > is to force a busy CPU to shift its attention among multiple duties, > but an idle CPU by definition has no duties to shift its attention among. > > The CONFIG_NO_HZ=y Kconfig option causes the kernel to avoid sending > scheduling-clock interrupts to idle CPUs, which is critically important > both to battery-powered devices and to highly virtualized mainframes. > A battery-powered device running a CONFIG_NO_HZ=n kernel would drain its > battery very quickly, easily 2-3x as fast as would the same device running > a CONFIG_NO_HZ=n kernel. A mainframe running 1,500 OS instances could So a device running CONFIG_NO_HZ=n would drain its battery 2-3x faster than the same device running CONFIG_NO_HZ=n ? :-) > easily find that half of its CPU time was consumed by scheduling-clock > interrupts. In these situations, there is therefore strong motivation > to avoid sending scheduling-clock interrupts to idle CPUs. That said, > dyntick-idle mode is not free: > > 1. It increases the number of instructions executed on the path > to and from the idle loop. > > 2. Many architectures will place dyntick-idle CPUs into deep sleep > states, which further degrades from-idle transition latencies. > > Therefore, systems with aggressive real-time response constraints > often run CONFIG_NO_HZ=n kernels in order to avoid degrading from-idle > transition latencies. > > An idle CPU that is not receiving scheduling-clock interrupts is said to > be "dyntick-idle", "in dyntick-idle mode", "in nohz mode", or "running > tickless". The remainder of this document will use "dyntick-idle mode". > > There is also a boot parameter "nohz=" that can be used to disable > dyntick-idle mode in CONFIG_NO_HZ=y kernels by specifying "nohz=off". > By default, CONFIG_NO_HZ=y kernels boot with "nohz=on", enabling > dyntick-idle mode. > > > CPUs WITH ONLY ONE RUNNABLE TASK > > If a CPU has only one runnable task, there is again little point in > sending it a scheduling-clock interrupt. Recall that the primary > purpose of a scheduling-clock interrupt is to force a busy CPU to > shift its attention among many things requiring its attention -- and > there is nowhere else for a CPU with but one runnable task to shift its > attention to. > > The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid > sending scheduling-clock interrupts to CPUs with a single runnable task. > This is important for applications with aggressive real-time response > constraints because it allows them to improve their worst-case response > times by the maximum duration of a scheduling-clock interrupt. It is also > important for computationally intensive iterative workloads with short > iterations: If any CPU is delayed during a given iteration, all the > other CPUs will be forced to wait idle while the delayed CPU finished. > Thus, the delay is multiplied by one less than the number of CPUs. > In these situations, there is again strong motivation to avoid sending > scheduling-clock interrupts to CPUs that have but one runnable task that > is executing in user mode. > > The "full_nohz=" boot parameter specifies which CPUs are to be > adaptive-ticks CPUs. For example, "full_nohz=1,6-8" says that CPUs 1, This is the first time you mention "adaptive-ticks". Probably should define it before just using it, even though one should be able to figure out what adaptive-ticks are, it does throw in a wrench when reading this if you have no idea what an "adaptive-tick" is. > 6, 7, and 8 are to be adaptive-ticks CPUs. By default, no CPUs will > be adaptive-ticks CPUs. Not that you are prohibited from marking all > of the CPUs as adaptive-tick CPUs: At least one non-adaptive-tick CPU > must remain online to handle timekeeping tasks in order to ensure that > gettimeofday() returns sane values on adaptive-tick CPUs. > > Note that if a given CPU is in adaptive-ticks mode while executing in > user mode, transitioning to kernel mode does not automatically force > that CPU out of adaptive-ticks mode. The CPU will exit adaptive-ticks > mode only if needed, for example, if that CPU enqueues an RCU callback. > > Just as with dyntick-idle mode, the benefits of adaptive-tick mode do > not come for free: > > 1. CONFIG_NO_HZ_FULL depends on CONFIG_NO_HZ, so you cannot run > adaptive ticks without also running dyntick idle. This dependency > of CONFIG_NO_HZ_FULL on CONFIG_NO_HZ extends down into the > implementation. Therefore, all of the costs of CONFIG_NO_HZ > are also incurred by CONFIG_NO_HZ_FULL. Not a comment on this document, but on the implementation. As idle NO_HZ can hurt RT, but RT would want to have full NO_HZ, it's a shame that you can't have both (no idle but full). As we only care about not letting the CPU go into deep sleep, I wonder if it wouldn't be too hard to add something that keeps idle from going into nohz mode. Hmm, I think there may be an option to keep the CPU from going too deep into sleep. That may be a better approach. > > 2. The user/kernel transitions are slightly more expensive due > to the need to inform kernel subsystems (such as RCU) about > the change in mode. > > 3. POSIX CPU timers on adaptive-tick CPUs may fire late (or even > not at all) because they currently rely on scheduling-tick > interrupts. This will likely be fixed in one of two ways: (1) > Prevent CPUs with POSIX CPU timers from entering adaptive-tick > mode, or (2) Use hrtimers or other adaptive-ticks-immune mechanism > to cause the POSIX CPU timer to fire properly. > > 4. If there are more perf events pending than the hardware can > accommodate, they are normally round-robined so as to collect > all of them over time. Adaptive-tick mode may prevent this > round-robining from happening. This will likely be fixed by > preventing CPUs with large numbers of perf events pending from > entering adaptive-tick mode. > > 5. Scheduler statistics for adaptive-idle CPUs may be computed > slightly differently than those for non-adaptive-idle CPUs. > This may in turn perturb load-balancing of real-time tasks. > > 6. The LB_BIAS scheduler feature is disabled by adaptive ticks. > > Although improvements are expected over time, adaptive ticks is quite > useful for many types of real-time and compute-intensive applications. > However, the drawbacks listed above mean that adaptive ticks should not > be enabled by default across the board at the current time. > > > RCU IMPLICATIONS > > There are situations in which idle CPUs cannot be permitted to > enter either dyntick-idle mode or adaptive-tick mode, the most > familiar being the case where that CPU has RCU callbacks pending. > > The CONFIG_RCU_FAST_NO_HZ=y Kconfig option may be used to cause such > CPUs to enter dyntick-idle mode or adaptive-tick mode anyway, though a > timer will awaken these CPUs every four jiffies in order to ensure that > the RCU callbacks are processed in a timely fashion. > > Another approach is to offload RCU callback processing to "rcuo" kthreads > using the CONFIG_RCU_NOCB_CPU=y. The specific CPUs to offload may be > selected via several methods: > > 1. One of three mutually exclusive Kconfig options specify a > build-time default for the CPUs to offload: > > a. The RCU_NOCB_CPU_NONE=y Kconfig option results in > no CPUs being offloaded. > > b. The RCU_NOCB_CPU_ZERO=y Kconfig option causes CPU 0 to > be offloaded. > > c. The RCU_NOCB_CPU_ALL=y Kconfig option causes all CPUs > to be offloaded. All CPUs don't have their RCU call backs on them? I'm a bit confused by this. Or is it that the scheduler picks one CPU to do call backs? Does this mean that to use rcu_ncbs= to be the only deciding factor, you select RCU_NCB_CPU_NONE? I think this needs to be explained better. > > 2. The "rcu_nocbs=" kernel boot parameter, which takes a comma-separated > list of CPUs and CPU ranges, for example, "1,3-5" selects CPUs 1, > 3, 4, and 5. The specified CPUs will be offloaded in addition > to any CPUs specified as offloaded by RCU_NOCB_CPU_ZERO or > RCU_NOCB_CPU_ALL. > > The offloaded CPUs never have RCU callbacks queued, and therefore RCU > never prevents offloaded CPUs from entering either dyntick-idle mode or > adaptive-tick mode. That said, note that it is up to userspace to > pin the "rcuo" kthreads to specific CPUs if desired. Otherwise, the > scheduler will decide where to run them, which might or might not be > where you want them to run. > > > KNOWN ISSUES > > o Dyntick-idle slows transitions to and from idle slightly. > In practice, this has not been a problem except for the most > aggressive real-time workloads, which have the option of disabling > dyntick-idle mode, an option that most of them take. > > o Adaptive-ticks slows user/kernel transitions slightly. > This is not expected to be a problem for computational-intensive > workloads, which have few such transitions. Careful benchmarking > will be required to determine whether or not other workloads > are significantly affected by this effect. It should be mentioned that only CPUs that are in adaptive-tick mode have this issue. Other CPUs are still using the tick based accounting, right? > > o Adaptive-ticks does not do anything unless there is only one > runnable task for a given CPU, even though there are a number > of other situations where the scheduling-clock tick is not > needed. To give but one example, consider a CPU that has one > runnable high-priority SCHED_FIFO task and an arbitrary number > of low-priority SCHED_OTHER tasks. In this case, the CPU is > required to run the SCHED_FIFO task until either it blocks or > some other higher-priority task awakens on (or is assigned to) > this CPU, so there is no point in sending a scheduling-clock > interrupt to this CPU. You should point out that the example does not enable adaptive-ticks. That point is hinted at, but not really expressed. That is, perhaps end the paragraph with: "Even though the SCHED_FIFO task is the only task running, because the SCHED_OTHER tasks are queued on the CPU, it currently will not enter adaptive tick mode." > > Better handling of these sorts of situations is future work. > > o A reboot is required to reconfigure both adaptive idle and RCU > callback offloading. Runtime reconfiguration could be provided > if needed, however, due to the complexity of reconfiguring RCU > at runtime, there would need to be an earthshakingly good reason. > Especially given the option of simply offloading RCU callbacks > from all CPUs. When you enable for all CPUs, how do you tell what CPUs you don't want the scheduler to pick for off loading? I mean, if you pick all CPUs, can you at run time pick which ones should always off load and which ones shouldn't? > > o Additional configuration is required to deal with other sources > of OS jitter, including interrupts and system-utility tasks > and processes. This configuration normally involves binding > interrupts and tasks to particular CPUs. > > o Some sources of OS jitter can currently be eliminated only by > constraining the workload. For example, the only way to eliminate > OS jitter due to global TLB shootdowns is to avoid the unmapping > operations (such as kernel module unload operations) that result > in these shootdowns. For another example, page faults and TLB > misses can be reduced (and in some cases eliminated) by using > huge pages and by constraining the amount of memory used by the > application. > > o At least one CPU must keep the scheduling-clock interrupt going > in order to support accurate timekeeping. Thanks for writing this up Paul! -- Steve ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-20 23:32 ` Steven Rostedt @ 2013-03-20 23:55 ` Paul E. McKenney 2013-03-21 0:27 ` Steven Rostedt 2013-03-21 16:08 ` Christoph Lameter 0 siblings, 2 replies; 43+ messages in thread From: Paul E. McKenney @ 2013-03-20 23:55 UTC (permalink / raw) To: Steven Rostedt Cc: Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx On Wed, Mar 20, 2013 at 07:32:18PM -0400, Steven Rostedt wrote: > On Mon, 2013-03-18 at 15:25 -0700, Paul E. McKenney wrote: > > > ------------------------------------------------------------------------ > > > > NO_HZ: Reducing Scheduling-Clock Ticks > > > > > > This document covers Kconfig options and boot parameters used to reduce > > the number of scheduling-clock interrupts. These reductions can be > > helpful in improving energy efficiency and in reducing "OS jitter", > > the latter being very important for some types of computationally > > intensive high-performance computing (HPC) applications and for real-time > > applications. > > > > Within the Linux kernel, there are two major aspects of scheduling-clock > > interrupt reduction: > > > > 1. Idle CPUs. > > > > 2. CPUs having only one runnable task. > > > > These two cases are described in the following sections. > > > > > > IDLE CPUs > > > > If a CPU is idle, there is little point in sending it a scheduling-clock > > interrupt. After all, the primary purpose of a scheduling-clock interrupt > > is to force a busy CPU to shift its attention among multiple duties, > > but an idle CPU by definition has no duties to shift its attention among. > > > > The CONFIG_NO_HZ=y Kconfig option causes the kernel to avoid sending > > scheduling-clock interrupts to idle CPUs, which is critically important > > both to battery-powered devices and to highly virtualized mainframes. > > A battery-powered device running a CONFIG_NO_HZ=n kernel would drain its > > battery very quickly, easily 2-3x as fast as would the same device running > > a CONFIG_NO_HZ=n kernel. A mainframe running 1,500 OS instances could > > So a device running CONFIG_NO_HZ=n would drain its battery 2-3x faster > than the > same device running CONFIG_NO_HZ=n ? > > :-) Good catch, fixed! That said, there are two solutions as stated -- either the battery drains immediately, or it takes infinitely long to drain. ;-) > > easily find that half of its CPU time was consumed by scheduling-clock > > interrupts. In these situations, there is therefore strong motivation > > to avoid sending scheduling-clock interrupts to idle CPUs. That said, > > dyntick-idle mode is not free: > > > > 1. It increases the number of instructions executed on the path > > to and from the idle loop. > > > > 2. Many architectures will place dyntick-idle CPUs into deep sleep > > states, which further degrades from-idle transition latencies. > > > > Therefore, systems with aggressive real-time response constraints > > often run CONFIG_NO_HZ=n kernels in order to avoid degrading from-idle > > transition latencies. > > > > An idle CPU that is not receiving scheduling-clock interrupts is said to > > be "dyntick-idle", "in dyntick-idle mode", "in nohz mode", or "running > > tickless". The remainder of this document will use "dyntick-idle mode". > > > > There is also a boot parameter "nohz=" that can be used to disable > > dyntick-idle mode in CONFIG_NO_HZ=y kernels by specifying "nohz=off". > > By default, CONFIG_NO_HZ=y kernels boot with "nohz=on", enabling > > dyntick-idle mode. > > > > > > CPUs WITH ONLY ONE RUNNABLE TASK > > > > If a CPU has only one runnable task, there is again little point in > > sending it a scheduling-clock interrupt. Recall that the primary > > purpose of a scheduling-clock interrupt is to force a busy CPU to > > shift its attention among many things requiring its attention -- and > > there is nowhere else for a CPU with but one runnable task to shift its > > attention to. > > > > The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid > > sending scheduling-clock interrupts to CPUs with a single runnable task. > > This is important for applications with aggressive real-time response > > constraints because it allows them to improve their worst-case response > > times by the maximum duration of a scheduling-clock interrupt. It is also > > important for computationally intensive iterative workloads with short > > iterations: If any CPU is delayed during a given iteration, all the > > other CPUs will be forced to wait idle while the delayed CPU finished. > > Thus, the delay is multiplied by one less than the number of CPUs. > > In these situations, there is again strong motivation to avoid sending > > scheduling-clock interrupts to CPUs that have but one runnable task that > > is executing in user mode. > > > > The "full_nohz=" boot parameter specifies which CPUs are to be > > adaptive-ticks CPUs. For example, "full_nohz=1,6-8" says that CPUs 1, > > This is the first time you mention "adaptive-ticks". Probably should > define it before just using it, even though one should be able to figure > out what adaptive-ticks are, it does throw in a wrench when reading this > if you have no idea what an "adaptive-tick" is. Good point, changed the first sentence of this paragraph to read: The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid sending scheduling-clock interrupts to CPUs with a single runnable task, and such CPUs are said to be "adaptive-ticks CPUs". > > 6, 7, and 8 are to be adaptive-ticks CPUs. By default, no CPUs will > > be adaptive-ticks CPUs. Not that you are prohibited from marking all > > of the CPUs as adaptive-tick CPUs: At least one non-adaptive-tick CPU > > must remain online to handle timekeeping tasks in order to ensure that > > gettimeofday() returns sane values on adaptive-tick CPUs. > > > > Note that if a given CPU is in adaptive-ticks mode while executing in > > user mode, transitioning to kernel mode does not automatically force > > that CPU out of adaptive-ticks mode. The CPU will exit adaptive-ticks > > mode only if needed, for example, if that CPU enqueues an RCU callback. > > > > Just as with dyntick-idle mode, the benefits of adaptive-tick mode do > > not come for free: > > > > 1. CONFIG_NO_HZ_FULL depends on CONFIG_NO_HZ, so you cannot run > > adaptive ticks without also running dyntick idle. This dependency > > of CONFIG_NO_HZ_FULL on CONFIG_NO_HZ extends down into the > > implementation. Therefore, all of the costs of CONFIG_NO_HZ > > are also incurred by CONFIG_NO_HZ_FULL. > > Not a comment on this document, but on the implementation. As idle NO_HZ > can hurt RT, but RT would want to have full NO_HZ, it's a shame that you > can't have both (no idle but full). As we only care about not letting > the CPU go into deep sleep, I wonder if it wouldn't be too hard to add > something that keeps idle from going into nohz mode. Hmm, I think there > may be an option to keep the CPU from going too deep into sleep. That > may be a better approach. Would the combination of CONFIG_NO_HZ=y, CONFIG_NO_HZ_FULL=y, and idle=poll do the trick in this case? If so, I do need to document it. > > 2. The user/kernel transitions are slightly more expensive due > > to the need to inform kernel subsystems (such as RCU) about > > the change in mode. > > > > 3. POSIX CPU timers on adaptive-tick CPUs may fire late (or even > > not at all) because they currently rely on scheduling-tick > > interrupts. This will likely be fixed in one of two ways: (1) > > Prevent CPUs with POSIX CPU timers from entering adaptive-tick > > mode, or (2) Use hrtimers or other adaptive-ticks-immune mechanism > > to cause the POSIX CPU timer to fire properly. > > > > 4. If there are more perf events pending than the hardware can > > accommodate, they are normally round-robined so as to collect > > all of them over time. Adaptive-tick mode may prevent this > > round-robining from happening. This will likely be fixed by > > preventing CPUs with large numbers of perf events pending from > > entering adaptive-tick mode. > > > > 5. Scheduler statistics for adaptive-idle CPUs may be computed > > slightly differently than those for non-adaptive-idle CPUs. > > This may in turn perturb load-balancing of real-time tasks. > > > > 6. The LB_BIAS scheduler feature is disabled by adaptive ticks. > > > > Although improvements are expected over time, adaptive ticks is quite > > useful for many types of real-time and compute-intensive applications. > > However, the drawbacks listed above mean that adaptive ticks should not > > be enabled by default across the board at the current time. > > > > > > RCU IMPLICATIONS > > > > There are situations in which idle CPUs cannot be permitted to > > enter either dyntick-idle mode or adaptive-tick mode, the most > > familiar being the case where that CPU has RCU callbacks pending. > > > > The CONFIG_RCU_FAST_NO_HZ=y Kconfig option may be used to cause such > > CPUs to enter dyntick-idle mode or adaptive-tick mode anyway, though a > > timer will awaken these CPUs every four jiffies in order to ensure that > > the RCU callbacks are processed in a timely fashion. > > > > Another approach is to offload RCU callback processing to "rcuo" kthreads > > using the CONFIG_RCU_NOCB_CPU=y. The specific CPUs to offload may be > > selected via several methods: > > > > 1. One of three mutually exclusive Kconfig options specify a > > build-time default for the CPUs to offload: > > > > a. The RCU_NOCB_CPU_NONE=y Kconfig option results in > > no CPUs being offloaded. > > > > b. The RCU_NOCB_CPU_ZERO=y Kconfig option causes CPU 0 to > > be offloaded. > > > > c. The RCU_NOCB_CPU_ALL=y Kconfig option causes all CPUs > > to be offloaded. > > All CPUs don't have their RCU call backs on them? I'm a bit confused by > this. Or is it that the scheduler picks one CPU to do call backs? Does > this mean that to use rcu_ncbs= to be the only deciding factor, you > select RCU_NCB_CPU_NONE? > > I think this needs to be explained better. Does this help? c. The RCU_NOCB_CPU_ALL=y Kconfig option causes all CPUs to be offloaded. Note that the callbacks will be offloaded to "rcuo" kthreads, and that those kthreads will in fact run on some CPU. However, this approach gives fine-grained control on exactly which CPUs the callbacks run on, the priority that they run at (including the default of SCHED_OTHER), and it further allows this control to be varied dynamically at runtime. > > 2. The "rcu_nocbs=" kernel boot parameter, which takes a comma-separated > > list of CPUs and CPU ranges, for example, "1,3-5" selects CPUs 1, > > 3, 4, and 5. The specified CPUs will be offloaded in addition > > to any CPUs specified as offloaded by RCU_NOCB_CPU_ZERO or > > RCU_NOCB_CPU_ALL. > > > > The offloaded CPUs never have RCU callbacks queued, and therefore RCU > > never prevents offloaded CPUs from entering either dyntick-idle mode or > > adaptive-tick mode. That said, note that it is up to userspace to > > pin the "rcuo" kthreads to specific CPUs if desired. Otherwise, the > > scheduler will decide where to run them, which might or might not be > > where you want them to run. > > > > > > KNOWN ISSUES > > > > o Dyntick-idle slows transitions to and from idle slightly. > > In practice, this has not been a problem except for the most > > aggressive real-time workloads, which have the option of disabling > > dyntick-idle mode, an option that most of them take. > > > > o Adaptive-ticks slows user/kernel transitions slightly. > > This is not expected to be a problem for computational-intensive > > workloads, which have few such transitions. Careful benchmarking > > will be required to determine whether or not other workloads > > are significantly affected by this effect. > > It should be mentioned that only CPUs that are in adaptive-tick mode > have this issue. Other CPUs are still using the tick based accounting, > right? > > > > > o Adaptive-ticks does not do anything unless there is only one > > runnable task for a given CPU, even though there are a number > > of other situations where the scheduling-clock tick is not > > needed. To give but one example, consider a CPU that has one > > runnable high-priority SCHED_FIFO task and an arbitrary number > > of low-priority SCHED_OTHER tasks. In this case, the CPU is > > required to run the SCHED_FIFO task until either it blocks or > > some other higher-priority task awakens on (or is assigned to) > > this CPU, so there is no point in sending a scheduling-clock > > interrupt to this CPU. > > You should point out that the example does not enable adaptive-ticks. > That point is hinted at, but not really expressed. That is, perhaps end > the paragraph with: > > "Even though the SCHED_FIFO task is the only task running, because the > SCHED_OTHER tasks are queued on the CPU, it currently will not enter > adaptive tick mode." Again, good point! How about adding the following sentence at the end of this paragraph. However, the current implementation prohibits CPU with a single runnable SCHED_FIFO task and multiple runnable SCHED_OTHER tasks from entering adaptive-ticks mode, even though it would be correct to allow it to do so. > > Better handling of these sorts of situations is future work. > > > > o A reboot is required to reconfigure both adaptive idle and RCU > > callback offloading. Runtime reconfiguration could be provided > > if needed, however, due to the complexity of reconfiguring RCU > > at runtime, there would need to be an earthshakingly good reason. > > Especially given the option of simply offloading RCU callbacks > > from all CPUs. > > When you enable for all CPUs, how do you tell what CPUs you don't want > the scheduler to pick for off loading? I mean, if you pick all CPUs, can > you at run time pick which ones should always off load and which ones > shouldn't? I must defer to Frederic on this one. > > o Additional configuration is required to deal with other sources > > of OS jitter, including interrupts and system-utility tasks > > and processes. This configuration normally involves binding > > interrupts and tasks to particular CPUs. > > > > o Some sources of OS jitter can currently be eliminated only by > > constraining the workload. For example, the only way to eliminate > > OS jitter due to global TLB shootdowns is to avoid the unmapping > > operations (such as kernel module unload operations) that result > > in these shootdowns. For another example, page faults and TLB > > misses can be reduced (and in some cases eliminated) by using > > huge pages and by constraining the amount of memory used by the > > application. > > > > o At least one CPU must keep the scheduling-clock interrupt going > > in order to support accurate timekeeping. > > Thanks for writing this up Paul! And to many other people, including yourself, for doing the actual work! Thanx, Paul ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-20 23:55 ` Paul E. McKenney @ 2013-03-21 0:27 ` Steven Rostedt 2013-03-21 2:22 ` Paul E. McKenney ` (2 more replies) 2013-03-21 16:08 ` Christoph Lameter 1 sibling, 3 replies; 43+ messages in thread From: Steven Rostedt @ 2013-03-21 0:27 UTC (permalink / raw) To: paulmck Cc: Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx, Arjan van de Ven [ Added Arjan in case he as anything to add about the idle=poll below ] On Wed, 2013-03-20 at 16:55 -0700, Paul E. McKenney wrote: > On Wed, Mar 20, 2013 at 07:32:18PM -0400, Steven Rostedt wrote: > > On Mon, 2013-03-18 at 15:25 -0700, Paul E. McKenney wrote: > > > > > ------------------------------------------------------------------------ > > > > > > NO_HZ: Reducing Scheduling-Clock Ticks > > > > > > > > > This document covers Kconfig options and boot parameters used to reduce > > > the number of scheduling-clock interrupts. These reductions can be > > > helpful in improving energy efficiency and in reducing "OS jitter", > > > the latter being very important for some types of computationally > > > intensive high-performance computing (HPC) applications and for real-time > > > applications. > > > > > > Within the Linux kernel, there are two major aspects of scheduling-clock > > > interrupt reduction: > > > > > > 1. Idle CPUs. > > > > > > 2. CPUs having only one runnable task. > > > > > > These two cases are described in the following sections. > > > > > > > > > IDLE CPUs > > > > > > If a CPU is idle, there is little point in sending it a scheduling-clock > > > interrupt. After all, the primary purpose of a scheduling-clock interrupt > > > is to force a busy CPU to shift its attention among multiple duties, > > > but an idle CPU by definition has no duties to shift its attention among. > > > > > > The CONFIG_NO_HZ=y Kconfig option causes the kernel to avoid sending > > > scheduling-clock interrupts to idle CPUs, which is critically important > > > both to battery-powered devices and to highly virtualized mainframes. > > > A battery-powered device running a CONFIG_NO_HZ=n kernel would drain its > > > battery very quickly, easily 2-3x as fast as would the same device running > > > a CONFIG_NO_HZ=n kernel. A mainframe running 1,500 OS instances could > > > > So a device running CONFIG_NO_HZ=n would drain its battery 2-3x faster > > than the Hmm, Evolution had the above on one line in the composer, but it seems to be chopping it when it sends. I recently did an update on this box, which screwed up the formatting of what the composer does and what it sends out :-/ I hit a hard return to have CONFIG_NO_HZ = 0 be lined up correctly (since I already knew that evolution screwed this up) > > same device running CONFIG_NO_HZ=n ? > > > > :-) > > Good catch, fixed! > > That said, there are two solutions as stated -- either the battery drains > immediately, or it takes infinitely long to drain. ;-) A typical paulmck response ;-) > > > > easily find that half of its CPU time was consumed by scheduling-clock > > > interrupts. In these situations, there is therefore strong motivation > > > to avoid sending scheduling-clock interrupts to idle CPUs. That said, > > > dyntick-idle mode is not free: > > > > > > 1. It increases the number of instructions executed on the path > > > to and from the idle loop. > > > > > > 2. Many architectures will place dyntick-idle CPUs into deep sleep > > > states, which further degrades from-idle transition latencies. > > > > > > Therefore, systems with aggressive real-time response constraints > > > often run CONFIG_NO_HZ=n kernels in order to avoid degrading from-idle > > > transition latencies. > > > > > > An idle CPU that is not receiving scheduling-clock interrupts is said to > > > be "dyntick-idle", "in dyntick-idle mode", "in nohz mode", or "running > > > tickless". The remainder of this document will use "dyntick-idle mode". > > > > > > There is also a boot parameter "nohz=" that can be used to disable > > > dyntick-idle mode in CONFIG_NO_HZ=y kernels by specifying "nohz=off". > > > By default, CONFIG_NO_HZ=y kernels boot with "nohz=on", enabling > > > dyntick-idle mode. > > > > > > > > > CPUs WITH ONLY ONE RUNNABLE TASK > > > > > > If a CPU has only one runnable task, there is again little point in > > > sending it a scheduling-clock interrupt. Recall that the primary > > > purpose of a scheduling-clock interrupt is to force a busy CPU to > > > shift its attention among many things requiring its attention -- and > > > there is nowhere else for a CPU with but one runnable task to shift its > > > attention to. > > > > > > The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid > > > sending scheduling-clock interrupts to CPUs with a single runnable task. > > > This is important for applications with aggressive real-time response > > > constraints because it allows them to improve their worst-case response > > > times by the maximum duration of a scheduling-clock interrupt. It is also > > > important for computationally intensive iterative workloads with short > > > iterations: If any CPU is delayed during a given iteration, all the > > > other CPUs will be forced to wait idle while the delayed CPU finished. > > > Thus, the delay is multiplied by one less than the number of CPUs. > > > In these situations, there is again strong motivation to avoid sending > > > scheduling-clock interrupts to CPUs that have but one runnable task that > > > is executing in user mode. > > > > > > The "full_nohz=" boot parameter specifies which CPUs are to be > > > adaptive-ticks CPUs. For example, "full_nohz=1,6-8" says that CPUs 1, > > > > This is the first time you mention "adaptive-ticks". Probably should > > define it before just using it, even though one should be able to figure > > out what adaptive-ticks are, it does throw in a wrench when reading this > > if you have no idea what an "adaptive-tick" is. > > Good point, changed the first sentence of this paragraph to read: > > The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to > avoid sending scheduling-clock interrupts to CPUs with a single > runnable task, and such CPUs are said to be "adaptive-ticks CPUs". Sounds good. > > > > 6, 7, and 8 are to be adaptive-ticks CPUs. By default, no CPUs will > > > be adaptive-ticks CPUs. Not that you are prohibited from marking all > > > of the CPUs as adaptive-tick CPUs: At least one non-adaptive-tick CPU > > > must remain online to handle timekeeping tasks in order to ensure that > > > gettimeofday() returns sane values on adaptive-tick CPUs. > > > > > > Note that if a given CPU is in adaptive-ticks mode while executing in > > > user mode, transitioning to kernel mode does not automatically force > > > that CPU out of adaptive-ticks mode. The CPU will exit adaptive-ticks > > > mode only if needed, for example, if that CPU enqueues an RCU callback. > > > > > > Just as with dyntick-idle mode, the benefits of adaptive-tick mode do > > > not come for free: > > > > > > 1. CONFIG_NO_HZ_FULL depends on CONFIG_NO_HZ, so you cannot run > > > adaptive ticks without also running dyntick idle. This dependency > > > of CONFIG_NO_HZ_FULL on CONFIG_NO_HZ extends down into the > > > implementation. Therefore, all of the costs of CONFIG_NO_HZ > > > are also incurred by CONFIG_NO_HZ_FULL. > > > > Not a comment on this document, but on the implementation. As idle NO_HZ > > can hurt RT, but RT would want to have full NO_HZ, it's a shame that you > > can't have both (no idle but full). As we only care about not letting > > the CPU go into deep sleep, I wonder if it wouldn't be too hard to add > > something that keeps idle from going into nohz mode. Hmm, I think there > > may be an option to keep the CPU from going too deep into sleep. That > > may be a better approach. > > Would the combination of CONFIG_NO_HZ=y, CONFIG_NO_HZ_FULL=y, and > idle=poll do the trick in this case? I'm not sure I would recommend idle=poll either. It would certainly work, but it goes to the other extreme. You think NO_HZ=n drains a battery? Try idle=poll. Looking at Documentation/kernel-parameters.txt, it looks like idle=mwait may be better. It states that performance is the same as idle=poll (if supported). Also there's a kernel parameter for x86 called intel_idle.max_cstate=X. As idle=poll will most likely run the processor very hot and you will need to add more electricity not only for the computer but also for the A/C, it would be nice to still have the CPU sleep, but just at a shallow (fast wakeup) state. Perhaps Arjan can add some input here? > > If so, I do need to document it. > > > > 2. The user/kernel transitions are slightly more expensive due > > > to the need to inform kernel subsystems (such as RCU) about > > > the change in mode. > > > > > > 3. POSIX CPU timers on adaptive-tick CPUs may fire late (or even > > > not at all) because they currently rely on scheduling-tick > > > interrupts. This will likely be fixed in one of two ways: (1) > > > Prevent CPUs with POSIX CPU timers from entering adaptive-tick > > > mode, or (2) Use hrtimers or other adaptive-ticks-immune mechanism > > > to cause the POSIX CPU timer to fire properly. > > > > > > 4. If there are more perf events pending than the hardware can > > > accommodate, they are normally round-robined so as to collect > > > all of them over time. Adaptive-tick mode may prevent this > > > round-robining from happening. This will likely be fixed by > > > preventing CPUs with large numbers of perf events pending from > > > entering adaptive-tick mode. > > > > > > 5. Scheduler statistics for adaptive-idle CPUs may be computed > > > slightly differently than those for non-adaptive-idle CPUs. > > > This may in turn perturb load-balancing of real-time tasks. > > > > > > 6. The LB_BIAS scheduler feature is disabled by adaptive ticks. > > > > > > Although improvements are expected over time, adaptive ticks is quite > > > useful for many types of real-time and compute-intensive applications. > > > However, the drawbacks listed above mean that adaptive ticks should not > > > be enabled by default across the board at the current time. > > > > > > > > > RCU IMPLICATIONS > > > > > > There are situations in which idle CPUs cannot be permitted to > > > enter either dyntick-idle mode or adaptive-tick mode, the most > > > familiar being the case where that CPU has RCU callbacks pending. > > > > > > The CONFIG_RCU_FAST_NO_HZ=y Kconfig option may be used to cause such > > > CPUs to enter dyntick-idle mode or adaptive-tick mode anyway, though a > > > timer will awaken these CPUs every four jiffies in order to ensure that > > > the RCU callbacks are processed in a timely fashion. > > > > > > Another approach is to offload RCU callback processing to "rcuo" kthreads > > > using the CONFIG_RCU_NOCB_CPU=y. The specific CPUs to offload may be > > > selected via several methods: > > > > > > 1. One of three mutually exclusive Kconfig options specify a > > > build-time default for the CPUs to offload: > > > > > > a. The RCU_NOCB_CPU_NONE=y Kconfig option results in > > > no CPUs being offloaded. > > > > > > b. The RCU_NOCB_CPU_ZERO=y Kconfig option causes CPU 0 to > > > be offloaded. > > > > > > c. The RCU_NOCB_CPU_ALL=y Kconfig option causes all CPUs > > > to be offloaded. > > > > All CPUs don't have their RCU call backs on them? I'm a bit confused by > > this. Or is it that the scheduler picks one CPU to do call backs? Does > > this mean that to use rcu_ncbs= to be the only deciding factor, you > > select RCU_NCB_CPU_NONE? > > > > I think this needs to be explained better. > > Does this help? > > c. The RCU_NOCB_CPU_ALL=y Kconfig option causes all CPUs > to be offloaded. Note that the callbacks will be > offloaded to "rcuo" kthreads, and that those kthreads > will in fact run on some CPU. However, this approach > gives fine-grained control on exactly which CPUs the > callbacks run on, the priority that they run at (including > the default of SCHED_OTHER), and it further allows > this control to be varied dynamically at runtime. Excellent! > > > > 2. The "rcu_nocbs=" kernel boot parameter, which takes a comma-separated > > > list of CPUs and CPU ranges, for example, "1,3-5" selects CPUs 1, > > > 3, 4, and 5. The specified CPUs will be offloaded in addition > > > to any CPUs specified as offloaded by RCU_NOCB_CPU_ZERO or > > > RCU_NOCB_CPU_ALL. > > > > > > The offloaded CPUs never have RCU callbacks queued, and therefore RCU > > > never prevents offloaded CPUs from entering either dyntick-idle mode or > > > adaptive-tick mode. That said, note that it is up to userspace to > > > pin the "rcuo" kthreads to specific CPUs if desired. Otherwise, the > > > scheduler will decide where to run them, which might or might not be > > > where you want them to run. > > > > > > > > > KNOWN ISSUES > > > > > > o Dyntick-idle slows transitions to and from idle slightly. > > > In practice, this has not been a problem except for the most > > > aggressive real-time workloads, which have the option of disabling > > > dyntick-idle mode, an option that most of them take. > > > > > > o Adaptive-ticks slows user/kernel transitions slightly. > > > This is not expected to be a problem for computational-intensive > > > workloads, which have few such transitions. Careful benchmarking > > > will be required to determine whether or not other workloads > > > are significantly affected by this effect. > > > > It should be mentioned that only CPUs that are in adaptive-tick mode > > have this issue. Other CPUs are still using the tick based accounting, > > right? ? > > > > > > > > o Adaptive-ticks does not do anything unless there is only one > > > runnable task for a given CPU, even though there are a number > > > of other situations where the scheduling-clock tick is not > > > needed. To give but one example, consider a CPU that has one > > > runnable high-priority SCHED_FIFO task and an arbitrary number > > > of low-priority SCHED_OTHER tasks. In this case, the CPU is > > > required to run the SCHED_FIFO task until either it blocks or > > > some other higher-priority task awakens on (or is assigned to) > > > this CPU, so there is no point in sending a scheduling-clock > > > interrupt to this CPU. > > > > You should point out that the example does not enable adaptive-ticks. > > That point is hinted at, but not really expressed. That is, perhaps end > > the paragraph with: > > > > "Even though the SCHED_FIFO task is the only task running, because the > > SCHED_OTHER tasks are queued on the CPU, it currently will not enter > > adaptive tick mode." > > Again, good point! > > How about adding the following sentence at the end of this paragraph. > > However, the current implementation prohibits CPU with a single > runnable SCHED_FIFO task and multiple runnable SCHED_OTHER > tasks from entering adaptive-ticks mode, even though it would > be correct to allow it to do so. Sure. > > > > Better handling of these sorts of situations is future work. > > > > > > o A reboot is required to reconfigure both adaptive idle and RCU > > > callback offloading. Runtime reconfiguration could be provided > > > if needed, however, due to the complexity of reconfiguring RCU > > > at runtime, there would need to be an earthshakingly good reason. > > > Especially given the option of simply offloading RCU callbacks > > > from all CPUs. > > > > When you enable for all CPUs, how do you tell what CPUs you don't want > > the scheduler to pick for off loading? I mean, if you pick all CPUs, can > > you at run time pick which ones should always off load and which ones > > shouldn't? > > I must defer to Frederic on this one. Well I was actually thinking more about the RCU NOCB mode. You answered my question above about the rcu kthreads that do the callbacks instead of them being pinned to a CPU. -- Steve > > > > o Additional configuration is required to deal with other sources > > > of OS jitter, including interrupts and system-utility tasks > > > and processes. This configuration normally involves binding > > > interrupts and tasks to particular CPUs. > > > > > > o Some sources of OS jitter can currently be eliminated only by > > > constraining the workload. For example, the only way to eliminate > > > OS jitter due to global TLB shootdowns is to avoid the unmapping > > > operations (such as kernel module unload operations) that result > > > in these shootdowns. For another example, page faults and TLB > > > misses can be reduced (and in some cases eliminated) by using > > > huge pages and by constraining the amount of memory used by the > > > application. > > > > > > o At least one CPU must keep the scheduling-clock interrupt going > > > in order to support accurate timekeeping. > > > > Thanks for writing this up Paul! > > And to many other people, including yourself, for doing the actual work! > > Thanx, Paul ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-21 0:27 ` Steven Rostedt @ 2013-03-21 2:22 ` Paul E. McKenney 2013-03-21 10:16 ` Borislav Petkov 2013-03-21 15:45 ` Arjan van de Ven 2013-03-21 18:01 ` Frederic Weisbecker 2 siblings, 1 reply; 43+ messages in thread From: Paul E. McKenney @ 2013-03-21 2:22 UTC (permalink / raw) To: Steven Rostedt Cc: Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx, Arjan van de Ven On Wed, Mar 20, 2013 at 08:27:11PM -0400, Steven Rostedt wrote: > [ Added Arjan in case he as anything to add about the idle=poll below ] Good point! > On Wed, 2013-03-20 at 16:55 -0700, Paul E. McKenney wrote: > > On Wed, Mar 20, 2013 at 07:32:18PM -0400, Steven Rostedt wrote: > > > On Mon, 2013-03-18 at 15:25 -0700, Paul E. McKenney wrote: > > > > > > > ------------------------------------------------------------------------ > > > > > > > > NO_HZ: Reducing Scheduling-Clock Ticks > > > > > > > > > > > > This document covers Kconfig options and boot parameters used to reduce > > > > the number of scheduling-clock interrupts. These reductions can be > > > > helpful in improving energy efficiency and in reducing "OS jitter", > > > > the latter being very important for some types of computationally > > > > intensive high-performance computing (HPC) applications and for real-time > > > > applications. > > > > > > > > Within the Linux kernel, there are two major aspects of scheduling-clock > > > > interrupt reduction: > > > > > > > > 1. Idle CPUs. > > > > > > > > 2. CPUs having only one runnable task. > > > > > > > > These two cases are described in the following sections. > > > > > > > > > > > > IDLE CPUs > > > > > > > > If a CPU is idle, there is little point in sending it a scheduling-clock > > > > interrupt. After all, the primary purpose of a scheduling-clock interrupt > > > > is to force a busy CPU to shift its attention among multiple duties, > > > > but an idle CPU by definition has no duties to shift its attention among. > > > > > > > > The CONFIG_NO_HZ=y Kconfig option causes the kernel to avoid sending > > > > scheduling-clock interrupts to idle CPUs, which is critically important > > > > both to battery-powered devices and to highly virtualized mainframes. > > > > A battery-powered device running a CONFIG_NO_HZ=n kernel would drain its > > > > battery very quickly, easily 2-3x as fast as would the same device running > > > > a CONFIG_NO_HZ=n kernel. A mainframe running 1,500 OS instances could > > > > > > So a device running CONFIG_NO_HZ=n would drain its battery 2-3x faster > > > than the > > Hmm, Evolution had the above on one line in the composer, but it seems > to be chopping it when it sends. I recently did an update on this box, > which screwed up the formatting of what the composer does and what it > sends out :-/ > > I hit a hard return to have CONFIG_NO_HZ = 0 be lined up correctly > (since I already knew that evolution screwed this up) Forever mutt!!! ;-) > > > same device running CONFIG_NO_HZ=n ? > > > > > > :-) > > > > Good catch, fixed! > > > > That said, there are two solutions as stated -- either the battery drains > > immediately, or it takes infinitely long to drain. ;-) > > A typical paulmck response ;-) ;-) ;-) ;-) > > > > easily find that half of its CPU time was consumed by scheduling-clock > > > > interrupts. In these situations, there is therefore strong motivation > > > > to avoid sending scheduling-clock interrupts to idle CPUs. That said, > > > > dyntick-idle mode is not free: > > > > > > > > 1. It increases the number of instructions executed on the path > > > > to and from the idle loop. > > > > > > > > 2. Many architectures will place dyntick-idle CPUs into deep sleep > > > > states, which further degrades from-idle transition latencies. > > > > > > > > Therefore, systems with aggressive real-time response constraints > > > > often run CONFIG_NO_HZ=n kernels in order to avoid degrading from-idle > > > > transition latencies. > > > > > > > > An idle CPU that is not receiving scheduling-clock interrupts is said to > > > > be "dyntick-idle", "in dyntick-idle mode", "in nohz mode", or "running > > > > tickless". The remainder of this document will use "dyntick-idle mode". > > > > > > > > There is also a boot parameter "nohz=" that can be used to disable > > > > dyntick-idle mode in CONFIG_NO_HZ=y kernels by specifying "nohz=off". > > > > By default, CONFIG_NO_HZ=y kernels boot with "nohz=on", enabling > > > > dyntick-idle mode. > > > > > > > > > > > > CPUs WITH ONLY ONE RUNNABLE TASK > > > > > > > > If a CPU has only one runnable task, there is again little point in > > > > sending it a scheduling-clock interrupt. Recall that the primary > > > > purpose of a scheduling-clock interrupt is to force a busy CPU to > > > > shift its attention among many things requiring its attention -- and > > > > there is nowhere else for a CPU with but one runnable task to shift its > > > > attention to. > > > > > > > > The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid > > > > sending scheduling-clock interrupts to CPUs with a single runnable task. > > > > This is important for applications with aggressive real-time response > > > > constraints because it allows them to improve their worst-case response > > > > times by the maximum duration of a scheduling-clock interrupt. It is also > > > > important for computationally intensive iterative workloads with short > > > > iterations: If any CPU is delayed during a given iteration, all the > > > > other CPUs will be forced to wait idle while the delayed CPU finished. > > > > Thus, the delay is multiplied by one less than the number of CPUs. > > > > In these situations, there is again strong motivation to avoid sending > > > > scheduling-clock interrupts to CPUs that have but one runnable task that > > > > is executing in user mode. > > > > > > > > The "full_nohz=" boot parameter specifies which CPUs are to be > > > > adaptive-ticks CPUs. For example, "full_nohz=1,6-8" says that CPUs 1, > > > > > > This is the first time you mention "adaptive-ticks". Probably should > > > define it before just using it, even though one should be able to figure > > > out what adaptive-ticks are, it does throw in a wrench when reading this > > > if you have no idea what an "adaptive-tick" is. > > > > Good point, changed the first sentence of this paragraph to read: > > > > The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to > > avoid sending scheduling-clock interrupts to CPUs with a single > > runnable task, and such CPUs are said to be "adaptive-ticks CPUs". > > Sounds good. > > > > > > > 6, 7, and 8 are to be adaptive-ticks CPUs. By default, no CPUs will > > > > be adaptive-ticks CPUs. Not that you are prohibited from marking all > > > > of the CPUs as adaptive-tick CPUs: At least one non-adaptive-tick CPU > > > > must remain online to handle timekeeping tasks in order to ensure that > > > > gettimeofday() returns sane values on adaptive-tick CPUs. > > > > > > > > Note that if a given CPU is in adaptive-ticks mode while executing in > > > > user mode, transitioning to kernel mode does not automatically force > > > > that CPU out of adaptive-ticks mode. The CPU will exit adaptive-ticks > > > > mode only if needed, for example, if that CPU enqueues an RCU callback. > > > > > > > > Just as with dyntick-idle mode, the benefits of adaptive-tick mode do > > > > not come for free: > > > > > > > > 1. CONFIG_NO_HZ_FULL depends on CONFIG_NO_HZ, so you cannot run > > > > adaptive ticks without also running dyntick idle. This dependency > > > > of CONFIG_NO_HZ_FULL on CONFIG_NO_HZ extends down into the > > > > implementation. Therefore, all of the costs of CONFIG_NO_HZ > > > > are also incurred by CONFIG_NO_HZ_FULL. > > > > > > Not a comment on this document, but on the implementation. As idle NO_HZ > > > can hurt RT, but RT would want to have full NO_HZ, it's a shame that you > > > can't have both (no idle but full). As we only care about not letting > > > the CPU go into deep sleep, I wonder if it wouldn't be too hard to add > > > something that keeps idle from going into nohz mode. Hmm, I think there > > > may be an option to keep the CPU from going too deep into sleep. That > > > may be a better approach. > > > > Would the combination of CONFIG_NO_HZ=y, CONFIG_NO_HZ_FULL=y, and > > idle=poll do the trick in this case? > > I'm not sure I would recommend idle=poll either. It would certainly > work, but it goes to the other extreme. You think NO_HZ=n drains a > battery? Try idle=poll. And a few people already run realtime on battery-powered systems, so good point... > Looking at Documentation/kernel-parameters.txt, it looks like idle=mwait > may be better. It states that performance is the same as idle=poll (if > supported). > > Also there's a kernel parameter for x86 called intel_idle.max_cstate=X. > > As idle=poll will most likely run the processor very hot and you will > need to add more electricity not only for the computer but also for the > A/C, it would be nice to still have the CPU sleep, but just at a shallow > (fast wakeup) state. So maybe idle=mwait or intel_idle.max_cstate=? if supported, otherwise if on AC power, idle=poll plus active cooling. ;-) > Perhaps Arjan can add some input here? I would certainly like to hear it! > > If so, I do need to document it. > > > > > > 2. The user/kernel transitions are slightly more expensive due > > > > to the need to inform kernel subsystems (such as RCU) about > > > > the change in mode. > > > > > > > > 3. POSIX CPU timers on adaptive-tick CPUs may fire late (or even > > > > not at all) because they currently rely on scheduling-tick > > > > interrupts. This will likely be fixed in one of two ways: (1) > > > > Prevent CPUs with POSIX CPU timers from entering adaptive-tick > > > > mode, or (2) Use hrtimers or other adaptive-ticks-immune mechanism > > > > to cause the POSIX CPU timer to fire properly. > > > > > > > > 4. If there are more perf events pending than the hardware can > > > > accommodate, they are normally round-robined so as to collect > > > > all of them over time. Adaptive-tick mode may prevent this > > > > round-robining from happening. This will likely be fixed by > > > > preventing CPUs with large numbers of perf events pending from > > > > entering adaptive-tick mode. > > > > > > > > 5. Scheduler statistics for adaptive-idle CPUs may be computed > > > > slightly differently than those for non-adaptive-idle CPUs. > > > > This may in turn perturb load-balancing of real-time tasks. > > > > > > > > 6. The LB_BIAS scheduler feature is disabled by adaptive ticks. > > > > > > > > Although improvements are expected over time, adaptive ticks is quite > > > > useful for many types of real-time and compute-intensive applications. > > > > However, the drawbacks listed above mean that adaptive ticks should not > > > > be enabled by default across the board at the current time. > > > > > > > > > > > > RCU IMPLICATIONS > > > > > > > > There are situations in which idle CPUs cannot be permitted to > > > > enter either dyntick-idle mode or adaptive-tick mode, the most > > > > familiar being the case where that CPU has RCU callbacks pending. > > > > > > > > The CONFIG_RCU_FAST_NO_HZ=y Kconfig option may be used to cause such > > > > CPUs to enter dyntick-idle mode or adaptive-tick mode anyway, though a > > > > timer will awaken these CPUs every four jiffies in order to ensure that > > > > the RCU callbacks are processed in a timely fashion. > > > > > > > > Another approach is to offload RCU callback processing to "rcuo" kthreads > > > > using the CONFIG_RCU_NOCB_CPU=y. The specific CPUs to offload may be > > > > selected via several methods: > > > > > > > > 1. One of three mutually exclusive Kconfig options specify a > > > > build-time default for the CPUs to offload: > > > > > > > > a. The RCU_NOCB_CPU_NONE=y Kconfig option results in > > > > no CPUs being offloaded. > > > > > > > > b. The RCU_NOCB_CPU_ZERO=y Kconfig option causes CPU 0 to > > > > be offloaded. > > > > > > > > c. The RCU_NOCB_CPU_ALL=y Kconfig option causes all CPUs > > > > to be offloaded. > > > > > > All CPUs don't have their RCU call backs on them? I'm a bit confused by > > > this. Or is it that the scheduler picks one CPU to do call backs? Does > > > this mean that to use rcu_ncbs= to be the only deciding factor, you > > > select RCU_NCB_CPU_NONE? > > > > > > I think this needs to be explained better. > > > > Does this help? > > > > c. The RCU_NOCB_CPU_ALL=y Kconfig option causes all CPUs > > to be offloaded. Note that the callbacks will be > > offloaded to "rcuo" kthreads, and that those kthreads > > will in fact run on some CPU. However, this approach > > gives fine-grained control on exactly which CPUs the > > callbacks run on, the priority that they run at (including > > the default of SCHED_OTHER), and it further allows > > this control to be varied dynamically at runtime. > > Excellent! > > > > > > > 2. The "rcu_nocbs=" kernel boot parameter, which takes a comma-separated > > > > list of CPUs and CPU ranges, for example, "1,3-5" selects CPUs 1, > > > > 3, 4, and 5. The specified CPUs will be offloaded in addition > > > > to any CPUs specified as offloaded by RCU_NOCB_CPU_ZERO or > > > > RCU_NOCB_CPU_ALL. > > > > > > > > The offloaded CPUs never have RCU callbacks queued, and therefore RCU > > > > never prevents offloaded CPUs from entering either dyntick-idle mode or > > > > adaptive-tick mode. That said, note that it is up to userspace to > > > > pin the "rcuo" kthreads to specific CPUs if desired. Otherwise, the > > > > scheduler will decide where to run them, which might or might not be > > > > where you want them to run. > > > > > > > > > > > > KNOWN ISSUES > > > > > > > > o Dyntick-idle slows transitions to and from idle slightly. > > > > In practice, this has not been a problem except for the most > > > > aggressive real-time workloads, which have the option of disabling > > > > dyntick-idle mode, an option that most of them take. > > > > > > > > o Adaptive-ticks slows user/kernel transitions slightly. > > > > This is not expected to be a problem for computational-intensive > > > > workloads, which have few such transitions. Careful benchmarking > > > > will be required to determine whether or not other workloads > > > > are significantly affected by this effect. > > > > > > It should be mentioned that only CPUs that are in adaptive-tick mode > > > have this issue. Other CPUs are still using the tick based accounting, > > > right? > > ? True, but they still end up executing extra code to deal with the possibility that they are in adaptive-tick mode. > > > > > > > > > > > o Adaptive-ticks does not do anything unless there is only one > > > > runnable task for a given CPU, even though there are a number > > > > of other situations where the scheduling-clock tick is not > > > > needed. To give but one example, consider a CPU that has one > > > > runnable high-priority SCHED_FIFO task and an arbitrary number > > > > of low-priority SCHED_OTHER tasks. In this case, the CPU is > > > > required to run the SCHED_FIFO task until either it blocks or > > > > some other higher-priority task awakens on (or is assigned to) > > > > this CPU, so there is no point in sending a scheduling-clock > > > > interrupt to this CPU. > > > > > > You should point out that the example does not enable adaptive-ticks. > > > That point is hinted at, but not really expressed. That is, perhaps end > > > the paragraph with: > > > > > > "Even though the SCHED_FIFO task is the only task running, because the > > > SCHED_OTHER tasks are queued on the CPU, it currently will not enter > > > adaptive tick mode." > > > > Again, good point! > > > > How about adding the following sentence at the end of this paragraph. > > > > However, the current implementation prohibits CPU with a single > > runnable SCHED_FIFO task and multiple runnable SCHED_OTHER > > tasks from entering adaptive-ticks mode, even though it would > > be correct to allow it to do so. > > Sure. > > > > > > > Better handling of these sorts of situations is future work. > > > > > > > > o A reboot is required to reconfigure both adaptive idle and RCU > > > > callback offloading. Runtime reconfiguration could be provided > > > > if needed, however, due to the complexity of reconfiguring RCU > > > > at runtime, there would need to be an earthshakingly good reason. > > > > Especially given the option of simply offloading RCU callbacks > > > > from all CPUs. > > > > > > When you enable for all CPUs, how do you tell what CPUs you don't want > > > the scheduler to pick for off loading? I mean, if you pick all CPUs, can > > > you at run time pick which ones should always off load and which ones > > > shouldn't? > > > > I must defer to Frederic on this one. > > Well I was actually thinking more about the RCU NOCB mode. You answered > my question above about the rcu kthreads that do the callbacks instead > of them being pinned to a CPU. Ah, color me confused. ;-) Thanx, Paul > -- Steve > > > > > > > o Additional configuration is required to deal with other sources > > > > of OS jitter, including interrupts and system-utility tasks > > > > and processes. This configuration normally involves binding > > > > interrupts and tasks to particular CPUs. > > > > > > > > o Some sources of OS jitter can currently be eliminated only by > > > > constraining the workload. For example, the only way to eliminate > > > > OS jitter due to global TLB shootdowns is to avoid the unmapping > > > > operations (such as kernel module unload operations) that result > > > > in these shootdowns. For another example, page faults and TLB > > > > misses can be reduced (and in some cases eliminated) by using > > > > huge pages and by constraining the amount of memory used by the > > > > application. > > > > > > > > o At least one CPU must keep the scheduling-clock interrupt going > > > > in order to support accurate timekeeping. > > > > > > Thanks for writing this up Paul! > > > > And to many other people, including yourself, for doing the actual work! > > > > Thanx, Paul > > > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-21 2:22 ` Paul E. McKenney @ 2013-03-21 10:16 ` Borislav Petkov 2013-03-21 15:18 ` Paul E. McKenney 0 siblings, 1 reply; 43+ messages in thread From: Borislav Petkov @ 2013-03-21 10:16 UTC (permalink / raw) To: Paul E. McKenney Cc: Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx, Arjan van de Ven On Wed, Mar 20, 2013 at 07:22:59PM -0700, Paul E. McKenney wrote: > > > > > The "full_nohz=" boot parameter specifies which CPUs are to be > > > > > adaptive-ticks CPUs. For example, "full_nohz=1,6-8" says that CPUs 1, > > > > > > > > This is the first time you mention "adaptive-ticks". Probably should > > > > define it before just using it, even though one should be able to figure > > > > out what adaptive-ticks are, it does throw in a wrench when reading this > > > > if you have no idea what an "adaptive-tick" is. > > > > > > Good point, changed the first sentence of this paragraph to read: > > > > > > The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to > > > avoid sending scheduling-clock interrupts to CPUs with a single > > > runnable task, and such CPUs are said to be "adaptive-ticks CPUs". > > > > Sounds good. Yeah, so I read this last night too and I have to say, very clearly written, even for dummies like me. But this "adaptive-ticks CPUs" reads kinda strange throughout the whole text, it feels a bit weird. And since the cmdline option is called "full_nohz", you might just as well call them the "full_nohz CPUs" or the "full_nohz subset of CPUs" for simplicity and so that you don't have yet another new term in the text denoting the same idea. I mean, all those names kinda suck and need the full definition of what adaptive ticking actually means anyway. :) Btw, congrats on coining a new noun: "Adaptive-tick mode may prevent this round-robining from happening." ^^^^^^^^^^^^^^ Funny. :-) I spose now one can say: "The kids in the garden are round-robining on the carousel." or "The kernel developers are round-robined for pull requests." Or maybe it wasn't you who coined it after /me doing a little search. It looks like technical people are pushing hard for it to be committed in the upstream English language repository. :-) Thanks. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-21 10:16 ` Borislav Petkov @ 2013-03-21 15:18 ` Paul E. McKenney 2013-03-21 16:00 ` Borislav Petkov 0 siblings, 1 reply; 43+ messages in thread From: Paul E. McKenney @ 2013-03-21 15:18 UTC (permalink / raw) To: Borislav Petkov, Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx, Arjan van de Ven On Thu, Mar 21, 2013 at 11:16:50AM +0100, Borislav Petkov wrote: > On Wed, Mar 20, 2013 at 07:22:59PM -0700, Paul E. McKenney wrote: > > > > > > The "full_nohz=" boot parameter specifies which CPUs are to be > > > > > > adaptive-ticks CPUs. For example, "full_nohz=1,6-8" says that CPUs 1, > > > > > > > > > > This is the first time you mention "adaptive-ticks". Probably should > > > > > define it before just using it, even though one should be able to figure > > > > > out what adaptive-ticks are, it does throw in a wrench when reading this > > > > > if you have no idea what an "adaptive-tick" is. > > > > > > > > Good point, changed the first sentence of this paragraph to read: > > > > > > > > The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to > > > > avoid sending scheduling-clock interrupts to CPUs with a single > > > > runnable task, and such CPUs are said to be "adaptive-ticks CPUs". > > > > > > Sounds good. > > Yeah, > > so I read this last night too and I have to say, very clearly written, > even for dummies like me. Can't say that I think of you as a dummy, but glad you liked it! > But this "adaptive-ticks CPUs" reads kinda strange throughout the whole > text, it feels a bit weird. And since the cmdline option is called > "full_nohz", you might just as well call them the "full_nohz CPUs" or > the "full_nohz subset of CPUs" for simplicity and so that you don't have > yet another new term in the text denoting the same idea. I mean, all > those names kinda suck and need the full definition of what adaptive > ticking actually means anyway. :) I am happy with either "adaptive-ticks CPUs" or "full_nohz CPUs", and leave the choice to Frederic. > Btw, congrats on coining a new noun: "Adaptive-tick mode may prevent > this round-robining from happening." > ^^^^^^^^^^^^^^ Actually, this is a generic transformation. Given an English verb, you almost always add "ing" to create a noun. Since "round-robin" is used as a verb, as in "The scheduler will round-robin between the two SCHED_RR tasks", "round-robining" may be used as a noun denoting the action corresponding to the verb "round-robin". There is no doubt an argument as to whether this should be spelled "round-robining" or "round-robinning", but I will leave this to those who care enough to argue about it. ;-) > Funny. :-) > > I spose now one can say: "The kids in the garden are round-robining on > the carousel." > > or > > "The kernel developers are round-robined for pull requests." ;-) > Or maybe it wasn't you who coined it after /me doing a little search. It > looks like technical people are pushing hard for it to be committed in > the upstream English language repository. :-) The thing about English is that it is an open-source language, and always has been. English is defined by its usage, and the wise dictionary-makers try their best to keep up. (The unwise ones attempt to stop the evolution of the English language.) Everything good and everything bad about English stems from this property. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-21 15:18 ` Paul E. McKenney @ 2013-03-21 16:00 ` Borislav Petkov 0 siblings, 0 replies; 43+ messages in thread From: Borislav Petkov @ 2013-03-21 16:00 UTC (permalink / raw) To: Paul E. McKenney Cc: Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx, Arjan van de Ven On Thu, Mar 21, 2013 at 08:18:11AM -0700, Paul E. McKenney wrote: > Actually, this is a generic transformation. Given an English verb, > you almost always add "ing" to create a noun. Since "round-robin" is > used as a verb, ... which sounds, in this case, weird IMHO. :-) > as in "The scheduler will round-robin between the two SCHED_RR > tasks", I think the "correct" way to say it is "The scheduler will select tasks in a round-robin fashion..." But while it is correct (for some accepted definition of correct), this is slow, has too many words and we don't want that - we want fast! We want a lot less instructions in the pipe! This way, we burn a lot less energy when talking. :-) > "round-robining" may be used as a noun denoting the action > corresponding to the verb "round-robin". There is no doubt an > argument as to whether this should be spelled "round-robining" or > "round-robinning", but I will leave this to those who care enough to > argue about it. ;-) Hey sir, you're preaching to the choir - I'm all for doing all kinds of weird/funny experiments with language... > The thing about English is that it is an open-source language, and > always has been. English is defined by its usage, and the wise > dictionary-makers try their best to keep up. ... yes, and then there are the English language Nazis who wouldn't allow that - their rules are stricter than software APIs and breaking userspace compatibility. Technical people, OTOH, are much more willing and not afraid to take the language and mold it in such a form so that it works for them instead of adhering to ancient rules. Which is cool. That's why I was pointing out the "round-robining" - nice and cool. And look how much shorter it is: round-robining = iterate over the items on a list by periodically switching from one to the next in a circular order. Now imagine the pressure on I$ the two versions create. And compare. :-) > (The unwise ones attempt to stop the evolution of the English > language.) Everything good and everything bad about English stems from > this property. ;-) Yeah, I've had to deal with enough of those evolution-stopping idiots during my days at the university. Well, I've got three words for them: "Resistance is futile!" :-) -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-21 0:27 ` Steven Rostedt 2013-03-21 2:22 ` Paul E. McKenney @ 2013-03-21 15:45 ` Arjan van de Ven 2013-03-21 17:18 ` Paul E. McKenney 2013-03-22 4:59 ` Rob Landley 2013-03-21 18:01 ` Frederic Weisbecker 2 siblings, 2 replies; 43+ messages in thread From: Arjan van de Ven @ 2013-03-21 15:45 UTC (permalink / raw) To: Steven Rostedt Cc: paulmck, Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx On 3/20/2013 5:27 PM, Steven Rostedt wrote: > I'm not sure I would recommend idle=poll either. It would certainly > work, but it goes to the other extreme. You think NO_HZ=n drains a > battery? Try idle=poll. do not ever use idle=poll on anything production.. really bad idea. if you temporary cannot cope with the latency, you can use the PMQOS system to limit (including going all the way to idle=poll). but using idle=poll completely is very nasty for the hardware. In addition we should document that idle=poll will cost you peak performance, possibly quite a bit. the same is true for the kernel paramter to some extend; it's there to work around broken bioses/hardware/etc; if you have a latency/runtime requirement, it's much better to use PMQOS for this from userspace. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-21 15:45 ` Arjan van de Ven @ 2013-03-21 17:18 ` Paul E. McKenney 2013-03-21 17:41 ` Arjan van de Ven 2013-03-22 4:59 ` Rob Landley 1 sibling, 1 reply; 43+ messages in thread From: Paul E. McKenney @ 2013-03-21 17:18 UTC (permalink / raw) To: Arjan van de Ven Cc: Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx On Thu, Mar 21, 2013 at 08:45:07AM -0700, Arjan van de Ven wrote: > On 3/20/2013 5:27 PM, Steven Rostedt wrote: > >I'm not sure I would recommend idle=poll either. It would certainly > >work, but it goes to the other extreme. You think NO_HZ=n drains a > >battery? Try idle=poll. > > > do not ever use idle=poll on anything production.. really bad idea. > > if you temporary cannot cope with the latency, you can use the PMQOS system > to limit (including going all the way to idle=poll). > but using idle=poll completely is very nasty for the hardware. > > In addition we should document that idle=poll will cost you peak performance, > possibly quite a bit. > > the same is true for the kernel paramter to some extend; it's there to work around > broken bioses/hardware/etc; if you have a latency/runtime requirement, it's much better > to use PMQOS for this from userspace. Thank you for the info, Arjan! Does the following capture the tradeoffs? o Dyntick-idle slows transitions to and from idle slightly. In practice, this has not been a problem except for the most aggressive real-time workloads, which have the option of disabling dyntick-idle mode, an option that most of them take. However, some workloads will no doubt want to use adaptive ticks to eliminate scheduling-clock-tick latencies. Here are some options for these workloads: o Use PMQOS from userspace to inform the kernel of your latency requirements (preferred). o Use the "idle=mwait" boot parameter. o Use the "intel_idle.max_cstate=" to limit the maximum depth C-state depth. o Use the "idle=poll" boot parameter. However, please note that use of this parameter can cause your CPU to overheat, which may cause thermal throttling to degrade your latencies --and that this degradation can be even worse than that of dyntick-idle. Thanx, Paul ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-21 17:18 ` Paul E. McKenney @ 2013-03-21 17:41 ` Arjan van de Ven 2013-03-21 18:02 ` Paul E. McKenney 0 siblings, 1 reply; 43+ messages in thread From: Arjan van de Ven @ 2013-03-21 17:41 UTC (permalink / raw) To: paulmck Cc: Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx On 3/21/2013 10:18 AM, Paul E. McKenney wrote: > o Use the "idle=poll" boot parameter. However, please note > that use of this parameter can cause your CPU to overheat, > which may cause thermal throttling to degrade your > latencies --and that this degradation can be even worse > than that of dyntick-idle. it also disables (effectively) Turbo Mode on Intel cpus... which can cost you a serious percentage of performance ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-21 17:41 ` Arjan van de Ven @ 2013-03-21 18:02 ` Paul E. McKenney 2013-03-22 18:37 ` Kevin Hilman 0 siblings, 1 reply; 43+ messages in thread From: Paul E. McKenney @ 2013-03-21 18:02 UTC (permalink / raw) To: Arjan van de Ven Cc: Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx On Thu, Mar 21, 2013 at 10:41:30AM -0700, Arjan van de Ven wrote: > On 3/21/2013 10:18 AM, Paul E. McKenney wrote: > > o Use the "idle=poll" boot parameter. However, please note > > that use of this parameter can cause your CPU to overheat, > > which may cause thermal throttling to degrade your > > latencies --and that this degradation can be even worse > > than that of dyntick-idle. > > it also disables (effectively) Turbo Mode on Intel cpus... which can cost you a serious percentage of performance Thank you, added! Please see below for the updated list. Thanx, Paul ------------------------------------------------------------------------ o Dyntick-idle slows transitions to and from idle slightly. In practice, this has not been a problem except for the most aggressive real-time workloads, which have the option of disabling dyntick-idle mode, an option that most of them take. However, some workloads will no doubt want to use adaptive ticks to eliminate scheduling-clock-tick latencies. Here are some options for these workloads: a. Use PMQOS from userspace to inform the kernel of your latency requirements (preferred). b. Use the "idle=mwait" boot parameter. c. Use the "intel_idle.max_cstate=" to limit the maximum depth C-state depth. d. Use the "idle=poll" boot parameter. However, please note that use of this parameter can cause your CPU to overheat, which may cause thermal throttling to degrade your latencies -- and that this degradation can be even worse than that of dyntick-idle. Furthermore, this parameter effectively disables Turbo Mode on Intel CPUs, which can significantly reduce maximum performance. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-21 18:02 ` Paul E. McKenney @ 2013-03-22 18:37 ` Kevin Hilman 2013-03-22 19:25 ` Paul E. McKenney 0 siblings, 1 reply; 43+ messages in thread From: Kevin Hilman @ 2013-03-22 18:37 UTC (permalink / raw) To: paulmck Cc: Arjan van de Ven, Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong, geoff, tglx "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes: > On Thu, Mar 21, 2013 at 10:41:30AM -0700, Arjan van de Ven wrote: >> On 3/21/2013 10:18 AM, Paul E. McKenney wrote: >> > o Use the "idle=poll" boot parameter. However, please note >> > that use of this parameter can cause your CPU to overheat, >> > which may cause thermal throttling to degrade your >> > latencies --and that this degradation can be even worse >> > than that of dyntick-idle. >> >> it also disables (effectively) Turbo Mode on Intel cpus... which can >> cost you a serious percentage of performance > > Thank you, added! Please see below for the updated list. > > Thanx, Paul > > ------------------------------------------------------------------------ > > o Dyntick-idle slows transitions to and from idle slightly. > In practice, this has not been a problem except for the most > aggressive real-time workloads, which have the option of disabling > dyntick-idle mode, an option that most of them take. However, > some workloads will no doubt want to use adaptive ticks to > eliminate scheduling-clock-tick latencies. Here are some > options for these workloads: > > a. Use PMQOS from userspace to inform the kernel of your > latency requirements (preferred). This is not only the preferred approach, but the *only* approach available on non-x86 systems. Perhaps the others should be marked as x86-only? Kevin > b. Use the "idle=mwait" boot parameter. > > c. Use the "intel_idle.max_cstate=" to limit the maximum > depth C-state depth. > > d. Use the "idle=poll" boot parameter. However, please note > that use of this parameter can cause your CPU to overheat, > which may cause thermal throttling to degrade your > latencies -- and that this degradation can be even worse > than that of dyntick-idle. Furthermore, this parameter > effectively disables Turbo Mode on Intel CPUs, which > can significantly reduce maximum performance. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-22 18:37 ` Kevin Hilman @ 2013-03-22 19:25 ` Paul E. McKenney 0 siblings, 0 replies; 43+ messages in thread From: Paul E. McKenney @ 2013-03-22 19:25 UTC (permalink / raw) To: Kevin Hilman Cc: Arjan van de Ven, Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong, geoff, tglx On Fri, Mar 22, 2013 at 11:37:55AM -0700, Kevin Hilman wrote: > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes: > > > On Thu, Mar 21, 2013 at 10:41:30AM -0700, Arjan van de Ven wrote: > >> On 3/21/2013 10:18 AM, Paul E. McKenney wrote: > >> > o Use the "idle=poll" boot parameter. However, please note > >> > that use of this parameter can cause your CPU to overheat, > >> > which may cause thermal throttling to degrade your > >> > latencies --and that this degradation can be even worse > >> > than that of dyntick-idle. > >> > >> it also disables (effectively) Turbo Mode on Intel cpus... which can > >> cost you a serious percentage of performance > > > > Thank you, added! Please see below for the updated list. > > > > Thanx, Paul > > > > ------------------------------------------------------------------------ > > > > o Dyntick-idle slows transitions to and from idle slightly. > > In practice, this has not been a problem except for the most > > aggressive real-time workloads, which have the option of disabling > > dyntick-idle mode, an option that most of them take. However, > > some workloads will no doubt want to use adaptive ticks to > > eliminate scheduling-clock-tick latencies. Here are some > > options for these workloads: > > > > a. Use PMQOS from userspace to inform the kernel of your > > latency requirements (preferred). > > This is not only the preferred approach, but the *only* approach > available on non-x86 systems. Perhaps the others should be marked as > x86-only? Good point, added that. Thanx, Paul > Kevin > > > b. Use the "idle=mwait" boot parameter. > > > > c. Use the "intel_idle.max_cstate=" to limit the maximum > > depth C-state depth. > > > > d. Use the "idle=poll" boot parameter. However, please note > > that use of this parameter can cause your CPU to overheat, > > which may cause thermal throttling to degrade your > > latencies -- and that this degradation can be even worse > > than that of dyntick-idle. Furthermore, this parameter > > effectively disables Turbo Mode on Intel CPUs, which > > can significantly reduce maximum performance. > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-21 15:45 ` Arjan van de Ven 2013-03-21 17:18 ` Paul E. McKenney @ 2013-03-22 4:59 ` Rob Landley 1 sibling, 0 replies; 43+ messages in thread From: Rob Landley @ 2013-03-22 4:59 UTC (permalink / raw) To: Arjan van de Ven Cc: Steven Rostedt, paulmck, Frederic Weisbecker, linux-kernel, josh, zhong, khilman, geoff, tglx On 03/21/2013 10:45:07 AM, Arjan van de Ven wrote: > On 3/20/2013 5:27 PM, Steven Rostedt wrote: >> I'm not sure I would recommend idle=poll either. It would certainly >> work, but it goes to the other extreme. You think NO_HZ=n drains a >> battery? Try idle=poll. > > > do not ever use idle=poll on anything production.. really bad idea. > > if you temporary cannot cope with the latency, you can use the PMQOS > system > to limit (including going all the way to idle=poll). > but using idle=poll completely is very nasty for the hardware. > > In addition we should document that idle=poll will cost you peak > performance, > possibly quite a bit. Where should that be documented? > the same is true for the kernel paramter to some extend; it's there > to work around > broken bioses/hardware/etc; if you have a latency/runtime > requirement, it's much better > to use PMQOS for this from userspace. I googled and found http://elinux.org/images/f/f9/Elc2008_pm_qos_slides.pdf Rob ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-21 0:27 ` Steven Rostedt 2013-03-21 2:22 ` Paul E. McKenney 2013-03-21 15:45 ` Arjan van de Ven @ 2013-03-21 18:01 ` Frederic Weisbecker 2013-03-21 18:26 ` Paul E. McKenney 2 siblings, 1 reply; 43+ messages in thread From: Frederic Weisbecker @ 2013-03-21 18:01 UTC (permalink / raw) To: Steven Rostedt Cc: paulmck, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx, Arjan van de Ven 2013/3/21 Steven Rostedt <rostedt@goodmis.org>: > [ Added Arjan in case he as anything to add about the idle=poll below ] > > > On Wed, 2013-03-20 at 16:55 -0700, Paul E. McKenney wrote: >> On Wed, Mar 20, 2013 at 07:32:18PM -0400, Steven Rostedt wrote: >> > Not a comment on this document, but on the implementation. As idle NO_HZ >> > can hurt RT, but RT would want to have full NO_HZ, it's a shame that you >> > can't have both (no idle but full). As we only care about not letting >> > the CPU go into deep sleep, I wonder if it wouldn't be too hard to add >> > something that keeps idle from going into nohz mode. Hmm, I think there >> > may be an option to keep the CPU from going too deep into sleep. That >> > may be a better approach. >> >> Would the combination of CONFIG_NO_HZ=y, CONFIG_NO_HZ_FULL=y, and >> idle=poll do the trick in this case? > > I'm not sure I would recommend idle=poll either. It would certainly > work, but it goes to the other extreme. You think NO_HZ=n drains a > battery? Try idle=poll. > > Looking at Documentation/kernel-parameters.txt, it looks like idle=mwait > may be better. It states that performance is the same as idle=poll (if > supported). > > Also there's a kernel parameter for x86 called intel_idle.max_cstate=X. > > As idle=poll will most likely run the processor very hot and you will > need to add more electricity not only for the computer but also for the > A/C, it would be nice to still have the CPU sleep, but just at a shallow > (fast wakeup) state. > > Perhaps Arjan can add some input here? But I note that it's an interesting usecase. May be we'll want to make CONFIG_NO_HZ_FULL (or whatever it's going to be called) not depend on CONFIG_NO_HZ_IDLE in the long. We'll see. Also, just a guess, but on dynticks-idle may be wakeup from deep CPU sleep state is not the only latency source. Reprogramming the timer tick on idle exit may be another one? Not sure how fast it is to write to the clock device. I supect it's not that free. So probably you would like to get rid of the entire dynticks-idle infrastructure for real time. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-21 18:01 ` Frederic Weisbecker @ 2013-03-21 18:26 ` Paul E. McKenney 0 siblings, 0 replies; 43+ messages in thread From: Paul E. McKenney @ 2013-03-21 18:26 UTC (permalink / raw) To: Frederic Weisbecker Cc: Steven Rostedt, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx, Arjan van de Ven On Thu, Mar 21, 2013 at 07:01:01PM +0100, Frederic Weisbecker wrote: > 2013/3/21 Steven Rostedt <rostedt@goodmis.org>: > > [ Added Arjan in case he as anything to add about the idle=poll below ] > > > > > > On Wed, 2013-03-20 at 16:55 -0700, Paul E. McKenney wrote: > >> On Wed, Mar 20, 2013 at 07:32:18PM -0400, Steven Rostedt wrote: > >> > Not a comment on this document, but on the implementation. As idle NO_HZ > >> > can hurt RT, but RT would want to have full NO_HZ, it's a shame that you > >> > can't have both (no idle but full). As we only care about not letting > >> > the CPU go into deep sleep, I wonder if it wouldn't be too hard to add > >> > something that keeps idle from going into nohz mode. Hmm, I think there > >> > may be an option to keep the CPU from going too deep into sleep. That > >> > may be a better approach. > >> > >> Would the combination of CONFIG_NO_HZ=y, CONFIG_NO_HZ_FULL=y, and > >> idle=poll do the trick in this case? > > > > I'm not sure I would recommend idle=poll either. It would certainly > > work, but it goes to the other extreme. You think NO_HZ=n drains a > > battery? Try idle=poll. > > > > Looking at Documentation/kernel-parameters.txt, it looks like idle=mwait > > may be better. It states that performance is the same as idle=poll (if > > supported). > > > > Also there's a kernel parameter for x86 called intel_idle.max_cstate=X. > > > > As idle=poll will most likely run the processor very hot and you will > > need to add more electricity not only for the computer but also for the > > A/C, it would be nice to still have the CPU sleep, but just at a shallow > > (fast wakeup) state. > > > > Perhaps Arjan can add some input here? > > But I note that it's an interesting usecase. May be we'll want to make > CONFIG_NO_HZ_FULL (or whatever it's going to be called) not depend on > CONFIG_NO_HZ_IDLE in the long. > > We'll see. > > Also, just a guess, but on dynticks-idle may be wakeup from deep CPU > sleep state is not the only latency source. Reprogramming the timer > tick on idle exit may be another one? Not sure how fast it is to write > to the clock device. I supect it's not that free. So probably you > would like to get rid of the entire dynticks-idle infrastructure for > real time. Agreed, and the first known-issues bullet calls that possibility out. Thanx, Paul ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-20 23:55 ` Paul E. McKenney 2013-03-21 0:27 ` Steven Rostedt @ 2013-03-21 16:08 ` Christoph Lameter 2013-03-21 17:15 ` Paul E. McKenney 1 sibling, 1 reply; 43+ messages in thread From: Christoph Lameter @ 2013-03-21 16:08 UTC (permalink / raw) To: Paul E. McKenney Cc: Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx On Wed, 20 Mar 2013, Paul E. McKenney wrote: > > > Another approach is to offload RCU callback processing to "rcuo" kthreads > > > using the CONFIG_RCU_NOCB_CPU=y. The specific CPUs to offload may be > > > selected via several methods: Why are there multiple rcuo threads? Would a single thread that may be able to run on multiple cpus not be sufficient? > > "Even though the SCHED_FIFO task is the only task running, because the > > SCHED_OTHER tasks are queued on the CPU, it currently will not enter > > adaptive tick mode." > > Again, good point! Uggh. That will cause problems and did cause problems when I tried to use nohz. The OS always has some sched other tasks around that become runnable after a while (like for example the vm statistics update, or the notorious slab scanning). As long as SCHED_FIFO is active and there is no process in the same scheduling class then tick needs to be off. Also wish that this would work with SCHED_OTHER if there is only a single task with a certain renice value (-10?) and the rest is runnable at lower priorities. Maybe in that case stop the tick for a longer period and then give the lower priority tasks a chance to run but then switch off the tick again. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-21 16:08 ` Christoph Lameter @ 2013-03-21 17:15 ` Paul E. McKenney 2013-03-21 18:39 ` Christoph Lameter 2013-03-21 18:44 ` Steven Rostedt 0 siblings, 2 replies; 43+ messages in thread From: Paul E. McKenney @ 2013-03-21 17:15 UTC (permalink / raw) To: Christoph Lameter Cc: Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx On Thu, Mar 21, 2013 at 04:08:08PM +0000, Christoph Lameter wrote: > On Wed, 20 Mar 2013, Paul E. McKenney wrote: > > > > > Another approach is to offload RCU callback processing to "rcuo" kthreads > > > > using the CONFIG_RCU_NOCB_CPU=y. The specific CPUs to offload may be > > > > selected via several methods: > > Why are there multiple rcuo threads? Would a single thread that may be > able to run on multiple cpus not be sufficient? In many cases, this would indeed be sufficient. However, if you have enough CPUs posting RCU callbacks, then the single thread would become a bottleneck, eventually resulting in an OOM. Per-CPU kthreads avoid this possibility. That said, if you know that your workload's RCU callbacks could be serviced by a single CPU, you can bind all the rcuo kthreads to a single CPU. > > > "Even though the SCHED_FIFO task is the only task running, because the > > > SCHED_OTHER tasks are queued on the CPU, it currently will not enter > > > adaptive tick mode." > > > > Again, good point! > > Uggh. That will cause problems and did cause problems when I tried to use > nohz. > > The OS always has some sched other tasks around that become runnable after > a while (like for example the vm statistics update, or the notorious slab > scanning). As long as SCHED_FIFO is active and there is no process in the > same scheduling class then tick needs to be off. Also wish that this would > work with SCHED_OTHER if there is only a single task with a certain renice > value (-10?) and the rest is runnable at lower priorities. Maybe in that > case stop the tick for a longer period and then give the lower priority > tasks a chance to run but then switch off the tick again. These sound to me like good future enhancements. In the meantime, one approach is to bind all these SCHED_OTHER tasks to designated housekeeping CPU(s) that don't run your main workload. Thanx, Paul ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-21 17:15 ` Paul E. McKenney @ 2013-03-21 18:39 ` Christoph Lameter 2013-03-21 18:58 ` Paul E. McKenney 2013-03-21 18:44 ` Steven Rostedt 1 sibling, 1 reply; 43+ messages in thread From: Christoph Lameter @ 2013-03-21 18:39 UTC (permalink / raw) To: Paul E. McKenney Cc: Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx On Thu, 21 Mar 2013, Paul E. McKenney wrote: > > Why are there multiple rcuo threads? Would a single thread that may be > > able to run on multiple cpus not be sufficient? > > In many cases, this would indeed be sufficient. However, if you have > enough CPUs posting RCU callbacks, then the single thread would become > a bottleneck, eventually resulting in an OOM. Per-CPU kthreads avoid > this possibility. Spawn another if the load gets too high for a single cpu? > That said, if you know that your workload's RCU callbacks could be > serviced by a single CPU, you can bind all the rcuo kthreads to a > single CPU. Yeah doing that right now but I'd like to see it handled without manual intervention. > > > Again, good point! > > > > Uggh. That will cause problems and did cause problems when I tried to use > > nohz. > > > > The OS always has some sched other tasks around that become runnable after > > a while (like for example the vm statistics update, or the notorious slab > > scanning). As long as SCHED_FIFO is active and there is no process in the > > same scheduling class then tick needs to be off. Also wish that this would > > work with SCHED_OTHER if there is only a single task with a certain renice > > value (-10?) and the rest is runnable at lower priorities. Maybe in that > > case stop the tick for a longer period and then give the lower priority > > tasks a chance to run but then switch off the tick again. > > These sound to me like good future enhancements. > > In the meantime, one approach is to bind all these SCHED_OTHER tasks > to designated housekeeping CPU(s) that don't run your main workload. One cannot bind kevent threads and other per cpu threads to another processor. So right now there is no way to avoid this issue. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-21 18:39 ` Christoph Lameter @ 2013-03-21 18:58 ` Paul E. McKenney 2013-03-21 20:04 ` Christoph Lameter 2013-03-22 19:01 ` Kevin Hilman 0 siblings, 2 replies; 43+ messages in thread From: Paul E. McKenney @ 2013-03-21 18:58 UTC (permalink / raw) To: Christoph Lameter Cc: Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx On Thu, Mar 21, 2013 at 06:39:09PM +0000, Christoph Lameter wrote: > On Thu, 21 Mar 2013, Paul E. McKenney wrote: > > > > Why are there multiple rcuo threads? Would a single thread that may be > > > able to run on multiple cpus not be sufficient? > > > > In many cases, this would indeed be sufficient. However, if you have > > enough CPUs posting RCU callbacks, then the single thread would become > > a bottleneck, eventually resulting in an OOM. Per-CPU kthreads avoid > > this possibility. > > Spawn another if the load gets too high for a single cpu? > > > That said, if you know that your workload's RCU callbacks could be > > serviced by a single CPU, you can bind all the rcuo kthreads to a > > single CPU. > > Yeah doing that right now but I'd like to see it handled without manual > intervention. Given that RCU has no idea where you want them to run, some manual intervention would most likely be required even if RCU spawned them dynamically, right? > > > > Again, good point! > > > > > > Uggh. That will cause problems and did cause problems when I tried to use > > > nohz. > > > > > > The OS always has some sched other tasks around that become runnable after > > > a while (like for example the vm statistics update, or the notorious slab > > > scanning). As long as SCHED_FIFO is active and there is no process in the > > > same scheduling class then tick needs to be off. Also wish that this would > > > work with SCHED_OTHER if there is only a single task with a certain renice > > > value (-10?) and the rest is runnable at lower priorities. Maybe in that > > > case stop the tick for a longer period and then give the lower priority > > > tasks a chance to run but then switch off the tick again. > > > > These sound to me like good future enhancements. > > > > In the meantime, one approach is to bind all these SCHED_OTHER tasks > > to designated housekeeping CPU(s) that don't run your main workload. > > One cannot bind kevent threads and other per cpu threads to another > processor. So right now there is no way to avoid this issue. Yep, my approach works only for those threads that are free to migrate. Of course, in some cases, you can avoid per-CPU threads running by pinning interrupts, avoiding certain operations in your workload, and so on. So, again, removing scheduling-clock interrupts in more situations is a good future enhancement. Thanx, Paul ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-21 18:58 ` Paul E. McKenney @ 2013-03-21 20:04 ` Christoph Lameter 2013-03-21 20:42 ` Frederic Weisbecker ` (2 more replies) 2013-03-22 19:01 ` Kevin Hilman 1 sibling, 3 replies; 43+ messages in thread From: Christoph Lameter @ 2013-03-21 20:04 UTC (permalink / raw) To: Paul E. McKenney Cc: Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx On Thu, 21 Mar 2013, Paul E. McKenney wrote: > > Yeah doing that right now but I'd like to see it handled without manual > > intervention. > > Given that RCU has no idea where you want them to run, some manual > intervention would most likely be required even if RCU spawned them > dynamically, right? If rcuoXX is a SCHED_OTHER process/thread then the kernel will move it to another processor from the one running the SCHED_FIFO task. There would be no manual intervention required. > So, again, removing scheduling-clock interrupts in more situations is > a good future enhancement. The point here is that the check for a single runnable process is wrong because it accounts for tasks in all scheduling classes. It would be better to check if there is only one runnable task in the highest scheduling class. That would work and defer the SCHED_OTHER kernel threads for the SCHED_FIFO thread. I am wondering how you actually can get NOHZ to work right? There is always a kernel thread that is scheduled in a couple of ticks. I guess what will happens with this patchset is: 1. SCHED_FIFO thread begins to run. There is only a single runnable task so adaptive tick mode is enabled. 2. After 2 seconds or so some or other thing needs to run (keventd thread needs to run vm statistics f.e.). It becomes runnable. nr_running > 1. Adaptive tick mode is disabled? Occurs on my system. Or is there some other trick to avoid kernel threads becoming runnable? 3. Now there are 2 runnable processes. The SCHED_FIFO thread continues to run with the tick. The kernel thread is also runnable but will not be given cpu time since the SCHED_FIFO thread has priority? So the SCHED_FIFO thread enjoys 2 seconds of no tick time and then ticks occur uselessly from there on? I have not been able to consistently get the tick switched off with the nohz patchset. How do others use nohz? Is it only usable for short periods of less than 2 seconds? ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-21 20:04 ` Christoph Lameter @ 2013-03-21 20:42 ` Frederic Weisbecker 2013-03-21 21:02 ` Christoph Lameter 2013-03-21 20:50 ` Paul E. McKenney 2013-03-22 9:52 ` Mats Liljegren 2 siblings, 1 reply; 43+ messages in thread From: Frederic Weisbecker @ 2013-03-21 20:42 UTC (permalink / raw) To: Christoph Lameter Cc: Paul E. McKenney, Steven Rostedt, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx 2013/3/21 Christoph Lameter <cl@linux.com>: > On Thu, 21 Mar 2013, Paul E. McKenney wrote: >> So, again, removing scheduling-clock interrupts in more situations is >> a good future enhancement. > > The point here is that the check for a single runnable process is wrong > because it accounts for tasks in all scheduling classes. > > It would be better to check if there is only one runnable task in the > highest scheduling class. That would work and defer the SCHED_OTHER kernel > threads for the SCHED_FIFO thread. It sounds that simple but it's more complicated. It requires some more hooks on the scheduler, namely in the sched_switch/dequeue path so that when the last task of a class goes to sleep, we check what else is running and whether we need to restart the tick or not depending on the class of the next task and how many tasks are there. This will probably need to go in sched_class::dequeue_task(). There is some careful and subtle attention to put on that. Of course we want to improve that in the long run. But for now we have a KISS solution that works. And like Steve said, the patchset is complicated enough. Move baby steps forward to ease the upstream integration. > I am wondering how you actually can get NOHZ to work right? There is > always a kernel thread that is scheduled in a couple of ticks. > > I guess what will happens with this patchset is: > > 1. SCHED_FIFO thread begins to run. There is only a single runnable task > so adaptive tick mode is enabled. > > 2. After 2 seconds or so some or other thing needs to run (keventd thread > needs to run vm statistics f.e.). It becomes runnable. nr_running > 1. > Adaptive tick mode is disabled? Occurs on my system. Or is there some > other trick to avoid kernel threads becoming runnable? > > 3. Now there are 2 runnable processes. The SCHED_FIFO thread continues to > run with the tick. The kernel thread is also runnable but will not be > given cpu time since the SCHED_FIFO thread has priority? > > So the SCHED_FIFO thread enjoys 2 seconds of no tick time and then ticks > occur uselessly from there on? > > > I have not been able to consistently get the tick switched off with > the nohz patchset. How do others use nohz? Is it only usable for short > periods of less than 2 seconds? Sure, for now just don't use SCHED_FIFO and you will have a much more extended dynticks coverage. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-21 20:42 ` Frederic Weisbecker @ 2013-03-21 21:02 ` Christoph Lameter 2013-03-21 21:06 ` Frederic Weisbecker 0 siblings, 1 reply; 43+ messages in thread From: Christoph Lameter @ 2013-03-21 21:02 UTC (permalink / raw) To: Frederic Weisbecker Cc: Paul E. McKenney, Steven Rostedt, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx On Thu, 21 Mar 2013, Frederic Weisbecker wrote: > Sure, for now just don't use SCHED_FIFO and you will have a much more > extended dynticks coverage. Ah. Ok. Important information. That would mean no tick for the 2 second intervals between the vm stats etc. Much much better than now where we have a tick 1000 times per second. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-21 21:02 ` Christoph Lameter @ 2013-03-21 21:06 ` Frederic Weisbecker 0 siblings, 0 replies; 43+ messages in thread From: Frederic Weisbecker @ 2013-03-21 21:06 UTC (permalink / raw) To: Christoph Lameter Cc: Paul E. McKenney, Steven Rostedt, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx 2013/3/21 Christoph Lameter <cl@linux.com>: > On Thu, 21 Mar 2013, Frederic Weisbecker wrote: > >> Sure, for now just don't use SCHED_FIFO and you will have a much more >> extended dynticks coverage. > > Ah. Ok. Important information. That would mean no tick for the 2 second > intervals between the vm stats etc. Much much better than now where we > have a tick 1000 times per second. I can't guarantee no tick, there can be many reasons for the tick to happen. But if you don't run SCHED_FIFO, the pending kernel thread can execute quickly and give back the CPU for your task alone, then the tick can shut down again. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-21 20:04 ` Christoph Lameter 2013-03-21 20:42 ` Frederic Weisbecker @ 2013-03-21 20:50 ` Paul E. McKenney 2013-03-22 14:38 ` Christoph Lameter 2013-03-22 9:52 ` Mats Liljegren 2 siblings, 1 reply; 43+ messages in thread From: Paul E. McKenney @ 2013-03-21 20:50 UTC (permalink / raw) To: Christoph Lameter Cc: Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx On Thu, Mar 21, 2013 at 08:04:08PM +0000, Christoph Lameter wrote: > On Thu, 21 Mar 2013, Paul E. McKenney wrote: > > > > Yeah doing that right now but I'd like to see it handled without manual > > > intervention. > > > > Given that RCU has no idea where you want them to run, some manual > > intervention would most likely be required even if RCU spawned them > > dynamically, right? > > If rcuoXX is a SCHED_OTHER process/thread then the kernel will move it to > another processor from the one running the SCHED_FIFO task. There would be > no manual intervention required. Assuming that the SCHED_FIFO task was running at the time that RCU decided to spawn the kthread, and assuming that there was at least one CPU not running a SCHED_FIFO task, agreed. But these assumptions do not hold in general. > > So, again, removing scheduling-clock interrupts in more situations is > > a good future enhancement. > > The point here is that the check for a single runnable process is wrong > because it accounts for tasks in all scheduling classes. Incomplete, yes. Only a starting point, yes. Wrong, no. > It would be better to check if there is only one runnable task in the > highest scheduling class. That would work and defer the SCHED_OTHER kernel > threads for the SCHED_FIFO thread. Agreed, that would be better. Hopefully we will handle that and other similar cases at some point. > I am wondering how you actually can get NOHZ to work right? There is > always a kernel thread that is scheduled in a couple of ticks. > > I guess what will happens with this patchset is: > > 1. SCHED_FIFO thread begins to run. There is only a single runnable task > so adaptive tick mode is enabled. Yep. > 2. After 2 seconds or so some or other thing needs to run (keventd thread > needs to run vm statistics f.e.). It becomes runnable. nr_running > 1. > Adaptive tick mode is disabled? Occurs on my system. Or is there some > other trick to avoid kernel threads becoming runnable? Yes, adaptive tick mode would be disabled at that point. > 3. Now there are 2 runnable processes. The SCHED_FIFO thread continues to > run with the tick. The kernel thread is also runnable but will not be > given cpu time since the SCHED_FIFO thread has priority? Yep. > So the SCHED_FIFO thread enjoys 2 seconds of no tick time and then ticks > occur uselessly from there on? If the SCHED_FIFO thread never sleeps at all, this would be the outcome. On the other hand, if the SCHED_FIFO thread never sleeps at all, the various per-CPU kthreads are deferred forever, which might not be so good long term. If the SCHED_FIFO thread does sleep at some point, the SCHED_OTHER threads would run, the CPU would go idle, and then when the SCHED_OTHER thread started up again, it would start up in adaptive-idle mode. > I have not been able to consistently get the tick switched off with > the nohz patchset. How do others use nohz? Is it only usable for short > periods of less than 2 seconds? I believe that many other SCHED_FIFO users run their SCHED_FIFO threads in short bursts to respond to some real-time event. They would not tend to have a SCHED_FIFO thread with a busy period exceeding two seconds, and therefore would be less likely to encounter this issue. So, how long of busy periods are you contemplating for your SCHED_FIFO threads? Is it possible to tune/adjust the offending per-CPU ktheads to wake up less frequently than that time? Thanx, Paul ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-21 20:50 ` Paul E. McKenney @ 2013-03-22 14:38 ` Christoph Lameter 2013-03-22 16:28 ` Paul E. McKenney 0 siblings, 1 reply; 43+ messages in thread From: Christoph Lameter @ 2013-03-22 14:38 UTC (permalink / raw) To: Paul E. McKenney Cc: Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx On Thu, 21 Mar 2013, Paul E. McKenney wrote: > So, how long of busy periods are you contemplating for your SCHED_FIFO > threads? Is it possible to tune/adjust the offending per-CPU ktheads > to wake up less frequently than that time? Test programs right now run 10 seconds. 30 seconds would definitely be enough for the worst case. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-22 14:38 ` Christoph Lameter @ 2013-03-22 16:28 ` Paul E. McKenney 2013-03-25 14:31 ` Christoph Lameter 0 siblings, 1 reply; 43+ messages in thread From: Paul E. McKenney @ 2013-03-22 16:28 UTC (permalink / raw) To: Christoph Lameter Cc: Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx On Fri, Mar 22, 2013 at 02:38:58PM +0000, Christoph Lameter wrote: > On Thu, 21 Mar 2013, Paul E. McKenney wrote: > > > So, how long of busy periods are you contemplating for your SCHED_FIFO > > threads? Is it possible to tune/adjust the offending per-CPU ktheads > > to wake up less frequently than that time? > > Test programs right now run 10 seconds. 30 seconds would definitely be > enough for the worst case. OK, that might be doable for some workloads. What happens when you try tuning the 2-second wakeup interval to (say) 45 seconds? Thanx, Paul ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-22 16:28 ` Paul E. McKenney @ 2013-03-25 14:31 ` Christoph Lameter 2013-03-25 14:37 ` Frederic Weisbecker 0 siblings, 1 reply; 43+ messages in thread From: Christoph Lameter @ 2013-03-25 14:31 UTC (permalink / raw) To: Paul E. McKenney Cc: Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx On Fri, 22 Mar 2013, Paul E. McKenney wrote: > On Fri, Mar 22, 2013 at 02:38:58PM +0000, Christoph Lameter wrote: > > On Thu, 21 Mar 2013, Paul E. McKenney wrote: > > > > > So, how long of busy periods are you contemplating for your SCHED_FIFO > > > threads? Is it possible to tune/adjust the offending per-CPU ktheads > > > to wake up less frequently than that time? > > > > Test programs right now run 10 seconds. 30 seconds would definitely be > > enough for the worst case. > > OK, that might be doable for some workloads. What happens when you > try tuning the 2-second wakeup interval to (say) 45 seconds? The vm kernel threads do no useful work if no system calls are being done. If there is no kernel action then they can be deferred indefinitely. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-25 14:31 ` Christoph Lameter @ 2013-03-25 14:37 ` Frederic Weisbecker 2013-03-25 15:18 ` Christoph Lameter 0 siblings, 1 reply; 43+ messages in thread From: Frederic Weisbecker @ 2013-03-25 14:37 UTC (permalink / raw) To: Christoph Lameter Cc: Paul E. McKenney, Steven Rostedt, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx 2013/3/25 Christoph Lameter <cl@linux.com>: > On Fri, 22 Mar 2013, Paul E. McKenney wrote: > >> On Fri, Mar 22, 2013 at 02:38:58PM +0000, Christoph Lameter wrote: >> > On Thu, 21 Mar 2013, Paul E. McKenney wrote: >> > >> > > So, how long of busy periods are you contemplating for your SCHED_FIFO >> > > threads? Is it possible to tune/adjust the offending per-CPU ktheads >> > > to wake up less frequently than that time? >> > >> > Test programs right now run 10 seconds. 30 seconds would definitely be >> > enough for the worst case. >> >> OK, that might be doable for some workloads. What happens when you >> try tuning the 2-second wakeup interval to (say) 45 seconds? > > The vm kernel threads do no useful work if no system calls are being done. > If there is no kernel action then they can be deferred indefinitely. > We can certainly add some user deferrable timer_list. But that's going to be for extreme usecases (those who require pure isolation) because we'll need to settle that with a timer reprogramming into user/kernel boundaries. That won't be free. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-25 14:37 ` Frederic Weisbecker @ 2013-03-25 15:18 ` Christoph Lameter 2013-03-25 15:20 ` Frederic Weisbecker 0 siblings, 1 reply; 43+ messages in thread From: Christoph Lameter @ 2013-03-25 15:18 UTC (permalink / raw) To: Frederic Weisbecker Cc: Paul E. McKenney, Steven Rostedt, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx On Mon, 25 Mar 2013, Frederic Weisbecker wrote: > > The vm kernel threads do no useful work if no system calls are being done. > > If there is no kernel action then they can be deferred indefinitely. > > > > We can certainly add some user deferrable timer_list. But that's going > to be for extreme usecases (those who require pure isolation) because > we'll need to settle that with a timer reprogramming into user/kernel > boundaries. That won't be free. These timers are already marked deferrable and are deferred for the idle dynticks case. Could we reuse the same logic? See timer.h around the define of TIMER_DEFERRABLE. I just assumed so far that the dyntick idle logic would have been used for this case. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-25 15:18 ` Christoph Lameter @ 2013-03-25 15:20 ` Frederic Weisbecker 0 siblings, 0 replies; 43+ messages in thread From: Frederic Weisbecker @ 2013-03-25 15:20 UTC (permalink / raw) To: Christoph Lameter Cc: Paul E. McKenney, Steven Rostedt, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx 2013/3/25 Christoph Lameter <cl@linux.com>: > On Mon, 25 Mar 2013, Frederic Weisbecker wrote: > >> > The vm kernel threads do no useful work if no system calls are being done. >> > If there is no kernel action then they can be deferred indefinitely. >> > >> >> We can certainly add some user deferrable timer_list. But that's going >> to be for extreme usecases (those who require pure isolation) because >> we'll need to settle that with a timer reprogramming into user/kernel >> boundaries. That won't be free. > > These timers are already marked deferrable and are deferred for the idle > dynticks case. Could we reuse the same logic? See timer.h around the > define of TIMER_DEFERRABLE. I just assumed so far that the dyntick idle > logic would have been used for this case. We need to audit all deferreable timers to check if their deferrability in idle also applies for userspace. If so may be we can consider that. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-21 20:04 ` Christoph Lameter 2013-03-21 20:42 ` Frederic Weisbecker 2013-03-21 20:50 ` Paul E. McKenney @ 2013-03-22 9:52 ` Mats Liljegren 2 siblings, 0 replies; 43+ messages in thread From: Mats Liljegren @ 2013-03-22 9:52 UTC (permalink / raw) To: Christoph Lameter Cc: Paul E. McKenney, Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx Christoph Lameter wrote: > On Thu, 21 Mar 2013, Paul E. McKenney wrote: > > > > Yeah doing that right now but I'd like to see it handled without manual > > > intervention. > > > > Given that RCU has no idea where you want them to run, some manual > > intervention would most likely be required even if RCU spawned them > > dynamically, right? > > If rcuoXX is a SCHED_OTHER process/thread then the kernel will move it to > another processor from the one running the SCHED_FIFO task. There would be > no manual intervention required. > > > So, again, removing scheduling-clock interrupts in more situations is > > a good future enhancement. > > The point here is that the check for a single runnable process is wrong > because it accounts for tasks in all scheduling classes. > > It would be better to check if there is only one runnable task in the > highest scheduling class. That would work and defer the SCHED_OTHER kernel > threads for the SCHED_FIFO thread. > > I am wondering how you actually can get NOHZ to work right? There is > always a kernel thread that is scheduled in a couple of ticks. In my case I use 2 CPU PandaBoard where I use cpuset to create a non-realtime domain for CPU0 and a real-time domain for CPU1. I then move all kernel threads and IRQs to CPU0, leaving only the application specific IRQ for CPU1. I then start a singe thread on CPU1. I use a quite down-stripped version of Linux built using Yocto. I have run the application for a minute and got 70-80 ticks, most (all?) occurring during start and exit of the application. I use 100Hz ticks. So personally I do get something by using full NOHZ in its current incarnation. I'd like some better interrupt latency though, so disabling nohz-idle might be interesting for me. But that's another story... -- Mats ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-21 18:58 ` Paul E. McKenney 2013-03-21 20:04 ` Christoph Lameter @ 2013-03-22 19:01 ` Kevin Hilman 1 sibling, 0 replies; 43+ messages in thread From: Kevin Hilman @ 2013-03-22 19:01 UTC (permalink / raw) To: paulmck Cc: Christoph Lameter, Steven Rostedt, Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong, geoff, tglx [...] >> > >> > In the meantime, one approach is to bind all these SCHED_OTHER tasks >> > to designated housekeeping CPU(s) that don't run your main workload. >> >> One cannot bind kevent threads and other per cpu threads to another >> processor. So right now there is no way to avoid this issue. > > Yep, my approach works only for those threads that are free to migrate. > Of course, in some cases, you can avoid per-CPU threads running by pinning > interrupts, avoiding certain operations in your workload, and so on. Speaking of threads that are not free to migrate, you might add a bit to the doc explaining that these various kernel threads that cannot migrate are also potential sources of jitter and also reasons why a CPU may exit (or not enter) full nohz mode. And thanks a ton for writing up this detailed doc. Speaking as someone trying to get full nohz working on a new arch (ARM), getting my head around all of this has been challenging, and your doc is a great help. Kevin ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-21 17:15 ` Paul E. McKenney 2013-03-21 18:39 ` Christoph Lameter @ 2013-03-21 18:44 ` Steven Rostedt 2013-03-21 18:53 ` Christoph Lameter 2013-03-21 18:59 ` Paul E. McKenney 1 sibling, 2 replies; 43+ messages in thread From: Steven Rostedt @ 2013-03-21 18:44 UTC (permalink / raw) To: paulmck Cc: Christoph Lameter, Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx On Thu, 2013-03-21 at 10:15 -0700, Paul E. McKenney wrote: > > The OS always has some sched other tasks around that become runnable after > > a while (like for example the vm statistics update, or the notorious slab > > scanning). As long as SCHED_FIFO is active and there is no process in the > > same scheduling class then tick needs to be off. Also wish that this would > > work with SCHED_OTHER if there is only a single task with a certain renice > > value (-10?) and the rest is runnable at lower priorities. Maybe in that > > case stop the tick for a longer period and then give the lower priority > > tasks a chance to run but then switch off the tick again. > > These sound to me like good future enhancements. Exactly. Please, this is a complex enough change to something that is critical to the entire system (similar to RCU itself). Lets take baby steps here and get it right each step of the way. For now, no, if more than one process is scheduled on the CPU, we fall out of dynamic tick mode. In the future, we can add SCHED_FIFO task scheduled in to trigger it. But lets conquer that after we successfully conquer the current changes. -- Steve ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-21 18:44 ` Steven Rostedt @ 2013-03-21 18:53 ` Christoph Lameter 2013-03-21 19:16 ` Steven Rostedt 2013-03-21 18:59 ` Paul E. McKenney 1 sibling, 1 reply; 43+ messages in thread From: Christoph Lameter @ 2013-03-21 18:53 UTC (permalink / raw) To: Steven Rostedt Cc: paulmck, Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx On Thu, 21 Mar 2013, Steven Rostedt wrote: > For now, no, if more than one process is scheduled on the CPU, we fall > out of dynamic tick mode. In the future, we can add SCHED_FIFO task > scheduled in to trigger it. But lets conquer that after we successfully > conquer the current changes. Be glad to see whatever is possible merged as soon as possible. But be aware that we will fall out of dyntick mode at mininum every 2 seconds because that is when the per cpu vm stats and the slab scanning is occurring. These are both deferrable activities. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-21 18:53 ` Christoph Lameter @ 2013-03-21 19:16 ` Steven Rostedt 0 siblings, 0 replies; 43+ messages in thread From: Steven Rostedt @ 2013-03-21 19:16 UTC (permalink / raw) To: Christoph Lameter Cc: paulmck, Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx On Thu, 2013-03-21 at 18:53 +0000, Christoph Lameter wrote: > On Thu, 21 Mar 2013, Steven Rostedt wrote: > > > For now, no, if more than one process is scheduled on the CPU, we fall > > out of dynamic tick mode. In the future, we can add SCHED_FIFO task > > scheduled in to trigger it. But lets conquer that after we successfully > > conquer the current changes. > > Be glad to see whatever is possible merged as soon as possible. But be > aware that we will fall out of dyntick mode at mininum every 2 seconds > because that is when the per cpu vm stats and the slab scanning is > occurring. These are both deferrable activities. > Thanks for giving us the heads up. Yeah, I understand your concern. Even when the current patch set is in, I'm not claiming success. Just like how most of -rt is now in mainline. It's not completely finished until the rest of -rt is there. I feel the same with the dynamic tick patches. It will get in in stages. But it truly isn't there until we have it fully functional, that even you will be pleased with the result ;-) -- Steve ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] nohz1: Documentation 2013-03-21 18:44 ` Steven Rostedt 2013-03-21 18:53 ` Christoph Lameter @ 2013-03-21 18:59 ` Paul E. McKenney 1 sibling, 0 replies; 43+ messages in thread From: Paul E. McKenney @ 2013-03-21 18:59 UTC (permalink / raw) To: Steven Rostedt Cc: Christoph Lameter, Frederic Weisbecker, Rob Landley, linux-kernel, josh, zhong, khilman, geoff, tglx On Thu, Mar 21, 2013 at 02:44:22PM -0400, Steven Rostedt wrote: > On Thu, 2013-03-21 at 10:15 -0700, Paul E. McKenney wrote: > > > > The OS always has some sched other tasks around that become runnable after > > > a while (like for example the vm statistics update, or the notorious slab > > > scanning). As long as SCHED_FIFO is active and there is no process in the > > > same scheduling class then tick needs to be off. Also wish that this would > > > work with SCHED_OTHER if there is only a single task with a certain renice > > > value (-10?) and the rest is runnable at lower priorities. Maybe in that > > > case stop the tick for a longer period and then give the lower priority > > > tasks a chance to run but then switch off the tick again. > > > > These sound to me like good future enhancements. > > Exactly. Please, this is a complex enough change to something that is > critical to the entire system (similar to RCU itself). Lets take baby > steps here and get it right each step of the way. > > For now, no, if more than one process is scheduled on the CPU, we fall > out of dynamic tick mode. In the future, we can add SCHED_FIFO task > scheduled in to trigger it. But lets conquer that after we successfully > conquer the current changes. What Steve said!!! Thanx, Paul ^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2013-03-25 15:27 UTC | newest] Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-03-18 16:29 [PATCH] nohz1: Documentation Paul E. McKenney 2013-03-18 18:13 ` Rob Landley 2013-03-18 18:46 ` Frederic Weisbecker 2013-03-18 19:59 ` Rob Landley 2013-03-18 20:48 ` Frederic Weisbecker 2013-03-18 22:25 ` Paul E. McKenney 2013-03-20 23:32 ` Steven Rostedt 2013-03-20 23:55 ` Paul E. McKenney 2013-03-21 0:27 ` Steven Rostedt 2013-03-21 2:22 ` Paul E. McKenney 2013-03-21 10:16 ` Borislav Petkov 2013-03-21 15:18 ` Paul E. McKenney 2013-03-21 16:00 ` Borislav Petkov 2013-03-21 15:45 ` Arjan van de Ven 2013-03-21 17:18 ` Paul E. McKenney 2013-03-21 17:41 ` Arjan van de Ven 2013-03-21 18:02 ` Paul E. McKenney 2013-03-22 18:37 ` Kevin Hilman 2013-03-22 19:25 ` Paul E. McKenney 2013-03-22 4:59 ` Rob Landley 2013-03-21 18:01 ` Frederic Weisbecker 2013-03-21 18:26 ` Paul E. McKenney 2013-03-21 16:08 ` Christoph Lameter 2013-03-21 17:15 ` Paul E. McKenney 2013-03-21 18:39 ` Christoph Lameter 2013-03-21 18:58 ` Paul E. McKenney 2013-03-21 20:04 ` Christoph Lameter 2013-03-21 20:42 ` Frederic Weisbecker 2013-03-21 21:02 ` Christoph Lameter 2013-03-21 21:06 ` Frederic Weisbecker 2013-03-21 20:50 ` Paul E. McKenney 2013-03-22 14:38 ` Christoph Lameter 2013-03-22 16:28 ` Paul E. McKenney 2013-03-25 14:31 ` Christoph Lameter 2013-03-25 14:37 ` Frederic Weisbecker 2013-03-25 15:18 ` Christoph Lameter 2013-03-25 15:20 ` Frederic Weisbecker 2013-03-22 9:52 ` Mats Liljegren 2013-03-22 19:01 ` Kevin Hilman 2013-03-21 18:44 ` Steven Rostedt 2013-03-21 18:53 ` Christoph Lameter 2013-03-21 19:16 ` Steven Rostedt 2013-03-21 18:59 ` Paul E. McKenney
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.