* [NOHZ] Remove scheduler_tick_max_deferment @ 2014-10-31 16:01 Christoph Lameter 2014-11-01 19:18 ` Thomas Gleixner 0 siblings, 1 reply; 24+ messages in thread From: Christoph Lameter @ 2014-10-31 16:01 UTC (permalink / raw) To: Frederic Weisbecker Cc: linux-kernel, Gilad Ben-Yossef, Thomas Gleixner, Tejun Heo, John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky, Paul E. McKenney, Hugh Dickins, Viresh Kumar, H. Peter Anvin, Ingo Molnar, Peter Zijlstra The reasoning behind this function is not clear to me and removal seems to have a limited impact on the system overall. Even without the cap to 1 second the system will be limited by the boundaries on the period of interrupts by various devices (in my case I ended up with a 4 second interval on x86 because of the limitations of periodicy of the underlying interupt source). Moreover this artificial limits the impact the benefit that commit commit 7cc36bbddde5cd0c98f0c06e3304ab833d662565 (on-demand vmstat workers) should be giving us. Without this patch timer interrupts will still occur in 1 second intervals but no vmstat kworker will run. On a processor where all other events have been redirected to other processors nothing will be going on just timer interrupts that do not do much. With this patch the maximum deferrability of other items will become evident and work can then proceed on eliminating those (like the 4 second limit that I encountered due to the timing limitations of the underlying hardware) Signed-off-by: Christoph Lameter <cl@linux.com> Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -2174,7 +2174,6 @@ static inline void wake_up_nohz_cpu(int #ifdef CONFIG_NO_HZ_FULL extern bool sched_can_stop_tick(void); -extern u64 scheduler_tick_max_deferment(void); #else static inline bool sched_can_stop_tick(void) { return false; } #endif Index: linux/kernel/sched/core.c =================================================================== --- linux.orig/kernel/sched/core.c +++ linux/kernel/sched/core.c @@ -2573,34 +2573,6 @@ void scheduler_tick(void) rq_last_tick_reset(rq); } -#ifdef CONFIG_NO_HZ_FULL -/** - * scheduler_tick_max_deferment - * - * Keep at least one tick per second when a single - * active task is running because the scheduler doesn't - * yet completely support full dynticks environment. - * - * This makes sure that uptime, CFS vruntime, load - * balancing, etc... continue to move forward, even - * with a very low granularity. - * - * Return: Maximum deferment in nanoseconds. - */ -u64 scheduler_tick_max_deferment(void) -{ - struct rq *rq = this_rq(); - unsigned long next, now = ACCESS_ONCE(jiffies); - - next = rq->last_sched_tick + HZ; - - if (time_before_eq(next, now)) - return 0; - - return jiffies_to_nsecs(next - now); -} -#endif - notrace unsigned long get_parent_ip(unsigned long addr) { if (in_lock_functions(addr)) { Index: linux/kernel/time/tick-sched.c =================================================================== --- linux.orig/kernel/time/tick-sched.c +++ linux/kernel/time/tick-sched.c @@ -632,13 +632,6 @@ static ktime_t tick_nohz_stop_sched_tick time_delta = KTIME_MAX; } -#ifdef CONFIG_NO_HZ_FULL - if (!ts->inidle) { - time_delta = min(time_delta, - scheduler_tick_max_deferment()); - } -#endif - /* * calculate the expiry time for the next timer wheel * timer. delta_jiffies >= NEXT_TIMER_MAX_DELTA signals ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [NOHZ] Remove scheduler_tick_max_deferment 2014-10-31 16:01 [NOHZ] Remove scheduler_tick_max_deferment Christoph Lameter @ 2014-11-01 19:18 ` Thomas Gleixner 2014-11-01 21:52 ` Christoph Lameter 0 siblings, 1 reply; 24+ messages in thread From: Thomas Gleixner @ 2014-11-01 19:18 UTC (permalink / raw) To: Christoph Lameter Cc: Frederic Weisbecker, linux-kernel, Gilad Ben-Yossef, Tejun Heo, John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky, Paul E. McKenney, Hugh Dickins, Viresh Kumar, H. Peter Anvin, Ingo Molnar, Peter Zijlstra On Fri, 31 Oct 2014, Christoph Lameter wrote: > The reasoning behind this function is not clear to me and removal seems The comment above the function is clear enough. > to have a limited impact on the system overall. Even without the > cap to 1 second the system will be limited by the boundaries on the period > of interrupts by various devices (in my case I ended up with a 4 second > interval on x86 because of the limitations of periodicy of the underlying > interupt source). And just because it happens to do so on your machine it's not guaranteed. > Moreover this artificial limits the impact the benefit that commit > commit 7cc36bbddde5cd0c98f0c06e3304ab833d662565 (on-demand vmstat workers) > should be giving us. > > Without this patch timer interrupts will still occur in 1 second intervals > but no vmstat kworker will run. On a processor where all other And what has this to do with vmstat kworker? Nothing. > events have been redirected to other processors nothing will be > going on just timer interrupts that do not do much. What the timer interrupt does is very clearly explained in the comment above the function you want to remove. > With this patch the maximum deferrability of other items will become > evident and work can then proceed on eliminating those What about eliminating the requirement for this function first? It's clear what it does and it's also clear that this can be done remotely from a housekeeping cpu when the full nohz cpu is busy looping in user space. The function is there because nobody has tackled that problem yet. > (like the 4 second limit that I encountered due to the timing > limitations of the underlying hardware) You don't want to tackle the 4 seconds limit of the underlying hardware. What you really want is to eliminate the [hr]timer which is preventing the hardware timer to be shutdown completely. But we care about that _after_ we solved the scheduler tick requirement because that is the most evident one. Thanks, tglx ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [NOHZ] Remove scheduler_tick_max_deferment 2014-11-01 19:18 ` Thomas Gleixner @ 2014-11-01 21:52 ` Christoph Lameter 2014-11-01 22:33 ` Thomas Gleixner 2014-11-10 20:26 ` Frederic Weisbecker 0 siblings, 2 replies; 24+ messages in thread From: Christoph Lameter @ 2014-11-01 21:52 UTC (permalink / raw) To: Thomas Gleixner Cc: Frederic Weisbecker, linux-kernel, Gilad Ben-Yossef, Tejun Heo, John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky, Paul E. McKenney, Hugh Dickins, Viresh Kumar, H. Peter Anvin, Ingo Molnar, Peter Zijlstra On Sat, 1 Nov 2014, Thomas Gleixner wrote: > On Fri, 31 Oct 2014, Christoph Lameter wrote: > > The reasoning behind this function is not clear to me and removal seems > > The comment above the function is clear enough. I looked around into the functions called by the timer interrupt for accounting etc. They have measures to compensate if the HZ is not occurring for some time. > > to have a limited impact on the system overall. Even without the > > cap to 1 second the system will be limited by the boundaries on the period > > of interrupts by various devices (in my case I ended up with a 4 second > > interval on x86 because of the limitations of periodicy of the underlying > > interupt source). > > And just because it happens to do so on your machine it's not > guaranteed. When would it not occur? Where do we lack a measure to cope with missing timer interrupts now? > > Moreover this artificial limits the impact the benefit that commit > > commit 7cc36bbddde5cd0c98f0c06e3304ab833d662565 (on-demand vmstat workers) > > should be giving us. > > > > Without this patch timer interrupts will still occur in 1 second intervals > > but no vmstat kworker will run. On a processor where all other > > And what has this to do with vmstat kworker? Nothing. This means that the timer interrupt occurs needlessly. > > events have been redirected to other processors nothing will be > > going on just timer interrupts that do not do much. > > What the timer interrupt does is very clearly explained in the comment > above the function you want to remove. Could you be specific? > > With this patch the maximum deferrability of other items will become > > evident and work can then proceed on eliminating those > > What about eliminating the requirement for this function first? It's > clear what it does and it's also clear that this can be done remotely > from a housekeeping cpu when the full nohz cpu is busy looping in user > space. The function is there because nobody has tackled that problem > yet. Where does that requirement exist? > You don't want to tackle the 4 seconds limit of the underlying > hardware. What you really want is to eliminate the [hr]timer which is > preventing the hardware timer to be shutdown completely. Yes after the 1 second issue here has been avoided we can move on to the 4 second one and so on. > But we care about that _after_ we solved the scheduler tick > requirement because that is the most evident one. Why does the scheduler require that tick? It seems that the processor is always busy running exactly 1 process when the tick is not occurring. Anything else will switch on the tick again. So the information that the scheduler has never becomes outdated. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [NOHZ] Remove scheduler_tick_max_deferment 2014-11-01 21:52 ` Christoph Lameter @ 2014-11-01 22:33 ` Thomas Gleixner 2014-11-06 17:24 ` Christoph Lameter 2014-11-10 20:26 ` Frederic Weisbecker 1 sibling, 1 reply; 24+ messages in thread From: Thomas Gleixner @ 2014-11-01 22:33 UTC (permalink / raw) To: Christoph Lameter Cc: Frederic Weisbecker, linux-kernel, Gilad Ben-Yossef, Tejun Heo, John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky, Paul E. McKenney, Hugh Dickins, Viresh Kumar, H. Peter Anvin, Ingo Molnar, Peter Zijlstra On Sat, 1 Nov 2014, Christoph Lameter wrote: > On Sat, 1 Nov 2014, Thomas Gleixner wrote: > > > On Fri, 31 Oct 2014, Christoph Lameter wrote: > > > The reasoning behind this function is not clear to me and removal seems > > > > The comment above the function is clear enough. > > I looked around into the functions called by the timer interrupt for > accounting etc. They have measures to compensate if the HZ is not > occurring for some time. Let's look at that comment first: * Keep at least one tick per second when a single * active task is running because the scheduler doesn't * yet completely support full dynticks environment. * * This makes sure that uptime, CFS vruntime, load * balancing, etc... continue to move forward, even * with a very low granularity. So this talks about the scheduler tick obviously, right? Now scheduler_tick() is invoked from update_process_times() and update_process_times() is invoked from tick_sched_handle() and that is invoked from either tick_sched_timer() or tick_nohz_handler(). tick_sched_timer() is the hrtimer callback of tick_cpu_sched.sched_timer. That's used when high resolution timers are enabled. tick_nohz_handler() is the event handler for the clock event device if high resolution timers are disabled. Now the callsite of scheduler_tick_max_deferment() does: time_delta = min(time_delta, scheduler_tick_max_deferment()); And that is used further down after some other checks to arm either tick_cpu_sched.sched_timer or the clockevent itself. Which then when fired will invoke scheduler_tick() .... Really hard to figure out, right? > > > to have a limited impact on the system overall. Even without the > > > cap to 1 second the system will be limited by the boundaries on the period > > > of interrupts by various devices (in my case I ended up with a 4 second > > > interval on x86 because of the limitations of periodicy of the underlying > > > interupt source). > > > > And just because it happens to do so on your machine it's not > > guaranteed. > > When would it not occur? Where do we lack a measure to cope with missing > timer interrupts now? I wont happen, if time_delta is KTIME_MAX and the following checks are not having a timer armed. if (unlikely(expires.tv64 == KTIME_MAX)) { if (ts->nohz_mode == NOHZ_MODE_HIGHRES) hrtimer_cancel(&ts->sched_timer); goto out; } Which does either not arm the clockevent device (non highres) or cancels ts->sched_timer (highres). So in that case your timer interrupt will stop completely and therefor the scheduler updates on that cpu wont happen anymore. > > But we care about that _after_ we solved the scheduler tick > > requirement because that is the most evident one. > > Why does the scheduler require that tick? It seems that the processor is > always busy running exactly 1 process when the tick is not > occurring. Anything else will switch on the tick again. So the information > that the scheduler has never becomes outdated. Surely vruntime, load balancing data, load accounting and all the other stuff which contributes to global and local state updates itself magically. As I said before: It can be delegated to a housekeeper, but this needs to be implemented first before we can remove that function. There is a world outside of vmstat kworker, really. Thanks, tglx ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [NOHZ] Remove scheduler_tick_max_deferment 2014-11-01 22:33 ` Thomas Gleixner @ 2014-11-06 17:24 ` Christoph Lameter 2014-11-10 7:11 ` Viresh Kumar 2014-11-10 22:43 ` Frederic Weisbecker 0 siblings, 2 replies; 24+ messages in thread From: Christoph Lameter @ 2014-11-06 17:24 UTC (permalink / raw) To: Thomas Gleixner Cc: Frederic Weisbecker, linux-kernel, Gilad Ben-Yossef, Tejun Heo, John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky, Paul E. McKenney, Hugh Dickins, Viresh Kumar, H. Peter Anvin, Ingo Molnar, Peter Zijlstra On Sat, 1 Nov 2014, Thomas Gleixner wrote: > * balancing, etc... continue to move forward, even > * with a very low granularity. > > So this talks about the scheduler tick obviously, right? Obviously. > Now scheduler_tick() is invoked from update_process_times() and > update_process_times() is invoked from tick_sched_handle() and that is > invoked from either tick_sched_timer() or tick_nohz_handler(). > > tick_sched_timer() is the hrtimer callback of tick_cpu_sched.sched_timer. > That's used when high resolution timers are enabled. > > tick_nohz_handler() is the event handler for the clock event device if > high resolution timers are disabled. > > Now the callsite of scheduler_tick_max_deferment() does: > > time_delta = min(time_delta, scheduler_tick_max_deferment()); > > And that is used further down after some other checks to arm either > tick_cpu_sched.sched_timer or the clockevent itself. > > Which then when fired will invoke scheduler_tick() .... > > Really hard to figure out, right? I thought there is already logic in there to compensate for times when the tick is off. tick_do_update_jiffies64 calculates the time differential and calculates the number of ticks from there calling do_timer() with the number of ticks that have passed since the last invocation. The global load calculation is then also made based on the number of ticks that have passed. So it compensates when reenabling. And the load during the dynticks busy period is known because one process is monopolizing the processor during that time. > I wont happen, if time_delta is KTIME_MAX and the following checks are > not having a timer armed. > > if (unlikely(expires.tv64 == KTIME_MAX)) { > if (ts->nohz_mode == NOHZ_MODE_HIGHRES) > hrtimer_cancel(&ts->sched_timer); > goto out; > } > > Which does either not arm the clockevent device (non highres) or > cancels ts->sched_timer (highres). > > So in that case your timer interrupt will stop completely and therefor > the scheduler updates on that cpu wont happen anymore. Why is that bad? The load is constant and the timer interrupt can be reenabled by the dynticks logic when a system call occurs that requires OS services. I thought that was already done that way by Frederic? > > Why does the scheduler require that tick? It seems that the processor is > > always busy running exactly 1 process when the tick is not > > occurring. Anything else will switch on the tick again. So the information > > that the scheduler has never becomes outdated. > > Surely vruntime, load balancing data, load accounting and all the > other stuff which contributes to global and local state updates itself > magically. There is logic in there that compensates when the tick is finally reenabled. Load balancing data is already not updated when the tick is disabled when the processor is idle right? What is so different here? > As I said before: It can be delegated to a housekeeper, but this needs > to be implemented first before we can remove that function. We did not need to housekeeper in the dynticks idle case. What is so different about dynticks busy? > There is a world outside of vmstat kworker, really. Absolutely but I thought the logic is already there to compensate for issues like the timer interrupt not occurring. I may not have the complete picture of the timer tick processing in my mind these days (it has been a lots of years since I did any work there after all) but as far as my arguably simplistic reading of the code goes I do not see why a housekeeper would be needed there. The load is constant and known in the dynticks busy case as it is in the dynticks idle case. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [NOHZ] Remove scheduler_tick_max_deferment 2014-11-06 17:24 ` Christoph Lameter @ 2014-11-10 7:11 ` Viresh Kumar 2014-11-10 15:31 ` Paul E. McKenney 2014-11-10 16:19 ` [NOHZ] Remove scheduler_tick_max_deferment Christoph Lameter 2014-11-10 22:43 ` Frederic Weisbecker 1 sibling, 2 replies; 24+ messages in thread From: Viresh Kumar @ 2014-11-10 7:11 UTC (permalink / raw) To: Christoph Lameter Cc: Thomas Gleixner, Frederic Weisbecker, Linux Kernel Mailing List, Gilad Ben-Yossef, Tejun Heo, John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky, Paul E. McKenney, Hugh Dickins, H. Peter Anvin, Ingo Molnar, Peter Zijlstra On 6 November 2014 22:54, Christoph Lameter <cl@linux.com> wrote: > We did not need to housekeeper in the dynticks idle case. What is so > different about dynticks busy? We do have a running task here and so the stats are important.. > I may not have the complete picture of the timer tick processing in my > mind these days (it has been a lots of years since I did any work there > after all) but as far as my arguably simplistic reading of the code goes I > do not see why a housekeeper would be needed there. The load is constant > and known in the dynticks busy case as it is in the dynticks idle case. I tried to initiate a thread on similar stuff, might be helpful: https://lkml.org/lkml/2014/5/22/131 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [NOHZ] Remove scheduler_tick_max_deferment 2014-11-10 7:11 ` Viresh Kumar @ 2014-11-10 15:31 ` Paul E. McKenney 2014-11-10 16:21 ` Christoph Lameter 2014-11-10 18:26 ` Christoph Lameter 2014-11-10 16:19 ` [NOHZ] Remove scheduler_tick_max_deferment Christoph Lameter 1 sibling, 2 replies; 24+ messages in thread From: Paul E. McKenney @ 2014-11-10 15:31 UTC (permalink / raw) To: Viresh Kumar Cc: Christoph Lameter, Thomas Gleixner, Frederic Weisbecker, Linux Kernel Mailing List, Gilad Ben-Yossef, Tejun Heo, John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky, Hugh Dickins, H. Peter Anvin, Ingo Molnar, Peter Zijlstra On Mon, Nov 10, 2014 at 12:41:38PM +0530, Viresh Kumar wrote: > On 6 November 2014 22:54, Christoph Lameter <cl@linux.com> wrote: > > > We did not need to housekeeper in the dynticks idle case. What is so > > different about dynticks busy? > > We do have a running task here and so the stats are important.. > > > I may not have the complete picture of the timer tick processing in my > > mind these days (it has been a lots of years since I did any work there > > after all) but as far as my arguably simplistic reading of the code goes I > > do not see why a housekeeper would be needed there. The load is constant > > and known in the dynticks busy case as it is in the dynticks idle case. > > I tried to initiate a thread on similar stuff, might be helpful: > > https://lkml.org/lkml/2014/5/22/131 Would it make sense for unlimited max deferment to be available as a boot parameter? That would allow people who want tick-free execution more than accurate stats to get that easily, while keeping stats accurate for everyone else. Thanx, Paul ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [NOHZ] Remove scheduler_tick_max_deferment 2014-11-10 15:31 ` Paul E. McKenney @ 2014-11-10 16:21 ` Christoph Lameter 2014-11-10 18:26 ` Christoph Lameter 1 sibling, 0 replies; 24+ messages in thread From: Christoph Lameter @ 2014-11-10 16:21 UTC (permalink / raw) To: Paul E. McKenney Cc: Viresh Kumar, Thomas Gleixner, Frederic Weisbecker, Linux Kernel Mailing List, Gilad Ben-Yossef, Tejun Heo, John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky, Hugh Dickins, H. Peter Anvin, Ingo Molnar, Peter Zijlstra On Mon, 10 Nov 2014, Paul E. McKenney wrote: > Would it make sense for unlimited max deferment to be available as > a boot parameter? That would allow people who want tick-free execution > more than accurate stats to get that easily, while keeping stats accurate > for everyone else. Well at least it would help us to get started on figuring out how to deal with other processor interruptions that occur in intervals of seconds or minutes. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [NOHZ] Remove scheduler_tick_max_deferment 2014-11-10 15:31 ` Paul E. McKenney 2014-11-10 16:21 ` Christoph Lameter @ 2014-11-10 18:26 ` Christoph Lameter 2014-11-11 17:15 ` Future of NOHZ full/isolation development (was Re: [NOHZ] Remove scheduler_tick_max_deferment) Frederic Weisbecker 1 sibling, 1 reply; 24+ messages in thread From: Christoph Lameter @ 2014-11-10 18:26 UTC (permalink / raw) To: Paul E. McKenney Cc: Viresh Kumar, Thomas Gleixner, Frederic Weisbecker, Linux Kernel Mailing List, Gilad Ben-Yossef, Tejun Heo, John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky, Hugh Dickins, H. Peter Anvin, Ingo Molnar, Peter Zijlstra > > Would it make sense for unlimited max deferment to be available as > a boot parameter? That would allow people who want tick-free execution > more than accurate stats to get that easily, while keeping stats accurate > for everyone else. Subject: Make the maximum tick deferral for CONFIG_NO_HZ configurable Add a way to configure this interval at boot and via /proc/sys/vm/max_defer_tick Signed-off-by: Christoph Lameter <cl@linux.com> Index: linux/kernel/sched/core.c =================================================================== --- linux.orig/kernel/sched/core.c +++ linux/kernel/sched/core.c @@ -2574,6 +2574,17 @@ void scheduler_tick(void) } #ifdef CONFIG_NO_HZ_FULL +int sysctl_max_defer_tick __read_mostly = 1; + +static int __init max_defer_tick_setup(char *str) +{ + sysctl_max_defer_tick = simple_strtol(str, NULL, 0); + pr_info("NO_HZ_FULL maxinum deferral of busy tick set to %d\n", + sysctl_max_defer_tick); + return 1; +} +__setup("max_defer_tick=", max_defer_tick_setup); + /** * scheduler_tick_max_deferment * @@ -2592,7 +2603,7 @@ u64 scheduler_tick_max_deferment(void) struct rq *rq = this_rq(); unsigned long next, now = ACCESS_ONCE(jiffies); - next = rq->last_sched_tick + HZ; + next = rq->last_sched_tick + sysctl_max_defer_tick * HZ; if (time_before_eq(next, now)) return 0; Index: linux/kernel/sysctl.c =================================================================== --- linux.orig/kernel/sysctl.c +++ linux/kernel/sysctl.c @@ -1407,6 +1407,15 @@ static struct ctl_table vm_table[] = { .proc_handler = proc_dointvec_jiffies, }, #endif +#ifdef CONFIG_NO_HZ_FULL + { + .procname = "max_defer_tick", + .data = &sysctl_max_defer_tick, + .maxlen = sizeof(sysctl_max_defer_tick), + .mode = 0644, + .proc_handler = proc_dointvec_jiffies, + }, +#endif #ifdef CONFIG_MMU { .procname = "mmap_min_addr", Index: linux/Documentation/sysctl/vm.txt =================================================================== --- linux.orig/Documentation/sysctl/vm.txt +++ linux/Documentation/sysctl/vm.txt @@ -714,6 +714,13 @@ is 1 second. ============================================================== +max_defer_tick + +The maximum time that the tick may be deferred while a process is +monopolizing a cpu. + +============================================================== + swappiness This control is used to define how aggressive the kernel will swap Index: linux/Documentation/kernel-parameters.txt =================================================================== --- linux.orig/Documentation/kernel-parameters.txt +++ linux/Documentation/kernel-parameters.txt @@ -1876,6 +1876,10 @@ bytes respectively. Such letter suffixes devices can be requested on-demand with the /dev/loop-control interface. + max_defer_tick [NO_HZ_FULL] The number of seconds that the system may + defer the timer tick if a process is monopolizing the + cpu. + mce [X86-32] Machine Check Exception mce=option [X86-64] See Documentation/x86/x86_64/boot-options.txt Index: linux/include/linux/sched/sysctl.h =================================================================== --- linux.orig/include/linux/sched/sysctl.h +++ linux/include/linux/sched/sysctl.h @@ -53,6 +53,10 @@ extern unsigned int sysctl_numa_balancin extern unsigned int sysctl_numa_balancing_scan_period_max; extern unsigned int sysctl_numa_balancing_scan_size; +#ifdef CONFIG_NO_HZ_FULL +extern int sysctl_max_defer_tick; +#endif + #ifdef CONFIG_SCHED_DEBUG extern unsigned int sysctl_sched_migration_cost; extern unsigned int sysctl_sched_nr_migrate; ^ permalink raw reply [flat|nested] 24+ messages in thread
* Future of NOHZ full/isolation development (was Re: [NOHZ] Remove scheduler_tick_max_deferment) 2014-11-10 18:26 ` Christoph Lameter @ 2014-11-11 17:15 ` Frederic Weisbecker 2014-11-11 17:39 ` Paul E. McKenney 2014-11-12 6:11 ` Viresh Kumar 0 siblings, 2 replies; 24+ messages in thread From: Frederic Weisbecker @ 2014-11-11 17:15 UTC (permalink / raw) To: Christoph Lameter Cc: Paul E. McKenney, Viresh Kumar, Thomas Gleixner, Linux Kernel Mailing List, Gilad Ben-Yossef, Tejun Heo, John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky, Hugh Dickins, H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Kevin Hilman On Mon, Nov 10, 2014 at 12:26:51PM -0600, Christoph Lameter wrote: > > > > Would it make sense for unlimited max deferment to be available as > > a boot parameter? That would allow people who want tick-free execution > > more than accurate stats to get that easily, while keeping stats accurate > > for everyone else. > > Subject: Make the maximum tick deferral for CONFIG_NO_HZ configurable > > Add a way to configure this interval at boot and via > /proc/sys/vm/max_defer_tick > > Signed-off-by: Christoph Lameter <cl@linux.com> Sorry but that's not solving the problem. All it does is to allow the user to tune bugs. Kevin Hilman proposed something similar using debugfs and I declined it as well. Integrating a hack like this is a good way to make sure that nobody will ever fix the real underlying issue. BTW, that's a good opportunity for me to generalize this case to the full dynticks development general issue. I got a lot of help from people to improve the kernel's isolation and full dynticks: Paul has spent a lot of time to improve RCU, you improved vmstat, full dynticks got ported to other archs, people like Viresh fixed some timers internals, Gilad fixed IPIs, Peterz reviewed a lot, etc... But now we reached a step where there are mostly core issues remaining that require some infrastrure change investments, some extensions or a bit of rethinking. We know we reach that step when people who want the features are stuck sending workarounds. Nothing like big rewrites is needed really, actually just a bunch of pretty self contained issues. And by self-contained I mean that each of these individual problems can be worked out seperately as they are unrelated enough altogether. Here is a summarized list: * Unbound workqueues affinity (to housekeeper) * Unbound timers affinity (to housekeeper) * 1 Hz residual scheduler tick offlining to housekeeper * Fix some scheduler accounting that don't even work with 1 Hz: cpu load accounting, rt_scale, load balancing, etc... * Lighten the syscall path and get rid of cputime accounting + RCU hooks for people who want isolation + fast syscalls and faults. * Work on non-affinable workqueues * Work on non-affinable timers * ... If I'm going to work alone on all that, this is going to take several years, honestly. But we know what to do and how. So all we need is (at least one) more full time core developer to get these things done in a reasonable amount of time. Thanks. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Future of NOHZ full/isolation development (was Re: [NOHZ] Remove scheduler_tick_max_deferment) 2014-11-11 17:15 ` Future of NOHZ full/isolation development (was Re: [NOHZ] Remove scheduler_tick_max_deferment) Frederic Weisbecker @ 2014-11-11 17:39 ` Paul E. McKenney 2014-11-11 18:00 ` Christoph Lameter 2014-11-12 6:11 ` Viresh Kumar 1 sibling, 1 reply; 24+ messages in thread From: Paul E. McKenney @ 2014-11-11 17:39 UTC (permalink / raw) To: Frederic Weisbecker Cc: Christoph Lameter, Viresh Kumar, Thomas Gleixner, Linux Kernel Mailing List, Gilad Ben-Yossef, Tejun Heo, John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky, Hugh Dickins, H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Kevin Hilman On Tue, Nov 11, 2014 at 06:15:28PM +0100, Frederic Weisbecker wrote: > On Mon, Nov 10, 2014 at 12:26:51PM -0600, Christoph Lameter wrote: > > > > > > Would it make sense for unlimited max deferment to be available as > > > a boot parameter? That would allow people who want tick-free execution > > > more than accurate stats to get that easily, while keeping stats accurate > > > for everyone else. > > > > Subject: Make the maximum tick deferral for CONFIG_NO_HZ configurable > > > > Add a way to configure this interval at boot and via > > /proc/sys/vm/max_defer_tick > > > > Signed-off-by: Christoph Lameter <cl@linux.com> > > Sorry but that's not solving the problem. All it does is to allow the user > to tune bugs. > > Kevin Hilman proposed something similar using debugfs and I declined it as > well. Integrating a hack like this is a good way to make sure that nobody > will ever fix the real underlying issue. I guess I should have remembered that before suggesting this to Christoph, my apologies to all! > BTW, that's a good opportunity for me to generalize this case to the full > dynticks development general issue. I got a lot of help from people to improve > the kernel's isolation and full dynticks: Paul has spent a lot of time to improve > RCU, you improved vmstat, full dynticks got ported to other archs, people > like Viresh fixed some timers internals, Gilad fixed IPIs, Peterz reviewed a > lot, etc... > > But now we reached a step where there are mostly core issues remaining that > require some infrastrure change investments, some extensions or a bit of rethinking. > We know we reach that step when people who want the features are stuck sending > workarounds. > Nothing like big rewrites is needed really, actually just a bunch of pretty > self contained issues. And by self-contained I mean that each of these individual > problems can be worked out seperately as they are unrelated enough altogether. Here is > a summarized list: > > * Unbound workqueues affinity (to housekeeper) > * Unbound timers affinity (to housekeeper) > * 1 Hz residual scheduler tick offlining to housekeeper > * Fix some scheduler accounting that don't even work with 1 Hz: cpu load > accounting, rt_scale, load balancing, etc... > * Lighten the syscall path and get rid of cputime accounting + RCU hooks > for people who want isolation + fast syscalls and faults. I thought that the RCU hooks were well and truly down in the noise. Or is that not the case without cputime accounting to hide behind? Thanx, Paul > * Work on non-affinable workqueues > * Work on non-affinable timers > * ... > > If I'm going to work alone on all that, this is going to take several years, > honestly. > > But we know what to do and how. So all we need is (at least one) more full time > core developer to get these things done in a reasonable amount of time. > > Thanks. > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Future of NOHZ full/isolation development (was Re: [NOHZ] Remove scheduler_tick_max_deferment) 2014-11-11 17:39 ` Paul E. McKenney @ 2014-11-11 18:00 ` Christoph Lameter 0 siblings, 0 replies; 24+ messages in thread From: Christoph Lameter @ 2014-11-11 18:00 UTC (permalink / raw) To: Paul E. McKenney Cc: Frederic Weisbecker, Viresh Kumar, Thomas Gleixner, Linux Kernel Mailing List, Gilad Ben-Yossef, Tejun Heo, John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky, Hugh Dickins, H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Kevin Hilman On Tue, 11 Nov 2014, Paul E. McKenney wrote: > I guess I should have remembered that before suggesting this to Christoph, > my apologies to all! No this is good. Do not worry. I can just throw out the one or other patch to ensure that the discussion continues and then we can figure out how to come up with a correct solution. Patches are a way to say that you are serious about an issue on lkml. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Future of NOHZ full/isolation development (was Re: [NOHZ] Remove scheduler_tick_max_deferment) 2014-11-11 17:15 ` Future of NOHZ full/isolation development (was Re: [NOHZ] Remove scheduler_tick_max_deferment) Frederic Weisbecker 2014-11-11 17:39 ` Paul E. McKenney @ 2014-11-12 6:11 ` Viresh Kumar 2014-11-12 13:54 ` Frederic Weisbecker 1 sibling, 1 reply; 24+ messages in thread From: Viresh Kumar @ 2014-11-12 6:11 UTC (permalink / raw) To: Frederic Weisbecker Cc: Christoph Lameter, Paul E. McKenney, Thomas Gleixner, Linux Kernel Mailing List, Gilad Ben-Yossef, Tejun Heo, John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky, Hugh Dickins, H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Kevin Hilman On 11 November 2014 22:45, Frederic Weisbecker <fweisbec@gmail.com> wrote: > Here is a summarized list: > > * Unbound workqueues affinity (to housekeeper) > * Unbound timers affinity (to housekeeper) > * 1 Hz residual scheduler tick offlining to housekeeper > * Fix some scheduler accounting that don't even work with 1 Hz: cpu load > accounting, rt_scale, load balancing, etc... > * Lighten the syscall path and get rid of cputime accounting + RCU hooks > for people who want isolation + fast syscalls and faults. > * Work on non-affinable workqueues > * Work on non-affinable timers > * ... + spurious interrupts with NOHZ_FULL on all architectures which break isolation but doesn't get caught with traces. Can be observed with this: diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index 481fa54..91d490d 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -1244,7 +1244,8 @@ void hrtimer_interrupt(struct clock_event_device *dev) { struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases); ktime_t expires_next, now, entry_time, delta; - int i, retries = 0; + int i, retries = 0, count = 0; + static int total_spurious; BUG_ON(!cpu_base->hres_active); cpu_base->nr_events++; @@ -1304,10 +1305,14 @@ void hrtimer_interrupt(struct clock_event_device *dev) break; } + count++; __run_hrtimer(timer, &basenow); } } + if (!count) + pr_err("____%s: Totalspurious: %d\n", __func__, ++total_spurious); + /* * Store the new expiry value so the migration code can verify * against it. ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: Future of NOHZ full/isolation development (was Re: [NOHZ] Remove scheduler_tick_max_deferment) 2014-11-12 6:11 ` Viresh Kumar @ 2014-11-12 13:54 ` Frederic Weisbecker 2014-11-12 14:56 ` Viresh Kumar 0 siblings, 1 reply; 24+ messages in thread From: Frederic Weisbecker @ 2014-11-12 13:54 UTC (permalink / raw) To: Viresh Kumar Cc: Christoph Lameter, Paul E. McKenney, Thomas Gleixner, Linux Kernel Mailing List, Gilad Ben-Yossef, Tejun Heo, John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky, Hugh Dickins, H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Kevin Hilman On Wed, Nov 12, 2014 at 11:41:09AM +0530, Viresh Kumar wrote: > On 11 November 2014 22:45, Frederic Weisbecker <fweisbec@gmail.com> wrote: > > > Here is a summarized list: > > > > * Unbound workqueues affinity (to housekeeper) > > * Unbound timers affinity (to housekeeper) > > * 1 Hz residual scheduler tick offlining to housekeeper > > * Fix some scheduler accounting that don't even work with 1 Hz: cpu load > > accounting, rt_scale, load balancing, etc... > > * Lighten the syscall path and get rid of cputime accounting + RCU hooks > > for people who want isolation + fast syscalls and faults. > > * Work on non-affinable workqueues > > * Work on non-affinable timers > > * ... > > + spurious interrupts with NOHZ_FULL on all architectures which break isolation > but doesn't get caught with traces. Can be observed with this: > > diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c > index 481fa54..91d490d 100644 > --- a/kernel/time/hrtimer.c > +++ b/kernel/time/hrtimer.c > @@ -1244,7 +1244,8 @@ void hrtimer_interrupt(struct clock_event_device *dev) > { > struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases); > ktime_t expires_next, now, entry_time, delta; > - int i, retries = 0; > + int i, retries = 0, count = 0; > + static int total_spurious; > > BUG_ON(!cpu_base->hres_active); > cpu_base->nr_events++; > @@ -1304,10 +1305,14 @@ void hrtimer_interrupt(struct clock_event_device *dev) > break; > } > > + count++; > __run_hrtimer(timer, &basenow); > } > } > > + if (!count) > + pr_err("____%s: Totalspurious: %d\n", __func__, > ++total_spurious); > + I'd rather leave that to tracepoints. Like trace_hrtimer_spurious(). Or better yet: have trace_hrtimer_interrupt() which we can compare against trace_hrtimer_expire_entry/exit() to check if any hrtimer callback have run in the interrupt. This way we avoid workarounds like the above count. > /* > * Store the new expiry value so the migration code can verify > * against it. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Future of NOHZ full/isolation development (was Re: [NOHZ] Remove scheduler_tick_max_deferment) 2014-11-12 13:54 ` Frederic Weisbecker @ 2014-11-12 14:56 ` Viresh Kumar 2014-11-12 15:06 ` Peter Zijlstra 0 siblings, 1 reply; 24+ messages in thread From: Viresh Kumar @ 2014-11-12 14:56 UTC (permalink / raw) To: Frederic Weisbecker Cc: Christoph Lameter, Paul E. McKenney, Thomas Gleixner, Linux Kernel Mailing List, Gilad Ben-Yossef, Tejun Heo, John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky, Hugh Dickins, H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Kevin Hilman On 12 November 2014 19:24, Frederic Weisbecker <fweisbec@gmail.com> wrote: > I'd rather leave that to tracepoints. Like trace_hrtimer_spurious(). Yeah, it was just to prove things right on the console without getting into traces. > Or better yet: have trace_hrtimer_interrupt() which we can compare against > trace_hrtimer_expire_entry/exit() to check if any hrtimer callback have run > in the interrupt. This way we avoid workarounds like the above count. Yeah, I also believe we better add this debug information to mainline kernel. I will try to get a patch for that soon. Would it be recommended to add both trace points? i.e. trace_hrtimer_interrupt() and trace_hrtimer_spurious() ? -- viresh ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Future of NOHZ full/isolation development (was Re: [NOHZ] Remove scheduler_tick_max_deferment) 2014-11-12 14:56 ` Viresh Kumar @ 2014-11-12 15:06 ` Peter Zijlstra 2014-11-12 15:16 ` Viresh Kumar 0 siblings, 1 reply; 24+ messages in thread From: Peter Zijlstra @ 2014-11-12 15:06 UTC (permalink / raw) To: Viresh Kumar Cc: Frederic Weisbecker, Christoph Lameter, Paul E. McKenney, Thomas Gleixner, Linux Kernel Mailing List, Gilad Ben-Yossef, Tejun Heo, John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky, Hugh Dickins, H. Peter Anvin, Ingo Molnar, Kevin Hilman On Wed, Nov 12, 2014 at 08:26:05PM +0530, Viresh Kumar wrote: > On 12 November 2014 19:24, Frederic Weisbecker <fweisbec@gmail.com> wrote: > > I'd rather leave that to tracepoints. Like trace_hrtimer_spurious(). > > Yeah, it was just to prove things right on the console without getting > into traces. > > > Or better yet: have trace_hrtimer_interrupt() which we can compare against > > trace_hrtimer_expire_entry/exit() to check if any hrtimer callback have run > > in the interrupt. This way we avoid workarounds like the above count. > > Yeah, I also believe we better add this debug information to mainline kernel. > I will try to get a patch for that soon. > > Would it be recommended to add both trace points? > i.e. trace_hrtimer_interrupt() and trace_hrtimer_spurious() I don't think you need to add anything. We already have tracepoints for every single interrupt (and therefore also for the hrtimer one) and we have expiry tracepoints. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Future of NOHZ full/isolation development (was Re: [NOHZ] Remove scheduler_tick_max_deferment) 2014-11-12 15:06 ` Peter Zijlstra @ 2014-11-12 15:16 ` Viresh Kumar 2014-11-13 7:22 ` Viresh Kumar 0 siblings, 1 reply; 24+ messages in thread From: Viresh Kumar @ 2014-11-12 15:16 UTC (permalink / raw) To: Peter Zijlstra Cc: Frederic Weisbecker, Christoph Lameter, Paul E. McKenney, Thomas Gleixner, Linux Kernel Mailing List, Gilad Ben-Yossef, Tejun Heo, John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky, Hugh Dickins, H. Peter Anvin, Ingo Molnar, Kevin Hilman On 12 November 2014 20:36, Peter Zijlstra <peterz@infradead.org> wrote: > I don't think you need to add anything. We already have tracepoints for > every single interrupt (and therefore also for the hrtimer one) and we > have expiry tracepoints. I will crosscheck this again but as far as I remember these spurious interrupts aren't caught with tracers currently. Not even the irq_enter/exit ones. I saw that piece of code long back (and obviously don't understand it completely). But the problem was that because these are spurious clockevent interrupts, we don't have anything to do on that interrupt. No tick/sched/softirq processing. And the irq-handling code returns before even calling trace_irq_enter(). And that's why I had to add those prints initially. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Future of NOHZ full/isolation development (was Re: [NOHZ] Remove scheduler_tick_max_deferment) 2014-11-12 15:16 ` Viresh Kumar @ 2014-11-13 7:22 ` Viresh Kumar 0 siblings, 0 replies; 24+ messages in thread From: Viresh Kumar @ 2014-11-13 7:22 UTC (permalink / raw) To: Peter Zijlstra Cc: Frederic Weisbecker, Christoph Lameter, Paul E. McKenney, Thomas Gleixner, Linux Kernel Mailing List, Gilad Ben-Yossef, Tejun Heo, John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky, Hugh Dickins, H. Peter Anvin, Ingo Molnar, Kevin Hilman On 12 November 2014 20:46, Viresh Kumar <viresh.kumar@linaro.org> wrote: > On 12 November 2014 20:36, Peter Zijlstra <peterz@infradead.org> wrote: >> I don't think you need to add anything. We already have tracepoints for >> every single interrupt (and therefore also for the hrtimer one) and we >> have expiry tracepoints. > > I will crosscheck this again but as far as I remember these spurious interrupts > aren't caught with tracers currently. Not even the irq_enter/exit ones. So yes, we do get irq_handler_entry/exit traces for the clockevent device. The problem here is that people miss these as there are no warnings issued here. And so nobody tried to resolve these spurious interrupts. This is how it looks on a ARM machine (With trace_hrtimer_spurious), though it is relevant for all architectures which emulate ONESHOT mode over PERIODIC mode: <idle>-0 [001] d.h2 719.642085: irq_handler_entry: irq=30 name=arch_timer <idle>-0 [001] d.h3 719.646506: hrtimer_spurious: count:105 <idle>-0 [001] d.h2 719.646511: irq_handler_exit: irq=30 ret=handled <idle>-0 [001] d..3 719.646522: hrtimer_start: hrtimer=ee7bb740 function=tick_sched_timer expires=719640000000 softexpires=719640000000 <idle>-0 [001] d..3 719.646545: hrtimer_cancel: hrtimer=ee7bb740 <idle>-0 [001] d.h2 719.647086: irq_handler_entry: irq=30 name=arch_timer <idle>-0 [001] d.h3 719.651540: hrtimer_spurious: count:106 <idle>-0 [001] d.h2 719.651545: irq_handler_exit: irq=30 ret=handled <idle>-0 [001] d..3 719.651554: hrtimer_start: hrtimer=ee7bb740 function=tick_sched_timer expires=719645000000 softexpires=719645000000 <idle>-0 [001] d..3 719.651576: hrtimer_cancel: hrtimer=ee7bb740 <idle>-0 [001] d.h2 719.652084: irq_handler_entry: irq=30 name=arch_timer <idle>-0 [001] d.h3 719.656571: hrtimer_spurious: count:107 <idle>-0 [001] d.h2 719.656575: irq_handler_exit: irq=30 ret=handled <idle>-0 [001] d..3 719.656583: hrtimer_start: hrtimer=ee7bb740 function=tick_sched_timer expires=719650000000 softexpires=719650000000 <idle>-0 [001] d..3 719.656601: hrtimer_cancel: hrtimer=ee7bb740 <idle>-0 [001] d.h2 719.657084: irq_handler_entry: irq=30 name=arch_timer <idle>-0 [001] d.h3 719.661611: hrtimer_spurious: count:108 <idle>-0 [001] d.h2 719.661616: irq_handler_exit: irq=30 ret=handled <idle>-0 [001] d..3 719.661627: hrtimer_start: hrtimer=ee7bb740 function=tick_sched_timer expires=719655000000 softexpires=719655000000 <idle>-0 [001] d..3 719.661652: hrtimer_cancel: hrtimer=ee7bb740 <idle>-0 [001] d.h2 719.662086: irq_handler_entry: irq=30 name=arch_timer <idle>-0 [001] d.h3 719.666643: hrtimer_spurious: count:109 <idle>-0 [001] d.h2 719.666648: irq_handler_exit: irq=30 ret=handled <idle>-0 [001] d..3 719.666657: hrtimer_start: hrtimer=ee7bb740 function=tick_sched_timer expires=719660000000 softexpires=719660000000 <idle>-0 [001] d..3 719.666680: hrtimer_cancel: hrtimer=ee7bb740 <idle>-0 [001] d.h2 719.667084: irq_handler_entry: irq=30 name=arch_timer <idle>-0 [001] d.h3 719.671678: hrtimer_spurious: count:110 <idle>-0 [001] d.h2 719.671682: irq_handler_exit: irq=30 ret=handled <idle>-0 [001] d..3 719.671693: hrtimer_start: hrtimer=ee7bb740 function=tick_sched_timer expires=719665000000 softexpires=719665000000 <idle>-0 [001] d..3 719.671714: hrtimer_cancel: hrtimer=ee7bb740 <idle>-0 [001] d.h2 719.672085: irq_handler_entry: irq=30 name=arch_timer <idle>-0 [001] d.h3 719.676712: hrtimer_spurious: count:111 <idle>-0 [001] d.h2 719.676717: irq_handler_exit: irq=30 ret=handled <idle>-0 [001] d..3 719.676726: hrtimer_start: hrtimer=ee7bb740 function=tick_sched_timer expires=719670000000 softexpires=719670000000 <idle>-0 [001] d..3 719.676749: hrtimer_cancel: hrtimer=ee7bb740 <idle>-0 [001] d.h2 719.677087: irq_handler_entry: irq=30 name=arch_timer <idle>-0 [001] d.h3 719.681745: hrtimer_spurious: count:112 <idle>-0 [001] d.h2 719.681750: irq_handler_exit: irq=30 ret=handled <idle>-0 [001] d..3 719.681759: hrtimer_start: hrtimer=ee7bb740 function=tick_sched_timer expires=719675000000 softexpires=719675000000 <idle>-0 [001] d..3 719.681780: hrtimer_cancel: hrtimer=ee7bb740 <idle>-0 [001] d.h2 719.682084: irq_handler_entry: irq=30 name=arch_timer <idle>-0 [001] d.h3 719.686783: hrtimer_spurious: count:113 <idle>-0 [001] d.h2 719.686787: irq_handler_exit: irq=30 ret=handled <idle>-0 [001] d..3 719.686797: hrtimer_start: hrtimer=ee7bb740 function=tick_sched_timer expires=719680000000 softexpires=719680000000 <idle>-0 [001] d..3 719.686819: hrtimer_cancel: hrtimer=ee7bb740 Normally this happens on a non-spurious interrupt: is-cpu-isolated-1963 [000] d.h1 719.642080: irq_handler_entry: irq=30 name=arch_timer is-cpu-isolated-1963 [000] d.h2 719.642086: hrtimer_cancel: hrtimer=ee7b3740 is-cpu-isolated-1963 [000] d.h1 719.642089: hrtimer_expire_entry: hrtimer=ee7b3740 function=tick_sched_timer now=719635008926 is-cpu-isolated-1963 [000] d.h1 719.642095: softirq_raise: vec=1 [action=TIMER] is-cpu-isolated-1963 [000] d.h1 719.642099: softirq_raise: vec=9 [action=RCU] is-cpu-isolated-1963 [000] d.h2 719.642104: sched_stat_runtime: comm=trace-isolation pid=1963 runtime=348042 [ns] vruntime=7804506025 [ns] is-cpu-isolated-1963 [000] d.h1 719.642111: hrtimer_expire_exit: hrtimer=ee7b3740 is-cpu-isolated-1963 [000] d.h2 719.642115: hrtimer_start: hrtimer=ee7b3740 function=tick_sched_timer expires=719640000000 softexpires=719640000000 is-cpu-isolated-1963 [000] d.h1 719.642121: irq_handler_exit: irq=30 ret=handled .. followed by softirq processing. Another trace point with trace_hrtimer_spurious() might end up being useful :) ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [NOHZ] Remove scheduler_tick_max_deferment 2014-11-10 7:11 ` Viresh Kumar 2014-11-10 15:31 ` Paul E. McKenney @ 2014-11-10 16:19 ` Christoph Lameter 1 sibling, 0 replies; 24+ messages in thread From: Christoph Lameter @ 2014-11-10 16:19 UTC (permalink / raw) To: Viresh Kumar Cc: Thomas Gleixner, Frederic Weisbecker, Linux Kernel Mailing List, Gilad Ben-Yossef, Tejun Heo, John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky, Paul E. McKenney, Hugh Dickins, H. Peter Anvin, Ingo Molnar, Peter Zijlstra On Mon, 10 Nov 2014, Viresh Kumar wrote: > On 6 November 2014 22:54, Christoph Lameter <cl@linux.com> wrote: > > > We did not need to housekeeper in the dynticks idle case. What is so > > different about dynticks busy? > > We do have a running task here and so the stats are important.. The task is running in user space only and if there is any important kernel action the tick will be coming on. Stats are not that important and predictable I would say. We could add stuff to the vmstat kworker thread to check for activity that requires updates. > > I may not have the complete picture of the timer tick processing in my > > mind these days (it has been a lots of years since I did any work there > > after all) but as far as my arguably simplistic reading of the code goes I > > do not see why a housekeeper would be needed there. The load is constant > > and known in the dynticks busy case as it is in the dynticks idle case. > > I tried to initiate a thread on similar stuff, might be helpful: > > https://lkml.org/lkml/2014/5/22/131 Hmmm... yes that is interesting. I agree that the process should be exempted from load balancing since the idea of NOHZ is to lower OS noise and load balancing would add significant amounts of noise. NUMA balancing is also a source of noise and so we should have that off by default as well. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [NOHZ] Remove scheduler_tick_max_deferment 2014-11-06 17:24 ` Christoph Lameter 2014-11-10 7:11 ` Viresh Kumar @ 2014-11-10 22:43 ` Frederic Weisbecker 2014-11-11 14:58 ` Christoph Lameter 1 sibling, 1 reply; 24+ messages in thread From: Frederic Weisbecker @ 2014-11-10 22:43 UTC (permalink / raw) To: Christoph Lameter Cc: Thomas Gleixner, linux-kernel, Gilad Ben-Yossef, Tejun Heo, John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky, Paul E. McKenney, Hugh Dickins, Viresh Kumar, H. Peter Anvin, Ingo Molnar, Peter Zijlstra On Thu, Nov 06, 2014 at 11:24:59AM -0600, Christoph Lameter wrote: > I thought there is already logic in there to compensate for times when the > tick is off. > > tick_do_update_jiffies64 calculates the time differential and calculates > the number of ticks from there calling do_timer() with the number of ticks > that have passed since the last invocation. The global load calculation > is then also made based on the number of ticks that have passed. So it > compensates when reenabling. And the load during the dynticks busy period > is known because one process is monopolizing the processor during that > time. jiffies accounting is well handled everywhere. But that's different than the scheduler. > > I wont happen, if time_delta is KTIME_MAX and the following checks are > > not having a timer armed. > > > > if (unlikely(expires.tv64 == KTIME_MAX)) { > > if (ts->nohz_mode == NOHZ_MODE_HIGHRES) > > hrtimer_cancel(&ts->sched_timer); > > goto out; > > } > > > > Which does either not arm the clockevent device (non highres) or > > cancels ts->sched_timer (highres). > > > > So in that case your timer interrupt will stop completely and therefor > > the scheduler updates on that cpu wont happen anymore. > > Why is that bad? The load is constant and the timer interrupt can be > reenabled by the dynticks logic when a system call occurs that requires OS > services. I thought that was already done that way by Frederic? Yeah it is. Perf events, RCU, posix cpu timers are examples of things that are well handled by this tick on demand system. But they are all seperate things than the scheduler. > > > > Why does the scheduler require that tick? It seems that the processor is > > > always busy running exactly 1 process when the tick is not > > > occurring. Anything else will switch on the tick again. So the information > > > that the scheduler has never becomes outdated. > > > > Surely vruntime, load balancing data, load accounting and all the > > other stuff which contributes to global and local state updates itself > > magically. > > There is logic in there that compensates when the tick is finally > reenabled. Load balancing data is already not updated when the tick is > disabled when the processor is idle right? What is so different here? That's completely different because idle and busy CPUs may play different roles in load balancing. Load balancing can be assigned to idle CPUs for example. But the scheduler still assumes that dynticks CPUs are always idle. And we certainly don't want to assign load balancing duty to nohz full CPUs. That too needs some work to be fixed properly. > > > As I said before: It can be delegated to a housekeeper, but this needs > > to be implemented first before we can remove that function. > > We did not need to housekeeper in the dynticks idle case. What is so > different about dynticks busy? Because when a task runs we need some things to move forward: timekeeping for example. We don't want to update jiffies and gettimeofday from full nohz syscalls kernel entry. So another CPU has to maintain that. Probably the game between timekeeping and vdso complicates even further the situation. > > > There is a world outside of vmstat kworker, really. > > Absolutely but I thought the logic is already there to compensate for > issues like the timer interrupt not occurring. > > I may not have the complete picture of the timer tick processing in my > mind these days (it has been a lots of years since I did any work there > after all) but as far as my arguably simplistic reading of the code goes I > do not see why a housekeeper would be needed there. The load is constant > and known in the dynticks busy case as it is in the dynticks idle case. This is because of the general confusion between idle and dynticks. There is no need for housekeeping if there is no activity at all on a CPU (idle) and the mind makes a shortcut by considering that dynticks doesn't need housekeeping. But housekeeping is needed as long as there is activity and kernel service. And that's the case whether hz or nohz. Ok, I confess we moved part of that housekeeping to the syscall/exception/interrupt entry path. We did that for cputime accounting and RCU. And it's possible to even do that for timekeeping. But then the kernel entrypoint is going to be extremely costly. It's worth CPU 0 as a sacrificial lamb. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [NOHZ] Remove scheduler_tick_max_deferment 2014-11-10 22:43 ` Frederic Weisbecker @ 2014-11-11 14:58 ` Christoph Lameter 2014-11-11 15:36 ` Frederic Weisbecker 0 siblings, 1 reply; 24+ messages in thread From: Christoph Lameter @ 2014-11-11 14:58 UTC (permalink / raw) To: Frederic Weisbecker Cc: Thomas Gleixner, linux-kernel, Gilad Ben-Yossef, Tejun Heo, John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky, Paul E. McKenney, Hugh Dickins, Viresh Kumar, H. Peter Anvin, Ingo Molnar, Peter Zijlstra On Mon, 10 Nov 2014, Frederic Weisbecker wrote: > Ok, I confess we moved part of that housekeeping to the syscall/exception/interrupt > entry path. We did that for cputime accounting and RCU. And it's possible to > even do that for timekeeping. But then the kernel entrypoint is going to be extremely > costly. It's worth CPU 0 as a sacrificial lamb. Well we can redirect to the scheduler setting the task flag and handle costly stuff there if necessary? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [NOHZ] Remove scheduler_tick_max_deferment 2014-11-11 14:58 ` Christoph Lameter @ 2014-11-11 15:36 ` Frederic Weisbecker 2014-11-11 17:08 ` Christoph Lameter 0 siblings, 1 reply; 24+ messages in thread From: Frederic Weisbecker @ 2014-11-11 15:36 UTC (permalink / raw) To: Christoph Lameter Cc: Thomas Gleixner, linux-kernel, Gilad Ben-Yossef, Tejun Heo, John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky, Paul E. McKenney, Hugh Dickins, Viresh Kumar, H. Peter Anvin, Ingo Molnar, Peter Zijlstra On Tue, Nov 11, 2014 at 08:58:35AM -0600, Christoph Lameter wrote: > On Mon, 10 Nov 2014, Frederic Weisbecker wrote: > > > Ok, I confess we moved part of that housekeeping to the syscall/exception/interrupt > > entry path. We did that for cputime accounting and RCU. And it's possible to > > even do that for timekeeping. But then the kernel entrypoint is going to be extremely > > costly. It's worth CPU 0 as a sacrificial lamb. > > Well we can redirect to the scheduler setting the task flag and handle > costly stuff there if necessary? "there" here is the syscall/exception/interrupt entry path. And like I said, updating timekeeping from these places is overkill (although we do it in interrupt entry on dynticks-idle, but we don't have timekeeping when all system is idle). ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [NOHZ] Remove scheduler_tick_max_deferment 2014-11-11 15:36 ` Frederic Weisbecker @ 2014-11-11 17:08 ` Christoph Lameter 0 siblings, 0 replies; 24+ messages in thread From: Christoph Lameter @ 2014-11-11 17:08 UTC (permalink / raw) To: Frederic Weisbecker Cc: Thomas Gleixner, linux-kernel, Gilad Ben-Yossef, Tejun Heo, John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky, Paul E. McKenney, Hugh Dickins, Viresh Kumar, H. Peter Anvin, Ingo Molnar, Peter Zijlstra On Tue, 11 Nov 2014, Frederic Weisbecker wrote: > "there" here is the syscall/exception/interrupt entry path. And like I said, updating > timekeeping from these places is overkill (although we do it in interrupt entry on > dynticks-idle, but we don't have timekeeping when all system is idle). Well that is going to extremes I would think. Lets keep one processor alive so that the rest can be as noise free as possible. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [NOHZ] Remove scheduler_tick_max_deferment 2014-11-01 21:52 ` Christoph Lameter 2014-11-01 22:33 ` Thomas Gleixner @ 2014-11-10 20:26 ` Frederic Weisbecker 1 sibling, 0 replies; 24+ messages in thread From: Frederic Weisbecker @ 2014-11-10 20:26 UTC (permalink / raw) To: Christoph Lameter Cc: Thomas Gleixner, linux-kernel, Gilad Ben-Yossef, Tejun Heo, John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky, Paul E. McKenney, Hugh Dickins, Viresh Kumar, H. Peter Anvin, Ingo Molnar, Peter Zijlstra On Sat, Nov 01, 2014 at 04:52:13PM -0500, Christoph Lameter wrote: > On Sat, 1 Nov 2014, Thomas Gleixner wrote: > > > On Fri, 31 Oct 2014, Christoph Lameter wrote: > > > The reasoning behind this function is not clear to me and removal seems > > > > The comment above the function is clear enough. > > I looked around into the functions called by the timer interrupt for > accounting etc. They have measures to compensate if the HZ is not > occurring for some time. Not very well. They handle correctly dynticks idle but not dynticks full. Checkout update_cpu_load_active() -> __update_cpu_load() for example. There is a pending_update argument that take care of tickless delta but decay_load_miss() catch up with the missing cpu load assuming it was all 0 (idle) all that time. Generally speaking the scheduler assume dynticks to be idle dynticks. And that concerns the above example and probably many other accounting. Now the issue with update_cpu_load_active() is there, whether we keep 1 Hz or not, any delta of full dynticks workload makes it buggy because it's accounted as idle load. But removing the 1 Hz residual tick is dangerous because many accounting in the scheduler tick assume regular updates. It's mostly ok as long as the accounting is exclusively updated and read locally. But some accounting is also updated locally and read remotely. So if CPU 0 is full dynticks and runs for 1 hour in userspace and CPU 1 reads its stats, those will be buggy because of the missing updates. At best in this scenarion CPU 1 may consider that CPU 0 has been idle for 1 hour, at worst the stats can be junk and there can be crashes. Also a lot of the scheduler decisions is based on these accountings. Load balancing to the least. So we have two possible solutions: 1) Make the scheduler more full-dynticks aware. Which means that any remote stat accounting read must handle out of date results. That's going to be tricky: if you check scheduler_tick() and sched_class::task_tick(), even simply trying to sort out which stat is updated, can handle busy dynticks load, is read only locally or can be read remotely, handles overflow, etc... That's enough work for an army of ants. 2) Offload scheduler_tick() to the housekeeping. It looks like many of the updaters there can easily take a remote rq argument. There doesn't seem to be much local rq assumption. So that's the easiest solution. But we can't just remove scheduler_tick_max_deferment() and not fix things behind. The result will be unpredictably insane and dangerous. The only predictable thing that's going to happen if we do that is that nobody will ever fix it properly. ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2014-11-13 7:23 UTC | newest] Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-10-31 16:01 [NOHZ] Remove scheduler_tick_max_deferment Christoph Lameter 2014-11-01 19:18 ` Thomas Gleixner 2014-11-01 21:52 ` Christoph Lameter 2014-11-01 22:33 ` Thomas Gleixner 2014-11-06 17:24 ` Christoph Lameter 2014-11-10 7:11 ` Viresh Kumar 2014-11-10 15:31 ` Paul E. McKenney 2014-11-10 16:21 ` Christoph Lameter 2014-11-10 18:26 ` Christoph Lameter 2014-11-11 17:15 ` Future of NOHZ full/isolation development (was Re: [NOHZ] Remove scheduler_tick_max_deferment) Frederic Weisbecker 2014-11-11 17:39 ` Paul E. McKenney 2014-11-11 18:00 ` Christoph Lameter 2014-11-12 6:11 ` Viresh Kumar 2014-11-12 13:54 ` Frederic Weisbecker 2014-11-12 14:56 ` Viresh Kumar 2014-11-12 15:06 ` Peter Zijlstra 2014-11-12 15:16 ` Viresh Kumar 2014-11-13 7:22 ` Viresh Kumar 2014-11-10 16:19 ` [NOHZ] Remove scheduler_tick_max_deferment Christoph Lameter 2014-11-10 22:43 ` Frederic Weisbecker 2014-11-11 14:58 ` Christoph Lameter 2014-11-11 15:36 ` Frederic Weisbecker 2014-11-11 17:08 ` Christoph Lameter 2014-11-10 20:26 ` Frederic Weisbecker
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.