All of lore.kernel.org
 help / color / mirror / Atom feed
* [NOHZ] Remove scheduler_tick_max_deferment
@ 2014-10-31 16:01 Christoph Lameter
  2014-11-01 19:18 ` Thomas Gleixner
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Lameter @ 2014-10-31 16:01 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, Gilad Ben-Yossef, Thomas Gleixner, Tejun Heo,
	John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan,
	Max Krasnyansky, Paul E. McKenney, Hugh Dickins, Viresh Kumar,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra

The reasoning behind this function is not clear to me and removal seems
to have a limited impact on the system overall. Even without the
cap to 1 second the system will be limited by the boundaries on the period
of interrupts by various devices (in my case I ended up with a 4 second
interval on x86 because of the limitations of periodicy of the underlying
interupt source).

Moreover this artificial limits the impact the benefit that commit
commit 7cc36bbddde5cd0c98f0c06e3304ab833d662565 (on-demand vmstat workers)
should be giving us.

Without this patch timer interrupts will still occur in 1 second intervals
but no vmstat kworker will run. On a processor where all other
events have been redirected to other processors nothing will be
going on just timer interrupts that do not do much.

With this patch the maximum deferrability of other items will become
evident and work can then proceed on eliminating those (like the 4
second limit that I encountered due to the timing limitations of
the underlying hardware)

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -2174,7 +2174,6 @@ static inline void wake_up_nohz_cpu(int

 #ifdef CONFIG_NO_HZ_FULL
 extern bool sched_can_stop_tick(void);
-extern u64 scheduler_tick_max_deferment(void);
 #else
 static inline bool sched_can_stop_tick(void) { return false; }
 #endif
Index: linux/kernel/sched/core.c
===================================================================
--- linux.orig/kernel/sched/core.c
+++ linux/kernel/sched/core.c
@@ -2573,34 +2573,6 @@ void scheduler_tick(void)
 	rq_last_tick_reset(rq);
 }

-#ifdef CONFIG_NO_HZ_FULL
-/**
- * scheduler_tick_max_deferment
- *
- * Keep at least one tick per second when a single
- * active task is running because the scheduler doesn't
- * yet completely support full dynticks environment.
- *
- * This makes sure that uptime, CFS vruntime, load
- * balancing, etc... continue to move forward, even
- * with a very low granularity.
- *
- * Return: Maximum deferment in nanoseconds.
- */
-u64 scheduler_tick_max_deferment(void)
-{
-	struct rq *rq = this_rq();
-	unsigned long next, now = ACCESS_ONCE(jiffies);
-
-	next = rq->last_sched_tick + HZ;
-
-	if (time_before_eq(next, now))
-		return 0;
-
-	return jiffies_to_nsecs(next - now);
-}
-#endif
-
 notrace unsigned long get_parent_ip(unsigned long addr)
 {
 	if (in_lock_functions(addr)) {
Index: linux/kernel/time/tick-sched.c
===================================================================
--- linux.orig/kernel/time/tick-sched.c
+++ linux/kernel/time/tick-sched.c
@@ -632,13 +632,6 @@ static ktime_t tick_nohz_stop_sched_tick
 			time_delta = KTIME_MAX;
 		}

-#ifdef CONFIG_NO_HZ_FULL
-		if (!ts->inidle) {
-			time_delta = min(time_delta,
-					 scheduler_tick_max_deferment());
-		}
-#endif
-
 		/*
 		 * calculate the expiry time for the next timer wheel
 		 * timer. delta_jiffies >= NEXT_TIMER_MAX_DELTA signals

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [NOHZ] Remove scheduler_tick_max_deferment
  2014-10-31 16:01 [NOHZ] Remove scheduler_tick_max_deferment Christoph Lameter
@ 2014-11-01 19:18 ` Thomas Gleixner
  2014-11-01 21:52   ` Christoph Lameter
  0 siblings, 1 reply; 24+ messages in thread
From: Thomas Gleixner @ 2014-11-01 19:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Frederic Weisbecker, linux-kernel, Gilad Ben-Yossef, Tejun Heo,
	John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan,
	Max Krasnyansky, Paul E. McKenney, Hugh Dickins, Viresh Kumar,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra

On Fri, 31 Oct 2014, Christoph Lameter wrote:
> The reasoning behind this function is not clear to me and removal seems

The comment above the function is clear enough.

> to have a limited impact on the system overall. Even without the
> cap to 1 second the system will be limited by the boundaries on the period
> of interrupts by various devices (in my case I ended up with a 4 second
> interval on x86 because of the limitations of periodicy of the underlying
> interupt source).

And just because it happens to do so on your machine it's not
guaranteed. 
 
> Moreover this artificial limits the impact the benefit that commit
> commit 7cc36bbddde5cd0c98f0c06e3304ab833d662565 (on-demand vmstat workers)
> should be giving us.
> 
> Without this patch timer interrupts will still occur in 1 second intervals
> but no vmstat kworker will run. On a processor where all other

And what has this to do with vmstat kworker? Nothing.

> events have been redirected to other processors nothing will be
> going on just timer interrupts that do not do much.

What the timer interrupt does is very clearly explained in the comment
above the function you want to remove.
 
> With this patch the maximum deferrability of other items will become
> evident and work can then proceed on eliminating those

What about eliminating the requirement for this function first? It's
clear what it does and it's also clear that this can be done remotely
from a housekeeping cpu when the full nohz cpu is busy looping in user
space. The function is there because nobody has tackled that problem
yet.

> (like the 4 second limit that I encountered due to the timing
> limitations of the underlying hardware)

You don't want to tackle the 4 seconds limit of the underlying
hardware. What you really want is to eliminate the [hr]timer which is
preventing the hardware timer to be shutdown completely.

But we care about that _after_ we solved the scheduler tick
requirement because that is the most evident one.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [NOHZ] Remove scheduler_tick_max_deferment
  2014-11-01 19:18 ` Thomas Gleixner
@ 2014-11-01 21:52   ` Christoph Lameter
  2014-11-01 22:33     ` Thomas Gleixner
  2014-11-10 20:26     ` Frederic Weisbecker
  0 siblings, 2 replies; 24+ messages in thread
From: Christoph Lameter @ 2014-11-01 21:52 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Frederic Weisbecker, linux-kernel, Gilad Ben-Yossef, Tejun Heo,
	John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan,
	Max Krasnyansky, Paul E. McKenney, Hugh Dickins, Viresh Kumar,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra

On Sat, 1 Nov 2014, Thomas Gleixner wrote:

> On Fri, 31 Oct 2014, Christoph Lameter wrote:
> > The reasoning behind this function is not clear to me and removal seems
>
> The comment above the function is clear enough.

I looked around into the functions called by the timer interrupt for
accounting etc. They have measures to compensate if the HZ is not
occurring for some time.

> > to have a limited impact on the system overall. Even without the
> > cap to 1 second the system will be limited by the boundaries on the period
> > of interrupts by various devices (in my case I ended up with a 4 second
> > interval on x86 because of the limitations of periodicy of the underlying
> > interupt source).
>
> And just because it happens to do so on your machine it's not
> guaranteed.

When would it not occur? Where do we lack a measure to cope with missing
timer interrupts now?

> > Moreover this artificial limits the impact the benefit that commit
> > commit 7cc36bbddde5cd0c98f0c06e3304ab833d662565 (on-demand vmstat workers)
> > should be giving us.
> >
> > Without this patch timer interrupts will still occur in 1 second intervals
> > but no vmstat kworker will run. On a processor where all other
>
> And what has this to do with vmstat kworker? Nothing.

This means that the timer interrupt occurs needlessly.

> > events have been redirected to other processors nothing will be
> > going on just timer interrupts that do not do much.
>
> What the timer interrupt does is very clearly explained in the comment
> above the function you want to remove.

Could you be specific?

> > With this patch the maximum deferrability of other items will become
> > evident and work can then proceed on eliminating those
>
> What about eliminating the requirement for this function first? It's
> clear what it does and it's also clear that this can be done remotely
> from a housekeeping cpu when the full nohz cpu is busy looping in user
> space. The function is there because nobody has tackled that problem
> yet.

Where does that requirement exist?


> You don't want to tackle the 4 seconds limit of the underlying
> hardware. What you really want is to eliminate the [hr]timer which is
> preventing the hardware timer to be shutdown completely.

Yes after the 1 second issue here has been avoided we can move on to the 4
second one and so on.

> But we care about that _after_ we solved the scheduler tick
> requirement because that is the most evident one.

Why does the scheduler require that tick? It seems that the processor is
always busy running exactly 1 process when the tick is not
occurring. Anything else will switch on the tick again. So the information
that the scheduler has never becomes outdated.





^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [NOHZ] Remove scheduler_tick_max_deferment
  2014-11-01 21:52   ` Christoph Lameter
@ 2014-11-01 22:33     ` Thomas Gleixner
  2014-11-06 17:24       ` Christoph Lameter
  2014-11-10 20:26     ` Frederic Weisbecker
  1 sibling, 1 reply; 24+ messages in thread
From: Thomas Gleixner @ 2014-11-01 22:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Frederic Weisbecker, linux-kernel, Gilad Ben-Yossef, Tejun Heo,
	John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan,
	Max Krasnyansky, Paul E. McKenney, Hugh Dickins, Viresh Kumar,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra

On Sat, 1 Nov 2014, Christoph Lameter wrote:
> On Sat, 1 Nov 2014, Thomas Gleixner wrote:
> 
> > On Fri, 31 Oct 2014, Christoph Lameter wrote:
> > > The reasoning behind this function is not clear to me and removal seems
> >
> > The comment above the function is clear enough.
> 
> I looked around into the functions called by the timer interrupt for
> accounting etc. They have measures to compensate if the HZ is not
> occurring for some time.

Let's look at that comment first:

 * Keep at least one tick per second when a single
 * active task is running because the scheduler doesn't
 * yet completely support full dynticks environment.
 *
 * This makes sure that uptime, CFS vruntime, load
 * balancing, etc... continue to move forward, even
 * with a very low granularity.

So this talks about the scheduler tick obviously, right?

Now scheduler_tick() is invoked from update_process_times() and
update_process_times() is invoked from tick_sched_handle() and that is
invoked from either tick_sched_timer() or tick_nohz_handler().

tick_sched_timer() is the hrtimer callback of tick_cpu_sched.sched_timer.
That's used when high resolution timers are enabled.

tick_nohz_handler() is the event handler for the clock event device if
high resolution timers are disabled.

Now the callsite of scheduler_tick_max_deferment() does:

   time_delta = min(time_delta, scheduler_tick_max_deferment());

And that is used further down after some other checks to arm either
tick_cpu_sched.sched_timer or the clockevent itself.

Which then when fired will invoke scheduler_tick() ....

Really hard to figure out, right?

> > > to have a limited impact on the system overall. Even without the
> > > cap to 1 second the system will be limited by the boundaries on the period
> > > of interrupts by various devices (in my case I ended up with a 4 second
> > > interval on x86 because of the limitations of periodicy of the underlying
> > > interupt source).
> >
> > And just because it happens to do so on your machine it's not
> > guaranteed.
> 
> When would it not occur? Where do we lack a measure to cope with missing
> timer interrupts now?

I wont happen, if time_delta is KTIME_MAX and the following checks are
not having a timer armed.

                 if (unlikely(expires.tv64 == KTIME_MAX)) {
                        if (ts->nohz_mode == NOHZ_MODE_HIGHRES)
                                hrtimer_cancel(&ts->sched_timer);
                        goto out;
                }

Which does either not arm the clockevent device (non highres) or
cancels ts->sched_timer (highres).

So in that case your timer interrupt will stop completely and therefor
the scheduler updates on that cpu wont happen anymore.

> > But we care about that _after_ we solved the scheduler tick
> > requirement because that is the most evident one.
> 
> Why does the scheduler require that tick? It seems that the processor is
> always busy running exactly 1 process when the tick is not
> occurring. Anything else will switch on the tick again. So the information
> that the scheduler has never becomes outdated.

Surely vruntime, load balancing data, load accounting and all the
other stuff which contributes to global and local state updates itself
magically.

As I said before: It can be delegated to a housekeeper, but this needs
to be implemented first before we can remove that function.

There is a world outside of vmstat kworker, really.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [NOHZ] Remove scheduler_tick_max_deferment
  2014-11-01 22:33     ` Thomas Gleixner
@ 2014-11-06 17:24       ` Christoph Lameter
  2014-11-10  7:11         ` Viresh Kumar
  2014-11-10 22:43         ` Frederic Weisbecker
  0 siblings, 2 replies; 24+ messages in thread
From: Christoph Lameter @ 2014-11-06 17:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Frederic Weisbecker, linux-kernel, Gilad Ben-Yossef, Tejun Heo,
	John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan,
	Max Krasnyansky, Paul E. McKenney, Hugh Dickins, Viresh Kumar,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra

On Sat, 1 Nov 2014, Thomas Gleixner wrote:

>  * balancing, etc... continue to move forward, even
>  * with a very low granularity.
>
> So this talks about the scheduler tick obviously, right?

Obviously.

> Now scheduler_tick() is invoked from update_process_times() and
> update_process_times() is invoked from tick_sched_handle() and that is
> invoked from either tick_sched_timer() or tick_nohz_handler().

>
> tick_sched_timer() is the hrtimer callback of tick_cpu_sched.sched_timer.
> That's used when high resolution timers are enabled.
>
> tick_nohz_handler() is the event handler for the clock event device if
> high resolution timers are disabled.
>
> Now the callsite of scheduler_tick_max_deferment() does:
>
>    time_delta = min(time_delta, scheduler_tick_max_deferment());
>
> And that is used further down after some other checks to arm either
> tick_cpu_sched.sched_timer or the clockevent itself.
>
> Which then when fired will invoke scheduler_tick() ....
>
> Really hard to figure out, right?

I thought there is already logic in there to compensate for times when the
tick is off.

tick_do_update_jiffies64 calculates the time differential and calculates
the number of ticks from there calling do_timer() with the number of ticks
that have passed since the last invocation. The global load calculation
is then also made based on the number of ticks that have passed. So it
compensates when reenabling. And the load during the dynticks busy period
is known because one process is monopolizing the processor during that
time.

> I wont happen, if time_delta is KTIME_MAX and the following checks are
> not having a timer armed.
>
>                  if (unlikely(expires.tv64 == KTIME_MAX)) {
>                         if (ts->nohz_mode == NOHZ_MODE_HIGHRES)
>                                 hrtimer_cancel(&ts->sched_timer);
>                         goto out;
>                 }
>
> Which does either not arm the clockevent device (non highres) or
> cancels ts->sched_timer (highres).
>
> So in that case your timer interrupt will stop completely and therefor
> the scheduler updates on that cpu wont happen anymore.

Why is that bad? The load is constant and the timer interrupt can be
reenabled by the dynticks logic when a system call occurs that requires OS
services. I thought that was already done that way by Frederic?

> > Why does the scheduler require that tick? It seems that the processor is
> > always busy running exactly 1 process when the tick is not
> > occurring. Anything else will switch on the tick again. So the information
> > that the scheduler has never becomes outdated.
>
> Surely vruntime, load balancing data, load accounting and all the
> other stuff which contributes to global and local state updates itself
> magically.

There is logic in there that compensates when the tick is finally
reenabled. Load balancing data is already not updated when the tick is
disabled when the processor is idle right? What is so different here?

> As I said before: It can be delegated to a housekeeper, but this needs
> to be implemented first before we can remove that function.

We did not need to housekeeper in the dynticks idle case. What is so
different about dynticks busy?

> There is a world outside of vmstat kworker, really.

Absolutely but I thought the logic is already there to compensate for
issues like the timer interrupt not occurring.

I may not have the complete picture of the timer tick processing in my
mind these days (it has been a lots of years since I did any work there
after all) but as far as my arguably simplistic reading of the code goes I
do not see why a housekeeper would be needed there. The load is constant
and known in the dynticks busy case as it is in the dynticks idle case.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [NOHZ] Remove scheduler_tick_max_deferment
  2014-11-06 17:24       ` Christoph Lameter
@ 2014-11-10  7:11         ` Viresh Kumar
  2014-11-10 15:31           ` Paul E. McKenney
  2014-11-10 16:19           ` [NOHZ] Remove scheduler_tick_max_deferment Christoph Lameter
  2014-11-10 22:43         ` Frederic Weisbecker
  1 sibling, 2 replies; 24+ messages in thread
From: Viresh Kumar @ 2014-11-10  7:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Thomas Gleixner, Frederic Weisbecker, Linux Kernel Mailing List,
	Gilad Ben-Yossef, Tejun Heo, John Stultz, Mike Frysinger,
	Minchan Kim, Hakan Akkan, Max Krasnyansky, Paul E. McKenney,
	Hugh Dickins, H. Peter Anvin, Ingo Molnar, Peter Zijlstra

On 6 November 2014 22:54, Christoph Lameter <cl@linux.com> wrote:

> We did not need to housekeeper in the dynticks idle case. What is so
> different about dynticks busy?

We do have a running task here and so the stats are important..

> I may not have the complete picture of the timer tick processing in my
> mind these days (it has been a lots of years since I did any work there
> after all) but as far as my arguably simplistic reading of the code goes I
> do not see why a housekeeper would be needed there. The load is constant
> and known in the dynticks busy case as it is in the dynticks idle case.

I tried to initiate a thread on similar stuff, might be helpful:

https://lkml.org/lkml/2014/5/22/131

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [NOHZ] Remove scheduler_tick_max_deferment
  2014-11-10  7:11         ` Viresh Kumar
@ 2014-11-10 15:31           ` Paul E. McKenney
  2014-11-10 16:21             ` Christoph Lameter
  2014-11-10 18:26             ` Christoph Lameter
  2014-11-10 16:19           ` [NOHZ] Remove scheduler_tick_max_deferment Christoph Lameter
  1 sibling, 2 replies; 24+ messages in thread
From: Paul E. McKenney @ 2014-11-10 15:31 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Christoph Lameter, Thomas Gleixner, Frederic Weisbecker,
	Linux Kernel Mailing List, Gilad Ben-Yossef, Tejun Heo,
	John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan,
	Max Krasnyansky, Hugh Dickins, H. Peter Anvin, Ingo Molnar,
	Peter Zijlstra

On Mon, Nov 10, 2014 at 12:41:38PM +0530, Viresh Kumar wrote:
> On 6 November 2014 22:54, Christoph Lameter <cl@linux.com> wrote:
> 
> > We did not need to housekeeper in the dynticks idle case. What is so
> > different about dynticks busy?
> 
> We do have a running task here and so the stats are important..
> 
> > I may not have the complete picture of the timer tick processing in my
> > mind these days (it has been a lots of years since I did any work there
> > after all) but as far as my arguably simplistic reading of the code goes I
> > do not see why a housekeeper would be needed there. The load is constant
> > and known in the dynticks busy case as it is in the dynticks idle case.
> 
> I tried to initiate a thread on similar stuff, might be helpful:
> 
> https://lkml.org/lkml/2014/5/22/131

Would it make sense for unlimited max deferment to be available as
a boot parameter?  That would allow people who want tick-free execution
more than accurate stats to get that easily, while keeping stats accurate
for everyone else.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [NOHZ] Remove scheduler_tick_max_deferment
  2014-11-10  7:11         ` Viresh Kumar
  2014-11-10 15:31           ` Paul E. McKenney
@ 2014-11-10 16:19           ` Christoph Lameter
  1 sibling, 0 replies; 24+ messages in thread
From: Christoph Lameter @ 2014-11-10 16:19 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Thomas Gleixner, Frederic Weisbecker, Linux Kernel Mailing List,
	Gilad Ben-Yossef, Tejun Heo, John Stultz, Mike Frysinger,
	Minchan Kim, Hakan Akkan, Max Krasnyansky, Paul E. McKenney,
	Hugh Dickins, H. Peter Anvin, Ingo Molnar, Peter Zijlstra

On Mon, 10 Nov 2014, Viresh Kumar wrote:

> On 6 November 2014 22:54, Christoph Lameter <cl@linux.com> wrote:
>
> > We did not need to housekeeper in the dynticks idle case. What is so
> > different about dynticks busy?
>
> We do have a running task here and so the stats are important..

The task is running in user space only and if there is any important
kernel action the tick will be coming on. Stats are not that important and
predictable I would say.

We could add stuff to the vmstat kworker thread to check for activity that
requires updates.

> > I may not have the complete picture of the timer tick processing in my
> > mind these days (it has been a lots of years since I did any work there
> > after all) but as far as my arguably simplistic reading of the code goes I
> > do not see why a housekeeper would be needed there. The load is constant
> > and known in the dynticks busy case as it is in the dynticks idle case.
>
> I tried to initiate a thread on similar stuff, might be helpful:
>
> https://lkml.org/lkml/2014/5/22/131

Hmmm... yes that is interesting. I agree that the process should be
exempted from load balancing since the idea of NOHZ is to lower OS noise
and load balancing would add significant amounts of noise.

NUMA balancing is also a source of noise and so we should have that off by
default as well.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [NOHZ] Remove scheduler_tick_max_deferment
  2014-11-10 15:31           ` Paul E. McKenney
@ 2014-11-10 16:21             ` Christoph Lameter
  2014-11-10 18:26             ` Christoph Lameter
  1 sibling, 0 replies; 24+ messages in thread
From: Christoph Lameter @ 2014-11-10 16:21 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Viresh Kumar, Thomas Gleixner, Frederic Weisbecker,
	Linux Kernel Mailing List, Gilad Ben-Yossef, Tejun Heo,
	John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan,
	Max Krasnyansky, Hugh Dickins, H. Peter Anvin, Ingo Molnar,
	Peter Zijlstra

On Mon, 10 Nov 2014, Paul E. McKenney wrote:

> Would it make sense for unlimited max deferment to be available as
> a boot parameter?  That would allow people who want tick-free execution
> more than accurate stats to get that easily, while keeping stats accurate
> for everyone else.

Well at least it would help us to get started on figuring out how to
deal with other  processor interruptions that occur in intervals of
seconds or minutes.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [NOHZ] Remove scheduler_tick_max_deferment
  2014-11-10 15:31           ` Paul E. McKenney
  2014-11-10 16:21             ` Christoph Lameter
@ 2014-11-10 18:26             ` Christoph Lameter
  2014-11-11 17:15               ` Future of NOHZ full/isolation development (was Re: [NOHZ] Remove scheduler_tick_max_deferment) Frederic Weisbecker
  1 sibling, 1 reply; 24+ messages in thread
From: Christoph Lameter @ 2014-11-10 18:26 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Viresh Kumar, Thomas Gleixner, Frederic Weisbecker,
	Linux Kernel Mailing List, Gilad Ben-Yossef, Tejun Heo,
	John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan,
	Max Krasnyansky, Hugh Dickins, H. Peter Anvin, Ingo Molnar,
	Peter Zijlstra

>
> Would it make sense for unlimited max deferment to be available as
> a boot parameter?  That would allow people who want tick-free execution
> more than accurate stats to get that easily, while keeping stats accurate
> for everyone else.

Subject: Make the maximum tick deferral for CONFIG_NO_HZ configurable

Add a way to configure this interval at boot and via
/proc/sys/vm/max_defer_tick

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/kernel/sched/core.c
===================================================================
--- linux.orig/kernel/sched/core.c
+++ linux/kernel/sched/core.c
@@ -2574,6 +2574,17 @@ void scheduler_tick(void)
 }

 #ifdef CONFIG_NO_HZ_FULL
+int sysctl_max_defer_tick __read_mostly = 1;
+
+static int __init max_defer_tick_setup(char *str)
+{
+	sysctl_max_defer_tick = simple_strtol(str, NULL, 0);
+	pr_info("NO_HZ_FULL maxinum deferral of busy tick set to %d\n",
+		sysctl_max_defer_tick);
+	return 1;
+}
+__setup("max_defer_tick=", max_defer_tick_setup);
+
 /**
  * scheduler_tick_max_deferment
  *
@@ -2592,7 +2603,7 @@ u64 scheduler_tick_max_deferment(void)
 	struct rq *rq = this_rq();
 	unsigned long next, now = ACCESS_ONCE(jiffies);

-	next = rq->last_sched_tick + HZ;
+	next = rq->last_sched_tick + sysctl_max_defer_tick * HZ;

 	if (time_before_eq(next, now))
 		return 0;
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c
+++ linux/kernel/sysctl.c
@@ -1407,6 +1407,15 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= proc_dointvec_jiffies,
 	},
 #endif
+#ifdef CONFIG_NO_HZ_FULL
+	{
+		.procname	= "max_defer_tick",
+		.data		= &sysctl_max_defer_tick,
+		.maxlen		= sizeof(sysctl_max_defer_tick),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_jiffies,
+	},
+#endif
 #ifdef CONFIG_MMU
 	{
 		.procname	= "mmap_min_addr",
Index: linux/Documentation/sysctl/vm.txt
===================================================================
--- linux.orig/Documentation/sysctl/vm.txt
+++ linux/Documentation/sysctl/vm.txt
@@ -714,6 +714,13 @@ is 1 second.

 ==============================================================

+max_defer_tick
+
+The maximum time that the tick may be deferred while a process is
+monopolizing a cpu.
+
+==============================================================
+
 swappiness

 This control is used to define how aggressive the kernel will swap
Index: linux/Documentation/kernel-parameters.txt
===================================================================
--- linux.orig/Documentation/kernel-parameters.txt
+++ linux/Documentation/kernel-parameters.txt
@@ -1876,6 +1876,10 @@ bytes respectively. Such letter suffixes
 			devices can be requested on-demand with the
 			/dev/loop-control interface.

+	max_defer_tick	[NO_HZ_FULL] The number of seconds that the system may
+			defer the timer tick if a process is monopolizing the
+			cpu.
+
 	mce		[X86-32] Machine Check Exception

 	mce=option	[X86-64] See Documentation/x86/x86_64/boot-options.txt
Index: linux/include/linux/sched/sysctl.h
===================================================================
--- linux.orig/include/linux/sched/sysctl.h
+++ linux/include/linux/sched/sysctl.h
@@ -53,6 +53,10 @@ extern unsigned int sysctl_numa_balancin
 extern unsigned int sysctl_numa_balancing_scan_period_max;
 extern unsigned int sysctl_numa_balancing_scan_size;

+#ifdef CONFIG_NO_HZ_FULL
+extern int sysctl_max_defer_tick;
+#endif
+
 #ifdef CONFIG_SCHED_DEBUG
 extern unsigned int sysctl_sched_migration_cost;
 extern unsigned int sysctl_sched_nr_migrate;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [NOHZ] Remove scheduler_tick_max_deferment
  2014-11-01 21:52   ` Christoph Lameter
  2014-11-01 22:33     ` Thomas Gleixner
@ 2014-11-10 20:26     ` Frederic Weisbecker
  1 sibling, 0 replies; 24+ messages in thread
From: Frederic Weisbecker @ 2014-11-10 20:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Thomas Gleixner, linux-kernel, Gilad Ben-Yossef, Tejun Heo,
	John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan,
	Max Krasnyansky, Paul E. McKenney, Hugh Dickins, Viresh Kumar,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra

On Sat, Nov 01, 2014 at 04:52:13PM -0500, Christoph Lameter wrote:
> On Sat, 1 Nov 2014, Thomas Gleixner wrote:
> 
> > On Fri, 31 Oct 2014, Christoph Lameter wrote:
> > > The reasoning behind this function is not clear to me and removal seems
> >
> > The comment above the function is clear enough.
> 
> I looked around into the functions called by the timer interrupt for
> accounting etc. They have measures to compensate if the HZ is not
> occurring for some time.

Not very well. They handle correctly dynticks idle but not dynticks full.
Checkout update_cpu_load_active() -> __update_cpu_load() for example.

There is a pending_update argument that take care of tickless delta but
decay_load_miss() catch up with the missing cpu load assuming it was all 0 (idle)
all that time.

Generally speaking the scheduler assume dynticks to be idle dynticks. And that
concerns the above example and probably many other accounting.

Now the issue with update_cpu_load_active() is there, whether we keep 1 Hz or not,
any delta of full dynticks workload makes it buggy because it's accounted as idle
load.

But removing the 1 Hz residual tick is dangerous because many accounting in the
scheduler tick assume regular updates. It's mostly ok as long as the accounting
is exclusively updated and read locally. But some accounting is also updated locally
and read remotely. So if CPU 0 is full dynticks and runs for 1 hour in userspace and
CPU 1 reads its stats, those will be buggy because of the missing updates. At best
in this scenarion CPU 1 may consider that CPU 0 has been idle for 1 hour, at worst
the stats can be junk and there can be crashes. Also a lot of the scheduler decisions
is based on these accountings. Load balancing to the least.

So we have two possible solutions:

1) Make the scheduler more full-dynticks aware. Which means that any remote
stat accounting read must handle out of date results. That's going to be tricky: if
you check scheduler_tick() and sched_class::task_tick(), even simply trying to
sort out which stat is updated, can handle busy dynticks load, is read only locally
or can be read remotely, handles overflow, etc... That's enough work for an army of ants.

2) Offload scheduler_tick() to the housekeeping. It looks like many of the updaters
there can easily take a remote rq argument. There doesn't seem to be much local rq
assumption. So that's the easiest solution.

But we can't just remove scheduler_tick_max_deferment() and not fix things behind.
The result will be unpredictably insane and dangerous. The only predictable thing
that's going to happen if we do that is that nobody will ever fix it properly.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [NOHZ] Remove scheduler_tick_max_deferment
  2014-11-06 17:24       ` Christoph Lameter
  2014-11-10  7:11         ` Viresh Kumar
@ 2014-11-10 22:43         ` Frederic Weisbecker
  2014-11-11 14:58           ` Christoph Lameter
  1 sibling, 1 reply; 24+ messages in thread
From: Frederic Weisbecker @ 2014-11-10 22:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Thomas Gleixner, linux-kernel, Gilad Ben-Yossef, Tejun Heo,
	John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan,
	Max Krasnyansky, Paul E. McKenney, Hugh Dickins, Viresh Kumar,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra

On Thu, Nov 06, 2014 at 11:24:59AM -0600, Christoph Lameter wrote:
> I thought there is already logic in there to compensate for times when the
> tick is off.
> 
> tick_do_update_jiffies64 calculates the time differential and calculates
> the number of ticks from there calling do_timer() with the number of ticks
> that have passed since the last invocation. The global load calculation
> is then also made based on the number of ticks that have passed. So it
> compensates when reenabling. And the load during the dynticks busy period
> is known because one process is monopolizing the processor during that
> time.

jiffies accounting is well handled everywhere. But that's different than the
scheduler.

> > I wont happen, if time_delta is KTIME_MAX and the following checks are
> > not having a timer armed.
> >
> >                  if (unlikely(expires.tv64 == KTIME_MAX)) {
> >                         if (ts->nohz_mode == NOHZ_MODE_HIGHRES)
> >                                 hrtimer_cancel(&ts->sched_timer);
> >                         goto out;
> >                 }
> >
> > Which does either not arm the clockevent device (non highres) or
> > cancels ts->sched_timer (highres).
> >
> > So in that case your timer interrupt will stop completely and therefor
> > the scheduler updates on that cpu wont happen anymore.
> 
> Why is that bad? The load is constant and the timer interrupt can be
> reenabled by the dynticks logic when a system call occurs that requires OS
> services. I thought that was already done that way by Frederic?

Yeah it is. Perf events, RCU, posix cpu timers are examples of things that
are well handled by this tick on demand system. But they are all seperate
things than the scheduler.

> 
> > > Why does the scheduler require that tick? It seems that the processor is
> > > always busy running exactly 1 process when the tick is not
> > > occurring. Anything else will switch on the tick again. So the information
> > > that the scheduler has never becomes outdated.
> >
> > Surely vruntime, load balancing data, load accounting and all the
> > other stuff which contributes to global and local state updates itself
> > magically.
> 
> There is logic in there that compensates when the tick is finally
> reenabled. Load balancing data is already not updated when the tick is
> disabled when the processor is idle right? What is so different here?

That's completely different because idle and busy CPUs may play different
roles in load balancing. Load balancing can be assigned to idle CPUs for
example. But the scheduler still assumes that dynticks CPUs are always idle.
And we certainly don't want to assign load balancing duty to nohz full CPUs.

That too needs some work to be fixed properly.

> 
> > As I said before: It can be delegated to a housekeeper, but this needs
> > to be implemented first before we can remove that function.
> 
> We did not need to housekeeper in the dynticks idle case. What is so
> different about dynticks busy?

Because when a task runs we need some things to move forward: timekeeping
for example. We don't want to update jiffies and gettimeofday from full nohz
syscalls kernel entry. So another CPU has to maintain that.

Probably the game between timekeeping and vdso complicates even further the situation.

> 
> > There is a world outside of vmstat kworker, really.
> 
> Absolutely but I thought the logic is already there to compensate for
> issues like the timer interrupt not occurring.
> 
> I may not have the complete picture of the timer tick processing in my
> mind these days (it has been a lots of years since I did any work there
> after all) but as far as my arguably simplistic reading of the code goes I
> do not see why a housekeeper would be needed there. The load is constant
> and known in the dynticks busy case as it is in the dynticks idle case.

This is because of the general confusion between idle and dynticks.
There is no need for housekeeping if there is no activity at all on
a CPU (idle) and the mind makes a shortcut by considering that dynticks doesn't need
housekeeping.

But housekeeping is needed as long as there is activity and kernel service.
And that's the case whether hz or nohz.

Ok, I confess we moved part of that housekeeping to the syscall/exception/interrupt
entry path. We did that for cputime accounting and RCU. And it's possible to
even do that for timekeeping. But then the kernel entrypoint is going to be extremely
costly. It's worth CPU 0 as a sacrificial lamb.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [NOHZ] Remove scheduler_tick_max_deferment
  2014-11-10 22:43         ` Frederic Weisbecker
@ 2014-11-11 14:58           ` Christoph Lameter
  2014-11-11 15:36             ` Frederic Weisbecker
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Lameter @ 2014-11-11 14:58 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Thomas Gleixner, linux-kernel, Gilad Ben-Yossef, Tejun Heo,
	John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan,
	Max Krasnyansky, Paul E. McKenney, Hugh Dickins, Viresh Kumar,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra

On Mon, 10 Nov 2014, Frederic Weisbecker wrote:

> Ok, I confess we moved part of that housekeeping to the syscall/exception/interrupt
> entry path. We did that for cputime accounting and RCU. And it's possible to
> even do that for timekeeping. But then the kernel entrypoint is going to be extremely
> costly. It's worth CPU 0 as a sacrificial lamb.

Well we can redirect to the scheduler setting the task flag and handle
costly stuff there if necessary?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [NOHZ] Remove scheduler_tick_max_deferment
  2014-11-11 14:58           ` Christoph Lameter
@ 2014-11-11 15:36             ` Frederic Weisbecker
  2014-11-11 17:08               ` Christoph Lameter
  0 siblings, 1 reply; 24+ messages in thread
From: Frederic Weisbecker @ 2014-11-11 15:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Thomas Gleixner, linux-kernel, Gilad Ben-Yossef, Tejun Heo,
	John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan,
	Max Krasnyansky, Paul E. McKenney, Hugh Dickins, Viresh Kumar,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra

On Tue, Nov 11, 2014 at 08:58:35AM -0600, Christoph Lameter wrote:
> On Mon, 10 Nov 2014, Frederic Weisbecker wrote:
> 
> > Ok, I confess we moved part of that housekeeping to the syscall/exception/interrupt
> > entry path. We did that for cputime accounting and RCU. And it's possible to
> > even do that for timekeeping. But then the kernel entrypoint is going to be extremely
> > costly. It's worth CPU 0 as a sacrificial lamb.
> 
> Well we can redirect to the scheduler setting the task flag and handle
> costly stuff there if necessary?

"there" here is the syscall/exception/interrupt entry path. And like I said, updating
timekeeping from these places is overkill (although we do it in interrupt entry on
dynticks-idle, but we don't have timekeeping when all system is idle).

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [NOHZ] Remove scheduler_tick_max_deferment
  2014-11-11 15:36             ` Frederic Weisbecker
@ 2014-11-11 17:08               ` Christoph Lameter
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Lameter @ 2014-11-11 17:08 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Thomas Gleixner, linux-kernel, Gilad Ben-Yossef, Tejun Heo,
	John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan,
	Max Krasnyansky, Paul E. McKenney, Hugh Dickins, Viresh Kumar,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra

On Tue, 11 Nov 2014, Frederic Weisbecker wrote:

> "there" here is the syscall/exception/interrupt entry path. And like I said, updating
> timekeeping from these places is overkill (although we do it in interrupt entry on
> dynticks-idle, but we don't have timekeeping when all system is idle).

Well that is going to extremes I would think. Lets keep one processor
alive so that the rest can be as noise free as possible.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Future of NOHZ full/isolation development (was Re: [NOHZ] Remove scheduler_tick_max_deferment)
  2014-11-10 18:26             ` Christoph Lameter
@ 2014-11-11 17:15               ` Frederic Weisbecker
  2014-11-11 17:39                 ` Paul E. McKenney
  2014-11-12  6:11                 ` Viresh Kumar
  0 siblings, 2 replies; 24+ messages in thread
From: Frederic Weisbecker @ 2014-11-11 17:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul E. McKenney, Viresh Kumar, Thomas Gleixner,
	Linux Kernel Mailing List, Gilad Ben-Yossef, Tejun Heo,
	John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan,
	Max Krasnyansky, Hugh Dickins, H. Peter Anvin, Ingo Molnar,
	Peter Zijlstra, Kevin Hilman

On Mon, Nov 10, 2014 at 12:26:51PM -0600, Christoph Lameter wrote:
> >
> > Would it make sense for unlimited max deferment to be available as
> > a boot parameter?  That would allow people who want tick-free execution
> > more than accurate stats to get that easily, while keeping stats accurate
> > for everyone else.
> 
> Subject: Make the maximum tick deferral for CONFIG_NO_HZ configurable
> 
> Add a way to configure this interval at boot and via
> /proc/sys/vm/max_defer_tick
> 
> Signed-off-by: Christoph Lameter <cl@linux.com>

Sorry but that's not solving the problem. All it does is to allow the user
to tune bugs.

Kevin Hilman proposed something similar using debugfs and I declined it as
well. Integrating a hack like this is a good way to make sure that nobody
will ever fix the real underlying issue.

BTW, that's a good opportunity for me to generalize this case to the full
dynticks development general issue. I got a lot of help from people to improve
the kernel's isolation and full dynticks: Paul has spent a lot of time to improve
RCU, you improved vmstat, full dynticks got ported to other archs, people
like Viresh fixed some timers internals, Gilad fixed IPIs, Peterz reviewed a
lot, etc...

But now we reached a step where there are mostly core issues remaining that
require some infrastrure change investments, some extensions or a bit of rethinking.
We know we reach that step when people who want the features are stuck sending
workarounds.
Nothing like big rewrites is needed really, actually just a bunch of pretty
self contained issues. And by self-contained I mean that each of these individual
problems can be worked out seperately as they are unrelated enough altogether. Here is
a summarized list:

* Unbound workqueues affinity (to housekeeper)
* Unbound timers affinity (to housekeeper)
* 1 Hz residual scheduler tick offlining to housekeeper
* Fix some scheduler accounting that don't even work with 1 Hz: cpu load
  accounting, rt_scale, load balancing, etc...
* Lighten the syscall path and get rid of cputime accounting + RCU hooks
  for people who want isolation + fast syscalls and faults.
* Work on non-affinable workqueues
* Work on non-affinable timers
* ...

If I'm going to work alone on all that, this is going to take several years,
honestly.

But we know what to do and how. So all we need is (at least one) more full time
core developer to get these things done in a reasonable amount of time.

Thanks.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Future of NOHZ full/isolation development (was Re: [NOHZ] Remove scheduler_tick_max_deferment)
  2014-11-11 17:15               ` Future of NOHZ full/isolation development (was Re: [NOHZ] Remove scheduler_tick_max_deferment) Frederic Weisbecker
@ 2014-11-11 17:39                 ` Paul E. McKenney
  2014-11-11 18:00                   ` Christoph Lameter
  2014-11-12  6:11                 ` Viresh Kumar
  1 sibling, 1 reply; 24+ messages in thread
From: Paul E. McKenney @ 2014-11-11 17:39 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Christoph Lameter, Viresh Kumar, Thomas Gleixner,
	Linux Kernel Mailing List, Gilad Ben-Yossef, Tejun Heo,
	John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan,
	Max Krasnyansky, Hugh Dickins, H. Peter Anvin, Ingo Molnar,
	Peter Zijlstra, Kevin Hilman

On Tue, Nov 11, 2014 at 06:15:28PM +0100, Frederic Weisbecker wrote:
> On Mon, Nov 10, 2014 at 12:26:51PM -0600, Christoph Lameter wrote:
> > >
> > > Would it make sense for unlimited max deferment to be available as
> > > a boot parameter?  That would allow people who want tick-free execution
> > > more than accurate stats to get that easily, while keeping stats accurate
> > > for everyone else.
> > 
> > Subject: Make the maximum tick deferral for CONFIG_NO_HZ configurable
> > 
> > Add a way to configure this interval at boot and via
> > /proc/sys/vm/max_defer_tick
> > 
> > Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> Sorry but that's not solving the problem. All it does is to allow the user
> to tune bugs.
> 
> Kevin Hilman proposed something similar using debugfs and I declined it as
> well. Integrating a hack like this is a good way to make sure that nobody
> will ever fix the real underlying issue.

I guess I should have remembered that before suggesting this to Christoph,
my apologies to all!

> BTW, that's a good opportunity for me to generalize this case to the full
> dynticks development general issue. I got a lot of help from people to improve
> the kernel's isolation and full dynticks: Paul has spent a lot of time to improve
> RCU, you improved vmstat, full dynticks got ported to other archs, people
> like Viresh fixed some timers internals, Gilad fixed IPIs, Peterz reviewed a
> lot, etc...
> 
> But now we reached a step where there are mostly core issues remaining that
> require some infrastrure change investments, some extensions or a bit of rethinking.
> We know we reach that step when people who want the features are stuck sending
> workarounds.
> Nothing like big rewrites is needed really, actually just a bunch of pretty
> self contained issues. And by self-contained I mean that each of these individual
> problems can be worked out seperately as they are unrelated enough altogether. Here is
> a summarized list:
> 
> * Unbound workqueues affinity (to housekeeper)
> * Unbound timers affinity (to housekeeper)
> * 1 Hz residual scheduler tick offlining to housekeeper
> * Fix some scheduler accounting that don't even work with 1 Hz: cpu load
>   accounting, rt_scale, load balancing, etc...
> * Lighten the syscall path and get rid of cputime accounting + RCU hooks
>   for people who want isolation + fast syscalls and faults.

I thought that the RCU hooks were well and truly down in the noise.
Or is that not the case without cputime accounting to hide behind?

							Thanx, Paul

> * Work on non-affinable workqueues
> * Work on non-affinable timers
> * ...
> 
> If I'm going to work alone on all that, this is going to take several years,
> honestly.
> 
> But we know what to do and how. So all we need is (at least one) more full time
> core developer to get these things done in a reasonable amount of time.
> 
> Thanks.
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Future of NOHZ full/isolation development (was Re: [NOHZ] Remove scheduler_tick_max_deferment)
  2014-11-11 17:39                 ` Paul E. McKenney
@ 2014-11-11 18:00                   ` Christoph Lameter
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Lameter @ 2014-11-11 18:00 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Frederic Weisbecker, Viresh Kumar, Thomas Gleixner,
	Linux Kernel Mailing List, Gilad Ben-Yossef, Tejun Heo,
	John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan,
	Max Krasnyansky, Hugh Dickins, H. Peter Anvin, Ingo Molnar,
	Peter Zijlstra, Kevin Hilman

On Tue, 11 Nov 2014, Paul E. McKenney wrote:

> I guess I should have remembered that before suggesting this to Christoph,
> my apologies to all!

No this is good. Do not worry. I can just throw out the one or other patch
to ensure that the discussion continues and then we can figure out how to
come up with a correct solution. Patches are a way to say that you are
serious about an issue on lkml.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Future of NOHZ full/isolation development (was Re: [NOHZ] Remove scheduler_tick_max_deferment)
  2014-11-11 17:15               ` Future of NOHZ full/isolation development (was Re: [NOHZ] Remove scheduler_tick_max_deferment) Frederic Weisbecker
  2014-11-11 17:39                 ` Paul E. McKenney
@ 2014-11-12  6:11                 ` Viresh Kumar
  2014-11-12 13:54                   ` Frederic Weisbecker
  1 sibling, 1 reply; 24+ messages in thread
From: Viresh Kumar @ 2014-11-12  6:11 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Christoph Lameter, Paul E. McKenney, Thomas Gleixner,
	Linux Kernel Mailing List, Gilad Ben-Yossef, Tejun Heo,
	John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan,
	Max Krasnyansky, Hugh Dickins, H. Peter Anvin, Ingo Molnar,
	Peter Zijlstra, Kevin Hilman

On 11 November 2014 22:45, Frederic Weisbecker <fweisbec@gmail.com> wrote:

> Here is a summarized list:
>
> * Unbound workqueues affinity (to housekeeper)
> * Unbound timers affinity (to housekeeper)
> * 1 Hz residual scheduler tick offlining to housekeeper
> * Fix some scheduler accounting that don't even work with 1 Hz: cpu load
>   accounting, rt_scale, load balancing, etc...
> * Lighten the syscall path and get rid of cputime accounting + RCU hooks
>   for people who want isolation + fast syscalls and faults.
> * Work on non-affinable workqueues
> * Work on non-affinable timers
> * ...

+ spurious interrupts with NOHZ_FULL on all architectures which break isolation
but doesn't get caught with traces. Can be observed with this:

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 481fa54..91d490d 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1244,7 +1244,8 @@ void hrtimer_interrupt(struct clock_event_device *dev)
 {
        struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
        ktime_t expires_next, now, entry_time, delta;
-       int i, retries = 0;
+       int i, retries = 0, count = 0;
+       static int total_spurious;

        BUG_ON(!cpu_base->hres_active);
        cpu_base->nr_events++;
@@ -1304,10 +1305,14 @@ void hrtimer_interrupt(struct clock_event_device *dev)
                                break;
                        }

+                       count++;
                        __run_hrtimer(timer, &basenow);
                }
        }

+       if (!count)
+               pr_err("____%s: Totalspurious: %d\n", __func__,
++total_spurious);
+
        /*
         * Store the new expiry value so the migration code can verify
         * against it.

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: Future of NOHZ full/isolation development (was Re: [NOHZ] Remove scheduler_tick_max_deferment)
  2014-11-12  6:11                 ` Viresh Kumar
@ 2014-11-12 13:54                   ` Frederic Weisbecker
  2014-11-12 14:56                     ` Viresh Kumar
  0 siblings, 1 reply; 24+ messages in thread
From: Frederic Weisbecker @ 2014-11-12 13:54 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Christoph Lameter, Paul E. McKenney, Thomas Gleixner,
	Linux Kernel Mailing List, Gilad Ben-Yossef, Tejun Heo,
	John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan,
	Max Krasnyansky, Hugh Dickins, H. Peter Anvin, Ingo Molnar,
	Peter Zijlstra, Kevin Hilman

On Wed, Nov 12, 2014 at 11:41:09AM +0530, Viresh Kumar wrote:
> On 11 November 2014 22:45, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
> > Here is a summarized list:
> >
> > * Unbound workqueues affinity (to housekeeper)
> > * Unbound timers affinity (to housekeeper)
> > * 1 Hz residual scheduler tick offlining to housekeeper
> > * Fix some scheduler accounting that don't even work with 1 Hz: cpu load
> >   accounting, rt_scale, load balancing, etc...
> > * Lighten the syscall path and get rid of cputime accounting + RCU hooks
> >   for people who want isolation + fast syscalls and faults.
> > * Work on non-affinable workqueues
> > * Work on non-affinable timers
> > * ...
> 
> + spurious interrupts with NOHZ_FULL on all architectures which break isolation
> but doesn't get caught with traces. Can be observed with this:
> 
> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index 481fa54..91d490d 100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -1244,7 +1244,8 @@ void hrtimer_interrupt(struct clock_event_device *dev)
>  {
>         struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
>         ktime_t expires_next, now, entry_time, delta;
> -       int i, retries = 0;
> +       int i, retries = 0, count = 0;
> +       static int total_spurious;
> 
>         BUG_ON(!cpu_base->hres_active);
>         cpu_base->nr_events++;
> @@ -1304,10 +1305,14 @@ void hrtimer_interrupt(struct clock_event_device *dev)
>                                 break;
>                         }
> 
> +                       count++;
>                         __run_hrtimer(timer, &basenow);
>                 }
>         }
> 
> +       if (!count)
> +               pr_err("____%s: Totalspurious: %d\n", __func__,
> ++total_spurious);
> +

I'd rather leave that to tracepoints. Like trace_hrtimer_spurious().

Or better yet: have trace_hrtimer_interrupt() which we can compare against
trace_hrtimer_expire_entry/exit() to check if any hrtimer callback have run
in the interrupt. This way we avoid workarounds like the above count.

>         /*
>          * Store the new expiry value so the migration code can verify
>          * against it.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Future of NOHZ full/isolation development (was Re: [NOHZ] Remove scheduler_tick_max_deferment)
  2014-11-12 13:54                   ` Frederic Weisbecker
@ 2014-11-12 14:56                     ` Viresh Kumar
  2014-11-12 15:06                       ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Viresh Kumar @ 2014-11-12 14:56 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Christoph Lameter, Paul E. McKenney, Thomas Gleixner,
	Linux Kernel Mailing List, Gilad Ben-Yossef, Tejun Heo,
	John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan,
	Max Krasnyansky, Hugh Dickins, H. Peter Anvin, Ingo Molnar,
	Peter Zijlstra, Kevin Hilman

On 12 November 2014 19:24, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> I'd rather leave that to tracepoints. Like trace_hrtimer_spurious().

Yeah, it was just to prove things right on the console without getting
into traces.

> Or better yet: have trace_hrtimer_interrupt() which we can compare against
> trace_hrtimer_expire_entry/exit() to check if any hrtimer callback have run
> in the interrupt. This way we avoid workarounds like the above count.

Yeah, I also believe we better add this debug information to mainline kernel.
I will try to get a patch for that soon.

Would it be recommended to add both trace points?
i.e. trace_hrtimer_interrupt() and trace_hrtimer_spurious() ?

--
viresh

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Future of NOHZ full/isolation development (was Re: [NOHZ] Remove scheduler_tick_max_deferment)
  2014-11-12 14:56                     ` Viresh Kumar
@ 2014-11-12 15:06                       ` Peter Zijlstra
  2014-11-12 15:16                         ` Viresh Kumar
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2014-11-12 15:06 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Frederic Weisbecker, Christoph Lameter, Paul E. McKenney,
	Thomas Gleixner, Linux Kernel Mailing List, Gilad Ben-Yossef,
	Tejun Heo, John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan,
	Max Krasnyansky, Hugh Dickins, H. Peter Anvin, Ingo Molnar,
	Kevin Hilman

On Wed, Nov 12, 2014 at 08:26:05PM +0530, Viresh Kumar wrote:
> On 12 November 2014 19:24, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > I'd rather leave that to tracepoints. Like trace_hrtimer_spurious().
> 
> Yeah, it was just to prove things right on the console without getting
> into traces.
> 
> > Or better yet: have trace_hrtimer_interrupt() which we can compare against
> > trace_hrtimer_expire_entry/exit() to check if any hrtimer callback have run
> > in the interrupt. This way we avoid workarounds like the above count.
> 
> Yeah, I also believe we better add this debug information to mainline kernel.
> I will try to get a patch for that soon.
> 
> Would it be recommended to add both trace points?
> i.e. trace_hrtimer_interrupt() and trace_hrtimer_spurious() 

I don't think you need to add anything. We already have tracepoints for
every single interrupt (and therefore also for the hrtimer one) and we
have expiry tracepoints.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Future of NOHZ full/isolation development (was Re: [NOHZ] Remove scheduler_tick_max_deferment)
  2014-11-12 15:06                       ` Peter Zijlstra
@ 2014-11-12 15:16                         ` Viresh Kumar
  2014-11-13  7:22                           ` Viresh Kumar
  0 siblings, 1 reply; 24+ messages in thread
From: Viresh Kumar @ 2014-11-12 15:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Christoph Lameter, Paul E. McKenney,
	Thomas Gleixner, Linux Kernel Mailing List, Gilad Ben-Yossef,
	Tejun Heo, John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan,
	Max Krasnyansky, Hugh Dickins, H. Peter Anvin, Ingo Molnar,
	Kevin Hilman

On 12 November 2014 20:36, Peter Zijlstra <peterz@infradead.org> wrote:
> I don't think you need to add anything. We already have tracepoints for
> every single interrupt (and therefore also for the hrtimer one) and we
> have expiry tracepoints.

I will crosscheck this again but as far as I remember these spurious interrupts
aren't caught with tracers currently. Not even the irq_enter/exit ones.

I saw that piece of code long back (and obviously don't understand it
completely). But the problem was that because these are spurious
clockevent interrupts, we don't have anything to do on that interrupt.
No tick/sched/softirq processing. And the irq-handling code returns before
even calling trace_irq_enter().

And that's why I had to add those prints initially.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Future of NOHZ full/isolation development (was Re: [NOHZ] Remove scheduler_tick_max_deferment)
  2014-11-12 15:16                         ` Viresh Kumar
@ 2014-11-13  7:22                           ` Viresh Kumar
  0 siblings, 0 replies; 24+ messages in thread
From: Viresh Kumar @ 2014-11-13  7:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Christoph Lameter, Paul E. McKenney,
	Thomas Gleixner, Linux Kernel Mailing List, Gilad Ben-Yossef,
	Tejun Heo, John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan,
	Max Krasnyansky, Hugh Dickins, H. Peter Anvin, Ingo Molnar,
	Kevin Hilman

On 12 November 2014 20:46, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> On 12 November 2014 20:36, Peter Zijlstra <peterz@infradead.org> wrote:
>> I don't think you need to add anything. We already have tracepoints for
>> every single interrupt (and therefore also for the hrtimer one) and we
>> have expiry tracepoints.
>
> I will crosscheck this again but as far as I remember these spurious interrupts
> aren't caught with tracers currently. Not even the irq_enter/exit ones.

So yes, we do get irq_handler_entry/exit traces for the clockevent device.

The problem here is that people miss these as there are no warnings issued
here. And so nobody tried to resolve these spurious interrupts.

This is how it looks on a ARM machine (With trace_hrtimer_spurious), though
it is relevant for all architectures which emulate ONESHOT mode over
PERIODIC mode:

          <idle>-0     [001] d.h2   719.642085: irq_handler_entry:
irq=30 name=arch_timer
          <idle>-0     [001] d.h3   719.646506: hrtimer_spurious: count:105
          <idle>-0     [001] d.h2   719.646511: irq_handler_exit:
irq=30 ret=handled
          <idle>-0     [001] d..3   719.646522: hrtimer_start:
hrtimer=ee7bb740 function=tick_sched_timer expires=719640000000
softexpires=719640000000
          <idle>-0     [001] d..3   719.646545: hrtimer_cancel: hrtimer=ee7bb740
          <idle>-0     [001] d.h2   719.647086: irq_handler_entry:
irq=30 name=arch_timer
          <idle>-0     [001] d.h3   719.651540: hrtimer_spurious: count:106
          <idle>-0     [001] d.h2   719.651545: irq_handler_exit:
irq=30 ret=handled
          <idle>-0     [001] d..3   719.651554: hrtimer_start:
hrtimer=ee7bb740 function=tick_sched_timer expires=719645000000
softexpires=719645000000
          <idle>-0     [001] d..3   719.651576: hrtimer_cancel: hrtimer=ee7bb740
          <idle>-0     [001] d.h2   719.652084: irq_handler_entry:
irq=30 name=arch_timer
          <idle>-0     [001] d.h3   719.656571: hrtimer_spurious: count:107
          <idle>-0     [001] d.h2   719.656575: irq_handler_exit:
irq=30 ret=handled
          <idle>-0     [001] d..3   719.656583: hrtimer_start:
hrtimer=ee7bb740 function=tick_sched_timer expires=719650000000
softexpires=719650000000
          <idle>-0     [001] d..3   719.656601: hrtimer_cancel: hrtimer=ee7bb740
          <idle>-0     [001] d.h2   719.657084: irq_handler_entry:
irq=30 name=arch_timer
          <idle>-0     [001] d.h3   719.661611: hrtimer_spurious: count:108
          <idle>-0     [001] d.h2   719.661616: irq_handler_exit:
irq=30 ret=handled
          <idle>-0     [001] d..3   719.661627: hrtimer_start:
hrtimer=ee7bb740 function=tick_sched_timer expires=719655000000
softexpires=719655000000
          <idle>-0     [001] d..3   719.661652: hrtimer_cancel: hrtimer=ee7bb740
          <idle>-0     [001] d.h2   719.662086: irq_handler_entry:
irq=30 name=arch_timer
          <idle>-0     [001] d.h3   719.666643: hrtimer_spurious: count:109
          <idle>-0     [001] d.h2   719.666648: irq_handler_exit:
irq=30 ret=handled
          <idle>-0     [001] d..3   719.666657: hrtimer_start:
hrtimer=ee7bb740 function=tick_sched_timer expires=719660000000
softexpires=719660000000
          <idle>-0     [001] d..3   719.666680: hrtimer_cancel: hrtimer=ee7bb740
          <idle>-0     [001] d.h2   719.667084: irq_handler_entry:
irq=30 name=arch_timer
          <idle>-0     [001] d.h3   719.671678: hrtimer_spurious: count:110
          <idle>-0     [001] d.h2   719.671682: irq_handler_exit:
irq=30 ret=handled
          <idle>-0     [001] d..3   719.671693: hrtimer_start:
hrtimer=ee7bb740 function=tick_sched_timer expires=719665000000
softexpires=719665000000
          <idle>-0     [001] d..3   719.671714: hrtimer_cancel: hrtimer=ee7bb740
          <idle>-0     [001] d.h2   719.672085: irq_handler_entry:
irq=30 name=arch_timer
          <idle>-0     [001] d.h3   719.676712: hrtimer_spurious: count:111
          <idle>-0     [001] d.h2   719.676717: irq_handler_exit:
irq=30 ret=handled
          <idle>-0     [001] d..3   719.676726: hrtimer_start:
hrtimer=ee7bb740 function=tick_sched_timer expires=719670000000
softexpires=719670000000
          <idle>-0     [001] d..3   719.676749: hrtimer_cancel: hrtimer=ee7bb740
          <idle>-0     [001] d.h2   719.677087: irq_handler_entry:
irq=30 name=arch_timer
          <idle>-0     [001] d.h3   719.681745: hrtimer_spurious: count:112
          <idle>-0     [001] d.h2   719.681750: irq_handler_exit:
irq=30 ret=handled
          <idle>-0     [001] d..3   719.681759: hrtimer_start:
hrtimer=ee7bb740 function=tick_sched_timer expires=719675000000
softexpires=719675000000
          <idle>-0     [001] d..3   719.681780: hrtimer_cancel: hrtimer=ee7bb740
          <idle>-0     [001] d.h2   719.682084: irq_handler_entry:
irq=30 name=arch_timer
          <idle>-0     [001] d.h3   719.686783: hrtimer_spurious: count:113
          <idle>-0     [001] d.h2   719.686787: irq_handler_exit:
irq=30 ret=handled
          <idle>-0     [001] d..3   719.686797: hrtimer_start:
hrtimer=ee7bb740 function=tick_sched_timer expires=719680000000
softexpires=719680000000
          <idle>-0     [001] d..3   719.686819: hrtimer_cancel: hrtimer=ee7bb740


Normally this happens on a non-spurious interrupt:

 is-cpu-isolated-1963  [000] d.h1   719.642080: irq_handler_entry:
irq=30 name=arch_timer
 is-cpu-isolated-1963  [000] d.h2   719.642086: hrtimer_cancel: hrtimer=ee7b3740
 is-cpu-isolated-1963  [000] d.h1   719.642089: hrtimer_expire_entry:
hrtimer=ee7b3740 function=tick_sched_timer now=719635008926
 is-cpu-isolated-1963  [000] d.h1   719.642095: softirq_raise: vec=1
[action=TIMER]
 is-cpu-isolated-1963  [000] d.h1   719.642099: softirq_raise: vec=9
[action=RCU]
 is-cpu-isolated-1963  [000] d.h2   719.642104: sched_stat_runtime:
comm=trace-isolation pid=1963 runtime=348042 [ns] vruntime=7804506025
[ns]
 is-cpu-isolated-1963  [000] d.h1   719.642111: hrtimer_expire_exit:
hrtimer=ee7b3740
 is-cpu-isolated-1963  [000] d.h2   719.642115: hrtimer_start:
hrtimer=ee7b3740 function=tick_sched_timer expires=719640000000
softexpires=719640000000
 is-cpu-isolated-1963  [000] d.h1   719.642121: irq_handler_exit:
irq=30 ret=handled

.. followed by softirq processing.

Another trace point with trace_hrtimer_spurious() might end up being useful :)

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2014-11-13  7:23 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-31 16:01 [NOHZ] Remove scheduler_tick_max_deferment Christoph Lameter
2014-11-01 19:18 ` Thomas Gleixner
2014-11-01 21:52   ` Christoph Lameter
2014-11-01 22:33     ` Thomas Gleixner
2014-11-06 17:24       ` Christoph Lameter
2014-11-10  7:11         ` Viresh Kumar
2014-11-10 15:31           ` Paul E. McKenney
2014-11-10 16:21             ` Christoph Lameter
2014-11-10 18:26             ` Christoph Lameter
2014-11-11 17:15               ` Future of NOHZ full/isolation development (was Re: [NOHZ] Remove scheduler_tick_max_deferment) Frederic Weisbecker
2014-11-11 17:39                 ` Paul E. McKenney
2014-11-11 18:00                   ` Christoph Lameter
2014-11-12  6:11                 ` Viresh Kumar
2014-11-12 13:54                   ` Frederic Weisbecker
2014-11-12 14:56                     ` Viresh Kumar
2014-11-12 15:06                       ` Peter Zijlstra
2014-11-12 15:16                         ` Viresh Kumar
2014-11-13  7:22                           ` Viresh Kumar
2014-11-10 16:19           ` [NOHZ] Remove scheduler_tick_max_deferment Christoph Lameter
2014-11-10 22:43         ` Frederic Weisbecker
2014-11-11 14:58           ` Christoph Lameter
2014-11-11 15:36             ` Frederic Weisbecker
2014-11-11 17:08               ` Christoph Lameter
2014-11-10 20:26     ` Frederic Weisbecker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.