linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang
@ 2014-05-09  7:47 Anton Blanchard
  2014-05-09  9:52 ` Preeti U Murthy
  2014-05-09 13:41 ` Paul E. McKenney
  0 siblings, 2 replies; 15+ messages in thread
From: Anton Blanchard @ 2014-05-09  7:47 UTC (permalink / raw)
  To: benh, paulus, mpe, paulmck; +Cc: linuxppc-dev

I am seeing an issue where a CPU running perf eventually hangs.
Traces show timer interrupts happening every 4 seconds even
when a userspace task is running on the CPU. /proc/timer_list
also shows pending hrtimers have not run in over an hour,
including the scheduler.

Looking closer, decrementers_next_tb is getting set to
0xffffffffffffffff, and at that point we will never take
a timer interrupt again.

In __timer_interrupt() we set decrementers_next_tb to
0xffffffffffffffff and rely on ->event_handler to update it:

        *next_tb = ~(u64)0;
        if (evt->event_handler)
                evt->event_handler(evt);

In this case ->event_handler is hrtimer_interrupt. This will eventually
call back through the clockevents code with the next event to be
programmed:

static int decrementer_set_next_event(unsigned long evt,
                                      struct clock_event_device *dev)
{
        /* Don't adjust the decrementer if some irq work is pending */
        if (test_irq_work_pending())
                return 0;
        __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;

If irq work came in between these two points, we will return
before updating decrementers_next_tb and we never process a timer
interrupt again.

This looks to have been introduced by 0215f7d8c53f (powerpc: Fix races
with irq_work). Fix it by removing the early exit and relying on
code later on in the function to force an early decrementer:

       /* We may have raced with new irq work */
       if (test_irq_work_pending())
               set_dec(1);

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@vger.kernel.org # 3.14+
---

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 122a580..4f0b676 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -813,9 +888,6 @@ static void __init clocksource_init(void)
 static int decrementer_set_next_event(unsigned long evt,
 				      struct clock_event_device *dev)
 {
-	/* Don't adjust the decrementer if some irq work is pending */
-	if (test_irq_work_pending())
-		return 0;
 	__get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
 	set_dec(evt);
 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang
  2014-05-09  7:47 [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang Anton Blanchard
@ 2014-05-09  9:52 ` Preeti U Murthy
  2014-05-10  4:26   ` Benjamin Herrenschmidt
  2014-05-09 13:41 ` Paul E. McKenney
  1 sibling, 1 reply; 15+ messages in thread
From: Preeti U Murthy @ 2014-05-09  9:52 UTC (permalink / raw)
  To: Anton Blanchard, benh; +Cc: paulmck, paulus, linuxppc-dev

Hi Anton,

On 05/09/2014 01:17 PM, Anton Blanchard wrote:
> I am seeing an issue where a CPU running perf eventually hangs.
> Traces show timer interrupts happening every 4 seconds even
> when a userspace task is running on the CPU. /proc/timer_list
> also shows pending hrtimers have not run in over an hour,
> including the scheduler.
> 
> Looking closer, decrementers_next_tb is getting set to
> 0xffffffffffffffff, and at that point we will never take
> a timer interrupt again.
> 
> In __timer_interrupt() we set decrementers_next_tb to
> 0xffffffffffffffff and rely on ->event_handler to update it:
> 
>         *next_tb = ~(u64)0;
>         if (evt->event_handler)
>                 evt->event_handler(evt);
> 
> In this case ->event_handler is hrtimer_interrupt. This will eventually
> call back through the clockevents code with the next event to be
> programmed:
> 
> static int decrementer_set_next_event(unsigned long evt,
>                                       struct clock_event_device *dev)
> {
>         /* Don't adjust the decrementer if some irq work is pending */
>         if (test_irq_work_pending())
>                 return 0;
>         __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
> 
> If irq work came in between these two points, we will return
> before updating decrementers_next_tb and we never process a timer
> interrupt again.
> 
> This looks to have been introduced by 0215f7d8c53f (powerpc: Fix races
> with irq_work). Fix it by removing the early exit and relying on
> code later on in the function to force an early decrementer:
> 
>        /* We may have raced with new irq work */
>        if (test_irq_work_pending())
>                set_dec(1);
> 

There is another scenario we are missing. Its not necessary that on a
timer interrupt the event handler will call back through the
set_next_event().
If there are no pending timers then the event handler will not bother
programming the tick device and simply return.IOW, set_next_event() will
not be called. In that case we will miss taking care of pending irq work
altogether.

__timer_interrupt() -> event_handler -> next_time = KTIME_MAX ->
__timer_interrupt().

In __timer_interrupt() we do not check for pending irq anywhere after
the call to the event handler and we hence miss servicing irqs in the
above scenario.

How about you also move the check:
 if (test_irq_pending())
   set_dec(1)

in __timer_interrupt() outside the _else_ loop? This will ensure that no
matter what, before exiting timer interrupt handler we check for pending
irq work.

Regards
Preeti U Murthy

> Signed-off-by: Anton Blanchard <anton@samba.org>
> Cc: stable@vger.kernel.org # 3.14+
> ---
> 
> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> index 122a580..4f0b676 100644
> --- a/arch/powerpc/kernel/time.c
> +++ b/arch/powerpc/kernel/time.c
> @@ -813,9 +888,6 @@ static void __init clocksource_init(void)
>  static int decrementer_set_next_event(unsigned long evt,
>  				      struct clock_event_device *dev)
>  {
> -	/* Don't adjust the decrementer if some irq work is pending */
> -	if (test_irq_work_pending())
> -		return 0;
>  	__get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
>  	set_dec(evt);

How about if you move the test_irq_work_pending
Why do we have test_irq_work_pending() later in the function
decrementer_set_next_event()?
>  
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang
  2014-05-09  7:47 [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang Anton Blanchard
  2014-05-09  9:52 ` Preeti U Murthy
@ 2014-05-09 13:41 ` Paul E. McKenney
  2014-05-09 21:50   ` Gabriel Paubert
  1 sibling, 1 reply; 15+ messages in thread
From: Paul E. McKenney @ 2014-05-09 13:41 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: paulus, linuxppc-dev

On Fri, May 09, 2014 at 05:47:12PM +1000, Anton Blanchard wrote:
> I am seeing an issue where a CPU running perf eventually hangs.
> Traces show timer interrupts happening every 4 seconds even
> when a userspace task is running on the CPU.

Is this by chance every 4.2 seconds?  The reason I ask is that
Paul Clarke and I are seeing an interrupt every 4.2 seconds when
he runs NO_HZ_FULL, and are trying to get rid of it.  ;-)

						Thanx, Paul

>                                              /proc/timer_list
> also shows pending hrtimers have not run in over an hour,
> including the scheduler.
> 
> Looking closer, decrementers_next_tb is getting set to
> 0xffffffffffffffff, and at that point we will never take
> a timer interrupt again.
> 
> In __timer_interrupt() we set decrementers_next_tb to
> 0xffffffffffffffff and rely on ->event_handler to update it:
> 
>         *next_tb = ~(u64)0;
>         if (evt->event_handler)
>                 evt->event_handler(evt);
> 
> In this case ->event_handler is hrtimer_interrupt. This will eventually
> call back through the clockevents code with the next event to be
> programmed:
> 
> static int decrementer_set_next_event(unsigned long evt,
>                                       struct clock_event_device *dev)
> {
>         /* Don't adjust the decrementer if some irq work is pending */
>         if (test_irq_work_pending())
>                 return 0;
>         __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
> 
> If irq work came in between these two points, we will return
> before updating decrementers_next_tb and we never process a timer
> interrupt again.
> 
> This looks to have been introduced by 0215f7d8c53f (powerpc: Fix races
> with irq_work). Fix it by removing the early exit and relying on
> code later on in the function to force an early decrementer:
> 
>        /* We may have raced with new irq work */
>        if (test_irq_work_pending())
>                set_dec(1);
> 
> Signed-off-by: Anton Blanchard <anton@samba.org>
> Cc: stable@vger.kernel.org # 3.14+
> ---
> 
> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> index 122a580..4f0b676 100644
> --- a/arch/powerpc/kernel/time.c
> +++ b/arch/powerpc/kernel/time.c
> @@ -813,9 +888,6 @@ static void __init clocksource_init(void)
>  static int decrementer_set_next_event(unsigned long evt,
>  				      struct clock_event_device *dev)
>  {
> -	/* Don't adjust the decrementer if some irq work is pending */
> -	if (test_irq_work_pending())
> -		return 0;
>  	__get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
>  	set_dec(evt);
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang
  2014-05-09 13:41 ` Paul E. McKenney
@ 2014-05-09 21:50   ` Gabriel Paubert
  2014-05-09 22:08     ` Paul E. McKenney
  0 siblings, 1 reply; 15+ messages in thread
From: Gabriel Paubert @ 2014-05-09 21:50 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linuxppc-dev, paulus, Anton Blanchard

On Fri, May 09, 2014 at 06:41:13AM -0700, Paul E. McKenney wrote:
> On Fri, May 09, 2014 at 05:47:12PM +1000, Anton Blanchard wrote:
> > I am seeing an issue where a CPU running perf eventually hangs.
> > Traces show timer interrupts happening every 4 seconds even
> > when a userspace task is running on the CPU.
> 
> Is this by chance every 4.2 seconds?  The reason I ask is that
> Paul Clarke and I are seeing an interrupt every 4.2 seconds when
> he runs NO_HZ_FULL, and are trying to get rid of it.  ;-)

Hmmm, it's close to 2^32 nanoseconds, isnt't it suspiscious?

	Gabriel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang
  2014-05-09 21:50   ` Gabriel Paubert
@ 2014-05-09 22:08     ` Paul E. McKenney
  2014-05-10  6:33       ` Paul Mackerras
  0 siblings, 1 reply; 15+ messages in thread
From: Paul E. McKenney @ 2014-05-09 22:08 UTC (permalink / raw)
  To: Gabriel Paubert; +Cc: linuxppc-dev, paulus, Anton Blanchard

On Fri, May 09, 2014 at 11:50:05PM +0200, Gabriel Paubert wrote:
> On Fri, May 09, 2014 at 06:41:13AM -0700, Paul E. McKenney wrote:
> > On Fri, May 09, 2014 at 05:47:12PM +1000, Anton Blanchard wrote:
> > > I am seeing an issue where a CPU running perf eventually hangs.
> > > Traces show timer interrupts happening every 4 seconds even
> > > when a userspace task is running on the CPU.
> > 
> > Is this by chance every 4.2 seconds?  The reason I ask is that
> > Paul Clarke and I are seeing an interrupt every 4.2 seconds when
> > he runs NO_HZ_FULL, and are trying to get rid of it.  ;-)
> 
> Hmmm, it's close to 2^32 nanoseconds, isnt't it suspiscious?

Now that you mention it...  ;-)

So you are telling me that we are not succeeding in completely turning
off the decrementer interrupt?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang
  2014-05-09  9:52 ` Preeti U Murthy
@ 2014-05-10  4:26   ` Benjamin Herrenschmidt
  2014-05-10 15:36     ` Preeti U Murthy
  0 siblings, 1 reply; 15+ messages in thread
From: Benjamin Herrenschmidt @ 2014-05-10  4:26 UTC (permalink / raw)
  To: Preeti U Murthy; +Cc: paulmck, paulus, Anton Blanchard, linuxppc-dev

On Fri, 2014-05-09 at 15:22 +0530, Preeti U Murthy wrote:
> in __timer_interrupt() outside the _else_ loop? This will ensure that no
> matter what, before exiting timer interrupt handler we check for pending
> irq work.

We still need to make sure that set_next_event() doesn't move the
dec beyond the next tick if there is a pending timer... maybe we
can fix it like this:

static int decrementer_set_next_event(unsigned long evt,
				      struct clock_event_device *dev)
{
	__get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;

	/* Don't adjust the decrementer if some irq work is pending */
	if (!test_irq_work_pending())
		set_dec(evt);

	return 0;
}

Along with a single occurrence of:

	if (test_irq_work_pending())
		set_dec(1);

At the end of __timer_interrupt(), outside if the current else {}
case, this should work, don't you think ?

What about this completely untested patch ?

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 122a580..ba7e83b 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -503,12 +503,13 @@ void __timer_interrupt(void)
                now = *next_tb - now;
                if (now <= DECREMENTER_MAX)
                        set_dec((int)now);
-               /* We may have raced with new irq work */
-               if (test_irq_work_pending())
-                       set_dec(1);
                __get_cpu_var(irq_stat).timer_irqs_others++;
        }
 
+       /* We may have raced with new irq work */
+       if (test_irq_work_pending())
+               set_dec(1);
+
 #ifdef CONFIG_PPC64
        /* collect purr register values often, for accurate calculations */
        if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
@@ -813,15 +814,11 @@ static void __init clocksource_init(void)
 static int decrementer_set_next_event(unsigned long evt,
                                      struct clock_event_device *dev)
 {
-       /* Don't adjust the decrementer if some irq work is pending */
-       if (test_irq_work_pending())
-               return 0;
        __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
-       set_dec(evt);
 
-       /* We may have raced with new irq work */
-       if (test_irq_work_pending())
-               set_dec(1);
+       /* Don't adjust the decrementer if some irq work is pending */
+       if (!test_irq_work_pending())
+               set_dec(evt);
 
        return 0;
 }

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang
  2014-05-09 22:08     ` Paul E. McKenney
@ 2014-05-10  6:33       ` Paul Mackerras
  2014-05-10 16:33         ` Paul E. McKenney
  0 siblings, 1 reply; 15+ messages in thread
From: Paul Mackerras @ 2014-05-10  6:33 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linuxppc-dev, Anton Blanchard

On Fri, May 09, 2014 at 03:08:45PM -0700, Paul E. McKenney wrote:
> On Fri, May 09, 2014 at 11:50:05PM +0200, Gabriel Paubert wrote:
> > On Fri, May 09, 2014 at 06:41:13AM -0700, Paul E. McKenney wrote:
> > > On Fri, May 09, 2014 at 05:47:12PM +1000, Anton Blanchard wrote:
> > > > I am seeing an issue where a CPU running perf eventually hangs.
> > > > Traces show timer interrupts happening every 4 seconds even
> > > > when a userspace task is running on the CPU.
> > > 
> > > Is this by chance every 4.2 seconds?  The reason I ask is that
> > > Paul Clarke and I are seeing an interrupt every 4.2 seconds when
> > > he runs NO_HZ_FULL, and are trying to get rid of it.  ;-)
> > 
> > Hmmm, it's close to 2^32 nanoseconds, isnt't it suspiscious?
> 
> Now that you mention it...  ;-)
> 
> So you are telling me that we are not succeeding in completely turning
> off the decrementer interrupt?

There is no way to turn off the decrementer interrupt without turning
off external (device) interrupts.

On IBM Power CPUs since POWER6, the decrementer runs at 512MHz.  If
you set the decrementer to 0x7fffffff it will interrupt in 4.194
seconds, so that would be what you're seeing.  The only way to avoid
the interrupt becoming pending is to keep on setting it to a large
value before it gets to -1.

If an interrupt every 4.2 seconds is a problem in some applications,
then we need to talk to the Power architects.

Regards,
Paul.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang
  2014-05-10  4:26   ` Benjamin Herrenschmidt
@ 2014-05-10 15:36     ` Preeti U Murthy
  2014-05-10 22:25       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 15+ messages in thread
From: Preeti U Murthy @ 2014-05-10 15:36 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: paulmck, paulus, Anton Blanchard, linuxppc-dev

On 05/10/2014 09:56 AM, Benjamin Herrenschmidt wrote:
> On Fri, 2014-05-09 at 15:22 +0530, Preeti U Murthy wrote:
>> in __timer_interrupt() outside the _else_ loop? This will ensure that no
>> matter what, before exiting timer interrupt handler we check for pending
>> irq work.
> 
> We still need to make sure that set_next_event() doesn't move the
> dec beyond the next tick if there is a pending timer... maybe we

Sorry, but didn't get this. s/if there is pending timer/if there is
pending irq work ?

> can fix it like this:

We can call set_next_event() from events like hrtimer_cancel() or
hrtimer_forward() as well. In that case we don't come to
decrementer_set_next_event() from __timer_interrupt(). Then, if we race
with irq work, we *do not do* a set_dec(1) ( I am referring to the patch
below ), we might never set the decrementer to fire immediately right?

Or does this scenario never arise?

Regards
Preeti U Murthy
> 
> static int decrementer_set_next_event(unsigned long evt,
> 				      struct clock_event_device *dev)
> {
> 	__get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
> 
> 	/* Don't adjust the decrementer if some irq work is pending */
> 	if (!test_irq_work_pending())
> 		set_dec(evt);
> 
> 	return 0;
> }
> 
> Along with a single occurrence of:
> 
> 	if (test_irq_work_pending())
> 		set_dec(1);
> 
> At the end of __timer_interrupt(), outside if the current else {}
> case, this should work, don't you think ?
> 
> What about this completely untested patch ?
> 
> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> index 122a580..ba7e83b 100644
> --- a/arch/powerpc/kernel/time.c
> +++ b/arch/powerpc/kernel/time.c
> @@ -503,12 +503,13 @@ void __timer_interrupt(void)
>                 now = *next_tb - now;
>                 if (now <= DECREMENTER_MAX)
>                         set_dec((int)now);
> -               /* We may have raced with new irq work */
> -               if (test_irq_work_pending())
> -                       set_dec(1);
>                 __get_cpu_var(irq_stat).timer_irqs_others++;
>         }
> 
> +       /* We may have raced with new irq work */
> +       if (test_irq_work_pending())
> +               set_dec(1);
> +
>  #ifdef CONFIG_PPC64
>         /* collect purr register values often, for accurate calculations */
>         if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
> @@ -813,15 +814,11 @@ static void __init clocksource_init(void)
>  static int decrementer_set_next_event(unsigned long evt,
>                                       struct clock_event_device *dev)
>  {
> -       /* Don't adjust the decrementer if some irq work is pending */
> -       if (test_irq_work_pending())
> -               return 0;
>         __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
> -       set_dec(evt);
> 
> -       /* We may have raced with new irq work */
> -       if (test_irq_work_pending())
> -               set_dec(1);
> +       /* Don't adjust the decrementer if some irq work is pending */
> +       if (!test_irq_work_pending())
> +               set_dec(evt);
> 
>         return 0;
>  }
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang
  2014-05-10  6:33       ` Paul Mackerras
@ 2014-05-10 16:33         ` Paul E. McKenney
  0 siblings, 0 replies; 15+ messages in thread
From: Paul E. McKenney @ 2014-05-10 16:33 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, Anton Blanchard

On Sat, May 10, 2014 at 04:33:37PM +1000, Paul Mackerras wrote:
> On Fri, May 09, 2014 at 03:08:45PM -0700, Paul E. McKenney wrote:
> > On Fri, May 09, 2014 at 11:50:05PM +0200, Gabriel Paubert wrote:
> > > On Fri, May 09, 2014 at 06:41:13AM -0700, Paul E. McKenney wrote:
> > > > On Fri, May 09, 2014 at 05:47:12PM +1000, Anton Blanchard wrote:
> > > > > I am seeing an issue where a CPU running perf eventually hangs.
> > > > > Traces show timer interrupts happening every 4 seconds even
> > > > > when a userspace task is running on the CPU.
> > > > 
> > > > Is this by chance every 4.2 seconds?  The reason I ask is that
> > > > Paul Clarke and I are seeing an interrupt every 4.2 seconds when
> > > > he runs NO_HZ_FULL, and are trying to get rid of it.  ;-)
> > > 
> > > Hmmm, it's close to 2^32 nanoseconds, isnt't it suspiscious?
> > 
> > Now that you mention it...  ;-)
> > 
> > So you are telling me that we are not succeeding in completely turning
> > off the decrementer interrupt?
> 
> There is no way to turn off the decrementer interrupt without turning
> off external (device) interrupts.
> 
> On IBM Power CPUs since POWER6, the decrementer runs at 512MHz.  If
> you set the decrementer to 0x7fffffff it will interrupt in 4.194
> seconds, so that would be what you're seeing.  The only way to avoid
> the interrupt becoming pending is to keep on setting it to a large
> value before it gets to -1.
> 
> If an interrupt every 4.2 seconds is a problem in some applications,
> then we need to talk to the Power architects.

Thank you for filling me in on this!  Might be worth doing just that.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang
  2014-05-10 15:36     ` Preeti U Murthy
@ 2014-05-10 22:25       ` Benjamin Herrenschmidt
  2014-05-11  8:15         ` Preeti U Murthy
  0 siblings, 1 reply; 15+ messages in thread
From: Benjamin Herrenschmidt @ 2014-05-10 22:25 UTC (permalink / raw)
  To: Preeti U Murthy; +Cc: paulmck, paulus, Anton Blanchard, linuxppc-dev

On Sat, 2014-05-10 at 21:06 +0530, Preeti U Murthy wrote:
> On 05/10/2014 09:56 AM, Benjamin Herrenschmidt wrote:
> > On Fri, 2014-05-09 at 15:22 +0530, Preeti U Murthy wrote:
> >> in __timer_interrupt() outside the _else_ loop? This will ensure that no
> >> matter what, before exiting timer interrupt handler we check for pending
> >> irq work.
> > 
> > We still need to make sure that set_next_event() doesn't move the
> > dec beyond the next tick if there is a pending timer... maybe we
> 
> Sorry, but didn't get this. s/if there is pending timer/if there is
> pending irq work ?

Yes, sorry :-) That's what I meant.

> > can fix it like this:
> 
> We can call set_next_event() from events like hrtimer_cancel() or
> hrtimer_forward() as well. In that case we don't come to
> decrementer_set_next_event() from __timer_interrupt(). Then, if we race
> with irq work, we *do not do* a set_dec(1) ( I am referring to the patch
> below ), we might never set the decrementer to fire immediately right?
> 
> Or does this scenario never arise?

So my proposed patch handles that no ?

With that patch, we do the set_dec(1) in two cases:

 - The existing arch_irq_work_raise() which is unchanged

 - At the end of __timer_interrupt() if an irq work is still pending

And the patch also makes decrementer_set_next_event() not modify the
decrementer if an irq work is pending, but *still* adjust next_tb unlike
what the code does now.

Thus the timer interrupt, when it happens, will re-adjust the dec
properly using next_tb.

Do we still miss a case ?

Cheers,
Ben.

> Regards
> Preeti U Murthy
> > 
> > static int decrementer_set_next_event(unsigned long evt,
> > 				      struct clock_event_device *dev)
> > {
> > 	__get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
> > 
> > 	/* Don't adjust the decrementer if some irq work is pending */
> > 	if (!test_irq_work_pending())
> > 		set_dec(evt);
> > 
> > 	return 0;
> > }
> > 
> > Along with a single occurrence of:
> > 
> > 	if (test_irq_work_pending())
> > 		set_dec(1);
> > 
> > At the end of __timer_interrupt(), outside if the current else {}
> > case, this should work, don't you think ?
> > 
> > What about this completely untested patch ?
> > 
> > diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> > index 122a580..ba7e83b 100644
> > --- a/arch/powerpc/kernel/time.c
> > +++ b/arch/powerpc/kernel/time.c
> > @@ -503,12 +503,13 @@ void __timer_interrupt(void)
> >                 now = *next_tb - now;
> >                 if (now <= DECREMENTER_MAX)
> >                         set_dec((int)now);
> > -               /* We may have raced with new irq work */
> > -               if (test_irq_work_pending())
> > -                       set_dec(1);
> >                 __get_cpu_var(irq_stat).timer_irqs_others++;
> >         }
> > 
> > +       /* We may have raced with new irq work */
> > +       if (test_irq_work_pending())
> > +               set_dec(1);
> > +
> >  #ifdef CONFIG_PPC64
> >         /* collect purr register values often, for accurate calculations */
> >         if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
> > @@ -813,15 +814,11 @@ static void __init clocksource_init(void)
> >  static int decrementer_set_next_event(unsigned long evt,
> >                                       struct clock_event_device *dev)
> >  {
> > -       /* Don't adjust the decrementer if some irq work is pending */
> > -       if (test_irq_work_pending())
> > -               return 0;
> >         __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
> > -       set_dec(evt);
> > 
> > -       /* We may have raced with new irq work */
> > -       if (test_irq_work_pending())
> > -               set_dec(1);
> > +       /* Don't adjust the decrementer if some irq work is pending */
> > +       if (!test_irq_work_pending())
> > +               set_dec(evt);
> > 
> >         return 0;
> >  }
> > 
> > 
> > 
> > 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang
  2014-05-10 22:25       ` Benjamin Herrenschmidt
@ 2014-05-11  8:15         ` Preeti U Murthy
  2014-05-11  8:37           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 15+ messages in thread
From: Preeti U Murthy @ 2014-05-11  8:15 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: paulmck, paulus, Anton Blanchard, linuxppc-dev

On 05/11/2014 03:55 AM, Benjamin Herrenschmidt wrote:
> On Sat, 2014-05-10 at 21:06 +0530, Preeti U Murthy wrote:
>> On 05/10/2014 09:56 AM, Benjamin Herrenschmidt wrote:
>>> On Fri, 2014-05-09 at 15:22 +0530, Preeti U Murthy wrote:
>>>> in __timer_interrupt() outside the _else_ loop? This will ensure that no
>>>> matter what, before exiting timer interrupt handler we check for pending
>>>> irq work.
>>>
>>> We still need to make sure that set_next_event() doesn't move the
>>> dec beyond the next tick if there is a pending timer... maybe we
>>
>> Sorry, but didn't get this. s/if there is pending timer/if there is
>> pending irq work ?
> 
> Yes, sorry :-) That's what I meant.
> 
>>> can fix it like this:
>>
>> We can call set_next_event() from events like hrtimer_cancel() or
>> hrtimer_forward() as well. In that case we don't come to
>> decrementer_set_next_event() from __timer_interrupt(). Then, if we race
>> with irq work, we *do not do* a set_dec(1) ( I am referring to the patch
>> below ), we might never set the decrementer to fire immediately right?
>>
>> Or does this scenario never arise?
> 
> So my proposed patch handles that no ?
> 
> With that patch, we do the set_dec(1) in two cases:
> 
>  - The existing arch_irq_work_raise() which is unchanged
> 
>  - At the end of __timer_interrupt() if an irq work is still pending
> 
> And the patch also makes decrementer_set_next_event() not modify the
> decrementer if an irq work is pending, but *still* adjust next_tb unlike
> what the code does now.
> 
> Thus the timer interrupt, when it happens, will re-adjust the dec
> properly using next_tb.
> 
> Do we still miss a case ?

I was thinking something like the below in decrementer_set_next_event().
See last line in particular :

 -       /* Don't adjust the decrementer if some irq work is pending */
 -       if (test_irq_work_pending())
 -               return 0;
         __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
 -       set_dec(evt);

 -       /* We may have raced with new irq work */
 -       if (test_irq_work_pending())
 -               set_dec(1);
 +       /* Don't adjust the decrementer if some irq work is pending */
 +       if (!test_irq_work_pending())
 +               set_dec(evt);
 +       else
 +               set_dec(1);

                  ^^^^^ your patch currently does not have this explicit
set_dec(1) here. Will that create a problem? If there is any irq work
pending at this point, will someone set the decrementer to fire
immediately after this point? The current code in
decrementer_set_next_event() sets set_dec(1) explicitly in case of
pending irq work.

Regards
Preeti U Murthy
> 
> Cheers,
> Ben.
> 
>> Regards
>> Preeti U Murthy
>>>
>>> static int decrementer_set_next_event(unsigned long evt,
>>> 				      struct clock_event_device *dev)
>>> {
>>> 	__get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
>>>
>>> 	/* Don't adjust the decrementer if some irq work is pending */
>>> 	if (!test_irq_work_pending())
>>> 		set_dec(evt);
>>>
>>> 	return 0;
>>> }
>>>
>>> Along with a single occurrence of:
>>>
>>> 	if (test_irq_work_pending())
>>> 		set_dec(1);
>>>
>>> At the end of __timer_interrupt(), outside if the current else {}
>>> case, this should work, don't you think ?
>>>
>>> What about this completely untested patch ?
>>>
>>> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
>>> index 122a580..ba7e83b 100644
>>> --- a/arch/powerpc/kernel/time.c
>>> +++ b/arch/powerpc/kernel/time.c
>>> @@ -503,12 +503,13 @@ void __timer_interrupt(void)
>>>                 now = *next_tb - now;
>>>                 if (now <= DECREMENTER_MAX)
>>>                         set_dec((int)now);
>>> -               /* We may have raced with new irq work */
>>> -               if (test_irq_work_pending())
>>> -                       set_dec(1);
>>>                 __get_cpu_var(irq_stat).timer_irqs_others++;
>>>         }
>>>
>>> +       /* We may have raced with new irq work */
>>> +       if (test_irq_work_pending())
>>> +               set_dec(1);
>>> +
>>>  #ifdef CONFIG_PPC64
>>>         /* collect purr register values often, for accurate calculations */
>>>         if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
>>> @@ -813,15 +814,11 @@ static void __init clocksource_init(void)
>>>  static int decrementer_set_next_event(unsigned long evt,
>>>                                       struct clock_event_device *dev)
>>>  {
>>> -       /* Don't adjust the decrementer if some irq work is pending */
>>> -       if (test_irq_work_pending())
>>> -               return 0;
>>>         __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
>>> -       set_dec(evt);
>>>
>>> -       /* We may have raced with new irq work */
>>> -       if (test_irq_work_pending())
>>> -               set_dec(1);
>>> +       /* Don't adjust the decrementer if some irq work is pending */
>>> +       if (!test_irq_work_pending())
>>> +               set_dec(evt);
>>>
>>>         return 0;
>>>  }
>>>
>>>
>>>
>>>
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang
  2014-05-11  8:15         ` Preeti U Murthy
@ 2014-05-11  8:37           ` Benjamin Herrenschmidt
  2014-05-11  8:43             ` Preeti U Murthy
  0 siblings, 1 reply; 15+ messages in thread
From: Benjamin Herrenschmidt @ 2014-05-11  8:37 UTC (permalink / raw)
  To: Preeti U Murthy; +Cc: paulmck, paulus, Anton Blanchard, linuxppc-dev

On Sun, 2014-05-11 at 13:45 +0530, Preeti U Murthy wrote:
>  +       /* Don't adjust the decrementer if some irq work is pending
> */
>  +       if (!test_irq_work_pending())
>  +               set_dec(evt);
>  +       else
>  +               set_dec(1);
> 
>                   ^^^^^ your patch currently does not have this
> explicit
> set_dec(1) here. Will that create a problem? 
>
> If there is any irq work pending at this point, will someone set the
> decrementer to fire immediately after this point? The current code in
> decrementer_set_next_event() sets set_dec(1) explicitly in case of
> pending irq work.

Hrm, actually this is an interesting point. The problem isn't that
*someone* will do a set_dec, nobody else should that matters.

The problem is that irq_work can be triggered typically by NMIs or
similar, which means that it might be queued between the
test_irq_work_pending() and the set_dec(), thus causing a race.

So basically Anton's original patch is fine :-) I had missed that
we did a post-set_dec() test already in decrementer_next_event()
so as far as I can tell, removing the pre-test, which is what Anton
does, is really all we need.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang
  2014-05-11  8:37           ` Benjamin Herrenschmidt
@ 2014-05-11  8:43             ` Preeti U Murthy
  2014-05-11  9:03               ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 15+ messages in thread
From: Preeti U Murthy @ 2014-05-11  8:43 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: paulmck, paulus, Anton Blanchard, linuxppc-dev

On 05/11/2014 02:07 PM, Benjamin Herrenschmidt wrote:
> On Sun, 2014-05-11 at 13:45 +0530, Preeti U Murthy wrote:
>>  +       /* Don't adjust the decrementer if some irq work is pending
>> */
>>  +       if (!test_irq_work_pending())
>>  +               set_dec(evt);
>>  +       else
>>  +               set_dec(1);
>>
>>                   ^^^^^ your patch currently does not have this
>> explicit
>> set_dec(1) here. Will that create a problem? 
>>
>> If there is any irq work pending at this point, will someone set the
>> decrementer to fire immediately after this point? The current code in
>> decrementer_set_next_event() sets set_dec(1) explicitly in case of
>> pending irq work.
> 
> Hrm, actually this is an interesting point. The problem isn't that
> *someone* will do a set_dec, nobody else should that matters.
> 
> The problem is that irq_work can be triggered typically by NMIs or
> similar, which means that it might be queued between the
> test_irq_work_pending() and the set_dec(), thus causing a race.
> 
> So basically Anton's original patch is fine :-) I had missed that
> we did a post-set_dec() test already in decrementer_next_event()
> so as far as I can tell, removing the pre-test, which is what Anton
> does, is really all we need.

Isn't this patch required too?

@@ -503,12 +503,13 @@ void __timer_interrupt(void)
                now = *next_tb - now;
                if (now <= DECREMENTER_MAX)
                        set_dec((int)now);
-               /* We may have raced with new irq work */
-               if (test_irq_work_pending())
-                       set_dec(1);
                __get_cpu_var(irq_stat).timer_irqs_others++;
        }

+       /* We may have raced with new irq work */
+       if (test_irq_work_pending())
+               set_dec(1);
+

The event_handler cannot be relied upon to call
decrementer_set_next_event() all the time. This is in the case where
there are no pending timers. In that case we need to have the check on
irq work pending at the end of __timer_interrupt() no?

Regards
Preeti U Murthy
> 
> Cheers,
> Ben.
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang
  2014-05-11  8:43             ` Preeti U Murthy
@ 2014-05-11  9:03               ` Benjamin Herrenschmidt
  2014-05-11  9:07                 ` Preeti U Murthy
  0 siblings, 1 reply; 15+ messages in thread
From: Benjamin Herrenschmidt @ 2014-05-11  9:03 UTC (permalink / raw)
  To: Preeti U Murthy; +Cc: paulmck, paulus, Anton Blanchard, linuxppc-dev

On Sun, 2014-05-11 at 14:13 +0530, Preeti U Murthy wrote:
> 
> Isn't this patch required too?
> 
> @@ -503,12 +503,13 @@ void __timer_interrupt(void)
>                 now = *next_tb - now;
>                 if (now <= DECREMENTER_MAX)
>                         set_dec((int)now);
> -               /* We may have raced with new irq work */
> -               if (test_irq_work_pending())
> -                       set_dec(1);
>                 __get_cpu_var(irq_stat).timer_irqs_others++;
>         }
>
> +       /* We may have raced with new irq work */
> +       if (test_irq_work_pending())
> +               set_dec(1);
> +
> 
> The event_handler cannot be relied upon to call
> decrementer_set_next_event() all the time. This is in the case where
> there are no pending timers. In that case we need to have the check on
> irq work pending at the end of __timer_interrupt() no?

I don't think we need to move the test no. If there's a pending
irq_work, at that point, it will have done set_dec when being queued up.
So we only care about cases where we might change the decrementer.

If the event handler doesn't call decrementer_set_next_event() then
nothing will modify the decrementer and it will still trigger soon.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang
  2014-05-11  9:03               ` Benjamin Herrenschmidt
@ 2014-05-11  9:07                 ` Preeti U Murthy
  0 siblings, 0 replies; 15+ messages in thread
From: Preeti U Murthy @ 2014-05-11  9:07 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: paulmck, paulus, Anton Blanchard, linuxppc-dev

On 05/11/2014 02:33 PM, Benjamin Herrenschmidt wrote:
> On Sun, 2014-05-11 at 14:13 +0530, Preeti U Murthy wrote:
>>
>> Isn't this patch required too?
>>
>> @@ -503,12 +503,13 @@ void __timer_interrupt(void)
>>                 now = *next_tb - now;
>>                 if (now <= DECREMENTER_MAX)
>>                         set_dec((int)now);
>> -               /* We may have raced with new irq work */
>> -               if (test_irq_work_pending())
>> -                       set_dec(1);
>>                 __get_cpu_var(irq_stat).timer_irqs_others++;
>>         }
>>
>> +       /* We may have raced with new irq work */
>> +       if (test_irq_work_pending())
>> +               set_dec(1);
>> +
>>
>> The event_handler cannot be relied upon to call
>> decrementer_set_next_event() all the time. This is in the case where
>> there are no pending timers. In that case we need to have the check on
>> irq work pending at the end of __timer_interrupt() no?
> 
> I don't think we need to move the test no. If there's a pending
> irq_work, at that point, it will have done set_dec when being queued up.
> So we only care about cases where we might change the decrementer.
> 
> If the event handler doesn't call decrementer_set_next_event() then
> nothing will modify the decrementer and it will still trigger soon.

Hmm ok. Then Anton's patch covers all cases :)

Thanks!

Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>

Regards
Preeti U Murthy
> 
> Cheers,
> Ben.
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-05-11  9:11 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-09  7:47 [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang Anton Blanchard
2014-05-09  9:52 ` Preeti U Murthy
2014-05-10  4:26   ` Benjamin Herrenschmidt
2014-05-10 15:36     ` Preeti U Murthy
2014-05-10 22:25       ` Benjamin Herrenschmidt
2014-05-11  8:15         ` Preeti U Murthy
2014-05-11  8:37           ` Benjamin Herrenschmidt
2014-05-11  8:43             ` Preeti U Murthy
2014-05-11  9:03               ` Benjamin Herrenschmidt
2014-05-11  9:07                 ` Preeti U Murthy
2014-05-09 13:41 ` Paul E. McKenney
2014-05-09 21:50   ` Gabriel Paubert
2014-05-09 22:08     ` Paul E. McKenney
2014-05-10  6:33       ` Paul Mackerras
2014-05-10 16:33         ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).