All of lore.kernel.org
 help / color / mirror / Atom feed
* [KERNEL BUG] do_timer/tick_handover_do_timer 3.10.17
@ 2015-05-06 17:27 pawandeep oza
  2015-05-07  3:22 ` Mike Galbraith
  0 siblings, 1 reply; 15+ messages in thread
From: pawandeep oza @ 2015-05-06 17:27 UTC (permalink / raw)
  To: linux-kernel, malayasen rout, oza

Hi,

Linux version 3.10.17

Problem Statement: The timekeeping/do_timer seems to be stopped and
the core (in this case it is core0) which is aborting is stuck in the
loop which relies on jiffies.


The root cause/Reason:

we have tickless kernel, so cpu goes to deep idle state, and stop
sched tick. tick_nohz_stop_sched_tick

tick_sched_do_timer should then take the job and whichever cpu is
running transfer jiffies incrementing job to itself. which is
tick_sched_do_timer


but when say core0 has raised BUG, ipi_cpu_stop will amek other cpu to
go to stop. and clcokevents_notify/tick_notify/hrtimer_notifiy
eventually seem to be conencted through cpu_chain.

but this code belong to hotplug where cpu_down happen and then it can
successfully call tick_handover_do_timer which will take over the duty
from dying cpu and assign it to the one which is online.

static void tick_handover_do_timer(int *cpup) { if (*cpup ==
tick_do_timer_cpu) { int cpu = cpumask_first(cpu_online_mask);
tick_do_timer_cpu = (cpu < nr_cpu_ids) ? cpu : TICK_DO_TIMER_NONE; } }


but since cpu_down is not getting called, this handover is not happening.
and the last status of the variable tick_do_timer_cpu is always
pointing to DEAD cpu (1,2 or 3).

and core0 waits forever (where if the code relies on the increment of jiffies).


what is the right way to approach this problem, at first it looks like
kernel should take care of handing over the jiffies job to other
online core indepedent of hotplug.

Regards,
Oza.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [KERNEL BUG] do_timer/tick_handover_do_timer 3.10.17
  2015-05-06 17:27 [KERNEL BUG] do_timer/tick_handover_do_timer 3.10.17 pawandeep oza
@ 2015-05-07  3:22 ` Mike Galbraith
  2015-05-07  4:37   ` Oza (Pawandeep) Oza
  0 siblings, 1 reply; 15+ messages in thread
From: Mike Galbraith @ 2015-05-07  3:22 UTC (permalink / raw)
  To: pawandeep oza; +Cc: linux-kernel, malayasen rout, oza

On Wed, 2015-05-06 at 22:57 +0530, pawandeep oza wrote:

> but when say core0 has raised BUG..
...

> what is the right way to approach this problem

Look at the spot BUG() printed?  BUG() means "Way to go slick, the code
you fed me (file:line) is toxic.  Have a nice day, your ex-buddy core0".

	-Mike


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [KERNEL BUG] do_timer/tick_handover_do_timer 3.10.17
  2015-05-07  3:22 ` Mike Galbraith
@ 2015-05-07  4:37   ` Oza (Pawandeep) Oza
  2015-05-07  5:08     ` Mike Galbraith
  0 siblings, 1 reply; 15+ messages in thread
From: Oza (Pawandeep) Oza @ 2015-05-07  4:37 UTC (permalink / raw)
  To: Mike Galbraith, pawandeep oza; +Cc: linux-kernel, malayasen rout

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1598 bytes --]

Hi Mike,

Let me explain the problem again.

Problem Statement: the timkeeping is stopped, do_timer is no more a job of cpu0.

The reason: the variable "tick_do_timer_cpu" is not set to correct CPU (cpu0)
And when BUG() happens, the tick_do_timer_cpu variable stay set to 1, 2 or 3 (we have 4 cores)
And finally any code running on core0 (which relies on jiffies incrementing) doesn’t work because there is nobody to increment jiffies.

There is tick_handover_do_timer, and if that is called then things are fine, but that is also not getting called because it is tightly coupled with hotplug.
since cpu_down is not getting called, this handover is not happening. and the last status of the variable tick_do_timer_cpu is always
pointing to DEAD cpu (1,2 or 3). and core0 waits forever (where if the code relies on the increment of jiffies).

Regards,
-Oza

-----Original Message-----
From: Mike Galbraith [mailto:umgwanakikbuti@gmail.com] 
Sent: Thursday, May 07, 2015 8:53 AM
To: pawandeep oza
Cc: linux-kernel@vger.kernel.org; malayasen rout; Oza (Pawandeep) Oza
Subject: Re: [KERNEL BUG] do_timer/tick_handover_do_timer 3.10.17

On Wed, 2015-05-06 at 22:57 +0530, pawandeep oza wrote:

> but when say core0 has raised BUG..
...

> what is the right way to approach this problem

Look at the spot BUG() printed?  BUG() means "Way to go slick, the code
you fed me (file:line) is toxic.  Have a nice day, your ex-buddy core0".

	-Mike

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [KERNEL BUG] do_timer/tick_handover_do_timer 3.10.17
  2015-05-07  4:37   ` Oza (Pawandeep) Oza
@ 2015-05-07  5:08     ` Mike Galbraith
  2015-05-07  5:12       ` Oza (Pawandeep) Oza
  0 siblings, 1 reply; 15+ messages in thread
From: Mike Galbraith @ 2015-05-07  5:08 UTC (permalink / raw)
  To: Oza (Pawandeep) Oza; +Cc: pawandeep oza, linux-kernel, malayasen rout

On Thu, 2015-05-07 at 04:37 +0000, Oza (Pawandeep) Oza wrote:
> Problem Statement: the timkeeping is stopped, do_timer is no more a
> job of cpu0.
> 
> The reason: the variable "tick_do_timer_cpu" is not set to correct CPU
> (cpu0)
> And when BUG() happens, the tick_do_timer_cpu variable stay set to 1,
> 2 or 3 (we have 4 cores)

Solution Statement: Fix the UTTERLY DEADLY bug.

	-Mike


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [KERNEL BUG] do_timer/tick_handover_do_timer 3.10.17
  2015-05-07  5:08     ` Mike Galbraith
@ 2015-05-07  5:12       ` Oza (Pawandeep) Oza
  2015-05-07  5:54         ` Mike Galbraith
  0 siblings, 1 reply; 15+ messages in thread
From: Oza (Pawandeep) Oza @ 2015-05-07  5:12 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: pawandeep oza, linux-kernel, malayasen rout

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1189 bytes --]

Solution Statement: Fix the UTTERLY DEADLY bug.

Oza: 
that BUG() is LEGAL. Kernel is not a problem there.
Somebody else outside of kernel/ARM (some other HW raises the bug), and send indication to kernel that I am not alive.
So kernel choose to CRASH ITSELF.

So that is legal crash and wanted Crash.

But after Crash, jiffies do not increment.


Regards,
-Oza


-----Original Message-----
From: Mike Galbraith [mailto:umgwanakikbuti@gmail.com] 
Sent: Thursday, May 07, 2015 10:39 AM
To: Oza (Pawandeep) Oza
Cc: pawandeep oza; linux-kernel@vger.kernel.org; malayasen rout
Subject: Re: [KERNEL BUG] do_timer/tick_handover_do_timer 3.10.17

On Thu, 2015-05-07 at 04:37 +0000, Oza (Pawandeep) Oza wrote:
> Problem Statement: the timkeeping is stopped, do_timer is no more a
> job of cpu0.
> 
> The reason: the variable "tick_do_timer_cpu" is not set to correct CPU
> (cpu0)
> And when BUG() happens, the tick_do_timer_cpu variable stay set to 1,
> 2 or 3 (we have 4 cores)

Solution Statement: Fix the UTTERLY DEADLY bug.

	-Mike

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [KERNEL BUG] do_timer/tick_handover_do_timer 3.10.17
  2015-05-07  5:12       ` Oza (Pawandeep) Oza
@ 2015-05-07  5:54         ` Mike Galbraith
  2015-05-07  5:58           ` Oza (Pawandeep) Oza
  0 siblings, 1 reply; 15+ messages in thread
From: Mike Galbraith @ 2015-05-07  5:54 UTC (permalink / raw)
  To: Oza (Pawandeep) Oza; +Cc: pawandeep oza, linux-kernel, malayasen rout

On Thu, 2015-05-07 at 05:12 +0000, Oza (Pawandeep) Oza wrote:

> But after Crash, jiffies do not increment.

Your kernel said "I'M DEAD", that's a good reason to believe it.

	-Mike


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [KERNEL BUG] do_timer/tick_handover_do_timer 3.10.17
  2015-05-07  5:54         ` Mike Galbraith
@ 2015-05-07  5:58           ` Oza (Pawandeep) Oza
  2015-05-07  6:54             ` Mike Galbraith
  0 siblings, 1 reply; 15+ messages in thread
From: Oza (Pawandeep) Oza @ 2015-05-07  5:58 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: pawandeep oza, linux-kernel, malayasen rout

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 855 bytes --]

Yes.
But dying kernel doesn’t mean it CAN NOT INCREMENT jiffies.
do_timer should do the job until kernel takes its last breathe and more precisely CPU0 take its last breathe by halting itself as its last instruction.

Regards,
-Oza


-----Original Message-----
From: Mike Galbraith [mailto:umgwanakikbuti@gmail.com] 
Sent: Thursday, May 07, 2015 11:25 AM
To: Oza (Pawandeep) Oza
Cc: pawandeep oza; linux-kernel@vger.kernel.org; malayasen rout
Subject: Re: [KERNEL BUG] do_timer/tick_handover_do_timer 3.10.17

On Thu, 2015-05-07 at 05:12 +0000, Oza (Pawandeep) Oza wrote:

> But after Crash, jiffies do not increment.

Your kernel said "I'M DEAD", that's a good reason to believe it.

	-Mike

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [KERNEL BUG] do_timer/tick_handover_do_timer 3.10.17
  2015-05-07  5:58           ` Oza (Pawandeep) Oza
@ 2015-05-07  6:54             ` Mike Galbraith
  2015-05-07  7:05               ` Oza (Pawandeep) Oza
  2015-05-07  8:19               ` Oza (Pawandeep) Oza
  0 siblings, 2 replies; 15+ messages in thread
From: Mike Galbraith @ 2015-05-07  6:54 UTC (permalink / raw)
  To: Oza (Pawandeep) Oza; +Cc: pawandeep oza, linux-kernel, malayasen rout

On Thu, 2015-05-07 at 05:58 +0000, Oza (Pawandeep) Oza wrote:
> Yes.
> But dying kernel doesn’t mean it CAN NOT INCREMENT jiffies.
> do_timer should do the job until kernel takes its last breathe and more precisely CPU0 take its last breathe by halting itself as its last instruction.

Feel free to add a redundant timer subsystem lest we BUG() in there, and
whatever else you need to guarantee a perfect orderly death for your
box.  I prefer live boxen, would make that BUG() go away.

	-Mike


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [KERNEL BUG] do_timer/tick_handover_do_timer 3.10.17
  2015-05-07  6:54             ` Mike Galbraith
@ 2015-05-07  7:05               ` Oza (Pawandeep) Oza
  2015-05-07  8:29                 ` Mike Galbraith
  2015-05-07  8:19               ` Oza (Pawandeep) Oza
  1 sibling, 1 reply; 15+ messages in thread
From: Oza (Pawandeep) Oza @ 2015-05-07  7:05 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: pawandeep oza, linux-kernel, malayasen rout

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1519 bytes --]

: )

Well, I am not sure, if problem was communicated clearly from my side.
Let me attempt it again.

If  variable tick_do_timer_cpu = 0. Things are fine.
If it is some other value say for e.g. 1, 2 or 3 then core0 does not increment jiffies.  (but say if it is set to tick_do_timer_cpu=1, then core1 will increment jiffies)

If cpu1 ,2 and 3 are sent smp_send_stop and as a result of that cpu1, 2 and 3 will be stopped.

Now only cpu0 is alive, cpu0 should increment jiffies upon each time tick.
For that tick_do_timer_cpu should be set to 0.

Which is not happening.

Regards,
-Oza


-----Original Message-----
From: Mike Galbraith [mailto:umgwanakikbuti@gmail.com] 
Sent: Thursday, May 07, 2015 12:25 PM
To: Oza (Pawandeep) Oza
Cc: pawandeep oza; linux-kernel@vger.kernel.org; malayasen rout
Subject: Re: [KERNEL BUG] do_timer/tick_handover_do_timer 3.10.17

On Thu, 2015-05-07 at 05:58 +0000, Oza (Pawandeep) Oza wrote:
> Yes.
> But dying kernel doesn’t mean it CAN NOT INCREMENT jiffies.
> do_timer should do the job until kernel takes its last breathe and more precisely CPU0 take its last breathe by halting itself as its last instruction.

Feel free to add a redundant timer subsystem lest we BUG() in there, and
whatever else you need to guarantee a perfect orderly death for your
box.  I prefer live boxen, would make that BUG() go away.

	-Mike

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [KERNEL BUG] do_timer/tick_handover_do_timer 3.10.17
  2015-05-07  6:54             ` Mike Galbraith
  2015-05-07  7:05               ` Oza (Pawandeep) Oza
@ 2015-05-07  8:19               ` Oza (Pawandeep) Oza
  1 sibling, 0 replies; 15+ messages in thread
From: Oza (Pawandeep) Oza @ 2015-05-07  8:19 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: pawandeep oza, linux-kernel, malayasen rout

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2958 bytes --]

Mike,

Here is the code which will explain you what I meant to address.
The is just a WARN_ON in case if "any other cpu, other than this cpu, are all offline, and at the same time tick_do_timer_cpu is not set correctly)

Note: this patch is just to put forward the problem. (not an actual patch)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9142591..3aa4c8c 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -112,6 +112,7 @@ static ktime_t tick_init_jiffy_update(void)
 static void tick_sched_do_timer(ktime_t now)
 {
        int cpu = smp_processor_id();
+       int other_cpu, is_cpu_online = 0;

 #ifdef CONFIG_NO_HZ_COMMON
        /*
@@ -125,6 +126,11 @@ static void tick_sched_do_timer(ktime_t now)
            && !tick_nohz_full_cpu(cpu))
                tick_do_timer_cpu = cpu;
 #endif
+       for (other_cpu = 0; other_cpu < nr_cpu_ids = 0; other_cpu++) {
+               if (other_cpu != cpu)
+                       is_cpu_online += cpu_online(other_cpu);
+       }
+       WARN_ON((tick_do_timer_cpu != cpu) && !is_cpu_online)

        /* Check, if the jiffies need an update */
        if (tick_do_timer_cpu == cpu)

Regards,
-Oza


-----Original Message-----
From: Oza (Pawandeep) Oza 
Sent: Thursday, May 07, 2015 12:36 PM
To: 'Mike Galbraith'
Cc: pawandeep oza; linux-kernel@vger.kernel.org; malayasen rout
Subject: RE: [KERNEL BUG] do_timer/tick_handover_do_timer 3.10.17

: )

Well, I am not sure, if problem was communicated clearly from my side.
Let me attempt it again.

If  variable tick_do_timer_cpu = 0. Things are fine.
If it is some other value say for e.g. 1, 2 or 3 then core0 does not increment jiffies.  (but say if it is set to tick_do_timer_cpu=1, then core1 will increment jiffies)

If cpu1 ,2 and 3 are sent smp_send_stop and as a result of that cpu1, 2 and 3 will be stopped.

Now only cpu0 is alive, cpu0 should increment jiffies upon each time tick.
For that tick_do_timer_cpu should be set to 0.

Which is not happening.

Regards,
-Oza


-----Original Message-----
From: Mike Galbraith [mailto:umgwanakikbuti@gmail.com] 
Sent: Thursday, May 07, 2015 12:25 PM
To: Oza (Pawandeep) Oza
Cc: pawandeep oza; linux-kernel@vger.kernel.org; malayasen rout
Subject: Re: [KERNEL BUG] do_timer/tick_handover_do_timer 3.10.17

On Thu, 2015-05-07 at 05:58 +0000, Oza (Pawandeep) Oza wrote:
> Yes.
> But dying kernel doesn’t mean it CAN NOT INCREMENT jiffies.
> do_timer should do the job until kernel takes its last breathe and more precisely CPU0 take its last breathe by halting itself as its last instruction.

Feel free to add a redundant timer subsystem lest we BUG() in there, and
whatever else you need to guarantee a perfect orderly death for your
box.  I prefer live boxen, would make that BUG() go away.

	-Mike

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [KERNEL BUG] do_timer/tick_handover_do_timer 3.10.17
  2015-05-07  7:05               ` Oza (Pawandeep) Oza
@ 2015-05-07  8:29                 ` Mike Galbraith
  2015-05-07  8:47                   ` Oza (Pawandeep) Oza
  2015-05-08  4:16                   ` Oza (Pawandeep) Oza
  0 siblings, 2 replies; 15+ messages in thread
From: Mike Galbraith @ 2015-05-07  8:29 UTC (permalink / raw)
  To: Oza (Pawandeep) Oza; +Cc: pawandeep oza, linux-kernel, malayasen rout

On Thu, 2015-05-07 at 07:05 +0000, Oza (Pawandeep) Oza wrote:
> : )
> 
> Well, I am not sure, if problem was communicated clearly from my side.

I understood.  I just don't understand why you'd care deeply whether
CPU0 halts or eternally waits.  Both render it harmless and useless.

	-Mike


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [KERNEL BUG] do_timer/tick_handover_do_timer 3.10.17
  2015-05-07  8:29                 ` Mike Galbraith
@ 2015-05-07  8:47                   ` Oza (Pawandeep) Oza
  2015-05-08  4:16                   ` Oza (Pawandeep) Oza
  1 sibling, 0 replies; 15+ messages in thread
From: Oza (Pawandeep) Oza @ 2015-05-07  8:47 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: pawandeep oza, linux-kernel, malayasen rout

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1310 bytes --]

Oh ok.
So the reason why I cared was:

There is a code in our base which relies on jiffies, but since jiffies are not incrementing, the code waits there and loops forever.
And forward progress is on halt. (on cpu0, since that is the only cpu, which is alive)

We have changed the code to use mdelay and things move on.

But that means that in the patch which I mentioned, 
any code which relies on jiffies will stuck forever and will not allow rest of the code to get executed and hence no forward progress.
specially if that code is running with preempt_disable();

Regards,
-Oza


-----Original Message-----
From: Mike Galbraith [mailto:umgwanakikbuti@gmail.com] 
Sent: Thursday, May 07, 2015 2:00 PM
To: Oza (Pawandeep) Oza
Cc: pawandeep oza; linux-kernel@vger.kernel.org; malayasen rout
Subject: Re: [KERNEL BUG] do_timer/tick_handover_do_timer 3.10.17

On Thu, 2015-05-07 at 07:05 +0000, Oza (Pawandeep) Oza wrote:
> : )
> 
> Well, I am not sure, if problem was communicated clearly from my side.

I understood.  I just don't understand why you'd care deeply whether
CPU0 halts or eternally waits.  Both render it harmless and useless.

	-Mike

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [KERNEL BUG] do_timer/tick_handover_do_timer 3.10.17
  2015-05-07  8:29                 ` Mike Galbraith
  2015-05-07  8:47                   ` Oza (Pawandeep) Oza
@ 2015-05-08  4:16                   ` Oza (Pawandeep) Oza
  2015-05-08  5:12                     ` Mike Galbraith
  1 sibling, 1 reply; 15+ messages in thread
From: Oza (Pawandeep) Oza @ 2015-05-08  4:16 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: pawandeep oza, linux-kernel, malayasen rout

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2443 bytes --]

So Mike, is this reason strong enough for you ?

I understand your point: solve the BUG, and I do tend to agree with you.

But by design and implementation, the BUG() is just a beginning of the end for dying kernel.
And what happens in between this 'the beginning' and 'the end' is not less important. 
(because say,  on our platform we want to get clean RAMDUMP to analyze what happened, and for that we want to get clean reboot)

Also,
If somebody's design is to legally Crash the kernel (e.g. where kernel is actually not faulty).
Then, I do expect that tick/timekeeping framework do its job as long as it can do, and it should do, because kernel is not faulty.
But in this case it doesn’t handover jiffies incrementing job sanely.

In other words, 
"no one can relies on jiffies, or rather the code which is based on jiffies will never forward progress in this path"

Regards,
-Oza


-----Original Message-----
From: Oza (Pawandeep) Oza 
Sent: Thursday, May 07, 2015 2:17 PM
To: 'Mike Galbraith'
Cc: pawandeep oza; linux-kernel@vger.kernel.org; malayasen rout
Subject: RE: [KERNEL BUG] do_timer/tick_handover_do_timer 3.10.17

Oh ok.
So the reason why I cared was:

There is a code in our base which relies on jiffies, but since jiffies are not incrementing, the code waits there and loops forever.
And forward progress is on halt. (on cpu0, since that is the only cpu, which is alive)

We have changed the code to use mdelay and things move on.

But that means that in the patch which I mentioned, 
any code which relies on jiffies will stuck forever and will not allow rest of the code to get executed and hence no forward progress.
specially if that code is running with preempt_disable();

Regards,
-Oza


-----Original Message-----
From: Mike Galbraith [mailto:umgwanakikbuti@gmail.com] 
Sent: Thursday, May 07, 2015 2:00 PM
To: Oza (Pawandeep) Oza
Cc: pawandeep oza; linux-kernel@vger.kernel.org; malayasen rout
Subject: Re: [KERNEL BUG] do_timer/tick_handover_do_timer 3.10.17

On Thu, 2015-05-07 at 07:05 +0000, Oza (Pawandeep) Oza wrote:
> : )
> 
> Well, I am not sure, if problem was communicated clearly from my side.

I understood.  I just don't understand why you'd care deeply whether
CPU0 halts or eternally waits.  Both render it harmless and useless.

	-Mike

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [KERNEL BUG] do_timer/tick_handover_do_timer 3.10.17
  2015-05-08  4:16                   ` Oza (Pawandeep) Oza
@ 2015-05-08  5:12                     ` Mike Galbraith
  2015-05-08  5:21                       ` Oza (Pawandeep) Oza
  0 siblings, 1 reply; 15+ messages in thread
From: Mike Galbraith @ 2015-05-08  5:12 UTC (permalink / raw)
  To: Oza (Pawandeep) Oza; +Cc: pawandeep oza, linux-kernel, malayasen rout

On Fri, 2015-05-08 at 04:16 +0000, Oza (Pawandeep) Oza wrote:
> So Mike, is this reason strong enough for you ?

Nope.  I think you did the right thing in removing your dependency on
jiffies reliability in a dying box.  You don't have to convince me of
anything though, CC timer subsystem maintainer, see what he says.

> I understand your point: solve the BUG, and I do tend to agree with you.
> 
> But by design and implementation, the BUG() is just a beginning of the end for dying kernel.
> And what happens in between this 'the beginning' and 'the end' is not less important. 
> (because say,  on our platform we want to get clean RAMDUMP to analyze what happened, and for that we want to get clean reboot)

I don't see anybody else having any trouble getting crash dumps.  I
spent yet another long day just yesterday, rummaging through one.

> Also,
> If somebody's design is to legally Crash the kernel (e.g. where kernel is actually not faulty).
> Then, I do expect that tick/timekeeping framework do its job as long as it can do, and it should do, because kernel is not faulty.
> But in this case it doesn’t handover jiffies incrementing job sanely.

It seems odd to me to use BUG() for what you appear to be using it for..
not that I know exactly what that it mind you, but when you said when
some other gizmo in your box has a problem you crash the kernel, my head
tilted to the side - surely there's a more controlled response possible
than poking the big red self destruct button ;-)

	-Mike


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [KERNEL BUG] do_timer/tick_handover_do_timer 3.10.17
  2015-05-08  5:12                     ` Mike Galbraith
@ 2015-05-08  5:21                       ` Oza (Pawandeep) Oza
  0 siblings, 0 replies; 15+ messages in thread
From: Oza (Pawandeep) Oza @ 2015-05-08  5:21 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: pawandeep oza, linux-kernel, malayasen rout

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 2470 bytes --]

It seems odd to me to use BUG() for what you appear to be using it for..
not that I know exactly what that it mind you, but when you said when
some other gizmo in your box has a problem you crash the kernel, my head
tilted to the side - surely there's a more controlled response possible
than poking the big red self destruct button ;-)

Oza: 
We have to place red button as our last resort, if we don’t press we pass the time or miss the point where we can go back and debug.
So that is something by design.

Regards,
-Oza


-----Original Message-----
From: Mike Galbraith [mailto:umgwanakikbuti@gmail.com] 
Sent: Friday, May 08, 2015 10:42 AM
To: Oza (Pawandeep) Oza
Cc: pawandeep oza; linux-kernel@vger.kernel.org; malayasen rout
Subject: Re: [KERNEL BUG] do_timer/tick_handover_do_timer 3.10.17

On Fri, 2015-05-08 at 04:16 +0000, Oza (Pawandeep) Oza wrote:
> So Mike, is this reason strong enough for you ?

Nope.  I think you did the right thing in removing your dependency on
jiffies reliability in a dying box.  You don't have to convince me of
anything though, CC timer subsystem maintainer, see what he says.

> I understand your point: solve the BUG, and I do tend to agree with you.
> 
> But by design and implementation, the BUG() is just a beginning of the end for dying kernel.
> And what happens in between this 'the beginning' and 'the end' is not less important. 
> (because say,  on our platform we want to get clean RAMDUMP to analyze what happened, and for that we want to get clean reboot)

I don't see anybody else having any trouble getting crash dumps.  I
spent yet another long day just yesterday, rummaging through one.

> Also,
> If somebody's design is to legally Crash the kernel (e.g. where kernel is actually not faulty).
> Then, I do expect that tick/timekeeping framework do its job as long as it can do, and it should do, because kernel is not faulty.
> But in this case it doesn’t handover jiffies incrementing job sanely.

It seems odd to me to use BUG() for what you appear to be using it for..
not that I know exactly what that it mind you, but when you said when
some other gizmo in your box has a problem you crash the kernel, my head
tilted to the side - surely there's a more controlled response possible
than poking the big red self destruct button ;-)

	-Mike

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-05-08  5:21 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-06 17:27 [KERNEL BUG] do_timer/tick_handover_do_timer 3.10.17 pawandeep oza
2015-05-07  3:22 ` Mike Galbraith
2015-05-07  4:37   ` Oza (Pawandeep) Oza
2015-05-07  5:08     ` Mike Galbraith
2015-05-07  5:12       ` Oza (Pawandeep) Oza
2015-05-07  5:54         ` Mike Galbraith
2015-05-07  5:58           ` Oza (Pawandeep) Oza
2015-05-07  6:54             ` Mike Galbraith
2015-05-07  7:05               ` Oza (Pawandeep) Oza
2015-05-07  8:29                 ` Mike Galbraith
2015-05-07  8:47                   ` Oza (Pawandeep) Oza
2015-05-08  4:16                   ` Oza (Pawandeep) Oza
2015-05-08  5:12                     ` Mike Galbraith
2015-05-08  5:21                       ` Oza (Pawandeep) Oza
2015-05-07  8:19               ` Oza (Pawandeep) Oza

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.