All of lore.kernel.org
 help / color / mirror / Atom feed
* PROBLEM: Kernel BUG in mfgpt_tick (cs5535-clockevt.c) on ALIX 2c3 - null call
@ 2017-10-07 21:26 David Kozub
  2017-10-09  8:39 ` Daniel Lezcano
  0 siblings, 1 reply; 13+ messages in thread
From: David Kozub @ 2017-10-07 21:26 UTC (permalink / raw)
  To: Daniel Lezcano, Thomas Gleixner; +Cc: linux-kernel

Hi all,

booting up kernel 4.14-rc3 with CS5535_CLOCK_EVENT_SRC on an ALIX 2c3 
(http://pcengines.ch/alix2c3.htm) dies with:

[    2.313086] cs5535-smb cs5535-smb: SCx200 device 'CS5535 ACB0' registered
[    2.338711] cs5535-mfgpt cs5535-mfgpt: registered timer 0
[    2.355676] ledtrig-cpu: registered to indicate activity on CPUs
[    2.373745] cs5535-mfgpt cs5535-mfgpt: registered timer 1
[    2.389976] cs5535-clockevt: Registering MFGPT timer as a clock event, using IRQ 7
[    2.412687] BUG: unable to handle kernel NULL pointer dereference at   (null)
[    2.412698] IP:   (null)
[    2.412702] *pde = 00000000
[    2.412713] Oops: 0000 [#1]
[    2.412716] Modules linked in:
[    2.412732] CPU: 0 PID: 1 Comm: swapper Not tainted 4.14.0-rc3-humel-test17 #36
[    2.412739] task: c0010000 task.stack: c008e000
[    2.412744] EIP:   (null)
[    2.412749] EFLAGS: 00210093 CPU: 0
[    2.412758] EAX: c05fb8c0 EBX: c05fb880 ECX: 00000000 EDX: 0000620c
[    2.412769] ESI: c0009fd4 EDI: c014e14e EBP: c0009fac ESP: c0009fa8
[    2.412780]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[    2.412790] CR0: 80050033 CR2: 00000000 CR3: 0064f000 CR4: 00000090
[    2.412793] Call Trace:
[    2.412800]  <IRQ>
[    2.412825]  ? mfgpt_tick+0x5d/0x81
[    2.412845]  __handle_irq_event_percpu+0x56/0xb6
[    2.412864]  ? handle_fasteoi_irq+0xf3/0xf3
[    2.412878]  handle_irq_event_percpu+0x17/0x3f
[    2.412892]  handle_irq_event+0x1d/0x29
[    2.412905]  handle_level_irq+0x57/0xc6
[    2.412924]  handle_irq+0x47/0x52
[    2.412929]  </IRQ>
[    2.412934]  <SOFTIRQ>
[    2.412949]  do_IRQ+0x32/0x9b
[    2.412963]  ? __irqentry_text_end+0x7/0x7
[    2.412976]  common_interrupt+0x2e/0x34
...

The problem seems to be in drivers/clocksource/cs5535-clockevt.c in 
mfgpt_tick() on line 132:
...
 	cs5535_clockevent.event_handler(&cs5535_clockevent);
...
cs5535_clockevent.event_handler is null.

Adding some more traces I see mfgpt_tick() gets called before 
clockevents_config_and_register() finishes (invoked from 
cs5535_mfgpt_init() on line 178). So when mfgpt_tick() accessess the 
event_handler, it's NULL. Wrapping the event_handler call on line 132 in a 
null pointer check results in a working system.

The issue is present at least also in kernel 4.13.5. In kernel versions <= 
4.1-rc6 the cs5535_clockevent worked OK. Kernels >= 4.1-rc7 never booted 
at all on my ALIX 2c3 (last I tried 4.5, then gave up and tried 4.13.5 
recently), so I don't know when exactly this issue appeared.

My guess is that the timer interrupt is enabled too early. If I change the 
order of clockevents_config_and_register() and 
cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val) in 
cs5535_mfgpt_init() like this:
...
 	/* Set up the clock event */
 	printk(KERN_INFO DRV_NAME
 		": Registering MFGPT timer as a clock event, using IRQ %d\n",
 		timer_irq);
 	clockevents_config_and_register(&cs5535_clockevent, MFGPT_HZ,
 					0xF, 0xFFFE);

 	/* Set the clock scale and enable the event mode for CMP2 */
 	val = MFGPT_SCALE | (3 << 8);
 	cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val);
...
the system boots and seems to be OK.


Best regards,
David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: PROBLEM: Kernel BUG in mfgpt_tick (cs5535-clockevt.c) on ALIX 2c3 - null call
  2017-10-07 21:26 PROBLEM: Kernel BUG in mfgpt_tick (cs5535-clockevt.c) on ALIX 2c3 - null call David Kozub
@ 2017-10-09  8:39 ` Daniel Lezcano
  2017-10-09 19:33   ` David Kozub
  0 siblings, 1 reply; 13+ messages in thread
From: Daniel Lezcano @ 2017-10-09  8:39 UTC (permalink / raw)
  To: David Kozub, Thomas Gleixner; +Cc: linux-kernel

On 07/10/2017 23:26, David Kozub wrote:
> Hi all,
> 
> booting up kernel 4.14-rc3 with CS5535_CLOCK_EVENT_SRC on an ALIX 2c3
> (http://pcengines.ch/alix2c3.htm) dies with:
> 
> [    2.313086] cs5535-smb cs5535-smb: SCx200 device 'CS5535 ACB0'
> registered
> [    2.338711] cs5535-mfgpt cs5535-mfgpt: registered timer 0
> [    2.355676] ledtrig-cpu: registered to indicate activity on CPUs
> [    2.373745] cs5535-mfgpt cs5535-mfgpt: registered timer 1
> [    2.389976] cs5535-clockevt: Registering MFGPT timer as a clock
> event, using IRQ 7
> [    2.412687] BUG: unable to handle kernel NULL pointer dereference
> at   (null)
> [    2.412698] IP:   (null)
> [    2.412702] *pde = 00000000
> [    2.412713] Oops: 0000 [#1]
> [    2.412716] Modules linked in:
> [    2.412732] CPU: 0 PID: 1 Comm: swapper Not tainted
> 4.14.0-rc3-humel-test17 #36
> [    2.412739] task: c0010000 task.stack: c008e000
> [    2.412744] EIP:   (null)
> [    2.412749] EFLAGS: 00210093 CPU: 0
> [    2.412758] EAX: c05fb8c0 EBX: c05fb880 ECX: 00000000 EDX: 0000620c
> [    2.412769] ESI: c0009fd4 EDI: c014e14e EBP: c0009fac ESP: c0009fa8
> [    2.412780]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
> [    2.412790] CR0: 80050033 CR2: 00000000 CR3: 0064f000 CR4: 00000090
> [    2.412793] Call Trace:
> [    2.412800]  <IRQ>
> [    2.412825]  ? mfgpt_tick+0x5d/0x81
> [    2.412845]  __handle_irq_event_percpu+0x56/0xb6
> [    2.412864]  ? handle_fasteoi_irq+0xf3/0xf3
> [    2.412878]  handle_irq_event_percpu+0x17/0x3f
> [    2.412892]  handle_irq_event+0x1d/0x29
> [    2.412905]  handle_level_irq+0x57/0xc6
> [    2.412924]  handle_irq+0x47/0x52
> [    2.412929]  </IRQ>
> [    2.412934]  <SOFTIRQ>
> [    2.412949]  do_IRQ+0x32/0x9b
> [    2.412963]  ? __irqentry_text_end+0x7/0x7
> [    2.412976]  common_interrupt+0x2e/0x34
> ...
> 
> The problem seems to be in drivers/clocksource/cs5535-clockevt.c in
> mfgpt_tick() on line 132:
> ...
>     cs5535_clockevent.event_handler(&cs5535_clockevent);
> ...
> cs5535_clockevent.event_handler is null.
> 
> Adding some more traces I see mfgpt_tick() gets called before
> clockevents_config_and_register() finishes (invoked from
> cs5535_mfgpt_init() on line 178). So when mfgpt_tick() accessess the
> event_handler, it's NULL. Wrapping the event_handler call on line 132 in
> a null pointer check results in a working system.
> 
> The issue is present at least also in kernel 4.13.5. In kernel versions
> <= 4.1-rc6 the cs5535_clockevent worked OK. Kernels >= 4.1-rc7 never
> booted at all on my ALIX 2c3 (last I tried 4.5, then gave up and tried
> 4.13.5 recently), so I don't know when exactly this issue appeared.
> 
> My guess is that the timer interrupt is enabled too early. If I change
> the order of clockevents_config_and_register() and
> cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val) in
> cs5535_mfgpt_init() like this:
> ...
>     /* Set up the clock event */
>     printk(KERN_INFO DRV_NAME
>         ": Registering MFGPT timer as a clock event, using IRQ %d\n",
>         timer_irq);
>     clockevents_config_and_register(&cs5535_clockevent, MFGPT_HZ,
>                     0xF, 0xFFFE);
> 
>     /* Set the clock scale and enable the event mode for CMP2 */
>     val = MFGPT_SCALE | (3 << 8);
>     cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val);
> ...
> the system boots and seems to be OK.

What happens if instead of inverting those two lines, you add
mfgpt_shutdown() early in the init function ?


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: PROBLEM: Kernel BUG in mfgpt_tick (cs5535-clockevt.c) on ALIX 2c3 - null call
  2017-10-09  8:39 ` Daniel Lezcano
@ 2017-10-09 19:33   ` David Kozub
  2017-10-10 13:03     ` Daniel Lezcano
  0 siblings, 1 reply; 13+ messages in thread
From: David Kozub @ 2017-10-09 19:33 UTC (permalink / raw)
  To: Daniel Lezcano; +Cc: Thomas Gleixner, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6164 bytes --]

On Mon, 9 Oct 2017, Daniel Lezcano wrote:

> On 07/10/2017 23:26, David Kozub wrote:
>> Hi all,
>>
>> booting up kernel 4.14-rc3 with CS5535_CLOCK_EVENT_SRC on an ALIX 2c3
>> (http://pcengines.ch/alix2c3.htm) dies with:
>>
>> [    2.313086] cs5535-smb cs5535-smb: SCx200 device 'CS5535 ACB0'
>> registered
>> [    2.338711] cs5535-mfgpt cs5535-mfgpt: registered timer 0
>> [    2.355676] ledtrig-cpu: registered to indicate activity on CPUs
>> [    2.373745] cs5535-mfgpt cs5535-mfgpt: registered timer 1
>> [    2.389976] cs5535-clockevt: Registering MFGPT timer as a clock
>> event, using IRQ 7
>> [    2.412687] BUG: unable to handle kernel NULL pointer dereference
>> at   (null)
>> [    2.412698] IP:   (null)
>> [    2.412702] *pde = 00000000
>> [    2.412713] Oops: 0000 [#1]
>> [    2.412716] Modules linked in:
>> [    2.412732] CPU: 0 PID: 1 Comm: swapper Not tainted
>> 4.14.0-rc3-humel-test17 #36
>> [    2.412739] task: c0010000 task.stack: c008e000
>> [    2.412744] EIP:   (null)
>> [    2.412749] EFLAGS: 00210093 CPU: 0
>> [    2.412758] EAX: c05fb8c0 EBX: c05fb880 ECX: 00000000 EDX: 0000620c
>> [    2.412769] ESI: c0009fd4 EDI: c014e14e EBP: c0009fac ESP: c0009fa8
>> [    2.412780]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
>> [    2.412790] CR0: 80050033 CR2: 00000000 CR3: 0064f000 CR4: 00000090
>> [    2.412793] Call Trace:
>> [    2.412800]  <IRQ>
>> [    2.412825]  ? mfgpt_tick+0x5d/0x81
>> [    2.412845]  __handle_irq_event_percpu+0x56/0xb6
>> [    2.412864]  ? handle_fasteoi_irq+0xf3/0xf3
>> [    2.412878]  handle_irq_event_percpu+0x17/0x3f
>> [    2.412892]  handle_irq_event+0x1d/0x29
>> [    2.412905]  handle_level_irq+0x57/0xc6
>> [    2.412924]  handle_irq+0x47/0x52
>> [    2.412929]  </IRQ>
>> [    2.412934]  <SOFTIRQ>
>> [    2.412949]  do_IRQ+0x32/0x9b
>> [    2.412963]  ? __irqentry_text_end+0x7/0x7
>> [    2.412976]  common_interrupt+0x2e/0x34
>> ...
>>
>> The problem seems to be in drivers/clocksource/cs5535-clockevt.c in
>> mfgpt_tick() on line 132:
>> ...
>>     cs5535_clockevent.event_handler(&cs5535_clockevent);
>> ...
>> cs5535_clockevent.event_handler is null.
>>
>> Adding some more traces I see mfgpt_tick() gets called before
>> clockevents_config_and_register() finishes (invoked from
>> cs5535_mfgpt_init() on line 178). So when mfgpt_tick() accessess the
>> event_handler, it's NULL. Wrapping the event_handler call on line 132 in
>> a null pointer check results in a working system.
>>
>> The issue is present at least also in kernel 4.13.5. In kernel versions
>> <= 4.1-rc6 the cs5535_clockevent worked OK. Kernels >= 4.1-rc7 never
>> booted at all on my ALIX 2c3 (last I tried 4.5, then gave up and tried
>> 4.13.5 recently), so I don't know when exactly this issue appeared.
>>
>> My guess is that the timer interrupt is enabled too early. If I change
>> the order of clockevents_config_and_register() and
>> cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val) in
>> cs5535_mfgpt_init() like this:
>> ...
>>     /* Set up the clock event */
>>     printk(KERN_INFO DRV_NAME
>>         ": Registering MFGPT timer as a clock event, using IRQ %d\n",
>>         timer_irq);
>>     clockevents_config_and_register(&cs5535_clockevent, MFGPT_HZ,
>>                     0xF, 0xFFFE);
>>
>>     /* Set the clock scale and enable the event mode for CMP2 */
>>     val = MFGPT_SCALE | (3 << 8);
>>     cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val);
>> ...
>> the system boots and seems to be OK.
>
> What happens if instead of inverting those two lines, you add
> mfgpt_shutdown() early in the init function ?

Hi Daniel,

I can't call mfgpt_shutdown() before clockevents_config_and_register() as 
mfgpt_shutdown() wants a struct clock_event_device* which is only 
available after clockevents_config_and_register(). But I tried calling 
disable_timer():

@@ -152,6 +152,8 @@ static int __init cs5535_mfgpt_init(void)
         }
         cs5535_event_clock = timer;

+       disable_timer(cs5535_event_clock);
+
         /* Set up the IRQ on the MFGPT side */
         if (cs5535_mfgpt_setup_irq(timer, MFGPT_CMP2, &timer_irq)) {
                 printk(KERN_ERR DRV_NAME ": Could not set up IRQ %d\n",

which should do the same thing. Unfortunately I still get the same BUG.

I added some more traces and I think the IRQ comes just after:
 	ret = setup_irq(timer_irq, &mfgptirq);

(And my reordering of cs5535_mfgpt_write and 
clockevents_config_and_register was just lucky, because the confing and 
register occured sooner than the IRQ.)

Attached is a patch that shows where I added the traces. It produces the 
following output (I also added a bunch of sleeps in between to be sure 
which step triggers the mfgpt_tick call):

[    3.229612] cs5535-clockevt: enabling verbose handler
[    3.234699] cs5535-clockevt: sleeping(1)...
[    4.320051] cs5535-clockevt: done(1)
[    4.323641] cs5535-clockevt: calling cs5535_mfgpt_alloc_timer
[    4.329402] cs5535-mfgpt cs5535-mfgpt: registered timer 1
[    4.334835] cs5535-clockevt: sleeping(2)...
[    5.360051] cs5535-clockevt: done(2)
[    5.363632] cs5535-clockevt: calling disable_timer
[    5.368425] cs5535-clockevt: sleeping(3)...
[    6.400048] cs5535-clockevt: done(3)
[    6.403631] cs5535-clockevt: calling cs5535_mfgpt_setup_irq
[    6.409214] cs5535-clockevt: sleeping(4)...
[    7.440044] cs5535-clockevt: done(4)
[    7.443623] cs5535-clockevt: calling setup_irq
[    7.448088] cs5535-clockevt: mfgpt_tick called
[    7.452579] cs5535-clockevt: sleeping(5)...
[    8.480049] cs5535-clockevt: done(5)
[    8.483634] cs5535-clockevt: doing write
[    8.487559] cs5535-clockevt: sleeping(6)...
[    9.520051] cs5535-clockevt: done(6)
[    9.523635] cs5535-clockevt: Registering MFGPT timer as a clock event, 
using IRQ 7
[    9.531263] cs5535-clockevt: clockevents_config_and_register returned
[    9.537706] cs5535-clockevt: disabling verbose handler

(I observed the boot 3 times and the sequence is always the same.)

Best regards,
David

[-- Attachment #2: Type: text/plain, Size: 3762 bytes --]

diff --git a/drivers/clocksource/cs5535-clockevt.c b/drivers/clocksource/cs5535-clockevt.c
index a1df588343f2..446bb595030b 100644
--- a/drivers/clocksource/cs5535-clockevt.c
+++ b/drivers/clocksource/cs5535-clockevt.c
@@ -18,9 +18,12 @@
 #include <linux/module.h>
 #include <linux/cs5535.h>
 #include <linux/clockchips.h>
+#include <linux/delay.h>
 
 #define DRV_NAME "cs5535-clockevt"
 
+static int verbose_handler = 0;
+
 static int timer_irq;
 module_param_hw_named(irq, timer_irq, int, irq, 0644);
 MODULE_PARM_DESC(irq, "Which IRQ to use for the clock source MFGPT ticks.");
@@ -129,7 +132,14 @@ static irqreturn_t mfgpt_tick(int irq, void *dev_id)
 		cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP,
 				MFGPT_SETUP_CNTEN | MFGPT_SETUP_CMP2);
 
-	cs5535_clockevent.event_handler(&cs5535_clockevent);
+	if (verbose_handler)
+		printk(KERN_INFO DRV_NAME ": mfgpt_tick called\n");
+
+	if (!verbose_handler && cs5535_clockevent.event_handler == NULL)
+		printk(KERN_ERR DRV_NAME ": mfgpt_tick with null event_handler while not verbose!\n");
+
+	if (cs5535_clockevent.event_handler != NULL)
+		cs5535_clockevent.event_handler(&cs5535_clockevent);
 	return IRQ_HANDLED;
 }
 
@@ -145,6 +155,14 @@ static int __init cs5535_mfgpt_init(void)
 	int ret;
 	uint16_t val;
 
+	printk(KERN_INFO DRV_NAME ": enabling verbose handler\n");
+	WRITE_ONCE(verbose_handler, 1);
+
+	printk(KERN_INFO DRV_NAME ": sleeping(1)...\n");
+	msleep(1000);
+	printk(KERN_INFO DRV_NAME ": done(1)\n");
+
+	printk(KERN_INFO DRV_NAME ": calling cs5535_mfgpt_alloc_timer\n");
 	timer = cs5535_mfgpt_alloc_timer(MFGPT_TIMER_ANY, MFGPT_DOMAIN_WORKING);
 	if (!timer) {
 		printk(KERN_ERR DRV_NAME ": Could not allocate MFGPT timer\n");
@@ -152,6 +170,18 @@ static int __init cs5535_mfgpt_init(void)
 	}
 	cs5535_event_clock = timer;
 
+	printk(KERN_INFO DRV_NAME ": sleeping(2)...\n");
+	msleep(1000);
+	printk(KERN_INFO DRV_NAME ": done(2)\n");
+
+	printk(KERN_INFO DRV_NAME ": calling disable_timer\n");
+	disable_timer(cs5535_event_clock);
+
+	printk(KERN_INFO DRV_NAME ": sleeping(3)...\n");
+	msleep(1000);
+	printk(KERN_INFO DRV_NAME ": done(3)\n");
+
+	printk(KERN_INFO DRV_NAME ": calling cs5535_mfgpt_setup_irq\n");
 	/* Set up the IRQ on the MFGPT side */
 	if (cs5535_mfgpt_setup_irq(timer, MFGPT_CMP2, &timer_irq)) {
 		printk(KERN_ERR DRV_NAME ": Could not set up IRQ %d\n",
@@ -159,6 +189,11 @@ static int __init cs5535_mfgpt_init(void)
 		goto err_timer;
 	}
 
+	printk(KERN_INFO DRV_NAME ": sleeping(4)...\n");
+	msleep(1000);
+	printk(KERN_INFO DRV_NAME ": done(4)\n");
+
+	printk(KERN_INFO DRV_NAME ": calling setup_irq\n");
 	/* And register it with the kernel */
 	ret = setup_irq(timer_irq, &mfgptirq);
 	if (ret) {
@@ -166,17 +201,30 @@ static int __init cs5535_mfgpt_init(void)
 		goto err_irq;
 	}
 
+	printk(KERN_INFO DRV_NAME ": sleeping(5)...\n");
+	msleep(1000);
+	printk(KERN_INFO DRV_NAME ": done(5)\n");
+
 	/* Set the clock scale and enable the event mode for CMP2 */
 	val = MFGPT_SCALE | (3 << 8);
 
+	printk(KERN_INFO DRV_NAME ": doing write\n");
 	cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val);
 
+	printk(KERN_INFO DRV_NAME ": sleeping(6)...\n");
+	msleep(1000);
+	printk(KERN_INFO DRV_NAME ": done(6)\n");
+
 	/* Set up the clock event */
 	printk(KERN_INFO DRV_NAME
 		": Registering MFGPT timer as a clock event, using IRQ %d\n",
 		timer_irq);
 	clockevents_config_and_register(&cs5535_clockevent, MFGPT_HZ,
 					0xF, 0xFFFE);
+	printk(KERN_INFO DRV_NAME ": clockevents_config_and_register returned\n");
+
+	printk(KERN_INFO DRV_NAME ": disabling verbose handler\n");
+	WRITE_ONCE(verbose_handler, 0);
 
 	return 0;
 

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: PROBLEM: Kernel BUG in mfgpt_tick (cs5535-clockevt.c) on ALIX 2c3 - null call
  2017-10-09 19:33   ` David Kozub
@ 2017-10-10 13:03     ` Daniel Lezcano
  2017-10-10 21:19       ` David Kozub
  0 siblings, 1 reply; 13+ messages in thread
From: Daniel Lezcano @ 2017-10-10 13:03 UTC (permalink / raw)
  To: David Kozub; +Cc: Thomas Gleixner, linux-kernel

On 09/10/2017 21:33, David Kozub wrote:
> On Mon, 9 Oct 2017, Daniel Lezcano wrote:
> 
>> On 07/10/2017 23:26, David Kozub wrote:
>>> Hi all,
>>>
>>> booting up kernel 4.14-rc3 with CS5535_CLOCK_EVENT_SRC on an ALIX 2c3
>>> (http://pcengines.ch/alix2c3.htm) dies with:
>>>
>>> [    2.313086] cs5535-smb cs5535-smb: SCx200 device 'CS5535 ACB0'
>>> registered
>>> [    2.338711] cs5535-mfgpt cs5535-mfgpt: registered timer 0
>>> [    2.355676] ledtrig-cpu: registered to indicate activity on CPUs
>>> [    2.373745] cs5535-mfgpt cs5535-mfgpt: registered timer 1
>>> [    2.389976] cs5535-clockevt: Registering MFGPT timer as a clock
>>> event, using IRQ 7
>>> [    2.412687] BUG: unable to handle kernel NULL pointer dereference
>>> at   (null)
>>> [    2.412698] IP:   (null)
>>> [    2.412702] *pde = 00000000
>>> [    2.412713] Oops: 0000 [#1]
>>> [    2.412716] Modules linked in:
>>> [    2.412732] CPU: 0 PID: 1 Comm: swapper Not tainted
>>> 4.14.0-rc3-humel-test17 #36
>>> [    2.412739] task: c0010000 task.stack: c008e000
>>> [    2.412744] EIP:   (null)
>>> [    2.412749] EFLAGS: 00210093 CPU: 0
>>> [    2.412758] EAX: c05fb8c0 EBX: c05fb880 ECX: 00000000 EDX: 0000620c
>>> [    2.412769] ESI: c0009fd4 EDI: c014e14e EBP: c0009fac ESP: c0009fa8
>>> [    2.412780]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
>>> [    2.412790] CR0: 80050033 CR2: 00000000 CR3: 0064f000 CR4: 00000090
>>> [    2.412793] Call Trace:
>>> [    2.412800]  <IRQ>
>>> [    2.412825]  ? mfgpt_tick+0x5d/0x81
>>> [    2.412845]  __handle_irq_event_percpu+0x56/0xb6
>>> [    2.412864]  ? handle_fasteoi_irq+0xf3/0xf3
>>> [    2.412878]  handle_irq_event_percpu+0x17/0x3f
>>> [    2.412892]  handle_irq_event+0x1d/0x29
>>> [    2.412905]  handle_level_irq+0x57/0xc6
>>> [    2.412924]  handle_irq+0x47/0x52
>>> [    2.412929]  </IRQ>
>>> [    2.412934]  <SOFTIRQ>
>>> [    2.412949]  do_IRQ+0x32/0x9b
>>> [    2.412963]  ? __irqentry_text_end+0x7/0x7
>>> [    2.412976]  common_interrupt+0x2e/0x34
>>> ...
>>>
>>> The problem seems to be in drivers/clocksource/cs5535-clockevt.c in
>>> mfgpt_tick() on line 132:
>>> ...
>>>     cs5535_clockevent.event_handler(&cs5535_clockevent);
>>> ...
>>> cs5535_clockevent.event_handler is null.
>>>
>>> Adding some more traces I see mfgpt_tick() gets called before
>>> clockevents_config_and_register() finishes (invoked from
>>> cs5535_mfgpt_init() on line 178). So when mfgpt_tick() accessess the
>>> event_handler, it's NULL. Wrapping the event_handler call on line 132 in
>>> a null pointer check results in a working system.
>>>
>>> The issue is present at least also in kernel 4.13.5. In kernel versions
>>> <= 4.1-rc6 the cs5535_clockevent worked OK. Kernels >= 4.1-rc7 never
>>> booted at all on my ALIX 2c3 (last I tried 4.5, then gave up and tried
>>> 4.13.5 recently), so I don't know when exactly this issue appeared.
>>>
>>> My guess is that the timer interrupt is enabled too early. If I change
>>> the order of clockevents_config_and_register() and
>>> cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val) in
>>> cs5535_mfgpt_init() like this:
>>> ...
>>>     /* Set up the clock event */
>>>     printk(KERN_INFO DRV_NAME
>>>         ": Registering MFGPT timer as a clock event, using IRQ %d\n",
>>>         timer_irq);
>>>     clockevents_config_and_register(&cs5535_clockevent, MFGPT_HZ,
>>>                     0xF, 0xFFFE);
>>>
>>>     /* Set the clock scale and enable the event mode for CMP2 */
>>>     val = MFGPT_SCALE | (3 << 8);
>>>     cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val);
>>> ...
>>> the system boots and seems to be OK.
>>
>> What happens if instead of inverting those two lines, you add
>> mfgpt_shutdown() early in the init function ?
> 
> Hi Daniel,
> 
> I can't call mfgpt_shutdown() before clockevents_config_and_register()
> as mfgpt_shutdown() wants a struct clock_event_device* which is only
> available after clockevents_config_and_register(). But I tried calling
> disable_timer():
> 
> @@ -152,6 +152,8 @@ static int __init cs5535_mfgpt_init(void)
>         }
>         cs5535_event_clock = timer;
> 
> +       disable_timer(cs5535_event_clock);
> +
>         /* Set up the IRQ on the MFGPT side */
>         if (cs5535_mfgpt_setup_irq(timer, MFGPT_CMP2, &timer_irq)) {
>                 printk(KERN_ERR DRV_NAME ": Could not set up IRQ %d\n",
> 
> which should do the same thing. Unfortunately I still get the same BUG.
> 
> I added some more traces and I think the IRQ comes just after:
>     ret = setup_irq(timer_irq, &mfgptirq);

Mmh, the interrupt flag tells it is a shared interrupt, bad :/

The interrupt handler checks if the interrupt was for the timer.

So here, there are two options:

 - the timer is in an inconsistent state, with an interrupt line up and
enabled. In this case, it should be reseted correctly before going
further in the init function (disable, ack irq if any, and then request
irq and register).

 .. or ..

 - the check at the beginning of the interrupt handler:
        /* See if the interrupt was for us */
        if (!(val & (MFGPT_SETUP_SETUP | MFGPT_SETUP_CMP2 |
		MFGPT_SETUP_CMP1)))
                return IRQ_NONE;

   does not work and we try to handle an interrupt which is not for us.


That could be any of these two (or both) and I suspect the commit
8f9327cbb brings to light this problem as before the interrupt handler
had the check:
	if (cs5535_tick_mode == CLOCK_EVT_MODE_SHUTDOWN)
		return IRQ_HANDLED;

with:

static unsigned int cs5535_tick_mode = CLOCK_EVT_MODE_SHUTDOWN;







-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: PROBLEM: Kernel BUG in mfgpt_tick (cs5535-clockevt.c) on ALIX 2c3 - null call
  2017-10-10 13:03     ` Daniel Lezcano
@ 2017-10-10 21:19       ` David Kozub
  2017-10-11  8:26         ` Daniel Lezcano
  0 siblings, 1 reply; 13+ messages in thread
From: David Kozub @ 2017-10-10 21:19 UTC (permalink / raw)
  To: Daniel Lezcano; +Cc: Thomas Gleixner, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 7725 bytes --]

On Tue, 10 Oct 2017, Daniel Lezcano wrote:

> On 09/10/2017 21:33, David Kozub wrote:
>> On Mon, 9 Oct 2017, Daniel Lezcano wrote:
>>
>>> On 07/10/2017 23:26, David Kozub wrote:
>>>> Hi all,
>>>>
>>>> booting up kernel 4.14-rc3 with CS5535_CLOCK_EVENT_SRC on an ALIX 2c3
>>>> (http://pcengines.ch/alix2c3.htm) dies with:
>>>>
>>>> [    2.313086] cs5535-smb cs5535-smb: SCx200 device 'CS5535 ACB0'
>>>> registered
>>>> [    2.338711] cs5535-mfgpt cs5535-mfgpt: registered timer 0
>>>> [    2.355676] ledtrig-cpu: registered to indicate activity on CPUs
>>>> [    2.373745] cs5535-mfgpt cs5535-mfgpt: registered timer 1
>>>> [    2.389976] cs5535-clockevt: Registering MFGPT timer as a clock
>>>> event, using IRQ 7
>>>> [    2.412687] BUG: unable to handle kernel NULL pointer dereference
>>>> at   (null)
>>>> [    2.412698] IP:   (null)
>>>> [    2.412702] *pde = 00000000
>>>> [    2.412713] Oops: 0000 [#1]
>>>> [    2.412716] Modules linked in:
>>>> [    2.412732] CPU: 0 PID: 1 Comm: swapper Not tainted
>>>> 4.14.0-rc3-humel-test17 #36
>>>> [    2.412739] task: c0010000 task.stack: c008e000
>>>> [    2.412744] EIP:   (null)
>>>> [    2.412749] EFLAGS: 00210093 CPU: 0
>>>> [    2.412758] EAX: c05fb8c0 EBX: c05fb880 ECX: 00000000 EDX: 0000620c
>>>> [    2.412769] ESI: c0009fd4 EDI: c014e14e EBP: c0009fac ESP: c0009fa8
>>>> [    2.412780]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
>>>> [    2.412790] CR0: 80050033 CR2: 00000000 CR3: 0064f000 CR4: 00000090
>>>> [    2.412793] Call Trace:
>>>> [    2.412800]  <IRQ>
>>>> [    2.412825]  ? mfgpt_tick+0x5d/0x81
>>>> [    2.412845]  __handle_irq_event_percpu+0x56/0xb6
>>>> [    2.412864]  ? handle_fasteoi_irq+0xf3/0xf3
>>>> [    2.412878]  handle_irq_event_percpu+0x17/0x3f
>>>> [    2.412892]  handle_irq_event+0x1d/0x29
>>>> [    2.412905]  handle_level_irq+0x57/0xc6
>>>> [    2.412924]  handle_irq+0x47/0x52
>>>> [    2.412929]  </IRQ>
>>>> [    2.412934]  <SOFTIRQ>
>>>> [    2.412949]  do_IRQ+0x32/0x9b
>>>> [    2.412963]  ? __irqentry_text_end+0x7/0x7
>>>> [    2.412976]  common_interrupt+0x2e/0x34
>>>> ...
>>>>
>>>> The problem seems to be in drivers/clocksource/cs5535-clockevt.c in
>>>> mfgpt_tick() on line 132:
>>>> ...
>>>>     cs5535_clockevent.event_handler(&cs5535_clockevent);
>>>> ...
>>>> cs5535_clockevent.event_handler is null.
>>>>
>>>> Adding some more traces I see mfgpt_tick() gets called before
>>>> clockevents_config_and_register() finishes (invoked from
>>>> cs5535_mfgpt_init() on line 178). So when mfgpt_tick() accessess the
>>>> event_handler, it's NULL. Wrapping the event_handler call on line 132 in
>>>> a null pointer check results in a working system.
>>>>
>>>> The issue is present at least also in kernel 4.13.5. In kernel versions
>>>> <= 4.1-rc6 the cs5535_clockevent worked OK. Kernels >= 4.1-rc7 never
>>>> booted at all on my ALIX 2c3 (last I tried 4.5, then gave up and tried
>>>> 4.13.5 recently), so I don't know when exactly this issue appeared.
>>>>
>>>> My guess is that the timer interrupt is enabled too early. If I change
>>>> the order of clockevents_config_and_register() and
>>>> cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val) in
>>>> cs5535_mfgpt_init() like this:
>>>> ...
>>>>     /* Set up the clock event */
>>>>     printk(KERN_INFO DRV_NAME
>>>>         ": Registering MFGPT timer as a clock event, using IRQ %d\n",
>>>>         timer_irq);
>>>>     clockevents_config_and_register(&cs5535_clockevent, MFGPT_HZ,
>>>>                     0xF, 0xFFFE);
>>>>
>>>>     /* Set the clock scale and enable the event mode for CMP2 */
>>>>     val = MFGPT_SCALE | (3 << 8);
>>>>     cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val);
>>>> ...
>>>> the system boots and seems to be OK.
>>>
>>> What happens if instead of inverting those two lines, you add
>>> mfgpt_shutdown() early in the init function ?
>>
>> Hi Daniel,
>>
>> I can't call mfgpt_shutdown() before clockevents_config_and_register()
>> as mfgpt_shutdown() wants a struct clock_event_device* which is only
>> available after clockevents_config_and_register(). But I tried calling
>> disable_timer():
>>
>> @@ -152,6 +152,8 @@ static int __init cs5535_mfgpt_init(void)
>>         }
>>         cs5535_event_clock = timer;
>>
>> +       disable_timer(cs5535_event_clock);
>> +
>>         /* Set up the IRQ on the MFGPT side */
>>         if (cs5535_mfgpt_setup_irq(timer, MFGPT_CMP2, &timer_irq)) {
>>                 printk(KERN_ERR DRV_NAME ": Could not set up IRQ %d\n",
>>
>> which should do the same thing. Unfortunately I still get the same BUG.
>>
>> I added some more traces and I think the IRQ comes just after:
>>     ret = setup_irq(timer_irq, &mfgptirq);
>
> Mmh, the interrupt flag tells it is a shared interrupt, bad :/

While there is IRQF_SHARED, I think on the ALIX 2c3 the IRQ is not really 
shared:

$ cat /proc/interrupts
            CPU0
   0:       9959    XT-PIC      timer
   2:          0    XT-PIC      cascade
   4:       1763    XT-PIC      ttyS0
   7:       4648    XT-PIC      cs5535-clockevt
   9:          0    XT-PIC      ath
  10:        151    XT-PIC      eth0
  11:          0    XT-PIC      eth1
  12:          1    XT-PIC      ehci_hcd:usb1, ohci_hcd:usb2
  14:       4063    XT-PIC      pata_cs5536
  15:          0    XT-PIC      eth2

(if this is sufficient proof)

> The interrupt handler checks if the interrupt was for the timer.
>
> So here, there are two options:
>
> - the timer is in an inconsistent state, with an interrupt line up and
> enabled. In this case, it should be reseted correctly before going
> further in the init function (disable, ack irq if any, and then request
> irq and register).
>
> .. or ..
>
> - the check at the beginning of the interrupt handler:
>        /* See if the interrupt was for us */
>        if (!(val & (MFGPT_SETUP_SETUP | MFGPT_SETUP_CMP2 |
> 		MFGPT_SETUP_CMP1)))
>                return IRQ_NONE;
>
>   does not work and we try to handle an interrupt which is not for us.

If this interrupt is not shared, this option can be ruled out, right?

> That could be any of these two (or both) and I suspect the commit
> 8f9327cbb brings to light this problem as before the interrupt handler
> had the check:
> 	if (cs5535_tick_mode == CLOCK_EVT_MODE_SHUTDOWN)
> 		return IRQ_HANDLED;
>
> with:
>
> static unsigned int cs5535_tick_mode = CLOCK_EVT_MODE_SHUTDOWN;

I agree, in the previous version any early calls would be ignored.

I see in http://pcengines.ch/pdf/alix2.pdf on page 10 in "Setup options": 
"M: toggles MFGPT workaround – may be required to support high speed 
timer. See AMD CS5536 data book section 5.16.3 for the gory details. The 
system may hang during boot if you get it wrong…"

I tried toggling the option and it didn't seem to have any effect on this 
issue. Furthermore I did a BIOS update and now the menu option is gone.

The CD5536 Data Book is here: 
https://support.amd.com/TechDocs/33238G_cs5536_db.pdf and that 5.16.3 
issue doesn't seem to be relevant here.

So I think the following must be happening:

> - the timer is in an inconsistent state, with an interrupt line up and
> enabled. In this case, it should be reseted correctly before going
> further in the init function (disable, ack irq if any, and then request
> irq and register).

Should I do something more to verify that it is this case?

Could you please give me more detailed info on how to reset the IRQ to a 
sane sate? Being a kernel noob that summary is not sufficient for me.

Best regards,
David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: PROBLEM: Kernel BUG in mfgpt_tick (cs5535-clockevt.c) on ALIX 2c3 - null call
  2017-10-10 21:19       ` David Kozub
@ 2017-10-11  8:26         ` Daniel Lezcano
  2017-10-11 20:48           ` David Kozub
  0 siblings, 1 reply; 13+ messages in thread
From: Daniel Lezcano @ 2017-10-11  8:26 UTC (permalink / raw)
  To: David Kozub; +Cc: Thomas Gleixner, linux-kernel

On 10/10/2017 23:19, David Kozub wrote:
> On Tue, 10 Oct 2017, Daniel Lezcano wrote:
> 
>> On 09/10/2017 21:33, David Kozub wrote:
>>> On Mon, 9 Oct 2017, Daniel Lezcano wrote:
>>>
>>>> On 07/10/2017 23:26, David Kozub wrote:
>>>>> Hi all,
>>>>>
>>>>> booting up kernel 4.14-rc3 with CS5535_CLOCK_EVENT_SRC on an ALIX 2c3
>>>>> (http://pcengines.ch/alix2c3.htm) dies with:
>>>>>
>>>>> [    2.313086] cs5535-smb cs5535-smb: SCx200 device 'CS5535 ACB0'
>>>>> registered
>>>>> [    2.338711] cs5535-mfgpt cs5535-mfgpt: registered timer 0
>>>>> [    2.355676] ledtrig-cpu: registered to indicate activity on CPUs
>>>>> [    2.373745] cs5535-mfgpt cs5535-mfgpt: registered timer 1
>>>>> [    2.389976] cs5535-clockevt: Registering MFGPT timer as a clock
>>>>> event, using IRQ 7
>>>>> [    2.412687] BUG: unable to handle kernel NULL pointer dereference
>>>>> at   (null)
>>>>> [    2.412698] IP:   (null)
>>>>> [    2.412702] *pde = 00000000
>>>>> [    2.412713] Oops: 0000 [#1]
>>>>> [    2.412716] Modules linked in:
>>>>> [    2.412732] CPU: 0 PID: 1 Comm: swapper Not tainted
>>>>> 4.14.0-rc3-humel-test17 #36
>>>>> [    2.412739] task: c0010000 task.stack: c008e000
>>>>> [    2.412744] EIP:   (null)
>>>>> [    2.412749] EFLAGS: 00210093 CPU: 0
>>>>> [    2.412758] EAX: c05fb8c0 EBX: c05fb880 ECX: 00000000 EDX: 0000620c
>>>>> [    2.412769] ESI: c0009fd4 EDI: c014e14e EBP: c0009fac ESP: c0009fa8
>>>>> [    2.412780]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
>>>>> [    2.412790] CR0: 80050033 CR2: 00000000 CR3: 0064f000 CR4: 00000090
>>>>> [    2.412793] Call Trace:
>>>>> [    2.412800]  <IRQ>
>>>>> [    2.412825]  ? mfgpt_tick+0x5d/0x81
>>>>> [    2.412845]  __handle_irq_event_percpu+0x56/0xb6
>>>>> [    2.412864]  ? handle_fasteoi_irq+0xf3/0xf3
>>>>> [    2.412878]  handle_irq_event_percpu+0x17/0x3f
>>>>> [    2.412892]  handle_irq_event+0x1d/0x29
>>>>> [    2.412905]  handle_level_irq+0x57/0xc6
>>>>> [    2.412924]  handle_irq+0x47/0x52
>>>>> [    2.412929]  </IRQ>
>>>>> [    2.412934]  <SOFTIRQ>
>>>>> [    2.412949]  do_IRQ+0x32/0x9b
>>>>> [    2.412963]  ? __irqentry_text_end+0x7/0x7
>>>>> [    2.412976]  common_interrupt+0x2e/0x34
>>>>> ...
>>>>>
>>>>> The problem seems to be in drivers/clocksource/cs5535-clockevt.c in
>>>>> mfgpt_tick() on line 132:
>>>>> ...
>>>>>     cs5535_clockevent.event_handler(&cs5535_clockevent);
>>>>> ...
>>>>> cs5535_clockevent.event_handler is null.
>>>>>
>>>>> Adding some more traces I see mfgpt_tick() gets called before
>>>>> clockevents_config_and_register() finishes (invoked from
>>>>> cs5535_mfgpt_init() on line 178). So when mfgpt_tick() accessess the
>>>>> event_handler, it's NULL. Wrapping the event_handler call on line
>>>>> 132 in
>>>>> a null pointer check results in a working system.
>>>>>
>>>>> The issue is present at least also in kernel 4.13.5. In kernel
>>>>> versions
>>>>> <= 4.1-rc6 the cs5535_clockevent worked OK. Kernels >= 4.1-rc7 never
>>>>> booted at all on my ALIX 2c3 (last I tried 4.5, then gave up and tried
>>>>> 4.13.5 recently), so I don't know when exactly this issue appeared.
>>>>>
>>>>> My guess is that the timer interrupt is enabled too early. If I change
>>>>> the order of clockevents_config_and_register() and
>>>>> cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val) in
>>>>> cs5535_mfgpt_init() like this:
>>>>> ...
>>>>>     /* Set up the clock event */
>>>>>     printk(KERN_INFO DRV_NAME
>>>>>         ": Registering MFGPT timer as a clock event, using IRQ %d\n",
>>>>>         timer_irq);
>>>>>     clockevents_config_and_register(&cs5535_clockevent, MFGPT_HZ,
>>>>>                     0xF, 0xFFFE);
>>>>>
>>>>>     /* Set the clock scale and enable the event mode for CMP2 */
>>>>>     val = MFGPT_SCALE | (3 << 8);
>>>>>     cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val);
>>>>> ...
>>>>> the system boots and seems to be OK.
>>>>
>>>> What happens if instead of inverting those two lines, you add
>>>> mfgpt_shutdown() early in the init function ?
>>>
>>> Hi Daniel,
>>>
>>> I can't call mfgpt_shutdown() before clockevents_config_and_register()
>>> as mfgpt_shutdown() wants a struct clock_event_device* which is only
>>> available after clockevents_config_and_register(). But I tried calling
>>> disable_timer():
>>>
>>> @@ -152,6 +152,8 @@ static int __init cs5535_mfgpt_init(void)
>>>         }
>>>         cs5535_event_clock = timer;
>>>
>>> +       disable_timer(cs5535_event_clock);
>>> +
>>>         /* Set up the IRQ on the MFGPT side */
>>>         if (cs5535_mfgpt_setup_irq(timer, MFGPT_CMP2, &timer_irq)) {
>>>                 printk(KERN_ERR DRV_NAME ": Could not set up IRQ %d\n",
>>>
>>> which should do the same thing. Unfortunately I still get the same BUG.
>>>
>>> I added some more traces and I think the IRQ comes just after:
>>>     ret = setup_irq(timer_irq, &mfgptirq);
>>
>> Mmh, the interrupt flag tells it is a shared interrupt, bad :/
> 
> While there is IRQF_SHARED, I think on the ALIX 2c3 the IRQ is not
> really shared:
> 
> $ cat /proc/interrupts
>            CPU0
>   0:       9959    XT-PIC      timer
>   2:          0    XT-PIC      cascade
>   4:       1763    XT-PIC      ttyS0
>   7:       4648    XT-PIC      cs5535-clockevt
>   9:          0    XT-PIC      ath
>  10:        151    XT-PIC      eth0
>  11:          0    XT-PIC      eth1
>  12:          1    XT-PIC      ehci_hcd:usb1, ohci_hcd:usb2
>  14:       4063    XT-PIC      pata_cs5536
>  15:          0    XT-PIC      eth2
> 
> (if this is sufficient proof)

After digging a bit, the interrupt can be shared. There is a comment in
the drivers/misc/cs5535-mfgpt.c:cs5535_mfgpt_set_irq() function.

Moreover, the kernel can boot with a module parameter to change the irq
number. By config, the default is 7 but it could be another interrupt
shared with a device.

But your setup, if I refer the output of /proc/interrupts, does not show
the interrupt timer is shared.

>> The interrupt handler checks if the interrupt was for the timer.
>>
>> So here, there are two options:
>>
>> - the timer is in an inconsistent state, with an interrupt line up and
>> enabled. In this case, it should be reseted correctly before going
>> further in the init function (disable, ack irq if any, and then request
>> irq and register).
>>
>> .. or ..
>>
>> - the check at the beginning of the interrupt handler:
>>        /* See if the interrupt was for us */
>>        if (!(val & (MFGPT_SETUP_SETUP | MFGPT_SETUP_CMP2 |
>>         MFGPT_SETUP_CMP1)))
>>                return IRQ_NONE;
>>
>>   does not work and we try to handle an interrupt which is not for us.
> 
> If this interrupt is not shared, this option can be ruled out, right?

Yes.

>> That could be any of these two (or both) and I suspect the commit
>> 8f9327cbb brings to light this problem as before the interrupt handler
>> had the check:
>>     if (cs5535_tick_mode == CLOCK_EVT_MODE_SHUTDOWN)
>>         return IRQ_HANDLED;
>>
>> with:
>>
>> static unsigned int cs5535_tick_mode = CLOCK_EVT_MODE_SHUTDOWN;
> 
> I agree, in the previous version any early calls would be ignored.
> 
> I see in http://pcengines.ch/pdf/alix2.pdf on page 10 in "Setup
> options": "M: toggles MFGPT workaround – may be required to support high
> speed timer. See AMD CS5536 data book section 5.16.3 for the gory
> details. The system may hang during boot if you get it wrong…"
> 
> I tried toggling the option and it didn't seem to have any effect on
> this issue. Furthermore I did a BIOS update and now the menu option is
> gone.
> 
> The CD5536 Data Book is here:
> https://support.amd.com/TechDocs/33238G_cs5536_db.pdf and that 5.16.3
> issue doesn't seem to be relevant here.
> 
> So I think the following must be happening:
> 
>> - the timer is in an inconsistent state, with an interrupt line up and
>> enabled. In this case, it should be reseted correctly before going
>> further in the init function (disable, ack irq if any, and then request
>> irq and register).
> 
> Should I do something more to verify that it is this case?
> 
> Could you please give me more detailed info on how to reset the IRQ to a
> sane sate? Being a kernel noob that summary is not sufficient for me.

The code should be already there in the driver.

Usually, in order to prevent this kind issue, before calling
clockevent_config_and_register(), we disable the timer and ack
unconditionally any interrupt.

So at the first glance, I would add:

diff --git a/drivers/clocksource/cs5535-clockevt.c
b/drivers/clocksource/cs5535-clockevt.c
index a1df588..87e9581 100644
--- a/drivers/clocksource/cs5535-clockevt.c
+++ b/drivers/clocksource/cs5535-clockevt.c
@@ -152,6 +152,9 @@ static int __init cs5535_mfgpt_init(void)
        }
        cs5535_event_clock = timer;

+       disable_timer(timer);
+       cs5535_mfgpt_write(timer, MFGPT_REG_COUNTER, 0);
+
        /* Set up the IRQ on the MFGPT side */
        if (cs5535_mfgpt_setup_irq(timer, MFGPT_CMP2, &timer_irq)) {
                printk(KERN_ERR DRV_NAME ": Could not set up IRQ %d\n",





-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: PROBLEM: Kernel BUG in mfgpt_tick (cs5535-clockevt.c) on ALIX 2c3 - null call
  2017-10-11  8:26         ` Daniel Lezcano
@ 2017-10-11 20:48           ` David Kozub
  2017-10-12 12:47             ` Daniel Lezcano
  0 siblings, 1 reply; 13+ messages in thread
From: David Kozub @ 2017-10-11 20:48 UTC (permalink / raw)
  To: Daniel Lezcano; +Cc: Thomas Gleixner, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 10103 bytes --]

On Wed, 11 Oct 2017, Daniel Lezcano wrote:

> On 10/10/2017 23:19, David Kozub wrote:
>> On Tue, 10 Oct 2017, Daniel Lezcano wrote:
>>
>>> On 09/10/2017 21:33, David Kozub wrote:
>>>> On Mon, 9 Oct 2017, Daniel Lezcano wrote:
>>>>
>>>>> On 07/10/2017 23:26, David Kozub wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> booting up kernel 4.14-rc3 with CS5535_CLOCK_EVENT_SRC on an ALIX 2c3
>>>>>> (http://pcengines.ch/alix2c3.htm) dies with:
>>>>>>
>>>>>> [    2.313086] cs5535-smb cs5535-smb: SCx200 device 'CS5535 ACB0'
>>>>>> registered
>>>>>> [    2.338711] cs5535-mfgpt cs5535-mfgpt: registered timer 0
>>>>>> [    2.355676] ledtrig-cpu: registered to indicate activity on CPUs
>>>>>> [    2.373745] cs5535-mfgpt cs5535-mfgpt: registered timer 1
>>>>>> [    2.389976] cs5535-clockevt: Registering MFGPT timer as a clock
>>>>>> event, using IRQ 7
>>>>>> [    2.412687] BUG: unable to handle kernel NULL pointer dereference
>>>>>> at   (null)
>>>>>> [    2.412698] IP:   (null)
>>>>>> [    2.412702] *pde = 00000000
>>>>>> [    2.412713] Oops: 0000 [#1]
>>>>>> [    2.412716] Modules linked in:
>>>>>> [    2.412732] CPU: 0 PID: 1 Comm: swapper Not tainted
>>>>>> 4.14.0-rc3-humel-test17 #36
>>>>>> [    2.412739] task: c0010000 task.stack: c008e000
>>>>>> [    2.412744] EIP:   (null)
>>>>>> [    2.412749] EFLAGS: 00210093 CPU: 0
>>>>>> [    2.412758] EAX: c05fb8c0 EBX: c05fb880 ECX: 00000000 EDX: 0000620c
>>>>>> [    2.412769] ESI: c0009fd4 EDI: c014e14e EBP: c0009fac ESP: c0009fa8
>>>>>> [    2.412780]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
>>>>>> [    2.412790] CR0: 80050033 CR2: 00000000 CR3: 0064f000 CR4: 00000090
>>>>>> [    2.412793] Call Trace:
>>>>>> [    2.412800]  <IRQ>
>>>>>> [    2.412825]  ? mfgpt_tick+0x5d/0x81
>>>>>> [    2.412845]  __handle_irq_event_percpu+0x56/0xb6
>>>>>> [    2.412864]  ? handle_fasteoi_irq+0xf3/0xf3
>>>>>> [    2.412878]  handle_irq_event_percpu+0x17/0x3f
>>>>>> [    2.412892]  handle_irq_event+0x1d/0x29
>>>>>> [    2.412905]  handle_level_irq+0x57/0xc6
>>>>>> [    2.412924]  handle_irq+0x47/0x52
>>>>>> [    2.412929]  </IRQ>
>>>>>> [    2.412934]  <SOFTIRQ>
>>>>>> [    2.412949]  do_IRQ+0x32/0x9b
>>>>>> [    2.412963]  ? __irqentry_text_end+0x7/0x7
>>>>>> [    2.412976]  common_interrupt+0x2e/0x34
>>>>>> ...
>>>>>>
>>>>>> The problem seems to be in drivers/clocksource/cs5535-clockevt.c in
>>>>>> mfgpt_tick() on line 132:
>>>>>> ...
>>>>>>     cs5535_clockevent.event_handler(&cs5535_clockevent);
>>>>>> ...
>>>>>> cs5535_clockevent.event_handler is null.
>>>>>>
>>>>>> Adding some more traces I see mfgpt_tick() gets called before
>>>>>> clockevents_config_and_register() finishes (invoked from
>>>>>> cs5535_mfgpt_init() on line 178). So when mfgpt_tick() accessess the
>>>>>> event_handler, it's NULL. Wrapping the event_handler call on line
>>>>>> 132 in
>>>>>> a null pointer check results in a working system.
>>>>>>
>>>>>> The issue is present at least also in kernel 4.13.5. In kernel
>>>>>> versions
>>>>>> <= 4.1-rc6 the cs5535_clockevent worked OK. Kernels >= 4.1-rc7 never
>>>>>> booted at all on my ALIX 2c3 (last I tried 4.5, then gave up and tried
>>>>>> 4.13.5 recently), so I don't know when exactly this issue appeared.
>>>>>>
>>>>>> My guess is that the timer interrupt is enabled too early. If I change
>>>>>> the order of clockevents_config_and_register() and
>>>>>> cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val) in
>>>>>> cs5535_mfgpt_init() like this:
>>>>>> ...
>>>>>>     /* Set up the clock event */
>>>>>>     printk(KERN_INFO DRV_NAME
>>>>>>         ": Registering MFGPT timer as a clock event, using IRQ %d\n",
>>>>>>         timer_irq);
>>>>>>     clockevents_config_and_register(&cs5535_clockevent, MFGPT_HZ,
>>>>>>                     0xF, 0xFFFE);
>>>>>>
>>>>>>     /* Set the clock scale and enable the event mode for CMP2 */
>>>>>>     val = MFGPT_SCALE | (3 << 8);
>>>>>>     cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val);
>>>>>> ...
>>>>>> the system boots and seems to be OK.
>>>>>
>>>>> What happens if instead of inverting those two lines, you add
>>>>> mfgpt_shutdown() early in the init function ?
>>>>
>>>> Hi Daniel,
>>>>
>>>> I can't call mfgpt_shutdown() before clockevents_config_and_register()
>>>> as mfgpt_shutdown() wants a struct clock_event_device* which is only
>>>> available after clockevents_config_and_register(). But I tried calling
>>>> disable_timer():
>>>>
>>>> @@ -152,6 +152,8 @@ static int __init cs5535_mfgpt_init(void)
>>>>         }
>>>>         cs5535_event_clock = timer;
>>>>
>>>> +       disable_timer(cs5535_event_clock);
>>>> +
>>>>         /* Set up the IRQ on the MFGPT side */
>>>>         if (cs5535_mfgpt_setup_irq(timer, MFGPT_CMP2, &timer_irq)) {
>>>>                 printk(KERN_ERR DRV_NAME ": Could not set up IRQ %d\n",
>>>>
>>>> which should do the same thing. Unfortunately I still get the same BUG.
>>>>
>>>> I added some more traces and I think the IRQ comes just after:
>>>>     ret = setup_irq(timer_irq, &mfgptirq);
>>>
>>> Mmh, the interrupt flag tells it is a shared interrupt, bad :/
>>
>> While there is IRQF_SHARED, I think on the ALIX 2c3 the IRQ is not
>> really shared:
>>
>> $ cat /proc/interrupts
>>            CPU0
>>   0:       9959    XT-PIC      timer
>>   2:          0    XT-PIC      cascade
>>   4:       1763    XT-PIC      ttyS0
>>   7:       4648    XT-PIC      cs5535-clockevt
>>   9:          0    XT-PIC      ath
>>  10:        151    XT-PIC      eth0
>>  11:          0    XT-PIC      eth1
>>  12:          1    XT-PIC      ehci_hcd:usb1, ohci_hcd:usb2
>>  14:       4063    XT-PIC      pata_cs5536
>>  15:          0    XT-PIC      eth2
>>
>> (if this is sufficient proof)
>
> After digging a bit, the interrupt can be shared. There is a comment in
> the drivers/misc/cs5535-mfgpt.c:cs5535_mfgpt_set_irq() function.
>
> Moreover, the kernel can boot with a module parameter to change the irq
> number. By config, the default is 7 but it could be another interrupt
> shared with a device.
>
> But your setup, if I refer the output of /proc/interrupts, does not show
> the interrupt timer is shared.
>
>>> The interrupt handler checks if the interrupt was for the timer.
>>>
>>> So here, there are two options:
>>>
>>> - the timer is in an inconsistent state, with an interrupt line up and
>>> enabled. In this case, it should be reseted correctly before going
>>> further in the init function (disable, ack irq if any, and then request
>>> irq and register).
>>>
>>> .. or ..
>>>
>>> - the check at the beginning of the interrupt handler:
>>>        /* See if the interrupt was for us */
>>>        if (!(val & (MFGPT_SETUP_SETUP | MFGPT_SETUP_CMP2 |
>>>         MFGPT_SETUP_CMP1)))
>>>                return IRQ_NONE;
>>>
>>>   does not work and we try to handle an interrupt which is not for us.
>>
>> If this interrupt is not shared, this option can be ruled out, right?
>
> Yes.
>
>>> That could be any of these two (or both) and I suspect the commit
>>> 8f9327cbb brings to light this problem as before the interrupt handler
>>> had the check:
>>>     if (cs5535_tick_mode == CLOCK_EVT_MODE_SHUTDOWN)
>>>         return IRQ_HANDLED;
>>>
>>> with:
>>>
>>> static unsigned int cs5535_tick_mode = CLOCK_EVT_MODE_SHUTDOWN;
>>
>> I agree, in the previous version any early calls would be ignored.
>>
>> I see in http://pcengines.ch/pdf/alix2.pdf on page 10 in "Setup
>> options": "M: toggles MFGPT workaround – may be required to support high
>> speed timer. See AMD CS5536 data book section 5.16.3 for the gory
>> details. The system may hang during boot if you get it wrong…"
>>
>> I tried toggling the option and it didn't seem to have any effect on
>> this issue. Furthermore I did a BIOS update and now the menu option is
>> gone.
>>
>> The CD5536 Data Book is here:
>> https://support.amd.com/TechDocs/33238G_cs5536_db.pdf and that 5.16.3
>> issue doesn't seem to be relevant here.
>>
>> So I think the following must be happening:
>>
>>> - the timer is in an inconsistent state, with an interrupt line up and
>>> enabled. In this case, it should be reseted correctly before going
>>> further in the init function (disable, ack irq if any, and then request
>>> irq and register).
>>
>> Should I do something more to verify that it is this case?
>>
>> Could you please give me more detailed info on how to reset the IRQ to a
>> sane sate? Being a kernel noob that summary is not sufficient for me.
>
> The code should be already there in the driver.
>
> Usually, in order to prevent this kind issue, before calling
> clockevent_config_and_register(), we disable the timer and ack
> unconditionally any interrupt.
>
> So at the first glance, I would add:
>
> diff --git a/drivers/clocksource/cs5535-clockevt.c
> b/drivers/clocksource/cs5535-clockevt.c
> index a1df588..87e9581 100644
> --- a/drivers/clocksource/cs5535-clockevt.c
> +++ b/drivers/clocksource/cs5535-clockevt.c
> @@ -152,6 +152,9 @@ static int __init cs5535_mfgpt_init(void)
>        }
>        cs5535_event_clock = timer;
>
> +       disable_timer(timer);
> +       cs5535_mfgpt_write(timer, MFGPT_REG_COUNTER, 0);
> +
>        /* Set up the IRQ on the MFGPT side */
>        if (cs5535_mfgpt_setup_irq(timer, MFGPT_CMP2, &timer_irq)) {
>                printk(KERN_ERR DRV_NAME ": Could not set up IRQ %d\n",

I tried that and the handler is still called. So I did some more random 
experiments and I found out that if I call disable_timer(timer) twice, 
then the issue is resolved (the handler is not called before the 
registration is finished.) And I don't have to set MFGPT_REG_COUNTER to 0.

I have no idea why do I have to call disable_timer twice.

Best regards,
David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: PROBLEM: Kernel BUG in mfgpt_tick (cs5535-clockevt.c) on ALIX 2c3 - null call
  2017-10-11 20:48           ` David Kozub
@ 2017-10-12 12:47             ` Daniel Lezcano
  2017-10-12 15:29               ` Thomas Gleixner
  0 siblings, 1 reply; 13+ messages in thread
From: Daniel Lezcano @ 2017-10-12 12:47 UTC (permalink / raw)
  To: David Kozub; +Cc: Thomas Gleixner, linux-kernel

On 11/10/2017 22:48, David Kozub wrote:

[ ... ]

>>
>> +       disable_timer(timer);
>> +       cs5535_mfgpt_write(timer, MFGPT_REG_COUNTER, 0);
>> +
>>        /* Set up the IRQ on the MFGPT side */
>>        if (cs5535_mfgpt_setup_irq(timer, MFGPT_CMP2, &timer_irq)) {
>>                printk(KERN_ERR DRV_NAME ": Could not set up IRQ %d\n",
> 
> I tried that and the handler is still called. So I did some more random
> experiments and I found out that if I call disable_timer(timer) twice,
> then the issue is resolved (the handler is not called before the
> registration is finished.) And I don't have to set MFGPT_REG_COUNTER to 0.

Aha! we are close to a fix.

> I have no idea why do I have to call disable_timer twice.

For testing purpose, can you try by adding mmiowb() and/or wmb() after
disable_timer()?




-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: PROBLEM: Kernel BUG in mfgpt_tick (cs5535-clockevt.c) on ALIX 2c3 - null call
  2017-10-12 12:47             ` Daniel Lezcano
@ 2017-10-12 15:29               ` Thomas Gleixner
  2017-10-12 20:53                 ` David Kozub
  0 siblings, 1 reply; 13+ messages in thread
From: Thomas Gleixner @ 2017-10-12 15:29 UTC (permalink / raw)
  To: Daniel Lezcano; +Cc: David Kozub, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1441 bytes --]

On Thu, 12 Oct 2017, Daniel Lezcano wrote:

> On 11/10/2017 22:48, David Kozub wrote:
> 
> [ ... ]
> 
> >>
> >> +       disable_timer(timer);
> >> +       cs5535_mfgpt_write(timer, MFGPT_REG_COUNTER, 0);
> >> +
> >>        /* Set up the IRQ on the MFGPT side */
> >>        if (cs5535_mfgpt_setup_irq(timer, MFGPT_CMP2, &timer_irq)) {
> >>                printk(KERN_ERR DRV_NAME ": Could not set up IRQ %d\n",
> > 
> > I tried that and the handler is still called. So I did some more random
> > experiments and I found out that if I call disable_timer(timer) twice,
> > then the issue is resolved (the handler is not called before the
> > registration is finished.) And I don't have to set MFGPT_REG_COUNTER to 0.
> 
> Aha! we are close to a fix.
> 
> > I have no idea why do I have to call disable_timer twice.
> 
> For testing purpose, can you try by adding mmiowb() and/or wmb() after
> disable_timer()?

The real question is why

        /* Set the clock scale and enable the event mode for CMP2 */
        val = MFGPT_SCALE | (3 << 8);

        cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val);

is in the setup code at all.

The obvious place for this is in mfgpt_set_periodic() which gets called
when the clock event and the handler is set up in the core code. Up to that
point the interrupt handler is protected against shared interrupts via the
is_shutdown() check.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: PROBLEM: Kernel BUG in mfgpt_tick (cs5535-clockevt.c) on ALIX 2c3 - null call
  2017-10-12 15:29               ` Thomas Gleixner
@ 2017-10-12 20:53                 ` David Kozub
  2017-10-16  9:30                   ` Thomas Gleixner
  0 siblings, 1 reply; 13+ messages in thread
From: David Kozub @ 2017-10-12 20:53 UTC (permalink / raw)
  To: Thomas Gleixner, Daniel Lezcano; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1864 bytes --]

On Thu, 12 Oct 2017, Thomas Gleixner wrote:

> On Thu, 12 Oct 2017, Daniel Lezcano wrote:
>
>> On 11/10/2017 22:48, David Kozub wrote:
>>
>> [ ... ]
>>
>>>>
>>>> +       disable_timer(timer);
>>>> +       cs5535_mfgpt_write(timer, MFGPT_REG_COUNTER, 0);
>>>> +
>>>>        /* Set up the IRQ on the MFGPT side */
>>>>        if (cs5535_mfgpt_setup_irq(timer, MFGPT_CMP2, &timer_irq)) {
>>>>                printk(KERN_ERR DRV_NAME ": Could not set up IRQ %d\n",
>>>
>>> I tried that and the handler is still called. So I did some more random
>>> experiments and I found out that if I call disable_timer(timer) twice,
>>> then the issue is resolved (the handler is not called before the
>>> registration is finished.) And I don't have to set MFGPT_REG_COUNTER to 0.
>>
>> Aha! we are close to a fix.
>>
>>> I have no idea why do I have to call disable_timer twice.
>>
>> For testing purpose, can you try by adding mmiowb() and/or wmb() after
>> disable_timer()?

I tried one, the other, both. Still, the unexpected handler call happens. 
I (again) added a msleep(1000) after the disable (and the barriers). If 
the trouble was a missing barrier, the sleep would most likely also make 
the isaue go away (even if in a hacky way).

> The real question is why
>
>        /* Set the clock scale and enable the event mode for CMP2 */
>        val = MFGPT_SCALE | (3 << 8);
>
>        cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val);
>
> is in the setup code at all.
>
> The obvious place for this is in mfgpt_set_periodic() which gets called
> when the clock event and the handler is set up in the core code. Up to that
> point the interrupt handler is protected against shared interrupts via the
> is_shutdown() check.

It is, but only after clockevents_config_and_register. But mfgpt_tick is 
called before that (just after setup_irq).

Best regards,
David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: PROBLEM: Kernel BUG in mfgpt_tick (cs5535-clockevt.c) on ALIX 2c3 - null call
  2017-10-12 20:53                 ` David Kozub
@ 2017-10-16  9:30                   ` Thomas Gleixner
  2017-10-16 20:41                     ` David Kozub
  0 siblings, 1 reply; 13+ messages in thread
From: Thomas Gleixner @ 2017-10-16  9:30 UTC (permalink / raw)
  To: David Kozub; +Cc: Daniel Lezcano, linux-kernel

On Thu, 12 Oct 2017, David Kozub wrote:
> On Thu, 12 Oct 2017, Thomas Gleixner wrote:
> > On Thu, 12 Oct 2017, Daniel Lezcano wrote:
> > The real question is why
> > 
> >        /* Set the clock scale and enable the event mode for CMP2 */
> >        val = MFGPT_SCALE | (3 << 8);
> > 
> >        cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val);
> > 
> > is in the setup code at all.
> > 
> > The obvious place for this is in mfgpt_set_periodic() which gets called
> > when the clock event and the handler is set up in the core code. Up to that
> > point the interrupt handler is protected against shared interrupts via the
> > is_shutdown() check.
> 
> It is, but only after clockevents_config_and_register. But mfgpt_tick is
> called before that (just after setup_irq).

Right, but the protection against the spurious interrupt there is not
sufficient. See patch below.

Thanks,

	tglx

8<-------------------

--- a/drivers/clocksource/cs5535-clockevt.c
+++ b/drivers/clocksource/cs5535-clockevt.c
@@ -117,7 +117,8 @@ static irqreturn_t mfgpt_tick(int irq, void *dev_id)
 	/* Turn off the clock (and clear the event) */
 	disable_timer(cs5535_event_clock);
 
-	if (clockevent_state_shutdown(&cs5535_clockevent))
+	if (clockevent_state_detached(&cs5535_clockevent) ||
+	    clockevent_state_shutdown(&cs5535_clockevent))
 		return IRQ_HANDLED;
 
 	/* Clear the counter */

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: PROBLEM: Kernel BUG in mfgpt_tick (cs5535-clockevt.c) on ALIX 2c3 - null call
  2017-10-16  9:30                   ` Thomas Gleixner
@ 2017-10-16 20:41                     ` David Kozub
  2017-10-18 18:23                       ` Thomas Gleixner
  0 siblings, 1 reply; 13+ messages in thread
From: David Kozub @ 2017-10-16 20:41 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Daniel Lezcano, linux-kernel

On Mon, 16 Oct 2017, Thomas Gleixner wrote:

> On Thu, 12 Oct 2017, David Kozub wrote:
>> On Thu, 12 Oct 2017, Thomas Gleixner wrote:
>>> On Thu, 12 Oct 2017, Daniel Lezcano wrote:
>>> The real question is why
>>>
>>>        /* Set the clock scale and enable the event mode for CMP2 */
>>>        val = MFGPT_SCALE | (3 << 8);
>>>
>>>        cs5535_mfgpt_write(cs5535_event_clock, MFGPT_REG_SETUP, val);
>>>
>>> is in the setup code at all.
>>>
>>> The obvious place for this is in mfgpt_set_periodic() which gets called
>>> when the clock event and the handler is set up in the core code. Up to that
>>> point the interrupt handler is protected against shared interrupts via the
>>> is_shutdown() check.
>>
>> It is, but only after clockevents_config_and_register. But mfgpt_tick is
>> called before that (just after setup_irq).
>
> Right, but the protection against the spurious interrupt there is not
> sufficient. See patch below.
>
> Thanks,
>
> 	tglx
>
> 8<-------------------
>
> --- a/drivers/clocksource/cs5535-clockevt.c
> +++ b/drivers/clocksource/cs5535-clockevt.c
> @@ -117,7 +117,8 @@ static irqreturn_t mfgpt_tick(int irq, void *dev_id)
> 	/* Turn off the clock (and clear the event) */
> 	disable_timer(cs5535_event_clock);
>
> -	if (clockevent_state_shutdown(&cs5535_clockevent))
> +	if (clockevent_state_detached(&cs5535_clockevent) ||
> +	    clockevent_state_shutdown(&cs5535_clockevent))
> 		return IRQ_HANDLED;
>
> 	/* Clear the counter */

Hi Thomas,

Thanks for the patch. I can confirm that this resolves the issue.

I still think the fact that the handler is triggered at all is a bit 
strange but maybe ensuring the handler doesn't do anything bad is 
sufficient to declare this issue fixed. After all I'm probbly the last 
person in the known universe using this driver. :)

Would it not be nicer to also explicitly set cs5535_clockevent's 
state_use_accessors to CLOCK_EVT_STATE_DETACHED (in the initialization of 
cs5535_clockevent)? Maybe that field is not to be touched directly from 
outside, on the other hand in the current code the check in the handler 
relies on CLOCK_EVT_STATE_DETACHED being 0 (and so cs5535_clockevent's 
state_use_accessors being CLOCK_EVT_STATE_DETACHED by default).

I was also wondering how is it ensured that the handler can never see an 
inconsistent state of state_use_accessors but after going through the 
code I think the key is clockevents_register_device's 
raw_spin_lock_irqsave(&clockevents_lock, flags); (even though the set to 
DETACHED happens outside, but we're already starting in DETACHED state). I 
also wonder if this would still be safe if there were multiple CPUs - but 
this is not happening on the ALIX as it has just one CPU.

Best regards,
David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: PROBLEM: Kernel BUG in mfgpt_tick (cs5535-clockevt.c) on ALIX 2c3 - null call
  2017-10-16 20:41                     ` David Kozub
@ 2017-10-18 18:23                       ` Thomas Gleixner
  0 siblings, 0 replies; 13+ messages in thread
From: Thomas Gleixner @ 2017-10-18 18:23 UTC (permalink / raw)
  To: David Kozub; +Cc: Daniel Lezcano, linux-kernel

On Mon, 16 Oct 2017, David Kozub wrote:
> On Mon, 16 Oct 2017, Thomas Gleixner wrote:
> > --- a/drivers/clocksource/cs5535-clockevt.c
> > +++ b/drivers/clocksource/cs5535-clockevt.c
> > @@ -117,7 +117,8 @@ static irqreturn_t mfgpt_tick(int irq, void *dev_id)
> > 	/* Turn off the clock (and clear the event) */
> > 	disable_timer(cs5535_event_clock);
> > 
> > -	if (clockevent_state_shutdown(&cs5535_clockevent))
> > +	if (clockevent_state_detached(&cs5535_clockevent) ||
> > +	    clockevent_state_shutdown(&cs5535_clockevent))
> > 		return IRQ_HANDLED;
> > 
> > 	/* Clear the counter */
> 
> Hi Thomas,
> 
> Thanks for the patch. I can confirm that this resolves the issue.
> 
> I still think the fact that the handler is triggered at all is a bit strange
> but maybe ensuring the handler doesn't do anything bad is sufficient to
> declare this issue fixed. After all I'm probbly the last person in the known
> universe using this driver. :)

Well, any handler should be able to handle spurious interrupts sanely,
especially on legacy x86 interrupts.

> Would it not be nicer to also explicitly set cs5535_clockevent's
> state_use_accessors to CLOCK_EVT_STATE_DETACHED (in the initialization of
> cs5535_clockevent)?

Well, it's explicitely the first state to ensure that zero initialization
sets it to CLOCK_EVT_STATE_DETACHED.

> Maybe that field is not to be touched directly from outside, on the other
> hand in the current code the check in the handler relies on
> CLOCK_EVT_STATE_DETACHED being 0 (and so cs5535_clockevent's
> state_use_accessors being CLOCK_EVT_STATE_DETACHED by default).

There is lots of other code which would break if we violate that
assumption. But, yes you are right as well. The data structure is visible
before clockevents_register_device() is invoked because its statically
allocated, so we might as well explicitely set the state in the static
initializer. Though I don't think it's worth it.

> I was also wondering how is it ensured that the handler can never see an
> inconsistent state of state_use_accessors but after going through the code I
> think the key is clockevents_register_device's
> raw_spin_lock_irqsave(&clockevents_lock, flags); (even though the set to
> DETACHED happens outside, but we're already starting in DETACHED state). I
> also wonder if this would still be safe if there were multiple CPUs - but this
> is not happening on the ALIX as it has just one CPU.

In a case like this the core code cannot do much about serialization
vs. interrupts escpecially when the interrupt is enabled prior to clock
event registration. Especially not when the interrupt is shared. That's a
hen egg problem outside the scope of the core code.

There are lots of convoluted ways to prevent the interrupt from happening
early, but the best way to handle this particular issue is to make it
robust vs. both spurious and shared interrupt invocations.

Care to submit a patch with a proper changelog? Take author ship as the
complex part of that is writing a good changelog. Feel free to add

Suggested-by: Thomas .....

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-10-18 18:23 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-07 21:26 PROBLEM: Kernel BUG in mfgpt_tick (cs5535-clockevt.c) on ALIX 2c3 - null call David Kozub
2017-10-09  8:39 ` Daniel Lezcano
2017-10-09 19:33   ` David Kozub
2017-10-10 13:03     ` Daniel Lezcano
2017-10-10 21:19       ` David Kozub
2017-10-11  8:26         ` Daniel Lezcano
2017-10-11 20:48           ` David Kozub
2017-10-12 12:47             ` Daniel Lezcano
2017-10-12 15:29               ` Thomas Gleixner
2017-10-12 20:53                 ` David Kozub
2017-10-16  9:30                   ` Thomas Gleixner
2017-10-16 20:41                     ` David Kozub
2017-10-18 18:23                       ` Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.