linux-acpi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Intermittent boot failure after 6492fed7d8c9 (v6.0-rc1)
@ 2022-10-10 14:16 Mel Gorman
  2022-10-10 14:47 ` Rafael J. Wysocki
  0 siblings, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2022-10-10 14:16 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: linux-rtc, linux-acpi, linux-kernel

Hi Rafael,

I'm seeing intermittent boot failures after 6492fed7d8c9 ("rtc: rtc-cmos:
Do not check ACPI_FADT_LOW_POWER_S0") due to a NULL pointer exception
early in boot. It fails to boot 5 times after 10 boot attempts and I've
only observed it on one machine so far. Either a revert or the patch below
fixes it but it's unlikely it is the correct fix.

--- drivers/rtc/rtc-cmos.c.orig	2022-10-10 15:11:50.335756567 +0200
+++ drivers/rtc/rtc-cmos.c	2022-10-10 15:11:53.211756691 +0200
@@ -1209,7 +1209,7 @@
 	 * Or else, ACPI SCI is enabled during suspend/resume only,
 	 * update rtc irq in that case.
 	 */
-	if (cmos_use_acpi_alarm())
+	if (cmos_use_acpi_alarm() && cmos)
 		cmos_interrupt(0, (void *)cmos->rtc);
 	else {
 		/* Fix me: can we use cmos_interrupt() here as well? */

Boot failure looks like the below, it's not a vanilla kernel but the
applied patch is not relevant and it's known to fail on a vanilla kernel.
The machine has a E5-2698 v4 CPU plugged into a SGI C2112-4GP3 platform
with a X10DRT-P-Series motherboard.

[   10.924167][    C1] BUG: kernel NULL pointer dereference, address: 0000000000000000
[   10.928016][    C1] #PF: supervisor read access in kernel mode
[   10.928016][    C1] #PF: error_code(0x0000) - not-present page
[   10.928016][    C1] PGD 0 P4D 0 
[   10.928016][    C1] Oops: 0000 [#1] PREEMPT SMP PTI
[   10.928016][    C1] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.0.0-mm-pcpnoirq-v1r2 #1 6debc4647ebcbe3e91270f1109aebc1e85510e3e
[   10.928016][    C1] Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 05/09/2016
[   10.928016][    C1] RIP: 0010:rtc_handler+0x73/0xd0
[   10.928016][    C1] Code: df e8 41 62 f9 ff bf 04 00 00 00 e8 a3 bf e7 ff 31 f6 bf 04 00 00 00 e8 08 c2 e7 ff b8 01 00 00 00 5b 5d 41 5c c3 cc cc cc cc <48> 8b 75 00 31 ff e8 72 fe ff ff eb c0 bf 0b 00 00 00 e8 56 81 77
[   10.928016][    C1] RSP: 0000:ffffaf7f8003eec0 EFLAGS: 00010002
[   10.928016][    C1] RAX: ffffffffad6d0c00 RBX: ffff94049801a000 RCX: 0000000000000000
[   10.928016][    C1] RDX: 0000000000000040 RSI: ffffffffadf00460 RDI: ffff94049801a000
[   10.928016][    C1] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000000004d0
[   10.928016][    C1] R10: 0000000000000000 R11: ffffaf7f8003eff8 R12: 0000000000000000
[   10.928016][    C1] R13: ffffffffae228d82 R14: 0000000000000004 R15: 0000000000000000
[   10.928016][    C1] FS:  0000000000000000(0000) GS:ffff94037ea80000(0000) knlGS:0000000000000000
[   10.928016][    C1] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   10.928016][    C1] CR2: 0000000000000000 CR3: 00000002c7e26001 CR4: 00000000003706e0
[   10.928016][    C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   10.928016][    C1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   10.928016][    C1] Call Trace:
[   10.928016][    C1]  <IRQ>
[   10.928016][    C1]  acpi_ev_fixed_event_detect+0x14a/0x18c
[   10.928016][    C1]  acpi_ev_sci_xrupt_handler+0x2c/0x6e
[   10.928016][    C1]  acpi_irq+0x18/0x40
[   10.928016][    C1]  __handle_irq_event_percpu+0x3e/0x2d0
[   10.928016][    C1]  handle_irq_event_percpu+0xf/0x40
[   10.928016][    C1]  handle_irq_event+0x34/0x60
[   10.928016][    C1]  handle_fasteoi_irq+0x7b/0x140
[   10.928016][    C1]  __common_interrupt+0x4b/0x100
[   10.928016][    C1]  common_interrupt+0x58/0xa0
[   10.928016][    C1]  </IRQ>
[   10.928016][    C1]  <TASK>
[   10.928016][    C1]  asm_common_interrupt+0x22/0x40
[   10.928016][    C1] RIP: 0010:cmos_wake_setup.part.9+0x2f/0x120
[   10.928016][    C1] Code: 80 3d 65 16 4a 01 00 53 48 89 fb 0f 84 a5 00 00 00 48 89 da 48 c7 c6 00 0c 6d ad bf 04 00 00 00 e8 53 b8 e7 ff bf 04 00 00 00 <e8> 98 c6 e7 ff 31 f6 bf 04 00 00 00 e8 fd c8 e7 ff 0f b6 0d 34 ce
[   10.928016][    C1] RSP: 0000:ffffaf7f800d7ca8 EFLAGS: 00000246
[   10.928016][    C1] RAX: 0000000000000000 RBX: ffff94049801a000 RCX: 0000000000000004
[   10.928016][    C1] RDX: ffffffffadefef10 RSI: ffffffffadefee20 RDI: 0000000000000004
[   10.928016][    C1] RBP: ffffffffaeaf98a0 R08: 0000000000000000 R09: 0000000000000000
[   10.928016][    C1] R10: 0000000000000000 R11: 000000000000000a R12: ffffffffad6d1750
[   10.928016][    C1] R13: 0000000000000000 R14: ffff93c5111191a0 R15: ffffffffaefe47f8
[   10.928016][    C1]  ? rdinit_setup+0x2f/0x2f
[   10.928016][    C1]  ? cmos_do_probe+0x570/0x570
[   10.928016][    C1]  ? cmos_wake_setup.part.9+0x2a/0x120
[   10.928016][    C1]  cmos_pnp_probe+0x6c/0xa0
[   10.928016][    C1]  pnp_device_probe+0x5b/0xb0
[   10.928016][    C1]  ? driver_sysfs_add+0x75/0xe0
[   10.928016][    C1]  really_probe+0x109/0x3e0
[   10.928016][    C1]  ? pm_runtime_barrier+0x4f/0xa0
[   10.928016][    C1]  __driver_probe_device+0x79/0x170
[   10.928016][    C1]  driver_probe_device+0x1f/0xa0
[   10.928016][    C1]  __driver_attach+0x11e/0x180
[   10.928016][    C1]  ? __device_attach_driver+0x110/0x110
[   10.928016][    C1]  bus_for_each_dev+0x79/0xc0
[   10.928016][    C1]  bus_add_driver+0x1ba/0x250
[   10.928016][    C1]  ? rtc_dev_init+0x34/0x34
[   10.928016][    C1]  driver_register+0x5f/0x100
[   10.928016][    C1]  ? rtc_dev_init+0x34/0x34
[   10.928016][    C1]  cmos_init+0x12/0x70
[   10.928016][    C1]  do_one_initcall+0x5b/0x310
[   10.928016][    C1]  ? rcu_read_lock_held_common+0xe/0x50
[   10.928016][    C1]  ? rcu_read_lock_sched_held+0x23/0x80
[   10.928016][    C1]  kernel_init_freeable+0x2b7/0x319
[   10.928016][    C1]  ? rest_init+0x1b0/0x1b0
[   10.928016][    C1]  kernel_init+0x16/0x140
[   10.928016][    C1]  ret_from_fork+0x22/0x30
[   10.928016][    C1]  </TASK>
[   10.928016][    C1] Modules linked in:
[   10.928016][    C1] CR2: 0000000000000000
[   10.928016][    C1] ---[ end trace 0000000000000000 ]---
[   10.928016][    C1] RIP: 0010:rtc_handler+0x73/0xd0
[   10.928016][    C1] Code: df e8 41 62 f9 ff bf 04 00 00 00 e8 a3 bf e7 ff 31 f6 bf 04 00 00 00 e8 08 c2 e7 ff b8 01 00 00 00 5b 5d 41 5c c3 cc cc cc cc <48> 8b 75 00 31 ff e8 72 fe ff ff eb c0 bf 0b 00 00 00 e8 56 81 77
[   10.928016][    C1] RSP: 0000:ffffaf7f8003eec0 EFLAGS: 00010002
[   10.928016][    C1] RAX: ffffffffad6d0c00 RBX: ffff94049801a000 RCX: 0000000000000000
[   10.928016][    C1] RDX: 0000000000000040 RSI: ffffffffadf00460 RDI: ffff94049801a000
[   10.928016][    C1] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000000004d0
[   10.928016][    C1] R10: 0000000000000000 R11: ffffaf7f8003eff8 R12: 0000000000000000
[   10.928016][    C1] R13: ffffffffae228d82 R14: 0000000000000004 R15: 0000000000000000
[   10.928016][    C1] FS:  0000000000000000(0000) GS:ffff94037ea80000(0000) knlGS:0000000000000000
[   10.928016][    C1] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   10.928016][    C1] CR2: 0000000000000000 CR3: 00000002c7e26001 CR4: 00000000003706e0
[   10.928016][    C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   10.928016][    C1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   10.928016][    C1] Kernel panic - not syncing: Fatal exception in interrupt
[   10.928016][    C1] Kernel Offset: 0x2be00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[   10.928016][    C1] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Intermittent boot failure after 6492fed7d8c9 (v6.0-rc1)
  2022-10-10 14:16 Intermittent boot failure after 6492fed7d8c9 (v6.0-rc1) Mel Gorman
@ 2022-10-10 14:47 ` Rafael J. Wysocki
  2022-10-10 17:45   ` Mel Gorman
  0 siblings, 1 reply; 7+ messages in thread
From: Rafael J. Wysocki @ 2022-10-10 14:47 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Rafael J. Wysocki, linux-rtc, linux-acpi, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 7986 bytes --]

Hi Mel,

Thanks for the report!

On Mon, Oct 10, 2022 at 4:25 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> Hi Rafael,
>
> I'm seeing intermittent boot failures after 6492fed7d8c9 ("rtc: rtc-cmos:
> Do not check ACPI_FADT_LOW_POWER_S0") due to a NULL pointer exception
> early in boot. It fails to boot 5 times after 10 boot attempts and I've
> only observed it on one machine so far. Either a revert or the patch below
> fixes it but it's unlikely it is the correct fix.
>
> --- drivers/rtc/rtc-cmos.c.orig 2022-10-10 15:11:50.335756567 +0200
> +++ drivers/rtc/rtc-cmos.c      2022-10-10 15:11:53.211756691 +0200
> @@ -1209,7 +1209,7 @@
>          * Or else, ACPI SCI is enabled during suspend/resume only,
>          * update rtc irq in that case.
>          */
> -       if (cmos_use_acpi_alarm())
> +       if (cmos_use_acpi_alarm() && cmos)
>                 cmos_interrupt(0, (void *)cmos->rtc);
>         else {
>                 /* Fix me: can we use cmos_interrupt() here as well? */

It looks like I've exposed a race condition there.

Generally speaking, it is misguided to install an event handler that
is not ready to handle the event at that time before making sure that
the event is disabled.

Does the attached patch help?

>
> Boot failure looks like the below, it's not a vanilla kernel but the
> applied patch is not relevant and it's known to fail on a vanilla kernel.
> The machine has a E5-2698 v4 CPU plugged into a SGI C2112-4GP3 platform
> with a X10DRT-P-Series motherboard.
>
> [   10.924167][    C1] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [   10.928016][    C1] #PF: supervisor read access in kernel mode
> [   10.928016][    C1] #PF: error_code(0x0000) - not-present page
> [   10.928016][    C1] PGD 0 P4D 0
> [   10.928016][    C1] Oops: 0000 [#1] PREEMPT SMP PTI
> [   10.928016][    C1] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.0.0-mm-pcpnoirq-v1r2 #1 6debc4647ebcbe3e91270f1109aebc1e85510e3e
> [   10.928016][    C1] Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 05/09/2016
> [   10.928016][    C1] RIP: 0010:rtc_handler+0x73/0xd0
> [   10.928016][    C1] Code: df e8 41 62 f9 ff bf 04 00 00 00 e8 a3 bf e7 ff 31 f6 bf 04 00 00 00 e8 08 c2 e7 ff b8 01 00 00 00 5b 5d 41 5c c3 cc cc cc cc <48> 8b 75 00 31 ff e8 72 fe ff ff eb c0 bf 0b 00 00 00 e8 56 81 77
> [   10.928016][    C1] RSP: 0000:ffffaf7f8003eec0 EFLAGS: 00010002
> [   10.928016][    C1] RAX: ffffffffad6d0c00 RBX: ffff94049801a000 RCX: 0000000000000000
> [   10.928016][    C1] RDX: 0000000000000040 RSI: ffffffffadf00460 RDI: ffff94049801a000
> [   10.928016][    C1] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000000004d0
> [   10.928016][    C1] R10: 0000000000000000 R11: ffffaf7f8003eff8 R12: 0000000000000000
> [   10.928016][    C1] R13: ffffffffae228d82 R14: 0000000000000004 R15: 0000000000000000
> [   10.928016][    C1] FS:  0000000000000000(0000) GS:ffff94037ea80000(0000) knlGS:0000000000000000
> [   10.928016][    C1] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   10.928016][    C1] CR2: 0000000000000000 CR3: 00000002c7e26001 CR4: 00000000003706e0
> [   10.928016][    C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   10.928016][    C1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   10.928016][    C1] Call Trace:
> [   10.928016][    C1]  <IRQ>
> [   10.928016][    C1]  acpi_ev_fixed_event_detect+0x14a/0x18c
> [   10.928016][    C1]  acpi_ev_sci_xrupt_handler+0x2c/0x6e
> [   10.928016][    C1]  acpi_irq+0x18/0x40
> [   10.928016][    C1]  __handle_irq_event_percpu+0x3e/0x2d0
> [   10.928016][    C1]  handle_irq_event_percpu+0xf/0x40
> [   10.928016][    C1]  handle_irq_event+0x34/0x60
> [   10.928016][    C1]  handle_fasteoi_irq+0x7b/0x140
> [   10.928016][    C1]  __common_interrupt+0x4b/0x100
> [   10.928016][    C1]  common_interrupt+0x58/0xa0
> [   10.928016][    C1]  </IRQ>
> [   10.928016][    C1]  <TASK>
> [   10.928016][    C1]  asm_common_interrupt+0x22/0x40
> [   10.928016][    C1] RIP: 0010:cmos_wake_setup.part.9+0x2f/0x120
> [   10.928016][    C1] Code: 80 3d 65 16 4a 01 00 53 48 89 fb 0f 84 a5 00 00 00 48 89 da 48 c7 c6 00 0c 6d ad bf 04 00 00 00 e8 53 b8 e7 ff bf 04 00 00 00 <e8> 98 c6 e7 ff 31 f6 bf 04 00 00 00 e8 fd c8 e7 ff 0f b6 0d 34 ce
> [   10.928016][    C1] RSP: 0000:ffffaf7f800d7ca8 EFLAGS: 00000246
> [   10.928016][    C1] RAX: 0000000000000000 RBX: ffff94049801a000 RCX: 0000000000000004
> [   10.928016][    C1] RDX: ffffffffadefef10 RSI: ffffffffadefee20 RDI: 0000000000000004
> [   10.928016][    C1] RBP: ffffffffaeaf98a0 R08: 0000000000000000 R09: 0000000000000000
> [   10.928016][    C1] R10: 0000000000000000 R11: 000000000000000a R12: ffffffffad6d1750
> [   10.928016][    C1] R13: 0000000000000000 R14: ffff93c5111191a0 R15: ffffffffaefe47f8
> [   10.928016][    C1]  ? rdinit_setup+0x2f/0x2f
> [   10.928016][    C1]  ? cmos_do_probe+0x570/0x570
> [   10.928016][    C1]  ? cmos_wake_setup.part.9+0x2a/0x120
> [   10.928016][    C1]  cmos_pnp_probe+0x6c/0xa0
> [   10.928016][    C1]  pnp_device_probe+0x5b/0xb0
> [   10.928016][    C1]  ? driver_sysfs_add+0x75/0xe0
> [   10.928016][    C1]  really_probe+0x109/0x3e0
> [   10.928016][    C1]  ? pm_runtime_barrier+0x4f/0xa0
> [   10.928016][    C1]  __driver_probe_device+0x79/0x170
> [   10.928016][    C1]  driver_probe_device+0x1f/0xa0
> [   10.928016][    C1]  __driver_attach+0x11e/0x180
> [   10.928016][    C1]  ? __device_attach_driver+0x110/0x110
> [   10.928016][    C1]  bus_for_each_dev+0x79/0xc0
> [   10.928016][    C1]  bus_add_driver+0x1ba/0x250
> [   10.928016][    C1]  ? rtc_dev_init+0x34/0x34
> [   10.928016][    C1]  driver_register+0x5f/0x100
> [   10.928016][    C1]  ? rtc_dev_init+0x34/0x34
> [   10.928016][    C1]  cmos_init+0x12/0x70
> [   10.928016][    C1]  do_one_initcall+0x5b/0x310
> [   10.928016][    C1]  ? rcu_read_lock_held_common+0xe/0x50
> [   10.928016][    C1]  ? rcu_read_lock_sched_held+0x23/0x80
> [   10.928016][    C1]  kernel_init_freeable+0x2b7/0x319
> [   10.928016][    C1]  ? rest_init+0x1b0/0x1b0
> [   10.928016][    C1]  kernel_init+0x16/0x140
> [   10.928016][    C1]  ret_from_fork+0x22/0x30
> [   10.928016][    C1]  </TASK>
> [   10.928016][    C1] Modules linked in:
> [   10.928016][    C1] CR2: 0000000000000000
> [   10.928016][    C1] ---[ end trace 0000000000000000 ]---
> [   10.928016][    C1] RIP: 0010:rtc_handler+0x73/0xd0
> [   10.928016][    C1] Code: df e8 41 62 f9 ff bf 04 00 00 00 e8 a3 bf e7 ff 31 f6 bf 04 00 00 00 e8 08 c2 e7 ff b8 01 00 00 00 5b 5d 41 5c c3 cc cc cc cc <48> 8b 75 00 31 ff e8 72 fe ff ff eb c0 bf 0b 00 00 00 e8 56 81 77
> [   10.928016][    C1] RSP: 0000:ffffaf7f8003eec0 EFLAGS: 00010002
> [   10.928016][    C1] RAX: ffffffffad6d0c00 RBX: ffff94049801a000 RCX: 0000000000000000
> [   10.928016][    C1] RDX: 0000000000000040 RSI: ffffffffadf00460 RDI: ffff94049801a000
> [   10.928016][    C1] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000000004d0
> [   10.928016][    C1] R10: 0000000000000000 R11: ffffaf7f8003eff8 R12: 0000000000000000
> [   10.928016][    C1] R13: ffffffffae228d82 R14: 0000000000000004 R15: 0000000000000000
> [   10.928016][    C1] FS:  0000000000000000(0000) GS:ffff94037ea80000(0000) knlGS:0000000000000000
> [   10.928016][    C1] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   10.928016][    C1] CR2: 0000000000000000 CR3: 00000002c7e26001 CR4: 00000000003706e0
> [   10.928016][    C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   10.928016][    C1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   10.928016][    C1] Kernel panic - not syncing: Fatal exception in interrupt
> [   10.928016][    C1] Kernel Offset: 0x2be00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [   10.928016][    C1] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
>
> --
> Mel Gorman
> SUSE Labs

[-- Attachment #2: rtc-handler-wake-setup-debug.patch --]
[-- Type: text/x-patch, Size: 938 bytes --]

---
 drivers/rtc/rtc-cmos.c |    7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

Index: linux-pm/drivers/rtc/rtc-cmos.c
===================================================================
--- linux-pm.orig/drivers/rtc/rtc-cmos.c
+++ linux-pm/drivers/rtc/rtc-cmos.c
@@ -1233,13 +1233,12 @@ static u32 rtc_handler(void *context)
 
 static inline void rtc_wake_setup(struct device *dev)
 {
-	acpi_install_fixed_event_handler(ACPI_EVENT_RTC, rtc_handler, dev);
 	/*
-	 * After the RTC handler is installed, the Fixed_RTC event should
-	 * be disabled. Only when the RTC alarm is set will it be enabled.
+	 * Disable the Fixed_RTC event before installing the RTC handler.  It
+	 * can be enabled only when the RTC alarm is set.
 	 */
-	acpi_clear_event(ACPI_EVENT_RTC);
 	acpi_disable_event(ACPI_EVENT_RTC, 0);
+	acpi_install_fixed_event_handler(ACPI_EVENT_RTC, rtc_handler, dev);
 }
 
 static void rtc_wake_on(struct device *dev)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Intermittent boot failure after 6492fed7d8c9 (v6.0-rc1)
  2022-10-10 14:47 ` Rafael J. Wysocki
@ 2022-10-10 17:45   ` Mel Gorman
  2022-10-10 18:29     ` Rafael J. Wysocki
  0 siblings, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2022-10-10 17:45 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Rafael J. Wysocki, linux-rtc, linux-acpi, linux-kernel

On Mon, Oct 10, 2022 at 04:47:50PM +0200, Rafael J. Wysocki wrote:
> Hi Mel,
> 
> Thanks for the report!
> 
> On Mon, Oct 10, 2022 at 4:25 PM Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > Hi Rafael,
> >
> > I'm seeing intermittent boot failures after 6492fed7d8c9 ("rtc: rtc-cmos:
> > Do not check ACPI_FADT_LOW_POWER_S0") due to a NULL pointer exception
> > early in boot. It fails to boot 5 times after 10 boot attempts and I've
> > only observed it on one machine so far. Either a revert or the patch below
> > fixes it but it's unlikely it is the correct fix.
> >
> > --- drivers/rtc/rtc-cmos.c.orig 2022-10-10 15:11:50.335756567 +0200
> > +++ drivers/rtc/rtc-cmos.c      2022-10-10 15:11:53.211756691 +0200
> > @@ -1209,7 +1209,7 @@
> >          * Or else, ACPI SCI is enabled during suspend/resume only,
> >          * update rtc irq in that case.
> >          */
> > -       if (cmos_use_acpi_alarm())
> > +       if (cmos_use_acpi_alarm() && cmos)
> >                 cmos_interrupt(0, (void *)cmos->rtc);
> >         else {
> >                 /* Fix me: can we use cmos_interrupt() here as well? */
> 
> It looks like I've exposed a race condition there.
> 
> Generally speaking, it is misguided to install an event handler that
> is not ready to handle the event at that time before making sure that
> the event is disabled.
> 
> Does the attached patch help?
> 

It failed 3/10 times. That's less than the previous 5/10 failures but I
cannot be certain it helped without running a lot more boot tests. The
failure happens in the same function as before.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Intermittent boot failure after 6492fed7d8c9 (v6.0-rc1)
  2022-10-10 17:45   ` Mel Gorman
@ 2022-10-10 18:29     ` Rafael J. Wysocki
  2022-10-11  9:20       ` Mel Gorman
  0 siblings, 1 reply; 7+ messages in thread
From: Rafael J. Wysocki @ 2022-10-10 18:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rafael J. Wysocki, Rafael J. Wysocki, linux-rtc, linux-acpi,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2059 bytes --]

On Mon, Oct 10, 2022 at 7:50 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Mon, Oct 10, 2022 at 04:47:50PM +0200, Rafael J. Wysocki wrote:
> > Hi Mel,
> >
> > Thanks for the report!
> >
> > On Mon, Oct 10, 2022 at 4:25 PM Mel Gorman <mgorman@techsingularity.net> wrote:
> > >
> > > Hi Rafael,
> > >
> > > I'm seeing intermittent boot failures after 6492fed7d8c9 ("rtc: rtc-cmos:
> > > Do not check ACPI_FADT_LOW_POWER_S0") due to a NULL pointer exception
> > > early in boot. It fails to boot 5 times after 10 boot attempts and I've
> > > only observed it on one machine so far. Either a revert or the patch below
> > > fixes it but it's unlikely it is the correct fix.
> > >
> > > --- drivers/rtc/rtc-cmos.c.orig 2022-10-10 15:11:50.335756567 +0200
> > > +++ drivers/rtc/rtc-cmos.c      2022-10-10 15:11:53.211756691 +0200
> > > @@ -1209,7 +1209,7 @@
> > >          * Or else, ACPI SCI is enabled during suspend/resume only,
> > >          * update rtc irq in that case.
> > >          */
> > > -       if (cmos_use_acpi_alarm())
> > > +       if (cmos_use_acpi_alarm() && cmos)
> > >                 cmos_interrupt(0, (void *)cmos->rtc);
> > >         else {
> > >                 /* Fix me: can we use cmos_interrupt() here as well? */
> >
> > It looks like I've exposed a race condition there.
> >
> > Generally speaking, it is misguided to install an event handler that
> > is not ready to handle the event at that time before making sure that
> > the event is disabled.
> >
> > Does the attached patch help?
> >
>
> It failed 3/10 times.

This is still not acceptable.

> That's less than the previous 5/10 failures but I
> cannot be certain it helped without running a lot more boot tests. The
> failure happens in the same function as before.

I've overlooked the fact that acpi_install_fixed_event_handler()
enables the event on success, so it is a bug to call it when the
handler is not ready.

It should help to only enable the event after running cmos_do_probe()
where the driver data pointer is set, so please try the attached
patch.

[-- Attachment #2: rtc-handler-wake-setup-debug.patch --]
[-- Type: text/x-patch, Size: 1763 bytes --]

---
 drivers/rtc/rtc-cmos.c |   24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

Index: linux-pm/drivers/rtc/rtc-cmos.c
===================================================================
--- linux-pm.orig/drivers/rtc/rtc-cmos.c
+++ linux-pm/drivers/rtc/rtc-cmos.c
@@ -1352,7 +1352,7 @@ static void cmos_check_acpi_rtc_status(s
 
 static int cmos_pnp_probe(struct pnp_dev *pnp, const struct pnp_device_id *id)
 {
-	cmos_wake_setup(&pnp->dev);
+	int ret;
 
 	if (pnp_port_start(pnp, 0) == 0x70 && !pnp_irq_valid(pnp, 0)) {
 		unsigned int irq = 0;
@@ -1364,13 +1364,20 @@ static int cmos_pnp_probe(struct pnp_dev
 		if (nr_legacy_irqs())
 			irq = RTC_IRQ;
 #endif
-		return cmos_do_probe(&pnp->dev,
+		ret = cmos_do_probe(&pnp->dev,
 				pnp_get_resource(pnp, IORESOURCE_IO, 0), irq);
 	} else {
-		return cmos_do_probe(&pnp->dev,
+		ret = cmos_do_probe(&pnp->dev,
 				pnp_get_resource(pnp, IORESOURCE_IO, 0),
 				pnp_irq(pnp, 0));
 	}
+
+	if (ret)
+		return ret;
+
+	cmos_wake_setup(&pnp->dev);
+
+	return 0;
 }
 
 static void cmos_pnp_remove(struct pnp_dev *pnp)
@@ -1454,10 +1461,9 @@ static inline void cmos_of_init(struct p
 static int __init cmos_platform_probe(struct platform_device *pdev)
 {
 	struct resource *resource;
-	int irq;
+	int irq, ret;
 
 	cmos_of_init(pdev);
-	cmos_wake_setup(&pdev->dev);
 
 	if (RTC_IOMAPPED)
 		resource = platform_get_resource(pdev, IORESOURCE_IO, 0);
@@ -1467,7 +1473,13 @@ static int __init cmos_platform_probe(st
 	if (irq < 0)
 		irq = -1;
 
-	return cmos_do_probe(&pdev->dev, resource, irq);
+	ret = cmos_do_probe(&pdev->dev, resource, irq);
+	if (ret)
+		return ret;
+
+	cmos_wake_setup(&pdev->dev);
+
+	return 0;
 }
 
 static int cmos_platform_remove(struct platform_device *pdev)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Intermittent boot failure after 6492fed7d8c9 (v6.0-rc1)
  2022-10-10 18:29     ` Rafael J. Wysocki
@ 2022-10-11  9:20       ` Mel Gorman
  2022-12-12 18:24         ` Mathieu Chouquet-Stringer
  0 siblings, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2022-10-11  9:20 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Rafael J. Wysocki, linux-rtc, linux-acpi, linux-kernel

On Mon, Oct 10, 2022 at 08:29:05PM +0200, Rafael J. Wysocki wrote:
> > It failed 3/10 times.
> 
> This is still not acceptable.
> 

Agreed.

> > That's less than the previous 5/10 failures but I
> > cannot be certain it helped without running a lot more boot tests. The
> > failure happens in the same function as before.
> 
> I've overlooked the fact that acpi_install_fixed_event_handler()
> enables the event on success, so it is a bug to call it when the
> handler is not ready.
> 
> It should help to only enable the event after running cmos_do_probe()
> where the driver data pointer is set, so please try the attached
> patch.

Looks good and it booted 10 times successfully.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Intermittent boot failure after 6492fed7d8c9 (v6.0-rc1)
  2022-10-11  9:20       ` Mel Gorman
@ 2022-12-12 18:24         ` Mathieu Chouquet-Stringer
  2022-12-12 18:34           ` Rafael J. Wysocki
  0 siblings, 1 reply; 7+ messages in thread
From: Mathieu Chouquet-Stringer @ 2022-12-12 18:24 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Mel Gorman, Rafael J. Wysocki, linux-rtc, linux-acpi,
	linux-kernel, stable, gregkh

	Hello Rafael,

On Tue, Oct 11, 2022 at 10:20:50AM +0100, Mel Gorman wrote:
> On Mon, Oct 10, 2022 at 08:29:05PM +0200, Rafael J. Wysocki wrote:
> > > That's less than the previous 5/10 failures but I
> > > cannot be certain it helped without running a lot more boot tests. The
> > > failure happens in the same function as before.
> > 
> > I've overlooked the fact that acpi_install_fixed_event_handler()
> > enables the event on success, so it is a bug to call it when the
> > handler is not ready.
> > 
> > It should help to only enable the event after running cmos_do_probe()
> > where the driver data pointer is set, so please try the attached
> > patch.

I'm hitting this issue on the 6.0 stable releases (aka 6.0.y) and
looking at the stable tree I see this hasn't been merged... I just got
bitten by this on 6.0.12.

Greg, if Rafael agrees, I think you should apply 4919d3eb2ec0 and
0782b66ed2fb to the 6.0.y tree.

Thank you in advance.

Cheers,

-- 
Mathieu Chouquet-Stringer                             me@mathieu.digital
            The sun itself sees not till heaven clears.
	             -- William Shakespeare --

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Intermittent boot failure after 6492fed7d8c9 (v6.0-rc1)
  2022-12-12 18:24         ` Mathieu Chouquet-Stringer
@ 2022-12-12 18:34           ` Rafael J. Wysocki
  0 siblings, 0 replies; 7+ messages in thread
From: Rafael J. Wysocki @ 2022-12-12 18:34 UTC (permalink / raw)
  To: Mathieu Chouquet-Stringer, Rafael J. Wysocki, Mel Gorman,
	Rafael J. Wysocki, linux-rtc, linux-acpi, linux-kernel, stable,
	gregkh

On Mon, Dec 12, 2022 at 7:25 PM Mathieu Chouquet-Stringer
<me@mathieu.digital> wrote:
>
>         Hello Rafael,
>
> On Tue, Oct 11, 2022 at 10:20:50AM +0100, Mel Gorman wrote:
> > On Mon, Oct 10, 2022 at 08:29:05PM +0200, Rafael J. Wysocki wrote:
> > > > That's less than the previous 5/10 failures but I
> > > > cannot be certain it helped without running a lot more boot tests. The
> > > > failure happens in the same function as before.
> > >
> > > I've overlooked the fact that acpi_install_fixed_event_handler()
> > > enables the event on success, so it is a bug to call it when the
> > > handler is not ready.
> > >
> > > It should help to only enable the event after running cmos_do_probe()
> > > where the driver data pointer is set, so please try the attached
> > > patch.
>
> I'm hitting this issue on the 6.0 stable releases (aka 6.0.y) and
> looking at the stable tree I see this hasn't been merged... I just got
> bitten by this on 6.0.12.
>
> Greg, if Rafael agrees, I think you should apply 4919d3eb2ec0 and
> 0782b66ed2fb to the 6.0.y tree.

This is fine with me, please send an inclusion request to Greg and the
"stable" list.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-12-12 18:35 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-10 14:16 Intermittent boot failure after 6492fed7d8c9 (v6.0-rc1) Mel Gorman
2022-10-10 14:47 ` Rafael J. Wysocki
2022-10-10 17:45   ` Mel Gorman
2022-10-10 18:29     ` Rafael J. Wysocki
2022-10-11  9:20       ` Mel Gorman
2022-12-12 18:24         ` Mathieu Chouquet-Stringer
2022-12-12 18:34           ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).