Kernel Panic on 5.4.69

* Kernel Panic on 5.4.69
@ 2020-10-30 15:16 Marc Smith
  2020-11-13 15:36 ` Marc Smith
  0 siblings, 1 reply; 2+ messages in thread
From: Marc Smith @ 2020-10-30 15:16 UTC (permalink / raw)
  To: linux-bcache

Hi,

I'm using Linux 5.4.69 with the following two patches applied for bcache:
commit 125d98edd114 ("bcache: remove member accessed from struct btree")
commit d5c9c470b011 ("bcache: reap c->btree_cache_freeable from the
tail in bch_mca_scan()")

I'm using bcache in write-back mode... the cache device is a RAID1
mirror set using NVMe drives, and several backing devices are
associated with that cache device. While driving I/O, I experienced
the following kernel panic:
  SYSTEM MAP: /home/marc.smith/Downloads/System.map-esos.prod
DEBUG KERNEL: /home/marc.smith/Downloads/vmlinux-esos.prod (5.4.69-esos.prod)
    DUMPFILE: /home/marc.smith/Downloads/dumpfile-1604062993
        CPUS: 8
        DATE: Fri Oct 30 09:02:56 2020
      UPTIME: 2 days, 12:38:15
LOAD AVERAGE: 9.48, 8.89, 7.69
       TASKS: 980
    NODENAME: node-10cccd-2
     RELEASE: 5.4.69-esos.prod
     VERSION: #1 SMP Thu Oct 22 19:45:11 UTC 2020
     MACHINE: x86_64  (2799 Mhz)
      MEMORY: 24 GB
       PANIC: "Oops: 0002 [#1] SMP NOPTI" (check log for details)
         PID: 18272
     COMMAND: "kworker/2:13"
        TASK: ffff88841d9e8000  [THREAD_INFO: ffff88841d9e8000]
         CPU: 2
       STATE: TASK_UNINTERRUPTIBLE (PANIC)

crash> bt
PID: 18272  TASK: ffff88841d9e8000  CPU: 2   COMMAND: "kworker/2:13"
 #0 [ffffc90000100938] machine_kexec at ffffffff8103d6b5
 #1 [ffffc90000100980] __crash_kexec at ffffffff8110d37b
 #2 [ffffc90000100a48] crash_kexec at ffffffff8110e07d
 #3 [ffffc90000100a58] oops_end at ffffffff8101a9de
 #4 [ffffc90000100a78] no_context at ffffffff81045e99
 #5 [ffffc90000100ae0] async_page_fault at ffffffff81e010cf
    [exception RIP: atomic_try_cmpxchg+2]
    RIP: ffffffff810d3e3b  RSP: ffffc90000100b98  RFLAGS: 00010046
    RAX: 0000000000000000  RBX: 0000000000000003  RCX: 0000000000080006
    RDX: 0000000000000001  RSI: ffffc90000100ba4  RDI: 0000000000000a6c
    RBP: 0000000000000010   R8: 0000000000000001   R9: ffffffffa0418d4e
    R10: ffff88841c8b3000  R11: ffff88841c8b3000  R12: 0000000000000046
    R13: 0000000000000000  R14: ffff8885a3a0a000  R15: 0000000000000a6c
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #6 [ffffc90000100b98] _raw_spin_lock_irqsave at ffffffff81cf7d7d
 #7 [ffffc90000100bb8] try_to_wake_up at ffffffff810c1624
 #8 [ffffc90000100c08] closure_sync_fn at ffffffffa040fb07 [bcache]
 #9 [ffffc90000100c10] clone_endio at ffffffff81aac48c
#10 [ffffc90000100c40] call_bio_endio at ffffffff81a78e20
#11 [ffffc90000100c58] raid_end_bio_io at ffffffff81a78e69
#12 [ffffc90000100c88] raid1_end_write_request at ffffffff81a79ad9
#13 [ffffc90000100cf8] blk_update_request at ffffffff814c3ab1
#14 [ffffc90000100d38] blk_mq_end_request at ffffffff814caaf2
#15 [ffffc90000100d50] blk_mq_complete_request at ffffffff814c91c1
#16 [ffffc90000100d78] nvme_complete_cqes at ffffffffa002fb03 [nvme]
#17 [ffffc90000100db8] nvme_irq at ffffffffa002fb7f [nvme]
#18 [ffffc90000100de0] __handle_irq_event_percpu at ffffffff810e0d60
#19 [ffffc90000100e20] handle_irq_event_percpu at ffffffff810e0e65
#20 [ffffc90000100e48] handle_irq_event at ffffffff810e0ecb
#21 [ffffc90000100e60] handle_edge_irq at ffffffff810e494d
#22 [ffffc90000100e78] do_IRQ at ffffffff81e01900
#23 [ffffc90000100eb0] common_interrupt at ffffffff81e00a0a
#24 [ffffc90000100f38] __softirqentry_text_start at ffffffff8200006a
#25 [ffffc90000100fc8] irq_exit at ffffffff810a3f6a
#26 [ffffc90000100fd0] smp_apic_timer_interrupt at ffffffff81e020b2
bt: invalid kernel virtual address: ffffc90000101000  type: "pt_regs"
crash>

I noticed in the call trace that closure_sync_fn() is just before the
thread is woken; I saw one patch from a year ago for closure_sync_fn()
but of course this is already applied in 5.4.69:
https://lore.kernel.org/patchwork/patch/1086698/

I haven't encountered this panic in any prior testing, so it appears
to be rare so far. Not sure what else could be done to debug?

I'll continue testing with heaving I/O to see if this can be reproduced.

--Marc

^ permalink raw reply	[flat|nested] 2+ messages in thread