Problem with SPCC 256GB NVMe 1.3 drive - refcount_t: underflow; use-after-free.

* Problem with SPCC 256GB NVMe 1.3 drive - refcount_t: underflow; use-after-free.
@ 2021-01-17 18:58 Bradley Chapman
  2021-01-18  4:36 ` Chaitanya Kulkarni
  0 siblings, 1 reply; 17+ messages in thread
From: Bradley Chapman @ 2021-01-17 18:58 UTC (permalink / raw)
  To: linux-nvme

All,

I recently plugged a 256GB SPCC NVMe 1.3 drive into the secondary slot 
on my Asus X570-P motherboard, running a Ryzen 5 3600 CPU. After 
partitioning and formatting the drive, it is detected thusly by the 
5.9.15 and 5.10.6 kernels:

[    1.653074] nvme nvme1: pci function 0000:04:00.0
[    1.657181] nvme nvme1: missing or invalid SUBNQN field.
[    1.662294] nvme nvme1: allocated 64 MiB host memory buffer.
[    1.663105] nvme nvme1: 15/0/0 default/read/poll queues
[    1.665815]  nvme1n1: p1

However, any I/O to the drive (including mounting its filesystem) causes 
the following errors to appear in the dmesg. These errors occur with 
both the 5.9.15 kernel and the 5.10.6 kernel, and with X570-P BIOS 
version 1406 and version 3001. I have modified the BIOS settings to 
specify that a GEN 3 device is plugged into the M.2_2 slot instead of 
allowing the BIOS to auto-detect the drive.

[ 2745.659502] refcount_t: underflow; use-after-free.
[ 2745.659510] WARNING: CPU: 2 PID: 0 at lib/refcount.c:28 
refcount_warn_saturate+0xab/0xf0
[ 2745.659510] Modules linked in: rfcomm(E) cmac(E) bnep(E) 
binfmt_misc(E) nls_ascii(E) nls_cp437(E) vfat(E) fat(E) uas(E) 
usb_storage(E) btusb(E) btrtl(E) crct10dif_pclmul(E) btbcm(E) 
crc32_pclmul(E) btintel(E) ghash_clmulni_intel(E) bluetooth(E) rfkill(E) 
aesni_intel(E) crypto_simd(E) cryptd(E) glue_helper(E) efi_pstore(E) 
jitterentropy_rng(E) drbg(E) ccp(E) ansi_cprng(E) ecdh_generic(E) ecc(E) 
acpi_cpufreq(E) nft_counter(E) efivarfs(E) crc32c_intel(E)
[ 2745.659527] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G            E 
5.10.6-BET #1
[ 2745.659528] Hardware name: System manufacturer System Product 
Name/PRIME X570-P, BIOS 3001 12/04/2020
[ 2745.659529] RIP: 0010:refcount_warn_saturate+0xab/0xf0
[ 2745.659530] Code: 05 af d2 72 01 01 e8 7a 06 87 00 0f 0b c3 80 3d 9d 
d2 72 01 00 75 90 48 c7 c7 78 60 44 af c6 05 8d d2 72 01 01 e8 5b 06 87 
00 <0f> 0b c3 80 3d 7c d2 72 01 00 0f 85 6d ff ff ff 48 c7 c7 d0 60 44
[ 2745.659531] RSP: 0018:ffffaf1880298f30 EFLAGS: 00010086
[ 2745.659532] RAX: 0000000000000000 RBX: ffff9873cf3bc300 RCX: 
0000000000000027
[ 2745.659533] RDX: 0000000000000027 RSI: ffff987acea92e80 RDI: 
ffff987acea92e88
[ 2745.659533] RBP: ffff9873d0e661f0 R08: 0000000000000000 R09: 
c0000000ffffdfff
[ 2745.659534] R10: ffffaf1880298d50 R11: ffffaf1880298d48 R12: 
0000000000000001
[ 2745.659534] R13: ffff9873d0f98580 R14: ffff9873cdf8ac00 R15: 
0000000000000000
[ 2745.659535] FS:  0000000000000000(0000) GS:ffff987acea80000(0000) 
knlGS:0000000000000000
[ 2745.659536] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2745.659536] CR2: 00005588763902c8 CR3: 0000000107248000 CR4: 
0000000000350ee0
[ 2745.659537] Call Trace:
[ 2745.659538]  <IRQ>
[ 2745.659541]  nvme_irq+0x104/0x190
[ 2745.659543]  __handle_irq_event_percpu+0x2e/0xd0
[ 2745.659545]  handle_irq_event_percpu+0x33/0x80
[ 2745.659545]  handle_irq_event+0x39/0x70
[ 2745.659547]  handle_edge_irq+0x7c/0x1a0
[ 2745.659549]  asm_call_irq_on_stack+0x12/0x20
[ 2745.659549]  </IRQ>
[ 2745.659551]  common_interrupt+0xd7/0x160
[ 2745.659552]  asm_common_interrupt+0x1e/0x40
[ 2745.659554] RIP: 0010:cpuidle_enter_state+0xd2/0x2e0
[ 2745.659555] Code: e8 93 22 6a ff 31 ff 49 89 c5 e8 29 2c 6a ff 45 84 
ff 74 12 9c 58 f6 c4 02 0f 85 c4 01 00 00 31 ff e8 a2 d8 6f ff fb 45 85 
f6 <0f> 88 c9 00 00 00 49 63 ce be 68 00 00 00 4c 2b 2c 24 48 89 ca 48
[ 2745.659556] RSP: 0018:ffffaf188014fe80 EFLAGS: 00000202
[ 2745.659557] RAX: ffff987acea9ce00 RBX: 0000000000000002 RCX: 
000000000000001f
[ 2745.659557] RDX: 0000027f460f1f90 RSI: 00000000239f5229 RDI: 
0000000000000000
[ 2745.659558] RBP: ffff9873c1a4e800 R08: 0000000000000002 R09: 
000000000001c600
[ 2745.659558] R10: 0000090da145abf0 R11: ffff987acea9be24 R12: 
ffffffffaf6d38e0
[ 2745.659559] R13: 0000027f460f1f90 R14: 0000000000000002 R15: 
0000000000000000
[ 2745.659561]  cpuidle_enter+0x30/0x50
[ 2745.659562]  do_idle+0x24f/0x290
[ 2745.659564]  cpu_startup_entry+0x1b/0x20
[ 2745.659566]  start_secondary+0x10b/0x150
[ 2745.659567]  secondary_startup_64_no_verify+0xb0/0xbb
[ 2745.659569] ---[ end trace be84281f034198f3 ]---
[ 2776.138874] nvme nvme1: I/O 414 QID 3 timeout, aborting
[ 2776.138886] nvme nvme1: I/O 415 QID 3 timeout, aborting
[ 2776.138891] nvme nvme1: I/O 416 QID 3 timeout, aborting
[ 2776.138895] nvme nvme1: I/O 417 QID 3 timeout, aborting
[ 2776.138912] nvme nvme1: Abort status: 0x0
[ 2776.138921] nvme nvme1: I/O 428 QID 3 timeout, aborting
[ 2776.138922] nvme nvme1: Abort status: 0x0
[ 2776.138925] nvme nvme1: Abort status: 0x0
[ 2776.138974] nvme nvme1: Abort status: 0x0
[ 2776.138977] nvme nvme1: Abort status: 0x0
[ 2806.346792] nvme nvme1: I/O 414 QID 3 timeout, reset controller
[ 2806.363566] nvme nvme1: 15/0/0 default/read/poll queues
[ 2836.554298] nvme nvme1: I/O 415 QID 3 timeout, disable controller
[ 2836.672064] blk_update_request: I/O error, dev nvme1n1, sector 16350 
op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
[ 2836.672072] blk_update_request: I/O error, dev nvme1n1, sector 16093 
op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
[ 2836.672074] blk_update_request: I/O error, dev nvme1n1, sector 15836 
op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
[ 2836.672076] blk_update_request: I/O error, dev nvme1n1, sector 15579 
op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
[ 2836.672078] blk_update_request: I/O error, dev nvme1n1, sector 15322 
op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
[ 2836.672080] blk_update_request: I/O error, dev nvme1n1, sector 15065 
op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
[ 2836.672082] blk_update_request: I/O error, dev nvme1n1, sector 14808 
op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
[ 2836.672083] blk_update_request: I/O error, dev nvme1n1, sector 14551 
op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
[ 2836.672085] blk_update_request: I/O error, dev nvme1n1, sector 14294 
op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
[ 2836.672087] blk_update_request: I/O error, dev nvme1n1, sector 14037 
op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
[ 2836.672121] nvme nvme1: failed to mark controller live state
[ 2836.672123] nvme nvme1: Removing after probe failure status: -19
[ 2836.689016] Aborting journal on device dm-0-8.
[ 2836.689024] Buffer I/O error on dev dm-0, logical block 25198592, 
lost sync page write
[ 2836.689027] JBD2: Error -5 detected when updating journal superblock 
for dm-0-8.

[ 2836.723821] percpu ref (hd_struct_free) <= 0 (-28) after switching to 
atomic
[ 2836.723828] WARNING: CPU: 8 PID: 0 at lib/percpu-refcount.c:196 
percpu_ref_switch_to_atomic_rcu+0x139/0x140
[ 2836.723828] Modules linked in: rfcomm(E) cmac(E) bnep(E) 
binfmt_misc(E) nls_ascii(E) nls_cp437(E) vfat(E) fat(E) uas(E) 
usb_storage(E) btusb(E) btrtl(E) crct10dif_pclmul(E) btbcm(E) 
crc32_pclmul(E) btintel(E) ghash_clmulni_intel(E) bluetooth(E) rfkill(E) 
aesni_intel(E) crypto_simd(E) cryptd(E) glue_helper(E) efi_pstore(E) 
jitterentropy_rng(E) drbg(E) ccp(E) ansi_cprng(E) ecdh_generic(E) ecc(E) 
acpi_cpufreq(E) nft_counter(E) efivarfs(E) crc32c_intel(E)
[ 2836.723844] CPU: 8 PID: 0 Comm: swapper/8 Tainted: G        W   E 
5.10.6-BET #1
[ 2836.723845] Hardware name: System manufacturer System Product 
Name/PRIME X570-P, BIOS 3001 12/04/2020
[ 2836.723847] RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x139/0x140
[ 2836.723848] Code: 80 3d f9 f0 72 01 00 0f 85 52 ff ff ff 49 8b 54 24 
e0 49 8b 74 24 e8 48 c7 c7 88 5f 44 af c6 05 db f0 72 01 01 e8 ad 24 87 
00 <0f> 0b e9 2e ff ff ff 41 55 49 89 f5 41 54 55 48 89 fd 53 48 83 ec
[ 2836.723849] RSP: 0018:ffffaf18803a0f20 EFLAGS: 00010282
[ 2836.723850] RAX: 0000000000000000 RBX: 7fffffffffffffe3 RCX: 
0000000000000027
[ 2836.723850] RDX: 0000000000000027 RSI: ffff987acec12e80 RDI: 
ffff987acec12e88
[ 2836.723851] RBP: 0000369db0c0e3c8 R08: 0000000000000000 R09: 
c0000000ffffdfff
[ 2836.723851] R10: ffffaf18803a0d40 R11: ffffaf18803a0d38 R12: 
ffff9873c0bbbda0
[ 2836.723852] R13: ffffffffaf765f10 R14: 0000000000000202 R15: 
ffffffffaf6060c0
[ 2836.723853] FS:  0000000000000000(0000) GS:ffff987acec00000(0000) 
knlGS:0000000000000000
[ 2836.723853] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2836.723854] CR2: 0000558899b414c0 CR3: 0000000101cd2000 CR4: 
0000000000350ee0
[ 2836.723854] Call Trace:
[ 2836.723855]  <IRQ>
[ 2836.723859]  rcu_core+0x196/0x420
[ 2836.723862]  __do_softirq+0xc9/0x214
[ 2836.723863]  asm_call_irq_on_stack+0x12/0x20
[ 2836.723864]  </IRQ>
[ 2836.723866]  do_softirq_own_stack+0x31/0x40
[ 2836.723867]  irq_exit_rcu+0x9a/0xa0
[ 2836.723869]  sysvec_apic_timer_interrupt+0x2c/0x80
[ 2836.723870]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 2836.723872] RIP: 0010:cpuidle_enter_state+0xd2/0x2e0
[ 2836.723873] Code: e8 93 22 6a ff 31 ff 49 89 c5 e8 29 2c 6a ff 45 84 
ff 74 12 9c 58 f6 c4 02 0f 85 c4 01 00 00 31 ff e8 a2 d8 6f ff fb 45 85 
f6 <0f> 88 c9 00 00 00 49 63 ce be 68 00 00 00 4c 2b 2c 24 48 89 ca 48
[ 2836.723874] RSP: 0018:ffffaf188017fe80 EFLAGS: 00000202
[ 2836.723874] RAX: ffff987acec1ce00 RBX: 0000000000000002 RCX: 
000000000000001f
[ 2836.723875] RDX: 0000029479ea3f98 RSI: 00000000239f5229 RDI: 
0000000000000000
[ 2836.723875] RBP: ffff9873c1a4ec00 R08: 0000000000000002 R09: 
000000000001c600
[ 2836.723876] R10: 00000959d0ea6498 R11: ffff987acec1be24 R12: 
ffffffffaf6d38e0
[ 2836.723876] R13: 0000029479ea3f98 R14: 0000000000000002 R15: 
0000000000000000
[ 2836.723878]  cpuidle_enter+0x30/0x50
[ 2836.723880]  do_idle+0x24f/0x290
[ 2836.723882]  cpu_startup_entry+0x1b/0x20
[ 2836.723884]  start_secondary+0x10b/0x150
[ 2836.723885]  secondary_startup_64_no_verify+0xb0/0xbb
[ 2836.723887] ---[ end trace be84281f034198f4 ]---

After these errors are generated, the device becomes inaccessible and 
unmounting its filesystem (which does not hang in D state) generates 
additional errors:

[ 2868.181018] Buffer I/O error on dev dm-0, logical block 0, lost sync 
page write
[ 2868.181022] EXT4-fs (dm-0): I/O error while writing superblock

After the filesystem is unmounted the device no longer appears in the 
output of lsblk(8) and its device node(s) disappear after the kernel 
removes the device. Prior to the I/O failures, the nvme error-log 
command returns no error entries for any of the 64 log entries present.

nvme fw-log and nvme smart-log return the following output:

Firmware Log for device:nvme1
afi  : 0x20

Smart Log for NVME device:nvme1 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 48 C
available_spare                     : 100%
available_spare_threshold           : 10%
percentage_used                     : 0%
data_units_read                     : 234
data_units_written                  : 2,149
host_read_commands                  : 4,202
host_write_commands                 : 421,917
controller_busy_time                : 0
power_cycles                        : 7
power_on_hours                      : 11
unsafe_shutdowns                    : 0
media_errors                        : 0
num_err_log_entries                 : 0
Warning Temperature Time            : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1                : 48 C
Thermal Management T1 Trans Count   : 0
Thermal Management T2 Trans Count   : 0
Thermal Management T1 Total Time    : 0
Thermal Management T2 Total Time    : 0

I've checked the kernel change logs and I know that the refcount_t error 
has been occurring in other kernel subsystems and was subsequently fixed 
in recent kernel point releases, so I will be trying to reproduce this 
error with the most recent 5.10 and 5.11-rc kernels.

Any suggestions on what else to try next?

Thanks!

Brad

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 17+ messages in thread