nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed

* nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
@ 2022-12-15  1:38 J. Hart
  2022-12-15  8:23 ` Christoph Hellwig
  2022-12-16 23:16 ` Keith Busch
  0 siblings, 2 replies; 27+ messages in thread
From: J. Hart @ 2022-12-15  1:38 UTC (permalink / raw)
  To: linux-nvme; +Cc: jfhart085, kbusch, axboe, hch, sagi

I am attempting to load an nvme device (nvme0n1) to use as main system 
drive using the following command:

rsync -axvH /. --exclude=/lost+found --exclude=/var/log.bu 
--exclude=/usr/var/log.bu --exclude=/usr/X11R6/var/log.bu 
--exclude=/home/jhart/.cache/mozilla/firefox/are7uokl.default-release/cache2.bu 
--exclude=/home/jhart/.cache/thunderbird/7zsnqnss.default/cache2.bu 
/mnt/root_new 2>&1 | tee root.log

The total transfer would be approximately 50 GB.  This is being done at 
run level 1, and only the kernel threads and the root shell are observed 
to be active.

The following log messages appear after a minute or so, and rsync hangs. 
The nvme drive cannot be unmounted without a reboot.

dmesg reports the following:

[Dec14 19:24] nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting
[Dec14 19:25] nvme nvme0: I/O 0 QID 1 timeout, reset controller
[ +30.719985] nvme nvme0: I/O 8 QID 0 timeout, reset controller
[Dec14 19:28] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
[  +0.031803] nvme nvme0: Abort status: 0x371
[Dec14 19:30] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
[  +0.000019] nvme nvme0: Removing after probe failure status: -19

I have also observed file system corruption on the source drive of the 
transfer.  I would not normally think this to be related, except that 
after the first time I observed it, I made certain that I corrected the 
file content before any additional attempts, but have seen this again 
after every attempt.  The modification dates and file sizes did not 
change, but the file content on the source drive did.  I confirmed this 
using the "diff" utility, and again using a rsync dry run with the check 
sum test enabled.

kernel/distro:

Linux DellXPS 6.1.0 #1 SMP Tue Dec 13 21:48:51 JST 2022 x86_64 GNU/Linux
custom distribution built entirely from source

nvme controller:

MZHOU M.2 NVME SSD-PCIe 4.0 X4 adaptor
Key-M NGFF PCI-E 3.0、2.0 or 1.0 controller expansion cards
(2230 2242 2260 2280 22110 M.2 SSD）

02:00.0 Non-Volatile memory controller: Kingston Technologies Device 
500f (rev 03) (prog-if 02)
         Subsystem: Kingston Technologies Device 500f
         Flags: bus master, fast devsel, latency 0, IRQ 16
         Memory at ef9fc000 (64-bit, non-prefetchable) [size=16K]
         Capabilities: [40] Power Management version 3
         Capabilities: [50] MSI: Enable- Count=1/8 Maskable+ 64bit+
         Capabilities: [70] Express Endpoint, MSI 00
         Capabilities: [b0] MSI-X: Enable- Count=16 Masked-
         Kernel driver in use: nvme

nvme drive:

Model Number:                       KINGSTON SNVSE500G
Serial Number:                      50026B7685D8EE42
Firmware Version:                   S8542105
PCI Vendor/Subsystem ID:            0x2646
IEEE OUI Identifier:                0x0026b7
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          500,107,862,016 [500 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            0026b7 685d8ee425
Local Time is:                      Tue Nov 29 20:31:21 2022 JST
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0016):   Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero 
Sav/Sel_Feat Timestmp
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         64 Pages
Warning  Comp. Temp. Threshold:     85 Celsius
Critical Comp. Temp. Threshold:     90 Celsius

CPU (quad core, cpu 0 shown, others the same):

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Core(TM)2 Quad  CPU   Q9550  @ 2.83GHz
stepping	: 7
microcode	: 0x705
cpu MHz		: 1999.839
cache size	: 6144 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx lm 
constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni 
dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 
lahf_lm pti tpr_shadow vnmi flexpriority vpid dtherm
vmx flags	: vnmi flexpriority tsc_offset vtpr vapic
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds 
swapgs itlb_multihit mmio_unknown
bogomips	: 5666.43
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

^ permalink raw reply	[flat|nested] 27+ messages in thread