All of lore.kernel.org
 help / color / mirror / Atom feed
* nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
@ 2022-12-15  1:38 J. Hart
  2022-12-15  8:23 ` Christoph Hellwig
  2022-12-16 23:16 ` Keith Busch
  0 siblings, 2 replies; 27+ messages in thread
From: J. Hart @ 2022-12-15  1:38 UTC (permalink / raw)
  To: linux-nvme; +Cc: jfhart085, kbusch, axboe, hch, sagi

I am attempting to load an nvme device (nvme0n1) to use as main system 
drive using the following command:

rsync -axvH /. --exclude=/lost+found --exclude=/var/log.bu 
--exclude=/usr/var/log.bu --exclude=/usr/X11R6/var/log.bu 
--exclude=/home/jhart/.cache/mozilla/firefox/are7uokl.default-release/cache2.bu 
--exclude=/home/jhart/.cache/thunderbird/7zsnqnss.default/cache2.bu 
/mnt/root_new 2>&1 | tee root.log

The total transfer would be approximately 50 GB.  This is being done at 
run level 1, and only the kernel threads and the root shell are observed 
to be active.

The following log messages appear after a minute or so, and rsync hangs. 
The nvme drive cannot be unmounted without a reboot.

dmesg reports the following:

[Dec14 19:24] nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting
[Dec14 19:25] nvme nvme0: I/O 0 QID 1 timeout, reset controller
[ +30.719985] nvme nvme0: I/O 8 QID 0 timeout, reset controller
[Dec14 19:28] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
[  +0.031803] nvme nvme0: Abort status: 0x371
[Dec14 19:30] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
[  +0.000019] nvme nvme0: Removing after probe failure status: -19

I have also observed file system corruption on the source drive of the 
transfer.  I would not normally think this to be related, except that 
after the first time I observed it, I made certain that I corrected the 
file content before any additional attempts, but have seen this again 
after every attempt.  The modification dates and file sizes did not 
change, but the file content on the source drive did.  I confirmed this 
using the "diff" utility, and again using a rsync dry run with the check 
sum test enabled.

kernel/distro:

Linux DellXPS 6.1.0 #1 SMP Tue Dec 13 21:48:51 JST 2022 x86_64 GNU/Linux
custom distribution built entirely from source

nvme controller:

MZHOU M.2 NVME SSD-PCIe 4.0 X4 adaptor
Key-M NGFF PCI-E 3.0、2.0 or 1.0 controller expansion cards
(2230 2242 2260 2280 22110 M.2 SSD)

02:00.0 Non-Volatile memory controller: Kingston Technologies Device 
500f (rev 03) (prog-if 02)
         Subsystem: Kingston Technologies Device 500f
         Flags: bus master, fast devsel, latency 0, IRQ 16
         Memory at ef9fc000 (64-bit, non-prefetchable) [size=16K]
         Capabilities: [40] Power Management version 3
         Capabilities: [50] MSI: Enable- Count=1/8 Maskable+ 64bit+
         Capabilities: [70] Express Endpoint, MSI 00
         Capabilities: [b0] MSI-X: Enable- Count=16 Masked-
         Kernel driver in use: nvme

nvme drive:

Model Number:                       KINGSTON SNVSE500G
Serial Number:                      50026B7685D8EE42
Firmware Version:                   S8542105
PCI Vendor/Subsystem ID:            0x2646
IEEE OUI Identifier:                0x0026b7
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          500,107,862,016 [500 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            0026b7 685d8ee425
Local Time is:                      Tue Nov 29 20:31:21 2022 JST
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0016):   Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero 
Sav/Sel_Feat Timestmp
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         64 Pages
Warning  Comp. Temp. Threshold:     85 Celsius
Critical Comp. Temp. Threshold:     90 Celsius

CPU (quad core, cpu 0 shown, others the same):

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Core(TM)2 Quad  CPU   Q9550  @ 2.83GHz
stepping	: 7
microcode	: 0x705
cpu MHz		: 1999.839
cache size	: 6144 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx lm 
constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni 
dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 
lahf_lm pti tpr_shadow vnmi flexpriority vpid dtherm
vmx flags	: vnmi flexpriority tsc_offset vtpr vapic
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds 
swapgs itlb_multihit mmio_unknown
bogomips	: 5666.43
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
  2022-12-15  1:38 nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed J. Hart
@ 2022-12-15  8:23 ` Christoph Hellwig
  2022-12-15  9:07   ` J. Hart
  2022-12-16 23:16 ` Keith Busch
  1 sibling, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2022-12-15  8:23 UTC (permalink / raw)
  To: J. Hart; +Cc: linux-nvme, kbusch, axboe, hch, sagi

On Thu, Dec 15, 2022 at 10:38:33AM +0900, J. Hart wrote:
> I am attempting to load an nvme device (nvme0n1) to use as main system 
> drive using the following command:
>
> rsync -axvH /. --exclude=/lost+found --exclude=/var/log.bu 
> --exclude=/usr/var/log.bu --exclude=/usr/X11R6/var/log.bu 
> --exclude=/home/jhart/.cache/mozilla/firefox/are7uokl.default-release/cache2.bu 
> --exclude=/home/jhart/.cache/thunderbird/7zsnqnss.default/cache2.bu 
> /mnt/root_new 2>&1 | tee root.log
>
> The total transfer would be approximately 50 GB.  This is being done at run 
> level 1, and only the kernel threads and the root shell are observed to be 
> active.
>
> The following log messages appear after a minute or so, and rsync hangs. 
> The nvme drive cannot be unmounted without a reboot.

Ok, this looks like the driver has firmware / hardware problems and
can't copy wit hthe load.

>
> dmesg reports the following:

nvme0 is the destination driver I guess?

>
> [Dec14 19:24] nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting

Can you enable CONFIG_NVME_VERBOSE_ERRORS so that we can see what
commands are hanging?

> I have also observed file system corruption on the source drive of the 
> transfer.  I would not normally think this to be related, except that after 
> the first time I observed it, I made certain that I corrected the file 
> content before any additional attempts, but have seen this again after 
> every attempt.  The modification dates and file sizes did not change, but 
> the file content on the source drive did.  I confirmed this using the 
> "diff" utility, and again using a rsync dry run with the check sum test 
> enabled.

Ok, that's really odd.  The only way I could think of that happening
is if the driver does stay DMAs, which would be really grave.

Do you have CONFIG_INTEL_IOMMU and CONFIG_INTEL_IOMMU_DEFAULT_ON enabled?
If not, it would be good to enable those to see if the iommu catches
any stray DMAs.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
  2022-12-15  8:23 ` Christoph Hellwig
@ 2022-12-15  9:07   ` J. Hart
  2022-12-15  9:09     ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: J. Hart @ 2022-12-15  9:07 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-nvme, kbusch, axboe, sagi, jfhart085

On 12/15/22 5:23 PM, Christoph Hellwig wrote:
>>
>> dmesg reports the following:
> 
> nvme0 is the destination driver I guess?

That is correct.  The device /dev/nvme0n1p3 is the destination partition 
on the NVME drive, and was mounted at /mnt/root_new.

>> [Dec14 19:24] nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting
> 
> Can you enable CONFIG_NVME_VERBOSE_ERRORS so that we can see what
> commands are hanging?

I will do this and run the requisite testing right away.  I'll reply 
with the results as soon as I have them.

>> I have also observed file system corruption on the source drive of the
>> transfer.  I would not normally think this to be related, except that after
>> the first time I observed it, I made certain that I corrected the file
>> content before any additional attempts, but have seen this again after
>> every attempt.  The modification dates and file sizes did not change, but
>> the file content on the source drive did.  I confirmed this using the
>> "diff" utility, and again using a rsync dry run with the check sum test
>> enabled.

I should also note that I did a third test using md5sum and confirmed 
that the sums obtained thereby were different.

> Ok, that's really odd.  The only way I could think of that happening
> is if the driver does stay DMAs, which would be really grave.

My apologies....I am not sure what is meant by "stay DMAs".  Is there 
something I can look for here ?

> Do you have CONFIG_INTEL_IOMMU and CONFIG_INTEL_IOMMU_DEFAULT_ON enabled?
> If not, it would be good to enable those to see if the iommu catches
> any stray DMAs.

I will enable these as well and reply with the test results.  Thanks 
very much for your generous assistance.

J. Hart


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
  2022-12-15  9:07   ` J. Hart
@ 2022-12-15  9:09     ` Christoph Hellwig
  2022-12-15  9:15       ` J. Hart
  2022-12-15 13:33       ` J. Hart
  0 siblings, 2 replies; 27+ messages in thread
From: Christoph Hellwig @ 2022-12-15  9:09 UTC (permalink / raw)
  To: J. Hart; +Cc: Christoph Hellwig, linux-nvme, kbusch, axboe, sagi

On Thu, Dec 15, 2022 at 06:07:32PM +0900, J. Hart wrote:
> My apologies....I am not sure what is meant by "stay DMAs".  Is there 
> something I can look for here ?

I mean stray, sorry.  There isn't really anything you can look for.
Either it really is a device problem, in which case the IOMMU should
catch it.  Or it is a kernel problem somewhere, in which case
CONFIG_KASAN would catch.  So maybe enable that as well, but it will
slow down the kernel a LOT.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
  2022-12-15  9:09     ` Christoph Hellwig
@ 2022-12-15  9:15       ` J. Hart
  2022-12-15 13:33       ` J. Hart
  1 sibling, 0 replies; 27+ messages in thread
From: J. Hart @ 2022-12-15  9:15 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-nvme, kbusch, axboe, sagi, jfhart085

On 12/15/22 6:09 PM, Christoph Hellwig wrote:
> On Thu, Dec 15, 2022 at 06:07:32PM +0900, J. Hart wrote:
>> My apologies....I am not sure what is meant by "stay DMAs".  Is there
>> something I can look for here ?
> 
> I mean stray, sorry.  There isn't really anything you can look for.
> Either it really is a device problem, in which case the IOMMU should
> catch it.  Or it is a kernel problem somewhere, in which case
> CONFIG_KASAN would catch.  So maybe enable that as well, but it will
> slow down the kernel a LOT.

Understood...  I will enable the following and do the test right away :

CONFIG_NVME_VERBOSE_ERRORS
CONFIG_INTEL_IOMMU
CONFIG_INTEL_IOMMU_DEFAULT

J. Hart


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
  2022-12-15  9:09     ` Christoph Hellwig
  2022-12-15  9:15       ` J. Hart
@ 2022-12-15 13:33       ` J. Hart
  2022-12-15 17:34         ` Keith Busch
  1 sibling, 1 reply; 27+ messages in thread
From: J. Hart @ 2022-12-15 13:33 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-nvme, kbusch, axboe, sagi, jfhart085

Here are the test results.  My apologies for the delay, as I was being 
rather careful.  The test also appears to have resulted in very serious 
filesystem damage on the system main drive which had to be dealt with.

These results were obtained as before, at run level 1 with all processes 
terminated except for the root shell under initd.  I then started 
syslogd and then started klogd.  I ran vim under the root shell, and a 
terminal from with vim.  The rsync was run from that terminal, and the 
condition monitored via top within another vim terminal.

The settings I enabled were as follows:

CONFIG_NVME_VERBOSE_ERRORS=y
CONFIG_INTEL_IOMMU=y

you also mentioned the following:
CONFIG_INTEL_IOMMU_DEFAULT

I didn't have that parameter, but I think you meant this one:
CONFIG_INTEL_IOMMU_DEFAULT_ON=y

The rsync invocation hung as before, and had to be forcibly terminated.
The system was forced down via the magic SysReq key sequence.

I have appended the log extracts here.  If these are unusable, I can 
send the extracts as attachments or upload them as seperate files.


dmesg log extract:

[Dec15 21:06] printk: udevd: 5 output lines suppressed due to ratelimiting
[Dec15 21:14] EXT4-fs (nvme0n1p3): mounted filesystem with ordered data 
mode. Quota mode: disabled.
[Dec15 21:16] EXT4-fs error (device dm-0): ext4_ext_check_inode:520: 
inode #53085366: comm rsync: pblk 0 bad header/extent: invalid 
eh_entries - magic f30a, entries 17, max 4(4), depth 0(0)
[Dec15 21:17] EXT4-fs error (device dm-0): ext4_ext_check_inode:520: 
inode #5506059: comm rsync: pblk 0 bad header/extent: invalid eh_entries 
- magic f30a, entries 17, max 4(4), depth 0(0)
[  +8.145700] EXT4-fs error (device dm-0): ext4_ext_check_inode:520: 
inode #5506082: comm rsync: pblk 0 bad header/extent: invalid eh_entries 
- magic f30a, entries 17, max 4(4), depth 0(0)
[  +7.668302] EXT4-fs error (device dm-0): ext4_ext_check_inode:520: 
inode #41180419: comm rsync: pblk 0 bad header/extent: invalid 
eh_entries - magic f30a, entries 17, max 4(4), depth 0(0)
[  +0.081111] EXT4-fs error (device dm-0): ext4_ext_check_inode:520: 
inode #41180440: comm rsync: pblk 0 bad header/extent: invalid 
eh_entries - magic f30a, entries 17, max 4(4), depth 0(0)
[  +0.037526] EXT4-fs error (device dm-0): ext4_ext_check_inode:520: 
inode #41180229: comm rsync: pblk 0 bad header/extent: invalid 
eh_entries - magic f30a, entries 17, max 4(4), depth 0(0)
[  +0.011292] EXT4-fs error (device dm-0): ext4_ext_check_inode:520: 
inode #41182425: comm rsync: pblk 0 bad header/extent: invalid 
eh_entries - magic f30a, entries 17, max 4(4), depth 0(0)
[  +0.031661] EXT4-fs error (device dm-0): ext4_ext_check_inode:520: 
inode #41178319: comm rsync: pblk 0 bad header/extent: invalid 
eh_entries - magic f30a, entries 17, max 4(4), depth 0(0)
[Dec15 21:32] EXT4-fs error (device dm-0): ext4_ext_check_inode:520: 
inode #38936868: comm rsync: pblk 0 bad header/extent: invalid 
eh_entries - magic f30a, entries 17, max 4(4), depth 0(0)
[  +0.000629] EXT4-fs error (device dm-0): ext4_ext_check_inode:520: 
inode #38936901: comm rsync: pblk 0 bad header/extent: invalid 
eh_entries - magic f30a, entries 17, max 4(4), depth 0(0)
[Dec15 21:34] EXT4-fs error (device dm-0): ext4_ext_check_inode:520: 
inode #41029460: comm rsync: pblk 0 bad header/extent: invalid 
eh_entries - magic f30a, entries 17, max 4(4), depth 0(0)
[ +22.029135] EXT4-fs error (device dm-0): ext4_ext_check_inode:520: 
inode #41029305: comm rsync: pblk 0 bad header/extent: invalid 
eh_entries - magic f30a, entries 17, max 4(4), depth 0(0)
[ +26.890018] nvme nvme0: I/O 0 (Write) QID 1 timeout, aborting
[Dec15 21:35] nvme nvme0: I/O 0 QID 1 timeout, reset controller
[ +30.719998] nvme nvme0: I/O 13 QID 0 timeout, reset controller
[Dec15 21:38] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
[  +0.014796] nvme nvme0: Abort status: 0x371
[Dec15 21:40] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
[  +0.000024] nvme nvme0: Removing after probe failure status: -19
[Dec15 21:42] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
[  +0.000324] nvme0n1: detected capacity change from 976773168 to 0
[  +0.000010] EXT4-fs warning (device nvme0n1p3): ext4_end_bio:343: I/O 
error 10 writing to inode 6558258 starting block 27121876)
[  +0.000046] EXT4-fs warning (device nvme0n1p3): ext4_end_bio:343: I/O 
error 10 writing to inode 6558302 starting block 27335746)
[  +0.000061] EXT4-fs warning (device nvme0n1p3): ext4_end_bio:343: I/O 
error 10 writing to inode 6558257 starting block 27199501)
[  +0.000012] Buffer I/O error on device nvme0n1p3, logical block 27104980
[  +0.000006] EXT4-fs warning (device nvme0n1p3): ext4_end_bio:343: I/O 
error 10 writing to inode 6558301 starting block 26915262)
[  +0.000010] EXT4-fs warning (device nvme0n1p3): ext4_end_bio:343: I/O 
error 10 writing to inode 6558287 starting block 27335718)
[  +0.000006] EXT4-fs warning (device nvme0n1p3): ext4_end_bio:343: I/O 
error 10 writing to inode 6558288 starting block 27335719)
[  +0.000003] EXT4-fs warning (device nvme0n1p3): ext4_end_bio:343: I/O 
error 10 writing to inode 6558289 starting block 27335721)
[  +0.000005] EXT4-fs warning (device nvme0n1p3): ext4_end_bio:343: I/O 
error 10 writing to inode 6558290 starting block 27335723)
[  +0.000006] EXT4-fs warning (device nvme0n1p3): ext4_end_bio:343: I/O 
error 10 writing to inode 6558291 starting block 27335724)
[  +0.000017] Buffer I/O error on device nvme0n1p3, logical block 27104981
[  +0.000006] EXT4-fs warning (device nvme0n1p3): ext4_end_bio:343: I/O 
error 10 writing to inode 6558257 starting block 27198798)
[  +0.000193] Buffer I/O error on device nvme0n1p3, logical block 27104982
[  +0.000018] Buffer I/O error on device nvme0n1p3, logical block 27104983
[  +0.000019] Buffer I/O error on device nvme0n1p3, logical block 27104984
[  +0.000031] Buffer I/O error on device nvme0n1p3, logical block 27318850
[  +0.000028] Buffer I/O error on device nvme0n1p3, logical block 26898366
[  +0.000020] Buffer I/O error on device nvme0n1p3, logical block 26898367
[  +0.000018] Buffer I/O error on device nvme0n1p3, logical block 26898368
[  +0.000019] Buffer I/O error on device nvme0n1p3, logical block 26898369
[  +0.000092] Aborting journal on device nvme0n1p3-8.
[  +0.000057] Buffer I/O error on dev nvme0n1p3, logical block 60850176, 
lost sync page write
[  +0.000057] EXT4-fs error (device nvme0n1p3): 
ext4_journal_check_start:83: comm kworker/u8:2: Detected aborted journal
[  +0.000022] JBD2: I/O error when updating journal superblock for 
nvme0n1p3-8.
[  +0.000093] Buffer I/O error on dev nvme0n1p3, logical block 0, lost 
sync page write
[  +0.000029] EXT4-fs (nvme0n1p3): I/O error while writing superblock
[  +0.000017] EXT4-fs (nvme0n1p3): Remounting filesystem read-only
[  +0.000018] EXT4-fs (nvme0n1p3): ext4_writepages: jbd2_start: 12288 
pages, ino 6558303; err -30
[  +0.000377] Buffer I/O error on dev nvme0n1p3, logical block 26214412, 
lost async page write
[  +0.000030] Buffer I/O error on dev nvme0n1p3, logical block 26214705, 
lost async page write
[  +0.000023] Buffer I/O error on dev nvme0n1p3, logical block 26214706, 
lost async page write
[  +0.000026] Buffer I/O error on dev nvme0n1p3, logical block 26214707, 
lost async page write
[  +0.000024] Buffer I/O error on dev nvme0n1p3, logical block 26214708, 
lost async page write
[  +0.000021] EXT4-fs (nvme0n1p3): failed to convert unwritten extents 
to written extents -- potential data loss!  (inode 6558291, error -30)
[  +0.000011] Buffer I/O error on dev nvme0n1p3, logical block 26214709, 
lost async page write
[  +0.000026] EXT4-fs (nvme0n1p3): failed to convert unwritten extents 
to written extents -- potential data loss!  (inode 6558256, error -30)
[  +0.000018] Buffer I/O error on dev nvme0n1p3, logical block 26214710, 
lost async page write
[  +0.000047] Buffer I/O error on dev nvme0n1p3, logical block 26214711, 
lost async page write
[  +0.000305] EXT4-fs (nvme0n1p3): failed to convert unwritten extents 
to written extents -- potential data loss!  (inode 6558292, error -30)
[  +0.000044] EXT4-fs (nvme0n1p3): failed to convert unwritten extents 
to written extents -- potential data loss!  (inode 6558293, error -30)
[  +0.000036] EXT4-fs (nvme0n1p3): failed to convert unwritten extents 
to written extents -- potential data loss!  (inode 6558294, error -30)
[  +0.000049] EXT4-fs (nvme0n1p3): failed to convert unwritten extents 
to written extents -- potential data loss!  (inode 6558295, error -30)
[  +0.000047] EXT4-fs (nvme0n1p3): failed to convert unwritten extents 
to written extents -- potential data loss!  (inode 6558296, error -30)

=====================================================================

syslog extract:

Dec 15 21:08:13 DellXPS kernel: printk: udevd: 5 output lines suppressed 
due to ratelimiting
Dec 15 21:14:30 DellXPS kernel: EXT4-fs (nvme0n1p3): mounted filesystem 
with ordered data mode. Quota mode: disabled.
Dec 15 21:17:00 DellXPS kernel: EXT4-fs error (device dm-0): 
ext4_ext_check_inode:520: inode #53085366: comm rsync: pblk 0 bad 
header/extent: invalid eh_entries - magic f30a, entries 17, max 4(4), 
depth 0(0)
Dec 15 21:17:42 DellXPS kernel: EXT4-fs error (device dm-0): 
ext4_ext_check_inode:520: inode #5506059: comm rsync: pblk 0 bad 
header/extent: invalid eh_entries - magic f30a, entries 17, max 4(4), 
depth 0(0)
Dec 15 21:17:51 DellXPS kernel: EXT4-fs error (device dm-0): 
ext4_ext_check_inode:520: inode #5506082: comm rsync: pblk 0 bad 
header/extent: invalid eh_entries - magic f30a, entries 17, max 4(4), 
depth 0(0)
Dec 15 21:17:58 DellXPS kernel: EXT4-fs error (device dm-0): 
ext4_ext_check_inode:520: inode #41180419: comm rsync: pblk 0 bad 
header/extent: invalid eh_entries - magic f30a, entries 17, max 4(4), 
depth 0(0)
Dec 15 21:17:58 DellXPS kernel: EXT4-fs error (device dm-0): 
ext4_ext_check_inode:520: inode #41180440: comm rsync: pblk 0 bad 
header/extent: invalid eh_entries - magic f30a, entries 17, max 4(4), 
depth 0(0)
Dec 15 21:17:58 DellXPS kernel: EXT4-fs error (device dm-0): 
ext4_ext_check_inode:520: inode #41180229: comm rsync: pblk 0 bad 
header/extent: invalid eh_entries - magic f30a, entries 17, max 4(4), 
depth 0(0)
Dec 15 21:17:58 DellXPS kernel: EXT4-fs error (device dm-0): 
ext4_ext_check_inode:520: inode #41182425: comm rsync: pblk 0 bad 
header/extent: invalid eh_entries - magic f30a, entries 17, max 4(4), 
depth 0(0)
Dec 15 21:17:58 DellXPS kernel: EXT4-fs error (device dm-0): 
ext4_ext_check_inode:520: inode #41178319: comm rsync: pblk 0 bad 
header/extent: invalid eh_entries - magic f30a, entries 17, max 4(4), 
depth 0(0)
Dec 15 21:32:43 DellXPS kernel: EXT4-fs error (device dm-0): 
ext4_ext_check_inode:520: inode #38936868: comm rsync: pblk 0 bad 
header/extent: invalid eh_entries - magic f30a, entries 17, max 4(4), 
depth 0(0)
Dec 15 21:32:43 DellXPS kernel: EXT4-fs error (device dm-0): 
ext4_ext_check_inode:520: inode #38936901: comm rsync: pblk 0 bad 
header/extent: invalid eh_entries - magic f30a, entries 17, max 4(4), 
depth 0(0)
Dec 15 21:34:03 DellXPS kernel: EXT4-fs error (device dm-0): 
ext4_ext_check_inode:520: inode #41029460: comm rsync: pblk 0 bad 
header/extent: invalid eh_entries - magic f30a, entries 17, max 4(4), 
depth 0(0)
Dec 15 21:34:25 DellXPS kernel: EXT4-fs error (device dm-0): 
ext4_ext_check_inode:520: inode #41029305: comm rsync: pblk 0 bad 
header/extent: invalid eh_entries - magic f30a, entries 17, max 4(4), 
depth 0(0)
Dec 15 21:34:52 DellXPS kernel: nvme nvme0: I/O 0 (Write) QID 1 timeout, 
aborting
Dec 15 21:35:23 DellXPS kernel: nvme nvme0: I/O 0 QID 1 timeout, reset 
controller
Dec 15 21:35:54 DellXPS kernel: nvme nvme0: I/O 13 QID 0 timeout, reset 
controller
Dec 15 21:38:25 DellXPS kernel: nvme nvme0: Device not ready; aborting 
reset, CSTS=0x1
Dec 15 21:38:25 DellXPS kernel: nvme nvme0: Abort status: 0x371
Dec 15 21:40:25 DellXPS kernel: nvme nvme0: Device not ready; aborting 
reset, CSTS=0x1
Dec 15 21:40:25 DellXPS kernel: nvme nvme0: Removing after probe failure 
status: -19
Dec 15 21:42:26 DellXPS kernel: nvme nvme0: Device not ready; aborting 
reset, CSTS=0x1
Dec 15 21:42:26 DellXPS kernel: nvme0n1: detected capacity change from 
976773168 to 0
Dec 15 21:42:26 DellXPS kernel: EXT4-fs warning (device nvme0n1p3): 
ext4_end_bio:343: I/O error 10 writing to inode 6558258 starting block 
27121876)
Dec 15 21:42:26 DellXPS kernel: EXT4-fs warning (device nvme0n1p3): 
ext4_end_bio:343: I/O error 10 writing to inode 6558302 starting block 
27335746)
Dec 15 21:42:26 DellXPS kernel: EXT4-fs warning (device nvme0n1p3): 
ext4_end_bio:343: I/O error 10 writing to inode 6558257 starting block 
27199501)
Dec 15 21:42:26 DellXPS kernel: Buffer I/O error on device nvme0n1p3, 
logical block 27104980
Dec 15 21:42:26 DellXPS kernel: EXT4-fs warning (device nvme0n1p3): 
ext4_end_bio:343: I/O error 10 writing to inode 6558301 starting block 
26915262)
Dec 15 21:42:26 DellXPS kernel: EXT4-fs warning (device nvme0n1p3): 
ext4_end_bio:343: I/O error 10 writing to inode 6558287 starting block 
27335718)
Dec 15 21:42:26 DellXPS kernel: EXT4-fs warning (device nvme0n1p3): 
ext4_end_bio:343: I/O error 10 writing to inode 6558288 starting block 
27335719)
Dec 15 21:42:26 DellXPS kernel: EXT4-fs warning (device nvme0n1p3): 
ext4_end_bio:343: I/O error 10 writing to inode 6558289 starting block 
27335721)
Dec 15 21:42:26 DellXPS kernel: EXT4-fs warning (device nvme0n1p3): 
ext4_end_bio:343: I/O error 10 writing to inode 6558290 starting block 
27335723)
Dec 15 21:42:26 DellXPS kernel: EXT4-fs warning (device nvme0n1p3): 
ext4_end_bio:343: I/O error 10 writing to inode 6558291 starting block 
27335724)
Dec 15 21:42:26 DellXPS kernel: Buffer I/O error on device nvme0n1p3, 
logical block 27104981
Dec 15 21:42:26 DellXPS kernel: EXT4-fs warning (device nvme0n1p3): 
ext4_end_bio:343: I/O error 10 writing to inode 6558257 starting block 
27198798)
Dec 15 21:42:26 DellXPS kernel: Buffer I/O error on device nvme0n1p3, 
logical block 27104982
Dec 15 21:42:26 DellXPS kernel: Buffer I/O error on device nvme0n1p3, 
logical block 27104983
Dec 15 21:42:26 DellXPS kernel: Buffer I/O error on device nvme0n1p3, 
logical block 27104984
Dec 15 21:42:26 DellXPS kernel: Buffer I/O error on device nvme0n1p3, 
logical block 27318850
Dec 15 21:42:26 DellXPS kernel: Buffer I/O error on device nvme0n1p3, 
logical block 26898366
Dec 15 21:42:26 DellXPS kernel: Buffer I/O error on device nvme0n1p3, 
logical block 26898367
Dec 15 21:42:26 DellXPS kernel: Buffer I/O error on device nvme0n1p3, 
logical block 26898368
Dec 15 21:42:26 DellXPS kernel: Buffer I/O error on device nvme0n1p3, 
logical block 26898369
Dec 15 21:42:26 DellXPS kernel: Aborting journal on device nvme0n1p3-8.
Dec 15 21:42:26 DellXPS kernel: Buffer I/O error on dev nvme0n1p3, 
logical block 60850176, lost sync page write
Dec 15 21:42:26 DellXPS kernel: EXT4-fs error (device nvme0n1p3): 
ext4_journal_check_start:83: comm kworker/u8:2: Detected aborted journal
Dec 15 21:42:26 DellXPS kernel: JBD2: I/O error when updating journal 
superblock for nvme0n1p3-8.
Dec 15 21:42:26 DellXPS kernel: Buffer I/O error on dev nvme0n1p3, 
logical block 0, lost sync page write
Dec 15 21:42:26 DellXPS kernel: EXT4-fs (nvme0n1p3): I/O error while 
writing superblock
Dec 15 21:42:26 DellXPS kernel: EXT4-fs (nvme0n1p3): Remounting 
filesystem read-only
Dec 15 21:42:26 DellXPS kernel: EXT4-fs (nvme0n1p3): ext4_writepages: 
jbd2_start: 12288 pages, ino 6558303; err -30
Dec 15 21:42:26 DellXPS kernel: Buffer I/O error on dev nvme0n1p3, 
logical block 26214412, lost async page write
Dec 15 21:42:26 DellXPS kernel: Buffer I/O error on dev nvme0n1p3, 
logical block 26214705, lost async page write
Dec 15 21:42:26 DellXPS kernel: Buffer I/O error on dev nvme0n1p3, 
logical block 26214706, lost async page write
Dec 15 21:42:26 DellXPS kernel: Buffer I/O error on dev nvme0n1p3, 
logical block 26214707, lost async page write
Dec 15 21:42:26 DellXPS kernel: Buffer I/O error on dev nvme0n1p3, 
logical block 26214708, lost async page write
Dec 15 21:42:26 DellXPS kernel: EXT4-fs (nvme0n1p3): failed to convert 
unwritten extents to written extents -- potential data loss!  (inode 
6558291, error -30)
Dec 15 21:42:26 DellXPS kernel: Buffer I/O error on dev nvme0n1p3, 
logical block 26214709, lost async page write
Dec 15 21:42:26 DellXPS kernel: EXT4-fs (nvme0n1p3): failed to convert 
unwritten extents to written extents -- potential data loss!  (inode 
6558256, error -30)
Dec 15 21:42:26 DellXPS kernel: Buffer I/O error on dev nvme0n1p3, 
logical block 26214710, lost async page write
Dec 15 21:42:26 DellXPS kernel: Buffer I/O error on dev nvme0n1p3, 
logical block 26214711, lost async page write
Dec 15 21:42:26 DellXPS kernel: EXT4-fs (nvme0n1p3): failed to convert 
unwritten extents to written extents -- potential data loss!  (inode 
6558292, error -30)
Dec 15 21:42:26 DellXPS kernel: EXT4-fs (nvme0n1p3): failed to convert 
unwritten extents to written extents -- potential data loss!  (inode 
6558293, error -30)
Dec 15 21:42:26 DellXPS kernel: EXT4-fs (nvme0n1p3): failed to convert 
unwritten extents to written extents -- potential data loss!  (inode 
6558294, error -30)
Dec 15 21:42:26 DellXPS kernel: EXT4-fs (nvme0n1p3): failed to convert 
unwritten extents to written extents -- potential data loss!  (inode 
6558295, error -30)
Dec 15 21:42:26 DellXPS kernel: EXT4-fs (nvme0n1p3): failed to convert 
unwritten extents to written extents -- potential data loss!  (inode 
6558296, error -30)


On 12/15/22 6:09 PM, Christoph Hellwig wrote:
> On Thu, Dec 15, 2022 at 06:07:32PM +0900, J. Hart wrote:
>> My apologies....I am not sure what is meant by "stay DMAs".  Is there
>> something I can look for here ?
> 
> I mean stray, sorry.  There isn't really anything you can look for.
> Either it really is a device problem, in which case the IOMMU should
> catch it.  Or it is a kernel problem somewhere, in which case
> CONFIG_KASAN would catch.  So maybe enable that as well, but it will
> slow down the kernel a LOT.

J. Hart


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
  2022-12-15 13:33       ` J. Hart
@ 2022-12-15 17:34         ` Keith Busch
  2022-12-15 22:30           ` J. Hart
  0 siblings, 1 reply; 27+ messages in thread
From: Keith Busch @ 2022-12-15 17:34 UTC (permalink / raw)
  To: J. Hart; +Cc: Christoph Hellwig, linux-nvme, axboe, sagi

On Thu, Dec 15, 2022 at 10:33:30PM +0900, J. Hart wrote:
> [ +26.890018] nvme nvme0: I/O 0 (Write) QID 1 timeout, aborting
> [Dec15 21:35] nvme nvme0: I/O 0 QID 1 timeout, reset controller
> [ +30.719998] nvme nvme0: I/O 13 QID 0 timeout, reset controller
> [Dec15 21:38] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
> [  +0.014796] nvme nvme0: Abort status: 0x371
> [Dec15 21:40] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
> [  +0.000024] nvme nvme0: Removing after probe failure status: -19
> [Dec15 21:42] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
> [  +0.000324] nvme0n1: detected capacity change from 976773168 to 0

This looks like your device is completely unresponsive: no ack to IO
commands, admin commands, or reset sequences. Unfortunately these are
typically firmware bugs. Without additional guidance from the vendor,
we don't really have many options to try from the driver: just disabling
some optional power and performance capabilities, though that often
doesn't help either.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
  2022-12-15 17:34         ` Keith Busch
@ 2022-12-15 22:30           ` J. Hart
  2022-12-16  6:39             ` Christoph Hellwig
  2023-01-18 10:27             ` Mark Ruijter
  0 siblings, 2 replies; 27+ messages in thread
From: J. Hart @ 2022-12-15 22:30 UTC (permalink / raw)
  To: Keith Busch; +Cc: Christoph Hellwig, linux-nvme, axboe, sagi

I've tried the obvious ones and that didn't help either.  I guess I'll 
have to give up on it and return it as defective.  I'll go back to 
normal operation and to try and find a controller/device combination 
that works with the linux driver if there are any.

In any case, thanks again very much for your kind assistance.

J. Hart

On 12/16/22 2:34 AM, Keith Busch wrote:
> On Thu, Dec 15, 2022 at 10:33:30PM +0900, J. Hart wrote:
>> [ +26.890018] nvme nvme0: I/O 0 (Write) QID 1 timeout, aborting
>> [Dec15 21:35] nvme nvme0: I/O 0 QID 1 timeout, reset controller
>> [ +30.719998] nvme nvme0: I/O 13 QID 0 timeout, reset controller
>> [Dec15 21:38] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
>> [  +0.014796] nvme nvme0: Abort status: 0x371
>> [Dec15 21:40] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
>> [  +0.000024] nvme nvme0: Removing after probe failure status: -19
>> [Dec15 21:42] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
>> [  +0.000324] nvme0n1: detected capacity change from 976773168 to 0
> 
> This looks like your device is completely unresponsive: no ack to IO
> commands, admin commands, or reset sequences. Unfortunately these are
> typically firmware bugs. Without additional guidance from the vendor,
> we don't really have many options to try from the driver: just disabling
> some optional power and performance capabilities, though that often
> doesn't help either.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
  2022-12-15 22:30           ` J. Hart
@ 2022-12-16  6:39             ` Christoph Hellwig
  2022-12-16 19:08               ` Keith Busch
  2023-01-18 10:27             ` Mark Ruijter
  1 sibling, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2022-12-16  6:39 UTC (permalink / raw)
  To: J. Hart; +Cc: Keith Busch, Christoph Hellwig, linux-nvme, axboe, sagi

On Fri, Dec 16, 2022 at 07:30:55AM +0900, J. Hart wrote:
> I've tried the obvious ones and that didn't help either.  I guess I'll have 
> to give up on it and return it as defective.  I'll go back to normal 
> operation and to try and find a controller/device combination that works 
> with the linux driver if there are any.

So on the hand I agree with Keith that the device seems really broken.
On the other hand the fact that source file system on another device
sees corruption even with the iommu enabled is something that looks
scrary.  Even if ultimatively caused by the device somehow, that seems
like the kernel is part of the corruption.  And I have absolutely no
idea how.  A KASAN run on the device might be helpful, but I'm also
reluctant to ask a reported to run more reproducers and something that
corrupts his data.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
  2022-12-16  6:39             ` Christoph Hellwig
@ 2022-12-16 19:08               ` Keith Busch
  0 siblings, 0 replies; 27+ messages in thread
From: Keith Busch @ 2022-12-16 19:08 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: J. Hart, linux-nvme, axboe, sagi

On Fri, Dec 16, 2022 at 07:39:53AM +0100, Christoph Hellwig wrote:
> On Fri, Dec 16, 2022 at 07:30:55AM +0900, J. Hart wrote:
> > I've tried the obvious ones and that didn't help either.  I guess I'll have 
> > to give up on it and return it as defective.  I'll go back to normal 
> > operation and to try and find a controller/device combination that works 
> > with the linux driver if there are any.
> 
> So on the hand I agree with Keith that the device seems really broken.
> On the other hand the fact that source file system on another device
> sees corruption even with the iommu enabled is something that looks
> scrary.  Even if ultimatively caused by the device somehow, that seems
> like the kernel is part of the corruption.  And I have absolutely no
> idea how.  A KASAN run on the device might be helpful, but I'm also
> reluctant to ask a reported to run more reproducers and something that
> corrupts his data.

Oh, I assumed the source was a different partition on the same flakey
looking device. If not, yeah, that's pretty concerning.

How do you know enabling Intel IOMMU in the kernel config does anything
here? I didn't see anything confirming the kernel was actually using it.
I know this CPU model has VT-d capabilities, but I believe the platform
may disable it in BIOS.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
  2022-12-15  1:38 nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed J. Hart
  2022-12-15  8:23 ` Christoph Hellwig
@ 2022-12-16 23:16 ` Keith Busch
  2022-12-17  1:28   ` J. Hart
  1 sibling, 1 reply; 27+ messages in thread
From: Keith Busch @ 2022-12-16 23:16 UTC (permalink / raw)
  To: J. Hart; +Cc: linux-nvme, axboe, hch, sagi

On Thu, Dec 15, 2022 at 10:38:33AM +0900, J. Hart wrote:
> 02:00.0 Non-Volatile memory controller: Kingston Technologies Device 500f
> (rev 03) (prog-if 02)
>         Subsystem: Kingston Technologies Device 500f
>         Flags: bus master, fast devsel, latency 0, IRQ 16
>         Memory at ef9fc000 (64-bit, non-prefetchable) [size=16K]
>         Capabilities: [40] Power Management version 3
>         Capabilities: [50] MSI: Enable- Count=1/8 Maskable+ 64bit+
>         Capabilities: [70] Express Endpoint, MSI 00
>         Capabilities: [b0] MSI-X: Enable- Count=16 Masked-
>         Kernel driver in use: nvme

Seems odd that the nvme driver is in use, but MSI/MSI-x are not. We
really don't have a lot of testing with legacy IRQ.

Could you add the output from 'lspci -vvv -s 02:00.0'?
 
> CPU (quad core, cpu 0 shown, others the same):
> 
> processor	: 0
> vendor_id	: GenuineIntel
> cpu family	: 6
> model		: 23
> model name	: Intel(R) Core(TM)2 Quad  CPU   Q9550  @ 2.83GHz
> stepping	: 7

That's a pretty old processor for an M.2 slotted drive. Are you using a
retimer or some type of AIC adapter card?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
  2022-12-16 23:16 ` Keith Busch
@ 2022-12-17  1:28   ` J. Hart
  2022-12-19 14:41     ` Keith Busch
  0 siblings, 1 reply; 27+ messages in thread
From: J. Hart @ 2022-12-17  1:28 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, axboe, hch, sagi

On 12/17/22 8:16 AM, Keith Busch wrote:

> Seems odd that the nvme driver is in use, but MSI/MSI-x are not. We
> really don't have a lot of testing with legacy IRQ.
> 
> Could you add the output from 'lspci -vvv -s 02:00.0'?

Here is what I have from "lspci -vvv -s 02:00.0:

02:00.0 Non-Volatile memory controller: Kingston Technologies Device 500f (rev 03) (prog-if 02)
         Subsystem: Kingston Technologies Device 500f
         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
         Latency: 0, Cache Line Size: 64 bytes
         Interrupt: pin A routed to IRQ 16
         Region 0: Memory at ef9fc000 (64-bit, non-prefetchable) [size=16K]
         Capabilities: [40] Power Management version 3
                 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
         Capabilities: [50] MSI: Enable- Count=1/8 Maskable+ 64bit+
                 Address: 0000000000000000  Data: 0000
                 Masking: 00000000  Pending: 00000000
         Capabilities: [70] Express (v2) Endpoint, MSI 00
                 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                         ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
                 DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                         RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
                         MaxPayload 128 bytes, MaxReadReq 512 bytes
                 DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
                 LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Latency L0 <1us, L1 <8us
                         ClockPM+ Surprise- LLActRep- BwNot-
                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                 LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                 DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported
                 DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                 LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                          Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                          Compliance De-emphasis: -6dB
                 LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
                          EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
         Capabilities: [b0] MSI-X: Enable- Count=16 Masked-
                 Vector table: BAR=0 offset=00002000
                 PBA: BAR=0 offset=00002100
         Kernel driver in use: nvme

>> CPU (quad core, cpu 0 shown, others the same):
>>
>> processor	: 0
>> vendor_id	: GenuineIntel
>> cpu family	: 6
>> model		: 23
>> model name	: Intel(R) Core(TM)2 Quad  CPU   Q9550  @ 2.83GHz
>> stepping	: 7
> 
> That's a pretty old processor for an M.2 slotted drive. Are you using a
> retimer or some type of AIC adapter card?

The controller I have is this:

MZHOU M.2 NVME SSD-PCIe 4.0 X4 adapter
Key-M NGFF PCI-E 3.0、2.0 or 1.0 controller expansion cards
(2230 2242 2260 2280 22110 M.2 SSD)

The processor I have is indeed an older one, but times are hard as I am not working at present and must economize....:-)

I should also note that I have a replacement nvme drive coming and will be replacing the Kingston  SNVSE500G with
a Samsung 970 EVO Plus 500 GB. That model has been confirmed as working under Linux according to what I
have read.  I should have that today or tomorrow and will be sending back the Kingston device.

Please let me know if you need additional testing or results.

With Thanks and Best Regards for the holidays to you and yours,

J. Hart


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
  2022-12-17  1:28   ` J. Hart
@ 2022-12-19 14:41     ` Keith Busch
  2022-12-20  1:10       ` J. Hart
  0 siblings, 1 reply; 27+ messages in thread
From: Keith Busch @ 2022-12-19 14:41 UTC (permalink / raw)
  To: J. Hart; +Cc: linux-nvme, axboe, hch, sagi

On Sat, Dec 17, 2022 at 10:28:58AM +0900, J. Hart wrote:
> 02:00.0 Non-Volatile memory controller: Kingston Technologies Device 500f (rev 03) (prog-if 02)
>         Subsystem: Kingston Technologies Device 500f
>         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Latency: 0, Cache Line Size: 64 bytes
>         Interrupt: pin A routed to IRQ 16
>         Region 0: Memory at ef9fc000 (64-bit, non-prefetchable) [size=16K]
>         Capabilities: [40] Power Management version 3
>                 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
>                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>         Capabilities: [50] MSI: Enable- Count=1/8 Maskable+ 64bit+
>                 Address: 0000000000000000  Data: 0000
>                 Masking: 00000000  Pending: 00000000
>         Capabilities: [70] Express (v2) Endpoint, MSI 00
>                 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
>                         ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
>                 DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
>                         RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-

Given the potential flakiness of read corruption, I'd disable relaxed
ordering and see if that improves anything.

>                         MaxPayload 128 bytes, MaxReadReq 512 bytes
>                 DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
>                 LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Latency L0 <1us, L1 <8us
>                         ClockPM+ Surprise- LLActRep- BwNot-
>                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                 LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

Something seems off if it's downtraining to Gen1 x1. I believe this
setup should be capable of Gen2 x4. It sounds like the links among these
components may not be reliable.

Your first post mentioned total transfer was 50GB. If you've deep enough
queues, the tail latency will exceed the default timeout values when
you're limited to that kind of bandwidth. You'd probably be better off
from a performance strand point with a cheaper SATA SSD on AHCI.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
  2022-12-19 14:41     ` Keith Busch
@ 2022-12-20  1:10       ` J. Hart
  2022-12-20 16:56         ` Keith Busch
  0 siblings, 1 reply; 27+ messages in thread
From: J. Hart @ 2022-12-20  1:10 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, axboe, hch, sagi

On 12/19/22 11:41 PM, Keith Busch wrote:
> Given the potential flakiness of read corruption, I'd disable relaxed
> ordering and see if that improves anything.

I am not familiar with this part.  How is this done ?

> 
>>                          MaxPayload 128 bytes, MaxReadReq 512 bytes
>>                  DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
>>                  LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Latency L0 <1us, L1 <8us
>>                          ClockPM+ Surprise- LLActRep- BwNot-
>>                  LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
>>                          ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>>                  LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-


> Something seems off if it's downtraining to Gen1 x1. I believe this
> setup should be capable of Gen2 x4. It sounds like the links among these
> components may not be reliable.
> 
> Your first post mentioned total transfer was 50GB. If you've deep enough
> queues, the tail latency will exceed the default timeout values when
> you're limited to that kind of bandwidth. You'd probably be better off
> from a performance strand point with a cheaper SATA SSD on AHCI.

It would be unfortunate I think if the linux driver could not be made to 
implement the NVME standards on the somewhat older equipment from 
perhaps ten or fifteen years ago.  Earlier than that is perhaps not 
terribly practical of course.  Equipment like that which is still 
operating does tend to be reliable, and it's something of a shame to 
have to waste it. Some of us also do lack the wherewithal to update 
equipment every two years, especially older people or those in areas 
where the economy is not so good.  As I think we all know, there's more 
of that these days then we'd like.....:-)

In any case, I'm very willing to run tests on this equipment if that 
will help.  I'm fairly familiar with building kernels, writing software 
and that sort of thing, but perhaps less so with fixing drivers.

J. Hart


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
  2022-12-20  1:10       ` J. Hart
@ 2022-12-20 16:56         ` Keith Busch
  2022-12-21  7:50           ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Keith Busch @ 2022-12-20 16:56 UTC (permalink / raw)
  To: J. Hart; +Cc: linux-nvme, axboe, hch, sagi

On Tue, Dec 20, 2022 at 10:10:30AM +0900, J. Hart wrote:
> On 12/19/22 11:41 PM, Keith Busch wrote:
> > Given the potential flakiness of read corruption, I'd disable relaxed
> > ordering and see if that improves anything.
> 
> I am not familiar with this part.  How is this done ?
> 
> > 
> > >                          MaxPayload 128 bytes, MaxReadReq 512 bytes
> > >                  DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
> > >                  LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Latency L0 <1us, L1 <8us
> > >                          ClockPM+ Surprise- LLActRep- BwNot-
> > >                  LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
> > >                          ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > >                  LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> 
> 
> > Something seems off if it's downtraining to Gen1 x1. I believe this
> > setup should be capable of Gen2 x4. It sounds like the links among these
> > components may not be reliable.
> > 
> > Your first post mentioned total transfer was 50GB. If you've deep enough
> > queues, the tail latency will exceed the default timeout values when
> > you're limited to that kind of bandwidth. You'd probably be better off
> > from a performance strand point with a cheaper SATA SSD on AHCI.
> 
> It would be unfortunate I think if the linux driver could not be made to
> implement the NVME standards on the somewhat older equipment from perhaps
> ten or fifteen years ago.  Earlier than that is perhaps not terribly
> practical of course.  Equipment like that which is still operating does tend
> to be reliable, and it's something of a shame to have to waste it. Some of
> us also do lack the wherewithal to update equipment every two years,
> especially older people or those in areas where the economy is not so good.
> As I think we all know, there's more of that these days then we'd
> like.....:-)
> 
> In any case, I'm very willing to run tests on this equipment if that will
> help.  I'm fairly familiar with building kernels, writing software and that
> sort of thing, but perhaps less so with fixing drivers.

For the record, the linux driver does implement the nvme standards and
works fine on older equipment capable of implementing it.

The problem you're describing sounds closer to the pcie phy layer, far
below the nvme protocol. There's really not a lot we can do at the
kernel layer to say for sure, though; you'd need something like an
expensive pcie protocol analyzer to really confirm. But even if we did
have that kind of data, it's unlikely to reveal a viable work-around.

Though I am skeptical, Christoph seemed to also think there was a
possibility you hit a real kernel issue with your setup, but I don't
know if he has any ideas other than enabling KASAN to see if that
catches anything.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
  2022-12-20 16:56         ` Keith Busch
@ 2022-12-21  7:50           ` Christoph Hellwig
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2022-12-21  7:50 UTC (permalink / raw)
  To: Keith Busch; +Cc: J. Hart, linux-nvme, axboe, hch, sagi

On Tue, Dec 20, 2022 at 09:56:23AM -0700, Keith Busch wrote:
> Though I am skeptical, Christoph seemed to also think there was a
> possibility you hit a real kernel issue with your setup, but I don't
> know if he has any ideas other than enabling KASAN to see if that
> catches anything.

Sorry for the delay, caught the nasy cold bugs circulating everywhere
and was mostly knocked out for a couple of days.

I can't really think of anything specific, but when we see random
memory corruption, there's basically two major options:

 - something DMAing where it should not.  In general an IOMMU should
   catch that if it is actually enable.  I think Keith rightly questioned
   if VT-d is actually running here and not disabled by the BIOS, and
   I don't remember a dmesg disproving that.  Even with that there
   could be some devices opting out of the IOMMU in the BIOS
 - the kernel overwriting random data.  This should be really rare, but
   could happen and KASAN should catch it.  But I really have no idea
   what it would be.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
  2022-12-15 22:30           ` J. Hart
  2022-12-16  6:39             ` Christoph Hellwig
@ 2023-01-18 10:27             ` Mark Ruijter
  1 sibling, 0 replies; 27+ messages in thread
From: Mark Ruijter @ 2023-01-18 10:27 UTC (permalink / raw)
  To: jfhart085, Keith Busch; +Cc: Christoph Hellwig, linux-nvme, axboe, sagi

For what it's worth, I see the exact same problem while running SUSE Linux Enterprise Server 15 SP3.

lithium:~ # dmesg | grep nvme4
[    3.371400] nvme nvme4: pci function 0000:21:00.0
[   41.333886] nvme nvme4: Device not ready; aborting reset, CSTS=0x9
[   41.334802] nvme nvme4: Removing after probe failure status: -19
[  759.291672] nvme nvme4: pci function 0000:21:00.0
[  797.300033] nvme nvme4: Device not ready; aborting reset, CSTS=0x9
[  797.300038] nvme nvme4: Removing after probe failure status: -19
lithium:~ #

Attempts to recover from this state by removing the drives from the PCI space and rescanning the PCI bus also fail.
Rebooting the system does solve it.

It's fairly easy to reproduce the problem on systems that contain >= 8 drives.

Thanks,

Mark Ruijter

On 15/12/2022, 23:31, "Linux-nvme on behalf of J. Hart" <linux-nvme-bounces@lists.infradead.org on behalf of jfhart085@gmail.com> wrote:

    I've tried the obvious ones and that didn't help either.  I guess I'll 
    have to give up on it and return it as defective.  I'll go back to 
    normal operation and to try and find a controller/device combination 
    that works with the linux driver if there are any.

    In any case, thanks again very much for your kind assistance.

    J. Hart

    On 12/16/22 2:34 AM, Keith Busch wrote:
    > On Thu, Dec 15, 2022 at 10:33:30PM +0900, J. Hart wrote:
    >> [ +26.890018] nvme nvme0: I/O 0 (Write) QID 1 timeout, aborting
    >> [Dec15 21:35] nvme nvme0: I/O 0 QID 1 timeout, reset controller
    >> [ +30.719998] nvme nvme0: I/O 13 QID 0 timeout, reset controller
    >> [Dec15 21:38] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
    >> [  +0.014796] nvme nvme0: Abort status: 0x371
    >> [Dec15 21:40] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
    >> [  +0.000024] nvme nvme0: Removing after probe failure status: -19
    >> [Dec15 21:42] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
    >> [  +0.000324] nvme0n1: detected capacity change from 976773168 to 0
    > 
    > This looks like your device is completely unresponsive: no ack to IO
    > commands, admin commands, or reset sequences. Unfortunately these are
    > typically firmware bugs. Without additional guidance from the vendor,
    > we don't really have many options to try from the driver: just disabling
    > some optional power and performance capabilities, though that often
    > doesn't help either.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
  2022-12-19 23:40   ` J. Hart
@ 2022-12-20 18:10     ` Keith Busch
  0 siblings, 0 replies; 27+ messages in thread
From: Keith Busch @ 2022-12-20 18:10 UTC (permalink / raw)
  To: J. Hart; +Cc: linux-nvme, axboe, hch, sagi

On Tue, Dec 20, 2022 at 08:40:58AM +0900, J. Hart wrote:
> My apologies on that last, it was a typo. I should have said /dev/nvme0n1p2.
> 
> There are two ext4 partitions on the nvme drive.
> One is /dev/nvme0n1p2, which is a 64 MB partition.
> The other is /dev/nvme0n1p3, which is the remainder of that 500GB drive.

And what about nvme0n1p1? What are the offsets and total sizes of these?
What is the logical block format size of your nvme namespace? I'm asking
because some drives are known to behave badly if access is not aligned
to its NAND page size. NVMe doesn't actually provide a way to know what
that size is, but if all partitions are aligned to a power of 2 at least
16k, then partition setup is probably fine.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
  2022-12-19 14:45 ` Keith Busch
  2022-12-19 23:40   ` J. Hart
@ 2022-12-20 14:04   ` J. Hart
  1 sibling, 0 replies; 27+ messages in thread
From: J. Hart @ 2022-12-20 14:04 UTC (permalink / raw)
  To: linux-nvme; +Cc: axboe, hch, sagi, Keith Busch


As I mentioned in an earlier note, I have updated the util-linux and 
e2fsprogs packages to see if that makes any difference.  The versions I 
was using were rather old.  As expected, it unfortunately did not help, 
although it was a much needed update.

I am out of things to try.

My apologies for being rather a pest about this, but I was hoping I 
wouldn't have to scrap the time and money I put into it. I'll leave you 
all undisturbed and make no more requests of you.

With Thanks for your kind attention,

J. Hart

On 12/19/22 11:45 PM, Keith Busch wrote:
> On Sun, Dec 18, 2022 at 09:08:19PM +0900, J. Hart wrote:
>> Here are some consecutive fsck runs done on /dev/nvme0n1p3:
>>
>> -bash-3.2# fsck -f -C /dev/nvme0n1p2
> 
> I'm having some trouble following some of this. You mentioned nvme0n1p3
> corruption, but then show nvme0n1p2 instead. Could you possibly relay
> your recreation steps starting from a freshly formatted nvme drive?
> Partition setup, device mappers, filesystems, mount options, etc.?



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
  2022-12-19 14:45 ` Keith Busch
@ 2022-12-19 23:40   ` J. Hart
  2022-12-20 18:10     ` Keith Busch
  2022-12-20 14:04   ` J. Hart
  1 sibling, 1 reply; 27+ messages in thread
From: J. Hart @ 2022-12-19 23:40 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, axboe, hch, sagi

My apologies on that last, it was a typo. I should have said /dev/nvme0n1p2.

There are two ext4 partitions on the nvme drive.
One is /dev/nvme0n1p2, which is a 64 MB partition.
The other is /dev/nvme0n1p3, which is the remainder of that 500GB drive.

The 50 GB transfer was to partition 3 (/dev/nvme0n1p3)

The consecutive fsck runs in the previous message were to the 64 MB 
partition 2 (/dev/nvme0n1p2).

I am presently updating the util-linux and e2fsprogs packages to see if 
that makes any difference.  The versions I was using were rather old.

On 12/19/22 11:45 PM, Keith Busch wrote:
> On Sun, Dec 18, 2022 at 09:08:19PM +0900, J. Hart wrote:
>> Here are some consecutive fsck runs done on /dev/nvme0n1p3:
>>
>> -bash-3.2# fsck -f -C /dev/nvme0n1p2
> 
> I'm having some trouble following some of this. You mentioned nvme0n1p3
> corruption, but then show nvme0n1p2 instead. Could you possibly relay
> your recreation steps starting from a freshly formatted nvme drive?
> Partition setup, device mappers, filesystems, mount options, etc.?



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
  2022-12-18 12:08 J. Hart
@ 2022-12-19 14:45 ` Keith Busch
  2022-12-19 23:40   ` J. Hart
  2022-12-20 14:04   ` J. Hart
  0 siblings, 2 replies; 27+ messages in thread
From: Keith Busch @ 2022-12-19 14:45 UTC (permalink / raw)
  To: J. Hart; +Cc: linux-nvme, axboe, hch, sagi

On Sun, Dec 18, 2022 at 09:08:19PM +0900, J. Hart wrote:
> Here are some consecutive fsck runs done on /dev/nvme0n1p3:
> 
> -bash-3.2# fsck -f -C /dev/nvme0n1p2

I'm having some trouble following some of this. You mentioned nvme0n1p3
corruption, but then show nvme0n1p2 instead. Could you possibly relay
your recreation steps starting from a freshly formatted nvme drive?
Partition setup, device mappers, filesystems, mount options, etc.?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
@ 2022-12-18 12:08 J. Hart
  2022-12-19 14:45 ` Keith Busch
  0 siblings, 1 reply; 27+ messages in thread
From: J. Hart @ 2022-12-18 12:08 UTC (permalink / raw)
  To: linux-nvme; +Cc: jfhart085, kbusch, axboe, hch, sagi, hch

Here are some consecutive fsck runs done on /dev/nvme0n1p3:

-bash-3.2# fsck -f -C /dev/nvme0n1p2
fsck from util-linux 2.34
e2fsck 1.42.12 (29-Aug-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure 

Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/nvme0n1p2: 335/16384 files (0.6% non-contiguous), 7496/16384 blocks 

-bash-3.2# fsck -f -C /dev/nvme0n1p2
fsck from util-linux 2.34
e2fsck 1.42.12 (29-Aug-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure 

Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -5716
Fix<y>? yes
 

/dev/nvme0n1p2: ***** FILE SYSTEM WAS MODIFIED *****
/dev/nvme0n1p2: 335/16384 files (0.6% non-contiguous), 7496/16384 blocks
-bash-3.2# fsck -f -C /dev/nvme0n1p2
fsck from util-linux 2.34
e2fsck 1.42.12 (29-Aug-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure 

Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -5716
Fix<y>? yes

/dev/nvme0n1p2: ***** FILE SYSTEM WAS MODIFIED *****
/dev/nvme0n1p2: 335/16384 files (0.6% non-contiguous), 7496/16384 blocks
-bash-3.2# fsck -f -C /dev/nvme0n1p2
fsck from util-linux 2.34
e2fsck 1.42.12 (29-Aug-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure 

Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/nvme0n1p2: 335/16384 files (0.6% non-contiguous), 7496/16384 blocks 

-bash-3.2# fsck -f -C /dev/nvme0n1p2
fsck from util-linux 2.34
e2fsck 1.42.12 (29-Aug-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure 

Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/nvme0n1p2: 335/16384 files (0.6% non-contiguous), 7496/16384 blocks 

-bash-3.2# fsck -f -C /dev/nvme0n1p2
fsck from util-linux 2.34
e2fsck 1.42.12 (29-Aug-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure 

Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/nvme0n1p2: 335/16384 files (0.6% non-contiguous), 7496/16384 blocks 

-bash-3.2# fsck -f -C /dev/nvme0n1p2
fsck from util-linux 2.34
e2fsck 1.42.12 (29-Aug-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure 

Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/nvme0n1p2: 335/16384 files (0.6% non-contiguous), 7496/16384 blocks 


========================================================
There is a very similar partition (same contents) on a SATA drive on the 
system, and this does not happen there when I test it.

J. Hart



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
@ 2022-12-18  6:20 J. Hart
  0 siblings, 0 replies; 27+ messages in thread
From: J. Hart @ 2022-12-18  6:20 UTC (permalink / raw)
  To: linux-nvme; +Cc: jfhart085, kbusch, axboe, hch, sagi, hch

The following may be interesting to note:

This afternoon I also tried an fsck check on the smaller partition 
/dev/nvme0n1p2 (the 64MB one).  I made sure it was not mounted and 
repeatedly ran fsck on it without anything else between runs.  The 
results would alternate between reporting no damage, and reporting 
damage present.  Both occurrences were frequent.

J. Hart

> I have done an fsck check on the /dev/nvme0n1p3 file system after the rsync invocation referenced earlier.  In the first run I found errors which fsck should have repaired if I understand correctly.  In repeating the fsck invocation immediately afterwards, I found errors again each time.
> This was done using the replacement Samsung nvme ssd (Samsung 970 EVO Plus 500G).



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
@ 2022-12-17 21:57 J. Hart
  0 siblings, 0 replies; 27+ messages in thread
From: J. Hart @ 2022-12-17 21:57 UTC (permalink / raw)
  To: linux-nvme; +Cc: jfhart085, kbusch, axboe, hch, sagi, hch


An additional note if I may:

Memory tests run overnight using memtest86plus-6.00 found no issues.
Please let me know if there is anything else you need from me.

With Thanks,

J. Hart


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
@ 2022-12-17 16:14 J. Hart
  0 siblings, 0 replies; 27+ messages in thread
From: J. Hart @ 2022-12-17 16:14 UTC (permalink / raw)
  To: linux-nvme; +Cc: jfhart085, kbusch, axboe, hch, sagi, hch


> I also tried loading a smaller partition on the Samsung card at /dev/nvme0n1p3.  The copy stopped with a "no space left on device" error, which should not have been possible as the source device is a 32MB partition, and the destination partition on the nvme ssd is a 64MB partition.  The two files to be transferred were very small and could not have accounted for this as they totaled less than 5MB. I found file system damage on the nvme destination partition in this case as well. It also occurred repeatedly. I am still investigating this last case.
> 
> In no instance did I note any otherwise unusual log messages or errors from the nvme driver.
> 
> I do not yet know if there has been any damage to any other filesystems, but I will check.

A correction is in order above:
That smaller partition was at /dev/nvme0n1p2, not /dev/nvme0n1p3.  The 
former is a 64MB partition, the latter is much larger.

I have checked for damage in all the filesystems on all the non-NVME 
block devices on the system, and have found none since installing the 
Samsung ssd device.

I am presently unable to safely use an NVME ssd drive as the behavior 
appears to be unstable.  I can still do testing with the Samsung drive 
if needed, but the Kingston has been removed, and will be returned on 
Monday local time (Japan Standard, as I'm in Kobe)

J. Hart


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
@ 2022-12-17 15:07 J. Hart
  0 siblings, 0 replies; 27+ messages in thread
From: J. Hart @ 2022-12-17 15:07 UTC (permalink / raw)
  To: linux-nvme; +Cc: jfhart085, kbusch, axboe, hch, sagi, hch

I have done an fsck check on the /dev/nvme0n1p3 file system after the 
rsync invocation referenced earlier.  In the first run I found errors 
which fsck should have repaired if I understand correctly.  In repeating 
the fsck invocation immediately afterwards, I found errors again each time.
This was done using the replacement Samsung nvme ssd (Samsung 970 EVO 
Plus 500G).

Here is the output :

-bash-3.2# fsck -f -C /dev/nvme0n1p3
fsck from util-linux 2.34
e2fsck 1.42.12 (29-Aug-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure 

Pass 3: Checking directory connectivity 

Pass 4: Checking reference counts
Pass 5: Checking group summary information 

Block bitmap differences:  -87163220 -87163384 -87187960 -122079572 
-122079736
Fix<y>? yes
 

/dev/nvme0n1p3: ***** FILE SYSTEM WAS MODIFIED *****
/dev/nvme0n1p3: 699563/30523392 files (0.1% non-contiguous), 
13841458/122079744 blocks
-bash-3.2# fsck -f -C /dev/nvme0n1p3
fsck from util-linux 2.34
e2fsck 1.42.12 (29-Aug-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure 

Pass 3: Checking directory connectivity 

Pass 4: Checking reference counts 

Pass 5: Checking group summary information 

Block bitmap differences:  -87163220 -87163384 -122079736 

Fix<y>? yes
 

/dev/nvme0n1p3: ***** FILE SYSTEM WAS MODIFIED *****
/dev/nvme0n1p3: 699563/30523392 files (0.1% non-contiguous), 
13841458/122079744 blocks
-bash-3.2# fsck -f -C /dev/nvme0n1p3
fsck from util-linux 2.34
e2fsck 1.42.12 (29-Aug-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure 

Pass 3: Checking directory connectivity 

Pass 4: Checking reference counts
Pass 5: Checking group summary information 

Block bitmap differences:  -87163220 -87163384 -122079572 -122079736 

Fix<y>? yes
 

/dev/nvme0n1p3: ***** FILE SYSTEM WAS MODIFIED *****
/dev/nvme0n1p3: 699563/30523392 files (0.1% non-contiguous), 
13841458/122079744 blocks


I also tried loading a smaller partition on the Samsung card at 
/dev/nvme0n1p3.  The copy stopped with a "no space left on device" 
error, which should not have been possible as the source device is a 
32MB partition, and the destination partition on the nvme ssd is a 64MB 
partition.  The two files to be transferred were very small and could 
not have accounted for this as they totaled less than 5MB. I found file 
system damage on the nvme destination partition in this case as well. 
It also occurred repeatedly. I am still investigating this last case.

In no instance did I note any otherwise unusual log messages or errors 
from the nvme driver.

I do not yet know if there has been any damage to any other filesystems, 
but I will check.

J. Hart


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed
@ 2022-12-17 12:07 J. Hart
  0 siblings, 0 replies; 27+ messages in thread
From: J. Hart @ 2022-12-17 12:07 UTC (permalink / raw)
  To: linux-nvme; +Cc: jfhart085, kbusch, axboe, hch, sagi, hch

I have replaced the Kingston NV1-E 500 GB NVME SSD drive with a Samsung 970 EVO Plus 500GB NVME SSD drive.  I retained the same PCIe controller (Mzhou M.2 NVME SSD-PCIe 4.0 X4 adapter).  I then attempted the same rsync transfer at single user run level as I had done before with the Kingston NVME SSD.  The transfer has apparently completed successfully and without incident. No unusual log messages or corruption was observed.

J. Hart


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2023-01-18 10:27 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-15  1:38 nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed J. Hart
2022-12-15  8:23 ` Christoph Hellwig
2022-12-15  9:07   ` J. Hart
2022-12-15  9:09     ` Christoph Hellwig
2022-12-15  9:15       ` J. Hart
2022-12-15 13:33       ` J. Hart
2022-12-15 17:34         ` Keith Busch
2022-12-15 22:30           ` J. Hart
2022-12-16  6:39             ` Christoph Hellwig
2022-12-16 19:08               ` Keith Busch
2023-01-18 10:27             ` Mark Ruijter
2022-12-16 23:16 ` Keith Busch
2022-12-17  1:28   ` J. Hart
2022-12-19 14:41     ` Keith Busch
2022-12-20  1:10       ` J. Hart
2022-12-20 16:56         ` Keith Busch
2022-12-21  7:50           ` Christoph Hellwig
2022-12-17 12:07 J. Hart
2022-12-17 15:07 J. Hart
2022-12-17 16:14 J. Hart
2022-12-17 21:57 J. Hart
2022-12-18  6:20 J. Hart
2022-12-18 12:08 J. Hart
2022-12-19 14:45 ` Keith Busch
2022-12-19 23:40   ` J. Hart
2022-12-20 18:10     ` Keith Busch
2022-12-20 14:04   ` J. Hart

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.