* Data corruption when using multiple devices with NVMEoF TCP @ 2020-12-22 18:09 Hao Wang 2020-12-22 19:29 ` Sagi Grimberg 0 siblings, 1 reply; 23+ messages in thread From: Hao Wang @ 2020-12-22 18:09 UTC (permalink / raw) To: Linux-nvme I'm using kernel 5.2.9 with following related configs enabled: CONFIG_NVME_CORE=y CONFIG_BLK_DEV_NVME=y CONFIG_NVME_MULTIPATH=y CONFIG_NVME_FABRICS=m # CONFIG_NVME_FC is not set CONFIG_NVME_TCP=m CONFIG_NVME_TARGET=m CONFIG_NVME_TARGET_LOOP=m # CONFIG_NVME_TARGET_FC is not set CONFIG_NVME_TARGET_TCP=m CONFIG_RTC_NVMEM=y CONFIG_NVMEM=y CONFIG_NVMEM_SYSFS=y On target side, I exported 2 NVMe devices using tcp/ipv6: [root@rtptest34337.prn2 ~/ext_nvme]# ll /sys/kernel/config/nvmet/ports/1/subsystems/ total 0 lrwxrwxrwx 1 root root 0 Dec 19 02:08 nvmet-rtptest34337-1 -> ../../../../nvmet/subsystems/nvmet-rtptest34337-1 lrwxrwxrwx 1 root root 0 Dec 19 02:08 nvmet-rtptest34337-2 -> ../../../../nvmet/subsystems/nvmet-rtptest34337-2 On initiator side, I could successfully connect the 2 nvme devices, nvme1n1 & nvme2n1; [root@rtptest34206.prn2 /]# nvme list Node SN Model Namespace Usage Format FW Rev ---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- -------- /dev/nvme0n1 *********** INTEL ******* 1 256.06 GB / 256.06 GB 512 B + 0 B PSF119D /dev/nvme1n1 *********** Linux 1 900.19 GB / 900.19 GB 4 KiB + 0 B 5.2.9-0_ /dev/nvme2n1 *********** Linux 1 900.19 GB / 900.19 GB 4 KiB + 0 B 5.2.9-0_ Then for each of nvme1n1 & nvme2n1, I created a partition using fdisk; type is "linux raid autodetect"; Next I created a RAID-0 volume using, created a filesystem on it, and mounted itL # mdadm --create /dev/md5 --level=0 --raid-devices=2 --chunk=128 /dev/nvme1n1p1 /dev/nvme2n1p1 # mkfs.xfs -f /dev/md5 # mkdir /flash # mount -o rw,noatime,discard /dev/md5 /flash/ Now, when I copy a large directory into /flash/, a lot of files under /flash/ are corrupted. Specifically, that large directory has a lot of .gz files, and unzip will fail on many of them; also diff with original files does show they are different, although the file size is exactly the same. Also I found that if I don't create a RAID-0 array, instead just make a filesystem on either /dev/nvme1n1p1 or /dev/nvme2n1p1, there is no data corruption. I'm wondering if there is a known issue, or I'm doing something not really supported. Thanks! _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Data corruption when using multiple devices with NVMEoF TCP 2020-12-22 18:09 Data corruption when using multiple devices with NVMEoF TCP Hao Wang @ 2020-12-22 19:29 ` Sagi Grimberg 2020-12-22 19:58 ` Hao Wang 0 siblings, 1 reply; 23+ messages in thread From: Sagi Grimberg @ 2020-12-22 19:29 UTC (permalink / raw) To: Hao Wang, Linux-nvme Hey Hao, > I'm using kernel 5.2.9 with following related configs enabled: > CONFIG_NVME_CORE=y > CONFIG_BLK_DEV_NVME=y > CONFIG_NVME_MULTIPATH=y > CONFIG_NVME_FABRICS=m > # CONFIG_NVME_FC is not set > CONFIG_NVME_TCP=m > CONFIG_NVME_TARGET=m > CONFIG_NVME_TARGET_LOOP=m > # CONFIG_NVME_TARGET_FC is not set > CONFIG_NVME_TARGET_TCP=m > CONFIG_RTC_NVMEM=y > CONFIG_NVMEM=y > CONFIG_NVMEM_SYSFS=y > > On target side, I exported 2 NVMe devices using tcp/ipv6: > [root@rtptest34337.prn2 ~/ext_nvme]# ll > /sys/kernel/config/nvmet/ports/1/subsystems/ > total 0 > lrwxrwxrwx 1 root root 0 Dec 19 02:08 nvmet-rtptest34337-1 -> > ../../../../nvmet/subsystems/nvmet-rtptest34337-1 > lrwxrwxrwx 1 root root 0 Dec 19 02:08 nvmet-rtptest34337-2 -> > ../../../../nvmet/subsystems/nvmet-rtptest34337-2 > > On initiator side, I could successfully connect the 2 nvme devices, > nvme1n1 & nvme2n1; > [root@rtptest34206.prn2 /]# nvme list > Node SN Model > Namespace Usage Format FW Rev > ---------------- -------------------- > ---------------------------------------- --------- > -------------------------- ---------------- -------- > /dev/nvme0n1 *********** INTEL ******* 1 > 256.06 GB / 256.06 GB 512 B + 0 B PSF119D > /dev/nvme1n1 *********** Linux 1 > 900.19 GB / 900.19 GB 4 KiB + 0 B 5.2.9-0_ > /dev/nvme2n1 *********** Linux 1 > 900.19 GB / 900.19 GB 4 KiB + 0 B 5.2.9-0_ > > Then for each of nvme1n1 & nvme2n1, I created a partition using fdisk; > type is "linux raid autodetect"; > Next I created a RAID-0 volume using, created a filesystem on it, and > mounted itL > # mdadm --create /dev/md5 --level=0 --raid-devices=2 --chunk=128 > /dev/nvme1n1p1 /dev/nvme2n1p1 > # mkfs.xfs -f /dev/md5 > # mkdir /flash > # mount -o rw,noatime,discard /dev/md5 /flash/ > > Now, when I copy a large directory into /flash/, a lot of files under > /flash/ are corrupted. > Specifically, that large directory has a lot of .gz files, and unzip > will fail on many of them; > also diff with original files does show they are different, although > the file size is exactly the same. Sounds strange to me. Nothing forbids mounting a fs on a raid0 volume. > Also I found that if I don't create a RAID-0 array, instead just make > a filesystem on either /dev/nvme1n1p1 or /dev/nvme2n1p1, there is no > data corruption. > > I'm wondering if there is a known issue, or I'm doing something not > really supported. Did you try to run the same test locally on the target side without having nvme-tcp/nvmet-tcp target in between? _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Data corruption when using multiple devices with NVMEoF TCP 2020-12-22 19:29 ` Sagi Grimberg @ 2020-12-22 19:58 ` Hao Wang 2020-12-23 8:41 ` Sagi Grimberg 0 siblings, 1 reply; 23+ messages in thread From: Hao Wang @ 2020-12-22 19:58 UTC (permalink / raw) To: Sagi Grimberg; +Cc: Linux-nvme Also really strange to me. This has been burning me 16+ hours a day for 2 days doing And for your question, yes I did. Locally on the target side, no data corruption happening, with the same process of creating a partition on each device, creating a 2-device raid-0 volume, and creating a filesystem. I have also tested on multiple sets of machines, but no luck. Another point I should've mentioned is that corruption does not always happen. Sometimes if I only copy one .gz file (~100MB), it seems fine. But whenever I copy a large directory with many .gz files (~100GB in total), there are always some .gz files corrupted. Hao On Tue, Dec 22, 2020 at 11:29 AM Sagi Grimberg <sagi@grimberg.me> wrote: > > Hey Hao, > > > I'm using kernel 5.2.9 with following related configs enabled: > > CONFIG_NVME_CORE=y > > CONFIG_BLK_DEV_NVME=y > > CONFIG_NVME_MULTIPATH=y > > CONFIG_NVME_FABRICS=m > > # CONFIG_NVME_FC is not set > > CONFIG_NVME_TCP=m > > CONFIG_NVME_TARGET=m > > CONFIG_NVME_TARGET_LOOP=m > > # CONFIG_NVME_TARGET_FC is not set > > CONFIG_NVME_TARGET_TCP=m > > CONFIG_RTC_NVMEM=y > > CONFIG_NVMEM=y > > CONFIG_NVMEM_SYSFS=y > > > > On target side, I exported 2 NVMe devices using tcp/ipv6: > > [root@rtptest34337.prn2 ~/ext_nvme]# ll > > /sys/kernel/config/nvmet/ports/1/subsystems/ > > total 0 > > lrwxrwxrwx 1 root root 0 Dec 19 02:08 nvmet-rtptest34337-1 -> > > ../../../../nvmet/subsystems/nvmet-rtptest34337-1 > > lrwxrwxrwx 1 root root 0 Dec 19 02:08 nvmet-rtptest34337-2 -> > > ../../../../nvmet/subsystems/nvmet-rtptest34337-2 > > > > On initiator side, I could successfully connect the 2 nvme devices, > > nvme1n1 & nvme2n1; > > [root@rtptest34206.prn2 /]# nvme list > > Node SN Model > > Namespace Usage Format FW Rev > > ---------------- -------------------- > > ---------------------------------------- --------- > > -------------------------- ---------------- -------- > > /dev/nvme0n1 *********** INTEL ******* 1 > > 256.06 GB / 256.06 GB 512 B + 0 B PSF119D > > /dev/nvme1n1 *********** Linux 1 > > 900.19 GB / 900.19 GB 4 KiB + 0 B 5.2.9-0_ > > /dev/nvme2n1 *********** Linux 1 > > 900.19 GB / 900.19 GB 4 KiB + 0 B 5.2.9-0_ > > > > Then for each of nvme1n1 & nvme2n1, I created a partition using fdisk; > > type is "linux raid autodetect"; > > Next I created a RAID-0 volume using, created a filesystem on it, and > > mounted itL > > # mdadm --create /dev/md5 --level=0 --raid-devices=2 --chunk=128 > > /dev/nvme1n1p1 /dev/nvme2n1p1 > > # mkfs.xfs -f /dev/md5 > > # mkdir /flash > > # mount -o rw,noatime,discard /dev/md5 /flash/ > > > > Now, when I copy a large directory into /flash/, a lot of files under > > /flash/ are corrupted. > > Specifically, that large directory has a lot of .gz files, and unzip > > will fail on many of them; > > also diff with original files does show they are different, although > > the file size is exactly the same. > > Sounds strange to me. Nothing forbids mounting a fs on a raid0 volume. > > > Also I found that if I don't create a RAID-0 array, instead just make > > a filesystem on either /dev/nvme1n1p1 or /dev/nvme2n1p1, there is no > > data corruption. > > > > I'm wondering if there is a known issue, or I'm doing something not > > really supported. > > Did you try to run the same test locally on the target side without > having nvme-tcp/nvmet-tcp target in between? _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Data corruption when using multiple devices with NVMEoF TCP 2020-12-22 19:58 ` Hao Wang @ 2020-12-23 8:41 ` Sagi Grimberg 2020-12-23 8:43 ` Christoph Hellwig 0 siblings, 1 reply; 23+ messages in thread From: Sagi Grimberg @ 2020-12-23 8:41 UTC (permalink / raw) To: Hao Wang; +Cc: Linux-nvme > Also really strange to me. This has been burning me 16+ hours a day > for 2 days doing > > And for your question, yes I did. > Locally on the target side, no data corruption happening, with the > same process of creating a partition on each device, creating a > 2-device raid-0 volume, and creating a filesystem. > I have also tested on multiple sets of machines, but no luck. > > Another point I should've mentioned is that corruption does not always > happen. Sometimes if I only copy one .gz file (~100MB), it seems fine. > But whenever I copy a large directory with many .gz files (~100GB in > total), there are always some .gz files corrupted. OK, interesting. Can you retry the test with setting max_sectors_kb to 512: echo 512 > /sys/block/nvmeXnY/queue/max_sectors_kb I'm trying to understand if there is an issue related to large IOs. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Data corruption when using multiple devices with NVMEoF TCP 2020-12-23 8:41 ` Sagi Grimberg @ 2020-12-23 8:43 ` Christoph Hellwig 2020-12-23 21:23 ` Sagi Grimberg 2020-12-24 1:51 ` Hao Wang 0 siblings, 2 replies; 23+ messages in thread From: Christoph Hellwig @ 2020-12-23 8:43 UTC (permalink / raw) To: Sagi Grimberg; +Cc: Hao Wang, Linux-nvme Wouldn't testing with a not completely outdated kernel a better first step? _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Data corruption when using multiple devices with NVMEoF TCP 2020-12-23 8:43 ` Christoph Hellwig @ 2020-12-23 21:23 ` Sagi Grimberg 2020-12-23 22:23 ` Hao Wang 2020-12-24 1:51 ` Hao Wang 1 sibling, 1 reply; 23+ messages in thread From: Sagi Grimberg @ 2020-12-23 21:23 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Hao Wang, Linux-nvme > Wouldn't testing with a not completely outdated kernel a better first > step? Right, didn't notice that. Hao, would it be possible to test this happens with the latest upstream kernel (or something close to that)? _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Data corruption when using multiple devices with NVMEoF TCP 2020-12-23 21:23 ` Sagi Grimberg @ 2020-12-23 22:23 ` Hao Wang 0 siblings, 0 replies; 23+ messages in thread From: Hao Wang @ 2020-12-23 22:23 UTC (permalink / raw) To: Sagi Grimberg; +Cc: Christoph Hellwig, Linux-nvme Sure. I will try to build a new kernel. This is in an enterprise environment, so it's just really convenient for me to run with v5.2 or v5.6. For newer kernel, I will have to build myself. But I will give it a try. Regarding max_sectors_kb, there seems something interesting: So on the target side, I see: # cat /sys/block/nvme1n1/queue/max_sectors_kb 256 #/sys/block/nvme2n1/queue/max_sectors_kb 256 On the initiator side, * first, there is both /sys/block/nvme1c1n1 and /sys/block/nvme1n1 * and their max_sectors_kb is 1280. Then when I create a raid-0 volume with mdadm, # /sys/block/md5 128 I'm not an expert on storage, but do you see any potential problem here? Hao On Wed, Dec 23, 2020 at 1:23 PM Sagi Grimberg <sagi@grimberg.me> wrote: > > > > Wouldn't testing with a not completely outdated kernel a better first > > step? > > Right, didn't notice that. Hao, would it be possible to test this > happens with the latest upstream kernel (or something close to that)? _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Data corruption when using multiple devices with NVMEoF TCP 2020-12-23 8:43 ` Christoph Hellwig 2020-12-23 21:23 ` Sagi Grimberg @ 2020-12-24 1:51 ` Hao Wang 2020-12-24 2:57 ` Sagi Grimberg 1 sibling, 1 reply; 23+ messages in thread From: Hao Wang @ 2020-12-24 1:51 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Sagi Grimberg, Linux-nvme Okay, tried both v5.10 and latest 58cf05f597b0. And same behavior - data corruption on the initiator side when creating a raid-0 volume using 2 nvme-tcp devices; - no data corruption either on local target side, or on initiator side but only using 1 nvme-tcp devoce. A difference I can see on the max_sectors_kb is that, now on the target side, /sys/block/nvme*n1/queue/max_sectors_kb also becomes 1280. On Wed, Dec 23, 2020 at 12:43 AM Christoph Hellwig <hch@infradead.org> wrote: > > Wouldn't testing with a not completely outdated kernel a better first > step? _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Data corruption when using multiple devices with NVMEoF TCP 2020-12-24 1:51 ` Hao Wang @ 2020-12-24 2:57 ` Sagi Grimberg 2020-12-24 10:28 ` Hao Wang 0 siblings, 1 reply; 23+ messages in thread From: Sagi Grimberg @ 2020-12-24 2:57 UTC (permalink / raw) To: Hao Wang, Christoph Hellwig; +Cc: Linux-nvme > Okay, tried both v5.10 and latest 58cf05f597b0. > > And same behavior > - data corruption on the initiator side when creating a raid-0 volume > using 2 nvme-tcp devices; > - no data corruption either on local target side, or on initiator > side but only using 1 nvme-tcp devoce. > > A difference I can see on the max_sectors_kb is that, now on the > target side, /sys/block/nvme*n1/queue/max_sectors_kb also becomes > 1280. > Thanks Hao, I'm thinking we maybe have an issue with bio splitting/merge/cloning. Question, if you build the raid0 in the target and expose that over nvmet-tcp (with a single namespace), does the issue happen? Also, would be interesting to add this patch and see if the following print pops up, and if it correlates when you see the issue: -- diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 979ee31b8dd1..d0a68cdb374f 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -243,6 +243,9 @@ static void nvme_tcp_init_iter(struct nvme_tcp_request *req, nsegs = bio_segments(bio); size = bio->bi_iter.bi_size; offset = bio->bi_iter.bi_bvec_done; + if (rq->bio != rq->biotail) + pr_info("rq %d (%s) contains multiple bios bvec: nsegs %d size %d offset %ld\n", + rq->tag, dir == WRITE ? "WRITE" : "READ", nsegs, size, offset); } iov_iter_bvec(&req->iter, dir, vec, nsegs, size); -- I'll try to look further to understand if we have an issue there. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: Data corruption when using multiple devices with NVMEoF TCP 2020-12-24 2:57 ` Sagi Grimberg @ 2020-12-24 10:28 ` Hao Wang 2020-12-24 17:56 ` Sagi Grimberg 0 siblings, 1 reply; 23+ messages in thread From: Hao Wang @ 2020-12-24 10:28 UTC (permalink / raw) To: Sagi Grimberg; +Cc: Christoph Hellwig, Linux-nvme Sagi, thanks a lot for helping look into this. > Question, if you build the raid0 in the target and expose that over nvmet-tcp (with a single namespace), does the issue happen? No, it works fine in that case. Actually with this setup, initially the latency was pretty bad, and it seems enabling CONFIG_NVME_MULTIPATH improved it significantly. I'm not exactly sure though as I've changed too many things and didn't specifically test for this setup. Could you help confirm that? And after applying your patch, - With the problematic setup, i.e. creating a 2-device raid0, I did see numerous numerous prints popping up in dmesg; a few lines are pasted below: - With the good setup, i.e. only using 1 device, this line also pops up, but a lot less frequent. [ 390.240595] nvme_tcp: rq 10 (WRITE) contains multiple bios bvec: nsegs 25 size 102400 offset 0 [ 390.243146] nvme_tcp: rq 35 (WRITE) contains multiple bios bvec: nsegs 7 size 28672 offset 4096 [ 390.246893] nvme_tcp: rq 35 (WRITE) contains multiple bios bvec: nsegs 25 size 102400 offset 4096 [ 390.250631] nvme_tcp: rq 35 (WRITE) contains multiple bios bvec: nsegs 4 size 16384 offset 16384 [ 390.254374] nvme_tcp: rq 11 (WRITE) contains multiple bios bvec: nsegs 7 size 28672 offset 0 [ 390.256869] nvme_tcp: rq 11 (WRITE) contains multiple bios bvec: nsegs 25 size 102400 offset 12288 [ 390.266877] nvme_tcp: rq 57 (READ) contains multiple bios bvec: nsegs 4 size 16384 offset 118784 [ 390.269444] nvme_tcp: rq 58 (READ) contains multiple bios bvec: nsegs 4 size 16384 offset 118784 [ 390.273281] nvme_tcp: rq 59 (READ) contains multiple bios bvec: nsegs 4 size 16384 offset 0 [ 390.275776] nvme_tcp: rq 60 (READ) contains multiple bios bvec: nsegs 4 size 16384 offset 118784 On Wed, Dec 23, 2020 at 6:57 PM Sagi Grimberg <sagi@grimberg.me> wrote: > > > > Okay, tried both v5.10 and latest 58cf05f597b0. > > > > And same behavior > > - data corruption on the initiator side when creating a raid-0 volume > > using 2 nvme-tcp devices; > > - no data corruption either on local target side, or on initiator > > side but only using 1 nvme-tcp devoce. > > > > A difference I can see on the max_sectors_kb is that, now on the > > target side, /sys/block/nvme*n1/queue/max_sectors_kb also becomes > > 1280. > > > > Thanks Hao, > > I'm thinking we maybe have an issue with bio splitting/merge/cloning. > > Question, if you build the raid0 in the target and expose that over > nvmet-tcp (with a single namespace), does the issue happen? > > Also, would be interesting to add this patch and see if the following > print pops up, and if it correlates when you see the issue: > > -- > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c > index 979ee31b8dd1..d0a68cdb374f 100644 > --- a/drivers/nvme/host/tcp.c > +++ b/drivers/nvme/host/tcp.c > @@ -243,6 +243,9 @@ static void nvme_tcp_init_iter(struct > nvme_tcp_request *req, > nsegs = bio_segments(bio); > size = bio->bi_iter.bi_size; > offset = bio->bi_iter.bi_bvec_done; > + if (rq->bio != rq->biotail) > + pr_info("rq %d (%s) contains multiple bios bvec: > nsegs %d size %d offset %ld\n", > + rq->tag, dir == WRITE ? "WRITE" : > "READ", nsegs, size, offset); > } > > iov_iter_bvec(&req->iter, dir, vec, nsegs, size); > -- > > I'll try to look further to understand if we have an issue there. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Data corruption when using multiple devices with NVMEoF TCP 2020-12-24 10:28 ` Hao Wang @ 2020-12-24 17:56 ` Sagi Grimberg 2020-12-25 7:49 ` Hao Wang 0 siblings, 1 reply; 23+ messages in thread From: Sagi Grimberg @ 2020-12-24 17:56 UTC (permalink / raw) To: Hao Wang; +Cc: Christoph Hellwig, Linux-nvme > Sagi, thanks a lot for helping look into this. > >> Question, if you build the raid0 in the target and expose that over nvmet-tcp (with a single namespace), does the issue happen? > No, it works fine in that case. > Actually with this setup, initially the latency was pretty bad, and it > seems enabling CONFIG_NVME_MULTIPATH improved it significantly. > I'm not exactly sure though as I've changed too many things and didn't > specifically test for this setup. > Could you help confirm that? > > And after applying your patch, > - With the problematic setup, i.e. creating a 2-device raid0, I did > see numerous numerous prints popping up in dmesg; a few lines are > pasted below: > - With the good setup, i.e. only using 1 device, this line also pops > up, but a lot less frequent. Hao, question, what is the io-scheduler in-use for the nvme-tcp devices? Can you try to reproduce this issue when disabling merges on the nvme-tcp devices? echo 2 > /sys/block/nvmeXnY/queue/nomerges I want to see if this is an issue with merged bios. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Data corruption when using multiple devices with NVMEoF TCP 2020-12-24 17:56 ` Sagi Grimberg @ 2020-12-25 7:49 ` Hao Wang 2020-12-25 9:05 ` Sagi Grimberg 0 siblings, 1 reply; 23+ messages in thread From: Hao Wang @ 2020-12-25 7:49 UTC (permalink / raw) To: Sagi Grimberg; +Cc: Christoph Hellwig, Linux-nvme In my current setup, on the initiator side, nvme3n1 & nvme4n1 are 2 nvme-tcp devices, schedulers for 3 is: - cat /sys/block/nvme3n1/queue/scheduler: "none" - cat /sys/block/nvme3c3n1/queue/scheduler: "[none] mq-deadline kyber" Not sure what is nvme3c3n1 here? And disabling merges on nvme-tcp devices solves the data corruption issue! Hao On Thu, Dec 24, 2020 at 9:56 AM Sagi Grimberg <sagi@grimberg.me> wrote: > > > > Sagi, thanks a lot for helping look into this. > > > >> Question, if you build the raid0 in the target and expose that over nvmet-tcp (with a single namespace), does the issue happen? > > No, it works fine in that case. > > Actually with this setup, initially the latency was pretty bad, and it > > seems enabling CONFIG_NVME_MULTIPATH improved it significantly. > > I'm not exactly sure though as I've changed too many things and didn't > > specifically test for this setup. > > Could you help confirm that? > > > > And after applying your patch, > > - With the problematic setup, i.e. creating a 2-device raid0, I did > > see numerous numerous prints popping up in dmesg; a few lines are > > pasted below: > > - With the good setup, i.e. only using 1 device, this line also pops > > up, but a lot less frequent. > > Hao, question, what is the io-scheduler in-use for the nvme-tcp devices? > > Can you try to reproduce this issue when disabling merges on the > nvme-tcp devices? > > echo 2 > /sys/block/nvmeXnY/queue/nomerges > > I want to see if this is an issue with merged bios. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Data corruption when using multiple devices with NVMEoF TCP 2020-12-25 7:49 ` Hao Wang @ 2020-12-25 9:05 ` Sagi Grimberg [not found] ` <CAJS6Edgb+yCW5q5dA=MEkL0eYs4MXoopdiz72nhkxpkd5Fe_cA@mail.gmail.com> 0 siblings, 1 reply; 23+ messages in thread From: Sagi Grimberg @ 2020-12-25 9:05 UTC (permalink / raw) To: Hao Wang; +Cc: Christoph Hellwig, Linux-nvme > In my current setup, on the initiator side, nvme3n1 & nvme4n1 are 2 > nvme-tcp devices, schedulers for 3 is: > - cat /sys/block/nvme3n1/queue/scheduler: "none" > - cat /sys/block/nvme3c3n1/queue/scheduler: "[none] mq-deadline kyber" > Not sure what is nvme3c3n1 here? > > And disabling merges on nvme-tcp devices solves the data corruption issue! Well, it actually confirms that we have an issue when we get a multi-bio request that was merged. I'm assuming you also do not see the prints I added in this case... Would you mind adding these prints (they will overload probably, but may be useful to shed some light on this): -- diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 979ee31b8dd1..5a611ddb22ea 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -243,6 +243,16 @@ static void nvme_tcp_init_iter(struct nvme_tcp_request *req, nsegs = bio_segments(bio); size = bio->bi_iter.bi_size; offset = bio->bi_iter.bi_bvec_done; + if (rq->bio != rq->biotail) { + int bio_num = 1; + struct bio *ptr = rq->bio; + while (ptr != bio) { + ptr = ptr->bi_next; + bio_num++; + }; + pr_info("rq %d (%s) data_len %d bio[%d/%d] sector %llx bvec: nsegs %d size %d offset %ld\n", + rq->tag, dir == WRITE ? "WRITE" : "READ", req->data_len, bio_num, blk_rq_count_bios(rq), bio->bi_iter.bi_sector, nsegs, size, offset); + } } iov_iter_bvec(&req->iter, dir, vec, nsegs, size); -- Thank you for helping isolating this issue. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 23+ messages in thread
[parent not found: <CAJS6Edgb+yCW5q5dA=MEkL0eYs4MXoopdiz72nhkxpkd5Fe_cA@mail.gmail.com>]
* Re: Data corruption when using multiple devices with NVMEoF TCP [not found] ` <CAJS6Edgb+yCW5q5dA=MEkL0eYs4MXoopdiz72nhkxpkd5Fe_cA@mail.gmail.com> @ 2020-12-29 1:25 ` Sagi Grimberg 2021-01-06 1:53 ` Sagi Grimberg 1 sibling, 0 replies; 23+ messages in thread From: Sagi Grimberg @ 2020-12-29 1:25 UTC (permalink / raw) To: Hao Wang; +Cc: Christoph Hellwig, Linux-nvme > Okay, will do that in a few days. Something else just popped up and I > have a limited time window to use some machines. Understood, I'm still trying to understand what can cause a problem in a multi-bio merge request, that used to work AFAIR... > BTW, what is the performance implication of disabling merge? My usage > pattern is mostly sequential read and write, and write bandwidth is > pretty high. Well, if your writes are bigger in nature to begin with, there is not a lot of gain, but if they are than there is a potential gain here. Some of the log messages could also help understand what is the I/O pattern. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Data corruption when using multiple devices with NVMEoF TCP [not found] ` <CAJS6Edgb+yCW5q5dA=MEkL0eYs4MXoopdiz72nhkxpkd5Fe_cA@mail.gmail.com> 2020-12-29 1:25 ` Sagi Grimberg @ 2021-01-06 1:53 ` Sagi Grimberg 2021-01-06 8:21 ` Hao Wang 2021-01-11 8:56 ` Hao Wang 1 sibling, 2 replies; 23+ messages in thread From: Sagi Grimberg @ 2021-01-06 1:53 UTC (permalink / raw) To: Hao Wang; +Cc: Christoph Hellwig, Linux-nvme Hey Hao, > Okay, will do that in a few days. Something else just popped up and I > have a limited time window to use some machines. > > BTW, what is the performance implication of disabling merge? My usage > pattern is mostly sequential read and write, and write bandwidth is > pretty high. Did you get a chance to look into this? _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Data corruption when using multiple devices with NVMEoF TCP 2021-01-06 1:53 ` Sagi Grimberg @ 2021-01-06 8:21 ` Hao Wang 2021-01-11 8:56 ` Hao Wang 1 sibling, 0 replies; 23+ messages in thread From: Hao Wang @ 2021-01-06 8:21 UTC (permalink / raw) To: Sagi Grimberg; +Cc: Christoph Hellwig, Linux-nvme Plan to get to this over the coming weekend. Sorry for the delay. On Tue, Jan 5, 2021 at 5:53 PM Sagi Grimberg <sagi@grimberg.me> wrote: > > Hey Hao, > > > Okay, will do that in a few days. Something else just popped up and I > > have a limited time window to use some machines. > > > > BTW, what is the performance implication of disabling merge? My usage > > pattern is mostly sequential read and write, and write bandwidth is > > pretty high. > > Did you get a chance to look into this? _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Data corruption when using multiple devices with NVMEoF TCP 2021-01-06 1:53 ` Sagi Grimberg 2021-01-06 8:21 ` Hao Wang @ 2021-01-11 8:56 ` Hao Wang 2021-01-11 10:11 ` Sagi Grimberg 1 sibling, 1 reply; 23+ messages in thread From: Hao Wang @ 2021-01-11 8:56 UTC (permalink / raw) To: Sagi Grimberg; +Cc: Christoph Hellwig, Linux-nvme Hey Sagi, I exported 4 devices to the initiator, created a raid-0 array, and copied a 98G directory with many ~100MB .gz files. With the patch you give on top of 58cf05f597b0 (fairly new), I saw about 24K prints from dmesg. Below are some of them: [ 3775.256547] nvme_tcp: rq 22 (READ) data_len 131072 bio[1/2] sector a388200 bvec: nsegs 19 size 77824 offset 0 [ 3775.256768] nvme_tcp: rq 19 (READ) data_len 131072 bio[1/2] sector a388300 bvec: nsegs 19 size 77824 offset 0 [ 3775.256774] nvme_tcp: rq 20 (READ) data_len 131072 bio[1/2] sector a388400 bvec: nsegs 19 size 77824 offset 0 [ 3775.256787] nvme_tcp: rq 5 (READ) data_len 131072 bio[1/2] sector a388300 bvec: nsegs 19 size 77824 offset 0 [ 3775.256791] nvme_tcp: rq 6 (READ) data_len 131072 bio[1/2] sector a388400 bvec: nsegs 19 size 77824 offset 0 [ 3775.256794] nvme_tcp: rq 117 (READ) data_len 131072 bio[1/2] sector a388300 bvec: nsegs 19 size 77824 offset 0 [ 3775.256797] nvme_tcp: rq 118 (READ) data_len 131072 bio[1/2] sector a388400 bvec: nsegs 19 size 77824 offset 0 [ 3775.256800] nvme_tcp: rq 5 (READ) data_len 262144 bio[1/4] sector a388300 bvec: nsegs 19 size 77824 offset 0 [ 3775.257002] nvme_tcp: rq 21 (READ) data_len 131072 bio[1/2] sector a388500 bvec: nsegs 19 size 77824 offset 0 [ 3775.257006] nvme_tcp: rq 22 (READ) data_len 131072 bio[1/2] sector a388600 bvec: nsegs 19 size 77824 offset 0 [ 3775.257009] nvme_tcp: rq 7 (READ) data_len 131072 bio[1/2] sector a388500 bvec: nsegs 19 size 77824 offset 0 [ 3775.257012] nvme_tcp: rq 8 (READ) data_len 131072 bio[1/2] sector a388600 bvec: nsegs 19 size 77824 offset 0 [ 3775.257014] nvme_tcp: rq 7 (READ) data_len 131072 bio[1/2] sector a388500 bvec: nsegs 19 size 77824 offset 0 [ 3775.257017] nvme_tcp: rq 8 (READ) data_len 131072 bio[1/2] sector a388600 bvec: nsegs 19 size 77824 offset 0 [ 3775.257020] nvme_tcp: rq 6 (READ) data_len 262144 bio[1/4] sector a388500 bvec: nsegs 19 size 77824 offset 0 [ 3775.262587] nvme_tcp: rq 22 (WRITE) data_len 131072 bio[1/2] sector a388200 bvec: nsegs 19 size 77824 offset 0 [ 3775.262600] nvme_tcp: rq 22 (WRITE) data_len 131072 bio[2/2] sector a388298 bvec: nsegs 13 size 53248 offset 0 [ 3775.262610] nvme_tcp: rq 5 (WRITE) data_len 262144 bio[1/4] sector a388300 bvec: nsegs 19 size 77824 offset 0 [ 3775.262617] nvme_tcp: rq 5 (WRITE) data_len 262144 bio[2/4] sector a388398 bvec: nsegs 13 size 53248 offset 0 [ 3775.262623] nvme_tcp: rq 5 (WRITE) data_len 262144 bio[3/4] sector a388400 bvec: nsegs 19 size 77824 offset 0 [ 3775.262629] nvme_tcp: rq 5 (WRITE) data_len 262144 bio[4/4] sector a388498 bvec: nsegs 13 size 53248 offset 0 [ 3775.262635] nvme_tcp: rq 6 (WRITE) data_len 262144 bio[1/4] sector a388500 bvec: nsegs 19 size 77824 offset 0 [ 3775.262641] nvme_tcp: rq 6 (WRITE) data_len 262144 bio[2/4] sector a388598 bvec: nsegs 13 size 53248 offset 0 [ 3775.262647] nvme_tcp: rq 6 (WRITE) data_len 262144 bio[3/4] sector a388600 bvec: nsegs 19 size 77824 offset 0 [ 3775.262653] nvme_tcp: rq 6 (WRITE) data_len 262144 bio[4/4] sector a388698 bvec: nsegs 13 size 53248 offset 0 [ 3775.263009] nvme_tcp: rq 5 (WRITE) data_len 131072 bio[1/2] sector a388300 bvec: nsegs 19 size 77824 offset 0 [ 3775.263019] nvme_tcp: rq 5 (WRITE) data_len 131072 bio[2/2] sector a388398 bvec: nsegs 13 size 53248 offset 0 [ 3775.263027] nvme_tcp: rq 6 (WRITE) data_len 131072 bio[1/2] sector a388400 bvec: nsegs 19 size 77824 offset 0 [ 3775.263034] nvme_tcp: rq 6 (WRITE) data_len 131072 bio[2/2] sector a388498 bvec: nsegs 13 size 53248 offset 0 [ 3775.263040] nvme_tcp: rq 7 (WRITE) data_len 131072 bio[1/2] sector a388500 bvec: nsegs 19 size 77824 offset 0 [ 3775.263047] nvme_tcp: rq 7 (WRITE) data_len 131072 bio[2/2] sector a388598 bvec: nsegs 13 size 53248 offset 0 [ 3775.263052] nvme_tcp: rq 8 (WRITE) data_len 131072 bio[1/2] sector a388600 bvec: nsegs 19 size 77824 offset 0 [ 3775.263059] nvme_tcp: rq 8 (WRITE) data_len 131072 bio[2/2] sector a388698 bvec: nsegs 13 size 53248 offset 0 [ 3775.264341] nvme_tcp: rq 19 (WRITE) data_len 131072 bio[1/2] sector a388300 bvec: nsegs 19 size 77824 offset 0 [ 3775.264353] nvme_tcp: rq 19 (WRITE) data_len 131072 bio[2/2] sector a388398 bvec: nsegs 13 size 53248 offset 0 [ 3775.264361] nvme_tcp: rq 20 (WRITE) data_len 131072 bio[1/2] sector a388400 bvec: nsegs 19 size 77824 offset 0 [ 3775.264369] nvme_tcp: rq 20 (WRITE) data_len 131072 bio[2/2] sector a388498 bvec: nsegs 13 size 53248 offset 0 [ 3775.264380] nvme_tcp: rq 21 (WRITE) data_len 131072 bio[1/2] sector a388500 bvec: nsegs 19 size 77824 offset 0 [ 3775.264387] nvme_tcp: rq 21 (WRITE) data_len 131072 bio[2/2] sector a388598 bvec: nsegs 13 size 53248 offset 0 [ 3775.264410] nvme_tcp: rq 22 (WRITE) data_len 131072 bio[1/2] sector a388600 bvec: nsegs 19 size 77824 offset 0 Hao On Tue, Jan 5, 2021 at 5:53 PM Sagi Grimberg <sagi@grimberg.me> wrote: > > Hey Hao, > > > Okay, will do that in a few days. Something else just popped up and I > > have a limited time window to use some machines. > > > > BTW, what is the performance implication of disabling merge? My usage > > pattern is mostly sequential read and write, and write bandwidth is > > pretty high. > > Did you get a chance to look into this? _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Data corruption when using multiple devices with NVMEoF TCP 2021-01-11 8:56 ` Hao Wang @ 2021-01-11 10:11 ` Sagi Grimberg [not found] ` <CAJS6Edi9Es1zR9QC+=kwVjAFAGYrEru4vibW42ffyWoMDutFhQ@mail.gmail.com> 0 siblings, 1 reply; 23+ messages in thread From: Sagi Grimberg @ 2021-01-11 10:11 UTC (permalink / raw) To: Hao Wang; +Cc: Christoph Hellwig, Linux-nvme > Hey Sagi, Hey Hao, > I exported 4 devices to the initiator, created a raid-0 array, and > copied a 98G directory with many ~100MB .gz files. > With the patch you give on top of 58cf05f597b0 (fairly new), I saw > about 24K prints from dmesg. Below are some of them: Yes, I understand it generated tons of prints, but it seems that something is strange here. > [ 3775.256547] nvme_tcp: rq 22 (READ) data_len 131072 bio[1/2] sector > a388200 bvec: nsegs 19 size 77824 offset 0 This is a read request and has 2 bios, one spans 19 4K buffers (starting from sector a388200) and the second probably spans 13 4K buffers. The host is asking the target to send 128K (data_len 131072), but I don't see anywhere that the host is receiving the residual of the data transfer.. Should be in the form of: nvme_tcp: rq 22 (READ) data_len 131072 bio[2/2] sector 0xa388298 bvec: nsegs 13 size 53248 offset 0 In your entire log, do you see any (READ) print that spans bio that is not [1/x]? e.g. a read that spans other bios in the request (like [2/2], [2/3], etc..)? > [ 3775.256768] nvme_tcp: rq 19 (READ) data_len 131072 bio[1/2] sector > a388300 bvec: nsegs 19 size 77824 offset 0 > [ 3775.256774] nvme_tcp: rq 20 (READ) data_len 131072 bio[1/2] sector > a388400 bvec: nsegs 19 size 77824 offset 0 > [ 3775.256787] nvme_tcp: rq 5 (READ) data_len 131072 bio[1/2] sector > a388300 bvec: nsegs 19 size 77824 offset 0 > [ 3775.256791] nvme_tcp: rq 6 (READ) data_len 131072 bio[1/2] sector > a388400 bvec: nsegs 19 size 77824 offset 0 > [ 3775.256794] nvme_tcp: rq 117 (READ) data_len 131072 bio[1/2] sector > a388300 bvec: nsegs 19 size 77824 offset 0 > [ 3775.256797] nvme_tcp: rq 118 (READ) data_len 131072 bio[1/2] sector > a388400 bvec: nsegs 19 size 77824 offset 0 > [ 3775.256800] nvme_tcp: rq 5 (READ) data_len 262144 bio[1/4] sector > a388300 bvec: nsegs 19 size 77824 offset 0 > [ 3775.257002] nvme_tcp: rq 21 (READ) data_len 131072 bio[1/2] sector > a388500 bvec: nsegs 19 size 77824 offset 0 > [ 3775.257006] nvme_tcp: rq 22 (READ) data_len 131072 bio[1/2] sector > a388600 bvec: nsegs 19 size 77824 offset 0 > [ 3775.257009] nvme_tcp: rq 7 (READ) data_len 131072 bio[1/2] sector > a388500 bvec: nsegs 19 size 77824 offset 0 > [ 3775.257012] nvme_tcp: rq 8 (READ) data_len 131072 bio[1/2] sector > a388600 bvec: nsegs 19 size 77824 offset 0 > [ 3775.257014] nvme_tcp: rq 7 (READ) data_len 131072 bio[1/2] sector > a388500 bvec: nsegs 19 size 77824 offset 0 > [ 3775.257017] nvme_tcp: rq 8 (READ) data_len 131072 bio[1/2] sector > a388600 bvec: nsegs 19 size 77824 offset 0 > [ 3775.257020] nvme_tcp: rq 6 (READ) data_len 262144 bio[1/4] sector > a388500 bvec: nsegs 19 size 77824 offset 0 > [ 3775.262587] nvme_tcp: rq 22 (WRITE) data_len 131072 bio[1/2] sector > a388200 bvec: nsegs 19 size 77824 offset 0 > [ 3775.262600] nvme_tcp: rq 22 (WRITE) data_len 131072 bio[2/2] sector > a388298 bvec: nsegs 13 size 53248 offset 0 For (WRITE) request we see the desired sequence, we first write the content of the first bio (19 4K segments) and then the content of the second bio (13 4K segments). > [ 3775.262610] nvme_tcp: rq 5 (WRITE) data_len 262144 bio[1/4] sector > a388300 bvec: nsegs 19 size 77824 offset 0 > [ 3775.262617] nvme_tcp: rq 5 (WRITE) data_len 262144 bio[2/4] sector > a388398 bvec: nsegs 13 size 53248 offset 0 > [ 3775.262623] nvme_tcp: rq 5 (WRITE) data_len 262144 bio[3/4] sector > a388400 bvec: nsegs 19 size 77824 offset 0 > [ 3775.262629] nvme_tcp: rq 5 (WRITE) data_len 262144 bio[4/4] sector > a388498 bvec: nsegs 13 size 53248 offset 0 Same here and on... > [ 3775.262635] nvme_tcp: rq 6 (WRITE) data_len 262144 bio[1/4] sector > a388500 bvec: nsegs 19 size 77824 offset 0 > [ 3775.262641] nvme_tcp: rq 6 (WRITE) data_len 262144 bio[2/4] sector > a388598 bvec: nsegs 13 size 53248 offset 0 > [ 3775.262647] nvme_tcp: rq 6 (WRITE) data_len 262144 bio[3/4] sector > a388600 bvec: nsegs 19 size 77824 offset 0 > [ 3775.262653] nvme_tcp: rq 6 (WRITE) data_len 262144 bio[4/4] sector > a388698 bvec: nsegs 13 size 53248 offset 0 > [ 3775.263009] nvme_tcp: rq 5 (WRITE) data_len 131072 bio[1/2] sector > a388300 bvec: nsegs 19 size 77824 offset 0 > [ 3775.263019] nvme_tcp: rq 5 (WRITE) data_len 131072 bio[2/2] sector > a388398 bvec: nsegs 13 size 53248 offset 0 > [ 3775.263027] nvme_tcp: rq 6 (WRITE) data_len 131072 bio[1/2] sector > a388400 bvec: nsegs 19 size 77824 offset 0 > [ 3775.263034] nvme_tcp: rq 6 (WRITE) data_len 131072 bio[2/2] sector > a388498 bvec: nsegs 13 size 53248 offset 0 > [ 3775.263040] nvme_tcp: rq 7 (WRITE) data_len 131072 bio[1/2] sector > a388500 bvec: nsegs 19 size 77824 offset 0 > [ 3775.263047] nvme_tcp: rq 7 (WRITE) data_len 131072 bio[2/2] sector > a388598 bvec: nsegs 13 size 53248 offset 0 > [ 3775.263052] nvme_tcp: rq 8 (WRITE) data_len 131072 bio[1/2] sector > a388600 bvec: nsegs 19 size 77824 offset 0 > [ 3775.263059] nvme_tcp: rq 8 (WRITE) data_len 131072 bio[2/2] sector > a388698 bvec: nsegs 13 size 53248 offset 0 > [ 3775.264341] nvme_tcp: rq 19 (WRITE) data_len 131072 bio[1/2] sector > a388300 bvec: nsegs 19 size 77824 offset 0 > [ 3775.264353] nvme_tcp: rq 19 (WRITE) data_len 131072 bio[2/2] sector > a388398 bvec: nsegs 13 size 53248 offset 0 > [ 3775.264361] nvme_tcp: rq 20 (WRITE) data_len 131072 bio[1/2] sector > a388400 bvec: nsegs 19 size 77824 offset 0 > [ 3775.264369] nvme_tcp: rq 20 (WRITE) data_len 131072 bio[2/2] sector > a388498 bvec: nsegs 13 size 53248 offset 0 > [ 3775.264380] nvme_tcp: rq 21 (WRITE) data_len 131072 bio[1/2] sector > a388500 bvec: nsegs 19 size 77824 offset 0 > [ 3775.264387] nvme_tcp: rq 21 (WRITE) data_len 131072 bio[2/2] sector > a388598 bvec: nsegs 13 size 53248 offset 0 > [ 3775.264410] nvme_tcp: rq 22 (WRITE) data_len 131072 bio[1/2] sector > a388600 bvec: nsegs 19 size 77824 offset 0 From the code it seems like it should do the right thing assuming that the data does arrive, will look deeper. Thanks for helping to dissect this issue. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <CAJS6Edi9Es1zR9QC+=kwVjAFAGYrEru4vibW42ffyWoMDutFhQ@mail.gmail.com>]
* Re: Data corruption when using multiple devices with NVMEoF TCP [not found] ` <CAJS6Edi9Es1zR9QC+=kwVjAFAGYrEru4vibW42ffyWoMDutFhQ@mail.gmail.com> @ 2021-01-12 0:36 ` Sagi Grimberg 2021-01-12 1:29 ` Sagi Grimberg 0 siblings, 1 reply; 23+ messages in thread From: Sagi Grimberg @ 2021-01-12 0:36 UTC (permalink / raw) To: Hao Wang; +Cc: Christoph Hellwig, Linux-nvme Hey Hao, > Here is the entire log (and it's a new one, i.e. above snippet not > included): > https://drive.google.com/file/d/16ArIs5-Jw4P2f17A_ftKLm1A4LQUFpmg/view?usp=sharing > > What I found is the data corruption does not always happen, especially > when I copy a small directory. So I guess a lot of log entries should > just look fine. So this seems to be a breakage that existed for some time now with multipage bvecs that you have been the first one to report. This seems to be related to bio merges, which is seems strange to me why this just now comes up, perhaps it is the combination with raid0 that triggers this, I'm not sure. IIUC, this should resolve your issue, care to give it a go? -- diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 973d5d683180..6bceadc204a8 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -201,8 +201,9 @@ static inline size_t nvme_tcp_req_cur_offset(struct nvme_tcp_request *req) static inline size_t nvme_tcp_req_cur_length(struct nvme_tcp_request *req) { - return min_t(size_t, req->iter.bvec->bv_len - req->iter.iov_offset, - req->pdu_len - req->pdu_sent); + return min_t(size_t, req->iter.count, + min_t(size_t, req->iter.bvec->bv_len - req->iter.iov_offset, + req->pdu_len - req->pdu_sent)); } static inline size_t nvme_tcp_pdu_data_left(struct nvme_tcp_request *req) @@ -223,7 +224,7 @@ static void nvme_tcp_init_iter(struct nvme_tcp_request *req, struct request *rq = blk_mq_rq_from_pdu(req); struct bio_vec *vec; unsigned int size; - int nsegs; + int nsegs = 0; size_t offset; if (rq->rq_flags & RQF_SPECIAL_PAYLOAD) { @@ -233,11 +234,15 @@ static void nvme_tcp_init_iter(struct nvme_tcp_request *req, offset = 0; } else { struct bio *bio = req->curr_bio; + struct bvec_iter bi; + struct bio_vec bv; vec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter); - nsegs = bio_segments(bio); + bio_for_each_bvec(bv, bio, bi) { + nsegs++; + } size = bio->bi_iter.bi_size; - offset = bio->bi_iter.bi_bvec_done; + offset = mp_bvec_iter_offset(bio->bi_io_vec, bio->bi_iter) - vec->bv_offset; } iov_iter_bvec(&req->iter, dir, vec, nsegs, size); -- _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: Data corruption when using multiple devices with NVMEoF TCP 2021-01-12 0:36 ` Sagi Grimberg @ 2021-01-12 1:29 ` Sagi Grimberg 2021-01-12 2:22 ` Ming Lei 2021-01-12 8:55 ` Hao Wang 0 siblings, 2 replies; 23+ messages in thread From: Sagi Grimberg @ 2021-01-12 1:29 UTC (permalink / raw) To: Hao Wang; +Cc: Christoph Hellwig, Linux-nvme > Hey Hao, > >> Here is the entire log (and it's a new one, i.e. above snippet not >> included): >> https://drive.google.com/file/d/16ArIs5-Jw4P2f17A_ftKLm1A4LQUFpmg/view?usp=sharing >> >> >> What I found is the data corruption does not always happen, especially >> when I copy a small directory. So I guess a lot of log entries should >> just look fine. > > So this seems to be a breakage that existed for some time now with > multipage bvecs that you have been the first one to report. This > seems to be related to bio merges, which is seems strange to me > why this just now comes up, perhaps it is the combination with > raid0 that triggers this, I'm not sure. OK, I think I understand what is going on. With multipage bvecs bios can split in the middle of a bvec entry, and then merge back with another bio. The issue is that we are not capping the last bvec entry send length calculation in that. I think that just this can also resolve the issue: -- diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 973d5d683180..c6b0a189a494 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -201,8 +201,9 @@ static inline size_t nvme_tcp_req_cur_offset(struct nvme_tcp_request *req) static inline size_t nvme_tcp_req_cur_length(struct nvme_tcp_request *req) { - return min_t(size_t, req->iter.bvec->bv_len - req->iter.iov_offset, - req->pdu_len - req->pdu_sent); + return min_t(size_t, req->iter.count, + min_t(size_t, req->iter.bvec->bv_len - req->iter.iov_offset, + req->pdu_len - req->pdu_sent)); } static inline size_t nvme_tcp_pdu_data_left(struct nvme_tcp_request *req) -- _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: Data corruption when using multiple devices with NVMEoF TCP 2021-01-12 1:29 ` Sagi Grimberg @ 2021-01-12 2:22 ` Ming Lei 2021-01-12 6:49 ` Sagi Grimberg 2021-01-12 8:55 ` Hao Wang 1 sibling, 1 reply; 23+ messages in thread From: Ming Lei @ 2021-01-12 2:22 UTC (permalink / raw) To: Sagi Grimberg; +Cc: Christoph Hellwig, Hao Wang, linux-nvme On Tue, Jan 12, 2021 at 9:33 AM Sagi Grimberg <sagi@grimberg.me> wrote: > > > > Hey Hao, > > > >> Here is the entire log (and it's a new one, i.e. above snippet not > >> included): > >> https://drive.google.com/file/d/16ArIs5-Jw4P2f17A_ftKLm1A4LQUFpmg/view?usp=sharing > >> > >> > >> What I found is the data corruption does not always happen, especially > >> when I copy a small directory. So I guess a lot of log entries should > >> just look fine. > > > > So this seems to be a breakage that existed for some time now with > > multipage bvecs that you have been the first one to report. This > > seems to be related to bio merges, which is seems strange to me > > why this just now comes up, perhaps it is the combination with > > raid0 that triggers this, I'm not sure. > > OK, I think I understand what is going on. With multipage bvecs > bios can split in the middle of a bvec entry, and then merge > back with another bio. IMO, bio split can be done in the middle of a bvec even though the bvec is single page. The split may just be triggered in case of raid over nvme-tcp, and I guess it might be triggered by device mapper too. Thanks, Ming _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Data corruption when using multiple devices with NVMEoF TCP 2021-01-12 2:22 ` Ming Lei @ 2021-01-12 6:49 ` Sagi Grimberg 0 siblings, 0 replies; 23+ messages in thread From: Sagi Grimberg @ 2021-01-12 6:49 UTC (permalink / raw) To: Ming Lei; +Cc: Christoph Hellwig, Hao Wang, linux-nvme >>> Hey Hao, >>> >>>> Here is the entire log (and it's a new one, i.e. above snippet not >>>> included): >>>> https://drive.google.com/file/d/16ArIs5-Jw4P2f17A_ftKLm1A4LQUFpmg/view?usp=sharing >>>> >>>> >>>> What I found is the data corruption does not always happen, especially >>>> when I copy a small directory. So I guess a lot of log entries should >>>> just look fine. >>> >>> So this seems to be a breakage that existed for some time now with >>> multipage bvecs that you have been the first one to report. This >>> seems to be related to bio merges, which is seems strange to me >>> why this just now comes up, perhaps it is the combination with >>> raid0 that triggers this, I'm not sure. >> >> OK, I think I understand what is going on. With multipage bvecs >> bios can split in the middle of a bvec entry, and then merge >> back with another bio. > > IMO, bio split can be done in the middle of a bvec even though the bvec > is single page. The split may just be triggered in case of raid over nvme-tcp, > and I guess it might be triggered by device mapper too. Yes, but I couldn't find a case where it cannot happen, but it only triggered with mdraid. I'll wait for Hao to verify and send a formal patch. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Data corruption when using multiple devices with NVMEoF TCP 2021-01-12 1:29 ` Sagi Grimberg 2021-01-12 2:22 ` Ming Lei @ 2021-01-12 8:55 ` Hao Wang 1 sibling, 0 replies; 23+ messages in thread From: Hao Wang @ 2021-01-12 8:55 UTC (permalink / raw) To: Sagi Grimberg; +Cc: Christoph Hellwig, Linux-nvme Yes, this patch fixes the problem! Thanks! Tested on top of a0d54b4f5b21. Hao On Mon, Jan 11, 2021 at 5:29 PM Sagi Grimberg <sagi@grimberg.me> wrote: > > > > Hey Hao, > > > >> Here is the entire log (and it's a new one, i.e. above snippet not > >> included): > >> https://drive.google.com/file/d/16ArIs5-Jw4P2f17A_ftKLm1A4LQUFpmg/view?usp=sharing > >> > >> > >> What I found is the data corruption does not always happen, especially > >> when I copy a small directory. So I guess a lot of log entries should > >> just look fine. > > > > So this seems to be a breakage that existed for some time now with > > multipage bvecs that you have been the first one to report. This > > seems to be related to bio merges, which is seems strange to me > > why this just now comes up, perhaps it is the combination with > > raid0 that triggers this, I'm not sure. > > OK, I think I understand what is going on. With multipage bvecs > bios can split in the middle of a bvec entry, and then merge > back with another bio. > > The issue is that we are not capping the last bvec entry send length > calculation in that. > > I think that just this can also resolve the issue: > -- > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c > index 973d5d683180..c6b0a189a494 100644 > --- a/drivers/nvme/host/tcp.c > +++ b/drivers/nvme/host/tcp.c > @@ -201,8 +201,9 @@ static inline size_t nvme_tcp_req_cur_offset(struct > nvme_tcp_request *req) > > static inline size_t nvme_tcp_req_cur_length(struct nvme_tcp_request *req) > { > - return min_t(size_t, req->iter.bvec->bv_len - req->iter.iov_offset, > - req->pdu_len - req->pdu_sent); > + return min_t(size_t, req->iter.count, > + min_t(size_t, req->iter.bvec->bv_len - > req->iter.iov_offset, > + req->pdu_len - req->pdu_sent)); > } > > static inline size_t nvme_tcp_pdu_data_left(struct nvme_tcp_request *req) > -- _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2021-01-12 8:56 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-12-22 18:09 Data corruption when using multiple devices with NVMEoF TCP Hao Wang 2020-12-22 19:29 ` Sagi Grimberg 2020-12-22 19:58 ` Hao Wang 2020-12-23 8:41 ` Sagi Grimberg 2020-12-23 8:43 ` Christoph Hellwig 2020-12-23 21:23 ` Sagi Grimberg 2020-12-23 22:23 ` Hao Wang 2020-12-24 1:51 ` Hao Wang 2020-12-24 2:57 ` Sagi Grimberg 2020-12-24 10:28 ` Hao Wang 2020-12-24 17:56 ` Sagi Grimberg 2020-12-25 7:49 ` Hao Wang 2020-12-25 9:05 ` Sagi Grimberg [not found] ` <CAJS6Edgb+yCW5q5dA=MEkL0eYs4MXoopdiz72nhkxpkd5Fe_cA@mail.gmail.com> 2020-12-29 1:25 ` Sagi Grimberg 2021-01-06 1:53 ` Sagi Grimberg 2021-01-06 8:21 ` Hao Wang 2021-01-11 8:56 ` Hao Wang 2021-01-11 10:11 ` Sagi Grimberg [not found] ` <CAJS6Edi9Es1zR9QC+=kwVjAFAGYrEru4vibW42ffyWoMDutFhQ@mail.gmail.com> 2021-01-12 0:36 ` Sagi Grimberg 2021-01-12 1:29 ` Sagi Grimberg 2021-01-12 2:22 ` Ming Lei 2021-01-12 6:49 ` Sagi Grimberg 2021-01-12 8:55 ` Hao Wang
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.