All of lore.kernel.org
 help / color / mirror / Atom feed
* Data corruption when using multiple devices with NVMEoF TCP
@ 2020-12-22 18:09 Hao Wang
  2020-12-22 19:29 ` Sagi Grimberg
  0 siblings, 1 reply; 23+ messages in thread
From: Hao Wang @ 2020-12-22 18:09 UTC (permalink / raw)
  To: Linux-nvme

I'm using kernel 5.2.9 with following related configs enabled:
CONFIG_NVME_CORE=y
CONFIG_BLK_DEV_NVME=y
CONFIG_NVME_MULTIPATH=y
CONFIG_NVME_FABRICS=m
# CONFIG_NVME_FC is not set
CONFIG_NVME_TCP=m
CONFIG_NVME_TARGET=m
CONFIG_NVME_TARGET_LOOP=m
# CONFIG_NVME_TARGET_FC is not set
CONFIG_NVME_TARGET_TCP=m
CONFIG_RTC_NVMEM=y
CONFIG_NVMEM=y
CONFIG_NVMEM_SYSFS=y

On target side, I exported 2 NVMe devices using tcp/ipv6:
[root@rtptest34337.prn2 ~/ext_nvme]# ll
/sys/kernel/config/nvmet/ports/1/subsystems/
total 0
lrwxrwxrwx 1 root root 0 Dec 19 02:08 nvmet-rtptest34337-1 ->
../../../../nvmet/subsystems/nvmet-rtptest34337-1
lrwxrwxrwx 1 root root 0 Dec 19 02:08 nvmet-rtptest34337-2 ->
../../../../nvmet/subsystems/nvmet-rtptest34337-2

On initiator side, I could successfully connect the 2 nvme devices,
nvme1n1 & nvme2n1;
[root@rtptest34206.prn2 /]# nvme list
Node             SN                   Model
Namespace          Usage                      Format           FW Rev
---------------- --------------------
---------------------------------------- ---------
-------------------------- ---------------- --------
/dev/nvme0n1     ***********     INTEL *******          1
256.06  GB / 256.06  GB    512   B +  0 B    PSF119D
/dev/nvme1n1     ***********     Linux                       1
900.19  GB / 900.19  GB      4 KiB +  0 B     5.2.9-0_
/dev/nvme2n1     ***********     Linux                       1
900.19  GB / 900.19  GB      4 KiB +  0 B     5.2.9-0_

Then for each of nvme1n1 & nvme2n1, I created a partition using fdisk;
type is "linux raid autodetect";
Next I created a RAID-0 volume using, created a filesystem on it, and
mounted itL
# mdadm --create /dev/md5 --level=0 --raid-devices=2 --chunk=128
/dev/nvme1n1p1 /dev/nvme2n1p1
# mkfs.xfs -f /dev/md5
# mkdir /flash
# mount -o rw,noatime,discard /dev/md5 /flash/

Now, when I copy a large directory into /flash/, a lot of files under
/flash/ are corrupted.
Specifically, that large directory has a lot of .gz files, and unzip
will fail on many of them;
also diff with original files does show they are different, although
the file size is exactly the same.

Also I found that if I don't create a RAID-0 array, instead just make
a filesystem on either /dev/nvme1n1p1 or /dev/nvme2n1p1, there is no
data corruption.

I'm wondering if there is a known issue, or I'm doing something not
really supported.
Thanks!

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Data corruption when using multiple devices with NVMEoF TCP
  2020-12-22 18:09 Data corruption when using multiple devices with NVMEoF TCP Hao Wang
@ 2020-12-22 19:29 ` Sagi Grimberg
  2020-12-22 19:58   ` Hao Wang
  0 siblings, 1 reply; 23+ messages in thread
From: Sagi Grimberg @ 2020-12-22 19:29 UTC (permalink / raw)
  To: Hao Wang, Linux-nvme

Hey Hao,

> I'm using kernel 5.2.9 with following related configs enabled:
> CONFIG_NVME_CORE=y
> CONFIG_BLK_DEV_NVME=y
> CONFIG_NVME_MULTIPATH=y
> CONFIG_NVME_FABRICS=m
> # CONFIG_NVME_FC is not set
> CONFIG_NVME_TCP=m
> CONFIG_NVME_TARGET=m
> CONFIG_NVME_TARGET_LOOP=m
> # CONFIG_NVME_TARGET_FC is not set
> CONFIG_NVME_TARGET_TCP=m
> CONFIG_RTC_NVMEM=y
> CONFIG_NVMEM=y
> CONFIG_NVMEM_SYSFS=y
> 
> On target side, I exported 2 NVMe devices using tcp/ipv6:
> [root@rtptest34337.prn2 ~/ext_nvme]# ll
> /sys/kernel/config/nvmet/ports/1/subsystems/
> total 0
> lrwxrwxrwx 1 root root 0 Dec 19 02:08 nvmet-rtptest34337-1 ->
> ../../../../nvmet/subsystems/nvmet-rtptest34337-1
> lrwxrwxrwx 1 root root 0 Dec 19 02:08 nvmet-rtptest34337-2 ->
> ../../../../nvmet/subsystems/nvmet-rtptest34337-2
> 
> On initiator side, I could successfully connect the 2 nvme devices,
> nvme1n1 & nvme2n1;
> [root@rtptest34206.prn2 /]# nvme list
> Node             SN                   Model
> Namespace          Usage                      Format           FW Rev
> ---------------- --------------------
> ---------------------------------------- ---------
> -------------------------- ---------------- --------
> /dev/nvme0n1     ***********     INTEL *******          1
> 256.06  GB / 256.06  GB    512   B +  0 B    PSF119D
> /dev/nvme1n1     ***********     Linux                       1
> 900.19  GB / 900.19  GB      4 KiB +  0 B     5.2.9-0_
> /dev/nvme2n1     ***********     Linux                       1
> 900.19  GB / 900.19  GB      4 KiB +  0 B     5.2.9-0_
> 
> Then for each of nvme1n1 & nvme2n1, I created a partition using fdisk;
> type is "linux raid autodetect";
> Next I created a RAID-0 volume using, created a filesystem on it, and
> mounted itL
> # mdadm --create /dev/md5 --level=0 --raid-devices=2 --chunk=128
> /dev/nvme1n1p1 /dev/nvme2n1p1
> # mkfs.xfs -f /dev/md5
> # mkdir /flash
> # mount -o rw,noatime,discard /dev/md5 /flash/
> 
> Now, when I copy a large directory into /flash/, a lot of files under
> /flash/ are corrupted.
> Specifically, that large directory has a lot of .gz files, and unzip
> will fail on many of them;
> also diff with original files does show they are different, although
> the file size is exactly the same.

Sounds strange to me. Nothing forbids mounting a fs on a raid0 volume.

> Also I found that if I don't create a RAID-0 array, instead just make
> a filesystem on either /dev/nvme1n1p1 or /dev/nvme2n1p1, there is no
> data corruption.
> 
> I'm wondering if there is a known issue, or I'm doing something not
> really supported.

Did you try to run the same test locally on the target side without
having nvme-tcp/nvmet-tcp target in between?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Data corruption when using multiple devices with NVMEoF TCP
  2020-12-22 19:29 ` Sagi Grimberg
@ 2020-12-22 19:58   ` Hao Wang
  2020-12-23  8:41     ` Sagi Grimberg
  0 siblings, 1 reply; 23+ messages in thread
From: Hao Wang @ 2020-12-22 19:58 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Linux-nvme

Also really strange to me. This has been burning me 16+ hours a day
for 2 days doing

And for your question, yes I did.
Locally on the target side, no data corruption happening, with the
same process of creating a partition on each device, creating a
2-device raid-0 volume, and creating a filesystem.
I have also tested on multiple sets of machines, but no luck.

Another point I should've mentioned is that corruption does not always
happen. Sometimes if I only copy one .gz file (~100MB), it seems fine.
But whenever I copy a large directory with many .gz files (~100GB in
total), there are always some .gz files corrupted.

Hao

On Tue, Dec 22, 2020 at 11:29 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>
> Hey Hao,
>
> > I'm using kernel 5.2.9 with following related configs enabled:
> > CONFIG_NVME_CORE=y
> > CONFIG_BLK_DEV_NVME=y
> > CONFIG_NVME_MULTIPATH=y
> > CONFIG_NVME_FABRICS=m
> > # CONFIG_NVME_FC is not set
> > CONFIG_NVME_TCP=m
> > CONFIG_NVME_TARGET=m
> > CONFIG_NVME_TARGET_LOOP=m
> > # CONFIG_NVME_TARGET_FC is not set
> > CONFIG_NVME_TARGET_TCP=m
> > CONFIG_RTC_NVMEM=y
> > CONFIG_NVMEM=y
> > CONFIG_NVMEM_SYSFS=y
> >
> > On target side, I exported 2 NVMe devices using tcp/ipv6:
> > [root@rtptest34337.prn2 ~/ext_nvme]# ll
> > /sys/kernel/config/nvmet/ports/1/subsystems/
> > total 0
> > lrwxrwxrwx 1 root root 0 Dec 19 02:08 nvmet-rtptest34337-1 ->
> > ../../../../nvmet/subsystems/nvmet-rtptest34337-1
> > lrwxrwxrwx 1 root root 0 Dec 19 02:08 nvmet-rtptest34337-2 ->
> > ../../../../nvmet/subsystems/nvmet-rtptest34337-2
> >
> > On initiator side, I could successfully connect the 2 nvme devices,
> > nvme1n1 & nvme2n1;
> > [root@rtptest34206.prn2 /]# nvme list
> > Node             SN                   Model
> > Namespace          Usage                      Format           FW Rev
> > ---------------- --------------------
> > ---------------------------------------- ---------
> > -------------------------- ---------------- --------
> > /dev/nvme0n1     ***********     INTEL *******          1
> > 256.06  GB / 256.06  GB    512   B +  0 B    PSF119D
> > /dev/nvme1n1     ***********     Linux                       1
> > 900.19  GB / 900.19  GB      4 KiB +  0 B     5.2.9-0_
> > /dev/nvme2n1     ***********     Linux                       1
> > 900.19  GB / 900.19  GB      4 KiB +  0 B     5.2.9-0_
> >
> > Then for each of nvme1n1 & nvme2n1, I created a partition using fdisk;
> > type is "linux raid autodetect";
> > Next I created a RAID-0 volume using, created a filesystem on it, and
> > mounted itL
> > # mdadm --create /dev/md5 --level=0 --raid-devices=2 --chunk=128
> > /dev/nvme1n1p1 /dev/nvme2n1p1
> > # mkfs.xfs -f /dev/md5
> > # mkdir /flash
> > # mount -o rw,noatime,discard /dev/md5 /flash/
> >
> > Now, when I copy a large directory into /flash/, a lot of files under
> > /flash/ are corrupted.
> > Specifically, that large directory has a lot of .gz files, and unzip
> > will fail on many of them;
> > also diff with original files does show they are different, although
> > the file size is exactly the same.
>
> Sounds strange to me. Nothing forbids mounting a fs on a raid0 volume.
>
> > Also I found that if I don't create a RAID-0 array, instead just make
> > a filesystem on either /dev/nvme1n1p1 or /dev/nvme2n1p1, there is no
> > data corruption.
> >
> > I'm wondering if there is a known issue, or I'm doing something not
> > really supported.
>
> Did you try to run the same test locally on the target side without
> having nvme-tcp/nvmet-tcp target in between?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Data corruption when using multiple devices with NVMEoF TCP
  2020-12-22 19:58   ` Hao Wang
@ 2020-12-23  8:41     ` Sagi Grimberg
  2020-12-23  8:43       ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Sagi Grimberg @ 2020-12-23  8:41 UTC (permalink / raw)
  To: Hao Wang; +Cc: Linux-nvme


> Also really strange to me. This has been burning me 16+ hours a day
> for 2 days doing
> 
> And for your question, yes I did.
> Locally on the target side, no data corruption happening, with the
> same process of creating a partition on each device, creating a
> 2-device raid-0 volume, and creating a filesystem.
> I have also tested on multiple sets of machines, but no luck.
> 
> Another point I should've mentioned is that corruption does not always
> happen. Sometimes if I only copy one .gz file (~100MB), it seems fine.
> But whenever I copy a large directory with many .gz files (~100GB in
> total), there are always some .gz files corrupted.

OK, interesting.

Can you retry the test with setting max_sectors_kb to 512:
echo 512 > /sys/block/nvmeXnY/queue/max_sectors_kb

I'm trying to understand if there is an issue related
to large IOs.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Data corruption when using multiple devices with NVMEoF TCP
  2020-12-23  8:41     ` Sagi Grimberg
@ 2020-12-23  8:43       ` Christoph Hellwig
  2020-12-23 21:23         ` Sagi Grimberg
  2020-12-24  1:51         ` Hao Wang
  0 siblings, 2 replies; 23+ messages in thread
From: Christoph Hellwig @ 2020-12-23  8:43 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Hao Wang, Linux-nvme

Wouldn't testing with a not completely outdated kernel a better first
step?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Data corruption when using multiple devices with NVMEoF TCP
  2020-12-23  8:43       ` Christoph Hellwig
@ 2020-12-23 21:23         ` Sagi Grimberg
  2020-12-23 22:23           ` Hao Wang
  2020-12-24  1:51         ` Hao Wang
  1 sibling, 1 reply; 23+ messages in thread
From: Sagi Grimberg @ 2020-12-23 21:23 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Hao Wang, Linux-nvme


> Wouldn't testing with a not completely outdated kernel a better first
> step?

Right, didn't notice that. Hao, would it be possible to test this
happens with the latest upstream kernel (or something close to that)?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Data corruption when using multiple devices with NVMEoF TCP
  2020-12-23 21:23         ` Sagi Grimberg
@ 2020-12-23 22:23           ` Hao Wang
  0 siblings, 0 replies; 23+ messages in thread
From: Hao Wang @ 2020-12-23 22:23 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Christoph Hellwig, Linux-nvme

Sure. I will try to build a new kernel.
This is in an enterprise environment, so it's just really convenient
for me to run with v5.2 or v5.6. For newer kernel, I will have to
build myself. But I will give it a try.

Regarding max_sectors_kb, there seems something interesting:
So on the target side, I see:
# cat /sys/block/nvme1n1/queue/max_sectors_kb
256
#/sys/block/nvme2n1/queue/max_sectors_kb
256

On the initiator side,
 * first, there is both /sys/block/nvme1c1n1 and /sys/block/nvme1n1
 * and their max_sectors_kb is 1280.

Then when I create a raid-0 volume with mdadm,
# /sys/block/md5
128

I'm not an expert on storage, but do you see any potential problem here?

Hao

On Wed, Dec 23, 2020 at 1:23 PM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
> > Wouldn't testing with a not completely outdated kernel a better first
> > step?
>
> Right, didn't notice that. Hao, would it be possible to test this
> happens with the latest upstream kernel (or something close to that)?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Data corruption when using multiple devices with NVMEoF TCP
  2020-12-23  8:43       ` Christoph Hellwig
  2020-12-23 21:23         ` Sagi Grimberg
@ 2020-12-24  1:51         ` Hao Wang
  2020-12-24  2:57           ` Sagi Grimberg
  1 sibling, 1 reply; 23+ messages in thread
From: Hao Wang @ 2020-12-24  1:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sagi Grimberg, Linux-nvme

Okay, tried both v5.10 and latest 58cf05f597b0.

And same behavior
 - data corruption on the initiator side when creating a raid-0 volume
using 2 nvme-tcp devices;
 - no data corruption either on local target side, or on initiator
side but only using 1 nvme-tcp devoce.

A difference I can see on the max_sectors_kb is that, now on the
target side, /sys/block/nvme*n1/queue/max_sectors_kb also becomes
1280.

On Wed, Dec 23, 2020 at 12:43 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> Wouldn't testing with a not completely outdated kernel a better first
> step?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Data corruption when using multiple devices with NVMEoF TCP
  2020-12-24  1:51         ` Hao Wang
@ 2020-12-24  2:57           ` Sagi Grimberg
  2020-12-24 10:28             ` Hao Wang
  0 siblings, 1 reply; 23+ messages in thread
From: Sagi Grimberg @ 2020-12-24  2:57 UTC (permalink / raw)
  To: Hao Wang, Christoph Hellwig; +Cc: Linux-nvme


> Okay, tried both v5.10 and latest 58cf05f597b0.
> 
> And same behavior
>   - data corruption on the initiator side when creating a raid-0 volume
> using 2 nvme-tcp devices;
>   - no data corruption either on local target side, or on initiator
> side but only using 1 nvme-tcp devoce.
> 
> A difference I can see on the max_sectors_kb is that, now on the
> target side, /sys/block/nvme*n1/queue/max_sectors_kb also becomes
> 1280.
> 

Thanks Hao,

I'm thinking we maybe have an issue with bio splitting/merge/cloning.

Question, if you build the raid0 in the target and expose that over
nvmet-tcp (with a single namespace), does the issue happen?

Also, would be interesting to add this patch and see if the following
print pops up, and if it correlates when you see the issue:

--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 979ee31b8dd1..d0a68cdb374f 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -243,6 +243,9 @@ static void nvme_tcp_init_iter(struct 
nvme_tcp_request *req,
                 nsegs = bio_segments(bio);
                 size = bio->bi_iter.bi_size;
                 offset = bio->bi_iter.bi_bvec_done;
+               if (rq->bio != rq->biotail)
+                       pr_info("rq %d (%s) contains multiple bios bvec: 
nsegs %d size %d offset %ld\n",
+                               rq->tag, dir == WRITE ? "WRITE" : 
"READ", nsegs, size, offset);
         }

         iov_iter_bvec(&req->iter, dir, vec, nsegs, size);
--

I'll try to look further to understand if we have an issue there.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: Data corruption when using multiple devices with NVMEoF TCP
  2020-12-24  2:57           ` Sagi Grimberg
@ 2020-12-24 10:28             ` Hao Wang
  2020-12-24 17:56               ` Sagi Grimberg
  0 siblings, 1 reply; 23+ messages in thread
From: Hao Wang @ 2020-12-24 10:28 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Christoph Hellwig, Linux-nvme

Sagi, thanks a lot for helping look into this.

> Question, if you build the raid0 in the target and expose that over nvmet-tcp (with a single namespace), does the issue happen?
No, it works fine in that case.
Actually with this setup, initially the latency was pretty bad, and it
seems enabling CONFIG_NVME_MULTIPATH improved it significantly.
I'm not exactly sure though as I've changed too many things and didn't
specifically test for this setup.
Could you help confirm that?

And after applying your patch,
 - With the problematic setup, i.e. creating a 2-device raid0, I did
see numerous numerous prints popping up in dmesg; a few lines are
pasted below:
 - With the good setup, i.e. only using 1 device, this line also pops
up, but a lot less frequent.

[  390.240595] nvme_tcp: rq 10 (WRITE) contains multiple bios bvec:
nsegs 25 size 102400 offset 0
[  390.243146] nvme_tcp: rq 35 (WRITE) contains multiple bios bvec:
nsegs 7 size 28672 offset 4096
[  390.246893] nvme_tcp: rq 35 (WRITE) contains multiple bios bvec:
nsegs 25 size 102400 offset 4096
[  390.250631] nvme_tcp: rq 35 (WRITE) contains multiple bios bvec:
nsegs 4 size 16384 offset 16384
[  390.254374] nvme_tcp: rq 11 (WRITE) contains multiple bios bvec:
nsegs 7 size 28672 offset 0
[  390.256869] nvme_tcp: rq 11 (WRITE) contains multiple bios bvec:
nsegs 25 size 102400 offset 12288
[  390.266877] nvme_tcp: rq 57 (READ) contains multiple bios bvec:
nsegs 4 size 16384 offset 118784
[  390.269444] nvme_tcp: rq 58 (READ) contains multiple bios bvec:
nsegs 4 size 16384 offset 118784
[  390.273281] nvme_tcp: rq 59 (READ) contains multiple bios bvec:
nsegs 4 size 16384 offset 0
[  390.275776] nvme_tcp: rq 60 (READ) contains multiple bios bvec:
nsegs 4 size 16384 offset 118784

On Wed, Dec 23, 2020 at 6:57 PM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
> > Okay, tried both v5.10 and latest 58cf05f597b0.
> >
> > And same behavior
> >   - data corruption on the initiator side when creating a raid-0 volume
> > using 2 nvme-tcp devices;
> >   - no data corruption either on local target side, or on initiator
> > side but only using 1 nvme-tcp devoce.
> >
> > A difference I can see on the max_sectors_kb is that, now on the
> > target side, /sys/block/nvme*n1/queue/max_sectors_kb also becomes
> > 1280.
> >
>
> Thanks Hao,
>
> I'm thinking we maybe have an issue with bio splitting/merge/cloning.
>
> Question, if you build the raid0 in the target and expose that over
> nvmet-tcp (with a single namespace), does the issue happen?
>
> Also, would be interesting to add this patch and see if the following
> print pops up, and if it correlates when you see the issue:
>
> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 979ee31b8dd1..d0a68cdb374f 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -243,6 +243,9 @@ static void nvme_tcp_init_iter(struct
> nvme_tcp_request *req,
>                  nsegs = bio_segments(bio);
>                  size = bio->bi_iter.bi_size;
>                  offset = bio->bi_iter.bi_bvec_done;
> +               if (rq->bio != rq->biotail)
> +                       pr_info("rq %d (%s) contains multiple bios bvec:
> nsegs %d size %d offset %ld\n",
> +                               rq->tag, dir == WRITE ? "WRITE" :
> "READ", nsegs, size, offset);
>          }
>
>          iov_iter_bvec(&req->iter, dir, vec, nsegs, size);
> --
>
> I'll try to look further to understand if we have an issue there.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Data corruption when using multiple devices with NVMEoF TCP
  2020-12-24 10:28             ` Hao Wang
@ 2020-12-24 17:56               ` Sagi Grimberg
  2020-12-25  7:49                 ` Hao Wang
  0 siblings, 1 reply; 23+ messages in thread
From: Sagi Grimberg @ 2020-12-24 17:56 UTC (permalink / raw)
  To: Hao Wang; +Cc: Christoph Hellwig, Linux-nvme


> Sagi, thanks a lot for helping look into this.
> 
>> Question, if you build the raid0 in the target and expose that over nvmet-tcp (with a single namespace), does the issue happen?
> No, it works fine in that case.
> Actually with this setup, initially the latency was pretty bad, and it
> seems enabling CONFIG_NVME_MULTIPATH improved it significantly.
> I'm not exactly sure though as I've changed too many things and didn't
> specifically test for this setup.
> Could you help confirm that?
> 
> And after applying your patch,
>   - With the problematic setup, i.e. creating a 2-device raid0, I did
> see numerous numerous prints popping up in dmesg; a few lines are
> pasted below:
>   - With the good setup, i.e. only using 1 device, this line also pops
> up, but a lot less frequent.

Hao, question, what is the io-scheduler in-use for the nvme-tcp devices?

Can you try to reproduce this issue when disabling merges on the
nvme-tcp devices?

echo 2 > /sys/block/nvmeXnY/queue/nomerges

I want to see if this is an issue with merged bios.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Data corruption when using multiple devices with NVMEoF TCP
  2020-12-24 17:56               ` Sagi Grimberg
@ 2020-12-25  7:49                 ` Hao Wang
  2020-12-25  9:05                   ` Sagi Grimberg
  0 siblings, 1 reply; 23+ messages in thread
From: Hao Wang @ 2020-12-25  7:49 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Christoph Hellwig, Linux-nvme

In my current setup, on the initiator side, nvme3n1 & nvme4n1 are 2
nvme-tcp devices, schedulers for 3 is:
 - cat /sys/block/nvme3n1/queue/scheduler: "none"
 - cat /sys/block/nvme3c3n1/queue/scheduler: "[none] mq-deadline kyber"
Not sure what is nvme3c3n1 here?

And disabling merges on nvme-tcp devices solves the data corruption issue!

Hao



On Thu, Dec 24, 2020 at 9:56 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
> > Sagi, thanks a lot for helping look into this.
> >
> >> Question, if you build the raid0 in the target and expose that over nvmet-tcp (with a single namespace), does the issue happen?
> > No, it works fine in that case.
> > Actually with this setup, initially the latency was pretty bad, and it
> > seems enabling CONFIG_NVME_MULTIPATH improved it significantly.
> > I'm not exactly sure though as I've changed too many things and didn't
> > specifically test for this setup.
> > Could you help confirm that?
> >
> > And after applying your patch,
> >   - With the problematic setup, i.e. creating a 2-device raid0, I did
> > see numerous numerous prints popping up in dmesg; a few lines are
> > pasted below:
> >   - With the good setup, i.e. only using 1 device, this line also pops
> > up, but a lot less frequent.
>
> Hao, question, what is the io-scheduler in-use for the nvme-tcp devices?
>
> Can you try to reproduce this issue when disabling merges on the
> nvme-tcp devices?
>
> echo 2 > /sys/block/nvmeXnY/queue/nomerges
>
> I want to see if this is an issue with merged bios.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Data corruption when using multiple devices with NVMEoF TCP
  2020-12-25  7:49                 ` Hao Wang
@ 2020-12-25  9:05                   ` Sagi Grimberg
       [not found]                     ` <CAJS6Edgb+yCW5q5dA=MEkL0eYs4MXoopdiz72nhkxpkd5Fe_cA@mail.gmail.com>
  0 siblings, 1 reply; 23+ messages in thread
From: Sagi Grimberg @ 2020-12-25  9:05 UTC (permalink / raw)
  To: Hao Wang; +Cc: Christoph Hellwig, Linux-nvme


> In my current setup, on the initiator side, nvme3n1 & nvme4n1 are 2
> nvme-tcp devices, schedulers for 3 is:
>   - cat /sys/block/nvme3n1/queue/scheduler: "none"
>   - cat /sys/block/nvme3c3n1/queue/scheduler: "[none] mq-deadline kyber"
> Not sure what is nvme3c3n1 here?
> 
> And disabling merges on nvme-tcp devices solves the data corruption issue!

Well, it actually confirms that we have an issue when we get a multi-bio
request that was merged. I'm assuming you also do not see the prints
I added in this case...

Would you mind adding these prints (they will overload probably, but
may be useful to shed some light on this):
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 979ee31b8dd1..5a611ddb22ea 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -243,6 +243,16 @@ static void nvme_tcp_init_iter(struct 
nvme_tcp_request *req,
                 nsegs = bio_segments(bio);
                 size = bio->bi_iter.bi_size;
                 offset = bio->bi_iter.bi_bvec_done;
+               if (rq->bio != rq->biotail) {
+                       int bio_num = 1;
+                       struct bio *ptr = rq->bio;
+                       while (ptr != bio) {
+                               ptr = ptr->bi_next;
+                               bio_num++;
+                       };
+                       pr_info("rq %d (%s) data_len %d bio[%d/%d] 
sector %llx bvec: nsegs %d size %d offset %ld\n",
+                               rq->tag, dir == WRITE ? "WRITE" : 
"READ", req->data_len, bio_num, blk_rq_count_bios(rq), 
bio->bi_iter.bi_sector, nsegs, size, offset);
+               }
         }

         iov_iter_bvec(&req->iter, dir, vec, nsegs, size);
--

Thank you for helping isolating this issue.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: Data corruption when using multiple devices with NVMEoF TCP
       [not found]                     ` <CAJS6Edgb+yCW5q5dA=MEkL0eYs4MXoopdiz72nhkxpkd5Fe_cA@mail.gmail.com>
@ 2020-12-29  1:25                       ` Sagi Grimberg
  2021-01-06  1:53                       ` Sagi Grimberg
  1 sibling, 0 replies; 23+ messages in thread
From: Sagi Grimberg @ 2020-12-29  1:25 UTC (permalink / raw)
  To: Hao Wang; +Cc: Christoph Hellwig, Linux-nvme


> Okay, will do that in a few days. Something else just popped up and I 
> have a limited time window to use some machines.

Understood, I'm still trying to understand what can cause a problem
in a multi-bio merge request, that used to work AFAIR...

> BTW, what is the performance implication of disabling merge? My usage 
> pattern is mostly sequential read and write, and write bandwidth is 
> pretty high.

Well, if your writes are bigger in nature to begin with, there is not
a lot of gain, but if they are than there is a potential gain here.

Some of the log messages could also help understand what is the
I/O pattern.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Data corruption when using multiple devices with NVMEoF TCP
       [not found]                     ` <CAJS6Edgb+yCW5q5dA=MEkL0eYs4MXoopdiz72nhkxpkd5Fe_cA@mail.gmail.com>
  2020-12-29  1:25                       ` Sagi Grimberg
@ 2021-01-06  1:53                       ` Sagi Grimberg
  2021-01-06  8:21                         ` Hao Wang
  2021-01-11  8:56                         ` Hao Wang
  1 sibling, 2 replies; 23+ messages in thread
From: Sagi Grimberg @ 2021-01-06  1:53 UTC (permalink / raw)
  To: Hao Wang; +Cc: Christoph Hellwig, Linux-nvme

Hey Hao,

> Okay, will do that in a few days. Something else just popped up and I 
> have a limited time window to use some machines.
> 
> BTW, what is the performance implication of disabling merge? My usage 
> pattern is mostly sequential read and write, and write bandwidth is 
> pretty high.

Did you get a chance to look into this?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Data corruption when using multiple devices with NVMEoF TCP
  2021-01-06  1:53                       ` Sagi Grimberg
@ 2021-01-06  8:21                         ` Hao Wang
  2021-01-11  8:56                         ` Hao Wang
  1 sibling, 0 replies; 23+ messages in thread
From: Hao Wang @ 2021-01-06  8:21 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Christoph Hellwig, Linux-nvme

Plan to get to this over the coming weekend. Sorry for the delay.


On Tue, Jan 5, 2021 at 5:53 PM Sagi Grimberg <sagi@grimberg.me> wrote:
>
> Hey Hao,
>
> > Okay, will do that in a few days. Something else just popped up and I
> > have a limited time window to use some machines.
> >
> > BTW, what is the performance implication of disabling merge? My usage
> > pattern is mostly sequential read and write, and write bandwidth is
> > pretty high.
>
> Did you get a chance to look into this?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Data corruption when using multiple devices with NVMEoF TCP
  2021-01-06  1:53                       ` Sagi Grimberg
  2021-01-06  8:21                         ` Hao Wang
@ 2021-01-11  8:56                         ` Hao Wang
  2021-01-11 10:11                           ` Sagi Grimberg
  1 sibling, 1 reply; 23+ messages in thread
From: Hao Wang @ 2021-01-11  8:56 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Christoph Hellwig, Linux-nvme

Hey Sagi,

I exported 4 devices to the initiator, created a raid-0 array, and
copied a 98G directory with many ~100MB .gz files.
With the patch you give on top of 58cf05f597b0 (fairly new), I saw
about 24K prints from dmesg. Below are some of them:
[ 3775.256547] nvme_tcp: rq 22 (READ) data_len 131072 bio[1/2] sector
a388200 bvec: nsegs 19 size 77824 offset 0
[ 3775.256768] nvme_tcp: rq 19 (READ) data_len 131072 bio[1/2] sector
a388300 bvec: nsegs 19 size 77824 offset 0
[ 3775.256774] nvme_tcp: rq 20 (READ) data_len 131072 bio[1/2] sector
a388400 bvec: nsegs 19 size 77824 offset 0
[ 3775.256787] nvme_tcp: rq 5 (READ) data_len 131072 bio[1/2] sector
a388300 bvec: nsegs 19 size 77824 offset 0
[ 3775.256791] nvme_tcp: rq 6 (READ) data_len 131072 bio[1/2] sector
a388400 bvec: nsegs 19 size 77824 offset 0
[ 3775.256794] nvme_tcp: rq 117 (READ) data_len 131072 bio[1/2] sector
a388300 bvec: nsegs 19 size 77824 offset 0
[ 3775.256797] nvme_tcp: rq 118 (READ) data_len 131072 bio[1/2] sector
a388400 bvec: nsegs 19 size 77824 offset 0
[ 3775.256800] nvme_tcp: rq 5 (READ) data_len 262144 bio[1/4] sector
a388300 bvec: nsegs 19 size 77824 offset 0
[ 3775.257002] nvme_tcp: rq 21 (READ) data_len 131072 bio[1/2] sector
a388500 bvec: nsegs 19 size 77824 offset 0
[ 3775.257006] nvme_tcp: rq 22 (READ) data_len 131072 bio[1/2] sector
a388600 bvec: nsegs 19 size 77824 offset 0
[ 3775.257009] nvme_tcp: rq 7 (READ) data_len 131072 bio[1/2] sector
a388500 bvec: nsegs 19 size 77824 offset 0
[ 3775.257012] nvme_tcp: rq 8 (READ) data_len 131072 bio[1/2] sector
a388600 bvec: nsegs 19 size 77824 offset 0
[ 3775.257014] nvme_tcp: rq 7 (READ) data_len 131072 bio[1/2] sector
a388500 bvec: nsegs 19 size 77824 offset 0
[ 3775.257017] nvme_tcp: rq 8 (READ) data_len 131072 bio[1/2] sector
a388600 bvec: nsegs 19 size 77824 offset 0
[ 3775.257020] nvme_tcp: rq 6 (READ) data_len 262144 bio[1/4] sector
a388500 bvec: nsegs 19 size 77824 offset 0
[ 3775.262587] nvme_tcp: rq 22 (WRITE) data_len 131072 bio[1/2] sector
a388200 bvec: nsegs 19 size 77824 offset 0
[ 3775.262600] nvme_tcp: rq 22 (WRITE) data_len 131072 bio[2/2] sector
a388298 bvec: nsegs 13 size 53248 offset 0
[ 3775.262610] nvme_tcp: rq 5 (WRITE) data_len 262144 bio[1/4] sector
a388300 bvec: nsegs 19 size 77824 offset 0
[ 3775.262617] nvme_tcp: rq 5 (WRITE) data_len 262144 bio[2/4] sector
a388398 bvec: nsegs 13 size 53248 offset 0
[ 3775.262623] nvme_tcp: rq 5 (WRITE) data_len 262144 bio[3/4] sector
a388400 bvec: nsegs 19 size 77824 offset 0
[ 3775.262629] nvme_tcp: rq 5 (WRITE) data_len 262144 bio[4/4] sector
a388498 bvec: nsegs 13 size 53248 offset 0
[ 3775.262635] nvme_tcp: rq 6 (WRITE) data_len 262144 bio[1/4] sector
a388500 bvec: nsegs 19 size 77824 offset 0
[ 3775.262641] nvme_tcp: rq 6 (WRITE) data_len 262144 bio[2/4] sector
a388598 bvec: nsegs 13 size 53248 offset 0
[ 3775.262647] nvme_tcp: rq 6 (WRITE) data_len 262144 bio[3/4] sector
a388600 bvec: nsegs 19 size 77824 offset 0
[ 3775.262653] nvme_tcp: rq 6 (WRITE) data_len 262144 bio[4/4] sector
a388698 bvec: nsegs 13 size 53248 offset 0
[ 3775.263009] nvme_tcp: rq 5 (WRITE) data_len 131072 bio[1/2] sector
a388300 bvec: nsegs 19 size 77824 offset 0
[ 3775.263019] nvme_tcp: rq 5 (WRITE) data_len 131072 bio[2/2] sector
a388398 bvec: nsegs 13 size 53248 offset 0
[ 3775.263027] nvme_tcp: rq 6 (WRITE) data_len 131072 bio[1/2] sector
a388400 bvec: nsegs 19 size 77824 offset 0
[ 3775.263034] nvme_tcp: rq 6 (WRITE) data_len 131072 bio[2/2] sector
a388498 bvec: nsegs 13 size 53248 offset 0
[ 3775.263040] nvme_tcp: rq 7 (WRITE) data_len 131072 bio[1/2] sector
a388500 bvec: nsegs 19 size 77824 offset 0
[ 3775.263047] nvme_tcp: rq 7 (WRITE) data_len 131072 bio[2/2] sector
a388598 bvec: nsegs 13 size 53248 offset 0
[ 3775.263052] nvme_tcp: rq 8 (WRITE) data_len 131072 bio[1/2] sector
a388600 bvec: nsegs 19 size 77824 offset 0
[ 3775.263059] nvme_tcp: rq 8 (WRITE) data_len 131072 bio[2/2] sector
a388698 bvec: nsegs 13 size 53248 offset 0
[ 3775.264341] nvme_tcp: rq 19 (WRITE) data_len 131072 bio[1/2] sector
a388300 bvec: nsegs 19 size 77824 offset 0
[ 3775.264353] nvme_tcp: rq 19 (WRITE) data_len 131072 bio[2/2] sector
a388398 bvec: nsegs 13 size 53248 offset 0
[ 3775.264361] nvme_tcp: rq 20 (WRITE) data_len 131072 bio[1/2] sector
a388400 bvec: nsegs 19 size 77824 offset 0
[ 3775.264369] nvme_tcp: rq 20 (WRITE) data_len 131072 bio[2/2] sector
a388498 bvec: nsegs 13 size 53248 offset 0
[ 3775.264380] nvme_tcp: rq 21 (WRITE) data_len 131072 bio[1/2] sector
a388500 bvec: nsegs 19 size 77824 offset 0
[ 3775.264387] nvme_tcp: rq 21 (WRITE) data_len 131072 bio[2/2] sector
a388598 bvec: nsegs 13 size 53248 offset 0
[ 3775.264410] nvme_tcp: rq 22 (WRITE) data_len 131072 bio[1/2] sector
a388600 bvec: nsegs 19 size 77824 offset 0

Hao

On Tue, Jan 5, 2021 at 5:53 PM Sagi Grimberg <sagi@grimberg.me> wrote:
>
> Hey Hao,
>
> > Okay, will do that in a few days. Something else just popped up and I
> > have a limited time window to use some machines.
> >
> > BTW, what is the performance implication of disabling merge? My usage
> > pattern is mostly sequential read and write, and write bandwidth is
> > pretty high.
>
> Did you get a chance to look into this?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Data corruption when using multiple devices with NVMEoF TCP
  2021-01-11  8:56                         ` Hao Wang
@ 2021-01-11 10:11                           ` Sagi Grimberg
       [not found]                             ` <CAJS6Edi9Es1zR9QC+=kwVjAFAGYrEru4vibW42ffyWoMDutFhQ@mail.gmail.com>
  0 siblings, 1 reply; 23+ messages in thread
From: Sagi Grimberg @ 2021-01-11 10:11 UTC (permalink / raw)
  To: Hao Wang; +Cc: Christoph Hellwig, Linux-nvme


> Hey Sagi,

Hey Hao,

> I exported 4 devices to the initiator, created a raid-0 array, and
> copied a 98G directory with many ~100MB .gz files.
> With the patch you give on top of 58cf05f597b0 (fairly new), I saw
> about 24K prints from dmesg. Below are some of them:

Yes, I understand it generated tons of prints, but it seems that
something is strange here.

> [ 3775.256547] nvme_tcp: rq 22 (READ) data_len 131072 bio[1/2] sector
> a388200 bvec: nsegs 19 size 77824 offset 0

This is a read request and has 2 bios, one spans 19 4K buffers (starting
from sector a388200) and the second probably spans 13 4K buffers. The
host is asking the target to send 128K (data_len 131072), but I don't
see anywhere that the host is receiving the residual of the data
transfer..

Should be in the form of:

nvme_tcp: rq 22 (READ) data_len 131072 bio[2/2] sector 0xa388298 bvec: 
nsegs 13 size 53248 offset 0

In your entire log, do you see any (READ) print that spans bio that
is not [1/x]? e.g. a read that spans other bios in the request (like
[2/2], [2/3], etc..)?

> [ 3775.256768] nvme_tcp: rq 19 (READ) data_len 131072 bio[1/2] sector
> a388300 bvec: nsegs 19 size 77824 offset 0
> [ 3775.256774] nvme_tcp: rq 20 (READ) data_len 131072 bio[1/2] sector
> a388400 bvec: nsegs 19 size 77824 offset 0
> [ 3775.256787] nvme_tcp: rq 5 (READ) data_len 131072 bio[1/2] sector
> a388300 bvec: nsegs 19 size 77824 offset 0
> [ 3775.256791] nvme_tcp: rq 6 (READ) data_len 131072 bio[1/2] sector
> a388400 bvec: nsegs 19 size 77824 offset 0
> [ 3775.256794] nvme_tcp: rq 117 (READ) data_len 131072 bio[1/2] sector
> a388300 bvec: nsegs 19 size 77824 offset 0
> [ 3775.256797] nvme_tcp: rq 118 (READ) data_len 131072 bio[1/2] sector
> a388400 bvec: nsegs 19 size 77824 offset 0
> [ 3775.256800] nvme_tcp: rq 5 (READ) data_len 262144 bio[1/4] sector
> a388300 bvec: nsegs 19 size 77824 offset 0
> [ 3775.257002] nvme_tcp: rq 21 (READ) data_len 131072 bio[1/2] sector
> a388500 bvec: nsegs 19 size 77824 offset 0
> [ 3775.257006] nvme_tcp: rq 22 (READ) data_len 131072 bio[1/2] sector
> a388600 bvec: nsegs 19 size 77824 offset 0
> [ 3775.257009] nvme_tcp: rq 7 (READ) data_len 131072 bio[1/2] sector
> a388500 bvec: nsegs 19 size 77824 offset 0
> [ 3775.257012] nvme_tcp: rq 8 (READ) data_len 131072 bio[1/2] sector
> a388600 bvec: nsegs 19 size 77824 offset 0
> [ 3775.257014] nvme_tcp: rq 7 (READ) data_len 131072 bio[1/2] sector
> a388500 bvec: nsegs 19 size 77824 offset 0
> [ 3775.257017] nvme_tcp: rq 8 (READ) data_len 131072 bio[1/2] sector
> a388600 bvec: nsegs 19 size 77824 offset 0
> [ 3775.257020] nvme_tcp: rq 6 (READ) data_len 262144 bio[1/4] sector
> a388500 bvec: nsegs 19 size 77824 offset 0
> [ 3775.262587] nvme_tcp: rq 22 (WRITE) data_len 131072 bio[1/2] sector
> a388200 bvec: nsegs 19 size 77824 offset 0
> [ 3775.262600] nvme_tcp: rq 22 (WRITE) data_len 131072 bio[2/2] sector
> a388298 bvec: nsegs 13 size 53248 offset 0

For (WRITE) request we see the desired sequence, we first write the
content of the first bio (19 4K segments) and then the content of the
second bio (13 4K segments).

> [ 3775.262610] nvme_tcp: rq 5 (WRITE) data_len 262144 bio[1/4] sector
> a388300 bvec: nsegs 19 size 77824 offset 0
> [ 3775.262617] nvme_tcp: rq 5 (WRITE) data_len 262144 bio[2/4] sector
> a388398 bvec: nsegs 13 size 53248 offset 0
> [ 3775.262623] nvme_tcp: rq 5 (WRITE) data_len 262144 bio[3/4] sector
> a388400 bvec: nsegs 19 size 77824 offset 0
> [ 3775.262629] nvme_tcp: rq 5 (WRITE) data_len 262144 bio[4/4] sector
> a388498 bvec: nsegs 13 size 53248 offset 0

Same here and on...

> [ 3775.262635] nvme_tcp: rq 6 (WRITE) data_len 262144 bio[1/4] sector
> a388500 bvec: nsegs 19 size 77824 offset 0
> [ 3775.262641] nvme_tcp: rq 6 (WRITE) data_len 262144 bio[2/4] sector
> a388598 bvec: nsegs 13 size 53248 offset 0
> [ 3775.262647] nvme_tcp: rq 6 (WRITE) data_len 262144 bio[3/4] sector
> a388600 bvec: nsegs 19 size 77824 offset 0
> [ 3775.262653] nvme_tcp: rq 6 (WRITE) data_len 262144 bio[4/4] sector
> a388698 bvec: nsegs 13 size 53248 offset 0
> [ 3775.263009] nvme_tcp: rq 5 (WRITE) data_len 131072 bio[1/2] sector
> a388300 bvec: nsegs 19 size 77824 offset 0
> [ 3775.263019] nvme_tcp: rq 5 (WRITE) data_len 131072 bio[2/2] sector
> a388398 bvec: nsegs 13 size 53248 offset 0
> [ 3775.263027] nvme_tcp: rq 6 (WRITE) data_len 131072 bio[1/2] sector
> a388400 bvec: nsegs 19 size 77824 offset 0
> [ 3775.263034] nvme_tcp: rq 6 (WRITE) data_len 131072 bio[2/2] sector
> a388498 bvec: nsegs 13 size 53248 offset 0
> [ 3775.263040] nvme_tcp: rq 7 (WRITE) data_len 131072 bio[1/2] sector
> a388500 bvec: nsegs 19 size 77824 offset 0
> [ 3775.263047] nvme_tcp: rq 7 (WRITE) data_len 131072 bio[2/2] sector
> a388598 bvec: nsegs 13 size 53248 offset 0
> [ 3775.263052] nvme_tcp: rq 8 (WRITE) data_len 131072 bio[1/2] sector
> a388600 bvec: nsegs 19 size 77824 offset 0
> [ 3775.263059] nvme_tcp: rq 8 (WRITE) data_len 131072 bio[2/2] sector
> a388698 bvec: nsegs 13 size 53248 offset 0
> [ 3775.264341] nvme_tcp: rq 19 (WRITE) data_len 131072 bio[1/2] sector
> a388300 bvec: nsegs 19 size 77824 offset 0
> [ 3775.264353] nvme_tcp: rq 19 (WRITE) data_len 131072 bio[2/2] sector
> a388398 bvec: nsegs 13 size 53248 offset 0
> [ 3775.264361] nvme_tcp: rq 20 (WRITE) data_len 131072 bio[1/2] sector
> a388400 bvec: nsegs 19 size 77824 offset 0
> [ 3775.264369] nvme_tcp: rq 20 (WRITE) data_len 131072 bio[2/2] sector
> a388498 bvec: nsegs 13 size 53248 offset 0
> [ 3775.264380] nvme_tcp: rq 21 (WRITE) data_len 131072 bio[1/2] sector
> a388500 bvec: nsegs 19 size 77824 offset 0
> [ 3775.264387] nvme_tcp: rq 21 (WRITE) data_len 131072 bio[2/2] sector
> a388598 bvec: nsegs 13 size 53248 offset 0
> [ 3775.264410] nvme_tcp: rq 22 (WRITE) data_len 131072 bio[1/2] sector
> a388600 bvec: nsegs 19 size 77824 offset 0

 From the code it seems like it should do the right thing assuming
that the data does arrive, will look deeper.

Thanks for helping to dissect this issue.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Data corruption when using multiple devices with NVMEoF TCP
       [not found]                             ` <CAJS6Edi9Es1zR9QC+=kwVjAFAGYrEru4vibW42ffyWoMDutFhQ@mail.gmail.com>
@ 2021-01-12  0:36                               ` Sagi Grimberg
  2021-01-12  1:29                                 ` Sagi Grimberg
  0 siblings, 1 reply; 23+ messages in thread
From: Sagi Grimberg @ 2021-01-12  0:36 UTC (permalink / raw)
  To: Hao Wang; +Cc: Christoph Hellwig, Linux-nvme

Hey Hao,

> Here is the entire log (and it's a new one, i.e. above snippet not 
> included):
> https://drive.google.com/file/d/16ArIs5-Jw4P2f17A_ftKLm1A4LQUFpmg/view?usp=sharing
> 
> What I found is the data corruption does not always happen, especially 
> when I copy a small directory. So I guess a lot of log entries should 
> just look fine.

So this seems to be a breakage that existed for some time now with
multipage bvecs that you have been the first one to report. This
seems to be related to bio merges, which is seems strange to me
why this just now comes up, perhaps it is the combination with
raid0 that triggers this, I'm not sure.

IIUC, this should resolve your issue, care to give it a go?
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 973d5d683180..6bceadc204a8 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -201,8 +201,9 @@ static inline size_t nvme_tcp_req_cur_offset(struct 
nvme_tcp_request *req)

  static inline size_t nvme_tcp_req_cur_length(struct nvme_tcp_request *req)
  {
-       return min_t(size_t, req->iter.bvec->bv_len - req->iter.iov_offset,
-                       req->pdu_len - req->pdu_sent);
+       return min_t(size_t, req->iter.count,
+                       min_t(size_t, req->iter.bvec->bv_len - 
req->iter.iov_offset,
+                               req->pdu_len - req->pdu_sent));
  }

  static inline size_t nvme_tcp_pdu_data_left(struct nvme_tcp_request *req)
@@ -223,7 +224,7 @@ static void nvme_tcp_init_iter(struct 
nvme_tcp_request *req,
         struct request *rq = blk_mq_rq_from_pdu(req);
         struct bio_vec *vec;
         unsigned int size;
-       int nsegs;
+       int nsegs = 0;
         size_t offset;

         if (rq->rq_flags & RQF_SPECIAL_PAYLOAD) {
@@ -233,11 +234,15 @@ static void nvme_tcp_init_iter(struct 
nvme_tcp_request *req,
                 offset = 0;
         } else {
                 struct bio *bio = req->curr_bio;
+               struct bvec_iter bi;
+               struct bio_vec bv;

                 vec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
-               nsegs = bio_segments(bio);
+               bio_for_each_bvec(bv, bio, bi) {
+                       nsegs++;
+               }
                 size = bio->bi_iter.bi_size;
-               offset = bio->bi_iter.bi_bvec_done;
+               offset = mp_bvec_iter_offset(bio->bi_io_vec, 
bio->bi_iter) - vec->bv_offset;
         }

         iov_iter_bvec(&req->iter, dir, vec, nsegs, size);
--

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: Data corruption when using multiple devices with NVMEoF TCP
  2021-01-12  0:36                               ` Sagi Grimberg
@ 2021-01-12  1:29                                 ` Sagi Grimberg
  2021-01-12  2:22                                   ` Ming Lei
  2021-01-12  8:55                                   ` Hao Wang
  0 siblings, 2 replies; 23+ messages in thread
From: Sagi Grimberg @ 2021-01-12  1:29 UTC (permalink / raw)
  To: Hao Wang; +Cc: Christoph Hellwig, Linux-nvme


> Hey Hao,
> 
>> Here is the entire log (and it's a new one, i.e. above snippet not 
>> included):
>> https://drive.google.com/file/d/16ArIs5-Jw4P2f17A_ftKLm1A4LQUFpmg/view?usp=sharing 
>>
>>
>> What I found is the data corruption does not always happen, especially 
>> when I copy a small directory. So I guess a lot of log entries should 
>> just look fine.
> 
> So this seems to be a breakage that existed for some time now with
> multipage bvecs that you have been the first one to report. This
> seems to be related to bio merges, which is seems strange to me
> why this just now comes up, perhaps it is the combination with
> raid0 that triggers this, I'm not sure.

OK, I think I understand what is going on. With multipage bvecs
bios can split in the middle of a bvec entry, and then merge
back with another bio.

The issue is that we are not capping the last bvec entry send length
calculation in that.

I think that just this can also resolve the issue:
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 973d5d683180..c6b0a189a494 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -201,8 +201,9 @@ static inline size_t nvme_tcp_req_cur_offset(struct 
nvme_tcp_request *req)

  static inline size_t nvme_tcp_req_cur_length(struct nvme_tcp_request *req)
  {
-       return min_t(size_t, req->iter.bvec->bv_len - req->iter.iov_offset,
-                       req->pdu_len - req->pdu_sent);
+       return min_t(size_t, req->iter.count,
+                       min_t(size_t, req->iter.bvec->bv_len - 
req->iter.iov_offset,
+                               req->pdu_len - req->pdu_sent));
  }

  static inline size_t nvme_tcp_pdu_data_left(struct nvme_tcp_request *req)
--

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: Data corruption when using multiple devices with NVMEoF TCP
  2021-01-12  1:29                                 ` Sagi Grimberg
@ 2021-01-12  2:22                                   ` Ming Lei
  2021-01-12  6:49                                     ` Sagi Grimberg
  2021-01-12  8:55                                   ` Hao Wang
  1 sibling, 1 reply; 23+ messages in thread
From: Ming Lei @ 2021-01-12  2:22 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Christoph Hellwig, Hao Wang, linux-nvme

On Tue, Jan 12, 2021 at 9:33 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
> > Hey Hao,
> >
> >> Here is the entire log (and it's a new one, i.e. above snippet not
> >> included):
> >> https://drive.google.com/file/d/16ArIs5-Jw4P2f17A_ftKLm1A4LQUFpmg/view?usp=sharing
> >>
> >>
> >> What I found is the data corruption does not always happen, especially
> >> when I copy a small directory. So I guess a lot of log entries should
> >> just look fine.
> >
> > So this seems to be a breakage that existed for some time now with
> > multipage bvecs that you have been the first one to report. This
> > seems to be related to bio merges, which is seems strange to me
> > why this just now comes up, perhaps it is the combination with
> > raid0 that triggers this, I'm not sure.
>
> OK, I think I understand what is going on. With multipage bvecs
> bios can split in the middle of a bvec entry, and then merge
> back with another bio.

IMO, bio split can be done in the middle of a bvec even though the bvec
is single page. The split may just be triggered in case of raid over nvme-tcp,
and I guess it might be triggered by device mapper too.


Thanks,
Ming

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Data corruption when using multiple devices with NVMEoF TCP
  2021-01-12  2:22                                   ` Ming Lei
@ 2021-01-12  6:49                                     ` Sagi Grimberg
  0 siblings, 0 replies; 23+ messages in thread
From: Sagi Grimberg @ 2021-01-12  6:49 UTC (permalink / raw)
  To: Ming Lei; +Cc: Christoph Hellwig, Hao Wang, linux-nvme


>>> Hey Hao,
>>>
>>>> Here is the entire log (and it's a new one, i.e. above snippet not
>>>> included):
>>>> https://drive.google.com/file/d/16ArIs5-Jw4P2f17A_ftKLm1A4LQUFpmg/view?usp=sharing
>>>>
>>>>
>>>> What I found is the data corruption does not always happen, especially
>>>> when I copy a small directory. So I guess a lot of log entries should
>>>> just look fine.
>>>
>>> So this seems to be a breakage that existed for some time now with
>>> multipage bvecs that you have been the first one to report. This
>>> seems to be related to bio merges, which is seems strange to me
>>> why this just now comes up, perhaps it is the combination with
>>> raid0 that triggers this, I'm not sure.
>>
>> OK, I think I understand what is going on. With multipage bvecs
>> bios can split in the middle of a bvec entry, and then merge
>> back with another bio.
> 
> IMO, bio split can be done in the middle of a bvec even though the bvec
> is single page. The split may just be triggered in case of raid over nvme-tcp,
> and I guess it might be triggered by device mapper too.

Yes, but I couldn't find a case where it cannot happen, but it only
triggered with mdraid. I'll wait for Hao to verify and send a formal
patch.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Data corruption when using multiple devices with NVMEoF TCP
  2021-01-12  1:29                                 ` Sagi Grimberg
  2021-01-12  2:22                                   ` Ming Lei
@ 2021-01-12  8:55                                   ` Hao Wang
  1 sibling, 0 replies; 23+ messages in thread
From: Hao Wang @ 2021-01-12  8:55 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Christoph Hellwig, Linux-nvme

Yes, this patch fixes the problem! Thanks!

Tested on top of a0d54b4f5b21.

Hao

On Mon, Jan 11, 2021 at 5:29 PM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
> > Hey Hao,
> >
> >> Here is the entire log (and it's a new one, i.e. above snippet not
> >> included):
> >> https://drive.google.com/file/d/16ArIs5-Jw4P2f17A_ftKLm1A4LQUFpmg/view?usp=sharing
> >>
> >>
> >> What I found is the data corruption does not always happen, especially
> >> when I copy a small directory. So I guess a lot of log entries should
> >> just look fine.
> >
> > So this seems to be a breakage that existed for some time now with
> > multipage bvecs that you have been the first one to report. This
> > seems to be related to bio merges, which is seems strange to me
> > why this just now comes up, perhaps it is the combination with
> > raid0 that triggers this, I'm not sure.
>
> OK, I think I understand what is going on. With multipage bvecs
> bios can split in the middle of a bvec entry, and then merge
> back with another bio.
>
> The issue is that we are not capping the last bvec entry send length
> calculation in that.
>
> I think that just this can also resolve the issue:
> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 973d5d683180..c6b0a189a494 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -201,8 +201,9 @@ static inline size_t nvme_tcp_req_cur_offset(struct
> nvme_tcp_request *req)
>
>   static inline size_t nvme_tcp_req_cur_length(struct nvme_tcp_request *req)
>   {
> -       return min_t(size_t, req->iter.bvec->bv_len - req->iter.iov_offset,
> -                       req->pdu_len - req->pdu_sent);
> +       return min_t(size_t, req->iter.count,
> +                       min_t(size_t, req->iter.bvec->bv_len -
> req->iter.iov_offset,
> +                               req->pdu_len - req->pdu_sent));
>   }
>
>   static inline size_t nvme_tcp_pdu_data_left(struct nvme_tcp_request *req)
> --

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2021-01-12  8:56 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-22 18:09 Data corruption when using multiple devices with NVMEoF TCP Hao Wang
2020-12-22 19:29 ` Sagi Grimberg
2020-12-22 19:58   ` Hao Wang
2020-12-23  8:41     ` Sagi Grimberg
2020-12-23  8:43       ` Christoph Hellwig
2020-12-23 21:23         ` Sagi Grimberg
2020-12-23 22:23           ` Hao Wang
2020-12-24  1:51         ` Hao Wang
2020-12-24  2:57           ` Sagi Grimberg
2020-12-24 10:28             ` Hao Wang
2020-12-24 17:56               ` Sagi Grimberg
2020-12-25  7:49                 ` Hao Wang
2020-12-25  9:05                   ` Sagi Grimberg
     [not found]                     ` <CAJS6Edgb+yCW5q5dA=MEkL0eYs4MXoopdiz72nhkxpkd5Fe_cA@mail.gmail.com>
2020-12-29  1:25                       ` Sagi Grimberg
2021-01-06  1:53                       ` Sagi Grimberg
2021-01-06  8:21                         ` Hao Wang
2021-01-11  8:56                         ` Hao Wang
2021-01-11 10:11                           ` Sagi Grimberg
     [not found]                             ` <CAJS6Edi9Es1zR9QC+=kwVjAFAGYrEru4vibW42ffyWoMDutFhQ@mail.gmail.com>
2021-01-12  0:36                               ` Sagi Grimberg
2021-01-12  1:29                                 ` Sagi Grimberg
2021-01-12  2:22                                   ` Ming Lei
2021-01-12  6:49                                     ` Sagi Grimberg
2021-01-12  8:55                                   ` Hao Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.