Linux-RDMA Archive on lore.kernel.org
 help / color / Atom feed
* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
       [not found] <CAAFE1bd9wuuobpe4VK7Ty175j7mWT+kRmHCNhVD+6R8MWEAqmw@mail.gmail.com>
@ 2019-11-28  1:57 ` Ming Lei
       [not found]   ` <CA+VdTb_-CGaPjKUQteKVFSGqDz-5o-tuRRkJYqt8B9iOQypiwQ@mail.gmail.com>
  0 siblings, 1 reply; 24+ messages in thread
From: Ming Lei @ 2019-11-28  1:57 UTC (permalink / raw)
  To: Stephen Rust
  Cc: Jens Axboe, Christoph Hellwig, linux-block, linux-rdma,
	linux-scsi, target-devel, martin.petersen

Hello,

On Wed, Nov 27, 2019 at 02:38:42PM -0500, Stephen Rust wrote:
> Hi,
> 
> We recently began testing 5.4 in preparation for migration from 4.14. One
> of our tests found reproducible data corruption in 5.x kernels. The test
> consists of a few basic single-issue writes to an iSER attached ramdisk.
> The writes are subsequently verified with single-issue reads. We tracked
> the corruption down using git bisect. The issue appears to have started in
> 5.1 with the following commit:
> 
> 3d75ca0adef4280650c6690a0c4702a74a6f3c95 block: introduce multi-page bvec
> helpers
> 
> We wanted to bring this to your attention. A reproducer and the git bisect
> data follows below.
> 
> Our setup consists of two systems: A ramdisk exported in a LIO target from
> host A, iSCSI attached with iSER / RDMA from host B. Specific writes to the

Could you explain a bit what is iSCSI attached with iSER / RDMA? Is the
actual transport TCP over RDMA? What is related target driver involved?

> very end of the attached disk on B result in incorrect data being written
> to the remote disk. The writes appear to complete successfully on the
> client. We’ve also verified that the correct data is being sent over the
> network by tracing the RDMA flow. For reference, the tests were conducted
> on x86_64 Intel Skylake systems with Mellanox ConnectX5 NICs.

If I understand correctly, LIO ramdisk doesn't generate any IO to block
stack, see rd_execute_rw(), and the ramdisk should be one big/long
pre-allocated sgl, see rd_build_device_space().

Seems very strange, given no bvec/bio is involved in this code
path from iscsi_target_rx_thread to rd_execute_rw. So far I have no idea
how commit 3d75ca0adef428065 causes this issue, because that patch
only changes bvec/bio related code.

> 
> The issue appears to lie on the target host side. The initiator kernel
> version does not appear to play a role. The target host exhibits the issue
> when running kernel version 5.1+.
> 
> To reproduce, given attached sda on client host B, write data at the end of
> the device:
> 
> 
> SIZE=$(blockdev --getsize64 /dev/sda)
> 
> SEEK=$((( $SIZE - 512 )))
> 
> # initialize device and seed data
> 
> dd if=/dev/zero of=/dev/sda bs=512 count=1 seek=$SEEK oflag=seek_bytes
> oflag=direct
> 
> dd if=/dev/urandom of=/tmp/random bs=512 count=1 oflag=direct
> 
> 
> # write the random data (note: not direct)
> 
> dd if=/tmp/random of=/dev/sda bs=512 count=1 seek=$SEEK oflag=seek_bytes
> 
> 
> # verify the data was written
> 
> dd if=/dev/sda of=/tmp/verify bs=512 count=1 skip=$SEEK iflag=skip_bytes
> iflag=direct
> 
> hexdump -xv /tmp/random > /tmp/random.hex
> 
> hexdump -xv /tmp/verify > /tmp/verify.hex
> 
> diff -u /tmp/random.hex /tmp/verify.hex

I just setup one LIO for exporting ramdisk(2G) via iscsi, and run the
above test via iscsi HBA, still can't reproduce the issue.

> # first bad commit: [3d75ca0adef4280650c6690a0c4702a74a6f3c95] block:
> introduce multi-page bvec helpers
> 
> 
> Please advise. We have cycles and systems to help track down the issue. Let
> me know how best to assist.

Could you install bcc and start to collect the following trace on target side
before you run the above test in host side?

/usr/share/bcc/tools/stackcount -K rd_execute_rw


Thanks,
Ming


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
       [not found]   ` <CA+VdTb_-CGaPjKUQteKVFSGqDz-5o-tuRRkJYqt8B9iOQypiwQ@mail.gmail.com>
@ 2019-11-28  2:58     ` Ming Lei
       [not found]       ` <CAAFE1bfsXsKGyw7SU_z4NanT+wmtuJT=XejBYbHHMCDQwm73sw@mail.gmail.com>
  0 siblings, 1 reply; 24+ messages in thread
From: Ming Lei @ 2019-11-28  2:58 UTC (permalink / raw)
  To: Rob Townley
  Cc: Christoph Hellwig, Jens Axboe, Stephen Rust, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel

On Wed, Nov 27, 2019 at 08:18:30PM -0600, Rob Townley wrote:
> On Wed, Nov 27, 2019 at 7:58 PM Ming Lei <ming.lei@redhat.com> wrote:
> 
> > Hello,
> >
> > On Wed, Nov 27, 2019 at 02:38:42PM -0500, Stephen Rust wrote:
> > > Hi,
> > >
> > > We recently began testing 5.4 in preparation for migration from 4.14. One
> > > of our tests found reproducible data corruption in 5.x kernels. The test
> > > consists of a few basic single-issue writes to an iSER attached ramdisk.
> > > The writes are subsequently verified with single-issue reads. We tracked
> > > the corruption down using git bisect. The issue appears to have started
> > in
> > > 5.1 with the following commit:
> > >
> > > 3d75ca0adef4280650c6690a0c4702a74a6f3c95 block: introduce multi-page bvec
> > > helpers
> > >
> > > We wanted to bring this to your attention. A reproducer and the git
> > bisect
> > > data follows below.
> > >
> > > Our setup consists of two systems: A ramdisk exported in a LIO target
> > from
> > > host A, iSCSI attached with iSER / RDMA from host B. Specific writes to
> > the
> >
> > Could you explain a bit what is iSCSI attached with iSER / RDMA? Is the
> > actual transport TCP over RDMA? What is related target driver involved?
> >
> > > very end of the attached disk on B result in incorrect data being written
> > > to the remote disk. The writes appear to complete successfully on the
> > > client. We’ve also verified that the correct data is being sent over the
> > > network by tracing the RDMA flow. For reference, the tests were conducted
> > > on x86_64 Intel Skylake systems with Mellanox ConnectX5 NICs.
> >
> > If I understand correctly, LIO ramdisk doesn't generate any IO to block
> > stack, see rd_execute_rw(), and the ramdisk should be one big/long
> > pre-allocated sgl, see rd_build_device_space().
> >
> > Seems very strange, given no bvec/bio is involved in this code
> > path from iscsi_target_rx_thread to rd_execute_rw. So far I have no idea
> > how commit 3d75ca0adef428065 causes this issue, because that patch
> > only changes bvec/bio related code.
> >
> > >
> > > The issue appears to lie on the target host side. The initiator kernel
> > > version does not appear to play a role. The target host exhibits the
> > issue
> > > when running kernel version 5.1+.
> > >
> > > To reproduce, given attached sda on client host B, write data at the end
> > of
> > > the device:
> > >
> > >
> > > SIZE=$(blockdev --getsize64 /dev/sda)
> > >
> > > SEEK=$((( $SIZE - 512 )))
> > >
> > > # initialize device and seed data
> > >
> > > dd if=/dev/zero of=/dev/sda bs=512 count=1 seek=$SEEK oflag=seek_bytes
> > > oflag=direct
> > >
> > > dd if=/dev/urandom of=/tmp/random bs=512 count=1 oflag=direct
> > >
> > >
> > > # write the random data (note: not direct)
> > >
> > > dd if=/tmp/random of=/dev/sda bs=512 count=1 seek=$SEEK oflag=seek_bytes
> > >
> > >
> > > # verify the data was written
> > >
> > > dd if=/dev/sda of=/tmp/verify bs=512 count=1 skip=$SEEK iflag=skip_bytes
> > > iflag=direct
> > >
> > > hexdump -xv /tmp/random > /tmp/random.hex
> > >
> > > hexdump -xv /tmp/verify > /tmp/verify.hex
> > >
> > > diff -u /tmp/random.hex /tmp/verify.hex
> >
> > I just setup one LIO for exporting ramdisk(2G) via iscsi, and run the
> > above test via iscsi HBA, still can't reproduce the issue.
> >
> > > # first bad commit: [3d75ca0adef4280650c6690a0c4702a74a6f3c95] block:
> > > introduce multi-page bvec helpers
> > >
> > >
> > > Please advise. We have cycles and systems to help track down the issue.
> > Let
> > > me know how best to assist.
> >
> > Could you install bcc and start to collect the following trace on target
> > side
> > before you run the above test in host side?
> >
> > /usr/share/bcc/tools/stackcount -K rd_execute_rw
> >
> >
> > Thanks,
> > Ming
> >
> 
> 
> Interesting case to follow as there are many types of RamDisks.  The common
> tmpfs kind will use its RAM allocation and all free harddrive space.
> 
> The ramdisk in CentOS 7 backed by LIO will overflow its size in RAM and
> fill up all remaining free space on spinning platters.  So if the RamDisk
> is 4GB out of 192GB RAM in the lightly used machine. Free filesystem space
> is 16GB.  Writes to the 4GB RamDisk will only error out at 21GB when there
> is no space left on filesystem.
> 
> dd if=/dev/zero of=/dev/iscsiRamDisk
> Will keep writing way past 4GB and not stop till hardrive is full which is
> totally different than normal disks.
> 
> Wonder what exact kind of RamDisk is in that kernel?

In my test, it is the LIO built-in ramdisk:

/backstores/ramdisk> create rd0 2G
Created ramdisk rd0 with size 2G.
/backstores/ramdisk> ls
o- ramdisk ......................................................................... [Storage Objects: 1]
  o- rd0 ......................................................................... [(2.0GiB) deactivated]
    o- alua ............................................................................ [ALUA Groups: 1]
      o- default_tg_pt_gp ................................................ [ALUA state: Active/optimized]

Stephen, could you share us how you setup the ramdisk in your test?

Thanks, 
Ming


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
       [not found]       ` <CAAFE1bfsXsKGyw7SU_z4NanT+wmtuJT=XejBYbHHMCDQwm73sw@mail.gmail.com>
@ 2019-11-28  4:25         ` Stephen Rust
  2019-11-28  5:51           ` Rob Townley
  2019-11-28  9:12         ` Ming Lei
  1 sibling, 1 reply; 24+ messages in thread
From: Stephen Rust @ 2019-11-28  4:25 UTC (permalink / raw)
  To: Ming Lei
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel

[Apologies for dup, re-sending without text formatting to lists]

Hi,

Thanks for your reply.

I agree it does seem surprising that the git bisect pointed to this
particular commit when tracking down this issue.

> Stephen, could you share us how you setup the ramdisk in your test?

The ramdisk we export in LIO is a standard "brd" module ramdisk (ie:
/dev/ram*). We configure it as a "block" backstore in LIO, not using
the built-in LIO ramdisk.

LIO configuration is as follows:

  o- backstores .......................................................... [...]
  | o- block .............................................. [Storage Objects: 1]
  | | o- Blockbridge-952f0334-2535-5fae-9581-6c6524165067
[/dev/ram-bb.952f0334-2535-5fae-9581-6c6524165067.cm2 (16.0MiB)
write-thru activated]
  | |   o- alua ............................................... [ALUA Groups: 1]
  | |     o- default_tg_pt_gp ................... [ALUA state: Active/optimized]
  | o- fileio ............................................. [Storage Objects: 0]
  | o- pscsi .............................................. [Storage Objects: 0]
  | o- ramdisk ............................................ [Storage Objects: 0]
  o- iscsi ........................................................ [Targets: 1]
  | o- iqn.2009-12.com.blockbridge:rda:1:952f0334-2535-5fae-9581-6c6524165067:rda
 [TPGs: 1]
  |   o- tpg1 ...................................... [no-gen-acls, auth per-acl]
  |     o- acls ...................................................... [ACLs: 1]
  |     | o- iqn.1994-05.com.redhat:115ecc56a5c .. [mutual auth, Mapped LUNs: 1]
  |     |   o- mapped_lun0  [lun0
block/Blockbridge-952f0334-2535-5fae-9581-6c6524165067 (rw)]
  |     o- luns ...................................................... [LUNs: 1]
  |     | o- lun0
[block/Blockbridge-952f0334-2535-5fae-9581-6c6524165067
(/dev/ram-bb.952f0334-2535-5fae-9581-6c6524165067.cm2)
(default_tg_pt_gp)]
  |     o- portals ................................................ [Portals: 1]
  |       o- 0.0.0.0:3260 ............................................... [iser]

> > > Could you explain a bit what is iSCSI attached with iSER / RDMA? Is the
> > > actual transport TCP over RDMA? What is related target driver involved?

iSER is the iSCSI extension for RDMA, and it is important to note that
we have _only_ reproduced this when the writes occur over RDMA, with
the target portal in LIO having enabled "iser". The iscsi client
(using iscsiadm) connects to the target directly over iSER. We use the
Mellanox ConnectX-5 Ethernet NICs (mlx5* module) for this purpose,
which utilizes RoCE (RDMA over Converged Ethernet) instead of TCP.

The identical ramdisk configuration using TCP/IP target in LIO has
_not_ reproduced this issue for us.

> > > /usr/share/bcc/tools/stackcount -K rd_execute_rw

I installed bcc and used the stackcount tool to trace rd_execute_rw,
but I suspect because we are not using the built-in LIO ramdisk this
did not catch anything. Are there other function traces we can provide
for you?

Thanks,
Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-11-28  4:25         ` Stephen Rust
@ 2019-11-28  5:51           ` Rob Townley
  0 siblings, 0 replies; 24+ messages in thread
From: Rob Townley @ 2019-11-28  5:51 UTC (permalink / raw)
  To: Stephen Rust
  Cc: Ming Lei, Christoph Hellwig, Jens Axboe, linux-block, linux-rdma,
	linux-scsi, martin.petersen, target-devel

Interesting case to follow as there are many types of RamDisks.  The
common tmpfs kind will use its RAM allocation and all free harddrive
space.

The ramdisk in CentOS 7 backed by LIO will overflow its size in RAM
and fill up all remaining free space on spinning platters.  So if the
RamDisk is 4GB out of 192GB RAM in the lightly used machine. Free
filesystem space is 16GB.  Writes to the 4GB RamDisk will only error
out at 21GB when there is no space left on filesystem.

dd if=/dev/zero of=/dev/iscsiRamDisk
Will keep writing way past 4GB and not stop till hardrive is full
which is totally different than normal disks.

Wonder what exact kind of RamDisk is in that kernel?

On Wed, Nov 27, 2019 at 10:26 PM Stephen Rust <srust@blockbridge.com> wrote:
>
> [Apologies for dup, re-sending without text formatting to lists]
>
> Hi,
>
> Thanks for your reply.
>
> I agree it does seem surprising that the git bisect pointed to this
> particular commit when tracking down this issue.
>
> > Stephen, could you share us how you setup the ramdisk in your test?
>
> The ramdisk we export in LIO is a standard "brd" module ramdisk (ie:
> /dev/ram*). We configure it as a "block" backstore in LIO, not using
> the built-in LIO ramdisk.
>
> LIO configuration is as follows:
>
>   o- backstores .......................................................... [...]
>   | o- block .............................................. [Storage Objects: 1]
>   | | o- Blockbridge-952f0334-2535-5fae-9581-6c6524165067
> [/dev/ram-bb.952f0334-2535-5fae-9581-6c6524165067.cm2 (16.0MiB)
> write-thru activated]
>   | |   o- alua ............................................... [ALUA Groups: 1]
>   | |     o- default_tg_pt_gp ................... [ALUA state: Active/optimized]
>   | o- fileio ............................................. [Storage Objects: 0]
>   | o- pscsi .............................................. [Storage Objects: 0]
>   | o- ramdisk ............................................ [Storage Objects: 0]
>   o- iscsi ........................................................ [Targets: 1]
>   | o- iqn.2009-12.com.blockbridge:rda:1:952f0334-2535-5fae-9581-6c6524165067:rda
>  [TPGs: 1]
>   |   o- tpg1 ...................................... [no-gen-acls, auth per-acl]
>   |     o- acls ...................................................... [ACLs: 1]
>   |     | o- iqn.1994-05.com.redhat:115ecc56a5c .. [mutual auth, Mapped LUNs: 1]
>   |     |   o- mapped_lun0  [lun0
> block/Blockbridge-952f0334-2535-5fae-9581-6c6524165067 (rw)]
>   |     o- luns ...................................................... [LUNs: 1]
>   |     | o- lun0
> [block/Blockbridge-952f0334-2535-5fae-9581-6c6524165067
> (/dev/ram-bb.952f0334-2535-5fae-9581-6c6524165067.cm2)
> (default_tg_pt_gp)]
>   |     o- portals ................................................ [Portals: 1]
>   |       o- 0.0.0.0:3260 ............................................... [iser]
>
> > > > Could you explain a bit what is iSCSI attached with iSER / RDMA? Is the
> > > > actual transport TCP over RDMA? What is related target driver involved?
>
> iSER is the iSCSI extension for RDMA, and it is important to note that
> we have _only_ reproduced this when the writes occur over RDMA, with
> the target portal in LIO having enabled "iser". The iscsi client
> (using iscsiadm) connects to the target directly over iSER. We use the
> Mellanox ConnectX-5 Ethernet NICs (mlx5* module) for this purpose,
> which utilizes RoCE (RDMA over Converged Ethernet) instead of TCP.
>
> The identical ramdisk configuration using TCP/IP target in LIO has
> _not_ reproduced this issue for us.
>
> > > > /usr/share/bcc/tools/stackcount -K rd_execute_rw
>
> I installed bcc and used the stackcount tool to trace rd_execute_rw,
> but I suspect because we are not using the built-in LIO ramdisk this
> did not catch anything. Are there other function traces we can provide
> for you?
>
> Thanks,
> Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
       [not found]       ` <CAAFE1bfsXsKGyw7SU_z4NanT+wmtuJT=XejBYbHHMCDQwm73sw@mail.gmail.com>
  2019-11-28  4:25         ` Stephen Rust
@ 2019-11-28  9:12         ` Ming Lei
  2019-12-02 18:42           ` Stephen Rust
  1 sibling, 1 reply; 24+ messages in thread
From: Ming Lei @ 2019-11-28  9:12 UTC (permalink / raw)
  To: Stephen Rust
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel

On Wed, Nov 27, 2019 at 11:14:46PM -0500, Stephen Rust wrote:
> Hi,
> 
> Thanks for your reply.
> 
> I agree it does seem surprising that the git bisect pointed to this
> particular commit when tracking down this issue.
> 
> The ramdisk we export in LIO is a standard "brd" module ramdisk (ie:
> /dev/ram*). We configure it as a "block" backstore in LIO, not using the
> built-in LIO ramdisk.

Then it isn't strange any more, since iblock code uses bio interface.

> 
> LIO configuration is as follows:
> 
>   o- backstores ..........................................................
> [...]
>   | o- block .............................................. [Storage
> Objects: 1]
>   | | o- Blockbridge-952f0334-2535-5fae-9581-6c6524165067
>  [/dev/ram-bb.952f0334-2535-5fae-9581-6c6524165067.cm2 (16.0MiB) write-thru
> activated]
>   | |   o- alua ............................................... [ALUA
> Groups: 1]
>   | |     o- default_tg_pt_gp ................... [ALUA state:
> Active/optimized]
>   | o- fileio ............................................. [Storage
> Objects: 0]
>   | o- pscsi .............................................. [Storage
> Objects: 0]
>   | o- ramdisk ............................................ [Storage
> Objects: 0]
>   o- iscsi ........................................................
> [Targets: 1]
>   | o-
> iqn.2009-12.com.blockbridge:rda:1:952f0334-2535-5fae-9581-6c6524165067:rda
>  [TPGs: 1]
>   |   o- tpg1 ...................................... [no-gen-acls, auth
> per-acl]
>   |     o- acls ......................................................
> [ACLs: 1]
>   |     | o- iqn.1994-05.com.redhat:115ecc56a5c .. [mutual auth, Mapped
> LUNs: 1]
>   |     |   o- mapped_lun0  [lun0
> block/Blockbridge-952f0334-2535-5fae-9581-6c6524165067 (rw)]
>   |     o- luns ......................................................
> [LUNs: 1]
>   |     | o- lun0  [block/Blockbridge-952f0334-2535-5fae-9581-6c6524165067
> (/dev/ram-bb.952f0334-2535-5fae-9581-6c6524165067.cm2) (default_tg_pt_gp)]
>   |     o- portals ................................................
> [Portals: 1]
>   |       o- 0.0.0.0:3260 ...............................................
> [iser]
> 
> 
> iSER is the iSCSI extension for RDMA, and it is important to note that we
> have _only_ reproduced this when the writes occur over RDMA, with the
> target portal in LIO having enabled "iser". The iscsi client (using
> iscsiadm) connects to the target directly over iSER. We use the Mellanox
> ConnectX-5 Ethernet NICs (mlx5* module) for this purpose, which utilizes
> RoCE (RDMA over Converged Ethernet) instead of TCP.

I may get one machine with Mellanox NIC, is it easy to setup & reproduce
just in the local machine(both host and target are setup on same machine)?

> 
> The identical ramdisk configuration using TCP/IP target in LIO has _not_
> reproduced this issue for us.

Yeah, I just tried iblock over brd, and can't reproduce it.

> 
> I installed bcc and used the stackcount tool to trace rd_execute_rw, but I
> suspect because we are not using the built-in LIO ramdisk this did not
> catch anything. Are there other function traces we can provide for you?

Please try to trace bio_add_page() a bit via 'bpftrace ./ilo.bt'.

[root@ktest-01 func]# cat ilo.bt
kprobe:iblock_execute_rw
{
    @start[tid]=1;
}

kretprobe:iblock_execute_rw
{
    @start[tid]=0;
}

kprobe:bio_add_page
/@start[tid]/
{
  printf("%d %d\n", arg2, arg3);
}



Thanks, 
Ming


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-11-28  9:12         ` Ming Lei
@ 2019-12-02 18:42           ` Stephen Rust
  2019-12-03  0:58             ` Ming Lei
  0 siblings, 1 reply; 24+ messages in thread
From: Stephen Rust @ 2019-12-02 18:42 UTC (permalink / raw)
  To: Ming Lei
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel

Hi Ming,

> I may get one machine with Mellanox NIC, is it easy to setup & reproduce
> just in the local machine(both host and target are setup on same machine)?

Yes, I have reproduced locally on one machine (using the IP address of
the Mellanox NIC as the target IP), with iser enabled on the target,
and iscsiadm connected via iser.

e.g.:
target:
/iscsi/iqn.20.../0.0.0.0:3260> enable_iser true
iSER enable now: True

  | |   o- portals
....................................................................................................
[Portals: 1]
  | |     o- 0.0.0.0:3260
...................................................................................................
[iser]

client:
# iscsiadm -m node -o update --targetname <target> -n
iface.transport_name -v iser
# iscsiadm -m node --targetname <target> --login
# iscsiadm -m session
iser: [3] 172.16.XX.XX:3260,1
iqn.2003-01.org.linux-iscsi.x8664:sn.c46c084919b0 (non-flash)

> Please try to trace bio_add_page() a bit via 'bpftrace ./ilo.bt'.

Here is the output of this trace from a failed run:

# bpftrace lio.bt
modprobe: FATAL: Module kheaders not found.
Attaching 3 probes...
512 76
4096 0
4096 0
4096 0
4096 76
512 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
^C

@start[14475]: 0
@start[14384]: 0
@start[6764]: 0
@start[14477]: 0
@start[7771]: 0
@start[13788]: 0
@start[6879]: 0
@start[11842]: 0
@start[7765]: 0
@start[7782]: 0
@start[14476]: 0
@start[14385]: 0
@start[14474]: 0
@start[11564]: 0
@start[7753]: 0
@start[7786]: 0
@start[7791]: 0
@start[6878]: 0
@start[7411]: 0
@start[14473]: 0
@start[11563]: 0
@start[7681]: 0
@start[7756]: 0


Thanks,
Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-02 18:42           ` Stephen Rust
@ 2019-12-03  0:58             ` Ming Lei
  2019-12-03  3:04               ` Stephen Rust
  0 siblings, 1 reply; 24+ messages in thread
From: Ming Lei @ 2019-12-03  0:58 UTC (permalink / raw)
  To: Stephen Rust
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel

On Mon, Dec 02, 2019 at 01:42:15PM -0500, Stephen Rust wrote:
> Hi Ming,
> 
> > I may get one machine with Mellanox NIC, is it easy to setup & reproduce
> > just in the local machine(both host and target are setup on same machine)?
> 
> Yes, I have reproduced locally on one machine (using the IP address of
> the Mellanox NIC as the target IP), with iser enabled on the target,
> and iscsiadm connected via iser.
> 
> e.g.:
> target:
> /iscsi/iqn.20.../0.0.0.0:3260> enable_iser true
> iSER enable now: True
> 
>   | |   o- portals
> ....................................................................................................
> [Portals: 1]
>   | |     o- 0.0.0.0:3260
> ...................................................................................................
> [iser]
> 
> client:
> # iscsiadm -m node -o update --targetname <target> -n
> iface.transport_name -v iser
> # iscsiadm -m node --targetname <target> --login
> # iscsiadm -m session
> iser: [3] 172.16.XX.XX:3260,1
> iqn.2003-01.org.linux-iscsi.x8664:sn.c46c084919b0 (non-flash)
> 
> > Please try to trace bio_add_page() a bit via 'bpftrace ./ilo.bt'.
> 
> Here is the output of this trace from a failed run:
> 
> # bpftrace lio.bt
> modprobe: FATAL: Module kheaders not found.
> Attaching 3 probes...
> 512 76
> 4096 0
> 4096 0
> 4096 0
> 4096 76

The above buffer might be the reason, 4096 is length, and 76 is the
offset, that means the added buffer crosses two pages, meantime the
buffer isn't aligned.

We need to figure out why the magic 76 offset is passed from target or
driver.

Please install bcc and collect the following log:

/usr/share/bcc/tools/trace -K 'bio_add_page ((arg4 & 512) != 0) "%d %d", arg3, arg4 '


Thanks,
Ming


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-03  0:58             ` Ming Lei
@ 2019-12-03  3:04               ` Stephen Rust
  2019-12-03  3:14                 ` Ming Lei
  0 siblings, 1 reply; 24+ messages in thread
From: Stephen Rust @ 2019-12-03  3:04 UTC (permalink / raw)
  To: Ming Lei
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel

Hi Ming,

The log you requested with the (arg4 & 512 != 0) predicate did not
match anything. However, I checked specifically for the offset of "76"
and came up with the following stack traces:

# /usr/share/bcc/tools/trace -K 'bio_add_page ((arg4 == 76)) "%d %d",
arg3, arg4 '
PID     TID     COMM            FUNC             -
7782    7782    kworker/19:1H   bio_add_page     512 76
        bio_add_page+0x1 [kernel]
        sbc_execute_rw+0x28 [kernel]
        __target_execute_cmd+0x2e [kernel]
        target_execute_cmd+0x1c1 [kernel]
        iscsit_execute_cmd+0x1e7 [kernel]
        iscsit_sequence_cmd+0xdc [kernel]
        isert_recv_done+0x780 [kernel]
        __ib_process_cq+0x78 [kernel]
        ib_cq_poll_work+0x29 [kernel]
        process_one_work+0x179 [kernel]
        worker_thread+0x4f [kernel]
        kthread+0x105 [kernel]
        ret_from_fork+0x1f [kernel]

14475   14475   kworker/13:1H   bio_add_page     4096 76
        bio_add_page+0x1 [kernel]
        sbc_execute_rw+0x28 [kernel]
        __target_execute_cmd+0x2e [kernel]
        target_execute_cmd+0x1c1 [kernel]
        iscsit_execute_cmd+0x1e7 [kernel]
        iscsit_sequence_cmd+0xdc [kernel]
        isert_recv_done+0x780 [kernel]
        __ib_process_cq+0x78 [kernel]
        ib_cq_poll_work+0x29 [kernel]
        process_one_work+0x179 [kernel]
        worker_thread+0x4f [kernel]
        kthread+0x105 [kernel]
        ret_from_fork+0x1f [kernel]

Thanks,
Steve

On Mon, Dec 2, 2019 at 7:59 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Mon, Dec 02, 2019 at 01:42:15PM -0500, Stephen Rust wrote:
> > Hi Ming,
> >
> > > I may get one machine with Mellanox NIC, is it easy to setup & reproduce
> > > just in the local machine(both host and target are setup on same machine)?
> >
> > Yes, I have reproduced locally on one machine (using the IP address of
> > the Mellanox NIC as the target IP), with iser enabled on the target,
> > and iscsiadm connected via iser.
> >
> > e.g.:
> > target:
> > /iscsi/iqn.20.../0.0.0.0:3260> enable_iser true
> > iSER enable now: True
> >
> >   | |   o- portals
> > ....................................................................................................
> > [Portals: 1]
> >   | |     o- 0.0.0.0:3260
> > ...................................................................................................
> > [iser]
> >
> > client:
> > # iscsiadm -m node -o update --targetname <target> -n
> > iface.transport_name -v iser
> > # iscsiadm -m node --targetname <target> --login
> > # iscsiadm -m session
> > iser: [3] 172.16.XX.XX:3260,1
> > iqn.2003-01.org.linux-iscsi.x8664:sn.c46c084919b0 (non-flash)
> >
> > > Please try to trace bio_add_page() a bit via 'bpftrace ./ilo.bt'.
> >
> > Here is the output of this trace from a failed run:
> >
> > # bpftrace lio.bt
> > modprobe: FATAL: Module kheaders not found.
> > Attaching 3 probes...
> > 512 76
> > 4096 0
> > 4096 0
> > 4096 0
> > 4096 76
>
> The above buffer might be the reason, 4096 is length, and 76 is the
> offset, that means the added buffer crosses two pages, meantime the
> buffer isn't aligned.
>
> We need to figure out why the magic 76 offset is passed from target or
> driver.
>
> Please install bcc and collect the following log:
>
> /usr/share/bcc/tools/trace -K 'bio_add_page ((arg4 & 512) != 0) "%d %d", arg3, arg4 '
>
>
> Thanks,
> Ming
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-03  3:04               ` Stephen Rust
@ 2019-12-03  3:14                 ` Ming Lei
  2019-12-03  3:26                   ` Stephen Rust
  0 siblings, 1 reply; 24+ messages in thread
From: Ming Lei @ 2019-12-03  3:14 UTC (permalink / raw)
  To: Stephen Rust
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel

On Mon, Dec 02, 2019 at 10:04:20PM -0500, Stephen Rust wrote:
> Hi Ming,
> 
> The log you requested with the (arg4 & 512 != 0) predicate did not

oops, it should have been (arg4 & 511) != 0.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-03  3:14                 ` Ming Lei
@ 2019-12-03  3:26                   ` Stephen Rust
  2019-12-03  3:50                     ` Stephen Rust
  2019-12-03  4:15                     ` Ming Lei
  0 siblings, 2 replies; 24+ messages in thread
From: Stephen Rust @ 2019-12-03  3:26 UTC (permalink / raw)
  To: Ming Lei
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel

> oops, it should have been (arg4 & 511) != 0.

Yep, there they are:

# /usr/share/bcc/tools/trace -K 'bio_add_page ((arg4 & 511) != 0) "%d
%d", arg3, arg4'
PID     TID     COMM            FUNC             -
7411    7411    kworker/31:1H   bio_add_page     512 76
        bio_add_page+0x1 [kernel]
        sbc_execute_rw+0x28 [kernel]
        __target_execute_cmd+0x2e [kernel]
        target_execute_cmd+0x1c1 [kernel]
        iscsit_execute_cmd+0x1e7 [kernel]
        iscsit_sequence_cmd+0xdc [kernel]
        isert_recv_done+0x780 [kernel]
        __ib_process_cq+0x78 [kernel]
        ib_cq_poll_work+0x29 [kernel]
        process_one_work+0x179 [kernel]
        worker_thread+0x4f [kernel]
        kthread+0x105 [kernel]
        ret_from_fork+0x1f [kernel]

7753    7753    kworker/26:1H   bio_add_page     4096 76
        bio_add_page+0x1 [kernel]
        sbc_execute_rw+0x28 [kernel]
        __target_execute_cmd+0x2e [kernel]
        target_execute_cmd+0x1c1 [kernel]
        iscsit_execute_cmd+0x1e7 [kernel]
        iscsit_sequence_cmd+0xdc [kernel]
        isert_recv_done+0x780 [kernel]
        __ib_process_cq+0x78 [kernel]
        ib_cq_poll_work+0x29 [kernel]
        process_one_work+0x179 [kernel]
        worker_thread+0x4f [kernel]
        kthread+0x105 [kernel]
        ret_from_fork+0x1f [kernel]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-03  3:26                   ` Stephen Rust
@ 2019-12-03  3:50                     ` Stephen Rust
  2019-12-03 12:45                       ` Ming Lei
  2019-12-03  4:15                     ` Ming Lei
  1 sibling, 1 reply; 24+ messages in thread
From: Stephen Rust @ 2019-12-03  3:50 UTC (permalink / raw)
  To: Ming Lei
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel

Hi,

Another datapoint.

I enabled "isert_debug" tracing and re-ran the test. Here is a small
snippet of the debug data. FWIW, the "length of 76" in the "lkey
mismatch" is a pattern that repeats quite often during the exchange.


Dec 03 03:41:12 host-2 kernel: isert: isert_recv_done: DMA:
0x10a6457000, iSCSI opcode: 0x01, ITT: 0x00000023, flags: 0x81 dlen: 0
Dec 03 03:41:12 host-2 kernel: isert: isert_recv_done: ISER ISCSI_CTRL PDU
Dec 03 03:41:12 host-2 kernel: isert: __isert_create_send_desc:
tx_desc 000000009bbe54ca lkey mismatch, fixing
Dec 03 03:41:12 host-2 kernel: isert: isert_init_tx_hdrs: Setup
tx_sg[0].addr: 0x4ce45010 length: 76 lkey: 0x80480
Dec 03 03:41:12 host-2 kernel: isert: isert_put_response: Posting SCSI Response
Dec 03 03:41:12 host-2 kernel: isert: isert_send_done: Cmd 00000000238f9047
Dec 03 03:41:12 host-2 kernel: isert: isert_unmap_tx_desc: unmap
single for tx_desc->dma_addr
Dec 03 03:41:12 host-2 kernel: isert: isert_put_cmd: Cmd 00000000238f9047
Dec 03 03:41:12 host-2 kernel: isert: isert_recv_done: DMA:
0x10a645a000, iSCSI opcode: 0x01, ITT: 0x00000024, flags: 0xa1 dlen:
512
Dec 03 03:41:12 host-2 kernel: isert: isert_recv_done: ISER ISCSI_CTRL PDU
Dec 03 03:41:12 host-2 kernel: isert: isert_handle_scsi_cmd: Transfer
Immediate imm_data_len: 512
Dec 03 03:41:12 host-2 kernel: isert: __isert_create_send_desc:
tx_desc 000000004b902cb9 lkey mismatch, fixing
Dec 03 03:41:12 host-2 kernel: isert: isert_init_tx_hdrs: Setup
tx_sg[0].addr: 0x4ce55b70 length: 76 lkey: 0x80480
Dec 03 03:41:12 host-2 kernel: isert: isert_put_response: Posting SCSI Response
Dec 03 03:41:12 host-2 kernel: isert: isert_send_done: Cmd 0000000069929548
Dec 03 03:41:12 host-2 kernel: isert: isert_unmap_tx_desc: unmap
single for tx_desc->dma_addr
Dec 03 03:41:12 host-2 kernel: isert: isert_put_cmd: Cmd 0000000069929548
Dec 03 03:41:12 host-2 kernel: isert: isert_recv_done: DMA:
0x10a645d000, iSCSI opcode: 0x01, ITT: 0x00000025, flags: 0x81 dlen: 0
Dec 03 03:41:12 host-2 kernel: isert: isert_recv_done: ISER ISCSI_CTRL PDU
Dec 03 03:41:12 host-2 kernel: isert: __isert_create_send_desc:
tx_desc 000000006d694fe9 lkey mismatch, fixing
Dec 03 03:41:12 host-2 kernel: isert: isert_init_tx_hdrs: Setup
tx_sg[0].addr: 0x4ce56140 length: 76 lkey: 0x80480
Dec 03 03:41:12 host-2 kernel: isert: isert_put_response: Posting SCSI Response
Dec 03 03:41:12 host-2 kernel: isert: isert_send_done: Cmd 00000000a666ae3c
Dec 03 03:41:12 host-2 kernel: isert: isert_unmap_tx_desc: unmap
single for tx_desc->dma_addr
Dec 03 03:41:12 host-2 kernel: isert: isert_put_cmd: Cmd 00000000a666ae3c
Dec 03 03:41:12 host-2 kernel: isert: isert_recv_done: DMA:
0x10a6460000, iSCSI opcode: 0x01, ITT: 0x00000026, flags: 0x81 dlen: 0
Dec 03 03:41:12 host-2 kernel: isert: isert_recv_done: ISER ISCSI_CTRL PDU
Dec 03 03:41:12 host-2 kernel: isert: __isert_create_send_desc:
tx_desc 00000000dd22ea75 lkey mismatch, fixing
Dec 03 03:41:12 host-2 kernel: isert: isert_init_tx_hdrs: Setup
tx_sg[0].addr: 0x4ce5e6f0 length: 76 lkey: 0x80480
Dec 03 03:41:12 host-2 kernel: isert: isert_put_response: Posting SCSI Response
Dec 03 03:41:12 host-2 kernel: isert: isert_send_done: Cmd 000000009b63dcb0
Dec 03 03:41:12 host-2 kernel: isert: isert_unmap_tx_desc: unmap
single for tx_desc->dma_addr
Dec 03 03:41:12 host-2 kernel: isert: isert_put_cmd: Cmd 000000009b63dcb0
Dec 03 03:41:12 host-2 kernel: isert: isert_recv_done: DMA:
0x10a6463000, iSCSI opcode: 0x01, ITT: 0x00000027, flags: 0xc1 dlen: 0
Dec 03 03:41:12 host-2 kernel: isert: isert_recv_done: ISER_RSV:
read_stag: 0x4000009a read_va: 0xac29e6800
Dec 03 03:41:12 host-2 kernel: isert: isert_recv_done: ISER ISCSI_CTRL PDU
Dec 03 03:41:12 host-2 kernel: isert: isert_put_datain: Cmd:
00000000fe3d39bf RDMA_WRITE data_length: 32
Dec 03 03:41:12 host-2 kernel: isert: __isert_create_send_desc:
tx_desc 00000000f5f10cf7 lkey mismatch, fixing
Dec 03 03:41:12 host-2 kernel: isert: isert_init_tx_hdrs: Setup
tx_sg[0].addr: 0x4ce56710 length: 76 lkey: 0x80480
Dec 03 03:41:12 host-2 kernel: isert: isert_put_datain: Cmd:
00000000fe3d39bf posted RDMA_WRITE for iSER Data READ rc: 0
Dec 03 03:41:12 host-2 kernel: isert: isert_send_done: Cmd 00000000fe3d39bf
Dec 03 03:41:12 host-2 kernel: isert: isert_unmap_tx_desc: unmap
single for tx_desc->dma_addr
Dec 03 03:41:12 host-2 kernel: isert: isert_put_cmd: Cmd 00000000fe3d39bf

[snip]

I could post the whole isert debug log somewhere if you'd like?

Thanks,
Steve

On Mon, Dec 2, 2019 at 10:26 PM Stephen Rust <srust@blockbridge.com> wrote:
>
> > oops, it should have been (arg4 & 511) != 0.
>
> Yep, there they are:
>
> # /usr/share/bcc/tools/trace -K 'bio_add_page ((arg4 & 511) != 0) "%d
> %d", arg3, arg4'
> PID     TID     COMM            FUNC             -
> 7411    7411    kworker/31:1H   bio_add_page     512 76
>         bio_add_page+0x1 [kernel]
>         sbc_execute_rw+0x28 [kernel]
>         __target_execute_cmd+0x2e [kernel]
>         target_execute_cmd+0x1c1 [kernel]
>         iscsit_execute_cmd+0x1e7 [kernel]
>         iscsit_sequence_cmd+0xdc [kernel]
>         isert_recv_done+0x780 [kernel]
>         __ib_process_cq+0x78 [kernel]
>         ib_cq_poll_work+0x29 [kernel]
>         process_one_work+0x179 [kernel]
>         worker_thread+0x4f [kernel]
>         kthread+0x105 [kernel]
>         ret_from_fork+0x1f [kernel]
>
> 7753    7753    kworker/26:1H   bio_add_page     4096 76
>         bio_add_page+0x1 [kernel]
>         sbc_execute_rw+0x28 [kernel]
>         __target_execute_cmd+0x2e [kernel]
>         target_execute_cmd+0x1c1 [kernel]
>         iscsit_execute_cmd+0x1e7 [kernel]
>         iscsit_sequence_cmd+0xdc [kernel]
>         isert_recv_done+0x780 [kernel]
>         __ib_process_cq+0x78 [kernel]
>         ib_cq_poll_work+0x29 [kernel]
>         process_one_work+0x179 [kernel]
>         worker_thread+0x4f [kernel]
>         kthread+0x105 [kernel]
>         ret_from_fork+0x1f [kernel]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-03  3:26                   ` Stephen Rust
  2019-12-03  3:50                     ` Stephen Rust
@ 2019-12-03  4:15                     ` Ming Lei
  1 sibling, 0 replies; 24+ messages in thread
From: Ming Lei @ 2019-12-03  4:15 UTC (permalink / raw)
  To: Stephen Rust
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel

On Mon, Dec 02, 2019 at 10:26:28PM -0500, Stephen Rust wrote:
> > oops, it should have been (arg4 & 511) != 0.
> 
> Yep, there they are:
> 
> # /usr/share/bcc/tools/trace -K 'bio_add_page ((arg4 & 511) != 0) "%d
> %d", arg3, arg4'
> PID     TID     COMM            FUNC             -
> 7411    7411    kworker/31:1H   bio_add_page     512 76
>         bio_add_page+0x1 [kernel]
>         sbc_execute_rw+0x28 [kernel]
>         __target_execute_cmd+0x2e [kernel]
>         target_execute_cmd+0x1c1 [kernel]
>         iscsit_execute_cmd+0x1e7 [kernel]
>         iscsit_sequence_cmd+0xdc [kernel]
>         isert_recv_done+0x780 [kernel]
>         __ib_process_cq+0x78 [kernel]
>         ib_cq_poll_work+0x29 [kernel]
>         process_one_work+0x179 [kernel]
>         worker_thread+0x4f [kernel]
>         kthread+0x105 [kernel]
>         ret_from_fork+0x1f [kernel]
> 
> 7753    7753    kworker/26:1H   bio_add_page     4096 76

The issue should be in brd_make_request() which assumes that
bvec.bv_len is 512bytes align.

I will figure out one patch for you tomorrow.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-03  3:50                     ` Stephen Rust
@ 2019-12-03 12:45                       ` Ming Lei
  2019-12-03 19:56                         ` Stephen Rust
  0 siblings, 1 reply; 24+ messages in thread
From: Ming Lei @ 2019-12-03 12:45 UTC (permalink / raw)
  To: Stephen Rust
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel,
	Doug Ledford, Jason Gunthorpe

On Mon, Dec 02, 2019 at 10:50:32PM -0500, Stephen Rust wrote:
> Hi,
> 
> Another datapoint.
> 
> I enabled "isert_debug" tracing and re-ran the test. Here is a small
> snippet of the debug data. FWIW, the "length of 76" in the "lkey
> mismatch" is a pattern that repeats quite often during the exchange.

That is because ISER_HEADERS_LEN is 76.

From our trace, 76 is bvec->bv_offset, is it possible that IO buffer
just follows the ISER HEADER suppose that iser applies zero-copy?

BTW, you may try the attached test patch. If the issue can be fixed by
this patch, that means it is really caused by un-aligned buffer, and
the iser driver needs to be fixed.

From 0368ee8a756384116fa1d0415f51389d438a6e40 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Tue, 3 Dec 2019 20:00:53 +0800
Subject: [PATCH] brd: handle un-aligned bvec->bv_len

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/brd.c | 27 +++++++++++++++++----------
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index c548a5a6c1a0..9ea1894c820d 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -190,13 +190,15 @@ static int copy_to_brd_setup(struct brd_device *brd, sector_t sector, size_t n)
  * Copy n bytes from src to the brd starting at sector. Does not sleep.
  */
 static void copy_to_brd(struct brd_device *brd, const void *src,
-			sector_t sector, size_t n)
+			sector_t sector, unsigned off_in_sec, size_t n)
 {
 	struct page *page;
 	void *dst;
 	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
 	size_t copy;
 
+	offset += off_in_sec;
+
 	copy = min_t(size_t, n, PAGE_SIZE - offset);
 	page = brd_lookup_page(brd, sector);
 	BUG_ON(!page);
@@ -207,7 +209,7 @@ static void copy_to_brd(struct brd_device *brd, const void *src,
 
 	if (copy < n) {
 		src += copy;
-		sector += copy >> SECTOR_SHIFT;
+		sector += (copy + off_in_sec) >> SECTOR_SHIFT;
 		copy = n - copy;
 		page = brd_lookup_page(brd, sector);
 		BUG_ON(!page);
@@ -222,13 +224,15 @@ static void copy_to_brd(struct brd_device *brd, const void *src,
  * Copy n bytes to dst from the brd starting at sector. Does not sleep.
  */
 static void copy_from_brd(void *dst, struct brd_device *brd,
-			sector_t sector, size_t n)
+			sector_t sector, unsigned off_in_sec, size_t n)
 {
 	struct page *page;
 	void *src;
 	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
 	size_t copy;
 
+	offset += off_in_sec;
+
 	copy = min_t(size_t, n, PAGE_SIZE - offset);
 	page = brd_lookup_page(brd, sector);
 	if (page) {
@@ -240,7 +244,7 @@ static void copy_from_brd(void *dst, struct brd_device *brd,
 
 	if (copy < n) {
 		dst += copy;
-		sector += copy >> SECTOR_SHIFT;
+		sector += (copy + off_in_sec) >> SECTOR_SHIFT;
 		copy = n - copy;
 		page = brd_lookup_page(brd, sector);
 		if (page) {
@@ -257,7 +261,7 @@ static void copy_from_brd(void *dst, struct brd_device *brd,
  */
 static int brd_do_bvec(struct brd_device *brd, struct page *page,
 			unsigned int len, unsigned int off, unsigned int op,
-			sector_t sector)
+			sector_t sector, unsigned int off_in_sec)
 {
 	void *mem;
 	int err = 0;
@@ -270,11 +274,11 @@ static int brd_do_bvec(struct brd_device *brd, struct page *page,
 
 	mem = kmap_atomic(page);
 	if (!op_is_write(op)) {
-		copy_from_brd(mem + off, brd, sector, len);
+		copy_from_brd(mem + off, brd, sector, off_in_sec, len);
 		flush_dcache_page(page);
 	} else {
 		flush_dcache_page(page);
-		copy_to_brd(brd, mem + off, sector, len);
+		copy_to_brd(brd, mem + off, sector, off_in_sec, len);
 	}
 	kunmap_atomic(mem);
 
@@ -287,6 +291,7 @@ static blk_qc_t brd_make_request(struct request_queue *q, struct bio *bio)
 	struct brd_device *brd = bio->bi_disk->private_data;
 	struct bio_vec bvec;
 	sector_t sector;
+	unsigned offset_in_sec = 0;
 	struct bvec_iter iter;
 
 	sector = bio->bi_iter.bi_sector;
@@ -296,12 +301,14 @@ static blk_qc_t brd_make_request(struct request_queue *q, struct bio *bio)
 	bio_for_each_segment(bvec, bio, iter) {
 		unsigned int len = bvec.bv_len;
 		int err;
+		unsigned int secs = len >> SECTOR_SHIFT;
 
 		err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset,
-				  bio_op(bio), sector);
+				  bio_op(bio), sector, offset_in_sec);
 		if (err)
 			goto io_error;
-		sector += len >> SECTOR_SHIFT;
+		sector += secs;
+		offset_in_sec = len - (secs << SECTOR_SHIFT);
 	}
 
 	bio_endio(bio);
@@ -319,7 +326,7 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
 
 	if (PageTransHuge(page))
 		return -ENOTSUPP;
-	err = brd_do_bvec(brd, page, PAGE_SIZE, 0, op, sector);
+	err = brd_do_bvec(brd, page, PAGE_SIZE, 0, op, sector, 0);
 	page_endio(page, op_is_write(op), err);
 	return err;
 }
-- 
2.20.1


Thanks,
Ming


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-03 12:45                       ` Ming Lei
@ 2019-12-03 19:56                         ` Stephen Rust
  2019-12-04  1:05                           ` Ming Lei
  2019-12-04  2:39                           ` Ming Lei
  0 siblings, 2 replies; 24+ messages in thread
From: Stephen Rust @ 2019-12-03 19:56 UTC (permalink / raw)
  To: Ming Lei
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel,
	Doug Ledford, Jason Gunthorpe

Hi Ming,

Thanks very much for the patch.

> BTW, you may try the attached test patch. If the issue can be fixed by
> this patch, that means it is really caused by un-aligned buffer, and
> the iser driver needs to be fixed.

I have tried the patch, and re-run the test. Results are mixed.

To recap, our test writes the last bytes of an iser attached iscsi
device. The target device is a LIO iblock, backed by a brd ramdisk.
The client does a simple `dd`, doing a seek to "size - offset" of the
device, and writing a buffer of "length" which is equivalent to the
offset.

For example, to test a write at a 512 offset, seek to device "size -
512", and write a length of data 512 bytes.

WITHOUT the patch, writing data at the following offsets from the end
of the device failed to write all the correct data (rather, the write
succeeded, but reading the data back it was invalid):

- failed: 512,1024, 2048, 4096, 8192

Anything larger worked fine.

WITH the patch applied, writing data up to an offset of 4096 all now
worked and verified correctly. However, offsets between 4096 and 8192
all still failed. I started at 512, and incremented by 512 all the way
up to 16384. The following offsets all failed to verify the write:

- failed: 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192

Anything larger continues to work fine with the patch.

As an example, for the failed 8192 case, the `bpftrace lio.bt` trace shows:

8192 76
4096 0
4096 0
8192 76
4096 0
4096 0
...
[snip]

What do you think are appropriate next steps? Do you think you have an
idea on why the specific "multi-page bvec helpers" commit could have
exposed this particular latent issue? Please let me know what else I
can try, or additional data I can provide for you.

Thanks,
Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-03 19:56                         ` Stephen Rust
@ 2019-12-04  1:05                           ` Ming Lei
  2019-12-04 17:23                             ` Stephen Rust
  2019-12-04  2:39                           ` Ming Lei
  1 sibling, 1 reply; 24+ messages in thread
From: Ming Lei @ 2019-12-04  1:05 UTC (permalink / raw)
  To: Stephen Rust
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel,
	Doug Ledford, Jason Gunthorpe

On Tue, Dec 03, 2019 at 02:56:08PM -0500, Stephen Rust wrote:
> Hi Ming,
> 
> Thanks very much for the patch.
> 
> > BTW, you may try the attached test patch. If the issue can be fixed by
> > this patch, that means it is really caused by un-aligned buffer, and
> > the iser driver needs to be fixed.
> 
> I have tried the patch, and re-run the test. Results are mixed.
> 
> To recap, our test writes the last bytes of an iser attached iscsi
> device. The target device is a LIO iblock, backed by a brd ramdisk.
> The client does a simple `dd`, doing a seek to "size - offset" of the
> device, and writing a buffer of "length" which is equivalent to the
> offset.
> 
> For example, to test a write at a 512 offset, seek to device "size -
> 512", and write a length of data 512 bytes.
> 
> WITHOUT the patch, writing data at the following offsets from the end
> of the device failed to write all the correct data (rather, the write
> succeeded, but reading the data back it was invalid):
> 
> - failed: 512,1024, 2048, 4096, 8192
> 
> Anything larger worked fine.
> 
> WITH the patch applied, writing data up to an offset of 4096 all now
> worked and verified correctly. However, offsets between 4096 and 8192
> all still failed. I started at 512, and incremented by 512 all the way
> up to 16384. The following offsets all failed to verify the write:
> 
> - failed: 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192
> 
> Anything larger continues to work fine with the patch.
> 
> As an example, for the failed 8192 case, the `bpftrace lio.bt` trace shows:
> 
> 8192 76
> 4096 0
> 4096 0
> 8192 76
> 4096 0
> 4096 0
> ...
> [snip]
> 
> What do you think are appropriate next steps?

OK, my guess should be correct, and the issue is related with un-aligned
bvec->bv_offset.

So firstly, I'd suggest to investigate from RDMA driver side to see why
un-aligned buffer is passed to block layer.

According to previous discussion, 512 aligned buffer should be provided
to block layer.

So looks the driver needs to be fixed.

> Do you think you have an
> idea on why the specific "multi-page bvec helpers" commit could have
> exposed this particular latent issue? Please let me know what else I
> can try, or additional data I can provide for you.
 
The patch might not cover the big offset case, could you collect bpftrace
via the following script when you reproduce the issue with >4096 offset?

kprobe:iblock_execute_rw
{
    @start[tid]=1;
}

kretprobe:iblock_execute_rw
{
    @start[tid]=0;
}

kprobe:bio_add_page
/@start[tid]/
{
  printf("%d %d\n", arg2, arg3);
}

kprobe:brd_do_bvec
{
  printf("%d %d %d %d\n", arg2, arg3, arg4, arg5);
}


Thanks,
Ming


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-03 19:56                         ` Stephen Rust
  2019-12-04  1:05                           ` Ming Lei
@ 2019-12-04  2:39                           ` Ming Lei
  1 sibling, 0 replies; 24+ messages in thread
From: Ming Lei @ 2019-12-04  2:39 UTC (permalink / raw)
  To: Stephen Rust
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel,
	Doug Ledford, Jason Gunthorpe

On Tue, Dec 03, 2019 at 02:56:08PM -0500, Stephen Rust wrote:
> Hi Ming,
> 
> Thanks very much for the patch.
> 
> > BTW, you may try the attached test patch. If the issue can be fixed by
> > this patch, that means it is really caused by un-aligned buffer, and
> > the iser driver needs to be fixed.
> 
> I have tried the patch, and re-run the test. Results are mixed.
> 
> To recap, our test writes the last bytes of an iser attached iscsi
> device. The target device is a LIO iblock, backed by a brd ramdisk.
> The client does a simple `dd`, doing a seek to "size - offset" of the
> device, and writing a buffer of "length" which is equivalent to the
> offset.
> 
> For example, to test a write at a 512 offset, seek to device "size -
> 512", and write a length of data 512 bytes.
> 
> WITHOUT the patch, writing data at the following offsets from the end
> of the device failed to write all the correct data (rather, the write
> succeeded, but reading the data back it was invalid):
> 
> - failed: 512,1024, 2048, 4096, 8192
> 
> Anything larger worked fine.
> 
> WITH the patch applied, writing data up to an offset of 4096 all now
> worked and verified correctly. However, offsets between 4096 and 8192
> all still failed. I started at 512, and incremented by 512 all the way
> up to 16384. The following offsets all failed to verify the write:
> 
> - failed: 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192
> 
> Anything larger continues to work fine with the patch.
> 
> As an example, for the failed 8192 case, the `bpftrace lio.bt` trace shows:
> 
> 8192 76
> 4096 0
> 4096 0
> 8192 76
> 4096 0
> 4096 0

The following delta change against last patch should fix the issue
with >4096 bvec length:

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 9ea1894c820d..49e37a7dda63 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -308,7 +308,7 @@ static blk_qc_t brd_make_request(struct request_queue *q, struct bio *bio)
                if (err)
                        goto io_error;
                sector += secs;
-               offset_in_sec = len - (secs << SECTOR_SHIFT);
+               offset_in_sec += len - (secs << SECTOR_SHIFT);
        }

        bio_endio(bio);

However, the change on brd is a workaround just for confirming the
issue.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-04  1:05                           ` Ming Lei
@ 2019-12-04 17:23                             ` Stephen Rust
  2019-12-04 23:02                               ` Ming Lei
  0 siblings, 1 reply; 24+ messages in thread
From: Stephen Rust @ 2019-12-04 17:23 UTC (permalink / raw)
  To: Ming Lei
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel,
	Doug Ledford, Jason Gunthorpe

Hi Ming,

I have tried your latest "workaround" patch in brd including the fix
for large offsets, and it does appear to work. I tried the same tests
and the data was written correctly for all offsets I tried. Thanks!

I include the updated additional bpftrace below.

> So firstly, I'd suggest to investigate from RDMA driver side to see why
> un-aligned buffer is passed to block layer.
>
> According to previous discussion, 512 aligned buffer should be provided
> to block layer.
>
> So looks the driver needs to be fixed.

If it does appear to be an RDMA driver issue, do you know who we
should follow up with directly from the RDMA driver side of the world?

Presumably non-brd devices, ie: real scsi devices work for these test
cases because they accept un-aligned buffers?

> The patch might not cover the big offset case, could you collect bpftrace
> via the following script when you reproduce the issue with >4096 offset?

Here is the updated bpftrace output for an offset of 8192:

8192 76
4020 76 1 131056
4096 0 1 131063
76 0 1 131071
4096 0
4096 0 0 0
4096 0
4096 0 0 8
4096 0
4096 0 0 130944
8192 76
4020 76 1 131056
4096 0 1 131063
76 0 1 131071
4096 0
4096 0 0 130808
4096 0
4096 0
4096 0 0 131056
4096 0 0 131064
[snip]

Thanks,
Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-04 17:23                             ` Stephen Rust
@ 2019-12-04 23:02                               ` Ming Lei
  2019-12-05  0:16                                 ` Bart Van Assche
                                                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Ming Lei @ 2019-12-04 23:02 UTC (permalink / raw)
  To: Stephen Rust
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel,
	Doug Ledford, Jason Gunthorpe, Sagi Grimberg, Max Gurtovoy

On Wed, Dec 04, 2019 at 12:23:39PM -0500, Stephen Rust wrote:
> Hi Ming,
> 
> I have tried your latest "workaround" patch in brd including the fix
> for large offsets, and it does appear to work. I tried the same tests
> and the data was written correctly for all offsets I tried. Thanks!
> 
> I include the updated additional bpftrace below.
> 
> > So firstly, I'd suggest to investigate from RDMA driver side to see why
> > un-aligned buffer is passed to block layer.
> >
> > According to previous discussion, 512 aligned buffer should be provided
> > to block layer.
> >
> > So looks the driver needs to be fixed.
> 
> If it does appear to be an RDMA driver issue, do you know who we
> should follow up with directly from the RDMA driver side of the world?
> 
> Presumably non-brd devices, ie: real scsi devices work for these test
> cases because they accept un-aligned buffers?

Right, not every driver supports such un-aligned buffer.

I am not familiar with RDMA, but from the trace we have done so far,
it is highly related with iser driver. 


Thanks,
Ming


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-04 23:02                               ` Ming Lei
@ 2019-12-05  0:16                                 ` Bart Van Assche
  2019-12-05 14:44                                   ` Stephen Rust
  2019-12-05  2:28                                 ` Stephen Rust
  2019-12-05  9:17                                 ` Sagi Grimberg
  2 siblings, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2019-12-05  0:16 UTC (permalink / raw)
  To: Ming Lei, Stephen Rust
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel,
	Doug Ledford, Jason Gunthorpe, Sagi Grimberg, Max Gurtovoy

On 12/4/19 3:02 PM, Ming Lei wrote:
> On Wed, Dec 04, 2019 at 12:23:39PM -0500, Stephen Rust wrote:
>> Presumably non-brd devices, ie: real scsi devices work for these test
>> cases because they accept un-aligned buffers?
> 
> Right, not every driver supports such un-aligned buffer.
> 
> I am not familiar with RDMA, but from the trace we have done so far,
> it is highly related with iser driver.

Hi Stephen,

Do you need the iSER protocol? I think that the NVMeOF and SRP drivers 
also support RoCE and that these align data buffers on a 512 byte boundary.

Bart.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-04 23:02                               ` Ming Lei
  2019-12-05  0:16                                 ` Bart Van Assche
@ 2019-12-05  2:28                                 ` Stephen Rust
  2019-12-05  3:05                                   ` Ming Lei
  2019-12-05  9:17                                 ` Sagi Grimberg
  2 siblings, 1 reply; 24+ messages in thread
From: Stephen Rust @ 2019-12-05  2:28 UTC (permalink / raw)
  To: Ming Lei
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel,
	Doug Ledford, Jason Gunthorpe, Sagi Grimberg, Max Gurtovoy

Hi Ming,

Thanks for all your help and insight. I really appreciate it.

> > Presumably non-brd devices, ie: real scsi devices work for these test
> > cases because they accept un-aligned buffers?
>
> Right, not every driver supports such un-aligned buffer.

Can you please clarify: does the block layer require that it is called
with 512-byte aligned buffers? If that is the case, would it make
sense for the block interface (bio_add_page() or other) to reject
buffers that are not aligned?

It seems that passing these buffers on to underlying drivers that
don't support un-aligned buffers can result in silent data corruption.
Perhaps it would be better to fail the I/O up front. This would also
help future proof the block interface when changes/new target drivers
are added.

I'm also curious how these same unaligned buffers from iSER made it to
brd and were written successfully in the pre "multi-page bvec" world.
(Just trying to understand, if you have any thoughts, as this same
test case worked fine in 4.14+ until 5.1)

> I am not familiar with RDMA, but from the trace we have done so far,
> it is highly related with iser driver.

Do you think it is fair to say that the iSER/block integration is
causing corruption by using un-aligned buffers?

Thanks,
Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-05  2:28                                 ` Stephen Rust
@ 2019-12-05  3:05                                   ` Ming Lei
  0 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2019-12-05  3:05 UTC (permalink / raw)
  To: Stephen Rust
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel,
	Doug Ledford, Jason Gunthorpe, Sagi Grimberg, Max Gurtovoy

On Wed, Dec 04, 2019 at 09:28:43PM -0500, Stephen Rust wrote:
> Hi Ming,
> 
> Thanks for all your help and insight. I really appreciate it.
> 
> > > Presumably non-brd devices, ie: real scsi devices work for these test
> > > cases because they accept un-aligned buffers?
> >
> > Right, not every driver supports such un-aligned buffer.
> 
> Can you please clarify: does the block layer require that it is called
> with 512-byte aligned buffers? If that is the case, would it make
> sense for the block interface (bio_add_page() or other) to reject
> buffers that are not aligned?

The things is a bit complicated, see the following xfs commits:

f8f9ee479439 xfs: add kmem_alloc_io()
d916275aa4dd xfs: get allocation alignment from the buftarg

Which applies request queue's dma alignment limit which may be
smaller than 512. Before this report, xfs should be the only known
user of passing un-aligned buffer.

So we can't add the check in bio_add_page(), in which request queue
may not be available, also bio_add_page() is really hot path, and
people hates to add unnecessary code in this function.

IMO, it is better for all FS or users of bio_add_page() to pass
512 aligned buffer.

> 
> It seems that passing these buffers on to underlying drivers that
> don't support un-aligned buffers can result in silent data corruption.
> Perhaps it would be better to fail the I/O up front. This would also
> help future proof the block interface when changes/new target drivers
> are added.

It is a brd device, strictly speaking, it doesn't matter to fail the
I/O or whatever, given either way should cause data loss.

> 
> I'm also curious how these same unaligned buffers from iSER made it to
> brd and were written successfully in the pre "multi-page bvec" world.
> (Just trying to understand, if you have any thoughts, as this same
> test case worked fine in 4.14+ until 5.1)

I am pretty sure that brd never supports un-aligned buffer, and I have
no idea why 'multi-page bvec' helper can cause this issue. However, I
am happy to investigate further if you can run previous trace on pre
'multi-page bvec' kernel.

> 
> > I am not familiar with RDMA, but from the trace we have done so far,
> > it is highly related with iser driver.
> 
> Do you think it is fair to say that the iSER/block integration is
> causing corruption by using un-aligned buffers?

As you saw, XFS changed the un-aligned buffer into aligned one for
avoiding the issue, so I think it is pretty fair to say that.

Thanks, 
Ming


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-04 23:02                               ` Ming Lei
  2019-12-05  0:16                                 ` Bart Van Assche
  2019-12-05  2:28                                 ` Stephen Rust
@ 2019-12-05  9:17                                 ` Sagi Grimberg
  2019-12-05 14:36                                   ` Stephen Rust
  2 siblings, 1 reply; 24+ messages in thread
From: Sagi Grimberg @ 2019-12-05  9:17 UTC (permalink / raw)
  To: Ming Lei, Stephen Rust
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel,
	Doug Ledford, Jason Gunthorpe, Max Gurtovoy


>> Hi Ming,
>>
>> I have tried your latest "workaround" patch in brd including the fix
>> for large offsets, and it does appear to work. I tried the same tests
>> and the data was written correctly for all offsets I tried. Thanks!
>>
>> I include the updated additional bpftrace below.
>>
>>> So firstly, I'd suggest to investigate from RDMA driver side to see why
>>> un-aligned buffer is passed to block layer.
>>>
>>> According to previous discussion, 512 aligned buffer should be provided
>>> to block layer.
>>>
>>> So looks the driver needs to be fixed.
>>
>> If it does appear to be an RDMA driver issue, do you know who we
>> should follow up with directly from the RDMA driver side of the world?
>>
>> Presumably non-brd devices, ie: real scsi devices work for these test
>> cases because they accept un-aligned buffers?
> 
> Right, not every driver supports such un-aligned buffer.
> 
> I am not familiar with RDMA, but from the trace we have done so far,
> it is highly related with iser driver.

Hi guys,

Just got this one (Thanks for CCing me Ming, been extremely busy
lately).

So it looks from the report that this is the immediate-data and
unsolicited data-out flows, which indeed seem to violate the alignment
assumption. The reason is that isert post recv a contig rx_desc which
has both the headers and the data, and when it gets immediate_data it
will set the data sg to rx_desc+offset (which are the headers).

Stephen,
As a work-around for now, you should turn off immediate-data in your LIO
target. I'll work on a fix.

Thanks for reporting!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-05  9:17                                 ` Sagi Grimberg
@ 2019-12-05 14:36                                   ` Stephen Rust
  0 siblings, 0 replies; 24+ messages in thread
From: Stephen Rust @ 2019-12-05 14:36 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Ming Lei, Rob Townley, Christoph Hellwig, Jens Axboe,
	linux-block, linux-rdma, linux-scsi, martin.petersen,
	target-devel, Doug Ledford, Jason Gunthorpe, Max Gurtovoy

Hi Sagi,

On Thu, Dec 5, 2019 at 4:17 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>
> Just got this one (Thanks for CCing me Ming, been extremely busy
> lately).

No problem, thanks for looking into it!

> So it looks from the report that this is the immediate-data and
> unsolicited data-out flows, which indeed seem to violate the alignment
> assumption. The reason is that isert post recv a contig rx_desc which
> has both the headers and the data, and when it gets immediate_data it
> will set the data sg to rx_desc+offset (which are the headers).
>
> Stephen,
> As a work-around for now, you should turn off immediate-data in your LIO
> target. I'll work on a fix.

I have confirmed that turning off ImmediateData in the target (and
reconnecting) is a successful workaround for this test case. All of
the I/O as reported by bio_add_page() is aligned.

Using the previously described bpftrace script with 512 offset:

# bpftrace lio.bt
Attaching 4 probes...
512 0
512 0 1 131071
4096 0
4096 0 0 0
4096 0
4096 0 0 8
4096 0
4096 0 0 131064
4096 0
4096 0 1 131064
4096 0
4096 0 0 0
4096 0
4096 0 0 8
512 0
512 0 0 131071
4096 0
4096 0 0 130944
4096 0
4096 0 0 131056

> Thanks for reporting!

Please let me know if you need any additional information, or if I can
assist further. I would be happy to test any patches when you are
ready.

Thanks,
Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-05  0:16                                 ` Bart Van Assche
@ 2019-12-05 14:44                                   ` Stephen Rust
  0 siblings, 0 replies; 24+ messages in thread
From: Stephen Rust @ 2019-12-05 14:44 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Ming Lei, Rob Townley, Christoph Hellwig, Jens Axboe,
	linux-block, linux-rdma, linux-scsi, martin.petersen,
	target-devel, Doug Ledford, Jason Gunthorpe, Sagi Grimberg,
	Max Gurtovoy

On Wed, Dec 4, 2019 at 7:16 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> Do you need the iSER protocol? I think that the NVMeOF and SRP drivers
> also support RoCE and that these align data buffers on a 512 byte boundary.

Hi Bart,

In this case we do. But thank you for the other references. Those
might be options for us for other use cases.

Thanks,
Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, back to index

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAAFE1bd9wuuobpe4VK7Ty175j7mWT+kRmHCNhVD+6R8MWEAqmw@mail.gmail.com>
2019-11-28  1:57 ` Data corruption in kernel 5.1+ with iSER attached ramdisk Ming Lei
     [not found]   ` <CA+VdTb_-CGaPjKUQteKVFSGqDz-5o-tuRRkJYqt8B9iOQypiwQ@mail.gmail.com>
2019-11-28  2:58     ` Ming Lei
     [not found]       ` <CAAFE1bfsXsKGyw7SU_z4NanT+wmtuJT=XejBYbHHMCDQwm73sw@mail.gmail.com>
2019-11-28  4:25         ` Stephen Rust
2019-11-28  5:51           ` Rob Townley
2019-11-28  9:12         ` Ming Lei
2019-12-02 18:42           ` Stephen Rust
2019-12-03  0:58             ` Ming Lei
2019-12-03  3:04               ` Stephen Rust
2019-12-03  3:14                 ` Ming Lei
2019-12-03  3:26                   ` Stephen Rust
2019-12-03  3:50                     ` Stephen Rust
2019-12-03 12:45                       ` Ming Lei
2019-12-03 19:56                         ` Stephen Rust
2019-12-04  1:05                           ` Ming Lei
2019-12-04 17:23                             ` Stephen Rust
2019-12-04 23:02                               ` Ming Lei
2019-12-05  0:16                                 ` Bart Van Assche
2019-12-05 14:44                                   ` Stephen Rust
2019-12-05  2:28                                 ` Stephen Rust
2019-12-05  3:05                                   ` Ming Lei
2019-12-05  9:17                                 ` Sagi Grimberg
2019-12-05 14:36                                   ` Stephen Rust
2019-12-04  2:39                           ` Ming Lei
2019-12-03  4:15                     ` Ming Lei

Linux-RDMA Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-rdma/0 linux-rdma/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-rdma linux-rdma/ https://lore.kernel.org/linux-rdma \
		linux-rdma@vger.kernel.org
	public-inbox-index linux-rdma

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-rdma


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git