linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
       [not found] <CAAFE1bd9wuuobpe4VK7Ty175j7mWT+kRmHCNhVD+6R8MWEAqmw@mail.gmail.com>
@ 2019-11-28  1:57 ` Ming Lei
       [not found]   ` <CA+VdTb_-CGaPjKUQteKVFSGqDz-5o-tuRRkJYqt8B9iOQypiwQ@mail.gmail.com>
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2019-11-28  1:57 UTC (permalink / raw)
  To: Stephen Rust
  Cc: Jens Axboe, Christoph Hellwig, linux-block, linux-rdma,
	linux-scsi, target-devel, martin.petersen

Hello,

On Wed, Nov 27, 2019 at 02:38:42PM -0500, Stephen Rust wrote:
> Hi,
> 
> We recently began testing 5.4 in preparation for migration from 4.14. One
> of our tests found reproducible data corruption in 5.x kernels. The test
> consists of a few basic single-issue writes to an iSER attached ramdisk.
> The writes are subsequently verified with single-issue reads. We tracked
> the corruption down using git bisect. The issue appears to have started in
> 5.1 with the following commit:
> 
> 3d75ca0adef4280650c6690a0c4702a74a6f3c95 block: introduce multi-page bvec
> helpers
> 
> We wanted to bring this to your attention. A reproducer and the git bisect
> data follows below.
> 
> Our setup consists of two systems: A ramdisk exported in a LIO target from
> host A, iSCSI attached with iSER / RDMA from host B. Specific writes to the

Could you explain a bit what is iSCSI attached with iSER / RDMA? Is the
actual transport TCP over RDMA? What is related target driver involved?

> very end of the attached disk on B result in incorrect data being written
> to the remote disk. The writes appear to complete successfully on the
> client. We’ve also verified that the correct data is being sent over the
> network by tracing the RDMA flow. For reference, the tests were conducted
> on x86_64 Intel Skylake systems with Mellanox ConnectX5 NICs.

If I understand correctly, LIO ramdisk doesn't generate any IO to block
stack, see rd_execute_rw(), and the ramdisk should be one big/long
pre-allocated sgl, see rd_build_device_space().

Seems very strange, given no bvec/bio is involved in this code
path from iscsi_target_rx_thread to rd_execute_rw. So far I have no idea
how commit 3d75ca0adef428065 causes this issue, because that patch
only changes bvec/bio related code.

> 
> The issue appears to lie on the target host side. The initiator kernel
> version does not appear to play a role. The target host exhibits the issue
> when running kernel version 5.1+.
> 
> To reproduce, given attached sda on client host B, write data at the end of
> the device:
> 
> 
> SIZE=$(blockdev --getsize64 /dev/sda)
> 
> SEEK=$((( $SIZE - 512 )))
> 
> # initialize device and seed data
> 
> dd if=/dev/zero of=/dev/sda bs=512 count=1 seek=$SEEK oflag=seek_bytes
> oflag=direct
> 
> dd if=/dev/urandom of=/tmp/random bs=512 count=1 oflag=direct
> 
> 
> # write the random data (note: not direct)
> 
> dd if=/tmp/random of=/dev/sda bs=512 count=1 seek=$SEEK oflag=seek_bytes
> 
> 
> # verify the data was written
> 
> dd if=/dev/sda of=/tmp/verify bs=512 count=1 skip=$SEEK iflag=skip_bytes
> iflag=direct
> 
> hexdump -xv /tmp/random > /tmp/random.hex
> 
> hexdump -xv /tmp/verify > /tmp/verify.hex
> 
> diff -u /tmp/random.hex /tmp/verify.hex

I just setup one LIO for exporting ramdisk(2G) via iscsi, and run the
above test via iscsi HBA, still can't reproduce the issue.

> # first bad commit: [3d75ca0adef4280650c6690a0c4702a74a6f3c95] block:
> introduce multi-page bvec helpers
> 
> 
> Please advise. We have cycles and systems to help track down the issue. Let
> me know how best to assist.

Could you install bcc and start to collect the following trace on target side
before you run the above test in host side?

/usr/share/bcc/tools/stackcount -K rd_execute_rw


Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
       [not found]   ` <CA+VdTb_-CGaPjKUQteKVFSGqDz-5o-tuRRkJYqt8B9iOQypiwQ@mail.gmail.com>
@ 2019-11-28  2:58     ` Ming Lei
       [not found]       ` <CAAFE1bfsXsKGyw7SU_z4NanT+wmtuJT=XejBYbHHMCDQwm73sw@mail.gmail.com>
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2019-11-28  2:58 UTC (permalink / raw)
  To: Rob Townley
  Cc: Christoph Hellwig, Jens Axboe, Stephen Rust, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel

On Wed, Nov 27, 2019 at 08:18:30PM -0600, Rob Townley wrote:
> On Wed, Nov 27, 2019 at 7:58 PM Ming Lei <ming.lei@redhat.com> wrote:
> 
> > Hello,
> >
> > On Wed, Nov 27, 2019 at 02:38:42PM -0500, Stephen Rust wrote:
> > > Hi,
> > >
> > > We recently began testing 5.4 in preparation for migration from 4.14. One
> > > of our tests found reproducible data corruption in 5.x kernels. The test
> > > consists of a few basic single-issue writes to an iSER attached ramdisk.
> > > The writes are subsequently verified with single-issue reads. We tracked
> > > the corruption down using git bisect. The issue appears to have started
> > in
> > > 5.1 with the following commit:
> > >
> > > 3d75ca0adef4280650c6690a0c4702a74a6f3c95 block: introduce multi-page bvec
> > > helpers
> > >
> > > We wanted to bring this to your attention. A reproducer and the git
> > bisect
> > > data follows below.
> > >
> > > Our setup consists of two systems: A ramdisk exported in a LIO target
> > from
> > > host A, iSCSI attached with iSER / RDMA from host B. Specific writes to
> > the
> >
> > Could you explain a bit what is iSCSI attached with iSER / RDMA? Is the
> > actual transport TCP over RDMA? What is related target driver involved?
> >
> > > very end of the attached disk on B result in incorrect data being written
> > > to the remote disk. The writes appear to complete successfully on the
> > > client. We’ve also verified that the correct data is being sent over the
> > > network by tracing the RDMA flow. For reference, the tests were conducted
> > > on x86_64 Intel Skylake systems with Mellanox ConnectX5 NICs.
> >
> > If I understand correctly, LIO ramdisk doesn't generate any IO to block
> > stack, see rd_execute_rw(), and the ramdisk should be one big/long
> > pre-allocated sgl, see rd_build_device_space().
> >
> > Seems very strange, given no bvec/bio is involved in this code
> > path from iscsi_target_rx_thread to rd_execute_rw. So far I have no idea
> > how commit 3d75ca0adef428065 causes this issue, because that patch
> > only changes bvec/bio related code.
> >
> > >
> > > The issue appears to lie on the target host side. The initiator kernel
> > > version does not appear to play a role. The target host exhibits the
> > issue
> > > when running kernel version 5.1+.
> > >
> > > To reproduce, given attached sda on client host B, write data at the end
> > of
> > > the device:
> > >
> > >
> > > SIZE=$(blockdev --getsize64 /dev/sda)
> > >
> > > SEEK=$((( $SIZE - 512 )))
> > >
> > > # initialize device and seed data
> > >
> > > dd if=/dev/zero of=/dev/sda bs=512 count=1 seek=$SEEK oflag=seek_bytes
> > > oflag=direct
> > >
> > > dd if=/dev/urandom of=/tmp/random bs=512 count=1 oflag=direct
> > >
> > >
> > > # write the random data (note: not direct)
> > >
> > > dd if=/tmp/random of=/dev/sda bs=512 count=1 seek=$SEEK oflag=seek_bytes
> > >
> > >
> > > # verify the data was written
> > >
> > > dd if=/dev/sda of=/tmp/verify bs=512 count=1 skip=$SEEK iflag=skip_bytes
> > > iflag=direct
> > >
> > > hexdump -xv /tmp/random > /tmp/random.hex
> > >
> > > hexdump -xv /tmp/verify > /tmp/verify.hex
> > >
> > > diff -u /tmp/random.hex /tmp/verify.hex
> >
> > I just setup one LIO for exporting ramdisk(2G) via iscsi, and run the
> > above test via iscsi HBA, still can't reproduce the issue.
> >
> > > # first bad commit: [3d75ca0adef4280650c6690a0c4702a74a6f3c95] block:
> > > introduce multi-page bvec helpers
> > >
> > >
> > > Please advise. We have cycles and systems to help track down the issue.
> > Let
> > > me know how best to assist.
> >
> > Could you install bcc and start to collect the following trace on target
> > side
> > before you run the above test in host side?
> >
> > /usr/share/bcc/tools/stackcount -K rd_execute_rw
> >
> >
> > Thanks,
> > Ming
> >
> 
> 
> Interesting case to follow as there are many types of RamDisks.  The common
> tmpfs kind will use its RAM allocation and all free harddrive space.
> 
> The ramdisk in CentOS 7 backed by LIO will overflow its size in RAM and
> fill up all remaining free space on spinning platters.  So if the RamDisk
> is 4GB out of 192GB RAM in the lightly used machine. Free filesystem space
> is 16GB.  Writes to the 4GB RamDisk will only error out at 21GB when there
> is no space left on filesystem.
> 
> dd if=/dev/zero of=/dev/iscsiRamDisk
> Will keep writing way past 4GB and not stop till hardrive is full which is
> totally different than normal disks.
> 
> Wonder what exact kind of RamDisk is in that kernel?

In my test, it is the LIO built-in ramdisk:

/backstores/ramdisk> create rd0 2G
Created ramdisk rd0 with size 2G.
/backstores/ramdisk> ls
o- ramdisk ......................................................................... [Storage Objects: 1]
  o- rd0 ......................................................................... [(2.0GiB) deactivated]
    o- alua ............................................................................ [ALUA Groups: 1]
      o- default_tg_pt_gp ................................................ [ALUA state: Active/optimized]

Stephen, could you share us how you setup the ramdisk in your test?

Thanks, 
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
       [not found]       ` <CAAFE1bfsXsKGyw7SU_z4NanT+wmtuJT=XejBYbHHMCDQwm73sw@mail.gmail.com>
@ 2019-11-28  4:25         ` Stephen Rust
  2019-11-28  5:51           ` Rob Townley
  2019-11-28  9:12         ` Ming Lei
  1 sibling, 1 reply; 30+ messages in thread
From: Stephen Rust @ 2019-11-28  4:25 UTC (permalink / raw)
  To: Ming Lei
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel

[Apologies for dup, re-sending without text formatting to lists]

Hi,

Thanks for your reply.

I agree it does seem surprising that the git bisect pointed to this
particular commit when tracking down this issue.

> Stephen, could you share us how you setup the ramdisk in your test?

The ramdisk we export in LIO is a standard "brd" module ramdisk (ie:
/dev/ram*). We configure it as a "block" backstore in LIO, not using
the built-in LIO ramdisk.

LIO configuration is as follows:

  o- backstores .......................................................... [...]
  | o- block .............................................. [Storage Objects: 1]
  | | o- Blockbridge-952f0334-2535-5fae-9581-6c6524165067
[/dev/ram-bb.952f0334-2535-5fae-9581-6c6524165067.cm2 (16.0MiB)
write-thru activated]
  | |   o- alua ............................................... [ALUA Groups: 1]
  | |     o- default_tg_pt_gp ................... [ALUA state: Active/optimized]
  | o- fileio ............................................. [Storage Objects: 0]
  | o- pscsi .............................................. [Storage Objects: 0]
  | o- ramdisk ............................................ [Storage Objects: 0]
  o- iscsi ........................................................ [Targets: 1]
  | o- iqn.2009-12.com.blockbridge:rda:1:952f0334-2535-5fae-9581-6c6524165067:rda
 [TPGs: 1]
  |   o- tpg1 ...................................... [no-gen-acls, auth per-acl]
  |     o- acls ...................................................... [ACLs: 1]
  |     | o- iqn.1994-05.com.redhat:115ecc56a5c .. [mutual auth, Mapped LUNs: 1]
  |     |   o- mapped_lun0  [lun0
block/Blockbridge-952f0334-2535-5fae-9581-6c6524165067 (rw)]
  |     o- luns ...................................................... [LUNs: 1]
  |     | o- lun0
[block/Blockbridge-952f0334-2535-5fae-9581-6c6524165067
(/dev/ram-bb.952f0334-2535-5fae-9581-6c6524165067.cm2)
(default_tg_pt_gp)]
  |     o- portals ................................................ [Portals: 1]
  |       o- 0.0.0.0:3260 ............................................... [iser]

> > > Could you explain a bit what is iSCSI attached with iSER / RDMA? Is the
> > > actual transport TCP over RDMA? What is related target driver involved?

iSER is the iSCSI extension for RDMA, and it is important to note that
we have _only_ reproduced this when the writes occur over RDMA, with
the target portal in LIO having enabled "iser". The iscsi client
(using iscsiadm) connects to the target directly over iSER. We use the
Mellanox ConnectX-5 Ethernet NICs (mlx5* module) for this purpose,
which utilizes RoCE (RDMA over Converged Ethernet) instead of TCP.

The identical ramdisk configuration using TCP/IP target in LIO has
_not_ reproduced this issue for us.

> > > /usr/share/bcc/tools/stackcount -K rd_execute_rw

I installed bcc and used the stackcount tool to trace rd_execute_rw,
but I suspect because we are not using the built-in LIO ramdisk this
did not catch anything. Are there other function traces we can provide
for you?

Thanks,
Steve

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-11-28  4:25         ` Stephen Rust
@ 2019-11-28  5:51           ` Rob Townley
  0 siblings, 0 replies; 30+ messages in thread
From: Rob Townley @ 2019-11-28  5:51 UTC (permalink / raw)
  To: Stephen Rust
  Cc: Ming Lei, Christoph Hellwig, Jens Axboe, linux-block, linux-rdma,
	linux-scsi, martin.petersen, target-devel

Interesting case to follow as there are many types of RamDisks.  The
common tmpfs kind will use its RAM allocation and all free harddrive
space.

The ramdisk in CentOS 7 backed by LIO will overflow its size in RAM
and fill up all remaining free space on spinning platters.  So if the
RamDisk is 4GB out of 192GB RAM in the lightly used machine. Free
filesystem space is 16GB.  Writes to the 4GB RamDisk will only error
out at 21GB when there is no space left on filesystem.

dd if=/dev/zero of=/dev/iscsiRamDisk
Will keep writing way past 4GB and not stop till hardrive is full
which is totally different than normal disks.

Wonder what exact kind of RamDisk is in that kernel?

On Wed, Nov 27, 2019 at 10:26 PM Stephen Rust <srust@blockbridge.com> wrote:
>
> [Apologies for dup, re-sending without text formatting to lists]
>
> Hi,
>
> Thanks for your reply.
>
> I agree it does seem surprising that the git bisect pointed to this
> particular commit when tracking down this issue.
>
> > Stephen, could you share us how you setup the ramdisk in your test?
>
> The ramdisk we export in LIO is a standard "brd" module ramdisk (ie:
> /dev/ram*). We configure it as a "block" backstore in LIO, not using
> the built-in LIO ramdisk.
>
> LIO configuration is as follows:
>
>   o- backstores .......................................................... [...]
>   | o- block .............................................. [Storage Objects: 1]
>   | | o- Blockbridge-952f0334-2535-5fae-9581-6c6524165067
> [/dev/ram-bb.952f0334-2535-5fae-9581-6c6524165067.cm2 (16.0MiB)
> write-thru activated]
>   | |   o- alua ............................................... [ALUA Groups: 1]
>   | |     o- default_tg_pt_gp ................... [ALUA state: Active/optimized]
>   | o- fileio ............................................. [Storage Objects: 0]
>   | o- pscsi .............................................. [Storage Objects: 0]
>   | o- ramdisk ............................................ [Storage Objects: 0]
>   o- iscsi ........................................................ [Targets: 1]
>   | o- iqn.2009-12.com.blockbridge:rda:1:952f0334-2535-5fae-9581-6c6524165067:rda
>  [TPGs: 1]
>   |   o- tpg1 ...................................... [no-gen-acls, auth per-acl]
>   |     o- acls ...................................................... [ACLs: 1]
>   |     | o- iqn.1994-05.com.redhat:115ecc56a5c .. [mutual auth, Mapped LUNs: 1]
>   |     |   o- mapped_lun0  [lun0
> block/Blockbridge-952f0334-2535-5fae-9581-6c6524165067 (rw)]
>   |     o- luns ...................................................... [LUNs: 1]
>   |     | o- lun0
> [block/Blockbridge-952f0334-2535-5fae-9581-6c6524165067
> (/dev/ram-bb.952f0334-2535-5fae-9581-6c6524165067.cm2)
> (default_tg_pt_gp)]
>   |     o- portals ................................................ [Portals: 1]
>   |       o- 0.0.0.0:3260 ............................................... [iser]
>
> > > > Could you explain a bit what is iSCSI attached with iSER / RDMA? Is the
> > > > actual transport TCP over RDMA? What is related target driver involved?
>
> iSER is the iSCSI extension for RDMA, and it is important to note that
> we have _only_ reproduced this when the writes occur over RDMA, with
> the target portal in LIO having enabled "iser". The iscsi client
> (using iscsiadm) connects to the target directly over iSER. We use the
> Mellanox ConnectX-5 Ethernet NICs (mlx5* module) for this purpose,
> which utilizes RoCE (RDMA over Converged Ethernet) instead of TCP.
>
> The identical ramdisk configuration using TCP/IP target in LIO has
> _not_ reproduced this issue for us.
>
> > > > /usr/share/bcc/tools/stackcount -K rd_execute_rw
>
> I installed bcc and used the stackcount tool to trace rd_execute_rw,
> but I suspect because we are not using the built-in LIO ramdisk this
> did not catch anything. Are there other function traces we can provide
> for you?
>
> Thanks,
> Steve

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
       [not found]       ` <CAAFE1bfsXsKGyw7SU_z4NanT+wmtuJT=XejBYbHHMCDQwm73sw@mail.gmail.com>
  2019-11-28  4:25         ` Stephen Rust
@ 2019-11-28  9:12         ` Ming Lei
  2019-12-02 18:42           ` Stephen Rust
  1 sibling, 1 reply; 30+ messages in thread
From: Ming Lei @ 2019-11-28  9:12 UTC (permalink / raw)
  To: Stephen Rust
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel

On Wed, Nov 27, 2019 at 11:14:46PM -0500, Stephen Rust wrote:
> Hi,
> 
> Thanks for your reply.
> 
> I agree it does seem surprising that the git bisect pointed to this
> particular commit when tracking down this issue.
> 
> The ramdisk we export in LIO is a standard "brd" module ramdisk (ie:
> /dev/ram*). We configure it as a "block" backstore in LIO, not using the
> built-in LIO ramdisk.

Then it isn't strange any more, since iblock code uses bio interface.

> 
> LIO configuration is as follows:
> 
>   o- backstores ..........................................................
> [...]
>   | o- block .............................................. [Storage
> Objects: 1]
>   | | o- Blockbridge-952f0334-2535-5fae-9581-6c6524165067
>  [/dev/ram-bb.952f0334-2535-5fae-9581-6c6524165067.cm2 (16.0MiB) write-thru
> activated]
>   | |   o- alua ............................................... [ALUA
> Groups: 1]
>   | |     o- default_tg_pt_gp ................... [ALUA state:
> Active/optimized]
>   | o- fileio ............................................. [Storage
> Objects: 0]
>   | o- pscsi .............................................. [Storage
> Objects: 0]
>   | o- ramdisk ............................................ [Storage
> Objects: 0]
>   o- iscsi ........................................................
> [Targets: 1]
>   | o-
> iqn.2009-12.com.blockbridge:rda:1:952f0334-2535-5fae-9581-6c6524165067:rda
>  [TPGs: 1]
>   |   o- tpg1 ...................................... [no-gen-acls, auth
> per-acl]
>   |     o- acls ......................................................
> [ACLs: 1]
>   |     | o- iqn.1994-05.com.redhat:115ecc56a5c .. [mutual auth, Mapped
> LUNs: 1]
>   |     |   o- mapped_lun0  [lun0
> block/Blockbridge-952f0334-2535-5fae-9581-6c6524165067 (rw)]
>   |     o- luns ......................................................
> [LUNs: 1]
>   |     | o- lun0  [block/Blockbridge-952f0334-2535-5fae-9581-6c6524165067
> (/dev/ram-bb.952f0334-2535-5fae-9581-6c6524165067.cm2) (default_tg_pt_gp)]
>   |     o- portals ................................................
> [Portals: 1]
>   |       o- 0.0.0.0:3260 ...............................................
> [iser]
> 
> 
> iSER is the iSCSI extension for RDMA, and it is important to note that we
> have _only_ reproduced this when the writes occur over RDMA, with the
> target portal in LIO having enabled "iser". The iscsi client (using
> iscsiadm) connects to the target directly over iSER. We use the Mellanox
> ConnectX-5 Ethernet NICs (mlx5* module) for this purpose, which utilizes
> RoCE (RDMA over Converged Ethernet) instead of TCP.

I may get one machine with Mellanox NIC, is it easy to setup & reproduce
just in the local machine(both host and target are setup on same machine)?

> 
> The identical ramdisk configuration using TCP/IP target in LIO has _not_
> reproduced this issue for us.

Yeah, I just tried iblock over brd, and can't reproduce it.

> 
> I installed bcc and used the stackcount tool to trace rd_execute_rw, but I
> suspect because we are not using the built-in LIO ramdisk this did not
> catch anything. Are there other function traces we can provide for you?

Please try to trace bio_add_page() a bit via 'bpftrace ./ilo.bt'.

[root@ktest-01 func]# cat ilo.bt
kprobe:iblock_execute_rw
{
    @start[tid]=1;
}

kretprobe:iblock_execute_rw
{
    @start[tid]=0;
}

kprobe:bio_add_page
/@start[tid]/
{
  printf("%d %d\n", arg2, arg3);
}



Thanks, 
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-11-28  9:12         ` Ming Lei
@ 2019-12-02 18:42           ` Stephen Rust
  2019-12-03  0:58             ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: Stephen Rust @ 2019-12-02 18:42 UTC (permalink / raw)
  To: Ming Lei
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel

Hi Ming,

> I may get one machine with Mellanox NIC, is it easy to setup & reproduce
> just in the local machine(both host and target are setup on same machine)?

Yes, I have reproduced locally on one machine (using the IP address of
the Mellanox NIC as the target IP), with iser enabled on the target,
and iscsiadm connected via iser.

e.g.:
target:
/iscsi/iqn.20.../0.0.0.0:3260> enable_iser true
iSER enable now: True

  | |   o- portals
....................................................................................................
[Portals: 1]
  | |     o- 0.0.0.0:3260
...................................................................................................
[iser]

client:
# iscsiadm -m node -o update --targetname <target> -n
iface.transport_name -v iser
# iscsiadm -m node --targetname <target> --login
# iscsiadm -m session
iser: [3] 172.16.XX.XX:3260,1
iqn.2003-01.org.linux-iscsi.x8664:sn.c46c084919b0 (non-flash)

> Please try to trace bio_add_page() a bit via 'bpftrace ./ilo.bt'.

Here is the output of this trace from a failed run:

# bpftrace lio.bt
modprobe: FATAL: Module kheaders not found.
Attaching 3 probes...
512 76
4096 0
4096 0
4096 0
4096 76
512 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
4096 0
^C

@start[14475]: 0
@start[14384]: 0
@start[6764]: 0
@start[14477]: 0
@start[7771]: 0
@start[13788]: 0
@start[6879]: 0
@start[11842]: 0
@start[7765]: 0
@start[7782]: 0
@start[14476]: 0
@start[14385]: 0
@start[14474]: 0
@start[11564]: 0
@start[7753]: 0
@start[7786]: 0
@start[7791]: 0
@start[6878]: 0
@start[7411]: 0
@start[14473]: 0
@start[11563]: 0
@start[7681]: 0
@start[7756]: 0


Thanks,
Steve

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-02 18:42           ` Stephen Rust
@ 2019-12-03  0:58             ` Ming Lei
  2019-12-03  3:04               ` Stephen Rust
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2019-12-03  0:58 UTC (permalink / raw)
  To: Stephen Rust
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel

On Mon, Dec 02, 2019 at 01:42:15PM -0500, Stephen Rust wrote:
> Hi Ming,
> 
> > I may get one machine with Mellanox NIC, is it easy to setup & reproduce
> > just in the local machine(both host and target are setup on same machine)?
> 
> Yes, I have reproduced locally on one machine (using the IP address of
> the Mellanox NIC as the target IP), with iser enabled on the target,
> and iscsiadm connected via iser.
> 
> e.g.:
> target:
> /iscsi/iqn.20.../0.0.0.0:3260> enable_iser true
> iSER enable now: True
> 
>   | |   o- portals
> ....................................................................................................
> [Portals: 1]
>   | |     o- 0.0.0.0:3260
> ...................................................................................................
> [iser]
> 
> client:
> # iscsiadm -m node -o update --targetname <target> -n
> iface.transport_name -v iser
> # iscsiadm -m node --targetname <target> --login
> # iscsiadm -m session
> iser: [3] 172.16.XX.XX:3260,1
> iqn.2003-01.org.linux-iscsi.x8664:sn.c46c084919b0 (non-flash)
> 
> > Please try to trace bio_add_page() a bit via 'bpftrace ./ilo.bt'.
> 
> Here is the output of this trace from a failed run:
> 
> # bpftrace lio.bt
> modprobe: FATAL: Module kheaders not found.
> Attaching 3 probes...
> 512 76
> 4096 0
> 4096 0
> 4096 0
> 4096 76

The above buffer might be the reason, 4096 is length, and 76 is the
offset, that means the added buffer crosses two pages, meantime the
buffer isn't aligned.

We need to figure out why the magic 76 offset is passed from target or
driver.

Please install bcc and collect the following log:

/usr/share/bcc/tools/trace -K 'bio_add_page ((arg4 & 512) != 0) "%d %d", arg3, arg4 '


Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-03  0:58             ` Ming Lei
@ 2019-12-03  3:04               ` Stephen Rust
  2019-12-03  3:14                 ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: Stephen Rust @ 2019-12-03  3:04 UTC (permalink / raw)
  To: Ming Lei
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel

Hi Ming,

The log you requested with the (arg4 & 512 != 0) predicate did not
match anything. However, I checked specifically for the offset of "76"
and came up with the following stack traces:

# /usr/share/bcc/tools/trace -K 'bio_add_page ((arg4 == 76)) "%d %d",
arg3, arg4 '
PID     TID     COMM            FUNC             -
7782    7782    kworker/19:1H   bio_add_page     512 76
        bio_add_page+0x1 [kernel]
        sbc_execute_rw+0x28 [kernel]
        __target_execute_cmd+0x2e [kernel]
        target_execute_cmd+0x1c1 [kernel]
        iscsit_execute_cmd+0x1e7 [kernel]
        iscsit_sequence_cmd+0xdc [kernel]
        isert_recv_done+0x780 [kernel]
        __ib_process_cq+0x78 [kernel]
        ib_cq_poll_work+0x29 [kernel]
        process_one_work+0x179 [kernel]
        worker_thread+0x4f [kernel]
        kthread+0x105 [kernel]
        ret_from_fork+0x1f [kernel]

14475   14475   kworker/13:1H   bio_add_page     4096 76
        bio_add_page+0x1 [kernel]
        sbc_execute_rw+0x28 [kernel]
        __target_execute_cmd+0x2e [kernel]
        target_execute_cmd+0x1c1 [kernel]
        iscsit_execute_cmd+0x1e7 [kernel]
        iscsit_sequence_cmd+0xdc [kernel]
        isert_recv_done+0x780 [kernel]
        __ib_process_cq+0x78 [kernel]
        ib_cq_poll_work+0x29 [kernel]
        process_one_work+0x179 [kernel]
        worker_thread+0x4f [kernel]
        kthread+0x105 [kernel]
        ret_from_fork+0x1f [kernel]

Thanks,
Steve

On Mon, Dec 2, 2019 at 7:59 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Mon, Dec 02, 2019 at 01:42:15PM -0500, Stephen Rust wrote:
> > Hi Ming,
> >
> > > I may get one machine with Mellanox NIC, is it easy to setup & reproduce
> > > just in the local machine(both host and target are setup on same machine)?
> >
> > Yes, I have reproduced locally on one machine (using the IP address of
> > the Mellanox NIC as the target IP), with iser enabled on the target,
> > and iscsiadm connected via iser.
> >
> > e.g.:
> > target:
> > /iscsi/iqn.20.../0.0.0.0:3260> enable_iser true
> > iSER enable now: True
> >
> >   | |   o- portals
> > ....................................................................................................
> > [Portals: 1]
> >   | |     o- 0.0.0.0:3260
> > ...................................................................................................
> > [iser]
> >
> > client:
> > # iscsiadm -m node -o update --targetname <target> -n
> > iface.transport_name -v iser
> > # iscsiadm -m node --targetname <target> --login
> > # iscsiadm -m session
> > iser: [3] 172.16.XX.XX:3260,1
> > iqn.2003-01.org.linux-iscsi.x8664:sn.c46c084919b0 (non-flash)
> >
> > > Please try to trace bio_add_page() a bit via 'bpftrace ./ilo.bt'.
> >
> > Here is the output of this trace from a failed run:
> >
> > # bpftrace lio.bt
> > modprobe: FATAL: Module kheaders not found.
> > Attaching 3 probes...
> > 512 76
> > 4096 0
> > 4096 0
> > 4096 0
> > 4096 76
>
> The above buffer might be the reason, 4096 is length, and 76 is the
> offset, that means the added buffer crosses two pages, meantime the
> buffer isn't aligned.
>
> We need to figure out why the magic 76 offset is passed from target or
> driver.
>
> Please install bcc and collect the following log:
>
> /usr/share/bcc/tools/trace -K 'bio_add_page ((arg4 & 512) != 0) "%d %d", arg3, arg4 '
>
>
> Thanks,
> Ming
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-03  3:04               ` Stephen Rust
@ 2019-12-03  3:14                 ` Ming Lei
  2019-12-03  3:26                   ` Stephen Rust
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2019-12-03  3:14 UTC (permalink / raw)
  To: Stephen Rust
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel

On Mon, Dec 02, 2019 at 10:04:20PM -0500, Stephen Rust wrote:
> Hi Ming,
> 
> The log you requested with the (arg4 & 512 != 0) predicate did not

oops, it should have been (arg4 & 511) != 0.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-03  3:14                 ` Ming Lei
@ 2019-12-03  3:26                   ` Stephen Rust
  2019-12-03  3:50                     ` Stephen Rust
  2019-12-03  4:15                     ` Ming Lei
  0 siblings, 2 replies; 30+ messages in thread
From: Stephen Rust @ 2019-12-03  3:26 UTC (permalink / raw)
  To: Ming Lei
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel

> oops, it should have been (arg4 & 511) != 0.

Yep, there they are:

# /usr/share/bcc/tools/trace -K 'bio_add_page ((arg4 & 511) != 0) "%d
%d", arg3, arg4'
PID     TID     COMM            FUNC             -
7411    7411    kworker/31:1H   bio_add_page     512 76
        bio_add_page+0x1 [kernel]
        sbc_execute_rw+0x28 [kernel]
        __target_execute_cmd+0x2e [kernel]
        target_execute_cmd+0x1c1 [kernel]
        iscsit_execute_cmd+0x1e7 [kernel]
        iscsit_sequence_cmd+0xdc [kernel]
        isert_recv_done+0x780 [kernel]
        __ib_process_cq+0x78 [kernel]
        ib_cq_poll_work+0x29 [kernel]
        process_one_work+0x179 [kernel]
        worker_thread+0x4f [kernel]
        kthread+0x105 [kernel]
        ret_from_fork+0x1f [kernel]

7753    7753    kworker/26:1H   bio_add_page     4096 76
        bio_add_page+0x1 [kernel]
        sbc_execute_rw+0x28 [kernel]
        __target_execute_cmd+0x2e [kernel]
        target_execute_cmd+0x1c1 [kernel]
        iscsit_execute_cmd+0x1e7 [kernel]
        iscsit_sequence_cmd+0xdc [kernel]
        isert_recv_done+0x780 [kernel]
        __ib_process_cq+0x78 [kernel]
        ib_cq_poll_work+0x29 [kernel]
        process_one_work+0x179 [kernel]
        worker_thread+0x4f [kernel]
        kthread+0x105 [kernel]
        ret_from_fork+0x1f [kernel]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-03  3:26                   ` Stephen Rust
@ 2019-12-03  3:50                     ` Stephen Rust
  2019-12-03 12:45                       ` Ming Lei
  2019-12-03  4:15                     ` Ming Lei
  1 sibling, 1 reply; 30+ messages in thread
From: Stephen Rust @ 2019-12-03  3:50 UTC (permalink / raw)
  To: Ming Lei
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel

Hi,

Another datapoint.

I enabled "isert_debug" tracing and re-ran the test. Here is a small
snippet of the debug data. FWIW, the "length of 76" in the "lkey
mismatch" is a pattern that repeats quite often during the exchange.


Dec 03 03:41:12 host-2 kernel: isert: isert_recv_done: DMA:
0x10a6457000, iSCSI opcode: 0x01, ITT: 0x00000023, flags: 0x81 dlen: 0
Dec 03 03:41:12 host-2 kernel: isert: isert_recv_done: ISER ISCSI_CTRL PDU
Dec 03 03:41:12 host-2 kernel: isert: __isert_create_send_desc:
tx_desc 000000009bbe54ca lkey mismatch, fixing
Dec 03 03:41:12 host-2 kernel: isert: isert_init_tx_hdrs: Setup
tx_sg[0].addr: 0x4ce45010 length: 76 lkey: 0x80480
Dec 03 03:41:12 host-2 kernel: isert: isert_put_response: Posting SCSI Response
Dec 03 03:41:12 host-2 kernel: isert: isert_send_done: Cmd 00000000238f9047
Dec 03 03:41:12 host-2 kernel: isert: isert_unmap_tx_desc: unmap
single for tx_desc->dma_addr
Dec 03 03:41:12 host-2 kernel: isert: isert_put_cmd: Cmd 00000000238f9047
Dec 03 03:41:12 host-2 kernel: isert: isert_recv_done: DMA:
0x10a645a000, iSCSI opcode: 0x01, ITT: 0x00000024, flags: 0xa1 dlen:
512
Dec 03 03:41:12 host-2 kernel: isert: isert_recv_done: ISER ISCSI_CTRL PDU
Dec 03 03:41:12 host-2 kernel: isert: isert_handle_scsi_cmd: Transfer
Immediate imm_data_len: 512
Dec 03 03:41:12 host-2 kernel: isert: __isert_create_send_desc:
tx_desc 000000004b902cb9 lkey mismatch, fixing
Dec 03 03:41:12 host-2 kernel: isert: isert_init_tx_hdrs: Setup
tx_sg[0].addr: 0x4ce55b70 length: 76 lkey: 0x80480
Dec 03 03:41:12 host-2 kernel: isert: isert_put_response: Posting SCSI Response
Dec 03 03:41:12 host-2 kernel: isert: isert_send_done: Cmd 0000000069929548
Dec 03 03:41:12 host-2 kernel: isert: isert_unmap_tx_desc: unmap
single for tx_desc->dma_addr
Dec 03 03:41:12 host-2 kernel: isert: isert_put_cmd: Cmd 0000000069929548
Dec 03 03:41:12 host-2 kernel: isert: isert_recv_done: DMA:
0x10a645d000, iSCSI opcode: 0x01, ITT: 0x00000025, flags: 0x81 dlen: 0
Dec 03 03:41:12 host-2 kernel: isert: isert_recv_done: ISER ISCSI_CTRL PDU
Dec 03 03:41:12 host-2 kernel: isert: __isert_create_send_desc:
tx_desc 000000006d694fe9 lkey mismatch, fixing
Dec 03 03:41:12 host-2 kernel: isert: isert_init_tx_hdrs: Setup
tx_sg[0].addr: 0x4ce56140 length: 76 lkey: 0x80480
Dec 03 03:41:12 host-2 kernel: isert: isert_put_response: Posting SCSI Response
Dec 03 03:41:12 host-2 kernel: isert: isert_send_done: Cmd 00000000a666ae3c
Dec 03 03:41:12 host-2 kernel: isert: isert_unmap_tx_desc: unmap
single for tx_desc->dma_addr
Dec 03 03:41:12 host-2 kernel: isert: isert_put_cmd: Cmd 00000000a666ae3c
Dec 03 03:41:12 host-2 kernel: isert: isert_recv_done: DMA:
0x10a6460000, iSCSI opcode: 0x01, ITT: 0x00000026, flags: 0x81 dlen: 0
Dec 03 03:41:12 host-2 kernel: isert: isert_recv_done: ISER ISCSI_CTRL PDU
Dec 03 03:41:12 host-2 kernel: isert: __isert_create_send_desc:
tx_desc 00000000dd22ea75 lkey mismatch, fixing
Dec 03 03:41:12 host-2 kernel: isert: isert_init_tx_hdrs: Setup
tx_sg[0].addr: 0x4ce5e6f0 length: 76 lkey: 0x80480
Dec 03 03:41:12 host-2 kernel: isert: isert_put_response: Posting SCSI Response
Dec 03 03:41:12 host-2 kernel: isert: isert_send_done: Cmd 000000009b63dcb0
Dec 03 03:41:12 host-2 kernel: isert: isert_unmap_tx_desc: unmap
single for tx_desc->dma_addr
Dec 03 03:41:12 host-2 kernel: isert: isert_put_cmd: Cmd 000000009b63dcb0
Dec 03 03:41:12 host-2 kernel: isert: isert_recv_done: DMA:
0x10a6463000, iSCSI opcode: 0x01, ITT: 0x00000027, flags: 0xc1 dlen: 0
Dec 03 03:41:12 host-2 kernel: isert: isert_recv_done: ISER_RSV:
read_stag: 0x4000009a read_va: 0xac29e6800
Dec 03 03:41:12 host-2 kernel: isert: isert_recv_done: ISER ISCSI_CTRL PDU
Dec 03 03:41:12 host-2 kernel: isert: isert_put_datain: Cmd:
00000000fe3d39bf RDMA_WRITE data_length: 32
Dec 03 03:41:12 host-2 kernel: isert: __isert_create_send_desc:
tx_desc 00000000f5f10cf7 lkey mismatch, fixing
Dec 03 03:41:12 host-2 kernel: isert: isert_init_tx_hdrs: Setup
tx_sg[0].addr: 0x4ce56710 length: 76 lkey: 0x80480
Dec 03 03:41:12 host-2 kernel: isert: isert_put_datain: Cmd:
00000000fe3d39bf posted RDMA_WRITE for iSER Data READ rc: 0
Dec 03 03:41:12 host-2 kernel: isert: isert_send_done: Cmd 00000000fe3d39bf
Dec 03 03:41:12 host-2 kernel: isert: isert_unmap_tx_desc: unmap
single for tx_desc->dma_addr
Dec 03 03:41:12 host-2 kernel: isert: isert_put_cmd: Cmd 00000000fe3d39bf

[snip]

I could post the whole isert debug log somewhere if you'd like?

Thanks,
Steve

On Mon, Dec 2, 2019 at 10:26 PM Stephen Rust <srust@blockbridge.com> wrote:
>
> > oops, it should have been (arg4 & 511) != 0.
>
> Yep, there they are:
>
> # /usr/share/bcc/tools/trace -K 'bio_add_page ((arg4 & 511) != 0) "%d
> %d", arg3, arg4'
> PID     TID     COMM            FUNC             -
> 7411    7411    kworker/31:1H   bio_add_page     512 76
>         bio_add_page+0x1 [kernel]
>         sbc_execute_rw+0x28 [kernel]
>         __target_execute_cmd+0x2e [kernel]
>         target_execute_cmd+0x1c1 [kernel]
>         iscsit_execute_cmd+0x1e7 [kernel]
>         iscsit_sequence_cmd+0xdc [kernel]
>         isert_recv_done+0x780 [kernel]
>         __ib_process_cq+0x78 [kernel]
>         ib_cq_poll_work+0x29 [kernel]
>         process_one_work+0x179 [kernel]
>         worker_thread+0x4f [kernel]
>         kthread+0x105 [kernel]
>         ret_from_fork+0x1f [kernel]
>
> 7753    7753    kworker/26:1H   bio_add_page     4096 76
>         bio_add_page+0x1 [kernel]
>         sbc_execute_rw+0x28 [kernel]
>         __target_execute_cmd+0x2e [kernel]
>         target_execute_cmd+0x1c1 [kernel]
>         iscsit_execute_cmd+0x1e7 [kernel]
>         iscsit_sequence_cmd+0xdc [kernel]
>         isert_recv_done+0x780 [kernel]
>         __ib_process_cq+0x78 [kernel]
>         ib_cq_poll_work+0x29 [kernel]
>         process_one_work+0x179 [kernel]
>         worker_thread+0x4f [kernel]
>         kthread+0x105 [kernel]
>         ret_from_fork+0x1f [kernel]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-03  3:26                   ` Stephen Rust
  2019-12-03  3:50                     ` Stephen Rust
@ 2019-12-03  4:15                     ` Ming Lei
  1 sibling, 0 replies; 30+ messages in thread
From: Ming Lei @ 2019-12-03  4:15 UTC (permalink / raw)
  To: Stephen Rust
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel

On Mon, Dec 02, 2019 at 10:26:28PM -0500, Stephen Rust wrote:
> > oops, it should have been (arg4 & 511) != 0.
> 
> Yep, there they are:
> 
> # /usr/share/bcc/tools/trace -K 'bio_add_page ((arg4 & 511) != 0) "%d
> %d", arg3, arg4'
> PID     TID     COMM            FUNC             -
> 7411    7411    kworker/31:1H   bio_add_page     512 76
>         bio_add_page+0x1 [kernel]
>         sbc_execute_rw+0x28 [kernel]
>         __target_execute_cmd+0x2e [kernel]
>         target_execute_cmd+0x1c1 [kernel]
>         iscsit_execute_cmd+0x1e7 [kernel]
>         iscsit_sequence_cmd+0xdc [kernel]
>         isert_recv_done+0x780 [kernel]
>         __ib_process_cq+0x78 [kernel]
>         ib_cq_poll_work+0x29 [kernel]
>         process_one_work+0x179 [kernel]
>         worker_thread+0x4f [kernel]
>         kthread+0x105 [kernel]
>         ret_from_fork+0x1f [kernel]
> 
> 7753    7753    kworker/26:1H   bio_add_page     4096 76

The issue should be in brd_make_request() which assumes that
bvec.bv_len is 512bytes align.

I will figure out one patch for you tomorrow.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-03  3:50                     ` Stephen Rust
@ 2019-12-03 12:45                       ` Ming Lei
  2019-12-03 19:56                         ` Stephen Rust
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2019-12-03 12:45 UTC (permalink / raw)
  To: Stephen Rust
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel,
	Doug Ledford, Jason Gunthorpe

On Mon, Dec 02, 2019 at 10:50:32PM -0500, Stephen Rust wrote:
> Hi,
> 
> Another datapoint.
> 
> I enabled "isert_debug" tracing and re-ran the test. Here is a small
> snippet of the debug data. FWIW, the "length of 76" in the "lkey
> mismatch" is a pattern that repeats quite often during the exchange.

That is because ISER_HEADERS_LEN is 76.

From our trace, 76 is bvec->bv_offset, is it possible that IO buffer
just follows the ISER HEADER suppose that iser applies zero-copy?

BTW, you may try the attached test patch. If the issue can be fixed by
this patch, that means it is really caused by un-aligned buffer, and
the iser driver needs to be fixed.

From 0368ee8a756384116fa1d0415f51389d438a6e40 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Tue, 3 Dec 2019 20:00:53 +0800
Subject: [PATCH] brd: handle un-aligned bvec->bv_len

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/brd.c | 27 +++++++++++++++++----------
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index c548a5a6c1a0..9ea1894c820d 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -190,13 +190,15 @@ static int copy_to_brd_setup(struct brd_device *brd, sector_t sector, size_t n)
  * Copy n bytes from src to the brd starting at sector. Does not sleep.
  */
 static void copy_to_brd(struct brd_device *brd, const void *src,
-			sector_t sector, size_t n)
+			sector_t sector, unsigned off_in_sec, size_t n)
 {
 	struct page *page;
 	void *dst;
 	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
 	size_t copy;
 
+	offset += off_in_sec;
+
 	copy = min_t(size_t, n, PAGE_SIZE - offset);
 	page = brd_lookup_page(brd, sector);
 	BUG_ON(!page);
@@ -207,7 +209,7 @@ static void copy_to_brd(struct brd_device *brd, const void *src,
 
 	if (copy < n) {
 		src += copy;
-		sector += copy >> SECTOR_SHIFT;
+		sector += (copy + off_in_sec) >> SECTOR_SHIFT;
 		copy = n - copy;
 		page = brd_lookup_page(brd, sector);
 		BUG_ON(!page);
@@ -222,13 +224,15 @@ static void copy_to_brd(struct brd_device *brd, const void *src,
  * Copy n bytes to dst from the brd starting at sector. Does not sleep.
  */
 static void copy_from_brd(void *dst, struct brd_device *brd,
-			sector_t sector, size_t n)
+			sector_t sector, unsigned off_in_sec, size_t n)
 {
 	struct page *page;
 	void *src;
 	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
 	size_t copy;
 
+	offset += off_in_sec;
+
 	copy = min_t(size_t, n, PAGE_SIZE - offset);
 	page = brd_lookup_page(brd, sector);
 	if (page) {
@@ -240,7 +244,7 @@ static void copy_from_brd(void *dst, struct brd_device *brd,
 
 	if (copy < n) {
 		dst += copy;
-		sector += copy >> SECTOR_SHIFT;
+		sector += (copy + off_in_sec) >> SECTOR_SHIFT;
 		copy = n - copy;
 		page = brd_lookup_page(brd, sector);
 		if (page) {
@@ -257,7 +261,7 @@ static void copy_from_brd(void *dst, struct brd_device *brd,
  */
 static int brd_do_bvec(struct brd_device *brd, struct page *page,
 			unsigned int len, unsigned int off, unsigned int op,
-			sector_t sector)
+			sector_t sector, unsigned int off_in_sec)
 {
 	void *mem;
 	int err = 0;
@@ -270,11 +274,11 @@ static int brd_do_bvec(struct brd_device *brd, struct page *page,
 
 	mem = kmap_atomic(page);
 	if (!op_is_write(op)) {
-		copy_from_brd(mem + off, brd, sector, len);
+		copy_from_brd(mem + off, brd, sector, off_in_sec, len);
 		flush_dcache_page(page);
 	} else {
 		flush_dcache_page(page);
-		copy_to_brd(brd, mem + off, sector, len);
+		copy_to_brd(brd, mem + off, sector, off_in_sec, len);
 	}
 	kunmap_atomic(mem);
 
@@ -287,6 +291,7 @@ static blk_qc_t brd_make_request(struct request_queue *q, struct bio *bio)
 	struct brd_device *brd = bio->bi_disk->private_data;
 	struct bio_vec bvec;
 	sector_t sector;
+	unsigned offset_in_sec = 0;
 	struct bvec_iter iter;
 
 	sector = bio->bi_iter.bi_sector;
@@ -296,12 +301,14 @@ static blk_qc_t brd_make_request(struct request_queue *q, struct bio *bio)
 	bio_for_each_segment(bvec, bio, iter) {
 		unsigned int len = bvec.bv_len;
 		int err;
+		unsigned int secs = len >> SECTOR_SHIFT;
 
 		err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset,
-				  bio_op(bio), sector);
+				  bio_op(bio), sector, offset_in_sec);
 		if (err)
 			goto io_error;
-		sector += len >> SECTOR_SHIFT;
+		sector += secs;
+		offset_in_sec = len - (secs << SECTOR_SHIFT);
 	}
 
 	bio_endio(bio);
@@ -319,7 +326,7 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
 
 	if (PageTransHuge(page))
 		return -ENOTSUPP;
-	err = brd_do_bvec(brd, page, PAGE_SIZE, 0, op, sector);
+	err = brd_do_bvec(brd, page, PAGE_SIZE, 0, op, sector, 0);
 	page_endio(page, op_is_write(op), err);
 	return err;
 }
-- 
2.20.1


Thanks,
Ming


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-03 12:45                       ` Ming Lei
@ 2019-12-03 19:56                         ` Stephen Rust
  2019-12-04  1:05                           ` Ming Lei
  2019-12-04  2:39                           ` Ming Lei
  0 siblings, 2 replies; 30+ messages in thread
From: Stephen Rust @ 2019-12-03 19:56 UTC (permalink / raw)
  To: Ming Lei
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel,
	Doug Ledford, Jason Gunthorpe

Hi Ming,

Thanks very much for the patch.

> BTW, you may try the attached test patch. If the issue can be fixed by
> this patch, that means it is really caused by un-aligned buffer, and
> the iser driver needs to be fixed.

I have tried the patch, and re-run the test. Results are mixed.

To recap, our test writes the last bytes of an iser attached iscsi
device. The target device is a LIO iblock, backed by a brd ramdisk.
The client does a simple `dd`, doing a seek to "size - offset" of the
device, and writing a buffer of "length" which is equivalent to the
offset.

For example, to test a write at a 512 offset, seek to device "size -
512", and write a length of data 512 bytes.

WITHOUT the patch, writing data at the following offsets from the end
of the device failed to write all the correct data (rather, the write
succeeded, but reading the data back it was invalid):

- failed: 512,1024, 2048, 4096, 8192

Anything larger worked fine.

WITH the patch applied, writing data up to an offset of 4096 all now
worked and verified correctly. However, offsets between 4096 and 8192
all still failed. I started at 512, and incremented by 512 all the way
up to 16384. The following offsets all failed to verify the write:

- failed: 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192

Anything larger continues to work fine with the patch.

As an example, for the failed 8192 case, the `bpftrace lio.bt` trace shows:

8192 76
4096 0
4096 0
8192 76
4096 0
4096 0
...
[snip]

What do you think are appropriate next steps? Do you think you have an
idea on why the specific "multi-page bvec helpers" commit could have
exposed this particular latent issue? Please let me know what else I
can try, or additional data I can provide for you.

Thanks,
Steve

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-03 19:56                         ` Stephen Rust
@ 2019-12-04  1:05                           ` Ming Lei
  2019-12-04 17:23                             ` Stephen Rust
  2019-12-04  2:39                           ` Ming Lei
  1 sibling, 1 reply; 30+ messages in thread
From: Ming Lei @ 2019-12-04  1:05 UTC (permalink / raw)
  To: Stephen Rust
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel,
	Doug Ledford, Jason Gunthorpe

On Tue, Dec 03, 2019 at 02:56:08PM -0500, Stephen Rust wrote:
> Hi Ming,
> 
> Thanks very much for the patch.
> 
> > BTW, you may try the attached test patch. If the issue can be fixed by
> > this patch, that means it is really caused by un-aligned buffer, and
> > the iser driver needs to be fixed.
> 
> I have tried the patch, and re-run the test. Results are mixed.
> 
> To recap, our test writes the last bytes of an iser attached iscsi
> device. The target device is a LIO iblock, backed by a brd ramdisk.
> The client does a simple `dd`, doing a seek to "size - offset" of the
> device, and writing a buffer of "length" which is equivalent to the
> offset.
> 
> For example, to test a write at a 512 offset, seek to device "size -
> 512", and write a length of data 512 bytes.
> 
> WITHOUT the patch, writing data at the following offsets from the end
> of the device failed to write all the correct data (rather, the write
> succeeded, but reading the data back it was invalid):
> 
> - failed: 512,1024, 2048, 4096, 8192
> 
> Anything larger worked fine.
> 
> WITH the patch applied, writing data up to an offset of 4096 all now
> worked and verified correctly. However, offsets between 4096 and 8192
> all still failed. I started at 512, and incremented by 512 all the way
> up to 16384. The following offsets all failed to verify the write:
> 
> - failed: 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192
> 
> Anything larger continues to work fine with the patch.
> 
> As an example, for the failed 8192 case, the `bpftrace lio.bt` trace shows:
> 
> 8192 76
> 4096 0
> 4096 0
> 8192 76
> 4096 0
> 4096 0
> ...
> [snip]
> 
> What do you think are appropriate next steps?

OK, my guess should be correct, and the issue is related with un-aligned
bvec->bv_offset.

So firstly, I'd suggest to investigate from RDMA driver side to see why
un-aligned buffer is passed to block layer.

According to previous discussion, 512 aligned buffer should be provided
to block layer.

So looks the driver needs to be fixed.

> Do you think you have an
> idea on why the specific "multi-page bvec helpers" commit could have
> exposed this particular latent issue? Please let me know what else I
> can try, or additional data I can provide for you.
 
The patch might not cover the big offset case, could you collect bpftrace
via the following script when you reproduce the issue with >4096 offset?

kprobe:iblock_execute_rw
{
    @start[tid]=1;
}

kretprobe:iblock_execute_rw
{
    @start[tid]=0;
}

kprobe:bio_add_page
/@start[tid]/
{
  printf("%d %d\n", arg2, arg3);
}

kprobe:brd_do_bvec
{
  printf("%d %d %d %d\n", arg2, arg3, arg4, arg5);
}


Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-03 19:56                         ` Stephen Rust
  2019-12-04  1:05                           ` Ming Lei
@ 2019-12-04  2:39                           ` Ming Lei
  1 sibling, 0 replies; 30+ messages in thread
From: Ming Lei @ 2019-12-04  2:39 UTC (permalink / raw)
  To: Stephen Rust
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel,
	Doug Ledford, Jason Gunthorpe

On Tue, Dec 03, 2019 at 02:56:08PM -0500, Stephen Rust wrote:
> Hi Ming,
> 
> Thanks very much for the patch.
> 
> > BTW, you may try the attached test patch. If the issue can be fixed by
> > this patch, that means it is really caused by un-aligned buffer, and
> > the iser driver needs to be fixed.
> 
> I have tried the patch, and re-run the test. Results are mixed.
> 
> To recap, our test writes the last bytes of an iser attached iscsi
> device. The target device is a LIO iblock, backed by a brd ramdisk.
> The client does a simple `dd`, doing a seek to "size - offset" of the
> device, and writing a buffer of "length" which is equivalent to the
> offset.
> 
> For example, to test a write at a 512 offset, seek to device "size -
> 512", and write a length of data 512 bytes.
> 
> WITHOUT the patch, writing data at the following offsets from the end
> of the device failed to write all the correct data (rather, the write
> succeeded, but reading the data back it was invalid):
> 
> - failed: 512,1024, 2048, 4096, 8192
> 
> Anything larger worked fine.
> 
> WITH the patch applied, writing data up to an offset of 4096 all now
> worked and verified correctly. However, offsets between 4096 and 8192
> all still failed. I started at 512, and incremented by 512 all the way
> up to 16384. The following offsets all failed to verify the write:
> 
> - failed: 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192
> 
> Anything larger continues to work fine with the patch.
> 
> As an example, for the failed 8192 case, the `bpftrace lio.bt` trace shows:
> 
> 8192 76
> 4096 0
> 4096 0
> 8192 76
> 4096 0
> 4096 0

The following delta change against last patch should fix the issue
with >4096 bvec length:

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 9ea1894c820d..49e37a7dda63 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -308,7 +308,7 @@ static blk_qc_t brd_make_request(struct request_queue *q, struct bio *bio)
                if (err)
                        goto io_error;
                sector += secs;
-               offset_in_sec = len - (secs << SECTOR_SHIFT);
+               offset_in_sec += len - (secs << SECTOR_SHIFT);
        }

        bio_endio(bio);

However, the change on brd is a workaround just for confirming the
issue.


Thanks, 
Ming


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-04  1:05                           ` Ming Lei
@ 2019-12-04 17:23                             ` Stephen Rust
  2019-12-04 23:02                               ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: Stephen Rust @ 2019-12-04 17:23 UTC (permalink / raw)
  To: Ming Lei
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel,
	Doug Ledford, Jason Gunthorpe

Hi Ming,

I have tried your latest "workaround" patch in brd including the fix
for large offsets, and it does appear to work. I tried the same tests
and the data was written correctly for all offsets I tried. Thanks!

I include the updated additional bpftrace below.

> So firstly, I'd suggest to investigate from RDMA driver side to see why
> un-aligned buffer is passed to block layer.
>
> According to previous discussion, 512 aligned buffer should be provided
> to block layer.
>
> So looks the driver needs to be fixed.

If it does appear to be an RDMA driver issue, do you know who we
should follow up with directly from the RDMA driver side of the world?

Presumably non-brd devices, ie: real scsi devices work for these test
cases because they accept un-aligned buffers?

> The patch might not cover the big offset case, could you collect bpftrace
> via the following script when you reproduce the issue with >4096 offset?

Here is the updated bpftrace output for an offset of 8192:

8192 76
4020 76 1 131056
4096 0 1 131063
76 0 1 131071
4096 0
4096 0 0 0
4096 0
4096 0 0 8
4096 0
4096 0 0 130944
8192 76
4020 76 1 131056
4096 0 1 131063
76 0 1 131071
4096 0
4096 0 0 130808
4096 0
4096 0
4096 0 0 131056
4096 0 0 131064
[snip]

Thanks,
Steve

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-04 17:23                             ` Stephen Rust
@ 2019-12-04 23:02                               ` Ming Lei
  2019-12-05  0:16                                 ` Bart Van Assche
                                                   ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Ming Lei @ 2019-12-04 23:02 UTC (permalink / raw)
  To: Stephen Rust
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel,
	Doug Ledford, Jason Gunthorpe, Sagi Grimberg, Max Gurtovoy

On Wed, Dec 04, 2019 at 12:23:39PM -0500, Stephen Rust wrote:
> Hi Ming,
> 
> I have tried your latest "workaround" patch in brd including the fix
> for large offsets, and it does appear to work. I tried the same tests
> and the data was written correctly for all offsets I tried. Thanks!
> 
> I include the updated additional bpftrace below.
> 
> > So firstly, I'd suggest to investigate from RDMA driver side to see why
> > un-aligned buffer is passed to block layer.
> >
> > According to previous discussion, 512 aligned buffer should be provided
> > to block layer.
> >
> > So looks the driver needs to be fixed.
> 
> If it does appear to be an RDMA driver issue, do you know who we
> should follow up with directly from the RDMA driver side of the world?
> 
> Presumably non-brd devices, ie: real scsi devices work for these test
> cases because they accept un-aligned buffers?

Right, not every driver supports such un-aligned buffer.

I am not familiar with RDMA, but from the trace we have done so far,
it is highly related with iser driver. 


Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-04 23:02                               ` Ming Lei
@ 2019-12-05  0:16                                 ` Bart Van Assche
  2019-12-05 14:44                                   ` Stephen Rust
  2019-12-05  2:28                                 ` Stephen Rust
  2019-12-05  9:17                                 ` Sagi Grimberg
  2 siblings, 1 reply; 30+ messages in thread
From: Bart Van Assche @ 2019-12-05  0:16 UTC (permalink / raw)
  To: Ming Lei, Stephen Rust
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel,
	Doug Ledford, Jason Gunthorpe, Sagi Grimberg, Max Gurtovoy

On 12/4/19 3:02 PM, Ming Lei wrote:
> On Wed, Dec 04, 2019 at 12:23:39PM -0500, Stephen Rust wrote:
>> Presumably non-brd devices, ie: real scsi devices work for these test
>> cases because they accept un-aligned buffers?
> 
> Right, not every driver supports such un-aligned buffer.
> 
> I am not familiar with RDMA, but from the trace we have done so far,
> it is highly related with iser driver.

Hi Stephen,

Do you need the iSER protocol? I think that the NVMeOF and SRP drivers 
also support RoCE and that these align data buffers on a 512 byte boundary.

Bart.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-04 23:02                               ` Ming Lei
  2019-12-05  0:16                                 ` Bart Van Assche
@ 2019-12-05  2:28                                 ` Stephen Rust
  2019-12-05  3:05                                   ` Ming Lei
  2019-12-05  9:17                                 ` Sagi Grimberg
  2 siblings, 1 reply; 30+ messages in thread
From: Stephen Rust @ 2019-12-05  2:28 UTC (permalink / raw)
  To: Ming Lei
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel,
	Doug Ledford, Jason Gunthorpe, Sagi Grimberg, Max Gurtovoy

Hi Ming,

Thanks for all your help and insight. I really appreciate it.

> > Presumably non-brd devices, ie: real scsi devices work for these test
> > cases because they accept un-aligned buffers?
>
> Right, not every driver supports such un-aligned buffer.

Can you please clarify: does the block layer require that it is called
with 512-byte aligned buffers? If that is the case, would it make
sense for the block interface (bio_add_page() or other) to reject
buffers that are not aligned?

It seems that passing these buffers on to underlying drivers that
don't support un-aligned buffers can result in silent data corruption.
Perhaps it would be better to fail the I/O up front. This would also
help future proof the block interface when changes/new target drivers
are added.

I'm also curious how these same unaligned buffers from iSER made it to
brd and were written successfully in the pre "multi-page bvec" world.
(Just trying to understand, if you have any thoughts, as this same
test case worked fine in 4.14+ until 5.1)

> I am not familiar with RDMA, but from the trace we have done so far,
> it is highly related with iser driver.

Do you think it is fair to say that the iSER/block integration is
causing corruption by using un-aligned buffers?

Thanks,
Steve

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-05  2:28                                 ` Stephen Rust
@ 2019-12-05  3:05                                   ` Ming Lei
  0 siblings, 0 replies; 30+ messages in thread
From: Ming Lei @ 2019-12-05  3:05 UTC (permalink / raw)
  To: Stephen Rust
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel,
	Doug Ledford, Jason Gunthorpe, Sagi Grimberg, Max Gurtovoy

On Wed, Dec 04, 2019 at 09:28:43PM -0500, Stephen Rust wrote:
> Hi Ming,
> 
> Thanks for all your help and insight. I really appreciate it.
> 
> > > Presumably non-brd devices, ie: real scsi devices work for these test
> > > cases because they accept un-aligned buffers?
> >
> > Right, not every driver supports such un-aligned buffer.
> 
> Can you please clarify: does the block layer require that it is called
> with 512-byte aligned buffers? If that is the case, would it make
> sense for the block interface (bio_add_page() or other) to reject
> buffers that are not aligned?

The things is a bit complicated, see the following xfs commits:

f8f9ee479439 xfs: add kmem_alloc_io()
d916275aa4dd xfs: get allocation alignment from the buftarg

Which applies request queue's dma alignment limit which may be
smaller than 512. Before this report, xfs should be the only known
user of passing un-aligned buffer.

So we can't add the check in bio_add_page(), in which request queue
may not be available, also bio_add_page() is really hot path, and
people hates to add unnecessary code in this function.

IMO, it is better for all FS or users of bio_add_page() to pass
512 aligned buffer.

> 
> It seems that passing these buffers on to underlying drivers that
> don't support un-aligned buffers can result in silent data corruption.
> Perhaps it would be better to fail the I/O up front. This would also
> help future proof the block interface when changes/new target drivers
> are added.

It is a brd device, strictly speaking, it doesn't matter to fail the
I/O or whatever, given either way should cause data loss.

> 
> I'm also curious how these same unaligned buffers from iSER made it to
> brd and were written successfully in the pre "multi-page bvec" world.
> (Just trying to understand, if you have any thoughts, as this same
> test case worked fine in 4.14+ until 5.1)

I am pretty sure that brd never supports un-aligned buffer, and I have
no idea why 'multi-page bvec' helper can cause this issue. However, I
am happy to investigate further if you can run previous trace on pre
'multi-page bvec' kernel.

> 
> > I am not familiar with RDMA, but from the trace we have done so far,
> > it is highly related with iser driver.
> 
> Do you think it is fair to say that the iSER/block integration is
> causing corruption by using un-aligned buffers?

As you saw, XFS changed the un-aligned buffer into aligned one for
avoiding the issue, so I think it is pretty fair to say that.

Thanks, 
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-04 23:02                               ` Ming Lei
  2019-12-05  0:16                                 ` Bart Van Assche
  2019-12-05  2:28                                 ` Stephen Rust
@ 2019-12-05  9:17                                 ` Sagi Grimberg
  2019-12-05 14:36                                   ` Stephen Rust
       [not found]                                   ` <CAAFE1beqFBQS_zVYEXFTD2qu8PAF9hBSW4j1k9ZD6MhU_gWg5Q@mail.gmail.com>
  2 siblings, 2 replies; 30+ messages in thread
From: Sagi Grimberg @ 2019-12-05  9:17 UTC (permalink / raw)
  To: Ming Lei, Stephen Rust
  Cc: Rob Townley, Christoph Hellwig, Jens Axboe, linux-block,
	linux-rdma, linux-scsi, martin.petersen, target-devel,
	Doug Ledford, Jason Gunthorpe, Max Gurtovoy


>> Hi Ming,
>>
>> I have tried your latest "workaround" patch in brd including the fix
>> for large offsets, and it does appear to work. I tried the same tests
>> and the data was written correctly for all offsets I tried. Thanks!
>>
>> I include the updated additional bpftrace below.
>>
>>> So firstly, I'd suggest to investigate from RDMA driver side to see why
>>> un-aligned buffer is passed to block layer.
>>>
>>> According to previous discussion, 512 aligned buffer should be provided
>>> to block layer.
>>>
>>> So looks the driver needs to be fixed.
>>
>> If it does appear to be an RDMA driver issue, do you know who we
>> should follow up with directly from the RDMA driver side of the world?
>>
>> Presumably non-brd devices, ie: real scsi devices work for these test
>> cases because they accept un-aligned buffers?
> 
> Right, not every driver supports such un-aligned buffer.
> 
> I am not familiar with RDMA, but from the trace we have done so far,
> it is highly related with iser driver.

Hi guys,

Just got this one (Thanks for CCing me Ming, been extremely busy
lately).

So it looks from the report that this is the immediate-data and
unsolicited data-out flows, which indeed seem to violate the alignment
assumption. The reason is that isert post recv a contig rx_desc which
has both the headers and the data, and when it gets immediate_data it
will set the data sg to rx_desc+offset (which are the headers).

Stephen,
As a work-around for now, you should turn off immediate-data in your LIO
target. I'll work on a fix.

Thanks for reporting!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-05  9:17                                 ` Sagi Grimberg
@ 2019-12-05 14:36                                   ` Stephen Rust
       [not found]                                   ` <CAAFE1beqFBQS_zVYEXFTD2qu8PAF9hBSW4j1k9ZD6MhU_gWg5Q@mail.gmail.com>
  1 sibling, 0 replies; 30+ messages in thread
From: Stephen Rust @ 2019-12-05 14:36 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Ming Lei, Rob Townley, Christoph Hellwig, Jens Axboe,
	linux-block, linux-rdma, linux-scsi, martin.petersen,
	target-devel, Doug Ledford, Jason Gunthorpe, Max Gurtovoy

Hi Sagi,

On Thu, Dec 5, 2019 at 4:17 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>
> Just got this one (Thanks for CCing me Ming, been extremely busy
> lately).

No problem, thanks for looking into it!

> So it looks from the report that this is the immediate-data and
> unsolicited data-out flows, which indeed seem to violate the alignment
> assumption. The reason is that isert post recv a contig rx_desc which
> has both the headers and the data, and when it gets immediate_data it
> will set the data sg to rx_desc+offset (which are the headers).
>
> Stephen,
> As a work-around for now, you should turn off immediate-data in your LIO
> target. I'll work on a fix.

I have confirmed that turning off ImmediateData in the target (and
reconnecting) is a successful workaround for this test case. All of
the I/O as reported by bio_add_page() is aligned.

Using the previously described bpftrace script with 512 offset:

# bpftrace lio.bt
Attaching 4 probes...
512 0
512 0 1 131071
4096 0
4096 0 0 0
4096 0
4096 0 0 8
4096 0
4096 0 0 131064
4096 0
4096 0 1 131064
4096 0
4096 0 0 0
4096 0
4096 0 0 8
512 0
512 0 0 131071
4096 0
4096 0 0 130944
4096 0
4096 0 0 131056

> Thanks for reporting!

Please let me know if you need any additional information, or if I can
assist further. I would be happy to test any patches when you are
ready.

Thanks,
Steve

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2019-12-05  0:16                                 ` Bart Van Assche
@ 2019-12-05 14:44                                   ` Stephen Rust
  0 siblings, 0 replies; 30+ messages in thread
From: Stephen Rust @ 2019-12-05 14:44 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Ming Lei, Rob Townley, Christoph Hellwig, Jens Axboe,
	linux-block, linux-rdma, linux-scsi, martin.petersen,
	target-devel, Doug Ledford, Jason Gunthorpe, Sagi Grimberg,
	Max Gurtovoy

On Wed, Dec 4, 2019 at 7:16 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> Do you need the iSER protocol? I think that the NVMeOF and SRP drivers
> also support RoCE and that these align data buffers on a 512 byte boundary.

Hi Bart,

In this case we do. But thank you for the other references. Those
might be options for us for other use cases.

Thanks,
Steve

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
       [not found]                                   ` <CAAFE1beqFBQS_zVYEXFTD2qu8PAF9hBSW4j1k9ZD6MhU_gWg5Q@mail.gmail.com>
@ 2020-03-25  0:15                                     ` Sagi Grimberg
  2020-03-30 17:08                                       ` Stephen Rust
  0 siblings, 1 reply; 30+ messages in thread
From: Sagi Grimberg @ 2020-03-25  0:15 UTC (permalink / raw)
  To: Stephen Rust
  Cc: Ming Lei, Rob Townley, Christoph Hellwig, Jens Axboe,
	linux-block, linux-rdma, linux-scsi, martin.petersen,
	target-devel, Doug Ledford, Jason Gunthorpe, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 1240 bytes --]

> Hi Sagi,

Hey Stephen,

>> So it looks from the report that this is the immediate-data and
>> unsolicited data-out flows, which indeed seem to violate the alignment
>> assumption. The reason is that isert post recv a contig rx_desc which
>> has both the headers and the data, and when it gets immediate_data it
>> will set the data sg to rx_desc+offset (which are the headers).
>>
>> Stephen,
>> As a work-around for now, you should turn off immediate-data in your LIO
>> target. I'll work on a fix.
> 
> I'm writing to see if you have had a chance to work on a fix for this yet?

Sorry for the late reply, lost track of this.

> We have a couple of followups as well:
> 
> - Based on Ming's previous comments, is it safe to assume all non-brd 
> drivers and nvme in particular are able to handle unaligned data? We are 
> concerned to enable ImmedataData in general, with this bug outstanding.
> 
> - We are seeing a 5-10% performance degradation when using 
> ImmediateData=No for our workloads. That is to say, having this issue 
> fixed is important to us. Anything we can help with, to provide testing 
> or otherwise, please let us know.

Can you try attached patch and see if it solves your issue?
WARNING: very lightly tested...

[-- Attachment #2: 0001-IB-isert-fix-unaligned-immediate-data-handling.patch --]
[-- Type: text/x-patch, Size: 14144 bytes --]

From ca7b6374d22efa8592a35d9d191672e7cbf84a44 Mon Sep 17 00:00:00 2001
From: Sagi Grimberg <sagi@grimberg.me>
Date: Tue, 24 Mar 2020 11:14:44 -0700
Subject: [PATCH] IB/isert: fix unaligned immediate-data handling

Currently we allocate rx buffers in a single contiguous buffers
for headers (iser and iscsi) and data trailer. This means
that most likely the data starting offset is aligned to 76
bytes (size of both headers).

This worked fine for years, but at some point this broke.
To fix this, we should avoid passing unaligned buffers for
I/O.

The fix is rather simple (although a bit invasive). Simply
allocate a buffer for the headers and a buffer for the data
and post a 2-entry recv work request.

Reported-by: Stephen Rust <srust@blockbridge.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/infiniband/ulp/isert/ib_isert.c | 207 +++++++++++++-----------
 drivers/infiniband/ulp/isert/ib_isert.h |  18 +--
 2 files changed, 123 insertions(+), 102 deletions(-)

diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c
index a1a035270cab..bd27679d2a1b 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -129,7 +129,7 @@ isert_create_qp(struct isert_conn *isert_conn,
 	attr.cap.max_recv_wr = ISERT_QP_MAX_RECV_DTOS + 1;
 	attr.cap.max_rdma_ctxs = ISCSI_DEF_XMIT_CMDS_MAX;
 	attr.cap.max_send_sge = device->ib_device->attrs.max_send_sge;
-	attr.cap.max_recv_sge = 1;
+	attr.cap.max_recv_sge = 2;
 	attr.sq_sig_type = IB_SIGNAL_REQ_WR;
 	attr.qp_type = IB_QPT_RC;
 	if (device->pi_capable)
@@ -163,15 +163,74 @@ isert_conn_setup_qp(struct isert_conn *isert_conn, struct rdma_cm_id *cma_id)
 	return ret;
 }
 
+static int num_pages(int len)
+{
+	return 1 + (((len - 1) & PAGE_MASK) >> PAGE_SHIFT);
+}
+
+static void
+isert_free_rx_descriptor(struct iser_rx_desc *rx_desc, struct ib_device *ib_dev)
+{
+	struct ib_sge *rx_sg;
+
+	rx_sg = &rx_desc->rx_sg[1];
+	ib_dma_unmap_single(ib_dev, rx_sg->addr,
+			ISCSI_DEF_MAX_RECV_SEG_LEN, DMA_FROM_DEVICE);
+	free_pages((unsigned long)rx_desc->data,
+			num_pages(ISCSI_DEF_MAX_RECV_SEG_LEN));
+	rx_sg = &rx_desc->rx_sg[0];
+	ib_dma_unmap_single(ib_dev, rx_sg->addr,
+			ISER_HEADERS_LEN, DMA_FROM_DEVICE);
+}
+
+static int
+isert_alloc_rx_descriptor(struct iser_rx_desc *rx_desc, struct isert_device *device)
+{
+	struct ib_device *ib_dev = device->ib_device;
+	struct ib_sge *rx_sg;
+
+	/* headers */
+	rx_sg = &rx_desc->rx_sg[0];
+	rx_sg->addr = ib_dma_map_single(ib_dev, (void *)rx_desc,
+				ISER_HEADERS_LEN, DMA_FROM_DEVICE);
+	if (ib_dma_mapping_error(ib_dev, rx_sg->addr))
+		return -ENOMEM;
+
+	rx_sg->length = ISER_HEADERS_LEN;
+	rx_sg->lkey = device->pd->local_dma_lkey;
+
+	/* data */
+	rx_sg = &rx_desc->rx_sg[1];
+	rx_desc->data = (char *)__get_free_pages(GFP_KERNEL,
+				num_pages(ISCSI_DEF_MAX_RECV_SEG_LEN));
+	if (!rx_desc->data)
+		goto alloc_fail;
+
+	rx_sg->addr = ib_dma_map_single(ib_dev, (void *)rx_desc->data,
+				ISCSI_DEF_MAX_RECV_SEG_LEN, DMA_FROM_DEVICE);
+	if (ib_dma_mapping_error(ib_dev, rx_sg->addr))
+		goto data_map_fail;
+
+	rx_sg->length = ISCSI_DEF_MAX_RECV_SEG_LEN;
+	rx_sg->lkey = device->pd->local_dma_lkey;
+
+	return 0;
+
+data_map_fail:
+	kfree(rx_desc->data);
+alloc_fail:
+	rx_sg = &rx_desc->rx_sg[0];
+	ib_dma_unmap_single(ib_dev, rx_sg->addr,
+			ISER_HEADERS_LEN, DMA_FROM_DEVICE);
+	return -ENOMEM;
+}
+
 static int
 isert_alloc_rx_descriptors(struct isert_conn *isert_conn)
 {
 	struct isert_device *device = isert_conn->device;
 	struct ib_device *ib_dev = device->ib_device;
-	struct iser_rx_desc *rx_desc;
-	struct ib_sge *rx_sg;
-	u64 dma_addr;
-	int i, j;
+	int i;
 
 	isert_conn->rx_descs = kcalloc(ISERT_QP_MAX_RECV_DTOS,
 				       sizeof(struct iser_rx_desc),
@@ -179,32 +238,16 @@ isert_alloc_rx_descriptors(struct isert_conn *isert_conn)
 	if (!isert_conn->rx_descs)
 		return -ENOMEM;
 
-	rx_desc = isert_conn->rx_descs;
-
-	for (i = 0; i < ISERT_QP_MAX_RECV_DTOS; i++, rx_desc++)  {
-		dma_addr = ib_dma_map_single(ib_dev, (void *)rx_desc,
-					ISER_RX_PAYLOAD_SIZE, DMA_FROM_DEVICE);
-		if (ib_dma_mapping_error(ib_dev, dma_addr))
-			goto dma_map_fail;
-
-		rx_desc->dma_addr = dma_addr;
-
-		rx_sg = &rx_desc->rx_sg;
-		rx_sg->addr = rx_desc->dma_addr;
-		rx_sg->length = ISER_RX_PAYLOAD_SIZE;
-		rx_sg->lkey = device->pd->local_dma_lkey;
-		rx_desc->rx_cqe.done = isert_recv_done;
+	for (i = 0; i < ISERT_QP_MAX_RECV_DTOS; i++)  {
+		if (isert_alloc_rx_descriptor(&isert_conn->rx_descs[i], device))
+			goto alloc_fail;
 	}
 
 	return 0;
 
-dma_map_fail:
-	rx_desc = isert_conn->rx_descs;
-	for (j = 0; j < i; j++, rx_desc++) {
-		ib_dma_unmap_single(ib_dev, rx_desc->dma_addr,
-				    ISER_RX_PAYLOAD_SIZE, DMA_FROM_DEVICE);
-	}
-	kfree(isert_conn->rx_descs);
+alloc_fail:
+	while (--i)
+		isert_free_rx_descriptor(&isert_conn->rx_descs[i], ib_dev);
 	isert_conn->rx_descs = NULL;
 	isert_err("conn %p failed to allocate rx descriptors\n", isert_conn);
 	return -ENOMEM;
@@ -214,17 +257,13 @@ static void
 isert_free_rx_descriptors(struct isert_conn *isert_conn)
 {
 	struct ib_device *ib_dev = isert_conn->device->ib_device;
-	struct iser_rx_desc *rx_desc;
 	int i;
 
 	if (!isert_conn->rx_descs)
 		return;
 
-	rx_desc = isert_conn->rx_descs;
-	for (i = 0; i < ISERT_QP_MAX_RECV_DTOS; i++, rx_desc++)  {
-		ib_dma_unmap_single(ib_dev, rx_desc->dma_addr,
-				    ISER_RX_PAYLOAD_SIZE, DMA_FROM_DEVICE);
-	}
+	for (i = 0; i < ISERT_QP_MAX_RECV_DTOS; i++)
+		isert_free_rx_descriptor(&isert_conn->rx_descs[i], ib_dev);
 
 	kfree(isert_conn->rx_descs);
 	isert_conn->rx_descs = NULL;
@@ -407,11 +446,7 @@ isert_free_login_buf(struct isert_conn *isert_conn)
 	ib_dma_unmap_single(ib_dev, isert_conn->login_rsp_dma,
 			    ISER_RX_PAYLOAD_SIZE, DMA_TO_DEVICE);
 	kfree(isert_conn->login_rsp_buf);
-
-	ib_dma_unmap_single(ib_dev, isert_conn->login_req_dma,
-			    ISER_RX_PAYLOAD_SIZE,
-			    DMA_FROM_DEVICE);
-	kfree(isert_conn->login_req_buf);
+	isert_free_rx_descriptor(&isert_conn->login_desc, ib_dev);
 }
 
 static int
@@ -420,25 +455,15 @@ isert_alloc_login_buf(struct isert_conn *isert_conn,
 {
 	int ret;
 
-	isert_conn->login_req_buf = kzalloc(sizeof(*isert_conn->login_req_buf),
-			GFP_KERNEL);
-	if (!isert_conn->login_req_buf)
-		return -ENOMEM;
-
-	isert_conn->login_req_dma = ib_dma_map_single(ib_dev,
-				isert_conn->login_req_buf,
-				ISER_RX_PAYLOAD_SIZE, DMA_FROM_DEVICE);
-	ret = ib_dma_mapping_error(ib_dev, isert_conn->login_req_dma);
-	if (ret) {
-		isert_err("login_req_dma mapping error: %d\n", ret);
-		isert_conn->login_req_dma = 0;
-		goto out_free_login_req_buf;
-	}
+	ret = isert_alloc_rx_descriptor(&isert_conn->login_desc,
+					isert_conn->device);
+	if (ret)
+		return ret;
 
 	isert_conn->login_rsp_buf = kzalloc(ISER_RX_PAYLOAD_SIZE, GFP_KERNEL);
 	if (!isert_conn->login_rsp_buf) {
 		ret = -ENOMEM;
-		goto out_unmap_login_req_buf;
+		goto out_free_login_desc;
 	}
 
 	isert_conn->login_rsp_dma = ib_dma_map_single(ib_dev,
@@ -455,11 +480,8 @@ isert_alloc_login_buf(struct isert_conn *isert_conn,
 
 out_free_login_rsp_buf:
 	kfree(isert_conn->login_rsp_buf);
-out_unmap_login_req_buf:
-	ib_dma_unmap_single(ib_dev, isert_conn->login_req_dma,
-			    ISER_RX_PAYLOAD_SIZE, DMA_FROM_DEVICE);
-out_free_login_req_buf:
-	kfree(isert_conn->login_req_buf);
+out_free_login_desc:
+	isert_free_rx_descriptor(&isert_conn->login_desc, ib_dev);
 	return ret;
 }
 
@@ -516,10 +538,6 @@ isert_connect_request(struct rdma_cm_id *cma_id, struct rdma_cm_event *event)
 	isert_init_conn(isert_conn);
 	isert_conn->cm_id = cma_id;
 
-	ret = isert_alloc_login_buf(isert_conn, cma_id->device);
-	if (ret)
-		goto out;
-
 	device = isert_device_get(cma_id);
 	if (IS_ERR(device)) {
 		ret = PTR_ERR(device);
@@ -527,6 +545,10 @@ isert_connect_request(struct rdma_cm_id *cma_id, struct rdma_cm_event *event)
 	}
 	isert_conn->device = device;
 
+	ret = isert_alloc_login_buf(isert_conn, cma_id->device);
+	if (ret)
+		goto out;
+
 	isert_set_nego_params(isert_conn, &event->param.conn);
 
 	ret = isert_conn_setup_qp(isert_conn, cma_id);
@@ -578,7 +600,7 @@ isert_connect_release(struct isert_conn *isert_conn)
 		ib_destroy_qp(isert_conn->qp);
 	}
 
-	if (isert_conn->login_req_buf)
+	if (isert_conn->login_rsp_buf)
 		isert_free_login_buf(isert_conn);
 
 	isert_device_put(device);
@@ -809,9 +831,10 @@ isert_post_recvm(struct isert_conn *isert_conn, u32 count)
 	for (rx_wr = isert_conn->rx_wr, i = 0; i < count; i++, rx_wr++) {
 		rx_desc = &isert_conn->rx_descs[i];
 
+		rx_desc->rx_cqe.done = isert_recv_done;
 		rx_wr->wr_cqe = &rx_desc->rx_cqe;
-		rx_wr->sg_list = &rx_desc->rx_sg;
-		rx_wr->num_sge = 1;
+		rx_wr->sg_list = rx_desc->rx_sg;
+		rx_wr->num_sge = 2;
 		rx_wr->next = rx_wr + 1;
 		rx_desc->in_use = false;
 	}
@@ -840,9 +863,10 @@ isert_post_recv(struct isert_conn *isert_conn, struct iser_rx_desc *rx_desc)
 	}
 
 	rx_desc->in_use = false;
+	rx_desc->rx_cqe.done = isert_recv_done;
 	rx_wr.wr_cqe = &rx_desc->rx_cqe;
-	rx_wr.sg_list = &rx_desc->rx_sg;
-	rx_wr.num_sge = 1;
+	rx_wr.sg_list = rx_desc->rx_sg;
+	rx_wr.num_sge = 2;
 	rx_wr.next = NULL;
 
 	ret = ib_post_recv(isert_conn->qp, &rx_wr, NULL);
@@ -960,23 +984,13 @@ static int
 isert_login_post_recv(struct isert_conn *isert_conn)
 {
 	struct ib_recv_wr rx_wr;
-	struct ib_sge sge;
 	int ret;
 
-	memset(&sge, 0, sizeof(struct ib_sge));
-	sge.addr = isert_conn->login_req_dma;
-	sge.length = ISER_RX_PAYLOAD_SIZE;
-	sge.lkey = isert_conn->device->pd->local_dma_lkey;
-
-	isert_dbg("Setup sge: addr: %llx length: %d 0x%08x\n",
-		sge.addr, sge.length, sge.lkey);
-
-	isert_conn->login_req_buf->rx_cqe.done = isert_login_recv_done;
-
+	isert_conn->login_desc.rx_cqe.done = isert_login_recv_done;
 	memset(&rx_wr, 0, sizeof(struct ib_recv_wr));
-	rx_wr.wr_cqe = &isert_conn->login_req_buf->rx_cqe;
-	rx_wr.sg_list = &sge;
-	rx_wr.num_sge = 1;
+	rx_wr.wr_cqe = &isert_conn->login_desc.rx_cqe;
+	rx_wr.sg_list = isert_conn->login_desc.rx_sg;
+	rx_wr.num_sge = 2;
 
 	ret = ib_post_recv(isert_conn->qp, &rx_wr, NULL);
 	if (ret)
@@ -1051,7 +1065,7 @@ isert_put_login_tx(struct iscsi_conn *conn, struct iscsi_login *login,
 static void
 isert_rx_login_req(struct isert_conn *isert_conn)
 {
-	struct iser_rx_desc *rx_desc = isert_conn->login_req_buf;
+	struct iser_rx_desc *rx_desc = &isert_conn->login_desc;
 	int rx_buflen = isert_conn->login_req_len;
 	struct iscsi_conn *conn = isert_conn->conn;
 	struct iscsi_login *login = conn->conn_login;
@@ -1412,11 +1426,13 @@ isert_recv_done(struct ib_cq *cq, struct ib_wc *wc)
 
 	rx_desc->in_use = true;
 
-	ib_dma_sync_single_for_cpu(ib_dev, rx_desc->dma_addr,
-			ISER_RX_PAYLOAD_SIZE, DMA_FROM_DEVICE);
+	ib_dma_sync_single_for_cpu(ib_dev, rx_desc->rx_sg[0].addr,
+			ISER_HEADERS_LEN, DMA_FROM_DEVICE);
+	ib_dma_sync_single_for_cpu(ib_dev, rx_desc->rx_sg[1].addr,
+			ISCSI_DEF_MAX_RECV_SEG_LEN, DMA_FROM_DEVICE);
 
-	isert_dbg("DMA: 0x%llx, iSCSI opcode: 0x%02x, ITT: 0x%08x, flags: 0x%02x dlen: %d\n",
-		 rx_desc->dma_addr, hdr->opcode, hdr->itt, hdr->flags,
+	isert_dbg("iSCSI opcode: 0x%02x, ITT: 0x%08x, flags: 0x%02x dlen: %d\n",
+		 hdr->opcode, hdr->itt, hdr->flags,
 		 (int)(wc->byte_len - ISER_HEADERS_LEN));
 
 	switch (iser_ctrl->flags & 0xF0) {
@@ -1447,8 +1463,10 @@ isert_recv_done(struct ib_cq *cq, struct ib_wc *wc)
 	isert_rx_opcode(isert_conn, rx_desc,
 			read_stag, read_va, write_stag, write_va);
 
-	ib_dma_sync_single_for_device(ib_dev, rx_desc->dma_addr,
-			ISER_RX_PAYLOAD_SIZE, DMA_FROM_DEVICE);
+	ib_dma_sync_single_for_device(ib_dev, rx_desc->rx_sg[0].addr,
+			ISER_HEADERS_LEN, DMA_FROM_DEVICE);
+	ib_dma_sync_single_for_device(ib_dev, rx_desc->rx_sg[1].addr,
+			ISCSI_DEF_MAX_RECV_SEG_LEN, DMA_FROM_DEVICE);
 }
 
 static void
@@ -1456,14 +1474,17 @@ isert_login_recv_done(struct ib_cq *cq, struct ib_wc *wc)
 {
 	struct isert_conn *isert_conn = wc->qp->qp_context;
 	struct ib_device *ib_dev = isert_conn->device->ib_device;
+	struct iser_rx_desc *rx_desc = &isert_conn->login_desc;
 
 	if (unlikely(wc->status != IB_WC_SUCCESS)) {
 		isert_print_wc(wc, "login recv");
 		return;
 	}
 
-	ib_dma_sync_single_for_cpu(ib_dev, isert_conn->login_req_dma,
-			ISER_RX_PAYLOAD_SIZE, DMA_FROM_DEVICE);
+	ib_dma_sync_single_for_cpu(ib_dev, rx_desc->rx_sg[0].addr,
+			ISER_HEADERS_LEN, DMA_FROM_DEVICE);
+	ib_dma_sync_single_for_cpu(ib_dev, rx_desc->rx_sg[1].addr,
+			ISCSI_DEF_MAX_RECV_SEG_LEN, DMA_FROM_DEVICE);
 
 	isert_conn->login_req_len = wc->byte_len - ISER_HEADERS_LEN;
 
@@ -1478,8 +1499,10 @@ isert_login_recv_done(struct ib_cq *cq, struct ib_wc *wc)
 	complete(&isert_conn->login_req_comp);
 	mutex_unlock(&isert_conn->mutex);
 
-	ib_dma_sync_single_for_device(ib_dev, isert_conn->login_req_dma,
-				ISER_RX_PAYLOAD_SIZE, DMA_FROM_DEVICE);
+	ib_dma_sync_single_for_device(ib_dev, rx_desc->rx_sg[0].addr,
+			ISER_HEADERS_LEN, DMA_FROM_DEVICE);
+	ib_dma_sync_single_for_device(ib_dev, rx_desc->rx_sg[1].addr,
+			ISCSI_DEF_MAX_RECV_SEG_LEN, DMA_FROM_DEVICE);
 }
 
 static void
diff --git a/drivers/infiniband/ulp/isert/ib_isert.h b/drivers/infiniband/ulp/isert/ib_isert.h
index 3b296bac4f60..407bd6c9eb1f 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.h
+++ b/drivers/infiniband/ulp/isert/ib_isert.h
@@ -80,14 +80,13 @@ enum iser_conn_state {
 };
 
 struct iser_rx_desc {
-	struct iser_ctrl iser_header;
-	struct iscsi_hdr iscsi_header;
-	char		data[ISCSI_DEF_MAX_RECV_SEG_LEN];
-	u64		dma_addr;
-	struct ib_sge	rx_sg;
-	struct ib_cqe	rx_cqe;
-	bool		in_use;
-	char		pad[ISER_RX_PAD_SIZE];
+	struct iser_ctrl    iser_header;
+	struct iscsi_hdr    iscsi_header;
+	char                *data;
+	u64                 dma_addr[2];
+	struct ib_sge       rx_sg[2];
+	struct ib_cqe       rx_cqe;
+	bool	            in_use;
 } __packed;
 
 static inline struct iser_rx_desc *cqe_to_rx_desc(struct ib_cqe *cqe)
@@ -141,9 +140,8 @@ struct isert_conn {
 	u32			responder_resources;
 	u32			initiator_depth;
 	bool			pi_support;
-	struct iser_rx_desc	*login_req_buf;
+	struct iser_rx_desc	login_desc;
 	char			*login_rsp_buf;
-	u64			login_req_dma;
 	int			login_req_len;
 	u64			login_rsp_dma;
 	struct iser_rx_desc	*rx_descs;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2020-03-25  0:15                                     ` Sagi Grimberg
@ 2020-03-30 17:08                                       ` Stephen Rust
  2020-03-31  1:07                                         ` Sagi Grimberg
  2020-04-01  0:38                                         ` Sagi Grimberg
  0 siblings, 2 replies; 30+ messages in thread
From: Stephen Rust @ 2020-03-30 17:08 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Ming Lei, Rob Townley, Christoph Hellwig, Jens Axboe,
	linux-block, linux-rdma, linux-scsi, martin.petersen,
	target-devel, Doug Ledford, Jason Gunthorpe, Max Gurtovoy

Sagi,

> Sorry for the late reply, lost track of this.

No problem!

> Can you try attached patch and see if it solves your issue?
> WARNING: very lightly tested...

I have run our tests against this patch and it is working well for our
"basic" testing as well. The test case that previously failed, now
passes with this patch. So that's encouraging! Thanks for the quick
response and quick patch.

One question we had is regarding the hard coded header length: What
happens if the initiator sends an extended CDB, like a WRITE32? Are
there any concerns with an additional header segment (AHS)?

Thanks again,
Steve

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2020-03-30 17:08                                       ` Stephen Rust
@ 2020-03-31  1:07                                         ` Sagi Grimberg
  2020-04-01  0:38                                         ` Sagi Grimberg
  1 sibling, 0 replies; 30+ messages in thread
From: Sagi Grimberg @ 2020-03-31  1:07 UTC (permalink / raw)
  To: Stephen Rust
  Cc: Ming Lei, Rob Townley, Christoph Hellwig, Jens Axboe,
	linux-block, linux-rdma, linux-scsi, martin.petersen,
	target-devel, Doug Ledford, Jason Gunthorpe, Max Gurtovoy


>> Can you try attached patch and see if it solves your issue?
>> WARNING: very lightly tested...
> 
> I have run our tests against this patch and it is working well for our
> "basic" testing as well. The test case that previously failed, now
> passes with this patch. So that's encouraging! Thanks for the quick
> response and quick patch.

Good to know..

> One question we had is regarding the hard coded header length: What
> happens if the initiator sends an extended CDB, like a WRITE32? Are
> there any concerns with an additional header segment (AHS)?

You are absolutely correct! t10-dif is broken with this patch as
32 byte cdb would break into two buffers which is not expected
by the target core...

I take back this patch, I guess we should keep contiguous allocation but
just make the recv wr such that the data is aligned for 16 bytes cdbs,
and for 32-byte cdbs we never support immediate data anyways...

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2020-03-30 17:08                                       ` Stephen Rust
  2020-03-31  1:07                                         ` Sagi Grimberg
@ 2020-04-01  0:38                                         ` Sagi Grimberg
  2020-04-02 20:03                                           ` Stephen Rust
  1 sibling, 1 reply; 30+ messages in thread
From: Sagi Grimberg @ 2020-04-01  0:38 UTC (permalink / raw)
  To: Stephen Rust
  Cc: Ming Lei, Rob Townley, Christoph Hellwig, Jens Axboe,
	linux-block, linux-rdma, linux-scsi, martin.petersen,
	target-devel, Doug Ledford, Jason Gunthorpe, Max Gurtovoy

[-- Attachment #1: Type: text/plain, Size: 498 bytes --]

Hey Stephen,

> I have run our tests against this patch and it is working well for our
> "basic" testing as well. The test case that previously failed, now
> passes with this patch. So that's encouraging! Thanks for the quick
> response and quick patch.
> 
> One question we had is regarding the hard coded header length: What
> happens if the initiator sends an extended CDB, like a WRITE32? Are
> there any concerns with an additional header segment (AHS)?

Does the attached patch work for you?

[-- Attachment #2: 0001-IB-isert-fix-unaligned-immediate-data-handling.patch --]
[-- Type: text/x-patch, Size: 13801 bytes --]

From 219dc38b527445b70b4b839bbb8fc920aad41f7d Mon Sep 17 00:00:00 2001
From: Sagi Grimberg <sagi@grimberg.me>
Date: Tue, 31 Mar 2020 00:39:58 -0700
Subject: [PATCH] IB/isert: fix unaligned immediate-data handling

Currently we allocate rx buffers in a single contiguous buffers
for headers (iser and iscsi) and data trailer. This means
that most likely the data starting offset is aligned to 76
bytes (size of both headers).

This worked fine for years, but at some point this broke.
To fix this, we should avoid passing unaligned buffers for
I/O.

Instead, we allocate our recv buffers with some extra space
such that we can have the data portion align to 512 byte boundary.
This also means that we cannot reference headers or data using
structure but rather accessors (as they may move based on alignment).
Also, get rid of the wrong __packed annotation from iser_rx_desc
as this has only harmful effects (not aligned to anything).

This affects the rx descriptors for iscsi login and data plane.

Reported-by: Stephen Rust <srust@blockbridge.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/infiniband/ulp/isert/ib_isert.c | 93 +++++++++++++------------
 drivers/infiniband/ulp/isert/ib_isert.h | 41 ++++++++---
 2 files changed, 79 insertions(+), 55 deletions(-)

diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c
index a1a035270cab..60315d1a88c0 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -182,15 +182,16 @@ isert_alloc_rx_descriptors(struct isert_conn *isert_conn)
 	rx_desc = isert_conn->rx_descs;
 
 	for (i = 0; i < ISERT_QP_MAX_RECV_DTOS; i++, rx_desc++)  {
-		dma_addr = ib_dma_map_single(ib_dev, (void *)rx_desc,
-					ISER_RX_PAYLOAD_SIZE, DMA_FROM_DEVICE);
+		dma_addr = ib_dma_map_single(ib_dev,
+					rx_desc->buf,
+					ISER_RX_SIZE, DMA_FROM_DEVICE);
 		if (ib_dma_mapping_error(ib_dev, dma_addr))
 			goto dma_map_fail;
 
 		rx_desc->dma_addr = dma_addr;
 
 		rx_sg = &rx_desc->rx_sg;
-		rx_sg->addr = rx_desc->dma_addr;
+		rx_sg->addr = rx_desc->dma_addr + isert_get_hdr_offset(rx_desc);
 		rx_sg->length = ISER_RX_PAYLOAD_SIZE;
 		rx_sg->lkey = device->pd->local_dma_lkey;
 		rx_desc->rx_cqe.done = isert_recv_done;
@@ -202,7 +203,7 @@ isert_alloc_rx_descriptors(struct isert_conn *isert_conn)
 	rx_desc = isert_conn->rx_descs;
 	for (j = 0; j < i; j++, rx_desc++) {
 		ib_dma_unmap_single(ib_dev, rx_desc->dma_addr,
-				    ISER_RX_PAYLOAD_SIZE, DMA_FROM_DEVICE);
+				    ISER_RX_SIZE, DMA_FROM_DEVICE);
 	}
 	kfree(isert_conn->rx_descs);
 	isert_conn->rx_descs = NULL;
@@ -223,7 +224,7 @@ isert_free_rx_descriptors(struct isert_conn *isert_conn)
 	rx_desc = isert_conn->rx_descs;
 	for (i = 0; i < ISERT_QP_MAX_RECV_DTOS; i++, rx_desc++)  {
 		ib_dma_unmap_single(ib_dev, rx_desc->dma_addr,
-				    ISER_RX_PAYLOAD_SIZE, DMA_FROM_DEVICE);
+				    ISER_RX_SIZE, DMA_FROM_DEVICE);
 	}
 
 	kfree(isert_conn->rx_descs);
@@ -408,10 +409,10 @@ isert_free_login_buf(struct isert_conn *isert_conn)
 			    ISER_RX_PAYLOAD_SIZE, DMA_TO_DEVICE);
 	kfree(isert_conn->login_rsp_buf);
 
-	ib_dma_unmap_single(ib_dev, isert_conn->login_req_dma,
-			    ISER_RX_PAYLOAD_SIZE,
+	ib_dma_unmap_single(ib_dev, isert_conn->login_desc->dma_addr,
+			    ISER_RX_SIZE,
 			    DMA_FROM_DEVICE);
-	kfree(isert_conn->login_req_buf);
+	kfree(isert_conn->login_desc);
 }
 
 static int
@@ -420,25 +421,25 @@ isert_alloc_login_buf(struct isert_conn *isert_conn,
 {
 	int ret;
 
-	isert_conn->login_req_buf = kzalloc(sizeof(*isert_conn->login_req_buf),
+	isert_conn->login_desc = kzalloc(sizeof(*isert_conn->login_desc),
 			GFP_KERNEL);
-	if (!isert_conn->login_req_buf)
+	if (!isert_conn->login_desc)
 		return -ENOMEM;
 
-	isert_conn->login_req_dma = ib_dma_map_single(ib_dev,
-				isert_conn->login_req_buf,
-				ISER_RX_PAYLOAD_SIZE, DMA_FROM_DEVICE);
-	ret = ib_dma_mapping_error(ib_dev, isert_conn->login_req_dma);
+	isert_conn->login_desc->dma_addr = ib_dma_map_single(ib_dev,
+				isert_conn->login_desc->buf,
+				ISER_RX_SIZE, DMA_FROM_DEVICE);
+	ret = ib_dma_mapping_error(ib_dev, isert_conn->login_desc->dma_addr);
 	if (ret) {
-		isert_err("login_req_dma mapping error: %d\n", ret);
-		isert_conn->login_req_dma = 0;
-		goto out_free_login_req_buf;
+		isert_err("login_desc dma mapping error: %d\n", ret);
+		isert_conn->login_desc->dma_addr = 0;
+		goto out_free_login_desc;
 	}
 
 	isert_conn->login_rsp_buf = kzalloc(ISER_RX_PAYLOAD_SIZE, GFP_KERNEL);
 	if (!isert_conn->login_rsp_buf) {
 		ret = -ENOMEM;
-		goto out_unmap_login_req_buf;
+		goto out_unmap_login_desc;
 	}
 
 	isert_conn->login_rsp_dma = ib_dma_map_single(ib_dev,
@@ -455,11 +456,11 @@ isert_alloc_login_buf(struct isert_conn *isert_conn,
 
 out_free_login_rsp_buf:
 	kfree(isert_conn->login_rsp_buf);
-out_unmap_login_req_buf:
-	ib_dma_unmap_single(ib_dev, isert_conn->login_req_dma,
-			    ISER_RX_PAYLOAD_SIZE, DMA_FROM_DEVICE);
-out_free_login_req_buf:
-	kfree(isert_conn->login_req_buf);
+out_unmap_login_desc:
+	ib_dma_unmap_single(ib_dev, isert_conn->login_desc->dma_addr,
+			    ISER_RX_SIZE, DMA_FROM_DEVICE);
+out_free_login_desc:
+	kfree(isert_conn->login_desc);
 	return ret;
 }
 
@@ -578,7 +579,7 @@ isert_connect_release(struct isert_conn *isert_conn)
 		ib_destroy_qp(isert_conn->qp);
 	}
 
-	if (isert_conn->login_req_buf)
+	if (isert_conn->login_desc)
 		isert_free_login_buf(isert_conn);
 
 	isert_device_put(device);
@@ -964,17 +965,18 @@ isert_login_post_recv(struct isert_conn *isert_conn)
 	int ret;
 
 	memset(&sge, 0, sizeof(struct ib_sge));
-	sge.addr = isert_conn->login_req_dma;
+	sge.addr = isert_conn->login_desc->dma_addr +
+		isert_get_hdr_offset(isert_conn->login_desc);
 	sge.length = ISER_RX_PAYLOAD_SIZE;
 	sge.lkey = isert_conn->device->pd->local_dma_lkey;
 
 	isert_dbg("Setup sge: addr: %llx length: %d 0x%08x\n",
 		sge.addr, sge.length, sge.lkey);
 
-	isert_conn->login_req_buf->rx_cqe.done = isert_login_recv_done;
+	isert_conn->login_desc->rx_cqe.done = isert_login_recv_done;
 
 	memset(&rx_wr, 0, sizeof(struct ib_recv_wr));
-	rx_wr.wr_cqe = &isert_conn->login_req_buf->rx_cqe;
+	rx_wr.wr_cqe = &isert_conn->login_desc->rx_cqe;
 	rx_wr.sg_list = &sge;
 	rx_wr.num_sge = 1;
 
@@ -1051,7 +1053,7 @@ isert_put_login_tx(struct iscsi_conn *conn, struct iscsi_login *login,
 static void
 isert_rx_login_req(struct isert_conn *isert_conn)
 {
-	struct iser_rx_desc *rx_desc = isert_conn->login_req_buf;
+	struct iser_rx_desc *rx_desc = isert_conn->login_desc;
 	int rx_buflen = isert_conn->login_req_len;
 	struct iscsi_conn *conn = isert_conn->conn;
 	struct iscsi_login *login = conn->conn_login;
@@ -1063,7 +1065,7 @@ isert_rx_login_req(struct isert_conn *isert_conn)
 
 	if (login->first_request) {
 		struct iscsi_login_req *login_req =
-			(struct iscsi_login_req *)&rx_desc->iscsi_header;
+			(struct iscsi_login_req *)isert_get_iscsi_hdr(rx_desc);
 		/*
 		 * Setup the initial iscsi_login values from the leading
 		 * login request PDU.
@@ -1082,13 +1084,13 @@ isert_rx_login_req(struct isert_conn *isert_conn)
 		login->tsih		= be16_to_cpu(login_req->tsih);
 	}
 
-	memcpy(&login->req[0], (void *)&rx_desc->iscsi_header, ISCSI_HDR_LEN);
+	memcpy(&login->req[0], isert_get_iscsi_hdr(rx_desc), ISCSI_HDR_LEN);
 
 	size = min(rx_buflen, MAX_KEY_VALUE_PAIRS);
 	isert_dbg("Using login payload size: %d, rx_buflen: %d "
 		  "MAX_KEY_VALUE_PAIRS: %d\n", size, rx_buflen,
 		  MAX_KEY_VALUE_PAIRS);
-	memcpy(login->req_buf, &rx_desc->data[0], size);
+	memcpy(login->req_buf, isert_get_data(rx_desc), size);
 
 	if (login->first_request) {
 		complete(&isert_conn->login_comp);
@@ -1153,14 +1155,15 @@ isert_handle_scsi_cmd(struct isert_conn *isert_conn,
 	if (imm_data_len != data_len) {
 		sg_nents = max(1UL, DIV_ROUND_UP(imm_data_len, PAGE_SIZE));
 		sg_copy_from_buffer(cmd->se_cmd.t_data_sg, sg_nents,
-				    &rx_desc->data[0], imm_data_len);
+				    isert_get_data(rx_desc), imm_data_len);
 		isert_dbg("Copy Immediate sg_nents: %u imm_data_len: %d\n",
 			  sg_nents, imm_data_len);
 	} else {
 		sg_init_table(&isert_cmd->sg, 1);
 		cmd->se_cmd.t_data_sg = &isert_cmd->sg;
 		cmd->se_cmd.t_data_nents = 1;
-		sg_set_buf(&isert_cmd->sg, &rx_desc->data[0], imm_data_len);
+		sg_set_buf(&isert_cmd->sg, isert_get_data(rx_desc),
+				imm_data_len);
 		isert_dbg("Transfer Immediate imm_data_len: %d\n",
 			  imm_data_len);
 	}
@@ -1229,9 +1232,9 @@ isert_handle_iscsi_dataout(struct isert_conn *isert_conn,
 	}
 	isert_dbg("Copying DataOut: sg_start: %p, sg_off: %u "
 		  "sg_nents: %u from %p %u\n", sg_start, sg_off,
-		  sg_nents, &rx_desc->data[0], unsol_data_len);
+		  sg_nents, isert_get_data(rx_desc), unsol_data_len);
 
-	sg_copy_from_buffer(sg_start, sg_nents, &rx_desc->data[0],
+	sg_copy_from_buffer(sg_start, sg_nents, isert_get_data(rx_desc),
 			    unsol_data_len);
 
 	rc = iscsit_check_dataout_payload(cmd, hdr, false);
@@ -1290,7 +1293,7 @@ isert_handle_text_cmd(struct isert_conn *isert_conn, struct isert_cmd *isert_cmd
 	}
 	cmd->text_in_ptr = text_in;
 
-	memcpy(cmd->text_in_ptr, &rx_desc->data[0], payload_length);
+	memcpy(cmd->text_in_ptr, isert_get_data(rx_desc), payload_length);
 
 	return iscsit_process_text_cmd(conn, cmd, hdr);
 }
@@ -1300,7 +1303,7 @@ isert_rx_opcode(struct isert_conn *isert_conn, struct iser_rx_desc *rx_desc,
 		uint32_t read_stag, uint64_t read_va,
 		uint32_t write_stag, uint64_t write_va)
 {
-	struct iscsi_hdr *hdr = &rx_desc->iscsi_header;
+	struct iscsi_hdr *hdr = isert_get_iscsi_hdr(rx_desc);
 	struct iscsi_conn *conn = isert_conn->conn;
 	struct iscsi_cmd *cmd;
 	struct isert_cmd *isert_cmd;
@@ -1398,8 +1401,8 @@ isert_recv_done(struct ib_cq *cq, struct ib_wc *wc)
 	struct isert_conn *isert_conn = wc->qp->qp_context;
 	struct ib_device *ib_dev = isert_conn->cm_id->device;
 	struct iser_rx_desc *rx_desc = cqe_to_rx_desc(wc->wr_cqe);
-	struct iscsi_hdr *hdr = &rx_desc->iscsi_header;
-	struct iser_ctrl *iser_ctrl = &rx_desc->iser_header;
+	struct iscsi_hdr *hdr = isert_get_iscsi_hdr(rx_desc);
+	struct iser_ctrl *iser_ctrl = isert_get_iser_hdr(rx_desc);
 	uint64_t read_va = 0, write_va = 0;
 	uint32_t read_stag = 0, write_stag = 0;
 
@@ -1413,7 +1416,7 @@ isert_recv_done(struct ib_cq *cq, struct ib_wc *wc)
 	rx_desc->in_use = true;
 
 	ib_dma_sync_single_for_cpu(ib_dev, rx_desc->dma_addr,
-			ISER_RX_PAYLOAD_SIZE, DMA_FROM_DEVICE);
+			ISER_RX_SIZE, DMA_FROM_DEVICE);
 
 	isert_dbg("DMA: 0x%llx, iSCSI opcode: 0x%02x, ITT: 0x%08x, flags: 0x%02x dlen: %d\n",
 		 rx_desc->dma_addr, hdr->opcode, hdr->itt, hdr->flags,
@@ -1448,7 +1451,7 @@ isert_recv_done(struct ib_cq *cq, struct ib_wc *wc)
 			read_stag, read_va, write_stag, write_va);
 
 	ib_dma_sync_single_for_device(ib_dev, rx_desc->dma_addr,
-			ISER_RX_PAYLOAD_SIZE, DMA_FROM_DEVICE);
+			ISER_RX_SIZE, DMA_FROM_DEVICE);
 }
 
 static void
@@ -1462,8 +1465,8 @@ isert_login_recv_done(struct ib_cq *cq, struct ib_wc *wc)
 		return;
 	}
 
-	ib_dma_sync_single_for_cpu(ib_dev, isert_conn->login_req_dma,
-			ISER_RX_PAYLOAD_SIZE, DMA_FROM_DEVICE);
+	ib_dma_sync_single_for_cpu(ib_dev, isert_conn->login_desc->dma_addr,
+			ISER_RX_SIZE, DMA_FROM_DEVICE);
 
 	isert_conn->login_req_len = wc->byte_len - ISER_HEADERS_LEN;
 
@@ -1478,8 +1481,8 @@ isert_login_recv_done(struct ib_cq *cq, struct ib_wc *wc)
 	complete(&isert_conn->login_req_comp);
 	mutex_unlock(&isert_conn->mutex);
 
-	ib_dma_sync_single_for_device(ib_dev, isert_conn->login_req_dma,
-				ISER_RX_PAYLOAD_SIZE, DMA_FROM_DEVICE);
+	ib_dma_sync_single_for_device(ib_dev, isert_conn->login_desc->dma_addr,
+				ISER_RX_SIZE, DMA_FROM_DEVICE);
 }
 
 static void
diff --git a/drivers/infiniband/ulp/isert/ib_isert.h b/drivers/infiniband/ulp/isert/ib_isert.h
index 3b296bac4f60..d267a6d60d87 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.h
+++ b/drivers/infiniband/ulp/isert/ib_isert.h
@@ -59,9 +59,11 @@
 				ISERT_MAX_TX_MISC_PDUS	+ \
 				ISERT_MAX_RX_MISC_PDUS)
 
-#define ISER_RX_PAD_SIZE	(ISCSI_DEF_MAX_RECV_SEG_LEN + 4096 - \
-		(ISER_RX_PAYLOAD_SIZE + sizeof(u64) + sizeof(struct ib_sge) + \
-		 sizeof(struct ib_cqe) + sizeof(bool)))
+/*
+ * RX size is default of 8k plus headers, but data needs to align to
+ * 512 boundary, so use 1024 to have the extra space for alignment.
+ */
+#define ISER_RX_SIZE		(ISCSI_DEF_MAX_RECV_SEG_LEN + 1024)
 
 #define ISCSI_ISER_SG_TABLESIZE		256
 
@@ -80,21 +82,41 @@ enum iser_conn_state {
 };
 
 struct iser_rx_desc {
-	struct iser_ctrl iser_header;
-	struct iscsi_hdr iscsi_header;
-	char		data[ISCSI_DEF_MAX_RECV_SEG_LEN];
+	char		buf[ISER_RX_SIZE];
 	u64		dma_addr;
 	struct ib_sge	rx_sg;
 	struct ib_cqe	rx_cqe;
 	bool		in_use;
-	char		pad[ISER_RX_PAD_SIZE];
-} __packed;
+};
 
 static inline struct iser_rx_desc *cqe_to_rx_desc(struct ib_cqe *cqe)
 {
 	return container_of(cqe, struct iser_rx_desc, rx_cqe);
 }
 
+static void *isert_get_iser_hdr(struct iser_rx_desc *desc)
+{
+	return PTR_ALIGN(desc->buf + ISER_HEADERS_LEN, 512) - ISER_HEADERS_LEN;
+}
+
+static size_t isert_get_hdr_offset(struct iser_rx_desc *desc)
+{
+	return isert_get_iser_hdr(desc) - (void *)desc->buf;
+}
+
+static void *isert_get_iscsi_hdr(struct iser_rx_desc *desc)
+{
+	return isert_get_iser_hdr(desc) + sizeof(struct iser_ctrl);
+}
+
+static void *isert_get_data(struct iser_rx_desc *desc)
+{
+	void *data = isert_get_iser_hdr(desc) + ISER_HEADERS_LEN;
+
+	WARN_ON((uintptr_t)data & 511);
+	return data;
+}
+
 struct iser_tx_desc {
 	struct iser_ctrl iser_header;
 	struct iscsi_hdr iscsi_header;
@@ -141,9 +163,8 @@ struct isert_conn {
 	u32			responder_resources;
 	u32			initiator_depth;
 	bool			pi_support;
-	struct iser_rx_desc	*login_req_buf;
+	struct iser_rx_desc	*login_desc;
 	char			*login_rsp_buf;
-	u64			login_req_dma;
 	int			login_req_len;
 	u64			login_rsp_dma;
 	struct iser_rx_desc	*rx_descs;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2020-04-01  0:38                                         ` Sagi Grimberg
@ 2020-04-02 20:03                                           ` Stephen Rust
  2020-04-02 22:16                                             ` Sagi Grimberg
  0 siblings, 1 reply; 30+ messages in thread
From: Stephen Rust @ 2020-04-02 20:03 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Ming Lei, Rob Townley, Christoph Hellwig, Jens Axboe,
	linux-block, linux-rdma, linux-scsi, martin.petersen,
	target-devel, Doug Ledford, Jason Gunthorpe, Max Gurtovoy

Hi Sagi,

> Does the attached patch work for you?

This patch _does_ work for our tests. I see all the writes aligned to
512 bytes as, I think, would be expected. Thanks!


Just for your information, using the bpftrace script earlier in this
thread, the output is as follows. The lines with 2 values shows the
first two arguments to 'bio_add_page()', the lines with 4 values shows
the first 4 arguments to `brd_do_bvec()`.

For the successful case on the new patch:

# bpftrace isert.bt
Attaching 4 probes...
512 3072
512 3072 1 131071
4096 0
4096 0 0 0
4096 0
4096 0 0 8
4096 0
4096 0 0 131064
4096 512
3584 512 1 131064
512 0 1 131071
512 0
512 0 0 131071


And for the failed case on the older unpatched code:

# bpftrace ./isert.bt
Attaching 4 probes...
512 76
512 76 1 131071
4096 0
4096 0 0 0
4096 0
4096 0 0 8
4096 0
4096 0 0 131064
4096 76
4020 76 1 131064
76 0 1 131071
512 0
512 0 0 131071


Thanks again!
Steve

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Data corruption in kernel 5.1+ with iSER attached ramdisk
  2020-04-02 20:03                                           ` Stephen Rust
@ 2020-04-02 22:16                                             ` Sagi Grimberg
  0 siblings, 0 replies; 30+ messages in thread
From: Sagi Grimberg @ 2020-04-02 22:16 UTC (permalink / raw)
  To: Stephen Rust
  Cc: Ming Lei, Rob Townley, Christoph Hellwig, Jens Axboe,
	linux-block, linux-rdma, linux-scsi, martin.petersen,
	target-devel, Doug Ledford, Jason Gunthorpe, Max Gurtovoy


> Hi Sagi,
> 
>> Does the attached patch work for you?
> 
> This patch _does_ work for our tests. I see all the writes aligned to
> 512 bytes as, I think, would be expected. Thanks!

Good to hear, I'll wait for your further testing before submitting
it as a patch as I don't have resources to test this thoroughly
at the moment.

Thanks,
Sagi

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2020-04-02 22:17 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAAFE1bd9wuuobpe4VK7Ty175j7mWT+kRmHCNhVD+6R8MWEAqmw@mail.gmail.com>
2019-11-28  1:57 ` Data corruption in kernel 5.1+ with iSER attached ramdisk Ming Lei
     [not found]   ` <CA+VdTb_-CGaPjKUQteKVFSGqDz-5o-tuRRkJYqt8B9iOQypiwQ@mail.gmail.com>
2019-11-28  2:58     ` Ming Lei
     [not found]       ` <CAAFE1bfsXsKGyw7SU_z4NanT+wmtuJT=XejBYbHHMCDQwm73sw@mail.gmail.com>
2019-11-28  4:25         ` Stephen Rust
2019-11-28  5:51           ` Rob Townley
2019-11-28  9:12         ` Ming Lei
2019-12-02 18:42           ` Stephen Rust
2019-12-03  0:58             ` Ming Lei
2019-12-03  3:04               ` Stephen Rust
2019-12-03  3:14                 ` Ming Lei
2019-12-03  3:26                   ` Stephen Rust
2019-12-03  3:50                     ` Stephen Rust
2019-12-03 12:45                       ` Ming Lei
2019-12-03 19:56                         ` Stephen Rust
2019-12-04  1:05                           ` Ming Lei
2019-12-04 17:23                             ` Stephen Rust
2019-12-04 23:02                               ` Ming Lei
2019-12-05  0:16                                 ` Bart Van Assche
2019-12-05 14:44                                   ` Stephen Rust
2019-12-05  2:28                                 ` Stephen Rust
2019-12-05  3:05                                   ` Ming Lei
2019-12-05  9:17                                 ` Sagi Grimberg
2019-12-05 14:36                                   ` Stephen Rust
     [not found]                                   ` <CAAFE1beqFBQS_zVYEXFTD2qu8PAF9hBSW4j1k9ZD6MhU_gWg5Q@mail.gmail.com>
2020-03-25  0:15                                     ` Sagi Grimberg
2020-03-30 17:08                                       ` Stephen Rust
2020-03-31  1:07                                         ` Sagi Grimberg
2020-04-01  0:38                                         ` Sagi Grimberg
2020-04-02 20:03                                           ` Stephen Rust
2020-04-02 22:16                                             ` Sagi Grimberg
2019-12-04  2:39                           ` Ming Lei
2019-12-03  4:15                     ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).