linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/8] block: add support for REQ_OP_VERIFY
@ 2021-11-04  6:46 Chaitanya Kulkarni
  2021-11-04  6:46 ` [RFC PATCH 1/8] " Chaitanya Kulkarni
                   ` (9 more replies)
  0 siblings, 10 replies; 26+ messages in thread
From: Chaitanya Kulkarni @ 2021-11-04  6:46 UTC (permalink / raw)
  To: linux-block, linux-raid, linux-nvme, linux-scsi, target-devel,
	linux-fsdevel, linux-xfs, dm-devel
  Cc: axboe, agk, snitzer, song, djwong, kbusch, hch, sagi, jejb,
	martin.petersen, viro, javier, johannes.thumshirn, bvanassche,
	dongli.zhang, ming.lei, osandov, willy, jefflexu, josef, clm,
	dsterba, jack, tytso, adilger.kernel, jlayton, idryomov,
	danil.kipnis, ebiggers, jinpu.wang, Chaitanya Kulkarni

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="y", Size: 47787 bytes --]

From: Chaitanya Kulkarni <kch@nvidia.com>

Hi,

One of the responsibilities of the Operating System, along with managing
resources, is to provide a unified interface to the user by creating
hardware abstractions. In the Linux Kernel storage stack that
abstraction is created by implementing the generic request operations
such as REQ_OP_READ/REQ_OP_WRITE or REQ_OP_DISCARD/REQ_OP_WRITE_ZEROES,
etc that are mapped to the specific low-level hardware protocol commands 
e.g. SCSI or NVMe.

With that in mind, this RFC patch-series implements a new block layer
operation to offload the data verification on to the controller if
supported or emulate the operation if not. The main advantage is to free
up the CPU and reduce the host link traffic since, for some devices,
their internal bandwidth is higher than the host link and offloading this
operation can improve the performance of the proactive error detection
applications such as file system level scrubbing. 

* Background *
-----------------------------------------------------------------------

NVMe Specification provides a controller level Verify command [1] which
is similar to the ATA Verify [2] command where the controller is
responsible for data verification without transferring the data to the
host. (Offloading LBAs verification). This is designed to proactively
discover any data corruption issues when the device is free so that
applications can protect sensitive data and take corrective action
instead of waiting for failure to occur.

The NVMe Verify command is added in order to provide low level media
scrubbing and possibly moving the data to the right place in case it has
correctable media degradation. Also, this provides a way to enhance
file-system level scrubbing/checksum verification and optinally offload
this task, which is CPU intensive, to the kernel (when emulated), over
the fabric, and to the controller (when supported).   

This is useful when the controller's internal bandwidth is higher than
the host's bandwith showing a sharp increase in the performance due to
_no host traffic or host CPU involvement_.

* Implementation *
-----------------------------------------------------------------------

Right now there is no generic interface which can be used by the
in-kernel components such as file-system or userspace application
(except passthru commands or some combination of write/read/compare) to 
issue verify command with the central block layer API. This can lead to
each userspace applications having protocol specific IOCTL which
defeates the purpose of having the OS provide a H/W abstraction.

This patch series introduces a new block layer payloadless request
operation REQ_OP_VERIFY that allows in-kernel components & userspace
applications to verify the range of the LBAs by offloading checksum
scrubbing/verification to the controller that is directly attached to
the host. For direct attached devices this leads to decrease in the host
DMA traffic and CPU usage and for the fabrics attached device over the
network that leads to a decrease in the network traffic and CPU usage
for both host & target.

* Scope *
-----------------------------------------------------------------------

Please note this only covers the operating system level overhead.
Analyzing controller verify command performance for common protocols
(SCSI/NVMe) is out of scope for REQ_OP_VERIFY.

* Micro Benchmarks *
-----------------------------------------------------------------------

When verifing 500GB of data on NVMeOF with nvme-loop and null_blk as a
target backend block device results show almost a 80% performance
increase :-

With Verify resulting in REQ_OP_VERIFY to null_blk :-

real	2m3.773s
user	0m0.000s
sys	0m59.553s

With Emulation resulting in REQ_OP_READ null_blk :-

real	12m18.964s
user	0m0.002s
sys	1m15.666s


A detailed test log is included at the end of the cover letter.
Each of the following was tested:

1. Direct Attached REQ_OP_VERIFY.
2. Fabrics Attached REQ_OP_VERIFY.
3. Multi-device (md) REQ_OP_VERIFY.

* The complete picture *
-----------------------------------------------------------------------

  For the completeness the whole kernel stack support is divided into
  two phases :-
 
  Phase I :-
 
   Add and stabilize the support for the Block layer & low level drivers
   such as SCSI, NVMe, MD, and NVMeOF, implement necessary emulations in
   the block layer if needed and provide block level tools such as 
   _blkverify_. Also, add appropriate testcases for code-coverage.

  Phase II :-
 
   Add and stabilize the support for upper layer kernel components such
   as file-systems and provide userspace tools such _fsverify_ to route
   the request from file systems to block layer to Low level device
   drivers.


Please note that the interfaces for blk-lib.c REQ_OP_VERIFY emulation
will change in future I put together for the scope of RFC.

Any comments are welcome.

-ck

References:-

[1] NVMe Verify :-

For pro-actively avoiding unrecoverable read errors, NVMe 1.4 adds
Verify and Get LBA Status commands. The Verify command is simple: it
does everything a normal read command does, except for returning the
data to the host system. If a read command would return an error, a
verify command will return the same error. If a read command would be
successful, a verify command will be as well. This makes it possible to
do a low-level scrub of the stored data without being bottlenecked by
the host interface bandwidth. Some SSDs will react to a fixable ECC
error by moving or re-writing degraded data, and a verify command
should trigger the same behavior. Overall, this should reduce the need
for filesystem-level checksum scrubbing/verification. Each Verify
command is tagged with a bit indicating whether the SSD should fail
fast or try hard to recover data, similar to but overriding the above
Read Recovery Level setting.

[2]

http://t13.org/Documents/UploadedDocuments/docs2017/di529r18-ATAATAPI_Command_Set_-_4.pdf


Chaitanya Kulkarni (8):
  block: add support for REQ_OP_VERIFY
  scsi: add REQ_OP_VERIFY support
  nvme: add support for the Verify command
  nvmet: add Verify command support for bdev-ns
  nvmet: add Verify emulation support for bdev-ns
  nvmet: add verify emulation support for file-ns
  null_blk: add REQ_OP_VERIFY support
  md: add support for REQ_OP_VERIFY

 Documentation/ABI/testing/sysfs-block |  14 ++
 block/blk-core.c                      |   5 +
 block/blk-lib.c                       | 192 ++++++++++++++++++++++++++
 block/blk-merge.c                     |  19 +++
 block/blk-settings.c                  |  17 +++
 block/blk-sysfs.c                     |   8 ++
 block/blk-zoned.c                     |   1 +
 block/bounce.c                        |   1 +
 block/ioctl.c                         |  35 +++++
 drivers/block/null_blk/main.c         |  25 +++-
 drivers/block/null_blk/null_blk.h     |   1 +
 drivers/md/dm-core.h                  |   1 +
 drivers/md/dm-io.c                    |   8 +-
 drivers/md/dm-linear.c                |  11 +-
 drivers/md/dm-mpath.c                 |   1 +
 drivers/md/dm-rq.c                    |   3 +
 drivers/md/dm-stripe.c                |   1 +
 drivers/md/dm-table.c                 |  36 +++++
 drivers/md/dm.c                       |  31 +++++
 drivers/md/md-linear.c                |  10 ++
 drivers/md/md-multipath.c             |   1 +
 drivers/md/md.h                       |   7 +
 drivers/md/raid10.c                   |   1 +
 drivers/md/raid5.c                    |   1 +
 drivers/nvme/host/core.c              |  39 ++++++
 drivers/nvme/target/admin-cmd.c       |   3 +-
 drivers/nvme/target/core.c            |  12 +-
 drivers/nvme/target/io-cmd-bdev.c     |  74 ++++++++++
 drivers/nvme/target/io-cmd-file.c     | 151 ++++++++++++++++++++
 drivers/nvme/target/nvmet.h           |   3 +
 drivers/scsi/sd.c                     |  52 +++++++
 drivers/scsi/sd.h                     |   1 +
 include/linux/bio.h                   |  10 +-
 include/linux/blk_types.h             |   2 +
 include/linux/blkdev.h                |  31 +++++
 include/linux/device-mapper.h         |   6 +
 include/linux/nvme.h                  |  19 +++
 include/uapi/linux/fs.h               |   1 +
 38 files changed, 824 insertions(+), 10 deletions(-)


----------------------------------------------------------------------------
1. NVMeOF Block device backend with null_blk verify=1 support :-
----------------------------------------------------------------------------
# ./bdev.sh 1
++ FILE=/dev/nullb0
++ modprobe -r null_blk
++ modprobe null_blk verify=1 gb=50
++ modprobe nvme
++ modprobe nvme-fabrics
++ modprobe nvmet
++ dmesg -c > /dev/null
++ sleep 3
++ tree /sys/kernel/config
/sys/kernel/config
├── nullb
│   └── features
└── nvmet
    ├── hosts
    ├── ports
    └── subsystems

5 directories, 1 file
++ mkdir /sys/kernel/config/nvmet/subsystems/fs
++ for i in 1
++ mkdir /sys/kernel/config/nvmet/subsystems/fs/namespaces/1
++ echo -n /dev/nullb0
++ cat /sys/kernel/config/nvmet/subsystems/fs/namespaces/1/device_path
/dev/nullb0
++ echo 1
++ mkdir /sys/kernel/config/nvmet/ports/1/
++ echo -n loop
++ echo -n 1
++ ln -s /sys/kernel/config/nvmet/subsystems/fs /sys/kernel/config/nvmet/ports/1/subsystems/
++ sleep 1
++ echo transport=loop,nqn=fs
++ sleep 1
++ mount
++ column -t
++ grep nvme
++ dmesg -c
[12826.608855] nvmet: adding nsid 1 to subsystem fs
[12827.680593] nvmet: creating controller 1 for subsystem fs for NQN nqn.2014-08.org.nvmexpress:uuid:ec41afad-329f-4453-b10c-628ad87d6558.
[12827.682760] nvme nvme1: creating 12 I/O queues.
[12827.693358] nvme nvme1: new ctrl: "fs"
[12827.695610] nvme1n1: detected capacity change from 0 to 53687091200
# insmod host/test_verify.ko 
# dmesg -c 
[13104.730748] TEST:     REQ_OP_VERIFY sect     0 nr_sect  2048
[13104.730805] nvme:     REQ_OP_VERIFY sect     0 nr_sect  2041 insmod
[13104.731268] nvmet:    REQ_OP_VERIFY sect     0 nr_sect  2041 kworker/2:2
[13104.731384] null_blk: REQ_OP_VERIFY sect     0 nr_sect  2041 kworker/2:1H
[13104.731432] nvme:     REQ_OP_VERIFY sect  2041 nr_sect     7 kworker/2:1H
[13104.731443] nvmet:    REQ_OP_VERIFY sect  2041 nr_sect     7 kworker/2:2
[13104.731458] null_blk: REQ_OP_VERIFY sect  2041 nr_sect     7 kworker/2:1H
[13104.731524] ----------------------------------------------------------
[13104.731526] TEST:     REQ_OP_VERIFY sect  2048 nr_sect  2048
[13104.731542] nvme:     REQ_OP_VERIFY sect  2048 nr_sect  2041 insmod
[13104.731555] nvme:     REQ_OP_VERIFY sect  4089 nr_sect     7 kworker/2:1H
[13104.731564] nvmet:    REQ_OP_VERIFY sect  2048 nr_sect  2041 kworker/2:2
[13104.731577] nvmet:    REQ_OP_VERIFY sect  4089 nr_sect     7 kworker/2:2
[13104.731620] null_blk: REQ_OP_VERIFY sect  2048 nr_sect  2041 kworker/2:1H
[13104.731625] null_blk: REQ_OP_VERIFY sect  4089 nr_sect     7 kworker/2:1H
[13104.731675] ----------------------------------------------------------
[13104.731678] TEST:     REQ_OP_VERIFY sect  4096 nr_sect  2048
[13104.731694] nvme:     REQ_OP_VERIFY sect  4096 nr_sect  2041 insmod
[13104.731709] nvme:     REQ_OP_VERIFY sect  6137 nr_sect     7 kworker/2:1H
[13104.731716] nvmet:    REQ_OP_VERIFY sect  4096 nr_sect  2041 kworker/2:2
[13104.731750] nvmet:    REQ_OP_VERIFY sect  6137 nr_sect     7 kworker/2:2
[13104.731766] null_blk: REQ_OP_VERIFY sect  4096 nr_sect  2041 kworker/2:1H
[13104.731771] null_blk: REQ_OP_VERIFY sect  6137 nr_sect     7 kworker/2:1H
[13104.731816] ----------------------------------------------------------
[13104.731819] TEST:     REQ_OP_VERIFY sect  6144 nr_sect  2048
[13104.731833] nvme:     REQ_OP_VERIFY sect  6144 nr_sect  2041 insmod
[13104.731847] nvme:     REQ_OP_VERIFY sect  8185 nr_sect     7 kworker/2:1H
[13104.731856] nvmet:    REQ_OP_VERIFY sect  6144 nr_sect  2041 kworker/2:2
[13104.731867] nvmet:    REQ_OP_VERIFY sect  8185 nr_sect     7 kworker/2:2
[13104.731881] null_blk: REQ_OP_VERIFY sect  6144 nr_sect  2041 kworker/2:1H
[13104.731886] null_blk: REQ_OP_VERIFY sect  8185 nr_sect     7 kworker/2:1H
[13104.731934] ----------------------------------------------------------
[13104.731936] TEST:     REQ_OP_VERIFY sect  8192 nr_sect  2048
[13104.731951] nvme:     REQ_OP_VERIFY sect  8192 nr_sect  2041 insmod
[13104.731965] nvme:     REQ_OP_VERIFY sect 10233 nr_sect     7 kworker/2:1H
[13104.731972] nvmet:    REQ_OP_VERIFY sect  8192 nr_sect  2041 kworker/2:2
[13104.731984] nvmet:    REQ_OP_VERIFY sect 10233 nr_sect     7 kworker/2:2
[13104.732019] null_blk: REQ_OP_VERIFY sect  8192 nr_sect  2041 kworker/2:1H
[13104.732024] null_blk: REQ_OP_VERIFY sect 10233 nr_sect     7 kworker/2:1H
[13104.732069] ----------------------------------------------------------

----------------------------------------------------------------------------
2. NVMeOF Block device backend with null_blk verify=0 support :-
----------------------------------------------------------------------------
# ./bdev.sh 0 
++ FILE=/dev/nullb0
++ modprobe -r null_blk
++ modprobe null_blk verify=0 gb=50
++ modprobe nvme
++ modprobe nvme-fabrics
++ modprobe nvmet
++ dmesg -c > /dev/null
++ sleep 3
++ tree /sys/kernel/config
/sys/kernel/config
├── nullb
│   └── features
└── nvmet
    ├── hosts
    ├── ports
    └── subsystems

5 directories, 1 file
++ mkdir /sys/kernel/config/nvmet/subsystems/fs
++ for i in 1
++ mkdir /sys/kernel/config/nvmet/subsystems/fs/namespaces/1
++ echo -n /dev/nullb0
++ cat /sys/kernel/config/nvmet/subsystems/fs/namespaces/1/device_path
/dev/nullb0
++ echo 1
++ mkdir /sys/kernel/config/nvmet/ports/1/
++ echo -n loop
++ echo -n 1
++ ln -s /sys/kernel/config/nvmet/subsystems/fs /sys/kernel/config/nvmet/ports/1/subsystems/
++ sleep 1
++ echo transport=loop,nqn=fs
++ sleep 1
++ mount
++ column -t
++ grep nvme
++ dmesg -c
[12826.608855] nvmet: adding nsid 1 to subsystem fs
[12827.680593] nvmet: creating controller 1 for subsystem fs for NQN nqn.2014-08.org.nvmexpress:uuid:ec41afad-329f-4453-b10c-628ad87d6558.
[12827.682760] nvme nvme1: creating 12 I/O queues.
[12827.693358] nvme nvme1: new ctrl: "fs"
[12827.695610] nvme1n1: detected capacity change from 0 to 53687091200
# insmod host/test_verify.ko 
# dmesg -c 
[13981.069911] TEST:     REQ_OP_VERIFY sect     0 nr_sect  2048
[13981.069971] nvme:     REQ_OP_VERIFY sect     0 nr_sect  2041 insmod
[13981.070240] nvme:     REQ_OP_VERIFY sect  2041 nr_sect     7 kworker/9:1H
[13981.070350] nvmet:    REQ_OP_VERIFY sect     0 nr_sect  2041 kworker/9:2
[13981.070894] null_blk: REQ_OP_READ   sect     0 nr_sect   255 kworker/9:1H
[13981.070943] null_blk: REQ_OP_READ   sect   255 nr_sect   255 kworker/9:1H
[13981.070978] null_blk: REQ_OP_READ   sect   510 nr_sect   255 kworker/9:1H
[13981.071050] null_blk: REQ_OP_READ   sect   765 nr_sect   255 kworker/9:1H
[13981.071087] null_blk: REQ_OP_READ   sect  1020 nr_sect   255 kworker/9:1H
[13981.071120] null_blk: REQ_OP_READ   sect  1275 nr_sect   255 kworker/9:1H
[13981.071155] null_blk: REQ_OP_READ   sect  1530 nr_sect   255 kworker/9:1H
[13981.071244] null_blk: REQ_OP_READ   sect  1785 nr_sect   255 kworker/9:1H
[13981.071252] null_blk: REQ_OP_READ   sect  2040 nr_sect     1 kworker/9:1H
[13981.071907] nvmet:    REQ_OP_VERIFY sect  2041 nr_sect     7 kworker/9:2
[13981.071937] null_blk: REQ_OP_READ   sect  2041 nr_sect     7 kworker/9:1H
[13981.072264] ----------------------------------------------------------
[13981.072271] TEST:     REQ_OP_VERIFY sect  2048 nr_sect  2048
[13981.072327] nvme:     REQ_OP_VERIFY sect  2048 nr_sect  2041 insmod
[13981.072381] nvme:     REQ_OP_VERIFY sect  4089 nr_sect     7 kworker/2:1H
[13981.072547] nvmet:    REQ_OP_VERIFY sect  2048 nr_sect  2041 kworker/2:1
[13981.072632] null_blk: REQ_OP_READ   sect  2048 nr_sect   255 kworker/2:1H
[13981.072754] null_blk: REQ_OP_READ   sect  2303 nr_sect   255 kworker/2:1H
[13981.072858] null_blk: REQ_OP_READ   sect  2558 nr_sect   255 kworker/2:1H
[13981.072915] null_blk: REQ_OP_READ   sect  2813 nr_sect   255 kworker/2:1H
[13981.072961] null_blk: REQ_OP_READ   sect  3068 nr_sect   255 kworker/2:1H
[13981.073052] null_blk: REQ_OP_READ   sect  3323 nr_sect   255 kworker/2:1H
[13981.073102] null_blk: REQ_OP_READ   sect  3578 nr_sect   255 kworker/2:1H
[13981.073155] null_blk: REQ_OP_READ   sect  3833 nr_sect   255 kworker/2:1H
[13981.073164] null_blk: REQ_OP_READ   sect  4088 nr_sect     1 kworker/2:1H
[13981.073861] nvmet:    REQ_OP_VERIFY sect  4089 nr_sect     7 kworker/2:1
[13981.073924] null_blk: REQ_OP_READ   sect  4089 nr_sect     7 kworker/2:1H
[13981.074754] ----------------------------------------------------------
[13981.074766] TEST:     REQ_OP_VERIFY sect  4096 nr_sect  2048
[13981.074812] nvme:     REQ_OP_VERIFY sect  4096 nr_sect  2041 insmod
[13981.074859] nvme:     REQ_OP_VERIFY sect  6137 nr_sect     7 kworker/0:1H
[13981.075035] nvmet:    REQ_OP_VERIFY sect  4096 nr_sect  2041 kworker/0:2
[13981.075103] null_blk: REQ_OP_READ   sect  4096 nr_sect   255 kworker/0:1H
[13981.075141] null_blk: REQ_OP_READ   sect  4351 nr_sect   255 kworker/0:1H
[13981.075171] null_blk: REQ_OP_READ   sect  4606 nr_sect   255 kworker/0:1H
[13981.075201] null_blk: REQ_OP_READ   sect  4861 nr_sect   255 kworker/0:1H
[13981.075230] null_blk: REQ_OP_READ   sect  5116 nr_sect   255 kworker/0:1H
[13981.075259] null_blk: REQ_OP_READ   sect  5371 nr_sect   255 kworker/0:1H
[13981.075288] null_blk: REQ_OP_READ   sect  5626 nr_sect   255 kworker/0:1H
[13981.075324] null_blk: REQ_OP_READ   sect  5881 nr_sect   255 kworker/0:1H
[13981.075329] null_blk: REQ_OP_READ   sect  6136 nr_sect     1 kworker/0:1H
[13981.076247] nvmet:    REQ_OP_VERIFY sect  6137 nr_sect     7 kworker/0:2
[13981.076294] null_blk: REQ_OP_READ   sect  6137 nr_sect     7 kworker/0:1H
[13981.076640] ----------------------------------------------------------
[13981.076645] TEST:     REQ_OP_VERIFY sect  6144 nr_sect  2048
[13981.076677] nvme:     REQ_OP_VERIFY sect  6144 nr_sect  2041 insmod
[13981.076714] nvme:     REQ_OP_VERIFY sect  8185 nr_sect     7 kworker/0:1H
[13981.076823] nvmet:    REQ_OP_VERIFY sect  6144 nr_sect  2041 kworker/0:2
[13981.076883] null_blk: REQ_OP_READ   sect  6144 nr_sect   255 kworker/0:1H
[13981.076917] null_blk: REQ_OP_READ   sect  6399 nr_sect   255 kworker/0:1H
[13981.076944] null_blk: REQ_OP_READ   sect  6654 nr_sect   255 kworker/0:1H
[13981.076969] null_blk: REQ_OP_READ   sect  6909 nr_sect   255 kworker/0:1H
[13981.077025] null_blk: REQ_OP_READ   sect  7164 nr_sect   255 kworker/0:1H
[13981.077053] null_blk: REQ_OP_READ   sect  7419 nr_sect   255 kworker/0:1H
[13981.077085] null_blk: REQ_OP_READ   sect  7674 nr_sect   255 kworker/0:1H
[13981.077122] null_blk: REQ_OP_READ   sect  7929 nr_sect   255 kworker/0:1H
[13981.077127] null_blk: REQ_OP_READ   sect  8184 nr_sect     1 kworker/0:1H
[13981.077233] nvmet:    REQ_OP_VERIFY sect  8185 nr_sect     7 kworker/0:1
[13981.077293] null_blk: REQ_OP_READ   sect  8185 nr_sect     7 kworker/0:1H
[13981.078282] ----------------------------------------------------------
[13981.078286] TEST:     REQ_OP_VERIFY sect  8192 nr_sect  2048
[13981.078318] nvme:     REQ_OP_VERIFY sect  8192 nr_sect  2041 insmod
[13981.078376] nvme:     REQ_OP_VERIFY sect 10233 nr_sect     7 kworker/0:1H
[13981.078477] nvmet:    REQ_OP_VERIFY sect  8192 nr_sect  2041 kworker/0:2
[13981.078532] null_blk: REQ_OP_READ   sect  8192 nr_sect   255 kworker/0:1H
[13981.078564] null_blk: REQ_OP_READ   sect  8447 nr_sect   255 kworker/0:1H
[13981.078596] null_blk: REQ_OP_READ   sect  8702 nr_sect   255 kworker/0:1H
[13981.078622] null_blk: REQ_OP_READ   sect  8957 nr_sect   255 kworker/0:1H
[13981.078655] null_blk: REQ_OP_READ   sect  9212 nr_sect   255 kworker/0:1H
[13981.078682] null_blk: REQ_OP_READ   sect  9467 nr_sect   255 kworker/0:1H
[13981.078709] null_blk: REQ_OP_READ   sect  9722 nr_sect   255 kworker/0:1H
[13981.078741] null_blk: REQ_OP_READ   sect  9977 nr_sect   255 kworker/0:1H
[13981.078746] null_blk: REQ_OP_READ   sect 10232 nr_sect     1 kworker/0:1H
[13981.078903] nvmet:    REQ_OP_VERIFY sect 10233 nr_sect     7 kworker/0:1
[13981.078940] null_blk: REQ_OP_READ   sect 10233 nr_sect     7 kworker/0:1H
[13981.079561] ----------------------------------------------------------


----------------------------------------------------------------------------
3. NVMeOF with File Backend with Buffered IO :-
----------------------------------------------------------------------------
# ./file.sh 
++ FILE=/mnt/backend/nvme1n1
++ SS=fs
++ SSPATH=/sys/kernel/config/nvmet/subsystems/fs/
++ PORTS=/sys/kernel/config/nvmet/ports
++ main
++ load_modules
++ modprobe nvme
++ modprobe nvme-fabrics
++ modprobe nvmet
++ modprobe nvme-loop
++ sleep 3
++ mount_fs
++ make_nullb
++ local src=drivers/block/
+++ uname -r
++ local dest=/lib/modules/5.6.0-rc7lblk+/kernel/drivers/block
++ modprobe -r null_blk
++ makej M=drivers/block/
  MODPOST 11 modules
++ cp drivers/block//null_blk.ko /lib/modules/5.6.0-rc7lblk+/kernel/drivers/block/
++ modprobe null_blk nr_devices=0
++ sleep 1
++ mkdir config/nullb/nullb0
++ tree config/nullb/nullb0
config/nullb/nullb0
├── badblocks
├── blocking
├── blocksize
├── cache_size
├── completion_nsec
├── discard
├── home_node
├── hw_queue_depth
├── index
├── irqmode
├── mbps
├── memory_backed
├── power
├── queue_mode
├── size
├── submit_queues
├── use_per_node_hctx
├── verify
├── zoned
├── zone_nr_conv
└── zone_size

0 directories, 21 files
++ echo 1
++ echo 512
++ echo 20480
++ echo 1
++ sleep 2
+++ cat config/nullb/nullb0/index
++ IDX=0
++ lsblk
++ grep null0
++ sleep 1
++ mkfs.xfs -f /dev/nullb0
meta-data=/dev/nullb0            isize=512    agcount=4, agsize=1310720 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0, sparse=0
data     =                       bsize=4096   blocks=5242880, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
++ mount /dev/nullb0 /mnt/backend/
++ sleep 1
++ mount
++ column -t
++ grep nvme
++ dd if=/dev/zero of=/mnt/backend/nvme1n1 count=2621440 bs=4096
2621440+0 records in
2621440+0 records out
10737418240 bytes (11 GB) copied, 33.6448 s, 319 MB/s
++ file /mnt/backend/nvme1n1
/mnt/backend/nvme1n1: data
++ make_target
++ tree /sys/kernel/config
/sys/kernel/config
├── nullb
│   ├── features
│   └── nullb0
│       ├── badblocks
│       ├── blocking
│       ├── blocksize
│       ├── cache_size
│       ├── completion_nsec
│       ├── discard
│       ├── home_node
│       ├── hw_queue_depth
│       ├── index
│       ├── irqmode
│       ├── mbps
│       ├── memory_backed
│       ├── power
│       ├── queue_mode
│       ├── size
│       ├── submit_queues
│       ├── use_per_node_hctx
│       ├── verify
│       ├── zoned
│       ├── zone_nr_conv
│       └── zone_size
└── nvmet
    ├── hosts
    ├── ports
    └── subsystems

6 directories, 22 files
++ mkdir /sys/kernel/config/nvmet/subsystems/fs/
++ for i in 1
++ mkdir /sys/kernel/config/nvmet/subsystems/fs//namespaces/1
++ echo -n /mnt/backend/nvme1n1
++ echo 1
++ cat /sys/kernel/config/nvmet/subsystems/fs//namespaces/1/device_path
/mnt/backend/nvme1n1
++ cat /sys/kernel/config/nvmet/subsystems/fs//namespaces/1/buffered_io
1
++ echo 1
++ mkdir /sys/kernel/config/nvmet/ports/1/
++ echo -n loop
++ echo -n 1
++ ln -s /sys/kernel/config/nvmet/subsystems/fs/ /sys/kernel/config/nvmet/ports/1/subsystems/
++ sleep 1
++ connect
++ echo transport=loop,nqn=fs
++ sleep 1
#
# insmod host/test_verify.ko 
# dmesg -c 
[13469.310678] TEST:     REQ_OP_VERIFY sect     0 nr_sect  2048
[13469.310777] nvme:     REQ_OP_VERIFY sect     0 nr_sect  2048 kworker/5:1H
[13469.310978] nvmet:    REQ_OP_VERIFY sect     0 nr_sect  2048 kworker/5:0
[13469.313189] ----------------------------------------------------------
[13469.313193] TEST:     REQ_OP_VERIFY sect  2048 nr_sect  2048
[13469.313230] nvme:     REQ_OP_VERIFY sect  2048 nr_sect  2048 kworker/5:1H
[13469.313272] nvmet:    REQ_OP_VERIFY sect  2048 nr_sect  2048 kworker/5:0
[13469.313916] ----------------------------------------------------------
[13469.313920] TEST:     REQ_OP_VERIFY sect  4096 nr_sect  2048
[13469.313950] nvme:     REQ_OP_VERIFY sect  4096 nr_sect  2048 kworker/5:1H
[13469.313986] nvmet:    REQ_OP_VERIFY sect  4096 nr_sect  2048 kworker/5:0
[13469.314708] ----------------------------------------------------------
[13469.314713] TEST:     REQ_OP_VERIFY sect  6144 nr_sect  2048
[13469.314746] nvme:     REQ_OP_VERIFY sect  6144 nr_sect  2048 kworker/5:1H
[13469.314784] nvmet:    REQ_OP_VERIFY sect  6144 nr_sect  2048 kworker/5:0
[13469.315425] ----------------------------------------------------------
[13469.315429] TEST:     REQ_OP_VERIFY sect  8192 nr_sect  2048
[13469.315457] nvme:     REQ_OP_VERIFY sect  8192 nr_sect  2048 kworker/5:1H
[13469.315490] nvmet:    REQ_OP_VERIFY sect  8192 nr_sect  2048 kworker/5:0
[13469.316005] ----------------------------------------------------------

XFS Trace for buffered_io() :-

     kworker/5:0-19863 [005] ...1 13469.350152: xfs_file_buffered_read: dev 252:0 ino 0x43 size 0x280000000 offset 0x0 count 0x100000
     kworker/5:0-19863 [005] ...1 13469.351960: xfs_file_buffered_read: dev 252:0 ino 0x43 size 0x280000000 offset 0x100000 count 0x100000
     kworker/5:0-19863 [005] ...1 13469.352705: xfs_file_buffered_read: dev 252:0 ino 0x43 size 0x280000000 offset 0x200000 count 0x100000
     kworker/5:0-19863 [005] ...1 13469.353472: xfs_file_buffered_read: dev 252:0 ino 0x43 size 0x280000000 offset 0x300000 count 0x100000
     kworker/5:0-19863 [005] ...1 13469.354177: xfs_file_buffered_read: dev 252:0 ino 0x43 size 0x280000000 offset 0x400000 count 0x100000

----------------------------------------------------------------------------
4. NVMeOF with File Backend with DIRECT IO :-
----------------------------------------------------------------------------
# ./file.sh 
++ FILE=/mnt/backend/nvme1n1
++ SS=fs
++ SSPATH=/sys/kernel/config/nvmet/subsystems/fs/
++ PORTS=/sys/kernel/config/nvmet/ports
++ main
++ load_modules
++ modprobe nvme
++ modprobe nvme-fabrics
++ modprobe nvmet
++ modprobe nvme-loop
++ sleep 3
++ mount_fs
++ make_nullb
++ local src=drivers/block/
+++ uname -r
++ local dest=/lib/modules/5.6.0-rc7lblk+/kernel/drivers/block
++ modprobe -r null_blk
++ makej M=drivers/block/
  MODPOST 11 modules
++ cp drivers/block//null_blk.ko /lib/modules/5.6.0-rc7lblk+/kernel/drivers/block/
++ modprobe null_blk nr_devices=0
++ sleep 1
++ mkdir config/nullb/nullb0
++ tree config/nullb/nullb0
config/nullb/nullb0
├── badblocks
├── blocking
├── blocksize
├── cache_size
├── completion_nsec
├── discard
├── home_node
├── hw_queue_depth
├── index
├── irqmode
├── mbps
├── memory_backed
├── power
├── queue_mode
├── size
├── submit_queues
├── use_per_node_hctx
├── verify
├── zoned
├── zone_nr_conv
└── zone_size

0 directories, 21 files
++ echo 1
++ echo 512
++ echo 20480
++ echo 1
++ sleep 2
+++ cat config/nullb/nullb0/index
++ IDX=0
++ lsblk
++ grep null0
++ sleep 1
++ mkfs.xfs -f /dev/nullb0
meta-data=/dev/nullb0            isize=512    agcount=4, agsize=1310720 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0, sparse=0
data     =                       bsize=4096   blocks=5242880, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
++ mount /dev/nullb0 /mnt/backend/
++ sleep 1
++ mount
++ column -t
++ grep nvme
++ dd if=/dev/zero of=/mnt/backend/nvme1n1 count=2621440 bs=4096
2621440+0 records in
2621440+0 records out
10737418240 bytes (11 GB) copied, 33.7626 s, 318 MB/s
++ file /mnt/backend/nvme1n1
/mnt/backend/nvme1n1: data
++ make_target
++ tree /sys/kernel/config
/sys/kernel/config
├── nullb
│   ├── features
│   └── nullb0
│       ├── badblocks
│       ├── blocking
│       ├── blocksize
│       ├── cache_size
│       ├── completion_nsec
│       ├── discard
│       ├── home_node
│       ├── hw_queue_depth
│       ├── index
│       ├── irqmode
│       ├── mbps
│       ├── memory_backed
│       ├── power
│       ├── queue_mode
│       ├── size
│       ├── submit_queues
│       ├── use_per_node_hctx
│       ├── verify
│       ├── zoned
│       ├── zone_nr_conv
│       └── zone_size
└── nvmet
    ├── hosts
    ├── ports
    └── subsystems

6 directories, 22 files
++ mkdir /sys/kernel/config/nvmet/subsystems/fs/
++ for i in 1
++ mkdir /sys/kernel/config/nvmet/subsystems/fs//namespaces/1
++ echo -n /mnt/backend/nvme1n1
++ echo 0
++ cat /sys/kernel/config/nvmet/subsystems/fs//namespaces/1/device_path
/mnt/backend/nvme1n1
++ cat /sys/kernel/config/nvmet/subsystems/fs//namespaces/1/buffered_io
0
++ echo 1
++ mkdir /sys/kernel/config/nvmet/ports/1/
++ echo -n loop
++ echo -n 1
++ ln -s /sys/kernel/config/nvmet/subsystems/fs/ /sys/kernel/config/nvmet/ports/1/subsystems/
++ sleep 1
++ connect
++ echo transport=loop,nqn=fs
++ sleep 1
# 
# insmod host/test_verify.ko 
# dmesg  -c 
# Since XFS doesn't have the verify interface it will go as read operation
# to the null_blk device on which we have formatted xfs.
[13254.855110] TEST:     REQ_OP_VERIFY sect     0 nr_sect  2048
[13254.855183] nvme:     REQ_OP_VERIFY sect     0 nr_sect  2048 kworker/4:1H
[13254.855318] nvmet:    REQ_OP_VERIFY sect     0 nr_sect  2048 kworker/4:1
[13254.857858] null_blk: REQ_OP_READ   sect   128 nr_sect   255 kworker/4:1H
[13254.857901] null_blk: REQ_OP_READ   sect   383 nr_sect   255 kworker/4:1H
[13254.857965] null_blk: REQ_OP_READ   sect   638 nr_sect   255 kworker/4:1H
[13254.858031] null_blk: REQ_OP_READ   sect   893 nr_sect   255 kworker/4:1H
[13254.858069] null_blk: REQ_OP_READ   sect  1148 nr_sect   255 kworker/4:1H
[13254.858105] null_blk: REQ_OP_READ   sect  1403 nr_sect   255 kworker/4:1H
[13254.858132] null_blk: REQ_OP_READ   sect  1658 nr_sect   255 kworker/4:1H
[13254.858156] null_blk: REQ_OP_READ   sect  1913 nr_sect   255 kworker/4:1H
[13254.858194] null_blk: REQ_OP_READ   sect  2168 nr_sect     8 kworker/4:1H
[13254.859882] ----------------------------------------------------------
[13254.859885] TEST:     REQ_OP_VERIFY sect  2048 nr_sect  2048
[13254.859912] nvme:     REQ_OP_VERIFY sect  2048 nr_sect  2048 kworker/4:1H
[13254.859929] nvmet:    REQ_OP_VERIFY sect  2048 nr_sect  2048 kworker/4:1
[13254.862603] null_blk: REQ_OP_READ   sect  2176 nr_sect   255 kworker/4:1H
[13254.862647] null_blk: REQ_OP_READ   sect  2431 nr_sect   255 kworker/4:1H
[13254.862674] null_blk: REQ_OP_READ   sect  2686 nr_sect   255 kworker/4:1H
[13254.862699] null_blk: REQ_OP_READ   sect  2941 nr_sect   255 kworker/4:1H
[13254.862728] null_blk: REQ_OP_READ   sect  3196 nr_sect   255 kworker/4:1H
[13254.862754] null_blk: REQ_OP_READ   sect  3451 nr_sect   255 kworker/4:1H
[13254.862780] null_blk: REQ_OP_READ   sect  3706 nr_sect   255 kworker/4:1H
[13254.862813] null_blk: REQ_OP_READ   sect  3961 nr_sect   255 kworker/4:1H
[13254.862839] null_blk: REQ_OP_READ   sect  4216 nr_sect     8 kworker/4:1H
[13254.864087] ----------------------------------------------------------
[13254.864090] TEST:     REQ_OP_VERIFY sect  4096 nr_sect  2048
[13254.864115] nvme:     REQ_OP_VERIFY sect  4096 nr_sect  2048 kworker/4:1H
[13254.864131] nvmet:    REQ_OP_VERIFY sect  4096 nr_sect  2048 kworker/4:1
[13254.866647] null_blk: REQ_OP_READ   sect  4224 nr_sect   255 kworker/4:1H
[13254.866678] null_blk: REQ_OP_READ   sect  4479 nr_sect   255 kworker/4:1H
[13254.866703] null_blk: REQ_OP_READ   sect  4734 nr_sect   255 kworker/4:1H
[13254.866728] null_blk: REQ_OP_READ   sect  4989 nr_sect   255 kworker/4:1H
[13254.866754] null_blk: REQ_OP_READ   sect  5244 nr_sect   255 kworker/4:1H
[13254.866780] null_blk: REQ_OP_READ   sect  5499 nr_sect   255 kworker/4:1H
[13254.866806] null_blk: REQ_OP_READ   sect  5754 nr_sect   255 kworker/4:1H
[13254.866833] null_blk: REQ_OP_READ   sect  6009 nr_sect   255 kworker/4:1H
[13254.866861] null_blk: REQ_OP_READ   sect  6264 nr_sect     8 kworker/4:1H
[13254.868067] ----------------------------------------------------------
[13254.868071] TEST:     REQ_OP_VERIFY sect  6144 nr_sect  2048
[13254.868095] nvme:     REQ_OP_VERIFY sect  6144 nr_sect  2048 kworker/4:1H
[13254.868110] nvmet:    REQ_OP_VERIFY sect  6144 nr_sect  2048 kworker/4:1
[13254.870599] null_blk: REQ_OP_READ   sect  6272 nr_sect   255 kworker/4:1H
[13254.870631] null_blk: REQ_OP_READ   sect  6527 nr_sect   255 kworker/4:1H
[13254.870657] null_blk: REQ_OP_READ   sect  6782 nr_sect   255 kworker/4:1H
[13254.870681] null_blk: REQ_OP_READ   sect  7037 nr_sect   255 kworker/4:1H
[13254.870706] null_blk: REQ_OP_READ   sect  7292 nr_sect   255 kworker/4:1H
[13254.870732] null_blk: REQ_OP_READ   sect  7547 nr_sect   255 kworker/4:1H
[13254.870758] null_blk: REQ_OP_READ   sect  7802 nr_sect   255 kworker/4:1H
[13254.870782] null_blk: REQ_OP_READ   sect  8057 nr_sect   255 kworker/4:1H
[13254.870837] null_blk: REQ_OP_READ   sect  8312 nr_sect     8 kworker/4:1H
[13254.871956] ----------------------------------------------------------
[13254.871959] TEST:     REQ_OP_VERIFY sect  8192 nr_sect  2048
[13254.871982] nvme:     REQ_OP_VERIFY sect  8192 nr_sect  2048 kworker/4:1H
[13254.872045] nvmet:    REQ_OP_VERIFY sect  8192 nr_sect  2048 kworker/4:1
[13254.874532] null_blk: REQ_OP_READ   sect  8320 nr_sect   255 kworker/4:1H
[13254.874562] null_blk: REQ_OP_READ   sect  8575 nr_sect   255 kworker/4:1H
[13254.874588] null_blk: REQ_OP_READ   sect  8830 nr_sect   255 kworker/4:1H
[13254.874613] null_blk: REQ_OP_READ   sect  9085 nr_sect   255 kworker/4:1H
[13254.874638] null_blk: REQ_OP_READ   sect  9340 nr_sect   255 kworker/4:1H
[13254.874664] null_blk: REQ_OP_READ   sect  9595 nr_sect   255 kworker/4:1H
[13254.874689] null_blk: REQ_OP_READ   sect  9850 nr_sect   255 kworker/4:1H
[13254.874715] null_blk: REQ_OP_READ   sect 10105 nr_sect   255 kworker/4:1H
[13254.874741] null_blk: REQ_OP_READ   sect 10360 nr_sect     8 kworker/4:1H
[13254.875852] ----------------------------------------------------------

----------------------------------------------------------------------------
5. DM Linear testing:-
----------------------------------------------------------------------------
 # ./nullbtests-md.sh 
+ lvremove /dev/nullbvg/nullblv
  Volume group "nullbvg" not found
  Cannot process volume group nullbvg
+ vgremove nullbvg
  Volume group "nullbvg" not found
  Cannot process volume group nullbvg
+ pvremove /dev/nullb0p1 /dev/nullb1p1
  Device /dev/nullb0p1 not found.
  Device /dev/nullb1p1 not found.
+ rmdir config/nullb/nullb0
rmdir: failed to remove ‘config/nullb/nullb0’: No such file or directory
+ rmdir config/nullb/nullb1
rmdir: failed to remove ‘config/nullb/nullb1’: No such file or directory
+ modprobe -r null_blk
+ modprobe null_blk nr_devices=0 verify=1
+ declare -a arr
+ for i in 0 1
+ NULLB_DIR=config/nullb/nullb0
+ mkdir config/nullb/nullb0
+ tree config/nullb/nullb0
config/nullb/nullb0
├── badblocks
├── blocking
├── blocksize
├── cache_size
├── completion_nsec
├── discard
├── home_node
├── hw_queue_depth
├── index
├── irqmode
├── max_sectors
├── mbps
├── memory_backed
├── power
├── queue_mode
├── size
├── submit_queues
├── use_per_node_hctx
├── verify
├── zone_capacity
├── zoned
├── zone_max_active
├── zone_max_open
├── zone_nr_conv
└── zone_size

0 directories, 25 files
+ sleep 1
+ echo 1
+ echo 512
+ echo 10
+ echo 1
+ echo 1
++ cat config/nullb/nullb0/index
+ IDX=0
+ sleep 1
+ sed -e 's/\s*\([\+0-9a-zA-Z]*\).*/\1/'
+ fdisk /dev/nullb0
Welcome to fdisk (util-linux 2.23.2).

Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

Device does not contain a recognized partition table
Building a new DOS disklabel with disk identifier 0x2e1ae160.

Command (m for help): Building a new DOS disklabel with disk identifier 0x8e36dee5.

Command (m for help): Partition type:
   p   primary (0 primary, 0 extended, 4 free)
   e   extended
Select (default p): Partition number (1-4, default 1): First sector (2048-20479, default 2048): Using default value 2048
Last sector, +sectors or +size{K,M,G} (2048-20479, default 20479): Using default value 20479
Partition 1 of type Linux and of size 9 MiB is set

Command (m for help): Selected partition 1
Hex code (type L to list all codes): Changed type of partition 'Linux' to 'Linux LVM'

Command (m for help): 
Disk /dev/nullb0: 10 MB, 10485760 bytes, 20480 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0x8e36dee5

       Device Boot      Start         End      Blocks   Id  System
/dev/nullb0p1            2048       20479        9216   8e  Linux LVM

Command (m for help): The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.
+ sleep 1
+ partprobe
+ sleep 1
+ pvcreate /dev/nullb0p1
  Physical volume "/dev/nullb0p1" successfully created.
+ sleep 1
+ for i in 0 1
+ NULLB_DIR=config/nullb/nullb1
+ mkdir config/nullb/nullb1
+ tree config/nullb/nullb1
config/nullb/nullb1
├── badblocks
├── blocking
├── blocksize
├── cache_size
├── completion_nsec
├── discard
├── home_node
├── hw_queue_depth
├── index
├── irqmode
├── max_sectors
├── mbps
├── memory_backed
├── power
├── queue_mode
├── size
├── submit_queues
├── use_per_node_hctx
├── verify
├── zone_capacity
├── zoned
├── zone_max_active
├── zone_max_open
├── zone_nr_conv
└── zone_size

0 directories, 25 files
+ sleep 1
+ echo 1
+ echo 512
+ echo 10
+ echo 1
+ echo 1
++ cat config/nullb/nullb1/index
+ IDX=1
+ sleep 1
+ sed -e 's/\s*\([\+0-9a-zA-Z]*\).*/\1/'
+ fdisk /dev/nullb1
Welcome to fdisk (util-linux 2.23.2).

Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

Device does not contain a recognized partition table
Building a new DOS disklabel with disk identifier 0x3bd84b2f.

Command (m for help): Building a new DOS disklabel with disk identifier 0x0a6925a3.

Command (m for help): Partition type:
   p   primary (0 primary, 0 extended, 4 free)
   e   extended
Select (default p): Partition number (1-4, default 1): First sector (2048-20479, default 2048): Using default value 2048
Last sector, +sectors or +size{K,M,G} (2048-20479, default 20479): Using default value 20479
Partition 1 of type Linux and of size 9 MiB is set

Command (m for help): Selected partition 1
Hex code (type L to list all codes): Changed type of partition 'Linux' to 'Linux LVM'

Command (m for help): 
Disk /dev/nullb1: 10 MB, 10485760 bytes, 20480 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0x0a6925a3

       Device Boot      Start         End      Blocks   Id  System
/dev/nullb1p1            2048       20479        9216   8e  Linux LVM

Command (m for help): The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.
+ sleep 1
+ partprobe
+ sleep 1
+ pvcreate /dev/nullb1p1
  Physical volume "/dev/nullb1p1" successfully created.
+ sleep 1
+ vgcreate nullbvg /dev/nullb0p1 /dev/nullb1p1
  Volume group "nullbvg" successfully created
+ lvcreate -l 4 -n nullblv nullbvg
  Logical volume "nullblv" created.
+ sleep 1
+ dmesg -c
[  120.577643] null_blk: module loaded
[  123.061186]  nullb0: p1
[  125.213856]  nullb0: p1
[  130.441883]  nullb1: p1
[  132.628551]  nullb1: p1
[  132.636132]  nullb0: p1
[  134.858680] REQ_OP_VERIFY configured success for linear id 1
[  134.858686] REQ_OP_VERIFY configured success for linear id 2
[  134.859300] dm-5: detected capacity change from 32768 to 0
[  134.859333] REQ_OP_VERIFY configured success for linear id 1
[  134.859335] REQ_OP_VERIFY configured success for linear id 2
# ll  /dev/dm-5 
brw-------. 1 root root 253, 5 Feb  4 20:58 /dev/dm-5
# lsblk 
NAME                MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
nvme0n1             259:0    0    1G  0 disk 
nullb1              252:1    0   10M  0 disk 
└─nullb1p1          259:2    0    9M  0 part 
  └─nullbvg-nullblv 253:5    0   16M  0 lvm  
nullb0              252:0    0   10M  0 disk 
└─nullb0p1          259:1    0    9M  0 part 
  └─nullbvg-nullblv 253:5    0   16M  0 lvm  
# ls /dev/mapper/nullbvg-nullblv 
/dev/mapper/nullbvg-nullblv
# dmsetup info  /dev/dm-5  
Name:              nullbvg-nullblv
State:             ACTIVE
Read Ahead:        8192
Tables present:    LIVE
Open count:        0
Event number:      0
Major, minor:      253, 5
Number of targets: 2
UUID: LVM-mLiPnf25we1YKmZ6MMH3ruL7lThYYuuZRzOOYup42smajuTyV3T1ZAMhHZ7x4e6G

# insmod host/test.ko nblk=16 dev_path="/dev/dm-5"
# dmesg -c 

[  132.965714] TEST: REQ_OP_VERIFY sector 0 nr_sect 2048
[  132.965740] is_abnormal_io 1512
[  132.965763] dmrg: REQ_OP_VERIFY sect  2048 nr_sect  2048 nullb0 insmod linear_map_bio
[  132.965831] null_blk: REQ_OP_VERIFY sect  4096 nr_sect  2048 kworker/61:1H
[  132.965936] ----------------------------------------------------------
[  132.965940] TEST: REQ_OP_VERIFY sector 2048 nr_sect 2048
[  132.965949] is_abnormal_io 1512
[  132.965957] dmrg: REQ_OP_VERIFY sect  4096 nr_sect  2048 nullb0 insmod linear_map_bio
[  132.965987] null_blk: REQ_OP_VERIFY sect  6144 nr_sect  2048 kworker/61:1H
[  132.966058] ----------------------------------------------------------
[  132.966062] TEST: REQ_OP_VERIFY sector 4096 nr_sect 2048
[  132.966070] is_abnormal_io 1512
[  132.966078] dmrg: REQ_OP_VERIFY sect  6144 nr_sect  2048 nullb0 insmod linear_map_bio
[  132.966107] null_blk: REQ_OP_VERIFY sect  8192 nr_sect  2048 kworker/61:1H
[  132.966176] ----------------------------------------------------------
[  132.966180] TEST: REQ_OP_VERIFY sector 6144 nr_sect 2048
[  132.966188] is_abnormal_io 1512
[  132.966196] dmrg: REQ_OP_VERIFY sect  8192 nr_sect  2048 nullb0 insmod linear_map_bio
[  132.966224] null_blk: REQ_OP_VERIFY sect 10240 nr_sect  2048 kworker/61:1H
[  132.966293] ----------------------------------------------------------
[  132.966297] TEST: REQ_OP_VERIFY sector 8192 nr_sect 2048
[  132.966305] is_abnormal_io 1512
[  132.966313] dmrg: REQ_OP_VERIFY sect 10240 nr_sect  2048 nullb0 insmod linear_map_bio
[  132.966340] null_blk: REQ_OP_VERIFY sect 12288 nr_sect  2048 kworker/61:1H
[  132.966408] ----------------------------------------------------------
[  132.966412] TEST: REQ_OP_VERIFY sector 10240 nr_sect 2048
[  132.966420] is_abnormal_io 1512
[  132.966428] dmrg: REQ_OP_VERIFY sect 12288 nr_sect  2048 nullb0 insmod linear_map_bio
[  132.966454] null_blk: REQ_OP_VERIFY sect 14336 nr_sect  2048 kworker/61:1H
[  132.966574] ----------------------------------------------------------
[  132.966578] TEST: REQ_OP_VERIFY sector 12288 nr_sect 2048
[  132.966587] is_abnormal_io 1512
[  132.966595] dmrg: REQ_OP_VERIFY sect 14336 nr_sect  2048 nullb0 insmod linear_map_bio
[  132.966625] null_blk: REQ_OP_VERIFY sect 16384 nr_sect  2048 kworker/61:1H
[  132.966697] ----------------------------------------------------------
[  132.966701] TEST: REQ_OP_VERIFY sector 14336 nr_sect 2048
[  132.966709] is_abnormal_io 1512
[  132.966717] dmrg: REQ_OP_VERIFY sect 16384 nr_sect  2048 nullb0 insmod linear_map_bio
[  132.966744] null_blk: REQ_OP_VERIFY sect 18432 nr_sect  2048 kworker/61:1H
[  132.966813] ----------------------------------------------------------
[  132.966817] TEST: REQ_OP_VERIFY sector 16384 nr_sect 2048
[  132.966825] is_abnormal_io 1512
[  132.966834] dmrg: REQ_OP_VERIFY sect  2048 nr_sect  2048 nullb1 insmod linear_map_bio
[  132.966869] null_blk: REQ_OP_VERIFY sect  4096 nr_sect  2048 kworker/61:1H
[  132.966938] ----------------------------------------------------------
[  132.966942] TEST: REQ_OP_VERIFY sector 18432 nr_sect 2048
[  132.966950] is_abnormal_io 1512
[  132.966958] dmrg: REQ_OP_VERIFY sect  4096 nr_sect  2048 nullb1 insmod linear_map_bio
[  132.966985] null_blk: REQ_OP_VERIFY sect  6144 nr_sect  2048 kworker/61:1H
[  132.967053] ----------------------------------------------------------
[  132.967057] TEST: REQ_OP_VERIFY sector 20480 nr_sect 2048
[  132.967065] is_abnormal_io 1512
[  132.967073] dmrg: REQ_OP_VERIFY sect  6144 nr_sect  2048 nullb1 insmod linear_map_bio
[  132.967100] null_blk: REQ_OP_VERIFY sect  8192 nr_sect  2048 kworker/61:1H
[  132.967169] ----------------------------------------------------------
[  132.967174] TEST: REQ_OP_VERIFY sector 22528 nr_sect 2048
[  132.967181] is_abnormal_io 1512
[  132.967188] dmrg: REQ_OP_VERIFY sect  8192 nr_sect  2048 nullb1 insmod linear_map_bio
[  132.967206] null_blk: REQ_OP_VERIFY sect 10240 nr_sect  2048 kworker/61:1H
[  132.967251] ----------------------------------------------------------
[  132.967254] TEST: REQ_OP_VERIFY sector 24576 nr_sect 2048
[  132.967259] is_abnormal_io 1512
[  132.967264] dmrg: REQ_OP_VERIFY sect 10240 nr_sect  2048 nullb1 insmod linear_map_bio
[  132.967281] null_blk: REQ_OP_VERIFY sect 12288 nr_sect  2048 kworker/61:1H
[  132.967326] ----------------------------------------------------------
[  132.967329] TEST: REQ_OP_VERIFY sector 26624 nr_sect 2048
[  132.967335] is_abnormal_io 1512
[  132.967340] dmrg: REQ_OP_VERIFY sect 12288 nr_sect  2048 nullb1 insmod linear_map_bio
[  132.967360] null_blk: REQ_OP_VERIFY sect 14336 nr_sect  2048 kworker/61:1H
[  132.967404] ----------------------------------------------------------
[  132.967409] TEST: REQ_OP_VERIFY sector 28672 nr_sect 2048
[  132.967413] is_abnormal_io 1512
[  132.967417] dmrg: REQ_OP_VERIFY sect 14336 nr_sect  2048 nullb1 insmod linear_map_bio
[  132.967431] null_blk: REQ_OP_VERIFY sect 16384 nr_sect  2048 kworker/61:1H
[  132.967468] ----------------------------------------------------------
[  132.967470] TEST: REQ_OP_VERIFY sector 30720 nr_sect 2048
[  132.967477] is_abnormal_io 1512
[  132.967482] dmrg: REQ_OP_VERIFY sect 16384 nr_sect  2048 nullb1 insmod linear_map_bio
[  132.967496] null_blk: REQ_OP_VERIFY sect 18432 nr_sect  2048 kworker/61:1H
[  132.967567] ----------------------------------------------------------
-- 
2.22.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC PATCH 1/8] block: add support for REQ_OP_VERIFY
  2021-11-04  6:46 [RFC PATCH 0/8] block: add support for REQ_OP_VERIFY Chaitanya Kulkarni
@ 2021-11-04  6:46 ` Chaitanya Kulkarni
  2021-11-04 17:25   ` Darrick J. Wong
  2021-11-04  6:46 ` [RFC PATCH 2/8] scsi: add REQ_OP_VERIFY support Chaitanya Kulkarni
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: Chaitanya Kulkarni @ 2021-11-04  6:46 UTC (permalink / raw)
  To: linux-block, linux-raid, linux-nvme, linux-scsi, target-devel,
	linux-fsdevel, linux-xfs, dm-devel
  Cc: axboe, agk, snitzer, song, djwong, kbusch, hch, sagi, jejb,
	martin.petersen, viro, javier, johannes.thumshirn, bvanassche,
	dongli.zhang, ming.lei, osandov, willy, jefflexu, josef, clm,
	dsterba, jack, tytso, adilger.kernel, jlayton, idryomov,
	danil.kipnis, ebiggers, jinpu.wang, Chaitanya Kulkarni

From: Chaitanya Kulkarni <kch@nvidia.com>

This adds a new block layer operation to offload verifying a range of
LBAs. This support is needed in order to provide file systems and
fabrics, kernel components to offload LBA verification when it is
supported by the hardware controller. In case hardware offloading is
not supported then we provide APIs to emulate the same. The prominent
example of that is NVMe Verify command. We also provide an emulation of
the same operation which can be used in case H/W does not support
verify. This is still useful when block device is remotely attached e.g.
using NVMeOF.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
---
 Documentation/ABI/testing/sysfs-block |  14 ++
 block/blk-core.c                      |   5 +
 block/blk-lib.c                       | 192 ++++++++++++++++++++++++++
 block/blk-merge.c                     |  19 +++
 block/blk-settings.c                  |  17 +++
 block/blk-sysfs.c                     |   8 ++
 block/blk-zoned.c                     |   1 +
 block/bounce.c                        |   1 +
 block/ioctl.c                         |  35 +++++
 include/linux/bio.h                   |  10 +-
 include/linux/blk_types.h             |   2 +
 include/linux/blkdev.h                |  31 +++++
 include/uapi/linux/fs.h               |   1 +
 13 files changed, 332 insertions(+), 4 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block
index e34cdeeeb9d4..ba97f7a9cbec 100644
--- a/Documentation/ABI/testing/sysfs-block
+++ b/Documentation/ABI/testing/sysfs-block
@@ -252,6 +252,20 @@ Description:
 		write_zeroes_max_bytes is 0, write zeroes is not supported
 		by the device.
 
+What:		/sys/block/<disk>/queue/verify_max_bytes
+Date:		Nov 2021
+Contact:	Chaitanya Kulkarni <kch@nvidia.com>
+Description:
+		Devices that support verify operation in which a single
+		request can be issued to verify the range of the contiguous
+		blocks on the storage without any payload in the request.
+		This can be used to optimize verifying LBAs on the device
+		without reading by offloading functionality. verify_max_bytes
+		indicates how many bytes can be written in a single verify
+		command. If verify_max_bytes is 0, verify operation is not
+		supported by the device.
+
+
 What:		/sys/block/<disk>/queue/zoned
 Date:		September 2016
 Contact:	Damien Le Moal <damien.lemoal@wdc.com>
diff --git a/block/blk-core.c b/block/blk-core.c
index 5e752840b41a..62160e729e7d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -141,6 +141,7 @@ static const char *const blk_op_name[] = {
 	REQ_OP_NAME(ZONE_APPEND),
 	REQ_OP_NAME(WRITE_SAME),
 	REQ_OP_NAME(WRITE_ZEROES),
+	REQ_OP_NAME(VERIFY),
 	REQ_OP_NAME(SCSI_IN),
 	REQ_OP_NAME(SCSI_OUT),
 	REQ_OP_NAME(DRV_IN),
@@ -851,6 +852,10 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 		if (!q->limits.max_write_same_sectors)
 			goto not_supported;
 		break;
+	case REQ_OP_VERIFY:
+		if (!q->limits.max_verify_sectors)
+			goto not_supported;
+		break;
 	case REQ_OP_ZONE_APPEND:
 		status = blk_check_zone_append(q, bio);
 		if (status != BLK_STS_OK)
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 752f9c722062..fdbb765b369e 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -439,3 +439,195 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 	return ret;
 }
 EXPORT_SYMBOL(blkdev_issue_zeroout);
+
+/**
+ * __blkdev_emulate_verify - emulate number of verify operations
+ * 				asynchronously
+ * @bdev:	blockdev to issue
+ * @sector:	start sector
+ * @nr_sects:	number of sectors to verify
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
+ * @biop:	pointer to anchor bio
+ * @buf:	data buffer to mapped on bio
+ *
+ * Description:
+ *  Verify a block range by emulating REQ_OP_VERIFY, use this when H/W
+ *  offloading is not supported asynchronously. Caller is responsible to
+ *  handle anchored bio.
+ */
+int __blkdev_emulate_verify(struct block_device *bdev, sector_t sector,
+		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop, char *buf)
+{
+	struct request_queue *q = bdev_get_queue(bdev);
+	struct bio *bio = *biop;
+	unsigned int sz;
+	int bi_size;
+
+	if (!q)
+		return -ENXIO;
+
+	if (bdev_read_only(bdev))
+		return -EPERM;
+
+	while (nr_sects != 0) {
+		bio = blk_next_bio(bio, __blkdev_sectors_to_bio_pages(nr_sects),
+				gfp_mask);
+		bio->bi_iter.bi_sector = sector;
+		bio_set_dev(bio, bdev);
+		bio_set_op_attrs(bio, REQ_OP_READ, 0);
+
+		while (nr_sects != 0) {
+			bool is_vaddr = is_vmalloc_addr(buf);
+			struct page *p;
+
+			p = is_vaddr ? vmalloc_to_page(buf) : virt_to_page(buf);
+			sz = min((sector_t) PAGE_SIZE, nr_sects << 9);
+			bi_size = bio_add_page(bio, p, sz, offset_in_page(buf));
+			nr_sects -= bi_size >> 9;
+			sector += bi_size >> 9;
+			buf += bi_size;
+
+			if (bi_size < sz)
+				break;
+		}
+		cond_resched();
+	}
+
+	*biop = bio;
+	return 0;
+}
+EXPORT_SYMBOL(__blkdev_emulate_verify);
+
+/**
+ * blkdev_emulate_verify - emulate number of verify operations synchronously
+ * @bdev:	blockdev to issue
+ * @sector:	start sector
+ * @nr_sects:	number of sectors to verify
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ *  Verify a block range by emulating REQ_OP_VERIFY, use this when H/W
+ *  offloading is not supported synchronously.
+ */
+int blkdev_emulate_verify(struct block_device *bdev, sector_t sector,
+		sector_t nr_sects, gfp_t gfp_mask)
+{
+	sector_t min_io_sect = (BIO_MAX_VECS << PAGE_SHIFT) >> 9;
+	int ret = 0;
+	char *buf;
+
+	/* allows pages in buffer to be == BIO_MAX_VECS */
+	buf = kzalloc(min_io_sect << 9, GFP_KERNEL);
+	if (!buf) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	while (nr_sects > 0) {
+		sector_t curr_sects = min_t(sector_t, nr_sects, min_io_sect);
+		struct bio *bio = NULL;
+
+		ret = __blkdev_emulate_verify(bdev, sector, curr_sects,
+				GFP_KERNEL, &bio, buf);
+
+		if (!(ret == 0 && bio))
+			break;
+
+		ret = submit_bio_wait(bio);
+		bio_put(bio);
+
+		nr_sects -= curr_sects;
+		sector += curr_sects;
+	}
+out:
+	kfree(buf);
+	return ret;
+}
+EXPORT_SYMBOL(blkdev_emulate_verify);
+
+/**
+ * __blkdev_issue_verify - generate number of verify operations
+ * @bdev:	blockdev to issue
+ * @sector:	start sector
+ * @nr_sects:	number of sectors to verify
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
+ * @biop:	pointer to anchor bio
+ *
+ * Description:
+ *  Verify a block range using hardware offload.
+ *
+ * The function will emulate verify operation if no explicit hardware
+ * offloading for verifying is provided.
+ */
+int __blkdev_issue_verify(struct block_device *bdev, sector_t sector,
+		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop)
+{
+	struct request_queue *q = bdev_get_queue(bdev);
+	unsigned int max_verify_sectors;
+	struct bio *bio = *biop;
+
+	if (!q)
+		return -ENXIO;
+
+	if (bdev_read_only(bdev))
+		return -EPERM;
+
+	max_verify_sectors = bdev_verify_sectors(bdev);
+
+	if (max_verify_sectors == 0)
+		return blkdev_emulate_verify(bdev, sector, nr_sects, gfp_mask);
+
+	while (nr_sects) {
+		bio = blk_next_bio(bio, 0, gfp_mask);
+		bio->bi_iter.bi_sector = sector;
+		bio_set_dev(bio, bdev);
+		bio->bi_opf = REQ_OP_VERIFY;
+		if (nr_sects > max_verify_sectors) {
+			bio->bi_iter.bi_size = max_verify_sectors << 9;
+			nr_sects -= max_verify_sectors;
+			sector += max_verify_sectors;
+		} else {
+			bio->bi_iter.bi_size = nr_sects << 9;
+			nr_sects = 0;
+		}
+		cond_resched();
+	}
+
+	*biop = bio;
+	return 0;
+}
+EXPORT_SYMBOL(__blkdev_issue_verify);
+
+/**
+ * blkdev_issue_verify - verify a block range
+ * @bdev:	blockdev to verify
+ * @sector:	start sector
+ * @nr_sects:	number of sectors to verify
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ *  Verify a block range using hardware offload.
+ */
+int blkdev_issue_verify(struct block_device *bdev, sector_t sector,
+		sector_t nr_sects, gfp_t gfp_mask)
+{
+	int ret = 0;
+	sector_t bs_mask;
+	struct bio *bio = NULL;
+	struct blk_plug plug;
+
+	bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
+	if ((sector | nr_sects) & bs_mask)
+		return -EINVAL;
+
+	blk_start_plug(&plug);
+	ret = __blkdev_issue_verify(bdev, sector, nr_sects, gfp_mask, &bio);
+	if (ret == 0 && bio) {
+		ret = submit_bio_wait(bio);
+		bio_put(bio);
+	}
+	blk_finish_plug(&plug);
+
+	return ret;
+}
+EXPORT_SYMBOL(blkdev_issue_verify);
diff --git a/block/blk-merge.c b/block/blk-merge.c
index ffb4aa0ea68b..c28632cb936b 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -117,6 +117,20 @@ static struct bio *blk_bio_write_zeroes_split(struct request_queue *q,
 	return bio_split(bio, q->limits.max_write_zeroes_sectors, GFP_NOIO, bs);
 }
 
+static struct bio *blk_bio_verify_split(struct request_queue *q,
+		struct bio *bio, struct bio_set *bs, unsigned *nsegs)
+{
+	*nsegs = 0;
+
+	if (!q->limits.max_verify_sectors)
+		return NULL;
+
+	if (bio_sectors(bio) <= q->limits.max_verify_sectors)
+		return NULL;
+
+	return bio_split(bio, q->limits.max_verify_sectors, GFP_NOIO, bs);
+}
+
 static struct bio *blk_bio_write_same_split(struct request_queue *q,
 					    struct bio *bio,
 					    struct bio_set *bs,
@@ -316,6 +330,10 @@ void __blk_queue_split(struct bio **bio, unsigned int *nr_segs)
 		split = blk_bio_write_zeroes_split(q, *bio, &q->bio_split,
 				nr_segs);
 		break;
+	case REQ_OP_VERIFY:
+		split = blk_bio_verify_split(q, *bio, &q->bio_split,
+				nr_segs);
+		break;
 	case REQ_OP_WRITE_SAME:
 		split = blk_bio_write_same_split(q, *bio, &q->bio_split,
 				nr_segs);
@@ -383,6 +401,7 @@ unsigned int blk_recalc_rq_segments(struct request *rq)
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
 	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_VERIFY:
 		return 0;
 	case REQ_OP_WRITE_SAME:
 		return 1;
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 4c974340f1a9..f34cbd3678b6 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -48,6 +48,7 @@ void blk_set_default_limits(struct queue_limits *lim)
 	lim->chunk_sectors = 0;
 	lim->max_write_same_sectors = 0;
 	lim->max_write_zeroes_sectors = 0;
+	lim->max_verify_sectors = 0;
 	lim->max_zone_append_sectors = 0;
 	lim->max_discard_sectors = 0;
 	lim->max_hw_discard_sectors = 0;
@@ -84,6 +85,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
 	lim->max_dev_sectors = UINT_MAX;
 	lim->max_write_same_sectors = UINT_MAX;
 	lim->max_write_zeroes_sectors = UINT_MAX;
+	lim->max_verify_sectors = UINT_MAX;
 	lim->max_zone_append_sectors = UINT_MAX;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);
@@ -227,6 +229,19 @@ void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_max_write_zeroes_sectors);
 
+/**
+ * blk_queue_max_verify_sectors - set max sectors for a single verify
+ *
+ * @q:  the request queue for the device
+ * @max_verify_sectors: maximum number of sectors to verify per command
+ **/
+void blk_queue_max_verify_sectors(struct request_queue *q,
+		unsigned int max_verify_sectors)
+{
+	q->limits.max_verify_sectors = max_verify_sectors;
+}
+EXPORT_SYMBOL(blk_queue_max_verify_sectors);
+
 /**
  * blk_queue_max_zone_append_sectors - set max sectors for a single zone append
  * @q:  the request queue for the device
@@ -514,6 +529,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 					b->max_write_same_sectors);
 	t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
 					b->max_write_zeroes_sectors);
+	t->max_verify_sectors = min(t->max_verify_sectors,
+				    b->max_verify_sectors);
 	t->max_zone_append_sectors = min(t->max_zone_append_sectors,
 					b->max_zone_append_sectors);
 	t->bounce_pfn = min_not_zero(t->bounce_pfn, b->bounce_pfn);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index b513f1683af0..f918c83dd8d4 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -108,6 +108,12 @@ queue_ra_store(struct request_queue *q, const char *page, size_t count)
 	return ret;
 }
 
+static ssize_t queue_verify_max_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%llu\n",
+		(unsigned long long)q->limits.max_verify_sectors << 9);
+}
+
 static ssize_t queue_max_sectors_show(struct request_queue *q, char *page)
 {
 	int max_sectors_kb = queue_max_sectors(q) >> 1;
@@ -584,6 +590,7 @@ QUEUE_RO_ENTRY(queue_discard_zeroes_data, "discard_zeroes_data");
 
 QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes");
 QUEUE_RO_ENTRY(queue_write_zeroes_max, "write_zeroes_max_bytes");
+QUEUE_RO_ENTRY(queue_verify_max, "verify_max_bytes");
 QUEUE_RO_ENTRY(queue_zone_append_max, "zone_append_max_bytes");
 
 QUEUE_RO_ENTRY(queue_zoned, "zoned");
@@ -638,6 +645,7 @@ static struct attribute *queue_attrs[] = {
 	&queue_discard_zeroes_data_entry.attr,
 	&queue_write_same_max_entry.attr,
 	&queue_write_zeroes_max_entry.attr,
+	&queue_verify_max_entry.attr,
 	&queue_zone_append_max_entry.attr,
 	&queue_nonrot_entry.attr,
 	&queue_zoned_entry.attr,
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 7a68b6e4300c..c9c51ee22a49 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -73,6 +73,7 @@ bool blk_req_needs_zone_write_lock(struct request *rq)
 
 	switch (req_op(rq)) {
 	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_VERIFY:
 	case REQ_OP_WRITE_SAME:
 	case REQ_OP_WRITE:
 		return blk_rq_zone_is_seq(rq);
diff --git a/block/bounce.c b/block/bounce.c
index fc55314aa426..86cdb900b88f 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -259,6 +259,7 @@ static struct bio *bounce_clone_bio(struct bio *bio_src, gfp_t gfp_mask,
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
 	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_VERIFY:
 		break;
 	case REQ_OP_WRITE_SAME:
 		bio->bi_io_vec[bio->bi_vcnt++] = bio_src->bi_io_vec[0];
diff --git a/block/ioctl.c b/block/ioctl.c
index d61d652078f4..5e1b3c4660bf 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -168,6 +168,39 @@ static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
 			BLKDEV_ZERO_NOUNMAP);
 }
 
+static int blk_ioctl_verify(struct block_device *bdev, fmode_t mode,
+		unsigned long arg)
+{
+	uint64_t range[2];
+	struct address_space *mapping;
+	uint64_t start, end, len;
+
+	if (!(mode & FMODE_WRITE))
+		return -EBADF;
+
+	if (copy_from_user(range, (void __user *)arg, sizeof(range)))
+		return -EFAULT;
+
+	start = range[0];
+	len = range[1];
+	end = start + len - 1;
+
+	if (start & 511)
+		return -EINVAL;
+	if (len & 511)
+		return -EINVAL;
+	if (end >= (uint64_t)i_size_read(bdev->bd_inode))
+		return -EINVAL;
+	if (end < start)
+		return -EINVAL;
+
+	/* Invalidate the page cache, including dirty pages */
+	mapping = bdev->bd_inode->i_mapping;
+	truncate_inode_pages_range(mapping, start, end);
+
+	return blkdev_issue_verify(bdev, start >> 9, len >> 9, GFP_KERNEL);
+}
+
 static int put_ushort(unsigned short __user *argp, unsigned short val)
 {
 	return put_user(val, argp);
@@ -460,6 +493,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
 				BLKDEV_DISCARD_SECURE);
 	case BLKZEROOUT:
 		return blk_ioctl_zeroout(bdev, mode, arg);
+	case BLKVERIFY:
+		return blk_ioctl_verify(bdev, mode, arg);
 	case BLKREPORTZONE:
 		return blkdev_report_zones_ioctl(bdev, mode, cmd, arg);
 	case BLKRESETZONE:
diff --git a/include/linux/bio.h b/include/linux/bio.h
index c74857cf1252..d660c37b7d6c 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -63,7 +63,8 @@ static inline bool bio_has_data(struct bio *bio)
 	    bio->bi_iter.bi_size &&
 	    bio_op(bio) != REQ_OP_DISCARD &&
 	    bio_op(bio) != REQ_OP_SECURE_ERASE &&
-	    bio_op(bio) != REQ_OP_WRITE_ZEROES)
+	    bio_op(bio) != REQ_OP_WRITE_ZEROES &&
+	    bio_op(bio) != REQ_OP_VERIFY)
 		return true;
 
 	return false;
@@ -73,8 +74,8 @@ static inline bool bio_no_advance_iter(const struct bio *bio)
 {
 	return bio_op(bio) == REQ_OP_DISCARD ||
 	       bio_op(bio) == REQ_OP_SECURE_ERASE ||
-	       bio_op(bio) == REQ_OP_WRITE_SAME ||
-	       bio_op(bio) == REQ_OP_WRITE_ZEROES;
+	       bio_op(bio) == REQ_OP_WRITE_ZEROES ||
+	       bio_op(bio) == REQ_OP_VERIFY;
 }
 
 static inline bool bio_mergeable(struct bio *bio)
@@ -198,7 +199,7 @@ static inline unsigned bio_segments(struct bio *bio)
 	struct bvec_iter iter;
 
 	/*
-	 * We special case discard/write same/write zeroes, because they
+	 * We special case discard/write same/write zeroes/verify, because they
 	 * interpret bi_size differently:
 	 */
 
@@ -206,6 +207,7 @@ static inline unsigned bio_segments(struct bio *bio)
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
 	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_VERIFY:
 		return 0;
 	case REQ_OP_WRITE_SAME:
 		return 1;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 1bc6f6a01070..8877711c4c56 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -366,6 +366,8 @@ enum req_opf {
 	REQ_OP_SECURE_ERASE	= 5,
 	/* write the same sector many times */
 	REQ_OP_WRITE_SAME	= 7,
+	/* verify the sectors */
+	REQ_OP_VERIFY		= 8,
 	/* write the zero filled sector many times */
 	REQ_OP_WRITE_ZEROES	= 9,
 	/* Open a zone */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 0dea268bd61b..99c41d90584b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -334,6 +334,7 @@ struct queue_limits {
 	unsigned int		max_hw_discard_sectors;
 	unsigned int		max_write_same_sectors;
 	unsigned int		max_write_zeroes_sectors;
+	unsigned int		max_verify_sectors;
 	unsigned int		max_zone_append_sectors;
 	unsigned int		discard_granularity;
 	unsigned int		discard_alignment;
@@ -621,6 +622,7 @@ struct request_queue {
 #define QUEUE_FLAG_RQ_ALLOC_TIME 27	/* record rq->alloc_time_ns */
 #define QUEUE_FLAG_HCTX_ACTIVE	28	/* at least one blk-mq hctx is active */
 #define QUEUE_FLAG_NOWAIT       29	/* device supports NOWAIT */
+#define QUEUE_FLAG_VERIFY	30	/* supports Verify */
 
 #define QUEUE_FLAG_MQ_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_SAME_COMP) |		\
@@ -667,6 +669,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
 #define blk_queue_fua(q)	test_bit(QUEUE_FLAG_FUA, &(q)->queue_flags)
 #define blk_queue_registered(q)	test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags)
 #define blk_queue_nowait(q)	test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
+#define blk_queue_verify(q)	test_bit(QUEUE_FLAG_VERIFY, &(q)->queue_flags)
 
 extern void blk_set_pm_only(struct request_queue *q);
 extern void blk_clear_pm_only(struct request_queue *q);
@@ -814,6 +817,9 @@ static inline bool rq_mergeable(struct request *rq)
 	if (req_op(rq) == REQ_OP_WRITE_ZEROES)
 		return false;
 
+	if (req_op(rq) == REQ_OP_VERIFY)
+		return false;
+
 	if (req_op(rq) == REQ_OP_ZONE_APPEND)
 		return false;
 
@@ -1072,6 +1078,9 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
 	if (unlikely(op == REQ_OP_WRITE_ZEROES))
 		return q->limits.max_write_zeroes_sectors;
 
+	if (unlikely(op == REQ_OP_VERIFY))
+		return q->limits.max_verify_sectors;
+
 	return q->limits.max_sectors;
 }
 
@@ -1154,6 +1163,8 @@ extern void blk_queue_max_discard_sectors(struct request_queue *q,
 		unsigned int max_discard_sectors);
 extern void blk_queue_max_write_same_sectors(struct request_queue *q,
 		unsigned int max_write_same_sectors);
+extern void blk_queue_max_verify_sectors(struct request_queue *q,
+		unsigned int max_verify_sectors);
 extern void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
 		unsigned int max_write_same_sectors);
 extern void blk_queue_logical_block_size(struct request_queue *, unsigned int);
@@ -1348,6 +1359,16 @@ extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		unsigned flags);
 extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, unsigned flags);
+extern int __blkdev_emulate_verify(struct block_device *bdev, sector_t sector,
+		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
+		char *buf);
+extern int blkdev_emulate_verify(struct block_device *bdev, sector_t sector,
+		sector_t nr_sects, gfp_t gfp_mask);
+extern int __blkdev_issue_verify(struct block_device *bdev,
+		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
+		struct bio **biop);
+extern int blkdev_issue_verify(struct block_device *bdev, sector_t sector,
+		sector_t nr_sects, gfp_t gfp_mask);
 
 static inline int sb_issue_discard(struct super_block *sb, sector_t block,
 		sector_t nr_blocks, gfp_t gfp_mask, unsigned long flags)
@@ -1553,6 +1574,16 @@ static inline unsigned int bdev_write_same(struct block_device *bdev)
 	return 0;
 }
 
+static inline unsigned int bdev_verify_sectors(struct block_device *bdev)
+{
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	if (q)
+		return q->limits.max_verify_sectors;
+
+	return 0;
+}
+
 static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev)
 {
 	struct request_queue *q = bdev_get_queue(bdev);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index f44eb0a04afd..5eda16bd2c3d 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -184,6 +184,7 @@ struct fsxattr {
 #define BLKSECDISCARD _IO(0x12,125)
 #define BLKROTATIONAL _IO(0x12,126)
 #define BLKZEROOUT _IO(0x12,127)
+#define BLKVERIFY _IO(0x12,128)
 /*
  * A jump here: 130-131 are reserved for zoned block devices
  * (see uapi/linux/blkzoned.h)
-- 
2.22.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 2/8] scsi: add REQ_OP_VERIFY support
  2021-11-04  6:46 [RFC PATCH 0/8] block: add support for REQ_OP_VERIFY Chaitanya Kulkarni
  2021-11-04  6:46 ` [RFC PATCH 1/8] " Chaitanya Kulkarni
@ 2021-11-04  6:46 ` Chaitanya Kulkarni
  2021-11-04 12:33   ` Damien Le Moal
  2021-11-04  6:46 ` [RFC PATCH 3/8] nvme: add support for the Verify command Chaitanya Kulkarni
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: Chaitanya Kulkarni @ 2021-11-04  6:46 UTC (permalink / raw)
  To: linux-block, linux-raid, linux-nvme, linux-scsi, target-devel,
	linux-fsdevel, linux-xfs, dm-devel
  Cc: axboe, agk, snitzer, song, djwong, kbusch, hch, sagi, jejb,
	martin.petersen, viro, javier, johannes.thumshirn, bvanassche,
	dongli.zhang, ming.lei, osandov, willy, jefflexu, josef, clm,
	dsterba, jack, tytso, adilger.kernel, jlayton, idryomov,
	danil.kipnis, ebiggers, jinpu.wang, Chaitanya Kulkarni

From: Chaitanya Kulkarni <kch@nvidia.com>

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
---
 drivers/scsi/sd.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++
 drivers/scsi/sd.h |  1 +
 2 files changed, 53 insertions(+)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index a3d2d4bc4a3d..7f2c4eb98cf8 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -106,6 +106,7 @@ MODULE_ALIAS_SCSI_DEVICE(TYPE_ZBC);
 
 static void sd_config_discard(struct scsi_disk *, unsigned int);
 static void sd_config_write_same(struct scsi_disk *);
+static void sd_config_verify(struct scsi_disk *sdkp);
 static int  sd_revalidate_disk(struct gendisk *);
 static void sd_unlock_native_capacity(struct gendisk *disk);
 static int  sd_probe(struct device *);
@@ -995,6 +996,41 @@ static blk_status_t sd_setup_write_zeroes_cmnd(struct scsi_cmnd *cmd)
 	return sd_setup_write_same10_cmnd(cmd, false);
 }
 
+static void sd_config_verify(struct scsi_disk *sdkp)
+{
+	struct request_queue *q = sdkp->disk->queue;
+
+	/* XXX: use same pattern as sd_config_write_same(). */
+	blk_queue_max_verify_sectors(q, UINT_MAX >> 9);
+}
+
+static blk_status_t sd_setup_verify_cmnd(struct scsi_cmnd *cmd)
+{
+       struct request *rq = cmd->request;
+       struct scsi_device *sdp = cmd->device;
+       struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
+       u64 lba = sectors_to_logical(sdp, blk_rq_pos(rq));
+       u32 nr_blocks = sectors_to_logical(sdp, blk_rq_sectors(rq));
+
+       if (!sdkp->verify_16)
+	       return BLK_STS_NOTSUPP;
+
+       cmd->cmd_len = 16;
+       cmd->cmnd[0] = VERIFY_16;
+       /* skip veprotect / dpo / bytchk */
+       cmd->cmnd[1] = 0;
+       put_unaligned_be64(lba, &cmd->cmnd[2]);
+       put_unaligned_be32(nr_blocks, &cmd->cmnd[10]);
+       cmd->cmnd[14] = 0;
+       cmd->cmnd[15] = 0;
+
+       cmd->allowed = SD_MAX_RETRIES;
+       cmd->sc_data_direction = DMA_NONE;
+       cmd->transfersize = 0;
+
+       return BLK_STS_OK;
+}
+
 static void sd_config_write_same(struct scsi_disk *sdkp)
 {
 	struct request_queue *q = sdkp->disk->queue;
@@ -1345,6 +1381,8 @@ static blk_status_t sd_init_command(struct scsi_cmnd *cmd)
 		}
 	case REQ_OP_WRITE_ZEROES:
 		return sd_setup_write_zeroes_cmnd(cmd);
+	case REQ_OP_VERIFY:
+		return sd_setup_verify_cmnd(cmd);
 	case REQ_OP_WRITE_SAME:
 		return sd_setup_write_same_cmnd(cmd);
 	case REQ_OP_FLUSH:
@@ -2029,6 +2067,7 @@ static int sd_done(struct scsi_cmnd *SCpnt)
 	switch (req_op(req)) {
 	case REQ_OP_DISCARD:
 	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_VERIFY:
 	case REQ_OP_WRITE_SAME:
 	case REQ_OP_ZONE_RESET:
 	case REQ_OP_ZONE_RESET_ALL:
@@ -3096,6 +3135,17 @@ static void sd_read_write_same(struct scsi_disk *sdkp, unsigned char *buffer)
 		sdkp->ws10 = 1;
 }
 
+static void sd_read_verify(struct scsi_disk *sdkp, unsigned char *buffer)
+{
+       struct scsi_device *sdev = sdkp->device;
+
+       sd_printk(KERN_INFO, sdkp, "VERIFY16 check.\n");
+       if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, VERIFY_16) == 1) {
+	       sd_printk(KERN_INFO, sdkp, " VERIFY16 in ON .\n");
+               sdkp->verify_16 = 1;
+       }
+}
+
 static void sd_read_security(struct scsi_disk *sdkp, unsigned char *buffer)
 {
 	struct scsi_device *sdev = sdkp->device;
@@ -3224,6 +3274,7 @@ static int sd_revalidate_disk(struct gendisk *disk)
 		sd_read_cache_type(sdkp, buffer);
 		sd_read_app_tag_own(sdkp, buffer);
 		sd_read_write_same(sdkp, buffer);
+		sd_read_verify(sdkp, buffer);
 		sd_read_security(sdkp, buffer);
 	}
 
@@ -3265,6 +3316,7 @@ static int sd_revalidate_disk(struct gendisk *disk)
 
 	set_capacity_and_notify(disk, logical_to_sectors(sdp, sdkp->capacity));
 	sd_config_write_same(sdkp);
+	sd_config_verify(sdkp);
 	kfree(buffer);
 
 	/*
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index b59136c4125b..94a86bf6dac4 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -120,6 +120,7 @@ struct scsi_disk {
 	unsigned	lbpvpd : 1;
 	unsigned	ws10 : 1;
 	unsigned	ws16 : 1;
+	unsigned        verify_16 : 1;
 	unsigned	rc_basis: 2;
 	unsigned	zoned: 2;
 	unsigned	urswrz : 1;
-- 
2.22.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 3/8] nvme: add support for the Verify command
  2021-11-04  6:46 [RFC PATCH 0/8] block: add support for REQ_OP_VERIFY Chaitanya Kulkarni
  2021-11-04  6:46 ` [RFC PATCH 1/8] " Chaitanya Kulkarni
  2021-11-04  6:46 ` [RFC PATCH 2/8] scsi: add REQ_OP_VERIFY support Chaitanya Kulkarni
@ 2021-11-04  6:46 ` Chaitanya Kulkarni
  2021-11-04 22:44   ` Keith Busch
  2021-11-04  6:46 ` [PATCH 4/8] nvmet: add Verify command support for bdev-ns Chaitanya Kulkarni
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: Chaitanya Kulkarni @ 2021-11-04  6:46 UTC (permalink / raw)
  To: linux-block, linux-raid, linux-nvme, linux-scsi, target-devel,
	linux-fsdevel, linux-xfs, dm-devel
  Cc: axboe, agk, snitzer, song, djwong, kbusch, hch, sagi, jejb,
	martin.petersen, viro, javier, johannes.thumshirn, bvanassche,
	dongli.zhang, ming.lei, osandov, willy, jefflexu, josef, clm,
	dsterba, jack, tytso, adilger.kernel, jlayton, idryomov,
	danil.kipnis, ebiggers, jinpu.wang, Chaitanya Kulkarni

From: Chaitanya Kulkarni <kch@nvidia.com>

Allow verify operations (REQ_OP_VERIFY) on the block device, if the
device supports optional command bit set for verify. Add support
to setup verify command. Set maximum possible verify sectors in one
verify command according to maximum hardware sectors supported by the
controller.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
---
 drivers/nvme/host/core.c | 39 +++++++++++++++++++++++++++++++++++++++
 include/linux/nvme.h     | 19 +++++++++++++++++++
 2 files changed, 58 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 546a10407385..250647c3bb7b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -801,6 +801,19 @@ static inline blk_status_t nvme_setup_write_zeroes(struct nvme_ns *ns,
 	return BLK_STS_OK;
 }
 
+static inline blk_status_t nvme_setup_verify(struct nvme_ns *ns,
+		struct request *req, struct nvme_command *cmnd)
+{
+	cmnd->verify.opcode = nvme_cmd_verify;
+	cmnd->verify.nsid = cpu_to_le32(ns->head->ns_id);
+	cmnd->verify.slba =
+		cpu_to_le64(nvme_sect_to_lba(ns, blk_rq_pos(req)));
+	cmnd->verify.length =
+		cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
+	cmnd->verify.control = 0;
+	return BLK_STS_OK;
+}
+
 static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 		struct request *req, struct nvme_command *cmnd,
 		enum nvme_opcode op)
@@ -904,6 +917,9 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
 	case REQ_OP_WRITE_ZEROES:
 		ret = nvme_setup_write_zeroes(ns, req, cmd);
 		break;
+	case REQ_OP_VERIFY:
+		ret = nvme_setup_verify(ns, req, cmd);
+		break;
 	case REQ_OP_DISCARD:
 		ret = nvme_setup_discard(ns, req, cmd);
 		break;
@@ -1974,6 +1990,28 @@ static void nvme_config_write_zeroes(struct gendisk *disk, struct nvme_ns *ns)
 					   nvme_lba_to_sect(ns, max_blocks));
 }
 
+static void nvme_config_verify(struct gendisk *disk, struct nvme_ns *ns)
+{
+	u64 max_blocks;
+
+	if (!(ns->ctrl->oncs & NVME_CTRL_ONCS_VERIFY))
+		return;
+
+	if (ns->ctrl->max_hw_sectors == UINT_MAX)
+		max_blocks = (u64)USHRT_MAX + 1;
+	else
+		max_blocks = ns->ctrl->max_hw_sectors + 1;
+
+	/* keep same as discard */
+	if (blk_queue_flag_test_and_set(QUEUE_FLAG_VERIFY, disk->queue))
+		return;
+
+	blk_queue_max_verify_sectors(disk->queue,
+				     nvme_lba_to_sect(ns, max_blocks));
+
+}
+
+
 static bool nvme_ns_ids_valid(struct nvme_ns_ids *ids)
 {
 	return !uuid_is_null(&ids->uuid) ||
@@ -2144,6 +2182,7 @@ static void nvme_update_disk_info(struct gendisk *disk,
 
 	nvme_config_discard(disk, ns);
 	nvme_config_write_zeroes(disk, ns);
+	nvme_config_verify(disk, ns);
 
 	set_disk_ro(disk, (id->nsattr & NVME_NS_ATTR_RO) ||
 		test_bit(NVME_NS_FORCE_RO, &ns->flags));
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index b08787cd0881..14925602726a 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -318,6 +318,7 @@ enum {
 	NVME_CTRL_ONCS_WRITE_UNCORRECTABLE	= 1 << 1,
 	NVME_CTRL_ONCS_DSM			= 1 << 2,
 	NVME_CTRL_ONCS_WRITE_ZEROES		= 1 << 3,
+	NVME_CTRL_ONCS_VERIFY			= 1 << 7,
 	NVME_CTRL_ONCS_RESERVATIONS		= 1 << 5,
 	NVME_CTRL_ONCS_TIMESTAMP		= 1 << 6,
 	NVME_CTRL_VWC_PRESENT			= 1 << 0,
@@ -890,6 +891,23 @@ struct nvme_write_zeroes_cmd {
 	__le16			appmask;
 };
 
+struct nvme_verify_cmd {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__u64			rsvd2;
+	__le64			metadata;
+	union nvme_data_ptr	dptr;
+	__le64			slba;
+	__le16			length;
+	__le16			control;
+	__le32			rsvd3;
+	__le32			reftag;
+	__le16			eapptag;
+	__le16			eappmask;
+};
+
 enum nvme_zone_mgmt_action {
 	NVME_ZONE_CLOSE		= 0x1,
 	NVME_ZONE_FINISH	= 0x2,
@@ -1411,6 +1429,7 @@ struct nvme_command {
 		struct nvme_format_cmd format;
 		struct nvme_dsm_cmd dsm;
 		struct nvme_write_zeroes_cmd write_zeroes;
+		struct nvme_verify_cmd verify;
 		struct nvme_zone_mgmt_send_cmd zms;
 		struct nvme_zone_mgmt_recv_cmd zmr;
 		struct nvme_abort_cmd abort;
-- 
2.22.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/8] nvmet: add Verify command support for bdev-ns
  2021-11-04  6:46 [RFC PATCH 0/8] block: add support for REQ_OP_VERIFY Chaitanya Kulkarni
                   ` (2 preceding siblings ...)
  2021-11-04  6:46 ` [RFC PATCH 3/8] nvme: add support for the Verify command Chaitanya Kulkarni
@ 2021-11-04  6:46 ` Chaitanya Kulkarni
  2021-11-04  6:46 ` [RFC PATCH 5/8] nvmet: add Verify emulation " Chaitanya Kulkarni
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Chaitanya Kulkarni @ 2021-11-04  6:46 UTC (permalink / raw)
  To: linux-block, linux-raid, linux-nvme, linux-scsi, target-devel,
	linux-fsdevel, linux-xfs, dm-devel
  Cc: axboe, agk, snitzer, song, djwong, kbusch, hch, sagi, jejb,
	martin.petersen, viro, javier, johannes.thumshirn, bvanassche,
	dongli.zhang, ming.lei, osandov, willy, jefflexu, josef, clm,
	dsterba, jack, tytso, adilger.kernel, jlayton, idryomov,
	danil.kipnis, ebiggers, jinpu.wang, Chaitanya Kulkarni

From: Chaitanya Kulkarni <kch@nvidia.com>

Add support for handling verify command on target. Call into
__blkdev_issue_verify, which the block layer expands into the
REQ_OP_VERIFY LBAs.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
---
 drivers/nvme/target/admin-cmd.c   |  3 ++-
 drivers/nvme/target/io-cmd-bdev.c | 39 +++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 613a4d8feac1..87cad64895e6 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -408,7 +408,8 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req)
 	id->nn = cpu_to_le32(ctrl->subsys->max_nsid);
 	id->mnan = cpu_to_le32(NVMET_MAX_NAMESPACES);
 	id->oncs = cpu_to_le16(NVME_CTRL_ONCS_DSM |
-			NVME_CTRL_ONCS_WRITE_ZEROES);
+			NVME_CTRL_ONCS_WRITE_ZEROES |
+			NVME_CTRL_ONCS_VERIFY);
 
 	/* XXX: don't report vwc if the underlying device is write through */
 	id->vwc = NVME_CTRL_VWC_PRESENT;
diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
index ec45e597084b..5a888cdadfea 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -128,6 +128,7 @@ static u16 blk_to_nvme_status(struct nvmet_req *req, blk_status_t blk_sts)
 		switch (req->cmd->common.opcode) {
 		case nvme_cmd_dsm:
 		case nvme_cmd_write_zeroes:
+		case nvme_cmd_verify:
 			status = NVME_SC_ONCS_NOT_SUPPORTED | NVME_SC_DNR;
 			break;
 		default:
@@ -153,6 +154,10 @@ static u16 blk_to_nvme_status(struct nvmet_req *req, blk_status_t blk_sts)
 		req->error_slba =
 			le64_to_cpu(req->cmd->write_zeroes.slba);
 		break;
+	case nvme_cmd_verify:
+		req->error_slba =
+			le64_to_cpu(req->cmd->verify.slba);
+		break;
 	default:
 		req->error_slba = 0;
 	}
@@ -428,6 +433,37 @@ static void nvmet_bdev_execute_write_zeroes(struct nvmet_req *req)
 	}
 }
 
+static void nvmet_bdev_execute_verify(struct nvmet_req *req)
+{
+	struct nvme_verify_cmd *verify = &req->cmd->verify;
+	struct bio *bio = NULL;
+	sector_t nr_sector;
+	sector_t sector;
+	int ret;
+
+	if (!nvmet_check_transfer_len(req, 0))
+		return;
+
+	if (!bdev_verify_sectors(req->ns->bdev)) {
+		nvmet_req_complete(req, NVME_SC_INTERNAL | NVME_SC_DNR);
+		return;
+	}
+
+	sector = le64_to_cpu(verify->slba) << (req->ns->blksize_shift - 9);
+	nr_sector = (((sector_t)le16_to_cpu(verify->length) + 1) <<
+			(req->ns->blksize_shift - 9));
+
+	ret = __blkdev_issue_verify(req->ns->bdev, sector, nr_sector,
+			GFP_KERNEL, &bio);
+	if (bio) {
+		bio->bi_private = req;
+		bio->bi_end_io = nvmet_bio_done;
+		submit_bio(bio);
+	} else {
+		nvmet_req_complete(req, errno_to_nvme_status(req, ret));
+	}
+}
+
 u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
 {
 	struct nvme_command *cmd = req->cmd;
@@ -448,6 +484,9 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
 	case nvme_cmd_write_zeroes:
 		req->execute = nvmet_bdev_execute_write_zeroes;
 		return 0;
+	case nvme_cmd_verify:
+		req->execute = nvmet_bdev_execute_verify;
+		return 0;
 	default:
 		pr_err("unhandled cmd %d on qid %d\n", cmd->common.opcode,
 		       req->sq->qid);
-- 
2.22.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 5/8] nvmet: add Verify emulation support for bdev-ns
  2021-11-04  6:46 [RFC PATCH 0/8] block: add support for REQ_OP_VERIFY Chaitanya Kulkarni
                   ` (3 preceding siblings ...)
  2021-11-04  6:46 ` [PATCH 4/8] nvmet: add Verify command support for bdev-ns Chaitanya Kulkarni
@ 2021-11-04  6:46 ` Chaitanya Kulkarni
  2021-11-04  6:46 ` [RFC PATCH 6/8] nvmet: add verify emulation support for file-ns Chaitanya Kulkarni
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Chaitanya Kulkarni @ 2021-11-04  6:46 UTC (permalink / raw)
  To: linux-block, linux-raid, linux-nvme, linux-scsi, target-devel,
	linux-fsdevel, linux-xfs, dm-devel
  Cc: axboe, agk, snitzer, song, djwong, kbusch, hch, sagi, jejb,
	martin.petersen, viro, javier, johannes.thumshirn, bvanassche,
	dongli.zhang, ming.lei, osandov, willy, jefflexu, josef, clm,
	dsterba, jack, tytso, adilger.kernel, jlayton, idryomov,
	danil.kipnis, ebiggers, jinpu.wang, Chaitanya Kulkarni

From: Chaitanya Kulkarni <kch@nvidia.com>

Not all devices can support verify requests which can be mapped to
the controller specific command. This patch adds a way to emulate
REQ_OP_VERIFY for NVMeOF block device namespace. We add a new
workqueue to offload the emulation with the help of
__blkdev_emulate_verify().

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
---
 drivers/nvme/target/core.c        | 12 +++++++-
 drivers/nvme/target/io-cmd-bdev.c | 51 ++++++++++++++++++++++++++-----
 drivers/nvme/target/nvmet.h       |  3 ++
 3 files changed, 57 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 8ce4d59cc9e7..8a17a6479073 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -16,6 +16,7 @@
 #include "nvmet.h"
 
 struct workqueue_struct *buffered_io_wq;
+struct workqueue_struct *verify_wq;
 static const struct nvmet_fabrics_ops *nvmet_transports[NVMF_TRTYPE_MAX];
 static DEFINE_IDA(cntlid_ida);
 
@@ -1546,11 +1547,17 @@ static int __init nvmet_init(void)
 
 	nvmet_ana_group_enabled[NVMET_DEFAULT_ANA_GRPID] = 1;
 
+	verify_wq = alloc_workqueue("nvmet-verify-wq", WQ_MEM_RECLAIM, 0);
+	if (!verify_wq) {
+		error = -ENOMEM;
+		goto out;
+	}
+
 	buffered_io_wq = alloc_workqueue("nvmet-buffered-io-wq",
 			WQ_MEM_RECLAIM, 0);
 	if (!buffered_io_wq) {
 		error = -ENOMEM;
-		goto out;
+		goto out_free_verify_work_queue;
 	}
 
 	error = nvmet_init_discovery();
@@ -1566,6 +1573,8 @@ static int __init nvmet_init(void)
 	nvmet_exit_discovery();
 out_free_work_queue:
 	destroy_workqueue(buffered_io_wq);
+out_free_verify_work_queue:
+	destroy_workqueue(verify_wq);
 out:
 	return error;
 }
@@ -1576,6 +1585,7 @@ static void __exit nvmet_exit(void)
 	nvmet_exit_discovery();
 	ida_destroy(&cntlid_ida);
 	destroy_workqueue(buffered_io_wq);
+	destroy_workqueue(verify_wq);
 
 	BUILD_BUG_ON(sizeof(struct nvmf_disc_rsp_page_entry) != 1024);
 	BUILD_BUG_ON(sizeof(struct nvmf_disc_rsp_page_hdr) != 1024);
diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
index 5a888cdadfea..80b8e7bfd1ae 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -433,25 +433,60 @@ static void nvmet_bdev_execute_write_zeroes(struct nvmet_req *req)
 	}
 }
 
-static void nvmet_bdev_execute_verify(struct nvmet_req *req)
+static void __nvmet_req_to_verify_sectors(struct nvmet_req *req,
+		sector_t *sects, sector_t *nr_sects)
 {
 	struct nvme_verify_cmd *verify = &req->cmd->verify;
+
+	*sects = le64_to_cpu(verify->slba) << (req->ns->blksize_shift - 9);
+	*nr_sects = (((sector_t)le16_to_cpu(verify->length) + 1) <<
+			(req->ns->blksize_shift - 9));
+}
+
+static void nvmet_bdev_emulate_verify_work(struct work_struct *w)
+{
+	struct nvmet_req *req = container_of(w, struct nvmet_req, b.work);
+	sector_t nr_sector;
+	sector_t sector;
+	int ret = 0;
+
+	__nvmet_req_to_verify_sectors(req, &sector, &nr_sector);
+	if (!nr_sector)
+		goto out;
+
+	ret = blkdev_emulate_verify(req->ns->bdev, sector, nr_sector,
+			GFP_KERNEL);
+out:
+	nvmet_req_complete(req,
+			   blk_to_nvme_status(req, errno_to_blk_status(ret)));
+}
+
+static void nvmet_bdev_submit_emulate_verify(struct nvmet_req *req)
+{
+	INIT_WORK(&req->b.work, nvmet_bdev_emulate_verify_work);
+	queue_work(verify_wq, &req->b.work);
+}
+
+
+static void nvmet_bdev_execute_verify(struct nvmet_req *req)
+{
 	struct bio *bio = NULL;
 	sector_t nr_sector;
 	sector_t sector;
-	int ret;
+	int ret = 0;
 
 	if (!nvmet_check_transfer_len(req, 0))
 		return;
 
+	/* offload emulation */
 	if (!bdev_verify_sectors(req->ns->bdev)) {
-		nvmet_req_complete(req, NVME_SC_INTERNAL | NVME_SC_DNR);
+		nvmet_bdev_submit_emulate_verify(req);
 		return;
 	}
 
-	sector = le64_to_cpu(verify->slba) << (req->ns->blksize_shift - 9);
-	nr_sector = (((sector_t)le16_to_cpu(verify->length) + 1) <<
-			(req->ns->blksize_shift - 9));
+	__nvmet_req_to_verify_sectors(req, &sector, &nr_sector);
+	if (!nr_sector)
+		goto out;
 
 	ret = __blkdev_issue_verify(req->ns->bdev, sector, nr_sector,
 			GFP_KERNEL, &bio);
@@ -459,9 +494,9 @@ static void nvmet_bdev_execute_verify(struct nvmet_req *req)
 		bio->bi_private = req;
 		bio->bi_end_io = nvmet_bio_done;
 		submit_bio(bio);
-	} else {
-		nvmet_req_complete(req, errno_to_nvme_status(req, ret));
 	}
+out:
+	nvmet_req_complete(req, errno_to_nvme_status(req, ret));
 }
 
 u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 8776dd1a0490..7f3f584b1e7b 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -323,6 +323,8 @@ struct nvmet_req {
 	union {
 		struct {
 			struct bio      inline_bio;
+			/* XXX: should we take work out of union ? */
+			struct work_struct      work;
 		} b;
 		struct {
 			bool			mpool_alloc;
@@ -355,6 +357,7 @@ struct nvmet_req {
 };
 
 extern struct workqueue_struct *buffered_io_wq;
+extern struct workqueue_struct *verify_wq;
 
 static inline void nvmet_set_result(struct nvmet_req *req, u32 result)
 {
-- 
2.22.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 6/8] nvmet: add verify emulation support for file-ns
  2021-11-04  6:46 [RFC PATCH 0/8] block: add support for REQ_OP_VERIFY Chaitanya Kulkarni
                   ` (4 preceding siblings ...)
  2021-11-04  6:46 ` [RFC PATCH 5/8] nvmet: add Verify emulation " Chaitanya Kulkarni
@ 2021-11-04  6:46 ` Chaitanya Kulkarni
  2021-11-04  6:46 ` [RFC PATCH 7/8] null_blk: add REQ_OP_VERIFY support Chaitanya Kulkarni
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Chaitanya Kulkarni @ 2021-11-04  6:46 UTC (permalink / raw)
  To: linux-block, linux-raid, linux-nvme, linux-scsi, target-devel,
	linux-fsdevel, linux-xfs, dm-devel
  Cc: axboe, agk, snitzer, song, djwong, kbusch, hch, sagi, jejb,
	martin.petersen, viro, javier, johannes.thumshirn, bvanassche,
	dongli.zhang, ming.lei, osandov, willy, jefflexu, josef, clm,
	dsterba, jack, tytso, adilger.kernel, jlayton, idryomov,
	danil.kipnis, ebiggers, jinpu.wang, Chaitanya Kulkarni

From: Chaitanya Kulkarni <kch@nvidia.com>

For now, there is no way to map verify operation to the VFS layer API.
This patch emulates verify operation by offloading it to the workqueue
and reading the data using vfs layer APIs for both buffered io and
direct io mode.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
---
 drivers/nvme/target/io-cmd-file.c | 151 ++++++++++++++++++++++++++++++
 1 file changed, 151 insertions(+)

diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c
index 0abbefd9925e..2b0291c4164c 100644
--- a/drivers/nvme/target/io-cmd-file.c
+++ b/drivers/nvme/target/io-cmd-file.c
@@ -12,6 +12,7 @@
 
 #define NVMET_MAX_MPOOL_BVEC		16
 #define NVMET_MIN_MPOOL_OBJ		16
+#define NVMET_VERIFY_BUF_LEN		(BIO_MAX_PAGES << PAGE_SHIFT)
 
 int nvmet_file_ns_revalidate(struct nvmet_ns *ns)
 {
@@ -381,6 +382,153 @@ static void nvmet_file_execute_write_zeroes(struct nvmet_req *req)
 	schedule_work(&req->f.work);
 }
 
+static void __nvmet_req_to_verify_offset(struct nvmet_req *req, loff_t *offset,
+		ssize_t *len)
+{
+	struct nvme_verify_cmd *verify = &req->cmd->verify;
+
+	*offset = le64_to_cpu(verify->slba) << req->ns->blksize_shift;
+	*len = (((sector_t)le16_to_cpu(verify->length) + 1) <<
+			req->ns->blksize_shift);
+}
+
+static int do_buffered_io_emulate_verify(struct file *f, loff_t offset,
+		ssize_t len)
+{
+	char *buf = NULL;
+	int ret = 0;
+	ssize_t rc;
+
+	buf = kmalloc(NVMET_VERIFY_BUF_LEN, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	while (len > 0) {
+		ssize_t curr_len = min_t(ssize_t, len, NVMET_VERIFY_BUF_LEN);
+
+		rc = kernel_read(f, buf, curr_len, &offset);
+		if (rc != curr_len) {
+			pr_err("kernel_read %lu curr_len %lu\n", rc, curr_len);
+			ret = -EINVAL;
+			break;
+		}
+
+		len -= curr_len;
+		offset += curr_len;
+		cond_resched();
+	}
+
+	kfree(buf);
+	return ret;
+}
+
+static int do_direct_io_emulate_verify(struct file *f, loff_t offset,
+		ssize_t len)
+{
+	struct scatterlist *sgl = NULL;
+	struct bio_vec *bvec = NULL;
+	struct iov_iter iter = { 0 };
+	struct kiocb iocb = { 0 };
+	unsigned int sgl_nents;
+	ssize_t ret = 0;
+	int i;
+
+	while (len > 0) {
+		ssize_t curr_len = min_t(ssize_t, len, NVMET_VERIFY_BUF_LEN);
+		struct scatterlist *sg = NULL;
+		unsigned int bv_len = 0;
+		ssize_t rc;
+
+		sgl = sgl_alloc(curr_len, GFP_KERNEL, &sgl_nents);
+		if (!sgl) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		bvec = kmalloc_array(sgl_nents, sizeof(struct bio_vec),
+				GFP_KERNEL);
+		if (!bvec) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		for_each_sg(sgl, sg, sgl_nents, i) {
+			nvmet_file_init_bvec(&bvec[i], sg);
+			bv_len += sg->length;
+		}
+
+		if (bv_len != curr_len) {
+			pr_err("length mismatch sgl & bvec\n");
+			ret = -EINVAL;
+			break;
+		}
+
+		iocb.ki_pos = offset;
+		iocb.ki_filp = f;
+		iocb.ki_complete = NULL; /* Sync I/O */
+		iocb.ki_flags |= IOCB_DIRECT;
+
+		iov_iter_bvec(&iter, READ, bvec, sgl_nents, bv_len);
+
+		rc = call_read_iter(f, &iocb, &iter);
+		if (rc != curr_len) {
+			pr_err("read len mismatch expected %lu got %ld\n",
+					curr_len, rc);
+			ret = -EINVAL;
+			break;
+		}
+
+		cond_resched();
+
+		len -= curr_len;
+		offset += curr_len;
+
+		kfree(bvec);
+		sgl_free(sgl);
+		bvec = NULL;
+		sgl = NULL;
+		memset(&iocb, 0, sizeof(iocb));
+		memset(&iter, 0, sizeof(iter));
+	}
+
+	kfree(bvec);
+	sgl_free(sgl);
+	return ret;
+}
+
+static void nvmet_file_emulate_verify_work(struct work_struct *w)
+{
+	struct nvmet_req *req = container_of(w, struct nvmet_req, f.work);
+	loff_t offset;
+	ssize_t len;
+	int ret = 0;
+
+	__nvmet_req_to_verify_offset(req, &offset, &len);
+	if (!len)
+		goto out;
+
+	if (unlikely(offset + len > req->ns->size)) {
+		nvmet_req_complete(req, errno_to_nvme_status(req, -ENOSPC));
+		return;
+	}
+
+	if (req->ns->buffered_io)
+		ret = do_buffered_io_emulate_verify(req->ns->file, offset, len);
+	else
+		ret = do_direct_io_emulate_verify(req->ns->file, offset, len);
+out:
+	nvmet_req_complete(req, errno_to_nvme_status(req, ret));
+}
+
+static void nvmet_file_execute_verify(struct nvmet_req *req)
+{
+	if (!nvmet_check_data_len_lte(req, 0))
+		return;
+
+	INIT_WORK(&req->f.work, nvmet_file_emulate_verify_work);
+	queue_work(verify_wq, &req->f.work);
+}
+
 u16 nvmet_file_parse_io_cmd(struct nvmet_req *req)
 {
 	struct nvme_command *cmd = req->cmd;
@@ -399,6 +547,9 @@ u16 nvmet_file_parse_io_cmd(struct nvmet_req *req)
 	case nvme_cmd_write_zeroes:
 		req->execute = nvmet_file_execute_write_zeroes;
 		return 0;
+	case nvme_cmd_verify:
+		req->execute = nvmet_file_execute_verify;
+		return 0;
 	default:
 		pr_err("unhandled cmd for file ns %d on qid %d\n",
 				cmd->common.opcode, req->sq->qid);
-- 
2.22.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 7/8] null_blk: add REQ_OP_VERIFY support
  2021-11-04  6:46 [RFC PATCH 0/8] block: add support for REQ_OP_VERIFY Chaitanya Kulkarni
                   ` (5 preceding siblings ...)
  2021-11-04  6:46 ` [RFC PATCH 6/8] nvmet: add verify emulation support for file-ns Chaitanya Kulkarni
@ 2021-11-04  6:46 ` Chaitanya Kulkarni
  2021-11-04  6:46 ` [RFC PATCH 8/8] md: add support for REQ_OP_VERIFY Chaitanya Kulkarni
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Chaitanya Kulkarni @ 2021-11-04  6:46 UTC (permalink / raw)
  To: linux-block, linux-raid, linux-nvme, linux-scsi, target-devel,
	linux-fsdevel, linux-xfs, dm-devel
  Cc: axboe, agk, snitzer, song, djwong, kbusch, hch, sagi, jejb,
	martin.petersen, viro, javier, johannes.thumshirn, bvanassche,
	dongli.zhang, ming.lei, osandov, willy, jefflexu, josef, clm,
	dsterba, jack, tytso, adilger.kernel, jlayton, idryomov,
	danil.kipnis, ebiggers, jinpu.wang, Chaitanya Kulkarni

From: Chaitanya Kulkarni <kch@nvidia.com>

Add a new module parameter, configfs attribute to configure handling of
the REQ_OP_VERIFY. This is needed for testing newly added REQ_OP_VERIFY
block layer operation.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
---
 drivers/block/null_blk/main.c     | 25 ++++++++++++++++++++++++-
 drivers/block/null_blk/null_blk.h |  1 +
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index d6c821d48090..36a5f5343cee 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -216,6 +216,10 @@ static unsigned int g_zone_nr_conv;
 module_param_named(zone_nr_conv, g_zone_nr_conv, uint, 0444);
 MODULE_PARM_DESC(zone_nr_conv, "Number of conventional zones when block device is zoned. Default: 0");
 
+static bool g_verify;
+module_param_named(verify, g_verify, bool, 0444);
+MODULE_PARM_DESC(verify, "Allow REQ_OP_VERIFY processing. Default: false");
+
 static unsigned int g_zone_max_open;
 module_param_named(zone_max_open, g_zone_max_open, uint, 0444);
 MODULE_PARM_DESC(zone_max_open, "Maximum number of open zones when block device is zoned. Default: 0 (no limit)");
@@ -358,6 +362,7 @@ NULLB_DEVICE_ATTR(blocking, bool, NULL);
 NULLB_DEVICE_ATTR(use_per_node_hctx, bool, NULL);
 NULLB_DEVICE_ATTR(memory_backed, bool, NULL);
 NULLB_DEVICE_ATTR(discard, bool, NULL);
+NULLB_DEVICE_ATTR(verify, bool, NULL);
 NULLB_DEVICE_ATTR(mbps, uint, NULL);
 NULLB_DEVICE_ATTR(cache_size, ulong, NULL);
 NULLB_DEVICE_ATTR(zoned, bool, NULL);
@@ -477,6 +482,7 @@ static struct configfs_attribute *nullb_device_attrs[] = {
 	&nullb_device_attr_power,
 	&nullb_device_attr_memory_backed,
 	&nullb_device_attr_discard,
+	&nullb_device_attr_verify,
 	&nullb_device_attr_mbps,
 	&nullb_device_attr_cache_size,
 	&nullb_device_attr_badblocks,
@@ -539,7 +545,7 @@ nullb_group_drop_item(struct config_group *group, struct config_item *item)
 static ssize_t memb_group_features_show(struct config_item *item, char *page)
 {
 	return snprintf(page, PAGE_SIZE,
-			"memory_backed,discard,bandwidth,cache,badblocks,zoned,zone_size,zone_capacity,zone_nr_conv,zone_max_open,zone_max_active,blocksize,max_sectors\n");
+			"memory_backed,discard,verify,bandwidth,cache,badblocks,zoned,zone_size,zone_capacity,zone_nr_conv,zone_max_open,zone_max_active,blocksize,max_sectors\n");
 }
 
 CONFIGFS_ATTR_RO(memb_group_, features);
@@ -601,6 +607,7 @@ static struct nullb_device *null_alloc_dev(void)
 	dev->use_per_node_hctx = g_use_per_node_hctx;
 	dev->zoned = g_zoned;
 	dev->zone_size = g_zone_size;
+	dev->verify = g_verify;
 	dev->zone_capacity = g_zone_capacity;
 	dev->zone_nr_conv = g_zone_nr_conv;
 	dev->zone_max_open = g_zone_max_open;
@@ -1165,6 +1172,9 @@ static int null_handle_rq(struct nullb_cmd *cmd)
 	struct req_iterator iter;
 	struct bio_vec bvec;
 
+	if (req_op(rq) == REQ_OP_VERIFY)
+		return 0;
+
 	spin_lock_irq(&nullb->lock);
 	rq_for_each_segment(bvec, rq, iter) {
 		len = bvec.bv_len;
@@ -1192,6 +1202,9 @@ static int null_handle_bio(struct nullb_cmd *cmd)
 	struct bio_vec bvec;
 	struct bvec_iter iter;
 
+	if (bio_op(bio) == REQ_OP_VERIFY)
+		return 0;
+
 	spin_lock_irq(&nullb->lock);
 	bio_for_each_segment(bvec, bio, iter) {
 		len = bvec.bv_len;
@@ -1609,6 +1622,15 @@ static void null_config_discard(struct nullb *nullb)
 	blk_queue_flag_set(QUEUE_FLAG_DISCARD, nullb->q);
 }
 
+static void null_config_verify(struct nullb *nullb)
+{
+	if (nullb->dev->verify == false)
+		return;
+
+	blk_queue_max_verify_sectors(nullb->q, UINT_MAX >> 9);
+	blk_queue_flag_set(QUEUE_FLAG_VERIFY, nullb->q);
+}
+
 static const struct block_device_operations null_bio_ops = {
 	.owner		= THIS_MODULE,
 	.submit_bio	= null_submit_bio,
@@ -1881,6 +1903,7 @@ static int null_add_dev(struct nullb_device *dev)
 	blk_queue_max_hw_sectors(nullb->q, dev->max_sectors);
 
 	null_config_discard(nullb);
+	null_config_verify(nullb);
 
 	sprintf(nullb->disk_name, "nullb%d", nullb->index);
 
diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
index 83504f3cc9d6..e6913c099e71 100644
--- a/drivers/block/null_blk/null_blk.h
+++ b/drivers/block/null_blk/null_blk.h
@@ -95,6 +95,7 @@ struct nullb_device {
 	bool power; /* power on/off the device */
 	bool memory_backed; /* if data is stored in memory */
 	bool discard; /* if support discard */
+	bool verify; /* if support verify */
 	bool zoned; /* if device is zoned */
 };
 
-- 
2.22.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 8/8] md: add support for REQ_OP_VERIFY
  2021-11-04  6:46 [RFC PATCH 0/8] block: add support for REQ_OP_VERIFY Chaitanya Kulkarni
                   ` (6 preceding siblings ...)
  2021-11-04  6:46 ` [RFC PATCH 7/8] null_blk: add REQ_OP_VERIFY support Chaitanya Kulkarni
@ 2021-11-04  6:46 ` Chaitanya Kulkarni
  2021-11-11  8:13   ` Chaitanya Kulkarni
  2021-11-04  7:14 ` [RFC PATCH 0/8] block: " Christoph Hellwig
  2021-11-04 15:16 ` Douglas Gilbert
  9 siblings, 1 reply; 26+ messages in thread
From: Chaitanya Kulkarni @ 2021-11-04  6:46 UTC (permalink / raw)
  To: linux-block, linux-raid, linux-nvme, linux-scsi, target-devel,
	linux-fsdevel, linux-xfs, dm-devel
  Cc: axboe, agk, snitzer, song, djwong, kbusch, hch, sagi, jejb,
	martin.petersen, viro, javier, johannes.thumshirn, bvanassche,
	dongli.zhang, ming.lei, osandov, willy, jefflexu, josef, clm,
	dsterba, jack, tytso, adilger.kernel, jlayton, idryomov,
	danil.kipnis, ebiggers, jinpu.wang, Chaitanya Kulkarni

From: Chaitanya Kulkarni <kch@nvidia.com>

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
---
 drivers/md/dm-core.h          |  1 +
 drivers/md/dm-io.c            |  8 ++++++--
 drivers/md/dm-linear.c        | 11 ++++++++++-
 drivers/md/dm-mpath.c         |  1 +
 drivers/md/dm-rq.c            |  3 +++
 drivers/md/dm-stripe.c        |  1 +
 drivers/md/dm-table.c         | 36 +++++++++++++++++++++++++++++++++++
 drivers/md/dm.c               | 31 ++++++++++++++++++++++++++++++
 drivers/md/md-linear.c        | 10 ++++++++++
 drivers/md/md-multipath.c     |  1 +
 drivers/md/md.h               |  7 +++++++
 drivers/md/raid10.c           |  1 +
 drivers/md/raid5.c            |  1 +
 include/linux/device-mapper.h |  6 ++++++
 14 files changed, 115 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 086d293c2b03..8a07ac9165ec 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -114,6 +114,7 @@ struct mapped_device {
 void disable_discard(struct mapped_device *md);
 void disable_write_same(struct mapped_device *md);
 void disable_write_zeroes(struct mapped_device *md);
+void disable_verify(struct mapped_device *md);
 
 static inline sector_t dm_get_size(struct mapped_device *md)
 {
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 4312007d2d34..da09d092e2c1 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -317,8 +317,11 @@ static void do_region(int op, int op_flags, unsigned region,
 		special_cmd_max_sectors = q->limits.max_write_zeroes_sectors;
 	else if (op == REQ_OP_WRITE_SAME)
 		special_cmd_max_sectors = q->limits.max_write_same_sectors;
+	else if (op == REQ_OP_VERIFY)
+		special_cmd_max_sectors = q->limits.max_verify_sectors;
 	if ((op == REQ_OP_DISCARD || op == REQ_OP_WRITE_ZEROES ||
-	     op == REQ_OP_WRITE_SAME) && special_cmd_max_sectors == 0) {
+	     op == REQ_OP_VERIFY || op == REQ_OP_WRITE_SAME) &&
+	     special_cmd_max_sectors == 0) {
 		atomic_inc(&io->count);
 		dec_count(io, region, BLK_STS_NOTSUPP);
 		return;
@@ -335,6 +338,7 @@ static void do_region(int op, int op_flags, unsigned region,
 		switch (op) {
 		case REQ_OP_DISCARD:
 		case REQ_OP_WRITE_ZEROES:
+		case REQ_OP_VERIFY:
 			num_bvecs = 0;
 			break;
 		case REQ_OP_WRITE_SAME:
@@ -352,7 +356,7 @@ static void do_region(int op, int op_flags, unsigned region,
 		bio_set_op_attrs(bio, op, op_flags);
 		store_io_and_region_in_bio(bio, io, region);
 
-		if (op == REQ_OP_DISCARD || op == REQ_OP_WRITE_ZEROES) {
+		if (op == REQ_OP_DISCARD || op == REQ_OP_WRITE_ZEROES || op == REQ_OP_VERIFY) {
 			num_sectors = min_t(sector_t, special_cmd_max_sectors, remaining);
 			bio->bi_iter.bi_size = num_sectors << SECTOR_SHIFT;
 			remaining -= num_sectors;
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 00774b5d7668..802c9cb917ae 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -62,6 +62,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 	ti->num_secure_erase_bios = 1;
 	ti->num_write_same_bios = 1;
 	ti->num_write_zeroes_bios = 1;
+	ti->num_verify_bios = 1;
 	ti->private = lc;
 	return 0;
 
@@ -90,9 +91,17 @@ static void linear_map_bio(struct dm_target *ti, struct bio *bio)
 	struct linear_c *lc = ti->private;
 
 	bio_set_dev(bio, lc->dev->bdev);
-	if (bio_sectors(bio) || op_is_zone_mgmt(bio_op(bio)))
+	if (bio_sectors(bio) || op_is_zone_mgmt(bio_op(bio))) {
 		bio->bi_iter.bi_sector =
 			linear_map_sector(ti, bio->bi_iter.bi_sector);
+		if (bio_op(bio) == REQ_OP_VERIFY)
+			printk(KERN_INFO"dmrg:     REQ_OP_VERIFY sector %10llu nr_sectors "
+					"%10u %s %s\n",
+					bio->bi_iter.bi_sector, bio->bi_iter.bi_size >> 9,
+					bio->bi_bdev->bd_disk->disk_name,
+					current->comm);
+
+	}
 }
 
 static int linear_map(struct dm_target *ti, struct bio *bio)
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index bced42f082b0..d6eb0d287032 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1255,6 +1255,7 @@ static int multipath_ctr(struct dm_target *ti, unsigned argc, char **argv)
 	ti->num_discard_bios = 1;
 	ti->num_write_same_bios = 1;
 	ti->num_write_zeroes_bios = 1;
+	ti->num_verify_bios = 1;
 	if (m->queue_mode == DM_TYPE_BIO_BASED)
 		ti->per_io_data_size = multipath_per_bio_data_size();
 	else
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 13b4385f4d5a..eaf19f8c9fca 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -224,6 +224,9 @@ static void dm_done(struct request *clone, blk_status_t error, bool mapped)
 		else if (req_op(clone) == REQ_OP_WRITE_ZEROES &&
 			 !clone->q->limits.max_write_zeroes_sectors)
 			disable_write_zeroes(tio->md);
+		else if (req_op(clone) == REQ_OP_VERIFY &&
+			 !clone->q->limits.max_verify_sectors)
+			disable_verify(tio->md);
 	}
 
 	switch (r) {
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index df359d33cda8..199ee57290a2 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -159,6 +159,7 @@ static int stripe_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 	ti->num_secure_erase_bios = stripes;
 	ti->num_write_same_bios = stripes;
 	ti->num_write_zeroes_bios = stripes;
+	ti->num_verify_bios = stripes;
 
 	sc->chunk_size = chunk_size;
 	if (chunk_size & (chunk_size - 1))
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 4acf2342f7ad..6a55c4c3b77a 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1709,6 +1709,36 @@ static bool dm_table_supports_nowait(struct dm_table *t)
 	return true;
 }
 
+static int device_not_verify_capable(struct dm_target *ti, struct dm_dev *dev,
+					   sector_t start, sector_t len, void *data)
+{
+	struct request_queue *q = bdev_get_queue(dev->bdev);
+
+	return q && !q->limits.max_verify_sectors;
+}
+
+static bool dm_table_supports_verify(struct dm_table *t)
+{
+	struct dm_target *ti;
+	unsigned i = 0;
+
+	while (i < dm_table_get_num_targets(t)) {
+		ti = dm_table_get_target(t, i++);
+
+		if (!ti->num_verify_bios)
+			return false;
+
+		if (!ti->type->iterate_devices ||
+		    ti->type->iterate_devices(ti, device_not_verify_capable, NULL))
+			return false;
+
+		printk(KERN_INFO"REQ_OP_VERIFY configured success for %s id %d\n",
+				ti->type->name, i);
+	}
+
+	return true;
+}
+
 static int device_not_discard_capable(struct dm_target *ti, struct dm_dev *dev,
 				      sector_t start, sector_t len, void *data)
 {
@@ -1830,6 +1860,12 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 	if (dm_table_supports_secure_erase(t))
 		blk_queue_flag_set(QUEUE_FLAG_SECERASE, q);
 
+	if (!dm_table_supports_verify(t)) {
+		blk_queue_flag_clear(QUEUE_FLAG_VERIFY, q);
+		q->limits.max_verify_sectors = 0;
+	} else
+		blk_queue_flag_set(QUEUE_FLAG_VERIFY, q);
+
 	if (dm_table_supports_flush(t, (1UL << QUEUE_FLAG_WC))) {
 		wc = true;
 		if (dm_table_supports_flush(t, (1UL << QUEUE_FLAG_FUA)))
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 479ec5bea09e..f70e387ce020 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -969,6 +969,15 @@ void disable_write_zeroes(struct mapped_device *md)
 	limits->max_write_zeroes_sectors = 0;
 }
 
+void disable_verify(struct mapped_device *md)
+{
+	struct queue_limits *limits = dm_get_queue_limits(md);
+
+	/* device doesn't really support VERIFY, disable it */
+	limits->max_verify_sectors = 0;
+	blk_queue_flag_clear(QUEUE_FLAG_VERIFY, md->queue);
+}
+
 static void clone_endio(struct bio *bio)
 {
 	blk_status_t error = bio->bi_status;
@@ -989,6 +998,9 @@ static void clone_endio(struct bio *bio)
 		else if (bio_op(bio) == REQ_OP_WRITE_ZEROES &&
 			 !q->limits.max_write_zeroes_sectors)
 			disable_write_zeroes(md);
+		else if (bio_op(bio) == REQ_OP_VERIFY &&
+			 !q->limits.max_verify_sectors)
+			disable_verify(md);
 	}
 
 	/*
@@ -1455,6 +1467,12 @@ static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti,
 	return 0;
 }
 
+static unsigned get_num_verify_bios(struct dm_target *ti)
+{
+	printk(KERN_INFO"%s %d\n", __func__, __LINE__);
+	return ti->num_verify_bios;
+}
+
 static int __send_changing_extent_only(struct clone_info *ci, struct dm_target *ti,
 				       unsigned num_bios)
 {
@@ -1480,15 +1498,25 @@ static int __send_changing_extent_only(struct clone_info *ci, struct dm_target *
 	return 0;
 }
 
+static int __send_verify(struct clone_info *ci, struct dm_target *ti)
+{
+	printk(KERN_INFO"%s %d\n", __func__, __LINE__);
+	return __send_changing_extent_only(ci, ti, get_num_verify_bios(ti));
+}
+
 static bool is_abnormal_io(struct bio *bio)
 {
 	bool r = false;
 
+	if (bio_op(bio) == REQ_OP_VERIFY)
+		printk(KERN_INFO"%s %d\n", __func__, __LINE__);
+
 	switch (bio_op(bio)) {
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
 	case REQ_OP_WRITE_SAME:
 	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_VERIFY:
 		r = true;
 		break;
 	}
@@ -1515,6 +1543,9 @@ static bool __process_abnormal_io(struct clone_info *ci, struct dm_target *ti,
 	case REQ_OP_WRITE_ZEROES:
 		num_bios = ti->num_write_zeroes_bios;
 		break;
+	case REQ_OP_VERIFY:
+		num_bios = ti->num_verify_bios;
+		break;
 	default:
 		return false;
 	}
diff --git a/drivers/md/md-linear.c b/drivers/md/md-linear.c
index 63ed8329a98d..0d8355658f8f 100644
--- a/drivers/md/md-linear.c
+++ b/drivers/md/md-linear.c
@@ -65,6 +65,7 @@ static struct linear_conf *linear_conf(struct mddev *mddev, int raid_disks)
 	struct md_rdev *rdev;
 	int i, cnt;
 	bool discard_supported = false;
+	bool verify_supported = false;
 
 	conf = kzalloc(struct_size(conf, disks, raid_disks), GFP_KERNEL);
 	if (!conf)
@@ -99,6 +100,8 @@ static struct linear_conf *linear_conf(struct mddev *mddev, int raid_disks)
 
 		if (blk_queue_discard(bdev_get_queue(rdev->bdev)))
 			discard_supported = true;
+		if (blk_queue_verify(bdev_get_queue(rdev->bdev)))
+			verify_supported = true;
 	}
 	if (cnt != raid_disks) {
 		pr_warn("md/linear:%s: not enough drives present. Aborting!\n",
@@ -111,6 +114,12 @@ static struct linear_conf *linear_conf(struct mddev *mddev, int raid_disks)
 	else
 		blk_queue_flag_set(QUEUE_FLAG_DISCARD, mddev->queue);
 
+	if (!verify_supported)
+		blk_queue_flag_clear(QUEUE_FLAG_VERIFY, mddev->queue);
+	else
+		blk_queue_flag_set(QUEUE_FLAG_VERIFY, mddev->queue);
+
+
 	/*
 	 * Here we calculate the device offsets.
 	 */
@@ -261,6 +270,7 @@ static bool linear_make_request(struct mddev *mddev, struct bio *bio)
 					      bio_sector);
 		mddev_check_writesame(mddev, bio);
 		mddev_check_write_zeroes(mddev, bio);
+		mddev_check_verify(mddev, bio);
 		submit_bio_noacct(bio);
 	}
 	return true;
diff --git a/drivers/md/md-multipath.c b/drivers/md/md-multipath.c
index 776bbe542db5..2856fc80a8a1 100644
--- a/drivers/md/md-multipath.c
+++ b/drivers/md/md-multipath.c
@@ -131,6 +131,7 @@ static bool multipath_make_request(struct mddev *mddev, struct bio * bio)
 	mp_bh->bio.bi_private = mp_bh;
 	mddev_check_writesame(mddev, &mp_bh->bio);
 	mddev_check_write_zeroes(mddev, &mp_bh->bio);
+	mddev_check_verify(mddev, &mp_bh->bio);
 	submit_bio_noacct(&mp_bh->bio);
 	return true;
 }
diff --git a/drivers/md/md.h b/drivers/md/md.h
index bcbba1b5ec4a..f40b5a5bc862 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -802,6 +802,13 @@ static inline void mddev_check_write_zeroes(struct mddev *mddev, struct bio *bio
 		mddev->queue->limits.max_write_zeroes_sectors = 0;
 }
 
+static inline void mddev_check_verify(struct mddev *mddev, struct bio *bio)
+{
+	if (bio_op(bio) == REQ_OP_VERIFY &&
+	    !bio->bi_bdev->bd_disk->queue->limits.max_verify_sectors)
+		mddev->queue->limits.max_verify_sectors = 0;
+}
+
 struct mdu_array_info_s;
 struct mdu_disk_info_s;
 
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index e1eefbec15d4..2ba1214bec2e 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3756,6 +3756,7 @@ static int raid10_run(struct mddev *mddev)
 					      mddev->chunk_sectors);
 		blk_queue_max_write_same_sectors(mddev->queue, 0);
 		blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
+		blk_queue_max_verify_sectors(mddev->queue, 0);
 		blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9);
 		raid10_set_io_opt(conf);
 	}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index a348b2adf2a9..d723dfa2a3cb 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7718,6 +7718,7 @@ static int raid5_run(struct mddev *mddev)
 
 		blk_queue_max_write_same_sectors(mddev->queue, 0);
 		blk_queue_max_write_zeroes_sectors(mddev->queue, 0);
+		blk_queue_max_verify_sectors(mddev->queue, 0);
 
 		rdev_for_each(rdev, mddev) {
 			disk_stack_limits(mddev->gendisk, rdev->bdev,
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 61a66fb8ebb3..761228e234d9 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -302,6 +302,12 @@ struct dm_target {
 	 */
 	unsigned num_write_zeroes_bios;
 
+	/*
+	 * The number of VERIFY bios that will be submitted to the target.
+	 * The bio number can be accessed with dm_bio_get_target_bio_nr.
+	 */
+	unsigned num_verify_bios;
+
 	/*
 	 * The minimum number of extra bytes allocated in each io for the
 	 * target to use.
-- 
2.22.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/8] block: add support for REQ_OP_VERIFY
  2021-11-04  6:46 [RFC PATCH 0/8] block: add support for REQ_OP_VERIFY Chaitanya Kulkarni
                   ` (7 preceding siblings ...)
  2021-11-04  6:46 ` [RFC PATCH 8/8] md: add support for REQ_OP_VERIFY Chaitanya Kulkarni
@ 2021-11-04  7:14 ` Christoph Hellwig
  2021-11-04  9:27   ` Chaitanya Kulkarni
  2021-11-04 15:16 ` Douglas Gilbert
  9 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2021-11-04  7:14 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: linux-block, linux-raid, linux-nvme, linux-scsi, target-devel,
	linux-fsdevel, linux-xfs, dm-devel, axboe, agk, snitzer, song,
	djwong, kbusch, hch, sagi, jejb, martin.petersen, viro, javier,
	johannes.thumshirn, bvanassche, dongli.zhang, ming.lei, osandov,
	willy, jefflexu, josef, clm, dsterba, jack, tytso,
	adilger.kernel, jlayton, idryomov, danil.kipnis, ebiggers,
	jinpu.wang, Chaitanya Kulkarni

What is the actual use case here?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/8] block: add support for REQ_OP_VERIFY
  2021-11-04  7:14 ` [RFC PATCH 0/8] block: " Christoph Hellwig
@ 2021-11-04  9:27   ` Chaitanya Kulkarni
  2021-11-04 17:32     ` Darrick J. Wong
  0 siblings, 1 reply; 26+ messages in thread
From: Chaitanya Kulkarni @ 2021-11-04  9:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, linux-raid, linux-nvme, linux-scsi, target-devel,
	linux-fsdevel, linux-xfs, dm-devel, axboe, agk, snitzer, song,
	djwong, kbusch, sagi, jejb, martin.petersen, viro, javier,
	johannes.thumshirn, bvanassche, dongli.zhang, ming.lei, osandov,
	willy, jefflexu, josef, clm, dsterba, jack, tytso,
	adilger.kernel, jlayton, idryomov, danil.kipnis, ebiggers,
	jinpu.wang, Chaitanya Kulkarni

On 11/4/2021 12:14 AM, Christoph Hellwig wrote:
> External email: Use caution opening links or attachments
> 
> 
> What is the actual use case here?
> 

One of the immediate use-case is to use this interface with XFS 
scrubbing infrastructure [1] (by replacing any SCSI calls e.g. sg_io() 
with BLKVERIFY ioctl() calls corresponding to REQ_OP_VERIFY) and 
eventually allow and extend other file systems to use it for scrubbing.

[1] man xfs_scrub :-
-x     Read all file data extents to look for disk errors.
               xfs_scrub will issue O_DIRECT reads to the block device
               directly.  If the block device is a SCSI disk, it will
               instead issue READ VERIFY commands directly to the disk.
               If media errors are found, the error report will include
               the disk offset, in bytes.  If the media errors affect a
               file, the report will also include the inode number and
               file offset, in bytes.  These actions will confirm that
               all file data blocks can be read from storage.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/8] scsi: add REQ_OP_VERIFY support
  2021-11-04  6:46 ` [RFC PATCH 2/8] scsi: add REQ_OP_VERIFY support Chaitanya Kulkarni
@ 2021-11-04 12:33   ` Damien Le Moal
  2021-11-11  8:07     ` Chaitanya Kulkarni
  0 siblings, 1 reply; 26+ messages in thread
From: Damien Le Moal @ 2021-11-04 12:33 UTC (permalink / raw)
  To: Chaitanya Kulkarni, linux-block, linux-raid, linux-nvme,
	linux-scsi, target-devel, linux-fsdevel, linux-xfs, dm-devel
  Cc: axboe, agk, snitzer, song, djwong, kbusch, hch, sagi, jejb,
	martin.petersen, viro, javier, johannes.thumshirn, bvanassche,
	dongli.zhang, ming.lei, osandov, willy, jefflexu, josef, clm,
	dsterba, jack, tytso, adilger.kernel, jlayton, idryomov,
	danil.kipnis, ebiggers, jinpu.wang, Chaitanya Kulkarni

On 2021/11/04 15:46, Chaitanya Kulkarni wrote:
> From: Chaitanya Kulkarni <kch@nvidia.com>
> 
> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
> ---
>  drivers/scsi/sd.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/scsi/sd.h |  1 +
>  2 files changed, 53 insertions(+)
> 
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index a3d2d4bc4a3d..7f2c4eb98cf8 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -106,6 +106,7 @@ MODULE_ALIAS_SCSI_DEVICE(TYPE_ZBC);
>  
>  static void sd_config_discard(struct scsi_disk *, unsigned int);
>  static void sd_config_write_same(struct scsi_disk *);
> +static void sd_config_verify(struct scsi_disk *sdkp);
>  static int  sd_revalidate_disk(struct gendisk *);
>  static void sd_unlock_native_capacity(struct gendisk *disk);
>  static int  sd_probe(struct device *);
> @@ -995,6 +996,41 @@ static blk_status_t sd_setup_write_zeroes_cmnd(struct scsi_cmnd *cmd)
>  	return sd_setup_write_same10_cmnd(cmd, false);
>  }
>  
> +static void sd_config_verify(struct scsi_disk *sdkp)
> +{
> +	struct request_queue *q = sdkp->disk->queue;
> +
> +	/* XXX: use same pattern as sd_config_write_same(). */
> +	blk_queue_max_verify_sectors(q, UINT_MAX >> 9);

VERIFY 10, 12, 16 and 32 commands are optional and may not be implemented by a
device. So setting this unconditionally is wrong.
At the very least you must have an "if (sdkp->verify_16)" here, and call
"blk_queue_max_verify_sectors(q, 0);" if the device does not support verify.

> +}
> +
> +static blk_status_t sd_setup_verify_cmnd(struct scsi_cmnd *cmd)
> +{
> +       struct request *rq = cmd->request;
> +       struct scsi_device *sdp = cmd->device;
> +       struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
> +       u64 lba = sectors_to_logical(sdp, blk_rq_pos(rq));
> +       u32 nr_blocks = sectors_to_logical(sdp, blk_rq_sectors(rq));
> +
> +       if (!sdkp->verify_16)
> +	       return BLK_STS_NOTSUPP;

I think this should be "return BLK_STS_TARGET;"

> +
> +       cmd->cmd_len = 16;
> +       cmd->cmnd[0] = VERIFY_16;

And what if the device supports VERIFY 10 or 12 but not VERIFY 16 ?

> +       /* skip veprotect / dpo / bytchk */
> +       cmd->cmnd[1] = 0;
> +       put_unaligned_be64(lba, &cmd->cmnd[2]);
> +       put_unaligned_be32(nr_blocks, &cmd->cmnd[10]);
> +       cmd->cmnd[14] = 0;
> +       cmd->cmnd[15] = 0;
> +
> +       cmd->allowed = SD_MAX_RETRIES;
> +       cmd->sc_data_direction = DMA_NONE;
> +       cmd->transfersize = 0;
> +
> +       return BLK_STS_OK;
> +}
> +
>  static void sd_config_write_same(struct scsi_disk *sdkp)
>  {
>  	struct request_queue *q = sdkp->disk->queue;
> @@ -1345,6 +1381,8 @@ static blk_status_t sd_init_command(struct scsi_cmnd *cmd)
>  		}
>  	case REQ_OP_WRITE_ZEROES:
>  		return sd_setup_write_zeroes_cmnd(cmd);
> +	case REQ_OP_VERIFY:
> +		return sd_setup_verify_cmnd(cmd);
>  	case REQ_OP_WRITE_SAME:
>  		return sd_setup_write_same_cmnd(cmd);
>  	case REQ_OP_FLUSH:
> @@ -2029,6 +2067,7 @@ static int sd_done(struct scsi_cmnd *SCpnt)
>  	switch (req_op(req)) {
>  	case REQ_OP_DISCARD:
>  	case REQ_OP_WRITE_ZEROES:
> +	case REQ_OP_VERIFY:
>  	case REQ_OP_WRITE_SAME:
>  	case REQ_OP_ZONE_RESET:
>  	case REQ_OP_ZONE_RESET_ALL:
> @@ -3096,6 +3135,17 @@ static void sd_read_write_same(struct scsi_disk *sdkp, unsigned char *buffer)
>  		sdkp->ws10 = 1;
>  }
>  
> +static void sd_read_verify(struct scsi_disk *sdkp, unsigned char *buffer)
> +{
> +       struct scsi_device *sdev = sdkp->device;
> +
> +       sd_printk(KERN_INFO, sdkp, "VERIFY16 check.\n");

Remove this message please.

> +       if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, VERIFY_16) == 1) {
> +	       sd_printk(KERN_INFO, sdkp, " VERIFY16 in ON .\n");

And this one too.

> +               sdkp->verify_16 = 1;

Why not checking for VERIFY 10 and 12 if VERIFY 16 is not supported ?
Also, why don't you call "blk_queue_max_verify_sectors(q, UINT_MAX >> 9);" here
instead of adding the not so useful sd_config_verify() helper ?

> +       }
> +}
> +
>  static void sd_read_security(struct scsi_disk *sdkp, unsigned char *buffer)
>  {
>  	struct scsi_device *sdev = sdkp->device;
> @@ -3224,6 +3274,7 @@ static int sd_revalidate_disk(struct gendisk *disk)
>  		sd_read_cache_type(sdkp, buffer);
>  		sd_read_app_tag_own(sdkp, buffer);
>  		sd_read_write_same(sdkp, buffer);
> +		sd_read_verify(sdkp, buffer);
>  		sd_read_security(sdkp, buffer);
>  	}
>  
> @@ -3265,6 +3316,7 @@ static int sd_revalidate_disk(struct gendisk *disk)
>  
>  	set_capacity_and_notify(disk, logical_to_sectors(sdp, sdkp->capacity));
>  	sd_config_write_same(sdkp);
> +	sd_config_verify(sdkp);
>  	kfree(buffer);
>  
>  	/*
> diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
> index b59136c4125b..94a86bf6dac4 100644
> --- a/drivers/scsi/sd.h
> +++ b/drivers/scsi/sd.h
> @@ -120,6 +120,7 @@ struct scsi_disk {
>  	unsigned	lbpvpd : 1;
>  	unsigned	ws10 : 1;
>  	unsigned	ws16 : 1;
> +	unsigned        verify_16 : 1;

See right above this line how write same supports the 10 and 16 variants. I
think you need the same here. And very likely, you also need the 32 version in
case the device has DIF/DIX (type 2 protection).

>  	unsigned	rc_basis: 2;
>  	unsigned	zoned: 2;
>  	unsigned	urswrz : 1;
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/8] block: add support for REQ_OP_VERIFY
  2021-11-04  6:46 [RFC PATCH 0/8] block: add support for REQ_OP_VERIFY Chaitanya Kulkarni
                   ` (8 preceding siblings ...)
  2021-11-04  7:14 ` [RFC PATCH 0/8] block: " Christoph Hellwig
@ 2021-11-04 15:16 ` Douglas Gilbert
  2021-11-11  8:27   ` Chaitanya Kulkarni
  9 siblings, 1 reply; 26+ messages in thread
From: Douglas Gilbert @ 2021-11-04 15:16 UTC (permalink / raw)
  To: Chaitanya Kulkarni, linux-block, linux-raid, linux-nvme,
	linux-scsi, target-devel, linux-fsdevel, linux-xfs, dm-devel
  Cc: axboe, agk, snitzer, song, djwong, kbusch, hch, sagi, jejb,
	martin.petersen, viro, javier, johannes.thumshirn, bvanassche,
	dongli.zhang, ming.lei, osandov, willy, jefflexu, josef, clm,
	dsterba, jack, tytso, adilger.kernel, jlayton, idryomov,
	danil.kipnis, ebiggers, jinpu.wang, Chaitanya Kulkarni

On 2021-11-04 2:46 a.m., Chaitanya Kulkarni wrote:
> From: Chaitanya Kulkarni <kch@nvidia.com>
> 
> Hi,
> 
> One of the responsibilities of the Operating System, along with managing
> resources, is to provide a unified interface to the user by creating
> hardware abstractions. In the Linux Kernel storage stack that
> abstraction is created by implementing the generic request operations
> such as REQ_OP_READ/REQ_OP_WRITE or REQ_OP_DISCARD/REQ_OP_WRITE_ZEROES,
> etc that are mapped to the specific low-level hardware protocol commands
> e.g. SCSI or NVMe.
> 
> With that in mind, this RFC patch-series implements a new block layer
> operation to offload the data verification on to the controller if
> supported or emulate the operation if not. The main advantage is to free
> up the CPU and reduce the host link traffic since, for some devices,
> their internal bandwidth is higher than the host link and offloading this
> operation can improve the performance of the proactive error detection
> applications such as file system level scrubbing.
> 
> * Background *
> -----------------------------------------------------------------------
> 
> NVMe Specification provides a controller level Verify command [1] which
> is similar to the ATA Verify [2] command where the controller is
> responsible for data verification without transferring the data to the
> host. (Offloading LBAs verification). This is designed to proactively
> discover any data corruption issues when the device is free so that
> applications can protect sensitive data and take corrective action
> instead of waiting for failure to occur.
> 
> The NVMe Verify command is added in order to provide low level media
> scrubbing and possibly moving the data to the right place in case it has
> correctable media degradation. Also, this provides a way to enhance
> file-system level scrubbing/checksum verification and optinally offload
> this task, which is CPU intensive, to the kernel (when emulated), over
> the fabric, and to the controller (when supported).
> 
> This is useful when the controller's internal bandwidth is higher than
> the host's bandwith showing a sharp increase in the performance due to
> _no host traffic or host CPU involvement_.
> 
> * Implementation *
> -----------------------------------------------------------------------
> 
> Right now there is no generic interface which can be used by the
> in-kernel components such as file-system or userspace application
> (except passthru commands or some combination of write/read/compare) to
> issue verify command with the central block layer API. This can lead to
> each userspace applications having protocol specific IOCTL which
> defeates the purpose of having the OS provide a H/W abstraction.
> 
> This patch series introduces a new block layer payloadless request
> operation REQ_OP_VERIFY that allows in-kernel components & userspace
> applications to verify the range of the LBAs by offloading checksum
> scrubbing/verification to the controller that is directly attached to
> the host. For direct attached devices this leads to decrease in the host
> DMA traffic and CPU usage and for the fabrics attached device over the
> network that leads to a decrease in the network traffic and CPU usage
> for both host & target.
> 
> * Scope *
> -----------------------------------------------------------------------
> 
> Please note this only covers the operating system level overhead.
> Analyzing controller verify command performance for common protocols
> (SCSI/NVMe) is out of scope for REQ_OP_VERIFY.
> 
> * Micro Benchmarks *
> -----------------------------------------------------------------------
> 
> When verifing 500GB of data on NVMeOF with nvme-loop and null_blk as a
> target backend block device results show almost a 80% performance
> increase :-
> 
> With Verify resulting in REQ_OP_VERIFY to null_blk :-
> 
> real	2m3.773s
> user	0m0.000s
> sys	0m59.553s
> 
> With Emulation resulting in REQ_OP_READ null_blk :-
> 
> real	12m18.964s
> user	0m0.002s
> sys	1m15.666s
> 
> 
> A detailed test log is included at the end of the cover letter.
> Each of the following was tested:
> 
> 1. Direct Attached REQ_OP_VERIFY.
> 2. Fabrics Attached REQ_OP_VERIFY.
> 3. Multi-device (md) REQ_OP_VERIFY.
> 
> * The complete picture *
> -----------------------------------------------------------------------
> 
>    For the completeness the whole kernel stack support is divided into
>    two phases :-
>   
>    Phase I :-
>   
>     Add and stabilize the support for the Block layer & low level drivers
>     such as SCSI, NVMe, MD, and NVMeOF, implement necessary emulations in
>     the block layer if needed and provide block level tools such as
>     _blkverify_. Also, add appropriate testcases for code-coverage.
> 
>    Phase II :-
>   
>     Add and stabilize the support for upper layer kernel components such
>     as file-systems and provide userspace tools such _fsverify_ to route
>     the request from file systems to block layer to Low level device
>     drivers.
> 
> 
> Please note that the interfaces for blk-lib.c REQ_OP_VERIFY emulation
> will change in future I put together for the scope of RFC.
> 
> Any comments are welcome.

Hi,
You may also want to consider higher level support for the NVME COMPARE
and SCSI VERIFY(BYTCHK=1) commands. Since PCIe and SAS transports are
full duplex, replacing two READs (plus a memcmp in host memory) with
one READ and one COMPARE may be a win on a bandwidth constrained
system. It is a safe to assume the data-in transfers on a storage transport
exceed (probably by a significant margin) the data-out transfers. An
offloaded COMPARE switches one of those data-in transfers to a data-out
transfer, so it should improve the bandwidth utilization.

I did some brief benchmarking on a NVME SSD's COMPARE command (its optional)
and the results were underwhelming. OTOH using my own dd variants (which
can do compare instead of copy) and a scsi_debug target (i.e. RAM) I have
seen compare times of > 15 GBps while a copy rarely exceeds 9 GBps.


BTW The SCSI VERIFY(BYTCHK=3) command compares one block sent from
the host with a sequence of logical blocks on the media. So, for example,
it would be a quick way of checking that a sequence of blocks contained
zero-ed data.

Doug Gilbert

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 1/8] block: add support for REQ_OP_VERIFY
  2021-11-04  6:46 ` [RFC PATCH 1/8] " Chaitanya Kulkarni
@ 2021-11-04 17:25   ` Darrick J. Wong
  2021-11-11  8:01     ` Chaitanya Kulkarni
  0 siblings, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2021-11-04 17:25 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: linux-block, linux-raid, linux-nvme, linux-scsi, target-devel,
	linux-fsdevel, linux-xfs, dm-devel, axboe, agk, snitzer, song,
	kbusch, hch, sagi, jejb, martin.petersen, viro, javier,
	johannes.thumshirn, bvanassche, dongli.zhang, ming.lei, osandov,
	willy, jefflexu, josef, clm, dsterba, jack, tytso,
	adilger.kernel, jlayton, idryomov, danil.kipnis, ebiggers,
	jinpu.wang, Chaitanya Kulkarni

On Wed, Nov 03, 2021 at 11:46:27PM -0700, Chaitanya Kulkarni wrote:
> From: Chaitanya Kulkarni <kch@nvidia.com>
> 
> This adds a new block layer operation to offload verifying a range of
> LBAs. This support is needed in order to provide file systems and
> fabrics, kernel components to offload LBA verification when it is
> supported by the hardware controller. In case hardware offloading is
> not supported then we provide APIs to emulate the same. The prominent
> example of that is NVMe Verify command. We also provide an emulation of
> the same operation which can be used in case H/W does not support
> verify. This is still useful when block device is remotely attached e.g.
> using NVMeOF.
> 
> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
> ---
>  Documentation/ABI/testing/sysfs-block |  14 ++
>  block/blk-core.c                      |   5 +
>  block/blk-lib.c                       | 192 ++++++++++++++++++++++++++
>  block/blk-merge.c                     |  19 +++
>  block/blk-settings.c                  |  17 +++
>  block/blk-sysfs.c                     |   8 ++
>  block/blk-zoned.c                     |   1 +
>  block/bounce.c                        |   1 +
>  block/ioctl.c                         |  35 +++++
>  include/linux/bio.h                   |  10 +-
>  include/linux/blk_types.h             |   2 +
>  include/linux/blkdev.h                |  31 +++++
>  include/uapi/linux/fs.h               |   1 +
>  13 files changed, 332 insertions(+), 4 deletions(-)
> 

(skipping to the ioctl part; I didn't see anything obviously weird in
the block/ changes)

> diff --git a/block/ioctl.c b/block/ioctl.c
> index d61d652078f4..5e1b3c4660bf 100644
> --- a/block/ioctl.c
> +++ b/block/ioctl.c
> @@ -168,6 +168,39 @@ static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
>  			BLKDEV_ZERO_NOUNMAP);
>  }
>  
> +static int blk_ioctl_verify(struct block_device *bdev, fmode_t mode,
> +		unsigned long arg)
> +{
> +	uint64_t range[2];
> +	struct address_space *mapping;
> +	uint64_t start, end, len;
> +
> +	if (!(mode & FMODE_WRITE))
> +		return -EBADF;

Why does the fd have to be opened writable?  Isn't this a read test?

> +
> +	if (copy_from_user(range, (void __user *)arg, sizeof(range)))
> +		return -EFAULT;
> +
> +	start = range[0];
> +	len = range[1];
> +	end = start + len - 1;
> +
> +	if (start & 511)
> +		return -EINVAL;
> +	if (len & 511)
> +		return -EINVAL;
> +	if (end >= (uint64_t)i_size_read(bdev->bd_inode))
> +		return -EINVAL;
> +	if (end < start)
> +		return -EINVAL;
> +
> +	/* Invalidate the page cache, including dirty pages */
> +	mapping = bdev->bd_inode->i_mapping;
> +	truncate_inode_pages_range(mapping, start, end);

Why do we need to invalidate the page cache to verify media?  Won't that
cause data loss if those pages were dirty and about to be flushed?

--D

> +
> +	return blkdev_issue_verify(bdev, start >> 9, len >> 9, GFP_KERNEL);
> +}
> +
>  static int put_ushort(unsigned short __user *argp, unsigned short val)
>  {
>  	return put_user(val, argp);
> @@ -460,6 +493,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
>  				BLKDEV_DISCARD_SECURE);
>  	case BLKZEROOUT:
>  		return blk_ioctl_zeroout(bdev, mode, arg);
> +	case BLKVERIFY:
> +		return blk_ioctl_verify(bdev, mode, arg);
>  	case BLKREPORTZONE:
>  		return blkdev_report_zones_ioctl(bdev, mode, cmd, arg);
>  	case BLKRESETZONE:
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index c74857cf1252..d660c37b7d6c 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -63,7 +63,8 @@ static inline bool bio_has_data(struct bio *bio)
>  	    bio->bi_iter.bi_size &&
>  	    bio_op(bio) != REQ_OP_DISCARD &&
>  	    bio_op(bio) != REQ_OP_SECURE_ERASE &&
> -	    bio_op(bio) != REQ_OP_WRITE_ZEROES)
> +	    bio_op(bio) != REQ_OP_WRITE_ZEROES &&
> +	    bio_op(bio) != REQ_OP_VERIFY)
>  		return true;
>  
>  	return false;
> @@ -73,8 +74,8 @@ static inline bool bio_no_advance_iter(const struct bio *bio)
>  {
>  	return bio_op(bio) == REQ_OP_DISCARD ||
>  	       bio_op(bio) == REQ_OP_SECURE_ERASE ||
> -	       bio_op(bio) == REQ_OP_WRITE_SAME ||
> -	       bio_op(bio) == REQ_OP_WRITE_ZEROES;
> +	       bio_op(bio) == REQ_OP_WRITE_ZEROES ||
> +	       bio_op(bio) == REQ_OP_VERIFY;
>  }
>  
>  static inline bool bio_mergeable(struct bio *bio)
> @@ -198,7 +199,7 @@ static inline unsigned bio_segments(struct bio *bio)
>  	struct bvec_iter iter;
>  
>  	/*
> -	 * We special case discard/write same/write zeroes, because they
> +	 * We special case discard/write same/write zeroes/verify, because they
>  	 * interpret bi_size differently:
>  	 */
>  
> @@ -206,6 +207,7 @@ static inline unsigned bio_segments(struct bio *bio)
>  	case REQ_OP_DISCARD:
>  	case REQ_OP_SECURE_ERASE:
>  	case REQ_OP_WRITE_ZEROES:
> +	case REQ_OP_VERIFY:
>  		return 0;
>  	case REQ_OP_WRITE_SAME:
>  		return 1;
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 1bc6f6a01070..8877711c4c56 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -366,6 +366,8 @@ enum req_opf {
>  	REQ_OP_SECURE_ERASE	= 5,
>  	/* write the same sector many times */
>  	REQ_OP_WRITE_SAME	= 7,
> +	/* verify the sectors */
> +	REQ_OP_VERIFY		= 8,
>  	/* write the zero filled sector many times */
>  	REQ_OP_WRITE_ZEROES	= 9,
>  	/* Open a zone */
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 0dea268bd61b..99c41d90584b 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -334,6 +334,7 @@ struct queue_limits {
>  	unsigned int		max_hw_discard_sectors;
>  	unsigned int		max_write_same_sectors;
>  	unsigned int		max_write_zeroes_sectors;
> +	unsigned int		max_verify_sectors;
>  	unsigned int		max_zone_append_sectors;
>  	unsigned int		discard_granularity;
>  	unsigned int		discard_alignment;
> @@ -621,6 +622,7 @@ struct request_queue {
>  #define QUEUE_FLAG_RQ_ALLOC_TIME 27	/* record rq->alloc_time_ns */
>  #define QUEUE_FLAG_HCTX_ACTIVE	28	/* at least one blk-mq hctx is active */
>  #define QUEUE_FLAG_NOWAIT       29	/* device supports NOWAIT */
> +#define QUEUE_FLAG_VERIFY	30	/* supports Verify */
>  
>  #define QUEUE_FLAG_MQ_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
>  				 (1 << QUEUE_FLAG_SAME_COMP) |		\
> @@ -667,6 +669,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
>  #define blk_queue_fua(q)	test_bit(QUEUE_FLAG_FUA, &(q)->queue_flags)
>  #define blk_queue_registered(q)	test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags)
>  #define blk_queue_nowait(q)	test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
> +#define blk_queue_verify(q)	test_bit(QUEUE_FLAG_VERIFY, &(q)->queue_flags)
>  
>  extern void blk_set_pm_only(struct request_queue *q);
>  extern void blk_clear_pm_only(struct request_queue *q);
> @@ -814,6 +817,9 @@ static inline bool rq_mergeable(struct request *rq)
>  	if (req_op(rq) == REQ_OP_WRITE_ZEROES)
>  		return false;
>  
> +	if (req_op(rq) == REQ_OP_VERIFY)
> +		return false;
> +
>  	if (req_op(rq) == REQ_OP_ZONE_APPEND)
>  		return false;
>  
> @@ -1072,6 +1078,9 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
>  	if (unlikely(op == REQ_OP_WRITE_ZEROES))
>  		return q->limits.max_write_zeroes_sectors;
>  
> +	if (unlikely(op == REQ_OP_VERIFY))
> +		return q->limits.max_verify_sectors;
> +
>  	return q->limits.max_sectors;
>  }
>  
> @@ -1154,6 +1163,8 @@ extern void blk_queue_max_discard_sectors(struct request_queue *q,
>  		unsigned int max_discard_sectors);
>  extern void blk_queue_max_write_same_sectors(struct request_queue *q,
>  		unsigned int max_write_same_sectors);
> +extern void blk_queue_max_verify_sectors(struct request_queue *q,
> +		unsigned int max_verify_sectors);
>  extern void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
>  		unsigned int max_write_same_sectors);
>  extern void blk_queue_logical_block_size(struct request_queue *, unsigned int);
> @@ -1348,6 +1359,16 @@ extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
>  		unsigned flags);
>  extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
>  		sector_t nr_sects, gfp_t gfp_mask, unsigned flags);
> +extern int __blkdev_emulate_verify(struct block_device *bdev, sector_t sector,
> +		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
> +		char *buf);
> +extern int blkdev_emulate_verify(struct block_device *bdev, sector_t sector,
> +		sector_t nr_sects, gfp_t gfp_mask);
> +extern int __blkdev_issue_verify(struct block_device *bdev,
> +		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
> +		struct bio **biop);
> +extern int blkdev_issue_verify(struct block_device *bdev, sector_t sector,
> +		sector_t nr_sects, gfp_t gfp_mask);
>  
>  static inline int sb_issue_discard(struct super_block *sb, sector_t block,
>  		sector_t nr_blocks, gfp_t gfp_mask, unsigned long flags)
> @@ -1553,6 +1574,16 @@ static inline unsigned int bdev_write_same(struct block_device *bdev)
>  	return 0;
>  }
>  
> +static inline unsigned int bdev_verify_sectors(struct block_device *bdev)
> +{
> +	struct request_queue *q = bdev_get_queue(bdev);
> +
> +	if (q)
> +		return q->limits.max_verify_sectors;
> +
> +	return 0;
> +}
> +
>  static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev)
>  {
>  	struct request_queue *q = bdev_get_queue(bdev);
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index f44eb0a04afd..5eda16bd2c3d 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -184,6 +184,7 @@ struct fsxattr {
>  #define BLKSECDISCARD _IO(0x12,125)
>  #define BLKROTATIONAL _IO(0x12,126)
>  #define BLKZEROOUT _IO(0x12,127)
> +#define BLKVERIFY _IO(0x12,128)
>  /*
>   * A jump here: 130-131 are reserved for zoned block devices
>   * (see uapi/linux/blkzoned.h)
> -- 
> 2.22.1
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/8] block: add support for REQ_OP_VERIFY
  2021-11-04  9:27   ` Chaitanya Kulkarni
@ 2021-11-04 17:32     ` Darrick J. Wong
  2021-11-04 17:34       ` Christoph Hellwig
  2021-11-11  8:18       ` Chaitanya Kulkarni
  0 siblings, 2 replies; 26+ messages in thread
From: Darrick J. Wong @ 2021-11-04 17:32 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: Christoph Hellwig, linux-block, linux-raid, linux-nvme,
	linux-scsi, target-devel, linux-fsdevel, linux-xfs, dm-devel,
	axboe, agk, snitzer, song, kbusch, sagi, jejb, martin.petersen,
	viro, javier, johannes.thumshirn, bvanassche, dongli.zhang,
	ming.lei, osandov, willy, jefflexu, josef, clm, dsterba, jack,
	tytso, adilger.kernel, jlayton, idryomov, danil.kipnis, ebiggers,
	jinpu.wang

On Thu, Nov 04, 2021 at 09:27:50AM +0000, Chaitanya Kulkarni wrote:
> On 11/4/2021 12:14 AM, Christoph Hellwig wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > What is the actual use case here?
> > 
> 
> One of the immediate use-case is to use this interface with XFS 
> scrubbing infrastructure [1] (by replacing any SCSI calls e.g. sg_io() 
> with BLKVERIFY ioctl() calls corresponding to REQ_OP_VERIFY) and 
> eventually allow and extend other file systems to use it for scrubbing.

FWIW it /would/ be a win to have a general blkdev ioctl to do this,
rather than shoving SCSI commands through /dev/sg, which (obviously)
doesn't work when dm and friends are in use.  I hadn't bothered to wire
up xfs_scrub to NVME COMPARE because none of my devices support it and
tbh I was holding out for this kind of interface anyway. ;)

I also wonder if it would be useful (since we're already having a
discussion elsewhere about data integrity syscalls for pmem) to be able
to call this sort of thing against files?  In which case we'd want
another preadv2 flag or something, and then plumb all that through the
vfs/iomap as needed?

--D

> [1] man xfs_scrub :-
> -x     Read all file data extents to look for disk errors.
>                xfs_scrub will issue O_DIRECT reads to the block device
>                directly.  If the block device is a SCSI disk, it will
>                instead issue READ VERIFY commands directly to the disk.
>                If media errors are found, the error report will include
>                the disk offset, in bytes.  If the media errors affect a
>                file, the report will also include the inode number and
>                file offset, in bytes.  These actions will confirm that
>                all file data blocks can be read from storage.
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/8] block: add support for REQ_OP_VERIFY
  2021-11-04 17:32     ` Darrick J. Wong
@ 2021-11-04 17:34       ` Christoph Hellwig
  2021-11-04 22:37         ` Keith Busch
  2021-11-11  8:18       ` Chaitanya Kulkarni
  1 sibling, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2021-11-04 17:34 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Chaitanya Kulkarni, Christoph Hellwig, linux-block, linux-raid,
	linux-nvme, linux-scsi, target-devel, linux-fsdevel, linux-xfs,
	dm-devel, axboe, agk, snitzer, song, kbusch, sagi, jejb,
	martin.petersen, viro, javier, johannes.thumshirn, bvanassche,
	dongli.zhang, ming.lei, osandov, willy, jefflexu, josef, clm,
	dsterba, jack, tytso, adilger.kernel, jlayton, idryomov,
	danil.kipnis, ebiggers, jinpu.wang

On Thu, Nov 04, 2021 at 10:32:35AM -0700, Darrick J. Wong wrote:
> I also wonder if it would be useful (since we're already having a
> discussion elsewhere about data integrity syscalls for pmem) to be able
> to call this sort of thing against files?  In which case we'd want
> another preadv2 flag or something, and then plumb all that through the
> vfs/iomap as needed?

IFF we do this (can't answer if there is a need) we should not
overload read with it.  It is an operation that does not return
data but just a status, so let's not get into that mess.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/8] block: add support for REQ_OP_VERIFY
  2021-11-04 17:34       ` Christoph Hellwig
@ 2021-11-04 22:37         ` Keith Busch
  2021-11-05  8:25           ` javier
  0 siblings, 1 reply; 26+ messages in thread
From: Keith Busch @ 2021-11-04 22:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Chaitanya Kulkarni, linux-block, linux-raid,
	linux-nvme, linux-scsi, target-devel, linux-fsdevel, linux-xfs,
	dm-devel, axboe, agk, snitzer, song, sagi, jejb, martin.petersen,
	viro, javier, johannes.thumshirn, bvanassche, dongli.zhang,
	ming.lei, osandov, willy, jefflexu, josef, clm, dsterba, jack,
	tytso, adilger.kernel, jlayton, idryomov, danil.kipnis, ebiggers,
	jinpu.wang

On Thu, Nov 04, 2021 at 06:34:31PM +0100, Christoph Hellwig wrote:
> On Thu, Nov 04, 2021 at 10:32:35AM -0700, Darrick J. Wong wrote:
> > I also wonder if it would be useful (since we're already having a
> > discussion elsewhere about data integrity syscalls for pmem) to be able
> > to call this sort of thing against files?  In which case we'd want
> > another preadv2 flag or something, and then plumb all that through the
> > vfs/iomap as needed?
> 
> IFF we do this (can't answer if there is a need) we should not
> overload read with it.  It is an operation that does not return
> data but just a status, so let's not get into that mess.

If there is a need for this, a new io_uring opcode seems like the
appropirate user facing interface for it.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 3/8] nvme: add support for the Verify command
  2021-11-04  6:46 ` [RFC PATCH 3/8] nvme: add support for the Verify command Chaitanya Kulkarni
@ 2021-11-04 22:44   ` Keith Busch
  2021-11-11  8:09     ` Chaitanya Kulkarni
  0 siblings, 1 reply; 26+ messages in thread
From: Keith Busch @ 2021-11-04 22:44 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: linux-block, linux-raid, linux-nvme, linux-scsi, target-devel,
	linux-fsdevel, linux-xfs, dm-devel, axboe, agk, snitzer, song,
	djwong, hch, sagi, jejb, martin.petersen, viro, javier,
	johannes.thumshirn, bvanassche, dongli.zhang, ming.lei, osandov,
	willy, jefflexu, josef, clm, dsterba, jack, tytso,
	adilger.kernel, jlayton, idryomov, danil.kipnis, ebiggers,
	jinpu.wang, Chaitanya Kulkarni

On Wed, Nov 03, 2021 at 11:46:29PM -0700, Chaitanya Kulkarni wrote:
> +static inline blk_status_t nvme_setup_verify(struct nvme_ns *ns,
> +		struct request *req, struct nvme_command *cmnd)
> +{

Due to recent driver changes, you need a "memset(cmnd, 0, sizeof(*cmnd))"
prior to setting up the rest of the command, or you need to set each
command dword individually.

> +	cmnd->verify.opcode = nvme_cmd_verify;
> +	cmnd->verify.nsid = cpu_to_le32(ns->head->ns_id);
> +	cmnd->verify.slba =
> +		cpu_to_le64(nvme_sect_to_lba(ns, blk_rq_pos(req)));
> +	cmnd->verify.length =
> +		cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
> +	cmnd->verify.control = 0;
> +	return BLK_STS_OK;
> +}

> +static void nvme_config_verify(struct gendisk *disk, struct nvme_ns *ns)
> +{
> +	u64 max_blocks;
> +
> +	if (!(ns->ctrl->oncs & NVME_CTRL_ONCS_VERIFY))
> +		return;
> +
> +	if (ns->ctrl->max_hw_sectors == UINT_MAX)
> +		max_blocks = (u64)USHRT_MAX + 1;
> +	else
> +		max_blocks = ns->ctrl->max_hw_sectors + 1;

If supported by the controller, this maximum is defined in the non-mdts
command limits in NVM command set specific Identify Controller VSL field.

> +
> +	/* keep same as discard */
> +	if (blk_queue_flag_test_and_set(QUEUE_FLAG_VERIFY, disk->queue))
> +		return;
> +
> +	blk_queue_max_verify_sectors(disk->queue,
> +				     nvme_lba_to_sect(ns, max_blocks));
> +
> +}

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/8] block: add support for REQ_OP_VERIFY
  2021-11-04 22:37         ` Keith Busch
@ 2021-11-05  8:25           ` javier
  0 siblings, 0 replies; 26+ messages in thread
From: javier @ 2021-11-05  8:25 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Darrick J. Wong, Chaitanya Kulkarni,
	linux-block, linux-raid, linux-nvme, linux-scsi, target-devel,
	linux-fsdevel, linux-xfs, dm-devel, axboe, agk, snitzer, song,
	sagi, jejb, martin.petersen, viro, johannes.thumshirn,
	bvanassche, dongli.zhang, ming.lei, osandov, willy, jefflexu,
	josef, clm, dsterba, jack, tytso, adilger.kernel, jlayton,
	idryomov, danil.kipnis, ebiggers, jinpu.wang

On 04.11.2021 15:37, Keith Busch wrote:
>On Thu, Nov 04, 2021 at 06:34:31PM +0100, Christoph Hellwig wrote:
>> On Thu, Nov 04, 2021 at 10:32:35AM -0700, Darrick J. Wong wrote:
>> > I also wonder if it would be useful (since we're already having a
>> > discussion elsewhere about data integrity syscalls for pmem) to be able
>> > to call this sort of thing against files?  In which case we'd want
>> > another preadv2 flag or something, and then plumb all that through the
>> > vfs/iomap as needed?
>>
>> IFF we do this (can't answer if there is a need) we should not
>> overload read with it.  It is an operation that does not return
>> data but just a status, so let's not get into that mess.
>
>If there is a need for this, a new io_uring opcode seems like the
>appropirate user facing interface for it.

+1 to this. I was looking at the patchset yesterday and this was one of
the questions I had. Any reasons to not do it this way Chaitanya?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 1/8] block: add support for REQ_OP_VERIFY
  2021-11-04 17:25   ` Darrick J. Wong
@ 2021-11-11  8:01     ` Chaitanya Kulkarni
  0 siblings, 0 replies; 26+ messages in thread
From: Chaitanya Kulkarni @ 2021-11-11  8:01 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-block, linux-raid, linux-nvme, linux-scsi, target-devel,
	linux-fsdevel, linux-xfs, dm-devel

On 11/4/2021 10:25 AM, Darrick J. Wong wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Wed, Nov 03, 2021 at 11:46:27PM -0700, Chaitanya Kulkarni wrote:
>> From: Chaitanya Kulkarni <kch@nvidia.com>
>>
>> This adds a new block layer operation to offload verifying a range of
>> LBAs. This support is needed in order to provide file systems and
>> fabrics, kernel components to offload LBA verification when it is
>> supported by the hardware controller. In case hardware offloading is
>> not supported then we provide APIs to emulate the same. The prominent
>> example of that is NVMe Verify command. We also provide an emulation of
>> the same operation which can be used in case H/W does not support
>> verify. This is still useful when block device is remotely attached e.g.
>> using NVMeOF.
>>
>> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
>> ---
>>   Documentation/ABI/testing/sysfs-block |  14 ++
>>   block/blk-core.c                      |   5 +
>>   block/blk-lib.c                       | 192 ++++++++++++++++++++++++++
>>   block/blk-merge.c                     |  19 +++
>>   block/blk-settings.c                  |  17 +++
>>   block/blk-sysfs.c                     |   8 ++
>>   block/blk-zoned.c                     |   1 +
>>   block/bounce.c                        |   1 +
>>   block/ioctl.c                         |  35 +++++
>>   include/linux/bio.h                   |  10 +-
>>   include/linux/blk_types.h             |   2 +
>>   include/linux/blkdev.h                |  31 +++++
>>   include/uapi/linux/fs.h               |   1 +
>>   13 files changed, 332 insertions(+), 4 deletions(-)
>>
> 
> (skipping to the ioctl part; I didn't see anything obviously weird in
> the block/ changes)
> 

Yes it is pretty straight forward.

>> diff --git a/block/ioctl.c b/block/ioctl.c
>> index d61d652078f4..5e1b3c4660bf 100644
>> --- a/block/ioctl.c
>> +++ b/block/ioctl.c
>> @@ -168,6 +168,39 @@ static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
>>                        BLKDEV_ZERO_NOUNMAP);
>>   }
>>
>> +static int blk_ioctl_verify(struct block_device *bdev, fmode_t mode,
>> +             unsigned long arg)
>> +{
>> +     uint64_t range[2];
>> +     struct address_space *mapping;
>> +     uint64_t start, end, len;
>> +
>> +     if (!(mode & FMODE_WRITE))
>> +             return -EBADF;
> 
> Why does the fd have to be opened writable?  Isn't this a read test?
> 

Yes this needs to be removed, will fix it in the V1.

>> +
>> +     if (copy_from_user(range, (void __user *)arg, sizeof(range)))
>> +             return -EFAULT;
>> +
>> +     start = range[0];
>> +     len = range[1];
>> +     end = start + len - 1;
>> +
>> +     if (start & 511)
>> +             return -EINVAL;
>> +     if (len & 511)
>> +             return -EINVAL;
>> +     if (end >= (uint64_t)i_size_read(bdev->bd_inode))
>> +             return -EINVAL;
>> +     if (end < start)
>> +             return -EINVAL;
>> +
>> +     /* Invalidate the page cache, including dirty pages */
>> +     mapping = bdev->bd_inode->i_mapping;
>> +     truncate_inode_pages_range(mapping, start, end);
> 
> Why do we need to invalidate the page cache to verify media?  Won't that
> cause data loss if those pages were dirty and about to be flushed?
> 
> --D
> 

Yes, will fix it in the v1.

>> +
>> +     return blkdev_issue_verify(bdev, start >> 9, len >> 9, GFP_KERNEL);
>> +}
>> +

Thanks a lot Derrik for your comments, I'll add fixes to V1.





^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/8] scsi: add REQ_OP_VERIFY support
  2021-11-04 12:33   ` Damien Le Moal
@ 2021-11-11  8:07     ` Chaitanya Kulkarni
  0 siblings, 0 replies; 26+ messages in thread
From: Chaitanya Kulkarni @ 2021-11-11  8:07 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-block, linux-raid, linux-nvme, target-devel, linux-xfs,
	dm-devel, linux-fsdevel, linux-scsi

On 11/4/2021 5:33 AM, Damien Le Moal wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 2021/11/04 15:46, Chaitanya Kulkarni wrote:
>> From: Chaitanya Kulkarni <kch@nvidia.com>
>>
>> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
>> ---
>>   drivers/scsi/sd.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++
>>   drivers/scsi/sd.h |  1 +
>>   2 files changed, 53 insertions(+)
>>
>> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
>> index a3d2d4bc4a3d..7f2c4eb98cf8 100644
>> --- a/drivers/scsi/sd.c
>> +++ b/drivers/scsi/sd.c
>> @@ -106,6 +106,7 @@ MODULE_ALIAS_SCSI_DEVICE(TYPE_ZBC);
>>
>>   static void sd_config_discard(struct scsi_disk *, unsigned int);
>>   static void sd_config_write_same(struct scsi_disk *);
>> +static void sd_config_verify(struct scsi_disk *sdkp);
>>   static int  sd_revalidate_disk(struct gendisk *);
>>   static void sd_unlock_native_capacity(struct gendisk *disk);
>>   static int  sd_probe(struct device *);
>> @@ -995,6 +996,41 @@ static blk_status_t sd_setup_write_zeroes_cmnd(struct scsi_cmnd *cmd)
>>        return sd_setup_write_same10_cmnd(cmd, false);
>>   }
>>
>> +static void sd_config_verify(struct scsi_disk *sdkp)
>> +{
>> +     struct request_queue *q = sdkp->disk->queue;
>> +
>> +     /* XXX: use same pattern as sd_config_write_same(). */
>> +     blk_queue_max_verify_sectors(q, UINT_MAX >> 9);
> 
> VERIFY 10, 12, 16 and 32 commands are optional and may not be implemented by a
> device. So setting this unconditionally is wrong.
> At the very least you must have an "if (sdkp->verify_16)" here, and call
> "blk_queue_max_verify_sectors(q, 0);" if the device does not support verify.
> 

Yes, I put it together for the RFC, this needs to consider the device
unsupported case just like what we do for write zeroes and emulate the
same.

>> +}
>> +
>> +static blk_status_t sd_setup_verify_cmnd(struct scsi_cmnd *cmd)
>> +{
>> +       struct request *rq = cmd->request;
>> +       struct scsi_device *sdp = cmd->device;
>> +       struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
>> +       u64 lba = sectors_to_logical(sdp, blk_rq_pos(rq));
>> +       u32 nr_blocks = sectors_to_logical(sdp, blk_rq_sectors(rq));
>> +
>> +       if (!sdkp->verify_16)
>> +            return BLK_STS_NOTSUPP;
> 
> I think this should be "return BLK_STS_TARGET;"
> 
>> +
>> +       cmd->cmd_len = 16;
>> +       cmd->cmnd[0] = VERIFY_16;
> 
> And what if the device supports VERIFY 10 or 12 but not VERIFY 16 ?

For first implementation we can only VERIFY 16, later we can add cases 
for VERIFY 10-12 versions.

> 
>> +       /* skip veprotect / dpo / bytchk */
>> +       cmd->cmnd[1] = 0;
>> +       put_unaligned_be64(lba, &cmd->cmnd[2]);
>> +       put_unaligned_be32(nr_blocks, &cmd->cmnd[10]);
>> +       cmd->cmnd[14] = 0;
>> +       cmd->cmnd[15] = 0;
>> +
>> +       cmd->allowed = SD_MAX_RETRIES;
>> +       cmd->sc_data_direction = DMA_NONE;
>> +       cmd->transfersize = 0;
>> +
>> +       return BLK_STS_OK;
>> +}
>> +
>>   static void sd_config_write_same(struct scsi_disk *sdkp)
>>   {
>>        struct request_queue *q = sdkp->disk->queue;
>> @@ -1345,6 +1381,8 @@ static blk_status_t sd_init_command(struct scsi_cmnd *cmd)
>>                }
>>        case REQ_OP_WRITE_ZEROES:
>>                return sd_setup_write_zeroes_cmnd(cmd);
>> +     case REQ_OP_VERIFY:
>> +             return sd_setup_verify_cmnd(cmd);
>>        case REQ_OP_WRITE_SAME:
>>                return sd_setup_write_same_cmnd(cmd);
>>        case REQ_OP_FLUSH:
>> @@ -2029,6 +2067,7 @@ static int sd_done(struct scsi_cmnd *SCpnt)
>>        switch (req_op(req)) {
>>        case REQ_OP_DISCARD:
>>        case REQ_OP_WRITE_ZEROES:
>> +     case REQ_OP_VERIFY:
>>        case REQ_OP_WRITE_SAME:
>>        case REQ_OP_ZONE_RESET:
>>        case REQ_OP_ZONE_RESET_ALL:
>> @@ -3096,6 +3135,17 @@ static void sd_read_write_same(struct scsi_disk *sdkp, unsigned char *buffer)
>>                sdkp->ws10 = 1;
>>   }
>>
>> +static void sd_read_verify(struct scsi_disk *sdkp, unsigned char *buffer)
>> +{
>> +       struct scsi_device *sdev = sdkp->device;
>> +
>> +       sd_printk(KERN_INFO, sdkp, "VERIFY16 check.\n");
> 
> Remove this message please.
> 
>> +       if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, VERIFY_16) == 1) {
>> +            sd_printk(KERN_INFO, sdkp, " VERIFY16 in ON .\n");
> 
> And this one too.
> 
>> +               sdkp->verify_16 = 1;
> 
> Why not checking for VERIFY 10 and 12 if VERIFY 16 is not supported ?
> Also, why don't you call "blk_queue_max_verify_sectors(q, UINT_MAX >> 9);" here
> instead of adding the not so useful sd_config_verify() helper ?
> 

Okay, let me see if I can add that in V1.

>> +       }
>> +}
>> +
>>   static void sd_read_security(struct scsi_disk *sdkp, unsigned char *buffer)
>>   {
>>        struct scsi_device *sdev = sdkp->device;
>> @@ -3224,6 +3274,7 @@ static int sd_revalidate_disk(struct gendisk *disk)
>>                sd_read_cache_type(sdkp, buffer);
>>                sd_read_app_tag_own(sdkp, buffer);
>>                sd_read_write_same(sdkp, buffer);
>> +             sd_read_verify(sdkp, buffer);
>>                sd_read_security(sdkp, buffer);
>>        }
>>
>> @@ -3265,6 +3316,7 @@ static int sd_revalidate_disk(struct gendisk *disk)
>>
>>        set_capacity_and_notify(disk, logical_to_sectors(sdp, sdkp->capacity));
>>        sd_config_write_same(sdkp);
>> +     sd_config_verify(sdkp);
>>        kfree(buffer);
>>
>>        /*
>> diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
>> index b59136c4125b..94a86bf6dac4 100644
>> --- a/drivers/scsi/sd.h
>> +++ b/drivers/scsi/sd.h
>> @@ -120,6 +120,7 @@ struct scsi_disk {
>>        unsigned        lbpvpd : 1;
>>        unsigned        ws10 : 1;
>>        unsigned        ws16 : 1;
>> +     unsigned        verify_16 : 1;
> 
> See right above this line how write same supports the 10 and 16 variants. I
> think you need the same here. And very likely, you also need the 32 version in
> case the device has DIF/DIX (type 2 protection).
> 

Agree with write same 10/16 versions, let me see if I can add that for V1.

>>        unsigned        rc_basis: 2;
>>        unsigned        zoned: 2;
>>        unsigned        urswrz : 1;
>>
> 
> 
> --
> Damien Le Moal
> Western Digital Research
> 

Thanks for the comments Damien.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 3/8] nvme: add support for the Verify command
  2021-11-04 22:44   ` Keith Busch
@ 2021-11-11  8:09     ` Chaitanya Kulkarni
  0 siblings, 0 replies; 26+ messages in thread
From: Chaitanya Kulkarni @ 2021-11-11  8:09 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-raid, linux-nvme, linux-scsi, target-devel,
	linux-fsdevel, linux-xfs, dm-devel

On 11/4/2021 3:44 PM, Keith Busch wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Wed, Nov 03, 2021 at 11:46:29PM -0700, Chaitanya Kulkarni wrote:
>> +static inline blk_status_t nvme_setup_verify(struct nvme_ns *ns,
>> +             struct request *req, struct nvme_command *cmnd)
>> +{
> 
> Due to recent driver changes, you need a "memset(cmnd, 0, sizeof(*cmnd))"
> prior to setting up the rest of the command, or you need to set each
> command dword individually.

Agree, will fix that in V1.

> 
>> +     cmnd->verify.opcode = nvme_cmd_verify;
>> +     cmnd->verify.nsid = cpu_to_le32(ns->head->ns_id);
>> +     cmnd->verify.slba =
>> +             cpu_to_le64(nvme_sect_to_lba(ns, blk_rq_pos(req)));
>> +     cmnd->verify.length =
>> +             cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
>> +     cmnd->verify.control = 0;
>> +     return BLK_STS_OK;
>> +}
> 
>> +static void nvme_config_verify(struct gendisk *disk, struct nvme_ns *ns)
>> +{
>> +     u64 max_blocks;
>> +
>> +     if (!(ns->ctrl->oncs & NVME_CTRL_ONCS_VERIFY))
>> +             return;
>> +
>> +     if (ns->ctrl->max_hw_sectors == UINT_MAX)
>> +             max_blocks = (u64)USHRT_MAX + 1;
>> +     else
>> +             max_blocks = ns->ctrl->max_hw_sectors + 1;
> 
> If supported by the controller, this maximum is defined in the non-mdts
> command limits in NVM command set specific Identify Controller VSL field.
> 

I need take a closer look at this. I'll fix that in V1.

>> +
>> +     /* keep same as discard */
>> +     if (blk_queue_flag_test_and_set(QUEUE_FLAG_VERIFY, disk->queue))
>> +             return;
>> +
>> +     blk_queue_max_verify_sectors(disk->queue,
>> +                                  nvme_lba_to_sect(ns, max_blocks));
>> +
>> +}

Thanks for the comment Keith.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 8/8] md: add support for REQ_OP_VERIFY
  2021-11-04  6:46 ` [RFC PATCH 8/8] md: add support for REQ_OP_VERIFY Chaitanya Kulkarni
@ 2021-11-11  8:13   ` Chaitanya Kulkarni
  2021-11-12 15:19     ` Mike Snitzer
  0 siblings, 1 reply; 26+ messages in thread
From: Chaitanya Kulkarni @ 2021-11-11  8:13 UTC (permalink / raw)
  To: linux-block, linux-raid, linux-nvme, linux-scsi, target-devel,
	linux-fsdevel, linux-xfs, dm-devel
  Cc: axboe, agk, snitzer, song, djwong, kbusch, hch, sagi, jejb,
	martin.petersen, viro, javier, johannes.thumshirn, bvanassche,
	dongli.zhang, ming.lei, osandov, willy, jefflexu, josef, clm,
	dsterba, jack, tytso, adilger.kernel, jlayton, idryomov,
	danil.kipnis, ebiggers, jinpu.wang, Chaitanya Kulkarni

On 11/3/2021 11:46 PM, Chaitanya Kulkarni wrote:
> From: Chaitanya Kulkarni <kch@nvidia.com>
> 
> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>

I want to make sure the new REQ_OP_VERIFY is compatible with the
dm side as it is a generic interface.

Any comments on the dm side ? It will help me to respin the series for
V1 of this proposal.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/8] block: add support for REQ_OP_VERIFY
  2021-11-04 17:32     ` Darrick J. Wong
  2021-11-04 17:34       ` Christoph Hellwig
@ 2021-11-11  8:18       ` Chaitanya Kulkarni
  1 sibling, 0 replies; 26+ messages in thread
From: Chaitanya Kulkarni @ 2021-11-11  8:18 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, linux-block, linux-raid, linux-nvme,
	linux-scsi, target-devel, linux-fsdevel, linux-xfs, dm-devel,
	axboe, agk, snitzer, song, kbusch, sagi, jejb, martin.petersen,
	viro, javier, johannes.thumshirn, bvanassche, dongli.zhang,
	ming.lei, osandov, willy, jefflexu, josef, clm, dsterba, jack,
	tytso, adilger.kernel, jlayton, idryomov, danil.kipnis, ebiggers,
	jinpu.wang

On 11/4/2021 10:32 AM, Darrick J. Wong wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Thu, Nov 04, 2021 at 09:27:50AM +0000, Chaitanya Kulkarni wrote:
>> On 11/4/2021 12:14 AM, Christoph Hellwig wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> What is the actual use case here?
>>>
>>
>> One of the immediate use-case is to use this interface with XFS
>> scrubbing infrastructure [1] (by replacing any SCSI calls e.g. sg_io()
>> with BLKVERIFY ioctl() calls corresponding to REQ_OP_VERIFY) and
>> eventually allow and extend other file systems to use it for scrubbing.
> 
> FWIW it /would/ be a win to have a general blkdev ioctl to do this,
> rather than shoving SCSI commands through /dev/sg, which (obviously)
> doesn't work when dm and friends are in use.  I hadn't bothered to wire
> up xfs_scrub to NVME COMPARE because none of my devices support it and
> tbh I was holding out for this kind of interface anyway. ;)
> 

Yes, it is not possible without a new interface and impossible for dm
and friends.

> I also wonder if it would be useful (since we're already having a
> discussion elsewhere about data integrity syscalls for pmem) to be able
> to call this sort of thing against files?  In which case we'd want
> another preadv2 flag or something, and then plumb all that through the
> vfs/iomap as needed?
> 
> --D
> 

As part of a complete picture we once we get the block layer part stable
in the upstream how about implementing fsverify command like utility
that will work similar to fstrim so user can verify the critical files
with plumbing of VFS and iomap ?

Or is there other way that is more suitable ?

>> [1] man xfs_scrub :-
>> -x     Read all file data extents to look for disk errors.
>>                 xfs_scrub will issue O_DIRECT reads to the block device
>>                 directly.  If the block device is a SCSI disk, it will
>>                 instead issue READ VERIFY commands directly to the disk.
>>                 If media errors are found, the error report will include
>>                 the disk offset, in bytes.  If the media errors affect a
>>                 file, the report will also include the inode number and
>>                 file offset, in bytes.  These actions will confirm that
>>                 all file data blocks can be read from storage.
>>
>>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/8] block: add support for REQ_OP_VERIFY
  2021-11-04 15:16 ` Douglas Gilbert
@ 2021-11-11  8:27   ` Chaitanya Kulkarni
  0 siblings, 0 replies; 26+ messages in thread
From: Chaitanya Kulkarni @ 2021-11-11  8:27 UTC (permalink / raw)
  To: dgilbert
  Cc: linux-block, linux-raid, linux-nvme, target-devel, linux-fsdevel,
	linux-scsi, dm-devel, linux-xfs


>> Please note that the interfaces for blk-lib.c REQ_OP_VERIFY emulation
>> will change in future I put together for the scope of RFC.
>>
>> Any comments are welcome.
> 
> Hi,
> You may also want to consider higher level support for the NVME COMPARE
> and SCSI VERIFY(BYTCHK=1) commands. Since PCIe and SAS transports are
> full duplex, replacing two READs (plus a memcmp in host memory) with
> one READ and one COMPARE may be a win on a bandwidth constrained
> system. It is a safe to assume the data-in transfers on a storage transport
> exceed (probably by a significant margin) the data-out transfers. An
> offloaded COMPARE switches one of those data-in transfers to a data-out
> transfer, so it should improve the bandwidth utilizatio >

I've thought about adding a support for the compare and friends. But 
those commands are optional (correct me if I'm wrong) and I couldn't 
find right usecase(s) to justify the kernel plubming.

Do you happened to have usecases or application which are using compare
command extensively or perhaps we point to an application your dd
modified version ?

> I did some brief benchmarking on a NVME SSD's COMPARE command (its 
> optional)
> and the results were underwhelming. OTOH using my own dd variants (which
> can do compare instead of copy) and a scsi_debug target (i.e. RAM) I have
> seen compare times of > 15 GBps while a copy rarely exceeds 9 GBps.
> 

This is what I'd expect when it comes to performance, but we need
a strong usecase with in-kernel user to support that, I'd be happy to
add that support.

> 
> BTW The SCSI VERIFY(BYTCHK=3) command compares one block sent from
> the host with a sequence of logical blocks on the media. So, for example,
> it would be a quick way of checking that a sequence of blocks contained
> zero-ed data.
> 

I see thanks for the comments and sharing compare related experience,
I've thought about that when I worked on REQ_OP_WRITE_ZEROES support :).

> Doug Gilbert

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 8/8] md: add support for REQ_OP_VERIFY
  2021-11-11  8:13   ` Chaitanya Kulkarni
@ 2021-11-12 15:19     ` Mike Snitzer
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Snitzer @ 2021-11-12 15:19 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: linux-block, linux-raid, linux-nvme, linux-scsi, target-devel,
	linux-fsdevel, linux-xfs, dm-devel, axboe, agk, song, djwong,
	kbusch, hch, sagi, jejb, martin.petersen, viro, javier,
	johannes.thumshirn, bvanassche, dongli.zhang, ming.lei, osandov,
	willy, jefflexu, josef, clm, dsterba, jack, tytso,
	adilger.kernel, jlayton, idryomov, danil.kipnis, ebiggers,
	jinpu.wang

On Thu, Nov 11 2021 at  3:13P -0500,
Chaitanya Kulkarni <chaitanyak@nvidia.com> wrote:

> On 11/3/2021 11:46 PM, Chaitanya Kulkarni wrote:
> > From: Chaitanya Kulkarni <kch@nvidia.com>
> > 
> > Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
> 
> I want to make sure the new REQ_OP_VERIFY is compatible with the
> dm side as it is a generic interface.
> 
> Any comments on the dm side ? It will help me to respin the series for
> V1 of this proposal.

I can review, but have you tested your XFS scrub usecase ontop of
the various DM devices you modified?

Also, you seem to have missed Keith's suggestion of using io_uring to
expose this capability.  If you happen to go that route: making sure
DM has required io_uring capabilities would be needed (IIRC there
were/are some lingering patches from Ming Lei to facilitate more
efficient io_uring on DM.. I'll try to find, could be I'm wrong).

Mike


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2021-11-12 15:19 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-04  6:46 [RFC PATCH 0/8] block: add support for REQ_OP_VERIFY Chaitanya Kulkarni
2021-11-04  6:46 ` [RFC PATCH 1/8] " Chaitanya Kulkarni
2021-11-04 17:25   ` Darrick J. Wong
2021-11-11  8:01     ` Chaitanya Kulkarni
2021-11-04  6:46 ` [RFC PATCH 2/8] scsi: add REQ_OP_VERIFY support Chaitanya Kulkarni
2021-11-04 12:33   ` Damien Le Moal
2021-11-11  8:07     ` Chaitanya Kulkarni
2021-11-04  6:46 ` [RFC PATCH 3/8] nvme: add support for the Verify command Chaitanya Kulkarni
2021-11-04 22:44   ` Keith Busch
2021-11-11  8:09     ` Chaitanya Kulkarni
2021-11-04  6:46 ` [PATCH 4/8] nvmet: add Verify command support for bdev-ns Chaitanya Kulkarni
2021-11-04  6:46 ` [RFC PATCH 5/8] nvmet: add Verify emulation " Chaitanya Kulkarni
2021-11-04  6:46 ` [RFC PATCH 6/8] nvmet: add verify emulation support for file-ns Chaitanya Kulkarni
2021-11-04  6:46 ` [RFC PATCH 7/8] null_blk: add REQ_OP_VERIFY support Chaitanya Kulkarni
2021-11-04  6:46 ` [RFC PATCH 8/8] md: add support for REQ_OP_VERIFY Chaitanya Kulkarni
2021-11-11  8:13   ` Chaitanya Kulkarni
2021-11-12 15:19     ` Mike Snitzer
2021-11-04  7:14 ` [RFC PATCH 0/8] block: " Christoph Hellwig
2021-11-04  9:27   ` Chaitanya Kulkarni
2021-11-04 17:32     ` Darrick J. Wong
2021-11-04 17:34       ` Christoph Hellwig
2021-11-04 22:37         ` Keith Busch
2021-11-05  8:25           ` javier
2021-11-11  8:18       ` Chaitanya Kulkarni
2021-11-04 15:16 ` Douglas Gilbert
2021-11-11  8:27   ` Chaitanya Kulkarni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).