All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] block: add support for REQ_OP_VERIFY
@ 2022-06-30  9:14 Chaitanya Kulkarni
  2022-06-30  9:14 ` [PATCH 1/6] " Chaitanya Kulkarni
                   ` (7 more replies)
  0 siblings, 8 replies; 41+ messages in thread
From: Chaitanya Kulkarni @ 2022-06-30  9:14 UTC (permalink / raw)
  To: linux-block, linux-raid, linux-nvme, linux-fsdevel
  Cc: axboe, agk, song, djwong, kbusch, hch, sagi, jejb,
	martin.petersen, viro, javier, johannes.thumshirn, bvanassche,
	dongli.zhang, ming.lei, willy, jefflexu, josef, clm, dsterba,
	jack, tytso, adilger.kernel, jlayton, idryomov, danil.kipnis,
	ebiggers, jinpu.wang, Chaitanya Kulkarni

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="y", Size: 18061 bytes --]

Hi,

This adds support for the REQ_OP_VERIFY. In this version we add
support for block layer. NVMe host side, NVMeOF Block device backend,
and NVMeOF File device backend and null_blk driver.

In this version we also add a new blkverify userspace tool along with
the testcases patch for the util-linux, this patch will be followed
by the this series.

Below is the summary of testlog :-

1. NVMeOF bdev-ns null_blk verify=0 (triggering bdev emulation code) :-
-----------------------------------------------------------------------

linux-block (for-next) # blkverify -o 0 -l 40960 /dev/nvme1n1 
linux-block (for-next) # dmesg -c 
[ 1171.171536] nvmet: nvmet_bdev_emulate_verify_work 467
linux-block (for-next) # nvme verify -s 0 -c 1024 /dev/nvme1n1  
NVME Verify Success
linux-block (for-next) # dmesg -c 
[ 1199.322161] nvmet: nvmet_bdev_emulate_verify_work 467

2. NVMeOF bdev-ns null_blk verify=1.
-----------------------------------------------------------------------

linux-block (for-next) # blkverify -o 0 -l 40960 /dev/nvme1n1 
linux-block (for-next) # dmesg -c 
[ 1257.661548] nvmet: nvmet_bdev_execute_verify 506
[ 1257.661558] null_blk: null_process_cmd 1406
linux-block (for-next) # nvme verify -s 0 -c 1024 /dev/nvme1n1  
NVME Verify Success
linux-block (for-next) # dmesg -c 
[ 1269.613415] nvmet: nvmet_bdev_execute_verify 506
[ 1269.613425] null_blk: null_process_cmd 1406

3. NVMeOF file-ns :-
-----------------------------------------------------------------------

linux-block (for-next) # blkverify -o 0 -l 40960 /dev/nvme1n1 
linux-block (for-next) # dmesg -c 
[ 3452.675959] nvme_setup_verify 844
[ 3452.675969] nvmet: nvmet_file_execute_verify 525
[ 3452.675971] nvmet: nvmet_file_emulate_verify_work 502
[ 3452.675972] nvmet: do_direct_io_emulate_verify 431
linux-block (for-next) # nvme verify -s 0 -c 1024 /dev/nvme1n1
NVME Verify Success
linux-block (for-next) # dmesg  -c 
[ 3459.794385] nvmet: nvmet_file_execute_verify 525
[ 3459.794389] nvmet: nvmet_file_emulate_verify_work 502
[ 3459.794391] nvmet: do_direct_io_emulate_verify 431

4. NVMe PCIe device.
-----------------------------------------------------------------------

linux-block (for-next) # modprobe nvme
linux-block (for-next) # blkverify -o 0 -l 40960 /dev/nvme0n1
linux-block (for-next) # dmesg  -c
[ 2763.432194] nvme nvme0: pci function 0000:00:04.0
[ 2763.473827] nvme nvme0: 48/0/0 default/read/poll queues
[ 2763.478868] nvme nvme0: Ignoring bogus Namespace Identifiers
[ 2766.583923] nvme_setup_verify 844

Here is a link for the complete cover-letter for the background to save
reviewer's time :-
https://patchwork.kernel.org/project/dm-devel/cover/20211104064634.4481-1-chaitanyak@nvidia.com/

-ck

Chaitanya Kulkarni (6):
  block: add support for REQ_OP_VERIFY
  nvme: add support for the Verify command
  nvmet: add Verify command support for bdev-ns
  nvmet: add Verify emulation support for bdev-ns
  nvmet: add verify emulation support for file-ns
  null_blk: add REQ_OP_VERIFY support

 Documentation/ABI/stable/sysfs-block |  12 +++
 block/blk-core.c                     |   5 +
 block/blk-lib.c                      | 155 +++++++++++++++++++++++++++
 block/blk-merge.c                    |  18 ++++
 block/blk-settings.c                 |  17 +++
 block/blk-sysfs.c                    |   8 ++
 block/blk.h                          |   4 +
 block/ioctl.c                        |  35 ++++++
 drivers/block/null_blk/main.c        |  20 +++-
 drivers/block/null_blk/null_blk.h    |   1 +
 drivers/nvme/host/core.c             |  33 ++++++
 drivers/nvme/target/admin-cmd.c      |   3 +-
 drivers/nvme/target/core.c           |  14 ++-
 drivers/nvme/target/io-cmd-bdev.c    |  75 +++++++++++++
 drivers/nvme/target/io-cmd-file.c    | 152 ++++++++++++++++++++++++++
 drivers/nvme/target/nvmet.h          |   4 +-
 include/linux/bio.h                  |   9 +-
 include/linux/blk_types.h            |   2 +
 include/linux/blkdev.h               |  22 ++++
 include/linux/nvme.h                 |  19 ++++
 include/uapi/linux/fs.h              |   1 +
 21 files changed, 601 insertions(+), 8 deletions(-)

-- 
2.29.0

######################## NVMeOF bdev-ns null_blk verify=0 ######################

0 directories, 0 files
linux-block (for-next) # ./bdev.sh 1
++ FILE=/dev/nvme0n1
++ NN=1
++ NQN=testnqn
++ let NR_DEVICES=NN+1
++ modprobe -r null_blk
++ modprobe null_blk nr_devices=0 verify=0
++ modprobe nvme
++ modprobe nvme-fabrics
++ modprobe nvmet
++ modprobe nvme-loop
++ dmesg -c
++ sleep 2
++ tree /sys/kernel/config
/sys/kernel/config
├── nullb
│   └── features
└── nvmet
    ├── hosts
    ├── ports
    └── subsystems

5 directories, 1 file
++ mkdir /sys/kernel/config/nvmet/subsystems/testnqn
++ mkdir /sys/kernel/config/nvmet/ports/1/
++ echo -n loop
++ echo -n 1
++ ln -s /sys/kernel/config/nvmet/subsystems/testnqn /sys/kernel/config/nvmet/ports/1/subsystems/
++ sleep 1
++ echo transport=loop,nqn=testnqn
+++ shuf -i 1-1 -n 1
++ for i in `shuf -i  1-$NN -n $NN`
++ mkdir config/nullb/nullb1
++ echo 4096
++ echo 20971520
++ echo 1
+++ cat config/nullb/nullb1/index
++ IDX=0
++ mkdir /sys/kernel/config/nvmet/subsystems/testnqn/namespaces/1
++ let IDX=IDX+1
++ echo ' ####### /dev/nullb1'
 ####### /dev/nullb1
++ echo -n /dev/nullb1
++ cat /sys/kernel/config/nvmet/subsystems/testnqn/namespaces/1/device_path
/dev/nullb1
++ echo 1
++ dmesg -c
[ 1150.985720] nvmet: creating nvm controller 1 for subsystem testnqn for NQN nqn.2014-08.org.nvmexpress:uuid:7f5b83f1-b258-4300-9a55-cd1902bea8c2.
[ 1150.985882] nvme nvme1: creating 48 I/O queues.
[ 1150.990654] nvme nvme1: new ctrl: "testnqn"
[ 1150.995681] null_blk: disk nullb1 created
[ 1150.998886] nvmet: adding nsid 1 to subsystem testnqn
[ 1151.000716] nvme nvme1: rescanning namespaces.
++ sleep 1
++ mount
++ column -t
++ grep nvme
++ '[' 1 ']'
+++ wc -l
+++ ls -l /dev/nvme1 /dev/nvme1n1
++ cnt=2
++ echo 2
2
++ '[' 2 -gt 1 ']'
++ break
++ dmesg -c
linux-block (for-next) # blkverify -o 0 -l 40960 /dev/nvme1n1 
linux-block (for-next) # dmesg -c 
[ 1171.171536] nvmet: nvmet_bdev_emulate_verify_work 467
linux-block (for-next) # nvme verify -s 0 -c 1024 /dev/nvme1n1  
NVME Verify Success
linux-block (for-next) # dmesg -c 
[ 1199.322161] nvmet: nvmet_bdev_emulate_verify_work 467
linux-block (for-next) # ./delete.sh 1
+ nvme disconnect -n testnqn
NQN:testnqn disconnected 1 controller(s)

real	0m0.343s
user	0m0.002s
sys	0m0.003s
++ shuf -i 1-1 -n 1
+ for i in `shuf -i  1-$NN -n $NN`
+ echo 0
+ rmdir /sys/kernel/config/nvmet/subsystems/testnqn/namespaces/1
+ rmdir config/nullb/nullb1
+ sleep 2
+ rm -fr /sys/kernel/config/nvmet/ports/1/subsystems/testnqn
+ sleep 1
+ rmdir /sys/kernel/config/nvmet/ports/1
+ rmdir /sys/kernel/config/nvmet/subsystems/testnqn
+ sleep 1
+ modprobe -r nvme_loop
+ modprobe -r nvme_fabrics
+ modprobe -r nvmet
+ modprobe -r nvme
+ modprobe -r null_blk
+ umount /mnt/nvme0n1
umount: /mnt/nvme0n1: no mount point specified.
+ umount /mnt/backend
umount: /mnt/backend: not mounted.
+ tree /sys/kernel/config
/sys/kernel/config

0 directories, 0 files

######################## NVMeOF bdev-ns null_blk verify=1 #####################

linux-block (for-next) # ./bdev.sh 1
++ FILE=/dev/nvme0n1
++ NN=1
++ NQN=testnqn
++ let NR_DEVICES=NN+1
++ modprobe -r null_blk
++ modprobe null_blk nr_devices=0 verify=1
++ modprobe nvme
++ modprobe nvme-fabrics
++ modprobe nvmet
++ modprobe nvme-loop
++ dmesg -c
++ sleep 2
++ tree /sys/kernel/config
/sys/kernel/config
├── nullb
│   └── features
└── nvmet
    ├── hosts
    ├── ports
    └── subsystems

5 directories, 1 file
++ mkdir /sys/kernel/config/nvmet/subsystems/testnqn
++ mkdir /sys/kernel/config/nvmet/ports/1/
++ echo -n loop
++ echo -n 1
++ ln -s /sys/kernel/config/nvmet/subsystems/testnqn /sys/kernel/config/nvmet/ports/1/subsystems/
++ sleep 1
++ echo transport=loop,nqn=testnqn
+++ shuf -i 1-1 -n 1
++ for i in `shuf -i  1-$NN -n $NN`
++ mkdir config/nullb/nullb1
++ echo 4096
++ echo 20971520
++ echo 1
+++ cat config/nullb/nullb1/index
++ IDX=0
++ mkdir /sys/kernel/config/nvmet/subsystems/testnqn/namespaces/1
++ let IDX=IDX+1
++ echo ' ####### /dev/nullb1'
 ####### /dev/nullb1
++ echo -n /dev/nullb1
++ cat /sys/kernel/config/nvmet/subsystems/testnqn/namespaces/1/device_path
/dev/nullb1
++ echo 1
++ dmesg -c
[ 1250.311632] nvmet: creating nvm controller 1 for subsystem testnqn for NQN nqn.2014-08.org.nvmexpress:uuid:328bb18d-3662-48eb-8bc2-ca7d4ad73799.
[ 1250.311853] nvme nvme1: creating 48 I/O queues.
[ 1250.316251] nvme nvme1: new ctrl: "testnqn"
[ 1250.321710] null_blk: disk nullb1 created
[ 1250.324678] nvmet: adding nsid 1 to subsystem testnqn
[ 1250.326546] nvme nvme1: rescanning namespaces.
++ sleep 1
++ mount
++ column -t
++ grep nvme
++ '[' 1 ']'
+++ wc -l
+++ ls -l /dev/nvme1 /dev/nvme1n1
++ cnt=2
++ echo 2
2
++ '[' 2 -gt 1 ']'
++ break
++ dmesg -c
linux-block (for-next) # blkverify -o 0 -l 40960 /dev/nvme1n1 
linux-block (for-next) # dmesg -c 
[ 1257.661548] nvmet: nvmet_bdev_execute_verify 506
[ 1257.661558] null_blk: null_process_cmd 1406
linux-block (for-next) # nvme verify -s 0 -c 1024 /dev/nvme1n1  
NVME Verify Success
linux-block (for-next) # dmesg -c 
[ 1269.613415] nvmet: nvmet_bdev_execute_verify 506
[ 1269.613425] null_blk: null_process_cmd 1406
linux-block (for-next) # 
linux-block (for-next) # ./delete.sh 1
+ nvme disconnect -n testnqn
NQN:testnqn disconnected 1 controller(s)

real	0m0.339s
user	0m0.002s
sys	0m0.003s
++ shuf -i 1-1 -n 1
+ for i in `shuf -i  1-$NN -n $NN`
+ echo 0
+ rmdir /sys/kernel/config/nvmet/subsystems/testnqn/namespaces/1
+ rmdir config/nullb/nullb1
+ sleep 2
+ rm -fr /sys/kernel/config/nvmet/ports/1/subsystems/testnqn
+ sleep 1
+ rmdir /sys/kernel/config/nvmet/ports/1
+ rmdir /sys/kernel/config/nvmet/subsystems/testnqn
+ sleep 1
+ modprobe -r nvme_loop
+ modprobe -r nvme_fabrics
+ modprobe -r nvmet
+ modprobe -r nvme
+ modprobe -r null_blk
+ umount /mnt/nvme0n1
umount: /mnt/nvme0n1: no mount point specified.
+ umount /mnt/backend
umount: /mnt/backend: not mounted.
+ tree /sys/kernel/config
/sys/kernel/config

0 directories, 0 files

################################### NVMeOF file-ns #############################

linux-block (for-next) # 
linux-block (for-next) # ./file.sh 1
++ FILE=/mnt/backend/nvme1n1
++ SS=testnqn
++ SSPATH=/sys/kernel/config/nvmet/subsystems/testnqn/
++ PORTS=/sys/kernel/config/nvmet/ports
++ main
++ load_modules
++ dmesg -c
++ modprobe nvme
++ modprobe nvme-fabrics
++ modprobe nvmet
++ modprobe nvme-loop
++ sleep 3
++ mount_fs
++ make_nullb
++ ./compile_nullb.sh
+ umount /mnt/nullb0
umount: /mnt/nullb0: not mounted.
+ rmdir 'config/nullb/nullb*'
rmdir: failed to remove 'config/nullb/nullb*': No such file or directory
+ dmesg -c
+ modprobe -r null_blk
+ lsmod
+ grep null_blk
++ nproc
+ make -j 48 M=drivers/block modules
+ HOST=drivers/block/null_blk/
++ uname -r
+ HOST_DEST=/lib/modules/5.18.0blk+/kernel/drivers/block/null_blk/
+ cp drivers/block/null_blk//null_blk.ko /lib/modules/5.18.0blk+/kernel/drivers/block/null_blk//
+ ls -lrth /lib/modules/5.18.0blk+/kernel/drivers/block/null_blk//null_blk.ko
-rw-r--r--. 1 root root 1.1M Jun 29 17:06 /lib/modules/5.18.0blk+/kernel/drivers/block/null_blk//null_blk.ko
+ sleep 1
+ dmesg -c
++ ./compile_nullb.sh
+ umount /mnt/nullb0
umount: /mnt/nullb0: not mounted.
+ rmdir 'config/nullb/nullb*'
rmdir: failed to remove 'config/nullb/nullb*': No such file or directory
+ dmesg -c
+ modprobe -r null_blk
+ lsmod
+ grep null_blk
++ nproc
+ make -j 48 M=drivers/block modules
+ HOST=drivers/block/null_blk/
++ uname -r
+ HOST_DEST=/lib/modules/5.18.0blk+/kernel/drivers/block/null_blk/
+ cp drivers/block/null_blk//null_blk.ko /lib/modules/5.18.0blk+/kernel/drivers/block/null_blk//
+ ls -lrth /lib/modules/5.18.0blk+/kernel/drivers/block/null_blk//null_blk.ko
-rw-r--r--. 1 root root 1.1M Jun 29 17:06 /lib/modules/5.18.0blk+/kernel/drivers/block/null_blk//null_blk.ko
+ sleep 1
+ dmesg -c
++ modprobe null_blk nr_devices=0
++ sleep 1
++ mkdir config/nullb/nullb0
++ tree config/nullb/nullb0
config/nullb/nullb0
├── badblocks
├── blocking
├── blocksize
├── cache_size
├── completion_nsec
├── discard
├── home_node
├── hw_queue_depth
├── index
├── irqmode
├── max_sectors
├── mbps
├── memory_backed
├── poll_queues
├── power
├── queue_mode
├── size
├── submit_queues
├── use_per_node_hctx
├── verify
├── virt_boundary
├── zone_capacity
├── zoned
├── zone_max_active
├── zone_max_open
├── zone_nr_conv
└── zone_size

0 directories, 27 files
++ echo 1
++ echo 512
++ echo 20480
++ echo 1
++ sleep 2
+++ cat config/nullb/nullb0/index
++ IDX=0
++ lsblk
++ grep null0
++ sleep 1
++ mkfs.xfs -f /dev/nullb0
meta-data=/dev/nullb0            isize=512    agcount=4, agsize=1310720 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=0
data     =                       bsize=4096   blocks=5242880, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
++ mount /dev/nullb0 /mnt/backend/
++ sleep 1
++ mount
++ column -t
++ grep nvme
++ dd if=/dev/zero of=/mnt/backend/nvme1n1 count=2621440 bs=4096
2621440+0 records in
2621440+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 5.01608 s, 2.1 GB/s
++ file /mnt/backend/nvme1n1
/mnt/backend/nvme1n1: data
++ make_target
++ tree /sys/kernel/config
/sys/kernel/config
├── nullb
│   ├── features
│   └── nullb0
│       ├── badblocks
│       ├── blocking
│       ├── blocksize
│       ├── cache_size
│       ├── completion_nsec
│       ├── discard
│       ├── home_node
│       ├── hw_queue_depth
│       ├── index
│       ├── irqmode
│       ├── max_sectors
│       ├── mbps
│       ├── memory_backed
│       ├── poll_queues
│       ├── power
│       ├── queue_mode
│       ├── size
│       ├── submit_queues
│       ├── use_per_node_hctx
│       ├── verify
│       ├── virt_boundary
│       ├── zone_capacity
│       ├── zoned
│       ├── zone_max_active
│       ├── zone_max_open
│       ├── zone_nr_conv
│       └── zone_size
└── nvmet
    ├── hosts
    ├── ports
    └── subsystems

6 directories, 28 files
++ mkdir /sys/kernel/config/nvmet/subsystems/testnqn/
++ for i in 1
++ mkdir /sys/kernel/config/nvmet/subsystems/testnqn//namespaces/1
++ echo -n /mnt/backend/nvme1n1
++ cat /sys/kernel/config/nvmet/subsystems/testnqn//namespaces/1/device_path
/mnt/backend/nvme1n1
++ cat /sys/kernel/config/nvmet/subsystems/testnqn//namespaces/1/buffered_io
0
++ echo 1
++ mkdir /sys/kernel/config/nvmet/ports/1/
++ echo -n loop
++ echo -n 1
++ ln -s /sys/kernel/config/nvmet/subsystems/testnqn/ /sys/kernel/config/nvmet/ports/1/subsystems/
++ sleep 1
++ connect
++ echo transport=loop,nqn=testnqn
++ sleep 1
++ dmesg -c
[ 3436.671070] null_blk: module loaded
[ 3437.678812] null_blk: disk nullb0 created
[ 3440.700250] XFS (nullb0): Mounting V5 Filesystem
[ 3440.701686] XFS (nullb0): Ending clean mount
[ 3440.701772] xfs filesystem being mounted at /mnt/backend supports timestamps until 2038 (0x7fffffff)
[ 3446.742777] nvmet: adding nsid 1 to subsystem testnqn
[ 3447.752282] nvmet: creating nvm controller 1 for subsystem testnqn for NQN nqn.2014-08.org.nvmexpress:uuid:55bf9c5a-2992-4ffc-ac53-18003029a0b9.
[ 3447.752423] nvme nvme1: creating 48 I/O queues.
[ 3447.758869] nvme nvme1: new ctrl: "testnqn"
linux-block (for-next) # blkverify -o 0 -l 40960 /dev/nvme1n1 
linux-block (for-next) # dmesg -c 
[ 3452.675959] nvme_setup_verify 844
[ 3452.675969] nvmet: nvmet_file_execute_verify 525
[ 3452.675971] nvmet: nvmet_file_emulate_verify_work 502
[ 3452.675972] nvmet: do_direct_io_emulate_verify 431
linux-block (for-next) # nvme verify -s 0 -c 1024 /dev/nvme1n1
NVME Verify Success
linux-block (for-next) # dmesg  -c 
[ 3459.794385] nvmet: nvmet_file_execute_verify 525
[ 3459.794389] nvmet: nvmet_file_emulate_verify_work 502
[ 3459.794391] nvmet: do_direct_io_emulate_verify 431
linux-block (for-next) #  
linux-block (for-next) # 
linux-block (for-next) # ./delete.sh 1
+ nvme disconnect -n testnqn
NQN:testnqn disconnected 1 controller(s)

real	0m0.338s
user	0m0.001s
sys	0m0.004s
++ shuf -i 1-1 -n 1
+ for i in `shuf -i  1-$NN -n $NN`
+ echo 0
+ rmdir /sys/kernel/config/nvmet/subsystems/testnqn/namespaces/1
+ rmdir config/nullb/nullb0
+ sleep 2
+ rm -fr /sys/kernel/config/nvmet/ports/1/subsystems/testnqn
+ sleep 1
+ rmdir /sys/kernel/config/nvmet/ports/1
+ rmdir /sys/kernel/config/nvmet/subsystems/testnqn
+ sleep 1
+ modprobe -r nvme_loop
+ modprobe -r nvme_fabrics
+ modprobe -r nvmet
+ modprobe -r nvme
+ umount /mnt/nvme0n1
umount: /mnt/nvme0n1: no mount point specified.
+ umount /mnt/backend
+ modprobe -r null_blk
+ tree /sys/kernel/config
/sys/kernel/config

0 directories, 0 files

################################# NVMe PCIe device  ###########################
linux-block (for-next) #
linux-block (for-next) #
linux-block (for-next) # modprobe nvme
linux-block (for-next) # blkverify -o 0 -l 40960 /dev/nvme0n1
linux-block (for-next) # dmesg  -c
[ 2763.432194] nvme nvme0: pci function 0000:00:04.0
[ 2763.473827] nvme nvme0: 48/0/0 default/read/poll queues
[ 2763.478868] nvme nvme0: Ignoring bogus Namespace Identifiers
[ 2766.583923] nvme_setup_verify 844

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/6] block: add support for REQ_OP_VERIFY
  2022-06-30  9:14 [PATCH 0/6] block: add support for REQ_OP_VERIFY Chaitanya Kulkarni
@ 2022-06-30  9:14 ` Chaitanya Kulkarni
  2022-06-30 16:18   ` Darrick J. Wong
  2022-07-05  8:34   ` Christoph Hellwig
  2022-06-30  9:14 ` [PATCH 2/6] nvme: add support for the Verify command Chaitanya Kulkarni
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 41+ messages in thread
From: Chaitanya Kulkarni @ 2022-06-30  9:14 UTC (permalink / raw)
  To: linux-block, linux-raid, linux-nvme, linux-fsdevel
  Cc: axboe, agk, song, djwong, kbusch, hch, sagi, jejb,
	martin.petersen, viro, javier, johannes.thumshirn, bvanassche,
	dongli.zhang, ming.lei, willy, jefflexu, josef, clm, dsterba,
	jack, tytso, adilger.kernel, jlayton, idryomov, danil.kipnis,
	ebiggers, jinpu.wang, Chaitanya Kulkarni

This adds a new block layer operation to offload verifying a range of
LBAs. This support is needed in order to provide file systems and
fabrics, kernel components to offload LBA verification when it is
supported by the hardware controller. In case hardware offloading is
not supported then we provide API to emulate the same. The prominent
example of that is SCSI and NVMe Verify command. We also provide
an emulation of the same operation that can be used in case H/W does
not support verify. This is still useful when block device is remotely
attached e.g. using NVMeOF.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
---
 Documentation/ABI/stable/sysfs-block |  12 +++
 block/blk-core.c                     |   5 +
 block/blk-lib.c                      | 155 +++++++++++++++++++++++++++
 block/blk-merge.c                    |  18 ++++
 block/blk-settings.c                 |  17 +++
 block/blk-sysfs.c                    |   8 ++
 block/blk.h                          |   4 +
 block/ioctl.c                        |  35 ++++++
 include/linux/bio.h                  |   9 +-
 include/linux/blk_types.h            |   2 +
 include/linux/blkdev.h               |  22 ++++
 include/uapi/linux/fs.h              |   1 +
 12 files changed, 285 insertions(+), 3 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index e8797cd09aff..a71d9c41cf8b 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -657,6 +657,18 @@ Description:
 		in a single write zeroes command. If write_zeroes_max_bytes is
 		0, write zeroes is not supported by the device.
 
+What:		/sys/block/<disk>/queue/verify_max_bytes
+Date:		April 2022
+Contact:	Chaitanya Kulkarni <kch@nvidia.com>
+Description:
+		Devices that support verify operation in which a single
+		request can be issued to verify the range of the contiguous
+		blocks on the storage without any payload in the request.
+		This can be used to optimize verifying LBAs on the device
+		without reading by offloading functionality. verify_max_bytes
+		indicates how many bytes can be written in a single verify
+		command. If verify_max_bytes is 0, verify operation is not
+		supported by the device.
 
 What:		/sys/block/<disk>/queue/zone_append_max_bytes
 Date:		May 2020
diff --git a/block/blk-core.c b/block/blk-core.c
index 06ff5bbfe8f6..9ad52247dcdf 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -123,6 +123,7 @@ static const char *const blk_op_name[] = {
 	REQ_OP_NAME(ZONE_FINISH),
 	REQ_OP_NAME(ZONE_APPEND),
 	REQ_OP_NAME(WRITE_ZEROES),
+	REQ_OP_NAME(VERIFY),
 	REQ_OP_NAME(DRV_IN),
 	REQ_OP_NAME(DRV_OUT),
 };
@@ -842,6 +843,10 @@ void submit_bio_noacct(struct bio *bio)
 		if (!q->limits.max_write_zeroes_sectors)
 			goto not_supported;
 		break;
+	case REQ_OP_VERIFY:
+		if (!q->limits.max_verify_sectors)
+			goto not_supported;
+		break;
 	default:
 		break;
 	}
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 09b7e1200c0f..4624d68bb3cb 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -340,3 +340,158 @@ int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
 	return ret;
 }
 EXPORT_SYMBOL(blkdev_issue_secure_erase);
+
+/**
+ * __blkdev_emulate_verify - emulate number of verify operations
+ * 				asynchronously
+ * @bdev:	blockdev to issue
+ * @sector:	start sector
+ * @nr_sects:	number of sectors to verify
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
+ * @biop:	pointer to anchor bio
+ * @buf:	data buffer to mapped on bio
+ *
+ * Description:
+ *  Verify a block range by emulating REQ_OP_VERIFY into REQ_OP_READ,
+ *  use this when H/W offloading is not supported asynchronously.
+ *  Caller is responsible to handle anchored bio.
+ */
+static int __blkdev_emulate_verify(struct block_device *bdev, sector_t sector,
+		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop, char *buf)
+{
+	struct bio *bio = *biop;
+	unsigned int sz;
+	int bi_size;
+
+	while (nr_sects != 0) {
+		bio = blk_next_bio(bio, bdev,
+				__blkdev_sectors_to_bio_pages(nr_sects),
+				REQ_OP_READ, gfp_mask);
+		bio->bi_iter.bi_sector = sector;
+
+		while (nr_sects != 0) {
+			bool is_vaddr = is_vmalloc_addr(buf);
+			struct page *p;
+
+			p = is_vaddr ? vmalloc_to_page(buf) : virt_to_page(buf);
+			sz = min((sector_t) PAGE_SIZE, nr_sects << 9);
+
+			bi_size = bio_add_page(bio, p, sz, offset_in_page(buf));
+			if (bi_size < sz)
+				return -EIO;
+
+			nr_sects -= bi_size >> 9;
+			sector += bi_size >> 9;
+			buf += bi_size;
+		}
+		cond_resched();
+	}
+
+	*biop = bio;
+	return 0;
+}
+
+/**
+ * __blkdev_issue_verify - generate number of verify operations
+ * @bdev:	blockdev to issue
+ * @sector:	start sector
+ * @nr_sects:	number of sectors to verify
+ * @gfp_mask:	memory allocation flags (for bio_alloc())
+ * @biop:	pointer to anchor bio
+ *
+ * Description:
+ *  Verify a block range using hardware offload.
+ *
+ * The function will emulate verify operation if no explicit hardware
+ * offloading for verifying is provided.
+ */
+int __blkdev_issue_verify(struct block_device *bdev, sector_t sector,
+		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop)
+{
+	unsigned int max_verify_sectors = bdev_verify_sectors(bdev);
+	sector_t min_io_sect = (BIO_MAX_VECS << PAGE_SHIFT) >> 9;
+	struct bio *bio = *biop;
+	sector_t curr_sects;
+	char *buf;
+
+	if (!max_verify_sectors) {
+		int ret = 0;
+
+		buf = kzalloc(min_io_sect << 9, GFP_KERNEL);
+		if (!buf)
+			return -ENOMEM;
+
+		while (nr_sects > 0) {
+			curr_sects = min_t(sector_t, nr_sects, min_io_sect);
+			ret = __blkdev_emulate_verify(bdev, sector, curr_sects,
+						      gfp_mask, &bio, buf);
+			if (ret)
+				break;
+
+			if (bio) {
+				ret = submit_bio_wait(bio);
+				bio_put(bio);
+				bio = NULL;
+			}
+
+			nr_sects -= curr_sects;
+			sector += curr_sects;
+
+		}
+		/* set the biop to NULL since we have alrady completed above */
+		*biop = NULL;
+		kfree(buf);
+		return ret;
+	}
+
+	while (nr_sects) {
+		bio = blk_next_bio(bio, bdev, 0, REQ_OP_VERIFY, gfp_mask);
+		bio->bi_iter.bi_sector = sector;
+
+		if (nr_sects > max_verify_sectors) {
+			bio->bi_iter.bi_size = max_verify_sectors << 9;
+			nr_sects -= max_verify_sectors;
+			sector += max_verify_sectors;
+		} else {
+			bio->bi_iter.bi_size = nr_sects << 9;
+			nr_sects = 0;
+		}
+		cond_resched();
+	}
+	*biop = bio;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(__blkdev_issue_verify);
+
+/**
+ * blkdev_issue_verify - verify a block range
+ * @bdev:	blockdev to verify
+ * @sector:	start sector
+ * @nr_sects:	number of sectors to verify
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ *  Verify a block range using hardware offload.
+ */
+int blkdev_issue_verify(struct block_device *bdev, sector_t sector,
+		sector_t nr_sects, gfp_t gfp_mask)
+{
+	sector_t bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
+	struct bio *bio = NULL;
+	struct blk_plug plug;
+	int ret = 0;
+
+	if ((sector | nr_sects) & bs_mask)
+		return -EINVAL;
+
+	blk_start_plug(&plug);
+	ret = __blkdev_issue_verify(bdev, sector, nr_sects, gfp_mask, &bio);
+	if (ret == 0 && bio) {
+		ret = submit_bio_wait(bio);
+		bio_put(bio);
+	}
+	blk_finish_plug(&plug);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(blkdev_issue_verify);
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 7771dacc99cb..8ff305377b5a 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -153,6 +153,20 @@ static struct bio *blk_bio_write_zeroes_split(struct request_queue *q,
 	return bio_split(bio, q->limits.max_write_zeroes_sectors, GFP_NOIO, bs);
 }
 
+static struct bio *blk_bio_verify_split(struct request_queue *q,
+		struct bio *bio, struct bio_set *bs, unsigned *nsegs)
+{
+	*nsegs = 0;
+
+	if (!q->limits.max_verify_sectors)
+		return NULL;
+
+	if (bio_sectors(bio) <= q->limits.max_verify_sectors)
+		return NULL;
+
+	return bio_split(bio, q->limits.max_verify_sectors, GFP_NOIO, bs);
+}
+
 /*
  * Return the maximum number of sectors from the start of a bio that may be
  * submitted as a single request to a block device. If enough sectors remain,
@@ -336,6 +350,10 @@ void __blk_queue_split(struct request_queue *q, struct bio **bio,
 		split = blk_bio_write_zeroes_split(q, *bio, &q->bio_split,
 				nr_segs);
 		break;
+	case REQ_OP_VERIFY:
+		split = blk_bio_verify_split(q, *bio, &q->bio_split,
+				nr_segs);
+		break;
 	default:
 		split = blk_bio_segment_split(q, *bio, &q->bio_split, nr_segs);
 		break;
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 6ccceb421ed2..c77697290bc5 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -43,6 +43,7 @@ void blk_set_default_limits(struct queue_limits *lim)
 	lim->max_dev_sectors = 0;
 	lim->chunk_sectors = 0;
 	lim->max_write_zeroes_sectors = 0;
+	lim->max_verify_sectors = 0;
 	lim->max_zone_append_sectors = 0;
 	lim->max_discard_sectors = 0;
 	lim->max_hw_discard_sectors = 0;
@@ -80,6 +81,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
 	lim->max_sectors = UINT_MAX;
 	lim->max_dev_sectors = UINT_MAX;
 	lim->max_write_zeroes_sectors = UINT_MAX;
+	lim->max_verify_sectors = UINT_MAX;
 	lim->max_zone_append_sectors = UINT_MAX;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);
@@ -202,6 +204,19 @@ void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_max_write_zeroes_sectors);
 
+/**
+ * blk_queue_max_verify_sectors - set max sectors for a single verify
+ *
+ * @q:  the request queue for the device
+ * @max_verify_sectors: maximum number of sectors to verify per command
+ **/
+void blk_queue_max_verify_sectors(struct request_queue *q,
+		unsigned int max_verify_sectors)
+{
+	q->limits.max_verify_sectors = max_verify_sectors;
+}
+EXPORT_SYMBOL(blk_queue_max_verify_sectors);
+
 /**
  * blk_queue_max_zone_append_sectors - set max sectors for a single zone append
  * @q:  the request queue for the device
@@ -554,6 +569,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 	t->max_dev_sectors = min_not_zero(t->max_dev_sectors, b->max_dev_sectors);
 	t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
 					b->max_write_zeroes_sectors);
+	t->max_verify_sectors = min(t->max_verify_sectors,
+				    b->max_verify_sectors);
 	t->max_zone_append_sectors = min(t->max_zone_append_sectors,
 					b->max_zone_append_sectors);
 	t->bounce = max(t->bounce, b->bounce);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 88bd41d4cb59..4fb6a731acad 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -113,6 +113,12 @@ queue_ra_store(struct request_queue *q, const char *page, size_t count)
 	return ret;
 }
 
+static ssize_t queue_verify_max_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%llu\n",
+		(unsigned long long)q->limits.max_verify_sectors << 9);
+}
+
 static ssize_t queue_max_sectors_show(struct request_queue *q, char *page)
 {
 	int max_sectors_kb = queue_max_sectors(q) >> 1;
@@ -588,6 +594,7 @@ QUEUE_RO_ENTRY(queue_discard_zeroes_data, "discard_zeroes_data");
 
 QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes");
 QUEUE_RO_ENTRY(queue_write_zeroes_max, "write_zeroes_max_bytes");
+QUEUE_RO_ENTRY(queue_verify_max, "verify_max_bytes");
 QUEUE_RO_ENTRY(queue_zone_append_max, "zone_append_max_bytes");
 QUEUE_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity");
 
@@ -644,6 +651,7 @@ static struct attribute *queue_attrs[] = {
 	&queue_discard_zeroes_data_entry.attr,
 	&queue_write_same_max_entry.attr,
 	&queue_write_zeroes_max_entry.attr,
+	&queue_verify_max_entry.attr,
 	&queue_zone_append_max_entry.attr,
 	&queue_zone_write_granularity_entry.attr,
 	&queue_nonrot_entry.attr,
diff --git a/block/blk.h b/block/blk.h
index 434017701403..63a0e3aca7e0 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -132,6 +132,9 @@ static inline bool rq_mergeable(struct request *rq)
 	if (req_op(rq) == REQ_OP_WRITE_ZEROES)
 		return false;
 
+	if (req_op(rq) == REQ_OP_VERIFY)
+		return false;
+
 	if (req_op(rq) == REQ_OP_ZONE_APPEND)
 		return false;
 
@@ -286,6 +289,7 @@ static inline bool blk_may_split(struct request_queue *q, struct bio *bio)
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
 	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_VERIFY:
 		return true; /* non-trivial splitting decisions */
 	default:
 		break;
diff --git a/block/ioctl.c b/block/ioctl.c
index 46949f1b0dba..60a48e24b82d 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -192,6 +192,39 @@ static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
 	return err;
 }
 
+static int blk_ioctl_verify(struct block_device *bdev, fmode_t mode,
+		unsigned long arg)
+{
+	uint64_t range[2];
+	struct address_space *mapping;
+	uint64_t start, end, len;
+
+	if (!(mode & FMODE_READ))
+		return -EBADF;
+
+	if (copy_from_user(range, (void __user *)arg, sizeof(range)))
+		return -EFAULT;
+
+	start = range[0];
+	len = range[1];
+	end = start + len - 1;
+
+	if (start & 511)
+		return -EINVAL;
+	if (len & 511)
+		return -EINVAL;
+	if (end >= (uint64_t)i_size_read(bdev->bd_inode))
+		return -EINVAL;
+	if (end < start)
+		return -EINVAL;
+
+	/* Invalidate the page cache, including dirty pages */
+	mapping = bdev->bd_inode->i_mapping;
+	truncate_inode_pages_range(mapping, start, end);
+
+	return blkdev_issue_verify(bdev, start >> 9, len >> 9, GFP_KERNEL);
+}
+
 static int put_ushort(unsigned short __user *argp, unsigned short val)
 {
 	return put_user(val, argp);
@@ -483,6 +516,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
 		return blk_ioctl_secure_erase(bdev, mode, argp);
 	case BLKZEROOUT:
 		return blk_ioctl_zeroout(bdev, mode, arg);
+	case BLKVERIFY:
+		return blk_ioctl_verify(bdev, mode, arg);
 	case BLKGETDISKSEQ:
 		return put_u64(argp, bdev->bd_disk->diskseq);
 	case BLKREPORTZONE:
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 1cf3738ef1ea..3dfafe1da098 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -55,7 +55,8 @@ static inline bool bio_has_data(struct bio *bio)
 	    bio->bi_iter.bi_size &&
 	    bio_op(bio) != REQ_OP_DISCARD &&
 	    bio_op(bio) != REQ_OP_SECURE_ERASE &&
-	    bio_op(bio) != REQ_OP_WRITE_ZEROES)
+	    bio_op(bio) != REQ_OP_WRITE_ZEROES &&
+	    bio_op(bio) != REQ_OP_VERIFY)
 		return true;
 
 	return false;
@@ -65,7 +66,8 @@ static inline bool bio_no_advance_iter(const struct bio *bio)
 {
 	return bio_op(bio) == REQ_OP_DISCARD ||
 	       bio_op(bio) == REQ_OP_SECURE_ERASE ||
-	       bio_op(bio) == REQ_OP_WRITE_ZEROES;
+	       bio_op(bio) == REQ_OP_WRITE_ZEROES ||
+	       bio_op(bio) == REQ_OP_VERIFY;
 }
 
 static inline void *bio_data(struct bio *bio)
@@ -176,7 +178,7 @@ static inline unsigned bio_segments(struct bio *bio)
 	struct bvec_iter iter;
 
 	/*
-	 * We special case discard/write same/write zeroes, because they
+	 * We special case discard/write same/write zeroes/verify, because they
 	 * interpret bi_size differently:
 	 */
 
@@ -184,6 +186,7 @@ static inline unsigned bio_segments(struct bio *bio)
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
 	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_VERIFY:
 		return 0;
 	default:
 		break;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index a24d4078fb21..0d5383fc84ed 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -363,6 +363,8 @@ enum req_opf {
 	REQ_OP_FLUSH		= 2,
 	/* discard sectors */
 	REQ_OP_DISCARD		= 3,
+	/* Verify the sectors */
+	REQ_OP_VERIFY		= 6,
 	/* securely erase sectors */
 	REQ_OP_SECURE_ERASE	= 5,
 	/* write the zero filled sector many times */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 608d577734c2..78fd6c5530d7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -266,6 +266,7 @@ struct queue_limits {
 	unsigned int		max_hw_discard_sectors;
 	unsigned int		max_secure_erase_sectors;
 	unsigned int		max_write_zeroes_sectors;
+	unsigned int		max_verify_sectors;
 	unsigned int		max_zone_append_sectors;
 	unsigned int		discard_granularity;
 	unsigned int		discard_alignment;
@@ -925,6 +926,9 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
 	if (unlikely(op == REQ_OP_WRITE_ZEROES))
 		return q->limits.max_write_zeroes_sectors;
 
+	if (unlikely(op == REQ_OP_VERIFY))
+		return q->limits.max_verify_sectors;
+
 	return q->limits.max_sectors;
 }
 
@@ -968,6 +972,8 @@ extern void blk_queue_max_discard_sectors(struct request_queue *q,
 		unsigned int max_discard_sectors);
 extern void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
 		unsigned int max_write_same_sectors);
+extern void blk_queue_max_verify_sectors(struct request_queue *q,
+		unsigned int max_verify_sectors);
 extern void blk_queue_logical_block_size(struct request_queue *, unsigned int);
 extern void blk_queue_max_zone_append_sectors(struct request_queue *q,
 		unsigned int max_zone_append_sectors);
@@ -1119,6 +1125,12 @@ extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, unsigned flags);
 
+extern int __blkdev_issue_verify(struct block_device *bdev,
+		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
+		struct bio **biop);
+extern int blkdev_issue_verify(struct block_device *bdev, sector_t sector,
+		sector_t nr_sects, gfp_t gfp_mask);
+
 static inline int sb_issue_discard(struct super_block *sb, sector_t block,
 		sector_t nr_blocks, gfp_t gfp_mask, unsigned long flags)
 {
@@ -1293,6 +1305,16 @@ static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev)
 	return 0;
 }
 
+static inline unsigned int bdev_verify_sectors(struct block_device *bdev)
+{
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	if (q)
+		return q->limits.max_verify_sectors;
+
+	return 0;
+}
+
 static inline bool bdev_nonrot(struct block_device *bdev)
 {
 	return blk_queue_nonrot(bdev_get_queue(bdev));
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index bdf7b404b3e7..ad0e5cb5cac4 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -185,6 +185,7 @@ struct fsxattr {
 #define BLKROTATIONAL _IO(0x12,126)
 #define BLKZEROOUT _IO(0x12,127)
 #define BLKGETDISKSEQ _IOR(0x12,128,__u64)
+#define BLKVERIFY _IO(0x12,129)
 /*
  * A jump here: 130-136 are reserved for zoned block devices
  * (see uapi/linux/blkzoned.h)
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 2/6] nvme: add support for the Verify command
  2022-06-30  9:14 [PATCH 0/6] block: add support for REQ_OP_VERIFY Chaitanya Kulkarni
  2022-06-30  9:14 ` [PATCH 1/6] " Chaitanya Kulkarni
@ 2022-06-30  9:14 ` Chaitanya Kulkarni
  2022-06-30 16:24   ` Keith Busch
  2022-06-30  9:14 ` [PATCH 3/6] nvmet: add Verify command support for bdev-ns Chaitanya Kulkarni
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 41+ messages in thread
From: Chaitanya Kulkarni @ 2022-06-30  9:14 UTC (permalink / raw)
  To: linux-block, linux-raid, linux-nvme, linux-fsdevel
  Cc: axboe, agk, song, djwong, kbusch, hch, sagi, jejb,
	martin.petersen, viro, javier, johannes.thumshirn, bvanassche,
	dongli.zhang, ming.lei, willy, jefflexu, josef, clm, dsterba,
	jack, tytso, adilger.kernel, jlayton, idryomov, danil.kipnis,
	ebiggers, jinpu.wang, Chaitanya Kulkarni

Allow verify operations (REQ_OP_VERIFY) on the block device, if the
device supports optional command bit set for verify. Add support
to setup verify command. Set maximum possible verify sectors in one
verify command according to maximum hardware sectors supported by the
controller.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
---
 drivers/nvme/host/core.c | 33 +++++++++++++++++++++++++++++++++
 include/linux/nvme.h     | 19 +++++++++++++++++++
 2 files changed, 52 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 24165daee3c8..ef27580886b1 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -838,6 +838,19 @@ static inline blk_status_t nvme_setup_write_zeroes(struct nvme_ns *ns,
 	return BLK_STS_OK;
 }
 
+static inline blk_status_t nvme_setup_verify(struct nvme_ns *ns,
+		struct request *req, struct nvme_command *cmnd)
+{
+	cmnd->verify.opcode = nvme_cmd_verify;
+	cmnd->verify.nsid = cpu_to_le32(ns->head->ns_id);
+	cmnd->verify.slba =
+		cpu_to_le64(nvme_sect_to_lba(ns, blk_rq_pos(req)));
+	cmnd->verify.length =
+		cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
+	cmnd->verify.control = 0;
+	return BLK_STS_OK;
+}
+
 static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 		struct request *req, struct nvme_command *cmnd,
 		enum nvme_opcode op)
@@ -943,6 +956,9 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req)
 	case REQ_OP_WRITE_ZEROES:
 		ret = nvme_setup_write_zeroes(ns, req, cmd);
 		break;
+	case REQ_OP_VERIFY:
+		ret = nvme_setup_verify(ns, req, cmd);
+		break;
 	case REQ_OP_DISCARD:
 		ret = nvme_setup_discard(ns, req, cmd);
 		break;
@@ -1672,6 +1688,22 @@ static void nvme_config_discard(struct gendisk *disk, struct nvme_ns *ns)
 		blk_queue_max_write_zeroes_sectors(queue, UINT_MAX);
 }
 
+static void nvme_config_verify(struct gendisk *disk, struct nvme_ns *ns)
+{
+	u64 max_blocks;
+
+	if (!(ns->ctrl->oncs & NVME_CTRL_ONCS_VERIFY))
+		return;
+
+	if (ns->ctrl->max_hw_sectors == UINT_MAX)
+		max_blocks = (u64)USHRT_MAX + 1;
+	else
+		max_blocks = ns->ctrl->max_hw_sectors + 1;
+
+	blk_queue_max_verify_sectors(disk->queue,
+				     nvme_lba_to_sect(ns, max_blocks));
+}
+
 static bool nvme_ns_ids_equal(struct nvme_ns_ids *a, struct nvme_ns_ids *b)
 {
 	return uuid_equal(&a->uuid, &b->uuid) &&
@@ -1871,6 +1903,7 @@ static void nvme_update_disk_info(struct gendisk *disk,
 	set_capacity_and_notify(disk, capacity);
 
 	nvme_config_discard(disk, ns);
+	nvme_config_verify(disk, ns);
 	blk_queue_max_write_zeroes_sectors(disk->queue,
 					   ns->ctrl->max_zeroes_sectors);
 }
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 29ec3e3481ff..578bb4931665 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -363,6 +363,7 @@ enum {
 	NVME_CTRL_ONCS_WRITE_ZEROES		= 1 << 3,
 	NVME_CTRL_ONCS_RESERVATIONS		= 1 << 5,
 	NVME_CTRL_ONCS_TIMESTAMP		= 1 << 6,
+	NVME_CTRL_ONCS_VERIFY			= 1 << 7,
 	NVME_CTRL_VWC_PRESENT			= 1 << 0,
 	NVME_CTRL_OACS_SEC_SUPP                 = 1 << 0,
 	NVME_CTRL_OACS_NS_MNGT_SUPP		= 1 << 3,
@@ -1001,6 +1002,23 @@ struct nvme_write_zeroes_cmd {
 	__le16			appmask;
 };
 
+struct nvme_verify_cmd {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__u64			rsvd2;
+	__le64			metadata;
+	union nvme_data_ptr	dptr;
+	__le64			slba;
+	__le16			length;
+	__le16			control;
+	__le32			rsvd3;
+	__le32			reftag;
+	__le16			eapptag;
+	__le16			eappmask;
+};
+
 enum nvme_zone_mgmt_action {
 	NVME_ZONE_CLOSE		= 0x1,
 	NVME_ZONE_FINISH	= 0x2,
@@ -1539,6 +1557,7 @@ struct nvme_command {
 		struct nvme_format_cmd format;
 		struct nvme_dsm_cmd dsm;
 		struct nvme_write_zeroes_cmd write_zeroes;
+		struct nvme_verify_cmd verify;
 		struct nvme_zone_mgmt_send_cmd zms;
 		struct nvme_zone_mgmt_recv_cmd zmr;
 		struct nvme_abort_cmd abort;
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 3/6] nvmet: add Verify command support for bdev-ns
  2022-06-30  9:14 [PATCH 0/6] block: add support for REQ_OP_VERIFY Chaitanya Kulkarni
  2022-06-30  9:14 ` [PATCH 1/6] " Chaitanya Kulkarni
  2022-06-30  9:14 ` [PATCH 2/6] nvme: add support for the Verify command Chaitanya Kulkarni
@ 2022-06-30  9:14 ` Chaitanya Kulkarni
  2022-06-30  9:14 ` [PATCH 4/6] nvmet: add Verify emulation " Chaitanya Kulkarni
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 41+ messages in thread
From: Chaitanya Kulkarni @ 2022-06-30  9:14 UTC (permalink / raw)
  To: linux-block, linux-raid, linux-nvme, linux-fsdevel
  Cc: axboe, agk, song, djwong, kbusch, hch, sagi, jejb,
	martin.petersen, viro, javier, johannes.thumshirn, bvanassche,
	dongli.zhang, ming.lei, willy, jefflexu, josef, clm, dsterba,
	jack, tytso, adilger.kernel, jlayton, idryomov, danil.kipnis,
	ebiggers, jinpu.wang, Chaitanya Kulkarni

Add support for handling verify command on target. Call into
__blkdev_issue_verify, which the block layer expands into the
REQ_OP_VERIFY LBAs.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
---
 drivers/nvme/target/admin-cmd.c   |  3 ++-
 drivers/nvme/target/io-cmd-bdev.c | 39 +++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 397daaf51f1b..495c3a31473a 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -431,7 +431,8 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req)
 	id->nn = cpu_to_le32(NVMET_MAX_NAMESPACES);
 	id->mnan = cpu_to_le32(NVMET_MAX_NAMESPACES);
 	id->oncs = cpu_to_le16(NVME_CTRL_ONCS_DSM |
-			NVME_CTRL_ONCS_WRITE_ZEROES);
+			NVME_CTRL_ONCS_WRITE_ZEROES |
+			NVME_CTRL_ONCS_VERIFY);
 
 	/* XXX: don't report vwc if the underlying device is write through */
 	id->vwc = NVME_CTRL_VWC_PRESENT;
diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
index 27a72504d31c..6687e2665e26 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -146,6 +146,7 @@ u16 blk_to_nvme_status(struct nvmet_req *req, blk_status_t blk_sts)
 		switch (req->cmd->common.opcode) {
 		case nvme_cmd_dsm:
 		case nvme_cmd_write_zeroes:
+		case nvme_cmd_verify:
 			status = NVME_SC_ONCS_NOT_SUPPORTED | NVME_SC_DNR;
 			break;
 		default:
@@ -171,6 +172,10 @@ u16 blk_to_nvme_status(struct nvmet_req *req, blk_status_t blk_sts)
 		req->error_slba =
 			le64_to_cpu(req->cmd->write_zeroes.slba);
 		break;
+	case nvme_cmd_verify:
+		req->error_slba =
+			le64_to_cpu(req->cmd->verify.slba);
+		break;
 	default:
 		req->error_slba = 0;
 	}
@@ -442,6 +447,37 @@ static void nvmet_bdev_execute_write_zeroes(struct nvmet_req *req)
 	}
 }
 
+static void nvmet_bdev_execute_verify(struct nvmet_req *req)
+{
+	struct nvme_verify_cmd *verify = &req->cmd->verify;
+	struct bio *bio = NULL;
+	sector_t nr_sector;
+	sector_t sector;
+	int ret;
+
+	if (!nvmet_check_transfer_len(req, 0))
+		return;
+
+	if (!bdev_verify_sectors(req->ns->bdev)) {
+		nvmet_req_complete(req, NVME_SC_INTERNAL | NVME_SC_DNR);
+		return;
+	}
+
+	sector = le64_to_cpu(verify->slba) << (req->ns->blksize_shift - 9);
+	nr_sector = (((sector_t)le16_to_cpu(verify->length) + 1) <<
+			(req->ns->blksize_shift - 9));
+
+	ret = __blkdev_issue_verify(req->ns->bdev, sector, nr_sector,
+			GFP_KERNEL, &bio);
+	if (bio) {
+		bio->bi_private = req;
+		bio->bi_end_io = nvmet_bio_done;
+		submit_bio(bio);
+	} else {
+		nvmet_req_complete(req, errno_to_nvme_status(req, ret));
+	}
+}
+
 u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
 {
 	switch (req->cmd->common.opcode) {
@@ -460,6 +496,9 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
 	case nvme_cmd_write_zeroes:
 		req->execute = nvmet_bdev_execute_write_zeroes;
 		return 0;
+	case nvme_cmd_verify:
+		req->execute = nvmet_bdev_execute_verify;
+		return 0;
 	default:
 		return nvmet_report_invalid_opcode(req);
 	}
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 4/6] nvmet: add Verify emulation support for bdev-ns
  2022-06-30  9:14 [PATCH 0/6] block: add support for REQ_OP_VERIFY Chaitanya Kulkarni
                   ` (2 preceding siblings ...)
  2022-06-30  9:14 ` [PATCH 3/6] nvmet: add Verify command support for bdev-ns Chaitanya Kulkarni
@ 2022-06-30  9:14 ` Chaitanya Kulkarni
  2022-07-05  8:35   ` Christoph Hellwig
  2022-06-30  9:14 ` [PATCH 5/6] nvmet: add verify emulation support for file-ns Chaitanya Kulkarni
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 41+ messages in thread
From: Chaitanya Kulkarni @ 2022-06-30  9:14 UTC (permalink / raw)
  To: linux-block, linux-raid, linux-nvme, linux-fsdevel
  Cc: axboe, agk, song, djwong, kbusch, hch, sagi, jejb,
	martin.petersen, viro, javier, johannes.thumshirn, bvanassche,
	dongli.zhang, ming.lei, willy, jefflexu, josef, clm, dsterba,
	jack, tytso, adilger.kernel, jlayton, idryomov, danil.kipnis,
	ebiggers, jinpu.wang, Chaitanya Kulkarni

Not all devices can support verify requests which can be mapped to
the controller specific command. This patch adds a way to emulate
REQ_OP_VERIFY for NVMeOF block device namespace. We add a new
workqueue to offload the emulation.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
---
 drivers/nvme/target/core.c        | 14 ++++++--
 drivers/nvme/target/io-cmd-bdev.c | 56 +++++++++++++++++++++++++------
 drivers/nvme/target/nvmet.h       |  4 ++-
 3 files changed, 61 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 90e75324dae0..b701eeaf156a 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -16,6 +16,7 @@
 #include "nvmet.h"
 
 struct workqueue_struct *buffered_io_wq;
+struct workqueue_struct *verify_wq;
 struct workqueue_struct *zbd_wq;
 static const struct nvmet_fabrics_ops *nvmet_transports[NVMF_TRTYPE_MAX];
 static DEFINE_IDA(cntlid_ida);
@@ -1611,10 +1612,16 @@ static int __init nvmet_init(void)
 
 	nvmet_ana_group_enabled[NVMET_DEFAULT_ANA_GRPID] = 1;
 
-	zbd_wq = alloc_workqueue("nvmet-zbd-wq", WQ_MEM_RECLAIM, 0);
-	if (!zbd_wq)
+	verify_wq = alloc_workqueue("nvmet-verify-wq", WQ_MEM_RECLAIM, 0);
+	if (!verify_wq)
 		return -ENOMEM;
 
+	zbd_wq = alloc_workqueue("nvmet-zbd-wq", WQ_MEM_RECLAIM, 0);
+	if (!zbd_wq) {
+		error = -ENOMEM;
+		goto out_free_verify_work_queue;
+	}
+
 	buffered_io_wq = alloc_workqueue("nvmet-buffered-io-wq",
 			WQ_MEM_RECLAIM, 0);
 	if (!buffered_io_wq) {
@@ -1645,6 +1652,8 @@ static int __init nvmet_init(void)
 	destroy_workqueue(buffered_io_wq);
 out_free_zbd_work_queue:
 	destroy_workqueue(zbd_wq);
+out_free_verify_work_queue:
+	destroy_workqueue(verify_wq);
 	return error;
 }
 
@@ -1656,6 +1665,7 @@ static void __exit nvmet_exit(void)
 	destroy_workqueue(nvmet_wq);
 	destroy_workqueue(buffered_io_wq);
 	destroy_workqueue(zbd_wq);
+	destroy_workqueue(verify_wq);
 
 	BUILD_BUG_ON(sizeof(struct nvmf_disc_rsp_page_entry) != 1024);
 	BUILD_BUG_ON(sizeof(struct nvmf_disc_rsp_page_hdr) != 1024);
diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
index 6687e2665e26..721c8571a2da 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -447,35 +447,71 @@ static void nvmet_bdev_execute_write_zeroes(struct nvmet_req *req)
 	}
 }
 
-static void nvmet_bdev_execute_verify(struct nvmet_req *req)
+static void __nvmet_req_to_verify_sectors(struct nvmet_req *req,
+		sector_t *sects, sector_t *nr_sects)
 {
 	struct nvme_verify_cmd *verify = &req->cmd->verify;
+
+	*sects = le64_to_cpu(verify->slba) << (req->ns->blksize_shift - 9);
+	*nr_sects = (((sector_t)le16_to_cpu(verify->length) + 1) <<
+			(req->ns->blksize_shift - 9));
+}
+
+static void nvmet_bdev_emulate_verify_work(struct work_struct *w)
+{
+	struct nvmet_req *req = container_of(w, struct nvmet_req, b.work);
+	sector_t nr_sector;
+	sector_t sector;
+	int ret = 0;
+
+	__nvmet_req_to_verify_sectors(req, &sector, &nr_sector);
+	if (!nr_sector)
+		goto out;
+
+	/* blkdev_issue_verify() will automatically emulate */
+	ret = blkdev_issue_verify(req->ns->bdev, sector, nr_sector,
+			GFP_KERNEL);
+out:
+	nvmet_req_complete(req,
+		blk_to_nvme_status(req, errno_to_blk_status(ret)));
+}
+
+static void nvmet_bdev_submit_emulate_verify(struct nvmet_req *req)
+{
+	INIT_WORK(&req->b.work, nvmet_bdev_emulate_verify_work);
+	queue_work(verify_wq, &req->b.work);
+}
+
+static void nvmet_bdev_execute_verify(struct nvmet_req *req)
+{
 	struct bio *bio = NULL;
 	sector_t nr_sector;
 	sector_t sector;
-	int ret;
+	int ret = 0;
 
 	if (!nvmet_check_transfer_len(req, 0))
 		return;
 
+	__nvmet_req_to_verify_sectors(req, &sector, &nr_sector);
+	if (!nr_sector)
+		goto out;
+
+	/* offload emulation */
 	if (!bdev_verify_sectors(req->ns->bdev)) {
-		nvmet_req_complete(req, NVME_SC_INTERNAL | NVME_SC_DNR);
+		nvmet_bdev_submit_emulate_verify(req);
 		return;
 	}
 
-	sector = le64_to_cpu(verify->slba) << (req->ns->blksize_shift - 9);
-	nr_sector = (((sector_t)le16_to_cpu(verify->length) + 1) <<
-			(req->ns->blksize_shift - 9));
-
 	ret = __blkdev_issue_verify(req->ns->bdev, sector, nr_sector,
 			GFP_KERNEL, &bio);
-	if (bio) {
+	if (ret == 0 && bio) {
 		bio->bi_private = req;
 		bio->bi_end_io = nvmet_bio_done;
 		submit_bio(bio);
-	} else {
-		nvmet_req_complete(req, errno_to_nvme_status(req, ret));
+		return;
 	}
+out:
+	nvmet_req_complete(req, errno_to_nvme_status(req, ret));
 }
 
 u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 69818752a33a..96e3f6eb4fef 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -326,7 +326,8 @@ struct nvmet_req {
 	struct bio_vec		inline_bvec[NVMET_MAX_INLINE_BIOVEC];
 	union {
 		struct {
-			struct bio      inline_bio;
+			struct bio		inline_bio;
+			struct work_struct	work;
 		} b;
 		struct {
 			bool			mpool_alloc;
@@ -365,6 +366,7 @@ struct nvmet_req {
 };
 
 extern struct workqueue_struct *buffered_io_wq;
+extern struct workqueue_struct *verify_wq;
 extern struct workqueue_struct *zbd_wq;
 extern struct workqueue_struct *nvmet_wq;
 
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 5/6] nvmet: add verify emulation support for file-ns
  2022-06-30  9:14 [PATCH 0/6] block: add support for REQ_OP_VERIFY Chaitanya Kulkarni
                   ` (3 preceding siblings ...)
  2022-06-30  9:14 ` [PATCH 4/6] nvmet: add Verify emulation " Chaitanya Kulkarni
@ 2022-06-30  9:14 ` Chaitanya Kulkarni
  2022-07-05  8:36   ` Christoph Hellwig
  2022-06-30  9:14 ` [PATCH 6/6] null_blk: add REQ_OP_VERIFY support Chaitanya Kulkarni
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 41+ messages in thread
From: Chaitanya Kulkarni @ 2022-06-30  9:14 UTC (permalink / raw)
  To: linux-block, linux-raid, linux-nvme, linux-fsdevel
  Cc: axboe, agk, song, djwong, kbusch, hch, sagi, jejb,
	martin.petersen, viro, javier, johannes.thumshirn, bvanassche,
	dongli.zhang, ming.lei, willy, jefflexu, josef, clm, dsterba,
	jack, tytso, adilger.kernel, jlayton, idryomov, danil.kipnis,
	ebiggers, jinpu.wang, Chaitanya Kulkarni

For now, there is no way to map verify operation to the VFS layer API.
This patch emulates verify operation by offloading it to the workqueue
and reading the data using vfs layer APIs for both buffered io and
direct io mode.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
---
 drivers/nvme/target/io-cmd-file.c | 152 ++++++++++++++++++++++++++++++
 1 file changed, 152 insertions(+)

diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c
index f3d58abf11e0..287187d641ba 100644
--- a/drivers/nvme/target/io-cmd-file.c
+++ b/drivers/nvme/target/io-cmd-file.c
@@ -13,6 +13,7 @@
 
 #define NVMET_MAX_MPOOL_BVEC		16
 #define NVMET_MIN_MPOOL_OBJ		16
+#define NVMET_VERIFY_BUF_LEN		(BIO_MAX_VECS << PAGE_SHIFT)
 
 void nvmet_file_ns_revalidate(struct nvmet_ns *ns)
 {
@@ -376,6 +377,154 @@ static void nvmet_file_execute_write_zeroes(struct nvmet_req *req)
 	queue_work(nvmet_wq, &req->f.work);
 }
 
+static void __nvmet_req_to_verify_offset(struct nvmet_req *req, loff_t *offset,
+		ssize_t *len)
+{
+	struct nvme_verify_cmd *verify = &req->cmd->verify;
+
+	*offset = le64_to_cpu(verify->slba) << req->ns->blksize_shift;
+	*len = (((sector_t)le16_to_cpu(verify->length) + 1) <<
+			req->ns->blksize_shift);
+}
+
+static int do_buffered_io_emulate_verify(struct file *f, loff_t offset,
+		ssize_t len)
+{
+	char *buf = NULL;
+	int ret = 0;
+	ssize_t rc;
+
+	buf = kmalloc(NVMET_VERIFY_BUF_LEN, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	while (len > 0) {
+		ssize_t curr_len = min_t(ssize_t, len, NVMET_VERIFY_BUF_LEN);
+
+		rc = kernel_read(f, buf, curr_len, &offset);
+		if (rc != curr_len) {
+			pr_err("kernel_read %lu curr_len %lu\n", rc, curr_len);
+			ret = -EINVAL;
+			break;
+		}
+
+		len -= curr_len;
+		offset += curr_len;
+		cond_resched();
+	}
+
+	kfree(buf);
+	return ret;
+}
+
+static int do_direct_io_emulate_verify(struct file *f, loff_t offset,
+		ssize_t len)
+{
+	struct scatterlist *sgl = NULL;
+	struct bio_vec *bvec = NULL;
+	struct iov_iter iter = { 0 };
+	struct kiocb iocb = { 0 };
+	unsigned int sgl_nents;
+	ssize_t ret = 0;
+	int i;
+
+	while (len > 0) {
+		ssize_t curr_len = min_t(ssize_t, len, NVMET_VERIFY_BUF_LEN);
+		struct scatterlist *sg = NULL;
+		unsigned int bv_len = 0;
+		ssize_t rc;
+
+		sgl = sgl_alloc(curr_len, GFP_KERNEL, &sgl_nents);
+		if (!sgl) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		bvec = kmalloc_array(sgl_nents, sizeof(struct bio_vec),
+				GFP_KERNEL);
+		if (!bvec) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		for_each_sg(sgl, sg, sgl_nents, i) {
+			nvmet_file_init_bvec(&bvec[i], sg);
+			bv_len += sg->length;
+		}
+
+		if (bv_len != curr_len) {
+			pr_err("length mismatch sgl & bvec\n");
+			ret = -EINVAL;
+			break;
+		}
+
+		iocb.ki_pos = offset;
+		iocb.ki_filp = f;
+		iocb.ki_complete = NULL; /* Sync I/O */
+		iocb.ki_flags |= IOCB_DIRECT;
+
+		iov_iter_bvec(&iter, READ, bvec, sgl_nents, bv_len);
+
+		rc = call_read_iter(f, &iocb, &iter);
+		if (rc != curr_len) {
+			pr_err("read len mismatch expected %lu got %ld\n",
+					curr_len, rc);
+			ret = -EINVAL;
+			break;
+		}
+
+		cond_resched();
+
+		len -= curr_len;
+		offset += curr_len;
+
+		kfree(bvec);
+		sgl_free(sgl);
+		bvec = NULL;
+		sgl = NULL;
+		memset(&iocb, 0, sizeof(iocb));
+		memset(&iter, 0, sizeof(iter));
+	}
+
+	kfree(bvec);
+	sgl_free(sgl);
+	return ret;
+}
+
+static void nvmet_file_emulate_verify_work(struct work_struct *w)
+{
+	struct nvmet_req *req = container_of(w, struct nvmet_req, f.work);
+	loff_t offset;
+	ssize_t len;
+	int ret = 0;
+
+	__nvmet_req_to_verify_offset(req, &offset, &len);
+	if (!len)
+		goto out;
+
+	if (unlikely(offset + len > req->ns->size)) {
+		nvmet_req_complete(req, errno_to_nvme_status(req, -ENOSPC));
+		return;
+	}
+
+	if (req->ns->buffered_io)
+		ret = do_buffered_io_emulate_verify(req->ns->file, offset, len);
+	else
+		ret = do_direct_io_emulate_verify(req->ns->file, offset, len);
+out:
+	nvmet_req_complete(req, errno_to_nvme_status(req, ret));
+}
+
+static void nvmet_file_execute_verify(struct nvmet_req *req)
+{
+	if (!nvmet_check_data_len_lte(req, 0))
+		return;
+
+	INIT_WORK(&req->f.work, nvmet_file_emulate_verify_work);
+	queue_work(verify_wq, &req->f.work);
+}
+
+
 u16 nvmet_file_parse_io_cmd(struct nvmet_req *req)
 {
 	switch (req->cmd->common.opcode) {
@@ -392,6 +541,9 @@ u16 nvmet_file_parse_io_cmd(struct nvmet_req *req)
 	case nvme_cmd_write_zeroes:
 		req->execute = nvmet_file_execute_write_zeroes;
 		return 0;
+	case nvme_cmd_verify:
+		req->execute = nvmet_file_execute_verify;
+		return 0;
 	default:
 		return nvmet_report_invalid_opcode(req);
 	}
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 6/6] null_blk: add REQ_OP_VERIFY support
  2022-06-30  9:14 [PATCH 0/6] block: add support for REQ_OP_VERIFY Chaitanya Kulkarni
                   ` (4 preceding siblings ...)
  2022-06-30  9:14 ` [PATCH 5/6] nvmet: add verify emulation support for file-ns Chaitanya Kulkarni
@ 2022-06-30  9:14 ` Chaitanya Kulkarni
  2022-07-05  8:38 ` [PATCH 0/6] block: add support for REQ_OP_VERIFY Christoph Hellwig
  2022-07-06 17:42 ` Matthew Wilcox
  7 siblings, 0 replies; 41+ messages in thread
From: Chaitanya Kulkarni @ 2022-06-30  9:14 UTC (permalink / raw)
  To: linux-block, linux-raid, linux-nvme, linux-fsdevel
  Cc: axboe, agk, song, djwong, kbusch, hch, sagi, jejb,
	martin.petersen, viro, javier, johannes.thumshirn, bvanassche,
	dongli.zhang, ming.lei, willy, jefflexu, josef, clm, dsterba,
	jack, tytso, adilger.kernel, jlayton, idryomov, danil.kipnis,
	ebiggers, jinpu.wang, Chaitanya Kulkarni

Add a new module parameter, configfs attribute to configure handling of
the REQ_OP_VERIFY. This is needed for testing newly added REQ_OP_VERIFY
block layer operation.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
---
 drivers/block/null_blk/main.c     | 20 +++++++++++++++++++-
 drivers/block/null_blk/null_blk.h |  1 +
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index 6b67088f4ea7..559daac83b94 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -77,6 +77,10 @@ enum {
 	NULL_IRQ_TIMER		= 2,
 };
 
+static bool g_verify = true;
+module_param_named(verify, g_verify, bool, 0444);
+MODULE_PARM_DESC(verify, "Allow REQ_OP_VERIFY processing. Default: true");
+
 static bool g_virt_boundary = false;
 module_param_named(virt_boundary, g_virt_boundary, bool, 0444);
 MODULE_PARM_DESC(virt_boundary, "Require a virtual boundary for the device. Default: False");
@@ -400,6 +404,7 @@ NULLB_DEVICE_ATTR(blocking, bool, NULL);
 NULLB_DEVICE_ATTR(use_per_node_hctx, bool, NULL);
 NULLB_DEVICE_ATTR(memory_backed, bool, NULL);
 NULLB_DEVICE_ATTR(discard, bool, NULL);
+NULLB_DEVICE_ATTR(verify, bool, NULL);
 NULLB_DEVICE_ATTR(mbps, uint, NULL);
 NULLB_DEVICE_ATTR(cache_size, ulong, NULL);
 NULLB_DEVICE_ATTR(zoned, bool, NULL);
@@ -522,6 +527,7 @@ static struct configfs_attribute *nullb_device_attrs[] = {
 	&nullb_device_attr_power,
 	&nullb_device_attr_memory_backed,
 	&nullb_device_attr_discard,
+	&nullb_device_attr_verify,
 	&nullb_device_attr_mbps,
 	&nullb_device_attr_cache_size,
 	&nullb_device_attr_badblocks,
@@ -588,7 +594,7 @@ nullb_group_drop_item(struct config_group *group, struct config_item *item)
 static ssize_t memb_group_features_show(struct config_item *item, char *page)
 {
 	return snprintf(page, PAGE_SIZE,
-			"memory_backed,discard,bandwidth,cache,badblocks,zoned,zone_size,zone_capacity,zone_nr_conv,zone_max_open,zone_max_active,blocksize,max_sectors,virt_boundary\n");
+			"memory_backed,discard,verify,bandwidth,cache,badblocks,zoned,zone_size,zone_capacity,zone_nr_conv,zone_max_open,zone_max_active,blocksize,max_sectors,virt_boundary\n");
 }
 
 CONFIGFS_ATTR_RO(memb_group_, features);
@@ -651,6 +657,7 @@ static struct nullb_device *null_alloc_dev(void)
 	dev->hw_queue_depth = g_hw_queue_depth;
 	dev->blocking = g_blocking;
 	dev->use_per_node_hctx = g_use_per_node_hctx;
+	dev->verify = g_verify;
 	dev->zoned = g_zoned;
 	dev->zone_size = g_zone_size;
 	dev->zone_capacity = g_zone_capacity;
@@ -1394,6 +1401,10 @@ blk_status_t null_process_cmd(struct nullb_cmd *cmd,
 			return ret;
 	}
 
+	/* currently implemented as noop */
+	if (op == REQ_OP_VERIFY)
+		return 0;
+
 	if (dev->memory_backed)
 		return null_handle_memory_backed(cmd, op, sector, nr_sectors);
 
@@ -1769,6 +1780,12 @@ static void null_config_discard(struct nullb *nullb)
 	blk_queue_max_discard_sectors(nullb->q, UINT_MAX >> 9);
 }
 
+static void null_config_verify(struct nullb *nullb)
+{
+	blk_queue_max_verify_sectors(nullb->q,
+				     nullb->dev->verify ? UINT_MAX >> 9 : 0);
+}
+
 static const struct block_device_operations null_bio_ops = {
 	.owner		= THIS_MODULE,
 	.submit_bio	= null_submit_bio,
@@ -2059,6 +2076,7 @@ static int null_add_dev(struct nullb_device *dev)
 		blk_queue_virt_boundary(nullb->q, PAGE_SIZE - 1);
 
 	null_config_discard(nullb);
+	null_config_verify(nullb);
 
 	if (config_item_name(&dev->item)) {
 		/* Use configfs dir name as the device name */
diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
index 8359b43842f2..2a1df1bc8165 100644
--- a/drivers/block/null_blk/null_blk.h
+++ b/drivers/block/null_blk/null_blk.h
@@ -111,6 +111,7 @@ struct nullb_device {
 	bool power; /* power on/off the device */
 	bool memory_backed; /* if data is stored in memory */
 	bool discard; /* if support discard */
+	bool verify; /* if support verify */
 	bool zoned; /* if device is zoned */
 	bool virt_boundary; /* virtual boundary on/off for the device */
 };
-- 
2.29.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/6] block: add support for REQ_OP_VERIFY
  2022-06-30  9:14 ` [PATCH 1/6] " Chaitanya Kulkarni
@ 2022-06-30 16:18   ` Darrick J. Wong
  2022-07-05 16:50     ` Chaitanya Kulkarni
  2022-07-05  8:34   ` Christoph Hellwig
  1 sibling, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2022-06-30 16:18 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: linux-block, linux-raid, linux-nvme, linux-fsdevel, axboe, agk,
	song, kbusch, hch, sagi, jejb, martin.petersen, viro, javier,
	johannes.thumshirn, bvanassche, dongli.zhang, ming.lei, willy,
	jefflexu, josef, clm, dsterba, jack, tytso, adilger.kernel,
	jlayton, idryomov, danil.kipnis, ebiggers, jinpu.wang

On Thu, Jun 30, 2022 at 02:14:01AM -0700, Chaitanya Kulkarni wrote:
> This adds a new block layer operation to offload verifying a range of
> LBAs. This support is needed in order to provide file systems and
> fabrics, kernel components to offload LBA verification when it is
> supported by the hardware controller. In case hardware offloading is
> not supported then we provide API to emulate the same. The prominent
> example of that is SCSI and NVMe Verify command. We also provide
> an emulation of the same operation that can be used in case H/W does
> not support verify. This is still useful when block device is remotely
> attached e.g. using NVMeOF.
> 
> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
> ---
>  Documentation/ABI/stable/sysfs-block |  12 +++
>  block/blk-core.c                     |   5 +
>  block/blk-lib.c                      | 155 +++++++++++++++++++++++++++
>  block/blk-merge.c                    |  18 ++++
>  block/blk-settings.c                 |  17 +++
>  block/blk-sysfs.c                    |   8 ++
>  block/blk.h                          |   4 +
>  block/ioctl.c                        |  35 ++++++
>  include/linux/bio.h                  |   9 +-
>  include/linux/blk_types.h            |   2 +
>  include/linux/blkdev.h               |  22 ++++
>  include/uapi/linux/fs.h              |   1 +
>  12 files changed, 285 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
> index e8797cd09aff..a71d9c41cf8b 100644
> --- a/Documentation/ABI/stable/sysfs-block
> +++ b/Documentation/ABI/stable/sysfs-block
> @@ -657,6 +657,18 @@ Description:
>  		in a single write zeroes command. If write_zeroes_max_bytes is
>  		0, write zeroes is not supported by the device.
>  
> +What:		/sys/block/<disk>/queue/verify_max_bytes
> +Date:		April 2022
> +Contact:	Chaitanya Kulkarni <kch@nvidia.com>
> +Description:
> +		Devices that support verify operation in which a single
> +		request can be issued to verify the range of the contiguous
> +		blocks on the storage without any payload in the request.
> +		This can be used to optimize verifying LBAs on the device
> +		without reading by offloading functionality. verify_max_bytes
> +		indicates how many bytes can be written in a single verify
> +		command. If verify_max_bytes is 0, verify operation is not
> +		supported by the device.
>  
>  What:		/sys/block/<disk>/queue/zone_append_max_bytes
>  Date:		May 2020
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 06ff5bbfe8f6..9ad52247dcdf 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -123,6 +123,7 @@ static const char *const blk_op_name[] = {
>  	REQ_OP_NAME(ZONE_FINISH),
>  	REQ_OP_NAME(ZONE_APPEND),
>  	REQ_OP_NAME(WRITE_ZEROES),
> +	REQ_OP_NAME(VERIFY),
>  	REQ_OP_NAME(DRV_IN),
>  	REQ_OP_NAME(DRV_OUT),
>  };
> @@ -842,6 +843,10 @@ void submit_bio_noacct(struct bio *bio)
>  		if (!q->limits.max_write_zeroes_sectors)
>  			goto not_supported;
>  		break;
> +	case REQ_OP_VERIFY:
> +		if (!q->limits.max_verify_sectors)
> +			goto not_supported;
> +		break;
>  	default:
>  		break;
>  	}
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 09b7e1200c0f..4624d68bb3cb 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -340,3 +340,158 @@ int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
>  	return ret;
>  }
>  EXPORT_SYMBOL(blkdev_issue_secure_erase);
> +
> +/**
> + * __blkdev_emulate_verify - emulate number of verify operations
> + * 				asynchronously
> + * @bdev:	blockdev to issue
> + * @sector:	start sector
> + * @nr_sects:	number of sectors to verify
> + * @gfp_mask:	memory allocation flags (for bio_alloc)
> + * @biop:	pointer to anchor bio
> + * @buf:	data buffer to mapped on bio
> + *
> + * Description:
> + *  Verify a block range by emulating REQ_OP_VERIFY into REQ_OP_READ,
> + *  use this when H/W offloading is not supported asynchronously.
> + *  Caller is responsible to handle anchored bio.
> + */
> +static int __blkdev_emulate_verify(struct block_device *bdev, sector_t sector,
> +		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop, char *buf)
> +{
> +	struct bio *bio = *biop;
> +	unsigned int sz;
> +	int bi_size;
> +
> +	while (nr_sects != 0) {
> +		bio = blk_next_bio(bio, bdev,
> +				__blkdev_sectors_to_bio_pages(nr_sects),
> +				REQ_OP_READ, gfp_mask);
> +		bio->bi_iter.bi_sector = sector;
> +
> +		while (nr_sects != 0) {
> +			bool is_vaddr = is_vmalloc_addr(buf);
> +			struct page *p;
> +
> +			p = is_vaddr ? vmalloc_to_page(buf) : virt_to_page(buf);
> +			sz = min((sector_t) PAGE_SIZE, nr_sects << 9);
> +
> +			bi_size = bio_add_page(bio, p, sz, offset_in_page(buf));
> +			if (bi_size < sz)
> +				return -EIO;
> +
> +			nr_sects -= bi_size >> 9;
> +			sector += bi_size >> 9;
> +			buf += bi_size;
> +		}
> +		cond_resched();
> +	}
> +
> +	*biop = bio;
> +	return 0;
> +}
> +
> +/**
> + * __blkdev_issue_verify - generate number of verify operations
> + * @bdev:	blockdev to issue
> + * @sector:	start sector
> + * @nr_sects:	number of sectors to verify
> + * @gfp_mask:	memory allocation flags (for bio_alloc())
> + * @biop:	pointer to anchor bio
> + *
> + * Description:
> + *  Verify a block range using hardware offload.
> + *
> + * The function will emulate verify operation if no explicit hardware
> + * offloading for verifying is provided.
> + */
> +int __blkdev_issue_verify(struct block_device *bdev, sector_t sector,
> +		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop)
> +{
> +	unsigned int max_verify_sectors = bdev_verify_sectors(bdev);
> +	sector_t min_io_sect = (BIO_MAX_VECS << PAGE_SHIFT) >> 9;
> +	struct bio *bio = *biop;
> +	sector_t curr_sects;
> +	char *buf;
> +
> +	if (!max_verify_sectors) {
> +		int ret = 0;
> +
> +		buf = kzalloc(min_io_sect << 9, GFP_KERNEL);

k*z*alloc?  I don't think you need to zero a buffer that we're reading
into, right?

--D

> +		if (!buf)
> +			return -ENOMEM;
> +
> +		while (nr_sects > 0) {
> +			curr_sects = min_t(sector_t, nr_sects, min_io_sect);
> +			ret = __blkdev_emulate_verify(bdev, sector, curr_sects,
> +						      gfp_mask, &bio, buf);
> +			if (ret)
> +				break;
> +
> +			if (bio) {
> +				ret = submit_bio_wait(bio);
> +				bio_put(bio);
> +				bio = NULL;
> +			}
> +
> +			nr_sects -= curr_sects;
> +			sector += curr_sects;
> +
> +		}
> +		/* set the biop to NULL since we have alrady completed above */
> +		*biop = NULL;
> +		kfree(buf);
> +		return ret;
> +	}
> +
> +	while (nr_sects) {
> +		bio = blk_next_bio(bio, bdev, 0, REQ_OP_VERIFY, gfp_mask);
> +		bio->bi_iter.bi_sector = sector;
> +
> +		if (nr_sects > max_verify_sectors) {
> +			bio->bi_iter.bi_size = max_verify_sectors << 9;
> +			nr_sects -= max_verify_sectors;
> +			sector += max_verify_sectors;
> +		} else {
> +			bio->bi_iter.bi_size = nr_sects << 9;
> +			nr_sects = 0;
> +		}
> +		cond_resched();
> +	}
> +	*biop = bio;
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(__blkdev_issue_verify);
> +
> +/**
> + * blkdev_issue_verify - verify a block range
> + * @bdev:	blockdev to verify
> + * @sector:	start sector
> + * @nr_sects:	number of sectors to verify
> + * @gfp_mask:	memory allocation flags (for bio_alloc)
> + *
> + * Description:
> + *  Verify a block range using hardware offload.
> + */
> +int blkdev_issue_verify(struct block_device *bdev, sector_t sector,
> +		sector_t nr_sects, gfp_t gfp_mask)
> +{
> +	sector_t bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
> +	struct bio *bio = NULL;
> +	struct blk_plug plug;
> +	int ret = 0;
> +
> +	if ((sector | nr_sects) & bs_mask)
> +		return -EINVAL;
> +
> +	blk_start_plug(&plug);
> +	ret = __blkdev_issue_verify(bdev, sector, nr_sects, gfp_mask, &bio);
> +	if (ret == 0 && bio) {
> +		ret = submit_bio_wait(bio);
> +		bio_put(bio);
> +	}
> +	blk_finish_plug(&plug);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(blkdev_issue_verify);
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 7771dacc99cb..8ff305377b5a 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -153,6 +153,20 @@ static struct bio *blk_bio_write_zeroes_split(struct request_queue *q,
>  	return bio_split(bio, q->limits.max_write_zeroes_sectors, GFP_NOIO, bs);
>  }
>  
> +static struct bio *blk_bio_verify_split(struct request_queue *q,
> +		struct bio *bio, struct bio_set *bs, unsigned *nsegs)
> +{
> +	*nsegs = 0;
> +
> +	if (!q->limits.max_verify_sectors)
> +		return NULL;
> +
> +	if (bio_sectors(bio) <= q->limits.max_verify_sectors)
> +		return NULL;
> +
> +	return bio_split(bio, q->limits.max_verify_sectors, GFP_NOIO, bs);
> +}
> +
>  /*
>   * Return the maximum number of sectors from the start of a bio that may be
>   * submitted as a single request to a block device. If enough sectors remain,
> @@ -336,6 +350,10 @@ void __blk_queue_split(struct request_queue *q, struct bio **bio,
>  		split = blk_bio_write_zeroes_split(q, *bio, &q->bio_split,
>  				nr_segs);
>  		break;
> +	case REQ_OP_VERIFY:
> +		split = blk_bio_verify_split(q, *bio, &q->bio_split,
> +				nr_segs);
> +		break;
>  	default:
>  		split = blk_bio_segment_split(q, *bio, &q->bio_split, nr_segs);
>  		break;
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 6ccceb421ed2..c77697290bc5 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -43,6 +43,7 @@ void blk_set_default_limits(struct queue_limits *lim)
>  	lim->max_dev_sectors = 0;
>  	lim->chunk_sectors = 0;
>  	lim->max_write_zeroes_sectors = 0;
> +	lim->max_verify_sectors = 0;
>  	lim->max_zone_append_sectors = 0;
>  	lim->max_discard_sectors = 0;
>  	lim->max_hw_discard_sectors = 0;
> @@ -80,6 +81,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
>  	lim->max_sectors = UINT_MAX;
>  	lim->max_dev_sectors = UINT_MAX;
>  	lim->max_write_zeroes_sectors = UINT_MAX;
> +	lim->max_verify_sectors = UINT_MAX;
>  	lim->max_zone_append_sectors = UINT_MAX;
>  }
>  EXPORT_SYMBOL(blk_set_stacking_limits);
> @@ -202,6 +204,19 @@ void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
>  }
>  EXPORT_SYMBOL(blk_queue_max_write_zeroes_sectors);
>  
> +/**
> + * blk_queue_max_verify_sectors - set max sectors for a single verify
> + *
> + * @q:  the request queue for the device
> + * @max_verify_sectors: maximum number of sectors to verify per command
> + **/
> +void blk_queue_max_verify_sectors(struct request_queue *q,
> +		unsigned int max_verify_sectors)
> +{
> +	q->limits.max_verify_sectors = max_verify_sectors;
> +}
> +EXPORT_SYMBOL(blk_queue_max_verify_sectors);
> +
>  /**
>   * blk_queue_max_zone_append_sectors - set max sectors for a single zone append
>   * @q:  the request queue for the device
> @@ -554,6 +569,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>  	t->max_dev_sectors = min_not_zero(t->max_dev_sectors, b->max_dev_sectors);
>  	t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
>  					b->max_write_zeroes_sectors);
> +	t->max_verify_sectors = min(t->max_verify_sectors,
> +				    b->max_verify_sectors);
>  	t->max_zone_append_sectors = min(t->max_zone_append_sectors,
>  					b->max_zone_append_sectors);
>  	t->bounce = max(t->bounce, b->bounce);
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 88bd41d4cb59..4fb6a731acad 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -113,6 +113,12 @@ queue_ra_store(struct request_queue *q, const char *page, size_t count)
>  	return ret;
>  }
>  
> +static ssize_t queue_verify_max_show(struct request_queue *q, char *page)
> +{
> +	return sprintf(page, "%llu\n",
> +		(unsigned long long)q->limits.max_verify_sectors << 9);
> +}
> +
>  static ssize_t queue_max_sectors_show(struct request_queue *q, char *page)
>  {
>  	int max_sectors_kb = queue_max_sectors(q) >> 1;
> @@ -588,6 +594,7 @@ QUEUE_RO_ENTRY(queue_discard_zeroes_data, "discard_zeroes_data");
>  
>  QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes");
>  QUEUE_RO_ENTRY(queue_write_zeroes_max, "write_zeroes_max_bytes");
> +QUEUE_RO_ENTRY(queue_verify_max, "verify_max_bytes");
>  QUEUE_RO_ENTRY(queue_zone_append_max, "zone_append_max_bytes");
>  QUEUE_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity");
>  
> @@ -644,6 +651,7 @@ static struct attribute *queue_attrs[] = {
>  	&queue_discard_zeroes_data_entry.attr,
>  	&queue_write_same_max_entry.attr,
>  	&queue_write_zeroes_max_entry.attr,
> +	&queue_verify_max_entry.attr,
>  	&queue_zone_append_max_entry.attr,
>  	&queue_zone_write_granularity_entry.attr,
>  	&queue_nonrot_entry.attr,
> diff --git a/block/blk.h b/block/blk.h
> index 434017701403..63a0e3aca7e0 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -132,6 +132,9 @@ static inline bool rq_mergeable(struct request *rq)
>  	if (req_op(rq) == REQ_OP_WRITE_ZEROES)
>  		return false;
>  
> +	if (req_op(rq) == REQ_OP_VERIFY)
> +		return false;
> +
>  	if (req_op(rq) == REQ_OP_ZONE_APPEND)
>  		return false;
>  
> @@ -286,6 +289,7 @@ static inline bool blk_may_split(struct request_queue *q, struct bio *bio)
>  	case REQ_OP_DISCARD:
>  	case REQ_OP_SECURE_ERASE:
>  	case REQ_OP_WRITE_ZEROES:
> +	case REQ_OP_VERIFY:
>  		return true; /* non-trivial splitting decisions */
>  	default:
>  		break;
> diff --git a/block/ioctl.c b/block/ioctl.c
> index 46949f1b0dba..60a48e24b82d 100644
> --- a/block/ioctl.c
> +++ b/block/ioctl.c
> @@ -192,6 +192,39 @@ static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
>  	return err;
>  }
>  
> +static int blk_ioctl_verify(struct block_device *bdev, fmode_t mode,
> +		unsigned long arg)
> +{
> +	uint64_t range[2];
> +	struct address_space *mapping;
> +	uint64_t start, end, len;
> +
> +	if (!(mode & FMODE_READ))
> +		return -EBADF;
> +
> +	if (copy_from_user(range, (void __user *)arg, sizeof(range)))
> +		return -EFAULT;
> +
> +	start = range[0];
> +	len = range[1];
> +	end = start + len - 1;
> +
> +	if (start & 511)
> +		return -EINVAL;
> +	if (len & 511)
> +		return -EINVAL;
> +	if (end >= (uint64_t)i_size_read(bdev->bd_inode))
> +		return -EINVAL;
> +	if (end < start)
> +		return -EINVAL;
> +
> +	/* Invalidate the page cache, including dirty pages */
> +	mapping = bdev->bd_inode->i_mapping;
> +	truncate_inode_pages_range(mapping, start, end);

You might want to write any dirty pagecache contents to disk before you
invalidate them all...

> +
> +	return blkdev_issue_verify(bdev, start >> 9, len >> 9, GFP_KERNEL);
> +}
> +
>  static int put_ushort(unsigned short __user *argp, unsigned short val)
>  {
>  	return put_user(val, argp);
> @@ -483,6 +516,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
>  		return blk_ioctl_secure_erase(bdev, mode, argp);
>  	case BLKZEROOUT:
>  		return blk_ioctl_zeroout(bdev, mode, arg);
> +	case BLKVERIFY:
> +		return blk_ioctl_verify(bdev, mode, arg);
>  	case BLKGETDISKSEQ:
>  		return put_u64(argp, bdev->bd_disk->diskseq);
>  	case BLKREPORTZONE:
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 1cf3738ef1ea..3dfafe1da098 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -55,7 +55,8 @@ static inline bool bio_has_data(struct bio *bio)
>  	    bio->bi_iter.bi_size &&
>  	    bio_op(bio) != REQ_OP_DISCARD &&
>  	    bio_op(bio) != REQ_OP_SECURE_ERASE &&
> -	    bio_op(bio) != REQ_OP_WRITE_ZEROES)
> +	    bio_op(bio) != REQ_OP_WRITE_ZEROES &&
> +	    bio_op(bio) != REQ_OP_VERIFY)
>  		return true;
>  
>  	return false;
> @@ -65,7 +66,8 @@ static inline bool bio_no_advance_iter(const struct bio *bio)
>  {
>  	return bio_op(bio) == REQ_OP_DISCARD ||
>  	       bio_op(bio) == REQ_OP_SECURE_ERASE ||
> -	       bio_op(bio) == REQ_OP_WRITE_ZEROES;
> +	       bio_op(bio) == REQ_OP_WRITE_ZEROES ||
> +	       bio_op(bio) == REQ_OP_VERIFY;
>  }
>  
>  static inline void *bio_data(struct bio *bio)
> @@ -176,7 +178,7 @@ static inline unsigned bio_segments(struct bio *bio)
>  	struct bvec_iter iter;
>  
>  	/*
> -	 * We special case discard/write same/write zeroes, because they
> +	 * We special case discard/write same/write zeroes/verify, because they
>  	 * interpret bi_size differently:
>  	 */
>  
> @@ -184,6 +186,7 @@ static inline unsigned bio_segments(struct bio *bio)
>  	case REQ_OP_DISCARD:
>  	case REQ_OP_SECURE_ERASE:
>  	case REQ_OP_WRITE_ZEROES:
> +	case REQ_OP_VERIFY:
>  		return 0;
>  	default:
>  		break;
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index a24d4078fb21..0d5383fc84ed 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -363,6 +363,8 @@ enum req_opf {
>  	REQ_OP_FLUSH		= 2,
>  	/* discard sectors */
>  	REQ_OP_DISCARD		= 3,
> +	/* Verify the sectors */
> +	REQ_OP_VERIFY		= 6,
>  	/* securely erase sectors */
>  	REQ_OP_SECURE_ERASE	= 5,
>  	/* write the zero filled sector many times */
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 608d577734c2..78fd6c5530d7 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -266,6 +266,7 @@ struct queue_limits {
>  	unsigned int		max_hw_discard_sectors;
>  	unsigned int		max_secure_erase_sectors;
>  	unsigned int		max_write_zeroes_sectors;
> +	unsigned int		max_verify_sectors;
>  	unsigned int		max_zone_append_sectors;
>  	unsigned int		discard_granularity;
>  	unsigned int		discard_alignment;
> @@ -925,6 +926,9 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
>  	if (unlikely(op == REQ_OP_WRITE_ZEROES))
>  		return q->limits.max_write_zeroes_sectors;
>  
> +	if (unlikely(op == REQ_OP_VERIFY))
> +		return q->limits.max_verify_sectors;
> +
>  	return q->limits.max_sectors;
>  }
>  
> @@ -968,6 +972,8 @@ extern void blk_queue_max_discard_sectors(struct request_queue *q,
>  		unsigned int max_discard_sectors);
>  extern void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
>  		unsigned int max_write_same_sectors);
> +extern void blk_queue_max_verify_sectors(struct request_queue *q,
> +		unsigned int max_verify_sectors);
>  extern void blk_queue_logical_block_size(struct request_queue *, unsigned int);
>  extern void blk_queue_max_zone_append_sectors(struct request_queue *q,
>  		unsigned int max_zone_append_sectors);
> @@ -1119,6 +1125,12 @@ extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
>  extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
>  		sector_t nr_sects, gfp_t gfp_mask, unsigned flags);
>  
> +extern int __blkdev_issue_verify(struct block_device *bdev,
> +		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
> +		struct bio **biop);
> +extern int blkdev_issue_verify(struct block_device *bdev, sector_t sector,
> +		sector_t nr_sects, gfp_t gfp_mask);
> +
>  static inline int sb_issue_discard(struct super_block *sb, sector_t block,
>  		sector_t nr_blocks, gfp_t gfp_mask, unsigned long flags)
>  {
> @@ -1293,6 +1305,16 @@ static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev)
>  	return 0;
>  }
>  
> +static inline unsigned int bdev_verify_sectors(struct block_device *bdev)
> +{
> +	struct request_queue *q = bdev_get_queue(bdev);
> +
> +	if (q)
> +		return q->limits.max_verify_sectors;
> +
> +	return 0;
> +}
> +
>  static inline bool bdev_nonrot(struct block_device *bdev)
>  {
>  	return blk_queue_nonrot(bdev_get_queue(bdev));
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index bdf7b404b3e7..ad0e5cb5cac4 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -185,6 +185,7 @@ struct fsxattr {
>  #define BLKROTATIONAL _IO(0x12,126)
>  #define BLKZEROOUT _IO(0x12,127)
>  #define BLKGETDISKSEQ _IOR(0x12,128,__u64)
> +#define BLKVERIFY _IO(0x12,129)
>  /*
>   * A jump here: 130-136 are reserved for zoned block devices
>   * (see uapi/linux/blkzoned.h)
> -- 
> 2.29.0
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/6] nvme: add support for the Verify command
  2022-06-30  9:14 ` [PATCH 2/6] nvme: add support for the Verify command Chaitanya Kulkarni
@ 2022-06-30 16:24   ` Keith Busch
  2022-07-05  8:34     ` Christoph Hellwig
  0 siblings, 1 reply; 41+ messages in thread
From: Keith Busch @ 2022-06-30 16:24 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: linux-block, linux-raid, linux-nvme, linux-fsdevel, axboe, agk,
	song, djwong, hch, sagi, jejb, martin.petersen, viro, javier,
	johannes.thumshirn, bvanassche, dongli.zhang, ming.lei, willy,
	jefflexu, josef, clm, dsterba, jack, tytso, adilger.kernel,
	jlayton, idryomov, danil.kipnis, ebiggers, jinpu.wang

On Thu, Jun 30, 2022 at 02:14:02AM -0700, Chaitanya Kulkarni wrote:
> Allow verify operations (REQ_OP_VERIFY) on the block device, if the
> device supports optional command bit set for verify. Add support
> to setup verify command. Set maximum possible verify sectors in one
> verify command according to maximum hardware sectors supported by the
> controller.

Shouldn't the limit be determined by Identify Controller NVM CSS's 'VSL' field
instead of its max data transfer?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/6] block: add support for REQ_OP_VERIFY
  2022-06-30  9:14 ` [PATCH 1/6] " Chaitanya Kulkarni
  2022-06-30 16:18   ` Darrick J. Wong
@ 2022-07-05  8:34   ` Christoph Hellwig
  2022-07-05 23:55     ` Chaitanya Kulkarni
  1 sibling, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2022-07-05  8:34 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: linux-block, linux-raid, linux-nvme, linux-fsdevel, axboe, agk,
	song, djwong, kbusch, hch, sagi, jejb, martin.petersen, viro,
	javier, johannes.thumshirn, bvanassche, dongli.zhang, ming.lei,
	willy, jefflexu, josef, clm, dsterba, jack, tytso,
	adilger.kernel, jlayton, idryomov, danil.kipnis, ebiggers,
	jinpu.wang

On Thu, Jun 30, 2022 at 02:14:01AM -0700, Chaitanya Kulkarni wrote:
> This adds a new block layer operation to offload verifying a range of
> LBAs. This support is needed in order to provide file systems and
> fabrics, kernel components to offload LBA verification when it is
> supported by the hardware controller. In case hardware offloading is
> not supported then we provide API to emulate the same. The prominent
> example of that is SCSI and NVMe Verify command. We also provide
> an emulation of the same operation that can be used in case H/W does
> not support verify. This is still useful when block device is remotely
> attached e.g. using NVMeOF.

What is the point of providing the offload?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/6] nvme: add support for the Verify command
  2022-06-30 16:24   ` Keith Busch
@ 2022-07-05  8:34     ` Christoph Hellwig
  2022-07-05 23:56       ` Chaitanya Kulkarni
  0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2022-07-05  8:34 UTC (permalink / raw)
  To: Keith Busch
  Cc: Chaitanya Kulkarni, linux-block, linux-raid, linux-nvme,
	linux-fsdevel, axboe, agk, song, djwong, hch, sagi, jejb,
	martin.petersen, viro, javier, johannes.thumshirn, bvanassche,
	dongli.zhang, ming.lei, willy, jefflexu, josef, clm, dsterba,
	jack, tytso, adilger.kernel, jlayton, idryomov, danil.kipnis,
	ebiggers, jinpu.wang

On Thu, Jun 30, 2022 at 10:24:14AM -0600, Keith Busch wrote:
> On Thu, Jun 30, 2022 at 02:14:02AM -0700, Chaitanya Kulkarni wrote:
> > Allow verify operations (REQ_OP_VERIFY) on the block device, if the
> > device supports optional command bit set for verify. Add support
> > to setup verify command. Set maximum possible verify sectors in one
> > verify command according to maximum hardware sectors supported by the
> > controller.
> 
> Shouldn't the limit be determined by Identify Controller NVM CSS's 'VSL' field
> instead of its max data transfer?

Yes.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/6] nvmet: add Verify emulation support for bdev-ns
  2022-06-30  9:14 ` [PATCH 4/6] nvmet: add Verify emulation " Chaitanya Kulkarni
@ 2022-07-05  8:35   ` Christoph Hellwig
  2022-07-06  0:00     ` Chaitanya Kulkarni
  0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2022-07-05  8:35 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: linux-block, linux-raid, linux-nvme, linux-fsdevel, axboe, agk,
	song, djwong, kbusch, hch, sagi, jejb, martin.petersen, viro,
	javier, johannes.thumshirn, bvanassche, dongli.zhang, ming.lei,
	willy, jefflexu, josef, clm, dsterba, jack, tytso,
	adilger.kernel, jlayton, idryomov, danil.kipnis, ebiggers,
	jinpu.wang

On Thu, Jun 30, 2022 at 02:14:04AM -0700, Chaitanya Kulkarni wrote:
> Not all devices can support verify requests which can be mapped to
> the controller specific command. This patch adds a way to emulate
> REQ_OP_VERIFY for NVMeOF block device namespace. We add a new
> workqueue to offload the emulation.

How is this an "emulation"?

Also why do we need the workqueue offloads?  I can't see any good
reason to not just simply submit the bio asynchronously like all the
other bios submitted by the block device backend.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/6] nvmet: add verify emulation support for file-ns
  2022-06-30  9:14 ` [PATCH 5/6] nvmet: add verify emulation support for file-ns Chaitanya Kulkarni
@ 2022-07-05  8:36   ` Christoph Hellwig
  2022-07-06  0:06     ` Chaitanya Kulkarni
  0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2022-07-05  8:36 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: linux-block, linux-raid, linux-nvme, linux-fsdevel, axboe, agk,
	song, djwong, kbusch, hch, sagi, jejb, martin.petersen, viro,
	javier, johannes.thumshirn, bvanassche, dongli.zhang, ming.lei,
	willy, jefflexu, josef, clm, dsterba, jack, tytso,
	adilger.kernel, jlayton, idryomov, danil.kipnis, ebiggers,
	jinpu.wang

On Thu, Jun 30, 2022 at 02:14:05AM -0700, Chaitanya Kulkarni wrote:
> For now, there is no way to map verify operation to the VFS layer API.
> This patch emulates verify operation by offloading it to the workqueue
> and reading the data using vfs layer APIs for both buffered io and
> direct io mode.

What does this buy us?  I'm generally not too happy about adding
actualy logic to nvmet, it is supposed to mostly just pass through
operations support by block device or files.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] block: add support for REQ_OP_VERIFY
  2022-06-30  9:14 [PATCH 0/6] block: add support for REQ_OP_VERIFY Chaitanya Kulkarni
                   ` (5 preceding siblings ...)
  2022-06-30  9:14 ` [PATCH 6/6] null_blk: add REQ_OP_VERIFY support Chaitanya Kulkarni
@ 2022-07-05  8:38 ` Christoph Hellwig
  2022-07-05 16:47   ` Chaitanya Kulkarni
  2022-07-06 17:42 ` Matthew Wilcox
  7 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2022-07-05  8:38 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: linux-block, linux-raid, linux-nvme, linux-fsdevel, axboe, agk,
	song, djwong, kbusch, hch, sagi, jejb, martin.petersen, viro,
	javier, johannes.thumshirn, bvanassche, dongli.zhang, ming.lei,
	willy, jefflexu, josef, clm, dsterba, jack, tytso,
	adilger.kernel, jlayton, idryomov, danil.kipnis, ebiggers,
	jinpu.wang

On Thu, Jun 30, 2022 at 02:14:00AM -0700, Chaitanya Kulkarni wrote:
> This adds support for the REQ_OP_VERIFY. In this version we add
> support for block layer. NVMe host side, NVMeOF Block device backend,
> and NVMeOF File device backend and null_blk driver.

Who is "we" in this patch set?

> Here is a link for the complete cover-letter for the background to save
> reviewer's time :-
> https://patchwork.kernel.org/project/dm-devel/cover/20211104064634.4481-1-chaitanyak@nvidia.com/

Well, the cover letter should be here.

Also I don't see how an NVMe-only implementation.  If we don't also
cover SCSI and ATA there isn't much this API buys a potential user.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] block: add support for REQ_OP_VERIFY
  2022-07-05  8:38 ` [PATCH 0/6] block: add support for REQ_OP_VERIFY Christoph Hellwig
@ 2022-07-05 16:47   ` Chaitanya Kulkarni
  0 siblings, 0 replies; 41+ messages in thread
From: Chaitanya Kulkarni @ 2022-07-05 16:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, linux-raid, linux-nvme, linux-fsdevel,
	Chaitanya Kulkarni, axboe, agk, song, djwong, kbusch, sagi, jejb,
	martin.petersen, viro, javier, johannes.thumshirn, bvanassche,
	dongli.zhang, ming.lei, willy, jefflexu, josef, clm, dsterba,
	jack, tytso, adilger.kernel, jlayton, idryomov, danil.kipnis,
	ebiggers, jinpu.wang

On 7/5/22 01:38, Christoph Hellwig wrote:
> On Thu, Jun 30, 2022 at 02:14:00AM -0700, Chaitanya Kulkarni wrote:
>> This adds support for the REQ_OP_VERIFY. In this version we add
>> support for block layer. NVMe host side, NVMeOF Block device backend,
>> and NVMeOF File device backend and null_blk driver.
> 
> Who is "we" in this patch set?
> 
>> Here is a link for the complete cover-letter for the background to save
>> reviewer's time :-
>> https://patchwork.kernel.org/project/dm-devel/cover/20211104064634.4481-1-chaitanyak@nvidia.com/
> 
> Well, the cover letter should be here.

I'll send out with the updated cover-letter and fix the wording.

> 
> Also I don't see how an NVMe-only implementation.  If we don't also
> cover SCSI and ATA there isn't much this API buys a potential user.

I'll add to in the next version. My device went bad to test the scsi
code, I think scsi_debug should be okay for testing until
I have the scsi controller.

-ck



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/6] block: add support for REQ_OP_VERIFY
  2022-06-30 16:18   ` Darrick J. Wong
@ 2022-07-05 16:50     ` Chaitanya Kulkarni
  2022-07-05 17:57       ` Darrick J. Wong
  0 siblings, 1 reply; 41+ messages in thread
From: Chaitanya Kulkarni @ 2022-07-05 16:50 UTC (permalink / raw)
  To: Darrick J. Wong, Chaitanya Kulkarni
  Cc: linux-block, linux-raid, linux-nvme, linux-fsdevel, axboe, agk,
	song, kbusch, hch, sagi, jejb, martin.petersen, viro, javier,
	johannes.thumshirn, bvanassche, dongli.zhang, ming.lei, willy,
	jefflexu, josef, clm, dsterba, jack, tytso, adilger.kernel,
	jlayton, idryomov, danil.kipnis, ebiggers, jinpu.wang

Darrik,

Thanks for the reply.

>> +
>> +/**
>> + * __blkdev_issue_verify - generate number of verify operations
>> + * @bdev:	blockdev to issue
>> + * @sector:	start sector
>> + * @nr_sects:	number of sectors to verify
>> + * @gfp_mask:	memory allocation flags (for bio_alloc())
>> + * @biop:	pointer to anchor bio
>> + *
>> + * Description:
>> + *  Verify a block range using hardware offload.
>> + *
>> + * The function will emulate verify operation if no explicit hardware
>> + * offloading for verifying is provided.
>> + */
>> +int __blkdev_issue_verify(struct block_device *bdev, sector_t sector,
>> +		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop)
>> +{
>> +	unsigned int max_verify_sectors = bdev_verify_sectors(bdev);
>> +	sector_t min_io_sect = (BIO_MAX_VECS << PAGE_SHIFT) >> 9;
>> +	struct bio *bio = *biop;
>> +	sector_t curr_sects;
>> +	char *buf;
>> +
>> +	if (!max_verify_sectors) {
>> +		int ret = 0;
>> +
>> +		buf = kzalloc(min_io_sect << 9, GFP_KERNEL);
> 
> k*z*alloc?  I don't think you need to zero a buffer that we're reading
> into, right?
> 
> --D

we don't need to but I guess it is just a habit to make sure alloced
buffer is zeored, should I remove it for any particular reason ?

-ck



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/6] block: add support for REQ_OP_VERIFY
  2022-07-05 16:50     ` Chaitanya Kulkarni
@ 2022-07-05 17:57       ` Darrick J. Wong
  2022-07-06  1:32         ` Chaitanya Kulkarni
  0 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2022-07-05 17:57 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: linux-block, linux-raid, linux-nvme, linux-fsdevel, axboe, agk,
	song, kbusch, hch, sagi, jejb, martin.petersen, viro, javier,
	johannes.thumshirn, bvanassche, dongli.zhang, ming.lei, willy,
	jefflexu, josef, clm, dsterba, jack, tytso, adilger.kernel,
	jlayton, idryomov, danil.kipnis, ebiggers, jinpu.wang

On Tue, Jul 05, 2022 at 04:50:33PM +0000, Chaitanya Kulkarni wrote:
> Darrik,
> 
> Thanks for the reply.
> 
> >> +
> >> +/**
> >> + * __blkdev_issue_verify - generate number of verify operations
> >> + * @bdev:	blockdev to issue
> >> + * @sector:	start sector
> >> + * @nr_sects:	number of sectors to verify
> >> + * @gfp_mask:	memory allocation flags (for bio_alloc())
> >> + * @biop:	pointer to anchor bio
> >> + *
> >> + * Description:
> >> + *  Verify a block range using hardware offload.
> >> + *
> >> + * The function will emulate verify operation if no explicit hardware
> >> + * offloading for verifying is provided.
> >> + */
> >> +int __blkdev_issue_verify(struct block_device *bdev, sector_t sector,
> >> +		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop)
> >> +{
> >> +	unsigned int max_verify_sectors = bdev_verify_sectors(bdev);
> >> +	sector_t min_io_sect = (BIO_MAX_VECS << PAGE_SHIFT) >> 9;
> >> +	struct bio *bio = *biop;
> >> +	sector_t curr_sects;
> >> +	char *buf;
> >> +
> >> +	if (!max_verify_sectors) {
> >> +		int ret = 0;
> >> +
> >> +		buf = kzalloc(min_io_sect << 9, GFP_KERNEL);
> > 
> > k*z*alloc?  I don't think you need to zero a buffer that we're reading
> > into, right?
> > 
> > --D
> 
> we don't need to but I guess it is just a habit to make sure alloced
> buffer is zeored, should I remove it for any particular reason ?

What's the point in wasting CPU time zeroing a buffer if you're just
going to DMA into it?

--D

> -ck
> 
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/6] block: add support for REQ_OP_VERIFY
  2022-07-05  8:34   ` Christoph Hellwig
@ 2022-07-05 23:55     ` Chaitanya Kulkarni
  0 siblings, 0 replies; 41+ messages in thread
From: Chaitanya Kulkarni @ 2022-07-05 23:55 UTC (permalink / raw)
  To: Christoph Hellwig, Chaitanya Kulkarni
  Cc: linux-block, linux-raid, linux-nvme, linux-fsdevel, axboe, agk,
	song, djwong, kbusch, sagi, jejb, martin.petersen, viro, javier,
	johannes.thumshirn, bvanassche, dongli.zhang, ming.lei, willy,
	jefflexu, josef, clm, dsterba, jack, tytso, adilger.kernel,
	jlayton, idryomov, danil.kipnis, ebiggers, jinpu.wang

On 7/5/22 01:34, Christoph Hellwig wrote:
> On Thu, Jun 30, 2022 at 02:14:01AM -0700, Chaitanya Kulkarni wrote:
>> This adds a new block layer operation to offload verifying a range of
>> LBAs. This support is needed in order to provide file systems and
>> fabrics, kernel components to offload LBA verification when it is
>> supported by the hardware controller. In case hardware offloading is
>> not supported then we provide API to emulate the same. The prominent
>> example of that is SCSI and NVMe Verify command. We also provide
>> an emulation of the same operation that can be used in case H/W does
>> not support verify. This is still useful when block device is remotely
>> attached e.g. using NVMeOF.
> 
> What is the point of providing the offload?

Data block verification is done at the time of file-scrubbing
e.g. see [1], having support to offload the verify command will :-

1. Reduce the DMA transfer at the time of scrubbing :-

In the absense of verify command user has to send the read
command that will trigger same behaviour as verify command, but
with the DMA traffic with operating system storage stack overhead
of REQ_OP_READ. This overhead gets duplicated for the fabrics
controller where host and target now has to issue REQ_OP_READ
leading to significant DMA transfer as compare to REQ_OP_VERIFY
for each protocol SCSI/NVMe etc. This makes it possible to do
a low-level scrub of the stored data without being bottlenecked
by the host interface bandwidth.

2. Allow us to use to unify interface for applications :-

Currently in linux there is no unified interface to issue
verify command so each application will have to duplicate the
code for the discovering controllers protocol type, opencoding
device passthru ioctl for protocol spefcific verify command and
issuing verify read emulation if it is not supported, see [1].

3. Allow us to use controller's internal bandwidth :-

For some controllers offloading the verify command can reault in
decrease in data block verification time, since their internal
bandwidth can be higher than host DMA transfer + OS storage stack
overhead of the read command.

4. Pro-actively avoiding unrecoverable read errors:-

Verify command does everything a normal read command does, except
for returning the data to the host system. This makes it possible
to do a low-level scrub of the stored data without being
bottlenecked by the host interface bandwidth.

Please note that analyzing controller verify command performance
for common protocols (SCSI/NVMe) is out of scope for REQ_OP_VERIFY.

-ck

[1] xfs_scrub issueing the verify command :-
xfs-progs/scrub/disk.c 340: disk_read_verify()->disk_scsi_verify()



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/6] nvme: add support for the Verify command
  2022-07-05  8:34     ` Christoph Hellwig
@ 2022-07-05 23:56       ` Chaitanya Kulkarni
  0 siblings, 0 replies; 41+ messages in thread
From: Chaitanya Kulkarni @ 2022-07-05 23:56 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch
  Cc: Chaitanya Kulkarni, linux-block, linux-raid, linux-nvme,
	linux-fsdevel, axboe, agk, song, djwong, sagi, jejb,
	martin.petersen, viro, javier, johannes.thumshirn, bvanassche,
	dongli.zhang, ming.lei, willy, jefflexu, josef, clm, dsterba,
	jack, tytso, adilger.kernel, jlayton, idryomov, danil.kipnis,
	ebiggers, jinpu.wang

On 7/5/22 01:34, Christoph Hellwig wrote:
> On Thu, Jun 30, 2022 at 10:24:14AM -0600, Keith Busch wrote:
>> On Thu, Jun 30, 2022 at 02:14:02AM -0700, Chaitanya Kulkarni wrote:
>>> Allow verify operations (REQ_OP_VERIFY) on the block device, if the
>>> device supports optional command bit set for verify. Add support
>>> to setup verify command. Set maximum possible verify sectors in one
>>> verify command according to maximum hardware sectors supported by the
>>> controller.
>>
>> Shouldn't the limit be determined by Identify Controller NVM CSS's 'VSL' field
>> instead of its max data transfer?
> 
> Yes.

Okay will move to the helper and add limit.

-ck



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/6] nvmet: add Verify emulation support for bdev-ns
  2022-07-05  8:35   ` Christoph Hellwig
@ 2022-07-06  0:00     ` Chaitanya Kulkarni
  0 siblings, 0 replies; 41+ messages in thread
From: Chaitanya Kulkarni @ 2022-07-06  0:00 UTC (permalink / raw)
  To: Christoph Hellwig, Chaitanya Kulkarni
  Cc: linux-block, linux-raid, linux-nvme, linux-fsdevel, axboe, agk,
	song, djwong, kbusch, sagi, jejb, martin.petersen, viro, javier,
	johannes.thumshirn, bvanassche, dongli.zhang, ming.lei, willy,
	jefflexu, josef, clm, dsterba, jack, tytso, adilger.kernel,
	jlayton, idryomov, danil.kipnis, ebiggers, jinpu.wang

On 7/5/22 01:35, Christoph Hellwig wrote:
> On Thu, Jun 30, 2022 at 02:14:04AM -0700, Chaitanya Kulkarni wrote:
>> Not all devices can support verify requests which can be mapped to
>> the controller specific command. This patch adds a way to emulate
>> REQ_OP_VERIFY for NVMeOF block device namespace. We add a new
>> workqueue to offload the emulation.
> 
> How is this an "emulation"?

since verify command for NVMe does everything that read does except
transferring the data, so if controller does not support verify
this implementation drops down to issuing the read command ..

> 
> Also why do we need the workqueue offloads?  I can't see any good
> reason to not just simply submit the bio asynchronously like all the
> other bios submitted by the block device backend.

okay will remove the workqueue ...

-ck



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/6] nvmet: add verify emulation support for file-ns
  2022-07-05  8:36   ` Christoph Hellwig
@ 2022-07-06  0:06     ` Chaitanya Kulkarni
  0 siblings, 0 replies; 41+ messages in thread
From: Chaitanya Kulkarni @ 2022-07-06  0:06 UTC (permalink / raw)
  To: Christoph Hellwig, Chaitanya Kulkarni
  Cc: linux-block, linux-raid, linux-nvme, linux-fsdevel, axboe, agk,
	song, djwong, kbusch, sagi, jejb, martin.petersen, viro, javier,
	johannes.thumshirn, bvanassche, dongli.zhang, ming.lei, willy,
	jefflexu, josef, clm, dsterba, jack, tytso, adilger.kernel,
	jlayton, idryomov, danil.kipnis, ebiggers, jinpu.wang

On 7/5/22 01:36, Christoph Hellwig wrote:
> On Thu, Jun 30, 2022 at 02:14:05AM -0700, Chaitanya Kulkarni wrote:
>> For now, there is no way to map verify operation to the VFS layer API.
>> This patch emulates verify operation by offloading it to the workqueue
>> and reading the data using vfs layer APIs for both buffered io and
>> direct io mode.
> 
> What does this buy us?  I'm generally not too happy about adding
> actualy logic to nvmet, it is supposed to mostly just pass through
> operations support by block device or files.

When using file based backends for NVMeOF and lio (lio patch is
outstanding) I'm emulating (please let me know if you prefer other
terminology) verify command with read operation on the file,
this allows user to send read command and keep target compatible
with the block device namespace for verify support. In absence of
that file-backed fabrics targets will not support the verify
command ... or you want a generic file read helper to be used
for this scenario that is consumed by the nvmet/lio target ?

does that answer your question ?

-ck



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/6] block: add support for REQ_OP_VERIFY
  2022-07-05 17:57       ` Darrick J. Wong
@ 2022-07-06  1:32         ` Chaitanya Kulkarni
  0 siblings, 0 replies; 41+ messages in thread
From: Chaitanya Kulkarni @ 2022-07-06  1:32 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-block, linux-raid, linux-nvme, linux-fsdevel, axboe, agk,
	song, kbusch, hch, sagi, jejb, martin.petersen, viro, javier,
	johannes.thumshirn, bvanassche, dongli.zhang, ming.lei, willy,
	jefflexu, josef, clm, dsterba, jack, tytso, adilger.kernel,
	jlayton, idryomov, danil.kipnis, ebiggers, jinpu.wang


>>>> +int __blkdev_issue_verify(struct block_device *bdev, sector_t sector,
>>>> +		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop)
>>>> +{
>>>> +	unsigned int max_verify_sectors = bdev_verify_sectors(bdev);
>>>> +	sector_t min_io_sect = (BIO_MAX_VECS << PAGE_SHIFT) >> 9;
>>>> +	struct bio *bio = *biop;
>>>> +	sector_t curr_sects;
>>>> +	char *buf;
>>>> +
>>>> +	if (!max_verify_sectors) {
>>>> +		int ret = 0;
>>>> +
>>>> +		buf = kzalloc(min_io_sect << 9, GFP_KERNEL);
>>>
>>> k*z*alloc?  I don't think you need to zero a buffer that we're reading
>>> into, right?
>>>
>>> --D
>>
>> we don't need to but I guess it is just a habit to make sure alloced
>> buffer is zeored, should I remove it for any particular reason ?
> 
> What's the point in wasting CPU time zeroing a buffer if you're just
> going to DMA into it?
> 
> --D
> 

true, will remove it ...

>> -ck
>>
>>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] block: add support for REQ_OP_VERIFY
  2022-06-30  9:14 [PATCH 0/6] block: add support for REQ_OP_VERIFY Chaitanya Kulkarni
                   ` (6 preceding siblings ...)
  2022-07-05  8:38 ` [PATCH 0/6] block: add support for REQ_OP_VERIFY Christoph Hellwig
@ 2022-07-06 17:42 ` Matthew Wilcox
  2022-07-13  9:14   ` Chaitanya Kulkarni
  2022-12-01 18:12   ` Chaitanya Kulkarni
  7 siblings, 2 replies; 41+ messages in thread
From: Matthew Wilcox @ 2022-07-06 17:42 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: linux-block, linux-raid, linux-nvme, linux-fsdevel, axboe, agk,
	song, djwong, kbusch, hch, sagi, jejb, martin.petersen, viro,
	javier, johannes.thumshirn, bvanassche, dongli.zhang, ming.lei,
	jefflexu, josef, clm, dsterba, jack, tytso, adilger.kernel,
	jlayton, idryomov, danil.kipnis, ebiggers, jinpu.wang

On Thu, Jun 30, 2022 at 02:14:00AM -0700, Chaitanya Kulkarni wrote:
> This adds support for the REQ_OP_VERIFY. In this version we add

IMO, VERIFY is a useless command.  The history of storage is full of
devices which simply lie.  Since there's no way for the host to check if
the device did any work, cheap devices may simply implement it as a NOOP.
Even expensive devices where there's an ironclad legal contract between
the vendor and customer may have bugs that result in only some of the
bytes being VERIFYed.  We shouldn't support it.

Now, everything you say about its value (not consuming bus bandwidth)
is true, but the device should provide the host with proof-of-work.
I'd suggest calculating some kind of checksum, even something like a
SHA-1 of the contents would be worth having.  It doesn't need to be
crypto-secure; just something the host can verify the device didn't spoof.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] block: add support for REQ_OP_VERIFY
  2022-07-06 17:42 ` Matthew Wilcox
@ 2022-07-13  9:14   ` Chaitanya Kulkarni
  2022-07-13  9:36     ` Johannes Thumshirn
  2022-07-13 12:17     ` Matthew Wilcox
  2022-12-01 18:12   ` Chaitanya Kulkarni
  1 sibling, 2 replies; 41+ messages in thread
From: Chaitanya Kulkarni @ 2022-07-13  9:14 UTC (permalink / raw)
  To: Matthew Wilcox, Chaitanya Kulkarni
  Cc: linux-block, linux-raid, linux-nvme, linux-fsdevel, axboe, agk,
	song, djwong, kbusch, hch, sagi, jejb, martin.petersen, viro,
	javier, johannes.thumshirn, bvanassche, dongli.zhang, ming.lei,
	jefflexu, josef, clm, dsterba, jack, tytso, adilger.kernel,
	jlayton, idryomov, danil.kipnis, ebiggers, jinpu.wang

On 7/6/22 10:42, Matthew Wilcox wrote:
> On Thu, Jun 30, 2022 at 02:14:00AM -0700, Chaitanya Kulkarni wrote:
>> This adds support for the REQ_OP_VERIFY. In this version we add
> 
> IMO, VERIFY is a useless command.  The history of storage is full of
> devices which simply lie.  Since there's no way for the host to check if
> the device did any work, cheap devices may simply implement it as a NOOP.

Thanks for sharing your feedback regarding cheap devices.

This falls outside of the scope of the work, as scope of this work is
not to analyze different vendor implementations of the verify command.

> Even expensive devices where there's an ironclad legal contract between
> the vendor and customer may have bugs that result in only some of the
> bytes being VERIFYed.  We shouldn't support it.
This is not true with enterprise SSDs, I've been involved with product
qualification of the high end enterprise SSDs since 2012 including good
old non-nvme devices with e.g. skd driver on linux/windows/vmware.

At product qualification time for large data centers every single
feature gets reviewed with excruciating architectural details in the 
data center environment and detailed analysis of the feature including
running cost and actual impact is calculated where Service level
Agreements are confirmed between the vendor and client. In case vendor 
fails to meet the SLA product gets disqualified.

What you are mentioning is vendor is failing to meet the SLA and I think
we shouldn't consider vendor specific implementations for generic
feature.

> 
> Now, everything you say about its value (not consuming bus bandwidth)
> is true, but the device should provide the host with proof-of-work.

Yes that seems to be missing but it is not a blocker in this work since
protocol needs to provide this information.

We can update the respective specification to add a log page which
shows proof of work for verify command e.g.
A log page consist of the information such as :-

1. How many LBAs were verified ? How long it took.
2. What kind of errors were detected ?
3. How many blocks were moved to safe location ?
4. How much data (LBAs) been moved successfully ?
5. How much data we lost permanently with uncorrectible errors?
6. What is the impact on the overall size of the storage, in
    case of flash reduction in the over provisioning due to
    uncorrectible errors.

but clearly this is outside of the scope of the this work,
if you are willing to work on this I'd be happy to draft a TP
and work with you.

> I'd suggest calculating some kind of checksum, even something like a
> SHA-1 of the contents would be worth having.  It doesn't need to be
> crypto-secure; just something the host can verify the device didn't spoof.

I did not understand exactly what you mean here.

-ck



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] block: add support for REQ_OP_VERIFY
  2022-07-13  9:14   ` Chaitanya Kulkarni
@ 2022-07-13  9:36     ` Johannes Thumshirn
  2022-07-13 12:17     ` Matthew Wilcox
  1 sibling, 0 replies; 41+ messages in thread
From: Johannes Thumshirn @ 2022-07-13  9:36 UTC (permalink / raw)
  To: Chaitanya Kulkarni, Matthew Wilcox
  Cc: linux-block, linux-raid, linux-nvme, linux-fsdevel, axboe, agk,
	song, djwong, kbusch, hch, sagi, jejb, martin.petersen, viro,
	javier, bvanassche, dongli.zhang, ming.lei, jefflexu, josef, clm,
	dsterba, jack, tytso, adilger.kernel, jlayton, idryomov,
	danil.kipnis, ebiggers, jinpu.wang

On 13.07.22 11:14, Chaitanya Kulkarni wrote:
>> I'd suggest calculating some kind of checksum, even something like a
>> SHA-1 of the contents would be worth having.  It doesn't need to be
>> crypto-secure; just something the host can verify the device didn't spoof.
> I did not understand exactly what you mean here.

I _think_ what Willy wants to say here is, we need some kind of "out-of-band"
checksums to verify the device is not lying to us.

Something like the checksums for each data block that i.e. btrfs has. On read,
we're verifying the calculated checksum of the payload with the saved one and
if they're not matching (for whatever reason) return -EIO.

As the device doesn't know the location of the data checksum, it can't spoof it
for the host.

(Excuse me for my btrfs centric view on this topic, but that's what I know and
what I'm used to)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] block: add support for REQ_OP_VERIFY
  2022-07-13  9:14   ` Chaitanya Kulkarni
  2022-07-13  9:36     ` Johannes Thumshirn
@ 2022-07-13 12:17     ` Matthew Wilcox
  2022-07-13 14:04       ` Keith Busch
  1 sibling, 1 reply; 41+ messages in thread
From: Matthew Wilcox @ 2022-07-13 12:17 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: linux-block, linux-raid, linux-nvme, linux-fsdevel, axboe, agk,
	song, djwong, kbusch, hch, sagi, jejb, martin.petersen, viro,
	javier, johannes.thumshirn, bvanassche, dongli.zhang, ming.lei,
	jefflexu, josef, clm, dsterba, jack, tytso, adilger.kernel,
	jlayton, idryomov, danil.kipnis, ebiggers, jinpu.wang

On Wed, Jul 13, 2022 at 09:14:42AM +0000, Chaitanya Kulkarni wrote:
> On 7/6/22 10:42, Matthew Wilcox wrote:
> > On Thu, Jun 30, 2022 at 02:14:00AM -0700, Chaitanya Kulkarni wrote:
> >> This adds support for the REQ_OP_VERIFY. In this version we add
> > 
> > IMO, VERIFY is a useless command.  The history of storage is full of
> > devices which simply lie.  Since there's no way for the host to check if
> > the device did any work, cheap devices may simply implement it as a NOOP.
> 
> Thanks for sharing your feedback regarding cheap devices.
> 
> This falls outside of the scope of the work, as scope of this work is
> not to analyze different vendor implementations of the verify command.

The work is pointless.  As a customer, I can't ever use the VERIFY
command because I have no reason for trusting the outcome.  And there's
no way for a vendor to convince me that I should trust the result.

> > Even expensive devices where there's an ironclad legal contract between
> > the vendor and customer may have bugs that result in only some of the
> > bytes being VERIFYed.  We shouldn't support it.
> This is not true with enterprise SSDs, I've been involved with product
> qualification of the high end enterprise SSDs since 2012 including good
> old non-nvme devices with e.g. skd driver on linux/windows/vmware.

Oh, I'm sure there's good faith at the high end.  But bugs happen in
firmware, and everybody knows it.

> > Now, everything you say about its value (not consuming bus bandwidth)
> > is true, but the device should provide the host with proof-of-work.
> 
> Yes that seems to be missing but it is not a blocker in this work since
> protocol needs to provide this information.

There's no point in providing access to a feature when that feature is
not useful.

> We can update the respective specification to add a log page which
> shows proof of work for verify command e.g.
> A log page consist of the information such as :-
> 
> 1. How many LBAs were verified ? How long it took.
> 2. What kind of errors were detected ?
> 3. How many blocks were moved to safe location ?
> 4. How much data (LBAs) been moved successfully ?
> 5. How much data we lost permanently with uncorrectible errors?
> 6. What is the impact on the overall size of the storage, in
>     case of flash reduction in the over provisioning due to
>     uncorrectible errors.

That's not proof of work.  That's claim of work.

> > I'd suggest calculating some kind of checksum, even something like a
> > SHA-1 of the contents would be worth having.  It doesn't need to be
> > crypto-secure; just something the host can verify the device didn't spoof.
> 
> I did not understand exactly what you mean here.

The firmware needs to prove to me that it *did something*.  That it
actually read those bytes that it claims to have verified.  The simplest
way to do so is to calculate a hash over the blocks which were read
(maybe the host needs to provide a nonce as part of the VERIFY command
so the drive can't "remember" the checksum).

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] block: add support for REQ_OP_VERIFY
  2022-07-13 12:17     ` Matthew Wilcox
@ 2022-07-13 14:04       ` Keith Busch
  0 siblings, 0 replies; 41+ messages in thread
From: Keith Busch @ 2022-07-13 14:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Chaitanya Kulkarni, linux-block, linux-raid, linux-nvme,
	linux-fsdevel, axboe, agk, song, djwong, hch, sagi, jejb,
	martin.petersen, viro, javier, johannes.thumshirn, bvanassche,
	dongli.zhang, ming.lei, jefflexu, josef, clm, dsterba, jack,
	tytso, adilger.kernel, jlayton, idryomov, danil.kipnis, ebiggers,
	jinpu.wang

On Wed, Jul 13, 2022 at 01:17:56PM +0100, Matthew Wilcox wrote:
> The firmware needs to prove to me that it *did something*.  That it
> actually read those bytes that it claims to have verified.  The simplest
> way to do so is to calculate a hash over the blocks which were read
> (maybe the host needs to provide a nonce as part of the VERIFY command
> so the drive can't "remember" the checksum).

From a protocol perspective, NVMe's verify command currently leaves 8-bytes in
the response unused, so this could be a reasonable way to return something like
a crc64 of the verified blocks. The verify command starts at an arbitrary block
and spans an arbitrary number of them, so I think a device "remembering" all
possible combinations as a way to cheat may be harder than just doing the work
:).

But this is just theoretical; it'd be some time before we'd see devices
supporting such a scheme, assuming there's support from the standards
committees.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] block: add support for REQ_OP_VERIFY
  2022-07-06 17:42 ` Matthew Wilcox
  2022-07-13  9:14   ` Chaitanya Kulkarni
@ 2022-12-01 18:12   ` Chaitanya Kulkarni
  2022-12-01 19:39     ` Matthew Wilcox
  2022-12-02  7:13     ` Hannes Reinecke
  1 sibling, 2 replies; 41+ messages in thread
From: Chaitanya Kulkarni @ 2022-12-01 18:12 UTC (permalink / raw)
  To: Matthew Wilcox, Chaitanya Kulkarni
  Cc: linux-block, linux-raid, linux-nvme, linux-fsdevel, axboe,
	djwong, kbusch, hch, sagi, jejb, martin.petersen, javier,
	johannes.thumshirn, bvanassche, dongli.zhang, jefflexu, josef,
	clm, dsterba, jack, tytso, adilger.kernel, jlayton, idryomov,
	danil.kipnis, ebiggers, jinpu.wang

On 7/6/22 10:42, Matthew Wilcox wrote:
> On Thu, Jun 30, 2022 at 02:14:00AM -0700, Chaitanya Kulkarni wrote:
>> This adds support for the REQ_OP_VERIFY. In this version we add
> 
> IMO, VERIFY is a useless command.  The history of storage is full of
> devices which simply lie.  Since there's no way for the host to check if
> the device did any work, cheap devices may simply implement it as a NOOP.

In past few months at various storage conferences I've talked to
different people to address your comment where device being
a liar verify implementation or even implementing NOOP.

With all do respect this is not true.

Verify command for NVMe has significant impact when it comes to doing 
the maintenance work in downtime, that can get in a way when device is 
under heavy workload e.g. social media website being active at peak 
hours or high performance trading hours. In such a scenario controller 
doing the maintenance work can increase the tail latency of the workload
and one can see spikes as SSD starts aging, this is also true for
the file systems and databases when complex queries generating
complex I/O patterns under a heavy I/Os.

I respect your experience but calling every single SSD and enterprise
grade SSD vendor a liar is just not true, as with enterprise SSD
qualification process is extremely detailed oriented and each feature
command gets excruciating design review including performance/watt.

So nobody can get away with a lie.

This patch-series has an option to emulate the verify operation,
if you or someone don't want to trust the device one can simply turn
off the verify command and let the kernel emulate the Verify, that
will also save ton of the bandwidth especially in the case of fabrics
and if it is not there I'll make sure to have in the final version.

> Even expensive devices where there's an ironclad legal contract between
> the vendor and customer may have bugs that result in only some of the
> bytes being VERIFYed.  We shouldn't support it.
> 

One cannot simply write anything without bugs, see how many quirks we
have for the NVMe PCIe SSDs that doesn't stop us supporting features
in kernel if one can't trust the device just add a quirk and emulate.

The bugs doesn't make this feature useless or design of
the verify command unusable. There is a significant reason why it
exists in major device specs based on which data-center workloads
are running, just because there are few cheap vendors not being
authentic we cannot call multiple TWG's decision to add verify
command useless and not add support.

In fact all the file systems should be sending verify command as
a part of scrubbing which XFS only does as far as I know since
lack of interface is preventing the use.

> Now, everything you say about its value (not consuming bus bandwidth)
> is true, but the device should provide the host with proof-of-work.
> I'd suggest calculating some kind of checksum, even something like a
> SHA-1 of the contents would be worth having.  It doesn't need to be
> crypto-secure; just something the host can verify the device didn't spoof.

I'm not sure how SSD vendor will entertain the proof of work
idea since it will open the door for other questions such as discard and
any other commands since one of the blatant answer I got "if you don't
trust don't buy".

I'm absolutely not discarding your concerns and idea of proof of work,
I'm wiling to work with you in the TWG and submit the proposal
offline, but right now there is no support for this with existing
specs.

-ck


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] block: add support for REQ_OP_VERIFY
  2022-12-01 18:12   ` Chaitanya Kulkarni
@ 2022-12-01 19:39     ` Matthew Wilcox
  2022-12-02  7:16       ` Hannes Reinecke
  2022-12-02  7:13     ` Hannes Reinecke
  1 sibling, 1 reply; 41+ messages in thread
From: Matthew Wilcox @ 2022-12-01 19:39 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: linux-block, linux-raid, linux-nvme, linux-fsdevel, axboe,
	djwong, kbusch, hch, sagi, jejb, martin.petersen, javier,
	johannes.thumshirn, bvanassche, dongli.zhang, jefflexu, josef,
	clm, dsterba, jack, tytso, adilger.kernel, jlayton, idryomov,
	danil.kipnis, ebiggers, jinpu.wang

On Thu, Dec 01, 2022 at 06:12:46PM +0000, Chaitanya Kulkarni wrote:
> So nobody can get away with a lie.

And yet devices do exist which lie.  I'm not surprised that vendors
vehemently claim that they don't, or "nobody would get away with it".
But, of course, they do.  And there's no way for us to find out if
they're lying!


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] block: add support for REQ_OP_VERIFY
  2022-12-01 18:12   ` Chaitanya Kulkarni
  2022-12-01 19:39     ` Matthew Wilcox
@ 2022-12-02  7:13     ` Hannes Reinecke
  1 sibling, 0 replies; 41+ messages in thread
From: Hannes Reinecke @ 2022-12-02  7:13 UTC (permalink / raw)
  To: Chaitanya Kulkarni, Matthew Wilcox
  Cc: linux-block, linux-raid, linux-nvme, linux-fsdevel, axboe,
	djwong, kbusch, hch, sagi, jejb, martin.petersen, javier,
	johannes.thumshirn, bvanassche, dongli.zhang, jefflexu, josef,
	clm, dsterba, jack, tytso, adilger.kernel, jlayton, idryomov,
	danil.kipnis, ebiggers, jinpu.wang

On 12/1/22 19:12, Chaitanya Kulkarni wrote:
> On 7/6/22 10:42, Matthew Wilcox wrote:
>> On Thu, Jun 30, 2022 at 02:14:00AM -0700, Chaitanya Kulkarni wrote:
>>> This adds support for the REQ_OP_VERIFY. In this version we add
>>
>> IMO, VERIFY is a useless command.  The history of storage is full of
>> devices which simply lie.  Since there's no way for the host to check if
>> the device did any work, cheap devices may simply implement it as a NOOP.
> 
> In past few months at various storage conferences I've talked to
> different people to address your comment where device being
> a liar verify implementation or even implementing NOOP.
> 
> With all do respect this is not true.
> 
[ .. ]

Be careful to not fall into the copy-offload trap.
The arguments given do pretty much mirror the discussion we had during 
designing and implementing copy offload.

Turns out that the major benefit for any of these 'advanced' commands is 
not so much functionality but rather performance.
If the functionality provided by the command is _slower_ as when the 
host would be doing is manually there hardly is a point even calling 
that functionality; we've learned that the hard way with copy offload, 
and I guess the same it true for 'verify'.
Thing is, none of the command will _tell_ you how long that command will 
take; you just have to issue it and wait for it to complete.
(Remember TRIM? And device which took minutes to complete it?)

For copy offload we only recently added a bit telling the application 
whether it's worth calling it, or if we should rather do it the old 
fashioned way.

But all of the above discussion has nothing to do with the 
implementation. Why should we disallow an implementation for a 
functionality which is present in hardware?
How could we figure out the actual usability if we don't test?
Where is the innovation?

We need room for experimenting; if it turns out to be useless we can 
always disable or remove it later. But we shouldn't bar implementations 
because hardware _might_ be faulty.

I do see at least two topics for the next LSF:

- Implementation of REQ_OP_VERIFY
- Innovation rules for the kernel
...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] block: add support for REQ_OP_VERIFY
  2022-12-01 19:39     ` Matthew Wilcox
@ 2022-12-02  7:16       ` Hannes Reinecke
  2022-12-02 14:58         ` Keith Busch
  0 siblings, 1 reply; 41+ messages in thread
From: Hannes Reinecke @ 2022-12-02  7:16 UTC (permalink / raw)
  To: Matthew Wilcox, Chaitanya Kulkarni
  Cc: linux-block, linux-raid, linux-nvme, linux-fsdevel, axboe,
	djwong, kbusch, hch, sagi, jejb, martin.petersen, javier,
	johannes.thumshirn, bvanassche, dongli.zhang, jefflexu, josef,
	clm, dsterba, jack, tytso, adilger.kernel, jlayton, idryomov,
	danil.kipnis, ebiggers, jinpu.wang

On 12/1/22 20:39, Matthew Wilcox wrote:
> On Thu, Dec 01, 2022 at 06:12:46PM +0000, Chaitanya Kulkarni wrote:
>> So nobody can get away with a lie.
> 
> And yet devices do exist which lie.  I'm not surprised that vendors
> vehemently claim that they don't, or "nobody would get away with it".
> But, of course, they do.  And there's no way for us to find out if
> they're lying!
> 
But we'll never be able to figure that out unless we try.

Once we've tried we will have proof either way.
Which is why I think we should have that implementation.
Only then we'll know if it's worth it, both from the hardware and the 
application side.

Without an implementation it'll just degrade to a shouting match.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] block: add support for REQ_OP_VERIFY
  2022-12-02  7:16       ` Hannes Reinecke
@ 2022-12-02 14:58         ` Keith Busch
  2022-12-02 17:33           ` Clay Mayers
  2022-12-03  4:19           ` Javier González
  0 siblings, 2 replies; 41+ messages in thread
From: Keith Busch @ 2022-12-02 14:58 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Matthew Wilcox, Chaitanya Kulkarni, linux-block, linux-raid,
	linux-nvme, linux-fsdevel, axboe, djwong, hch, sagi, jejb,
	martin.petersen, javier, johannes.thumshirn, bvanassche,
	dongli.zhang, jefflexu, josef, clm, dsterba, jack, tytso,
	adilger.kernel, jlayton, idryomov, danil.kipnis, ebiggers,
	jinpu.wang

On Fri, Dec 02, 2022 at 08:16:30AM +0100, Hannes Reinecke wrote:
> On 12/1/22 20:39, Matthew Wilcox wrote:
> > On Thu, Dec 01, 2022 at 06:12:46PM +0000, Chaitanya Kulkarni wrote:
> > > So nobody can get away with a lie.
> > 
> > And yet devices do exist which lie.  I'm not surprised that vendors
> > vehemently claim that they don't, or "nobody would get away with it".
> > But, of course, they do.  And there's no way for us to find out if
> > they're lying!
> > 
> But we'll never be able to figure that out unless we try.
> 
> Once we've tried we will have proof either way.

As long as the protocols don't provide proof-of-work, trying this
doesn't really prove anything with respect to this concern.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH 0/6] block: add support for REQ_OP_VERIFY
  2022-12-02 14:58         ` Keith Busch
@ 2022-12-02 17:33           ` Clay Mayers
  2022-12-02 18:58             ` Keith Busch
  2022-12-09  4:52             ` Martin K. Petersen
  2022-12-03  4:19           ` Javier González
  1 sibling, 2 replies; 41+ messages in thread
From: Clay Mayers @ 2022-12-02 17:33 UTC (permalink / raw)
  To: Keith Busch, Hannes Reinecke
  Cc: Matthew Wilcox, Chaitanya Kulkarni, linux-block, linux-raid,
	linux-nvme, linux-fsdevel, axboe, djwong, hch, sagi, jejb,
	martin.petersen, javier, johannes.thumshirn, bvanassche,
	dongli.zhang, jefflexu, josef, clm, dsterba, jack, tytso,
	adilger.kernel, jlayton, idryomov, danil.kipnis, ebiggers,
	jinpu.wang

> From: Keith Busch
> On Fri, Dec 02, 2022 at 08:16:30AM +0100, Hannes Reinecke wrote:
> > On 12/1/22 20:39, Matthew Wilcox wrote:
> > > On Thu, Dec 01, 2022 at 06:12:46PM +0000, Chaitanya Kulkarni wrote:
> > > > So nobody can get away with a lie.
> > >
> > > And yet devices do exist which lie.  I'm not surprised that vendors
> > > vehemently claim that they don't, or "nobody would get away with it".
> > > But, of course, they do.  And there's no way for us to find out if
> > > they're lying!

My guess, if true, is it's rationalized with the device is already
doing patrols in the background - why verify when it's already
been recently patrolled?
 
> > >
> > But we'll never be able to figure that out unless we try.
> >
> > Once we've tried we will have proof either way.
> 
> As long as the protocols don't provide proof-of-work, trying this
> doesn't really prove anything with respect to this concern.

I'm out of my depth here, but isn't VERIFY tightly related to PI and
at the heart of detecting SAN bit-rot? The proof of work can be via
end-to-end data protection. VERIFY has to actually read to detect bad
host generated PI guard/tags.  I'm assuming the PI checks can be
disabled for WRITE and enabled for VERIFY as the test.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] block: add support for REQ_OP_VERIFY
  2022-12-02 17:33           ` Clay Mayers
@ 2022-12-02 18:58             ` Keith Busch
  2022-12-09  4:52             ` Martin K. Petersen
  1 sibling, 0 replies; 41+ messages in thread
From: Keith Busch @ 2022-12-02 18:58 UTC (permalink / raw)
  To: Clay Mayers
  Cc: Hannes Reinecke, Matthew Wilcox, Chaitanya Kulkarni, linux-block,
	linux-raid, linux-nvme, linux-fsdevel, axboe, djwong, hch, sagi,
	jejb, martin.petersen, javier, johannes.thumshirn, bvanassche,
	dongli.zhang, jefflexu, josef, clm, dsterba, jack, tytso,
	adilger.kernel, jlayton, idryomov, danil.kipnis, ebiggers,
	jinpu.wang

On Fri, Dec 02, 2022 at 05:33:33PM +0000, Clay Mayers wrote:
> > 
> > As long as the protocols don't provide proof-of-work, trying this
> > doesn't really prove anything with respect to this concern.
> 
> I'm out of my depth here, but isn't VERIFY tightly related to PI and
> at the heart of detecting SAN bit-rot? The proof of work can be via
> end-to-end data protection. VERIFY has to actually read to detect bad
> host generated PI guard/tags.  I'm assuming the PI checks can be
> disabled for WRITE and enabled for VERIFY as the test.

I suppose if you happen to have a PI formatted drive, you could WRITE
garbage Guard tags with PRCHK disabled, then see if VERIFY with PRCHK
enabled returns the Guard Check Error; but PI is an optional format
feature orthogonal to the VERIFY command: we can't count on that format
being available in most implementations.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] block: add support for REQ_OP_VERIFY
  2022-12-02 14:58         ` Keith Busch
  2022-12-02 17:33           ` Clay Mayers
@ 2022-12-03  4:19           ` Javier González
  2022-12-04 20:29             ` Keith Busch
  2022-12-08 20:02             ` Matthew Wilcox
  1 sibling, 2 replies; 41+ messages in thread
From: Javier González @ 2022-12-03  4:19 UTC (permalink / raw)
  To: Keith Busch
  Cc: Hannes Reinecke, Matthew Wilcox, Chaitanya Kulkarni, linux-block,
	linux-raid, linux-nvme, linux-fsdevel, axboe, djwong, hch, sagi,
	jejb, martin.petersen, Johannes.Thumshirn, bvanassche,
	dongli.zhang, jefflexu, josef, clm, dsterba, jack, tytso,
	adilger.kernel, jlayton, idryomov, danil.kipnis, ebiggers,
	jinpu.wang


> On 2 Dec 2022, at 17.58, Keith Busch <kbusch@kernel.org> wrote:
> 
> On Fri, Dec 02, 2022 at 08:16:30AM +0100, Hannes Reinecke wrote:
>>> On 12/1/22 20:39, Matthew Wilcox wrote:
>>> On Thu, Dec 01, 2022 at 06:12:46PM +0000, Chaitanya Kulkarni wrote:
>>>> So nobody can get away with a lie.
>>> 
>>> And yet devices do exist which lie.  I'm not surprised that vendors
>>> vehemently claim that they don't, or "nobody would get away with it".
>>> But, of course, they do.  And there's no way for us to find out if
>>> they're lying!
>>> 
>> But we'll never be able to figure that out unless we try.
>> 
>> Once we've tried we will have proof either way.
> 
> As long as the protocols don't provide proof-of-work, trying this
> doesn't really prove anything with respect to this concern.

Is this something we should bring to NVMe? Seems like the main disagreement can be addressed there. 

I will check internally if there is any existing proof-of-work that we are missing. 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] block: add support for REQ_OP_VERIFY
  2022-12-03  4:19           ` Javier González
@ 2022-12-04 20:29             ` Keith Busch
  2022-12-08 20:06               ` Javier González
  2022-12-08 20:02             ` Matthew Wilcox
  1 sibling, 1 reply; 41+ messages in thread
From: Keith Busch @ 2022-12-04 20:29 UTC (permalink / raw)
  To: Javier González
  Cc: Hannes Reinecke, Matthew Wilcox, Chaitanya Kulkarni, linux-block,
	linux-raid, linux-nvme, linux-fsdevel, axboe, djwong, hch, sagi,
	jejb, martin.petersen, Johannes.Thumshirn, bvanassche,
	dongli.zhang, jefflexu, josef, clm, dsterba, jack, tytso,
	adilger.kernel, jlayton, idryomov, danil.kipnis, ebiggers,
	jinpu.wang

On Sat, Dec 03, 2022 at 07:19:17AM +0300, Javier González wrote:
> 
> > On 2 Dec 2022, at 17.58, Keith Busch <kbusch@kernel.org> wrote:
> > 
> > On Fri, Dec 02, 2022 at 08:16:30AM +0100, Hannes Reinecke wrote:
> >>> On 12/1/22 20:39, Matthew Wilcox wrote:
> >>> On Thu, Dec 01, 2022 at 06:12:46PM +0000, Chaitanya Kulkarni wrote:
> >>>> So nobody can get away with a lie.
> >>> 
> >>> And yet devices do exist which lie.  I'm not surprised that vendors
> >>> vehemently claim that they don't, or "nobody would get away with it".
> >>> But, of course, they do.  And there's no way for us to find out if
> >>> they're lying!
> >>> 
> >> But we'll never be able to figure that out unless we try.
> >> 
> >> Once we've tried we will have proof either way.
> > 
> > As long as the protocols don't provide proof-of-work, trying this
> > doesn't really prove anything with respect to this concern.
> 
> Is this something we should bring to NVMe? Seems like the main disagreement can be addressed there. 

Yeah, proof for the host appears to require a new feature, so we'd need
to bring this to the TWG. I can draft a TPAR if there's interest and
have ideas on how the feature could be implemented, but I currently
don't have enough skin in this game to sponser it.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] block: add support for REQ_OP_VERIFY
  2022-12-03  4:19           ` Javier González
  2022-12-04 20:29             ` Keith Busch
@ 2022-12-08 20:02             ` Matthew Wilcox
  1 sibling, 0 replies; 41+ messages in thread
From: Matthew Wilcox @ 2022-12-08 20:02 UTC (permalink / raw)
  To: Javier González
  Cc: Keith Busch, Hannes Reinecke, Chaitanya Kulkarni, linux-block,
	linux-raid, linux-nvme, linux-fsdevel, axboe, djwong, hch, sagi,
	jejb, martin.petersen, Johannes.Thumshirn, bvanassche,
	dongli.zhang, jefflexu, josef, clm, dsterba, jack, tytso,
	adilger.kernel, jlayton, idryomov, danil.kipnis, ebiggers,
	jinpu.wang

On Sat, Dec 03, 2022 at 07:19:17AM +0300, Javier González wrote:
> > On 2 Dec 2022, at 17.58, Keith Busch <kbusch@kernel.org> wrote:
> > As long as the protocols don't provide proof-of-work, trying this
> > doesn't really prove anything with respect to this concern.
> 
> Is this something we should bring to NVMe? Seems like the main disagreement can be addressed there. 
> 
> I will check internally if there is any existing proof-of-work that we are missing. 

I think the right thing for NVMe to standardise is a new command, HASH.

It should be quite similar to the READ command, at least for command
Dwords 10-15.  Need to specify the hash algorithm to use somewhere
(maybe it's a per-namespace setting, maybe a per-command setting).
The ctte will have to decide which hash algos are permitted / required
(it probably makes sense to permit even some quite weak hashes as we're
not necessarily looking to have cryptographic security).

While there may be ways to make use of this command for its obvious usage
(ie actually producing a hash and communicating that to somebody else),
its real purpose is for the drive to read the data from storage and prove
it's still there without transferring it to the host, as discussed in
this thread.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] block: add support for REQ_OP_VERIFY
  2022-12-04 20:29             ` Keith Busch
@ 2022-12-08 20:06               ` Javier González
  0 siblings, 0 replies; 41+ messages in thread
From: Javier González @ 2022-12-08 20:06 UTC (permalink / raw)
  To: Keith Busch
  Cc: Hannes Reinecke, Matthew Wilcox, Chaitanya Kulkarni, linux-block,
	linux-raid, linux-nvme, linux-fsdevel, axboe, djwong, hch, sagi,
	jejb, martin.petersen, Johannes.Thumshirn, bvanassche,
	dongli.zhang, jefflexu, josef, clm, dsterba, jack, tytso,
	adilger.kernel, jlayton, idryomov, danil.kipnis, ebiggers,
	jinpu.wang

On 04.12.2022 20:29, Keith Busch wrote:
>On Sat, Dec 03, 2022 at 07:19:17AM +0300, Javier González wrote:
>>
>> > On 2 Dec 2022, at 17.58, Keith Busch <kbusch@kernel.org> wrote:
>> >
>> > On Fri, Dec 02, 2022 at 08:16:30AM +0100, Hannes Reinecke wrote:
>> >>> On 12/1/22 20:39, Matthew Wilcox wrote:
>> >>> On Thu, Dec 01, 2022 at 06:12:46PM +0000, Chaitanya Kulkarni wrote:
>> >>>> So nobody can get away with a lie.
>> >>>
>> >>> And yet devices do exist which lie.  I'm not surprised that vendors
>> >>> vehemently claim that they don't, or "nobody would get away with it".
>> >>> But, of course, they do.  And there's no way for us to find out if
>> >>> they're lying!
>> >>>
>> >> But we'll never be able to figure that out unless we try.
>> >>
>> >> Once we've tried we will have proof either way.
>> >
>> > As long as the protocols don't provide proof-of-work, trying this
>> > doesn't really prove anything with respect to this concern.
>>
>> Is this something we should bring to NVMe? Seems like the main disagreement can be addressed there.
>
>Yeah, proof for the host appears to require a new feature, so we'd need
>to bring this to the TWG. I can draft a TPAR if there's interest and
>have ideas on how the feature could be implemented, but I currently
>don't have enough skin in this game to sponser it.

Happy to review the TPAR, but I am in a similar situation.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] block: add support for REQ_OP_VERIFY
  2022-12-02 17:33           ` Clay Mayers
  2022-12-02 18:58             ` Keith Busch
@ 2022-12-09  4:52             ` Martin K. Petersen
  2022-12-10 13:06               ` Carlos Carvalho
  1 sibling, 1 reply; 41+ messages in thread
From: Martin K. Petersen @ 2022-12-09  4:52 UTC (permalink / raw)
  To: Clay Mayers
  Cc: Keith Busch, Hannes Reinecke, Matthew Wilcox, Chaitanya Kulkarni,
	linux-block, linux-raid, linux-nvme, linux-fsdevel, axboe,
	djwong, hch, sagi, jejb, martin.petersen, javier,
	johannes.thumshirn, bvanassche, dongli.zhang, jefflexu, josef,
	clm, dsterba, jack, tytso, adilger.kernel, jlayton, idryomov,
	danil.kipnis, ebiggers, jinpu.wang


> My guess, if true, is it's rationalized with the device is already
> doing patrols in the background - why verify when it's already
> been recently patrolled?

The original SCSI VERIFY operation allowed RAID array firmware to do
background scrubs before disk drives developed anything resembling
sophisticated media management. If verification of a block failed, the
array firmware could go ahead and reconstruct the data from the rest of
the stripe in the background. This substantially reduced the risk of
having to perform block reconstruction in the hot path. And verification
did not have to burden the already slow array CPU/memory/DMA combo with
transferring every block on every attached drive.

I suspect that these days it is very hard to find a storage device that
doesn't do media management internally in the background. So from the
perspective of physically exercising the media, VERIFY is probably not
terribly useful anymore.

In that light, having to run VERIFY over the full block range of a
device to identify unreadable blocks seems like a fairly clunky
mechanism. Querying the device for a list of unrecoverable blocks
already identified by the firmware seems like a better interface.

I am not sure I understand this whole "proof that the drive did
something" requirement. If a device lies and implements VERIFY as a noop
it just means you'll get the error during a future READ operation
instead.

No matter what, a successful VERIFY is obviously no guarantee that a
future READ on a given block will be possible. But it doesn't matter
because the useful outcome of a verify operation is the failure, not the
success. It's the verification failure scenario which allows you to take
a corrective action.

If you really want to verify device VERIFY implementation, we do have
WRITE UNCORRECTABLE commands in both SCSI and NVMe which allow you to do
that. But I think device validation is a secondary issue. The more
pertinent question is whether we have use cases in the kernel (MD,
btrfs) which would benefit from being able to preemptively identify
unreadable blocks?

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] block: add support for REQ_OP_VERIFY
  2022-12-09  4:52             ` Martin K. Petersen
@ 2022-12-10 13:06               ` Carlos Carvalho
  2022-12-12  6:30                 ` hch
  0 siblings, 1 reply; 41+ messages in thread
From: Carlos Carvalho @ 2022-12-10 13:06 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Clay Mayers, Keith Busch, Hannes Reinecke, Matthew Wilcox,
	Chaitanya Kulkarni, linux-block, linux-raid, linux-nvme,
	linux-fsdevel, axboe, djwong, hch, sagi, jejb, javier,
	johannes.thumshirn, bvanassche, dongli.zhang, jefflexu, josef,
	clm, dsterba, jack, tytso, adilger.kernel, jlayton, idryomov,
	danil.kipnis, ebiggers, jinpu.wang

Martin K. Petersen (martin.petersen@oracle.com) wrote on Fri, Dec 09, 2022 at 01:52:01AM -03:
> I suspect that these days it is very hard to find a storage device that
> doesn't do media management internally in the background. So from the
> perspective of physically exercising the media, VERIFY is probably not
> terribly useful anymore.
> 
> In that light, having to run VERIFY over the full block range of a
> device to identify unreadable blocks seems like a fairly clunky
> mechanism. Querying the device for a list of unrecoverable blocks
> already identified by the firmware seems like a better interface.

Sure.

> But I think device validation is a secondary issue. The more
> pertinent question is whether we have use cases in the kernel (MD,
> btrfs) which would benefit from being able to preemptively identify
> unreadable blocks?

Certainly we have. Currently admins have to periodically run full block range
checks in redundant arrays to detect bad blocks and correct them while
redundancy is available. Otherwise when a disk fails and you try to reconstruct
the replacement you hit another block in the remaining disks that's bad and you
cannot complete the reconstruction and have data loss. These checks are a
burden because they have HIGH overhead, significantly reducing bandwidth for
the normal use of the array.

If there was a standard interface for getting the list of bad blocks that the
firmware secretly knows the kernel could implement the repair continuosly, with
logs etc. That'd really be a relief for admins and, specially, users.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] block: add support for REQ_OP_VERIFY
  2022-12-10 13:06               ` Carlos Carvalho
@ 2022-12-12  6:30                 ` hch
  0 siblings, 0 replies; 41+ messages in thread
From: hch @ 2022-12-12  6:30 UTC (permalink / raw)
  To: Carlos Carvalho
  Cc: Martin K. Petersen, Clay Mayers, Keith Busch, Hannes Reinecke,
	Matthew Wilcox, Chaitanya Kulkarni, linux-block, linux-raid,
	linux-nvme, linux-fsdevel, axboe, djwong, hch, sagi, jejb,
	javier, johannes.thumshirn, bvanassche, dongli.zhang, jefflexu,
	josef, clm, dsterba, jack, tytso, adilger.kernel, jlayton,
	idryomov, danil.kipnis, ebiggers, jinpu.wang

On Sat, Dec 10, 2022 at 10:06:34AM -0300, Carlos Carvalho wrote:
> Certainly we have. Currently admins have to periodically run full block range
> checks in redundant arrays to detect bad blocks and correct them while
> redundancy is available. Otherwise when a disk fails and you try to reconstruct
> the replacement you hit another block in the remaining disks that's bad and you
> cannot complete the reconstruction and have data loss. These checks are a
> burden because they have HIGH overhead, significantly reducing bandwidth for
> the normal use of the array.
> 
> If there was a standard interface for getting the list of bad blocks that the
> firmware secretly knows the kernel could implement the repair continuosly, with
> logs etc. That'd really be a relief for admins and, specially, users.

Both SCSI and NVMe can do this through the GET LBA STATUS command -
in SCSI this was a later addition abusing the command, and in NVMe
only the abuse survived.  NVMe also has a log page an AEN associated
for it, I'd have to spend more time reading SBC to remember if SCSI
also has a notification mechanism of some sort.

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2022-12-12  6:30 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-30  9:14 [PATCH 0/6] block: add support for REQ_OP_VERIFY Chaitanya Kulkarni
2022-06-30  9:14 ` [PATCH 1/6] " Chaitanya Kulkarni
2022-06-30 16:18   ` Darrick J. Wong
2022-07-05 16:50     ` Chaitanya Kulkarni
2022-07-05 17:57       ` Darrick J. Wong
2022-07-06  1:32         ` Chaitanya Kulkarni
2022-07-05  8:34   ` Christoph Hellwig
2022-07-05 23:55     ` Chaitanya Kulkarni
2022-06-30  9:14 ` [PATCH 2/6] nvme: add support for the Verify command Chaitanya Kulkarni
2022-06-30 16:24   ` Keith Busch
2022-07-05  8:34     ` Christoph Hellwig
2022-07-05 23:56       ` Chaitanya Kulkarni
2022-06-30  9:14 ` [PATCH 3/6] nvmet: add Verify command support for bdev-ns Chaitanya Kulkarni
2022-06-30  9:14 ` [PATCH 4/6] nvmet: add Verify emulation " Chaitanya Kulkarni
2022-07-05  8:35   ` Christoph Hellwig
2022-07-06  0:00     ` Chaitanya Kulkarni
2022-06-30  9:14 ` [PATCH 5/6] nvmet: add verify emulation support for file-ns Chaitanya Kulkarni
2022-07-05  8:36   ` Christoph Hellwig
2022-07-06  0:06     ` Chaitanya Kulkarni
2022-06-30  9:14 ` [PATCH 6/6] null_blk: add REQ_OP_VERIFY support Chaitanya Kulkarni
2022-07-05  8:38 ` [PATCH 0/6] block: add support for REQ_OP_VERIFY Christoph Hellwig
2022-07-05 16:47   ` Chaitanya Kulkarni
2022-07-06 17:42 ` Matthew Wilcox
2022-07-13  9:14   ` Chaitanya Kulkarni
2022-07-13  9:36     ` Johannes Thumshirn
2022-07-13 12:17     ` Matthew Wilcox
2022-07-13 14:04       ` Keith Busch
2022-12-01 18:12   ` Chaitanya Kulkarni
2022-12-01 19:39     ` Matthew Wilcox
2022-12-02  7:16       ` Hannes Reinecke
2022-12-02 14:58         ` Keith Busch
2022-12-02 17:33           ` Clay Mayers
2022-12-02 18:58             ` Keith Busch
2022-12-09  4:52             ` Martin K. Petersen
2022-12-10 13:06               ` Carlos Carvalho
2022-12-12  6:30                 ` hch
2022-12-03  4:19           ` Javier González
2022-12-04 20:29             ` Keith Busch
2022-12-08 20:06               ` Javier González
2022-12-08 20:02             ` Matthew Wilcox
2022-12-02  7:13     ` Hannes Reinecke

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.