[PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices

* [PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
       [not found] <CGME20220308165414eucas1p106df0bd6a901931215cfab81660a4564@eucas1p1.samsung.com>
@ 2022-03-08 16:53 ` Pankaj Raghav
       [not found]   ` <CGME20220308165421eucas1p20575444f59702cd5478cb35fce8b72cd@eucas1p2.samsung.com>
                     ` (6 more replies)
  0 siblings, 7 replies; 83+ messages in thread
From: Pankaj Raghav @ 2022-03-08 16:53 UTC (permalink / raw)
  To: Luis Chamberlain, Adam Manzanares, Javier González,
	kanchan Joshi, Jens Axboe, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Damien Le Moal, Matias Bjørling, jiangbo.365
  Cc: Pankaj Raghav, Kanchan Joshi, linux-block, linux-nvme, Pankaj Raghav

#Motivation:
There are currently ZNS drives that are produced and deployed that do
not have power_of_2(PO2) zone size. The NVMe spec for ZNS does not
specify the PO2 requirement but the linux block layer currently checks
for zoned devices to have power_of_2 zone sizes.

As a result there are many applications in the kernel such as F2FS,
BTRFS and other userspace applications that are designed based on the assumption
that zone sizes are PO2.

This patchset aims at supporting non-power_of_2 zoned devices without
affecting the existing applications by adding an emulation layer for
NVMe ZNS devices without regressing the current upstream implementation.

#Implementation:
A new callback is added to the block device operation fops which is
called when a special handling is required by the driver when a
non-power_of_2 zoned device is discovered. This patchset adds support
only to NVMe ZNS and null block driver to measure performance.
The scsi ZAC/ZBC implementation is untouched.

Emulation is enabled by doing a static remapping of the zones only in
the host and whenever a request is sent to the device via the block
layer, a transformation is done to the actual device sector.

#Testing:
There are two things that need to be tested: no regression on the
upstream implementation for PO2 zone sizes and testing the
implementation of the emulation itself.

To do apple-apples comparison, the following device specs were chosen
for testing (both on null_blk and QEMU):
PO2 device:  zone.size=128M zone.cap=96M
NPO2 device: zone.size=96M zone.cap=96M

##Regression:
These tests are done on a **PO2 device**.
PO2 device used:  zone.size=128M zone.cap=96M

###blktests:
Blktests were executed with the following config:

TEST_DEVS=(/dev/nvme0n2)
TIMEOUT=100
RUN_ZONED_TESTS=1

block and zbd tests were performed and no regression were found in the
tests.

###Performance:
Performance tests were performed on a null blk device. The following fio
script was used to measure the performance:

fio --name=zbc --filename=/dev/nullb0 --direct=1 --zonemode=zbd  --size=23G
--io_size=<iosize> --ioengine=io_uring --iodepth=<iod> --rw=<mode> --bs=4k --loops=4

No regressions were found with the patches on a **PO2 device** compared
to the existing upstream implementation.

The following results are an average of 4 runs on AMD Ryzen 5 5600X with
32GB of RAM:

Sequential Write:
x-----------------x---------------------------------x---------------------------------x
|     IOdepth     |             1                   |             4                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patches |  155     |  604     |   6.00    |  426     |  1663    |   8.77    |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  157     |  613     |   5.92    |  425     |  1741    |   8.79    |
x-----------------x---------------------------------x---------------------------------x

x-----------------x---------------------------------x---------------------------------x
|     IOdepth     |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patches |  607     |  2370    |   12.06   |  622     |  2431    |   23.61   |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  621     |  2425    |   11.80   |  633     |  2472    |   23.24   |
x-----------------x---------------------------------x---------------------------------x

Sequential read:
x-----------------x---------------------------------x---------------------------------x
| IOdepth         |             1                   |             4                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patches |  165     |  643     |   5.72    |  485     |  1896    |   8.03    |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  167     |  654     |   5.62    |  483     |  1888    |   8.06    |
x-----------------x---------------------------------x---------------------------------x

x-----------------x---------------------------------x---------------------------------x
| IOdepth         |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patches |  696     |  2718    |   11.29   |  692     |  2701    |   22.92   |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  696     |  2718    |   11.29   |  730     |  2835    |   21.70   |
x-----------------x---------------------------------x---------------------------------x

Random read:
x-----------------x---------------------------------x---------------------------------x
| IOdepth         |             1                   |             4                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patches |  159     |  623     |   5.86    |  451     |  1760    |   8.58    |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  163     |  635     |   5.75    |  462     |  1806    |   8.36    |
x-----------------x---------------------------------x---------------------------------x

x-----------------x---------------------------------x---------------------------------x
| IOdepth         |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patches |  544     |  2124    |   14.44   |  553     |  2162    |   28.64   |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  554     |  2165    |   14.15   |  556     |  2171    |   28.52   |
x-----------------x---------------------------------x---------------------------------x

##Emulated device
NPO2 device: zone.size=96M zone.cap=96M

###blktests:
Blktests were executed with the following config:

TEST_DEVS=(/dev/nvme0n2)
TIMEOUT=100
RUN_ZONED_TESTS=1

block and zbd tests were performed and they are passing.

###Performance:
Performance tests were performed on a null blk device. The following fio
script was used to measure the performance:

fio --name=zbc --filename=/dev/nullb0 --direct=1 --zonemode=zbd  --size=23G
--io_size=<iosize> --ioengine=io_uring --iodepth=<iod> --rw=<mode> --bs=4k --loops=4

On an average, the NPO2 devices had a performance degradation of less than 1%
compared to the PO2 devices.

The following results are an average of 4 runs on AMD Ryzen 5 5600X with
32GB of RAM:

Write:
x-----------------x---------------------------------x---------------------------------x
|     IOdepth     |             1                   |             4                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  155     |  606     |   5.99    |  424     |  1655    |   8.83    |
x-----------------x---------------------------------x---------------------------------x

x-----------------x---------------------------------x---------------------------------x
|     IOdepth     |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  609     |  2378    |   12.04   |  620     |  2421    |   23.75   |
x-----------------x---------------------------------x---------------------------------x

SEQREAD:
x-----------------x---------------------------------x---------------------------------x
| IOdepth         |             1                   |             4                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  160     |  623     |   5.91    |  481     |  1878    |   8.11    |
x-----------------x---------------------------------x---------------------------------x

x-----------------x---------------------------------x---------------------------------x
| IOdepth         |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  696     |  2720    |   11.28   |  722     |  2819    |   21.96   |
x-----------------x---------------------------------x---------------------------------x

RANDREAD:
x-----------------x---------------------------------x---------------------------------x
| IOdepth         |             1                   |             4                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  155     |  607     |   6.03    |  465     |  1817    |   8.31    |
x-----------------x---------------------------------x---------------------------------x

x-----------------x---------------------------------x---------------------------------x
| IOdepth         |            8                    |            16                   |
x-----------------x---------------------------------x---------------------------------x
|                 |  KIOPS   |BW(MiB/s) | Lat(usec) |  KIOPS   |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
|  With patches   |  552     |  2158    |   14.21   |  561     |  2190    |   28.27   |
x-----------------x---------------------------------x---------------------------------x

#TODO:
- The current implementation only works for the NVMe pci transport to
  limit the scope and impact.
  Support for NVMe target will follow soon.

Pankaj Raghav (6):
  nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
  block: Add npo2_zone_setup callback to block device fops
  block: add a bool member to request_queue for power_of_2 emulation
  nvme: zns: Add support for power_of_2 emulation to NVMe ZNS devices
  null_blk: forward the sector value from null_handle_memory_backend
  null_blk: Add support for power_of_2 emulation to the null blk device

 block/blk-zoned.c                 |   3 +
 drivers/block/null_blk/main.c     |  18 +--
 drivers/block/null_blk/null_blk.h |  12 ++
 drivers/block/null_blk/zoned.c    | 203 ++++++++++++++++++++++++++----
 drivers/nvme/host/core.c          |  28 +++--
 drivers/nvme/host/nvme.h          | 100 ++++++++++++++-
 drivers/nvme/host/pci.c           |   4 +
 drivers/nvme/host/zns.c           |  86 +++++++++++--
 include/linux/blk-mq.h            |   2 +
 include/linux/blkdev.h            |  25 ++++
 10 files changed, 428 insertions(+), 53 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 83+ messages in thread