All of lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>,
	linux-block <linux-block@vger.kernel.org>,
	Damien Le Moal <Damien.LeMoal@wdc.com>,
	Keith Busch <kbusch@kernel.org>,
	"linux-scsi @ vger . kernel . org" <linux-scsi@vger.kernel.org>,
	"Martin K . Petersen" <martin.petersen@oracle.com>,
	"linux-fsdevel @ vger . kernel . org"
	<linux-fsdevel@vger.kernel.org>,
	Johannes Thumshirn <johannes.thumshirn@wdc.com>
Subject: [PATCH v4 00/10] Introduce Zone Append for writing to zoned block devices
Date: Fri,  3 Apr 2020 19:12:40 +0900	[thread overview]
Message-ID: <20200403101250.33245-1-johannes.thumshirn@wdc.com> (raw)

The upcoming NVMe ZNS Specification will define a new type of write
command for zoned block devices, zone append.

When when writing to a zoned block device using zone append, the start
sector of the write is pointing at the start LBA of the zone to write to.
Upon completion the block device will respond with the position the data
has been placed in the zone. This from a high level perspective can be
seen like a file system's block allocator, where the user writes to a
file and the file-system takes care of the data placement on the device.

In order to fully exploit the new zone append command in file-systems and
other interfaces above the block layer, we choose to emulate zone append
in SCSI and null_blk. This way we can have a single write path for both
file-systems and other interfaces above the block-layer, like io_uring on
zoned block devices, without having to care too much about the underlying
characteristics of the device itself.

The emulation works by providing a cache of each zone's write pointer, so
zone append issued to the disk can be translated to a write with a
starting LBA of the write pointer. This LBA is used as input zone number
for the write pointer lookup in the zone write pointer offset cache and
the cached offset is then added to the LBA to get the actual position to
write the data. In SCSI we then turn the REQ_OP_ZONE_APPEND request into a
WRITE(16) command. Upon successful completion of the WRITE(16), the cache
will be updated to the new write pointer location and the written sector
will be noted in the request. On error the cache entry will be marked as
invalid and on the next write an update of the write pointer will be
scheduled, before issuing the actual write.

In order to reduce memory consumption, the only cached item is the offset
of the write pointer from the start of the zone, everything else can be
calculated. On an example drive with 52156 zones, the additional memory
consumption of the cache is thus 52156 * 4 = 208624 Bytes or 51 4k Byte
pages. The performance impact is neglectable for a spinning drive.

For null_blk the emulation is way simpler, as null_blk's zoned block
device emulation support already caches the write pointer position, so we
only need to report the position back to the upper layers. Additional
caching is not needed here.

Furthermore we have converted zonefs to run use ZONE_APPEND for synchronous
direct I/Os. Asynchronous I/O still uses the normal path via iomap.

The series is based on v5.6 final, but it should be trivial to re-base onto
Jens' for-next branch once it re-opened.

Changes since v3:
- Remove impact of zone-append from bio_full() and bio_add_page()
  fast-path (Christoph)
- All of the zone write pointer offset caching is handled in SCSI now
  (Christoph) 
- Drop null_blk pathces that damien sent separately (Christoph)
- Use EXPORT_SYMBOL_GPL for new exports (Christoph)	

Changes since v2:
- Remove iomap implementation and directly issue zone-appends from within
  zonefs (Christoph)
- Drop already merged patch
- Rebase onto new for-next branch

Changes since v1:
- Too much to mention, treat as a completely new series.

Damien Le Moal (2):
  block: Modify revalidate zones
  null_blk: Support REQ_OP_ZONE_APPEND

Johannes Thumshirn (7):
  block: provide fallbacks for blk_queue_zone_is_seq and
    blk_queue_zone_no
  block: introduce blk_req_zone_write_trylock
  scsi: sd_zbc: factor out sanity checks for zoned commands
  scsi: export scsi_mq_uninit_cmnd
  scsi: sd_zbc: emulate ZONE_APPEND commands
  block: export bio_release_pages and bio_iov_iter_get_pages
  zonefs: use REQ_OP_ZONE_APPEND for sync DIO

Keith Busch (1):
  block: Introduce REQ_OP_ZONE_APPEND

 block/bio.c                    |  59 ++++-
 block/blk-core.c               |  52 +++++
 block/blk-mq.c                 |  27 +++
 block/blk-settings.c           |  23 ++
 block/blk-sysfs.c              |  13 ++
 block/blk-zoned.c              |  52 ++++-
 drivers/block/null_blk_zoned.c |  39 +++-
 drivers/scsi/scsi_lib.c        |  10 +-
 drivers/scsi/sd.c              |  26 ++-
 drivers/scsi/sd.h              |  38 +++-
 drivers/scsi/sd_zbc.c          | 403 +++++++++++++++++++++++++++++++--
 fs/zonefs/super.c              |  80 ++++++-
 include/linux/blk_types.h      |  14 ++
 include/linux/blkdev.h         |  33 ++-
 include/scsi/scsi_cmnd.h       |   1 +
 15 files changed, 809 insertions(+), 61 deletions(-)

-- 
2.24.1


             reply	other threads:[~2020-04-03 10:12 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-03 10:12 Johannes Thumshirn [this message]
2020-04-03 10:12 ` [PATCH v4 01/10] block: provide fallbacks for blk_queue_zone_is_seq and blk_queue_zone_no Johannes Thumshirn
2020-04-03 10:12 ` [PATCH v4 02/10] block: Introduce REQ_OP_ZONE_APPEND Johannes Thumshirn
2020-04-07 16:48   ` Christoph Hellwig
2020-04-08 11:31     ` Johannes Thumshirn
2020-04-08 15:51   ` Christoph Hellwig
2020-04-08 15:53     ` Johannes Thumshirn
2020-04-03 10:12 ` [PATCH v4 03/10] block: introduce blk_req_zone_write_trylock Johannes Thumshirn
2020-04-07 16:53   ` Christoph Hellwig
2020-04-07 16:54     ` Christoph Hellwig
2020-04-08  8:29     ` [PATCH v4 04/10] block: Modify revalidate zones Damien Le Moal
2020-04-08 15:58       ` hch
2020-04-03 10:12 ` Johannes Thumshirn
2020-04-03 10:12 ` [PATCH v4 05/10] scsi: sd_zbc: factor out sanity checks for zoned commands Johannes Thumshirn
2020-04-03 10:12 ` [PATCH v4 06/10] scsi: export scsi_mq_uninit_cmnd Johannes Thumshirn
2020-04-07 17:00   ` Christoph Hellwig
2020-04-08 11:32     ` Johannes Thumshirn
2020-04-08 15:59       ` hch
2020-04-03 10:12 ` [PATCH v4 07/10] scsi: sd_zbc: emulate ZONE_APPEND commands Johannes Thumshirn
2020-04-07 17:05   ` Christoph Hellwig
2020-04-08  8:14     ` Damien Le Moal
2020-04-08 15:58       ` hch
2020-04-08 16:13         ` Johannes Thumshirn
2020-04-03 10:12 ` [PATCH v4 08/10] null_blk: Support REQ_OP_ZONE_APPEND Johannes Thumshirn
2020-04-07 17:05   ` Christoph Hellwig
2020-04-03 10:12 ` [PATCH v4 09/10] block: export bio_release_pages and bio_iov_iter_get_pages Johannes Thumshirn
2020-04-03 10:12 ` [PATCH v4 10/10] zonefs: use REQ_OP_ZONE_APPEND for sync DIO Johannes Thumshirn
2020-04-07 16:49 ` [PATCH v4 00/10] Introduce Zone Append for writing to zoned block devices Christoph Hellwig
2020-04-08  8:28   ` Johannes Thumshirn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200403101250.33245-1-johannes.thumshirn@wdc.com \
    --to=johannes.thumshirn@wdc.com \
    --cc=Damien.LeMoal@wdc.com \
    --cc=axboe@kernel.dk \
    --cc=hch@infradead.org \
    --cc=kbusch@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.