Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v3 00/27] btrfs zoned block device support
@ 2019-08-08  9:30 Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 01/27] btrfs: introduce HMZONED feature flag Naohiro Aota
                   ` (26 more replies)
  0 siblings, 27 replies; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

This series adds zoned block device support to btrfs.

* Summary of changes from v2

The most significant change from v2 is the serialization of sequential
block allocation and submit_bio using per block group mutex instead of
waiting and sorting BIOs in a buffer. This per block group mutex now
locked before allocation and released after all BIOs submission
finishes. The same method is used for both data and metadata IOs.

By using a mutex instead of a submit buffer, we must disable
EXTENT_PREALLOC entirely in HMZONED mode to prevent deadlocks. As a
result, INODE_MAP_CACHE and MIXED_BG are disabled in HMZONED mode, and
relocation inode is reworked to use btrfs_wait_ordered_range() after
each relocation instead of relying on preallocated file region.

Furthermore, asynchronous checksum is disabled in and inline with the
serialized block allocation and BIO submission. This allows preserving
sequential write IO order without introducing any new functionality
such as submit buffers. Async submit will be removed once we merge
cgroup writeback support patch series.

* Patch series description

A zoned block device consists of a number of zones. Zones are either
conventional and accepting random writes or sequential and requiring
that writes be issued in LBA order from each zone write pointer
position. This patch series ensures that the sequential write
constraint of sequential zones is respected while fundamentally not
changing BtrFS block and I/O management for block stored in
conventional zones.

To achieve this, the default chunk size of btrfs is changed on zoned
block devices so that chunks are always aligned to a zone. Allocation
of blocks within a chunk is changed so that the allocation is always
sequential from the beginning of the chunks. To do so, an allocation
pointer is added to block groups and used as the allocation hint.  The
allocation changes also ensure that blocks freed below the allocation
pointer are ignored, resulting in sequential block allocation
regardless of the chunk usage.

While the introduction of the allocation pointer ensures that blocks
will be allocated sequentially, I/Os to write out newly allocated
blocks may be issued out of order, causing errors when writing to
sequential zones.  To preserve the ordering, this patch series adds
some mutexes around allocation and submit_bio and serialize
them. Also, this series disable async checksum and submit to avoid
mixing the BIOs.

The zone of a chunk is reset to allow reuse of the zone only when the
block group is being freed, that is, when all the chunks of the block
group are unused.

For btrfs volumes composed of multiple zoned disks, a restriction is
added to ensure that all disks have the same zone size. This
restriction matches the existing constraint that all chunks in a block
group must have the same size.

* Patch series organization

Patch 1 introduces the HMZONED incompatible feature flag to indicate
that the btrfs volume was formatted for use on zoned block devices.

Patches 2 and 3 implement functions to gather information on the zones
of the device (zones type and write pointer position).

Patches 4 to 8 disable features which are not compatible with the
sequential write constraints of zoned block devices. These includes
RAID5/6, space_cache, NODATACOW, TREE_LOG, and fallocate.

Patches 9 and 10 tweak the extent buffer allocation for HMZONED mode
to implement sequential block allocation in block groups and chunks.

Patch 11 and 12 handles the case when write pointers of devices which
compose e.g., RAID1 block group devices, are a mismatch.

Patch 13 implement a zone reset for unused block groups.

Patch 14 restrict the possible locations of super blocks to conventional
zones to preserve the existing update in-place mechanism for the super
blocks.

Patches 15 to 21 implement the serialization of allocation and
submit_bio for several types of IO (non-compressed data, compressed
data, direct IO, and metadata). These include re-dirtying once-freed
metadata blocks to prevent write holes.

Patch 22 and 23 disable features which are not compatible with the
serialization to prevent deadlocks. These include MIXED_BG and
INODE_MAP_CACHE.

Patches 24 to 26 tweak some btrfs features work with HMZONED
mode. These include device-replace, relocation, and repairing IO
error.

Finally, patch 27 adds the HMZONED feature to the list of supported
features.

* Patch testing note

This series is based on kdave/for-5.3-rc2.

Also, you need to cherry-pick the following commits to disable write
plugging with that branch. As described in commit b49773e7bcf3
("block: Disable write plugging for zoned block devices"), without
these commits, write plugging can reorder BIOs submitted from multiple
contexts, e.g., multiple extent_write_cached_pages().

0c8cf8c2a553 ("block: initialize the write priority in blk_rq_bio_prep")
f924cddebc90 ("block: remove blk_init_request_from_bio")
14ccb66b3f58 ("block: remove the bi_phys_segments field in struct bio")
c05f42206f4d ("blk-mq: remove blk_mq_put_ctx()")
970d168de636 ("blk-mq: simplify blk_mq_make_request()")
b49773e7bcf3 ("block: Disable write plugging for zoned block devices")

Furthermore, you need to apply the following patch if you run xfstests
with tcmu-loop disks. xfstests btrfs/003 failed to "_devmgt_add" after
"_devmgt_remove" without this patch.

https://marc.info/?l=linux-scsi&m=156498625421698&w=2

You can use tcmu-runer [1] to create an emulated zoned device backed
by a regular file. Here is a setup how-to:
http://zonedstorage.io/projects/tcmu-runner/#compilation-and-installation
                                                                                                                                                                                              
[1] https://github.com/open-iscsi/tcmu-runner

v2 https://lore.kernel.org/linux-btrfs/20190607131025.31996-1-naohiro.aota@wdc.com/
v1 https://lore.kernel.org/linux-btrfs/20180809180450.5091-1-naota@elisp.net/

Changelog
v3:
 - Serialize allocation and submit_bio instead of bio buffering in btrfs_map_bio().
 -- Disable async checksum/submit in HMZONED mode
 - Introduce helper functions and hmzoned.c/h (Josef, David)
 - Add support for repairing IO failure
 - Add support for NOCOW direct IO write (Josef)
 - Disable preallocation entirely
 -- Disable INODE_MAP_CACHE
 -- relocation is reworked not to rely on preallocation in HMZONED mode
 - Disable NODATACOW
 -Disable MIXED_BG
 - Device extent that cover super block position is banned (David)
v2:
 - Add support for dev-replace
 -- To support dev-replace, moved submit_buffer one layer up. It now
    handles bio instead of btrfs_bio.
 -- Mark unmirrored Block Group readonly only when there are writable
    mirrored BGs. Necessary to handle degraded RAID.
 - Expire worker use vanilla delayed_work instead of btrfs's async-thread
 - Device extent allocator now ensure that region is on the same zone type.
 - Add delayed allocation shrinking.
 - Rename btrfs_drop_dev_zonetypes() to btrfs_destroy_dev_zonetypes
 - Fix
 -- Use SECTOR_SHIFT (Nikolay)
 -- Use btrfs_err (Nikolay)

Naohiro Aota (27):
  btrfs: introduce HMZONED feature flag
  btrfs: Get zone information of zoned block devices
  btrfs: Check and enable HMZONED mode
  btrfs: disallow RAID5/6 in HMZONED mode
  btrfs: disallow space_cache in HMZONED mode
  btrfs: disallow NODATACOW in HMZONED mode
  btrfs: disable tree-log in HMZONED mode
  btrfs: disable fallocate in HMZONED mode
  btrfs: align device extent allocation to zone boundary
  btrfs: do sequential extent allocation in HMZONED mode
  btrfs: make unmirroed BGs readonly only if we have at least one
    writable BG
  btrfs: ensure metadata space available on/after degraded mount in
    HMZONED
  btrfs: reset zones of unused block groups
  btrfs: limit super block locations in HMZONED mode
  btrfs: redirty released extent buffers in sequential BGs
  btrfs: serialize data allocation and submit IOs
  btrfs: implement atomic compressed IO submission
  btrfs: support direct write IO in HMZONED
  btrfs: serialize meta IOs on HMZONED mode
  btrfs: wait existing extents before truncating
  btrfs: avoid async checksum/submit on HMZONED mode
  btrfs: disallow mixed-bg in HMZONED mode
  btrfs: disallow inode_cache in HMZONED mode
  btrfs: support dev-replace in HMZONED mode
  btrfs: enable relocation in HMZONED mode
  btrfs: relocate block group to repair IO failure in HMZONED
  btrfs: enable to mount HMZONED incompat flag

 fs/btrfs/Makefile           |   2 +-
 fs/btrfs/compression.c      |   5 +-
 fs/btrfs/ctree.h            |  37 +-
 fs/btrfs/dev-replace.c      | 155 +++++++
 fs/btrfs/dev-replace.h      |   3 +
 fs/btrfs/disk-io.c          |  29 ++
 fs/btrfs/extent-tree.c      | 277 +++++++++++--
 fs/btrfs/extent_io.c        |  22 +-
 fs/btrfs/extent_io.h        |   2 +
 fs/btrfs/file.c             |   4 +
 fs/btrfs/free-space-cache.c |  35 ++
 fs/btrfs/free-space-cache.h |   5 +
 fs/btrfs/hmzoned.c          | 785 ++++++++++++++++++++++++++++++++++++
 fs/btrfs/hmzoned.h          | 198 +++++++++
 fs/btrfs/inode.c            |  88 +++-
 fs/btrfs/ioctl.c            |   3 +
 fs/btrfs/relocation.c       |  39 +-
 fs/btrfs/scrub.c            |  89 +++-
 fs/btrfs/space-info.c       |  13 +-
 fs/btrfs/space-info.h       |   4 +-
 fs/btrfs/super.c            |   7 +
 fs/btrfs/sysfs.c            |   4 +
 fs/btrfs/transaction.c      |  10 +
 fs/btrfs/transaction.h      |   3 +
 fs/btrfs/volumes.c          | 207 +++++++++-
 fs/btrfs/volumes.h          |   5 +
 include/uapi/linux/btrfs.h  |   1 +
 27 files changed, 1980 insertions(+), 52 deletions(-)
 create mode 100644 fs/btrfs/hmzoned.c
 create mode 100644 fs/btrfs/hmzoned.h

-- 
2.22.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, back to index

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 01/27] btrfs: introduce HMZONED feature flag Naohiro Aota
2019-08-16  4:49   ` Anand Jain
2019-08-08  9:30 ` [PATCH v3 02/27] btrfs: Get zone information of zoned block devices Naohiro Aota
2019-08-16  4:44   ` Anand Jain
2019-08-16 14:19     ` Damien Le Moal
2019-08-16 23:47       ` Anand Jain
2019-08-16 23:55         ` Damien Le Moal
2019-08-08  9:30 ` [PATCH v3 03/27] btrfs: Check and enable HMZONED mode Naohiro Aota
2019-08-16  5:46   ` Anand Jain
2019-08-16 14:23     ` Damien Le Moal
2019-08-16 23:56       ` Anand Jain
2019-08-17  0:05         ` Damien Le Moal
2019-08-20  5:07         ` Naohiro Aota
2019-08-20 13:05           ` David Sterba
2019-08-08  9:30 ` [PATCH v3 04/27] btrfs: disallow RAID5/6 in " Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 05/27] btrfs: disallow space_cache " Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 06/27] btrfs: disallow NODATACOW " Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 07/27] btrfs: disable tree-log " Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 08/27] btrfs: disable fallocate " Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 09/27] btrfs: align device extent allocation to zone boundary Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 10/27] btrfs: do sequential extent allocation in HMZONED mode Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 11/27] btrfs: make unmirroed BGs readonly only if we have at least one writable BG Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 12/27] btrfs: ensure metadata space available on/after degraded mount in HMZONED Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 13/27] btrfs: reset zones of unused block groups Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 14/27] btrfs: limit super block locations in HMZONED mode Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 15/27] btrfs: redirty released extent buffers in sequential BGs Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 16/27] btrfs: serialize data allocation and submit IOs Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 17/27] btrfs: implement atomic compressed IO submission Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 18/27] btrfs: support direct write IO in HMZONED Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 19/27] btrfs: serialize meta IOs on HMZONED mode Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 20/27] btrfs: wait existing extents before truncating Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 21/27] btrfs: avoid async checksum/submit on HMZONED mode Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 22/27] btrfs: disallow mixed-bg in " Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 23/27] btrfs: disallow inode_cache " Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 24/27] btrfs: support dev-replace " Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 25/27] btrfs: enable relocation " Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 26/27] btrfs: relocate block group to repair IO failure in HMZONED Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 27/27] btrfs: enable to mount HMZONED incompat flag Naohiro Aota

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org linux-btrfs@archiver.kernel.org
	public-inbox-index linux-btrfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox