linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/27] btrfs zoned block device support
@ 2019-08-08  9:30 Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 01/27] btrfs: introduce HMZONED feature flag Naohiro Aota
                   ` (26 more replies)
  0 siblings, 27 replies; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

This series adds zoned block device support to btrfs.

* Summary of changes from v2

The most significant change from v2 is the serialization of sequential
block allocation and submit_bio using per block group mutex instead of
waiting and sorting BIOs in a buffer. This per block group mutex now
locked before allocation and released after all BIOs submission
finishes. The same method is used for both data and metadata IOs.

By using a mutex instead of a submit buffer, we must disable
EXTENT_PREALLOC entirely in HMZONED mode to prevent deadlocks. As a
result, INODE_MAP_CACHE and MIXED_BG are disabled in HMZONED mode, and
relocation inode is reworked to use btrfs_wait_ordered_range() after
each relocation instead of relying on preallocated file region.

Furthermore, asynchronous checksum is disabled in and inline with the
serialized block allocation and BIO submission. This allows preserving
sequential write IO order without introducing any new functionality
such as submit buffers. Async submit will be removed once we merge
cgroup writeback support patch series.

* Patch series description

A zoned block device consists of a number of zones. Zones are either
conventional and accepting random writes or sequential and requiring
that writes be issued in LBA order from each zone write pointer
position. This patch series ensures that the sequential write
constraint of sequential zones is respected while fundamentally not
changing BtrFS block and I/O management for block stored in
conventional zones.

To achieve this, the default chunk size of btrfs is changed on zoned
block devices so that chunks are always aligned to a zone. Allocation
of blocks within a chunk is changed so that the allocation is always
sequential from the beginning of the chunks. To do so, an allocation
pointer is added to block groups and used as the allocation hint.  The
allocation changes also ensure that blocks freed below the allocation
pointer are ignored, resulting in sequential block allocation
regardless of the chunk usage.

While the introduction of the allocation pointer ensures that blocks
will be allocated sequentially, I/Os to write out newly allocated
blocks may be issued out of order, causing errors when writing to
sequential zones.  To preserve the ordering, this patch series adds
some mutexes around allocation and submit_bio and serialize
them. Also, this series disable async checksum and submit to avoid
mixing the BIOs.

The zone of a chunk is reset to allow reuse of the zone only when the
block group is being freed, that is, when all the chunks of the block
group are unused.

For btrfs volumes composed of multiple zoned disks, a restriction is
added to ensure that all disks have the same zone size. This
restriction matches the existing constraint that all chunks in a block
group must have the same size.

* Patch series organization

Patch 1 introduces the HMZONED incompatible feature flag to indicate
that the btrfs volume was formatted for use on zoned block devices.

Patches 2 and 3 implement functions to gather information on the zones
of the device (zones type and write pointer position).

Patches 4 to 8 disable features which are not compatible with the
sequential write constraints of zoned block devices. These includes
RAID5/6, space_cache, NODATACOW, TREE_LOG, and fallocate.

Patches 9 and 10 tweak the extent buffer allocation for HMZONED mode
to implement sequential block allocation in block groups and chunks.

Patch 11 and 12 handles the case when write pointers of devices which
compose e.g., RAID1 block group devices, are a mismatch.

Patch 13 implement a zone reset for unused block groups.

Patch 14 restrict the possible locations of super blocks to conventional
zones to preserve the existing update in-place mechanism for the super
blocks.

Patches 15 to 21 implement the serialization of allocation and
submit_bio for several types of IO (non-compressed data, compressed
data, direct IO, and metadata). These include re-dirtying once-freed
metadata blocks to prevent write holes.

Patch 22 and 23 disable features which are not compatible with the
serialization to prevent deadlocks. These include MIXED_BG and
INODE_MAP_CACHE.

Patches 24 to 26 tweak some btrfs features work with HMZONED
mode. These include device-replace, relocation, and repairing IO
error.

Finally, patch 27 adds the HMZONED feature to the list of supported
features.

* Patch testing note

This series is based on kdave/for-5.3-rc2.

Also, you need to cherry-pick the following commits to disable write
plugging with that branch. As described in commit b49773e7bcf3
("block: Disable write plugging for zoned block devices"), without
these commits, write plugging can reorder BIOs submitted from multiple
contexts, e.g., multiple extent_write_cached_pages().

0c8cf8c2a553 ("block: initialize the write priority in blk_rq_bio_prep")
f924cddebc90 ("block: remove blk_init_request_from_bio")
14ccb66b3f58 ("block: remove the bi_phys_segments field in struct bio")
c05f42206f4d ("blk-mq: remove blk_mq_put_ctx()")
970d168de636 ("blk-mq: simplify blk_mq_make_request()")
b49773e7bcf3 ("block: Disable write plugging for zoned block devices")

Furthermore, you need to apply the following patch if you run xfstests
with tcmu-loop disks. xfstests btrfs/003 failed to "_devmgt_add" after
"_devmgt_remove" without this patch.

https://marc.info/?l=linux-scsi&m=156498625421698&w=2

You can use tcmu-runer [1] to create an emulated zoned device backed
by a regular file. Here is a setup how-to:
http://zonedstorage.io/projects/tcmu-runner/#compilation-and-installation
                                                                                                                                                                                              
[1] https://github.com/open-iscsi/tcmu-runner

v2 https://lore.kernel.org/linux-btrfs/20190607131025.31996-1-naohiro.aota@wdc.com/
v1 https://lore.kernel.org/linux-btrfs/20180809180450.5091-1-naota@elisp.net/

Changelog
v3:
 - Serialize allocation and submit_bio instead of bio buffering in btrfs_map_bio().
 -- Disable async checksum/submit in HMZONED mode
 - Introduce helper functions and hmzoned.c/h (Josef, David)
 - Add support for repairing IO failure
 - Add support for NOCOW direct IO write (Josef)
 - Disable preallocation entirely
 -- Disable INODE_MAP_CACHE
 -- relocation is reworked not to rely on preallocation in HMZONED mode
 - Disable NODATACOW
 -Disable MIXED_BG
 - Device extent that cover super block position is banned (David)
v2:
 - Add support for dev-replace
 -- To support dev-replace, moved submit_buffer one layer up. It now
    handles bio instead of btrfs_bio.
 -- Mark unmirrored Block Group readonly only when there are writable
    mirrored BGs. Necessary to handle degraded RAID.
 - Expire worker use vanilla delayed_work instead of btrfs's async-thread
 - Device extent allocator now ensure that region is on the same zone type.
 - Add delayed allocation shrinking.
 - Rename btrfs_drop_dev_zonetypes() to btrfs_destroy_dev_zonetypes
 - Fix
 -- Use SECTOR_SHIFT (Nikolay)
 -- Use btrfs_err (Nikolay)

Naohiro Aota (27):
  btrfs: introduce HMZONED feature flag
  btrfs: Get zone information of zoned block devices
  btrfs: Check and enable HMZONED mode
  btrfs: disallow RAID5/6 in HMZONED mode
  btrfs: disallow space_cache in HMZONED mode
  btrfs: disallow NODATACOW in HMZONED mode
  btrfs: disable tree-log in HMZONED mode
  btrfs: disable fallocate in HMZONED mode
  btrfs: align device extent allocation to zone boundary
  btrfs: do sequential extent allocation in HMZONED mode
  btrfs: make unmirroed BGs readonly only if we have at least one
    writable BG
  btrfs: ensure metadata space available on/after degraded mount in
    HMZONED
  btrfs: reset zones of unused block groups
  btrfs: limit super block locations in HMZONED mode
  btrfs: redirty released extent buffers in sequential BGs
  btrfs: serialize data allocation and submit IOs
  btrfs: implement atomic compressed IO submission
  btrfs: support direct write IO in HMZONED
  btrfs: serialize meta IOs on HMZONED mode
  btrfs: wait existing extents before truncating
  btrfs: avoid async checksum/submit on HMZONED mode
  btrfs: disallow mixed-bg in HMZONED mode
  btrfs: disallow inode_cache in HMZONED mode
  btrfs: support dev-replace in HMZONED mode
  btrfs: enable relocation in HMZONED mode
  btrfs: relocate block group to repair IO failure in HMZONED
  btrfs: enable to mount HMZONED incompat flag

 fs/btrfs/Makefile           |   2 +-
 fs/btrfs/compression.c      |   5 +-
 fs/btrfs/ctree.h            |  37 +-
 fs/btrfs/dev-replace.c      | 155 +++++++
 fs/btrfs/dev-replace.h      |   3 +
 fs/btrfs/disk-io.c          |  29 ++
 fs/btrfs/extent-tree.c      | 277 +++++++++++--
 fs/btrfs/extent_io.c        |  22 +-
 fs/btrfs/extent_io.h        |   2 +
 fs/btrfs/file.c             |   4 +
 fs/btrfs/free-space-cache.c |  35 ++
 fs/btrfs/free-space-cache.h |   5 +
 fs/btrfs/hmzoned.c          | 785 ++++++++++++++++++++++++++++++++++++
 fs/btrfs/hmzoned.h          | 198 +++++++++
 fs/btrfs/inode.c            |  88 +++-
 fs/btrfs/ioctl.c            |   3 +
 fs/btrfs/relocation.c       |  39 +-
 fs/btrfs/scrub.c            |  89 +++-
 fs/btrfs/space-info.c       |  13 +-
 fs/btrfs/space-info.h       |   4 +-
 fs/btrfs/super.c            |   7 +
 fs/btrfs/sysfs.c            |   4 +
 fs/btrfs/transaction.c      |  10 +
 fs/btrfs/transaction.h      |   3 +
 fs/btrfs/volumes.c          | 207 +++++++++-
 fs/btrfs/volumes.h          |   5 +
 include/uapi/linux/btrfs.h  |   1 +
 27 files changed, 1980 insertions(+), 52 deletions(-)
 create mode 100644 fs/btrfs/hmzoned.c
 create mode 100644 fs/btrfs/hmzoned.h

-- 
2.22.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v3 01/27] btrfs: introduce HMZONED feature flag
  2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
@ 2019-08-08  9:30 ` Naohiro Aota
  2019-08-16  4:49   ` Anand Jain
  2019-08-08  9:30 ` [PATCH v3 02/27] btrfs: Get zone information of zoned block devices Naohiro Aota
                   ` (25 subsequent siblings)
  26 siblings, 1 reply; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

This patch introduces the HMZONED incompat flag. The flag indicates that
the volume management will satisfy the constraints imposed by host-managed
zoned block devices.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/sysfs.c           | 2 ++
 include/uapi/linux/btrfs.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index e6493b068294..ad708a9edd0b 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -193,6 +193,7 @@ BTRFS_FEAT_ATTR_INCOMPAT(raid56, RAID56);
 BTRFS_FEAT_ATTR_INCOMPAT(skinny_metadata, SKINNY_METADATA);
 BTRFS_FEAT_ATTR_INCOMPAT(no_holes, NO_HOLES);
 BTRFS_FEAT_ATTR_INCOMPAT(metadata_uuid, METADATA_UUID);
+BTRFS_FEAT_ATTR_INCOMPAT(hmzoned, HMZONED);
 BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
 
 static struct attribute *btrfs_supported_feature_attrs[] = {
@@ -207,6 +208,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(skinny_metadata),
 	BTRFS_FEAT_ATTR_PTR(no_holes),
 	BTRFS_FEAT_ATTR_PTR(metadata_uuid),
+	BTRFS_FEAT_ATTR_PTR(hmzoned),
 	BTRFS_FEAT_ATTR_PTR(free_space_tree),
 	NULL
 };
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index c195896d478f..2d5e8f801135 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -270,6 +270,7 @@ struct btrfs_ioctl_fs_info_args {
 #define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA	(1ULL << 8)
 #define BTRFS_FEATURE_INCOMPAT_NO_HOLES		(1ULL << 9)
 #define BTRFS_FEATURE_INCOMPAT_METADATA_UUID	(1ULL << 10)
+#define BTRFS_FEATURE_INCOMPAT_HMZONED		(1ULL << 11)
 
 struct btrfs_ioctl_feature_flags {
 	__u64 compat_flags;
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 02/27] btrfs: Get zone information of zoned block devices
  2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 01/27] btrfs: introduce HMZONED feature flag Naohiro Aota
@ 2019-08-08  9:30 ` Naohiro Aota
  2019-08-16  4:44   ` Anand Jain
  2019-08-08  9:30 ` [PATCH v3 03/27] btrfs: Check and enable HMZONED mode Naohiro Aota
                   ` (24 subsequent siblings)
  26 siblings, 1 reply; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

If a zoned block device is found, get its zone information (number of zones
and zone size) using the new helper function btrfs_get_dev_zonetypes().  To
avoid costly run-time zone report commands to test the device zones type
during block allocation, attach the seq_zones bitmap to the device
structure to indicate if a zone is sequential or accept random writes. Also
it attaches the empty_zones bitmap to indicate if a zone is empty or not.

This patch also introduces the helper function btrfs_dev_is_sequential() to
test if the zone storing a block is a sequential write required zone and
btrfs_dev_is_empty_zone() to test if the zone is a empty zone.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/Makefile  |   2 +-
 fs/btrfs/hmzoned.c | 162 +++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/hmzoned.h |  79 ++++++++++++++++++++++
 fs/btrfs/volumes.c |  18 ++++-
 fs/btrfs/volumes.h |   4 ++
 5 files changed, 262 insertions(+), 3 deletions(-)
 create mode 100644 fs/btrfs/hmzoned.c
 create mode 100644 fs/btrfs/hmzoned.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 76a843198bcb..8d93abb31074 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -11,7 +11,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
 	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
 	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
-	   block-rsv.o delalloc-space.o
+	   block-rsv.o delalloc-space.o hmzoned.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
new file mode 100644
index 000000000000..bfd04792dd62
--- /dev/null
+++ b/fs/btrfs/hmzoned.c
@@ -0,0 +1,162 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 Western Digital Corporation or its affiliates.
+ * Authors:
+ *	Naohiro Aota	<naohiro.aota@wdc.com>
+ *	Damien Le Moal	<damien.lemoal@wdc.com>
+ */
+
+#include <linux/slab.h>
+#include <linux/blkdev.h>
+#include "ctree.h"
+#include "volumes.h"
+#include "hmzoned.h"
+#include "rcu-string.h"
+
+/* Maximum number of zones to report per blkdev_report_zones() call */
+#define BTRFS_REPORT_NR_ZONES   4096
+
+static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
+			       struct blk_zone **zones_ret,
+			       unsigned int *nr_zones, gfp_t gfp_mask)
+{
+	struct blk_zone *zones = *zones_ret;
+	int ret;
+
+	if (!zones) {
+		zones = kcalloc(*nr_zones, sizeof(struct blk_zone), GFP_KERNEL);
+		if (!zones)
+			return -ENOMEM;
+	}
+
+	ret = blkdev_report_zones(device->bdev, pos >> SECTOR_SHIFT,
+				  zones, nr_zones, gfp_mask);
+	if (ret != 0) {
+		btrfs_err_in_rcu(device->fs_info,
+				 "get zone at %llu on %s failed %d", pos,
+				 rcu_str_deref(device->name), ret);
+		return ret;
+	}
+	if (!*nr_zones)
+		return -EIO;
+
+	*zones_ret = zones;
+
+	return 0;
+}
+
+int btrfs_get_dev_zone_info(struct btrfs_device *device)
+{
+	struct btrfs_zoned_device_info *zone_info = NULL;
+	struct block_device *bdev = device->bdev;
+	sector_t nr_sectors = bdev->bd_part->nr_sects;
+	sector_t sector = 0;
+	struct blk_zone *zones = NULL;
+	unsigned int i, nreported = 0, nr_zones;
+	unsigned int zone_sectors;
+	int ret;
+
+	if (!bdev_is_zoned(bdev))
+		return 0;
+
+	zone_info = kzalloc(sizeof(*zone_info), GFP_KERNEL);
+	if (!zone_info)
+		return -ENOMEM;
+
+	zone_sectors = bdev_zone_sectors(bdev);
+	ASSERT(is_power_of_2(zone_sectors));
+	zone_info->zone_size = (u64)zone_sectors << SECTOR_SHIFT;
+	zone_info->zone_size_shift = ilog2(zone_info->zone_size);
+	zone_info->nr_zones = nr_sectors >> ilog2(bdev_zone_sectors(bdev));
+	if (nr_sectors & (bdev_zone_sectors(bdev) - 1))
+		zone_info->nr_zones++;
+
+	zone_info->seq_zones = kcalloc(BITS_TO_LONGS(zone_info->nr_zones),
+				       sizeof(*zone_info->seq_zones),
+				       GFP_KERNEL);
+	if (!zone_info->seq_zones) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	zone_info->empty_zones = kcalloc(BITS_TO_LONGS(zone_info->nr_zones),
+					 sizeof(*zone_info->empty_zones),
+					 GFP_KERNEL);
+	if (!zone_info->empty_zones) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	/* Get zones type */
+	while (sector < nr_sectors) {
+		nr_zones = BTRFS_REPORT_NR_ZONES;
+		ret = btrfs_get_dev_zones(device, sector << SECTOR_SHIFT,
+					  &zones, &nr_zones, GFP_KERNEL);
+		if (ret)
+			goto out;
+
+		for (i = 0; i < nr_zones; i++) {
+			if (zones[i].type == BLK_ZONE_TYPE_SEQWRITE_REQ)
+				set_bit(nreported, zone_info->seq_zones);
+			if (zones[i].cond == BLK_ZONE_COND_EMPTY)
+				set_bit(nreported, zone_info->empty_zones);
+			nreported++;
+		}
+		sector = zones[nr_zones - 1].start + zones[nr_zones - 1].len;
+	}
+
+	if (nreported != zone_info->nr_zones) {
+		btrfs_err_in_rcu(device->fs_info,
+				 "inconsistent number of zones on %s (%u / %u)",
+				 rcu_str_deref(device->name), nreported,
+				 zone_info->nr_zones);
+		ret = -EIO;
+		goto out;
+	}
+
+	device->zone_info = zone_info;
+
+	btrfs_info_in_rcu(
+		device->fs_info,
+		"host-%s zoned block device %s, %u zones of %llu sectors",
+		bdev_zoned_model(bdev) == BLK_ZONED_HM ? "managed" : "aware",
+		rcu_str_deref(device->name), zone_info->nr_zones,
+		zone_info->zone_size >> SECTOR_SHIFT);
+
+out:
+	kfree(zones);
+
+	if (ret) {
+		kfree(zone_info->seq_zones);
+		kfree(zone_info->empty_zones);
+		kfree(zone_info);
+	}
+
+	return ret;
+}
+
+void btrfs_destroy_dev_zone_info(struct btrfs_device *device)
+{
+	struct btrfs_zoned_device_info *zone_info = device->zone_info;
+
+	if (!zone_info)
+		return;
+
+	kfree(zone_info->seq_zones);
+	kfree(zone_info->empty_zones);
+	kfree(zone_info);
+	device->zone_info = NULL;
+}
+
+int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
+		       struct blk_zone *zone, gfp_t gfp_mask)
+{
+	unsigned int nr_zones = 1;
+	int ret;
+
+	ret = btrfs_get_dev_zones(device, pos, &zone, &nr_zones, gfp_mask);
+	if (ret != 0 || !nr_zones)
+		return ret ? ret : -EIO;
+
+	return 0;
+}
diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
new file mode 100644
index 000000000000..ffc70842135e
--- /dev/null
+++ b/fs/btrfs/hmzoned.h
@@ -0,0 +1,79 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2019 Western Digital Corporation or its affiliates.
+ * Authors:
+ *	Naohiro Aota	<naohiro.aota@wdc.com>
+ *	Damien Le Moal	<damien.lemoal@wdc.com>
+ */
+
+#ifndef BTRFS_HMZONED_H
+#define BTRFS_HMZONED_H
+
+struct btrfs_zoned_device_info {
+	/*
+	 * Number of zones, zone size and types of zones if bdev is a
+	 * zoned block device.
+	 */
+	u64 zone_size;
+	u8  zone_size_shift;
+	u32 nr_zones;
+	unsigned long *seq_zones;
+	unsigned long *empty_zones;
+};
+
+int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
+		       struct blk_zone *zone, gfp_t gfp_mask);
+int btrfs_get_dev_zone_info(struct btrfs_device *device);
+void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
+
+static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
+{
+	struct btrfs_zoned_device_info *zone_info = device->zone_info;
+
+	if (!zone_info)
+		return false;
+
+	return test_bit(pos >> zone_info->zone_size_shift,
+			zone_info->seq_zones);
+}
+
+static inline bool btrfs_dev_is_empty_zone(struct btrfs_device *device, u64 pos)
+{
+	struct btrfs_zoned_device_info *zone_info = device->zone_info;
+
+	if (!zone_info)
+		return true;
+
+	return test_bit(pos >> zone_info->zone_size_shift,
+			zone_info->empty_zones);
+}
+
+static inline void btrfs_dev_set_empty_zone_bit(struct btrfs_device *device,
+						u64 pos, bool set)
+{
+	struct btrfs_zoned_device_info *zone_info = device->zone_info;
+	unsigned int zno;
+
+	if (!zone_info)
+		return;
+
+	zno = pos >> zone_info->zone_size_shift;
+	if (set)
+		set_bit(zno, zone_info->empty_zones);
+	else
+		clear_bit(zno, zone_info->empty_zones);
+}
+
+static inline void btrfs_dev_set_zone_empty(struct btrfs_device *device,
+					    u64 pos)
+{
+	btrfs_dev_set_empty_zone_bit(device, pos, true);
+}
+
+static inline void btrfs_dev_clear_zone_empty(struct btrfs_device *device,
+					      u64 pos)
+{
+	btrfs_dev_set_empty_zone_bit(device, pos, false);
+}
+
+#endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d74b74ca07af..8e5a894e7bde 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -29,6 +29,7 @@
 #include "sysfs.h"
 #include "tree-checker.h"
 #include "space-info.h"
+#include "hmzoned.h"
 
 const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
 	[BTRFS_RAID_RAID10] = {
@@ -342,6 +343,7 @@ void btrfs_free_device(struct btrfs_device *device)
 	rcu_string_free(device->name);
 	extent_io_tree_release(&device->alloc_state);
 	bio_put(device->flush_bio);
+	btrfs_destroy_dev_zone_info(device);
 	kfree(device);
 }
 
@@ -847,6 +849,11 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
 	clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
 	device->mode = flags;
 
+	/* Get zone type information of zoned block devices */
+	ret = btrfs_get_dev_zone_info(device);
+	if (ret != 0)
+		goto error_brelse;
+
 	fs_devices->open_devices++;
 	if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
 	    device->devid != BTRFS_DEV_REPLACE_DEVID) {
@@ -2598,6 +2605,14 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	}
 	rcu_assign_pointer(device->name, name);
 
+	device->fs_info = fs_info;
+	device->bdev = bdev;
+
+	/* Get zone type information of zoned block devices */
+	ret = btrfs_get_dev_zone_info(device);
+	if (ret)
+		goto error_free_device;
+
 	trans = btrfs_start_transaction(root, 0);
 	if (IS_ERR(trans)) {
 		ret = PTR_ERR(trans);
@@ -2614,8 +2629,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 					 fs_info->sectorsize);
 	device->disk_total_bytes = device->total_bytes;
 	device->commit_total_bytes = device->total_bytes;
-	device->fs_info = fs_info;
-	device->bdev = bdev;
 	set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
 	clear_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state);
 	device->mode = FMODE_EXCL;
@@ -2756,6 +2769,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 		sb->s_flags |= SB_RDONLY;
 	if (trans)
 		btrfs_end_transaction(trans);
+	btrfs_destroy_dev_zone_info(device);
 error_free_device:
 	btrfs_free_device(device);
 error:
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 7f6aa1816409..5da1f354db93 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -57,6 +57,8 @@ struct btrfs_io_geometry {
 #define BTRFS_DEV_STATE_REPLACE_TGT	(3)
 #define BTRFS_DEV_STATE_FLUSH_SENT	(4)
 
+struct btrfs_zoned_device_info;
+
 struct btrfs_device {
 	struct list_head dev_list; /* device_list_mutex */
 	struct list_head dev_alloc_list; /* chunk mutex */
@@ -77,6 +79,8 @@ struct btrfs_device {
 
 	struct block_device *bdev;
 
+	struct btrfs_zoned_device_info *zone_info;
+
 	/* the mode sent to blkdev_get */
 	fmode_t mode;
 
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 03/27] btrfs: Check and enable HMZONED mode
  2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 01/27] btrfs: introduce HMZONED feature flag Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 02/27] btrfs: Get zone information of zoned block devices Naohiro Aota
@ 2019-08-08  9:30 ` Naohiro Aota
  2019-08-16  5:46   ` Anand Jain
  2019-08-08  9:30 ` [PATCH v3 04/27] btrfs: disallow RAID5/6 in " Naohiro Aota
                   ` (23 subsequent siblings)
  26 siblings, 1 reply; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

HMZONED mode cannot be used together with the RAID5/6 profile for now.
Introduce the function btrfs_check_hmzoned_mode() to check this. This
function will also check if HMZONED flag is enabled on the file system and
if the file system consists of zoned devices with equal zone size.

Additionally, as updates to the space cache are in-place, the space cache
cannot be located over sequential zones and there is no guarantees that the
device will have enough conventional zones to store this cache. Resolve
this problem by disabling completely the space cache.  This does not
introduces any problems with sequential block groups: all the free space is
located after the allocation pointer and no free space before the pointer.
There is no need to have such cache.

For the same reason, NODATACOW is also disabled.

Also INODE_MAP_CACHE is also disabled to avoid preallocation in the
INODE_MAP_CACHE inode.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ctree.h       |  3 ++
 fs/btrfs/dev-replace.c |  8 +++++
 fs/btrfs/disk-io.c     |  8 +++++
 fs/btrfs/hmzoned.c     | 67 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/hmzoned.h     | 18 ++++++++++++
 fs/btrfs/super.c       |  1 +
 fs/btrfs/volumes.c     |  5 ++++
 7 files changed, 110 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 299e11e6c554..a00ce8c4d678 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -713,6 +713,9 @@ struct btrfs_fs_info {
 	struct btrfs_root *uuid_root;
 	struct btrfs_root *free_space_root;
 
+	/* Zone size when in HMZONED mode */
+	u64 zone_size;
+
 	/* the log root tree is a directory of all the other log roots */
 	struct btrfs_root *log_root_tree;
 
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 6b2e9aa83ffa..2cc3ac4d101d 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -20,6 +20,7 @@
 #include "rcu-string.h"
 #include "dev-replace.h"
 #include "sysfs.h"
+#include "hmzoned.h"
 
 static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
 				       int scrub_ret);
@@ -201,6 +202,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
 		return PTR_ERR(bdev);
 	}
 
+	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
+		btrfs_err(fs_info,
+			  "zone type of target device mismatch with the filesystem!");
+		ret = -EINVAL;
+		goto error;
+	}
+
 	sync_blockdev(bdev);
 
 	devices = &fs_info->fs_devices->devices;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5f7ee70b3d1a..8854ff2e5fa5 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -40,6 +40,7 @@
 #include "compression.h"
 #include "tree-checker.h"
 #include "ref-verify.h"
+#include "hmzoned.h"
 
 #define BTRFS_SUPER_FLAG_SUPP	(BTRFS_HEADER_FLAG_WRITTEN |\
 				 BTRFS_HEADER_FLAG_RELOC |\
@@ -3123,6 +3124,13 @@ int open_ctree(struct super_block *sb,
 
 	btrfs_free_extra_devids(fs_devices, 1);
 
+	ret = btrfs_check_hmzoned_mode(fs_info);
+	if (ret) {
+		btrfs_err(fs_info, "failed to init hmzoned mode: %d",
+				ret);
+		goto fail_block_groups;
+	}
+
 	ret = btrfs_sysfs_add_fsid(fs_devices, NULL);
 	if (ret) {
 		btrfs_err(fs_info, "failed to init sysfs fsid interface: %d",
diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
index bfd04792dd62..512674d8f488 100644
--- a/fs/btrfs/hmzoned.c
+++ b/fs/btrfs/hmzoned.c
@@ -160,3 +160,70 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 
 	return 0;
 }
+
+int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+	struct btrfs_device *device;
+	u64 hmzoned_devices = 0;
+	u64 nr_devices = 0;
+	u64 zone_size = 0;
+	int incompat_hmzoned = btrfs_fs_incompat(fs_info, HMZONED);
+	int ret = 0;
+
+	/* Count zoned devices */
+	list_for_each_entry(device, &fs_devices->devices, dev_list) {
+		if (!device->bdev)
+			continue;
+		if (bdev_zoned_model(device->bdev) == BLK_ZONED_HM ||
+		    (bdev_zoned_model(device->bdev) == BLK_ZONED_HA &&
+		     incompat_hmzoned)) {
+			hmzoned_devices++;
+			if (!zone_size) {
+				zone_size = device->zone_info->zone_size;
+			} else if (device->zone_info->zone_size != zone_size) {
+				btrfs_err(fs_info,
+					  "Zoned block devices must have equal zone sizes");
+				ret = -EINVAL;
+				goto out;
+			}
+		}
+		nr_devices++;
+	}
+
+	if (!hmzoned_devices && incompat_hmzoned) {
+		/* No zoned block device found on HMZONED FS */
+		btrfs_err(fs_info, "HMZONED enabled file system should have zoned devices");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (!hmzoned_devices && !incompat_hmzoned)
+		goto out;
+
+	fs_info->zone_size = zone_size;
+
+	if (hmzoned_devices != nr_devices) {
+		btrfs_err(fs_info,
+			  "zoned devices cannot be mixed with regular devices");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/*
+	 * stripe_size is always aligned to BTRFS_STRIPE_LEN in
+	 * __btrfs_alloc_chunk(). Since we want stripe_len == zone_size,
+	 * check the alignment here.
+	 */
+	if (!IS_ALIGNED(zone_size, BTRFS_STRIPE_LEN)) {
+		btrfs_err(fs_info,
+			  "zone size is not aligned to BTRFS_STRIPE_LEN");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	btrfs_info(fs_info, "HMZONED mode enabled, zone size %llu B",
+		   fs_info->zone_size);
+out:
+	return ret;
+}
diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
index ffc70842135e..29cfdcabff2f 100644
--- a/fs/btrfs/hmzoned.h
+++ b/fs/btrfs/hmzoned.h
@@ -9,6 +9,8 @@
 #ifndef BTRFS_HMZONED_H
 #define BTRFS_HMZONED_H
 
+#include <linux/blkdev.h>
+
 struct btrfs_zoned_device_info {
 	/*
 	 * Number of zones, zone size and types of zones if bdev is a
@@ -25,6 +27,7 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 		       struct blk_zone *zone, gfp_t gfp_mask);
 int btrfs_get_dev_zone_info(struct btrfs_device *device);
 void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
+int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info);
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
 {
@@ -76,4 +79,19 @@ static inline void btrfs_dev_clear_zone_empty(struct btrfs_device *device,
 	btrfs_dev_set_empty_zone_bit(device, pos, false);
 }
 
+static inline bool btrfs_check_device_zone_type(struct btrfs_fs_info *fs_info,
+						struct block_device *bdev)
+{
+	u64 zone_size;
+
+	if (btrfs_fs_incompat(fs_info, HMZONED)) {
+		zone_size = (u64)bdev_zone_sectors(bdev) << SECTOR_SHIFT;
+		/* Do not allow non-zoned device */
+		return bdev_is_zoned(bdev) && fs_info->zone_size == zone_size;
+	}
+
+	/* Do not allow Host Manged zoned device */
+	return bdev_zoned_model(bdev) != BLK_ZONED_HM;
+}
+
 #endif
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 78de9d5d80c6..d7879a5a2536 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -43,6 +43,7 @@
 #include "free-space-cache.h"
 #include "backref.h"
 #include "space-info.h"
+#include "hmzoned.h"
 #include "tests/btrfs-tests.h"
 
 #include "qgroup.h"
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8e5a894e7bde..755b2ec1e0de 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2572,6 +2572,11 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	if (IS_ERR(bdev))
 		return PTR_ERR(bdev);
 
+	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
+		ret = -EINVAL;
+		goto error;
+	}
+
 	if (fs_devices->seeding) {
 		seeding_dev = 1;
 		down_write(&sb->s_umount);
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 04/27] btrfs: disallow RAID5/6 in HMZONED mode
  2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
                   ` (2 preceding siblings ...)
  2019-08-08  9:30 ` [PATCH v3 03/27] btrfs: Check and enable HMZONED mode Naohiro Aota
@ 2019-08-08  9:30 ` Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 05/27] btrfs: disallow space_cache " Naohiro Aota
                   ` (22 subsequent siblings)
  26 siblings, 0 replies; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

Supporting the RAID5/6 profile in HMZONED mode is not trivial. For example,
non-full stripe writes will cause overwriting parity blocks. When we do a
non-full stripe write, it writes to the parity block with the data at that
moment. Then, another write to the stripes will try to overwrite the parity
block with new parity value. However, sequential zones do not allow such
parity overwriting.

Furthermore, using RAID5/6 on SMR drives, which usually have a huge
capacity, incur large overhead of rebuild. Such overhead can lead to
higher volume failure rate (e.g. additional drive failure during
rebuild) because of the increased rebuild time.

Thus, let's disable RAID5/6 profile in HMZONED mode for now.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/hmzoned.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
index 512674d8f488..641c83f6ea73 100644
--- a/fs/btrfs/hmzoned.c
+++ b/fs/btrfs/hmzoned.c
@@ -222,6 +222,13 @@ int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info)
 		goto out;
 	}
 
+	/* RAID56 is not allowed */
+	if (btrfs_fs_incompat(fs_info, RAID56)) {
+		btrfs_err(fs_info, "HMZONED mode does not support RAID56");
+		ret = -EINVAL;
+		goto out;
+	}
+
 	btrfs_info(fs_info, "HMZONED mode enabled, zone size %llu B",
 		   fs_info->zone_size);
 out:
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 05/27] btrfs: disallow space_cache in HMZONED mode
  2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
                   ` (3 preceding siblings ...)
  2019-08-08  9:30 ` [PATCH v3 04/27] btrfs: disallow RAID5/6 in " Naohiro Aota
@ 2019-08-08  9:30 ` Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 06/27] btrfs: disallow NODATACOW " Naohiro Aota
                   ` (21 subsequent siblings)
  26 siblings, 0 replies; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

As updates to the space cache are in-place, the space cache cannot be
located over sequential zones and there is no guarantees that the device
will have enough conventional zones to store this cache. Resolve this
problem by disabling completely the space cache.  This does not introduces
any problems with sequential block groups: all the free space is located
after the allocation pointer and no free space before the pointer. There is
no need to have such cache.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/hmzoned.c | 18 ++++++++++++++++++
 fs/btrfs/hmzoned.h |  1 +
 fs/btrfs/super.c   | 10 ++++++++--
 3 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
index 641c83f6ea73..99a03ab3b5de 100644
--- a/fs/btrfs/hmzoned.c
+++ b/fs/btrfs/hmzoned.c
@@ -234,3 +234,21 @@ int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info)
 out:
 	return ret;
 }
+
+int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info)
+{
+	if (!btrfs_fs_incompat(info, HMZONED))
+		return 0;
+
+	/*
+	 * SPACE CACHE writing is not CoWed. Disable that to avoid
+	 * write errors in sequential zones.
+	 */
+	if (btrfs_test_opt(info, SPACE_CACHE)) {
+		btrfs_err(info,
+		  "cannot enable disk space caching with HMZONED mode");
+		return -EINVAL;
+	}
+
+	return 0;
+}
diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
index 29cfdcabff2f..83579b2dc0a4 100644
--- a/fs/btrfs/hmzoned.h
+++ b/fs/btrfs/hmzoned.h
@@ -28,6 +28,7 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 int btrfs_get_dev_zone_info(struct btrfs_device *device);
 void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
 int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info);
+int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info);
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
 {
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index d7879a5a2536..496d8b74f9a2 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -440,8 +440,12 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 	cache_gen = btrfs_super_cache_generation(info->super_copy);
 	if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE))
 		btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE);
-	else if (cache_gen)
-		btrfs_set_opt(info->mount_opt, SPACE_CACHE);
+	else if (cache_gen) {
+		if (btrfs_fs_incompat(info, HMZONED))
+			WARN_ON(1);
+		else
+			btrfs_set_opt(info->mount_opt, SPACE_CACHE);
+	}
 
 	/*
 	 * Even the options are empty, we still need to do extra check
@@ -877,6 +881,8 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 		ret = -EINVAL;
 
 	}
+	if (!ret)
+		ret = btrfs_check_mountopts_hmzoned(info);
 	if (!ret && btrfs_test_opt(info, SPACE_CACHE))
 		btrfs_info(info, "disk space caching is enabled");
 	if (!ret && btrfs_test_opt(info, FREE_SPACE_TREE))
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 06/27] btrfs: disallow NODATACOW in HMZONED mode
  2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
                   ` (4 preceding siblings ...)
  2019-08-08  9:30 ` [PATCH v3 05/27] btrfs: disallow space_cache " Naohiro Aota
@ 2019-08-08  9:30 ` Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 07/27] btrfs: disable tree-log " Naohiro Aota
                   ` (20 subsequent siblings)
  26 siblings, 0 replies; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

NODATACOW implies overwriting the file data on a device, which is
impossible in sequential required zones. Disable NODATACOW globally with
mount option and per-file NODATACOW attribute by masking FS_NOCOW_FL.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/hmzoned.c | 6 ++++++
 fs/btrfs/ioctl.c   | 3 +++
 2 files changed, 9 insertions(+)

diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
index 99a03ab3b5de..0770b1f58bd9 100644
--- a/fs/btrfs/hmzoned.c
+++ b/fs/btrfs/hmzoned.c
@@ -250,5 +250,11 @@ int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info)
 		return -EINVAL;
 	}
 
+	if (btrfs_test_opt(info, NODATACOW)) {
+		btrfs_err(info,
+		  "cannot enable nodatacow with HMZONED mode");
+		return -EINVAL;
+	}
+
 	return 0;
 }
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index d0743ec1231d..06783c489023 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -93,6 +93,9 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
 static unsigned int btrfs_mask_fsflags_for_type(struct inode *inode,
 		unsigned int flags)
 {
+	if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED))
+		flags &= ~FS_NOCOW_FL;
+
 	if (S_ISDIR(inode->i_mode))
 		return flags;
 	else if (S_ISREG(inode->i_mode))
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 07/27] btrfs: disable tree-log in HMZONED mode
  2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
                   ` (5 preceding siblings ...)
  2019-08-08  9:30 ` [PATCH v3 06/27] btrfs: disallow NODATACOW " Naohiro Aota
@ 2019-08-08  9:30 ` Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 08/27] btrfs: disable fallocate " Naohiro Aota
                   ` (19 subsequent siblings)
  26 siblings, 0 replies; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

Extent buffers for tree-log tree are allocated scattered between other
metadata's extent buffers, and btrfs_sync_log() writes out only the
tree-log buffers. This behavior breaks sequential writing rule, which is
mandatory in sequential required zones.

Actually, we don't have much benefit using tree-logging with HMZONED mode,
until we can allocate tree-log buffer sequentially. So, disable tree-log
entirely in HMZONED mode.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/hmzoned.c | 6 ++++++
 fs/btrfs/super.c   | 4 ++++
 2 files changed, 10 insertions(+)

diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
index 0770b1f58bd9..e07e76af1e82 100644
--- a/fs/btrfs/hmzoned.c
+++ b/fs/btrfs/hmzoned.c
@@ -256,5 +256,11 @@ int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info)
 		return -EINVAL;
 	}
 
+	if (!btrfs_test_opt(info, NOTREELOG)) {
+		btrfs_err(info,
+		  "cannot enable tree log with HMZONED mode");
+		return -EINVAL;
+	}
+
 	return 0;
 }
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 496d8b74f9a2..396238e099bc 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -447,6 +447,10 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 			btrfs_set_opt(info->mount_opt, SPACE_CACHE);
 	}
 
+	if (btrfs_fs_incompat(info, HMZONED))
+		btrfs_set_and_info(info, NOTREELOG,
+				   "disabling tree log with HMZONED mode");
+
 	/*
 	 * Even the options are empty, we still need to do extra check
 	 * against new flags
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 08/27] btrfs: disable fallocate in HMZONED mode
  2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
                   ` (6 preceding siblings ...)
  2019-08-08  9:30 ` [PATCH v3 07/27] btrfs: disable tree-log " Naohiro Aota
@ 2019-08-08  9:30 ` Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 09/27] btrfs: align device extent allocation to zone boundary Naohiro Aota
                   ` (18 subsequent siblings)
  26 siblings, 0 replies; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

fallocate() is implemented by reserving actual extent instead of
reservations. This can result in exposing the sequential write constraint
of host-managed zoned block devices to the application, which would break
the POSIX semantic for the fallocated file.  To avoid this, report
fallocate() as not supported when in HMZONED mode for now.

In the future, we may be able to implement "in-memory" fallocate() in
HMZONED mode by utilizing space_info->bytes_may_use or so.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/file.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 58a18ed11546..7474010a997d 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3023,6 +3023,10 @@ static long btrfs_fallocate(struct file *file, int mode,
 	alloc_end = round_up(offset + len, blocksize);
 	cur_offset = alloc_start;
 
+	/* Do not allow fallocate in HMZONED mode */
+	if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED))
+		return -EOPNOTSUPP;
+
 	/* Make sure we aren't being give some crap mode */
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
 		     FALLOC_FL_ZERO_RANGE))
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 09/27] btrfs: align device extent allocation to zone boundary
  2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
                   ` (7 preceding siblings ...)
  2019-08-08  9:30 ` [PATCH v3 08/27] btrfs: disable fallocate " Naohiro Aota
@ 2019-08-08  9:30 ` Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 10/27] btrfs: do sequential extent allocation in HMZONED mode Naohiro Aota
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

In HMZONED mode, align the device extents to zone boundaries so that a zone
reset affects only the device extent and does not change the state of
blocks in the neighbor device extents. Also, check that a region allocation
is always over empty same-type zones and it is not over any locations of
super block copies.

This patch also add a verification in verify_one_dev_extent() to check if
the device extent is align to zone boundary.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent-tree.c |  6 ++++
 fs/btrfs/hmzoned.c     | 56 ++++++++++++++++++++++++++++++++
 fs/btrfs/hmzoned.h     | 10 ++++++
 fs/btrfs/volumes.c     | 72 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 144 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d3b58e388535..3a36646dfaa8 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7637,6 +7637,12 @@ int btrfs_can_relocate(struct btrfs_fs_info *fs_info, u64 bytenr)
 		min_free = div64_u64(min_free, dev_min);
 	}
 
+	/* We cannot allocate size less than zone_size anyway */
+	if (index == BTRFS_RAID_DUP)
+		min_free = max_t(u64, min_free, 2 * fs_info->zone_size);
+	else
+		min_free = max_t(u64, min_free, fs_info->zone_size);
+
 	mutex_lock(&fs_info->chunk_mutex);
 	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
 		u64 dev_offset;
diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
index e07e76af1e82..7d334b236cd3 100644
--- a/fs/btrfs/hmzoned.c
+++ b/fs/btrfs/hmzoned.c
@@ -12,6 +12,7 @@
 #include "volumes.h"
 #include "hmzoned.h"
 #include "rcu-string.h"
+#include "disk-io.h"
 
 /* Maximum number of zones to report per blkdev_report_zones() call */
 #define BTRFS_REPORT_NR_ZONES   4096
@@ -264,3 +265,58 @@ int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info)
 
 	return 0;
 }
+
+/*
+ * btrfs_check_allocatable_zones - check if spcecifeid region is
+ *                                 suitable for allocation
+ * @device:	the device to allocate a region
+ * @pos:	the position of the region
+ * @num_bytes:	the size of the region
+ *
+ * In non-ZONED device, anywhere is suitable for allocation. In ZONED
+ * device, check if
+ * 1) the region is not on non-empty zones,
+ * 2) all zones in the region have the same zone type,
+ * 3) it does not contain super block location, if the zones are
+ *    sequential.
+ */
+bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos,
+				   u64 num_bytes)
+{
+	struct btrfs_zoned_device_info *zinfo = device->zone_info;
+	u64 nzones, begin, end;
+	u64 sb_pos;
+	u8 shift;
+	int i;
+
+	if (!zinfo)
+		return true;
+
+	shift = zinfo->zone_size_shift;
+	nzones = num_bytes >> shift;
+	begin = pos >> shift;
+	end = begin + nzones;
+
+	ASSERT(IS_ALIGNED(pos, zinfo->zone_size));
+	ASSERT(IS_ALIGNED(num_bytes, zinfo->zone_size));
+
+	if (end > zinfo->nr_zones)
+		return false;
+
+	/* check if zones in the region are all empty */
+	if (find_next_zero_bit(zinfo->empty_zones, end, begin) != end)
+		return false;
+
+	if (btrfs_dev_is_sequential(device, pos)) {
+		for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
+			sb_pos = btrfs_sb_offset(i);
+			if (!(sb_pos + BTRFS_SUPER_INFO_SIZE <= pos ||
+			      pos + end <= sb_pos))
+				return false;
+		}
+
+		return find_next_zero_bit(zinfo->seq_zones, end, begin) == end;
+	}
+
+	return find_next_bit(zinfo->seq_zones, end, begin) == end;
+}
diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
index 83579b2dc0a4..396ece5f9410 100644
--- a/fs/btrfs/hmzoned.h
+++ b/fs/btrfs/hmzoned.h
@@ -29,6 +29,8 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device);
 void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
 int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info);
 int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info);
+bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos,
+				   u64 num_bytes);
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
 {
@@ -95,4 +97,12 @@ static inline bool btrfs_check_device_zone_type(struct btrfs_fs_info *fs_info,
 	return bdev_zoned_model(bdev) != BLK_ZONED_HM;
 }
 
+static inline u64 btrfs_zone_align(struct btrfs_device *device, u64 pos)
+{
+	if (!device->zone_info)
+		return pos;
+
+	return ALIGN(pos, device->zone_info->zone_size);
+}
+
 #endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 755b2ec1e0de..265a1496e459 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1572,6 +1572,7 @@ int find_free_dev_extent_start(struct btrfs_device *device, u64 num_bytes,
 	u64 max_hole_size;
 	u64 extent_end;
 	u64 search_end = device->total_bytes;
+	u64 zone_size = 0;
 	int ret;
 	int slot;
 	struct extent_buffer *l;
@@ -1582,6 +1583,14 @@ int find_free_dev_extent_start(struct btrfs_device *device, u64 num_bytes,
 	 * at an offset of at least 1MB.
 	 */
 	search_start = max_t(u64, search_start, SZ_1M);
+	/*
+	 * For a zoned block device, skip the first zone of the device
+	 * entirely.
+	 */
+	if (device->zone_info)
+		zone_size = device->zone_info->zone_size;
+	search_start = max_t(u64, search_start, zone_size);
+	search_start = btrfs_zone_align(device, search_start);
 
 	path = btrfs_alloc_path();
 	if (!path)
@@ -1646,12 +1655,21 @@ int find_free_dev_extent_start(struct btrfs_device *device, u64 num_bytes,
 			 */
 			if (contains_pending_extent(device, &search_start,
 						    hole_size)) {
+				search_start = btrfs_zone_align(device,
+								search_start);
 				if (key.offset >= search_start)
 					hole_size = key.offset - search_start;
 				else
 					hole_size = 0;
 			}
 
+			if (!btrfs_check_allocatable_zones(device, search_start,
+							   num_bytes)) {
+				search_start += zone_size;
+				btrfs_release_path(path);
+				goto again;
+			}
+
 			if (hole_size > max_hole_size) {
 				max_hole_start = search_start;
 				max_hole_size = hole_size;
@@ -1691,6 +1709,14 @@ int find_free_dev_extent_start(struct btrfs_device *device, u64 num_bytes,
 		hole_size = search_end - search_start;
 
 		if (contains_pending_extent(device, &search_start, hole_size)) {
+			search_start = btrfs_zone_align(device, search_start);
+			btrfs_release_path(path);
+			goto again;
+		}
+
+		if (!btrfs_check_allocatable_zones(device, search_start,
+						   num_bytes)) {
+			search_start += zone_size;
 			btrfs_release_path(path);
 			goto again;
 		}
@@ -1708,6 +1734,7 @@ int find_free_dev_extent_start(struct btrfs_device *device, u64 num_bytes,
 		ret = 0;
 
 out:
+	ASSERT(zone_size == 0 || IS_ALIGNED(max_hole_start, zone_size));
 	btrfs_free_path(path);
 	*start = max_hole_start;
 	if (len)
@@ -4964,6 +4991,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	int i;
 	int j;
 	int index;
+	int hmzoned = btrfs_fs_incompat(info, HMZONED);
 
 	BUG_ON(!alloc_profile_is_valid(type, 0));
 
@@ -5004,10 +5032,20 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 		BUG();
 	}
 
+	if (hmzoned) {
+		max_stripe_size = info->zone_size;
+		max_chunk_size = round_down(max_chunk_size, info->zone_size);
+	}
+
 	/* We don't want a chunk larger than 10% of writable space */
 	max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
 			     max_chunk_size);
 
+	if (hmzoned)
+		max_chunk_size = max(round_down(max_chunk_size,
+						info->zone_size),
+				     info->zone_size);
+
 	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
 			       GFP_NOFS);
 	if (!devices_info)
@@ -5042,6 +5080,9 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 		if (total_avail == 0)
 			continue;
 
+		if (hmzoned && total_avail < max_stripe_size * dev_stripes)
+			continue;
+
 		ret = find_free_dev_extent(device,
 					   max_stripe_size * dev_stripes,
 					   &dev_offset, &max_avail);
@@ -5060,6 +5101,9 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 			continue;
 		}
 
+		if (hmzoned && max_avail < max_stripe_size * dev_stripes)
+			continue;
+
 		if (ndevs == fs_devices->rw_devices) {
 			WARN(1, "%s: found more than %llu devices\n",
 			     __func__, fs_devices->rw_devices);
@@ -5093,6 +5137,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 
 	ndevs = min(ndevs, devs_max);
 
+again:
 	/*
 	 * The primary goal is to maximize the number of stripes, so use as
 	 * many devices as possible, even if the stripes are not maximum sized.
@@ -5116,6 +5161,17 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	 * we try to reduce stripe_size.
 	 */
 	if (stripe_size * data_stripes > max_chunk_size) {
+		if (hmzoned) {
+			/*
+			 * stripe_size is fixed in HMZONED. Reduce ndevs
+			 * instead.
+			 */
+			ASSERT(nparity == 0);
+			ndevs = div_u64(max_chunk_size * ncopies,
+					stripe_size * dev_stripes);
+			goto again;
+		}
+
 		/*
 		 * Reduce stripe_size, round it up to a 16MB boundary again and
 		 * then use it, unless it ends up being even bigger than the
@@ -5129,6 +5185,8 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	/* align to BTRFS_STRIPE_LEN */
 	stripe_size = round_down(stripe_size, BTRFS_STRIPE_LEN);
 
+	ASSERT(!hmzoned || stripe_size == info->zone_size);
+
 	map = kmalloc(map_lookup_size(num_stripes), GFP_NOFS);
 	if (!map) {
 		ret = -ENOMEM;
@@ -7755,6 +7813,20 @@ static int verify_one_dev_extent(struct btrfs_fs_info *fs_info,
 		ret = -EUCLEAN;
 		goto out;
 	}
+
+	if (dev->zone_info) {
+		u64 zone_size = dev->zone_info->zone_size;
+
+		if (!IS_ALIGNED(physical_offset, zone_size) ||
+		    !IS_ALIGNED(physical_len, zone_size)) {
+			btrfs_err(fs_info,
+"dev extent devid %llu physical offset %llu len %llu is not aligned to device zone",
+				  devid, physical_offset, physical_len);
+			ret = -EUCLEAN;
+			goto out;
+		}
+	}
+
 out:
 	free_extent_map(em);
 	return ret;
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 10/27] btrfs: do sequential extent allocation in HMZONED mode
  2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
                   ` (8 preceding siblings ...)
  2019-08-08  9:30 ` [PATCH v3 09/27] btrfs: align device extent allocation to zone boundary Naohiro Aota
@ 2019-08-08  9:30 ` Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 11/27] btrfs: make unmirroed BGs readonly only if we have at least one writable BG Naohiro Aota
                   ` (16 subsequent siblings)
  26 siblings, 0 replies; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

On HMZONED drives, writes must always be sequential and directed at a block
group zone write pointer position. Thus, block allocation in a block group
must also be done sequentially using an allocation pointer equal to the
block group zone write pointer plus the number of blocks allocated but not
yet written.

Sequential allocation function find_free_extent_seq() bypass the checks in
find_free_extent() and increase the reserved byte counter by itself. It is
impossible to revert once allocated region in the sequential allocation,
since it might race with other allocations and leave an allocation hole,
which breaks the sequential write rule.

Furthermore, this commit introduce two new variable to struct
btrfs_block_group_cache. "wp_broken" indicate that write pointer is broken
(e.g. not synced on a RAID1 block group) and mark that block group read
only. "zone_unusable" keeps track of the size of once allocated then freed
region in a block group. Such region is never usable until resetting
underlying zones.

This commit also introduce "bytes_zone_unusable" to track such unusable
bytes in a space_info. Pinned bytes are always reclaimed to
"bytes_zone_unusable". They are not usable until resetting them first.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ctree.h            |  25 ++++
 fs/btrfs/extent-tree.c      | 179 +++++++++++++++++++++++++---
 fs/btrfs/free-space-cache.c |  35 ++++++
 fs/btrfs/free-space-cache.h |   5 +
 fs/btrfs/hmzoned.c          | 231 ++++++++++++++++++++++++++++++++++++
 fs/btrfs/hmzoned.h          |   1 +
 fs/btrfs/space-info.c       |  13 +-
 fs/btrfs/space-info.h       |   4 +-
 fs/btrfs/sysfs.c            |   2 +
 9 files changed, 471 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a00ce8c4d678..3d31a1960c4d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -482,6 +482,20 @@ struct btrfs_full_stripe_locks_tree {
 	struct mutex lock;
 };
 
+/* Block group allocation types */
+enum btrfs_alloc_type {
+
+	/* Regular first fit allocation */
+	BTRFS_ALLOC_FIT		= 0,
+
+	/*
+	 * Sequential allocation: this is for HMZONED mode and
+	 * will result in ignoring free space before a block
+	 * group allocation offset.
+	 */
+	BTRFS_ALLOC_SEQ		= 1,
+};
+
 struct btrfs_block_group_cache {
 	struct btrfs_key key;
 	struct btrfs_block_group_item item;
@@ -521,6 +535,7 @@ struct btrfs_block_group_cache {
 	unsigned int iref:1;
 	unsigned int has_caching_ctl:1;
 	unsigned int removed:1;
+	unsigned int wp_broken:1;
 
 	int disk_cache_state;
 
@@ -594,6 +609,16 @@ struct btrfs_block_group_cache {
 
 	/* Record locked full stripes for RAID5/6 block group */
 	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
+
+	enum btrfs_alloc_type alloc_type;
+	u64 zone_unusable;
+	/*
+	 * Allocation offset for the block group to implement
+	 * sequential allocation. This is used only with HMZONED mode
+	 * enabled and if the block group resides on a sequential
+	 * zone.
+	 */
+	u64 alloc_offset;
 };
 
 /* delayed seq elem */
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3a36646dfaa8..d2aacffe14d6 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -31,6 +31,8 @@
 #include "space-info.h"
 #include "block-rsv.h"
 #include "delalloc-space.h"
+#include "rcu-string.h"
+#include "hmzoned.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -543,6 +545,8 @@ static int cache_block_group(struct btrfs_block_group_cache *cache,
 	struct btrfs_caching_control *caching_ctl;
 	int ret = 0;
 
+	ASSERT(cache->alloc_type == BTRFS_ALLOC_FIT);
+
 	caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS);
 	if (!caching_ctl)
 		return -ENOMEM;
@@ -4429,6 +4433,20 @@ void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg)
 	wait_var_event(&bg->reservations, !atomic_read(&bg->reservations));
 }
 
+static void __btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache,
+				       u64 ram_bytes, u64 num_bytes,
+				       int delalloc)
+{
+	struct btrfs_space_info *space_info = cache->space_info;
+
+	cache->reserved += num_bytes;
+	space_info->bytes_reserved += num_bytes;
+	btrfs_space_info_update_bytes_may_use(cache->fs_info, space_info,
+					      -ram_bytes);
+	if (delalloc)
+		cache->delalloc_bytes += num_bytes;
+}
+
 /**
  * btrfs_add_reserved_bytes - update the block_group and space info counters
  * @cache:	The cache we are manipulating
@@ -4447,18 +4465,16 @@ static int btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache,
 	struct btrfs_space_info *space_info = cache->space_info;
 	int ret = 0;
 
+	/* should handled by find_free_extent_seq */
+	ASSERT(cache->alloc_type != BTRFS_ALLOC_SEQ);
+
 	spin_lock(&space_info->lock);
 	spin_lock(&cache->lock);
-	if (cache->ro) {
+	if (cache->ro)
 		ret = -EAGAIN;
-	} else {
-		cache->reserved += num_bytes;
-		space_info->bytes_reserved += num_bytes;
-		btrfs_space_info_update_bytes_may_use(cache->fs_info,
-						      space_info, -ram_bytes);
-		if (delalloc)
-			cache->delalloc_bytes += num_bytes;
-	}
+	else
+		__btrfs_add_reserved_bytes(cache, ram_bytes, num_bytes,
+					   delalloc);
 	spin_unlock(&cache->lock);
 	spin_unlock(&space_info->lock);
 	return ret;
@@ -4576,9 +4592,13 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
 			cache = btrfs_lookup_block_group(fs_info, start);
 			BUG_ON(!cache); /* Logic error */
 
-			cluster = fetch_cluster_info(fs_info,
-						     cache->space_info,
-						     &empty_cluster);
+			if (cache->alloc_type == BTRFS_ALLOC_FIT)
+				cluster = fetch_cluster_info(fs_info,
+							     cache->space_info,
+							     &empty_cluster);
+			else
+				cluster = NULL;
+
 			empty_cluster <<= 1;
 		}
 
@@ -4618,7 +4638,11 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
 		space_info->max_extent_size = 0;
 		percpu_counter_add_batch(&space_info->total_bytes_pinned,
 			    -len, BTRFS_TOTAL_BYTES_PINNED_BATCH);
-		if (cache->ro) {
+		if (cache->alloc_type == BTRFS_ALLOC_SEQ) {
+			/* need reset before reusing in ALLOC_SEQ BG */
+			space_info->bytes_zone_unusable += len;
+			readonly = true;
+		} else if (cache->ro) {
 			space_info->bytes_readonly += len;
 			readonly = true;
 		}
@@ -5464,6 +5488,60 @@ static int find_free_extent_unclustered(struct btrfs_block_group_cache *bg,
 	return 0;
 }
 
+/*
+ * Simple allocator for sequential only block group. It only allows
+ * sequential allocation. No need to play with trees. This function
+ * also reserve the bytes as in btrfs_add_reserved_bytes.
+ */
+
+static int find_free_extent_seq(struct btrfs_block_group_cache *cache,
+				struct find_free_extent_ctl *ffe_ctl)
+{
+	struct btrfs_space_info *space_info = cache->space_info;
+	struct btrfs_free_space_ctl *ctl = cache->free_space_ctl;
+	u64 start = cache->key.objectid;
+	u64 num_bytes = ffe_ctl->num_bytes;
+	u64 avail;
+	int ret = 0;
+
+	/* Sanity check */
+	if (cache->alloc_type != BTRFS_ALLOC_SEQ)
+		return 1;
+
+	spin_lock(&space_info->lock);
+	spin_lock(&cache->lock);
+
+	if (cache->ro) {
+		ret = -EAGAIN;
+		goto out;
+	}
+
+	spin_lock(&ctl->tree_lock);
+	avail = cache->key.offset - cache->alloc_offset;
+	if (avail < num_bytes) {
+		ffe_ctl->max_extent_size = avail;
+		spin_unlock(&ctl->tree_lock);
+		ret = 1;
+		goto out;
+	}
+
+	ffe_ctl->found_offset = start + cache->alloc_offset;
+	cache->alloc_offset += num_bytes;
+	ctl->free_space -= num_bytes;
+	spin_unlock(&ctl->tree_lock);
+
+	ASSERT(IS_ALIGNED(ffe_ctl->found_offset,
+			  cache->fs_info->stripesize));
+	ffe_ctl->search_start = ffe_ctl->found_offset;
+	__btrfs_add_reserved_bytes(cache, ffe_ctl->ram_bytes, num_bytes,
+				   ffe_ctl->delalloc);
+
+out:
+	spin_unlock(&cache->lock);
+	spin_unlock(&space_info->lock);
+	return ret;
+}
+
 /*
  * Return >0 means caller needs to re-search for free extent
  * Return 0 means we have the needed free extent.
@@ -5764,6 +5842,17 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
 		if (unlikely(block_group->cached == BTRFS_CACHE_ERROR))
 			goto loop;
 
+		if (block_group->alloc_type == BTRFS_ALLOC_SEQ) {
+			ret = find_free_extent_seq(block_group, &ffe_ctl);
+			if (ret)
+				goto loop;
+			/*
+			 * find_free_space_seq should ensure that
+			 * everything is OK and reserve the extent.
+			 */
+			goto nocheck;
+		}
+
 		/*
 		 * Ok we want to try and use the cluster allocator, so
 		 * lets look there
@@ -5819,6 +5908,7 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
 					     num_bytes);
 			goto loop;
 		}
+nocheck:
 		btrfs_inc_block_group_reservations(block_group);
 
 		/* we are all good, lets return */
@@ -7370,7 +7460,8 @@ static int inc_block_group_ro(struct btrfs_block_group_cache *cache, int force)
 	}
 
 	num_bytes = cache->key.offset - cache->reserved - cache->pinned -
-		    cache->bytes_super - btrfs_block_group_used(&cache->item);
+		    cache->bytes_super - cache->zone_unusable -
+		    btrfs_block_group_used(&cache->item);
 	sinfo_used = btrfs_space_info_used(sinfo, true);
 
 	if (sinfo_used + num_bytes + min_allocable_bytes <=
@@ -7519,6 +7610,7 @@ void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache)
 	if (!--cache->ro) {
 		num_bytes = cache->key.offset - cache->reserved -
 			    cache->pinned - cache->bytes_super -
+			    cache->zone_unusable -
 			    btrfs_block_group_used(&cache->item);
 		sinfo->bytes_readonly -= num_bytes;
 		list_del_init(&cache->ro_list);
@@ -7989,6 +8081,7 @@ btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info,
 	atomic_set(&cache->trimming, 0);
 	mutex_init(&cache->free_space_lock);
 	btrfs_init_full_stripe_locks_tree(&cache->full_stripe_locks_root);
+	cache->alloc_type = BTRFS_ALLOC_FIT;
 
 	return cache;
 }
@@ -8061,6 +8154,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 	int need_clear = 0;
 	u64 cache_gen;
 	u64 feature;
+	u64 unusable = 0;
 	int mixed;
 
 	feature = btrfs_super_incompat_flags(info->super_copy);
@@ -8130,6 +8224,14 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 		key.objectid = found_key.objectid + found_key.offset;
 		btrfs_release_path(path);
 
+		ret = btrfs_load_block_group_zone_info(cache);
+		if (ret) {
+			btrfs_err(info, "failed to load zone info of bg %llu",
+				  cache->key.objectid);
+			btrfs_put_block_group(cache);
+			goto error;
+		}
+
 		/*
 		 * We need to exclude the super stripes now so that the space
 		 * info has super bytes accounted for, otherwise we'll think
@@ -8166,6 +8268,31 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 			free_excluded_extents(cache);
 		}
 
+		if (cache->alloc_type == BTRFS_ALLOC_SEQ) {
+			u64 free;
+
+			WARN_ON(cache->bytes_super != 0);
+			if (!cache->wp_broken) {
+				unusable = cache->alloc_offset -
+					btrfs_block_group_used(&cache->item);
+				free = cache->key.offset - cache->alloc_offset;
+			} else {
+				unusable = cache->key.offset -
+					btrfs_block_group_used(&cache->item);
+				free = 0;
+			}
+			/* we only need ->free_space in ALLOC_SEQ BGs */
+			cache->last_byte_to_unpin = (u64)-1;
+			cache->cached = BTRFS_CACHE_FINISHED;
+			cache->free_space_ctl->free_space = free;
+			cache->zone_unusable = unusable;
+			/*
+			 * Should not have any excluded extents. Just
+			 * in case, though.
+			 */
+			free_excluded_extents(cache);
+		}
+
 		ret = btrfs_add_block_group_cache(info, cache);
 		if (ret) {
 			btrfs_remove_free_space_cache(cache);
@@ -8176,7 +8303,8 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 		trace_btrfs_add_block_group(info, cache, 0);
 		btrfs_update_space_info(info, cache->flags, found_key.offset,
 					btrfs_block_group_used(&cache->item),
-					cache->bytes_super, &space_info);
+					cache->bytes_super, unusable,
+					&space_info);
 
 		cache->space_info = space_info;
 
@@ -8189,6 +8317,9 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 			ASSERT(list_empty(&cache->bg_list));
 			btrfs_mark_bg_unused(cache);
 		}
+
+		if (cache->wp_broken)
+			inc_block_group_ro(cache, 1);
 	}
 
 	list_for_each_entry_rcu(space_info, &info->space_info, list) {
@@ -8282,6 +8413,13 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,
 	cache->last_byte_to_unpin = (u64)-1;
 	cache->cached = BTRFS_CACHE_FINISHED;
 	cache->needs_free_space = 1;
+
+	ret = btrfs_load_block_group_zone_info(cache);
+	if (ret) {
+		btrfs_put_block_group(cache);
+		return ret;
+	}
+
 	ret = exclude_super_stripes(cache);
 	if (ret) {
 		/*
@@ -8326,7 +8464,7 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,
 	 */
 	trace_btrfs_add_block_group(fs_info, cache, 1);
 	btrfs_update_space_info(fs_info, cache->flags, size, bytes_used,
-				cache->bytes_super, &cache->space_info);
+				cache->bytes_super, 0, &cache->space_info);
 	btrfs_update_global_block_rsv(fs_info);
 
 	link_block_group(cache);
@@ -8576,12 +8714,17 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 		WARN_ON(block_group->space_info->total_bytes
 			< block_group->key.offset);
 		WARN_ON(block_group->space_info->bytes_readonly
-			< block_group->key.offset);
+			< block_group->key.offset - block_group->zone_unusable);
+		WARN_ON(block_group->space_info->bytes_zone_unusable
+			< block_group->zone_unusable);
 		WARN_ON(block_group->space_info->disk_total
 			< block_group->key.offset * factor);
 	}
 	block_group->space_info->total_bytes -= block_group->key.offset;
-	block_group->space_info->bytes_readonly -= block_group->key.offset;
+	block_group->space_info->bytes_readonly -=
+		(block_group->key.offset - block_group->zone_unusable);
+	block_group->space_info->bytes_zone_unusable -=
+		block_group->zone_unusable;
 	block_group->space_info->disk_total -= block_group->key.offset * factor;
 
 	spin_unlock(&block_group->space_info->lock);
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 062be9dde4c6..2aeb3620645c 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2326,8 +2326,11 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 			   u64 offset, u64 bytes)
 {
 	struct btrfs_free_space *info;
+	struct btrfs_block_group_cache *block_group = ctl->private;
 	int ret = 0;
 
+	ASSERT(!block_group || block_group->alloc_type != BTRFS_ALLOC_SEQ);
+
 	info = kmem_cache_zalloc(btrfs_free_space_cachep, GFP_NOFS);
 	if (!info)
 		return -ENOMEM;
@@ -2376,6 +2379,30 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 	return ret;
 }
 
+int __btrfs_add_free_space_seq(struct btrfs_block_group_cache *block_group,
+			       u64 bytenr, u64 size)
+{
+	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
+	u64 offset = bytenr - block_group->key.objectid;
+	u64 to_free, to_unusable;
+
+	spin_lock(&ctl->tree_lock);
+	if (block_group->wp_broken)
+		to_free = 0;
+	else if (offset >= block_group->alloc_offset)
+		to_free = size;
+	else if (offset + size <= block_group->alloc_offset)
+		to_free = 0;
+	else
+		to_free = offset + size - block_group->alloc_offset;
+	to_unusable = size - to_free;
+
+	ctl->free_space += to_free;
+	block_group->zone_unusable += to_unusable;
+	spin_unlock(&ctl->tree_lock);
+	return 0;
+}
+
 int btrfs_remove_free_space(struct btrfs_block_group_cache *block_group,
 			    u64 offset, u64 bytes)
 {
@@ -2384,6 +2411,8 @@ int btrfs_remove_free_space(struct btrfs_block_group_cache *block_group,
 	int ret;
 	bool re_search = false;
 
+	ASSERT(block_group->alloc_type != BTRFS_ALLOC_SEQ);
+
 	spin_lock(&ctl->tree_lock);
 
 again:
@@ -2619,6 +2648,8 @@ u64 btrfs_find_space_for_alloc(struct btrfs_block_group_cache *block_group,
 	u64 align_gap = 0;
 	u64 align_gap_len = 0;
 
+	ASSERT(block_group->alloc_type != BTRFS_ALLOC_SEQ);
+
 	spin_lock(&ctl->tree_lock);
 	entry = find_free_space(ctl, &offset, &bytes_search,
 				block_group->full_stripe_len, max_extent_size);
@@ -2738,6 +2769,8 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group_cache *block_group,
 	struct rb_node *node;
 	u64 ret = 0;
 
+	ASSERT(block_group->alloc_type != BTRFS_ALLOC_SEQ);
+
 	spin_lock(&cluster->lock);
 	if (bytes > cluster->max_size)
 		goto out;
@@ -3384,6 +3417,8 @@ int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
 {
 	int ret;
 
+	ASSERT(block_group->alloc_type != BTRFS_ALLOC_SEQ);
+
 	*trimmed = 0;
 
 	spin_lock(&block_group->lock);
diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h
index 8760acb55ffd..d30667784f73 100644
--- a/fs/btrfs/free-space-cache.h
+++ b/fs/btrfs/free-space-cache.h
@@ -73,10 +73,15 @@ void btrfs_init_free_space_ctl(struct btrfs_block_group_cache *block_group);
 int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 			   struct btrfs_free_space_ctl *ctl,
 			   u64 bytenr, u64 size);
+int __btrfs_add_free_space_seq(struct btrfs_block_group_cache *block_group,
+			       u64 bytenr, u64 size);
 static inline int
 btrfs_add_free_space(struct btrfs_block_group_cache *block_group,
 		     u64 bytenr, u64 size)
 {
+	if (block_group->alloc_type == BTRFS_ALLOC_SEQ)
+		return __btrfs_add_free_space_seq(block_group, bytenr, size);
+
 	return __btrfs_add_free_space(block_group->fs_info,
 				      block_group->free_space_ctl,
 				      bytenr, size);
diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
index 7d334b236cd3..89631f5f01f2 100644
--- a/fs/btrfs/hmzoned.c
+++ b/fs/btrfs/hmzoned.c
@@ -17,6 +17,9 @@
 /* Maximum number of zones to report per blkdev_report_zones() call */
 #define BTRFS_REPORT_NR_ZONES   4096
 
+/* Invalid allocation pointer value for missing devices */
+#define WP_MISSING_DEV ((u64)-1)
+
 static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
 			       struct blk_zone **zones_ret,
 			       unsigned int *nr_zones, gfp_t gfp_mask)
@@ -320,3 +323,231 @@ bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos,
 
 	return find_next_bit(zinfo->seq_zones, end, begin) == end;
 }
+
+int btrfs_load_block_group_zone_info(struct btrfs_block_group_cache *cache)
+{
+	struct btrfs_fs_info *fs_info = cache->fs_info;
+	struct extent_map_tree *em_tree = &fs_info->mapping_tree;
+	struct extent_map *em;
+	struct map_lookup *map;
+	struct btrfs_device *device;
+	u64 logical = cache->key.objectid;
+	u64 length = cache->key.offset;
+	u64 physical = 0;
+	int ret, alloc_type;
+	int i, j;
+	u64 *alloc_offsets = NULL;
+
+	if (!btrfs_fs_incompat(fs_info, HMZONED))
+		return 0;
+
+	/* Sanity check */
+	if (!IS_ALIGNED(length, fs_info->zone_size)) {
+		btrfs_err(fs_info, "unaligned block group at %llu + %llu",
+			  logical, length);
+		return -EIO;
+	}
+
+	/* Get the chunk mapping */
+	read_lock(&em_tree->lock);
+	em = lookup_extent_mapping(em_tree, logical, length);
+	read_unlock(&em_tree->lock);
+
+	if (!em)
+		return -EINVAL;
+
+	map = em->map_lookup;
+
+	/*
+	 * Get the zone type: if the group is mapped to a non-sequential zone,
+	 * there is no need for the allocation offset (fit allocation is OK).
+	 */
+	alloc_type = -1;
+	alloc_offsets = kcalloc(map->num_stripes, sizeof(*alloc_offsets),
+				GFP_NOFS);
+	if (!alloc_offsets) {
+		free_extent_map(em);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < map->num_stripes; i++) {
+		bool is_sequential;
+		struct blk_zone zone;
+
+		device = map->stripes[i].dev;
+		physical = map->stripes[i].physical;
+
+		if (device->bdev == NULL) {
+			alloc_offsets[i] = WP_MISSING_DEV;
+			continue;
+		}
+
+		is_sequential = btrfs_dev_is_sequential(device, physical);
+		if (alloc_type == -1)
+			alloc_type = is_sequential ?
+					BTRFS_ALLOC_SEQ : BTRFS_ALLOC_FIT;
+
+		if ((is_sequential && alloc_type != BTRFS_ALLOC_SEQ) ||
+		    (!is_sequential && alloc_type == BTRFS_ALLOC_SEQ)) {
+			btrfs_err(fs_info, "found block group of mixed zone types");
+			ret = -EIO;
+			goto out;
+		}
+
+		if (!is_sequential)
+			continue;
+
+		/*
+		 * This zone will be used for allocation, so mark this
+		 * zone non-empty.
+		 */
+		btrfs_dev_clear_zone_empty(device, physical);
+
+		/*
+		 * The group is mapped to a sequential zone. Get the zone write
+		 * pointer to determine the allocation offset within the zone.
+		 */
+		WARN_ON(!IS_ALIGNED(physical, fs_info->zone_size));
+		ret = btrfs_get_dev_zone(device, physical, &zone, GFP_NOFS);
+		if (ret == -EIO || ret == -EOPNOTSUPP) {
+			ret = 0;
+			alloc_offsets[i] = WP_MISSING_DEV;
+			continue;
+		} else if (ret) {
+			goto out;
+		}
+
+
+		switch (zone.cond) {
+		case BLK_ZONE_COND_OFFLINE:
+		case BLK_ZONE_COND_READONLY:
+			btrfs_err(
+				fs_info, "Offline/readonly zone %llu",
+				physical >> device->zone_info->zone_size_shift);
+			alloc_offsets[i] = WP_MISSING_DEV;
+			break;
+		case BLK_ZONE_COND_EMPTY:
+			alloc_offsets[i] = 0;
+			break;
+		case BLK_ZONE_COND_FULL:
+			alloc_offsets[i] = fs_info->zone_size;
+			break;
+		default:
+			/* Partially used zone */
+			alloc_offsets[i] =
+				((zone.wp - zone.start) << SECTOR_SHIFT);
+			break;
+		}
+	}
+
+	if (alloc_type == BTRFS_ALLOC_FIT)
+		goto out;
+
+	switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
+	case 0: /* single */
+	case BTRFS_BLOCK_GROUP_DUP:
+	case BTRFS_BLOCK_GROUP_RAID1:
+		cache->alloc_offset = WP_MISSING_DEV;
+		for (i = 0; i < map->num_stripes; i++) {
+			if (alloc_offsets[i] == WP_MISSING_DEV)
+				continue;
+			if (cache->alloc_offset == WP_MISSING_DEV)
+				cache->alloc_offset = alloc_offsets[i];
+			if (alloc_offsets[i] == cache->alloc_offset)
+				continue;
+
+			btrfs_err(fs_info,
+				  "write pointer mismatch: block group %llu",
+				  logical);
+			cache->wp_broken = 1;
+		}
+		break;
+	case BTRFS_BLOCK_GROUP_RAID0:
+		cache->alloc_offset = 0;
+		for (i = 0; i < map->num_stripes; i++) {
+			if (alloc_offsets[i] == WP_MISSING_DEV) {
+				btrfs_err(fs_info,
+					  "cannot recover write pointer: block group %llu",
+					  logical);
+				cache->wp_broken = 1;
+				continue;
+			}
+
+			if (alloc_offsets[0] < alloc_offsets[i]) {
+				btrfs_err(fs_info,
+					  "write pointer mismatch: block group %llu",
+					  logical);
+				cache->wp_broken = 1;
+				continue;
+			}
+
+			cache->alloc_offset += alloc_offsets[i];
+		}
+		break;
+	case BTRFS_BLOCK_GROUP_RAID10:
+		/*
+		 * Pass1: check write pointer of RAID1 level: each pointer
+		 * should be equal.
+		 */
+		for (i = 0; i < map->num_stripes / map->sub_stripes; i++) {
+			int base = i * map->sub_stripes;
+			u64 offset = WP_MISSING_DEV;
+
+			for (j = 0; j < map->sub_stripes; j++) {
+				if (alloc_offsets[base + j] == WP_MISSING_DEV)
+					continue;
+				if (offset == WP_MISSING_DEV)
+					offset = alloc_offsets[base+j];
+				if (alloc_offsets[base + j] == offset)
+					continue;
+
+				btrfs_err(fs_info,
+					  "write pointer mismatch: block group %llu",
+					  logical);
+				cache->wp_broken = 1;
+			}
+			for (j = 0; j < map->sub_stripes; j++)
+				alloc_offsets[base + j] = offset;
+		}
+
+		/* Pass2: check write pointer of RAID1 level */
+		cache->alloc_offset = 0;
+		for (i = 0; i < map->num_stripes / map->sub_stripes; i++) {
+			int base = i * map->sub_stripes;
+
+			if (alloc_offsets[base] == WP_MISSING_DEV) {
+				btrfs_err(fs_info,
+					  "cannot recover write pointer: block group %llu",
+					  logical);
+				cache->wp_broken = 1;
+				continue;
+			}
+
+			if (alloc_offsets[0] < alloc_offsets[base]) {
+				btrfs_err(fs_info,
+					  "write pointer mismatch: block group %llu",
+					  logical);
+				cache->wp_broken = 1;
+				continue;
+			}
+
+			cache->alloc_offset += alloc_offsets[base];
+		}
+		break;
+	case BTRFS_BLOCK_GROUP_RAID5:
+	case BTRFS_BLOCK_GROUP_RAID6:
+		/* RAID5/6 is not supported yet */
+	default:
+		btrfs_err(fs_info, "Unsupported profile on HMZONED %llu",
+			map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK);
+		ret = -EINVAL;
+		goto out;
+	}
+
+out:
+	cache->alloc_type = alloc_type;
+	kfree(alloc_offsets);
+	free_extent_map(em);
+
+	return ret;
+}
diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
index 396ece5f9410..399d9e9543aa 100644
--- a/fs/btrfs/hmzoned.h
+++ b/fs/btrfs/hmzoned.h
@@ -31,6 +31,7 @@ int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info);
 int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info);
 bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos,
 				   u64 num_bytes);
+int btrfs_load_block_group_zone_info(struct btrfs_block_group_cache *cache);
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
 {
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index ab7b9ec4c240..4c6457bd1b9c 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -15,6 +15,7 @@ u64 btrfs_space_info_used(struct btrfs_space_info *s_info,
 	ASSERT(s_info);
 	return s_info->bytes_used + s_info->bytes_reserved +
 		s_info->bytes_pinned + s_info->bytes_readonly +
+		s_info->bytes_zone_unusable +
 		(may_use_included ? s_info->bytes_may_use : 0);
 }
 
@@ -133,7 +134,7 @@ int btrfs_init_space_info(struct btrfs_fs_info *fs_info)
 
 void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags,
 			     u64 total_bytes, u64 bytes_used,
-			     u64 bytes_readonly,
+			     u64 bytes_readonly, u64 bytes_zone_unusable,
 			     struct btrfs_space_info **space_info)
 {
 	struct btrfs_space_info *found;
@@ -149,6 +150,7 @@ void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags,
 	found->bytes_used += bytes_used;
 	found->disk_used += bytes_used * factor;
 	found->bytes_readonly += bytes_readonly;
+	found->bytes_zone_unusable += bytes_zone_unusable;
 	if (total_bytes > 0)
 		found->full = 0;
 	btrfs_space_info_add_new_bytes(info, found,
@@ -372,10 +374,10 @@ void btrfs_dump_space_info(struct btrfs_fs_info *fs_info,
 		   info->total_bytes - btrfs_space_info_used(info, true),
 		   info->full ? "" : "not ");
 	btrfs_info(fs_info,
-		"space_info total=%llu, used=%llu, pinned=%llu, reserved=%llu, may_use=%llu, readonly=%llu",
+		"space_info total=%llu, used=%llu, pinned=%llu, reserved=%llu, may_use=%llu, readonly=%llu zone_unusable=%llu",
 		info->total_bytes, info->bytes_used, info->bytes_pinned,
 		info->bytes_reserved, info->bytes_may_use,
-		info->bytes_readonly);
+		info->bytes_readonly, info->bytes_zone_unusable);
 	spin_unlock(&info->lock);
 
 	DUMP_BLOCK_RSV(fs_info, global_block_rsv);
@@ -392,10 +394,11 @@ void btrfs_dump_space_info(struct btrfs_fs_info *fs_info,
 	list_for_each_entry(cache, &info->block_groups[index], list) {
 		spin_lock(&cache->lock);
 		btrfs_info(fs_info,
-			"block group %llu has %llu bytes, %llu used %llu pinned %llu reserved %s",
+			"block group %llu has %llu bytes, %llu used %llu pinned %llu reserved zone_unusable %llu %s",
 			cache->key.objectid, cache->key.offset,
 			btrfs_block_group_used(&cache->item), cache->pinned,
-			cache->reserved, cache->ro ? "[readonly]" : "");
+			cache->reserved, cache->zone_unusable,
+			cache->ro ? "[readonly]" : "");
 		btrfs_dump_free_space(cache, bytes);
 		spin_unlock(&cache->lock);
 	}
diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index c2b54b8e1a14..b3837b2c41e4 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -17,6 +17,8 @@ struct btrfs_space_info {
 	u64 bytes_may_use;	/* number of bytes that may be used for
 				   delalloc/allocations */
 	u64 bytes_readonly;	/* total bytes that are read only */
+	u64 bytes_zone_unusable;	/* total bytes that are unusable until
+					   resetting the device zone */
 
 	u64 max_extent_size;	/* This will hold the maximum extent size of
 				   the space info if we had an ENOSPC in the
@@ -115,7 +117,7 @@ void btrfs_space_info_add_old_bytes(struct btrfs_fs_info *fs_info,
 int btrfs_init_space_info(struct btrfs_fs_info *fs_info);
 void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags,
 			     u64 total_bytes, u64 bytes_used,
-			     u64 bytes_readonly,
+			     u64 bytes_readonly, u64 bytes_zone_unusable,
 			     struct btrfs_space_info **space_info);
 struct btrfs_space_info *btrfs_find_space_info(struct btrfs_fs_info *info,
 					       u64 flags);
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index ad708a9edd0b..37733ec8e437 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -349,6 +349,7 @@ SPACE_INFO_ATTR(bytes_pinned);
 SPACE_INFO_ATTR(bytes_reserved);
 SPACE_INFO_ATTR(bytes_may_use);
 SPACE_INFO_ATTR(bytes_readonly);
+SPACE_INFO_ATTR(bytes_zone_unusable);
 SPACE_INFO_ATTR(disk_used);
 SPACE_INFO_ATTR(disk_total);
 BTRFS_ATTR(space_info, total_bytes_pinned,
@@ -362,6 +363,7 @@ static struct attribute *space_info_attrs[] = {
 	BTRFS_ATTR_PTR(space_info, bytes_reserved),
 	BTRFS_ATTR_PTR(space_info, bytes_may_use),
 	BTRFS_ATTR_PTR(space_info, bytes_readonly),
+	BTRFS_ATTR_PTR(space_info, bytes_zone_unusable),
 	BTRFS_ATTR_PTR(space_info, disk_used),
 	BTRFS_ATTR_PTR(space_info, disk_total),
 	BTRFS_ATTR_PTR(space_info, total_bytes_pinned),
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 11/27] btrfs: make unmirroed BGs readonly only if we have at least one writable BG
  2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
                   ` (9 preceding siblings ...)
  2019-08-08  9:30 ` [PATCH v3 10/27] btrfs: do sequential extent allocation in HMZONED mode Naohiro Aota
@ 2019-08-08  9:30 ` Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 12/27] btrfs: ensure metadata space available on/after degraded mount in HMZONED Naohiro Aota
                   ` (15 subsequent siblings)
  26 siblings, 0 replies; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

If the btrfs volume has mirrored block groups, it unconditionally makes
un-mirrored block groups read only. When we have mirrored block groups, but
don't have writable block groups, this will drop all writable block groups.
So, check if we have at least one writable mirrored block group before
setting un-mirrored block groups read only.

This change is necessary to handle e.g. xfstests btrfs/124 case.

When we mount degraded RAID1 FS and write to it, and then re-mount with
full device, the write pointers of corresponding zones of written BG
differ. We mark such block group as "wp_broken" and make it read only. In
this situation, we only have read only RAID1 BGs because of "wp_broken" and
un-mirrored BGs are also marked read only, because we have RAID1 BGs. As a
result, all the BGs are now read only, so that we cannot even start the
rebalance to fix the situation.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent-tree.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d2aacffe14d6..d0d887448bb5 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -8142,6 +8142,27 @@ static int check_chunk_block_group_mappings(struct btrfs_fs_info *fs_info)
 	return ret;
 }
 
+/*
+ * have_mirrored_block_group - check if we have at least one writable
+ *                             mirrored Block Group
+ */
+static bool have_mirrored_block_group(struct btrfs_space_info *space_info)
+{
+	struct btrfs_block_group_cache *cache;
+	int i;
+
+	for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
+		if (i == BTRFS_RAID_RAID0 || i == BTRFS_RAID_SINGLE)
+			continue;
+		list_for_each_entry(cache, &space_info->block_groups[i],
+				    list) {
+			if (!cache->ro)
+				return true;
+		}
+	}
+	return false;
+}
+
 int btrfs_read_block_groups(struct btrfs_fs_info *info)
 {
 	struct btrfs_path *path;
@@ -8329,6 +8350,10 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 		       BTRFS_BLOCK_GROUP_RAID56_MASK |
 		       BTRFS_BLOCK_GROUP_DUP)))
 			continue;
+
+		if (!have_mirrored_block_group(space_info))
+			continue;
+
 		/*
 		 * avoid allocating from un-mirrored block group if there are
 		 * mirrored block groups.
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 12/27] btrfs: ensure metadata space available on/after degraded mount in HMZONED
  2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
                   ` (10 preceding siblings ...)
  2019-08-08  9:30 ` [PATCH v3 11/27] btrfs: make unmirroed BGs readonly only if we have at least one writable BG Naohiro Aota
@ 2019-08-08  9:30 ` Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 13/27] btrfs: reset zones of unused block groups Naohiro Aota
                   ` (14 subsequent siblings)
  26 siblings, 0 replies; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

On/After degraded mount, we might have no writable metadata block group due
to broken write pointers. If you e.g. balance the FS before writing any
data, alloc_tree_block_no_bg_flush() (called from insert_balance_item())
fails to allocate a tree block for it, due to global reservation failure.
We can reproduce this situation with xfstests btrfs/124.

While we can workaround the failure if we write some data and, as a result
of writing, let a new metadata block group allocated, it's a bad practice
to apply.

This commit avoids such failures by ensuring that read-write mounted volume
has non-zero metadata space. If metadata space is empty, it forces new
metadata block group allocation.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/disk-io.c |  9 +++++++++
 fs/btrfs/hmzoned.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/hmzoned.h |  1 +
 3 files changed, 55 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8854ff2e5fa5..65b3198c6e83 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3287,6 +3287,15 @@ int open_ctree(struct super_block *sb,
 		}
 	}
 
+	ret = btrfs_hmzoned_check_metadata_space(fs_info);
+	if (ret) {
+		btrfs_warn(fs_info, "failed to allocate metadata space: %d",
+			   ret);
+		btrfs_warn(fs_info, "try remount with readonly");
+		close_ctree(fs_info);
+		return ret;
+	}
+
 	down_read(&fs_info->cleanup_work_sem);
 	if ((ret = btrfs_orphan_cleanup(fs_info->fs_root)) ||
 	    (ret = btrfs_orphan_cleanup(fs_info->tree_root))) {
diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
index 89631f5f01f2..38cc1bbfe118 100644
--- a/fs/btrfs/hmzoned.c
+++ b/fs/btrfs/hmzoned.c
@@ -13,6 +13,8 @@
 #include "hmzoned.h"
 #include "rcu-string.h"
 #include "disk-io.h"
+#include "space-info.h"
+#include "transaction.h"
 
 /* Maximum number of zones to report per blkdev_report_zones() call */
 #define BTRFS_REPORT_NR_ZONES   4096
@@ -551,3 +553,46 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group_cache *cache)
 
 	return ret;
 }
+
+/*
+ * On/After degraded mount, we might have no writable metadata block
+ * group due to broken write pointers. If you e.g. balance the FS
+ * before writing any data, alloc_tree_block_no_bg_flush() (called
+ * from insert_balance_item())fails to allocate a tree block for
+ * it. To avoid such situations, ensure we have some metadata BG here.
+ */
+int btrfs_hmzoned_check_metadata_space(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_root *root = fs_info->extent_root;
+	struct btrfs_trans_handle *trans;
+	struct btrfs_space_info *info;
+	u64 left;
+	int ret;
+
+	if (!btrfs_fs_incompat(fs_info, HMZONED))
+		return 0;
+
+	info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
+	spin_lock(&info->lock);
+	left = info->total_bytes - btrfs_space_info_used(info, true);
+	spin_unlock(&info->lock);
+
+	if (left)
+		return 0;
+
+	trans = btrfs_start_transaction(root, 0);
+	if (IS_ERR(trans))
+		return PTR_ERR(trans);
+
+	mutex_lock(&fs_info->chunk_mutex);
+	ret = btrfs_alloc_chunk(trans, btrfs_metadata_alloc_profile(fs_info));
+	if (ret) {
+		mutex_unlock(&fs_info->chunk_mutex);
+		btrfs_abort_transaction(trans, ret);
+		btrfs_end_transaction(trans);
+		return ret;
+	}
+	mutex_unlock(&fs_info->chunk_mutex);
+
+	return btrfs_commit_transaction(trans);
+}
diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
index 399d9e9543aa..e95139d4c072 100644
--- a/fs/btrfs/hmzoned.h
+++ b/fs/btrfs/hmzoned.h
@@ -32,6 +32,7 @@ int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info);
 bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos,
 				   u64 num_bytes);
 int btrfs_load_block_group_zone_info(struct btrfs_block_group_cache *cache);
+int btrfs_hmzoned_check_metadata_space(struct btrfs_fs_info *fs_info);
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
 {
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 13/27] btrfs: reset zones of unused block groups
  2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
                   ` (11 preceding siblings ...)
  2019-08-08  9:30 ` [PATCH v3 12/27] btrfs: ensure metadata space available on/after degraded mount in HMZONED Naohiro Aota
@ 2019-08-08  9:30 ` Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 14/27] btrfs: limit super block locations in HMZONED mode Naohiro Aota
                   ` (13 subsequent siblings)
  26 siblings, 0 replies; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

For an HMZONED volume, a block group maps to a zone of the device. For
deleted unused block groups, the zone of the block group can be reset to
rewind the zone write pointer at the start of the zone.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent-tree.c | 27 +++++++++++++++++++--------
 fs/btrfs/hmzoned.c     | 18 ++++++++++++++++++
 fs/btrfs/hmzoned.h     | 18 ++++++++++++++++++
 3 files changed, 55 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d0d887448bb5..8665aba61bb9 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1936,6 +1936,9 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
 
 
 		for (i = 0; i < bbio->num_stripes; i++, stripe++) {
+			struct btrfs_device *dev = stripe->dev;
+			u64 physical = stripe->physical;
+			u64 length = stripe->length;
 			u64 bytes;
 			struct request_queue *req_q;
 
@@ -1943,19 +1946,23 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
 				ASSERT(btrfs_test_opt(fs_info, DEGRADED));
 				continue;
 			}
+
 			req_q = bdev_get_queue(stripe->dev->bdev);
-			if (!blk_queue_discard(req_q))
+
+			/* zone reset in HMZONED mode */
+			if (btrfs_can_zone_reset(dev, physical, length))
+				ret = btrfs_reset_device_zone(dev, physical,
+							      length, &bytes);
+			else if (blk_queue_discard(req_q))
+				ret = btrfs_issue_discard(dev->bdev, physical,
+							  length, &bytes);
+			else
 				continue;
 
-			ret = btrfs_issue_discard(stripe->dev->bdev,
-						  stripe->physical,
-						  stripe->length,
-						  &bytes);
 			if (!ret)
 				discarded_bytes += bytes;
 			else if (ret != -EOPNOTSUPP)
 				break; /* Logic errors or -ENOMEM, or -EIO but I don't know how that could happen JDM */
-
 			/*
 			 * Just in case we get back EOPNOTSUPP for some reason,
 			 * just ignore the return value so we don't screw up
@@ -8985,8 +8992,12 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		spin_unlock(&block_group->lock);
 		spin_unlock(&space_info->lock);
 
-		/* DISCARD can flip during remount */
-		trimming = btrfs_test_opt(fs_info, DISCARD);
+		/*
+		 * DISCARD can flip during remount. In HMZONED mode,
+		 * we need to reset sequential required zones.
+		 */
+		trimming = btrfs_test_opt(fs_info, DISCARD) ||
+				btrfs_fs_incompat(fs_info, HMZONED);
 
 		/* Implicit trim during transaction commit. */
 		if (trimming)
diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
index 38cc1bbfe118..5968ef621fa7 100644
--- a/fs/btrfs/hmzoned.c
+++ b/fs/btrfs/hmzoned.c
@@ -596,3 +596,21 @@ int btrfs_hmzoned_check_metadata_space(struct btrfs_fs_info *fs_info)
 
 	return btrfs_commit_transaction(trans);
 }
+
+int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
+			    u64 length, u64 *bytes)
+{
+	int ret;
+
+	ret = blkdev_reset_zones(device->bdev,
+				 physical >> SECTOR_SHIFT,
+				 length >> SECTOR_SHIFT,
+				 GFP_NOFS);
+	if (!ret) {
+		*bytes = length;
+		set_bit(physical >> device->zone_info->zone_size_shift,
+			device->zone_info->empty_zones);
+	}
+
+	return ret;
+}
diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
index e95139d4c072..40b4151fc935 100644
--- a/fs/btrfs/hmzoned.h
+++ b/fs/btrfs/hmzoned.h
@@ -32,6 +32,8 @@ int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info);
 bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos,
 				   u64 num_bytes);
 int btrfs_load_block_group_zone_info(struct btrfs_block_group_cache *cache);
+int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
+			    u64 length, u64 *bytes);
 int btrfs_hmzoned_check_metadata_space(struct btrfs_fs_info *fs_info);
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
@@ -107,4 +109,20 @@ static inline u64 btrfs_zone_align(struct btrfs_device *device, u64 pos)
 	return ALIGN(pos, device->zone_info->zone_size);
 }
 
+static inline bool btrfs_can_zone_reset(struct btrfs_device *device,
+					u64 physical, u64 length)
+{
+	u64 zone_size;
+
+	if (!btrfs_dev_is_sequential(device, physical))
+		return false;
+
+	zone_size = device->zone_info->zone_size;
+	if (!IS_ALIGNED(physical, zone_size) ||
+	    !IS_ALIGNED(length, zone_size))
+		return false;
+
+	return true;
+}
+
 #endif
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 14/27] btrfs: limit super block locations in HMZONED mode
  2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
                   ` (12 preceding siblings ...)
  2019-08-08  9:30 ` [PATCH v3 13/27] btrfs: reset zones of unused block groups Naohiro Aota
@ 2019-08-08  9:30 ` Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 15/27] btrfs: redirty released extent buffers in sequential BGs Naohiro Aota
                   ` (12 subsequent siblings)
  26 siblings, 0 replies; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

When in HMZONED mode, make sure that device super blocks are located in
randomly writable zones of zoned block devices. That is, do not write super
blocks in sequential write required zones of host-managed zoned block
devices as update would not be possible.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/disk-io.c     |  4 ++++
 fs/btrfs/extent-tree.c |  8 ++++++++
 fs/btrfs/hmzoned.h     | 12 ++++++++++++
 fs/btrfs/scrub.c       |  3 +++
 4 files changed, 27 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 65b3198c6e83..a0a3709de2e6 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3547,6 +3547,8 @@ static int write_dev_supers(struct btrfs_device *device,
 		if (bytenr + BTRFS_SUPER_INFO_SIZE >=
 		    device->commit_total_bytes)
 			break;
+		if (!btrfs_check_super_location(device, bytenr))
+			continue;
 
 		btrfs_set_super_bytenr(sb, bytenr);
 
@@ -3613,6 +3615,8 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors)
 		if (bytenr + BTRFS_SUPER_INFO_SIZE >=
 		    device->commit_total_bytes)
 			break;
+		if (!btrfs_check_super_location(device, bytenr))
+			continue;
 
 		bh = __find_get_block(device->bdev,
 				      bytenr / BTRFS_BDEV_BLOCKSIZE,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 8665aba61bb9..de9d3028833e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -238,6 +238,14 @@ static int exclude_super_stripes(struct btrfs_block_group_cache *cache)
 			if (logical[nr] + stripe_len <= cache->key.objectid)
 				continue;
 
+			/* shouldn't have super stripes in sequential zones */
+			if (cache->alloc_type == BTRFS_ALLOC_SEQ) {
+				btrfs_err(fs_info,
+		"sequentil allocation bg %llu should not have super blocks",
+					  cache->key.objectid);
+				return -EUCLEAN;
+			}
+
 			start = logical[nr];
 			if (start < cache->key.objectid) {
 				start = cache->key.objectid;
diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
index 40b4151fc935..9de26d6b8c4e 100644
--- a/fs/btrfs/hmzoned.h
+++ b/fs/btrfs/hmzoned.h
@@ -10,6 +10,7 @@
 #define BTRFS_HMZONED_H
 
 #include <linux/blkdev.h>
+#include "volumes.h"
 
 struct btrfs_zoned_device_info {
 	/*
@@ -125,4 +126,15 @@ static inline bool btrfs_can_zone_reset(struct btrfs_device *device,
 	return true;
 }
 
+static inline bool btrfs_check_super_location(struct btrfs_device *device,
+					      u64 pos)
+{
+	/*
+	 * On a non-zoned device, any address is OK. On a zoned
+	 * device, non-SEQUENTIAL WRITE REQUIRED zones are capable.
+	 */
+	return device->zone_info == NULL ||
+		!btrfs_dev_is_sequential(device, pos);
+}
+
 #endif
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 0c99cf9fb595..e15d846c700a 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -18,6 +18,7 @@
 #include "check-integrity.h"
 #include "rcu-string.h"
 #include "raid56.h"
+#include "hmzoned.h"
 
 /*
  * This is only the first step towards a full-features scrub. It reads all
@@ -3732,6 +3733,8 @@ static noinline_for_stack int scrub_supers(struct scrub_ctx *sctx,
 		if (bytenr + BTRFS_SUPER_INFO_SIZE >
 		    scrub_dev->commit_total_bytes)
 			break;
+		if (!btrfs_check_super_location(scrub_dev, bytenr))
+			continue;
 
 		ret = scrub_pages(sctx, bytenr, BTRFS_SUPER_INFO_SIZE, bytenr,
 				  scrub_dev, BTRFS_EXTENT_FLAG_SUPER, gen, i,
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 15/27] btrfs: redirty released extent buffers in sequential BGs
  2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
                   ` (13 preceding siblings ...)
  2019-08-08  9:30 ` [PATCH v3 14/27] btrfs: limit super block locations in HMZONED mode Naohiro Aota
@ 2019-08-08  9:30 ` Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 16/27] btrfs: serialize data allocation and submit IOs Naohiro Aota
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

Tree manipulating operations like merging nodes often release
once-allocated tree nodes. Btrfs cleans such nodes so that pages in the
node are not uselessly written out. On HMZONED drives, however, such
optimization blocks the following IOs as the cancellation of the write out
of the freed blocks breaks the sequential write sequence expected by the
device.

This patch introduces a list of clean and unwritten extent buffers that
have been released in a transaction. Btrfs redirty the buffer so that
btree_write_cache_pages() can send proper bios to the disk.

Besides it clear the entire content of the extent buffer not to confuse
raw block scanners e.g. btrfsck. By clearing the content,
csum_dirty_buffer() complains about bytenr mismatch, so avoid the checking
and checksum using newly introduced buffer flag EXTENT_BUFFER_NO_CHECK.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/disk-io.c     |  5 +++++
 fs/btrfs/extent-tree.c | 11 ++++++++++-
 fs/btrfs/extent_io.c   |  2 ++
 fs/btrfs/extent_io.h   |  2 ++
 fs/btrfs/hmzoned.c     | 34 ++++++++++++++++++++++++++++++++++
 fs/btrfs/hmzoned.h     |  3 +++
 fs/btrfs/transaction.c | 10 ++++++++++
 fs/btrfs/transaction.h |  3 +++
 8 files changed, 69 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a0a3709de2e6..e0a80997b6ee 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -513,6 +513,9 @@ static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct page *page)
 	if (page != eb->pages[0])
 		return 0;
 
+	if (test_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags))
+		return 0;
+
 	found_start = btrfs_header_bytenr(eb);
 	/*
 	 * Please do not consolidate these warnings into a single if.
@@ -4577,6 +4580,8 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans,
 	btrfs_destroy_pinned_extent(fs_info,
 				    fs_info->pinned_extents);
 
+	btrfs_free_redirty_list(cur_trans);
+
 	cur_trans->state =TRANS_STATE_COMPLETED;
 	wake_up(&cur_trans->commit_wait);
 }
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index de9d3028833e..bc95a73a762d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5084,8 +5084,10 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 
 		if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
 			ret = check_ref_cleanup(trans, buf->start);
-			if (!ret)
+			if (!ret) {
+				btrfs_redirty_list_add(trans->transaction, buf);
 				goto out;
+			}
 		}
 
 		pin = 0;
@@ -5097,6 +5099,13 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 			goto out;
 		}
 
+		if (btrfs_fs_incompat(fs_info, HMZONED)) {
+			btrfs_redirty_list_add(trans->transaction, buf);
+			pin_down_extent(cache, buf->start, buf->len, 1);
+			btrfs_put_block_group(cache);
+			goto out;
+		}
+
 		WARN_ON(test_bit(EXTENT_BUFFER_DIRTY, &buf->bflags));
 
 		btrfs_add_free_space(cache, buf->start, buf->len);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index aea990473392..4e67b16c9f80 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -23,6 +23,7 @@
 #include "rcu-string.h"
 #include "backref.h"
 #include "disk-io.h"
+#include "hmzoned.h"
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
@@ -4863,6 +4864,7 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
 	init_waitqueue_head(&eb->read_lock_wq);
 
 	btrfs_leak_debug_add(&eb->leak_list, &buffers);
+	INIT_LIST_HEAD(&eb->release_list);
 
 	spin_lock_init(&eb->refs_lock);
 	atomic_set(&eb->refs, 1);
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 401423b16976..c63b58438f90 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -58,6 +58,7 @@ enum {
 	EXTENT_BUFFER_IN_TREE,
 	/* write IO error */
 	EXTENT_BUFFER_WRITE_ERR,
+	EXTENT_BUFFER_NO_CHECK,
 };
 
 /* these are flags for __process_pages_contig */
@@ -186,6 +187,7 @@ struct extent_buffer {
 	 */
 	wait_queue_head_t read_lock_wq;
 	struct page *pages[INLINE_EXTENT_BUFFER_PAGES];
+	struct list_head release_list;
 #ifdef CONFIG_BTRFS_DEBUG
 	int spinning_writers;
 	atomic_t spinning_readers;
diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
index 5968ef621fa7..4c296d282e67 100644
--- a/fs/btrfs/hmzoned.c
+++ b/fs/btrfs/hmzoned.c
@@ -614,3 +614,37 @@ int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
 
 	return ret;
 }
+
+void btrfs_redirty_list_add(struct btrfs_transaction *trans,
+			    struct extent_buffer *eb)
+{
+	struct btrfs_fs_info *fs_info = eb->fs_info;
+
+	if (!btrfs_fs_incompat(fs_info, HMZONED) ||
+	    btrfs_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN) ||
+	    !list_empty(&eb->release_list))
+		return;
+
+	set_extent_buffer_dirty(eb);
+	memzero_extent_buffer(eb, 0, eb->len);
+	set_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags);
+
+	spin_lock(&trans->releasing_ebs_lock);
+	list_add_tail(&eb->release_list, &trans->releasing_ebs);
+	spin_unlock(&trans->releasing_ebs_lock);
+	atomic_inc(&eb->refs);
+}
+
+void btrfs_free_redirty_list(struct btrfs_transaction *trans)
+{
+	spin_lock(&trans->releasing_ebs_lock);
+	while (!list_empty(&trans->releasing_ebs)) {
+		struct extent_buffer *eb;
+
+		eb = list_first_entry(&trans->releasing_ebs,
+				      struct extent_buffer, release_list);
+		list_del_init(&eb->release_list);
+		free_extent_buffer(eb);
+	}
+	spin_unlock(&trans->releasing_ebs_lock);
+}
diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
index 9de26d6b8c4e..3a73c3c5e1da 100644
--- a/fs/btrfs/hmzoned.h
+++ b/fs/btrfs/hmzoned.h
@@ -36,6 +36,9 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group_cache *cache);
 int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
 			    u64 length, u64 *bytes);
 int btrfs_hmzoned_check_metadata_space(struct btrfs_fs_info *fs_info);
+void btrfs_redirty_list_add(struct btrfs_transaction *trans,
+			    struct extent_buffer *eb);
+void btrfs_free_redirty_list(struct btrfs_transaction *trans);
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
 {
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index e3adb714c04b..45bd7c25bebf 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -19,6 +19,7 @@
 #include "volumes.h"
 #include "dev-replace.h"
 #include "qgroup.h"
+#include "hmzoned.h"
 
 #define BTRFS_ROOT_TRANS_TAG 0
 
@@ -257,6 +258,8 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info,
 	spin_lock_init(&cur_trans->dirty_bgs_lock);
 	INIT_LIST_HEAD(&cur_trans->deleted_bgs);
 	spin_lock_init(&cur_trans->dropped_roots_lock);
+	INIT_LIST_HEAD(&cur_trans->releasing_ebs);
+	spin_lock_init(&cur_trans->releasing_ebs_lock);
 	list_add_tail(&cur_trans->list, &fs_info->trans_list);
 	extent_io_tree_init(fs_info, &cur_trans->dirty_pages,
 			IO_TREE_TRANS_DIRTY_PAGES, fs_info->btree_inode);
@@ -2269,6 +2272,13 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 		goto scrub_continue;
 	}
 
+	/*
+	 * At this point, we should have written the all tree blocks
+	 * allocated in this transaction. So it's now safe to free the
+	 * redirtyied extent buffers.
+	 */
+	btrfs_free_redirty_list(cur_trans);
+
 	ret = write_all_supers(fs_info, 0);
 	/*
 	 * the super is written, we can safely allow the tree-loggers
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 2c5a6f6e5bb0..09329d2901b7 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -85,6 +85,9 @@ struct btrfs_transaction {
 	spinlock_t dropped_roots_lock;
 	struct btrfs_delayed_ref_root delayed_refs;
 	struct btrfs_fs_info *fs_info;
+
+	spinlock_t releasing_ebs_lock;
+	struct list_head releasing_ebs;
 };
 
 #define __TRANS_FREEZABLE	(1U << 0)
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 16/27] btrfs: serialize data allocation and submit IOs
  2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
                   ` (14 preceding siblings ...)
  2019-08-08  9:30 ` [PATCH v3 15/27] btrfs: redirty released extent buffers in sequential BGs Naohiro Aota
@ 2019-08-08  9:30 ` Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 17/27] btrfs: implement atomic compressed IO submission Naohiro Aota
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

To preserve sequential write pattern on the drives, we must serialize
allocation and submit_bio. This commit add per-block group mutex
"zone_io_lock" and find_free_extent_seq() hold the lock. The lock is kept
even after returning from find_free_extent(). It is released when submiting
IOs corresponding to the allocation is completed.

Implementing such behavior under __extent_writepage_io is almost impossible
because once pages are unlocked we are not sure when submiting IOs for an
allocated region is finished or not. Instead, this commit add
run_delalloc_hmzoned() to write out non-compressed data IOs at once using
extent_write_locked_rage(). After the write, we can call
btrfs_hmzoned_unlock_allocation() to unlock the block group for new
allocation.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ctree.h       |  1 +
 fs/btrfs/extent-tree.c |  5 +++++
 fs/btrfs/hmzoned.h     | 34 +++++++++++++++++++++++++++++++
 fs/btrfs/inode.c       | 45 ++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 83 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 3d31a1960c4d..1e924c0d1210 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -619,6 +619,7 @@ struct btrfs_block_group_cache {
 	 * zone.
 	 */
 	u64 alloc_offset;
+	struct mutex zone_io_lock;
 };
 
 /* delayed seq elem */
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index bc95a73a762d..5b1a9e607555 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5532,6 +5532,7 @@ static int find_free_extent_seq(struct btrfs_block_group_cache *cache,
 	if (cache->alloc_type != BTRFS_ALLOC_SEQ)
 		return 1;
 
+	btrfs_hmzoned_data_io_lock(cache);
 	spin_lock(&space_info->lock);
 	spin_lock(&cache->lock);
 
@@ -5563,6 +5564,9 @@ static int find_free_extent_seq(struct btrfs_block_group_cache *cache,
 out:
 	spin_unlock(&cache->lock);
 	spin_unlock(&space_info->lock);
+	/* if succeeds, unlock after submit_bio */
+	if (ret)
+		btrfs_hmzoned_data_io_unlock(cache);
 	return ret;
 }
 
@@ -8104,6 +8108,7 @@ btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info,
 	btrfs_init_free_space_ctl(cache);
 	atomic_set(&cache->trimming, 0);
 	mutex_init(&cache->free_space_lock);
+	mutex_init(&cache->zone_io_lock);
 	btrfs_init_full_stripe_locks_tree(&cache->full_stripe_locks_root);
 	cache->alloc_type = BTRFS_ALLOC_FIT;
 
diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
index 3a73c3c5e1da..a8e7286708d4 100644
--- a/fs/btrfs/hmzoned.h
+++ b/fs/btrfs/hmzoned.h
@@ -39,6 +39,7 @@ int btrfs_hmzoned_check_metadata_space(struct btrfs_fs_info *fs_info);
 void btrfs_redirty_list_add(struct btrfs_transaction *trans,
 			    struct extent_buffer *eb);
 void btrfs_free_redirty_list(struct btrfs_transaction *trans);
+void btrfs_hmzoned_data_io_unlock_at(struct inode *inode, u64 start, u64 len);
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
 {
@@ -140,4 +141,37 @@ static inline bool btrfs_check_super_location(struct btrfs_device *device,
 		!btrfs_dev_is_sequential(device, pos);
 }
 
+
+static inline void btrfs_hmzoned_data_io_lock(
+	struct btrfs_block_group_cache *cache)
+{
+	/* No need to lock metadata BGs or non-sequential BGs */
+	if (!(cache->flags & BTRFS_BLOCK_GROUP_DATA) ||
+	    cache->alloc_type != BTRFS_ALLOC_SEQ)
+		return;
+	mutex_lock(&cache->zone_io_lock);
+}
+
+static inline void btrfs_hmzoned_data_io_unlock(
+	struct btrfs_block_group_cache *cache)
+{
+	if (!(cache->flags & BTRFS_BLOCK_GROUP_DATA) ||
+	    cache->alloc_type != BTRFS_ALLOC_SEQ)
+		return;
+	mutex_unlock(&cache->zone_io_lock);
+}
+
+static inline void btrfs_hmzoned_data_io_unlock_logical(
+	struct btrfs_fs_info *fs_info, u64 logical)
+{
+	struct btrfs_block_group_cache *cache;
+
+	if (!btrfs_fs_incompat(fs_info, HMZONED))
+		return;
+
+	cache = btrfs_lookup_block_group(fs_info, logical);
+	btrfs_hmzoned_data_io_unlock(cache);
+	btrfs_put_block_group(cache);
+}
+
 #endif
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ee582a36653d..d504200c9767 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -48,6 +48,7 @@
 #include "qgroup.h"
 #include "dedupe.h"
 #include "delalloc-space.h"
+#include "hmzoned.h"
 
 struct btrfs_iget_args {
 	struct btrfs_key *location;
@@ -1279,6 +1280,39 @@ static int cow_file_range_async(struct inode *inode, struct page *locked_page,
 	return 0;
 }
 
+static noinline int run_delalloc_hmzoned(struct inode *inode,
+					 struct page *locked_page, u64 start,
+					 u64 end, int *page_started,
+					 unsigned long *nr_written)
+{
+	struct extent_map *em;
+	u64 logical;
+	int ret;
+
+	ret = cow_file_range(inode, locked_page, start, end,
+			     end, page_started, nr_written, 0, NULL);
+	if (ret)
+		return ret;
+
+	if (*page_started)
+		return 0;
+
+	em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, end - start + 1,
+			      0);
+	ASSERT(em != NULL && em->block_start < EXTENT_MAP_LAST_BYTE);
+	logical = em->block_start;
+	free_extent_map(em);
+
+	__set_page_dirty_nobuffers(locked_page);
+	account_page_redirty(locked_page);
+	extent_write_locked_range(inode, start, end, WB_SYNC_ALL);
+	*page_started = 1;
+
+	btrfs_hmzoned_data_io_unlock_logical(btrfs_sb(inode->i_sb), logical);
+
+	return 0;
+}
+
 static noinline int csum_exist_in_range(struct btrfs_fs_info *fs_info,
 					u64 bytenr, u64 num_bytes)
 {
@@ -1645,17 +1679,24 @@ int btrfs_run_delalloc_range(struct inode *inode, struct page *locked_page,
 	int ret;
 	int force_cow = need_force_cow(inode, start, end);
 	unsigned int write_flags = wbc_to_write_flags(wbc);
+	int do_compress = inode_can_compress(inode) &&
+		inode_need_compress(inode, start, end);
+	int hmzoned = btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED);
 
 	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW && !force_cow) {
+		ASSERT(!hmzoned);
 		ret = run_delalloc_nocow(inode, locked_page, start, end,
 					 page_started, 1, nr_written);
 	} else if (BTRFS_I(inode)->flags & BTRFS_INODE_PREALLOC && !force_cow) {
+		ASSERT(!hmzoned);
 		ret = run_delalloc_nocow(inode, locked_page, start, end,
 					 page_started, 0, nr_written);
-	} else if (!inode_can_compress(inode) ||
-		   !inode_need_compress(inode, start, end)) {
+	} else if (!do_compress && !hmzoned) {
 		ret = cow_file_range(inode, locked_page, start, end, end,
 				      page_started, nr_written, 1, NULL);
+	} else if (!do_compress && hmzoned) {
+		ret = run_delalloc_hmzoned(inode, locked_page, start, end,
+					   page_started, nr_written);
 	} else {
 		set_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
 			&BTRFS_I(inode)->runtime_flags);
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 17/27] btrfs: implement atomic compressed IO submission
  2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
                   ` (15 preceding siblings ...)
  2019-08-08  9:30 ` [PATCH v3 16/27] btrfs: serialize data allocation and submit IOs Naohiro Aota
@ 2019-08-08  9:30 ` Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 18/27] btrfs: support direct write IO in HMZONED Naohiro Aota
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

As same as with non-compressed IO submission, we must unlock a block group
for the next allocation.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/inode.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d504200c9767..283ac11849b1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -776,13 +776,26 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
 			 * and IO for us.  Otherwise, we need to submit
 			 * all those pages down to the drive.
 			 */
-			if (!page_started && !ret)
+			if (!page_started && !ret) {
+				struct extent_map *em;
+				u64 logical;
+
+				em = btrfs_get_extent(BTRFS_I(inode), NULL, 0,
+						      async_extent->start,
+						      async_extent->ram_size,
+						      0);
+				logical = em->block_start;
+				free_extent_map(em);
+
 				extent_write_locked_range(inode,
 						  async_extent->start,
 						  async_extent->start +
 						  async_extent->ram_size - 1,
 						  WB_SYNC_ALL);
-			else if (ret)
+
+				btrfs_hmzoned_data_io_unlock_logical(fs_info,
+								     logical);
+			} else if (ret)
 				unlock_page(async_chunk->locked_page);
 			kfree(async_extent);
 			cond_resched();
@@ -883,6 +896,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
 			free_async_extent_pages(async_extent);
 		}
 		alloc_hint = ins.objectid + ins.offset;
+		btrfs_hmzoned_data_io_unlock_logical(fs_info, ins.objectid);
 		kfree(async_extent);
 		cond_resched();
 	}
@@ -890,6 +904,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
 out_free_reserve:
 	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
 	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
+	btrfs_hmzoned_data_io_unlock_logical(fs_info, ins.objectid);
 out_free:
 	extent_clear_unlock_delalloc(inode, async_extent->start,
 				     async_extent->start +
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 18/27] btrfs: support direct write IO in HMZONED
  2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
                   ` (16 preceding siblings ...)
  2019-08-08  9:30 ` [PATCH v3 17/27] btrfs: implement atomic compressed IO submission Naohiro Aota
@ 2019-08-08  9:30 ` Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 19/27] btrfs: serialize meta IOs on HMZONED mode Naohiro Aota
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

As same as with other IO submission, we must unlock a block group for the
next allocation.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/inode.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 283ac11849b1..d7be97c6a069 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8519,6 +8519,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 	struct btrfs_io_bio *io_bio;
 	bool write = (bio_op(dio_bio) == REQ_OP_WRITE);
 	int ret = 0;
+	u64 disk_bytenr;
 
 	bio = btrfs_bio_clone(dio_bio);
 
@@ -8562,7 +8563,11 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 			dio_data->unsubmitted_oe_range_end;
 	}
 
+	disk_bytenr = dip->disk_bytenr;
 	ret = btrfs_submit_direct_hook(dip);
+	if (write)
+		btrfs_hmzoned_data_io_unlock_logical(
+			btrfs_sb(inode->i_sb), disk_bytenr);
 	if (!ret)
 		return;
 
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 19/27] btrfs: serialize meta IOs on HMZONED mode
  2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
                   ` (17 preceding siblings ...)
  2019-08-08  9:30 ` [PATCH v3 18/27] btrfs: support direct write IO in HMZONED Naohiro Aota
@ 2019-08-08  9:30 ` Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 20/27] btrfs: wait existing extents before truncating Naohiro Aota
                   ` (7 subsequent siblings)
  26 siblings, 0 replies; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

As same as in data IO path, we must serialize write IOs for metadata. We
cannot add mutex around allocation and submit because metadata blocks are
allocated in an earlier stage to build up B-trees.

Thus, this commit add hmzoned_meta_io_lock and hold it during metadata IO
submission in btree_write_cache_pages() to serialize IOs. Furthermore, this
commit add per-block grorup metadata IO submission pointer
"meta_write_pointer" to ensure sequential writing.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ctree.h     |  3 +++
 fs/btrfs/disk-io.c   |  1 +
 fs/btrfs/extent_io.c | 17 ++++++++++++++++-
 fs/btrfs/hmzoned.c   | 45 ++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/hmzoned.h   | 17 +++++++++++++++++
 5 files changed, 82 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 1e924c0d1210..a6a03fc5e4c5 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -620,6 +620,7 @@ struct btrfs_block_group_cache {
 	 */
 	u64 alloc_offset;
 	struct mutex zone_io_lock;
+	u64 meta_write_pointer;
 };
 
 /* delayed seq elem */
@@ -1108,6 +1109,8 @@ struct btrfs_fs_info {
 	spinlock_t ref_verify_lock;
 	struct rb_root block_tree;
 #endif
+
+	struct mutex hmzoned_meta_io_lock;
 };
 
 static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e0a80997b6ee..63dd4670aba6 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2703,6 +2703,7 @@ int open_ctree(struct super_block *sb,
 	mutex_init(&fs_info->delete_unused_bgs_mutex);
 	mutex_init(&fs_info->reloc_mutex);
 	mutex_init(&fs_info->delalloc_root_mutex);
+	mutex_init(&fs_info->hmzoned_meta_io_lock);
 	seqlock_init(&fs_info->profiles_lock);
 
 	INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4e67b16c9f80..ff963b2214aa 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3892,7 +3892,9 @@ int btree_write_cache_pages(struct address_space *mapping,
 				   struct writeback_control *wbc)
 {
 	struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree;
+	struct btrfs_fs_info *fs_info = tree->fs_info;
 	struct extent_buffer *eb, *prev_eb = NULL;
+	struct btrfs_block_group_cache *cache = NULL;
 	struct extent_page_data epd = {
 		.bio = NULL,
 		.tree = tree,
@@ -3922,6 +3924,7 @@ int btree_write_cache_pages(struct address_space *mapping,
 		tag = PAGECACHE_TAG_TOWRITE;
 	else
 		tag = PAGECACHE_TAG_DIRTY;
+	btrfs_hmzoned_meta_io_lock(fs_info);
 retry:
 	if (wbc->sync_mode == WB_SYNC_ALL)
 		tag_pages_for_writeback(mapping, index, end);
@@ -3965,6 +3968,14 @@ int btree_write_cache_pages(struct address_space *mapping,
 			if (!ret)
 				continue;
 
+			if (!btrfs_check_meta_write_pointer(fs_info, eb,
+							    &cache)) {
+				ret = 0;
+				done = 1;
+				free_extent_buffer(eb);
+				break;
+			}
+
 			prev_eb = eb;
 			ret = lock_extent_buffer_for_io(eb, &epd);
 			if (!ret) {
@@ -3999,12 +4010,16 @@ int btree_write_cache_pages(struct address_space *mapping,
 		index = 0;
 		goto retry;
 	}
+	if (cache)
+		btrfs_put_block_group(cache);
 	ASSERT(ret <= 0);
 	if (ret < 0) {
 		end_write_bio(&epd, ret);
-		return ret;
+		goto out;
 	}
 	ret = flush_write_bio(&epd);
+out:
+	btrfs_hmzoned_meta_io_unlock(fs_info);
 	return ret;
 }
 
diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
index 4c296d282e67..4b13c6c47849 100644
--- a/fs/btrfs/hmzoned.c
+++ b/fs/btrfs/hmzoned.c
@@ -548,6 +548,9 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group_cache *cache)
 
 out:
 	cache->alloc_type = alloc_type;
+	if (!ret)
+		cache->meta_write_pointer =
+			cache->alloc_offset + cache->key.objectid;
 	kfree(alloc_offsets);
 	free_extent_map(em);
 
@@ -648,3 +651,45 @@ void btrfs_free_redirty_list(struct btrfs_transaction *trans)
 	}
 	spin_unlock(&trans->releasing_ebs_lock);
 }
+
+bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
+				    struct extent_buffer *eb,
+				    struct btrfs_block_group_cache **cache_ret)
+{
+	struct btrfs_block_group_cache *cache;
+
+	if (!btrfs_fs_incompat(fs_info, HMZONED))
+		return true;
+
+	cache = *cache_ret;
+
+	if (cache &&
+	    (eb->start < cache->key.objectid ||
+	     cache->key.objectid + cache->key.offset <= eb->start)) {
+		btrfs_put_block_group(cache);
+		cache = NULL;
+		*cache_ret = NULL;
+	}
+
+	if (!cache)
+		cache = btrfs_lookup_block_group(fs_info,
+						 eb->start);
+
+	if (cache) {
+		*cache_ret = cache;
+
+		if (cache->alloc_type != BTRFS_ALLOC_SEQ)
+			return true;
+
+		if (cache->meta_write_pointer != eb->start) {
+			btrfs_put_block_group(cache);
+			cache = NULL;
+			*cache_ret = NULL;
+			return false;
+		}
+
+		cache->meta_write_pointer = eb->start + eb->len;
+	}
+
+	return true;
+}
diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
index a8e7286708d4..c68c4b8056a4 100644
--- a/fs/btrfs/hmzoned.h
+++ b/fs/btrfs/hmzoned.h
@@ -40,6 +40,9 @@ void btrfs_redirty_list_add(struct btrfs_transaction *trans,
 			    struct extent_buffer *eb);
 void btrfs_free_redirty_list(struct btrfs_transaction *trans);
 void btrfs_hmzoned_data_io_unlock_at(struct inode *inode, u64 start, u64 len);
+bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
+				    struct extent_buffer *eb,
+				    struct btrfs_block_group_cache **cache_ret);
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
 {
@@ -174,4 +177,18 @@ static inline void btrfs_hmzoned_data_io_unlock_logical(
 	btrfs_put_block_group(cache);
 }
 
+static inline void btrfs_hmzoned_meta_io_lock(struct btrfs_fs_info *fs_info)
+{
+	if (!btrfs_fs_incompat(fs_info, HMZONED))
+		return;
+	mutex_lock(&fs_info->hmzoned_meta_io_lock);
+}
+
+static inline void btrfs_hmzoned_meta_io_unlock(struct btrfs_fs_info *fs_info)
+{
+	if (!btrfs_fs_incompat(fs_info, HMZONED))
+		return;
+	mutex_unlock(&fs_info->hmzoned_meta_io_lock);
+}
+
 #endif
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 20/27] btrfs: wait existing extents before truncating
  2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
                   ` (18 preceding siblings ...)
  2019-08-08  9:30 ` [PATCH v3 19/27] btrfs: serialize meta IOs on HMZONED mode Naohiro Aota
@ 2019-08-08  9:30 ` Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 21/27] btrfs: avoid async checksum/submit on HMZONED mode Naohiro Aota
                   ` (6 subsequent siblings)
  26 siblings, 0 replies; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

When truncating a file, file buffers which have already been allocated but
not yet written may be truncated.  Truncating these buffers could cause
breakage of a sequential write pattern in a block group if the truncated
blocks are for example followed by blocks allocated to another file. To
avoid this problem, always wait for write out of all unwritten buffers
before proceeding with the truncate execution.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/inode.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d7be97c6a069..95f4ce8ac8d0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5236,6 +5236,16 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr)
 		btrfs_end_write_no_snapshotting(root);
 		btrfs_end_transaction(trans);
 	} else {
+		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+
+		if (btrfs_fs_incompat(fs_info, HMZONED)) {
+			ret = btrfs_wait_ordered_range(
+				inode,
+				ALIGN(newsize, fs_info->sectorsize),
+				(u64)-1);
+			if (ret)
+				return ret;
+		}
 
 		/*
 		 * We're truncating a file that used to have good data down to
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 21/27] btrfs: avoid async checksum/submit on HMZONED mode
  2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
                   ` (19 preceding siblings ...)
  2019-08-08  9:30 ` [PATCH v3 20/27] btrfs: wait existing extents before truncating Naohiro Aota
@ 2019-08-08  9:30 ` Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 22/27] btrfs: disallow mixed-bg in " Naohiro Aota
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

In HMZONED, btrfs use per-Block Group zone_io_lock to serialize the data
write IOs or use per-FS hmzoned_meta_io_lock to serialize the metadata
write IOs.

Even with these serialization, write bios sent from
{btree,btrfs}_write_cache_pages can be reordered by async checksum workers
as these workers are per CPU and not per zone.

To preserve write BIO ordering, we can disable async checksum on HMZONED.
This does not result in lower performance with HDDs as a single CPU core is
fast enough to do checksum for a single zone write stream with the maximum
possible bandwidth of the device. If multiple zones are being written
simultaneously, HDD seek overhead lowers the achievable maximum bandwidth,
resulting again in a per zone checksum serialization not affecting
performance.

Besides, this commit disable async_submit in
btrfs_submit_compressed_write() for the same reason. This part will be
unnecessary once btrfs get the "btrfs: fix cgroup writeback support"
series.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/compression.c | 5 +++--
 fs/btrfs/disk-io.c     | 2 ++
 fs/btrfs/inode.c       | 9 ++++++---
 3 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 60c47b417a4b..058dea5e432f 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -322,6 +322,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
 	struct block_device *bdev;
 	blk_status_t ret;
 	int skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
+	int async_submit = !btrfs_fs_incompat(fs_info, HMZONED);
 
 	WARN_ON(!PAGE_ALIGNED(start));
 	cb = kmalloc(compressed_bio_size(fs_info, compressed_len), GFP_NOFS);
@@ -377,7 +378,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
 				BUG_ON(ret); /* -ENOMEM */
 			}
 
-			ret = btrfs_map_bio(fs_info, bio, 0, 1);
+			ret = btrfs_map_bio(fs_info, bio, 0, async_submit);
 			if (ret) {
 				bio->bi_status = ret;
 				bio_endio(bio);
@@ -408,7 +409,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
 		BUG_ON(ret); /* -ENOMEM */
 	}
 
-	ret = btrfs_map_bio(fs_info, bio, 0, 1);
+	ret = btrfs_map_bio(fs_info, bio, 0, async_submit);
 	if (ret) {
 		bio->bi_status = ret;
 		bio_endio(bio);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 63dd4670aba6..a8d7e81ccad1 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -873,6 +873,8 @@ static blk_status_t btree_submit_bio_start(void *private_data, struct bio *bio,
 static int check_async_write(struct btrfs_fs_info *fs_info,
 			     struct btrfs_inode *bi)
 {
+	if (btrfs_fs_incompat(fs_info, HMZONED))
+		return 0;
 	if (atomic_read(&bi->sync_writers))
 		return 0;
 	if (test_bit(BTRFS_FS_CSUM_IMPL_FAST, &fs_info->flags))
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 95f4ce8ac8d0..bb0ae3107e60 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2075,7 +2075,8 @@ static blk_status_t btrfs_submit_bio_hook(struct inode *inode, struct bio *bio,
 	enum btrfs_wq_endio_type metadata = BTRFS_WQ_ENDIO_DATA;
 	blk_status_t ret = 0;
 	int skip_sum;
-	int async = !atomic_read(&BTRFS_I(inode)->sync_writers);
+	int async = !atomic_read(&BTRFS_I(inode)->sync_writers) &&
+		!btrfs_fs_incompat(fs_info, HMZONED);
 
 	skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
 
@@ -8383,7 +8384,8 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio,
 
 	/* Check btrfs_submit_bio_hook() for rules about async submit. */
 	if (async_submit)
-		async_submit = !atomic_read(&BTRFS_I(inode)->sync_writers);
+		async_submit = !atomic_read(&BTRFS_I(inode)->sync_writers) &&
+			!btrfs_fs_incompat(fs_info, HMZONED);
 
 	if (!write) {
 		ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA);
@@ -8448,7 +8450,8 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
 	}
 
 	/* async crcs make it difficult to collect full stripe writes. */
-	if (btrfs_data_alloc_profile(fs_info) & BTRFS_BLOCK_GROUP_RAID56_MASK)
+	if (btrfs_data_alloc_profile(fs_info) & BTRFS_BLOCK_GROUP_RAID56_MASK ||
+	    btrfs_fs_incompat(fs_info, HMZONED))
 		async_submit = 0;
 	else
 		async_submit = 1;
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 22/27] btrfs: disallow mixed-bg in HMZONED mode
  2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
                   ` (20 preceding siblings ...)
  2019-08-08  9:30 ` [PATCH v3 21/27] btrfs: avoid async checksum/submit on HMZONED mode Naohiro Aota
@ 2019-08-08  9:30 ` Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 23/27] btrfs: disallow inode_cache " Naohiro Aota
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

Placing both data and metadata in a block group is impossible in HMZONED
mode. For data, we can allocate a space for it and write it immediately
after the allocation. For metadata, however, we cannot do so, because the
logical addresses are recorded in other metadata buffers to build up the
trees. As a result, a data buffer can be placed after a metadata buffer,
which is not written yet. Writing out the data buffer will break the
sequential write rule.

This commit check and disallow MIXED_BG with HMZONED mode.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/hmzoned.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
index 4b13c6c47849..123d9c804c21 100644
--- a/fs/btrfs/hmzoned.c
+++ b/fs/btrfs/hmzoned.c
@@ -235,6 +235,13 @@ int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info)
 		goto out;
 	}
 
+	if (btrfs_fs_incompat(fs_info, MIXED_GROUPS)) {
+		btrfs_err(fs_info,
+			  "HMZONED mode is not allowed for mixed block groups");
+		ret = -EINVAL;
+		goto out;
+	}
+
 	btrfs_info(fs_info, "HMZONED mode enabled, zone size %llu B",
 		   fs_info->zone_size);
 out:
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 23/27] btrfs: disallow inode_cache in HMZONED mode
  2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
                   ` (21 preceding siblings ...)
  2019-08-08  9:30 ` [PATCH v3 22/27] btrfs: disallow mixed-bg in " Naohiro Aota
@ 2019-08-08  9:30 ` Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 24/27] btrfs: support dev-replace " Naohiro Aota
                   ` (3 subsequent siblings)
  26 siblings, 0 replies; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

inode_cache use pre-allocation to write its cache data. However,
pre-allocation is completely disabled in HMZONED mode.

We can technically enable inode_cache in the same way as relocation.
However, inode_cache is rarely used and the man page discourage using it.
So, let's just disable it for now.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/hmzoned.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
index 123d9c804c21..8529106321ac 100644
--- a/fs/btrfs/hmzoned.c
+++ b/fs/btrfs/hmzoned.c
@@ -275,6 +275,12 @@ int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info)
 		return -EINVAL;
 	}
 
+	if (btrfs_test_pending(info, SET_INODE_MAP_CACHE)) {
+		btrfs_err(info,
+		  "cannot enable inode map caching with HMZONED mode");
+		return -EINVAL;
+	}
+
 	return 0;
 }
 
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 24/27] btrfs: support dev-replace in HMZONED mode
  2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
                   ` (22 preceding siblings ...)
  2019-08-08  9:30 ` [PATCH v3 23/27] btrfs: disallow inode_cache " Naohiro Aota
@ 2019-08-08  9:30 ` Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 25/27] btrfs: enable relocation " Naohiro Aota
                   ` (2 subsequent siblings)
  26 siblings, 0 replies; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

Currently, dev-replace copy all the device extents on source device to the
target device, and it also clones new incoming write I/Os from users to the
source device into the target device.

Cloning incoming IOs can break the sequential write rule in the target
device. When write is mapped in the middle of block group, that I/O is
directed in the middle of a zone of target device, which breaks the
sequential write rule.

However, the cloning function cannot be simply disabled since incoming I/Os
targeting already copied device extents must be cloned so that the I/O is
executed on the target device.

We cannot use dev_replace->cursor_{left,right} to determine whether bio
is going to not yet copied region.  Since we have time gap between
finishing btrfs_scrub_dev() and rewriting the mapping tree in
btrfs_dev_replace_finishing(), we can have newly allocated device extent
which is never cloned (by handle_ops_on_dev_replace) nor copied (by the
dev-replace process).

So the point is to copy only already existing device extents. This patch
introduce mark_block_group_to_copy() to mark existing block group as a
target of copying. Then, handle_ops_on_dev_replace() and dev-replace can
check the flag to do their job.

This patch also handles empty region between used extents. Since
dev-replace is smart to copy only used extents on source device, we have to
fill the gap to honor the sequential write rule in the target device.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ctree.h       |   1 +
 fs/btrfs/dev-replace.c | 147 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/dev-replace.h |   3 +
 fs/btrfs/extent-tree.c |  20 +++++-
 fs/btrfs/hmzoned.c     |  77 +++++++++++++++++++++
 fs/btrfs/hmzoned.h     |   4 ++
 fs/btrfs/scrub.c       |  83 ++++++++++++++++++++++-
 fs/btrfs/volumes.c     |  40 ++++++++++-
 8 files changed, 370 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a6a03fc5e4c5..1282840a2db8 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -536,6 +536,7 @@ struct btrfs_block_group_cache {
 	unsigned int has_caching_ctl:1;
 	unsigned int removed:1;
 	unsigned int wp_broken:1;
+	unsigned int to_copy:1;
 
 	int disk_cache_state;
 
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 2cc3ac4d101d..7ef1654aed9d 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -264,6 +264,10 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
 	set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
 	device->fs_devices = fs_info->fs_devices;
 
+	ret = btrfs_get_dev_zone_info(device);
+	if (ret)
+		goto error;
+
 	mutex_lock(&fs_info->fs_devices->device_list_mutex);
 	list_add(&device->dev_list, &fs_info->fs_devices->devices);
 	fs_info->fs_devices->num_devices++;
@@ -398,6 +402,143 @@ static char* btrfs_dev_name(struct btrfs_device *device)
 		return rcu_str_deref(device->name);
 }
 
+static int mark_block_group_to_copy(struct btrfs_fs_info *fs_info,
+				    struct btrfs_device *src_dev)
+{
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	struct btrfs_key found_key;
+	struct btrfs_root *root = fs_info->dev_root;
+	struct btrfs_dev_extent *dev_extent = NULL;
+	struct btrfs_block_group_cache *cache;
+	struct extent_buffer *l;
+	int slot;
+	int ret;
+	u64 chunk_offset, length;
+
+	/* Do not use "to_copy" on non-HMZONED for now */
+	if (!btrfs_fs_incompat(fs_info, HMZONED))
+		return 0;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	path->reada = READA_FORWARD;
+	path->search_commit_root = 1;
+	path->skip_locking = 1;
+
+	key.objectid = src_dev->devid;
+	key.offset = 0ull;
+	key.type = BTRFS_DEV_EXTENT_KEY;
+
+	while (1) {
+		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+		if (ret < 0)
+			break;
+		if (ret > 0) {
+			if (path->slots[0] >=
+			    btrfs_header_nritems(path->nodes[0])) {
+				ret = btrfs_next_leaf(root, path);
+				if (ret < 0)
+					break;
+				if (ret > 0) {
+					ret = 0;
+					break;
+				}
+			} else {
+				ret = 0;
+			}
+		}
+
+		l = path->nodes[0];
+		slot = path->slots[0];
+
+		btrfs_item_key_to_cpu(l, &found_key, slot);
+
+		if (found_key.objectid != src_dev->devid)
+			break;
+
+		if (found_key.type != BTRFS_DEV_EXTENT_KEY)
+			break;
+
+		if (found_key.offset < key.offset)
+			break;
+
+		dev_extent = btrfs_item_ptr(l, slot, struct btrfs_dev_extent);
+		length = btrfs_dev_extent_length(l, dev_extent);
+
+		chunk_offset = btrfs_dev_extent_chunk_offset(l, dev_extent);
+
+		cache = btrfs_lookup_block_group(fs_info, chunk_offset);
+		if (!cache)
+			goto skip;
+
+		spin_lock(&cache->lock);
+		cache->to_copy = 1;
+		spin_unlock(&cache->lock);
+
+		btrfs_put_block_group(cache);
+
+skip:
+		key.offset = found_key.offset + length;
+		btrfs_release_path(path);
+	}
+
+	btrfs_free_path(path);
+
+	return ret;
+}
+
+void btrfs_finish_block_group_to_copy(struct btrfs_device *srcdev,
+				      struct btrfs_block_group_cache *cache,
+				      u64 physical)
+{
+	struct btrfs_fs_info *fs_info = cache->fs_info;
+	struct extent_map *em;
+	struct map_lookup *map;
+	u64 chunk_offset = cache->key.objectid;
+	int num_extents, cur_extent;
+	int i;
+
+	em = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
+	BUG_ON(IS_ERR(em));
+	map = em->map_lookup;
+
+	num_extents = cur_extent = 0;
+	for (i = 0; i < map->num_stripes; i++) {
+		/* we have more device extent to copy */
+		if (srcdev != map->stripes[i].dev)
+			continue;
+
+		num_extents++;
+		if (physical == map->stripes[i].physical)
+			cur_extent = i;
+	}
+
+	free_extent_map(em);
+
+	if (num_extents > 1) {
+		if (cur_extent == 0) {
+			/*
+			 * first stripe on this device. Keep this BG
+			 * readonly until we finish all the stripes.
+			 */
+			btrfs_inc_block_group_ro(cache);
+		} else if (cur_extent == num_extents - 1) {
+			/* last stripe on this device */
+			btrfs_dec_block_group_ro(cache);
+			spin_lock(&cache->lock);
+			cache->to_copy = 0;
+			spin_unlock(&cache->lock);
+		}
+	} else {
+		spin_lock(&cache->lock);
+		cache->to_copy = 0;
+		spin_unlock(&cache->lock);
+	}
+}
+
 static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
 		const char *tgtdev_name, u64 srcdevid, const char *srcdev_name,
 		int read_src)
@@ -439,6 +580,12 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
 	if (ret)
 		return ret;
 
+	mutex_lock(&fs_info->chunk_mutex);
+	ret = mark_block_group_to_copy(fs_info, src_device);
+	mutex_unlock(&fs_info->chunk_mutex);
+	if (ret)
+		return ret;
+
 	down_write(&dev_replace->rwsem);
 	switch (dev_replace->replace_state) {
 	case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED:
diff --git a/fs/btrfs/dev-replace.h b/fs/btrfs/dev-replace.h
index 78c5d8f1adda..5ba60345dbf8 100644
--- a/fs/btrfs/dev-replace.h
+++ b/fs/btrfs/dev-replace.h
@@ -18,5 +18,8 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info *fs_info);
 void btrfs_dev_replace_suspend_for_unmount(struct btrfs_fs_info *fs_info);
 int btrfs_resume_dev_replace_async(struct btrfs_fs_info *fs_info);
 int btrfs_dev_replace_is_ongoing(struct btrfs_dev_replace *dev_replace);
+void btrfs_finish_block_group_to_copy(struct btrfs_device *srcdev,
+				      struct btrfs_block_group_cache *cache,
+				      u64 physical);
 
 #endif
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 5b1a9e607555..e68872571f18 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -33,6 +33,7 @@
 #include "delalloc-space.h"
 #include "rcu-string.h"
 #include "hmzoned.h"
+#include "dev-replace.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -1949,6 +1950,8 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
 			u64 length = stripe->length;
 			u64 bytes;
 			struct request_queue *req_q;
+			struct btrfs_dev_replace *dev_replace =
+				&fs_info->dev_replace;
 
 			if (!stripe->dev->bdev) {
 				ASSERT(btrfs_test_opt(fs_info, DEGRADED));
@@ -1958,15 +1961,28 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
 			req_q = bdev_get_queue(stripe->dev->bdev);
 
 			/* zone reset in HMZONED mode */
-			if (btrfs_can_zone_reset(dev, physical, length))
+			if (btrfs_can_zone_reset(dev, physical, length)) {
 				ret = btrfs_reset_device_zone(dev, physical,
 							      length, &bytes);
-			else if (blk_queue_discard(req_q))
+				if (ret)
+					goto next;
+				if (!btrfs_dev_replace_is_ongoing(
+					    dev_replace) ||
+				    dev != dev_replace->srcdev)
+					goto next;
+
+				discarded_bytes += bytes;
+				/* send to replace target as well */
+				ret = btrfs_reset_device_zone(
+					dev_replace->tgtdev,
+					physical, length, &bytes);
+			} else if (blk_queue_discard(req_q))
 				ret = btrfs_issue_discard(dev->bdev, physical,
 							  length, &bytes);
 			else
 				continue;
 
+next:
 			if (!ret)
 				discarded_bytes += bytes;
 			else if (ret != -EOPNOTSUPP)
diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
index 8529106321ac..76230ad80a68 100644
--- a/fs/btrfs/hmzoned.c
+++ b/fs/btrfs/hmzoned.c
@@ -706,3 +706,80 @@ bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
 
 	return true;
 }
+
+int btrfs_hmzoned_issue_zeroout(struct btrfs_device *device, u64 physical,
+				u64 length)
+{
+	if (!btrfs_dev_is_sequential(device, physical))
+		return -EOPNOTSUPP;
+
+	return blkdev_issue_zeroout(device->bdev,
+				    physical >> SECTOR_SHIFT,
+				    length >> SECTOR_SHIFT,
+				    GFP_NOFS, 0);
+}
+
+static int read_zone_info(struct btrfs_fs_info *fs_info, u64 logical,
+			  struct blk_zone *zone)
+{
+	struct btrfs_bio *bbio = NULL;
+	u64 mapped_length = PAGE_SIZE;
+	int nmirrors;
+	int i, ret;
+
+	ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, logical,
+			       &mapped_length, &bbio);
+	if (ret || !bbio || mapped_length < PAGE_SIZE) {
+		btrfs_put_bbio(bbio);
+		return -EIO;
+	}
+
+	if (bbio->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK)
+		return -EINVAL;
+
+	nmirrors = (int)bbio->num_stripes;
+	for (i = 0; i < nmirrors; i++) {
+		u64 physical = bbio->stripes[i].physical;
+		struct btrfs_device *dev = bbio->stripes[i].dev;
+
+		/* missing device */
+		if (!dev->bdev)
+			continue;
+
+		ret = btrfs_get_dev_zone(dev, physical, zone, GFP_NOFS);
+		/* failing device */
+		if (ret == -EIO || ret == -EOPNOTSUPP)
+			continue;
+		break;
+	}
+
+	return ret;
+}
+
+int btrfs_sync_hmzone_write_pointer(struct btrfs_device *tgt_dev, u64 logical,
+				    u64 physical_start, u64 physical_pos)
+{
+	struct btrfs_fs_info *fs_info = tgt_dev->fs_info;
+	struct blk_zone zone;
+	u64 length;
+	u64 wp;
+	int ret;
+
+	if (!btrfs_dev_is_sequential(tgt_dev, physical_pos))
+		return 0;
+
+	ret = read_zone_info(fs_info, logical, &zone);
+	if (ret)
+		return ret;
+
+	wp = physical_start + ((zone.wp - zone.start) << SECTOR_SHIFT);
+
+	if (physical_pos == wp)
+		return 0;
+
+	if (physical_pos > wp)
+		return -EUCLEAN;
+
+	length = wp - physical_pos;
+	return btrfs_hmzoned_issue_zeroout(tgt_dev, physical_pos, length);
+}
diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
index c68c4b8056a4..b0bb96404a24 100644
--- a/fs/btrfs/hmzoned.h
+++ b/fs/btrfs/hmzoned.h
@@ -43,6 +43,10 @@ void btrfs_hmzoned_data_io_unlock_at(struct inode *inode, u64 start, u64 len);
 bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
 				    struct extent_buffer *eb,
 				    struct btrfs_block_group_cache **cache_ret);
+int btrfs_hmzoned_issue_zeroout(struct btrfs_device *device, u64 physical,
+				u64 length);
+int btrfs_sync_hmzone_write_pointer(struct btrfs_device *tgt_dev, u64 logical,
+				    u64 physical_start, u64 physical_pos);
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
 {
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index e15d846c700a..9f3484597338 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -167,6 +167,7 @@ struct scrub_ctx {
 	int			pages_per_rd_bio;
 
 	int			is_dev_replace;
+	u64			write_pointer;
 
 	struct scrub_bio        *wr_curr_bio;
 	struct mutex            wr_lock;
@@ -1648,6 +1649,23 @@ static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
 	sbio = sctx->wr_curr_bio;
 	if (sbio->page_count == 0) {
 		struct bio *bio;
+		u64 physical = spage->physical_for_dev_replace;
+
+		if (btrfs_fs_incompat(sctx->fs_info, HMZONED) &&
+		    sctx->write_pointer < physical) {
+			u64 length = physical - sctx->write_pointer;
+
+			ret = btrfs_hmzoned_issue_zeroout(sctx->wr_tgtdev,
+							  sctx->write_pointer,
+							  length);
+			if (ret == -EOPNOTSUPP)
+				ret = 0;
+			if (ret) {
+				mutex_unlock(&sctx->wr_lock);
+				return ret;
+			}
+			sctx->write_pointer = physical;
+		}
 
 		sbio->physical = spage->physical_for_dev_replace;
 		sbio->logical = spage->logical;
@@ -1710,6 +1728,10 @@ static void scrub_wr_submit(struct scrub_ctx *sctx)
 	 * doubled the write performance on spinning disks when measured
 	 * with Linux 3.5 */
 	btrfsic_submit_bio(sbio->bio);
+
+	if (btrfs_fs_incompat(sctx->fs_info, HMZONED))
+		sctx->write_pointer = sbio->physical +
+			sbio->page_count * PAGE_SIZE;
 }
 
 static void scrub_wr_bio_end_io(struct bio *bio)
@@ -3043,6 +3065,21 @@ static noinline_for_stack int scrub_raid56_parity(struct scrub_ctx *sctx,
 	return ret < 0 ? ret : 0;
 }
 
+void sync_replace_for_hmzoned(struct scrub_ctx *sctx)
+{
+	if (!btrfs_fs_incompat(sctx->fs_info, HMZONED))
+		return;
+
+	sctx->flush_all_writes = true;
+	scrub_submit(sctx);
+	mutex_lock(&sctx->wr_lock);
+	scrub_wr_submit(sctx);
+	mutex_unlock(&sctx->wr_lock);
+
+	wait_event(sctx->list_wait,
+		   atomic_read(&sctx->bios_in_flight) == 0);
+}
+
 static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 					   struct map_lookup *map,
 					   struct btrfs_device *scrub_dev,
@@ -3174,6 +3211,14 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 	 */
 	blk_start_plug(&plug);
 
+	if (sctx->is_dev_replace &&
+	    btrfs_dev_is_sequential(sctx->wr_tgtdev, physical)) {
+		mutex_lock(&sctx->wr_lock);
+		sctx->write_pointer = physical;
+		mutex_unlock(&sctx->wr_lock);
+		sctx->flush_all_writes = true;
+	}
+
 	/*
 	 * now find all extents for each stripe and scrub them
 	 */
@@ -3346,6 +3391,9 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 			if (ret)
 				goto out;
 
+			if (sctx->is_dev_replace)
+				sync_replace_for_hmzoned(sctx);
+
 			if (extent_logical + extent_len <
 			    key.objectid + bytes) {
 				if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
@@ -3413,6 +3461,26 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 	blk_finish_plug(&plug);
 	btrfs_free_path(path);
 	btrfs_free_path(ppath);
+
+	if (btrfs_fs_incompat(fs_info, HMZONED) && sctx->is_dev_replace &&
+	    ret >= 0) {
+		wait_event(sctx->list_wait,
+			   atomic_read(&sctx->bios_in_flight) == 0);
+
+		mutex_lock(&sctx->wr_lock);
+		if (sctx->write_pointer < physical_end) {
+			ret = btrfs_sync_hmzone_write_pointer(
+				sctx->wr_tgtdev, base + offset,
+				map->stripes[num].physical,
+				sctx->write_pointer);
+			if (ret)
+				btrfs_err(fs_info, "failed to recover write pointer");
+		}
+		mutex_unlock(&sctx->wr_lock);
+		btrfs_dev_clear_zone_empty(sctx->wr_tgtdev,
+					   map->stripes[num].physical);
+	}
+
 	return ret < 0 ? ret : 0;
 }
 
@@ -3554,6 +3622,14 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 		if (!cache)
 			goto skip;
 
+		spin_lock(&cache->lock);
+		if (sctx->is_dev_replace && !cache->to_copy) {
+			spin_unlock(&cache->lock);
+			ro_set = 0;
+			goto done;
+		}
+		spin_unlock(&cache->lock);
+
 		/*
 		 * we need call btrfs_inc_block_group_ro() with scrubs_paused,
 		 * to avoid deadlock caused by:
@@ -3588,7 +3664,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 			ret = btrfs_wait_ordered_roots(fs_info, U64_MAX,
 						       cache->key.objectid,
 						       cache->key.offset);
-			if (ret > 0) {
+			if (ret >= 0) {
 				struct btrfs_trans_handle *trans;
 
 				trans = btrfs_join_transaction(root);
@@ -3664,6 +3740,11 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 
 		scrub_pause_off(fs_info);
 
+		if (sctx->is_dev_replace)
+			btrfs_finish_block_group_to_copy(
+				dev_replace->srcdev, cache, found_key.offset);
+
+done:
 		down_write(&fs_info->dev_replace.rwsem);
 		dev_replace->cursor_left = dev_replace->cursor_right;
 		dev_replace->item_needs_writeback = 1;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 265a1496e459..07e7528fb23e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1592,6 +1592,9 @@ int find_free_dev_extent_start(struct btrfs_device *device, u64 num_bytes,
 	search_start = max_t(u64, search_start, zone_size);
 	search_start = btrfs_zone_align(device, search_start);
 
+	WARN_ON(device->zone_info &&
+		!IS_ALIGNED(num_bytes, device->zone_info->zone_size));
+
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
@@ -5894,9 +5897,29 @@ static int get_extra_mirror_from_replace(struct btrfs_fs_info *fs_info,
 	return ret;
 }
 
+static bool is_block_group_to_copy(struct btrfs_fs_info *fs_info, u64 logical)
+{
+	struct btrfs_block_group_cache *cache;
+	bool ret;
+
+	/* non-HMZONED mode does not use "to_copy" flag */
+	if (!btrfs_fs_incompat(fs_info, HMZONED))
+		return false;
+
+	cache = btrfs_lookup_block_group(fs_info, logical);
+
+	spin_lock(&cache->lock);
+	ret = cache->to_copy;
+	spin_unlock(&cache->lock);
+
+	btrfs_put_block_group(cache);
+	return ret;
+}
+
 static void handle_ops_on_dev_replace(enum btrfs_map_op op,
 				      struct btrfs_bio **bbio_ret,
 				      struct btrfs_dev_replace *dev_replace,
+				      u64 logical,
 				      int *num_stripes_ret, int *max_errors_ret)
 {
 	struct btrfs_bio *bbio = *bbio_ret;
@@ -5909,6 +5932,15 @@ static void handle_ops_on_dev_replace(enum btrfs_map_op op,
 	if (op == BTRFS_MAP_WRITE) {
 		int index_where_to_add;
 
+		/*
+		 * a block group which have "to_copy" set will
+		 * eventually copied by dev-replace process. We can
+		 * avoid cloning IO here.
+		 */
+		if (is_block_group_to_copy(dev_replace->srcdev->fs_info,
+					   logical))
+			return;
+
 		/*
 		 * duplicate the write operations while the dev replace
 		 * procedure is running. Since the copying of the old disk to
@@ -5936,6 +5968,10 @@ static void handle_ops_on_dev_replace(enum btrfs_map_op op,
 				index_where_to_add++;
 				max_errors++;
 				tgtdev_indexes++;
+
+				/* mark this zone as non-empty */
+				btrfs_dev_clear_zone_empty(new->dev,
+							   new->physical);
 			}
 		}
 		num_stripes = index_where_to_add;
@@ -6321,8 +6357,8 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 
 	if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL &&
 	    need_full_stripe(op)) {
-		handle_ops_on_dev_replace(op, &bbio, dev_replace, &num_stripes,
-					  &max_errors);
+		handle_ops_on_dev_replace(op, &bbio, dev_replace, logical,
+					  &num_stripes, &max_errors);
 	}
 
 	*bbio_ret = bbio;
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 25/27] btrfs: enable relocation in HMZONED mode
  2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
                   ` (23 preceding siblings ...)
  2019-08-08  9:30 ` [PATCH v3 24/27] btrfs: support dev-replace " Naohiro Aota
@ 2019-08-08  9:30 ` Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 26/27] btrfs: relocate block group to repair IO failure in HMZONED Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 27/27] btrfs: enable to mount HMZONED incompat flag Naohiro Aota
  26 siblings, 0 replies; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

To serialize allocation and submit_bio, we introduced mutex around them. As
a result, preallocation must be completely disabled to avoid a deadlock.

Since current relocation process relies on preallocation to move file data
extents, it must be handled in another way. In HMZONED mode, we just
truncate the inode to the size that we wanted to pre-allocate. Then, we
flush dirty pages on the file before finishing relocation process.
run_delalloc_hmzoned() will handle all the allocation and submit IOs to
the underlying layers.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/relocation.c | 39 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 37 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 7f219851fa23..d852e3389ee2 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3152,6 +3152,34 @@ int prealloc_file_extent_cluster(struct inode *inode,
 	if (ret)
 		goto out;
 
+	/*
+	 * In HMZONED, we cannot preallocate the file region. Instead,
+	 * we dirty and fiemap_write the region.
+	 */
+
+	if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED)) {
+		struct btrfs_root *root = BTRFS_I(inode)->root;
+		struct btrfs_trans_handle *trans;
+
+		end = cluster->end - offset + 1;
+		trans = btrfs_start_transaction(root, 1);
+		if (IS_ERR(trans))
+			return PTR_ERR(trans);
+
+		inode->i_ctime = current_time(inode);
+		i_size_write(inode, end);
+		btrfs_ordered_update_i_size(inode, end, NULL);
+		ret = btrfs_update_inode(trans, root, inode);
+		if (ret) {
+			btrfs_abort_transaction(trans, ret);
+			btrfs_end_transaction(trans);
+			return ret;
+		}
+		ret = btrfs_end_transaction(trans);
+
+		goto out;
+	}
+
 	cur_offset = prealloc_start;
 	while (nr < cluster->nr) {
 		start = cluster->boundary[nr] - offset;
@@ -3340,6 +3368,10 @@ static int relocate_file_extent_cluster(struct inode *inode,
 		btrfs_throttle(fs_info);
 	}
 	WARN_ON(nr != cluster->nr);
+	if (btrfs_fs_incompat(fs_info, HMZONED) && !ret) {
+		ret = btrfs_wait_ordered_range(inode, 0, (u64)-1);
+		WARN_ON(ret);
+	}
 out:
 	kfree(ra);
 	return ret;
@@ -4180,8 +4212,12 @@ static int __insert_orphan_inode(struct btrfs_trans_handle *trans,
 	struct btrfs_path *path;
 	struct btrfs_inode_item *item;
 	struct extent_buffer *leaf;
+	u64 flags = BTRFS_INODE_NOCOMPRESS | BTRFS_INODE_PREALLOC;
 	int ret;
 
+	if (btrfs_fs_incompat(trans->fs_info, HMZONED))
+		flags &= ~BTRFS_INODE_PREALLOC;
+
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
@@ -4196,8 +4232,7 @@ static int __insert_orphan_inode(struct btrfs_trans_handle *trans,
 	btrfs_set_inode_generation(leaf, item, 1);
 	btrfs_set_inode_size(leaf, item, 0);
 	btrfs_set_inode_mode(leaf, item, S_IFREG | 0600);
-	btrfs_set_inode_flags(leaf, item, BTRFS_INODE_NOCOMPRESS |
-					  BTRFS_INODE_PREALLOC);
+	btrfs_set_inode_flags(leaf, item, flags);
 	btrfs_mark_buffer_dirty(leaf);
 out:
 	btrfs_free_path(path);
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 26/27] btrfs: relocate block group to repair IO failure in HMZONED
  2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
                   ` (24 preceding siblings ...)
  2019-08-08  9:30 ` [PATCH v3 25/27] btrfs: enable relocation " Naohiro Aota
@ 2019-08-08  9:30 ` Naohiro Aota
  2019-08-08  9:30 ` [PATCH v3 27/27] btrfs: enable to mount HMZONED incompat flag Naohiro Aota
  26 siblings, 0 replies; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

When btrfs find a checksum error and if the file system has a mirror of the
damaged data, btrfs read the correct data from the mirror and write the
data to damaged blocks. This repairing, however, is against the sequential
write required rule.

We can consider three methods to repair an IO failure in HMZONED mode:
(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to the
    new extent
(3) Relocate the corresponding block group

Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and so
it unnecessary degrades non-damaged data.

Method (2) is much like device replacing but done in the same device. It is
safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev>physical == dst_dev->physical". Also, the extent mapping replacing
function should be extended to support replacing device extent position in
one device.

Method (3) invokes relocation of the damaged block group, so it is
straightforward to implement. It relocates all the mirrored device extents,
so it is, potentially, a more costly operation than method (1) or (2). But
it relocates only using extents which reduce the total IO size.

Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).

For protecting a block group gets relocated multiple time with multiple IO
errors, this commit introduces "relocating_repair" bit to show it's now
relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.

This commit also supports repairing in the scrub process.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ctree.h     |  1 +
 fs/btrfs/extent_io.c |  3 ++
 fs/btrfs/scrub.c     |  3 ++
 fs/btrfs/volumes.c   | 72 ++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h   |  1 +
 5 files changed, 80 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 1282840a2db8..144cf9c13320 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -537,6 +537,7 @@ struct btrfs_block_group_cache {
 	unsigned int removed:1;
 	unsigned int wp_broken:1;
 	unsigned int to_copy:1;
+	unsigned int relocating_repair:1;
 
 	int disk_cache_state;
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ff963b2214aa..0d3b61606b15 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2187,6 +2187,9 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
 	ASSERT(!(fs_info->sb->s_flags & SB_RDONLY));
 	BUG_ON(!mirror_num);
 
+	if (btrfs_fs_incompat(fs_info, HMZONED))
+		return btrfs_repair_one_hmzone(fs_info, logical);
+
 	bio = btrfs_io_bio_alloc(1);
 	bio->bi_iter.bi_size = 0;
 	map_length = length;
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 9f3484597338..6dd5fa4ad657 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -861,6 +861,9 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 	have_csum = sblock_to_check->pagev[0]->have_csum;
 	dev = sblock_to_check->pagev[0]->dev;
 
+	if (btrfs_fs_incompat(fs_info, HMZONED) && !sctx->is_dev_replace)
+		return btrfs_repair_one_hmzone(fs_info, logical);
+
 	/*
 	 * We must use GFP_NOFS because the scrub task might be waiting for a
 	 * worker task executing this function and in turn a transaction commit
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 07e7528fb23e..20109f20f102 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -8006,3 +8006,75 @@ bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr)
 	spin_unlock(&fs_info->swapfile_pins_lock);
 	return node != NULL;
 }
+
+static int relocating_repair_kthread(void *data)
+{
+	struct btrfs_block_group_cache *cache =
+		(struct btrfs_block_group_cache *) data;
+	struct btrfs_fs_info *fs_info = cache->fs_info;
+	u64 target;
+	int ret = 0;
+
+	target = cache->key.objectid;
+	btrfs_put_block_group(cache);
+
+	if (test_and_set_bit(BTRFS_FS_EXCL_OP, &fs_info->flags)) {
+		btrfs_info(fs_info,
+			   "skip relocating block group %llu to repair: EBUSY",
+			   target);
+		return -EBUSY;
+	}
+
+	mutex_lock(&fs_info->delete_unused_bgs_mutex);
+
+	/* ensure Block Group still exists */
+	cache = btrfs_lookup_block_group(fs_info, target);
+	if (!cache)
+		goto out;
+
+	if (!cache->relocating_repair)
+		goto out;
+
+	ret = btrfs_may_alloc_data_chunk(fs_info, target);
+	if (ret < 0)
+		goto out;
+
+	btrfs_info(fs_info, "relocating block group %llu to repair IO failure",
+		   target);
+	ret = btrfs_relocate_chunk(fs_info, target);
+
+out:
+	if (cache)
+		btrfs_put_block_group(cache);
+	mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+	clear_bit(BTRFS_FS_EXCL_OP, &fs_info->flags);
+
+	return ret;
+}
+
+int btrfs_repair_one_hmzone(struct btrfs_fs_info *fs_info, u64 logical)
+{
+	struct btrfs_block_group_cache *cache;
+
+	/* do not attempt to repair in degraded state */
+	if (btrfs_test_opt(fs_info, DEGRADED))
+		return 0;
+
+	cache = btrfs_lookup_block_group(fs_info, logical);
+	if (!cache)
+		return 0;
+
+	spin_lock(&cache->lock);
+	if (cache->relocating_repair) {
+		spin_unlock(&cache->lock);
+		btrfs_put_block_group(cache);
+		return 0;
+	}
+	cache->relocating_repair = 1;
+	spin_unlock(&cache->lock);
+
+	kthread_run(relocating_repair_kthread, cache,
+		    "btrfs-relocating-repair");
+
+	return 0;
+}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 5da1f354db93..ccb139d1f9c4 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -593,5 +593,6 @@ bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
 int btrfs_bg_type_to_factor(u64 flags);
 const char *btrfs_bg_type_to_raid_name(u64 flags);
 int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info);
+int btrfs_repair_one_hmzone(struct btrfs_fs_info *fs_info, u64 logical);
 
 #endif
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 27/27] btrfs: enable to mount HMZONED incompat flag
  2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
                   ` (25 preceding siblings ...)
  2019-08-08  9:30 ` [PATCH v3 26/27] btrfs: relocate block group to repair IO failure in HMZONED Naohiro Aota
@ 2019-08-08  9:30 ` Naohiro Aota
  26 siblings, 0 replies; 39+ messages in thread
From: Naohiro Aota @ 2019-08-08  9:30 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel, Naohiro Aota

This final patch adds the HMZONED incompat flag to
BTRFS_FEATURE_INCOMPAT_SUPP and enables btrfs to mount HMZONED flagged file
system.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ctree.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 144cf9c13320..b9dc9d4e152d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -294,7 +294,8 @@ struct btrfs_super_block {
 	 BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF |		\
 	 BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA |	\
 	 BTRFS_FEATURE_INCOMPAT_NO_HOLES	|	\
-	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID)
+	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID	|	\
+	 BTRFS_FEATURE_INCOMPAT_HMZONED)
 
 #define BTRFS_FEATURE_INCOMPAT_SAFE_SET			\
 	(BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF)
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 02/27] btrfs: Get zone information of zoned block devices
  2019-08-08  9:30 ` [PATCH v3 02/27] btrfs: Get zone information of zoned block devices Naohiro Aota
@ 2019-08-16  4:44   ` Anand Jain
  2019-08-16 14:19     ` Damien Le Moal
  0 siblings, 1 reply; 39+ messages in thread
From: Anand Jain @ 2019-08-16  4:44 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel

On 8/8/19 5:30 PM, Naohiro Aota wrote:
> If a zoned block device is found, get its zone information (number of zones
> and zone size) using the new helper function btrfs_get_dev_zonetypes().  To
> avoid costly run-time zone report commands to test the device zones type
> during block allocation, attach the seq_zones bitmap to the device
> structure to indicate if a zone is sequential or accept random writes. Also
> it attaches the empty_zones bitmap to indicate if a zone is empty or not.
> 
> This patch also introduces the helper function btrfs_dev_is_sequential() to
> test if the zone storing a block is a sequential write required zone and
> btrfs_dev_is_empty_zone() to test if the zone is a empty zone.
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>   fs/btrfs/Makefile  |   2 +-
>   fs/btrfs/hmzoned.c | 162 +++++++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/hmzoned.h |  79 ++++++++++++++++++++++
>   fs/btrfs/volumes.c |  18 ++++-
>   fs/btrfs/volumes.h |   4 ++
>   5 files changed, 262 insertions(+), 3 deletions(-)
>   create mode 100644 fs/btrfs/hmzoned.c
>   create mode 100644 fs/btrfs/hmzoned.h
> 
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index 76a843198bcb..8d93abb31074 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -11,7 +11,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
>   	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
>   	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
>   	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
> -	   block-rsv.o delalloc-space.o
> +	   block-rsv.o delalloc-space.o hmzoned.o
>   
>   btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
>   btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
> diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
> new file mode 100644
> index 000000000000..bfd04792dd62
> --- /dev/null
> +++ b/fs/btrfs/hmzoned.c
> @@ -0,0 +1,162 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2019 Western Digital Corporation or its affiliates.
> + * Authors:
> + *	Naohiro Aota	<naohiro.aota@wdc.com>
> + *	Damien Le Moal	<damien.lemoal@wdc.com>
> + */
> +
> +#include <linux/slab.h>
> +#include <linux/blkdev.h>
> +#include "ctree.h"
> +#include "volumes.h"
> +#include "hmzoned.h"
> +#include "rcu-string.h"
> +
> +/* Maximum number of zones to report per blkdev_report_zones() call */
> +#define BTRFS_REPORT_NR_ZONES   4096
> +
> +static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
> +			       struct blk_zone **zones_ret,
> +			       unsigned int *nr_zones, gfp_t gfp_mask)
> +{
> +	struct blk_zone *zones = *zones_ret;
> +	int ret;
> +
> +	if (!zones) {
> +		zones = kcalloc(*nr_zones, sizeof(struct blk_zone), GFP_KERNEL);
> +		if (!zones)
> +			return -ENOMEM;
> +	}
> +
> +	ret = blkdev_report_zones(device->bdev, pos >> SECTOR_SHIFT,
> +				  zones, nr_zones, gfp_mask);
> +	if (ret != 0) {
> +		btrfs_err_in_rcu(device->fs_info,
> +				 "get zone at %llu on %s failed %d", pos,
> +				 rcu_str_deref(device->name), ret);
> +		return ret;
> +	}
> +	if (!*nr_zones)
> +		return -EIO;
> +
> +	*zones_ret = zones;
> +
> +	return 0;
> +}
> +
> +int btrfs_get_dev_zone_info(struct btrfs_device *device)
> +{
> +	struct btrfs_zoned_device_info *zone_info = NULL;
> +	struct block_device *bdev = device->bdev;
> +	sector_t nr_sectors = bdev->bd_part->nr_sects;
> +	sector_t sector = 0;
> +	struct blk_zone *zones = NULL;
> +	unsigned int i, nreported = 0, nr_zones;
> +	unsigned int zone_sectors;
> +	int ret;
> +
> +	if (!bdev_is_zoned(bdev))
> +		return 0;
> +
> +	zone_info = kzalloc(sizeof(*zone_info), GFP_KERNEL);
> +	if (!zone_info)
> +		return -ENOMEM;
> +
> +	zone_sectors = bdev_zone_sectors(bdev);
> +	ASSERT(is_power_of_2(zone_sectors));
> +	zone_info->zone_size = (u64)zone_sectors << SECTOR_SHIFT;
> +	zone_info->zone_size_shift = ilog2(zone_info->zone_size);
> +	zone_info->nr_zones = nr_sectors >> ilog2(bdev_zone_sectors(bdev));
> +	if (nr_sectors & (bdev_zone_sectors(bdev) - 1))
> +		zone_info->nr_zones++;
> +
> +	zone_info->seq_zones = kcalloc(BITS_TO_LONGS(zone_info->nr_zones),
> +				       sizeof(*zone_info->seq_zones),
> +				       GFP_KERNEL);
> +	if (!zone_info->seq_zones) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	zone_info->empty_zones = kcalloc(BITS_TO_LONGS(zone_info->nr_zones),
> +					 sizeof(*zone_info->empty_zones),
> +					 GFP_KERNEL);
> +	if (!zone_info->empty_zones) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	/* Get zones type */
> +	while (sector < nr_sectors) {
> +		nr_zones = BTRFS_REPORT_NR_ZONES;
> +		ret = btrfs_get_dev_zones(device, sector << SECTOR_SHIFT,
> +					  &zones, &nr_zones, GFP_KERNEL);


How many zones do we see in a disk? Not many I presume.
Here the allocation for %zones is inconsistent for each zone, unless
there is substantial performance benefits, a consistent flow of
alloc/free is fine as it makes the code easy to read and verify.


Thanks, Anand

> +		if (ret)
> +			goto out;
> +
> +		for (i = 0; i < nr_zones; i++) {
> +			if (zones[i].type == BLK_ZONE_TYPE_SEQWRITE_REQ)
> +				set_bit(nreported, zone_info->seq_zones);
> +			if (zones[i].cond == BLK_ZONE_COND_EMPTY)
> +				set_bit(nreported, zone_info->empty_zones);
> +			nreported++;
> +		}
> +		sector = zones[nr_zones - 1].start + zones[nr_zones - 1].len;
> +	}
> +
> +	if (nreported != zone_info->nr_zones) {
> +		btrfs_err_in_rcu(device->fs_info,
> +				 "inconsistent number of zones on %s (%u / %u)",
> +				 rcu_str_deref(device->name), nreported,
> +				 zone_info->nr_zones);
> +		ret = -EIO;
> +		goto out;
> +	}
> +
> +	device->zone_info = zone_info;
> +
> +	btrfs_info_in_rcu(
> +		device->fs_info,
> +		"host-%s zoned block device %s, %u zones of %llu sectors",
> +		bdev_zoned_model(bdev) == BLK_ZONED_HM ? "managed" : "aware",
> +		rcu_str_deref(device->name), zone_info->nr_zones,
> +		zone_info->zone_size >> SECTOR_SHIFT);
> +
> +out:
> +	kfree(zones);
> +
> +	if (ret) {
> +		kfree(zone_info->seq_zones);
> +		kfree(zone_info->empty_zones);
> +		kfree(zone_info);
> +	}
> +
> +	return ret;
> +}
> +
> +void btrfs_destroy_dev_zone_info(struct btrfs_device *device)
> +{
> +	struct btrfs_zoned_device_info *zone_info = device->zone_info;
> +
> +	if (!zone_info)
> +		return;
> +
> +	kfree(zone_info->seq_zones);
> +	kfree(zone_info->empty_zones);
> +	kfree(zone_info);
> +	device->zone_info = NULL;
> +}
> +
> +int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
> +		       struct blk_zone *zone, gfp_t gfp_mask)
> +{
> +	unsigned int nr_zones = 1;
> +	int ret;
> +
> +	ret = btrfs_get_dev_zones(device, pos, &zone, &nr_zones, gfp_mask);
> +	if (ret != 0 || !nr_zones)
> +		return ret ? ret : -EIO;
> +
> +	return 0;
> +}
> diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
> new file mode 100644
> index 000000000000..ffc70842135e
> --- /dev/null
> +++ b/fs/btrfs/hmzoned.h
> @@ -0,0 +1,79 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2019 Western Digital Corporation or its affiliates.
> + * Authors:
> + *	Naohiro Aota	<naohiro.aota@wdc.com>
> + *	Damien Le Moal	<damien.lemoal@wdc.com>
> + */
> +
> +#ifndef BTRFS_HMZONED_H
> +#define BTRFS_HMZONED_H
> +
> +struct btrfs_zoned_device_info {
> +	/*
> +	 * Number of zones, zone size and types of zones if bdev is a
> +	 * zoned block device.
> +	 */
> +	u64 zone_size;
> +	u8  zone_size_shift;
> +	u32 nr_zones;
> +	unsigned long *seq_zones;
> +	unsigned long *empty_zones;
> +};
> +
> +int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
> +		       struct blk_zone *zone, gfp_t gfp_mask);
> +int btrfs_get_dev_zone_info(struct btrfs_device *device);
> +void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
> +
> +static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
> +{
> +	struct btrfs_zoned_device_info *zone_info = device->zone_info;
> +
> +	if (!zone_info)
> +		return false;
> +
> +	return test_bit(pos >> zone_info->zone_size_shift,
> +			zone_info->seq_zones);
> +}
> +
> +static inline bool btrfs_dev_is_empty_zone(struct btrfs_device *device, u64 pos)
> +{
> +	struct btrfs_zoned_device_info *zone_info = device->zone_info;
> +
> +	if (!zone_info)
> +		return true;
> +
> +	return test_bit(pos >> zone_info->zone_size_shift,
> +			zone_info->empty_zones);
> +}
> +
> +static inline void btrfs_dev_set_empty_zone_bit(struct btrfs_device *device,
> +						u64 pos, bool set)
> +{
> +	struct btrfs_zoned_device_info *zone_info = device->zone_info;
> +	unsigned int zno;
> +
> +	if (!zone_info)
> +		return;
> +
> +	zno = pos >> zone_info->zone_size_shift;
> +	if (set)
> +		set_bit(zno, zone_info->empty_zones);
> +	else
> +		clear_bit(zno, zone_info->empty_zones);
> +}
> +
> +static inline void btrfs_dev_set_zone_empty(struct btrfs_device *device,
> +					    u64 pos)
> +{
> +	btrfs_dev_set_empty_zone_bit(device, pos, true);
> +}
> +
> +static inline void btrfs_dev_clear_zone_empty(struct btrfs_device *device,
> +					      u64 pos)
> +{
> +	btrfs_dev_set_empty_zone_bit(device, pos, false);
> +}
> +
> +#endif
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index d74b74ca07af..8e5a894e7bde 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -29,6 +29,7 @@
>   #include "sysfs.h"
>   #include "tree-checker.h"
>   #include "space-info.h"
> +#include "hmzoned.h"
>   
>   const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
>   	[BTRFS_RAID_RAID10] = {
> @@ -342,6 +343,7 @@ void btrfs_free_device(struct btrfs_device *device)
>   	rcu_string_free(device->name);
>   	extent_io_tree_release(&device->alloc_state);
>   	bio_put(device->flush_bio);
> +	btrfs_destroy_dev_zone_info(device);
>   	kfree(device);
>   }
>   
> @@ -847,6 +849,11 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
>   	clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
>   	device->mode = flags;
>   
> +	/* Get zone type information of zoned block devices */
> +	ret = btrfs_get_dev_zone_info(device);
> +	if (ret != 0)
> +		goto error_brelse;
> +
>   	fs_devices->open_devices++;
>   	if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
>   	    device->devid != BTRFS_DEV_REPLACE_DEVID) {
> @@ -2598,6 +2605,14 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>   	}
>   	rcu_assign_pointer(device->name, name);
>   
> +	device->fs_info = fs_info;
> +	device->bdev = bdev;
> +
> +	/* Get zone type information of zoned block devices */
> +	ret = btrfs_get_dev_zone_info(device);
> +	if (ret)
> +		goto error_free_device;
> +
>   	trans = btrfs_start_transaction(root, 0);
>   	if (IS_ERR(trans)) {
>   		ret = PTR_ERR(trans);
> @@ -2614,8 +2629,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>   					 fs_info->sectorsize);
>   	device->disk_total_bytes = device->total_bytes;
>   	device->commit_total_bytes = device->total_bytes;
> -	device->fs_info = fs_info;
> -	device->bdev = bdev;
>   	set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
>   	clear_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state);
>   	device->mode = FMODE_EXCL;
> @@ -2756,6 +2769,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>   		sb->s_flags |= SB_RDONLY;
>   	if (trans)
>   		btrfs_end_transaction(trans);
> +	btrfs_destroy_dev_zone_info(device);
>   error_free_device:
>   	btrfs_free_device(device);
>   error:
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 7f6aa1816409..5da1f354db93 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -57,6 +57,8 @@ struct btrfs_io_geometry {
>   #define BTRFS_DEV_STATE_REPLACE_TGT	(3)
>   #define BTRFS_DEV_STATE_FLUSH_SENT	(4)
>   
> +struct btrfs_zoned_device_info;
> +
>   struct btrfs_device {
>   	struct list_head dev_list; /* device_list_mutex */
>   	struct list_head dev_alloc_list; /* chunk mutex */
> @@ -77,6 +79,8 @@ struct btrfs_device {
>   
>   	struct block_device *bdev;
>   
> +	struct btrfs_zoned_device_info *zone_info;
> +
>   	/* the mode sent to blkdev_get */
>   	fmode_t mode;
>   
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 01/27] btrfs: introduce HMZONED feature flag
  2019-08-08  9:30 ` [PATCH v3 01/27] btrfs: introduce HMZONED feature flag Naohiro Aota
@ 2019-08-16  4:49   ` Anand Jain
  0 siblings, 0 replies; 39+ messages in thread
From: Anand Jain @ 2019-08-16  4:49 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel

On 8/8/19 5:30 PM, Naohiro Aota wrote:
> This patch introduces the HMZONED incompat flag. The flag indicates that
> the volume management will satisfy the constraints imposed by host-managed
> zoned block devices.
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

Reviewed-by: Anand Jain <anand.jain@oracle.com>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 03/27] btrfs: Check and enable HMZONED mode
  2019-08-08  9:30 ` [PATCH v3 03/27] btrfs: Check and enable HMZONED mode Naohiro Aota
@ 2019-08-16  5:46   ` Anand Jain
  2019-08-16 14:23     ` Damien Le Moal
  0 siblings, 1 reply; 39+ messages in thread
From: Anand Jain @ 2019-08-16  5:46 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Matias Bjorling, Johannes Thumshirn, Hannes Reinecke,
	linux-fsdevel

On 8/8/19 5:30 PM, Naohiro Aota wrote:
> HMZONED mode cannot be used together with the RAID5/6 profile for now.
> Introduce the function btrfs_check_hmzoned_mode() to check this. This
> function will also check if HMZONED flag is enabled on the file system and
> if the file system consists of zoned devices with equal zone size.
> 
> Additionally, as updates to the space cache are in-place, the space cache
> cannot be located over sequential zones and there is no guarantees that the
> device will have enough conventional zones to store this cache. Resolve
> this problem by disabling completely the space cache.  This does not
> introduces any problems with sequential block groups: all the free space is
> located after the allocation pointer and no free space before the pointer.
> There is no need to have such cache.
> 
> For the same reason, NODATACOW is also disabled.
> 
> Also INODE_MAP_CACHE is also disabled to avoid preallocation in the
> INODE_MAP_CACHE inode.

  A list of incompatibility features with zoned devices. This need better
  documentation, may be a table and its reason is better.


> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>   fs/btrfs/ctree.h       |  3 ++
>   fs/btrfs/dev-replace.c |  8 +++++
>   fs/btrfs/disk-io.c     |  8 +++++
>   fs/btrfs/hmzoned.c     | 67 ++++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/hmzoned.h     | 18 ++++++++++++
>   fs/btrfs/super.c       |  1 +
>   fs/btrfs/volumes.c     |  5 ++++
>   7 files changed, 110 insertions(+)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 299e11e6c554..a00ce8c4d678 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -713,6 +713,9 @@ struct btrfs_fs_info {
>   	struct btrfs_root *uuid_root;
>   	struct btrfs_root *free_space_root;
>   
> +	/* Zone size when in HMZONED mode */
> +	u64 zone_size;
> +
>   	/* the log root tree is a directory of all the other log roots */
>   	struct btrfs_root *log_root_tree;
>   
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index 6b2e9aa83ffa..2cc3ac4d101d 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -20,6 +20,7 @@
>   #include "rcu-string.h"
>   #include "dev-replace.h"
>   #include "sysfs.h"
> +#include "hmzoned.h"
>   
>   static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
>   				       int scrub_ret);
> @@ -201,6 +202,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
>   		return PTR_ERR(bdev);
>   	}
>   
> +	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
> +		btrfs_err(fs_info,
> +			  "zone type of target device mismatch with the filesystem!");
> +		ret = -EINVAL;
> +		goto error;
> +	}
> +
>   	sync_blockdev(bdev);
>   
>   	devices = &fs_info->fs_devices->devices;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 5f7ee70b3d1a..8854ff2e5fa5 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -40,6 +40,7 @@
>   #include "compression.h"
>   #include "tree-checker.h"
>   #include "ref-verify.h"
> +#include "hmzoned.h"
>   
>   #define BTRFS_SUPER_FLAG_SUPP	(BTRFS_HEADER_FLAG_WRITTEN |\
>   				 BTRFS_HEADER_FLAG_RELOC |\
> @@ -3123,6 +3124,13 @@ int open_ctree(struct super_block *sb,
>   
>   	btrfs_free_extra_devids(fs_devices, 1);
>   
> +	ret = btrfs_check_hmzoned_mode(fs_info);
> +	if (ret) {
> +		btrfs_err(fs_info, "failed to init hmzoned mode: %d",
> +				ret);
> +		goto fail_block_groups;
> +	}
> +
>   	ret = btrfs_sysfs_add_fsid(fs_devices, NULL);
>   	if (ret) {
>   		btrfs_err(fs_info, "failed to init sysfs fsid interface: %d",
> diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
> index bfd04792dd62..512674d8f488 100644
> --- a/fs/btrfs/hmzoned.c
> +++ b/fs/btrfs/hmzoned.c
> @@ -160,3 +160,70 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
>   
>   	return 0;
>   }
> +
> +int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info)
> +{
> +	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> +	struct btrfs_device *device;
> +	u64 hmzoned_devices = 0;
> +	u64 nr_devices = 0;
> +	u64 zone_size = 0;
> +	int incompat_hmzoned = btrfs_fs_incompat(fs_info, HMZONED);
> +	int ret = 0;
> +
> +	/* Count zoned devices */
> +	list_for_each_entry(device, &fs_devices->devices, dev_list) {
> +		if (!device->bdev)
> +			continue;
> +		if (bdev_zoned_model(device->bdev) == BLK_ZONED_HM ||
> +		    (bdev_zoned_model(device->bdev) == BLK_ZONED_HA &&
> +		     incompat_hmzoned)) {
> +			hmzoned_devices++;
> +			if (!zone_size) {
> +				zone_size = device->zone_info->zone_size;
> +			} else if (device->zone_info->zone_size != zone_size) {
> +				btrfs_err(fs_info,
> +					  "Zoned block devices must have equal zone sizes");
> +				ret = -EINVAL;
> +				goto out;
> +			}
> +		}
> +		nr_devices++;
> +	}
> +
> +	if (!hmzoned_devices && incompat_hmzoned) {
> +		/* No zoned block device found on HMZONED FS */
> +		btrfs_err(fs_info, "HMZONED enabled file system should have zoned devices");
> +		ret = -EINVAL;
> +		goto out;


  When does the HMZONED gets enabled? I presume during mkfs. Where are
  the related btrfs-progs patches? Searching for the related btrfs-progs
  patches doesn't show up anything in the ML. Looks like I am missing
  something, nor the cover letter said anything about the progs part.

Thanks, Anand

> +	}
> +
> +	if (!hmzoned_devices && !incompat_hmzoned)
> +		goto out;
> +
> +	fs_info->zone_size = zone_size;
> +
> +	if (hmzoned_devices != nr_devices) {
> +		btrfs_err(fs_info,
> +			  "zoned devices cannot be mixed with regular devices");
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/*
> +	 * stripe_size is always aligned to BTRFS_STRIPE_LEN in
> +	 * __btrfs_alloc_chunk(). Since we want stripe_len == zone_size,
> +	 * check the alignment here.
> +	 */
> +	if (!IS_ALIGNED(zone_size, BTRFS_STRIPE_LEN)) {
> +		btrfs_err(fs_info,
> +			  "zone size is not aligned to BTRFS_STRIPE_LEN");
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	btrfs_info(fs_info, "HMZONED mode enabled, zone size %llu B",
> +		   fs_info->zone_size);
> +out:
> +	return ret;
> +}
> diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
> index ffc70842135e..29cfdcabff2f 100644
> --- a/fs/btrfs/hmzoned.h
> +++ b/fs/btrfs/hmzoned.h
> @@ -9,6 +9,8 @@
>   #ifndef BTRFS_HMZONED_H
>   #define BTRFS_HMZONED_H
>   
> +#include <linux/blkdev.h>
> +
>   struct btrfs_zoned_device_info {
>   	/*
>   	 * Number of zones, zone size and types of zones if bdev is a
> @@ -25,6 +27,7 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
>   		       struct blk_zone *zone, gfp_t gfp_mask);
>   int btrfs_get_dev_zone_info(struct btrfs_device *device);
>   void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
> +int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info);
>   
>   static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
>   {
> @@ -76,4 +79,19 @@ static inline void btrfs_dev_clear_zone_empty(struct btrfs_device *device,
>   	btrfs_dev_set_empty_zone_bit(device, pos, false);
>   }
>   
> +static inline bool btrfs_check_device_zone_type(struct btrfs_fs_info *fs_info,
> +						struct block_device *bdev)
> +{
> +	u64 zone_size;
> +
> +	if (btrfs_fs_incompat(fs_info, HMZONED)) {
> +		zone_size = (u64)bdev_zone_sectors(bdev) << SECTOR_SHIFT;
> +		/* Do not allow non-zoned device */
> +		return bdev_is_zoned(bdev) && fs_info->zone_size == zone_size;
> +	}
> +
> +	/* Do not allow Host Manged zoned device */
> +	return bdev_zoned_model(bdev) != BLK_ZONED_HM;
> +}
> +
>   #endif
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 78de9d5d80c6..d7879a5a2536 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -43,6 +43,7 @@
>   #include "free-space-cache.h"
>   #include "backref.h"
>   #include "space-info.h"
> +#include "hmzoned.h"
>   #include "tests/btrfs-tests.h"
>   
>   #include "qgroup.h"
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 8e5a894e7bde..755b2ec1e0de 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2572,6 +2572,11 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>   	if (IS_ERR(bdev))
>   		return PTR_ERR(bdev);
>   
> +	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
> +		ret = -EINVAL;
> +		goto error;
> +	}
> +
>   	if (fs_devices->seeding) {
>   		seeding_dev = 1;
>   		down_write(&sb->s_umount);
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 02/27] btrfs: Get zone information of zoned block devices
  2019-08-16  4:44   ` Anand Jain
@ 2019-08-16 14:19     ` Damien Le Moal
  2019-08-16 23:47       ` Anand Jain
  0 siblings, 1 reply; 39+ messages in thread
From: Damien Le Moal @ 2019-08-16 14:19 UTC (permalink / raw)
  To: Anand Jain, Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Matias Bjorling,
	Johannes Thumshirn, Hannes Reinecke, linux-fsdevel

On 2019/08/15 21:47, Anand Jain wrote:
> On 8/8/19 5:30 PM, Naohiro Aota wrote:
>> If a zoned block device is found, get its zone information (number of zones
>> and zone size) using the new helper function btrfs_get_dev_zonetypes().  To
>> avoid costly run-time zone report commands to test the device zones type
>> during block allocation, attach the seq_zones bitmap to the device
>> structure to indicate if a zone is sequential or accept random writes. Also
>> it attaches the empty_zones bitmap to indicate if a zone is empty or not.
>>
>> This patch also introduces the helper function btrfs_dev_is_sequential() to
>> test if the zone storing a block is a sequential write required zone and
>> btrfs_dev_is_empty_zone() to test if the zone is a empty zone.
>>
>> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>> ---
>>   fs/btrfs/Makefile  |   2 +-
>>   fs/btrfs/hmzoned.c | 162 +++++++++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/hmzoned.h |  79 ++++++++++++++++++++++
>>   fs/btrfs/volumes.c |  18 ++++-
>>   fs/btrfs/volumes.h |   4 ++
>>   5 files changed, 262 insertions(+), 3 deletions(-)
>>   create mode 100644 fs/btrfs/hmzoned.c
>>   create mode 100644 fs/btrfs/hmzoned.h
>>
>> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
>> index 76a843198bcb..8d93abb31074 100644
>> --- a/fs/btrfs/Makefile
>> +++ b/fs/btrfs/Makefile
>> @@ -11,7 +11,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
>>   	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
>>   	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
>>   	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
>> -	   block-rsv.o delalloc-space.o
>> +	   block-rsv.o delalloc-space.o hmzoned.o
>>   
>>   btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
>>   btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
>> diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
>> new file mode 100644
>> index 000000000000..bfd04792dd62
>> --- /dev/null
>> +++ b/fs/btrfs/hmzoned.c
>> @@ -0,0 +1,162 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (C) 2019 Western Digital Corporation or its affiliates.
>> + * Authors:
>> + *	Naohiro Aota	<naohiro.aota@wdc.com>
>> + *	Damien Le Moal	<damien.lemoal@wdc.com>
>> + */
>> +
>> +#include <linux/slab.h>
>> +#include <linux/blkdev.h>
>> +#include "ctree.h"
>> +#include "volumes.h"
>> +#include "hmzoned.h"
>> +#include "rcu-string.h"
>> +
>> +/* Maximum number of zones to report per blkdev_report_zones() call */
>> +#define BTRFS_REPORT_NR_ZONES   4096
>> +
>> +static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
>> +			       struct blk_zone **zones_ret,
>> +			       unsigned int *nr_zones, gfp_t gfp_mask)
>> +{
>> +	struct blk_zone *zones = *zones_ret;
>> +	int ret;
>> +
>> +	if (!zones) {
>> +		zones = kcalloc(*nr_zones, sizeof(struct blk_zone), GFP_KERNEL);
>> +		if (!zones)
>> +			return -ENOMEM;
>> +	}
>> +
>> +	ret = blkdev_report_zones(device->bdev, pos >> SECTOR_SHIFT,
>> +				  zones, nr_zones, gfp_mask);
>> +	if (ret != 0) {
>> +		btrfs_err_in_rcu(device->fs_info,
>> +				 "get zone at %llu on %s failed %d", pos,
>> +				 rcu_str_deref(device->name), ret);
>> +		return ret;
>> +	}
>> +	if (!*nr_zones)
>> +		return -EIO;
>> +
>> +	*zones_ret = zones;
>> +
>> +	return 0;
>> +}
>> +
>> +int btrfs_get_dev_zone_info(struct btrfs_device *device)
>> +{
>> +	struct btrfs_zoned_device_info *zone_info = NULL;
>> +	struct block_device *bdev = device->bdev;
>> +	sector_t nr_sectors = bdev->bd_part->nr_sects;
>> +	sector_t sector = 0;
>> +	struct blk_zone *zones = NULL;
>> +	unsigned int i, nreported = 0, nr_zones;
>> +	unsigned int zone_sectors;
>> +	int ret;
>> +
>> +	if (!bdev_is_zoned(bdev))
>> +		return 0;
>> +
>> +	zone_info = kzalloc(sizeof(*zone_info), GFP_KERNEL);
>> +	if (!zone_info)
>> +		return -ENOMEM;
>> +
>> +	zone_sectors = bdev_zone_sectors(bdev);
>> +	ASSERT(is_power_of_2(zone_sectors));
>> +	zone_info->zone_size = (u64)zone_sectors << SECTOR_SHIFT;
>> +	zone_info->zone_size_shift = ilog2(zone_info->zone_size);
>> +	zone_info->nr_zones = nr_sectors >> ilog2(bdev_zone_sectors(bdev));
>> +	if (nr_sectors & (bdev_zone_sectors(bdev) - 1))
>> +		zone_info->nr_zones++;
>> +
>> +	zone_info->seq_zones = kcalloc(BITS_TO_LONGS(zone_info->nr_zones),
>> +				       sizeof(*zone_info->seq_zones),
>> +				       GFP_KERNEL);
>> +	if (!zone_info->seq_zones) {
>> +		ret = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	zone_info->empty_zones = kcalloc(BITS_TO_LONGS(zone_info->nr_zones),
>> +					 sizeof(*zone_info->empty_zones),
>> +					 GFP_KERNEL);
>> +	if (!zone_info->empty_zones) {
>> +		ret = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	/* Get zones type */
>> +	while (sector < nr_sectors) {
>> +		nr_zones = BTRFS_REPORT_NR_ZONES;
>> +		ret = btrfs_get_dev_zones(device, sector << SECTOR_SHIFT,
>> +					  &zones, &nr_zones, GFP_KERNEL);
> 
> 
> How many zones do we see in a disk? Not many I presume.

A 15 TB SMR drive with 256 MB zones (which is a failry common value for products
out there) has over 55,000 zones. "Not many" is subjective... I personally
consider 55000 a large number and that one should take care to write appropriate
code to manage that many objects.

> Here the allocation for %zones is inconsistent for each zone, unless
> there is substantial performance benefits, a consistent flow of
> alloc/free is fine as it makes the code easy to read and verify.

I do not understand your comment here. btrfs_get_dev_zones() will allocate and
fill the zones array with at most BTRFS_REPORT_NR_ZONES zones descriptors on the
first call. On subsequent calls, the same array is reused until information on
all zones of the disk is obtained. "the allocation for %zones is inconsistent
for each zone" does not makes much sense. What exactly do you mean ?

> 
> 
> Thanks, Anand
> 
>> +		if (ret)
>> +			goto out;
>> +
>> +		for (i = 0; i < nr_zones; i++) {
>> +			if (zones[i].type == BLK_ZONE_TYPE_SEQWRITE_REQ)
>> +				set_bit(nreported, zone_info->seq_zones);
>> +			if (zones[i].cond == BLK_ZONE_COND_EMPTY)
>> +				set_bit(nreported, zone_info->empty_zones);
>> +			nreported++;
>> +		}
>> +		sector = zones[nr_zones - 1].start + zones[nr_zones - 1].len;
>> +	}
>> +
>> +	if (nreported != zone_info->nr_zones) {
>> +		btrfs_err_in_rcu(device->fs_info,
>> +				 "inconsistent number of zones on %s (%u / %u)",
>> +				 rcu_str_deref(device->name), nreported,
>> +				 zone_info->nr_zones);
>> +		ret = -EIO;
>> +		goto out;
>> +	}
>> +
>> +	device->zone_info = zone_info;
>> +
>> +	btrfs_info_in_rcu(
>> +		device->fs_info,
>> +		"host-%s zoned block device %s, %u zones of %llu sectors",
>> +		bdev_zoned_model(bdev) == BLK_ZONED_HM ? "managed" : "aware",
>> +		rcu_str_deref(device->name), zone_info->nr_zones,
>> +		zone_info->zone_size >> SECTOR_SHIFT);
>> +
>> +out:
>> +	kfree(zones);
>> +
>> +	if (ret) {
>> +		kfree(zone_info->seq_zones);
>> +		kfree(zone_info->empty_zones);
>> +		kfree(zone_info);
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +void btrfs_destroy_dev_zone_info(struct btrfs_device *device)
>> +{
>> +	struct btrfs_zoned_device_info *zone_info = device->zone_info;
>> +
>> +	if (!zone_info)
>> +		return;
>> +
>> +	kfree(zone_info->seq_zones);
>> +	kfree(zone_info->empty_zones);
>> +	kfree(zone_info);
>> +	device->zone_info = NULL;
>> +}
>> +
>> +int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
>> +		       struct blk_zone *zone, gfp_t gfp_mask)
>> +{
>> +	unsigned int nr_zones = 1;
>> +	int ret;
>> +
>> +	ret = btrfs_get_dev_zones(device, pos, &zone, &nr_zones, gfp_mask);
>> +	if (ret != 0 || !nr_zones)
>> +		return ret ? ret : -EIO;
>> +
>> +	return 0;
>> +}
>> diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
>> new file mode 100644
>> index 000000000000..ffc70842135e
>> --- /dev/null
>> +++ b/fs/btrfs/hmzoned.h
>> @@ -0,0 +1,79 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/*
>> + * Copyright (C) 2019 Western Digital Corporation or its affiliates.
>> + * Authors:
>> + *	Naohiro Aota	<naohiro.aota@wdc.com>
>> + *	Damien Le Moal	<damien.lemoal@wdc.com>
>> + */
>> +
>> +#ifndef BTRFS_HMZONED_H
>> +#define BTRFS_HMZONED_H
>> +
>> +struct btrfs_zoned_device_info {
>> +	/*
>> +	 * Number of zones, zone size and types of zones if bdev is a
>> +	 * zoned block device.
>> +	 */
>> +	u64 zone_size;
>> +	u8  zone_size_shift;
>> +	u32 nr_zones;
>> +	unsigned long *seq_zones;
>> +	unsigned long *empty_zones;
>> +};
>> +
>> +int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
>> +		       struct blk_zone *zone, gfp_t gfp_mask);
>> +int btrfs_get_dev_zone_info(struct btrfs_device *device);
>> +void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
>> +
>> +static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
>> +{
>> +	struct btrfs_zoned_device_info *zone_info = device->zone_info;
>> +
>> +	if (!zone_info)
>> +		return false;
>> +
>> +	return test_bit(pos >> zone_info->zone_size_shift,
>> +			zone_info->seq_zones);
>> +}
>> +
>> +static inline bool btrfs_dev_is_empty_zone(struct btrfs_device *device, u64 pos)
>> +{
>> +	struct btrfs_zoned_device_info *zone_info = device->zone_info;
>> +
>> +	if (!zone_info)
>> +		return true;
>> +
>> +	return test_bit(pos >> zone_info->zone_size_shift,
>> +			zone_info->empty_zones);
>> +}
>> +
>> +static inline void btrfs_dev_set_empty_zone_bit(struct btrfs_device *device,
>> +						u64 pos, bool set)
>> +{
>> +	struct btrfs_zoned_device_info *zone_info = device->zone_info;
>> +	unsigned int zno;
>> +
>> +	if (!zone_info)
>> +		return;
>> +
>> +	zno = pos >> zone_info->zone_size_shift;
>> +	if (set)
>> +		set_bit(zno, zone_info->empty_zones);
>> +	else
>> +		clear_bit(zno, zone_info->empty_zones);
>> +}
>> +
>> +static inline void btrfs_dev_set_zone_empty(struct btrfs_device *device,
>> +					    u64 pos)
>> +{
>> +	btrfs_dev_set_empty_zone_bit(device, pos, true);
>> +}
>> +
>> +static inline void btrfs_dev_clear_zone_empty(struct btrfs_device *device,
>> +					      u64 pos)
>> +{
>> +	btrfs_dev_set_empty_zone_bit(device, pos, false);
>> +}
>> +
>> +#endif
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index d74b74ca07af..8e5a894e7bde 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -29,6 +29,7 @@
>>   #include "sysfs.h"
>>   #include "tree-checker.h"
>>   #include "space-info.h"
>> +#include "hmzoned.h"
>>   
>>   const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
>>   	[BTRFS_RAID_RAID10] = {
>> @@ -342,6 +343,7 @@ void btrfs_free_device(struct btrfs_device *device)
>>   	rcu_string_free(device->name);
>>   	extent_io_tree_release(&device->alloc_state);
>>   	bio_put(device->flush_bio);
>> +	btrfs_destroy_dev_zone_info(device);
>>   	kfree(device);
>>   }
>>   
>> @@ -847,6 +849,11 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
>>   	clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
>>   	device->mode = flags;
>>   
>> +	/* Get zone type information of zoned block devices */
>> +	ret = btrfs_get_dev_zone_info(device);
>> +	if (ret != 0)
>> +		goto error_brelse;
>> +
>>   	fs_devices->open_devices++;
>>   	if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
>>   	    device->devid != BTRFS_DEV_REPLACE_DEVID) {
>> @@ -2598,6 +2605,14 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>>   	}
>>   	rcu_assign_pointer(device->name, name);
>>   
>> +	device->fs_info = fs_info;
>> +	device->bdev = bdev;
>> +
>> +	/* Get zone type information of zoned block devices */
>> +	ret = btrfs_get_dev_zone_info(device);
>> +	if (ret)
>> +		goto error_free_device;
>> +
>>   	trans = btrfs_start_transaction(root, 0);
>>   	if (IS_ERR(trans)) {
>>   		ret = PTR_ERR(trans);
>> @@ -2614,8 +2629,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>>   					 fs_info->sectorsize);
>>   	device->disk_total_bytes = device->total_bytes;
>>   	device->commit_total_bytes = device->total_bytes;
>> -	device->fs_info = fs_info;
>> -	device->bdev = bdev;
>>   	set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
>>   	clear_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state);
>>   	device->mode = FMODE_EXCL;
>> @@ -2756,6 +2769,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>>   		sb->s_flags |= SB_RDONLY;
>>   	if (trans)
>>   		btrfs_end_transaction(trans);
>> +	btrfs_destroy_dev_zone_info(device);
>>   error_free_device:
>>   	btrfs_free_device(device);
>>   error:
>> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
>> index 7f6aa1816409..5da1f354db93 100644
>> --- a/fs/btrfs/volumes.h
>> +++ b/fs/btrfs/volumes.h
>> @@ -57,6 +57,8 @@ struct btrfs_io_geometry {
>>   #define BTRFS_DEV_STATE_REPLACE_TGT	(3)
>>   #define BTRFS_DEV_STATE_FLUSH_SENT	(4)
>>   
>> +struct btrfs_zoned_device_info;
>> +
>>   struct btrfs_device {
>>   	struct list_head dev_list; /* device_list_mutex */
>>   	struct list_head dev_alloc_list; /* chunk mutex */
>> @@ -77,6 +79,8 @@ struct btrfs_device {
>>   
>>   	struct block_device *bdev;
>>   
>> +	struct btrfs_zoned_device_info *zone_info;
>> +
>>   	/* the mode sent to blkdev_get */
>>   	fmode_t mode;
>>   
>>
> 
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 03/27] btrfs: Check and enable HMZONED mode
  2019-08-16  5:46   ` Anand Jain
@ 2019-08-16 14:23     ` Damien Le Moal
  2019-08-16 23:56       ` Anand Jain
  0 siblings, 1 reply; 39+ messages in thread
From: Damien Le Moal @ 2019-08-16 14:23 UTC (permalink / raw)
  To: Anand Jain, Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Matias Bjorling,
	Johannes Thumshirn, Hannes Reinecke, linux-fsdevel

On 2019/08/15 22:46, Anand Jain wrote:
> On 8/8/19 5:30 PM, Naohiro Aota wrote:
>> HMZONED mode cannot be used together with the RAID5/6 profile for now.
>> Introduce the function btrfs_check_hmzoned_mode() to check this. This
>> function will also check if HMZONED flag is enabled on the file system and
>> if the file system consists of zoned devices with equal zone size.
>>
>> Additionally, as updates to the space cache are in-place, the space cache
>> cannot be located over sequential zones and there is no guarantees that the
>> device will have enough conventional zones to store this cache. Resolve
>> this problem by disabling completely the space cache.  This does not
>> introduces any problems with sequential block groups: all the free space is
>> located after the allocation pointer and no free space before the pointer.
>> There is no need to have such cache.
>>
>> For the same reason, NODATACOW is also disabled.
>>
>> Also INODE_MAP_CACHE is also disabled to avoid preallocation in the
>> INODE_MAP_CACHE inode.
> 
>   A list of incompatibility features with zoned devices. This need better
>   documentation, may be a table and its reason is better.

Are you referring to the format of the commit message itself ? Or would you like
to see a documentation added to Documentation/filesystems/btrfs.txt ?

> 
> 
>> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>> ---
>>   fs/btrfs/ctree.h       |  3 ++
>>   fs/btrfs/dev-replace.c |  8 +++++
>>   fs/btrfs/disk-io.c     |  8 +++++
>>   fs/btrfs/hmzoned.c     | 67 ++++++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/hmzoned.h     | 18 ++++++++++++
>>   fs/btrfs/super.c       |  1 +
>>   fs/btrfs/volumes.c     |  5 ++++
>>   7 files changed, 110 insertions(+)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 299e11e6c554..a00ce8c4d678 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -713,6 +713,9 @@ struct btrfs_fs_info {
>>   	struct btrfs_root *uuid_root;
>>   	struct btrfs_root *free_space_root;
>>   
>> +	/* Zone size when in HMZONED mode */
>> +	u64 zone_size;
>> +
>>   	/* the log root tree is a directory of all the other log roots */
>>   	struct btrfs_root *log_root_tree;
>>   
>> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
>> index 6b2e9aa83ffa..2cc3ac4d101d 100644
>> --- a/fs/btrfs/dev-replace.c
>> +++ b/fs/btrfs/dev-replace.c
>> @@ -20,6 +20,7 @@
>>   #include "rcu-string.h"
>>   #include "dev-replace.h"
>>   #include "sysfs.h"
>> +#include "hmzoned.h"
>>   
>>   static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
>>   				       int scrub_ret);
>> @@ -201,6 +202,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
>>   		return PTR_ERR(bdev);
>>   	}
>>   
>> +	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
>> +		btrfs_err(fs_info,
>> +			  "zone type of target device mismatch with the filesystem!");
>> +		ret = -EINVAL;
>> +		goto error;
>> +	}
>> +
>>   	sync_blockdev(bdev);
>>   
>>   	devices = &fs_info->fs_devices->devices;
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 5f7ee70b3d1a..8854ff2e5fa5 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -40,6 +40,7 @@
>>   #include "compression.h"
>>   #include "tree-checker.h"
>>   #include "ref-verify.h"
>> +#include "hmzoned.h"
>>   
>>   #define BTRFS_SUPER_FLAG_SUPP	(BTRFS_HEADER_FLAG_WRITTEN |\
>>   				 BTRFS_HEADER_FLAG_RELOC |\
>> @@ -3123,6 +3124,13 @@ int open_ctree(struct super_block *sb,
>>   
>>   	btrfs_free_extra_devids(fs_devices, 1);
>>   
>> +	ret = btrfs_check_hmzoned_mode(fs_info);
>> +	if (ret) {
>> +		btrfs_err(fs_info, "failed to init hmzoned mode: %d",
>> +				ret);
>> +		goto fail_block_groups;
>> +	}
>> +
>>   	ret = btrfs_sysfs_add_fsid(fs_devices, NULL);
>>   	if (ret) {
>>   		btrfs_err(fs_info, "failed to init sysfs fsid interface: %d",
>> diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
>> index bfd04792dd62..512674d8f488 100644
>> --- a/fs/btrfs/hmzoned.c
>> +++ b/fs/btrfs/hmzoned.c
>> @@ -160,3 +160,70 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
>>   
>>   	return 0;
>>   }
>> +
>> +int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info)
>> +{
>> +	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>> +	struct btrfs_device *device;
>> +	u64 hmzoned_devices = 0;
>> +	u64 nr_devices = 0;
>> +	u64 zone_size = 0;
>> +	int incompat_hmzoned = btrfs_fs_incompat(fs_info, HMZONED);
>> +	int ret = 0;
>> +
>> +	/* Count zoned devices */
>> +	list_for_each_entry(device, &fs_devices->devices, dev_list) {
>> +		if (!device->bdev)
>> +			continue;
>> +		if (bdev_zoned_model(device->bdev) == BLK_ZONED_HM ||
>> +		    (bdev_zoned_model(device->bdev) == BLK_ZONED_HA &&
>> +		     incompat_hmzoned)) {
>> +			hmzoned_devices++;
>> +			if (!zone_size) {
>> +				zone_size = device->zone_info->zone_size;
>> +			} else if (device->zone_info->zone_size != zone_size) {
>> +				btrfs_err(fs_info,
>> +					  "Zoned block devices must have equal zone sizes");
>> +				ret = -EINVAL;
>> +				goto out;
>> +			}
>> +		}
>> +		nr_devices++;
>> +	}
>> +
>> +	if (!hmzoned_devices && incompat_hmzoned) {
>> +		/* No zoned block device found on HMZONED FS */
>> +		btrfs_err(fs_info, "HMZONED enabled file system should have zoned devices");
>> +		ret = -EINVAL;
>> +		goto out;
> 
> 
>   When does the HMZONED gets enabled? I presume during mkfs. Where are
>   the related btrfs-progs patches? Searching for the related btrfs-progs
>   patches doesn't show up anything in the ML. Looks like I am missing
>   something, nor the cover letter said anything about the progs part.
> 
> Thanks, Anand
> 
>> +	}
>> +
>> +	if (!hmzoned_devices && !incompat_hmzoned)
>> +		goto out;
>> +
>> +	fs_info->zone_size = zone_size;
>> +
>> +	if (hmzoned_devices != nr_devices) {
>> +		btrfs_err(fs_info,
>> +			  "zoned devices cannot be mixed with regular devices");
>> +		ret = -EINVAL;
>> +		goto out;
>> +	}
>> +
>> +	/*
>> +	 * stripe_size is always aligned to BTRFS_STRIPE_LEN in
>> +	 * __btrfs_alloc_chunk(). Since we want stripe_len == zone_size,
>> +	 * check the alignment here.
>> +	 */
>> +	if (!IS_ALIGNED(zone_size, BTRFS_STRIPE_LEN)) {
>> +		btrfs_err(fs_info,
>> +			  "zone size is not aligned to BTRFS_STRIPE_LEN");
>> +		ret = -EINVAL;
>> +		goto out;
>> +	}
>> +
>> +	btrfs_info(fs_info, "HMZONED mode enabled, zone size %llu B",
>> +		   fs_info->zone_size);
>> +out:
>> +	return ret;
>> +}
>> diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
>> index ffc70842135e..29cfdcabff2f 100644
>> --- a/fs/btrfs/hmzoned.h
>> +++ b/fs/btrfs/hmzoned.h
>> @@ -9,6 +9,8 @@
>>   #ifndef BTRFS_HMZONED_H
>>   #define BTRFS_HMZONED_H
>>   
>> +#include <linux/blkdev.h>
>> +
>>   struct btrfs_zoned_device_info {
>>   	/*
>>   	 * Number of zones, zone size and types of zones if bdev is a
>> @@ -25,6 +27,7 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
>>   		       struct blk_zone *zone, gfp_t gfp_mask);
>>   int btrfs_get_dev_zone_info(struct btrfs_device *device);
>>   void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
>> +int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info);
>>   
>>   static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
>>   {
>> @@ -76,4 +79,19 @@ static inline void btrfs_dev_clear_zone_empty(struct btrfs_device *device,
>>   	btrfs_dev_set_empty_zone_bit(device, pos, false);
>>   }
>>   
>> +static inline bool btrfs_check_device_zone_type(struct btrfs_fs_info *fs_info,
>> +						struct block_device *bdev)
>> +{
>> +	u64 zone_size;
>> +
>> +	if (btrfs_fs_incompat(fs_info, HMZONED)) {
>> +		zone_size = (u64)bdev_zone_sectors(bdev) << SECTOR_SHIFT;
>> +		/* Do not allow non-zoned device */
>> +		return bdev_is_zoned(bdev) && fs_info->zone_size == zone_size;
>> +	}
>> +
>> +	/* Do not allow Host Manged zoned device */
>> +	return bdev_zoned_model(bdev) != BLK_ZONED_HM;
>> +}
>> +
>>   #endif
>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>> index 78de9d5d80c6..d7879a5a2536 100644
>> --- a/fs/btrfs/super.c
>> +++ b/fs/btrfs/super.c
>> @@ -43,6 +43,7 @@
>>   #include "free-space-cache.h"
>>   #include "backref.h"
>>   #include "space-info.h"
>> +#include "hmzoned.h"
>>   #include "tests/btrfs-tests.h"
>>   
>>   #include "qgroup.h"
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 8e5a894e7bde..755b2ec1e0de 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -2572,6 +2572,11 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>>   	if (IS_ERR(bdev))
>>   		return PTR_ERR(bdev);
>>   
>> +	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
>> +		ret = -EINVAL;
>> +		goto error;
>> +	}
>> +
>>   	if (fs_devices->seeding) {
>>   		seeding_dev = 1;
>>   		down_write(&sb->s_umount);
>>
> 
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 02/27] btrfs: Get zone information of zoned block devices
  2019-08-16 14:19     ` Damien Le Moal
@ 2019-08-16 23:47       ` Anand Jain
  2019-08-16 23:55         ` Damien Le Moal
  0 siblings, 1 reply; 39+ messages in thread
From: Anand Jain @ 2019-08-16 23:47 UTC (permalink / raw)
  To: Damien Le Moal, Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Matias Bjorling,
	Johannes Thumshirn, Hannes Reinecke, linux-fsdevel



On 8/16/19 10:19 PM, Damien Le Moal wrote:
> On 2019/08/15 21:47, Anand Jain wrote:
>> On 8/8/19 5:30 PM, Naohiro Aota wrote:
>>> If a zoned block device is found, get its zone information (number of zones
>>> and zone size) using the new helper function btrfs_get_dev_zonetypes().  To
>>> avoid costly run-time zone report commands to test the device zones type
>>> during block allocation, attach the seq_zones bitmap to the device
>>> structure to indicate if a zone is sequential or accept random writes. Also
>>> it attaches the empty_zones bitmap to indicate if a zone is empty or not.
>>>
>>> This patch also introduces the helper function btrfs_dev_is_sequential() to
>>> test if the zone storing a block is a sequential write required zone and
>>> btrfs_dev_is_empty_zone() to test if the zone is a empty zone.
>>>
>>> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
>>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>> ---
>>>    fs/btrfs/Makefile  |   2 +-
>>>    fs/btrfs/hmzoned.c | 162 +++++++++++++++++++++++++++++++++++++++++++++
>>>    fs/btrfs/hmzoned.h |  79 ++++++++++++++++++++++
>>>    fs/btrfs/volumes.c |  18 ++++-
>>>    fs/btrfs/volumes.h |   4 ++
>>>    5 files changed, 262 insertions(+), 3 deletions(-)
>>>    create mode 100644 fs/btrfs/hmzoned.c
>>>    create mode 100644 fs/btrfs/hmzoned.h
>>>
>>> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
>>> index 76a843198bcb..8d93abb31074 100644
>>> --- a/fs/btrfs/Makefile
>>> +++ b/fs/btrfs/Makefile
>>> @@ -11,7 +11,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
>>>    	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
>>>    	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
>>>    	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
>>> -	   block-rsv.o delalloc-space.o
>>> +	   block-rsv.o delalloc-space.o hmzoned.o
>>>    
>>>    btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
>>>    btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
>>> diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
>>> new file mode 100644
>>> index 000000000000..bfd04792dd62
>>> --- /dev/null
>>> +++ b/fs/btrfs/hmzoned.c
>>> @@ -0,0 +1,162 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +/*
>>> + * Copyright (C) 2019 Western Digital Corporation or its affiliates.
>>> + * Authors:
>>> + *	Naohiro Aota	<naohiro.aota@wdc.com>
>>> + *	Damien Le Moal	<damien.lemoal@wdc.com>
>>> + */
>>> +
>>> +#include <linux/slab.h>
>>> +#include <linux/blkdev.h>
>>> +#include "ctree.h"
>>> +#include "volumes.h"
>>> +#include "hmzoned.h"
>>> +#include "rcu-string.h"
>>> +
>>> +/* Maximum number of zones to report per blkdev_report_zones() call */
>>> +#define BTRFS_REPORT_NR_ZONES   4096
>>> +
>>> +static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
>>> +			       struct blk_zone **zones_ret,
>>> +			       unsigned int *nr_zones, gfp_t gfp_mask)
>>> +{
>>> +	struct blk_zone *zones = *zones_ret;
>>> +	int ret;
>>> +
>>> +	if (!zones) {
>>> +		zones = kcalloc(*nr_zones, sizeof(struct blk_zone), GFP_KERNEL);
>>> +		if (!zones)
>>> +			return -ENOMEM;
>>> +	}
>>> +
>>> +	ret = blkdev_report_zones(device->bdev, pos >> SECTOR_SHIFT,
>>> +				  zones, nr_zones, gfp_mask);
>>> +	if (ret != 0) {
>>> +		btrfs_err_in_rcu(device->fs_info,
>>> +				 "get zone at %llu on %s failed %d", pos,
>>> +				 rcu_str_deref(device->name), ret);
>>> +		return ret;
>>> +	}
>>> +	if (!*nr_zones)
>>> +		return -EIO;
>>> +
>>> +	*zones_ret = zones;
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +int btrfs_get_dev_zone_info(struct btrfs_device *device)
>>> +{
>>> +	struct btrfs_zoned_device_info *zone_info = NULL;
>>> +	struct block_device *bdev = device->bdev;
>>> +	sector_t nr_sectors = bdev->bd_part->nr_sects;
>>> +	sector_t sector = 0;
>>> +	struct blk_zone *zones = NULL;
>>> +	unsigned int i, nreported = 0, nr_zones;
>>> +	unsigned int zone_sectors;
>>> +	int ret;
>>> +
>>> +	if (!bdev_is_zoned(bdev))
>>> +		return 0;
>>> +
>>> +	zone_info = kzalloc(sizeof(*zone_info), GFP_KERNEL);
>>> +	if (!zone_info)
>>> +		return -ENOMEM;
>>> +
>>> +	zone_sectors = bdev_zone_sectors(bdev);
>>> +	ASSERT(is_power_of_2(zone_sectors));
>>> +	zone_info->zone_size = (u64)zone_sectors << SECTOR_SHIFT;
>>> +	zone_info->zone_size_shift = ilog2(zone_info->zone_size);
>>> +	zone_info->nr_zones = nr_sectors >> ilog2(bdev_zone_sectors(bdev));
>>> +	if (nr_sectors & (bdev_zone_sectors(bdev) - 1))
>>> +		zone_info->nr_zones++;
>>> +
>>> +	zone_info->seq_zones = kcalloc(BITS_TO_LONGS(zone_info->nr_zones),
>>> +				       sizeof(*zone_info->seq_zones),
>>> +				       GFP_KERNEL);
>>> +	if (!zone_info->seq_zones) {
>>> +		ret = -ENOMEM;
>>> +		goto out;
>>> +	}
>>> +
>>> +	zone_info->empty_zones = kcalloc(BITS_TO_LONGS(zone_info->nr_zones),
>>> +					 sizeof(*zone_info->empty_zones),
>>> +					 GFP_KERNEL);
>>> +	if (!zone_info->empty_zones) {
>>> +		ret = -ENOMEM;
>>> +		goto out;
>>> +	}
>>> +
>>> +	/* Get zones type */
>>> +	while (sector < nr_sectors) {
>>> +		nr_zones = BTRFS_REPORT_NR_ZONES;
>>> +		ret = btrfs_get_dev_zones(device, sector << SECTOR_SHIFT,
>>> +					  &zones, &nr_zones, GFP_KERNEL);
>>
>>
>> How many zones do we see in a disk? Not many I presume.
> 
> A 15 TB SMR drive with 256 MB zones (which is a failry common value for products
> out there) has over 55,000 zones. "Not many" is subjective... I personally
> consider 55000 a large number and that one should take care to write appropriate
> code to manage that many objects.

  Agree that's pretty large.

>> Here the allocation for %zones is inconsistent for each zone, unless
>> there is substantial performance benefits, a consistent flow of
>> alloc/free is fine as it makes the code easy to read and verify.
> 
> I do not understand your comment here. btrfs_get_dev_zones() will allocate and
> fill the zones array with at most BTRFS_REPORT_NR_ZONES zones descriptors on the
> first call. On subsequent calls, the same array is reused until information on
> all zones of the disk is obtained. "the allocation for %zones is inconsistent
> for each zone" does not makes much sense. What exactly do you mean ?

  btrfs_get_dev_zones() allocates the memory for %zones_ret, and expects
  its parent function btrfs_get_dev_zone_info() to free, instead can we
  have alloc and free in the parent function btrfs_get_dev_zone_info().

Thanks, Anand

>>
>>
>> Thanks, Anand
>>
>>> +		if (ret)
>>> +			goto out;
>>> +
>>> +		for (i = 0; i < nr_zones; i++) {
>>> +			if (zones[i].type == BLK_ZONE_TYPE_SEQWRITE_REQ)
>>> +				set_bit(nreported, zone_info->seq_zones);
>>> +			if (zones[i].cond == BLK_ZONE_COND_EMPTY)
>>> +				set_bit(nreported, zone_info->empty_zones);
>>> +			nreported++;
>>> +		}
>>> +		sector = zones[nr_zones - 1].start + zones[nr_zones - 1].len;
>>> +	}
>>> +
>>> +	if (nreported != zone_info->nr_zones) {
>>> +		btrfs_err_in_rcu(device->fs_info,
>>> +				 "inconsistent number of zones on %s (%u / %u)",
>>> +				 rcu_str_deref(device->name), nreported,
>>> +				 zone_info->nr_zones);
>>> +		ret = -EIO;
>>> +		goto out;
>>> +	}
>>> +
>>> +	device->zone_info = zone_info;
>>> +
>>> +	btrfs_info_in_rcu(
>>> +		device->fs_info,
>>> +		"host-%s zoned block device %s, %u zones of %llu sectors",
>>> +		bdev_zoned_model(bdev) == BLK_ZONED_HM ? "managed" : "aware",
>>> +		rcu_str_deref(device->name), zone_info->nr_zones,
>>> +		zone_info->zone_size >> SECTOR_SHIFT);
>>> +
>>> +out:
>>> +	kfree(zones);
>>> +
>>> +	if (ret) {
>>> +		kfree(zone_info->seq_zones);
>>> +		kfree(zone_info->empty_zones);
>>> +		kfree(zone_info);
>>> +	}
>>> +
>>> +	return ret;
>>> +}
>>> +
>>> +void btrfs_destroy_dev_zone_info(struct btrfs_device *device)
>>> +{
>>> +	struct btrfs_zoned_device_info *zone_info = device->zone_info;
>>> +
>>> +	if (!zone_info)
>>> +		return;
>>> +
>>> +	kfree(zone_info->seq_zones);
>>> +	kfree(zone_info->empty_zones);
>>> +	kfree(zone_info);
>>> +	device->zone_info = NULL;
>>> +}
>>> +
>>> +int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
>>> +		       struct blk_zone *zone, gfp_t gfp_mask)
>>> +{
>>> +	unsigned int nr_zones = 1;
>>> +	int ret;
>>> +
>>> +	ret = btrfs_get_dev_zones(device, pos, &zone, &nr_zones, gfp_mask);
>>> +	if (ret != 0 || !nr_zones)
>>> +		return ret ? ret : -EIO;
>>> +
>>> +	return 0;
>>> +}
>>> diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
>>> new file mode 100644
>>> index 000000000000..ffc70842135e
>>> --- /dev/null
>>> +++ b/fs/btrfs/hmzoned.h
>>> @@ -0,0 +1,79 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +/*
>>> + * Copyright (C) 2019 Western Digital Corporation or its affiliates.
>>> + * Authors:
>>> + *	Naohiro Aota	<naohiro.aota@wdc.com>
>>> + *	Damien Le Moal	<damien.lemoal@wdc.com>
>>> + */
>>> +
>>> +#ifndef BTRFS_HMZONED_H
>>> +#define BTRFS_HMZONED_H
>>> +
>>> +struct btrfs_zoned_device_info {
>>> +	/*
>>> +	 * Number of zones, zone size and types of zones if bdev is a
>>> +	 * zoned block device.
>>> +	 */
>>> +	u64 zone_size;
>>> +	u8  zone_size_shift;
>>> +	u32 nr_zones;
>>> +	unsigned long *seq_zones;
>>> +	unsigned long *empty_zones;
>>> +};
>>> +
>>> +int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
>>> +		       struct blk_zone *zone, gfp_t gfp_mask);
>>> +int btrfs_get_dev_zone_info(struct btrfs_device *device);
>>> +void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
>>> +
>>> +static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
>>> +{
>>> +	struct btrfs_zoned_device_info *zone_info = device->zone_info;
>>> +
>>> +	if (!zone_info)
>>> +		return false;
>>> +
>>> +	return test_bit(pos >> zone_info->zone_size_shift,
>>> +			zone_info->seq_zones);
>>> +}
>>> +
>>> +static inline bool btrfs_dev_is_empty_zone(struct btrfs_device *device, u64 pos)
>>> +{
>>> +	struct btrfs_zoned_device_info *zone_info = device->zone_info;
>>> +
>>> +	if (!zone_info)
>>> +		return true;
>>> +
>>> +	return test_bit(pos >> zone_info->zone_size_shift,
>>> +			zone_info->empty_zones);
>>> +}
>>> +
>>> +static inline void btrfs_dev_set_empty_zone_bit(struct btrfs_device *device,
>>> +						u64 pos, bool set)
>>> +{
>>> +	struct btrfs_zoned_device_info *zone_info = device->zone_info;
>>> +	unsigned int zno;
>>> +
>>> +	if (!zone_info)
>>> +		return;
>>> +
>>> +	zno = pos >> zone_info->zone_size_shift;
>>> +	if (set)
>>> +		set_bit(zno, zone_info->empty_zones);
>>> +	else
>>> +		clear_bit(zno, zone_info->empty_zones);
>>> +}
>>> +
>>> +static inline void btrfs_dev_set_zone_empty(struct btrfs_device *device,
>>> +					    u64 pos)
>>> +{
>>> +	btrfs_dev_set_empty_zone_bit(device, pos, true);
>>> +}
>>> +
>>> +static inline void btrfs_dev_clear_zone_empty(struct btrfs_device *device,
>>> +					      u64 pos)
>>> +{
>>> +	btrfs_dev_set_empty_zone_bit(device, pos, false);
>>> +}
>>> +
>>> +#endif
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index d74b74ca07af..8e5a894e7bde 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -29,6 +29,7 @@
>>>    #include "sysfs.h"
>>>    #include "tree-checker.h"
>>>    #include "space-info.h"
>>> +#include "hmzoned.h"
>>>    
>>>    const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
>>>    	[BTRFS_RAID_RAID10] = {
>>> @@ -342,6 +343,7 @@ void btrfs_free_device(struct btrfs_device *device)
>>>    	rcu_string_free(device->name);
>>>    	extent_io_tree_release(&device->alloc_state);
>>>    	bio_put(device->flush_bio);
>>> +	btrfs_destroy_dev_zone_info(device);
>>>    	kfree(device);
>>>    }
>>>    
>>> @@ -847,6 +849,11 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
>>>    	clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
>>>    	device->mode = flags;
>>>    
>>> +	/* Get zone type information of zoned block devices */
>>> +	ret = btrfs_get_dev_zone_info(device);
>>> +	if (ret != 0)
>>> +		goto error_brelse;
>>> +
>>>    	fs_devices->open_devices++;
>>>    	if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
>>>    	    device->devid != BTRFS_DEV_REPLACE_DEVID) {
>>> @@ -2598,6 +2605,14 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>>>    	}
>>>    	rcu_assign_pointer(device->name, name);
>>>    
>>> +	device->fs_info = fs_info;
>>> +	device->bdev = bdev;
>>> +
>>> +	/* Get zone type information of zoned block devices */
>>> +	ret = btrfs_get_dev_zone_info(device);
>>> +	if (ret)
>>> +		goto error_free_device;
>>> +
>>>    	trans = btrfs_start_transaction(root, 0);
>>>    	if (IS_ERR(trans)) {
>>>    		ret = PTR_ERR(trans);
>>> @@ -2614,8 +2629,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>>>    					 fs_info->sectorsize);
>>>    	device->disk_total_bytes = device->total_bytes;
>>>    	device->commit_total_bytes = device->total_bytes;
>>> -	device->fs_info = fs_info;
>>> -	device->bdev = bdev;
>>>    	set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
>>>    	clear_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state);
>>>    	device->mode = FMODE_EXCL;
>>> @@ -2756,6 +2769,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>>>    		sb->s_flags |= SB_RDONLY;
>>>    	if (trans)
>>>    		btrfs_end_transaction(trans);
>>> +	btrfs_destroy_dev_zone_info(device);
>>>    error_free_device:
>>>    	btrfs_free_device(device);
>>>    error:
>>> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
>>> index 7f6aa1816409..5da1f354db93 100644
>>> --- a/fs/btrfs/volumes.h
>>> +++ b/fs/btrfs/volumes.h
>>> @@ -57,6 +57,8 @@ struct btrfs_io_geometry {
>>>    #define BTRFS_DEV_STATE_REPLACE_TGT	(3)
>>>    #define BTRFS_DEV_STATE_FLUSH_SENT	(4)
>>>    
>>> +struct btrfs_zoned_device_info;
>>> +
>>>    struct btrfs_device {
>>>    	struct list_head dev_list; /* device_list_mutex */
>>>    	struct list_head dev_alloc_list; /* chunk mutex */
>>> @@ -77,6 +79,8 @@ struct btrfs_device {
>>>    
>>>    	struct block_device *bdev;
>>>    
>>> +	struct btrfs_zoned_device_info *zone_info;
>>> +
>>>    	/* the mode sent to blkdev_get */
>>>    	fmode_t mode;
>>>    
>>>
>>
>>
> 
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 02/27] btrfs: Get zone information of zoned block devices
  2019-08-16 23:47       ` Anand Jain
@ 2019-08-16 23:55         ` Damien Le Moal
  0 siblings, 0 replies; 39+ messages in thread
From: Damien Le Moal @ 2019-08-16 23:55 UTC (permalink / raw)
  To: Anand Jain, Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Matias Bjorling,
	Johannes Thumshirn, Hannes Reinecke, linux-fsdevel

On 2019/08/16 16:48, Anand Jain wrote:
[...]
>>> How many zones do we see in a disk? Not many I presume.
>>
>> A 15 TB SMR drive with 256 MB zones (which is a failry common value for products
>> out there) has over 55,000 zones. "Not many" is subjective... I personally
>> consider 55000 a large number and that one should take care to write appropriate
>> code to manage that many objects.
> 
>   Agree that's pretty large.
> 
>>> Here the allocation for %zones is inconsistent for each zone, unless
>>> there is substantial performance benefits, a consistent flow of
>>> alloc/free is fine as it makes the code easy to read and verify.
>>
>> I do not understand your comment here. btrfs_get_dev_zones() will allocate and
>> fill the zones array with at most BTRFS_REPORT_NR_ZONES zones descriptors on the
>> first call. On subsequent calls, the same array is reused until information on
>> all zones of the disk is obtained. "the allocation for %zones is inconsistent
>> for each zone" does not makes much sense. What exactly do you mean ?
> 
>   btrfs_get_dev_zones() allocates the memory for %zones_ret, and expects
>   its parent function btrfs_get_dev_zone_info() to free, instead can we
>   have alloc and free in the parent function btrfs_get_dev_zone_info().

Got it. Yes, we can change that. Thanks.

> 
> Thanks, Anand

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 03/27] btrfs: Check and enable HMZONED mode
  2019-08-16 14:23     ` Damien Le Moal
@ 2019-08-16 23:56       ` Anand Jain
  2019-08-17  0:05         ` Damien Le Moal
  2019-08-20  5:07         ` Naohiro Aota
  0 siblings, 2 replies; 39+ messages in thread
From: Anand Jain @ 2019-08-16 23:56 UTC (permalink / raw)
  To: Damien Le Moal, Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Matias Bjorling,
	Johannes Thumshirn, Hannes Reinecke, linux-fsdevel



On 8/16/19 10:23 PM, Damien Le Moal wrote:
> On 2019/08/15 22:46, Anand Jain wrote:
>> On 8/8/19 5:30 PM, Naohiro Aota wrote:
>>> HMZONED mode cannot be used together with the RAID5/6 profile for now.
>>> Introduce the function btrfs_check_hmzoned_mode() to check this. This
>>> function will also check if HMZONED flag is enabled on the file system and
>>> if the file system consists of zoned devices with equal zone size.
>>>
>>> Additionally, as updates to the space cache are in-place, the space cache
>>> cannot be located over sequential zones and there is no guarantees that the
>>> device will have enough conventional zones to store this cache. Resolve
>>> this problem by disabling completely the space cache.  This does not
>>> introduces any problems with sequential block groups: all the free space is
>>> located after the allocation pointer and no free space before the pointer.
>>> There is no need to have such cache.
>>>
>>> For the same reason, NODATACOW is also disabled.
>>>
>>> Also INODE_MAP_CACHE is also disabled to avoid preallocation in the
>>> INODE_MAP_CACHE inode.
>>
>>    A list of incompatibility features with zoned devices. This need better
>>    documentation, may be a table and its reason is better.
> 
> Are you referring to the format of the commit message itself ? Or would you like
> to see a documentation added to Documentation/filesystems/btrfs.txt ?

  Documenting in the commit change log is fine. But it can be better
  documented in a listed format as it looks like we have a list of
  features which will be incompatible with zoned devices.

more below..


>>
>>
>>> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
>>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>> ---
>>>    fs/btrfs/ctree.h       |  3 ++
>>>    fs/btrfs/dev-replace.c |  8 +++++
>>>    fs/btrfs/disk-io.c     |  8 +++++
>>>    fs/btrfs/hmzoned.c     | 67 ++++++++++++++++++++++++++++++++++++++++++
>>>    fs/btrfs/hmzoned.h     | 18 ++++++++++++
>>>    fs/btrfs/super.c       |  1 +
>>>    fs/btrfs/volumes.c     |  5 ++++
>>>    7 files changed, 110 insertions(+)
>>>
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>> index 299e11e6c554..a00ce8c4d678 100644
>>> --- a/fs/btrfs/ctree.h
>>> +++ b/fs/btrfs/ctree.h
>>> @@ -713,6 +713,9 @@ struct btrfs_fs_info {
>>>    	struct btrfs_root *uuid_root;
>>>    	struct btrfs_root *free_space_root;
>>>    
>>> +	/* Zone size when in HMZONED mode */
>>> +	u64 zone_size;
>>> +
>>>    	/* the log root tree is a directory of all the other log roots */
>>>    	struct btrfs_root *log_root_tree;
>>>    
>>> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
>>> index 6b2e9aa83ffa..2cc3ac4d101d 100644
>>> --- a/fs/btrfs/dev-replace.c
>>> +++ b/fs/btrfs/dev-replace.c
>>> @@ -20,6 +20,7 @@
>>>    #include "rcu-string.h"
>>>    #include "dev-replace.h"
>>>    #include "sysfs.h"
>>> +#include "hmzoned.h"
>>>    
>>>    static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
>>>    				       int scrub_ret);
>>> @@ -201,6 +202,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
>>>    		return PTR_ERR(bdev);
>>>    	}
>>>    
>>> +	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
>>> +		btrfs_err(fs_info,
>>> +			  "zone type of target device mismatch with the filesystem!");
>>> +		ret = -EINVAL;
>>> +		goto error;
>>> +	}
>>> +
>>>    	sync_blockdev(bdev);
>>>    
>>>    	devices = &fs_info->fs_devices->devices;
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index 5f7ee70b3d1a..8854ff2e5fa5 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -40,6 +40,7 @@
>>>    #include "compression.h"
>>>    #include "tree-checker.h"
>>>    #include "ref-verify.h"
>>> +#include "hmzoned.h"
>>>    
>>>    #define BTRFS_SUPER_FLAG_SUPP	(BTRFS_HEADER_FLAG_WRITTEN |\
>>>    				 BTRFS_HEADER_FLAG_RELOC |\
>>> @@ -3123,6 +3124,13 @@ int open_ctree(struct super_block *sb,
>>>    
>>>    	btrfs_free_extra_devids(fs_devices, 1);
>>>    
>>> +	ret = btrfs_check_hmzoned_mode(fs_info);
>>> +	if (ret) {
>>> +		btrfs_err(fs_info, "failed to init hmzoned mode: %d",
>>> +				ret);
>>> +		goto fail_block_groups;
>>> +	}
>>> +
>>>    	ret = btrfs_sysfs_add_fsid(fs_devices, NULL);
>>>    	if (ret) {
>>>    		btrfs_err(fs_info, "failed to init sysfs fsid interface: %d",
>>> diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
>>> index bfd04792dd62..512674d8f488 100644
>>> --- a/fs/btrfs/hmzoned.c
>>> +++ b/fs/btrfs/hmzoned.c
>>> @@ -160,3 +160,70 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
>>>    
>>>    	return 0;
>>>    }
>>> +
>>> +int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info)
>>> +{
>>> +	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>>> +	struct btrfs_device *device;
>>> +	u64 hmzoned_devices = 0;
>>> +	u64 nr_devices = 0;
>>> +	u64 zone_size = 0;
>>> +	int incompat_hmzoned = btrfs_fs_incompat(fs_info, HMZONED);
>>> +	int ret = 0;
>>> +
>>> +	/* Count zoned devices */
>>> +	list_for_each_entry(device, &fs_devices->devices, dev_list) {
>>> +		if (!device->bdev)
>>> +			continue;
>>> +		if (bdev_zoned_model(device->bdev) == BLK_ZONED_HM ||
>>> +		    (bdev_zoned_model(device->bdev) == BLK_ZONED_HA &&
>>> +		     incompat_hmzoned)) {
>>> +			hmzoned_devices++;
>>> +			if (!zone_size) {
>>> +				zone_size = device->zone_info->zone_size;
>>> +			} else if (device->zone_info->zone_size != zone_size) {
>>> +				btrfs_err(fs_info,
>>> +					  "Zoned block devices must have equal zone sizes");
>>> +				ret = -EINVAL;
>>> +				goto out;
>>> +			}
>>> +		}
>>> +		nr_devices++;
>>> +	}
>>> +
>>> +	if (!hmzoned_devices && incompat_hmzoned) {
>>> +		/* No zoned block device found on HMZONED FS */
>>> +		btrfs_err(fs_info, "HMZONED enabled file system should have zoned devices");
>>> +		ret = -EINVAL;
>>> +		goto out;
>>
>>
>>    When does the HMZONED gets enabled? I presume during mkfs. Where are
>>    the related btrfs-progs patches? Searching for the related btrfs-progs
>>    patches doesn't show up anything in the ML. Looks like I am missing
>>    something, nor the cover letter said anything about the progs part.


  Any idea about this comment above?

Thanks, Anand


>> Thanks, Anand
>>
>>> +	}
>>> +
>>> +	if (!hmzoned_devices && !incompat_hmzoned)
>>> +		goto out;
>>> +
>>> +	fs_info->zone_size = zone_size;
>>> +
>>> +	if (hmzoned_devices != nr_devices) {
>>> +		btrfs_err(fs_info,
>>> +			  "zoned devices cannot be mixed with regular devices");
>>> +		ret = -EINVAL;
>>> +		goto out;
>>> +	}
>>> +
>>> +	/*
>>> +	 * stripe_size is always aligned to BTRFS_STRIPE_LEN in
>>> +	 * __btrfs_alloc_chunk(). Since we want stripe_len == zone_size,
>>> +	 * check the alignment here.
>>> +	 */
>>> +	if (!IS_ALIGNED(zone_size, BTRFS_STRIPE_LEN)) {
>>> +		btrfs_err(fs_info,
>>> +			  "zone size is not aligned to BTRFS_STRIPE_LEN");
>>> +		ret = -EINVAL;
>>> +		goto out;
>>> +	}
>>> +
>>> +	btrfs_info(fs_info, "HMZONED mode enabled, zone size %llu B",
>>> +		   fs_info->zone_size);
>>> +out:
>>> +	return ret;
>>> +}
>>> diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
>>> index ffc70842135e..29cfdcabff2f 100644
>>> --- a/fs/btrfs/hmzoned.h
>>> +++ b/fs/btrfs/hmzoned.h
>>> @@ -9,6 +9,8 @@
>>>    #ifndef BTRFS_HMZONED_H
>>>    #define BTRFS_HMZONED_H
>>>    
>>> +#include <linux/blkdev.h>
>>> +
>>>    struct btrfs_zoned_device_info {
>>>    	/*
>>>    	 * Number of zones, zone size and types of zones if bdev is a
>>> @@ -25,6 +27,7 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
>>>    		       struct blk_zone *zone, gfp_t gfp_mask);
>>>    int btrfs_get_dev_zone_info(struct btrfs_device *device);
>>>    void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
>>> +int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info);
>>>    
>>>    static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
>>>    {
>>> @@ -76,4 +79,19 @@ static inline void btrfs_dev_clear_zone_empty(struct btrfs_device *device,
>>>    	btrfs_dev_set_empty_zone_bit(device, pos, false);
>>>    }
>>>    
>>> +static inline bool btrfs_check_device_zone_type(struct btrfs_fs_info *fs_info,
>>> +						struct block_device *bdev)
>>> +{
>>> +	u64 zone_size;
>>> +
>>> +	if (btrfs_fs_incompat(fs_info, HMZONED)) {
>>> +		zone_size = (u64)bdev_zone_sectors(bdev) << SECTOR_SHIFT;
>>> +		/* Do not allow non-zoned device */
>>> +		return bdev_is_zoned(bdev) && fs_info->zone_size == zone_size;
>>> +	}
>>> +
>>> +	/* Do not allow Host Manged zoned device */
>>> +	return bdev_zoned_model(bdev) != BLK_ZONED_HM;
>>> +}
>>> +
>>>    #endif
>>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>>> index 78de9d5d80c6..d7879a5a2536 100644
>>> --- a/fs/btrfs/super.c
>>> +++ b/fs/btrfs/super.c
>>> @@ -43,6 +43,7 @@
>>>    #include "free-space-cache.h"
>>>    #include "backref.h"
>>>    #include "space-info.h"
>>> +#include "hmzoned.h"
>>>    #include "tests/btrfs-tests.h"
>>>    
>>>    #include "qgroup.h"
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index 8e5a894e7bde..755b2ec1e0de 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -2572,6 +2572,11 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>>>    	if (IS_ERR(bdev))
>>>    		return PTR_ERR(bdev);
>>>    
>>> +	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
>>> +		ret = -EINVAL;
>>> +		goto error;
>>> +	}
>>> +
>>>    	if (fs_devices->seeding) {
>>>    		seeding_dev = 1;
>>>    		down_write(&sb->s_umount);
>>>
>>
>>
> 
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 03/27] btrfs: Check and enable HMZONED mode
  2019-08-16 23:56       ` Anand Jain
@ 2019-08-17  0:05         ` Damien Le Moal
  2019-08-20  5:07         ` Naohiro Aota
  1 sibling, 0 replies; 39+ messages in thread
From: Damien Le Moal @ 2019-08-17  0:05 UTC (permalink / raw)
  To: Anand Jain, Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Matias Bjorling,
	Johannes Thumshirn, Hannes Reinecke, linux-fsdevel

On 2019/08/16 16:57, Anand Jain wrote:
> 
> 
> On 8/16/19 10:23 PM, Damien Le Moal wrote:
>> On 2019/08/15 22:46, Anand Jain wrote:
>>> On 8/8/19 5:30 PM, Naohiro Aota wrote:
>>>> HMZONED mode cannot be used together with the RAID5/6 profile for now.
>>>> Introduce the function btrfs_check_hmzoned_mode() to check this. This
>>>> function will also check if HMZONED flag is enabled on the file system and
>>>> if the file system consists of zoned devices with equal zone size.
>>>>
>>>> Additionally, as updates to the space cache are in-place, the space cache
>>>> cannot be located over sequential zones and there is no guarantees that the
>>>> device will have enough conventional zones to store this cache. Resolve
>>>> this problem by disabling completely the space cache.  This does not
>>>> introduces any problems with sequential block groups: all the free space is
>>>> located after the allocation pointer and no free space before the pointer.
>>>> There is no need to have such cache.
>>>>
>>>> For the same reason, NODATACOW is also disabled.
>>>>
>>>> Also INODE_MAP_CACHE is also disabled to avoid preallocation in the
>>>> INODE_MAP_CACHE inode.
>>>
>>>    A list of incompatibility features with zoned devices. This need better
>>>    documentation, may be a table and its reason is better.
>>
>> Are you referring to the format of the commit message itself ? Or would you like
>> to see a documentation added to Documentation/filesystems/btrfs.txt ?
> 
>   Documenting in the commit change log is fine. But it can be better
>   documented in a listed format as it looks like we have a list of
>   features which will be incompatible with zoned devices.

OK. We can update btrfs.txt doc file.

> 
> more below..
[...]>>>> +	if (!hmzoned_devices && incompat_hmzoned) {
>>>> +		/* No zoned block device found on HMZONED FS */
>>>> +		btrfs_err(fs_info, "HMZONED enabled file system should have zoned devices");
>>>> +		ret = -EINVAL;
>>>> +		goto out;
>>>
>>>
>>>    When does the HMZONED gets enabled? I presume during mkfs. Where are
>>>    the related btrfs-progs patches? Searching for the related btrfs-progs
>>>    patches doesn't show up anything in the ML. Looks like I am missing
>>>    something, nor the cover letter said anything about the progs part.
> 
> 
>   Any idea about this comment above?

Yep, the feature is set at format time if some of the devices in the volume are
zoned. The btrfs-progs changes to handle that are ready too.

Naohiro, please re-post btrfs-progs too !

> 
> Thanks, Anand

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 03/27] btrfs: Check and enable HMZONED mode
  2019-08-16 23:56       ` Anand Jain
  2019-08-17  0:05         ` Damien Le Moal
@ 2019-08-20  5:07         ` Naohiro Aota
  2019-08-20 13:05           ` David Sterba
  1 sibling, 1 reply; 39+ messages in thread
From: Naohiro Aota @ 2019-08-20  5:07 UTC (permalink / raw)
  To: Anand Jain
  Cc: Damien Le Moal, linux-btrfs, David Sterba, Chris Mason,
	Josef Bacik, Nikolay Borisov, Matias Bjorling,
	Johannes Thumshirn, Hannes Reinecke, linux-fsdevel

On Sat, Aug 17, 2019 at 07:56:50AM +0800, Anand Jain wrote:
>
>
>On 8/16/19 10:23 PM, Damien Le Moal wrote:
>>On 2019/08/15 22:46, Anand Jain wrote:
>>>On 8/8/19 5:30 PM, Naohiro Aota wrote:
>>>>HMZONED mode cannot be used together with the RAID5/6 profile for now.
>>>>Introduce the function btrfs_check_hmzoned_mode() to check this. This
>>>>function will also check if HMZONED flag is enabled on the file system and
>>>>if the file system consists of zoned devices with equal zone size.
>>>>
>>>>Additionally, as updates to the space cache are in-place, the space cache
>>>>cannot be located over sequential zones and there is no guarantees that the
>>>>device will have enough conventional zones to store this cache. Resolve
>>>>this problem by disabling completely the space cache.  This does not
>>>>introduces any problems with sequential block groups: all the free space is
>>>>located after the allocation pointer and no free space before the pointer.
>>>>There is no need to have such cache.
>>>>
>>>>For the same reason, NODATACOW is also disabled.
>>>>
>>>>Also INODE_MAP_CACHE is also disabled to avoid preallocation in the
>>>>INODE_MAP_CACHE inode.
>>>
>>>   A list of incompatibility features with zoned devices. This need better
>>>   documentation, may be a table and its reason is better.
>>
>>Are you referring to the format of the commit message itself ? Or would you like
>>to see a documentation added to Documentation/filesystems/btrfs.txt ?
>
> Documenting in the commit change log is fine. But it can be better
> documented in a listed format as it looks like we have a list of
> features which will be incompatible with zoned devices.
>
>more below..

Sure. I will add a table in the next version.

btrfs.txt seems not to have much there. I'm considering to write a new
page in the wiki like:

https://btrfs.wiki.kernel.org/index.php/Feature:Skinny_Metadata

>>>>+	if (!hmzoned_devices && incompat_hmzoned) {
>>>>+		/* No zoned block device found on HMZONED FS */
>>>>+		btrfs_err(fs_info, "HMZONED enabled file system should have zoned devices");
>>>>+		ret = -EINVAL;
>>>>+		goto out;
>>>
>>>
>>>   When does the HMZONED gets enabled? I presume during mkfs. Where are
>>>   the related btrfs-progs patches? Searching for the related btrfs-progs
>>>   patches doesn't show up anything in the ML. Looks like I am missing
>>>   something, nor the cover letter said anything about the progs part.
>
>
> Any idea about this comment above?
>
>Thanks, Anand

I just post the updated version of userland side series:
https://lore.kernel.org/linux-btrfs/20190820045258.1571640-1-naohiro.aota@wdc.com/T/

Thanks,
Naohiro

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 03/27] btrfs: Check and enable HMZONED mode
  2019-08-20  5:07         ` Naohiro Aota
@ 2019-08-20 13:05           ` David Sterba
  0 siblings, 0 replies; 39+ messages in thread
From: David Sterba @ 2019-08-20 13:05 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: Anand Jain, Chris Mason, David Sterba, Hannes Reinecke,
	Nikolay Borisov, Johannes Thumshirn, Josef Bacik, linux-btrfs,
	linux-fsdevel, Damien Le Moal, Matias Bjorling

On Tue, Aug 20, 2019 at 02:07:37PM +0900, Naohiro Aota wrote:
> >>>>cannot be located over sequential zones and there is no guarantees that the
> >>>>device will have enough conventional zones to store this cache. Resolve
> >>>>this problem by disabling completely the space cache.  This does not
> >>>>introduces any problems with sequential block groups: all the free space is
> >>>>located after the allocation pointer and no free space before the pointer.
> >>>>There is no need to have such cache.
> >>>>
> >>>>For the same reason, NODATACOW is also disabled.
> >>>>
> >>>>Also INODE_MAP_CACHE is also disabled to avoid preallocation in the
> >>>>INODE_MAP_CACHE inode.
> >>>
> >>>   A list of incompatibility features with zoned devices. This need better
> >>>   documentation, may be a table and its reason is better.
> >>
> >>Are you referring to the format of the commit message itself ? Or would you like
> >>to see a documentation added to Documentation/filesystems/btrfs.txt ?
> >
> > Documenting in the commit change log is fine. But it can be better
> > documented in a listed format as it looks like we have a list of
> > features which will be incompatible with zoned devices.
> >
> >more below..
> 
> Sure. I will add a table in the next version.
> 
> btrfs.txt seems not to have much there.

We don't use the in-kernel documentation, it's either the wiki or the
manual pages in btrfs-progs. The section 5 page contains some generic
topics, eg. the swapfile limitations are there.

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2019-08-20 13:05 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-08  9:30 [PATCH v3 00/27] btrfs zoned block device support Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 01/27] btrfs: introduce HMZONED feature flag Naohiro Aota
2019-08-16  4:49   ` Anand Jain
2019-08-08  9:30 ` [PATCH v3 02/27] btrfs: Get zone information of zoned block devices Naohiro Aota
2019-08-16  4:44   ` Anand Jain
2019-08-16 14:19     ` Damien Le Moal
2019-08-16 23:47       ` Anand Jain
2019-08-16 23:55         ` Damien Le Moal
2019-08-08  9:30 ` [PATCH v3 03/27] btrfs: Check and enable HMZONED mode Naohiro Aota
2019-08-16  5:46   ` Anand Jain
2019-08-16 14:23     ` Damien Le Moal
2019-08-16 23:56       ` Anand Jain
2019-08-17  0:05         ` Damien Le Moal
2019-08-20  5:07         ` Naohiro Aota
2019-08-20 13:05           ` David Sterba
2019-08-08  9:30 ` [PATCH v3 04/27] btrfs: disallow RAID5/6 in " Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 05/27] btrfs: disallow space_cache " Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 06/27] btrfs: disallow NODATACOW " Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 07/27] btrfs: disable tree-log " Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 08/27] btrfs: disable fallocate " Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 09/27] btrfs: align device extent allocation to zone boundary Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 10/27] btrfs: do sequential extent allocation in HMZONED mode Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 11/27] btrfs: make unmirroed BGs readonly only if we have at least one writable BG Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 12/27] btrfs: ensure metadata space available on/after degraded mount in HMZONED Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 13/27] btrfs: reset zones of unused block groups Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 14/27] btrfs: limit super block locations in HMZONED mode Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 15/27] btrfs: redirty released extent buffers in sequential BGs Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 16/27] btrfs: serialize data allocation and submit IOs Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 17/27] btrfs: implement atomic compressed IO submission Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 18/27] btrfs: support direct write IO in HMZONED Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 19/27] btrfs: serialize meta IOs on HMZONED mode Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 20/27] btrfs: wait existing extents before truncating Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 21/27] btrfs: avoid async checksum/submit on HMZONED mode Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 22/27] btrfs: disallow mixed-bg in " Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 23/27] btrfs: disallow inode_cache " Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 24/27] btrfs: support dev-replace " Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 25/27] btrfs: enable relocation " Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 26/27] btrfs: relocate block group to repair IO failure in HMZONED Naohiro Aota
2019-08-08  9:30 ` [PATCH v3 27/27] btrfs: enable to mount HMZONED incompat flag Naohiro Aota

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).