linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 00/28] btrfs: zoned block device support
@ 2019-12-13  4:08 Naohiro Aota
  2019-12-13  4:08 ` [PATCH v6 01/28] btrfs: introduce HMZONED feature flag Naohiro Aota
                   ` (29 more replies)
  0 siblings, 30 replies; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:08 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

This series adds zoned block device support to btrfs.

Changes:
 - Changed -EINVAL to -EOPNOTSUPP to reject incompatible features
   within HMZONED mode (David)
 - Use bitmap helpers (Johannes)
 - Fix calculation of a string length
 - Code cleanup

Userland series is unchaged with the last version:
https://lore.kernel.org/linux-btrfs/20191204082513.857320-1-naohiro.aota@wdc.com/T/

* Patch series description

A zoned block device consists of a number of zones. Zones are either
conventional and accepting random writes or sequential and requiring
that writes be issued in LBA order from each zone write pointer
position. This patch series ensures that the sequential write
constraint of sequential zones is respected while fundamentally not
changing BtrFS block and I/O management for block stored in
conventional zones.

To achieve this, the default chunk size of btrfs is changed on zoned
block devices so that chunks are always aligned to a zone. Allocation
of blocks within a chunk is changed so that the allocation is always
sequential from the beginning of the chunks. To do so, an allocation
pointer is added to block groups and used as the allocation hint.  The
allocation changes also ensure that blocks freed below the allocation
pointer are ignored, resulting in sequential block allocation
regardless of the chunk usage.

While the introduction of the allocation pointer ensures that blocks
will be allocated sequentially, I/Os to write out newly allocated
blocks may be issued out of order, causing errors when writing to
sequential zones.  To preserve the ordering, this patch series adds
some mutexes around allocation and submit_bio and serialize
them. Also, this series disable async checksum and submit to avoid
mixing the BIOs.

The zone of a chunk is reset to allow reuse of the zone only when the
block group is being freed, that is, when all the chunks of the block
group are unused.

For btrfs volumes composed of multiple zoned disks, a restriction is
added to ensure that all disks have the same zone size. This
restriction matches the existing constraint that all chunks in a block
group must have the same size.

* Serialization of write IOs

The most significant change from v2 is the serialization of sequential
block allocation and submit_bio using per block group mutex instead of
waiting and sorting BIOs in a buffer. This per block group mutex now
locked before allocation and released after all BIOs submission
finishes. The same method is used for both data and metadata IOs.

By using a mutex instead of a submit buffer, we must disable
EXTENT_PREALLOC entirely in HMZONED mode to prevent deadlocks. As a
result, INODE_MAP_CACHE and MIXED_BG are disabled in HMZONED mode, and
relocation inode is reworked to use btrfs_wait_ordered_range() after
each relocation instead of relying on preallocated file region.

Furthermore, asynchronous checksum is disabled in and inline with the
serialized block allocation and BIO submission. This allows preserving
sequential write IO order without introducing any new functionality
such as submit buffers. Async submit will be removed once we merge
cgroup writeback support patch series.

* Enabling tree-log

The tree-log feature does not work on HMZONED mode as is. Blocks for a
tree-log tree are allocated mixed with other metadata blocks, and btrfs
writes and syncs the tree-log blocks to devices at the time of fsync(),
which is different timing than a global transaction commit. As a result,
both writing tree-log blocks and writing other metadata blocks become
non-sequential writes which HMZONED mode must avoid.

This series introduces a dedicated block group for tree-log blocks to
create two metadata writing streams, one for tree-log blocks and the
other for metadata blocks. As a result, each write stream can now be
written to devices separately and sequentially.

* Log-structured superblock

Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the
zone.

This series implements superblock log writing. It uses two zones as a
circular buffer to write updated superblocks. Once the first zone is
filled up, start writing into the second zone and reset the first
one. We can determine the postion of the latest superblock by reading
the write pointer information from a device.

* Patch series organization

Patch 1 introduces the HMZONED incompatible feature flag to indicate that
the btrfs volume was formatted for use on zoned block devices.

Patches 2 and 3 implement functions to gather information on the zones of
the device (zones type and write pointer position).

Patches 4 to 7 disable features which are not compatible with the
sequential write constraints of zoned block devices. These includes
RAID5/6, space_cache, NODATACOW, and fallocate.

Patch 8 implements the log-structured superblock writing.

Patches 9 and 10 tweak the extent buffer allocation for HMZONED mode
to implement sequential block allocation in block groups and chunks.

Patch 11 and 12 handles the case when write pointers of devices which
compose e.g., RAID1 block group devices, are a mismatch.

Patch 13 implement a zone reset for unused block groups.

Patches 14 to 20 implement the serialization of allocation and submit_bio
for several types of IO (non-compressed data, compressed data, direct IO,
and metadata). These include re-dirtying once-freed metadata blocks to
prevent write holes.

Patch 21 and 22 disable features which are not compatible with the
serialization to prevent deadlocks. These include MIXED_BG and
INODE_MAP_CACHE.

Patches 23 to 27 tweak some btrfs features work with HMZONED mode. These
include device-replace, relocation, repairing IO error, and tree-log.

Finally, patch 28 adds the HMZONED feature to the list of supported
features.

* Patch testing note

This series is based on kdave/for-5.5.

** Zone-aware util-linux

Since the log-structured superblock feature changed the location of
superblock magic, the current util-linux (libblkid) cannot detect
HMZONED btrfs anymore. You need to apply a to-be posted patch to
util-linux to make it "zone aware".

** Testing device

You can use tcmu-runer [1] to create an emulated zoned device backed
by a regular file. Here is a setup how-to:
http://zonedstorage.io/projects/tcmu-runner/#compilation-and-installation

[1] https://github.com/open-iscsi/tcmu-runner

You can also attach SMR disks on a host machine to a guest VM. Here is
a guide:
http://zonedstorage.io/projects/qemu/

** xfstests

We ran xfstests on HMZONED btrfs, and, if we omit some cases that are
known to fail currently, all test cases pass.

Cases that can be ignored:
1) failing also with the regular btrfs on regular devices,
2) trying to test fallocate feature without testing with
   "_require_xfs_io_command "falloc"",
3) trying to test incompatible features for HMZONED btrfs (e.g. RAID5/6)
4) trying to use incompatible setup for HMZONED btrfs (e.g. dm-linear
   not aligned to zone boundary, swap)
5) trying to create a file system with too small size, (we require at
   least 9 zones to initiate a HMZONED btrfs)
6) dropping original MKFS_OPTIONS ("-O hmzoned"), so it cannot create
   HMZONED btrfs (btrfs/003)
7) having ENOSPC which incurred by larger metadata block group size [2]

I will send a patch series for xfstests to handle these cases (2-6)
properly.

[2] For example, generic/275 try to fill a file system and see the
    ENOSPC behaviors. It creates 2GB (= 8 zones * 256 MB/zone) file
    system and tries to fill the FS using the "df"'s block
    count. Since we use 5 zones (1 for a superblock, 2 * 2 for
    meta/system dup), we cannot fill the FS over 51%. And the test
    fails telling us "could not sufficiently fill filesystem."

Also, you need to apply the following patch if you run xfstests with
tcmu devices. xfstests btrfs/003 failed to "_devmgt_add" after
"_devmgt_remove" without this patch.

https://marc.info/?l=linux-scsi&m=156498625421698&w=2

v4 https://lwn.net/Articles/797061/
v3 https://lore.kernel.org/linux-btrfs/20190808093038.4163421-1-naohiro.aota@wdc.com/
v2 https://lore.kernel.org/linux-btrfs/20190607131025.31996-1-naohiro.aota@wdc.com/
v1 https://lore.kernel.org/linux-btrfs/20180809180450.5091-1-naota@elisp.net/

Changelog
v6:
 - Use bitmap helpers (Johannes)
 - Code cleanup (Johannes)
 - Rebased on kdave/for-5.5
 - Enable the tree-log feature.
 - Treat conventional zones as sequential zones, so we can now allow
   mixed allocation of conventional zone and sequential write required
   zone to construct a block group.
 - Implement log-structured superblock
   - No need for one conventional zone at the beginning of a device.
 - Fix deadlock of direct IO writing
 - Fix building with !CONFIG_BLK_DEV_ZONED (Johannes)
 - Fix leak of zone_info (Johannes)
v5:
 - Rebased on kdave/for-5.5
 - Enable the tree-log feature.
 - Treat conventional zones as sequential zones, so we can now allow
   mixed allocation of conventional zone and sequential write required
   zone to construct a block group.
 - Implement log-structured superblock
   - No need for one conventional zone at the beginning of a device.
 - Fix deadlock of direct IO writing
 - Fix building with !CONFIG_BLK_DEV_ZONED (Johannes)
 - Fix leak of zone_info (Johannes)
v4:
 - Move memory allcation of zone informattion out of
   btrfs_get_dev_zones() (Anand)
 - Add disabled features table in commit log (Anand)
 - Ensure "max_chunk_size >= devs_min * data_stripes * zone_size"
v3:
 - Serialize allocation and submit_bio instead of bio buffering in
   btrfs_map_bio().
 -- Disable async checksum/submit in HMZONED mode
 - Introduce helper functions and hmzoned.c/h (Josef, David)
 - Add support for repairing IO failure
 - Add support for NOCOW direct IO write (Josef)
 - Disable preallocation entirely
 -- Disable INODE_MAP_CACHE
 -- relocation is reworked not to rely on preallocation in HMZONED mode
 - Disable NODATACOW
 -Disable MIXED_BG
 - Device extent that cover super block position is banned (David)
v2:
 - Add support for dev-replace
 -- To support dev-replace, moved submit_buffer one layer up. It now
    handles bio instead of btrfs_bio.
 -- Mark unmirrored Block Group readonly only when there are writable
    mirrored BGs. Necessary to handle degraded RAID.
 - Expire worker use vanilla delayed_work instead of btrfs's async-thread
 - Device extent allocator now ensure that region is on the same zone type.
 - Add delayed allocation shrinking.
 - Rename btrfs_drop_dev_zonetypes() to btrfs_destroy_dev_zonetypes
 - Fix
 -- Use SECTOR_SHIFT (Nikolay)
 -- Use btrfs_err (Nikolay)

Naohiro Aota (28):
  btrfs: introduce HMZONED feature flag
  btrfs: Get zone information of zoned block devices
  btrfs: Check and enable HMZONED mode
  btrfs: disallow RAID5/6 in HMZONED mode
  btrfs: disallow space_cache in HMZONED mode
  btrfs: disallow NODATACOW in HMZONED mode
  btrfs: disable fallocate in HMZONED mode
  btrfs: implement log-structured superblock for HMZONED mode
  btrfs: align device extent allocation to zone boundary
  btrfs: do sequential extent allocation in HMZONED mode
  btrfs: make unmirroed BGs readonly only if we have at least one
    writable BG
  btrfs: ensure metadata space available on/after degraded mount in
    HMZONED
  btrfs: reset zones of unused block groups
  btrfs: redirty released extent buffers in HMZONED mode
  btrfs: serialize data allocation and submit IOs
  btrfs: implement atomic compressed IO submission
  btrfs: support direct write IO in HMZONED
  btrfs: serialize meta IOs on HMZONED mode
  btrfs: wait existing extents before truncating
  btrfs: avoid async checksum on HMZONED mode
  btrfs: disallow mixed-bg in HMZONED mode
  btrfs: disallow inode_cache in HMZONED mode
  btrfs: support dev-replace in HMZONED mode
  btrfs: enable relocation in HMZONED mode
  btrfs: relocate block group to repair IO failure in HMZONED
  btrfs: split alloc_log_tree()
  btrfs: enable tree-log on HMZONED mode
  btrfs: enable to mount HMZONED incompat flag

 fs/btrfs/Makefile           |    1 +
 fs/btrfs/block-group.c      |  124 +++-
 fs/btrfs/block-group.h      |   15 +
 fs/btrfs/ctree.h            |   10 +-
 fs/btrfs/dev-replace.c      |  186 +++++
 fs/btrfs/dev-replace.h      |    3 +
 fs/btrfs/disk-io.c          |   74 +-
 fs/btrfs/disk-io.h          |    2 +
 fs/btrfs/extent-tree.c      |  194 ++++-
 fs/btrfs/extent_io.c        |   33 +-
 fs/btrfs/extent_io.h        |    2 +
 fs/btrfs/file.c             |    4 +
 fs/btrfs/free-space-cache.c |   38 +
 fs/btrfs/free-space-cache.h |    2 +
 fs/btrfs/hmzoned.c          | 1329 +++++++++++++++++++++++++++++++++++
 fs/btrfs/hmzoned.h          |  300 ++++++++
 fs/btrfs/inode.c            |  107 ++-
 fs/btrfs/ioctl.c            |    3 +
 fs/btrfs/relocation.c       |   39 +-
 fs/btrfs/scrub.c            |  148 +++-
 fs/btrfs/space-info.c       |   13 +-
 fs/btrfs/space-info.h       |    4 +-
 fs/btrfs/super.c            |   12 +-
 fs/btrfs/sysfs.c            |    4 +
 fs/btrfs/transaction.c      |   10 +
 fs/btrfs/transaction.h      |    3 +
 fs/btrfs/tree-log.c         |   49 +-
 fs/btrfs/volumes.c          |  226 +++++-
 fs/btrfs/volumes.h          |    5 +
 include/uapi/linux/btrfs.h  |    1 +
 30 files changed, 2858 insertions(+), 83 deletions(-)
 create mode 100644 fs/btrfs/hmzoned.c
 create mode 100644 fs/btrfs/hmzoned.h

-- 
2.24.0


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v6 01/28] btrfs: introduce HMZONED feature flag
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
@ 2019-12-13  4:08 ` Naohiro Aota
  2019-12-13  4:08 ` [PATCH v6 02/28] btrfs: Get zone information of zoned block devices Naohiro Aota
                   ` (28 subsequent siblings)
  29 siblings, 0 replies; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:08 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

This patch introduces the HMZONED incompat flag. The flag indicates that
the volume management will satisfy the constraints imposed by host-managed
zoned block devices.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/sysfs.c           | 2 ++
 include/uapi/linux/btrfs.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 5ebbe8a5ee76..230c7ad90e22 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -260,6 +260,7 @@ BTRFS_FEAT_ATTR_INCOMPAT(no_holes, NO_HOLES);
 BTRFS_FEAT_ATTR_INCOMPAT(metadata_uuid, METADATA_UUID);
 BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
 BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
+BTRFS_FEAT_ATTR_INCOMPAT(hmzoned, HMZONED);
 
 static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(mixed_backref),
@@ -275,6 +276,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(metadata_uuid),
 	BTRFS_FEAT_ATTR_PTR(free_space_tree),
 	BTRFS_FEAT_ATTR_PTR(raid1c34),
+	BTRFS_FEAT_ATTR_PTR(hmzoned),
 	NULL
 };
 
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 7a8bc8b920f5..62c22bf1f702 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -271,6 +271,7 @@ struct btrfs_ioctl_fs_info_args {
 #define BTRFS_FEATURE_INCOMPAT_NO_HOLES		(1ULL << 9)
 #define BTRFS_FEATURE_INCOMPAT_METADATA_UUID	(1ULL << 10)
 #define BTRFS_FEATURE_INCOMPAT_RAID1C34		(1ULL << 11)
+#define BTRFS_FEATURE_INCOMPAT_HMZONED		(1ULL << 12)
 
 struct btrfs_ioctl_feature_flags {
 	__u64 compat_flags;
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v6 02/28] btrfs: Get zone information of zoned block devices
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
  2019-12-13  4:08 ` [PATCH v6 01/28] btrfs: introduce HMZONED feature flag Naohiro Aota
@ 2019-12-13  4:08 ` Naohiro Aota
  2019-12-13 16:18   ` Josef Bacik
  2019-12-13  4:08 ` [PATCH v6 03/28] btrfs: Check and enable HMZONED mode Naohiro Aota
                   ` (27 subsequent siblings)
  29 siblings, 1 reply; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:08 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

If a zoned block device is found, get its zone information (number of zones
and zone size) using the new helper function btrfs_get_dev_zone_info().  To
avoid costly run-time zone report commands to test the device zones type
during block allocation, attach the seq_zones bitmap to the device
structure to indicate if a zone is sequential or accept random writes. Also
it attaches the empty_zones bitmap to indicate if a zone is empty or not.

This patch also introduces the helper function btrfs_dev_is_sequential() to
test if the zone storing a block is a sequential write required zone and
btrfs_dev_is_empty_zone() to test if the zone is a empty zone.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/Makefile  |   1 +
 fs/btrfs/hmzoned.c | 168 +++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/hmzoned.h |  92 +++++++++++++++++++++++++
 fs/btrfs/volumes.c |  18 ++++-
 fs/btrfs/volumes.h |   4 ++
 5 files changed, 281 insertions(+), 2 deletions(-)
 create mode 100644 fs/btrfs/hmzoned.c
 create mode 100644 fs/btrfs/hmzoned.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 82200dbca5ac..64aaeed397a4 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -16,6 +16,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
 btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
+btrfs-$(CONFIG_BLK_DEV_ZONED) += hmzoned.o
 
 btrfs-$(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) += tests/free-space-tests.o \
 	tests/extent-buffer-tests.o tests/btrfs-tests.o \
diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
new file mode 100644
index 000000000000..6a13763d2916
--- /dev/null
+++ b/fs/btrfs/hmzoned.c
@@ -0,0 +1,168 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 Western Digital Corporation or its affiliates.
+ * Authors:
+ *	Naohiro Aota	<naohiro.aota@wdc.com>
+ *	Damien Le Moal	<damien.lemoal@wdc.com>
+ */
+
+#include <linux/slab.h>
+#include <linux/blkdev.h>
+#include "ctree.h"
+#include "volumes.h"
+#include "hmzoned.h"
+#include "rcu-string.h"
+
+/* Maximum number of zones to report per blkdev_report_zones() call */
+#define BTRFS_REPORT_NR_ZONES   4096
+
+static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
+			       struct blk_zone *zones, unsigned int *nr_zones)
+{
+	int ret;
+
+	ret = blkdev_report_zones(device->bdev, pos >> SECTOR_SHIFT, zones,
+				  nr_zones);
+	if (ret != 0) {
+		btrfs_err_in_rcu(device->fs_info,
+				 "get zone at %llu on %s failed %d", pos,
+				 rcu_str_deref(device->name), ret);
+		return ret;
+	}
+	if (!*nr_zones)
+		return -EIO;
+
+	return 0;
+}
+
+int btrfs_get_dev_zone_info(struct btrfs_device *device)
+{
+	struct btrfs_zoned_device_info *zone_info = NULL;
+	struct block_device *bdev = device->bdev;
+	sector_t nr_sectors = bdev->bd_part->nr_sects;
+	sector_t sector = 0;
+	struct blk_zone *zones = NULL;
+	unsigned int i, nreported = 0, nr_zones;
+	unsigned int zone_sectors;
+	int ret;
+	char devstr[sizeof(device->fs_info->sb->s_id) +
+		    sizeof(" (device )") - 1];
+
+	if (!bdev_is_zoned(bdev))
+		return 0;
+
+	zone_info = kzalloc(sizeof(*zone_info), GFP_KERNEL);
+	if (!zone_info)
+		return -ENOMEM;
+
+	zone_sectors = bdev_zone_sectors(bdev);
+	ASSERT(is_power_of_2(zone_sectors));
+	zone_info->zone_size = (u64)zone_sectors << SECTOR_SHIFT;
+	zone_info->zone_size_shift = ilog2(zone_info->zone_size);
+	zone_info->nr_zones = nr_sectors >> ilog2(bdev_zone_sectors(bdev));
+	if (!IS_ALIGNED(nr_sectors, zone_sectors))
+		zone_info->nr_zones++;
+
+	zone_info->seq_zones = bitmap_zalloc(zone_info->nr_zones, GFP_KERNEL);
+	if (!zone_info->seq_zones) {
+		ret = -ENOMEM;
+		goto free_zone_info;
+	}
+
+	zone_info->empty_zones = bitmap_zalloc(zone_info->nr_zones, GFP_KERNEL);
+	if (!zone_info->empty_zones) {
+		ret = -ENOMEM;
+		goto free_seq_zones;
+	}
+
+	zones = kcalloc(BTRFS_REPORT_NR_ZONES,
+			sizeof(struct blk_zone), GFP_KERNEL);
+	if (!zones) {
+		ret = -ENOMEM;
+		goto free_empty_zones;
+	}
+
+	/* Get zones type */
+	while (sector < nr_sectors) {
+		nr_zones = BTRFS_REPORT_NR_ZONES;
+		ret = btrfs_get_dev_zones(device, sector << SECTOR_SHIFT, zones,
+					  &nr_zones);
+		if (ret)
+			goto free_zones;
+
+		for (i = 0; i < nr_zones; i++) {
+			if (zones[i].type == BLK_ZONE_TYPE_SEQWRITE_REQ)
+				set_bit(nreported, zone_info->seq_zones);
+			if (zones[i].cond == BLK_ZONE_COND_EMPTY)
+				set_bit(nreported, zone_info->empty_zones);
+			nreported++;
+		}
+		sector = zones[nr_zones - 1].start + zones[nr_zones - 1].len;
+	}
+
+	if (nreported != zone_info->nr_zones) {
+		btrfs_err_in_rcu(device->fs_info,
+				 "inconsistent number of zones on %s (%u / %u)",
+				 rcu_str_deref(device->name), nreported,
+				 zone_info->nr_zones);
+		ret = -EIO;
+		goto free_zones;
+	}
+
+	kfree(zones);
+
+	device->zone_info = zone_info;
+
+	devstr[0] = 0;
+	if (device->fs_info)
+		snprintf(devstr, sizeof(devstr), " (device %s)",
+			 device->fs_info->sb->s_id);
+
+	rcu_read_lock();
+	pr_info(
+"BTRFS info%s: host-%s zoned block device %s, %u zones of %llu sectors",
+		devstr,
+		bdev_zoned_model(bdev) == BLK_ZONED_HM ? "managed" : "aware",
+		rcu_str_deref(device->name), zone_info->nr_zones,
+		zone_info->zone_size >> SECTOR_SHIFT);
+	rcu_read_unlock();
+
+	return 0;
+
+free_zones:
+	kfree(zones);
+free_empty_zones:
+	bitmap_free(zone_info->empty_zones);
+free_seq_zones:
+	bitmap_free(zone_info->seq_zones);
+free_zone_info:
+	kfree(zone_info);
+
+	return ret;
+}
+
+void btrfs_destroy_dev_zone_info(struct btrfs_device *device)
+{
+	struct btrfs_zoned_device_info *zone_info = device->zone_info;
+
+	if (!zone_info)
+		return;
+
+	bitmap_free(zone_info->seq_zones);
+	bitmap_free(zone_info->empty_zones);
+	kfree(zone_info);
+	device->zone_info = NULL;
+}
+
+int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
+		       struct blk_zone *zone)
+{
+	unsigned int nr_zones = 1;
+	int ret;
+
+	ret = btrfs_get_dev_zones(device, pos, zone, &nr_zones);
+	if (ret != 0 || !nr_zones)
+		return ret ? ret : -EIO;
+
+	return 0;
+}
diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
new file mode 100644
index 000000000000..0f8006f39aaf
--- /dev/null
+++ b/fs/btrfs/hmzoned.h
@@ -0,0 +1,92 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2019 Western Digital Corporation or its affiliates.
+ * Authors:
+ *	Naohiro Aota	<naohiro.aota@wdc.com>
+ *	Damien Le Moal	<damien.lemoal@wdc.com>
+ */
+
+#ifndef BTRFS_HMZONED_H
+#define BTRFS_HMZONED_H
+
+struct btrfs_zoned_device_info {
+	/*
+	 * Number of zones, zone size and types of zones if bdev is a
+	 * zoned block device.
+	 */
+	u64 zone_size;
+	u8  zone_size_shift;
+	u32 nr_zones;
+	unsigned long *seq_zones;
+	unsigned long *empty_zones;
+};
+
+#ifdef CONFIG_BLK_DEV_ZONED
+int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
+		       struct blk_zone *zone);
+int btrfs_get_dev_zone_info(struct btrfs_device *device);
+void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
+#else /* CONFIG_BLK_DEV_ZONED */
+static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
+				     struct blk_zone *zone)
+{
+	return 0;
+}
+static inline int btrfs_get_dev_zone_info(struct btrfs_device *device)
+{
+	return 0;
+}
+static inline void btrfs_destroy_dev_zone_info(struct btrfs_device *device) { }
+#endif
+
+static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
+{
+	struct btrfs_zoned_device_info *zone_info = device->zone_info;
+
+	if (!zone_info)
+		return false;
+
+	return test_bit(pos >> zone_info->zone_size_shift,
+			zone_info->seq_zones);
+}
+
+static inline bool btrfs_dev_is_empty_zone(struct btrfs_device *device, u64 pos)
+{
+	struct btrfs_zoned_device_info *zone_info = device->zone_info;
+
+	if (!zone_info)
+		return true;
+
+	return test_bit(pos >> zone_info->zone_size_shift,
+			zone_info->empty_zones);
+}
+
+static inline void btrfs_dev_set_empty_zone_bit(struct btrfs_device *device,
+						u64 pos, bool set)
+{
+	struct btrfs_zoned_device_info *zone_info = device->zone_info;
+	unsigned int zno;
+
+	if (!zone_info)
+		return;
+
+	zno = pos >> zone_info->zone_size_shift;
+	if (set)
+		set_bit(zno, zone_info->empty_zones);
+	else
+		clear_bit(zno, zone_info->empty_zones);
+}
+
+static inline void btrfs_dev_set_zone_empty(struct btrfs_device *device,
+					    u64 pos)
+{
+	btrfs_dev_set_empty_zone_bit(device, pos, true);
+}
+
+static inline void btrfs_dev_clear_zone_empty(struct btrfs_device *device,
+					      u64 pos)
+{
+	btrfs_dev_set_empty_zone_bit(device, pos, false);
+}
+
+#endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d8e5560db285..18ea8dfce244 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -30,6 +30,7 @@
 #include "tree-checker.h"
 #include "space-info.h"
 #include "block-group.h"
+#include "hmzoned.h"
 
 const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
 	[BTRFS_RAID_RAID10] = {
@@ -366,6 +367,7 @@ void btrfs_free_device(struct btrfs_device *device)
 	rcu_string_free(device->name);
 	extent_io_tree_release(&device->alloc_state);
 	bio_put(device->flush_bio);
+	btrfs_destroy_dev_zone_info(device);
 	kfree(device);
 }
 
@@ -650,6 +652,11 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
 	clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
 	device->mode = flags;
 
+	/* Get zone type information of zoned block devices */
+	ret = btrfs_get_dev_zone_info(device);
+	if (ret != 0)
+		goto error_brelse;
+
 	fs_devices->open_devices++;
 	if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
 	    device->devid != BTRFS_DEV_REPLACE_DEVID) {
@@ -2421,6 +2428,14 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	}
 	rcu_assign_pointer(device->name, name);
 
+	device->fs_info = fs_info;
+	device->bdev = bdev;
+
+	/* Get zone type information of zoned block devices */
+	ret = btrfs_get_dev_zone_info(device);
+	if (ret)
+		goto error_free_device;
+
 	trans = btrfs_start_transaction(root, 0);
 	if (IS_ERR(trans)) {
 		ret = PTR_ERR(trans);
@@ -2437,8 +2452,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 					 fs_info->sectorsize);
 	device->disk_total_bytes = device->total_bytes;
 	device->commit_total_bytes = device->total_bytes;
-	device->fs_info = fs_info;
-	device->bdev = bdev;
 	set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
 	clear_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state);
 	device->mode = FMODE_EXCL;
@@ -2571,6 +2584,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 		sb->s_flags |= SB_RDONLY;
 	if (trans)
 		btrfs_end_transaction(trans);
+	btrfs_destroy_dev_zone_info(device);
 error_free_device:
 	btrfs_free_device(device);
 error:
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index fc1b564b9cfe..70cabe65f72a 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -53,6 +53,8 @@ struct btrfs_io_geometry {
 #define BTRFS_DEV_STATE_REPLACE_TGT	(3)
 #define BTRFS_DEV_STATE_FLUSH_SENT	(4)
 
+struct btrfs_zoned_device_info;
+
 struct btrfs_device {
 	struct list_head dev_list; /* device_list_mutex */
 	struct list_head dev_alloc_list; /* chunk mutex */
@@ -66,6 +68,8 @@ struct btrfs_device {
 
 	struct block_device *bdev;
 
+	struct btrfs_zoned_device_info *zone_info;
+
 	/* the mode sent to blkdev_get */
 	fmode_t mode;
 
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v6 03/28] btrfs: Check and enable HMZONED mode
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
  2019-12-13  4:08 ` [PATCH v6 01/28] btrfs: introduce HMZONED feature flag Naohiro Aota
  2019-12-13  4:08 ` [PATCH v6 02/28] btrfs: Get zone information of zoned block devices Naohiro Aota
@ 2019-12-13  4:08 ` Naohiro Aota
  2019-12-13 16:21   ` Josef Bacik
  2019-12-13  4:08 ` [PATCH v6 04/28] btrfs: disallow RAID5/6 in " Naohiro Aota
                   ` (26 subsequent siblings)
  29 siblings, 1 reply; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:08 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

HMZONED mode cannot be used together with the RAID5/6 profile for now.
Introduce the function btrfs_check_hmzoned_mode() to check this. This
function will also check if HMZONED flag is enabled on the file system and
if the file system consists of zoned devices with equal zone size.

Additionally, as updates to the space cache are in-place, the space cache
cannot be located over sequential zones and there is no guarantees that the
device will have enough conventional zones to store this cache. Resolve
this problem by completely disabling the space cache.  This does not
introduce any problems in HMZONED mode: all the free space is located after
the allocation pointer and no free space is located before the pointer.
There is no need to have such cache.

For the same reason, NODATACOW is also disabled.

Also INODE_MAP_CACHE is also disabled to avoid preallocation in the
INODE_MAP_CACHE inode.

In summary, HMZONED will disable:

| Disabled features | Reason                                              |
|-------------------+-----------------------------------------------------|
| RAID5/6           | 1) Non-full stripe write cause overwriting of       |
|                   | parity block                                        |
|                   | 2) Rebuilding on high capacity volume (usually with |
|                   | SMR) can lead to higher failure rate                |
|-------------------+-----------------------------------------------------|
| space_cache (v1)  | In-place updating                                   |
| NODATACOW         | In-place updating                                   |
|-------------------+-----------------------------------------------------|
| fallocate         | Reserved extent will be a write hole                |
| INODE_MAP_CACHE   | Need pre-allocation. (and will be deprecated?)      |
|-------------------+-----------------------------------------------------|
| MIXED_BG          | Allocated metadata region will be write holes for   |
|                   | data writes                                         |
| async checksum    | Not to mix up bios by multiple workers              |

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ctree.h       |  3 ++
 fs/btrfs/dev-replace.c |  8 +++++
 fs/btrfs/disk-io.c     |  8 +++++
 fs/btrfs/hmzoned.c     | 77 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/hmzoned.h     | 26 ++++++++++++++
 fs/btrfs/super.c       |  1 +
 fs/btrfs/volumes.c     |  5 +++
 7 files changed, 128 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b2e8fd8a8e59..44517802b9e5 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -541,6 +541,9 @@ struct btrfs_fs_info {
 	struct btrfs_root *uuid_root;
 	struct btrfs_root *free_space_root;
 
+	/* Zone size when in HMZONED mode */
+	u64 zone_size;
+
 	/* the log root tree is a directory of all the other log roots */
 	struct btrfs_root *log_root_tree;
 
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index f639dde2a679..9286c6e0b636 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -21,6 +21,7 @@
 #include "rcu-string.h"
 #include "dev-replace.h"
 #include "sysfs.h"
+#include "hmzoned.h"
 
 static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
 				       int scrub_ret);
@@ -202,6 +203,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
 		return PTR_ERR(bdev);
 	}
 
+	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
+		btrfs_err(fs_info,
+			  "zone type of target device mismatch with the filesystem!");
+		ret = -EINVAL;
+		goto error;
+	}
+
 	sync_blockdev(bdev);
 
 	devices = &fs_info->fs_devices->devices;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e0edfdc9c82b..ff418e393f82 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -41,6 +41,7 @@
 #include "tree-checker.h"
 #include "ref-verify.h"
 #include "block-group.h"
+#include "hmzoned.h"
 
 #define BTRFS_SUPER_FLAG_SUPP	(BTRFS_HEADER_FLAG_WRITTEN |\
 				 BTRFS_HEADER_FLAG_RELOC |\
@@ -3082,6 +3083,13 @@ int __cold open_ctree(struct super_block *sb,
 
 	btrfs_free_extra_devids(fs_devices, 1);
 
+	ret = btrfs_check_hmzoned_mode(fs_info);
+	if (ret) {
+		btrfs_err(fs_info, "failed to init hmzoned mode: %d",
+				ret);
+		goto fail_block_groups;
+	}
+
 	ret = btrfs_sysfs_add_fsid(fs_devices, NULL);
 	if (ret) {
 		btrfs_err(fs_info, "failed to init sysfs fsid interface: %d",
diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
index 6a13763d2916..0182bfb9c903 100644
--- a/fs/btrfs/hmzoned.c
+++ b/fs/btrfs/hmzoned.c
@@ -166,3 +166,80 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 
 	return 0;
 }
+
+int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+	struct btrfs_device *device;
+	u64 hmzoned_devices = 0;
+	u64 nr_devices = 0;
+	u64 zone_size = 0;
+	int incompat_hmzoned = btrfs_fs_incompat(fs_info, HMZONED);
+	int ret = 0;
+
+	/* Count zoned devices */
+	list_for_each_entry(device, &fs_devices->devices, dev_list) {
+		enum blk_zoned_model model;
+
+		if (!device->bdev)
+			continue;
+
+		model = bdev_zoned_model(device->bdev);
+		if (model == BLK_ZONED_HM ||
+		    (model == BLK_ZONED_HA && incompat_hmzoned)) {
+			hmzoned_devices++;
+			if (!zone_size) {
+				zone_size = device->zone_info->zone_size;
+			} else if (device->zone_info->zone_size != zone_size) {
+				btrfs_err(fs_info,
+					  "Zoned block devices must have equal zone sizes");
+				ret = -EINVAL;
+				goto out;
+			}
+		}
+		nr_devices++;
+	}
+
+	if (!hmzoned_devices && !incompat_hmzoned)
+		goto out;
+
+	if (!hmzoned_devices && incompat_hmzoned) {
+		/* No zoned block device found on HMZONED FS */
+		btrfs_err(fs_info, "HMZONED enabled file system should have zoned devices");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (hmzoned_devices && !incompat_hmzoned) {
+		btrfs_err(fs_info,
+			  "Enable HMZONED mode to mount HMZONED device");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (hmzoned_devices != nr_devices) {
+		btrfs_err(fs_info,
+			  "zoned devices cannot be mixed with regular devices");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/*
+	 * stripe_size is always aligned to BTRFS_STRIPE_LEN in
+	 * __btrfs_alloc_chunk(). Since we want stripe_len == zone_size,
+	 * check the alignment here.
+	 */
+	if (!IS_ALIGNED(zone_size, BTRFS_STRIPE_LEN)) {
+		btrfs_err(fs_info,
+			  "zone size is not aligned to BTRFS_STRIPE_LEN");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	fs_info->zone_size = zone_size;
+
+	btrfs_info(fs_info, "HMZONED mode enabled, zone size %llu B",
+		   fs_info->zone_size);
+out:
+	return ret;
+}
diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
index 0f8006f39aaf..8e17f64ff986 100644
--- a/fs/btrfs/hmzoned.h
+++ b/fs/btrfs/hmzoned.h
@@ -9,6 +9,8 @@
 #ifndef BTRFS_HMZONED_H
 #define BTRFS_HMZONED_H
 
+#include <linux/blkdev.h>
+
 struct btrfs_zoned_device_info {
 	/*
 	 * Number of zones, zone size and types of zones if bdev is a
@@ -26,6 +28,7 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 		       struct blk_zone *zone);
 int btrfs_get_dev_zone_info(struct btrfs_device *device);
 void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
+int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -37,6 +40,14 @@ static inline int btrfs_get_dev_zone_info(struct btrfs_device *device)
 	return 0;
 }
 static inline void btrfs_destroy_dev_zone_info(struct btrfs_device *device) { }
+static inline int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info)
+{
+	if (!btrfs_fs_incompat(fs_info, HMZONED))
+		return 0;
+
+	btrfs_err(fs_info, "Zoned block devices support is not enabled");
+	return -EOPNOTSUPP;
+}
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
@@ -89,4 +100,19 @@ static inline void btrfs_dev_clear_zone_empty(struct btrfs_device *device,
 	btrfs_dev_set_empty_zone_bit(device, pos, false);
 }
 
+static inline bool btrfs_check_device_zone_type(struct btrfs_fs_info *fs_info,
+						struct block_device *bdev)
+{
+	u64 zone_size;
+
+	if (btrfs_fs_incompat(fs_info, HMZONED)) {
+		zone_size = (u64)bdev_zone_sectors(bdev) << SECTOR_SHIFT;
+		/* Do not allow non-zoned device */
+		return bdev_is_zoned(bdev) && fs_info->zone_size == zone_size;
+	}
+
+	/* Do not allow Host Manged zoned device */
+	return bdev_zoned_model(bdev) != BLK_ZONED_HM;
+}
+
 #endif
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index a98c3c71fc54..616f5abec267 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -44,6 +44,7 @@
 #include "backref.h"
 #include "space-info.h"
 #include "sysfs.h"
+#include "hmzoned.h"
 #include "tests/btrfs-tests.h"
 #include "block-group.h"
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 18ea8dfce244..ab3590b310af 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2395,6 +2395,11 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	if (IS_ERR(bdev))
 		return PTR_ERR(bdev);
 
+	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
+		ret = -EINVAL;
+		goto error;
+	}
+
 	if (fs_devices->seeding) {
 		seeding_dev = 1;
 		down_write(&sb->s_umount);
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v6 04/28] btrfs: disallow RAID5/6 in HMZONED mode
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
                   ` (2 preceding siblings ...)
  2019-12-13  4:08 ` [PATCH v6 03/28] btrfs: Check and enable HMZONED mode Naohiro Aota
@ 2019-12-13  4:08 ` Naohiro Aota
  2019-12-13 16:21   ` Josef Bacik
  2019-12-13  4:08 ` [PATCH v6 05/28] btrfs: disallow space_cache " Naohiro Aota
                   ` (25 subsequent siblings)
  29 siblings, 1 reply; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:08 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

Supporting the RAID5/6 profile in HMZONED mode is not trivial. For example,
non-full stripe writes will cause overwriting parity blocks. When we do a
non-full stripe write, it writes to the parity block with the data at that
moment. Then, another write to the stripes will try to overwrite the parity
block with new parity value. However, sequential zones do not allow such
parity overwriting.

Furthermore, using RAID5/6 on SMR drives, which usually have a huge
capacity, incur large overhead of rebuild. Such overhead can lead to higher
to higher volume failure rate (e.g. additional drive failure during
rebuild) because of the increased rebuild time.

Thus, let's disable RAID5/6 profile in HMZONED mode for now.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/hmzoned.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
index 0182bfb9c903..1b24facd46b8 100644
--- a/fs/btrfs/hmzoned.c
+++ b/fs/btrfs/hmzoned.c
@@ -236,6 +236,13 @@ int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info)
 		goto out;
 	}
 
+	/* RAID56 is not allowed */
+	if (btrfs_fs_incompat(fs_info, RAID56)) {
+		btrfs_err(fs_info, "HMZONED mode does not support RAID56");
+		ret = -EINVAL;
+		goto out;
+	}
+
 	fs_info->zone_size = zone_size;
 
 	btrfs_info(fs_info, "HMZONED mode enabled, zone size %llu B",
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v6 05/28] btrfs: disallow space_cache in HMZONED mode
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
                   ` (3 preceding siblings ...)
  2019-12-13  4:08 ` [PATCH v6 04/28] btrfs: disallow RAID5/6 in " Naohiro Aota
@ 2019-12-13  4:08 ` Naohiro Aota
  2019-12-13 16:24   ` Josef Bacik
  2019-12-13  4:08 ` [PATCH v6 06/28] btrfs: disallow NODATACOW " Naohiro Aota
                   ` (24 subsequent siblings)
  29 siblings, 1 reply; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:08 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

As updates to the space cache v1 are in-place, the space cache cannot be
located over sequential zones and there is no guarantees that the device
will have enough conventional zones to store this cache. Resolve this
problem by disabling completely the space cache v1.  This does not
introduces any problems with sequential block groups: all the free space is
located after the allocation pointer and no free space before the pointer.
There is no need to have such cache.

Note: we can technically use free-space-tree (space cache v2) on HMZONED
mode. But, since HMZONED mode now always allocate extents in a block group
sequentially regardless of underlying device zone type, it's no use to
enable and maintain the tree.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/hmzoned.c | 18 ++++++++++++++++++
 fs/btrfs/hmzoned.h |  5 +++++
 fs/btrfs/super.c   | 11 +++++++++--
 3 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
index 1b24facd46b8..d62f11652973 100644
--- a/fs/btrfs/hmzoned.c
+++ b/fs/btrfs/hmzoned.c
@@ -250,3 +250,21 @@ int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info)
 out:
 	return ret;
 }
+
+int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info)
+{
+	if (!btrfs_fs_incompat(info, HMZONED))
+		return 0;
+
+	/*
+	 * SPACE CACHE writing is not CoWed. Disable that to avoid write
+	 * errors in sequential zones.
+	 */
+	if (btrfs_test_opt(info, SPACE_CACHE)) {
+		btrfs_err(info,
+			  "space cache v1 not supportted in HMZONED mode");
+		return -EOPNOTSUPP;
+	}
+
+	return 0;
+}
diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
index 8e17f64ff986..d9ebe11afdf5 100644
--- a/fs/btrfs/hmzoned.h
+++ b/fs/btrfs/hmzoned.h
@@ -29,6 +29,7 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 int btrfs_get_dev_zone_info(struct btrfs_device *device);
 void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
 int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info);
+int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -48,6 +49,10 @@ static inline int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info)
 	btrfs_err(fs_info, "Zoned block devices support is not enabled");
 	return -EOPNOTSUPP;
 }
+static inline int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info)
+{
+	return 0;
+}
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 616f5abec267..1424c3c6e3cf 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -442,8 +442,13 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 	cache_gen = btrfs_super_cache_generation(info->super_copy);
 	if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE))
 		btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE);
-	else if (cache_gen)
-		btrfs_set_opt(info->mount_opt, SPACE_CACHE);
+	else if (cache_gen) {
+		if (btrfs_fs_incompat(info, HMZONED))
+			btrfs_info(info,
+			"ignoring existing space cache in HMZONED mode");
+		else
+			btrfs_set_opt(info->mount_opt, SPACE_CACHE);
+	}
 
 	/*
 	 * Even the options are empty, we still need to do extra check
@@ -879,6 +884,8 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 		ret = -EINVAL;
 
 	}
+	if (!ret)
+		ret = btrfs_check_mountopts_hmzoned(info);
 	if (!ret && btrfs_test_opt(info, SPACE_CACHE))
 		btrfs_info(info, "disk space caching is enabled");
 	if (!ret && btrfs_test_opt(info, FREE_SPACE_TREE))
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v6 06/28] btrfs: disallow NODATACOW in HMZONED mode
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
                   ` (4 preceding siblings ...)
  2019-12-13  4:08 ` [PATCH v6 05/28] btrfs: disallow space_cache " Naohiro Aota
@ 2019-12-13  4:08 ` Naohiro Aota
  2019-12-13 16:25   ` Josef Bacik
  2019-12-13  4:08 ` [PATCH v6 07/28] btrfs: disable fallocate " Naohiro Aota
                   ` (23 subsequent siblings)
  29 siblings, 1 reply; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:08 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

NODATACOW implies overwriting the file data on a device, which is
impossible in sequential required zones. Disable NODATACOW globally with
mount option and per-file NODATACOW attribute by masking FS_NOCOW_FL.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/hmzoned.c | 6 ++++++
 fs/btrfs/ioctl.c   | 3 +++
 2 files changed, 9 insertions(+)

diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
index d62f11652973..21b8737dd289 100644
--- a/fs/btrfs/hmzoned.c
+++ b/fs/btrfs/hmzoned.c
@@ -266,5 +266,11 @@ int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info)
 		return -EOPNOTSUPP;
 	}
 
+	if (btrfs_test_opt(info, NODATACOW)) {
+		btrfs_err(info,
+		  "cannot enable nodatacow with HMZONED mode");
+		return -EOPNOTSUPP;
+	}
+
 	return 0;
 }
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index a1ee0b775e65..a67421eb8bd5 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -94,6 +94,9 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
 static unsigned int btrfs_mask_fsflags_for_type(struct inode *inode,
 		unsigned int flags)
 {
+	if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED))
+		flags &= ~FS_NOCOW_FL;
+
 	if (S_ISDIR(inode->i_mode))
 		return flags;
 	else if (S_ISREG(inode->i_mode))
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v6 07/28] btrfs: disable fallocate in HMZONED mode
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
                   ` (5 preceding siblings ...)
  2019-12-13  4:08 ` [PATCH v6 06/28] btrfs: disallow NODATACOW " Naohiro Aota
@ 2019-12-13  4:08 ` Naohiro Aota
  2019-12-13 16:26   ` Josef Bacik
  2019-12-13  4:08 ` [PATCH v6 08/28] btrfs: implement log-structured superblock for " Naohiro Aota
                   ` (22 subsequent siblings)
  29 siblings, 1 reply; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:08 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

fallocate() is implemented by reserving actual extent instead of
reservations. This can result in exposing the sequential write constraint
of host-managed zoned block devices to the application, which would break
the POSIX semantic for the fallocated file.  To avoid this, report
fallocate() as not supported when in HMZONED mode for now.

In the future, we may be able to implement "in-memory" fallocate() in
HMZONED mode by utilizing space_info->bytes_may_use or so.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/file.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0cb43b682789..22373d00428b 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3170,6 +3170,10 @@ static long btrfs_fallocate(struct file *file, int mode,
 	alloc_end = round_up(offset + len, blocksize);
 	cur_offset = alloc_start;
 
+	/* Do not allow fallocate in HMZONED mode */
+	if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED))
+		return -EOPNOTSUPP;
+
 	/* Make sure we aren't being give some crap mode */
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
 		     FALLOC_FL_ZERO_RANGE))
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v6 08/28] btrfs: implement log-structured superblock for HMZONED mode
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
                   ` (6 preceding siblings ...)
  2019-12-13  4:08 ` [PATCH v6 07/28] btrfs: disable fallocate " Naohiro Aota
@ 2019-12-13  4:08 ` Naohiro Aota
  2019-12-13 16:38   ` Josef Bacik
  2019-12-13  4:08 ` [PATCH v6 09/28] btrfs: align device extent allocation to zone boundary Naohiro Aota
                   ` (21 subsequent siblings)
  29 siblings, 1 reply; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:08 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

Superblock (and its copies) is the only data structure in btrfs which has a
fixed location on a device. Since we cannot overwrite in a sequential write
required zone, we cannot place superblock in the zone. One easy solution is
limiting superblock and copies to be placed only in conventional zones.
However, this method has two downsides: one is reduced number of superblock
copies. The location of the second copy of superblock is 256GB, which is in
a sequential write required zone on typical devices in the market today.
So, the number of superblock and copies is limited to be two.  Second
downside is that we cannot support devices which have no conventional zones
at all.

To solve these two problems, we employ superblock log writing. It uses two
zones as a circular buffer to write updated superblocks. Once the first
zone is filled up, start writing into the second buffer and reset the first
one. We can determine the postion of the latest superblock by reading write
pointer information from a device.

The following zones are reserved as the circular buffer on HMZONED btrfs.

- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and next
  to it

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c |   9 ++
 fs/btrfs/disk-io.c     |  19 ++-
 fs/btrfs/hmzoned.c     | 276 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/hmzoned.h     |  40 ++++++
 fs/btrfs/scrub.c       |   3 +
 fs/btrfs/volumes.c     |  18 ++-
 6 files changed, 354 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 6934a5b8708f..acfa0a9d3c5a 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1519,6 +1519,7 @@ static void set_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags)
 static int exclude_super_stripes(struct btrfs_block_group *cache)
 {
 	struct btrfs_fs_info *fs_info = cache->fs_info;
+	bool hmzoned = btrfs_fs_incompat(fs_info, HMZONED);
 	u64 bytenr;
 	u64 *logical;
 	int stripe_len;
@@ -1549,6 +1550,14 @@ static int exclude_super_stripes(struct btrfs_block_group *cache)
 			if (logical[nr] + stripe_len <= cache->start)
 				continue;
 
+			/* shouldn't have super stripes in sequential zones */
+			if (hmzoned) {
+				btrfs_err(fs_info,
+		"sequentil allocation bg %llu should not have super blocks",
+					  cache->start);
+				return -EUCLEAN;
+			}
+
 			start = logical[nr];
 			if (start < cache->start) {
 				start = cache->start;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index ff418e393f82..deca9fd70771 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3386,8 +3386,12 @@ int btrfs_read_dev_one_super(struct block_device *bdev, int copy_num,
 	struct buffer_head *bh;
 	struct btrfs_super_block *super;
 	u64 bytenr;
+	u64 bytenr_orig;
+
+	bytenr_orig = btrfs_sb_offset(copy_num);
+	if (btrfs_sb_log_location_bdev(bdev, copy_num, READ, &bytenr))
+		return -EUCLEAN;
 
-	bytenr = btrfs_sb_offset(copy_num);
 	if (bytenr + BTRFS_SUPER_INFO_SIZE >= i_size_read(bdev->bd_inode))
 		return -EINVAL;
 
@@ -3400,7 +3404,7 @@ int btrfs_read_dev_one_super(struct block_device *bdev, int copy_num,
 		return -EIO;
 
 	super = (struct btrfs_super_block *)bh->b_data;
-	if (btrfs_super_bytenr(super) != bytenr ||
+	if (btrfs_super_bytenr(super) != bytenr_orig ||
 		    btrfs_super_magic(super) != BTRFS_MAGIC) {
 		brelse(bh);
 		return -EINVAL;
@@ -3466,7 +3470,7 @@ static int write_dev_supers(struct btrfs_device *device,
 	int i;
 	int ret;
 	int errors = 0;
-	u64 bytenr;
+	u64 bytenr, bytenr_orig;
 	int op_flags;
 
 	if (max_mirrors == 0)
@@ -3475,12 +3479,13 @@ static int write_dev_supers(struct btrfs_device *device,
 	shash->tfm = fs_info->csum_shash;
 
 	for (i = 0; i < max_mirrors; i++) {
-		bytenr = btrfs_sb_offset(i);
+		bytenr_orig = btrfs_sb_offset(i);
+		bytenr = btrfs_sb_log_location(device, i, WRITE);
 		if (bytenr + BTRFS_SUPER_INFO_SIZE >=
 		    device->commit_total_bytes)
 			break;
 
-		btrfs_set_super_bytenr(sb, bytenr);
+		btrfs_set_super_bytenr(sb, bytenr_orig);
 
 		crypto_shash_init(shash);
 		crypto_shash_update(shash, (const char *)sb + BTRFS_CSUM_SIZE,
@@ -3518,6 +3523,8 @@ static int write_dev_supers(struct btrfs_device *device,
 		ret = btrfsic_submit_bh(REQ_OP_WRITE, op_flags, bh);
 		if (ret)
 			errors++;
+		else if (btrfs_advance_sb_log(device, i))
+			errors++;
 	}
 	return errors < i ? 0 : -1;
 }
@@ -3541,7 +3548,7 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors)
 		max_mirrors = BTRFS_SUPER_MIRROR_MAX;
 
 	for (i = 0; i < max_mirrors; i++) {
-		bytenr = btrfs_sb_offset(i);
+		bytenr = btrfs_sb_log_location(device, i, READ);
 		if (bytenr + BTRFS_SUPER_INFO_SIZE >=
 		    device->commit_total_bytes)
 			break;
diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
index 21b8737dd289..a74011650145 100644
--- a/fs/btrfs/hmzoned.c
+++ b/fs/btrfs/hmzoned.c
@@ -16,6 +16,26 @@
 /* Maximum number of zones to report per blkdev_report_zones() call */
 #define BTRFS_REPORT_NR_ZONES   4096
 
+static int sb_write_pointer(struct blk_zone *zone, u64 *wp_ret);
+
+static inline u32 sb_zone_number(u64 zone_size, int mirror)
+{
+	ASSERT(mirror < BTRFS_SUPER_MIRROR_MAX);
+
+	switch (mirror) {
+	case 0:
+		return 0;
+	case 1:
+		return 16;
+	case 2:
+		return min(btrfs_sb_offset(mirror) / zone_size, 1024ULL);
+	default:
+		BUG();
+	}
+
+	return 0;
+}
+
 static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
 			       struct blk_zone *zones, unsigned int *nr_zones)
 {
@@ -109,6 +129,39 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device)
 		goto free_zones;
 	}
 
+	nr_zones = 2;
+	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
+		u32 sb_zone = sb_zone_number(zone_info->zone_size, i);
+		u64 sb_wp;
+
+		if (sb_zone + 1 >= zone_info->nr_zones)
+			continue;
+
+		sector = sb_zone << (zone_info->zone_size_shift - SECTOR_SHIFT);
+		ret = btrfs_get_dev_zones(device, sector << SECTOR_SHIFT,
+					  &zone_info->sb_zones[2 * i],
+					  &nr_zones);
+		if (ret)
+			goto free_zones;
+		if (nr_zones != 2) {
+			btrfs_err_in_rcu(device->fs_info,
+			"failed to read SB log zone info at device %s zone %u",
+					 rcu_str_deref(device->name), sb_zone);
+			ret = -EIO;
+			goto free_zones;
+		}
+
+		ret = sb_write_pointer(&zone_info->sb_zones[2 * i], &sb_wp);
+		if (ret != -ENOENT && ret) {
+			btrfs_err_in_rcu(device->fs_info,
+				"SB log zone corrupted: device %s zone %u",
+					 rcu_str_deref(device->name), sb_zone);
+			ret = -EUCLEAN;
+			goto free_zones;
+		}
+	}
+
+
 	kfree(zones);
 
 	device->zone_info = zone_info;
@@ -274,3 +327,226 @@ int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info)
 
 	return 0;
 }
+
+static int sb_write_pointer(struct blk_zone *zones, u64 *wp_ret)
+{
+	bool empty[2];
+	bool full[2];
+	sector_t sector;
+
+	if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) {
+		*wp_ret = zones[0].start << SECTOR_SHIFT;
+		return -ENOENT;
+	}
+
+	empty[0] = zones[0].cond == BLK_ZONE_COND_EMPTY;
+	empty[1] = zones[1].cond == BLK_ZONE_COND_EMPTY;
+	full[0] = zones[0].cond == BLK_ZONE_COND_FULL;
+	full[1] = zones[1].cond == BLK_ZONE_COND_FULL;
+
+	/*
+	 * Possible state of log buffer zones
+	 *
+	 *   E I F
+	 * E * x 0
+	 * I 0 x 0
+	 * F 1 1 x
+	 *
+	 * Row: zones[0]
+	 * Col: zones[1]
+	 * State:
+	 *   E: Empty, I: In-Use, F: Full
+	 * Log position:
+	 *   *: Special case, no superblock is written
+	 *   0: Use write pointer of zones[0]
+	 *   1: Use write pointer of zones[1]
+	 *   x: Invalid state
+	 */
+
+	if (empty[0] && empty[1]) {
+		/* special case to distinguish no superblock to read */
+		*wp_ret = zones[0].start << SECTOR_SHIFT;
+		return -ENOENT;
+	} else if (full[0] && full[1]) {
+		/* cannot determine which zone has the newer superblock */
+		return -EUCLEAN;
+	} else if (!full[0] && (empty[1] || full[1])) {
+		sector = zones[0].wp;
+	} else if (full[0]) {
+		sector = zones[1].wp;
+	} else {
+		return -EUCLEAN;
+	}
+	*wp_ret = sector << SECTOR_SHIFT;
+	return 0;
+}
+
+int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw,
+			       u64 *bytenr_ret)
+{
+	struct blk_zone zones[2];
+	unsigned int nr_zones_rep = 2;
+	unsigned int zone_sectors;
+	u32 sb_zone;
+	int ret;
+	u64 wp;
+	u64 zone_size;
+	u8 zone_sectors_shift;
+	sector_t nr_sectors = bdev->bd_part->nr_sects;
+	u32 nr_zones;
+
+	if (!bdev_is_zoned(bdev)) {
+		*bytenr_ret = btrfs_sb_offset(mirror);
+		return 0;
+	}
+
+	ASSERT(rw == READ || rw == WRITE);
+
+	zone_sectors = bdev_zone_sectors(bdev);
+	if (!is_power_of_2(zone_sectors))
+		return -EINVAL;
+	zone_size = zone_sectors << SECTOR_SHIFT;
+	zone_sectors_shift = ilog2(zone_sectors);
+	nr_zones = nr_sectors >> zone_sectors_shift;
+
+	sb_zone = sb_zone_number(zone_size, mirror);
+	if (sb_zone + 1 >= nr_zones)
+		return -ENOENT;
+
+	ret = blkdev_report_zones(bdev, sb_zone << zone_sectors_shift, zones,
+				  &nr_zones_rep);
+	if (ret)
+		return ret;
+	if (nr_zones_rep != 2)
+		return -EIO;
+
+	ret = sb_write_pointer(zones, &wp);
+	if (ret != -ENOENT && ret)
+		return -EUCLEAN;
+
+	if (rw == READ && ret != -ENOENT) {
+		if (wp == zones[0].start << SECTOR_SHIFT)
+			wp = (zones[1].start + zones[1].len) << SECTOR_SHIFT;
+		wp -= BTRFS_SUPER_INFO_SIZE;
+	}
+	*bytenr_ret = wp;
+
+	return 0;
+}
+
+u64 btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw)
+{
+	struct btrfs_zoned_device_info *zinfo = device->zone_info;
+	u64 base, wp;
+	u32 zone_num;
+	int ret;
+
+	if (!zinfo)
+		return btrfs_sb_offset(mirror);
+
+	zone_num = sb_zone_number(zinfo->zone_size, mirror);
+	if (zone_num + 1 >= zinfo->nr_zones)
+		return U64_MAX - BTRFS_SUPER_INFO_SIZE;
+
+	base = (u64)zone_num << zinfo->zone_size_shift;
+	if (!test_bit(zone_num, zinfo->seq_zones))
+		return base;
+
+	/* sb_zones should be kept valid during runtime */
+	ret = sb_write_pointer(&zinfo->sb_zones[2 * mirror], &wp);
+	if (ret != -ENOENT && ret)
+		return U64_MAX - BTRFS_SUPER_INFO_SIZE;
+	if (rw == WRITE || ret == -ENOENT)
+		return wp;
+	if (wp == base)
+		wp = base + zinfo->zone_size * 2;
+	return wp - BTRFS_SUPER_INFO_SIZE;
+}
+
+static inline bool is_sb_log_zone(struct btrfs_zoned_device_info *zinfo,
+				  int mirror)
+{
+	u32 zone_num;
+
+	if (!zinfo)
+		return false;
+
+	zone_num = sb_zone_number(zinfo->zone_size, mirror);
+	if (zone_num + 1 >= zinfo->nr_zones)
+		return false;
+
+	if (!test_bit(zone_num, zinfo->seq_zones))
+		return false;
+
+	return true;
+}
+
+int btrfs_advance_sb_log(struct btrfs_device *device, int mirror)
+{
+	struct btrfs_zoned_device_info *zinfo = device->zone_info;
+	struct blk_zone *zone;
+	struct blk_zone *reset = NULL;
+	int ret;
+
+	if (!is_sb_log_zone(zinfo, mirror))
+		return 0;
+
+	zone = &zinfo->sb_zones[2 * mirror];
+	if (zone->cond != BLK_ZONE_COND_FULL) {
+		if (zone->cond == BLK_ZONE_COND_EMPTY)
+			zone->cond = BLK_ZONE_COND_IMP_OPEN;
+		zone->wp += (BTRFS_SUPER_INFO_SIZE >> SECTOR_SHIFT);
+		if (zone->wp == zone->start + zone->len) {
+			zone->cond = BLK_ZONE_COND_FULL;
+			reset = zone + 1;
+			goto reset;
+		}
+		return 0;
+	}
+
+	zone++;
+	ASSERT(zone->cond != BLK_ZONE_COND_FULL);
+	if (zone->cond == BLK_ZONE_COND_EMPTY)
+		zone->cond = BLK_ZONE_COND_IMP_OPEN;
+	zone->wp += (BTRFS_SUPER_INFO_SIZE >> SECTOR_SHIFT);
+	if (zone->wp == zone->start + zone->len) {
+		zone->cond = BLK_ZONE_COND_FULL;
+		reset = zone - 1;
+	}
+
+reset:
+	if (!reset || reset->cond == BLK_ZONE_COND_EMPTY)
+		return 0;
+
+	ASSERT(reset->cond == BLK_ZONE_COND_FULL);
+
+	ret = blkdev_reset_zones(device->bdev, reset->start, reset->len,
+				 GFP_NOFS);
+	if (!ret) {
+		reset->cond = BLK_ZONE_COND_EMPTY;
+		reset->wp = reset->start;
+	}
+	return ret;
+}
+
+int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror)
+{
+	sector_t zone_sectors;
+	sector_t nr_sectors = bdev->bd_part->nr_sects;
+	u8 zone_sectors_shift;
+	u32 sb_zone;
+	u32 nr_zones;
+
+	zone_sectors = bdev_zone_sectors(bdev);
+	zone_sectors_shift = ilog2(zone_sectors);
+	nr_zones = nr_sectors >> zone_sectors_shift;
+
+	sb_zone = sb_zone_number(zone_sectors << SECTOR_SHIFT, mirror);
+	if (sb_zone + 1 >= nr_zones)
+		return -ENOENT;
+
+	return blkdev_reset_zones(bdev,
+				  sb_zone << zone_sectors_shift,
+				  zone_sectors * 2,
+				  GFP_NOFS);
+}
diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
index d9ebe11afdf5..55041a26ae3c 100644
--- a/fs/btrfs/hmzoned.h
+++ b/fs/btrfs/hmzoned.h
@@ -10,6 +10,8 @@
 #define BTRFS_HMZONED_H
 
 #include <linux/blkdev.h>
+#include "volumes.h"
+#include "disk-io.h"
 
 struct btrfs_zoned_device_info {
 	/*
@@ -21,6 +23,7 @@ struct btrfs_zoned_device_info {
 	u32 nr_zones;
 	unsigned long *seq_zones;
 	unsigned long *empty_zones;
+	struct blk_zone sb_zones[2 * BTRFS_SUPER_MIRROR_MAX];
 };
 
 #ifdef CONFIG_BLK_DEV_ZONED
@@ -30,6 +33,11 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device);
 void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
 int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info);
 int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info);
+int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw,
+			       u64 *bytenr_ret);
+u64 btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw);
+int btrfs_advance_sb_log(struct btrfs_device *device, int mirror);
+int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -53,6 +61,27 @@ static inline int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info)
 {
 	return 0;
 }
+static inline int btrfs_sb_log_location_bdev(struct block_device *bdev,
+					     int mirror, int rw,
+					     u64 *bytenr_ret)
+{
+	*bytenr_ret = btrfs_sb_offset(mirror);
+	return 0;
+}
+static inline u64 btrfs_sb_log_location(struct btrfs_device *device, int mirror,
+					int rw)
+{
+	return btrfs_sb_offset(mirror);
+}
+static inline int btrfs_advance_sb_log(struct btrfs_device *device, int mirror)
+{
+	return 0;
+}
+static inline int btrfs_reset_sb_log_zones(struct block_device *bdev,
+					   int mirror)
+{
+	return 0;
+}
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
@@ -120,4 +149,15 @@ static inline bool btrfs_check_device_zone_type(struct btrfs_fs_info *fs_info,
 	return bdev_zoned_model(bdev) != BLK_ZONED_HM;
 }
 
+static inline bool btrfs_check_super_location(struct btrfs_device *device,
+					      u64 pos)
+{
+	/*
+	 * On a non-zoned device, any address is OK. On a zoned
+	 * device, non-SEQUENTIAL WRITE REQUIRED zones are capable.
+	 */
+	return device->zone_info == NULL ||
+		!btrfs_dev_is_sequential(device, pos);
+}
+
 #endif
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 21de630b0730..af7cec962619 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -19,6 +19,7 @@
 #include "rcu-string.h"
 #include "raid56.h"
 #include "block-group.h"
+#include "hmzoned.h"
 
 /*
  * This is only the first step towards a full-features scrub. It reads all
@@ -3709,6 +3710,8 @@ static noinline_for_stack int scrub_supers(struct scrub_ctx *sctx,
 		if (bytenr + BTRFS_SUPER_INFO_SIZE >
 		    scrub_dev->commit_total_bytes)
 			break;
+		if (!btrfs_check_super_location(scrub_dev, bytenr))
+			continue;
 
 		ret = scrub_pages(sctx, bytenr, BTRFS_SUPER_INFO_SIZE, bytenr,
 				  scrub_dev, BTRFS_EXTENT_FLAG_SUPER, gen, i,
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index ab3590b310af..a260648cecca 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1218,12 +1218,17 @@ static void btrfs_release_disk_super(struct page *page)
 	put_page(page);
 }
 
-static int btrfs_read_disk_super(struct block_device *bdev, u64 bytenr,
+static int btrfs_read_disk_super(struct block_device *bdev, int mirror,
 				 struct page **page,
 				 struct btrfs_super_block **disk_super)
 {
 	void *p;
 	pgoff_t index;
+	u64 bytenr;
+	u64 bytenr_orig = btrfs_sb_offset(mirror);
+
+	if (btrfs_sb_log_location_bdev(bdev, 0, READ, &bytenr))
+		return 1;
 
 	/* make sure our super fits in the device */
 	if (bytenr + PAGE_SIZE >= i_size_read(bdev->bd_inode))
@@ -1250,7 +1255,7 @@ static int btrfs_read_disk_super(struct block_device *bdev, u64 bytenr,
 	/* align our pointer to the offset of the super block */
 	*disk_super = p + offset_in_page(bytenr);
 
-	if (btrfs_super_bytenr(*disk_super) != bytenr ||
+	if (btrfs_super_bytenr(*disk_super) != bytenr_orig ||
 	    btrfs_super_magic(*disk_super) != BTRFS_MAGIC) {
 		btrfs_release_disk_super(*page);
 		return 1;
@@ -1287,7 +1292,6 @@ struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags,
 	struct btrfs_device *device = NULL;
 	struct block_device *bdev;
 	struct page *page;
-	u64 bytenr;
 
 	lockdep_assert_held(&uuid_mutex);
 
@@ -1297,14 +1301,13 @@ struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags,
 	 * So, we need to add a special mount option to scan for
 	 * later supers, using BTRFS_SUPER_MIRROR_MAX instead
 	 */
-	bytenr = btrfs_sb_offset(0);
 	flags |= FMODE_EXCL;
 
 	bdev = blkdev_get_by_path(path, flags, holder);
 	if (IS_ERR(bdev))
 		return ERR_CAST(bdev);
 
-	if (btrfs_read_disk_super(bdev, bytenr, &page, &disk_super)) {
+	if (btrfs_read_disk_super(bdev, 0, &page, &disk_super)) {
 		device = ERR_PTR(-EINVAL);
 		goto error_bdev_put;
 	}
@@ -7371,6 +7374,11 @@ void btrfs_scratch_superblocks(struct block_device *bdev, const char *device_pat
 		if (btrfs_read_dev_one_super(bdev, copy_num, &bh))
 			continue;
 
+		if (bdev_is_zoned(bdev)) {
+			btrfs_reset_sb_log_zones(bdev, copy_num);
+			continue;
+		}
+
 		disk_super = (struct btrfs_super_block *)bh->b_data;
 
 		memset(&disk_super->magic, 0, sizeof(disk_super->magic));
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v6 09/28] btrfs: align device extent allocation to zone boundary
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
                   ` (7 preceding siblings ...)
  2019-12-13  4:08 ` [PATCH v6 08/28] btrfs: implement log-structured superblock for " Naohiro Aota
@ 2019-12-13  4:08 ` Naohiro Aota
  2019-12-13 16:52   ` Josef Bacik
  2019-12-13  4:08 ` [PATCH v6 10/28] btrfs: do sequential extent allocation in HMZONED mode Naohiro Aota
                   ` (20 subsequent siblings)
  29 siblings, 1 reply; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:08 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

In HMZONED mode, align the device extents to zone boundaries so that a zone
reset affects only the device extent and does not change the state of
blocks in the neighbor device extents. Also, check that a region allocation
is always over empty zones and it is not over any locations of super block
zones.

This patch also add a verification in verify_one_dev_extent() to check if
the device extent is align to zone boundary.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/hmzoned.c | 55 ++++++++++++++++++++++++++++++++
 fs/btrfs/hmzoned.h | 15 +++++++++
 fs/btrfs/volumes.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 148 insertions(+)

diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
index a74011650145..6263c8aee082 100644
--- a/fs/btrfs/hmzoned.c
+++ b/fs/btrfs/hmzoned.c
@@ -12,6 +12,7 @@
 #include "volumes.h"
 #include "hmzoned.h"
 #include "rcu-string.h"
+#include "disk-io.h"
 
 /* Maximum number of zones to report per blkdev_report_zones() call */
 #define BTRFS_REPORT_NR_ZONES   4096
@@ -550,3 +551,57 @@ int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror)
 				  zone_sectors * 2,
 				  GFP_NOFS);
 }
+
+/*
+ * btrfs_check_allocatable_zones - check if spcecifeid region is
+ *                                 suitable for allocation
+ * @device:	the device to allocate a region
+ * @pos:	the position of the region
+ * @num_bytes:	the size of the region
+ *
+ * In non-ZONED device, anywhere is suitable for allocation. In ZONED
+ * device, check if
+ * 1) the region is not on non-empty sequential zones,
+ * 2) all zones in the region have the same zone type,
+ * 3) it does not contain super block location.
+ */
+bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos,
+				   u64 num_bytes)
+{
+	struct btrfs_zoned_device_info *zinfo = device->zone_info;
+	u64 nzones, begin, end;
+	u64 sb_pos;
+	u8 shift;
+	int i;
+
+	if (!zinfo)
+		return true;
+
+	shift = zinfo->zone_size_shift;
+	nzones = num_bytes >> shift;
+	begin = pos >> shift;
+	end = begin + nzones;
+
+	ASSERT(IS_ALIGNED(pos, zinfo->zone_size));
+	ASSERT(IS_ALIGNED(num_bytes, zinfo->zone_size));
+
+	if (end > zinfo->nr_zones)
+		return false;
+
+	/* check if zones in the region are all empty */
+	if (btrfs_dev_is_sequential(device, pos) &&
+	    find_next_zero_bit(zinfo->empty_zones, end, begin) != end)
+		return false;
+
+	if (btrfs_dev_is_sequential(device, pos)) {
+		for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
+			sb_pos = sb_zone_number(zinfo->zone_size, i);
+			if (!(end < sb_pos || sb_pos + 1 < begin))
+				return false;
+		}
+
+		return find_next_zero_bit(zinfo->seq_zones, end, begin) == end;
+	}
+
+	return find_next_bit(zinfo->seq_zones, end, begin) == end;
+}
diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
index 55041a26ae3c..d54b4ae8cf8b 100644
--- a/fs/btrfs/hmzoned.h
+++ b/fs/btrfs/hmzoned.h
@@ -38,6 +38,8 @@ int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw,
 u64 btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw);
 int btrfs_advance_sb_log(struct btrfs_device *device, int mirror);
 int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror);
+bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos,
+				   u64 num_bytes);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -82,6 +84,11 @@ static inline int btrfs_reset_sb_log_zones(struct block_device *bdev,
 {
 	return 0;
 }
+static inline bool btrfs_check_allocatable_zones(struct btrfs_device *device,
+						 u64 pos, u64 num_bytes)
+{
+	return true;
+}
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
@@ -160,4 +167,12 @@ static inline bool btrfs_check_super_location(struct btrfs_device *device,
 		!btrfs_dev_is_sequential(device, pos);
 }
 
+static inline u64 btrfs_zone_align(struct btrfs_device *device, u64 pos)
+{
+	if (!device->zone_info)
+		return pos;
+
+	return ALIGN(pos, device->zone_info->zone_size);
+}
+
 #endif
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a260648cecca..d5b280b59733 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1393,6 +1393,7 @@ static int find_free_dev_extent_start(struct btrfs_device *device,
 	u64 max_hole_size;
 	u64 extent_end;
 	u64 search_end = device->total_bytes;
+	u64 zone_size = 0;
 	int ret;
 	int slot;
 	struct extent_buffer *l;
@@ -1403,6 +1404,15 @@ static int find_free_dev_extent_start(struct btrfs_device *device,
 	 * at an offset of at least 1MB.
 	 */
 	search_start = max_t(u64, search_start, SZ_1M);
+	/*
+	 * For a zoned block device, skip the first zone of the device
+	 * entirely.
+	 */
+	if (device->zone_info) {
+		zone_size = device->zone_info->zone_size;
+		search_start = max_t(u64, search_start, zone_size);
+		search_start = btrfs_zone_align(device, search_start);
+	}
 
 	path = btrfs_alloc_path();
 	if (!path)
@@ -1467,12 +1477,21 @@ static int find_free_dev_extent_start(struct btrfs_device *device,
 			 */
 			if (contains_pending_extent(device, &search_start,
 						    hole_size)) {
+				search_start = btrfs_zone_align(device,
+								search_start);
 				if (key.offset >= search_start)
 					hole_size = key.offset - search_start;
 				else
 					hole_size = 0;
 			}
 
+			if (!btrfs_check_allocatable_zones(device, search_start,
+							   num_bytes)) {
+				search_start += zone_size;
+				btrfs_release_path(path);
+				goto again;
+			}
+
 			if (hole_size > max_hole_size) {
 				max_hole_start = search_start;
 				max_hole_size = hole_size;
@@ -1512,6 +1531,14 @@ static int find_free_dev_extent_start(struct btrfs_device *device,
 		hole_size = search_end - search_start;
 
 		if (contains_pending_extent(device, &search_start, hole_size)) {
+			search_start = btrfs_zone_align(device, search_start);
+			btrfs_release_path(path);
+			goto again;
+		}
+
+		if (!btrfs_check_allocatable_zones(device, search_start,
+						   num_bytes)) {
+			search_start += zone_size;
 			btrfs_release_path(path);
 			goto again;
 		}
@@ -1529,6 +1556,7 @@ static int find_free_dev_extent_start(struct btrfs_device *device,
 		ret = 0;
 
 out:
+	ASSERT(zone_size == 0 || IS_ALIGNED(max_hole_start, zone_size));
 	btrfs_free_path(path);
 	*start = max_hole_start;
 	if (len)
@@ -4778,6 +4806,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	int i;
 	int j;
 	int index;
+	bool hmzoned = btrfs_fs_incompat(info, HMZONED);
 
 	BUG_ON(!alloc_profile_is_valid(type, 0));
 
@@ -4819,10 +4848,25 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 		BUG();
 	}
 
+	if (hmzoned) {
+		max_stripe_size = info->zone_size;
+		max_chunk_size = round_down(max_chunk_size, info->zone_size);
+	}
+
 	/* We don't want a chunk larger than 10% of writable space */
 	max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
 			     max_chunk_size);
 
+	if (hmzoned) {
+		int min_num_stripes = devs_min * dev_stripes;
+		int min_data_stripes = (min_num_stripes - nparity) / ncopies;
+		u64 min_chunk_size = min_data_stripes * info->zone_size;
+
+		max_chunk_size = max(round_down(max_chunk_size,
+						info->zone_size),
+				     min_chunk_size);
+	}
+
 	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
 			       GFP_NOFS);
 	if (!devices_info)
@@ -4857,6 +4901,9 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 		if (total_avail == 0)
 			continue;
 
+		if (hmzoned && total_avail < max_stripe_size * dev_stripes)
+			continue;
+
 		ret = find_free_dev_extent(device,
 					   max_stripe_size * dev_stripes,
 					   &dev_offset, &max_avail);
@@ -4875,6 +4922,9 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 			continue;
 		}
 
+		if (hmzoned && max_avail < max_stripe_size * dev_stripes)
+			continue;
+
 		if (ndevs == fs_devices->rw_devices) {
 			WARN(1, "%s: found more than %llu devices\n",
 			     __func__, fs_devices->rw_devices);
@@ -4893,6 +4943,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
 	     btrfs_cmp_device_info, NULL);
 
+again:
 	/*
 	 * Round down to number of usable stripes, devs_increment can be any
 	 * number so we can't use round_down()
@@ -4934,6 +4985,17 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	 * we try to reduce stripe_size.
 	 */
 	if (stripe_size * data_stripes > max_chunk_size) {
+		if (hmzoned) {
+			/*
+			 * stripe_size is fixed in HMZONED. Reduce ndevs
+			 * instead.
+			 */
+			ASSERT(nparity == 0);
+			ndevs = div_u64(max_chunk_size * ncopies,
+					stripe_size * dev_stripes);
+			goto again;
+		}
+
 		/*
 		 * Reduce stripe_size, round it up to a 16MB boundary again and
 		 * then use it, unless it ends up being even bigger than the
@@ -4947,6 +5009,8 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	/* align to BTRFS_STRIPE_LEN */
 	stripe_size = round_down(stripe_size, BTRFS_STRIPE_LEN);
 
+	ASSERT(!hmzoned || stripe_size == info->zone_size);
+
 	map = kmalloc(map_lookup_size(num_stripes), GFP_NOFS);
 	if (!map) {
 		ret = -ENOMEM;
@@ -7541,6 +7605,20 @@ static int verify_one_dev_extent(struct btrfs_fs_info *fs_info,
 		ret = -EUCLEAN;
 		goto out;
 	}
+
+	if (dev->zone_info) {
+		u64 zone_size = dev->zone_info->zone_size;
+
+		if (!IS_ALIGNED(physical_offset, zone_size) ||
+		    !IS_ALIGNED(physical_len, zone_size)) {
+			btrfs_err(fs_info,
+"dev extent devid %llu physical offset %llu len %llu is not aligned to device zone",
+				  devid, physical_offset, physical_len);
+			ret = -EUCLEAN;
+			goto out;
+		}
+	}
+
 out:
 	free_extent_map(em);
 	return ret;
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v6 10/28] btrfs: do sequential extent allocation in HMZONED mode
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
                   ` (8 preceding siblings ...)
  2019-12-13  4:08 ` [PATCH v6 09/28] btrfs: align device extent allocation to zone boundary Naohiro Aota
@ 2019-12-13  4:08 ` Naohiro Aota
  2019-12-17 19:19   ` Josef Bacik
  2019-12-13  4:08 ` [PATCH v6 11/28] btrfs: make unmirroed BGs readonly only if we have at least one writable BG Naohiro Aota
                   ` (19 subsequent siblings)
  29 siblings, 1 reply; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:08 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

On HMZONED drives, writes must always be sequential and directed at a block
group zone write pointer position. Thus, block allocation in a block group
must also be done sequentially using an allocation pointer equal to the
block group zone write pointer plus the number of blocks allocated but not
yet written.

Sequential allocation function find_free_extent_zoned() bypass the checks
in find_free_extent() and increase the reserved byte counter by itself. It
is impossible to revert once allocated region in the sequential allocation,
since it might race with other allocations and leave an allocation hole,
which breaks the sequential write rule.

Furthermore, this commit introduce two new variable to struct
btrfs_block_group. "wp_broken" indicate that write pointer is broken (e.g.
not synced on a RAID1 block group) and mark that block group read only.
"zone_unusable" keeps track of the size of once allocated then freed region
in a block group. Such region is never usable until resetting underlying
zones.

This commit also introduce "bytes_zone_unusable" to track such unusable
bytes in a space_info. Pinned bytes are always reclaimed to
"bytes_zone_unusable". They are not usable until resetting them first.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c      |  74 ++++--
 fs/btrfs/block-group.h      |  11 +
 fs/btrfs/extent-tree.c      |  80 +++++-
 fs/btrfs/free-space-cache.c |  38 +++
 fs/btrfs/free-space-cache.h |   2 +
 fs/btrfs/hmzoned.c          | 467 ++++++++++++++++++++++++++++++++++++
 fs/btrfs/hmzoned.h          |   8 +
 fs/btrfs/space-info.c       |  13 +-
 fs/btrfs/space-info.h       |   4 +-
 fs/btrfs/sysfs.c            |   2 +
 10 files changed, 672 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index acfa0a9d3c5a..5c04422f6f5a 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -14,6 +14,7 @@
 #include "sysfs.h"
 #include "tree-log.h"
 #include "delalloc-space.h"
+#include "hmzoned.h"
 
 /*
  * Return target flags in extended format or 0 if restripe for this chunk_type
@@ -677,6 +678,9 @@ int btrfs_cache_block_group(struct btrfs_block_group *cache, int load_cache_only
 	struct btrfs_caching_control *caching_ctl;
 	int ret = 0;
 
+	if (btrfs_fs_incompat(fs_info, HMZONED))
+		return 0;
+
 	caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS);
 	if (!caching_ctl)
 		return -ENOMEM;
@@ -1048,12 +1052,15 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 		WARN_ON(block_group->space_info->total_bytes
 			< block_group->length);
 		WARN_ON(block_group->space_info->bytes_readonly
-			< block_group->length);
+			< block_group->length - block_group->zone_unusable);
+		WARN_ON(block_group->space_info->bytes_zone_unusable
+			< block_group->zone_unusable);
 		WARN_ON(block_group->space_info->disk_total
 			< block_group->length * factor);
 	}
 	block_group->space_info->total_bytes -= block_group->length;
-	block_group->space_info->bytes_readonly -= block_group->length;
+	block_group->space_info->bytes_readonly -=
+		(block_group->length - block_group->zone_unusable);
 	block_group->space_info->disk_total -= block_group->length * factor;
 
 	spin_unlock(&block_group->space_info->lock);
@@ -1210,7 +1217,7 @@ static int inc_block_group_ro(struct btrfs_block_group *cache, int force)
 	}
 
 	num_bytes = cache->length - cache->reserved - cache->pinned -
-		    cache->bytes_super - cache->used;
+		    cache->bytes_super - cache->zone_unusable - cache->used;
 	sinfo_used = btrfs_space_info_used(sinfo, true);
 
 	/*
@@ -1736,6 +1743,13 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 			goto error;
 	}
 
+	ret = btrfs_load_block_group_zone_info(cache);
+	if (ret) {
+		btrfs_err(info, "failed to load zone info of bg %llu",
+			  cache->start);
+		goto error;
+	}
+
 	/*
 	 * We need to exclude the super stripes now so that the space info has
 	 * super bytes accounted for, otherwise we'll think we have more space
@@ -1766,6 +1780,8 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 		btrfs_free_excluded_extents(cache);
 	}
 
+	btrfs_calc_zone_unusable(cache);
+
 	ret = btrfs_add_block_group_cache(info, cache);
 	if (ret) {
 		btrfs_remove_free_space_cache(cache);
@@ -1773,7 +1789,8 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 	}
 	trace_btrfs_add_block_group(info, cache, 0);
 	btrfs_update_space_info(info, cache->flags, key->offset,
-				cache->used, cache->bytes_super, &space_info);
+				cache->used, cache->bytes_super,
+				cache->zone_unusable, &space_info);
 
 	cache->space_info = space_info;
 
@@ -1786,6 +1803,10 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 		ASSERT(list_empty(&cache->bg_list));
 		btrfs_mark_bg_unused(cache);
 	}
+
+	if (cache->wp_broken)
+		inc_block_group_ro(cache, 1);
+
 	return 0;
 error:
 	btrfs_put_block_group(cache);
@@ -1924,6 +1945,13 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,
 	cache->last_byte_to_unpin = (u64)-1;
 	cache->cached = BTRFS_CACHE_FINISHED;
 	cache->needs_free_space = 1;
+
+	ret = btrfs_load_block_group_zone_info(cache);
+	if (ret) {
+		btrfs_put_block_group(cache);
+		return ret;
+	}
+
 	ret = exclude_super_stripes(cache);
 	if (ret) {
 		/* We may have excluded something, so call this just in case */
@@ -1965,7 +1993,7 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,
 	 */
 	trace_btrfs_add_block_group(fs_info, cache, 1);
 	btrfs_update_space_info(fs_info, cache->flags, size, bytes_used,
-				cache->bytes_super, &cache->space_info);
+				cache->bytes_super, 0, &cache->space_info);
 	btrfs_update_global_block_rsv(fs_info);
 
 	link_block_group(cache);
@@ -2121,7 +2149,8 @@ void btrfs_dec_block_group_ro(struct btrfs_block_group *cache)
 	spin_lock(&cache->lock);
 	if (!--cache->ro) {
 		num_bytes = cache->length - cache->reserved -
-			    cache->pinned - cache->bytes_super - cache->used;
+			    cache->pinned - cache->bytes_super -
+			    cache->zone_unusable - cache->used;
 		sinfo->bytes_readonly -= num_bytes;
 		list_del_init(&cache->ro_list);
 	}
@@ -2760,6 +2789,21 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+void __btrfs_add_reserved_bytes(struct btrfs_block_group *cache, u64 ram_bytes,
+				u64 num_bytes, int delalloc)
+{
+	struct btrfs_space_info *space_info = cache->space_info;
+
+	cache->reserved += num_bytes;
+	space_info->bytes_reserved += num_bytes;
+	trace_btrfs_space_reservation(cache->fs_info, "space_info",
+				      space_info->flags, num_bytes, 1);
+	btrfs_space_info_update_bytes_may_use(cache->fs_info, space_info,
+					      -ram_bytes);
+	if (delalloc)
+		cache->delalloc_bytes += num_bytes;
+}
+
 /**
  * btrfs_add_reserved_bytes - update the block_group and space info counters
  * @cache:	The cache we are manipulating
@@ -2778,20 +2822,16 @@ int btrfs_add_reserved_bytes(struct btrfs_block_group *cache,
 	struct btrfs_space_info *space_info = cache->space_info;
 	int ret = 0;
 
+	/* should handled by find_free_extent_zoned */
+	ASSERT(!btrfs_fs_incompat(cache->fs_info, HMZONED));
+
 	spin_lock(&space_info->lock);
 	spin_lock(&cache->lock);
-	if (cache->ro) {
+	if (cache->ro)
 		ret = -EAGAIN;
-	} else {
-		cache->reserved += num_bytes;
-		space_info->bytes_reserved += num_bytes;
-		trace_btrfs_space_reservation(cache->fs_info, "space_info",
-					      space_info->flags, num_bytes, 1);
-		btrfs_space_info_update_bytes_may_use(cache->fs_info,
-						      space_info, -ram_bytes);
-		if (delalloc)
-			cache->delalloc_bytes += num_bytes;
-	}
+	else
+		__btrfs_add_reserved_bytes(cache, ram_bytes, num_bytes,
+					   delalloc);
 	spin_unlock(&cache->lock);
 	spin_unlock(&space_info->lock);
 	return ret;
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 9b409676c4b2..347605654021 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -82,6 +82,7 @@ struct btrfs_block_group {
 	unsigned int iref:1;
 	unsigned int has_caching_ctl:1;
 	unsigned int removed:1;
+	unsigned int wp_broken:1;
 
 	int disk_cache_state;
 
@@ -156,6 +157,14 @@ struct btrfs_block_group {
 
 	/* Record locked full stripes for RAID5/6 block group */
 	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
+
+	u64 zone_unusable;
+	/*
+	 * Allocation offset for the block group to implement
+	 * sequential allocation. This is used only with HMZONED mode
+	 * enabled.
+	 */
+	u64 alloc_offset;
 };
 
 #ifdef CONFIG_BTRFS_DEBUG
@@ -216,6 +225,8 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
 			     u64 bytenr, u64 num_bytes, int alloc);
 int btrfs_add_reserved_bytes(struct btrfs_block_group *cache,
 			     u64 ram_bytes, u64 num_bytes, int delalloc);
+void __btrfs_add_reserved_bytes(struct btrfs_block_group *cache, u64 ram_bytes,
+				u64 num_bytes, int delalloc);
 void btrfs_free_reserved_bytes(struct btrfs_block_group *cache,
 			       u64 num_bytes, int delalloc);
 int btrfs_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 153f71a5bba9..3781a3778696 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -32,6 +32,8 @@
 #include "block-rsv.h"
 #include "delalloc-space.h"
 #include "block-group.h"
+#include "rcu-string.h"
+#include "hmzoned.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -2824,9 +2826,11 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
 			cache = btrfs_lookup_block_group(fs_info, start);
 			BUG_ON(!cache); /* Logic error */
 
-			cluster = fetch_cluster_info(fs_info,
-						     cache->space_info,
-						     &empty_cluster);
+			if (!btrfs_fs_incompat(fs_info, HMZONED))
+				cluster = fetch_cluster_info(fs_info,
+							     cache->space_info,
+							     &empty_cluster);
+
 			empty_cluster <<= 1;
 		}
 
@@ -2863,7 +2867,11 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
 		space_info->max_extent_size = 0;
 		percpu_counter_add_batch(&space_info->total_bytes_pinned,
 			    -len, BTRFS_TOTAL_BYTES_PINNED_BATCH);
-		if (cache->ro) {
+		if (btrfs_fs_incompat(fs_info, HMZONED)) {
+			/* need reset before reusing in zoned Block Group */
+			space_info->bytes_zone_unusable += len;
+			readonly = true;
+		} else if (cache->ro) {
 			space_info->bytes_readonly += len;
 			readonly = true;
 		}
@@ -3657,6 +3665,57 @@ static int find_free_extent_unclustered(struct btrfs_block_group *bg,
 	return 0;
 }
 
+/*
+ * Simple allocator for sequential only block group. It only allows
+ * sequential allocation. No need to play with trees. This function
+ * also reserve the bytes as in btrfs_add_reserved_bytes.
+ */
+
+static int find_free_extent_zoned(struct btrfs_block_group *cache,
+				  struct find_free_extent_ctl *ffe_ctl)
+{
+	struct btrfs_space_info *space_info = cache->space_info;
+	struct btrfs_free_space_ctl *ctl = cache->free_space_ctl;
+	u64 start = cache->start;
+	u64 num_bytes = ffe_ctl->num_bytes;
+	u64 avail;
+	int ret = 0;
+
+	ASSERT(btrfs_fs_incompat(cache->fs_info, HMZONED));
+
+	spin_lock(&space_info->lock);
+	spin_lock(&cache->lock);
+
+	if (cache->ro) {
+		ret = -EAGAIN;
+		goto out;
+	}
+
+	avail = cache->length - cache->alloc_offset;
+	if (avail < num_bytes) {
+		ffe_ctl->max_extent_size = avail;
+		ret = 1;
+		goto out;
+	}
+
+	ffe_ctl->found_offset = start + cache->alloc_offset;
+	cache->alloc_offset += num_bytes;
+	spin_lock(&ctl->tree_lock);
+	ctl->free_space -= num_bytes;
+	spin_unlock(&ctl->tree_lock);
+
+	ASSERT(IS_ALIGNED(ffe_ctl->found_offset,
+			  cache->fs_info->stripesize));
+	ffe_ctl->search_start = ffe_ctl->found_offset;
+	__btrfs_add_reserved_bytes(cache, ffe_ctl->ram_bytes, num_bytes,
+				   ffe_ctl->delalloc);
+
+out:
+	spin_unlock(&cache->lock);
+	spin_unlock(&space_info->lock);
+	return ret;
+}
+
 /*
  * Return >0 means caller needs to re-search for free extent
  * Return 0 means we have the needed free extent.
@@ -3803,6 +3862,7 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
 	struct btrfs_block_group *block_group = NULL;
 	struct find_free_extent_ctl ffe_ctl = {0};
 	struct btrfs_space_info *space_info;
+	bool hmzoned = btrfs_fs_incompat(fs_info, HMZONED);
 	bool use_cluster = true;
 	bool full_search = false;
 
@@ -3965,6 +4025,17 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
 		if (unlikely(block_group->cached == BTRFS_CACHE_ERROR))
 			goto loop;
 
+		if (hmzoned) {
+			ret = find_free_extent_zoned(block_group, &ffe_ctl);
+			if (ret)
+				goto loop;
+			/*
+			 * find_free_space_seq should ensure that
+			 * everything is OK and reserve the extent.
+			 */
+			goto nocheck;
+		}
+
 		/*
 		 * Ok we want to try and use the cluster allocator, so
 		 * lets look there
@@ -4020,6 +4091,7 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
 					     num_bytes);
 			goto loop;
 		}
+nocheck:
 		btrfs_inc_block_group_reservations(block_group);
 
 		/* we are all good, lets return */
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 3283da419200..e068325fcfc0 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2336,6 +2336,8 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 	struct btrfs_free_space *info;
 	int ret = 0;
 
+	ASSERT(!btrfs_fs_incompat(fs_info, HMZONED));
+
 	info = kmem_cache_zalloc(btrfs_free_space_cachep, GFP_NOFS);
 	if (!info)
 		return -ENOMEM;
@@ -2384,9 +2386,36 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 	return ret;
 }
 
+int __btrfs_add_free_space_seq(struct btrfs_block_group *block_group,
+			       u64 bytenr, u64 size)
+{
+	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
+	u64 offset = bytenr - block_group->start;
+	u64 to_free, to_unusable;
+
+	spin_lock(&ctl->tree_lock);
+	if (block_group->wp_broken)
+		to_free = 0;
+	else if (offset >= block_group->alloc_offset)
+		to_free = size;
+	else if (offset + size <= block_group->alloc_offset)
+		to_free = 0;
+	else
+		to_free = offset + size - block_group->alloc_offset;
+	to_unusable = size - to_free;
+
+	ctl->free_space += to_free;
+	block_group->zone_unusable += to_unusable;
+	spin_unlock(&ctl->tree_lock);
+	return 0;
+}
+
 int btrfs_add_free_space(struct btrfs_block_group *block_group,
 			 u64 bytenr, u64 size)
 {
+	if (btrfs_fs_incompat(block_group->fs_info, HMZONED))
+		return __btrfs_add_free_space_seq(block_group, bytenr, size);
+
 	return __btrfs_add_free_space(block_group->fs_info,
 				      block_group->free_space_ctl,
 				      bytenr, size);
@@ -2400,6 +2429,9 @@ int btrfs_remove_free_space(struct btrfs_block_group *block_group,
 	int ret;
 	bool re_search = false;
 
+	if (btrfs_fs_incompat(block_group->fs_info, HMZONED))
+		return 0;
+
 	spin_lock(&ctl->tree_lock);
 
 again:
@@ -2635,6 +2667,8 @@ u64 btrfs_find_space_for_alloc(struct btrfs_block_group *block_group,
 	u64 align_gap = 0;
 	u64 align_gap_len = 0;
 
+	ASSERT(!btrfs_fs_incompat(block_group->fs_info, HMZONED));
+
 	spin_lock(&ctl->tree_lock);
 	entry = find_free_space(ctl, &offset, &bytes_search,
 				block_group->full_stripe_len, max_extent_size);
@@ -2754,6 +2788,8 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group *block_group,
 	struct rb_node *node;
 	u64 ret = 0;
 
+	ASSERT(!btrfs_fs_incompat(block_group->fs_info, HMZONED));
+
 	spin_lock(&cluster->lock);
 	if (bytes > cluster->max_size)
 		goto out;
@@ -3401,6 +3437,8 @@ int btrfs_trim_block_group(struct btrfs_block_group *block_group,
 {
 	int ret;
 
+	ASSERT(!btrfs_fs_incompat(block_group->fs_info, HMZONED));
+
 	*trimmed = 0;
 
 	spin_lock(&block_group->lock);
diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h
index ba9a23241101..0d3812bbb793 100644
--- a/fs/btrfs/free-space-cache.h
+++ b/fs/btrfs/free-space-cache.h
@@ -84,6 +84,8 @@ void btrfs_init_free_space_ctl(struct btrfs_block_group *block_group);
 int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 			   struct btrfs_free_space_ctl *ctl,
 			   u64 bytenr, u64 size);
+int __btrfs_add_free_space_seq(struct btrfs_block_group *block_group,
+			       u64 bytenr, u64 size);
 int btrfs_add_free_space(struct btrfs_block_group *block_group,
 			 u64 bytenr, u64 size);
 int btrfs_remove_free_space(struct btrfs_block_group *block_group,
diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
index 6263c8aee082..b067fa84b9a1 100644
--- a/fs/btrfs/hmzoned.c
+++ b/fs/btrfs/hmzoned.c
@@ -8,14 +8,21 @@
 
 #include <linux/slab.h>
 #include <linux/blkdev.h>
+#include <linux/sched/mm.h>
 #include "ctree.h"
 #include "volumes.h"
 #include "hmzoned.h"
 #include "rcu-string.h"
 #include "disk-io.h"
+#include "block-group.h"
+#include "locking.h"
 
 /* Maximum number of zones to report per blkdev_report_zones() call */
 #define BTRFS_REPORT_NR_ZONES   4096
+/* Invalid allocation pointer value for missing devices */
+#define WP_MISSING_DEV ((u64)-1)
+/* Pseudo write pointer value for conventional zone */
+#define WP_CONVENTIONAL ((u64)-2)
 
 static int sb_write_pointer(struct blk_zone *zone, u64 *wp_ret);
 
@@ -605,3 +612,463 @@ bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos,
 
 	return find_next_bit(zinfo->seq_zones, end, begin) == end;
 }
+
+void btrfs_calc_zone_unusable(struct btrfs_block_group *cache)
+{
+	u64 unusable, free;
+
+	if (!btrfs_fs_incompat(cache->fs_info, HMZONED))
+		return;
+
+	WARN_ON(cache->bytes_super != 0);
+	if (!cache->wp_broken) {
+		unusable = cache->alloc_offset - cache->used;
+		free = cache->length - cache->alloc_offset;
+	} else {
+		unusable = cache->length - cache->used;
+		free = 0;
+	}
+	/* we only need ->free_space in ALLOC_SEQ BGs */
+	cache->last_byte_to_unpin = (u64)-1;
+	cache->cached = BTRFS_CACHE_FINISHED;
+	cache->free_space_ctl->free_space = free;
+	cache->zone_unusable = unusable;
+	/*
+	 * Should not have any excluded extents. Just
+	 * in case, though.
+	 */
+	btrfs_free_excluded_extents(cache);
+}
+
+static int emulate_write_pointer(struct btrfs_block_group *cache,
+				 u64 *offset_ret)
+{
+	struct btrfs_fs_info *fs_info = cache->fs_info;
+	struct btrfs_root *root = fs_info->extent_root;
+	struct btrfs_path *path;
+	struct extent_buffer *leaf;
+	struct btrfs_key search_key;
+	struct btrfs_key found_key;
+	int slot;
+	int ret;
+	u64 length;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	search_key.objectid = cache->start + cache->length;
+	search_key.type = 0;
+	search_key.offset = 0;
+
+	ret = btrfs_search_slot(NULL, root, &search_key, path, 0, 0);
+	if (ret < 0)
+		goto out;
+	ASSERT(ret != 0);
+	slot = path->slots[0];
+	leaf = path->nodes[0];
+	ASSERT(slot != 0);
+	slot--;
+	btrfs_item_key_to_cpu(leaf, &found_key, slot);
+
+	if (found_key.objectid < cache->start) {
+		*offset_ret = 0;
+	} else if (found_key.type == BTRFS_BLOCK_GROUP_ITEM_KEY) {
+		struct btrfs_key extent_item_key;
+
+		if (found_key.objectid != cache->start) {
+			ret = -EUCLEAN;
+			goto out;
+		}
+
+		length = 0;
+
+		/* metadata may have METADATA_ITEM_KEY */
+		if (slot == 0) {
+			btrfs_set_path_blocking(path);
+			ret = btrfs_prev_leaf(root, path);
+			if (ret < 0)
+				goto out;
+			if (ret == 0) {
+				slot = btrfs_header_nritems(leaf) - 1;
+				btrfs_item_key_to_cpu(leaf, &extent_item_key,
+						      slot);
+			}
+		} else {
+			btrfs_item_key_to_cpu(leaf, &extent_item_key, slot - 1);
+			ret = 0;
+		}
+
+		if (ret == 0 &&
+		    extent_item_key.objectid == cache->start) {
+			if (extent_item_key.type == BTRFS_METADATA_ITEM_KEY)
+				length = fs_info->nodesize;
+			else if (extent_item_key.type == BTRFS_EXTENT_ITEM_KEY)
+				length = extent_item_key.offset;
+			else {
+				ret = -EUCLEAN;
+				goto out;
+			}
+		}
+
+		*offset_ret = length;
+	} else if (found_key.type == BTRFS_EXTENT_ITEM_KEY ||
+		   found_key.type == BTRFS_METADATA_ITEM_KEY) {
+
+		if (found_key.type == BTRFS_EXTENT_ITEM_KEY)
+			length = found_key.offset;
+		else
+			length = fs_info->nodesize;
+
+		if (!(found_key.objectid >= cache->start &&
+		       found_key.objectid + length <=
+		       cache->start + cache->length)) {
+			ret = -EUCLEAN;
+			goto out;
+		}
+		*offset_ret = found_key.objectid + length - cache->start;
+	} else {
+		ret = -EUCLEAN;
+		goto out;
+	}
+	ret = 0;
+
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
+static u64 offset_in_dev_extent(struct map_lookup *map, u64 *alloc_offsets,
+				u64 logical, int idx)
+{
+	u64 profile = map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
+	u64 stripe_nr = logical / map->stripe_len;
+	u64 full_stripes_cnt;
+	u32 rest_stripes_cnt;
+	u64 stripe_start, offset;
+	int data_stripes = map->num_stripes / map->sub_stripes;
+	int stripe_idx;
+	int i;
+
+	ASSERT(profile == BTRFS_BLOCK_GROUP_RAID0 ||
+	       profile == BTRFS_BLOCK_GROUP_RAID10);
+
+	full_stripes_cnt = div_u64_rem(stripe_nr, data_stripes,
+				       &rest_stripes_cnt);
+	stripe_idx = idx / map->sub_stripes;
+
+	if (stripe_idx < rest_stripes_cnt)
+		return map->stripe_len * (full_stripes_cnt + 1);
+
+	for (i = idx + map->sub_stripes; i < map->num_stripes;
+	     i += map->sub_stripes) {
+		if (alloc_offsets[i] != WP_CONVENTIONAL &&
+		    alloc_offsets[i] > map->stripe_len * full_stripes_cnt)
+			return map->stripe_len * (full_stripes_cnt + 1);
+	}
+
+	stripe_start = (full_stripes_cnt * data_stripes + stripe_idx) *
+		map->stripe_len;
+	if (stripe_start >= logical)
+		return full_stripes_cnt * map->stripe_len;
+	offset = min_t(u64, logical - stripe_start, map->stripe_len);
+
+	return full_stripes_cnt * map->stripe_len + offset;
+}
+
+int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
+{
+	struct btrfs_fs_info *fs_info = cache->fs_info;
+	struct extent_map_tree *em_tree = &fs_info->mapping_tree;
+	struct extent_map *em;
+	struct map_lookup *map;
+	struct btrfs_device *device;
+	u64 logical = cache->start;
+	u64 length = cache->length;
+	u64 physical = 0;
+	int ret;
+	int i, j;
+	unsigned int nofs_flag;
+	u64 *alloc_offsets = NULL;
+	u64 emulated_offset = 0;
+	u32 num_sequential = 0, num_conventional = 0;
+
+	if (!btrfs_fs_incompat(fs_info, HMZONED))
+		return 0;
+
+	/* Sanity check */
+	if (!IS_ALIGNED(length, fs_info->zone_size)) {
+		btrfs_err(fs_info, "unaligned block group at %llu + %llu",
+			  logical, length);
+		return -EIO;
+	}
+
+	/* Get the chunk mapping */
+	read_lock(&em_tree->lock);
+	em = lookup_extent_mapping(em_tree, logical, length);
+	read_unlock(&em_tree->lock);
+
+	if (!em)
+		return -EINVAL;
+
+	map = em->map_lookup;
+
+	/*
+	 * Get the zone type: if the group is mapped to a non-sequential zone,
+	 * there is no need for the allocation offset (fit allocation is OK).
+	 */
+	alloc_offsets = kcalloc(map->num_stripes, sizeof(*alloc_offsets),
+				GFP_NOFS);
+	if (!alloc_offsets) {
+		free_extent_map(em);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < map->num_stripes; i++) {
+		bool is_sequential;
+		struct blk_zone zone;
+
+		device = map->stripes[i].dev;
+		physical = map->stripes[i].physical;
+
+		if (device->bdev == NULL) {
+			alloc_offsets[i] = WP_MISSING_DEV;
+			continue;
+		}
+
+		is_sequential = btrfs_dev_is_sequential(device, physical);
+		if (is_sequential)
+			num_sequential++;
+		else
+			num_conventional++;
+
+		if (!is_sequential) {
+			alloc_offsets[i] = WP_CONVENTIONAL;
+			continue;
+		}
+
+		/*
+		 * This zone will be used for allocation, so mark this
+		 * zone non-empty.
+		 */
+		btrfs_dev_clear_zone_empty(device, physical);
+
+		/*
+		 * The group is mapped to a sequential zone. Get the zone write
+		 * pointer to determine the allocation offset within the zone.
+		 */
+		WARN_ON(!IS_ALIGNED(physical, fs_info->zone_size));
+		nofs_flag = memalloc_nofs_save();
+		ret = btrfs_get_dev_zone(device, physical, &zone);
+		memalloc_nofs_restore(nofs_flag);
+		if (ret == -EIO || ret == -EOPNOTSUPP) {
+			ret = 0;
+			alloc_offsets[i] = WP_MISSING_DEV;
+			continue;
+		} else if (ret) {
+			goto out;
+		}
+
+		switch (zone.cond) {
+		case BLK_ZONE_COND_OFFLINE:
+		case BLK_ZONE_COND_READONLY:
+			btrfs_err(
+				fs_info, "Offline/readonly zone %llu",
+				physical >> device->zone_info->zone_size_shift);
+			alloc_offsets[i] = WP_MISSING_DEV;
+			break;
+		case BLK_ZONE_COND_EMPTY:
+			alloc_offsets[i] = 0;
+			break;
+		case BLK_ZONE_COND_FULL:
+			alloc_offsets[i] = fs_info->zone_size;
+			break;
+		default:
+			/* Partially used zone */
+			alloc_offsets[i] =
+				((zone.wp - zone.start) << SECTOR_SHIFT);
+			break;
+		}
+	}
+
+	if (num_conventional > 0) {
+		ret = emulate_write_pointer(cache, &emulated_offset);
+		if (ret || map->num_stripes == num_conventional) {
+			if (!ret)
+				cache->alloc_offset = emulated_offset;
+			goto out;
+		}
+	}
+
+	switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
+	case 0: /* single */
+	case BTRFS_BLOCK_GROUP_DUP:
+	case BTRFS_BLOCK_GROUP_RAID1:
+		cache->alloc_offset = WP_MISSING_DEV;
+		for (i = 0; i < map->num_stripes; i++) {
+			if (alloc_offsets[i] == WP_MISSING_DEV ||
+			    alloc_offsets[i] == WP_CONVENTIONAL)
+				continue;
+			if (cache->alloc_offset == WP_MISSING_DEV)
+				cache->alloc_offset = alloc_offsets[i];
+			if (alloc_offsets[i] == cache->alloc_offset)
+				continue;
+
+			cache->wp_broken = 1;
+		}
+		break;
+	case BTRFS_BLOCK_GROUP_RAID0:
+		cache->alloc_offset = 0;
+		for (i = 0; i < map->num_stripes; i++) {
+			if (alloc_offsets[i] == WP_MISSING_DEV) {
+				cache->wp_broken = 1;
+				continue;
+			}
+
+			if (alloc_offsets[i] == WP_CONVENTIONAL)
+				alloc_offsets[i] =
+					offset_in_dev_extent(map, alloc_offsets,
+							     emulated_offset,
+							     i);
+
+			/* sanity check */
+			if (i > 0) {
+				if ((alloc_offsets[i] % BTRFS_STRIPE_LEN != 0 &&
+				     alloc_offsets[i - 1] %
+					     BTRFS_STRIPE_LEN != 0) ||
+				    (alloc_offsets[i - 1] < alloc_offsets[i]) ||
+				    (alloc_offsets[i - 1] - alloc_offsets[i] >
+						BTRFS_STRIPE_LEN)) {
+					cache->wp_broken = 1;
+					continue;
+				}
+			}
+
+			cache->alloc_offset += alloc_offsets[i];
+		}
+		break;
+	case BTRFS_BLOCK_GROUP_RAID10:
+		/*
+		 * Pass1: check write pointer of RAID1 level: each pointer
+		 * should be equal.
+		 */
+		for (i = 0; i < map->num_stripes / map->sub_stripes; i++) {
+			int base = i * map->sub_stripes;
+			u64 offset = WP_MISSING_DEV;
+			int fill = 0, num_conventional = 0;
+
+			for (j = 0; j < map->sub_stripes; j++) {
+				if (alloc_offsets[base+j] == WP_MISSING_DEV) {
+					fill++;
+					continue;
+				}
+				if (alloc_offsets[base+j] == WP_CONVENTIONAL) {
+					fill++;
+					num_conventional++;
+					continue;
+				}
+				if (offset == WP_MISSING_DEV)
+					offset = alloc_offsets[base+j];
+				if (alloc_offsets[base + j] == offset)
+					continue;
+
+				cache->wp_broken = 1;
+				goto out;
+			}
+			if (!fill)
+				continue;
+			/* this RAID0 stripe is free on conventional zones */
+			if (num_conventional == map->sub_stripes)
+				offset = WP_CONVENTIONAL;
+			/* fill WP_MISSING_DEV or WP_CONVENTIONAL */
+			for (j = 0; j < map->sub_stripes; j++)
+				alloc_offsets[base + j] = offset;
+		}
+
+		/* Pass2: check write pointer of RAID0 level */
+		cache->alloc_offset = 0;
+		for (i = 0; i < map->num_stripes / map->sub_stripes; i++) {
+			int base = i * map->sub_stripes;
+
+			if (alloc_offsets[base] == WP_MISSING_DEV) {
+				cache->wp_broken = 1;
+				continue;
+			}
+
+			if (alloc_offsets[base] == WP_CONVENTIONAL)
+				alloc_offsets[base] =
+					offset_in_dev_extent(map, alloc_offsets,
+							     emulated_offset,
+							     base);
+
+			/* sanity check */
+			if (i > 0) {
+				int prev = base - map->sub_stripes;
+
+				if ((alloc_offsets[base] %
+					     BTRFS_STRIPE_LEN != 0 &&
+				     alloc_offsets[prev] %
+					     BTRFS_STRIPE_LEN != 0) ||
+				    (alloc_offsets[prev] <
+					     alloc_offsets[base]) ||
+				    (alloc_offsets[prev] - alloc_offsets[base] >
+						BTRFS_STRIPE_LEN)) {
+					cache->wp_broken = 1;
+					continue;
+				}
+			}
+
+			cache->alloc_offset += alloc_offsets[base];
+		}
+		break;
+	case BTRFS_BLOCK_GROUP_RAID5:
+	case BTRFS_BLOCK_GROUP_RAID6:
+		/* RAID5/6 is not supported yet */
+	default:
+		btrfs_err(fs_info, "Unsupported profile on HMZONED %llu",
+			map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK);
+		ret = -EINVAL;
+		goto out;
+	}
+
+out:
+	/* an extent is allocated after the write pointer */
+	if (num_conventional && emulated_offset > cache->alloc_offset) {
+		btrfs_err(fs_info,
+			  "got wrong write pointer in BG %llu: %llu > %llu",
+			  logical, emulated_offset, cache->alloc_offset);
+		cache->wp_broken = 1;
+		ret = -EIO;
+	}
+
+	if (cache->wp_broken) {
+		char buf[128] = {'\0'};
+
+		btrfs_describe_block_groups(cache->flags, buf, sizeof(buf));
+		btrfs_err(fs_info, "broken write pointer: block group %llu %s",
+			  logical, buf);
+		for (i = 0; i < map->num_stripes; i++) {
+			char *note;
+
+			device = map->stripes[i].dev;
+			physical = map->stripes[i].physical;
+
+			if (device->bdev == NULL)
+				note = " (missing)";
+			else if (!btrfs_dev_is_sequential(device, physical))
+				note = " (conventional)";
+			else
+				note = "";
+
+			btrfs_err_in_rcu(fs_info,
+		"stripe %d dev %s physical %llu write_pointer[i] = %llu%s",
+					 i, rcu_str_deref(device->name),
+					 physical, alloc_offsets[i], note);
+		}
+	}
+
+	kfree(alloc_offsets);
+	free_extent_map(em);
+
+	return ret;
+}
diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
index d54b4ae8cf8b..4ed985d027cc 100644
--- a/fs/btrfs/hmzoned.h
+++ b/fs/btrfs/hmzoned.h
@@ -40,6 +40,8 @@ int btrfs_advance_sb_log(struct btrfs_device *device, int mirror);
 int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror);
 bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos,
 				   u64 num_bytes);
+void btrfs_calc_zone_unusable(struct btrfs_block_group *cache);
+int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -89,6 +91,12 @@ static inline bool btrfs_check_allocatable_zones(struct btrfs_device *device,
 {
 	return true;
 }
+static inline void btrfs_calc_zone_unusable(struct btrfs_block_group *cache) { }
+static inline int btrfs_load_block_group_zone_info(
+	struct btrfs_block_group *cache)
+{
+	return 0;
+}
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index f09aa6ee9113..322036e49831 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -16,6 +16,7 @@ u64 __pure btrfs_space_info_used(struct btrfs_space_info *s_info,
 	ASSERT(s_info);
 	return s_info->bytes_used + s_info->bytes_reserved +
 		s_info->bytes_pinned + s_info->bytes_readonly +
+		s_info->bytes_zone_unusable +
 		(may_use_included ? s_info->bytes_may_use : 0);
 }
 
@@ -112,7 +113,7 @@ int btrfs_init_space_info(struct btrfs_fs_info *fs_info)
 
 void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags,
 			     u64 total_bytes, u64 bytes_used,
-			     u64 bytes_readonly,
+			     u64 bytes_readonly, u64 bytes_zone_unusable,
 			     struct btrfs_space_info **space_info)
 {
 	struct btrfs_space_info *found;
@@ -128,6 +129,7 @@ void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags,
 	found->bytes_used += bytes_used;
 	found->disk_used += bytes_used * factor;
 	found->bytes_readonly += bytes_readonly;
+	found->bytes_zone_unusable += bytes_zone_unusable;
 	if (total_bytes > 0)
 		found->full = 0;
 	btrfs_try_granting_tickets(info, found);
@@ -267,10 +269,10 @@ static void __btrfs_dump_space_info(struct btrfs_fs_info *fs_info,
 		   info->total_bytes - btrfs_space_info_used(info, true),
 		   info->full ? "" : "not ");
 	btrfs_info(fs_info,
-		"space_info total=%llu, used=%llu, pinned=%llu, reserved=%llu, may_use=%llu, readonly=%llu",
+		"space_info total=%llu, used=%llu, pinned=%llu, reserved=%llu, may_use=%llu, readonly=%llu zone_unusable=%llu",
 		info->total_bytes, info->bytes_used, info->bytes_pinned,
 		info->bytes_reserved, info->bytes_may_use,
-		info->bytes_readonly);
+		info->bytes_readonly, info->bytes_zone_unusable);
 
 	DUMP_BLOCK_RSV(fs_info, global_block_rsv);
 	DUMP_BLOCK_RSV(fs_info, trans_block_rsv);
@@ -299,9 +301,10 @@ void btrfs_dump_space_info(struct btrfs_fs_info *fs_info,
 	list_for_each_entry(cache, &info->block_groups[index], list) {
 		spin_lock(&cache->lock);
 		btrfs_info(fs_info,
-			"block group %llu has %llu bytes, %llu used %llu pinned %llu reserved %s",
+			"block group %llu has %llu bytes, %llu used %llu pinned %llu reserved %llu zone_unusable %s",
 			cache->start, cache->length, cache->used, cache->pinned,
-			cache->reserved, cache->ro ? "[readonly]" : "");
+			cache->reserved, cache->zone_unusable,
+			cache->ro ? "[readonly]" : "");
 		btrfs_dump_free_space(cache, bytes);
 		spin_unlock(&cache->lock);
 	}
diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index 1a349e3f9cc1..a1a5f6c2611b 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -17,6 +17,8 @@ struct btrfs_space_info {
 	u64 bytes_may_use;	/* number of bytes that may be used for
 				   delalloc/allocations */
 	u64 bytes_readonly;	/* total bytes that are read only */
+	u64 bytes_zone_unusable;	/* total bytes that are unusable until
+					   resetting the device zone */
 
 	u64 max_extent_size;	/* This will hold the maximum extent size of
 				   the space info if we had an ENOSPC in the
@@ -111,7 +113,7 @@ DECLARE_SPACE_INFO_UPDATE(bytes_pinned, "pinned");
 int btrfs_init_space_info(struct btrfs_fs_info *fs_info);
 void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags,
 			     u64 total_bytes, u64 bytes_used,
-			     u64 bytes_readonly,
+			     u64 bytes_readonly, u64 bytes_zone_unusable,
 			     struct btrfs_space_info **space_info);
 struct btrfs_space_info *btrfs_find_space_info(struct btrfs_fs_info *info,
 					       u64 flags);
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 230c7ad90e22..c479708537fc 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -458,6 +458,7 @@ SPACE_INFO_ATTR(bytes_pinned);
 SPACE_INFO_ATTR(bytes_reserved);
 SPACE_INFO_ATTR(bytes_may_use);
 SPACE_INFO_ATTR(bytes_readonly);
+SPACE_INFO_ATTR(bytes_zone_unusable);
 SPACE_INFO_ATTR(disk_used);
 SPACE_INFO_ATTR(disk_total);
 BTRFS_ATTR(space_info, total_bytes_pinned,
@@ -471,6 +472,7 @@ static struct attribute *space_info_attrs[] = {
 	BTRFS_ATTR_PTR(space_info, bytes_reserved),
 	BTRFS_ATTR_PTR(space_info, bytes_may_use),
 	BTRFS_ATTR_PTR(space_info, bytes_readonly),
+	BTRFS_ATTR_PTR(space_info, bytes_zone_unusable),
 	BTRFS_ATTR_PTR(space_info, disk_used),
 	BTRFS_ATTR_PTR(space_info, disk_total),
 	BTRFS_ATTR_PTR(space_info, total_bytes_pinned),
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v6 11/28] btrfs: make unmirroed BGs readonly only if we have at least one writable BG
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
                   ` (9 preceding siblings ...)
  2019-12-13  4:08 ` [PATCH v6 10/28] btrfs: do sequential extent allocation in HMZONED mode Naohiro Aota
@ 2019-12-13  4:08 ` Naohiro Aota
  2019-12-17 19:25   ` Josef Bacik
  2019-12-13  4:08 ` [PATCH v6 12/28] btrfs: ensure metadata space available on/after degraded mount in HMZONED Naohiro Aota
                   ` (18 subsequent siblings)
  29 siblings, 1 reply; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:08 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

If the btrfs volume has mirrored block groups, it unconditionally makes
un-mirrored block groups read only. When we have mirrored block groups, but
don't have writable block groups, this will drop all writable block groups.
So, check if we have at least one writable mirrored block group before
setting un-mirrored block groups read only.

This change is necessary to handle e.g. xfstests btrfs/124 case.

When we mount degraded RAID1 FS and write to it, and then re-mount with
full device, the write pointers of corresponding zones of written block
group differ. We mark such block group as "wp_broken" and make it read
only. In this situation, we only have read only RAID1 block groups because
of "wp_broken" and un-mirrored block groups are also marked read only,
because we have RAID1 block groups. As a result, all the block groups are
now read only, so that we cannot even start the rebalance to fix the
situation.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 5c04422f6f5a..b286359f3876 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1813,6 +1813,27 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 	return ret;
 }
 
+/*
+ * have_mirrored_block_group - check if we have at least one writable
+ *                             mirrored Block Group
+ */
+static bool have_mirrored_block_group(struct btrfs_space_info *space_info)
+{
+	struct btrfs_block_group *block_group;
+	int i;
+
+	for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
+		if (i == BTRFS_RAID_RAID0 || i == BTRFS_RAID_SINGLE)
+			continue;
+		list_for_each_entry(block_group, &space_info->block_groups[i],
+				    list) {
+			if (!block_group->ro)
+				return true;
+		}
+	}
+	return false;
+}
+
 int btrfs_read_block_groups(struct btrfs_fs_info *info)
 {
 	struct btrfs_path *path;
@@ -1861,6 +1882,10 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 		       BTRFS_BLOCK_GROUP_RAID56_MASK |
 		       BTRFS_BLOCK_GROUP_DUP)))
 			continue;
+
+		if (!have_mirrored_block_group(space_info))
+			continue;
+
 		/*
 		 * Avoid allocating from un-mirrored block group if there are
 		 * mirrored block groups.
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v6 12/28] btrfs: ensure metadata space available on/after degraded mount in HMZONED
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
                   ` (10 preceding siblings ...)
  2019-12-13  4:08 ` [PATCH v6 11/28] btrfs: make unmirroed BGs readonly only if we have at least one writable BG Naohiro Aota
@ 2019-12-13  4:08 ` Naohiro Aota
  2019-12-17 19:32   ` Josef Bacik
  2019-12-13  4:09 ` [PATCH v6 13/28] btrfs: reset zones of unused block groups Naohiro Aota
                   ` (17 subsequent siblings)
  29 siblings, 1 reply; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:08 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

On/After degraded mount, we might have no writable metadata block group due
to broken write pointers. If you e.g. balance the FS before writing any
data, alloc_tree_block_no_bg_flush() (called from insert_balance_item())
fails to allocate a tree block for it, due to global reservation failure.
We can reproduce this situation with xfstests btrfs/124.

While we can workaround the failure if we write some data and, as a result
of writing, let a new metadata block group allocated, it's a bad practice
to apply.

This commit avoids such failures by ensuring that read-write mounted volume
has non-zero metadata space. If metadata space is empty, it forces new
metadata block group allocation.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/disk-io.c |  9 +++++++++
 fs/btrfs/hmzoned.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/hmzoned.h |  6 ++++++
 3 files changed, 60 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index deca9fd70771..7f4c6a92079a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3246,6 +3246,15 @@ int __cold open_ctree(struct super_block *sb,
 		}
 	}
 
+	ret = btrfs_hmzoned_check_metadata_space(fs_info);
+	if (ret) {
+		btrfs_warn(fs_info, "failed to allocate metadata space: %d",
+			   ret);
+		btrfs_warn(fs_info, "try remount with readonly");
+		close_ctree(fs_info);
+		return ret;
+	}
+
 	down_read(&fs_info->cleanup_work_sem);
 	if ((ret = btrfs_orphan_cleanup(fs_info->fs_root)) ||
 	    (ret = btrfs_orphan_cleanup(fs_info->tree_root))) {
diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
index b067fa84b9a1..1a2a296e988a 100644
--- a/fs/btrfs/hmzoned.c
+++ b/fs/btrfs/hmzoned.c
@@ -16,6 +16,8 @@
 #include "disk-io.h"
 #include "block-group.h"
 #include "locking.h"
+#include "space-info.h"
+#include "transaction.h"
 
 /* Maximum number of zones to report per blkdev_report_zones() call */
 #define BTRFS_REPORT_NR_ZONES   4096
@@ -1072,3 +1074,46 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
 
 	return ret;
 }
+
+/*
+ * On/After degraded mount, we might have no writable metadata block
+ * group due to broken write pointers. If you e.g. balance the FS
+ * before writing any data, alloc_tree_block_no_bg_flush() (called
+ * from insert_balance_item())fails to allocate a tree block for
+ * it. To avoid such situations, ensure we have some metadata BG here.
+ */
+int btrfs_hmzoned_check_metadata_space(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_root *root = fs_info->extent_root;
+	struct btrfs_trans_handle *trans;
+	struct btrfs_space_info *info;
+	u64 left;
+	int ret;
+
+	if (!btrfs_fs_incompat(fs_info, HMZONED))
+		return 0;
+
+	info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
+	spin_lock(&info->lock);
+	left = info->total_bytes - btrfs_space_info_used(info, true);
+	spin_unlock(&info->lock);
+
+	if (left)
+		return 0;
+
+	trans = btrfs_start_transaction(root, 0);
+	if (IS_ERR(trans))
+		return PTR_ERR(trans);
+
+	mutex_lock(&fs_info->chunk_mutex);
+	ret = btrfs_alloc_chunk(trans, btrfs_metadata_alloc_profile(fs_info));
+	if (ret) {
+		mutex_unlock(&fs_info->chunk_mutex);
+		btrfs_abort_transaction(trans, ret);
+		btrfs_end_transaction(trans);
+		return ret;
+	}
+	mutex_unlock(&fs_info->chunk_mutex);
+
+	return btrfs_commit_transaction(trans);
+}
diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
index 4ed985d027cc..8ac758074afd 100644
--- a/fs/btrfs/hmzoned.h
+++ b/fs/btrfs/hmzoned.h
@@ -42,6 +42,7 @@ bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos,
 				   u64 num_bytes);
 void btrfs_calc_zone_unusable(struct btrfs_block_group *cache);
 int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache);
+int btrfs_hmzoned_check_metadata_space(struct btrfs_fs_info *fs_info);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -97,6 +98,11 @@ static inline int btrfs_load_block_group_zone_info(
 {
 	return 0;
 }
+static inline int btrfs_hmzoned_check_metadata_space(
+	struct btrfs_fs_info *fs_info)
+{
+	return 0;
+}
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v6 13/28] btrfs: reset zones of unused block groups
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
                   ` (11 preceding siblings ...)
  2019-12-13  4:08 ` [PATCH v6 12/28] btrfs: ensure metadata space available on/after degraded mount in HMZONED Naohiro Aota
@ 2019-12-13  4:09 ` Naohiro Aota
  2019-12-17 19:33   ` Josef Bacik
  2019-12-13  4:09 ` [PATCH v6 14/28] btrfs: redirty released extent buffers in HMZONED mode Naohiro Aota
                   ` (16 subsequent siblings)
  29 siblings, 1 reply; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:09 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

For an HMZONED volume, a block group maps to a zone of the device. For
deleted unused block groups, the zone of the block group can be reset to
rewind the zone write pointer at the start of the zone.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c |  8 ++++++--
 fs/btrfs/extent-tree.c | 17 ++++++++++++-----
 fs/btrfs/hmzoned.c     | 18 ++++++++++++++++++
 fs/btrfs/hmzoned.h     | 23 +++++++++++++++++++++++
 4 files changed, 59 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index b286359f3876..e78d34a4fb56 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1369,8 +1369,12 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		spin_unlock(&block_group->lock);
 		spin_unlock(&space_info->lock);
 
-		/* DISCARD can flip during remount */
-		trimming = btrfs_test_opt(fs_info, DISCARD);
+		/*
+		 * DISCARD can flip during remount. In HMZONED mode,
+		 * we need to reset sequential required zones.
+		 */
+		trimming = btrfs_test_opt(fs_info, DISCARD) ||
+				btrfs_fs_incompat(fs_info, HMZONED);
 
 		/* Implicit trim during transaction commit. */
 		if (trimming)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3781a3778696..b41a45855bc4 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1338,6 +1338,9 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
 
 		stripe = bbio->stripes;
 		for (i = 0; i < bbio->num_stripes; i++, stripe++) {
+			struct btrfs_device *dev = stripe->dev;
+			u64 physical = stripe->physical;
+			u64 length = stripe->length;
 			u64 bytes;
 			struct request_queue *req_q;
 
@@ -1345,14 +1348,18 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
 				ASSERT(btrfs_test_opt(fs_info, DEGRADED));
 				continue;
 			}
+
 			req_q = bdev_get_queue(stripe->dev->bdev);
-			if (!blk_queue_discard(req_q))
+			/* zone reset in HMZONED mode */
+			if (btrfs_can_zone_reset(dev, physical, length))
+				ret = btrfs_reset_device_zone(dev, physical,
+							      length, &bytes);
+			else if (blk_queue_discard(req_q))
+				ret = btrfs_issue_discard(dev->bdev, physical,
+							  length, &bytes);
+			else
 				continue;
 
-			ret = btrfs_issue_discard(stripe->dev->bdev,
-						  stripe->physical,
-						  stripe->length,
-						  &bytes);
 			if (!ret) {
 				discarded_bytes += bytes;
 			} else if (ret != -EOPNOTSUPP) {
diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
index 1a2a296e988a..0ca84d888e53 100644
--- a/fs/btrfs/hmzoned.c
+++ b/fs/btrfs/hmzoned.c
@@ -1117,3 +1117,21 @@ int btrfs_hmzoned_check_metadata_space(struct btrfs_fs_info *fs_info)
 
 	return btrfs_commit_transaction(trans);
 }
+
+int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
+			    u64 length, u64 *bytes)
+{
+	int ret;
+
+	ret = blkdev_reset_zones(device->bdev, physical >> SECTOR_SHIFT,
+				 length >> SECTOR_SHIFT, GFP_NOFS);
+	if (!ret) {
+		*bytes = length;
+		while (length) {
+			btrfs_dev_set_zone_empty(device, physical);
+			length -= device->zone_info->zone_size;
+		}
+	}
+
+	return ret;
+}
diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
index 8ac758074afd..e1fa6a2f2557 100644
--- a/fs/btrfs/hmzoned.h
+++ b/fs/btrfs/hmzoned.h
@@ -43,6 +43,8 @@ bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos,
 void btrfs_calc_zone_unusable(struct btrfs_block_group *cache);
 int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache);
 int btrfs_hmzoned_check_metadata_space(struct btrfs_fs_info *fs_info);
+int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
+			    u64 length, u64 *bytes);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -103,6 +105,11 @@ static inline int btrfs_hmzoned_check_metadata_space(
 {
 	return 0;
 }
+static inline int btrfs_reset_device_zone(struct btrfs_device *device,
+					  u64 physical, u64 length, u64 *bytes)
+{
+	return 0;
+}
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
@@ -189,4 +196,20 @@ static inline u64 btrfs_zone_align(struct btrfs_device *device, u64 pos)
 	return ALIGN(pos, device->zone_info->zone_size);
 }
 
+static inline bool btrfs_can_zone_reset(struct btrfs_device *device,
+					u64 physical, u64 length)
+{
+	u64 zone_size;
+
+	if (!btrfs_dev_is_sequential(device, physical))
+		return false;
+
+	zone_size = device->zone_info->zone_size;
+	if (!IS_ALIGNED(physical, zone_size) ||
+	    !IS_ALIGNED(length, zone_size))
+		return false;
+
+	return true;
+}
+
 #endif
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v6 14/28] btrfs: redirty released extent buffers in HMZONED mode
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
                   ` (12 preceding siblings ...)
  2019-12-13  4:09 ` [PATCH v6 13/28] btrfs: reset zones of unused block groups Naohiro Aota
@ 2019-12-13  4:09 ` Naohiro Aota
  2019-12-17 19:41   ` Josef Bacik
  2019-12-13  4:09 ` [PATCH v6 15/28] btrfs: serialize data allocation and submit IOs Naohiro Aota
                   ` (15 subsequent siblings)
  29 siblings, 1 reply; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:09 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

Tree manipulating operations like merging nodes often release
once-allocated tree nodes. Btrfs cleans such nodes so that pages in the
node are not uselessly written out. On HMZONED drives, however, such
optimization blocks the following IOs as the cancellation of the write out
of the freed blocks breaks the sequential write sequence expected by the
device.

This patch introduces a list of clean and unwritten extent buffers that
have been released in a transaction. Btrfs redirty the buffer so that
btree_write_cache_pages() can send proper bios to the devices.

Besides it clears the entire content of the extent buffer not to confuse
raw block scanners e.g. btrfsck. By clearing the content,
csum_dirty_buffer() complains about bytenr mismatch, so avoid the checking
and checksum using newly introduced buffer flag EXTENT_BUFFER_NO_CHECK.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/disk-io.c     |  8 ++++++++
 fs/btrfs/extent-tree.c | 12 +++++++++++-
 fs/btrfs/extent_io.c   |  3 +++
 fs/btrfs/extent_io.h   |  2 ++
 fs/btrfs/hmzoned.c     | 36 ++++++++++++++++++++++++++++++++++++
 fs/btrfs/hmzoned.h     |  6 ++++++
 fs/btrfs/transaction.c | 10 ++++++++++
 fs/btrfs/transaction.h |  3 +++
 8 files changed, 79 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7f4c6a92079a..fbbc313f9f46 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -525,6 +525,12 @@ static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct page *page)
 		return 0;
 
 	found_start = btrfs_header_bytenr(eb);
+
+	if (test_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags)) {
+		WARN_ON(found_start != 0);
+		return 0;
+	}
+
 	/*
 	 * Please do not consolidate these warnings into a single if.
 	 * It is useful to know what went wrong.
@@ -4521,6 +4527,8 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans,
 	btrfs_destroy_pinned_extent(fs_info,
 				    fs_info->pinned_extents);
 
+	btrfs_free_redirty_list(cur_trans);
+
 	cur_trans->state =TRANS_STATE_COMPLETED;
 	wake_up(&cur_trans->commit_wait);
 }
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index b41a45855bc4..e61f69eef4a8 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3301,8 +3301,10 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 
 		if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
 			ret = check_ref_cleanup(trans, buf->start);
-			if (!ret)
+			if (!ret) {
+				btrfs_redirty_list_add(trans->transaction, buf);
 				goto out;
+			}
 		}
 
 		pin = 0;
@@ -3314,6 +3316,13 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 			goto out;
 		}
 
+		if (btrfs_fs_incompat(fs_info, HMZONED)) {
+			btrfs_redirty_list_add(trans->transaction, buf);
+			pin_down_extent(cache, buf->start, buf->len, 1);
+			btrfs_put_block_group(cache);
+			goto out;
+		}
+
 		WARN_ON(test_bit(EXTENT_BUFFER_DIRTY, &buf->bflags));
 
 		btrfs_add_free_space(cache, buf->start, buf->len);
@@ -4524,6 +4533,7 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 	btrfs_tree_lock(buf);
 	btrfs_clean_tree_block(buf);
 	clear_bit(EXTENT_BUFFER_STALE, &buf->bflags);
+	clear_bit(EXTENT_BUFFER_NO_CHECK, &buf->bflags);
 
 	btrfs_set_lock_blocking_write(buf);
 	set_extent_buffer_uptodate(buf);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index eb8bd0258360..6e25c8790ef4 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -24,6 +24,7 @@
 #include "rcu-string.h"
 #include "backref.h"
 #include "disk-io.h"
+#include "hmzoned.h"
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
@@ -4889,6 +4890,7 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
 	init_waitqueue_head(&eb->read_lock_wq);
 
 	btrfs_leak_debug_add(&eb->leak_list, &buffers);
+	INIT_LIST_HEAD(&eb->release_list);
 
 	spin_lock_init(&eb->refs_lock);
 	atomic_set(&eb->refs, 1);
@@ -5686,6 +5688,7 @@ void write_extent_buffer(struct extent_buffer *eb, const void *srcv,
 
 	WARN_ON(start > eb->len);
 	WARN_ON(start + len > eb->start + eb->len);
+	WARN_ON(test_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags));
 
 	offset = offset_in_page(start_offset + start);
 
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index a8551a1f56e2..51a15e93a5cd 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -29,6 +29,7 @@ enum {
 	EXTENT_BUFFER_IN_TREE,
 	/* write IO error */
 	EXTENT_BUFFER_WRITE_ERR,
+	EXTENT_BUFFER_NO_CHECK,
 };
 
 /* these are flags for __process_pages_contig */
@@ -115,6 +116,7 @@ struct extent_buffer {
 	 */
 	wait_queue_head_t read_lock_wq;
 	struct page *pages[INLINE_EXTENT_BUFFER_PAGES];
+	struct list_head release_list;
 #ifdef CONFIG_BTRFS_DEBUG
 	int spinning_writers;
 	atomic_t spinning_readers;
diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
index 0ca84d888e53..0c0ee9a46009 100644
--- a/fs/btrfs/hmzoned.c
+++ b/fs/btrfs/hmzoned.c
@@ -1135,3 +1135,39 @@ int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
 
 	return ret;
 }
+
+void btrfs_redirty_list_add(struct btrfs_transaction *trans,
+			    struct extent_buffer *eb)
+{
+	struct btrfs_fs_info *fs_info = eb->fs_info;
+
+	if (!btrfs_fs_incompat(fs_info, HMZONED) ||
+	    btrfs_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN) ||
+	    !list_empty(&eb->release_list))
+		return;
+
+	set_extent_buffer_dirty(eb);
+	set_extent_bits_nowait(&trans->dirty_pages, eb->start,
+			       eb->start + eb->len - 1, EXTENT_DIRTY);
+	memzero_extent_buffer(eb, 0, eb->len);
+	set_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags);
+
+	spin_lock(&trans->releasing_ebs_lock);
+	list_add_tail(&eb->release_list, &trans->releasing_ebs);
+	spin_unlock(&trans->releasing_ebs_lock);
+	atomic_inc(&eb->refs);
+}
+
+void btrfs_free_redirty_list(struct btrfs_transaction *trans)
+{
+	spin_lock(&trans->releasing_ebs_lock);
+	while (!list_empty(&trans->releasing_ebs)) {
+		struct extent_buffer *eb;
+
+		eb = list_first_entry(&trans->releasing_ebs,
+				      struct extent_buffer, release_list);
+		list_del_init(&eb->release_list);
+		free_extent_buffer(eb);
+	}
+	spin_unlock(&trans->releasing_ebs_lock);
+}
diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
index e1fa6a2f2557..ddec6aed7283 100644
--- a/fs/btrfs/hmzoned.h
+++ b/fs/btrfs/hmzoned.h
@@ -45,6 +45,9 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache);
 int btrfs_hmzoned_check_metadata_space(struct btrfs_fs_info *fs_info);
 int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
 			    u64 length, u64 *bytes);
+void btrfs_redirty_list_add(struct btrfs_transaction *trans,
+			    struct extent_buffer *eb);
+void btrfs_free_redirty_list(struct btrfs_transaction *trans);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -110,6 +113,9 @@ static inline int btrfs_reset_device_zone(struct btrfs_device *device,
 {
 	return 0;
 }
+static inline void btrfs_redirty_list_add(struct btrfs_transaction *trans,
+					  struct extent_buffer *eb) { }
+static inline void btrfs_free_redirty_list(struct btrfs_transaction *trans) { }
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 19de6e2041dc..39628c370bdb 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -21,6 +21,7 @@
 #include "dev-replace.h"
 #include "qgroup.h"
 #include "block-group.h"
+#include "hmzoned.h"
 
 #define BTRFS_ROOT_TRANS_TAG 0
 
@@ -329,6 +330,8 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info,
 	spin_lock_init(&cur_trans->dirty_bgs_lock);
 	INIT_LIST_HEAD(&cur_trans->deleted_bgs);
 	spin_lock_init(&cur_trans->dropped_roots_lock);
+	INIT_LIST_HEAD(&cur_trans->releasing_ebs);
+	spin_lock_init(&cur_trans->releasing_ebs_lock);
 	list_add_tail(&cur_trans->list, &fs_info->trans_list);
 	extent_io_tree_init(fs_info, &cur_trans->dirty_pages,
 			IO_TREE_TRANS_DIRTY_PAGES, fs_info->btree_inode);
@@ -2336,6 +2339,13 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 		goto scrub_continue;
 	}
 
+	/*
+	 * At this point, we should have written the all tree blocks
+	 * allocated in this transaction. So it's now safe to free the
+	 * redirtyied extent buffers.
+	 */
+	btrfs_free_redirty_list(cur_trans);
+
 	ret = write_all_supers(fs_info, 0);
 	/*
 	 * the super is written, we can safely allow the tree-loggers
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 49f7196368f5..3d60d2213c70 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -84,6 +84,9 @@ struct btrfs_transaction {
 	spinlock_t dropped_roots_lock;
 	struct btrfs_delayed_ref_root delayed_refs;
 	struct btrfs_fs_info *fs_info;
+
+	spinlock_t releasing_ebs_lock;
+	struct list_head releasing_ebs;
 };
 
 #define __TRANS_FREEZABLE	(1U << 0)
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v6 15/28] btrfs: serialize data allocation and submit IOs
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
                   ` (13 preceding siblings ...)
  2019-12-13  4:09 ` [PATCH v6 14/28] btrfs: redirty released extent buffers in HMZONED mode Naohiro Aota
@ 2019-12-13  4:09 ` Naohiro Aota
  2019-12-17 19:49   ` Josef Bacik
  2019-12-13  4:09 ` [PATCH v6 16/28] btrfs: implement atomic compressed IO submission Naohiro Aota
                   ` (14 subsequent siblings)
  29 siblings, 1 reply; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:09 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

To preserve sequential write pattern on the drives, we must serialize
allocation and submit_bio. This commit add per-block group mutex
"zone_io_lock" and find_free_extent_zoned() hold the lock. The lock is kept
even after returning from find_free_extent(). It is released when submiting
IOs corresponding to the allocation is completed.

Implementing such behavior under __extent_writepage_io() is almost
impossible because once pages are unlocked we are not sure when submiting
IOs for an allocated region is finished or not. Instead, this commit add
run_delalloc_hmzoned() to write out non-compressed data IOs at once using
extent_write_locked_rage(). After the write, we can call
btrfs_hmzoned_data_io_unlock() to unlock the block group for new
allocation.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c |  1 +
 fs/btrfs/block-group.h |  1 +
 fs/btrfs/extent-tree.c |  4 ++++
 fs/btrfs/hmzoned.h     | 36 +++++++++++++++++++++++++++++++++
 fs/btrfs/inode.c       | 45 ++++++++++++++++++++++++++++++++++++++++--
 5 files changed, 85 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index e78d34a4fb56..6f7d29171adf 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1642,6 +1642,7 @@ static struct btrfs_block_group *btrfs_create_block_group_cache(
 	btrfs_init_free_space_ctl(cache);
 	atomic_set(&cache->trimming, 0);
 	mutex_init(&cache->free_space_lock);
+	mutex_init(&cache->zone_io_lock);
 	btrfs_init_full_stripe_locks_tree(&cache->full_stripe_locks_root);
 
 	return cache;
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 347605654021..57c8d6f4b3d1 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -165,6 +165,7 @@ struct btrfs_block_group {
 	 * enabled.
 	 */
 	u64 alloc_offset;
+	struct mutex zone_io_lock;
 };
 
 #ifdef CONFIG_BTRFS_DEBUG
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e61f69eef4a8..d1f326b6c4d4 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3699,6 +3699,7 @@ static int find_free_extent_zoned(struct btrfs_block_group *cache,
 
 	ASSERT(btrfs_fs_incompat(cache->fs_info, HMZONED));
 
+	btrfs_hmzoned_data_io_lock(cache);
 	spin_lock(&space_info->lock);
 	spin_lock(&cache->lock);
 
@@ -3729,6 +3730,9 @@ static int find_free_extent_zoned(struct btrfs_block_group *cache,
 out:
 	spin_unlock(&cache->lock);
 	spin_unlock(&space_info->lock);
+	/* if succeeds, unlock after submit_bio */
+	if (ret)
+		btrfs_hmzoned_data_io_unlock(cache);
 	return ret;
 }
 
diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
index ddec6aed7283..f6682ead575b 100644
--- a/fs/btrfs/hmzoned.h
+++ b/fs/btrfs/hmzoned.h
@@ -12,6 +12,7 @@
 #include <linux/blkdev.h>
 #include "volumes.h"
 #include "disk-io.h"
+#include "block-group.h"
 
 struct btrfs_zoned_device_info {
 	/*
@@ -48,6 +49,7 @@ int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
 void btrfs_redirty_list_add(struct btrfs_transaction *trans,
 			    struct extent_buffer *eb);
 void btrfs_free_redirty_list(struct btrfs_transaction *trans);
+void btrfs_hmzoned_data_io_unlock_at(struct inode *inode, u64 start, u64 len);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -116,6 +118,8 @@ static inline int btrfs_reset_device_zone(struct btrfs_device *device,
 static inline void btrfs_redirty_list_add(struct btrfs_transaction *trans,
 					  struct extent_buffer *eb) { }
 static inline void btrfs_free_redirty_list(struct btrfs_transaction *trans) { }
+static inline void btrfs_hmzoned_data_io_unlock_at(struct inode *inode,
+						   u64 start, u64 len) { }
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
@@ -218,4 +222,36 @@ static inline bool btrfs_can_zone_reset(struct btrfs_device *device,
 	return true;
 }
 
+static inline void btrfs_hmzoned_data_io_lock(
+	struct btrfs_block_group *cache)
+{
+	/* No need to lock metadata BGs or non-sequential BGs */
+	if (!btrfs_fs_incompat(cache->fs_info, HMZONED) ||
+	    !(cache->flags & BTRFS_BLOCK_GROUP_DATA))
+		return;
+	mutex_lock(&cache->zone_io_lock);
+}
+
+static inline void btrfs_hmzoned_data_io_unlock(
+	struct btrfs_block_group *cache)
+{
+	if (!btrfs_fs_incompat(cache->fs_info, HMZONED) ||
+	    !(cache->flags & BTRFS_BLOCK_GROUP_DATA))
+		return;
+	mutex_unlock(&cache->zone_io_lock);
+}
+
+static inline void btrfs_hmzoned_data_io_unlock_logical(
+	struct btrfs_fs_info *fs_info, u64 logical)
+{
+	struct btrfs_block_group *cache;
+
+	if (!btrfs_fs_incompat(fs_info, HMZONED))
+		return;
+
+	cache = btrfs_lookup_block_group(fs_info, logical);
+	btrfs_hmzoned_data_io_unlock(cache);
+	btrfs_put_block_group(cache);
+}
+
 #endif
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 56032c518b26..3677c36999d8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -49,6 +49,7 @@
 #include "qgroup.h"
 #include "delalloc-space.h"
 #include "block-group.h"
+#include "hmzoned.h"
 
 struct btrfs_iget_args {
 	struct btrfs_key *location;
@@ -1325,6 +1326,39 @@ static int cow_file_range_async(struct inode *inode,
 	return 0;
 }
 
+static noinline int run_delalloc_hmzoned(struct inode *inode,
+					 struct page *locked_page, u64 start,
+					 u64 end, int *page_started,
+					 unsigned long *nr_written)
+{
+	struct extent_map *em;
+	u64 logical;
+	int ret;
+
+	ret = cow_file_range(inode, locked_page, start, end,
+			     page_started, nr_written, 0);
+	if (ret)
+		return ret;
+
+	if (*page_started)
+		return 0;
+
+	em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, end - start + 1,
+			      0);
+	ASSERT(em != NULL && em->block_start < EXTENT_MAP_LAST_BYTE);
+	logical = em->block_start;
+	free_extent_map(em);
+
+	__set_page_dirty_nobuffers(locked_page);
+	account_page_redirty(locked_page);
+	extent_write_locked_range(inode, start, end, WB_SYNC_ALL);
+	*page_started = 1;
+
+	btrfs_hmzoned_data_io_unlock_logical(btrfs_sb(inode->i_sb), logical);
+
+	return 0;
+}
+
 static noinline int csum_exist_in_range(struct btrfs_fs_info *fs_info,
 					u64 bytenr, u64 num_bytes)
 {
@@ -1737,17 +1771,24 @@ int btrfs_run_delalloc_range(struct inode *inode, struct page *locked_page,
 {
 	int ret;
 	int force_cow = need_force_cow(inode, start, end);
+	int do_compress = inode_can_compress(inode) &&
+		inode_need_compress(inode, start, end);
+	int hmzoned = btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED);
 
 	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW && !force_cow) {
+		ASSERT(!hmzoned);
 		ret = run_delalloc_nocow(inode, locked_page, start, end,
 					 page_started, 1, nr_written);
 	} else if (BTRFS_I(inode)->flags & BTRFS_INODE_PREALLOC && !force_cow) {
+		ASSERT(!hmzoned);
 		ret = run_delalloc_nocow(inode, locked_page, start, end,
 					 page_started, 0, nr_written);
-	} else if (!inode_can_compress(inode) ||
-		   !inode_need_compress(inode, start, end)) {
+	} else if (!do_compress && !hmzoned) {
 		ret = cow_file_range(inode, locked_page, start, end,
 				      page_started, nr_written, 1);
+	} else if (!do_compress && hmzoned) {
+		ret = run_delalloc_hmzoned(inode, locked_page, start, end,
+					   page_started, nr_written);
 	} else {
 		set_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
 			&BTRFS_I(inode)->runtime_flags);
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v6 16/28] btrfs: implement atomic compressed IO submission
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
                   ` (14 preceding siblings ...)
  2019-12-13  4:09 ` [PATCH v6 15/28] btrfs: serialize data allocation and submit IOs Naohiro Aota
@ 2019-12-13  4:09 ` Naohiro Aota
  2019-12-13  4:09 ` [PATCH v6 17/28] btrfs: support direct write IO in HMZONED Naohiro Aota
                   ` (13 subsequent siblings)
  29 siblings, 0 replies; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:09 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

As same as with non-compressed IO submission, we must unlock a block group
for the next allocation.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/inode.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3677c36999d8..e09089e24a8f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -793,13 +793,25 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
 			 * and IO for us.  Otherwise, we need to submit
 			 * all those pages down to the drive.
 			 */
-			if (!page_started && !ret)
+			if (!page_started && !ret) {
+				struct extent_map *em;
+				u64 logical;
+
+				em = btrfs_get_extent(BTRFS_I(inode), NULL, 0,
+						      async_extent->start,
+						      async_extent->ram_size,
+						      0);
+				logical = em->block_start;
+				free_extent_map(em);
+
 				extent_write_locked_range(inode,
 						  async_extent->start,
 						  async_extent->start +
 						  async_extent->ram_size - 1,
 						  WB_SYNC_ALL);
-			else if (ret && async_chunk->locked_page)
+				btrfs_hmzoned_data_io_unlock_logical(fs_info,
+								     logical);
+			} else if (ret && async_chunk->locked_page)
 				unlock_page(async_chunk->locked_page);
 			kfree(async_extent);
 			cond_resched();
@@ -899,6 +911,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
 			free_async_extent_pages(async_extent);
 		}
 		alloc_hint = ins.objectid + ins.offset;
+		btrfs_hmzoned_data_io_unlock_logical(fs_info, ins.objectid);
 		kfree(async_extent);
 		cond_resched();
 	}
@@ -906,6 +919,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
 out_free_reserve:
 	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
 	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
+	btrfs_hmzoned_data_io_unlock_logical(fs_info, ins.objectid);
 out_free:
 	extent_clear_unlock_delalloc(inode, async_extent->start,
 				     async_extent->start +
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v6 17/28] btrfs: support direct write IO in HMZONED
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
                   ` (15 preceding siblings ...)
  2019-12-13  4:09 ` [PATCH v6 16/28] btrfs: implement atomic compressed IO submission Naohiro Aota
@ 2019-12-13  4:09 ` Naohiro Aota
  2019-12-13  4:09 ` [PATCH v6 18/28] btrfs: serialize meta IOs on HMZONED mode Naohiro Aota
                   ` (12 subsequent siblings)
  29 siblings, 0 replies; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:09 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

As same as with other IO submission, we must unlock a block group for the
next allocation.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/inode.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e09089e24a8f..44658590c6e8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -60,6 +60,7 @@ struct btrfs_dio_data {
 	u64 reserve;
 	u64 unsubmitted_oe_range_start;
 	u64 unsubmitted_oe_range_end;
+	u64 alloc_end;
 	int overwrite;
 };
 
@@ -7787,6 +7788,12 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 		}
 	}
 
+	if (dio_data->alloc_end) {
+		btrfs_hmzoned_data_io_unlock_logical(fs_info,
+						     dio_data->alloc_end - 1);
+		dio_data->alloc_end = 0;
+	}
+
 	/* this will cow the extent */
 	len = bh_result->b_size;
 	free_extent_map(em);
@@ -7818,6 +7825,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 	WARN_ON(dio_data->reserve < len);
 	dio_data->reserve -= len;
 	dio_data->unsubmitted_oe_range_end = start + len;
+	dio_data->alloc_end = em->block_start + (start - em->start) + len;
 	current->journal_info = dio_data;
 out:
 	return ret;
@@ -8585,6 +8593,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 	struct btrfs_io_bio *io_bio;
 	bool write = (bio_op(dio_bio) == REQ_OP_WRITE);
 	int ret = 0;
+	u64 disk_bytenr, len;
 
 	bio = btrfs_bio_clone(dio_bio);
 
@@ -8628,7 +8637,18 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 			dio_data->unsubmitted_oe_range_end;
 	}
 
+	disk_bytenr = dip->disk_bytenr;
+	len = dip->bytes;
 	ret = btrfs_submit_direct_hook(dip);
+	if (write) {
+		struct btrfs_dio_data *dio_data = current->journal_info;
+
+		if (disk_bytenr + len == dio_data->alloc_end) {
+			btrfs_hmzoned_data_io_unlock_logical(
+				btrfs_sb(inode->i_sb), disk_bytenr);
+			dio_data->alloc_end = 0;
+		}
+	}
 	if (!ret)
 		return;
 
@@ -8804,6 +8824,11 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 			btrfs_delalloc_release_space(inode, data_reserved,
 					offset, count - (size_t)ret, true);
 		btrfs_delalloc_release_extents(BTRFS_I(inode), count);
+		if (dio_data.alloc_end) {
+			pr_info("unlock final direct %llu", dio_data.alloc_end);
+			btrfs_hmzoned_data_io_unlock_logical(
+				fs_info, dio_data.alloc_end - 1);
+		}
 	}
 out:
 	if (wakeup)
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v6 18/28] btrfs: serialize meta IOs on HMZONED mode
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
                   ` (16 preceding siblings ...)
  2019-12-13  4:09 ` [PATCH v6 17/28] btrfs: support direct write IO in HMZONED Naohiro Aota
@ 2019-12-13  4:09 ` Naohiro Aota
  2019-12-13  4:09 ` [PATCH v6 19/28] btrfs: wait existing extents before truncating Naohiro Aota
                   ` (11 subsequent siblings)
  29 siblings, 0 replies; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:09 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

As same as in data IO path, we must serialize write IOs for metadata. We
cannot add mutex around allocation and submit because metadata blocks are
allocated in an earlier stage to build up B-trees.

Thus, this commit add hmzoned_meta_io_lock and hold it during metadata IO
submission in btree_write_cache_pages() to serialize IOs. Furthermore, this
commit add per-block group metadata IO submission pointer
"meta_write_pointer" to ensure sequential writing, which can be caused when
writing back blocks in a not finished transaction.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.h |  1 +
 fs/btrfs/ctree.h       |  2 ++
 fs/btrfs/disk-io.c     |  1 +
 fs/btrfs/extent_io.c   | 27 +++++++++++++++++++++-
 fs/btrfs/hmzoned.c     | 52 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/hmzoned.h     | 27 ++++++++++++++++++++++
 6 files changed, 109 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 57c8d6f4b3d1..8827869f1744 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -166,6 +166,7 @@ struct btrfs_block_group {
 	 */
 	u64 alloc_offset;
 	struct mutex zone_io_lock;
+	u64 meta_write_pointer;
 };
 
 #ifdef CONFIG_BTRFS_DEBUG
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 44517802b9e5..18d2d0581e68 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -905,6 +905,8 @@ struct btrfs_fs_info {
 	spinlock_t ref_verify_lock;
 	struct rb_root block_tree;
 #endif
+
+	struct mutex hmzoned_meta_io_lock;
 };
 
 static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index fbbc313f9f46..4abadd9317d1 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2707,6 +2707,7 @@ int __cold open_ctree(struct super_block *sb,
 	mutex_init(&fs_info->delete_unused_bgs_mutex);
 	mutex_init(&fs_info->reloc_mutex);
 	mutex_init(&fs_info->delalloc_root_mutex);
+	mutex_init(&fs_info->hmzoned_meta_io_lock);
 	seqlock_init(&fs_info->profiles_lock);
 
 	INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6e25c8790ef4..24f7b05e1f4c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3921,7 +3921,9 @@ int btree_write_cache_pages(struct address_space *mapping,
 				   struct writeback_control *wbc)
 {
 	struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree;
+	struct btrfs_fs_info *fs_info = tree->fs_info;
 	struct extent_buffer *eb, *prev_eb = NULL;
+	struct btrfs_block_group *cache = NULL;
 	struct extent_page_data epd = {
 		.bio = NULL,
 		.tree = tree,
@@ -3951,6 +3953,7 @@ int btree_write_cache_pages(struct address_space *mapping,
 		tag = PAGECACHE_TAG_TOWRITE;
 	else
 		tag = PAGECACHE_TAG_DIRTY;
+	btrfs_hmzoned_meta_io_lock(fs_info);
 retry:
 	if (wbc->sync_mode == WB_SYNC_ALL)
 		tag_pages_for_writeback(mapping, index, end);
@@ -3994,12 +3997,30 @@ int btree_write_cache_pages(struct address_space *mapping,
 			if (!ret)
 				continue;
 
+			if (!btrfs_check_meta_write_pointer(fs_info, eb,
+							    &cache)) {
+				/*
+				 * If for_sync, this hole will be
+				 * filled with trasnsaction commit.
+				 */
+				if (wbc->sync_mode == WB_SYNC_ALL &&
+				    !wbc->for_sync)
+					ret = -EAGAIN;
+				else
+					ret = 0;
+				done = 1;
+				free_extent_buffer(eb);
+				break;
+			}
+
 			prev_eb = eb;
 			ret = lock_extent_buffer_for_io(eb, &epd);
 			if (!ret) {
+				btrfs_revert_meta_write_pointer(cache, eb);
 				free_extent_buffer(eb);
 				continue;
 			} else if (ret < 0) {
+				btrfs_revert_meta_write_pointer(cache, eb);
 				done = 1;
 				free_extent_buffer(eb);
 				break;
@@ -4032,12 +4053,16 @@ int btree_write_cache_pages(struct address_space *mapping,
 		index = 0;
 		goto retry;
 	}
+	if (cache)
+		btrfs_put_block_group(cache);
 	ASSERT(ret <= 0);
 	if (ret < 0) {
 		end_write_bio(&epd, ret);
-		return ret;
+		goto out;
 	}
 	ret = flush_write_bio(&epd);
+out:
+	btrfs_hmzoned_meta_io_unlock(fs_info);
 	return ret;
 }
 
diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
index 0c0ee9a46009..1aa4c9d1032e 100644
--- a/fs/btrfs/hmzoned.c
+++ b/fs/btrfs/hmzoned.c
@@ -1069,6 +1069,9 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
 		}
 	}
 
+	if (!ret)
+		cache->meta_write_pointer = cache->alloc_offset + cache->start;
+
 	kfree(alloc_offsets);
 	free_extent_map(em);
 
@@ -1171,3 +1174,52 @@ void btrfs_free_redirty_list(struct btrfs_transaction *trans)
 	}
 	spin_unlock(&trans->releasing_ebs_lock);
 }
+
+bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
+				    struct extent_buffer *eb,
+				    struct btrfs_block_group **cache_ret)
+{
+	struct btrfs_block_group *cache;
+
+	if (!btrfs_fs_incompat(fs_info, HMZONED))
+		return true;
+
+	cache = *cache_ret;
+
+	if (cache &&
+	    (eb->start < cache->start ||
+	     cache->start + cache->length <= eb->start)) {
+		btrfs_put_block_group(cache);
+		cache = NULL;
+		*cache_ret = NULL;
+	}
+
+	if (!cache)
+		cache = btrfs_lookup_block_group(fs_info,
+						 eb->start);
+
+	if (cache) {
+		*cache_ret = cache;
+
+		if (cache->meta_write_pointer != eb->start) {
+			btrfs_put_block_group(cache);
+			cache = NULL;
+			*cache_ret = NULL;
+			return false;
+		}
+
+		cache->meta_write_pointer = eb->start + eb->len;
+	}
+
+	return true;
+}
+
+void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
+				     struct extent_buffer *eb)
+{
+	if (!btrfs_fs_incompat(eb->fs_info, HMZONED) || !cache)
+		return;
+
+	ASSERT(cache->meta_write_pointer == eb->start + eb->len);
+	cache->meta_write_pointer = eb->start;
+}
diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
index f6682ead575b..54f1affa6919 100644
--- a/fs/btrfs/hmzoned.h
+++ b/fs/btrfs/hmzoned.h
@@ -50,6 +50,11 @@ void btrfs_redirty_list_add(struct btrfs_transaction *trans,
 			    struct extent_buffer *eb);
 void btrfs_free_redirty_list(struct btrfs_transaction *trans);
 void btrfs_hmzoned_data_io_unlock_at(struct inode *inode, u64 start, u64 len);
+bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
+				    struct extent_buffer *eb,
+				    struct btrfs_block_group **cache_ret);
+void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
+				     struct extent_buffer *eb);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -120,6 +125,14 @@ static inline void btrfs_redirty_list_add(struct btrfs_transaction *trans,
 static inline void btrfs_free_redirty_list(struct btrfs_transaction *trans) { }
 static inline void btrfs_hmzoned_data_io_unlock_at(struct inode *inode,
 						   u64 start, u64 len) { }
+static inline bool btrfs_check_meta_write_pointer(
+	struct btrfs_fs_info *fs_info, struct extent_buffer *eb,
+	struct btrfs_block_group **cache_ret)
+{
+	return true;
+}
+static inline void btrfs_revert_meta_write_pointer(
+	struct btrfs_block_group *cache, struct extent_buffer *eb) { }
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
@@ -254,4 +267,18 @@ static inline void btrfs_hmzoned_data_io_unlock_logical(
 	btrfs_put_block_group(cache);
 }
 
+static inline void btrfs_hmzoned_meta_io_lock(struct btrfs_fs_info *fs_info)
+{
+	if (!btrfs_fs_incompat(fs_info, HMZONED))
+		return;
+	mutex_lock(&fs_info->hmzoned_meta_io_lock);
+}
+
+static inline void btrfs_hmzoned_meta_io_unlock(struct btrfs_fs_info *fs_info)
+{
+	if (!btrfs_fs_incompat(fs_info, HMZONED))
+		return;
+	mutex_unlock(&fs_info->hmzoned_meta_io_lock);
+}
+
 #endif
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v6 19/28] btrfs: wait existing extents before truncating
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
                   ` (17 preceding siblings ...)
  2019-12-13  4:09 ` [PATCH v6 18/28] btrfs: serialize meta IOs on HMZONED mode Naohiro Aota
@ 2019-12-13  4:09 ` Naohiro Aota
  2019-12-17 19:53   ` Josef Bacik
  2019-12-13  4:09 ` [PATCH v6 20/28] btrfs: avoid async checksum on HMZONED mode Naohiro Aota
                   ` (10 subsequent siblings)
  29 siblings, 1 reply; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:09 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

When truncating a file, file buffers which have already been allocated but
not yet written may be truncated.  Truncating these buffers could cause
breakage of a sequential write pattern in a block group if the truncated
blocks are for example followed by blocks allocated to another file. To
avoid this problem, always wait for write out of all unwritten buffers
before proceeding with the truncate execution.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/inode.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 44658590c6e8..e7fc217be095 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5323,6 +5323,16 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr)
 		btrfs_end_write_no_snapshotting(root);
 		btrfs_end_transaction(trans);
 	} else {
+		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+
+		if (btrfs_fs_incompat(fs_info, HMZONED)) {
+			ret = btrfs_wait_ordered_range(
+				inode,
+				ALIGN(newsize, fs_info->sectorsize),
+				(u64)-1);
+			if (ret)
+				return ret;
+		}
 
 		/*
 		 * We're truncating a file that used to have good data down to
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v6 20/28] btrfs: avoid async checksum on HMZONED mode
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
                   ` (18 preceding siblings ...)
  2019-12-13  4:09 ` [PATCH v6 19/28] btrfs: wait existing extents before truncating Naohiro Aota
@ 2019-12-13  4:09 ` Naohiro Aota
  2019-12-13  4:09 ` [PATCH v6 21/28] btrfs: disallow mixed-bg in " Naohiro Aota
                   ` (9 subsequent siblings)
  29 siblings, 0 replies; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:09 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

In HMZONED, btrfs use per-Block Group zone_io_lock to serialize the data
write IOs or use per-FS hmzoned_meta_io_lock to serialize the metadata
write IOs.

Even with these serialization, write bios sent from
{btree,btrfs}_write_cache_pages can be reordered by async checksum workers
as these workers are per CPU and not per zone.

To preserve write BIO ordering, we can disable async checksum on HMZONED.
This does not result in lower performance with HDDs as a single CPU core is
fast enough to do checksum for a single zone write stream with the maximum
possible bandwidth of the device. If multiple zones are being written
simultaneously, HDD seek overhead lowers the achievable maximum bandwidth,
resulting again in a per zone checksum serialization not affecting
performance.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/disk-io.c | 2 ++
 fs/btrfs/inode.c   | 9 ++++++---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 4abadd9317d1..c3d8fc10d11d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -882,6 +882,8 @@ static blk_status_t btree_submit_bio_start(void *private_data, struct bio *bio,
 static int check_async_write(struct btrfs_fs_info *fs_info,
 			     struct btrfs_inode *bi)
 {
+	if (btrfs_fs_incompat(fs_info, HMZONED))
+		return 0;
 	if (atomic_read(&bi->sync_writers))
 		return 0;
 	if (test_bit(BTRFS_FS_CSUM_IMPL_FAST, &fs_info->flags))
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e7fc217be095..bd3384200fc9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2166,7 +2166,8 @@ static blk_status_t btrfs_submit_bio_hook(struct inode *inode, struct bio *bio,
 	enum btrfs_wq_endio_type metadata = BTRFS_WQ_ENDIO_DATA;
 	blk_status_t ret = 0;
 	int skip_sum;
-	int async = !atomic_read(&BTRFS_I(inode)->sync_writers);
+	int async = !atomic_read(&BTRFS_I(inode)->sync_writers) &&
+		!btrfs_fs_incompat(fs_info, HMZONED);
 
 	skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
 
@@ -8457,7 +8458,8 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio,
 
 	/* Check btrfs_submit_bio_hook() for rules about async submit. */
 	if (async_submit)
-		async_submit = !atomic_read(&BTRFS_I(inode)->sync_writers);
+		async_submit = !atomic_read(&BTRFS_I(inode)->sync_writers) &&
+			!btrfs_fs_incompat(fs_info, HMZONED);
 
 	if (!write) {
 		ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA);
@@ -8522,7 +8524,8 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
 	}
 
 	/* async crcs make it difficult to collect full stripe writes. */
-	if (btrfs_data_alloc_profile(fs_info) & BTRFS_BLOCK_GROUP_RAID56_MASK)
+	if (btrfs_data_alloc_profile(fs_info) & BTRFS_BLOCK_GROUP_RAID56_MASK ||
+	    btrfs_fs_incompat(fs_info, HMZONED))
 		async_submit = 0;
 	else
 		async_submit = 1;
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v6 21/28] btrfs: disallow mixed-bg in HMZONED mode
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
                   ` (19 preceding siblings ...)
  2019-12-13  4:09 ` [PATCH v6 20/28] btrfs: avoid async checksum on HMZONED mode Naohiro Aota
@ 2019-12-13  4:09 ` Naohiro Aota
  2019-12-17 19:56   ` Josef Bacik
  2019-12-13  4:09 ` [PATCH v6 22/28] btrfs: disallow inode_cache " Naohiro Aota
                   ` (8 subsequent siblings)
  29 siblings, 1 reply; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:09 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

Placing both data and metadata in a block group is impossible in HMZONED
mode. For data, we can allocate a space for it and write it immediately
after the allocation. For metadata, however, we cannot do so, because the
logical addresses are recorded in other metadata buffers to build up the
trees. As a result, a data buffer can be placed after a metadata buffer,
which is not written yet. Writing out the data buffer will break the
sequential write rule.

This commit check and disallow MIXED_BG with HMZONED mode.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/hmzoned.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
index 1aa4c9d1032e..c779232bb003 100644
--- a/fs/btrfs/hmzoned.c
+++ b/fs/btrfs/hmzoned.c
@@ -306,6 +306,13 @@ int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info)
 		goto out;
 	}
 
+	if (btrfs_fs_incompat(fs_info, MIXED_GROUPS)) {
+		btrfs_err(fs_info,
+			  "HMZONED mode is not allowed for mixed block groups");
+		ret = -EINVAL;
+		goto out;
+	}
+
 	fs_info->zone_size = zone_size;
 
 	btrfs_info(fs_info, "HMZONED mode enabled, zone size %llu B",
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v6 22/28] btrfs: disallow inode_cache in HMZONED mode
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
                   ` (20 preceding siblings ...)
  2019-12-13  4:09 ` [PATCH v6 21/28] btrfs: disallow mixed-bg in " Naohiro Aota
@ 2019-12-13  4:09 ` Naohiro Aota
  2019-12-17 19:56   ` Josef Bacik
  2019-12-13  4:09 ` [PATCH v6 23/28] btrfs: support dev-replace " Naohiro Aota
                   ` (7 subsequent siblings)
  29 siblings, 1 reply; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:09 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

inode_cache use pre-allocation to write its cache data. However,
pre-allocation is completely disabled in HMZONED mode.

We can technically enable inode_cache in the same way as relocation.
However, inode_cache is rarely used and the man page discourage using it.
So, let's just disable it for now.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/hmzoned.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
index c779232bb003..465db8e6de94 100644
--- a/fs/btrfs/hmzoned.c
+++ b/fs/btrfs/hmzoned.c
@@ -342,6 +342,12 @@ int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info)
 		return -EOPNOTSUPP;
 	}
 
+	if (btrfs_test_pending(info, SET_INODE_MAP_CACHE)) {
+		btrfs_err(info,
+		  "cannot enable inode map caching with HMZONED mode");
+		return -EOPNOTSUPP;
+	}
+
 	return 0;
 }
 
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v6 23/28] btrfs: support dev-replace in HMZONED mode
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
                   ` (21 preceding siblings ...)
  2019-12-13  4:09 ` [PATCH v6 22/28] btrfs: disallow inode_cache " Naohiro Aota
@ 2019-12-13  4:09 ` Naohiro Aota
  2019-12-17 21:05   ` Josef Bacik
  2019-12-13  4:09 ` [PATCH v6 24/28] btrfs: enable relocation " Naohiro Aota
                   ` (6 subsequent siblings)
  29 siblings, 1 reply; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:09 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

We have two type of I/Os during the device-replace process. One is a I/O to
"copy" (by the scrub functions) all the device extents on the source device
to the destination device.  The other one is a I/O to "clone" (by
handle_ops_on_dev_replace()) new incoming write I/Os from users to the
source device into the target device.

Cloning incoming I/Os can break the sequential write rule in the target
device. When write is mapped in the middle of a block group, that I/O is
directed in the middle of a zone of target device, which breaks the
sequential write rule.

However, the cloning function cannot be simply disabled since incoming I/Os
targeting already copied device extents must be cloned so that the I/O is
executed on the target device.

We cannot use dev_replace->cursor_{left,right} to determine whether bio is
going to not yet copied region.  Since we have time gap between finishing
btrfs_scrub_dev() and rewriting the mapping tree in
btrfs_dev_replace_finishing(), we can have newly allocated device extent
which is never cloned nor copied.

So the point is to copy only already existing device extents. This patch
introduces mark_block_group_to_copy() to mark existing block group as a
target of copying. Then, handle_ops_on_dev_replace() and dev-replace can
check the flag to do their job.

Device-replace process in HMZONED mode must copy or clone all the extents
in the source device exctly once.  So, we need to use to ensure allocations
started just before the dev-replace process to have their corresponding
extent information in the B-trees. finish_extent_writes_for_hmzoned()
implements that functionality, which basically is the removed code in the
commit 042528f8d840 ("Btrfs: fix block group remaining RO forever after
error during device replace").

This patch also handles empty region between used extents. Since
dev-replace is smart to copy only used extents on source device, we have to
fill the gap to honor the sequential write rule in the target device.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.h |   1 +
 fs/btrfs/dev-replace.c | 178 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/dev-replace.h |   3 +
 fs/btrfs/extent-tree.c |  20 ++++-
 fs/btrfs/hmzoned.c     |  91 +++++++++++++++++++++
 fs/btrfs/hmzoned.h     |  16 ++++
 fs/btrfs/scrub.c       | 142 +++++++++++++++++++++++++++++++-
 fs/btrfs/volumes.c     |  36 ++++++++-
 8 files changed, 481 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 8827869f1744..323ba01ad8a9 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -83,6 +83,7 @@ struct btrfs_block_group {
 	unsigned int has_caching_ctl:1;
 	unsigned int removed:1;
 	unsigned int wp_broken:1;
+	unsigned int to_copy:1;
 
 	int disk_cache_state;
 
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 9286c6e0b636..6ac6aa0eb0b6 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -265,6 +265,10 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
 	set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
 	device->fs_devices = fs_info->fs_devices;
 
+	ret = btrfs_get_dev_zone_info(device);
+	if (ret)
+		goto error;
+
 	mutex_lock(&fs_info->fs_devices->device_list_mutex);
 	list_add(&device->dev_list, &fs_info->fs_devices->devices);
 	fs_info->fs_devices->num_devices++;
@@ -399,6 +403,176 @@ static char* btrfs_dev_name(struct btrfs_device *device)
 		return rcu_str_deref(device->name);
 }
 
+static int mark_block_group_to_copy(struct btrfs_fs_info *fs_info,
+				    struct btrfs_device *src_dev)
+{
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	struct btrfs_key found_key;
+	struct btrfs_root *root = fs_info->dev_root;
+	struct btrfs_dev_extent *dev_extent = NULL;
+	struct btrfs_block_group *cache;
+	struct extent_buffer *l;
+	struct btrfs_trans_handle *trans;
+	int slot;
+	int ret = 0;
+	u64 chunk_offset, length;
+
+	/* Do not use "to_copy" on non-HMZONED for now */
+	if (!btrfs_fs_incompat(fs_info, HMZONED))
+		return 0;
+
+	mutex_lock(&fs_info->chunk_mutex);
+
+	/* ensulre we don't have pending new block group */
+	while (fs_info->running_transaction &&
+	       !list_empty(&fs_info->running_transaction->dev_update_list)) {
+		mutex_unlock(&fs_info->chunk_mutex);
+		trans = btrfs_attach_transaction(root);
+		if (IS_ERR(trans)) {
+			ret = PTR_ERR(trans);
+			mutex_lock(&fs_info->chunk_mutex);
+			if (ret == -ENOENT)
+				continue;
+			else
+				goto out;
+		}
+
+		ret = btrfs_commit_transaction(trans);
+		mutex_lock(&fs_info->chunk_mutex);
+		if (ret)
+			goto out;
+	}
+
+	path = btrfs_alloc_path();
+	if (!path) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	path->reada = READA_FORWARD;
+	path->search_commit_root = 1;
+	path->skip_locking = 1;
+
+	key.objectid = src_dev->devid;
+	key.offset = 0ull;
+	key.type = BTRFS_DEV_EXTENT_KEY;
+
+	while (1) {
+		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+		if (ret < 0)
+			break;
+		if (ret > 0) {
+			if (path->slots[0] >=
+			    btrfs_header_nritems(path->nodes[0])) {
+				ret = btrfs_next_leaf(root, path);
+				if (ret < 0)
+					break;
+				if (ret > 0) {
+					ret = 0;
+					break;
+				}
+			} else {
+				ret = 0;
+			}
+		}
+
+		l = path->nodes[0];
+		slot = path->slots[0];
+
+		btrfs_item_key_to_cpu(l, &found_key, slot);
+
+		if (found_key.objectid != src_dev->devid)
+			break;
+
+		if (found_key.type != BTRFS_DEV_EXTENT_KEY)
+			break;
+
+		if (found_key.offset < key.offset)
+			break;
+
+		dev_extent = btrfs_item_ptr(l, slot, struct btrfs_dev_extent);
+		length = btrfs_dev_extent_length(l, dev_extent);
+
+		chunk_offset = btrfs_dev_extent_chunk_offset(l, dev_extent);
+
+		cache = btrfs_lookup_block_group(fs_info, chunk_offset);
+		if (!cache)
+			goto skip;
+
+		spin_lock(&cache->lock);
+		cache->to_copy = 1;
+		spin_unlock(&cache->lock);
+
+		btrfs_put_block_group(cache);
+
+skip:
+		key.offset = found_key.offset + length;
+		btrfs_release_path(path);
+	}
+
+	btrfs_free_path(path);
+out:
+	mutex_unlock(&fs_info->chunk_mutex);
+
+	return ret;
+}
+
+bool btrfs_finish_block_group_to_copy(struct btrfs_device *srcdev,
+				      struct btrfs_block_group *cache,
+				      u64 physical)
+{
+	struct btrfs_fs_info *fs_info = cache->fs_info;
+	struct extent_map *em;
+	struct map_lookup *map;
+	u64 chunk_offset = cache->start;
+	int num_extents, cur_extent;
+	int i;
+
+	/* Do not use "to_copy" on non-HMZONED for now */
+	if (!btrfs_fs_incompat(fs_info, HMZONED))
+		return true;
+
+	spin_lock(&cache->lock);
+	if (cache->removed) {
+		spin_unlock(&cache->lock);
+		return true;
+	}
+	spin_unlock(&cache->lock);
+
+	em = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
+	BUG_ON(IS_ERR(em));
+	map = em->map_lookup;
+
+	num_extents = cur_extent = 0;
+	for (i = 0; i < map->num_stripes; i++) {
+		/* we have more device extent to copy */
+		if (srcdev != map->stripes[i].dev)
+			continue;
+
+		num_extents++;
+		if (physical == map->stripes[i].physical)
+			cur_extent = i;
+	}
+
+	free_extent_map(em);
+
+	if (num_extents > 1 && cur_extent < num_extents - 1) {
+		/*
+		 * Has more stripes on this device. Keep this BG
+		 * readonly until we finish all the stripes.
+		 */
+		return false;
+	}
+
+	/* last stripe on this device */
+	spin_lock(&cache->lock);
+	cache->to_copy = 0;
+	spin_unlock(&cache->lock);
+
+	return true;
+}
+
 static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
 		const char *tgtdev_name, u64 srcdevid, const char *srcdev_name,
 		int read_src)
@@ -440,6 +614,10 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
 	if (ret)
 		return ret;
 
+	ret = mark_block_group_to_copy(fs_info, src_device);
+	if (ret)
+		return ret;
+
 	down_write(&dev_replace->rwsem);
 	switch (dev_replace->replace_state) {
 	case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED:
diff --git a/fs/btrfs/dev-replace.h b/fs/btrfs/dev-replace.h
index 60b70dacc299..3911049a5f23 100644
--- a/fs/btrfs/dev-replace.h
+++ b/fs/btrfs/dev-replace.h
@@ -18,5 +18,8 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info *fs_info);
 void btrfs_dev_replace_suspend_for_unmount(struct btrfs_fs_info *fs_info);
 int btrfs_resume_dev_replace_async(struct btrfs_fs_info *fs_info);
 int __pure btrfs_dev_replace_is_ongoing(struct btrfs_dev_replace *dev_replace);
+bool btrfs_finish_block_group_to_copy(struct btrfs_device *srcdev,
+				      struct btrfs_block_group *cache,
+				      u64 physical);
 
 #endif
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d1f326b6c4d4..69c4ce8ec83e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -34,6 +34,7 @@
 #include "block-group.h"
 #include "rcu-string.h"
 #include "hmzoned.h"
+#include "dev-replace.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -1343,6 +1344,8 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
 			u64 length = stripe->length;
 			u64 bytes;
 			struct request_queue *req_q;
+			struct btrfs_dev_replace *dev_replace =
+				&fs_info->dev_replace;
 
 			if (!stripe->dev->bdev) {
 				ASSERT(btrfs_test_opt(fs_info, DEGRADED));
@@ -1351,15 +1354,28 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
 
 			req_q = bdev_get_queue(stripe->dev->bdev);
 			/* zone reset in HMZONED mode */
-			if (btrfs_can_zone_reset(dev, physical, length))
+			if (btrfs_can_zone_reset(dev, physical, length)) {
 				ret = btrfs_reset_device_zone(dev, physical,
 							      length, &bytes);
-			else if (blk_queue_discard(req_q))
+				if (ret)
+					goto next;
+				if (!btrfs_dev_replace_is_ongoing(
+					    dev_replace) ||
+				    dev != dev_replace->srcdev)
+					goto next;
+
+				discarded_bytes += bytes;
+				/* send to replace target as well */
+				ret = btrfs_reset_device_zone(
+					dev_replace->tgtdev,
+					physical, length, &bytes);
+			} else if (blk_queue_discard(req_q))
 				ret = btrfs_issue_discard(dev->bdev, physical,
 							  length, &bytes);
 			else
 				continue;
 
+next:
 			if (!ret) {
 				discarded_bytes += bytes;
 			} else if (ret != -EOPNOTSUPP) {
diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
index 465db8e6de94..c26a28bd159e 100644
--- a/fs/btrfs/hmzoned.c
+++ b/fs/btrfs/hmzoned.c
@@ -18,6 +18,7 @@
 #include "locking.h"
 #include "space-info.h"
 #include "transaction.h"
+#include "dev-replace.h"
 
 /* Maximum number of zones to report per blkdev_report_zones() call */
 #define BTRFS_REPORT_NR_ZONES   4096
@@ -842,6 +843,8 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
 	for (i = 0; i < map->num_stripes; i++) {
 		bool is_sequential;
 		struct blk_zone zone;
+		struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
+		int dev_replace_is_ongoing = 0;
 
 		device = map->stripes[i].dev;
 		physical = map->stripes[i].physical;
@@ -868,6 +871,14 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
 		 */
 		btrfs_dev_clear_zone_empty(device, physical);
 
+		down_read(&dev_replace->rwsem);
+		dev_replace_is_ongoing =
+			btrfs_dev_replace_is_ongoing(dev_replace);
+		if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL)
+			btrfs_dev_clear_zone_empty(dev_replace->tgtdev,
+						   physical);
+		up_read(&dev_replace->rwsem);
+
 		/*
 		 * The group is mapped to a sequential zone. Get the zone write
 		 * pointer to determine the allocation offset within the zone.
@@ -1236,3 +1247,83 @@ void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
 	ASSERT(cache->meta_write_pointer == eb->start + eb->len);
 	cache->meta_write_pointer = eb->start;
 }
+
+int btrfs_hmzoned_issue_zeroout(struct btrfs_device *device, u64 physical,
+				u64 length)
+{
+	if (!btrfs_dev_is_sequential(device, physical))
+		return -EOPNOTSUPP;
+
+	return blkdev_issue_zeroout(device->bdev,
+				    physical >> SECTOR_SHIFT,
+				    length >> SECTOR_SHIFT,
+				    GFP_NOFS, 0);
+}
+
+static int read_zone_info(struct btrfs_fs_info *fs_info, u64 logical,
+			  struct blk_zone *zone)
+{
+	struct btrfs_bio *bbio = NULL;
+	u64 mapped_length = PAGE_SIZE;
+	unsigned int nofs_flag;
+	int nmirrors;
+	int i, ret;
+
+	ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, logical,
+			       &mapped_length, &bbio);
+	if (ret || !bbio || mapped_length < PAGE_SIZE) {
+		btrfs_put_bbio(bbio);
+		return -EIO;
+	}
+
+	if (bbio->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK)
+		return -EINVAL;
+
+	nofs_flag = memalloc_nofs_save();
+	nmirrors = (int)bbio->num_stripes;
+	for (i = 0; i < nmirrors; i++) {
+		u64 physical = bbio->stripes[i].physical;
+		struct btrfs_device *dev = bbio->stripes[i].dev;
+
+		/* missing device */
+		if (!dev->bdev)
+			continue;
+
+		ret = btrfs_get_dev_zone(dev, physical, zone);
+		/* failing device */
+		if (ret == -EIO || ret == -EOPNOTSUPP)
+			continue;
+		break;
+	}
+	memalloc_nofs_restore(nofs_flag);
+
+	return ret;
+}
+
+int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical,
+				    u64 physical_start, u64 physical_pos)
+{
+	struct btrfs_fs_info *fs_info = tgt_dev->fs_info;
+	struct blk_zone zone;
+	u64 length;
+	u64 wp;
+	int ret;
+
+	if (!btrfs_dev_is_sequential(tgt_dev, physical_pos))
+		return 0;
+
+	ret = read_zone_info(fs_info, logical, &zone);
+	if (ret)
+		return ret;
+
+	wp = physical_start + ((zone.wp - zone.start) << SECTOR_SHIFT);
+
+	if (physical_pos == wp)
+		return 0;
+
+	if (physical_pos > wp)
+		return -EUCLEAN;
+
+	length = wp - physical_pos;
+	return btrfs_hmzoned_issue_zeroout(tgt_dev, physical_pos, length);
+}
diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
index 54f1affa6919..8558dd692b08 100644
--- a/fs/btrfs/hmzoned.h
+++ b/fs/btrfs/hmzoned.h
@@ -55,6 +55,10 @@ bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
 				    struct btrfs_block_group **cache_ret);
 void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
 				     struct extent_buffer *eb);
+int btrfs_hmzoned_issue_zeroout(struct btrfs_device *device, u64 physical,
+				u64 length);
+int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical,
+				    u64 physical_start, u64 physical_pos);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -133,6 +137,18 @@ static inline bool btrfs_check_meta_write_pointer(
 }
 static inline void btrfs_revert_meta_write_pointer(
 	struct btrfs_block_group *cache, struct extent_buffer *eb) { }
+static inline int btrfs_hmzoned_issue_zeroout(struct btrfs_device *device,
+					      u64 physical, u64 length)
+{
+	return -EOPNOTSUPP;
+}
+static inline int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev,
+						u64 logical,
+						u64 physical_start,
+						u64 physical_pos)
+{
+	return -EOPNOTSUPP;
+}
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index af7cec962619..e88f32256ccc 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -168,6 +168,7 @@ struct scrub_ctx {
 	int			pages_per_rd_bio;
 
 	int			is_dev_replace;
+	u64			write_pointer;
 
 	struct scrub_bio        *wr_curr_bio;
 	struct mutex            wr_lock;
@@ -1627,6 +1628,25 @@ static int scrub_write_page_to_dev_replace(struct scrub_block *sblock,
 	return scrub_add_page_to_wr_bio(sblock->sctx, spage);
 }
 
+static int fill_writer_pointer_gap(struct scrub_ctx *sctx, u64 physical)
+{
+	int ret = 0;
+	u64 length;
+
+	if (!btrfs_fs_incompat(sctx->fs_info, HMZONED))
+		return 0;
+
+	if (sctx->write_pointer < physical) {
+		length = physical - sctx->write_pointer;
+
+		ret = btrfs_hmzoned_issue_zeroout(sctx->wr_tgtdev,
+						  sctx->write_pointer, length);
+		if (!ret)
+			sctx->write_pointer = physical;
+	}
+	return ret;
+}
+
 static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
 				    struct scrub_page *spage)
 {
@@ -1649,6 +1669,13 @@ static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
 	if (sbio->page_count == 0) {
 		struct bio *bio;
 
+		ret = fill_writer_pointer_gap(sctx,
+					      spage->physical_for_dev_replace);
+		if (ret) {
+			mutex_unlock(&sctx->wr_lock);
+			return ret;
+		}
+
 		sbio->physical = spage->physical_for_dev_replace;
 		sbio->logical = spage->logical;
 		sbio->dev = sctx->wr_tgtdev;
@@ -1710,6 +1737,10 @@ static void scrub_wr_submit(struct scrub_ctx *sctx)
 	 * doubled the write performance on spinning disks when measured
 	 * with Linux 3.5 */
 	btrfsic_submit_bio(sbio->bio);
+
+	if (btrfs_fs_incompat(sctx->fs_info, HMZONED))
+		sctx->write_pointer = sbio->physical +
+			sbio->page_count * PAGE_SIZE;
 }
 
 static void scrub_wr_bio_end_io(struct bio *bio)
@@ -3040,6 +3071,46 @@ static noinline_for_stack int scrub_raid56_parity(struct scrub_ctx *sctx,
 	return ret < 0 ? ret : 0;
 }
 
+static void sync_replace_for_hmzoned(struct scrub_ctx *sctx)
+{
+	if (!btrfs_fs_incompat(sctx->fs_info, HMZONED))
+		return;
+
+	sctx->flush_all_writes = true;
+	scrub_submit(sctx);
+	mutex_lock(&sctx->wr_lock);
+	scrub_wr_submit(sctx);
+	mutex_unlock(&sctx->wr_lock);
+
+	wait_event(sctx->list_wait,
+		   atomic_read(&sctx->bios_in_flight) == 0);
+}
+
+static int sync_write_pointer_for_hmzoned(struct scrub_ctx *sctx, u64 logical,
+					  u64 physical, u64 physical_end)
+{
+	struct btrfs_fs_info *fs_info = sctx->fs_info;
+	int ret = 0;
+
+	if (!btrfs_fs_incompat(fs_info, HMZONED))
+		return 0;
+
+	wait_event(sctx->list_wait, atomic_read(&sctx->bios_in_flight) == 0);
+
+	mutex_lock(&sctx->wr_lock);
+	if (sctx->write_pointer < physical_end) {
+		ret = btrfs_sync_zone_write_pointer(sctx->wr_tgtdev, logical,
+						    physical,
+						    sctx->write_pointer);
+		if (ret)
+			btrfs_err(fs_info, "failed to recover write pointer");
+	}
+	mutex_unlock(&sctx->wr_lock);
+	btrfs_dev_clear_zone_empty(sctx->wr_tgtdev, physical);
+
+	return ret;
+}
+
 static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 					   struct map_lookup *map,
 					   struct btrfs_device *scrub_dev,
@@ -3052,7 +3123,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 	struct btrfs_extent_item *extent;
 	struct blk_plug plug;
 	u64 flags;
-	int ret;
+	int ret, ret2;
 	int slot;
 	u64 nstripes;
 	struct extent_buffer *l;
@@ -3171,6 +3242,14 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 	 */
 	blk_start_plug(&plug);
 
+	if (sctx->is_dev_replace &&
+	    btrfs_dev_is_sequential(sctx->wr_tgtdev, physical)) {
+		mutex_lock(&sctx->wr_lock);
+		sctx->write_pointer = physical;
+		mutex_unlock(&sctx->wr_lock);
+		sctx->flush_all_writes = true;
+	}
+
 	/*
 	 * now find all extents for each stripe and scrub them
 	 */
@@ -3343,6 +3422,9 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 			if (ret)
 				goto out;
 
+			if (sctx->is_dev_replace)
+				sync_replace_for_hmzoned(sctx);
+
 			if (extent_logical + extent_len <
 			    key.objectid + bytes) {
 				if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
@@ -3410,6 +3492,15 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 	blk_finish_plug(&plug);
 	btrfs_free_path(path);
 	btrfs_free_path(ppath);
+
+	if (sctx->is_dev_replace && ret >= 0) {
+		ret2 = sync_write_pointer_for_hmzoned(
+			sctx, base + offset,
+			map->stripes[num].physical, physical_end);
+		if (ret2)
+			ret = ret2;
+	}
+
 	return ret < 0 ? ret : 0;
 }
 
@@ -3465,6 +3556,25 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx,
 	return ret;
 }
 
+static int finish_extent_writes_for_hmzoned(struct btrfs_root *root,
+					    struct btrfs_block_group *cache)
+{
+	struct btrfs_fs_info *fs_info = cache->fs_info;
+	struct btrfs_trans_handle *trans;
+
+	if (!btrfs_fs_incompat(fs_info, HMZONED))
+		return 0;
+
+	btrfs_wait_block_group_reservations(cache);
+	btrfs_wait_nocow_writers(cache);
+	btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start, cache->length);
+
+	trans = btrfs_join_transaction(root);
+	if (IS_ERR(trans))
+		return PTR_ERR(trans);
+	return btrfs_commit_transaction(trans);
+}
+
 static noinline_for_stack
 int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 			   struct btrfs_device *scrub_dev, u64 start, u64 end)
@@ -3483,6 +3593,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 	struct btrfs_key found_key;
 	struct btrfs_block_group *cache;
 	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
+	bool do_chunk_alloc = btrfs_fs_incompat(fs_info, HMZONED);
 
 	path = btrfs_alloc_path();
 	if (!path)
@@ -3551,6 +3662,18 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 		if (!cache)
 			goto skip;
 
+
+		if (sctx->is_dev_replace &&
+		    btrfs_fs_incompat(fs_info, HMZONED)) {
+			spin_lock(&cache->lock);
+			if (!cache->to_copy) {
+				spin_unlock(&cache->lock);
+				ro_set = 0;
+				goto done;
+			}
+			spin_unlock(&cache->lock);
+		}
+
 		/*
 		 * we need call btrfs_inc_block_group_ro() with scrubs_paused,
 		 * to avoid deadlock caused by:
@@ -3579,7 +3702,16 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 		 * thread can't be triggered fast enough, and use up all space
 		 * of btrfs_super_block::sys_chunk_array
 		 */
-		ret = btrfs_inc_block_group_ro(cache, false);
+		ret = btrfs_inc_block_group_ro(cache, do_chunk_alloc);
+		if (!ret && sctx->is_dev_replace) {
+			ret = finish_extent_writes_for_hmzoned(root, cache);
+			if (ret) {
+				btrfs_dec_block_group_ro(cache);
+				scrub_pause_off(fs_info);
+				btrfs_put_block_group(cache);
+				break;
+			}
+		}
 		scrub_pause_off(fs_info);
 
 		if (ret == 0) {
@@ -3641,6 +3773,12 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 
 		scrub_pause_off(fs_info);
 
+		if (sctx->is_dev_replace &&
+		    !btrfs_finish_block_group_to_copy(dev_replace->srcdev,
+						      cache, found_key.offset))
+			ro_set = 0;
+
+done:
 		down_write(&dev_replace->rwsem);
 		dev_replace->cursor_left = dev_replace->cursor_right;
 		dev_replace->item_needs_writeback = 1;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d5b280b59733..adc9dfd655a6 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1414,6 +1414,9 @@ static int find_free_dev_extent_start(struct btrfs_device *device,
 		search_start = btrfs_zone_align(device, search_start);
 	}
 
+	WARN_ON(device->zone_info &&
+		!IS_ALIGNED(num_bytes, device->zone_info->zone_size));
+
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
@@ -5721,9 +5724,29 @@ static int get_extra_mirror_from_replace(struct btrfs_fs_info *fs_info,
 	return ret;
 }
 
+static bool is_block_group_to_copy(struct btrfs_fs_info *fs_info, u64 logical)
+{
+	struct btrfs_block_group *cache;
+	bool ret;
+
+	/* non-HMZONED mode does not use "to_copy" flag */
+	if (!btrfs_fs_incompat(fs_info, HMZONED))
+		return false;
+
+	cache = btrfs_lookup_block_group(fs_info, logical);
+
+	spin_lock(&cache->lock);
+	ret = cache->to_copy;
+	spin_unlock(&cache->lock);
+
+	btrfs_put_block_group(cache);
+	return ret;
+}
+
 static void handle_ops_on_dev_replace(enum btrfs_map_op op,
 				      struct btrfs_bio **bbio_ret,
 				      struct btrfs_dev_replace *dev_replace,
+				      u64 logical,
 				      int *num_stripes_ret, int *max_errors_ret)
 {
 	struct btrfs_bio *bbio = *bbio_ret;
@@ -5736,6 +5759,15 @@ static void handle_ops_on_dev_replace(enum btrfs_map_op op,
 	if (op == BTRFS_MAP_WRITE) {
 		int index_where_to_add;
 
+		/*
+		 * a block group which have "to_copy" set will
+		 * eventually copied by dev-replace process. We can
+		 * avoid cloning IO here.
+		 */
+		if (is_block_group_to_copy(dev_replace->srcdev->fs_info,
+					   logical))
+			return;
+
 		/*
 		 * duplicate the write operations while the dev replace
 		 * procedure is running. Since the copying of the old disk to
@@ -6146,8 +6178,8 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 
 	if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL &&
 	    need_full_stripe(op)) {
-		handle_ops_on_dev_replace(op, &bbio, dev_replace, &num_stripes,
-					  &max_errors);
+		handle_ops_on_dev_replace(op, &bbio, dev_replace, logical,
+					  &num_stripes, &max_errors);
 	}
 
 	*bbio_ret = bbio;
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v6 24/28] btrfs: enable relocation in HMZONED mode
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
                   ` (22 preceding siblings ...)
  2019-12-13  4:09 ` [PATCH v6 23/28] btrfs: support dev-replace " Naohiro Aota
@ 2019-12-13  4:09 ` Naohiro Aota
  2019-12-17 21:32   ` Josef Bacik
  2019-12-13  4:09 ` [PATCH v6 25/28] btrfs: relocate block group to repair IO failure in HMZONED Naohiro Aota
                   ` (5 subsequent siblings)
  29 siblings, 1 reply; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:09 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

To serialize allocation and submit_bio, we introduced mutex around them. As
a result, preallocation must be completely disabled to avoid a deadlock.

Since current relocation process relies on preallocation to move file data
extents, it must be handled in another way. In HMZONED mode, we just
truncate the inode to the size that we wanted to pre-allocate. Then, we
flush dirty pages on the file before finishing relocation process.
run_delalloc_hmzoned() will handle all the allocation and submit IOs to
the underlying layers.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/relocation.c | 39 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 37 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index d897a8e5e430..2d17b7566df4 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3159,6 +3159,34 @@ int prealloc_file_extent_cluster(struct inode *inode,
 	if (ret)
 		goto out;
 
+	/*
+	 * In HMZONED, we cannot preallocate the file region. Instead,
+	 * we dirty and fiemap_write the region.
+	 */
+
+	if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED)) {
+		struct btrfs_root *root = BTRFS_I(inode)->root;
+		struct btrfs_trans_handle *trans;
+
+		end = cluster->end - offset + 1;
+		trans = btrfs_start_transaction(root, 1);
+		if (IS_ERR(trans))
+			return PTR_ERR(trans);
+
+		inode->i_ctime = current_time(inode);
+		i_size_write(inode, end);
+		btrfs_ordered_update_i_size(inode, end, NULL);
+		ret = btrfs_update_inode(trans, root, inode);
+		if (ret) {
+			btrfs_abort_transaction(trans, ret);
+			btrfs_end_transaction(trans);
+			return ret;
+		}
+		ret = btrfs_end_transaction(trans);
+
+		goto out;
+	}
+
 	cur_offset = prealloc_start;
 	while (nr < cluster->nr) {
 		start = cluster->boundary[nr] - offset;
@@ -3346,6 +3374,10 @@ static int relocate_file_extent_cluster(struct inode *inode,
 		btrfs_throttle(fs_info);
 	}
 	WARN_ON(nr != cluster->nr);
+	if (btrfs_fs_incompat(fs_info, HMZONED) && !ret) {
+		ret = btrfs_wait_ordered_range(inode, 0, (u64)-1);
+		WARN_ON(ret);
+	}
 out:
 	kfree(ra);
 	return ret;
@@ -4186,8 +4218,12 @@ static int __insert_orphan_inode(struct btrfs_trans_handle *trans,
 	struct btrfs_path *path;
 	struct btrfs_inode_item *item;
 	struct extent_buffer *leaf;
+	u64 flags = BTRFS_INODE_NOCOMPRESS | BTRFS_INODE_PREALLOC;
 	int ret;
 
+	if (btrfs_fs_incompat(trans->fs_info, HMZONED))
+		flags &= ~BTRFS_INODE_PREALLOC;
+
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
@@ -4202,8 +4238,7 @@ static int __insert_orphan_inode(struct btrfs_trans_handle *trans,
 	btrfs_set_inode_generation(leaf, item, 1);
 	btrfs_set_inode_size(leaf, item, 0);
 	btrfs_set_inode_mode(leaf, item, S_IFREG | 0600);
-	btrfs_set_inode_flags(leaf, item, BTRFS_INODE_NOCOMPRESS |
-					  BTRFS_INODE_PREALLOC);
+	btrfs_set_inode_flags(leaf, item, flags);
 	btrfs_mark_buffer_dirty(leaf);
 out:
 	btrfs_free_path(path);
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v6 25/28] btrfs: relocate block group to repair IO failure in HMZONED
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
                   ` (23 preceding siblings ...)
  2019-12-13  4:09 ` [PATCH v6 24/28] btrfs: enable relocation " Naohiro Aota
@ 2019-12-13  4:09 ` Naohiro Aota
  2019-12-17 22:04   ` Josef Bacik
  2019-12-13  4:09 ` [PATCH v6 26/28] btrfs: split alloc_log_tree() Naohiro Aota
                   ` (4 subsequent siblings)
  29 siblings, 1 reply; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:09 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

When btrfs find a checksum error and if the file system has a mirror of the
damaged data, btrfs read the correct data from the mirror and write the
data to damaged blocks. This repairing, however, is against the sequential
write required rule.

We can consider three methods to repair an IO failure in HMZONED mode:
(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to the
    new extent
(3) Relocate the corresponding block group

Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and so
it unnecessary degrades non-damaged data.

Method (2) is much like device replacing but done in the same device. It is
safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev>physical == dst_dev->physical". Also, the extent mapping replacing
function should be extended to support replacing device extent position in
one device.

Method (3) invokes relocation of the damaged block group, so it is
straightforward to implement. It relocates all the mirrored device extents,
so it is, potentially, a more costly operation than method (1) or (2). But
it relocates only using extents which reduce the total IO size.

Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).

For protecting a block group gets relocated multiple time with multiple IO
errors, this commit introduces "relocating_repair" bit to show it's now
relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.

This commit also supports repairing in the scrub process.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.h |  1 +
 fs/btrfs/extent_io.c   |  3 ++
 fs/btrfs/scrub.c       |  3 ++
 fs/btrfs/volumes.c     | 71 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h     |  1 +
 5 files changed, 79 insertions(+)

diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 323ba01ad8a9..4a5bd87345a1 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -84,6 +84,7 @@ struct btrfs_block_group {
 	unsigned int removed:1;
 	unsigned int wp_broken:1;
 	unsigned int to_copy:1;
+	unsigned int relocating_repair:1;
 
 	int disk_cache_state;
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 24f7b05e1f4c..83f5e5883723 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2197,6 +2197,9 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
 	ASSERT(!(fs_info->sb->s_flags & SB_RDONLY));
 	BUG_ON(!mirror_num);
 
+	if (btrfs_fs_incompat(fs_info, HMZONED))
+		return btrfs_repair_one_hmzone(fs_info, logical);
+
 	bio = btrfs_io_bio_alloc(1);
 	bio->bi_iter.bi_size = 0;
 	map_length = length;
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index e88f32256ccc..5ed54523f036 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -861,6 +861,9 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 	have_csum = sblock_to_check->pagev[0]->have_csum;
 	dev = sblock_to_check->pagev[0]->dev;
 
+	if (btrfs_fs_incompat(fs_info, HMZONED) && !sctx->is_dev_replace)
+		return btrfs_repair_one_hmzone(fs_info, logical);
+
 	/*
 	 * We must use GFP_NOFS because the scrub task might be waiting for a
 	 * worker task executing this function and in turn a transaction commit
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index adc9dfd655a6..21801aaa77c2 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7794,3 +7794,74 @@ bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr)
 	spin_unlock(&fs_info->swapfile_pins_lock);
 	return node != NULL;
 }
+
+static int relocating_repair_kthread(void *data)
+{
+	struct btrfs_block_group *cache = (struct btrfs_block_group *) data;
+	struct btrfs_fs_info *fs_info = cache->fs_info;
+	u64 target;
+	int ret = 0;
+
+	target = cache->start;
+	btrfs_put_block_group(cache);
+
+	if (test_and_set_bit(BTRFS_FS_EXCL_OP, &fs_info->flags)) {
+		btrfs_info(fs_info,
+			   "skip relocating block group %llu to repair: EBUSY",
+			   target);
+		return -EBUSY;
+	}
+
+	mutex_lock(&fs_info->delete_unused_bgs_mutex);
+
+	/* ensure Block Group still exists */
+	cache = btrfs_lookup_block_group(fs_info, target);
+	if (!cache)
+		goto out;
+
+	if (!cache->relocating_repair)
+		goto out;
+
+	ret = btrfs_may_alloc_data_chunk(fs_info, target);
+	if (ret < 0)
+		goto out;
+
+	btrfs_info(fs_info, "relocating block group %llu to repair IO failure",
+		   target);
+	ret = btrfs_relocate_chunk(fs_info, target);
+
+out:
+	if (cache)
+		btrfs_put_block_group(cache);
+	mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+	clear_bit(BTRFS_FS_EXCL_OP, &fs_info->flags);
+
+	return ret;
+}
+
+int btrfs_repair_one_hmzone(struct btrfs_fs_info *fs_info, u64 logical)
+{
+	struct btrfs_block_group *cache;
+
+	/* do not attempt to repair in degraded state */
+	if (btrfs_test_opt(fs_info, DEGRADED))
+		return 0;
+
+	cache = btrfs_lookup_block_group(fs_info, logical);
+	if (!cache)
+		return 0;
+
+	spin_lock(&cache->lock);
+	if (cache->relocating_repair) {
+		spin_unlock(&cache->lock);
+		btrfs_put_block_group(cache);
+		return 0;
+	}
+	cache->relocating_repair = 1;
+	spin_unlock(&cache->lock);
+
+	kthread_run(relocating_repair_kthread, cache,
+		    "btrfs-relocating-repair");
+
+	return 0;
+}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 70cabe65f72a..e5a2e7fc3a08 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -576,5 +576,6 @@ bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
 int btrfs_bg_type_to_factor(u64 flags);
 const char *btrfs_bg_type_to_raid_name(u64 flags);
 int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info);
+int btrfs_repair_one_hmzone(struct btrfs_fs_info *fs_info, u64 logical);
 
 #endif
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v6 26/28] btrfs: split alloc_log_tree()
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
                   ` (24 preceding siblings ...)
  2019-12-13  4:09 ` [PATCH v6 25/28] btrfs: relocate block group to repair IO failure in HMZONED Naohiro Aota
@ 2019-12-13  4:09 ` Naohiro Aota
  2019-12-13  4:09 ` [PATCH v6 27/28] btrfs: enable tree-log on HMZONED mode Naohiro Aota
                   ` (3 subsequent siblings)
  29 siblings, 0 replies; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:09 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

This is a preparation for the next patch. This commit split
alloc_log_tree() to allocating tree structure part (remains in
alloc_log_tree()) and allocating tree node part (moved in
btrfs_alloc_log_tree_node()). The latter part is also exported to be used
in the next patch.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/disk-io.c | 31 +++++++++++++++++++++++++------
 fs/btrfs/disk-io.h |  2 ++
 2 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c3d8fc10d11d..914c517d26b0 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1315,7 +1315,6 @@ static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans,
 					 struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_root *root;
-	struct extent_buffer *leaf;
 
 	root = btrfs_alloc_root(fs_info, GFP_NOFS);
 	if (!root)
@@ -1327,6 +1326,14 @@ static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans,
 	root->root_key.type = BTRFS_ROOT_ITEM_KEY;
 	root->root_key.offset = BTRFS_TREE_LOG_OBJECTID;
 
+	return root;
+}
+
+int btrfs_alloc_log_tree_node(struct btrfs_trans_handle *trans,
+			      struct btrfs_root *root)
+{
+	struct extent_buffer *leaf;
+
 	/*
 	 * DON'T set REF_COWS for log trees
 	 *
@@ -1338,26 +1345,31 @@ static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans,
 
 	leaf = btrfs_alloc_tree_block(trans, root, 0, BTRFS_TREE_LOG_OBJECTID,
 			NULL, 0, 0, 0);
-	if (IS_ERR(leaf)) {
-		kfree(root);
-		return ERR_CAST(leaf);
-	}
+	if (IS_ERR(leaf))
+		return PTR_ERR(leaf);
 
 	root->node = leaf;
 
 	btrfs_mark_buffer_dirty(root->node);
 	btrfs_tree_unlock(root->node);
-	return root;
+
+	return 0;
 }
 
 int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans,
 			     struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_root *log_root;
+	int ret;
 
 	log_root = alloc_log_tree(trans, fs_info);
 	if (IS_ERR(log_root))
 		return PTR_ERR(log_root);
+	ret = btrfs_alloc_log_tree_node(trans, log_root);
+	if (ret) {
+		kfree(log_root);
+		return ret;
+	}
 	WARN_ON(fs_info->log_root_tree);
 	fs_info->log_root_tree = log_root;
 	return 0;
@@ -1369,11 +1381,18 @@ int btrfs_add_log_tree(struct btrfs_trans_handle *trans,
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_root *log_root;
 	struct btrfs_inode_item *inode_item;
+	int ret;
 
 	log_root = alloc_log_tree(trans, fs_info);
 	if (IS_ERR(log_root))
 		return PTR_ERR(log_root);
 
+	ret = btrfs_alloc_log_tree_node(trans, log_root);
+	if (ret) {
+		kfree(log_root);
+		return ret;
+	}
+
 	log_root->last_trans = trans->transid;
 	log_root->root_key.offset = root->root_key.objectid;
 
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 76f123ebb292..21e8d936c705 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -121,6 +121,8 @@ blk_status_t btrfs_wq_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 			extent_submit_bio_start_t *submit_bio_start);
 blk_status_t btrfs_submit_bio_done(void *private_data, struct bio *bio,
 			  int mirror_num);
+int btrfs_alloc_log_tree_node(struct btrfs_trans_handle *trans,
+			      struct btrfs_root *root);
 int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans,
 			     struct btrfs_fs_info *fs_info);
 int btrfs_add_log_tree(struct btrfs_trans_handle *trans,
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v6 27/28] btrfs: enable tree-log on HMZONED mode
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
                   ` (25 preceding siblings ...)
  2019-12-13  4:09 ` [PATCH v6 26/28] btrfs: split alloc_log_tree() Naohiro Aota
@ 2019-12-13  4:09 ` Naohiro Aota
  2019-12-17 22:08   ` Josef Bacik
  2019-12-13  4:09 ` [PATCH v6 28/28] btrfs: enable to mount HMZONED incompat flag Naohiro Aota
                   ` (2 subsequent siblings)
  29 siblings, 1 reply; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:09 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

The tree-log feature does not work on HMZONED mode as is. Blocks for a
tree-log tree are allocated mixed with other metadata blocks, and btrfs
writes and syncs the tree-log blocks to devices at the time of fsync(),
which is different timing than a global transaction commit. As a result,
both writing tree-log blocks and writing other metadata blocks become
non-sequential writes which HMZONED mode must avoid.

Also, since we can start more than one log transactions per subvolume at
the same time, nodes from multiple transactions can be allocated
interleaved. Such mixed allocation results in non-sequential writes at the
time of log transaction commit. The nodes of the global log root tree
(fs_info->log_root_tree), also have the same mixed allocation problem.

This patch assigns a dedicated block group for tree-log blocks to separate
two metadata writing streams (for tree-log blocks and other metadata
blocks). As a result, each write stream can now be written to devices
separately. "fs_info->treelog_bg" tracks the dedicated block group and
btrfs assign "treelog_bg" on-demand on tree-log block allocation time.

Then, this patch serializes log transactions by waiting for a committing
transaction when someone tries to start a new transaction, to avoid the
mixed allocation problem. We must also wait for running log transactions
from another subvolume, but there is no easy way to detect which subvolume
root is running a log transaction. So, this patch forbids starting a new
log transaction when the global log root tree is already allocated by other
subvolumes.

Furthermore, this patch aligns the allocation order of nodes of
"fs_info->log_root_tree" and nodes of "root->log_root" with the writing
order of the nodes, by delaying allocation of the root node of
"fs_info->log_root_tree," so that, the node buffers can go out sequentially
to devices.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c |  7 +++++
 fs/btrfs/ctree.h       |  2 ++
 fs/btrfs/disk-io.c     |  8 ++---
 fs/btrfs/extent-tree.c | 71 +++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/tree-log.c    | 49 ++++++++++++++++++++++++-----
 5 files changed, 116 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 6f7d29171adf..93e6c617d68e 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -910,6 +910,13 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	btrfs_return_cluster_to_free_space(block_group, cluster);
 	spin_unlock(&cluster->refill_lock);
 
+	if (btrfs_fs_incompat(fs_info, HMZONED)) {
+		spin_lock(&fs_info->treelog_bg_lock);
+		if (fs_info->treelog_bg == block_group->start)
+			fs_info->treelog_bg = 0;
+		spin_unlock(&fs_info->treelog_bg_lock);
+	}
+
 	path = btrfs_alloc_path();
 	if (!path) {
 		ret = -ENOMEM;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 18d2d0581e68..cba8a169002c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -907,6 +907,8 @@ struct btrfs_fs_info {
 #endif
 
 	struct mutex hmzoned_meta_io_lock;
+	spinlock_t treelog_bg_lock;
+	u64 treelog_bg;
 };
 
 static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 914c517d26b0..9c2b2fbf0cdb 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1360,16 +1360,10 @@ int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans,
 			     struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_root *log_root;
-	int ret;
 
 	log_root = alloc_log_tree(trans, fs_info);
 	if (IS_ERR(log_root))
 		return PTR_ERR(log_root);
-	ret = btrfs_alloc_log_tree_node(trans, log_root);
-	if (ret) {
-		kfree(log_root);
-		return ret;
-	}
 	WARN_ON(fs_info->log_root_tree);
 	fs_info->log_root_tree = log_root;
 	return 0;
@@ -2841,6 +2835,8 @@ int __cold open_ctree(struct super_block *sb,
 
 	fs_info->send_in_progress = 0;
 
+	spin_lock_init(&fs_info->treelog_bg_lock);
+
 	ret = btrfs_alloc_stripe_hash_table(fs_info);
 	if (ret) {
 		err = ret;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 69c4ce8ec83e..9b9608097f7f 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3704,8 +3704,10 @@ static int find_free_extent_unclustered(struct btrfs_block_group *bg,
  */
 
 static int find_free_extent_zoned(struct btrfs_block_group *cache,
-				  struct find_free_extent_ctl *ffe_ctl)
+				  struct find_free_extent_ctl *ffe_ctl,
+				  bool for_treelog)
 {
+	struct btrfs_fs_info *fs_info = cache->fs_info;
 	struct btrfs_space_info *space_info = cache->space_info;
 	struct btrfs_free_space_ctl *ctl = cache->free_space_ctl;
 	u64 start = cache->start;
@@ -3718,12 +3720,26 @@ static int find_free_extent_zoned(struct btrfs_block_group *cache,
 	btrfs_hmzoned_data_io_lock(cache);
 	spin_lock(&space_info->lock);
 	spin_lock(&cache->lock);
+	spin_lock(&fs_info->treelog_bg_lock);
+
+	ASSERT(!for_treelog || cache->start == fs_info->treelog_bg ||
+	       fs_info->treelog_bg == 0);
 
 	if (cache->ro) {
 		ret = -EAGAIN;
 		goto out;
 	}
 
+	/*
+	 * Do not allow currently using block group to be tree-log
+	 * dedicated block group.
+	 */
+	if (for_treelog && !fs_info->treelog_bg &&
+	    (cache->used || cache->reserved)) {
+		ret = 1;
+		goto out;
+	}
+
 	avail = cache->length - cache->alloc_offset;
 	if (avail < num_bytes) {
 		ffe_ctl->max_extent_size = avail;
@@ -3731,6 +3747,9 @@ static int find_free_extent_zoned(struct btrfs_block_group *cache,
 		goto out;
 	}
 
+	if (for_treelog && !fs_info->treelog_bg)
+		fs_info->treelog_bg = cache->start;
+
 	ffe_ctl->found_offset = start + cache->alloc_offset;
 	cache->alloc_offset += num_bytes;
 	spin_lock(&ctl->tree_lock);
@@ -3738,12 +3757,15 @@ static int find_free_extent_zoned(struct btrfs_block_group *cache,
 	spin_unlock(&ctl->tree_lock);
 
 	ASSERT(IS_ALIGNED(ffe_ctl->found_offset,
-			  cache->fs_info->stripesize));
+			  fs_info->stripesize));
 	ffe_ctl->search_start = ffe_ctl->found_offset;
 	__btrfs_add_reserved_bytes(cache, ffe_ctl->ram_bytes, num_bytes,
 				   ffe_ctl->delalloc);
 
 out:
+	if (ret && for_treelog)
+		fs_info->treelog_bg = 0;
+	spin_unlock(&fs_info->treelog_bg_lock);
 	spin_unlock(&cache->lock);
 	spin_unlock(&space_info->lock);
 	/* if succeeds, unlock after submit_bio */
@@ -3891,7 +3913,7 @@ static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info,
 static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
 				u64 ram_bytes, u64 num_bytes, u64 empty_size,
 				u64 hint_byte, struct btrfs_key *ins,
-				u64 flags, int delalloc)
+				u64 flags, int delalloc, bool for_treelog)
 {
 	int ret = 0;
 	struct btrfs_free_cluster *last_ptr = NULL;
@@ -3970,6 +3992,13 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
 		spin_unlock(&last_ptr->lock);
 	}
 
+	if (hmzoned && for_treelog) {
+		spin_lock(&fs_info->treelog_bg_lock);
+		if (fs_info->treelog_bg)
+			hint_byte = fs_info->treelog_bg;
+		spin_unlock(&fs_info->treelog_bg_lock);
+	}
+
 	ffe_ctl.search_start = max(ffe_ctl.search_start,
 				   first_logical_byte(fs_info, 0));
 	ffe_ctl.search_start = max(ffe_ctl.search_start, hint_byte);
@@ -4015,8 +4044,15 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
 	list_for_each_entry(block_group,
 			    &space_info->block_groups[ffe_ctl.index], list) {
 		/* If the block group is read-only, we can skip it entirely. */
-		if (unlikely(block_group->ro))
+		if (unlikely(block_group->ro)) {
+			if (hmzoned && for_treelog) {
+				spin_lock(&fs_info->treelog_bg_lock);
+				if (block_group->start == fs_info->treelog_bg)
+					fs_info->treelog_bg = 0;
+				spin_unlock(&fs_info->treelog_bg_lock);
+			}
 			continue;
+		}
 
 		btrfs_grab_block_group(block_group, delalloc);
 		ffe_ctl.search_start = block_group->start;
@@ -4062,7 +4098,25 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
 			goto loop;
 
 		if (hmzoned) {
-			ret = find_free_extent_zoned(block_group, &ffe_ctl);
+			u64 bytenr = block_group->start;
+			u64 log_bytenr;
+			bool skip;
+
+			/*
+			 * Do not allow non-tree-log blocks in the
+			 * dedicated tree-log block group, and vice versa.
+			 */
+			spin_lock(&fs_info->treelog_bg_lock);
+			log_bytenr = fs_info->treelog_bg;
+			skip = log_bytenr &&
+				((for_treelog && bytenr != log_bytenr) ||
+				 (!for_treelog && bytenr == log_bytenr));
+			spin_unlock(&fs_info->treelog_bg_lock);
+			if (skip)
+				goto loop;
+
+			ret = find_free_extent_zoned(block_group, &ffe_ctl,
+						     for_treelog);
 			if (ret)
 				goto loop;
 			/*
@@ -4222,12 +4276,13 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes,
 	bool final_tried = num_bytes == min_alloc_size;
 	u64 flags;
 	int ret;
+	bool for_treelog = root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID;
 
 	flags = get_alloc_profile_by_root(root, is_data);
 again:
 	WARN_ON(num_bytes < fs_info->sectorsize);
 	ret = find_free_extent(fs_info, ram_bytes, num_bytes, empty_size,
-			       hint_byte, ins, flags, delalloc);
+			       hint_byte, ins, flags, delalloc, for_treelog);
 	if (!ret && !is_data) {
 		btrfs_dec_block_group_reservations(fs_info, ins->objectid);
 	} else if (ret == -ENOSPC) {
@@ -4245,8 +4300,8 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes,
 
 			sinfo = btrfs_find_space_info(fs_info, flags);
 			btrfs_err(fs_info,
-				  "allocation failed flags %llu, wanted %llu",
-				  flags, num_bytes);
+			"allocation failed flags %llu, wanted %llu treelog %d",
+				  flags, num_bytes, for_treelog);
 			if (sinfo)
 				btrfs_dump_space_info(fs_info, sinfo,
 						      num_bytes, 1);
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 6f757361db53..e155418f24ba 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -18,6 +18,7 @@
 #include "compression.h"
 #include "qgroup.h"
 #include "inode-map.h"
+#include "hmzoned.h"
 
 /* magic values for the inode_only field in btrfs_log_inode:
  *
@@ -105,6 +106,7 @@ static noinline int replay_dir_deletes(struct btrfs_trans_handle *trans,
 				       struct btrfs_root *log,
 				       struct btrfs_path *path,
 				       u64 dirid, int del_all);
+static void wait_log_commit(struct btrfs_root *root, int transid);
 
 /*
  * tree logging is a special write ahead log used to make sure that
@@ -139,16 +141,25 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
 			   struct btrfs_log_ctx *ctx)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
+	bool hmzoned = btrfs_fs_incompat(fs_info, HMZONED);
 	int ret = 0;
 
 	mutex_lock(&root->log_mutex);
 
+again:
 	if (root->log_root) {
+		int index = (root->log_transid + 1) % 2;
+
 		if (btrfs_need_log_full_commit(trans)) {
 			ret = -EAGAIN;
 			goto out;
 		}
 
+		if (hmzoned && atomic_read(&root->log_commit[index])) {
+			wait_log_commit(root, root->log_transid - 1);
+			goto again;
+		}
+
 		if (!root->log_start_pid) {
 			clear_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state);
 			root->log_start_pid = current->pid;
@@ -157,8 +168,13 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
 		}
 	} else {
 		mutex_lock(&fs_info->tree_log_mutex);
-		if (!fs_info->log_root_tree)
+		if (hmzoned && fs_info->log_root_tree) {
+			ret = -EAGAIN;
+			mutex_unlock(&fs_info->tree_log_mutex);
+			goto out;
+		} else if (!fs_info->log_root_tree) {
 			ret = btrfs_init_log_root_tree(trans, fs_info);
+		}
 		mutex_unlock(&fs_info->tree_log_mutex);
 		if (ret)
 			goto out;
@@ -191,11 +207,19 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
  */
 static int join_running_log_trans(struct btrfs_root *root)
 {
+	bool hmzoned = btrfs_fs_incompat(root->fs_info, HMZONED);
 	int ret = -ENOENT;
 
 	mutex_lock(&root->log_mutex);
+again:
 	if (root->log_root) {
+		int index = (root->log_transid + 1) % 2;
+
 		ret = 0;
+		if (hmzoned && atomic_read(&root->log_commit[index])) {
+			wait_log_commit(root, root->log_transid - 1);
+			goto again;
+		}
 		atomic_inc(&root->log_writers);
 	}
 	mutex_unlock(&root->log_mutex);
@@ -2724,6 +2748,8 @@ static noinline int walk_down_log_tree(struct btrfs_trans_handle *trans,
 					btrfs_clean_tree_block(next);
 					btrfs_wait_tree_block_writeback(next);
 					btrfs_tree_unlock(next);
+					btrfs_redirty_list_add(
+						trans->transaction, next);
 				} else {
 					if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &next->bflags))
 						clear_extent_buffer_dirty(next);
@@ -3128,6 +3154,11 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 
 	mutex_lock(&log_root_tree->log_mutex);
 
+	mutex_lock(&fs_info->tree_log_mutex);
+	if (!log_root_tree->node)
+		btrfs_alloc_log_tree_node(trans, log_root_tree);
+	mutex_unlock(&fs_info->tree_log_mutex);
+
 	/*
 	 * Now we are safe to update the log_root_tree because we're under the
 	 * log_mutex, and we're a current writer so we're holding the commit
@@ -3285,16 +3316,20 @@ static void free_log_tree(struct btrfs_trans_handle *trans,
 		.process_func = process_one_buffer
 	};
 
-	ret = walk_log_tree(trans, log, &wc);
-	if (ret) {
-		if (trans)
-			btrfs_abort_transaction(trans, ret);
-		else
-			btrfs_handle_fs_error(log->fs_info, ret, NULL);
+	if (log->node) {
+		ret = walk_log_tree(trans, log, &wc);
+		if (ret) {
+			if (trans)
+				btrfs_abort_transaction(trans, ret);
+			else
+				btrfs_handle_fs_error(log->fs_info, ret, NULL);
+		}
 	}
 
 	clear_extent_bits(&log->dirty_log_pages, 0, (u64)-1,
 			  EXTENT_DIRTY | EXTENT_NEW | EXTENT_NEED_WAIT);
+	if (trans && log->node)
+		btrfs_redirty_list_add(trans->transaction, log->node);
 	free_extent_buffer(log->node);
 	kfree(log);
 }
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v6 28/28] btrfs: enable to mount HMZONED incompat flag
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
                   ` (26 preceding siblings ...)
  2019-12-13  4:09 ` [PATCH v6 27/28] btrfs: enable tree-log on HMZONED mode Naohiro Aota
@ 2019-12-13  4:09 ` Naohiro Aota
  2019-12-17 22:09   ` Josef Bacik
  2019-12-13  4:15 ` [PATCH RFC v2] libblkid: implement zone-aware probing for HMZONED btrfs Naohiro Aota
  2019-12-19 20:19 ` [PATCH v6 00/28] btrfs: zoned block device support David Sterba
  29 siblings, 1 reply; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:09 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

This final patch adds the HMZONED incompat flag to
BTRFS_FEATURE_INCOMPAT_SUPP and enables btrfs to mount HMZONED flagged file
system.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ctree.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index cba8a169002c..79c8695ba4b4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -293,7 +293,8 @@ struct btrfs_super_block {
 	 BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA |	\
 	 BTRFS_FEATURE_INCOMPAT_NO_HOLES	|	\
 	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID	|	\
-	 BTRFS_FEATURE_INCOMPAT_RAID1C34)
+	 BTRFS_FEATURE_INCOMPAT_RAID1C34	|	\
+	 BTRFS_FEATURE_INCOMPAT_HMZONED)
 
 #define BTRFS_FEATURE_INCOMPAT_SAFE_SET			\
 	(BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF)
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC v2] libblkid: implement zone-aware probing for HMZONED btrfs
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
                   ` (27 preceding siblings ...)
  2019-12-13  4:09 ` [PATCH v6 28/28] btrfs: enable to mount HMZONED incompat flag Naohiro Aota
@ 2019-12-13  4:15 ` Naohiro Aota
  2019-12-19 20:19 ` [PATCH v6 00/28] btrfs: zoned block device support David Sterba
  29 siblings, 0 replies; 69+ messages in thread
From: Naohiro Aota @ 2019-12-13  4:15 UTC (permalink / raw)
  To: linux-btrfs, David Sterba, kzak
  Cc: Chris Mason, Josef Bacik, Nikolay Borisov, Damien Le Moal,
	Johannes Thumshirn, Hannes Reinecke, Anand Jain, linux-fsdevel,
	Naohiro Aota

This is a work-in-progress and proof-of-concept patch to make libblkid
zone-aware. It can probe the magic located at some offset from the
beginning of some specific zone of a device.

This patch will be split in two patches in the future:
- patch to introduce zone aware probing
- patch to user it in btrfs probing

The first part introduces some new fields to struct blkid_idmag. They
indicate the magic location which is placed related to a zone.

Also, the first part introduces `zone_size` to struct
blkid_struct_probe. It stores the size of zones of a device.

The second part use the introduced fields to probe the magic of HMZONED
btrfs. Then, it uses the write pointer position to detect the location
of the last written superblock.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 libblkid/src/blkidP.h            |   5 ++
 libblkid/src/probe.c             |  20 ++++-
 libblkid/src/superblocks/btrfs.c | 129 ++++++++++++++++++++++++++++++-
 3 files changed, 151 insertions(+), 3 deletions(-)

diff --git a/libblkid/src/blkidP.h b/libblkid/src/blkidP.h
index f9bbe008406f..9cd09520ee32 100644
--- a/libblkid/src/blkidP.h
+++ b/libblkid/src/blkidP.h
@@ -148,6 +148,10 @@ struct blkid_idmag
 
 	long		kboff;		/* kilobyte offset of superblock */
 	unsigned int	sboff;		/* byte offset within superblock */
+
+	int		is_zone;	/* indicate magic location is calcluated based on zone position  */
+	long		zonenum;	/* zone number which has superblock */
+	long		kboff_inzone;	/* kilobyte offset of superblock in a zone */
 };
 
 /*
@@ -195,6 +199,7 @@ struct blkid_struct_probe
 	dev_t			disk_devno;	/* devno of the whole-disk or 0 */
 	unsigned int		blkssz;		/* sector size (BLKSSZGET ioctl) */
 	mode_t			mode;		/* struct stat.sb_mode */
+	uint64_t		zone_size;	/* zone size (BLKGETZONESZ ioctl) */
 
 	int			flags;		/* private library flags */
 	int			prob_flags;	/* always zeroized by blkid_do_*() */
diff --git a/libblkid/src/probe.c b/libblkid/src/probe.c
index f6dd5573d5dd..52220bf6f0f4 100644
--- a/libblkid/src/probe.c
+++ b/libblkid/src/probe.c
@@ -94,6 +94,7 @@
 #ifdef HAVE_LINUX_CDROM_H
 #include <linux/cdrom.h>
 #endif
+#include <linux/blkzoned.h>
 #ifdef HAVE_SYS_STAT_H
 #include <sys/stat.h>
 #endif
@@ -861,6 +862,7 @@ int blkid_probe_set_device(blkid_probe pr, int fd,
 	struct stat sb;
 	uint64_t devsiz = 0;
 	char *dm_uuid = NULL;
+	uint32_t zone_size_sector;
 
 	blkid_reset_probe(pr);
 	blkid_probe_reset_buffers(pr);
@@ -887,6 +889,7 @@ int blkid_probe_set_device(blkid_probe pr, int fd,
 	pr->wipe_off = 0;
 	pr->wipe_size = 0;
 	pr->wipe_chain = NULL;
+	pr->zone_size = 0;
 
 	if (fd < 0)
 		return 1;
@@ -951,6 +954,9 @@ int blkid_probe_set_device(blkid_probe pr, int fd,
 #endif
 	free(dm_uuid);
 
+	if (S_ISBLK(sb.st_mode) && !ioctl(pr->fd, BLKGETZONESZ, &zone_size_sector))
+		pr->zone_size = zone_size_sector << 9;
+
 	DBG(LOWPROBE, ul_debug("ready for low-probing, offset=%"PRIu64", size=%"PRIu64"",
 				pr->off, pr->size));
 	DBG(LOWPROBE, ul_debug("whole-disk: %s, regfile: %s",
@@ -1009,8 +1015,16 @@ int blkid_probe_get_idmag(blkid_probe pr, const struct blkid_idinfo *id,
 	/* try to detect by magic string */
 	while(mag && mag->magic) {
 		unsigned char *buf;
+		uint64_t kboff;
+
+		if (!mag->is_zone)
+			kboff = mag->kboff;
+		else if (pr->zone_size)
+			kboff = ((mag->zonenum * pr->zone_size) >> 10) + mag->kboff_inzone;
+		else
+			goto next;
 
-		off = (mag->kboff + (mag->sboff >> 10)) << 10;
+		off = (kboff + (mag->sboff >> 10)) << 10;
 		buf = blkid_probe_get_buffer(pr, off, 1024);
 
 		if (!buf && errno)
@@ -1020,13 +1034,15 @@ int blkid_probe_get_idmag(blkid_probe pr, const struct blkid_idinfo *id,
 				buf + (mag->sboff & 0x3ff), mag->len)) {
 
 			DBG(LOWPROBE, ul_debug("\tmagic sboff=%u, kboff=%ld",
-				mag->sboff, mag->kboff));
+				mag->sboff, kboff));
 			if (offset)
 				*offset = off + (mag->sboff & 0x3ff);
 			if (res)
 				*res = mag;
 			return BLKID_PROBE_OK;
 		}
+
+next:
 		mag++;
 	}
 
diff --git a/libblkid/src/superblocks/btrfs.c b/libblkid/src/superblocks/btrfs.c
index f0fde700d896..10bdf841b6c4 100644
--- a/libblkid/src/superblocks/btrfs.c
+++ b/libblkid/src/superblocks/btrfs.c
@@ -9,6 +9,9 @@
 #include <unistd.h>
 #include <string.h>
 #include <stdint.h>
+#include <stdbool.h>
+
+#include <linux/blkzoned.h>
 
 #include "superblocks.h"
 
@@ -59,11 +62,128 @@ struct btrfs_super_block {
 	uint8_t label[256];
 } __attribute__ ((__packed__));
 
+#define BTRFS_SUPER_INFO_SIZE 4096
+#define SECTOR_SHIFT 9
+
+#define READ 0
+#define WRITE 1
+
+typedef uint64_t u64;
+typedef uint64_t sector_t;
+
+static int sb_write_pointer(struct blk_zone *zones, u64 *wp_ret)
+{
+	bool empty[2];
+	bool full[2];
+	sector_t sector;
+
+	if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) {
+		*wp_ret = zones[0].start << SECTOR_SHIFT;
+		return -ENOENT;
+	}
+
+	empty[0] = zones[0].cond == BLK_ZONE_COND_EMPTY;
+	empty[1] = zones[1].cond == BLK_ZONE_COND_EMPTY;
+	full[0] = zones[0].cond == BLK_ZONE_COND_FULL;
+	full[1] = zones[1].cond == BLK_ZONE_COND_FULL;
+
+	/*
+	 * Possible state of log buffer zones
+	 *
+	 *   E I F
+	 * E * x 0
+	 * I 0 x 0
+	 * F 1 1 x
+	 *
+	 * Row: zones[0]
+	 * Col: zones[1]
+	 * State:
+	 *   E: Empty, I: In-Use, F: Full
+	 * Log position:
+	 *   *: Special case, no superblock is written
+	 *   0: Use write pointer of zones[0]
+	 *   1: Use write pointer of zones[1]
+	 *   x: Invalid state
+	 */
+
+	if (empty[0] && empty[1]) {
+		/* special case to distinguish no superblock to read */
+		*wp_ret = zones[0].start << SECTOR_SHIFT;
+		return -ENOENT;
+	} else if (full[0] && full[1]) {
+		/* cannot determine which zone has the newer superblock */
+		return -EUCLEAN;
+	} else if (!full[0] && (empty[1] || full[1])) {
+		sector = zones[0].wp;
+	} else if (full[0]) {
+		sector = zones[1].wp;
+	} else {
+		return -EUCLEAN;
+	}
+	*wp_ret = sector << SECTOR_SHIFT;
+	return 0;
+}
+
+static int sb_log_offset(uint32_t zone_size_sector, blkid_probe pr,
+			 uint64_t *offset_ret)
+{
+	uint32_t zone_num = 0;
+	struct blk_zone_report *rep;
+	struct blk_zone *zones;
+	size_t rep_size;
+	int ret;
+	uint64_t wp;
+
+	rep_size = sizeof(struct blk_zone_report) + sizeof(struct blk_zone) * 2;
+	rep = malloc(rep_size);
+	if (!rep)
+		return -errno;
+
+	memset(rep, 0, rep_size);
+	rep->sector = zone_num * zone_size_sector;
+	rep->nr_zones = 2;
+
+	ret = ioctl(pr->fd, BLKREPORTZONE, rep);
+	if (ret)
+		return -errno;
+	if (rep->nr_zones != 2) {
+		free(rep);
+		return 1;
+	}
+
+	zones = (struct blk_zone *)(rep + 1);
+
+	ret = sb_write_pointer(zones, &wp);
+	if (ret != -ENOENT && ret)
+		return -EIO;
+	if (ret != -ENOENT) {
+		if (wp == zones[0].start << SECTOR_SHIFT)
+			wp = (zones[1].start + zones[1].len) << SECTOR_SHIFT;
+		wp -= BTRFS_SUPER_INFO_SIZE;
+	}
+	*offset_ret = wp;
+
+	return 0;
+}
+
 static int probe_btrfs(blkid_probe pr, const struct blkid_idmag *mag)
 {
 	struct btrfs_super_block *bfs;
+	uint32_t zone_size_sector;
+	int ret;
+
+	if (pr->zone_size != 0) {
+		uint64_t offset = 0;
 
-	bfs = blkid_probe_get_sb(pr, mag, struct btrfs_super_block);
+		ret = sb_log_offset(zone_size_sector, pr, &offset);
+		if (ret)
+			return ret;
+		bfs = (struct btrfs_super_block*)
+			blkid_probe_get_buffer(pr, offset,
+					       sizeof(struct btrfs_super_block));
+	} else {
+		bfs = blkid_probe_get_sb(pr, mag, struct btrfs_super_block);
+	}
 	if (!bfs)
 		return errno ? -errno : 1;
 
@@ -88,6 +208,13 @@ const struct blkid_idinfo btrfs_idinfo =
 	.magics		=
 	{
 	  { .magic = "_BHRfS_M", .len = 8, .sboff = 0x40, .kboff = 64 },
+	  /* for HMZONED btrfs */
+	  { .magic = "!BHRfS_M", .len = 8, .sboff = 0x40,
+	    .is_zone = 1, .zonenum = 0, .kboff_inzone = 0 },
+	  { .magic = "_BHRfS_M", .len = 8, .sboff = 0x40,
+	    .is_zone = 1, .zonenum = 0, .kboff_inzone = 0 },
+	  { .magic = "_BHRfS_M", .len = 8, .sboff = 0x40,
+	    .is_zone = 1, .zonenum = 1, .kboff_inzone = 0 },
 	  { NULL }
 	}
 };
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 02/28] btrfs: Get zone information of zoned block devices
  2019-12-13  4:08 ` [PATCH v6 02/28] btrfs: Get zone information of zoned block devices Naohiro Aota
@ 2019-12-13 16:18   ` Josef Bacik
  2019-12-18  2:29     ` Naohiro Aota
  0 siblings, 1 reply; 69+ messages in thread
From: Josef Bacik @ 2019-12-13 16:18 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Nikolay Borisov, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

On 12/12/19 11:08 PM, Naohiro Aota wrote:
> If a zoned block device is found, get its zone information (number of zones
> and zone size) using the new helper function btrfs_get_dev_zone_info().  To
> avoid costly run-time zone report commands to test the device zones type
> during block allocation, attach the seq_zones bitmap to the device
> structure to indicate if a zone is sequential or accept random writes. Also
> it attaches the empty_zones bitmap to indicate if a zone is empty or not.
> 
> This patch also introduces the helper function btrfs_dev_is_sequential() to
> test if the zone storing a block is a sequential write required zone and
> btrfs_dev_is_empty_zone() to test if the zone is a empty zone.
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>   fs/btrfs/Makefile  |   1 +
>   fs/btrfs/hmzoned.c | 168 +++++++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/hmzoned.h |  92 +++++++++++++++++++++++++
>   fs/btrfs/volumes.c |  18 ++++-
>   fs/btrfs/volumes.h |   4 ++
>   5 files changed, 281 insertions(+), 2 deletions(-)
>   create mode 100644 fs/btrfs/hmzoned.c
>   create mode 100644 fs/btrfs/hmzoned.h
> 
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index 82200dbca5ac..64aaeed397a4 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -16,6 +16,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
>   btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
>   btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
>   btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
> +btrfs-$(CONFIG_BLK_DEV_ZONED) += hmzoned.o
>   
>   btrfs-$(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) += tests/free-space-tests.o \
>   	tests/extent-buffer-tests.o tests/btrfs-tests.o \
> diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
> new file mode 100644
> index 000000000000..6a13763d2916
> --- /dev/null
> +++ b/fs/btrfs/hmzoned.c
> @@ -0,0 +1,168 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2019 Western Digital Corporation or its affiliates.
> + * Authors:
> + *	Naohiro Aota	<naohiro.aota@wdc.com>
> + *	Damien Le Moal	<damien.lemoal@wdc.com>
> + */
> +
> +#include <linux/slab.h>
> +#include <linux/blkdev.h>
> +#include "ctree.h"
> +#include "volumes.h"
> +#include "hmzoned.h"
> +#include "rcu-string.h"
> +
> +/* Maximum number of zones to report per blkdev_report_zones() call */
> +#define BTRFS_REPORT_NR_ZONES   4096
> +
> +static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
> +			       struct blk_zone *zones, unsigned int *nr_zones)
> +{
> +	int ret;
> +
> +	ret = blkdev_report_zones(device->bdev, pos >> SECTOR_SHIFT, zones,
> +				  nr_zones);
> +	if (ret != 0) {
> +		btrfs_err_in_rcu(device->fs_info,
> +				 "get zone at %llu on %s failed %d", pos,
> +				 rcu_str_deref(device->name), ret);
> +		return ret;
> +	}
> +	if (!*nr_zones)
> +		return -EIO;
> +
> +	return 0;
> +}
> +
> +int btrfs_get_dev_zone_info(struct btrfs_device *device)
> +{
> +	struct btrfs_zoned_device_info *zone_info = NULL;
> +	struct block_device *bdev = device->bdev;
> +	sector_t nr_sectors = bdev->bd_part->nr_sects;
> +	sector_t sector = 0;
> +	struct blk_zone *zones = NULL;
> +	unsigned int i, nreported = 0, nr_zones;
> +	unsigned int zone_sectors;
> +	int ret;
> +	char devstr[sizeof(device->fs_info->sb->s_id) +
> +		    sizeof(" (device )") - 1];
> +
> +	if (!bdev_is_zoned(bdev))
> +		return 0;
> +
> +	zone_info = kzalloc(sizeof(*zone_info), GFP_KERNEL);
> +	if (!zone_info)
> +		return -ENOMEM;
> +
> +	zone_sectors = bdev_zone_sectors(bdev);
> +	ASSERT(is_power_of_2(zone_sectors));
> +	zone_info->zone_size = (u64)zone_sectors << SECTOR_SHIFT;
> +	zone_info->zone_size_shift = ilog2(zone_info->zone_size);
> +	zone_info->nr_zones = nr_sectors >> ilog2(bdev_zone_sectors(bdev));
> +	if (!IS_ALIGNED(nr_sectors, zone_sectors))
> +		zone_info->nr_zones++;
> +
> +	zone_info->seq_zones = bitmap_zalloc(zone_info->nr_zones, GFP_KERNEL);
> +	if (!zone_info->seq_zones) {
> +		ret = -ENOMEM;
> +		goto free_zone_info;
> +	}
> +
> +	zone_info->empty_zones = bitmap_zalloc(zone_info->nr_zones, GFP_KERNEL);
> +	if (!zone_info->empty_zones) {
> +		ret = -ENOMEM;
> +		goto free_seq_zones;
> +	}
> +
> +	zones = kcalloc(BTRFS_REPORT_NR_ZONES,
> +			sizeof(struct blk_zone), GFP_KERNEL);
> +	if (!zones) {
> +		ret = -ENOMEM;
> +		goto free_empty_zones;
> +	}
> +
> +	/* Get zones type */
> +	while (sector < nr_sectors) {
> +		nr_zones = BTRFS_REPORT_NR_ZONES;
> +		ret = btrfs_get_dev_zones(device, sector << SECTOR_SHIFT, zones,
> +					  &nr_zones);
> +		if (ret)
> +			goto free_zones;
> +
> +		for (i = 0; i < nr_zones; i++) {
> +			if (zones[i].type == BLK_ZONE_TYPE_SEQWRITE_REQ)
> +				set_bit(nreported, zone_info->seq_zones);
> +			if (zones[i].cond == BLK_ZONE_COND_EMPTY)
> +				set_bit(nreported, zone_info->empty_zones);
> +			nreported++;
> +		}
> +		sector = zones[nr_zones - 1].start + zones[nr_zones - 1].len;
> +	}
> +
> +	if (nreported != zone_info->nr_zones) {
> +		btrfs_err_in_rcu(device->fs_info,
> +				 "inconsistent number of zones on %s (%u / %u)",
> +				 rcu_str_deref(device->name), nreported,
> +				 zone_info->nr_zones);
> +		ret = -EIO;
> +		goto free_zones;
> +	}
> +
> +	kfree(zones);
> +
> +	device->zone_info = zone_info;
> +
> +	devstr[0] = 0;
> +	if (device->fs_info)
> +		snprintf(devstr, sizeof(devstr), " (device %s)",
> +			 device->fs_info->sb->s_id);
> +
> +	rcu_read_lock();
> +	pr_info(
> +"BTRFS info%s: host-%s zoned block device %s, %u zones of %llu sectors",
> +		devstr,
> +		bdev_zoned_model(bdev) == BLK_ZONED_HM ? "managed" : "aware",
> +		rcu_str_deref(device->name), zone_info->nr_zones,
> +		zone_info->zone_size >> SECTOR_SHIFT);
> +	rcu_read_unlock();
> +
> +	return 0;
> +
> +free_zones:
> +	kfree(zones);
> +free_empty_zones:
> +	bitmap_free(zone_info->empty_zones);
> +free_seq_zones:
> +	bitmap_free(zone_info->seq_zones);
> +free_zone_info:

bitmap_free is just a kfree which handles NULL pointers properly, so you only 
need one goto here for cleaning up the zone_info.  Once that's fixed you can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 03/28] btrfs: Check and enable HMZONED mode
  2019-12-13  4:08 ` [PATCH v6 03/28] btrfs: Check and enable HMZONED mode Naohiro Aota
@ 2019-12-13 16:21   ` Josef Bacik
  2019-12-18  4:17     ` Naohiro Aota
  0 siblings, 1 reply; 69+ messages in thread
From: Josef Bacik @ 2019-12-13 16:21 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Nikolay Borisov, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

On 12/12/19 11:08 PM, Naohiro Aota wrote:
> HMZONED mode cannot be used together with the RAID5/6 profile for now.
> Introduce the function btrfs_check_hmzoned_mode() to check this. This
> function will also check if HMZONED flag is enabled on the file system and
> if the file system consists of zoned devices with equal zone size.
> 
> Additionally, as updates to the space cache are in-place, the space cache
> cannot be located over sequential zones and there is no guarantees that the
> device will have enough conventional zones to store this cache. Resolve
> this problem by completely disabling the space cache.  This does not
> introduce any problems in HMZONED mode: all the free space is located after
> the allocation pointer and no free space is located before the pointer.
> There is no need to have such cache.
> 
> For the same reason, NODATACOW is also disabled.
> 
> Also INODE_MAP_CACHE is also disabled to avoid preallocation in the
> INODE_MAP_CACHE inode.
> 
> In summary, HMZONED will disable:
> 
> | Disabled features | Reason                                              |
> |-------------------+-----------------------------------------------------|
> | RAID5/6           | 1) Non-full stripe write cause overwriting of       |
> |                   | parity block                                        |
> |                   | 2) Rebuilding on high capacity volume (usually with |
> |                   | SMR) can lead to higher failure rate                |
> |-------------------+-----------------------------------------------------|
> | space_cache (v1)  | In-place updating                                   |
> | NODATACOW         | In-place updating                                   |
> |-------------------+-----------------------------------------------------|
> | fallocate         | Reserved extent will be a write hole                |
> | INODE_MAP_CACHE   | Need pre-allocation. (and will be deprecated?)      |
> |-------------------+-----------------------------------------------------|
> | MIXED_BG          | Allocated metadata region will be write holes for   |
> |                   | data writes                                         |
> | async checksum    | Not to mix up bios by multiple workers              |
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

I assume the progs will be updated to account for these limitations as well?

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 04/28] btrfs: disallow RAID5/6 in HMZONED mode
  2019-12-13  4:08 ` [PATCH v6 04/28] btrfs: disallow RAID5/6 in " Naohiro Aota
@ 2019-12-13 16:21   ` Josef Bacik
  0 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2019-12-13 16:21 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Nikolay Borisov, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

On 12/12/19 11:08 PM, Naohiro Aota wrote:
> Supporting the RAID5/6 profile in HMZONED mode is not trivial. For example,
> non-full stripe writes will cause overwriting parity blocks. When we do a
> non-full stripe write, it writes to the parity block with the data at that
> moment. Then, another write to the stripes will try to overwrite the parity
> block with new parity value. However, sequential zones do not allow such
> parity overwriting.
> 
> Furthermore, using RAID5/6 on SMR drives, which usually have a huge
> capacity, incur large overhead of rebuild. Such overhead can lead to higher
> to higher volume failure rate (e.g. additional drive failure during
> rebuild) because of the increased rebuild time.
> 
> Thus, let's disable RAID5/6 profile in HMZONED mode for now.
> 
> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 05/28] btrfs: disallow space_cache in HMZONED mode
  2019-12-13  4:08 ` [PATCH v6 05/28] btrfs: disallow space_cache " Naohiro Aota
@ 2019-12-13 16:24   ` Josef Bacik
  2019-12-18  4:28     ` Naohiro Aota
  0 siblings, 1 reply; 69+ messages in thread
From: Josef Bacik @ 2019-12-13 16:24 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Nikolay Borisov, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

On 12/12/19 11:08 PM, Naohiro Aota wrote:
> As updates to the space cache v1 are in-place, the space cache cannot be
> located over sequential zones and there is no guarantees that the device
> will have enough conventional zones to store this cache. Resolve this
> problem by disabling completely the space cache v1.  This does not
> introduces any problems with sequential block groups: all the free space is
> located after the allocation pointer and no free space before the pointer.
> There is no need to have such cache.
> 
> Note: we can technically use free-space-tree (space cache v2) on HMZONED
> mode. But, since HMZONED mode now always allocate extents in a block group
> sequentially regardless of underlying device zone type, it's no use to
> enable and maintain the tree.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>   fs/btrfs/hmzoned.c | 18 ++++++++++++++++++
>   fs/btrfs/hmzoned.h |  5 +++++
>   fs/btrfs/super.c   | 11 +++++++++--
>   3 files changed, 32 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
> index 1b24facd46b8..d62f11652973 100644
> --- a/fs/btrfs/hmzoned.c
> +++ b/fs/btrfs/hmzoned.c
> @@ -250,3 +250,21 @@ int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info)
>   out:
>   	return ret;
>   }
> +
> +int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info)
> +{
> +	if (!btrfs_fs_incompat(info, HMZONED))
> +		return 0;
> +
> +	/*
> +	 * SPACE CACHE writing is not CoWed. Disable that to avoid write
> +	 * errors in sequential zones.
> +	 */
> +	if (btrfs_test_opt(info, SPACE_CACHE)) {
> +		btrfs_err(info,
> +			  "space cache v1 not supportted in HMZONED mode");
> +		return -EOPNOTSUPP;
> +	}
> +
> +	return 0;
> +}
> diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
> index 8e17f64ff986..d9ebe11afdf5 100644
> --- a/fs/btrfs/hmzoned.h
> +++ b/fs/btrfs/hmzoned.h
> @@ -29,6 +29,7 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
>   int btrfs_get_dev_zone_info(struct btrfs_device *device);
>   void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
>   int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info);
> +int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info);
>   #else /* CONFIG_BLK_DEV_ZONED */
>   static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
>   				     struct blk_zone *zone)
> @@ -48,6 +49,10 @@ static inline int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info)
>   	btrfs_err(fs_info, "Zoned block devices support is not enabled");
>   	return -EOPNOTSUPP;
>   }
> +static inline int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info)
> +{
> +	return 0;
> +}
>   #endif
>   
>   static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 616f5abec267..1424c3c6e3cf 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -442,8 +442,13 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
>   	cache_gen = btrfs_super_cache_generation(info->super_copy);
>   	if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE))
>   		btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE);
> -	else if (cache_gen)
> -		btrfs_set_opt(info->mount_opt, SPACE_CACHE);
> +	else if (cache_gen) {
> +		if (btrfs_fs_incompat(info, HMZONED))
> +			btrfs_info(info,
> +			"ignoring existing space cache in HMZONED mode");

It would be good to clear the cache gen in this case.  I assume this can happen 
if we add a hmzoned device to an existing fs with space cache already?  I'd hate 
for weird corner cases to pop up if we removed it later and still had a valid 
cache gen in place.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 06/28] btrfs: disallow NODATACOW in HMZONED mode
  2019-12-13  4:08 ` [PATCH v6 06/28] btrfs: disallow NODATACOW " Naohiro Aota
@ 2019-12-13 16:25   ` Josef Bacik
  0 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2019-12-13 16:25 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Nikolay Borisov, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

On 12/12/19 11:08 PM, Naohiro Aota wrote:
> NODATACOW implies overwriting the file data on a device, which is
> impossible in sequential required zones. Disable NODATACOW globally with
> mount option and per-file NODATACOW attribute by masking FS_NOCOW_FL.
> 
> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 07/28] btrfs: disable fallocate in HMZONED mode
  2019-12-13  4:08 ` [PATCH v6 07/28] btrfs: disable fallocate " Naohiro Aota
@ 2019-12-13 16:26   ` Josef Bacik
  0 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2019-12-13 16:26 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Nikolay Borisov, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

On 12/12/19 11:08 PM, Naohiro Aota wrote:
> fallocate() is implemented by reserving actual extent instead of
> reservations. This can result in exposing the sequential write constraint
> of host-managed zoned block devices to the application, which would break
> the POSIX semantic for the fallocated file.  To avoid this, report
> fallocate() as not supported when in HMZONED mode for now.
> 
> In the future, we may be able to implement "in-memory" fallocate() in
> HMZONED mode by utilizing space_info->bytes_may_use or so.
> 
> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 08/28] btrfs: implement log-structured superblock for HMZONED mode
  2019-12-13  4:08 ` [PATCH v6 08/28] btrfs: implement log-structured superblock for " Naohiro Aota
@ 2019-12-13 16:38   ` Josef Bacik
  2019-12-13 21:58     ` Damien Le Moal
  0 siblings, 1 reply; 69+ messages in thread
From: Josef Bacik @ 2019-12-13 16:38 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Nikolay Borisov, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

On 12/12/19 11:08 PM, Naohiro Aota wrote:
> Superblock (and its copies) is the only data structure in btrfs which has a
> fixed location on a device. Since we cannot overwrite in a sequential write
> required zone, we cannot place superblock in the zone. One easy solution is
> limiting superblock and copies to be placed only in conventional zones.
> However, this method has two downsides: one is reduced number of superblock
> copies. The location of the second copy of superblock is 256GB, which is in
> a sequential write required zone on typical devices in the market today.
> So, the number of superblock and copies is limited to be two.  Second
> downside is that we cannot support devices which have no conventional zones
> at all.
> 
> To solve these two problems, we employ superblock log writing. It uses two
> zones as a circular buffer to write updated superblocks. Once the first
> zone is filled up, start writing into the second buffer and reset the first
> one. We can determine the postion of the latest superblock by reading write
> pointer information from a device.
> 
> The following zones are reserved as the circular buffer on HMZONED btrfs.
> 
> - The primary superblock: zones 0 and 1
> - The first copy: zones 16 and 17
> - The second copy: zones 1024 or zone at 256GB which is minimum, and next
>    to it
> 

So the series of events for writing is

-> get wp
-> write super block
-> advance wp
   -> if wp == end of the zone, reset the wp

now assume we crash here.  We'll go to mount the fs and the zone will look like 
it's empty because we reset the wp, and we'll be unable to mount the fs.  Am I 
missing something here?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 09/28] btrfs: align device extent allocation to zone boundary
  2019-12-13  4:08 ` [PATCH v6 09/28] btrfs: align device extent allocation to zone boundary Naohiro Aota
@ 2019-12-13 16:52   ` Josef Bacik
  0 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2019-12-13 16:52 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Nikolay Borisov, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

On 12/12/19 11:08 PM, Naohiro Aota wrote:
> In HMZONED mode, align the device extents to zone boundaries so that a zone
> reset affects only the device extent and does not change the state of
> blocks in the neighbor device extents. Also, check that a region allocation
> is always over empty zones and it is not over any locations of super block
> zones.
> 
> This patch also add a verification in verify_one_dev_extent() to check if
> the device extent is align to zone boundary.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 08/28] btrfs: implement log-structured superblock for HMZONED mode
  2019-12-13 16:38   ` Josef Bacik
@ 2019-12-13 21:58     ` Damien Le Moal
  2019-12-17 19:17       ` Josef Bacik
  0 siblings, 1 reply; 69+ messages in thread
From: Damien Le Moal @ 2019-12-13 21:58 UTC (permalink / raw)
  To: Josef Bacik, Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Nikolay Borisov, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

Josef,

On 2019/12/14 1:39, Josef Bacik wrote:
> On 12/12/19 11:08 PM, Naohiro Aota wrote:
>> Superblock (and its copies) is the only data structure in btrfs which has a
>> fixed location on a device. Since we cannot overwrite in a sequential write
>> required zone, we cannot place superblock in the zone. One easy solution is
>> limiting superblock and copies to be placed only in conventional zones.
>> However, this method has two downsides: one is reduced number of superblock
>> copies. The location of the second copy of superblock is 256GB, which is in
>> a sequential write required zone on typical devices in the market today.
>> So, the number of superblock and copies is limited to be two.  Second
>> downside is that we cannot support devices which have no conventional zones
>> at all.
>>
>> To solve these two problems, we employ superblock log writing. It uses two
>> zones as a circular buffer to write updated superblocks. Once the first
>> zone is filled up, start writing into the second buffer and reset the first
>> one. We can determine the postion of the latest superblock by reading write
>> pointer information from a device.
>>
>> The following zones are reserved as the circular buffer on HMZONED btrfs.
>>
>> - The primary superblock: zones 0 and 1
>> - The first copy: zones 16 and 17
>> - The second copy: zones 1024 or zone at 256GB which is minimum, and next
>>    to it
>>
> 
> So the series of events for writing is
> 
> -> get wp
> -> write super block
> -> advance wp
>    -> if wp == end of the zone, reset the wp

In your example, the reset is for the other zone, leaving the zone that
was just filled as is. The sequence would in fact be more like this for
zones 0 & 1:

-> Get wp zone 0, if zone is full, reset it
-> write super block in zone 0
-> advance wp zone 0. If zone is full, switch to zone 1 for next update

This would come after the sequence:
-> Get wp zone 1
-> write super block in zone 1
-> advance wp zone 1. If zone is full, switch to zone 0 for next update

> 
> now assume we crash here.  We'll go to mount the fs and the zone will look like 
> it's empty because we reset the wp, and we'll be unable to mount the fs.  Am I 
> missing something here?  Thanks,

The last successful update of the super block is always present on disk
as the block right before the wp position of zone 0 or zone 1.

> 
> Josef
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 08/28] btrfs: implement log-structured superblock for HMZONED mode
  2019-12-13 21:58     ` Damien Le Moal
@ 2019-12-17 19:17       ` Josef Bacik
  0 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2019-12-17 19:17 UTC (permalink / raw)
  To: Damien Le Moal, Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Nikolay Borisov, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

On 12/13/19 4:58 PM, Damien Le Moal wrote:
> Josef,
> 
> On 2019/12/14 1:39, Josef Bacik wrote:
>> On 12/12/19 11:08 PM, Naohiro Aota wrote:
>>> Superblock (and its copies) is the only data structure in btrfs which has a
>>> fixed location on a device. Since we cannot overwrite in a sequential write
>>> required zone, we cannot place superblock in the zone. One easy solution is
>>> limiting superblock and copies to be placed only in conventional zones.
>>> However, this method has two downsides: one is reduced number of superblock
>>> copies. The location of the second copy of superblock is 256GB, which is in
>>> a sequential write required zone on typical devices in the market today.
>>> So, the number of superblock and copies is limited to be two.  Second
>>> downside is that we cannot support devices which have no conventional zones
>>> at all.
>>>
>>> To solve these two problems, we employ superblock log writing. It uses two
>>> zones as a circular buffer to write updated superblocks. Once the first
>>> zone is filled up, start writing into the second buffer and reset the first
>>> one. We can determine the postion of the latest superblock by reading write
>>> pointer information from a device.
>>>
>>> The following zones are reserved as the circular buffer on HMZONED btrfs.
>>>
>>> - The primary superblock: zones 0 and 1
>>> - The first copy: zones 16 and 17
>>> - The second copy: zones 1024 or zone at 256GB which is minimum, and next
>>>     to it
>>>
>>
>> So the series of events for writing is
>>
>> -> get wp
>> -> write super block
>> -> advance wp
>>     -> if wp == end of the zone, reset the wp
> 
> In your example, the reset is for the other zone, leaving the zone that
> was just filled as is. The sequence would in fact be more like this for
> zones 0 & 1:
> 
> -> Get wp zone 0, if zone is full, reset it
> -> write super block in zone 0
> -> advance wp zone 0. If zone is full, switch to zone 1 for next update
> 
> This would come after the sequence:
> -> Get wp zone 1
> -> write super block in zone 1
> -> advance wp zone 1. If zone is full, switch to zone 0 for next update
> 

Ah ok I missed that.  Alright you can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

To this one, thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 10/28] btrfs: do sequential extent allocation in HMZONED mode
  2019-12-13  4:08 ` [PATCH v6 10/28] btrfs: do sequential extent allocation in HMZONED mode Naohiro Aota
@ 2019-12-17 19:19   ` Josef Bacik
  0 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2019-12-17 19:19 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Nikolay Borisov, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

On 12/12/19 11:08 PM, Naohiro Aota wrote:
> On HMZONED drives, writes must always be sequential and directed at a block
> group zone write pointer position. Thus, block allocation in a block group
> must also be done sequentially using an allocation pointer equal to the
> block group zone write pointer plus the number of blocks allocated but not
> yet written.
> 
> Sequential allocation function find_free_extent_zoned() bypass the checks
> in find_free_extent() and increase the reserved byte counter by itself. It
> is impossible to revert once allocated region in the sequential allocation,
> since it might race with other allocations and leave an allocation hole,
> which breaks the sequential write rule.
> 
> Furthermore, this commit introduce two new variable to struct
> btrfs_block_group. "wp_broken" indicate that write pointer is broken (e.g.
> not synced on a RAID1 block group) and mark that block group read only.
> "zone_unusable" keeps track of the size of once allocated then freed region
> in a block group. Such region is never usable until resetting underlying
> zones.
> 
> This commit also introduce "bytes_zone_unusable" to track such unusable
> bytes in a space_info. Pinned bytes are always reclaimed to
> "bytes_zone_unusable". They are not usable until resetting them first.
> 

Please separate this out into it's own patch, these things are a bear as it is 
to review, it doesn't help that I need to keep track of two different things per 
patch.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 11/28] btrfs: make unmirroed BGs readonly only if we have at least one writable BG
  2019-12-13  4:08 ` [PATCH v6 11/28] btrfs: make unmirroed BGs readonly only if we have at least one writable BG Naohiro Aota
@ 2019-12-17 19:25   ` Josef Bacik
  2019-12-18  7:35     ` Naohiro Aota
  0 siblings, 1 reply; 69+ messages in thread
From: Josef Bacik @ 2019-12-17 19:25 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Nikolay Borisov, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

On 12/12/19 11:08 PM, Naohiro Aota wrote:
> If the btrfs volume has mirrored block groups, it unconditionally makes
> un-mirrored block groups read only. When we have mirrored block groups, but
> don't have writable block groups, this will drop all writable block groups.
> So, check if we have at least one writable mirrored block group before
> setting un-mirrored block groups read only.
> 
> This change is necessary to handle e.g. xfstests btrfs/124 case.
> 
> When we mount degraded RAID1 FS and write to it, and then re-mount with
> full device, the write pointers of corresponding zones of written block
> group differ. We mark such block group as "wp_broken" and make it read
> only. In this situation, we only have read only RAID1 block groups because
> of "wp_broken" and un-mirrored block groups are also marked read only,
> because we have RAID1 block groups. As a result, all the block groups are
> now read only, so that we cannot even start the rebalance to fix the
> situation.

I'm not sure I understand.  In degraded mode we're writing to just one mirror of 
a RAID1 block group, correct?  And this messes up the WP for the broken side, so 
it gets marked with wp_broken and thus RO.  How does this patch help?  The block 
groups are still marked RAID1 right?  Or are new block groups allocated with 
SINGLE or RAID0?  I'm confused.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 12/28] btrfs: ensure metadata space available on/after degraded mount in HMZONED
  2019-12-13  4:08 ` [PATCH v6 12/28] btrfs: ensure metadata space available on/after degraded mount in HMZONED Naohiro Aota
@ 2019-12-17 19:32   ` Josef Bacik
  0 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2019-12-17 19:32 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Nikolay Borisov, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

On 12/12/19 11:08 PM, Naohiro Aota wrote:
> On/After degraded mount, we might have no writable metadata block group due
> to broken write pointers. If you e.g. balance the FS before writing any
> data, alloc_tree_block_no_bg_flush() (called from insert_balance_item())
> fails to allocate a tree block for it, due to global reservation failure.
> We can reproduce this situation with xfstests btrfs/124.
> 
> While we can workaround the failure if we write some data and, as a result
> of writing, let a new metadata block group allocated, it's a bad practice
> to apply.
> 
> This commit avoids such failures by ensuring that read-write mounted volume
> has non-zero metadata space. If metadata space is empty, it forces new
> metadata block group allocation.
> 

Ick, I hate this, especially since it doesn't take into account if we're mounted 
read only.  No instead add something btrfs_start_transaction() or something 
similar that does this check to allocate a chunk.  And 
alloc_tree_block_no_bg_flush() only means we won't create the pending bg's in 
that path, we're still able to allocate chunks.  So I'm not super sure what you 
are actually hitting here, but this is the wrong way to go about fixing it.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 13/28] btrfs: reset zones of unused block groups
  2019-12-13  4:09 ` [PATCH v6 13/28] btrfs: reset zones of unused block groups Naohiro Aota
@ 2019-12-17 19:33   ` Josef Bacik
  0 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2019-12-17 19:33 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Nikolay Borisov, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

On 12/12/19 11:09 PM, Naohiro Aota wrote:
> For an HMZONED volume, a block group maps to a zone of the device. For
> deleted unused block groups, the zone of the block group can be reset to
> rewind the zone write pointer at the start of the zone.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

But Dennis's async discard stuff is going in, so you may need to rebase onto 
that and see how that affects this patch.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 14/28] btrfs: redirty released extent buffers in HMZONED mode
  2019-12-13  4:09 ` [PATCH v6 14/28] btrfs: redirty released extent buffers in HMZONED mode Naohiro Aota
@ 2019-12-17 19:41   ` Josef Bacik
  0 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2019-12-17 19:41 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Nikolay Borisov, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

On 12/12/19 11:09 PM, Naohiro Aota wrote:
> Tree manipulating operations like merging nodes often release
> once-allocated tree nodes. Btrfs cleans such nodes so that pages in the
> node are not uselessly written out. On HMZONED drives, however, such
> optimization blocks the following IOs as the cancellation of the write out
> of the freed blocks breaks the sequential write sequence expected by the
> device.
> 
> This patch introduces a list of clean and unwritten extent buffers that
> have been released in a transaction. Btrfs redirty the buffer so that
> btree_write_cache_pages() can send proper bios to the devices.
> 
> Besides it clears the entire content of the extent buffer not to confuse
> raw block scanners e.g. btrfsck. By clearing the content,
> csum_dirty_buffer() complains about bytenr mismatch, so avoid the checking
> and checksum using newly introduced buffer flag EXTENT_BUFFER_NO_CHECK.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 15/28] btrfs: serialize data allocation and submit IOs
  2019-12-13  4:09 ` [PATCH v6 15/28] btrfs: serialize data allocation and submit IOs Naohiro Aota
@ 2019-12-17 19:49   ` Josef Bacik
  2019-12-19  6:54     ` Naohiro Aota
  0 siblings, 1 reply; 69+ messages in thread
From: Josef Bacik @ 2019-12-17 19:49 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Nikolay Borisov, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

On 12/12/19 11:09 PM, Naohiro Aota wrote:
> To preserve sequential write pattern on the drives, we must serialize
> allocation and submit_bio. This commit add per-block group mutex
> "zone_io_lock" and find_free_extent_zoned() hold the lock. The lock is kept
> even after returning from find_free_extent(). It is released when submiting
> IOs corresponding to the allocation is completed.
> 
> Implementing such behavior under __extent_writepage_io() is almost
> impossible because once pages are unlocked we are not sure when submiting
> IOs for an allocated region is finished or not. Instead, this commit add
> run_delalloc_hmzoned() to write out non-compressed data IOs at once using
> extent_write_locked_rage(). After the write, we can call
> btrfs_hmzoned_data_io_unlock() to unlock the block group for new
> allocation.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

Have you actually tested these patches with lock debugging on?  The 
submit_compressed_extents stuff is async, so the unlocker owner will not be the 
lock owner, and that'll make all sorts of things blow up.  This is just straight 
up broken.

I would really rather see a hmzoned block scheduler that just doesn't submit the 
bio's until they are aligned with the WP, that way this intellligence doesn't 
have to be dealt with at the file system layer.  I get allocating in line with 
the WP, but this whole forcing us to allocate and submit the bio in lock step is 
just nuts, and broken in your subsequent patches.  This whole approach needs to 
be reworked.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 19/28] btrfs: wait existing extents before truncating
  2019-12-13  4:09 ` [PATCH v6 19/28] btrfs: wait existing extents before truncating Naohiro Aota
@ 2019-12-17 19:53   ` Josef Bacik
  0 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2019-12-17 19:53 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Nikolay Borisov, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

On 12/12/19 11:09 PM, Naohiro Aota wrote:
> When truncating a file, file buffers which have already been allocated but
> not yet written may be truncated.  Truncating these buffers could cause
> breakage of a sequential write pattern in a block group if the truncated
> blocks are for example followed by blocks allocated to another file. To
> avoid this problem, always wait for write out of all unwritten buffers
> before proceeding with the truncate execution.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 21/28] btrfs: disallow mixed-bg in HMZONED mode
  2019-12-13  4:09 ` [PATCH v6 21/28] btrfs: disallow mixed-bg in " Naohiro Aota
@ 2019-12-17 19:56   ` Josef Bacik
  2019-12-18  8:03     ` Naohiro Aota
  0 siblings, 1 reply; 69+ messages in thread
From: Josef Bacik @ 2019-12-17 19:56 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Nikolay Borisov, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

On 12/12/19 11:09 PM, Naohiro Aota wrote:
> Placing both data and metadata in a block group is impossible in HMZONED
> mode. For data, we can allocate a space for it and write it immediately
> after the allocation. For metadata, however, we cannot do so, because the
> logical addresses are recorded in other metadata buffers to build up the
> trees. As a result, a data buffer can be placed after a metadata buffer,
> which is not written yet. Writing out the data buffer will break the
> sequential write rule.
> 
> This commit check and disallow MIXED_BG with HMZONED mode.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

I would prefer it if you did all of the weird disallows early on so it's clear 
as I go through that I don't have to think about certain cases.  I remembered 
from a previous look through that mixed_bg's were disallowed, but I had to go 
look for some other cases.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 22/28] btrfs: disallow inode_cache in HMZONED mode
  2019-12-13  4:09 ` [PATCH v6 22/28] btrfs: disallow inode_cache " Naohiro Aota
@ 2019-12-17 19:56   ` Josef Bacik
  0 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2019-12-17 19:56 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Nikolay Borisov, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

On 12/12/19 11:09 PM, Naohiro Aota wrote:
> inode_cache use pre-allocation to write its cache data. However,
> pre-allocation is completely disabled in HMZONED mode.
> 
> We can technically enable inode_cache in the same way as relocation.
> However, inode_cache is rarely used and the man page discourage using it.
> So, let's just disable it for now.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

Same comment as the mixed_bg's comment

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 23/28] btrfs: support dev-replace in HMZONED mode
  2019-12-13  4:09 ` [PATCH v6 23/28] btrfs: support dev-replace " Naohiro Aota
@ 2019-12-17 21:05   ` Josef Bacik
  2019-12-18  6:00     ` Naohiro Aota
  0 siblings, 1 reply; 69+ messages in thread
From: Josef Bacik @ 2019-12-17 21:05 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Nikolay Borisov, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

On 12/12/19 11:09 PM, Naohiro Aota wrote:
> We have two type of I/Os during the device-replace process. One is a I/O to
> "copy" (by the scrub functions) all the device extents on the source device
> to the destination device.  The other one is a I/O to "clone" (by
> handle_ops_on_dev_replace()) new incoming write I/Os from users to the
> source device into the target device.
> 
> Cloning incoming I/Os can break the sequential write rule in the target
> device. When write is mapped in the middle of a block group, that I/O is
> directed in the middle of a zone of target device, which breaks the
> sequential write rule.
> 
> However, the cloning function cannot be simply disabled since incoming I/Os
> targeting already copied device extents must be cloned so that the I/O is
> executed on the target device.
> 
> We cannot use dev_replace->cursor_{left,right} to determine whether bio is
> going to not yet copied region.  Since we have time gap between finishing
> btrfs_scrub_dev() and rewriting the mapping tree in
> btrfs_dev_replace_finishing(), we can have newly allocated device extent
> which is never cloned nor copied.
> 
> So the point is to copy only already existing device extents. This patch
> introduces mark_block_group_to_copy() to mark existing block group as a
> target of copying. Then, handle_ops_on_dev_replace() and dev-replace can
> check the flag to do their job.
> 
> Device-replace process in HMZONED mode must copy or clone all the extents
> in the source device exctly once.  So, we need to use to ensure allocations
> started just before the dev-replace process to have their corresponding
> extent information in the B-trees. finish_extent_writes_for_hmzoned()
> implements that functionality, which basically is the removed code in the
> commit 042528f8d840 ("Btrfs: fix block group remaining RO forever after
> error during device replace").
> 
> This patch also handles empty region between used extents. Since
> dev-replace is smart to copy only used extents on source device, we have to
> fill the gap to honor the sequential write rule in the target device.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

Can you split up the copying part and the cloning part into different patches, 
this is a bear to review.  Also I don't quite understand the zeroout behavior. 
It _looks_ like for cloning you are doing a zeroout for the gap between the last 
wp position and the current cloned bio, which makes sense, but doesn't this gap 
exist because copying is ongoing?  Can you copy into a zero'ed out position?  Or 
am I missing something here?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 24/28] btrfs: enable relocation in HMZONED mode
  2019-12-13  4:09 ` [PATCH v6 24/28] btrfs: enable relocation " Naohiro Aota
@ 2019-12-17 21:32   ` Josef Bacik
  2019-12-18 10:49     ` Naohiro Aota
  0 siblings, 1 reply; 69+ messages in thread
From: Josef Bacik @ 2019-12-17 21:32 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Nikolay Borisov, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

On 12/12/19 11:09 PM, Naohiro Aota wrote:
> To serialize allocation and submit_bio, we introduced mutex around them. As
> a result, preallocation must be completely disabled to avoid a deadlock.
> 
> Since current relocation process relies on preallocation to move file data
> extents, it must be handled in another way. In HMZONED mode, we just
> truncate the inode to the size that we wanted to pre-allocate. Then, we
> flush dirty pages on the file before finishing relocation process.
> run_delalloc_hmzoned() will handle all the allocation and submit IOs to
> the underlying layers.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>   fs/btrfs/relocation.c | 39 +++++++++++++++++++++++++++++++++++++--
>   1 file changed, 37 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index d897a8e5e430..2d17b7566df4 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -3159,6 +3159,34 @@ int prealloc_file_extent_cluster(struct inode *inode,
>   	if (ret)
>   		goto out;
>   
> +	/*
> +	 * In HMZONED, we cannot preallocate the file region. Instead,
> +	 * we dirty and fiemap_write the region.
> +	 */
> +
> +	if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED)) {
> +		struct btrfs_root *root = BTRFS_I(inode)->root;
> +		struct btrfs_trans_handle *trans;
> +
> +		end = cluster->end - offset + 1;
> +		trans = btrfs_start_transaction(root, 1);
> +		if (IS_ERR(trans))
> +			return PTR_ERR(trans);
> +
> +		inode->i_ctime = current_time(inode);
> +		i_size_write(inode, end);
> +		btrfs_ordered_update_i_size(inode, end, NULL);
> +		ret = btrfs_update_inode(trans, root, inode);
> +		if (ret) {
> +			btrfs_abort_transaction(trans, ret);
> +			btrfs_end_transaction(trans);
> +			return ret;
> +		}
> +		ret = btrfs_end_transaction(trans);
> +
> +		goto out;
> +	}
> +

Why are we arbitrarily extending the i_size here?  If we don't need prealloc we 
don't need to jack up the i_size either.

>   	cur_offset = prealloc_start;
>   	while (nr < cluster->nr) {
>   		start = cluster->boundary[nr] - offset;
> @@ -3346,6 +3374,10 @@ static int relocate_file_extent_cluster(struct inode *inode,
>   		btrfs_throttle(fs_info);
>   	}
>   	WARN_ON(nr != cluster->nr);
> +	if (btrfs_fs_incompat(fs_info, HMZONED) && !ret) {
> +		ret = btrfs_wait_ordered_range(inode, 0, (u64)-1);
> +		WARN_ON(ret);

Do not WAR_ON() when this could happen due to IO errors.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 25/28] btrfs: relocate block group to repair IO failure in HMZONED
  2019-12-13  4:09 ` [PATCH v6 25/28] btrfs: relocate block group to repair IO failure in HMZONED Naohiro Aota
@ 2019-12-17 22:04   ` Josef Bacik
  0 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2019-12-17 22:04 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Nikolay Borisov, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

On 12/12/19 11:09 PM, Naohiro Aota wrote:
> When btrfs find a checksum error and if the file system has a mirror of the
> damaged data, btrfs read the correct data from the mirror and write the
> data to damaged blocks. This repairing, however, is against the sequential
> write required rule.
> 
> We can consider three methods to repair an IO failure in HMZONED mode:
> (1) Reset and rewrite the damaged zone
> (2) Allocate new device extent and replace the damaged device extent to the
>      new extent
> (3) Relocate the corresponding block group
> 
> Method (1) is most similar to a behavior done with regular devices.
> However, it also wipes non-damaged data in the same device extent, and so
> it unnecessary degrades non-damaged data.
> 
> Method (2) is much like device replacing but done in the same device. It is
> safe because it keeps the device extent until the replacing finish.
> However, extending device replacing is non-trivial. It assumes
> "src_dev>physical == dst_dev->physical". Also, the extent mapping replacing
> function should be extended to support replacing device extent position in
> one device.
> 
> Method (3) invokes relocation of the damaged block group, so it is
> straightforward to implement. It relocates all the mirrored device extents,
> so it is, potentially, a more costly operation than method (1) or (2). But
> it relocates only using extents which reduce the total IO size.
> 
> Let's apply method (3) for now. In the future, we can extend device-replace
> and apply method (2).
> 
> For protecting a block group gets relocated multiple time with multiple IO
> errors, this commit introduces "relocating_repair" bit to show it's now
> relocating to repair IO failures. Also it uses a new kthread
> "btrfs-relocating-repair", not to block IO path with relocating process.
> 
> This commit also supports repairing in the scrub process.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 27/28] btrfs: enable tree-log on HMZONED mode
  2019-12-13  4:09 ` [PATCH v6 27/28] btrfs: enable tree-log on HMZONED mode Naohiro Aota
@ 2019-12-17 22:08   ` Josef Bacik
  2019-12-18  9:35     ` Naohiro Aota
  0 siblings, 1 reply; 69+ messages in thread
From: Josef Bacik @ 2019-12-17 22:08 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Nikolay Borisov, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

On 12/12/19 11:09 PM, Naohiro Aota wrote:
> The tree-log feature does not work on HMZONED mode as is. Blocks for a
> tree-log tree are allocated mixed with other metadata blocks, and btrfs
> writes and syncs the tree-log blocks to devices at the time of fsync(),
> which is different timing than a global transaction commit. As a result,
> both writing tree-log blocks and writing other metadata blocks become
> non-sequential writes which HMZONED mode must avoid.
> 
> Also, since we can start more than one log transactions per subvolume at
> the same time, nodes from multiple transactions can be allocated
> interleaved. Such mixed allocation results in non-sequential writes at the
> time of log transaction commit. The nodes of the global log root tree
> (fs_info->log_root_tree), also have the same mixed allocation problem.
> 
> This patch assigns a dedicated block group for tree-log blocks to separate
> two metadata writing streams (for tree-log blocks and other metadata
> blocks). As a result, each write stream can now be written to devices
> separately. "fs_info->treelog_bg" tracks the dedicated block group and
> btrfs assign "treelog_bg" on-demand on tree-log block allocation time.
> 
> Then, this patch serializes log transactions by waiting for a committing
> transaction when someone tries to start a new transaction, to avoid the
> mixed allocation problem. We must also wait for running log transactions
> from another subvolume, but there is no easy way to detect which subvolume
> root is running a log transaction. So, this patch forbids starting a new
> log transaction when the global log root tree is already allocated by other
> subvolumes.
> 
> Furthermore, this patch aligns the allocation order of nodes of
> "fs_info->log_root_tree" and nodes of "root->log_root" with the writing
> order of the nodes, by delaying allocation of the root node of
> "fs_info->log_root_tree," so that, the node buffers can go out sequentially
> to devices.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>   fs/btrfs/block-group.c |  7 +++++
>   fs/btrfs/ctree.h       |  2 ++
>   fs/btrfs/disk-io.c     |  8 ++---
>   fs/btrfs/extent-tree.c | 71 +++++++++++++++++++++++++++++++++++++-----
>   fs/btrfs/tree-log.c    | 49 ++++++++++++++++++++++++-----
>   5 files changed, 116 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 6f7d29171adf..93e6c617d68e 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -910,6 +910,13 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>   	btrfs_return_cluster_to_free_space(block_group, cluster);
>   	spin_unlock(&cluster->refill_lock);
>   
> +	if (btrfs_fs_incompat(fs_info, HMZONED)) {
> +		spin_lock(&fs_info->treelog_bg_lock);
> +		if (fs_info->treelog_bg == block_group->start)
> +			fs_info->treelog_bg = 0;
> +		spin_unlock(&fs_info->treelog_bg_lock);
> +	}
> +
>   	path = btrfs_alloc_path();
>   	if (!path) {
>   		ret = -ENOMEM;
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 18d2d0581e68..cba8a169002c 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -907,6 +907,8 @@ struct btrfs_fs_info {
>   #endif
>   
>   	struct mutex hmzoned_meta_io_lock;
> +	spinlock_t treelog_bg_lock;
> +	u64 treelog_bg;
>   };
>   
>   static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 914c517d26b0..9c2b2fbf0cdb 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1360,16 +1360,10 @@ int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans,
>   			     struct btrfs_fs_info *fs_info)
>   {
>   	struct btrfs_root *log_root;
> -	int ret;
>   
>   	log_root = alloc_log_tree(trans, fs_info);
>   	if (IS_ERR(log_root))
>   		return PTR_ERR(log_root);
> -	ret = btrfs_alloc_log_tree_node(trans, log_root);
> -	if (ret) {
> -		kfree(log_root);
> -		return ret;
> -	}
>   	WARN_ON(fs_info->log_root_tree);
>   	fs_info->log_root_tree = log_root;
>   	return 0;
> @@ -2841,6 +2835,8 @@ int __cold open_ctree(struct super_block *sb,
>   
>   	fs_info->send_in_progress = 0;
>   
> +	spin_lock_init(&fs_info->treelog_bg_lock);
> +
>   	ret = btrfs_alloc_stripe_hash_table(fs_info);
>   	if (ret) {
>   		err = ret;
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 69c4ce8ec83e..9b9608097f7f 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3704,8 +3704,10 @@ static int find_free_extent_unclustered(struct btrfs_block_group *bg,
>    */
>   
>   static int find_free_extent_zoned(struct btrfs_block_group *cache,
> -				  struct find_free_extent_ctl *ffe_ctl)
> +				  struct find_free_extent_ctl *ffe_ctl,
> +				  bool for_treelog)
>   {
> +	struct btrfs_fs_info *fs_info = cache->fs_info;
>   	struct btrfs_space_info *space_info = cache->space_info;
>   	struct btrfs_free_space_ctl *ctl = cache->free_space_ctl;
>   	u64 start = cache->start;
> @@ -3718,12 +3720,26 @@ static int find_free_extent_zoned(struct btrfs_block_group *cache,
>   	btrfs_hmzoned_data_io_lock(cache);
>   	spin_lock(&space_info->lock);
>   	spin_lock(&cache->lock);
> +	spin_lock(&fs_info->treelog_bg_lock);
> +
> +	ASSERT(!for_treelog || cache->start == fs_info->treelog_bg ||
> +	       fs_info->treelog_bg == 0);
>   
>   	if (cache->ro) {
>   		ret = -EAGAIN;
>   		goto out;
>   	}
>   
> +	/*
> +	 * Do not allow currently using block group to be tree-log
> +	 * dedicated block group.
> +	 */
> +	if (for_treelog && !fs_info->treelog_bg &&
> +	    (cache->used || cache->reserved)) {
> +		ret = 1;
> +		goto out;
> +	}
> +
>   	avail = cache->length - cache->alloc_offset;
>   	if (avail < num_bytes) {
>   		ffe_ctl->max_extent_size = avail;
> @@ -3731,6 +3747,9 @@ static int find_free_extent_zoned(struct btrfs_block_group *cache,
>   		goto out;
>   	}
>   
> +	if (for_treelog && !fs_info->treelog_bg)
> +		fs_info->treelog_bg = cache->start;
> +
>   	ffe_ctl->found_offset = start + cache->alloc_offset;
>   	cache->alloc_offset += num_bytes;
>   	spin_lock(&ctl->tree_lock);
> @@ -3738,12 +3757,15 @@ static int find_free_extent_zoned(struct btrfs_block_group *cache,
>   	spin_unlock(&ctl->tree_lock);
>   
>   	ASSERT(IS_ALIGNED(ffe_ctl->found_offset,
> -			  cache->fs_info->stripesize));
> +			  fs_info->stripesize));
>   	ffe_ctl->search_start = ffe_ctl->found_offset;
>   	__btrfs_add_reserved_bytes(cache, ffe_ctl->ram_bytes, num_bytes,
>   				   ffe_ctl->delalloc);
>   
>   out:
> +	if (ret && for_treelog)
> +		fs_info->treelog_bg = 0;
> +	spin_unlock(&fs_info->treelog_bg_lock);
>   	spin_unlock(&cache->lock);
>   	spin_unlock(&space_info->lock);
>   	/* if succeeds, unlock after submit_bio */
> @@ -3891,7 +3913,7 @@ static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info,
>   static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
>   				u64 ram_bytes, u64 num_bytes, u64 empty_size,
>   				u64 hint_byte, struct btrfs_key *ins,
> -				u64 flags, int delalloc)
> +				u64 flags, int delalloc, bool for_treelog)
>   {
>   	int ret = 0;
>   	struct btrfs_free_cluster *last_ptr = NULL;
> @@ -3970,6 +3992,13 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
>   		spin_unlock(&last_ptr->lock);
>   	}
>   
> +	if (hmzoned && for_treelog) {
> +		spin_lock(&fs_info->treelog_bg_lock);
> +		if (fs_info->treelog_bg)
> +			hint_byte = fs_info->treelog_bg;
> +		spin_unlock(&fs_info->treelog_bg_lock);
> +	}
> +
>   	ffe_ctl.search_start = max(ffe_ctl.search_start,
>   				   first_logical_byte(fs_info, 0));
>   	ffe_ctl.search_start = max(ffe_ctl.search_start, hint_byte);
> @@ -4015,8 +4044,15 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
>   	list_for_each_entry(block_group,
>   			    &space_info->block_groups[ffe_ctl.index], list) {
>   		/* If the block group is read-only, we can skip it entirely. */
> -		if (unlikely(block_group->ro))
> +		if (unlikely(block_group->ro)) {
> +			if (hmzoned && for_treelog) {
> +				spin_lock(&fs_info->treelog_bg_lock);
> +				if (block_group->start == fs_info->treelog_bg)
> +					fs_info->treelog_bg = 0;
> +				spin_unlock(&fs_info->treelog_bg_lock);
> +			}
>   			continue;
> +		}
>   
>   		btrfs_grab_block_group(block_group, delalloc);
>   		ffe_ctl.search_start = block_group->start;
> @@ -4062,7 +4098,25 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
>   			goto loop;
>   
>   		if (hmzoned) {
> -			ret = find_free_extent_zoned(block_group, &ffe_ctl);
> +			u64 bytenr = block_group->start;
> +			u64 log_bytenr;
> +			bool skip;
> +
> +			/*
> +			 * Do not allow non-tree-log blocks in the
> +			 * dedicated tree-log block group, and vice versa.
> +			 */
> +			spin_lock(&fs_info->treelog_bg_lock);
> +			log_bytenr = fs_info->treelog_bg;
> +			skip = log_bytenr &&
> +				((for_treelog && bytenr != log_bytenr) ||
> +				 (!for_treelog && bytenr == log_bytenr));
> +			spin_unlock(&fs_info->treelog_bg_lock);
> +			if (skip)
> +				goto loop;
> +
> +			ret = find_free_extent_zoned(block_group, &ffe_ctl,
> +						     for_treelog);
>   			if (ret)
>   				goto loop;
>   			/*
> @@ -4222,12 +4276,13 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes,
>   	bool final_tried = num_bytes == min_alloc_size;
>   	u64 flags;
>   	int ret;
> +	bool for_treelog = root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID;
>   
>   	flags = get_alloc_profile_by_root(root, is_data);
>   again:
>   	WARN_ON(num_bytes < fs_info->sectorsize);
>   	ret = find_free_extent(fs_info, ram_bytes, num_bytes, empty_size,
> -			       hint_byte, ins, flags, delalloc);
> +			       hint_byte, ins, flags, delalloc, for_treelog);
>   	if (!ret && !is_data) {
>   		btrfs_dec_block_group_reservations(fs_info, ins->objectid);
>   	} else if (ret == -ENOSPC) {
> @@ -4245,8 +4300,8 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes,
>   
>   			sinfo = btrfs_find_space_info(fs_info, flags);
>   			btrfs_err(fs_info,
> -				  "allocation failed flags %llu, wanted %llu",
> -				  flags, num_bytes);
> +			"allocation failed flags %llu, wanted %llu treelog %d",
> +				  flags, num_bytes, for_treelog);
>   			if (sinfo)
>   				btrfs_dump_space_info(fs_info, sinfo,
>   						      num_bytes, 1);
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 6f757361db53..e155418f24ba 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -18,6 +18,7 @@
>   #include "compression.h"
>   #include "qgroup.h"
>   #include "inode-map.h"
> +#include "hmzoned.h"
>   
>   /* magic values for the inode_only field in btrfs_log_inode:
>    *
> @@ -105,6 +106,7 @@ static noinline int replay_dir_deletes(struct btrfs_trans_handle *trans,
>   				       struct btrfs_root *log,
>   				       struct btrfs_path *path,
>   				       u64 dirid, int del_all);
> +static void wait_log_commit(struct btrfs_root *root, int transid);
>   
>   /*
>    * tree logging is a special write ahead log used to make sure that
> @@ -139,16 +141,25 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
>   			   struct btrfs_log_ctx *ctx)
>   {
>   	struct btrfs_fs_info *fs_info = root->fs_info;
> +	bool hmzoned = btrfs_fs_incompat(fs_info, HMZONED);
>   	int ret = 0;
>   
>   	mutex_lock(&root->log_mutex);
>   
> +again:
>   	if (root->log_root) {
> +		int index = (root->log_transid + 1) % 2;
> +
>   		if (btrfs_need_log_full_commit(trans)) {
>   			ret = -EAGAIN;
>   			goto out;
>   		}
>   
> +		if (hmzoned && atomic_read(&root->log_commit[index])) {
> +			wait_log_commit(root, root->log_transid - 1);
> +			goto again;
> +		}
> +
>   		if (!root->log_start_pid) {
>   			clear_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state);
>   			root->log_start_pid = current->pid;
> @@ -157,8 +168,13 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
>   		}
>   	} else {
>   		mutex_lock(&fs_info->tree_log_mutex);
> -		if (!fs_info->log_root_tree)
> +		if (hmzoned && fs_info->log_root_tree) {
> +			ret = -EAGAIN;
> +			mutex_unlock(&fs_info->tree_log_mutex);
> +			goto out;
> +		} else if (!fs_info->log_root_tree) {
>   			ret = btrfs_init_log_root_tree(trans, fs_info);
> +		}
>   		mutex_unlock(&fs_info->tree_log_mutex);
>   		if (ret)
>   			goto out;
> @@ -191,11 +207,19 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
>    */
>   static int join_running_log_trans(struct btrfs_root *root)
>   {
> +	bool hmzoned = btrfs_fs_incompat(root->fs_info, HMZONED);
>   	int ret = -ENOENT;
>   
>   	mutex_lock(&root->log_mutex);
> +again:
>   	if (root->log_root) {
> +		int index = (root->log_transid + 1) % 2;
> +
>   		ret = 0;
> +		if (hmzoned && atomic_read(&root->log_commit[index])) {
> +			wait_log_commit(root, root->log_transid - 1);
> +			goto again;
> +		}
>   		atomic_inc(&root->log_writers);
>   	}
>   	mutex_unlock(&root->log_mutex);
> @@ -2724,6 +2748,8 @@ static noinline int walk_down_log_tree(struct btrfs_trans_handle *trans,
>   					btrfs_clean_tree_block(next);
>   					btrfs_wait_tree_block_writeback(next);
>   					btrfs_tree_unlock(next);
> +					btrfs_redirty_list_add(
> +						trans->transaction, next);

This is separate from the rest of the work here and needs to be in a separate 
patch.  In fact I'd like to see separate patches for the allocation part, the 
waiting part, and whatever this is.  As it stands it's all kind of wonky and 
really ends up deeply in the generic stuff which will make it all harder to read 
and maintain.  If I'm going to review this it needs to be in smaller chunks. 
Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 28/28] btrfs: enable to mount HMZONED incompat flag
  2019-12-13  4:09 ` [PATCH v6 28/28] btrfs: enable to mount HMZONED incompat flag Naohiro Aota
@ 2019-12-17 22:09   ` Josef Bacik
  0 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2019-12-17 22:09 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Nikolay Borisov, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

On 12/12/19 11:09 PM, Naohiro Aota wrote:
> This final patch adds the HMZONED incompat flag to
> BTRFS_FEATURE_INCOMPAT_SUPP and enables btrfs to mount HMZONED flagged file
> system.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 02/28] btrfs: Get zone information of zoned block devices
  2019-12-13 16:18   ` Josef Bacik
@ 2019-12-18  2:29     ` Naohiro Aota
  0 siblings, 0 replies; 69+ messages in thread
From: Naohiro Aota @ 2019-12-18  2:29 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, David Sterba, Chris Mason, Nikolay Borisov,
	Damien Le Moal, Johannes Thumshirn, Hannes Reinecke, Anand Jain,
	linux-fsdevel

On Fri, Dec 13, 2019 at 11:18:53AM -0500, Josef Bacik wrote:
>On 12/12/19 11:08 PM, Naohiro Aota wrote:
>>If a zoned block device is found, get its zone information (number of zones
>>and zone size) using the new helper function btrfs_get_dev_zone_info().  To
>>avoid costly run-time zone report commands to test the device zones type
>>during block allocation, attach the seq_zones bitmap to the device
>>structure to indicate if a zone is sequential or accept random writes. Also
>>it attaches the empty_zones bitmap to indicate if a zone is empty or not.
>>
>>This patch also introduces the helper function btrfs_dev_is_sequential() to
>>test if the zone storing a block is a sequential write required zone and
>>btrfs_dev_is_empty_zone() to test if the zone is a empty zone.
>>
>>Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
>>Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>---
>>  fs/btrfs/Makefile  |   1 +
>>  fs/btrfs/hmzoned.c | 168 +++++++++++++++++++++++++++++++++++++++++++++
>>  fs/btrfs/hmzoned.h |  92 +++++++++++++++++++++++++
>>  fs/btrfs/volumes.c |  18 ++++-
>>  fs/btrfs/volumes.h |   4 ++
>>  5 files changed, 281 insertions(+), 2 deletions(-)
>>  create mode 100644 fs/btrfs/hmzoned.c
>>  create mode 100644 fs/btrfs/hmzoned.h
>>
>>diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
>>index 82200dbca5ac..64aaeed397a4 100644
>>--- a/fs/btrfs/Makefile
>>+++ b/fs/btrfs/Makefile
>>@@ -16,6 +16,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
>>  btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
>>  btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
>>  btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
>>+btrfs-$(CONFIG_BLK_DEV_ZONED) += hmzoned.o
>>  btrfs-$(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) += tests/free-space-tests.o \
>>  	tests/extent-buffer-tests.o tests/btrfs-tests.o \
>>diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
>>new file mode 100644
>>index 000000000000..6a13763d2916
>>--- /dev/null
>>+++ b/fs/btrfs/hmzoned.c
>>@@ -0,0 +1,168 @@
>>+// SPDX-License-Identifier: GPL-2.0
>>+/*
>>+ * Copyright (C) 2019 Western Digital Corporation or its affiliates.
>>+ * Authors:
>>+ *	Naohiro Aota	<naohiro.aota@wdc.com>
>>+ *	Damien Le Moal	<damien.lemoal@wdc.com>
>>+ */
>>+
>>+#include <linux/slab.h>
>>+#include <linux/blkdev.h>
>>+#include "ctree.h"
>>+#include "volumes.h"
>>+#include "hmzoned.h"
>>+#include "rcu-string.h"
>>+
>>+/* Maximum number of zones to report per blkdev_report_zones() call */
>>+#define BTRFS_REPORT_NR_ZONES   4096
>>+
>>+static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
>>+			       struct blk_zone *zones, unsigned int *nr_zones)
>>+{
>>+	int ret;
>>+
>>+	ret = blkdev_report_zones(device->bdev, pos >> SECTOR_SHIFT, zones,
>>+				  nr_zones);
>>+	if (ret != 0) {
>>+		btrfs_err_in_rcu(device->fs_info,
>>+				 "get zone at %llu on %s failed %d", pos,
>>+				 rcu_str_deref(device->name), ret);
>>+		return ret;
>>+	}
>>+	if (!*nr_zones)
>>+		return -EIO;
>>+
>>+	return 0;
>>+}
>>+
>>+int btrfs_get_dev_zone_info(struct btrfs_device *device)
>>+{
>>+	struct btrfs_zoned_device_info *zone_info = NULL;
>>+	struct block_device *bdev = device->bdev;
>>+	sector_t nr_sectors = bdev->bd_part->nr_sects;
>>+	sector_t sector = 0;
>>+	struct blk_zone *zones = NULL;
>>+	unsigned int i, nreported = 0, nr_zones;
>>+	unsigned int zone_sectors;
>>+	int ret;
>>+	char devstr[sizeof(device->fs_info->sb->s_id) +
>>+		    sizeof(" (device )") - 1];
>>+
>>+	if (!bdev_is_zoned(bdev))
>>+		return 0;
>>+
>>+	zone_info = kzalloc(sizeof(*zone_info), GFP_KERNEL);
>>+	if (!zone_info)
>>+		return -ENOMEM;
>>+
>>+	zone_sectors = bdev_zone_sectors(bdev);
>>+	ASSERT(is_power_of_2(zone_sectors));
>>+	zone_info->zone_size = (u64)zone_sectors << SECTOR_SHIFT;
>>+	zone_info->zone_size_shift = ilog2(zone_info->zone_size);
>>+	zone_info->nr_zones = nr_sectors >> ilog2(bdev_zone_sectors(bdev));
>>+	if (!IS_ALIGNED(nr_sectors, zone_sectors))
>>+		zone_info->nr_zones++;
>>+
>>+	zone_info->seq_zones = bitmap_zalloc(zone_info->nr_zones, GFP_KERNEL);
>>+	if (!zone_info->seq_zones) {
>>+		ret = -ENOMEM;
>>+		goto free_zone_info;
>>+	}
>>+
>>+	zone_info->empty_zones = bitmap_zalloc(zone_info->nr_zones, GFP_KERNEL);
>>+	if (!zone_info->empty_zones) {
>>+		ret = -ENOMEM;
>>+		goto free_seq_zones;
>>+	}
>>+
>>+	zones = kcalloc(BTRFS_REPORT_NR_ZONES,
>>+			sizeof(struct blk_zone), GFP_KERNEL);
>>+	if (!zones) {
>>+		ret = -ENOMEM;
>>+		goto free_empty_zones;
>>+	}
>>+
>>+	/* Get zones type */
>>+	while (sector < nr_sectors) {
>>+		nr_zones = BTRFS_REPORT_NR_ZONES;
>>+		ret = btrfs_get_dev_zones(device, sector << SECTOR_SHIFT, zones,
>>+					  &nr_zones);
>>+		if (ret)
>>+			goto free_zones;
>>+
>>+		for (i = 0; i < nr_zones; i++) {
>>+			if (zones[i].type == BLK_ZONE_TYPE_SEQWRITE_REQ)
>>+				set_bit(nreported, zone_info->seq_zones);
>>+			if (zones[i].cond == BLK_ZONE_COND_EMPTY)
>>+				set_bit(nreported, zone_info->empty_zones);
>>+			nreported++;
>>+		}
>>+		sector = zones[nr_zones - 1].start + zones[nr_zones - 1].len;
>>+	}
>>+
>>+	if (nreported != zone_info->nr_zones) {
>>+		btrfs_err_in_rcu(device->fs_info,
>>+				 "inconsistent number of zones on %s (%u / %u)",
>>+				 rcu_str_deref(device->name), nreported,
>>+				 zone_info->nr_zones);
>>+		ret = -EIO;
>>+		goto free_zones;
>>+	}
>>+
>>+	kfree(zones);
>>+
>>+	device->zone_info = zone_info;
>>+
>>+	devstr[0] = 0;
>>+	if (device->fs_info)
>>+		snprintf(devstr, sizeof(devstr), " (device %s)",
>>+			 device->fs_info->sb->s_id);
>>+
>>+	rcu_read_lock();
>>+	pr_info(
>>+"BTRFS info%s: host-%s zoned block device %s, %u zones of %llu sectors",
>>+		devstr,
>>+		bdev_zoned_model(bdev) == BLK_ZONED_HM ? "managed" : "aware",
>>+		rcu_str_deref(device->name), zone_info->nr_zones,
>>+		zone_info->zone_size >> SECTOR_SHIFT);
>>+	rcu_read_unlock();
>>+
>>+	return 0;
>>+
>>+free_zones:
>>+	kfree(zones);
>>+free_empty_zones:
>>+	bitmap_free(zone_info->empty_zones);
>>+free_seq_zones:
>>+	bitmap_free(zone_info->seq_zones);
>>+free_zone_info:
>
>bitmap_free is just a kfree which handles NULL pointers properly, so 
>you only need one goto here for cleaning up the zone_info.  Once 
>that's fixed you can add
>
>Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>
>Josef

Ah, then, I think I can simplify the code to use one "out" label and
kfree/bitmap_free both zones and zone_info.

Thanks,

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 03/28] btrfs: Check and enable HMZONED mode
  2019-12-13 16:21   ` Josef Bacik
@ 2019-12-18  4:17     ` Naohiro Aota
  0 siblings, 0 replies; 69+ messages in thread
From: Naohiro Aota @ 2019-12-18  4:17 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, David Sterba, Chris Mason, Nikolay Borisov,
	Damien Le Moal, Johannes Thumshirn, Hannes Reinecke, Anand Jain,
	linux-fsdevel

On Fri, Dec 13, 2019 at 11:21:07AM -0500, Josef Bacik wrote:
>On 12/12/19 11:08 PM, Naohiro Aota wrote:
>>HMZONED mode cannot be used together with the RAID5/6 profile for now.
>>Introduce the function btrfs_check_hmzoned_mode() to check this. This
>>function will also check if HMZONED flag is enabled on the file system and
>>if the file system consists of zoned devices with equal zone size.
>>
>>Additionally, as updates to the space cache are in-place, the space cache
>>cannot be located over sequential zones and there is no guarantees that the
>>device will have enough conventional zones to store this cache. Resolve
>>this problem by completely disabling the space cache.  This does not
>>introduce any problems in HMZONED mode: all the free space is located after
>>the allocation pointer and no free space is located before the pointer.
>>There is no need to have such cache.
>>
>>For the same reason, NODATACOW is also disabled.
>>
>>Also INODE_MAP_CACHE is also disabled to avoid preallocation in the
>>INODE_MAP_CACHE inode.
>>
>>In summary, HMZONED will disable:
>>
>>| Disabled features | Reason                                              |
>>|-------------------+-----------------------------------------------------|
>>| RAID5/6           | 1) Non-full stripe write cause overwriting of       |
>>|                   | parity block                                        |
>>|                   | 2) Rebuilding on high capacity volume (usually with |
>>|                   | SMR) can lead to higher failure rate                |
>>|-------------------+-----------------------------------------------------|
>>| space_cache (v1)  | In-place updating                                   |
>>| NODATACOW         | In-place updating                                   |
>>|-------------------+-----------------------------------------------------|
>>| fallocate         | Reserved extent will be a write hole                |
>>| INODE_MAP_CACHE   | Need pre-allocation. (and will be deprecated?)      |
>>|-------------------+-----------------------------------------------------|
>>| MIXED_BG          | Allocated metadata region will be write holes for   |
>>|                   | data writes                                         |
>>| async checksum    | Not to mix up bios by multiple workers              |
>>
>>Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
>>Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>
>I assume the progs will be updated to account for these limitations as well?
>
>Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>
>Thanks,
>
>Josef

Oops, while it's errored out from mkfs.btrfs, I forgot to add early
check for RAID56 and MIXED_BG. I'll add the checks in the next series.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 05/28] btrfs: disallow space_cache in HMZONED mode
  2019-12-13 16:24   ` Josef Bacik
@ 2019-12-18  4:28     ` Naohiro Aota
  0 siblings, 0 replies; 69+ messages in thread
From: Naohiro Aota @ 2019-12-18  4:28 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, David Sterba, Chris Mason, Nikolay Borisov,
	Damien Le Moal, Johannes Thumshirn, Hannes Reinecke, Anand Jain,
	linux-fsdevel

On Fri, Dec 13, 2019 at 11:24:10AM -0500, Josef Bacik wrote:
>On 12/12/19 11:08 PM, Naohiro Aota wrote:
>>As updates to the space cache v1 are in-place, the space cache cannot be
>>located over sequential zones and there is no guarantees that the device
>>will have enough conventional zones to store this cache. Resolve this
>>problem by disabling completely the space cache v1.  This does not
>>introduces any problems with sequential block groups: all the free space is
>>located after the allocation pointer and no free space before the pointer.
>>There is no need to have such cache.
>>
>>Note: we can technically use free-space-tree (space cache v2) on HMZONED
>>mode. But, since HMZONED mode now always allocate extents in a block group
>>sequentially regardless of underlying device zone type, it's no use to
>>enable and maintain the tree.
>>
>>Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>---
>>  fs/btrfs/hmzoned.c | 18 ++++++++++++++++++
>>  fs/btrfs/hmzoned.h |  5 +++++
>>  fs/btrfs/super.c   | 11 +++++++++--
>>  3 files changed, 32 insertions(+), 2 deletions(-)
>>
>>diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c
>>index 1b24facd46b8..d62f11652973 100644
>>--- a/fs/btrfs/hmzoned.c
>>+++ b/fs/btrfs/hmzoned.c
>>@@ -250,3 +250,21 @@ int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info)
>>  out:
>>  	return ret;
>>  }
>>+
>>+int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info)
>>+{
>>+	if (!btrfs_fs_incompat(info, HMZONED))
>>+		return 0;
>>+
>>+	/*
>>+	 * SPACE CACHE writing is not CoWed. Disable that to avoid write
>>+	 * errors in sequential zones.
>>+	 */
>>+	if (btrfs_test_opt(info, SPACE_CACHE)) {
>>+		btrfs_err(info,
>>+			  "space cache v1 not supportted in HMZONED mode");
>>+		return -EOPNOTSUPP;
>>+	}
>>+
>>+	return 0;
>>+}
>>diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h
>>index 8e17f64ff986..d9ebe11afdf5 100644
>>--- a/fs/btrfs/hmzoned.h
>>+++ b/fs/btrfs/hmzoned.h
>>@@ -29,6 +29,7 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
>>  int btrfs_get_dev_zone_info(struct btrfs_device *device);
>>  void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
>>  int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info);
>>+int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info);
>>  #else /* CONFIG_BLK_DEV_ZONED */
>>  static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
>>  				     struct blk_zone *zone)
>>@@ -48,6 +49,10 @@ static inline int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info)
>>  	btrfs_err(fs_info, "Zoned block devices support is not enabled");
>>  	return -EOPNOTSUPP;
>>  }
>>+static inline int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info)
>>+{
>>+	return 0;
>>+}
>>  #endif
>>  static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
>>diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>>index 616f5abec267..1424c3c6e3cf 100644
>>--- a/fs/btrfs/super.c
>>+++ b/fs/btrfs/super.c
>>@@ -442,8 +442,13 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
>>  	cache_gen = btrfs_super_cache_generation(info->super_copy);
>>  	if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE))
>>  		btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE);
>>-	else if (cache_gen)
>>-		btrfs_set_opt(info->mount_opt, SPACE_CACHE);
>>+	else if (cache_gen) {
>>+		if (btrfs_fs_incompat(info, HMZONED))
>>+			btrfs_info(info,
>>+			"ignoring existing space cache in HMZONED mode");
>
>It would be good to clear the cache gen in this case.  I assume this 
>can happen if we add a hmzoned device to an existing fs with space 
>cache already?  I'd hate for weird corner cases to pop up if we 
>removed it later and still had a valid cache gen in place.  Thanks,
>
>Josef

We at least currently prohibit to add a zoned device to an non-zoned
files system. So, that case of adding zoned device to an existing fs
with space cache, won't happen usually. This condintion deals with
device corruption or bug in userland tools. Anyway, I will clear the
cache in this case.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 23/28] btrfs: support dev-replace in HMZONED mode
  2019-12-17 21:05   ` Josef Bacik
@ 2019-12-18  6:00     ` Naohiro Aota
  2019-12-18 14:58       ` Josef Bacik
  0 siblings, 1 reply; 69+ messages in thread
From: Naohiro Aota @ 2019-12-18  6:00 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, David Sterba, Chris Mason, Nikolay Borisov,
	Damien Le Moal, Johannes Thumshirn, Hannes Reinecke, Anand Jain,
	linux-fsdevel

On Tue, Dec 17, 2019 at 04:05:25PM -0500, Josef Bacik wrote:
>On 12/12/19 11:09 PM, Naohiro Aota wrote:
>>We have two type of I/Os during the device-replace process. One is a I/O to
>>"copy" (by the scrub functions) all the device extents on the source device
>>to the destination device.  The other one is a I/O to "clone" (by
>>handle_ops_on_dev_replace()) new incoming write I/Os from users to the
>>source device into the target device.
>>
>>Cloning incoming I/Os can break the sequential write rule in the target
>>device. When write is mapped in the middle of a block group, that I/O is
>>directed in the middle of a zone of target device, which breaks the
>>sequential write rule.
>>
>>However, the cloning function cannot be simply disabled since incoming I/Os
>>targeting already copied device extents must be cloned so that the I/O is
>>executed on the target device.
>>
>>We cannot use dev_replace->cursor_{left,right} to determine whether bio is
>>going to not yet copied region.  Since we have time gap between finishing
>>btrfs_scrub_dev() and rewriting the mapping tree in
>>btrfs_dev_replace_finishing(), we can have newly allocated device extent
>>which is never cloned nor copied.
>>
>>So the point is to copy only already existing device extents. This patch
>>introduces mark_block_group_to_copy() to mark existing block group as a
>>target of copying. Then, handle_ops_on_dev_replace() and dev-replace can
>>check the flag to do their job.
>>
>>Device-replace process in HMZONED mode must copy or clone all the extents
>>in the source device exctly once.  So, we need to use to ensure allocations
>>started just before the dev-replace process to have their corresponding
>>extent information in the B-trees. finish_extent_writes_for_hmzoned()
>>implements that functionality, which basically is the removed code in the
>>commit 042528f8d840 ("Btrfs: fix block group remaining RO forever after
>>error during device replace").
>>
>>This patch also handles empty region between used extents. Since
>>dev-replace is smart to copy only used extents on source device, we have to
>>fill the gap to honor the sequential write rule in the target device.
>>
>>Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>
>Can you split up the copying part and the cloning part into different 
>patches, this is a bear to review.  Also I don't quite understand the 
>zeroout behavior. It _looks_ like for cloning you are doing a zeroout 
>for the gap between the last wp position and the current cloned bio, 
>which makes sense, but doesn't this gap exist because copying is 
>ongoing?  Can you copy into a zero'ed out position?  Or am I missing 
>something here?  Thanks,
>
>Josef

OK, I will split this in the next version. (but, it's mostly "copying"
part)

Let me clarify first that I am using "copying" for copying existing
extents to the new device and "cloning" for cloning a new incoming BIO
to the new device.

For zeroout, it is for "copying" which is done with the scrub code to
copy existing extents on the source devie to the destination
device. Since copying or scrub only scans for living extents, there
can be a gap between two living extents. So, we need to fill a gap
with zeroout to make the writing stream sequential.

And "cloning" is only done for new block groups or already fully
copied block groups. So there is no gaps for them because the
allocator and the IO locks ensures the sequential allocation and
submit.

Thanks,

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 11/28] btrfs: make unmirroed BGs readonly only if we have at least one writable BG
  2019-12-17 19:25   ` Josef Bacik
@ 2019-12-18  7:35     ` Naohiro Aota
  2019-12-18 14:54       ` Josef Bacik
  0 siblings, 1 reply; 69+ messages in thread
From: Naohiro Aota @ 2019-12-18  7:35 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, David Sterba, Chris Mason, Nikolay Borisov,
	Damien Le Moal, Johannes Thumshirn, Hannes Reinecke, Anand Jain,
	linux-fsdevel

On Tue, Dec 17, 2019 at 02:25:37PM -0500, Josef Bacik wrote:
>On 12/12/19 11:08 PM, Naohiro Aota wrote:
>>If the btrfs volume has mirrored block groups, it unconditionally makes
>>un-mirrored block groups read only. When we have mirrored block groups, but
>>don't have writable block groups, this will drop all writable block groups.
>>So, check if we have at least one writable mirrored block group before
>>setting un-mirrored block groups read only.
>>
>>This change is necessary to handle e.g. xfstests btrfs/124 case.
>>
>>When we mount degraded RAID1 FS and write to it, and then re-mount with
>>full device, the write pointers of corresponding zones of written block
>>group differ. We mark such block group as "wp_broken" and make it read
>>only. In this situation, we only have read only RAID1 block groups because
>>of "wp_broken" and un-mirrored block groups are also marked read only,
>>because we have RAID1 block groups. As a result, all the block groups are
>>now read only, so that we cannot even start the rebalance to fix the
>>situation.
>
>I'm not sure I understand.  In degraded mode we're writing to just one 
>mirror of a RAID1 block group, correct?  And this messes up the WP for 
>the broken side, so it gets marked with wp_broken and thus RO.  How 
>does this patch help?  The block groups are still marked RAID1 right?  
>Or are new block groups allocated with SINGLE or RAID0?  I'm confused.  
>Thanks,
>
>Josef

First of all, I found that some recent change (maybe commit
112974d4067b ("btrfs: volumes: Remove ENOSPC-prone
btrfs_can_relocate()")?) solved the issue, so we no longer need patch
11 and 12. So, I will drop these two in the next version.

So, I think you may already have no interest on the answer, but just
for a note... The situation was like this:

* before degrading
   - All block groups are RAID1, working fine.
  
* degraded mount
   - Block groups allocated before degrading are RAID1. Writes goes
     into RAID1 block group and break the write pointer.
   - Newly allocated block groups are SINGLE, since we only have one
     available device.

* mount with the both drive again
   - RAID1 block groups are markd RO because of broken write pointer
   - SINGLE block groups are also marked RO because we have RAID1 block
     groups

and at this point, btrfs was somehow unable to allocate new block
group or to start blancing.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 21/28] btrfs: disallow mixed-bg in HMZONED mode
  2019-12-17 19:56   ` Josef Bacik
@ 2019-12-18  8:03     ` Naohiro Aota
  0 siblings, 0 replies; 69+ messages in thread
From: Naohiro Aota @ 2019-12-18  8:03 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, David Sterba, Chris Mason, Nikolay Borisov,
	Damien Le Moal, Johannes Thumshirn, Hannes Reinecke, Anand Jain,
	linux-fsdevel

On Tue, Dec 17, 2019 at 02:56:20PM -0500, Josef Bacik wrote:
>On 12/12/19 11:09 PM, Naohiro Aota wrote:
>>Placing both data and metadata in a block group is impossible in HMZONED
>>mode. For data, we can allocate a space for it and write it immediately
>>after the allocation. For metadata, however, we cannot do so, because the
>>logical addresses are recorded in other metadata buffers to build up the
>>trees. As a result, a data buffer can be placed after a metadata buffer,
>>which is not written yet. Writing out the data buffer will break the
>>sequential write rule.
>>
>>This commit check and disallow MIXED_BG with HMZONED mode.
>>
>>Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>
>I would prefer it if you did all of the weird disallows early on so 
>it's clear as I go through that I don't have to think about certain 
>cases.  I remembered from a previous look through that mixed_bg's were 
>disallowed, but I had to go look for some other cases.
>
>Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>
>Thanks,
>
>Josef

Sure, I will sort these patches after the other disallow patches.

Thanks,

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 27/28] btrfs: enable tree-log on HMZONED mode
  2019-12-17 22:08   ` Josef Bacik
@ 2019-12-18  9:35     ` Naohiro Aota
  0 siblings, 0 replies; 69+ messages in thread
From: Naohiro Aota @ 2019-12-18  9:35 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, David Sterba, Chris Mason, Nikolay Borisov,
	Damien Le Moal, Johannes Thumshirn, Hannes Reinecke, Anand Jain,
	linux-fsdevel

On Tue, Dec 17, 2019 at 05:08:44PM -0500, Josef Bacik wrote:
>On 12/12/19 11:09 PM, Naohiro Aota wrote:
>>The tree-log feature does not work on HMZONED mode as is. Blocks for a
>>tree-log tree are allocated mixed with other metadata blocks, and btrfs
>>writes and syncs the tree-log blocks to devices at the time of fsync(),
>>which is different timing than a global transaction commit. As a result,
>>both writing tree-log blocks and writing other metadata blocks become
>>non-sequential writes which HMZONED mode must avoid.
>>
>>Also, since we can start more than one log transactions per subvolume at
>>the same time, nodes from multiple transactions can be allocated
>>interleaved. Such mixed allocation results in non-sequential writes at the
>>time of log transaction commit. The nodes of the global log root tree
>>(fs_info->log_root_tree), also have the same mixed allocation problem.
>>
>>This patch assigns a dedicated block group for tree-log blocks to separate
>>two metadata writing streams (for tree-log blocks and other metadata
>>blocks). As a result, each write stream can now be written to devices
>>separately. "fs_info->treelog_bg" tracks the dedicated block group and
>>btrfs assign "treelog_bg" on-demand on tree-log block allocation time.
>>
>>Then, this patch serializes log transactions by waiting for a committing
>>transaction when someone tries to start a new transaction, to avoid the
>>mixed allocation problem. We must also wait for running log transactions
>>from another subvolume, but there is no easy way to detect which subvolume
>>root is running a log transaction. So, this patch forbids starting a new
>>log transaction when the global log root tree is already allocated by other
>>subvolumes.
>>
>>Furthermore, this patch aligns the allocation order of nodes of
>>"fs_info->log_root_tree" and nodes of "root->log_root" with the writing
>>order of the nodes, by delaying allocation of the root node of
>>"fs_info->log_root_tree," so that, the node buffers can go out sequentially
>>to devices.
>>
>>Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>---
>>  fs/btrfs/block-group.c |  7 +++++
>>  fs/btrfs/ctree.h       |  2 ++
>>  fs/btrfs/disk-io.c     |  8 ++---
>>  fs/btrfs/extent-tree.c | 71 +++++++++++++++++++++++++++++++++++++-----
>>  fs/btrfs/tree-log.c    | 49 ++++++++++++++++++++++++-----
>>  5 files changed, 116 insertions(+), 21 deletions(-)
>>
>>diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
>>index 6f7d29171adf..93e6c617d68e 100644
>>--- a/fs/btrfs/block-group.c
>>+++ b/fs/btrfs/block-group.c
>>@@ -910,6 +910,13 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>>  	btrfs_return_cluster_to_free_space(block_group, cluster);
>>  	spin_unlock(&cluster->refill_lock);
>>+	if (btrfs_fs_incompat(fs_info, HMZONED)) {
>>+		spin_lock(&fs_info->treelog_bg_lock);
>>+		if (fs_info->treelog_bg == block_group->start)
>>+			fs_info->treelog_bg = 0;
>>+		spin_unlock(&fs_info->treelog_bg_lock);
>>+	}
>>+
>>  	path = btrfs_alloc_path();
>>  	if (!path) {
>>  		ret = -ENOMEM;
>>diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>index 18d2d0581e68..cba8a169002c 100644
>>--- a/fs/btrfs/ctree.h
>>+++ b/fs/btrfs/ctree.h
>>@@ -907,6 +907,8 @@ struct btrfs_fs_info {
>>  #endif
>>  	struct mutex hmzoned_meta_io_lock;
>>+	spinlock_t treelog_bg_lock;
>>+	u64 treelog_bg;
>>  };
>>  static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
>>diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>index 914c517d26b0..9c2b2fbf0cdb 100644
>>--- a/fs/btrfs/disk-io.c
>>+++ b/fs/btrfs/disk-io.c
>>@@ -1360,16 +1360,10 @@ int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans,
>>  			     struct btrfs_fs_info *fs_info)
>>  {
>>  	struct btrfs_root *log_root;
>>-	int ret;
>>  	log_root = alloc_log_tree(trans, fs_info);
>>  	if (IS_ERR(log_root))
>>  		return PTR_ERR(log_root);
>>-	ret = btrfs_alloc_log_tree_node(trans, log_root);
>>-	if (ret) {
>>-		kfree(log_root);
>>-		return ret;
>>-	}
>>  	WARN_ON(fs_info->log_root_tree);
>>  	fs_info->log_root_tree = log_root;
>>  	return 0;
>>@@ -2841,6 +2835,8 @@ int __cold open_ctree(struct super_block *sb,
>>  	fs_info->send_in_progress = 0;
>>+	spin_lock_init(&fs_info->treelog_bg_lock);
>>+
>>  	ret = btrfs_alloc_stripe_hash_table(fs_info);
>>  	if (ret) {
>>  		err = ret;
>>diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>index 69c4ce8ec83e..9b9608097f7f 100644
>>--- a/fs/btrfs/extent-tree.c
>>+++ b/fs/btrfs/extent-tree.c
>>@@ -3704,8 +3704,10 @@ static int find_free_extent_unclustered(struct btrfs_block_group *bg,
>>   */
>>  static int find_free_extent_zoned(struct btrfs_block_group *cache,
>>-				  struct find_free_extent_ctl *ffe_ctl)
>>+				  struct find_free_extent_ctl *ffe_ctl,
>>+				  bool for_treelog)
>>  {
>>+	struct btrfs_fs_info *fs_info = cache->fs_info;
>>  	struct btrfs_space_info *space_info = cache->space_info;
>>  	struct btrfs_free_space_ctl *ctl = cache->free_space_ctl;
>>  	u64 start = cache->start;
>>@@ -3718,12 +3720,26 @@ static int find_free_extent_zoned(struct btrfs_block_group *cache,
>>  	btrfs_hmzoned_data_io_lock(cache);
>>  	spin_lock(&space_info->lock);
>>  	spin_lock(&cache->lock);
>>+	spin_lock(&fs_info->treelog_bg_lock);
>>+
>>+	ASSERT(!for_treelog || cache->start == fs_info->treelog_bg ||
>>+	       fs_info->treelog_bg == 0);
>>  	if (cache->ro) {
>>  		ret = -EAGAIN;
>>  		goto out;
>>  	}
>>+	/*
>>+	 * Do not allow currently using block group to be tree-log
>>+	 * dedicated block group.
>>+	 */
>>+	if (for_treelog && !fs_info->treelog_bg &&
>>+	    (cache->used || cache->reserved)) {
>>+		ret = 1;
>>+		goto out;
>>+	}
>>+
>>  	avail = cache->length - cache->alloc_offset;
>>  	if (avail < num_bytes) {
>>  		ffe_ctl->max_extent_size = avail;
>>@@ -3731,6 +3747,9 @@ static int find_free_extent_zoned(struct btrfs_block_group *cache,
>>  		goto out;
>>  	}
>>+	if (for_treelog && !fs_info->treelog_bg)
>>+		fs_info->treelog_bg = cache->start;
>>+
>>  	ffe_ctl->found_offset = start + cache->alloc_offset;
>>  	cache->alloc_offset += num_bytes;
>>  	spin_lock(&ctl->tree_lock);
>>@@ -3738,12 +3757,15 @@ static int find_free_extent_zoned(struct btrfs_block_group *cache,
>>  	spin_unlock(&ctl->tree_lock);
>>  	ASSERT(IS_ALIGNED(ffe_ctl->found_offset,
>>-			  cache->fs_info->stripesize));
>>+			  fs_info->stripesize));
>>  	ffe_ctl->search_start = ffe_ctl->found_offset;
>>  	__btrfs_add_reserved_bytes(cache, ffe_ctl->ram_bytes, num_bytes,
>>  				   ffe_ctl->delalloc);
>>  out:
>>+	if (ret && for_treelog)
>>+		fs_info->treelog_bg = 0;
>>+	spin_unlock(&fs_info->treelog_bg_lock);
>>  	spin_unlock(&cache->lock);
>>  	spin_unlock(&space_info->lock);
>>  	/* if succeeds, unlock after submit_bio */
>>@@ -3891,7 +3913,7 @@ static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info,
>>  static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
>>  				u64 ram_bytes, u64 num_bytes, u64 empty_size,
>>  				u64 hint_byte, struct btrfs_key *ins,
>>-				u64 flags, int delalloc)
>>+				u64 flags, int delalloc, bool for_treelog)
>>  {
>>  	int ret = 0;
>>  	struct btrfs_free_cluster *last_ptr = NULL;
>>@@ -3970,6 +3992,13 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
>>  		spin_unlock(&last_ptr->lock);
>>  	}
>>+	if (hmzoned && for_treelog) {
>>+		spin_lock(&fs_info->treelog_bg_lock);
>>+		if (fs_info->treelog_bg)
>>+			hint_byte = fs_info->treelog_bg;
>>+		spin_unlock(&fs_info->treelog_bg_lock);
>>+	}
>>+
>>  	ffe_ctl.search_start = max(ffe_ctl.search_start,
>>  				   first_logical_byte(fs_info, 0));
>>  	ffe_ctl.search_start = max(ffe_ctl.search_start, hint_byte);
>>@@ -4015,8 +4044,15 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
>>  	list_for_each_entry(block_group,
>>  			    &space_info->block_groups[ffe_ctl.index], list) {
>>  		/* If the block group is read-only, we can skip it entirely. */
>>-		if (unlikely(block_group->ro))
>>+		if (unlikely(block_group->ro)) {
>>+			if (hmzoned && for_treelog) {
>>+				spin_lock(&fs_info->treelog_bg_lock);
>>+				if (block_group->start == fs_info->treelog_bg)
>>+					fs_info->treelog_bg = 0;
>>+				spin_unlock(&fs_info->treelog_bg_lock);
>>+			}
>>  			continue;
>>+		}
>>  		btrfs_grab_block_group(block_group, delalloc);
>>  		ffe_ctl.search_start = block_group->start;
>>@@ -4062,7 +4098,25 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
>>  			goto loop;
>>  		if (hmzoned) {
>>-			ret = find_free_extent_zoned(block_group, &ffe_ctl);
>>+			u64 bytenr = block_group->start;
>>+			u64 log_bytenr;
>>+			bool skip;
>>+
>>+			/*
>>+			 * Do not allow non-tree-log blocks in the
>>+			 * dedicated tree-log block group, and vice versa.
>>+			 */
>>+			spin_lock(&fs_info->treelog_bg_lock);
>>+			log_bytenr = fs_info->treelog_bg;
>>+			skip = log_bytenr &&
>>+				((for_treelog && bytenr != log_bytenr) ||
>>+				 (!for_treelog && bytenr == log_bytenr));
>>+			spin_unlock(&fs_info->treelog_bg_lock);
>>+			if (skip)
>>+				goto loop;
>>+
>>+			ret = find_free_extent_zoned(block_group, &ffe_ctl,
>>+						     for_treelog);
>>  			if (ret)
>>  				goto loop;
>>  			/*
>>@@ -4222,12 +4276,13 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes,
>>  	bool final_tried = num_bytes == min_alloc_size;
>>  	u64 flags;
>>  	int ret;
>>+	bool for_treelog = root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID;
>>  	flags = get_alloc_profile_by_root(root, is_data);
>>  again:
>>  	WARN_ON(num_bytes < fs_info->sectorsize);
>>  	ret = find_free_extent(fs_info, ram_bytes, num_bytes, empty_size,
>>-			       hint_byte, ins, flags, delalloc);
>>+			       hint_byte, ins, flags, delalloc, for_treelog);
>>  	if (!ret && !is_data) {
>>  		btrfs_dec_block_group_reservations(fs_info, ins->objectid);
>>  	} else if (ret == -ENOSPC) {
>>@@ -4245,8 +4300,8 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes,
>>  			sinfo = btrfs_find_space_info(fs_info, flags);
>>  			btrfs_err(fs_info,
>>-				  "allocation failed flags %llu, wanted %llu",
>>-				  flags, num_bytes);
>>+			"allocation failed flags %llu, wanted %llu treelog %d",
>>+				  flags, num_bytes, for_treelog);
>>  			if (sinfo)
>>  				btrfs_dump_space_info(fs_info, sinfo,
>>  						      num_bytes, 1);
>>diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
>>index 6f757361db53..e155418f24ba 100644
>>--- a/fs/btrfs/tree-log.c
>>+++ b/fs/btrfs/tree-log.c
>>@@ -18,6 +18,7 @@
>>  #include "compression.h"
>>  #include "qgroup.h"
>>  #include "inode-map.h"
>>+#include "hmzoned.h"
>>  /* magic values for the inode_only field in btrfs_log_inode:
>>   *
>>@@ -105,6 +106,7 @@ static noinline int replay_dir_deletes(struct btrfs_trans_handle *trans,
>>  				       struct btrfs_root *log,
>>  				       struct btrfs_path *path,
>>  				       u64 dirid, int del_all);
>>+static void wait_log_commit(struct btrfs_root *root, int transid);
>>  /*
>>   * tree logging is a special write ahead log used to make sure that
>>@@ -139,16 +141,25 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
>>  			   struct btrfs_log_ctx *ctx)
>>  {
>>  	struct btrfs_fs_info *fs_info = root->fs_info;
>>+	bool hmzoned = btrfs_fs_incompat(fs_info, HMZONED);
>>  	int ret = 0;
>>  	mutex_lock(&root->log_mutex);
>>+again:
>>  	if (root->log_root) {
>>+		int index = (root->log_transid + 1) % 2;
>>+
>>  		if (btrfs_need_log_full_commit(trans)) {
>>  			ret = -EAGAIN;
>>  			goto out;
>>  		}
>>+		if (hmzoned && atomic_read(&root->log_commit[index])) {
>>+			wait_log_commit(root, root->log_transid - 1);
>>+			goto again;
>>+		}
>>+
>>  		if (!root->log_start_pid) {
>>  			clear_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state);
>>  			root->log_start_pid = current->pid;
>>@@ -157,8 +168,13 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
>>  		}
>>  	} else {
>>  		mutex_lock(&fs_info->tree_log_mutex);
>>-		if (!fs_info->log_root_tree)
>>+		if (hmzoned && fs_info->log_root_tree) {
>>+			ret = -EAGAIN;
>>+			mutex_unlock(&fs_info->tree_log_mutex);
>>+			goto out;
>>+		} else if (!fs_info->log_root_tree) {
>>  			ret = btrfs_init_log_root_tree(trans, fs_info);
>>+		}
>>  		mutex_unlock(&fs_info->tree_log_mutex);
>>  		if (ret)
>>  			goto out;
>>@@ -191,11 +207,19 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
>>   */
>>  static int join_running_log_trans(struct btrfs_root *root)
>>  {
>>+	bool hmzoned = btrfs_fs_incompat(root->fs_info, HMZONED);
>>  	int ret = -ENOENT;
>>  	mutex_lock(&root->log_mutex);
>>+again:
>>  	if (root->log_root) {
>>+		int index = (root->log_transid + 1) % 2;
>>+
>>  		ret = 0;
>>+		if (hmzoned && atomic_read(&root->log_commit[index])) {
>>+			wait_log_commit(root, root->log_transid - 1);
>>+			goto again;
>>+		}
>>  		atomic_inc(&root->log_writers);
>>  	}
>>  	mutex_unlock(&root->log_mutex);
>>@@ -2724,6 +2748,8 @@ static noinline int walk_down_log_tree(struct btrfs_trans_handle *trans,
>>  					btrfs_clean_tree_block(next);
>>  					btrfs_wait_tree_block_writeback(next);
>>  					btrfs_tree_unlock(next);
>>+					btrfs_redirty_list_add(
>>+						trans->transaction, next);
>
>This is separate from the rest of the work here and needs to be in a 
>separate patch.  In fact I'd like to see separate patches for the 
>allocation part, the waiting part, and whatever this is.  As it stands 
>it's all kind of wonky and really ends up deeply in the generic stuff 
>which will make it all harder to read and maintain.  If I'm going to 
>review this it needs to be in smaller chunks. Thanks,
>
>Josef

All right, I will split this into three patches or so and make them
small.

Thanks,

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 24/28] btrfs: enable relocation in HMZONED mode
  2019-12-17 21:32   ` Josef Bacik
@ 2019-12-18 10:49     ` Naohiro Aota
  2019-12-18 15:01       ` Josef Bacik
  0 siblings, 1 reply; 69+ messages in thread
From: Naohiro Aota @ 2019-12-18 10:49 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, David Sterba, Chris Mason, Nikolay Borisov,
	Damien Le Moal, Johannes Thumshirn, Hannes Reinecke, Anand Jain,
	linux-fsdevel

On Tue, Dec 17, 2019 at 04:32:04PM -0500, Josef Bacik wrote:
>On 12/12/19 11:09 PM, Naohiro Aota wrote:
>>To serialize allocation and submit_bio, we introduced mutex around them. As
>>a result, preallocation must be completely disabled to avoid a deadlock.
>>
>>Since current relocation process relies on preallocation to move file data
>>extents, it must be handled in another way. In HMZONED mode, we just
>>truncate the inode to the size that we wanted to pre-allocate. Then, we
>>flush dirty pages on the file before finishing relocation process.
>>run_delalloc_hmzoned() will handle all the allocation and submit IOs to
>>the underlying layers.
>>
>>Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>---
>>  fs/btrfs/relocation.c | 39 +++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 37 insertions(+), 2 deletions(-)
>>
>>diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
>>index d897a8e5e430..2d17b7566df4 100644
>>--- a/fs/btrfs/relocation.c
>>+++ b/fs/btrfs/relocation.c
>>@@ -3159,6 +3159,34 @@ int prealloc_file_extent_cluster(struct inode *inode,
>>  	if (ret)
>>  		goto out;
>>+	/*
>>+	 * In HMZONED, we cannot preallocate the file region. Instead,
>>+	 * we dirty and fiemap_write the region.
>>+	 */
>>+
>>+	if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED)) {
>>+		struct btrfs_root *root = BTRFS_I(inode)->root;
>>+		struct btrfs_trans_handle *trans;
>>+
>>+		end = cluster->end - offset + 1;
>>+		trans = btrfs_start_transaction(root, 1);
>>+		if (IS_ERR(trans))
>>+			return PTR_ERR(trans);
>>+
>>+		inode->i_ctime = current_time(inode);
>>+		i_size_write(inode, end);
>>+		btrfs_ordered_update_i_size(inode, end, NULL);
>>+		ret = btrfs_update_inode(trans, root, inode);
>>+		if (ret) {
>>+			btrfs_abort_transaction(trans, ret);
>>+			btrfs_end_transaction(trans);
>>+			return ret;
>>+		}
>>+		ret = btrfs_end_transaction(trans);
>>+
>>+		goto out;
>>+	}
>>+
>
>Why are we arbitrarily extending the i_size here?  If we don't need 
>prealloc we don't need to jack up the i_size either.

We need to extend i_size to read data from the relocating block
group. If not, btrfs_readpage() in relocate_file_extent_cluster()
always reads zero filled page because the read position is beyond the
file size.

>>  	cur_offset = prealloc_start;
>>  	while (nr < cluster->nr) {
>>  		start = cluster->boundary[nr] - offset;
>>@@ -3346,6 +3374,10 @@ static int relocate_file_extent_cluster(struct inode *inode,
>>  		btrfs_throttle(fs_info);
>>  	}
>>  	WARN_ON(nr != cluster->nr);
>>+	if (btrfs_fs_incompat(fs_info, HMZONED) && !ret) {
>>+		ret = btrfs_wait_ordered_range(inode, 0, (u64)-1);
>>+		WARN_ON(ret);
>
>Do not WAR_ON() when this could happen due to IO errors.  Thanks,
>
>Josef

Sure. We can just drop it.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 11/28] btrfs: make unmirroed BGs readonly only if we have at least one writable BG
  2019-12-18  7:35     ` Naohiro Aota
@ 2019-12-18 14:54       ` Josef Bacik
  0 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2019-12-18 14:54 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Nikolay Borisov,
	Damien Le Moal, Johannes Thumshirn, Hannes Reinecke, Anand Jain,
	linux-fsdevel

On 12/18/19 2:35 AM, Naohiro Aota wrote:
> On Tue, Dec 17, 2019 at 02:25:37PM -0500, Josef Bacik wrote:
>> On 12/12/19 11:08 PM, Naohiro Aota wrote:
>>> If the btrfs volume has mirrored block groups, it unconditionally makes
>>> un-mirrored block groups read only. When we have mirrored block groups, but
>>> don't have writable block groups, this will drop all writable block groups.
>>> So, check if we have at least one writable mirrored block group before
>>> setting un-mirrored block groups read only.
>>>
>>> This change is necessary to handle e.g. xfstests btrfs/124 case.
>>>
>>> When we mount degraded RAID1 FS and write to it, and then re-mount with
>>> full device, the write pointers of corresponding zones of written block
>>> group differ. We mark such block group as "wp_broken" and make it read
>>> only. In this situation, we only have read only RAID1 block groups because
>>> of "wp_broken" and un-mirrored block groups are also marked read only,
>>> because we have RAID1 block groups. As a result, all the block groups are
>>> now read only, so that we cannot even start the rebalance to fix the
>>> situation.
>>
>> I'm not sure I understand.  In degraded mode we're writing to just one mirror 
>> of a RAID1 block group, correct?  And this messes up the WP for the broken 
>> side, so it gets marked with wp_broken and thus RO.  How does this patch 
>> help?  The block groups are still marked RAID1 right? Or are new block groups 
>> allocated with SINGLE or RAID0?  I'm confused. Thanks,
>>
>> Josef
> 
> First of all, I found that some recent change (maybe commit
> 112974d4067b ("btrfs: volumes: Remove ENOSPC-prone
> btrfs_can_relocate()")?) solved the issue, so we no longer need patch
> 11 and 12. So, I will drop these two in the next version.
> 
> So, I think you may already have no interest on the answer, but just
> for a note... The situation was like this:
> 
> * before degrading
>    - All block groups are RAID1, working fine.
> 
> * degraded mount
>    - Block groups allocated before degrading are RAID1. Writes goes
>      into RAID1 block group and break the write pointer.
>    - Newly allocated block groups are SINGLE, since we only have one
>      available device.
> 
> * mount with the both drive again
>    - RAID1 block groups are markd RO because of broken write pointer
>    - SINGLE block groups are also marked RO because we have RAID1 block
>      groups
> 
> and at this point, btrfs was somehow unable to allocate new block
> group or to start blancing.

Oooh ok I see, I had it in my head we would still allocate RAID1 chunks, but we 
allocate SINGLE, so that makes sense.  Go ahead and drop those patches, and 
thanks for the explanation.

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 23/28] btrfs: support dev-replace in HMZONED mode
  2019-12-18  6:00     ` Naohiro Aota
@ 2019-12-18 14:58       ` Josef Bacik
  0 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2019-12-18 14:58 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Nikolay Borisov,
	Damien Le Moal, Johannes Thumshirn, Hannes Reinecke, Anand Jain,
	linux-fsdevel

On 12/18/19 1:00 AM, Naohiro Aota wrote:
> On Tue, Dec 17, 2019 at 04:05:25PM -0500, Josef Bacik wrote:
>> On 12/12/19 11:09 PM, Naohiro Aota wrote:
>>> We have two type of I/Os during the device-replace process. One is a I/O to
>>> "copy" (by the scrub functions) all the device extents on the source device
>>> to the destination device.  The other one is a I/O to "clone" (by
>>> handle_ops_on_dev_replace()) new incoming write I/Os from users to the
>>> source device into the target device.
>>>
>>> Cloning incoming I/Os can break the sequential write rule in the target
>>> device. When write is mapped in the middle of a block group, that I/O is
>>> directed in the middle of a zone of target device, which breaks the
>>> sequential write rule.
>>>
>>> However, the cloning function cannot be simply disabled since incoming I/Os
>>> targeting already copied device extents must be cloned so that the I/O is
>>> executed on the target device.
>>>
>>> We cannot use dev_replace->cursor_{left,right} to determine whether bio is
>>> going to not yet copied region.  Since we have time gap between finishing
>>> btrfs_scrub_dev() and rewriting the mapping tree in
>>> btrfs_dev_replace_finishing(), we can have newly allocated device extent
>>> which is never cloned nor copied.
>>>
>>> So the point is to copy only already existing device extents. This patch
>>> introduces mark_block_group_to_copy() to mark existing block group as a
>>> target of copying. Then, handle_ops_on_dev_replace() and dev-replace can
>>> check the flag to do their job.
>>>
>>> Device-replace process in HMZONED mode must copy or clone all the extents
>>> in the source device exctly once.  So, we need to use to ensure allocations
>>> started just before the dev-replace process to have their corresponding
>>> extent information in the B-trees. finish_extent_writes_for_hmzoned()
>>> implements that functionality, which basically is the removed code in the
>>> commit 042528f8d840 ("Btrfs: fix block group remaining RO forever after
>>> error during device replace").
>>>
>>> This patch also handles empty region between used extents. Since
>>> dev-replace is smart to copy only used extents on source device, we have to
>>> fill the gap to honor the sequential write rule in the target device.
>>>
>>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>
>> Can you split up the copying part and the cloning part into different patches, 
>> this is a bear to review.  Also I don't quite understand the zeroout behavior. 
>> It _looks_ like for cloning you are doing a zeroout for the gap between the 
>> last wp position and the current cloned bio, which makes sense, but doesn't 
>> this gap exist because copying is ongoing?  Can you copy into a zero'ed out 
>> position?  Or am I missing something here?  Thanks,
>>
>> Josef
> 
> OK, I will split this in the next version. (but, it's mostly "copying"
> part)
> 
> Let me clarify first that I am using "copying" for copying existing
> extents to the new device and "cloning" for cloning a new incoming BIO
> to the new device.
> 
> For zeroout, it is for "copying" which is done with the scrub code to
> copy existing extents on the source devie to the destination
> device. Since copying or scrub only scans for living extents, there
> can be a gap between two living extents. So, we need to fill a gap
> with zeroout to make the writing stream sequential.
> 
> And "cloning" is only done for new block groups or already fully
> copied block groups. So there is no gaps for them because the
> allocator and the IO locks ensures the sequential allocation and
> submit.
> 

Got it, thanks.  It looked like the cloning part was using the zeroout behavior 
which was what was confusing me.

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 24/28] btrfs: enable relocation in HMZONED mode
  2019-12-18 10:49     ` Naohiro Aota
@ 2019-12-18 15:01       ` Josef Bacik
  0 siblings, 0 replies; 69+ messages in thread
From: Josef Bacik @ 2019-12-18 15:01 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Nikolay Borisov,
	Damien Le Moal, Johannes Thumshirn, Hannes Reinecke, Anand Jain,
	linux-fsdevel

On 12/18/19 5:49 AM, Naohiro Aota wrote:
> On Tue, Dec 17, 2019 at 04:32:04PM -0500, Josef Bacik wrote:
>> On 12/12/19 11:09 PM, Naohiro Aota wrote:
>>> To serialize allocation and submit_bio, we introduced mutex around them. As
>>> a result, preallocation must be completely disabled to avoid a deadlock.
>>>
>>> Since current relocation process relies on preallocation to move file data
>>> extents, it must be handled in another way. In HMZONED mode, we just
>>> truncate the inode to the size that we wanted to pre-allocate. Then, we
>>> flush dirty pages on the file before finishing relocation process.
>>> run_delalloc_hmzoned() will handle all the allocation and submit IOs to
>>> the underlying layers.
>>>
>>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>> ---
>>>  fs/btrfs/relocation.c | 39 +++++++++++++++++++++++++++++++++++++--
>>>  1 file changed, 37 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
>>> index d897a8e5e430..2d17b7566df4 100644
>>> --- a/fs/btrfs/relocation.c
>>> +++ b/fs/btrfs/relocation.c
>>> @@ -3159,6 +3159,34 @@ int prealloc_file_extent_cluster(struct inode *inode,
>>>      if (ret)
>>>          goto out;
>>> +    /*
>>> +     * In HMZONED, we cannot preallocate the file region. Instead,
>>> +     * we dirty and fiemap_write the region.
>>> +     */
>>> +
>>> +    if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED)) {
>>> +        struct btrfs_root *root = BTRFS_I(inode)->root;
>>> +        struct btrfs_trans_handle *trans;
>>> +
>>> +        end = cluster->end - offset + 1;
>>> +        trans = btrfs_start_transaction(root, 1);
>>> +        if (IS_ERR(trans))
>>> +            return PTR_ERR(trans);
>>> +
>>> +        inode->i_ctime = current_time(inode);
>>> +        i_size_write(inode, end);
>>> +        btrfs_ordered_update_i_size(inode, end, NULL);
>>> +        ret = btrfs_update_inode(trans, root, inode);
>>> +        if (ret) {
>>> +            btrfs_abort_transaction(trans, ret);
>>> +            btrfs_end_transaction(trans);
>>> +            return ret;
>>> +        }
>>> +        ret = btrfs_end_transaction(trans);
>>> +
>>> +        goto out;
>>> +    }
>>> +
>>
>> Why are we arbitrarily extending the i_size here?  If we don't need prealloc 
>> we don't need to jack up the i_size either.
> 
> We need to extend i_size to read data from the relocating block
> group. If not, btrfs_readpage() in relocate_file_extent_cluster()
> always reads zero filled page because the read position is beyond the
> file size.

Right but the finish_ordered_io stuff will do the btrfs_ordered_update_i_size() 
once the IO is complete.  So all you really need is the i_size_write and the 
btrfs_update_inode.  If this crashes you'll have an inode that has a i_size with 
no extents up to i_size.  This is fine for NO_HOLES but not fine for !NO_HOLES. 
Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 15/28] btrfs: serialize data allocation and submit IOs
  2019-12-17 19:49   ` Josef Bacik
@ 2019-12-19  6:54     ` Naohiro Aota
  2019-12-19 14:01       ` Josef Bacik
  0 siblings, 1 reply; 69+ messages in thread
From: Naohiro Aota @ 2019-12-19  6:54 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, David Sterba, Chris Mason, Nikolay Borisov,
	Damien Le Moal, Johannes Thumshirn, Hannes Reinecke, Anand Jain,
	linux-fsdevel

On Tue, Dec 17, 2019 at 02:49:44PM -0500, Josef Bacik wrote:
>On 12/12/19 11:09 PM, Naohiro Aota wrote:
>>To preserve sequential write pattern on the drives, we must serialize
>>allocation and submit_bio. This commit add per-block group mutex
>>"zone_io_lock" and find_free_extent_zoned() hold the lock. The lock is kept
>>even after returning from find_free_extent(). It is released when submiting
>>IOs corresponding to the allocation is completed.
>>
>>Implementing such behavior under __extent_writepage_io() is almost
>>impossible because once pages are unlocked we are not sure when submiting
>>IOs for an allocated region is finished or not. Instead, this commit add
>>run_delalloc_hmzoned() to write out non-compressed data IOs at once using
>>extent_write_locked_rage(). After the write, we can call
>>btrfs_hmzoned_data_io_unlock() to unlock the block group for new
>>allocation.
>>
>>Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>
>Have you actually tested these patches with lock debugging on?  The 
>submit_compressed_extents stuff is async, so the unlocker owner will 
>not be the lock owner, and that'll make all sorts of things blow up.  
>This is just straight up broken.

Yes, I have ran xfstests on this patch series with lockdeps and
KASAN. There was no problem with that.

For non-compressed writes, both allocation and submit is done in
run_delalloc_zoned(). Allocation is done in cow_file_range() and
submit is done in extent_write_locked_range(), so both are in the same
context, so both locking and unlocking are done by the same execution
context.

For compressed writes, again, allocation/lock is done under
cow_file_range() and submit is done in extent_write_locked_range() and
unlocked all in submit_compressed_extents() (this is called after
compression), so they are all in the same context and the lock owner
does the unlock.

>I would really rather see a hmzoned block scheduler that just doesn't 
>submit the bio's until they are aligned with the WP, that way this 
>intellligence doesn't have to be dealt with at the file system layer.  
>I get allocating in line with the WP, but this whole forcing us to 
>allocate and submit the bio in lock step is just nuts, and broken in 
>your subsequent patches.  This whole approach needs to be reworked.  
>Thanks,
>
>Josef

We tried this approach by modifying mq-deadline to wait if the first
queued request is not aligned at the write pointer of a zone. However,
running btrfs without the allocate+submit lock with this modified IO
scheduler did not work well at all. With write intensive workloads, we
observed that a very long wait time was very often necessary to get a
fully sequential stream of requests starting at the write pointer of a
zone. The wait time we observed was sometimes in larger than 60 seconds,
at which point we gave up.

While we did not extensively dig into the fundamental root cause,
these potentially long wait times can come from a large number of
reasons: page cache writeback behavior, kernel process scheduling,
device IO congestion and writeback throttling, sync, transaction
commit of btrfs, and cgroup use could make everything even worse. In
the worst case scenario, a number of out-of-ordered requests could get
stuck in the IO scheduler, preventing forward progress in the case of
a memory reclaim writeback, causing the OOM killer to start happily
killing application processes. Furthermore, IO error handling becomes
a nightmare as the block layer scheduler would need to issue report
zones commands to re-sync the zone wp in case of write error. And that
is also in addition to having to track other zone commands that change
a zone wp such as reset zone and finish zone.

Considering all this, handling the sequential write constraint at the
file system layer by ensuring that write BIOs are issued in the correct
order starting from a zone WP is far simpler and removes dependencies on
other features such as cgroup, congestion control and other throttling
mechanisms. The IO scheduler can always dispatch to the device the
requests it received without any waiting time, ensuring forward progress.

The mq-deadline IO scheduler supports not only regular block devices but
also zoned block devices and it is the default scheduler for them, and
other schedulers that are not zone compliant cannot be selected (one
cannot change to kyber nor bfq). This ensure that the default system
behavior will be correct as long as the user (the FS) respects the
sequential write rule.

The previous approach I proposed using a btrfs request reordering stage
was indeed very invasive, and similarly the block layer scheduler
changes, could cause problems with cgroups etc. The new approach of this
path using locking to have atomic allocate+bio issuing results in
per-zone sequential write patterns, no matter what happens around it. It
is less invasive and rely on the sequential allocation of blocks for the
ordering of write IOs, so there is no explicit reordering, so no
additional overhead. f2fs implementation uses a similar approach since
kernel 4.10 and has proven to be very solid.

In light of these arguments and explanation, do you still think the
allocate zone locking approach is still not acceptable ?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 15/28] btrfs: serialize data allocation and submit IOs
  2019-12-19  6:54     ` Naohiro Aota
@ 2019-12-19 14:01       ` Josef Bacik
  2020-01-21  6:54         ` Naohiro Aota
  0 siblings, 1 reply; 69+ messages in thread
From: Josef Bacik @ 2019-12-19 14:01 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Nikolay Borisov,
	Damien Le Moal, Johannes Thumshirn, Hannes Reinecke, Anand Jain,
	linux-fsdevel

On 12/19/19 1:54 AM, Naohiro Aota wrote:
> On Tue, Dec 17, 2019 at 02:49:44PM -0500, Josef Bacik wrote:
>> On 12/12/19 11:09 PM, Naohiro Aota wrote:
>>> To preserve sequential write pattern on the drives, we must serialize
>>> allocation and submit_bio. This commit add per-block group mutex
>>> "zone_io_lock" and find_free_extent_zoned() hold the lock. The lock is kept
>>> even after returning from find_free_extent(). It is released when submiting
>>> IOs corresponding to the allocation is completed.
>>>
>>> Implementing such behavior under __extent_writepage_io() is almost
>>> impossible because once pages are unlocked we are not sure when submiting
>>> IOs for an allocated region is finished or not. Instead, this commit add
>>> run_delalloc_hmzoned() to write out non-compressed data IOs at once using
>>> extent_write_locked_rage(). After the write, we can call
>>> btrfs_hmzoned_data_io_unlock() to unlock the block group for new
>>> allocation.
>>>
>>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>
>> Have you actually tested these patches with lock debugging on?  The 
>> submit_compressed_extents stuff is async, so the unlocker owner will not be 
>> the lock owner, and that'll make all sorts of things blow up. This is just 
>> straight up broken.
> 
> Yes, I have ran xfstests on this patch series with lockdeps and
> KASAN. There was no problem with that.
> 
> For non-compressed writes, both allocation and submit is done in
> run_delalloc_zoned(). Allocation is done in cow_file_range() and
> submit is done in extent_write_locked_range(), so both are in the same
> context, so both locking and unlocking are done by the same execution
> context.
> 
> For compressed writes, again, allocation/lock is done under
> cow_file_range() and submit is done in extent_write_locked_range() and
> unlocked all in submit_compressed_extents() (this is called after
> compression), so they are all in the same context and the lock owner
> does the unlock.
> 
>> I would really rather see a hmzoned block scheduler that just doesn't submit 
>> the bio's until they are aligned with the WP, that way this intellligence 
>> doesn't have to be dealt with at the file system layer. I get allocating in 
>> line with the WP, but this whole forcing us to allocate and submit the bio in 
>> lock step is just nuts, and broken in your subsequent patches.  This whole 
>> approach needs to be reworked. Thanks,
>>
>> Josef
> 
> We tried this approach by modifying mq-deadline to wait if the first
> queued request is not aligned at the write pointer of a zone. However,
> running btrfs without the allocate+submit lock with this modified IO
> scheduler did not work well at all. With write intensive workloads, we
> observed that a very long wait time was very often necessary to get a
> fully sequential stream of requests starting at the write pointer of a
> zone. The wait time we observed was sometimes in larger than 60 seconds,
> at which point we gave up.

This is because we will only write out the pages we've been handed but do 
cow_file_range() for a possibly larger delalloc range, so as you say there can 
be a large gap in time between writing one part of the range and writing the 
next part.

You actually solve this with your patch, by doing the cow_file_range and then 
following it up with the extent_write_locked_range() for the range you just cow'ed.

There is no need for the locking in this case, you could simply do that and then 
have a modified block scheduler that keeps the bio's in the correct order.  I 
imagine if you just did this with your original block layer approach it would 
work fine.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 00/28] btrfs: zoned block device support
  2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
                   ` (28 preceding siblings ...)
  2019-12-13  4:15 ` [PATCH RFC v2] libblkid: implement zone-aware probing for HMZONED btrfs Naohiro Aota
@ 2019-12-19 20:19 ` David Sterba
  29 siblings, 0 replies; 69+ messages in thread
From: David Sterba @ 2019-12-19 20:19 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik,
	Nikolay Borisov, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Anand Jain, linux-fsdevel

On Fri, Dec 13, 2019 at 01:08:47PM +0900, Naohiro Aota wrote:
> This series adds zoned block device support to btrfs.
> 
> Changes:
>  - Changed -EINVAL to -EOPNOTSUPP to reject incompatible features
>    within HMZONED mode (David)
>  - Use bitmap helpers (Johannes)
>  - Fix calculation of a string length
>  - Code cleanup
> 
> Userland series is unchaged with the last version:
> https://lore.kernel.org/linux-btrfs/20191204082513.857320-1-naohiro.aota@wdc.com/T/
> 
> * Patch series description
> 
> A zoned block device consists of a number of zones. Zones are either
> conventional and accepting random writes or sequential and requiring
> that writes be issued in LBA order from each zone write pointer
> position. This patch series ensures that the sequential write
> constraint of sequential zones is respected while fundamentally not
> changing BtrFS block and I/O management for block stored in
> conventional zones.

One more high-level comment: let's please call it 'zone' mode, without
the 'host managed' part. That term is not relevant for a filesystem. The
zone allocator, or zone append-only allocator or similar describe what
happens on the filesystem layer.

The constraint posed by device is to never overwrite in place, that's
fine for COW design and that's what should be kept in mind while adding
the limitations (no nocow/raid56/...) or exceptions into the code.

In some cases it's not possible to fold the zoned support into existing
helpers but we should do that wherever we can. While reading the code
the number of if (HMZONED) felt too intrusive. This needs to be
adjusted, but I think it's mostly cosmetic or basic refactoring, not
changing the core of the implementation.

So in particular: remove 'hm' everywhere, filename, identifiers. For
short I'd call it 'zone' mode but full description would be something
like 'zone aware append-only allocation mode'.

I'll do another review pass to point out what I think can be refactored
but I hope that with the above gives enough hint.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v6 15/28] btrfs: serialize data allocation and submit IOs
  2019-12-19 14:01       ` Josef Bacik
@ 2020-01-21  6:54         ` Naohiro Aota
  0 siblings, 0 replies; 69+ messages in thread
From: Naohiro Aota @ 2020-01-21  6:54 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, David Sterba, Chris Mason, Nikolay Borisov,
	Damien Le Moal, Johannes Thumshirn, Hannes Reinecke, Anand Jain,
	linux-fsdevel, linux-block

[-- Attachment #1: Type: text/plain, Size: 15299 bytes --]

On Thu, Dec 19, 2019 at 09:01:35AM -0500, Josef Bacik wrote:
>On 12/19/19 1:54 AM, Naohiro Aota wrote:
>>On Tue, Dec 17, 2019 at 02:49:44PM -0500, Josef Bacik wrote:
>>>On 12/12/19 11:09 PM, Naohiro Aota wrote:
>>>>To preserve sequential write pattern on the drives, we must serialize
>>>>allocation and submit_bio. This commit add per-block group mutex
>>>>"zone_io_lock" and find_free_extent_zoned() hold the lock. The lock is kept
>>>>even after returning from find_free_extent(). It is released when submiting
>>>>IOs corresponding to the allocation is completed.
>>>>
>>>>Implementing such behavior under __extent_writepage_io() is almost
>>>>impossible because once pages are unlocked we are not sure when submiting
>>>>IOs for an allocated region is finished or not. Instead, this commit add
>>>>run_delalloc_hmzoned() to write out non-compressed data IOs at once using
>>>>extent_write_locked_rage(). After the write, we can call
>>>>btrfs_hmzoned_data_io_unlock() to unlock the block group for new
>>>>allocation.
>>>>
>>>>Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>>
>>>Have you actually tested these patches with lock debugging on?  
>>>The submit_compressed_extents stuff is async, so the unlocker 
>>>owner will not be the lock owner, and that'll make all sorts of 
>>>things blow up. This is just straight up broken.
>>
>>Yes, I have ran xfstests on this patch series with lockdeps and
>>KASAN. There was no problem with that.
>>
>>For non-compressed writes, both allocation and submit is done in
>>run_delalloc_zoned(). Allocation is done in cow_file_range() and
>>submit is done in extent_write_locked_range(), so both are in the same
>>context, so both locking and unlocking are done by the same execution
>>context.
>>
>>For compressed writes, again, allocation/lock is done under
>>cow_file_range() and submit is done in extent_write_locked_range() and
>>unlocked all in submit_compressed_extents() (this is called after
>>compression), so they are all in the same context and the lock owner
>>does the unlock.
>>
>>>I would really rather see a hmzoned block scheduler that just 
>>>doesn't submit the bio's until they are aligned with the WP, that 
>>>way this intellligence doesn't have to be dealt with at the file 
>>>system layer. I get allocating in line with the WP, but this whole 
>>>forcing us to allocate and submit the bio in lock step is just 
>>>nuts, and broken in your subsequent patches.  This whole approach 
>>>needs to be reworked. Thanks,
>>>
>>>Josef
>>
>>We tried this approach by modifying mq-deadline to wait if the first
>>queued request is not aligned at the write pointer of a zone. However,
>>running btrfs without the allocate+submit lock with this modified IO
>>scheduler did not work well at all. With write intensive workloads, we
>>observed that a very long wait time was very often necessary to get a
>>fully sequential stream of requests starting at the write pointer of a
>>zone. The wait time we observed was sometimes in larger than 60 seconds,
>>at which point we gave up.
>
>This is because we will only write out the pages we've been handed but 
>do cow_file_range() for a possibly larger delalloc range, so as you 
>say there can be a large gap in time between writing one part of the 
>range and writing the next part.
>
>You actually solve this with your patch, by doing the cow_file_range 
>and then following it up with the extent_write_locked_range() for the 
>range you just cow'ed.
>
>There is no need for the locking in this case, you could simply do 
>that and then have a modified block scheduler that keeps the bio's in 
>the correct order.  I imagine if you just did this with your original 
>block layer approach it would work fine.  Thanks,
>
>Josef

We have once again tried the btrfs SMR (Zoned Block Device) support
series without the locking around extent allocation and bio issuing,
with a modified version of mq-deadline as the scheduler for the block
layer. As you already know, mq-deadline will order read and write
requests separately in increasing sector order, which is essential for
SMR sequential writing. However, mq-deadline does not provide
guarantees regarding the completeness of a sequential write stream. If
there are missing requests ("holes" in the write stream), mq-deadline
will still dispatch the next write request in order, leading to write
errors on SMR drives.

The modifications we added to mq-deadline is the addition of a wait
time when a hole in a sequential write stream is discovered. This is
reminiscent of the old anticipatory scheduler, somewhat. The wait time
is limited, so if a hole is not filled up by newly inserted requests
after a timeout elapses, write requests are issued as is (and errors
happen on SMR). The default timeout we used initially was set to the
value of "/sys/block/<dev>/queue/iosched/write_expire" which is 5
seconds.

With this, tests show that unaligned write errors happen with a simple
workload of 48 threads simultaneously doing write() to their dedicated
file and fdatasync() (Code of the application doing this is attached
to this email).

Despite the wait time of 5 seconds, the holes in a zone sequential
write stream are not filled up by issued BIOs because of a "buffer
bloat." First, bio whose LBA is not aligned with the write pointer
reaches the IO scheduler (call it bio#1). For proceeding with bio#1,
the IO scheduler must wait for a hole filling bio aligned with the
write pointer (call it bio#0).  If the size of bio#1 is large, the
scheduler needs to split the bio#1 into many numbers of requests. Each
request must first obtain a scheduler tag to be inserted into the
scheduler queue. Since the number of the scheduler tag is limited and
tags are freed only with the completion of queued and inflight
requests, requests in bio#1 can fully use all the tags. This is not a
problem if forward progress is made (i.e., requests dispatched to the
disk), but if all requests in the scheduler using tags are bio#1 and
subsequent writes in sequence, these are all waiting for bio#0 to be
issued. We thus end up with a soft deadlock for request issuing and no
possibility of progress. That results in the timeout to trigger, no
matter how large we set it, and in unaligned write errors. Large bios
needing lots of requests for processing will trigger this problem all
the time.

In addition to unaligned write error, we also observed hung_task
timeout with a larger timeout. The reason is the same as above:
writing threads get stuck with blk_mq_get_tag() to acquire its
scheduler tag. We more often hit hung_task than unaligned write by
increasing the timeout seconds.

Jan 07 11:17:11 naota-devel kernel: INFO: task multi-proc-writ:2202 blocked for more than 122 seconds.
Jan 07 11:17:11 naota-devel kernel:       Not tainted 5.4.0-rc8-BTRFS-ZNS+ #165
Jan 07 11:17:11 naota-devel kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 07 11:17:11 naota-devel kernel: multi-proc-writ D    0  2202   2168 0x00004000
Jan 07 11:17:11 naota-devel kernel: Call Trace:
Jan 07 11:17:11 naota-devel kernel:  __schedule+0x8ab/0x1db0
Jan 07 11:17:11 naota-devel kernel:  ? pci_mmcfg_check_reserved+0x130/0x130
Jan 07 11:17:11 naota-devel kernel:  ? blk_insert_cloned_request+0x3e0/0x3e0
Jan 07 11:17:11 naota-devel kernel:  schedule+0xdb/0x260
Jan 07 11:17:11 naota-devel kernel:  io_schedule+0x21/0x70
Jan 07 11:17:11 naota-devel kernel:  blk_mq_get_tag+0x3b6/0x940
Jan 07 11:17:11 naota-devel kernel:  ? __blk_mq_tag_idle+0x80/0x80
Jan 07 11:17:11 naota-devel kernel:  ? finish_wait+0x270/0x270
Jan 07 11:17:11 naota-devel kernel:  blk_mq_get_request+0x340/0x1750
Jan 07 11:17:11 naota-devel kernel:  blk_mq_make_request+0x339/0x1bd0
Jan 07 11:17:11 naota-devel kernel:  ? blk_queue_enter+0x8a4/0xa30
Jan 07 11:17:11 naota-devel kernel:  ? blk_mq_try_issue_directly+0x150/0x150
Jan 07 11:17:11 naota-devel kernel:  generic_make_request+0x20c/0xa70
Jan 07 11:17:11 naota-devel kernel:  ? blk_queue_enter+0xa30/0xa30
Jan 07 11:17:11 naota-devel kernel:  ? find_held_lock+0x35/0x130
Jan 07 11:17:11 naota-devel kernel:  ? __kasan_check_read+0x11/0x20
Jan 07 11:17:11 naota-devel kernel:  submit_bio+0xd5/0x3c0
Jan 07 11:17:11 naota-devel kernel:  ? submit_bio+0xd5/0x3c0
Jan 07 11:17:11 naota-devel kernel:  ? generic_make_request+0xa70/0xa70
Jan 07 11:17:11 naota-devel kernel:  btrfs_map_bio+0x5f5/0xfb0 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? btrfs_rmap_block+0x820/0x820 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? unlock_page+0x9f/0x110
Jan 07 11:17:11 naota-devel kernel:  ? __extent_writepage+0x5aa/0x800 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? lock_downgrade+0x770/0x770
Jan 07 11:17:11 naota-devel kernel:  btrfs_submit_bio_hook+0x336/0x600 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? btrfs_fiemap+0x50/0x50 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  submit_one_bio+0xba/0x130 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  extent_write_locked_range+0x2f9/0x3e0 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? extent_write_full_page+0x1f0/0x1f0 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? lock_downgrade+0x770/0x770
Jan 07 11:17:11 naota-devel kernel:  ? account_page_redirty+0x2bb/0x490
Jan 07 11:17:11 naota-devel kernel:  run_delalloc_zoned+0x108/0x2f0 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  btrfs_run_delalloc_range+0xc4b/0x1170 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? test_range_bit+0x360/0x360 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? find_get_pages_range_tag+0x6f8/0x9d0
Jan 07 11:17:11 naota-devel kernel:  ? sched_clock_cpu+0x1b/0x170
Jan 07 11:17:11 naota-devel kernel:  ? mark_lock+0xc0/0x1160
Jan 07 11:17:11 naota-devel kernel:  writepage_delalloc+0x11e/0x270 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? find_lock_delalloc_range+0x400/0x400 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? rcu_read_lock_sched_held+0xa1/0xd0
Jan 07 11:17:11 naota-devel kernel:  ? rcu_read_lock_bh_held+0xb0/0xb0
Jan 07 11:17:11 naota-devel kernel:  __extent_writepage+0x3a2/0x800 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? lock_downgrade+0x770/0x770
Jan 07 11:17:11 naota-devel kernel:  ? __do_readpage+0x13a0/0x13a0 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? clear_page_dirty_for_io+0x32a/0x6e0
Jan 07 11:17:11 naota-devel kernel:  ? __kasan_check_read+0x11/0x20
Jan 07 11:17:11 naota-devel kernel:  extent_write_cache_pages+0x61c/0xaf0 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? __extent_writepage+0x800/0x800 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? __kasan_check_read+0x11/0x20
Jan 07 11:17:11 naota-devel kernel:  ? mark_lock+0xc0/0x1160
Jan 07 11:17:11 naota-devel kernel:  ? sched_clock_cpu+0x1b/0x170
Jan 07 11:17:11 naota-devel kernel:  ? __kasan_check_read+0x11/0x20
Jan 07 11:17:11 naota-devel kernel:  extent_writepages+0xf8/0x1a0 [btrfs]
Jan 07 11:17:11 naota-devel kernel:  ? __kasan_check_read+0x11/0x20
Jan 07 11:17:11 naota-devel kernel:  ? extent_write_locked_range+0x3e0/0x3e0 [btrfs]
Jan 07 11:17:12 naota-devel kernel:  ? find_held_lock+0x35/0x130
Jan 07 11:17:12 naota-devel kernel:  ? __kasan_check_read+0x11/0x20
Jan 07 11:17:12 naota-devel kernel:  btrfs_writepages+0xe/0x10 [btrfs]
Jan 07 11:17:12 naota-devel kernel:  do_writepages+0xe0/0x270
Jan 07 11:17:12 naota-devel kernel:  ? lock_downgrade+0x770/0x770
Jan 07 11:17:12 naota-devel kernel:  ? page_writeback_cpu_online+0x20/0x20
Jan 07 11:17:12 naota-devel kernel:  ? __kasan_check_read+0x11/0x20
Jan 07 11:17:12 naota-devel kernel:  ? do_raw_spin_unlock+0x59/0x250
Jan 07 11:17:12 naota-devel kernel:  ? _raw_spin_unlock+0x28/0x40
Jan 07 11:17:12 naota-devel kernel:  ? wbc_attach_and_unlock_inode+0x432/0x840
Jan 07 11:17:12 naota-devel kernel:  __filemap_fdatawrite_range+0x264/0x340
Jan 07 11:17:12 naota-devel kernel:  ? tty_ldisc_deref+0x35/0x40
Jan 07 11:17:12 naota-devel kernel:  ? delete_from_page_cache_batch+0xab0/0xab0
Jan 07 11:17:12 naota-devel kernel:  filemap_fdatawrite_range+0x13/0x20
Jan 07 11:17:12 naota-devel kernel:  btrfs_fdatawrite_range+0x4d/0xf0 [btrfs]
Jan 07 11:17:12 naota-devel kernel:  btrfs_sync_file+0x235/0xb30 [btrfs]
Jan 07 11:17:12 naota-devel kernel:  ? rcu_read_lock_sched_held+0xd0/0xd0
Jan 07 11:17:12 naota-devel kernel:  ? btrfs_file_write_iter+0x1430/0x1430 [btrfs]
Jan 07 11:17:12 naota-devel kernel:  ? do_dup2+0x440/0x440
Jan 07 11:17:12 naota-devel kernel:  ? __x64_sys_futex+0x29b/0x3f0
Jan 07 11:17:12 naota-devel kernel:  ? ksys_write+0x1c3/0x220
Jan 07 11:17:12 naota-devel kernel:  ? btrfs_file_write_iter+0x1430/0x1430 [btrfs]
Jan 07 11:17:12 naota-devel kernel:  vfs_fsync_range+0xf6/0x220
Jan 07 11:17:12 naota-devel kernel:  ? __fget_light+0x184/0x1f0
Jan 07 11:17:12 naota-devel kernel:  do_fsync+0x3d/0x70
Jan 07 11:17:12 naota-devel kernel:  ? trace_hardirqs_on+0x28/0x190
Jan 07 11:17:12 naota-devel kernel:  __x64_sys_fdatasync+0x36/0x50
Jan 07 11:17:12 naota-devel kernel:  do_syscall_64+0xa4/0x4b0
Jan 07 11:17:12 naota-devel kernel:  entry_SYSCALL_64_after_hwframe+0x49/0xbe
Jan 07 11:17:12 naota-devel kernel: RIP: 0033:0x7f7ba395f9bf
Jan 07 11:17:12 naota-devel kernel: Code: Bad RIP value.
Jan 07 11:17:12 naota-devel kernel: RSP: 002b:00007f7ba385de80 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
Jan 07 11:17:12 naota-devel kernel: RAX: ffffffffffffffda RBX: 0000000000100000 RCX: 00007f7ba395f9bf
Jan 07 11:17:12 naota-devel kernel: RDX: 0000000000000001 RSI: 0000000000000081 RDI: 0000000000000003
Jan 07 11:17:12 naota-devel kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000404198
Jan 07 11:17:12 naota-devel kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000100000
Jan 07 11:17:12 naota-devel kernel: R13: 0000000000000000 R14: 00007f7ba2f5d010 R15: 00000000008592a0

Considering the above cases, I do not think it is possible to
implement such a "waiting IO scheduler" that would allow removing the
mutex around block allocation and bio issuing. Such a method would
require an intermediate bio reordering layer either using a device
mapper, or as was implemented initially directly in btrfs (but that is
now a layering violation, so we do not want that).

Entirely relying on the block layer for achieving a perfect sequential
write request sequence is fragile. The current block layer interface
semantic for zoned block devices is: "If BIOs are issued sequentially,
they will be dispatched to the drive in the same order, sequentially."
That directly reflects the drive constraint, so this is compatible
with other regular block devices in the sense that no intelligence is
added for trying to create sequential streams of requests when the
issuer is not issuing the said request in perfect order. Trying to
change this interface to something like: "OK, I can accept some
out-of-ordered writes, but you must fill the hole quickly in the
stream" cannot be implemented directly in the block layer. Device
mapper should be used for that, but if we do so, then one could argue
that all SMR support can simply rely on dm-zoned, which is really
sub-optimal from a performance perspective. We can do much better than
dm-zoned with direct support in btrfs, but that support requires
guarantees of sequential write BIO issuing. The current implementation
relies on a mutex for that, which considering the complexity of
dm-zoned, is a *very* simple and clean solution.

[-- Attachment #2: multi-proc-write.c --]
[-- Type: text/x-c, Size: 3946 bytes --]

#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <time.h>
#include <pthread.h>
#include <assert.h>
#include <stdbool.h>

#define NUM_BASE_THREAD 16
#define NUM_CPU 8

int NUM_THREAD;

pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
bool exit_run = false;
int num_running;
int num_waiting;
int cnt = 0;

struct thread_data {
	int fd;
	char name[16];
	size_t size;
	pthread_barrier_t *barrier;
	pthread_t id;
};

#define min_t(type, x, y) ({			\
	type __min1 = (x);			\
	type __min2 = (y);			\
	__min1 < __min2 ? __min1: __min2; })

static void* write_file(void *arg)
{
	struct thread_data *data = arg;
	char *buf;
	ssize_t written;
	size_t left;
	struct timespec ts;
	ssize_t bufsize = 1 << 20;

	buf = malloc(bufsize);
	if (!buf) {
		fprintf(stderr, "failed to alloc %lu size buffer: %s", bufsize, strerror(errno));
		return NULL;
	}
	memset(buf, 0, bufsize);

	pthread_barrier_wait(data->barrier);

	for (;;) {
		/* clock_gettime(CLOCK_MONOTONIC, &ts); */
		/* printf("%10ld.%09ld writing file %s\n", ts.tv_sec, ts.tv_nsec, data->name); */
		left = data->size;
		while (left) {
			ssize_t write_size = min_t(ssize_t, bufsize, left);

			written = write(data->fd, buf, write_size);
			if (written < 0) {
				clock_gettime(CLOCK_MONOTONIC, &ts);
				fprintf(stderr, "%10ld.%09ld failed to write file %s %s\n",
					ts.tv_sec, ts.tv_nsec, data->name, strerror(errno));
				goto out;
			}
			if (written < write_size) {
				clock_gettime(CLOCK_MONOTONIC, &ts);
				fprintf(stderr, "%10ld.%09ld failed to write file %s %ld < %ld\n",
					ts.tv_sec, ts.tv_nsec, data->name, written, data->size);
				goto out;
			}
			left -= write_size;
		}

		pthread_mutex_lock(&lock);
		if (num_running < NUM_THREAD) {
			pthread_mutex_unlock(&lock);
			goto out;
		}
		num_waiting++;
		assert(num_waiting <= num_running);
		if (num_waiting == num_running) {
			pthread_cond_broadcast(&cond);
			num_waiting--;
			printf(".");
			fflush(stdout);
			cnt++;
			if (cnt == 80) {
				printf("\n");
				cnt = 0;
			}
		} else {
			pthread_cond_wait(&cond, &lock);
			num_waiting--;
		}
		pthread_mutex_unlock(&lock);

		/* clock_gettime(CLOCK_MONOTONIC, &ts); */
		/* printf("%10ld.%09ld fdatasync file %s\n", ts.tv_sec, ts.tv_nsec, data->name); */
		if (fdatasync(data->fd)) {
			fprintf(stderr, "failed to fdatasync file %s: %s\n", data->name, strerror(errno));
			goto out;
		}
		/* clock_gettime(CLOCK_MONOTONIC, &ts); */
		/* printf("%10ld.%09ld fdatasync done file %s\n", ts.tv_sec, ts.tv_nsec, data->name); */
	}

out:
	free(buf);

	pthread_mutex_lock(&lock);
	num_running--;
	assert(num_waiting <= num_running);
	if (num_waiting == num_running)
		pthread_cond_broadcast(&cond);
	pthread_mutex_unlock(&lock);

	return NULL;
}

int main(int argc, char *argv[])
{
	if (argc < 2)
		return 1;
	NUM_THREAD = atoi(argv[1]);

	pthread_barrier_t barrier;
	if (pthread_barrier_init(&barrier, NULL, NUM_THREAD)) {
		perror("failed to initialize barrier");
		exit(1);
	}

	num_running = NUM_THREAD;
	num_waiting = 0;

	struct thread_data *tds;

	tds = calloc(NUM_THREAD, sizeof(*tds));
	if (!tds) {
		perror("failed to allocate thread data");
		exit(1);
	}

	cpu_set_t cpuset;

	for (int i = 0; i < NUM_THREAD; i++) {
		sprintf(tds[i].name, "%03d", i);
		tds[i].fd = open(tds[i].name, O_RDWR | O_CREAT, 0644);
		if (tds[i].fd < 0) {
			perror("failed to open file");
			exit(1);
		}
		int shift = 20 - 1 - (i % NUM_BASE_THREAD);
		assert(shift >= 4);
		tds[i].size = 256 << shift;
		tds[i].barrier = &barrier;
		CPU_ZERO(&cpuset);
		CPU_SET(i % NUM_CPU, &cpuset);
		pthread_create(&tds[i].id, NULL, write_file, &tds[i]);
		pthread_setaffinity_np(tds[i].id, sizeof(cpu_set_t), &cpuset);
	}

	for (int i = 0; i < NUM_THREAD; i++) {
		pthread_join(tds[i].id, NULL);
		close(tds[i].fd);
	}

	pthread_barrier_destroy(&barrier);
}


^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2020-01-21  6:54 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-13  4:08 [PATCH v6 00/28] btrfs: zoned block device support Naohiro Aota
2019-12-13  4:08 ` [PATCH v6 01/28] btrfs: introduce HMZONED feature flag Naohiro Aota
2019-12-13  4:08 ` [PATCH v6 02/28] btrfs: Get zone information of zoned block devices Naohiro Aota
2019-12-13 16:18   ` Josef Bacik
2019-12-18  2:29     ` Naohiro Aota
2019-12-13  4:08 ` [PATCH v6 03/28] btrfs: Check and enable HMZONED mode Naohiro Aota
2019-12-13 16:21   ` Josef Bacik
2019-12-18  4:17     ` Naohiro Aota
2019-12-13  4:08 ` [PATCH v6 04/28] btrfs: disallow RAID5/6 in " Naohiro Aota
2019-12-13 16:21   ` Josef Bacik
2019-12-13  4:08 ` [PATCH v6 05/28] btrfs: disallow space_cache " Naohiro Aota
2019-12-13 16:24   ` Josef Bacik
2019-12-18  4:28     ` Naohiro Aota
2019-12-13  4:08 ` [PATCH v6 06/28] btrfs: disallow NODATACOW " Naohiro Aota
2019-12-13 16:25   ` Josef Bacik
2019-12-13  4:08 ` [PATCH v6 07/28] btrfs: disable fallocate " Naohiro Aota
2019-12-13 16:26   ` Josef Bacik
2019-12-13  4:08 ` [PATCH v6 08/28] btrfs: implement log-structured superblock for " Naohiro Aota
2019-12-13 16:38   ` Josef Bacik
2019-12-13 21:58     ` Damien Le Moal
2019-12-17 19:17       ` Josef Bacik
2019-12-13  4:08 ` [PATCH v6 09/28] btrfs: align device extent allocation to zone boundary Naohiro Aota
2019-12-13 16:52   ` Josef Bacik
2019-12-13  4:08 ` [PATCH v6 10/28] btrfs: do sequential extent allocation in HMZONED mode Naohiro Aota
2019-12-17 19:19   ` Josef Bacik
2019-12-13  4:08 ` [PATCH v6 11/28] btrfs: make unmirroed BGs readonly only if we have at least one writable BG Naohiro Aota
2019-12-17 19:25   ` Josef Bacik
2019-12-18  7:35     ` Naohiro Aota
2019-12-18 14:54       ` Josef Bacik
2019-12-13  4:08 ` [PATCH v6 12/28] btrfs: ensure metadata space available on/after degraded mount in HMZONED Naohiro Aota
2019-12-17 19:32   ` Josef Bacik
2019-12-13  4:09 ` [PATCH v6 13/28] btrfs: reset zones of unused block groups Naohiro Aota
2019-12-17 19:33   ` Josef Bacik
2019-12-13  4:09 ` [PATCH v6 14/28] btrfs: redirty released extent buffers in HMZONED mode Naohiro Aota
2019-12-17 19:41   ` Josef Bacik
2019-12-13  4:09 ` [PATCH v6 15/28] btrfs: serialize data allocation and submit IOs Naohiro Aota
2019-12-17 19:49   ` Josef Bacik
2019-12-19  6:54     ` Naohiro Aota
2019-12-19 14:01       ` Josef Bacik
2020-01-21  6:54         ` Naohiro Aota
2019-12-13  4:09 ` [PATCH v6 16/28] btrfs: implement atomic compressed IO submission Naohiro Aota
2019-12-13  4:09 ` [PATCH v6 17/28] btrfs: support direct write IO in HMZONED Naohiro Aota
2019-12-13  4:09 ` [PATCH v6 18/28] btrfs: serialize meta IOs on HMZONED mode Naohiro Aota
2019-12-13  4:09 ` [PATCH v6 19/28] btrfs: wait existing extents before truncating Naohiro Aota
2019-12-17 19:53   ` Josef Bacik
2019-12-13  4:09 ` [PATCH v6 20/28] btrfs: avoid async checksum on HMZONED mode Naohiro Aota
2019-12-13  4:09 ` [PATCH v6 21/28] btrfs: disallow mixed-bg in " Naohiro Aota
2019-12-17 19:56   ` Josef Bacik
2019-12-18  8:03     ` Naohiro Aota
2019-12-13  4:09 ` [PATCH v6 22/28] btrfs: disallow inode_cache " Naohiro Aota
2019-12-17 19:56   ` Josef Bacik
2019-12-13  4:09 ` [PATCH v6 23/28] btrfs: support dev-replace " Naohiro Aota
2019-12-17 21:05   ` Josef Bacik
2019-12-18  6:00     ` Naohiro Aota
2019-12-18 14:58       ` Josef Bacik
2019-12-13  4:09 ` [PATCH v6 24/28] btrfs: enable relocation " Naohiro Aota
2019-12-17 21:32   ` Josef Bacik
2019-12-18 10:49     ` Naohiro Aota
2019-12-18 15:01       ` Josef Bacik
2019-12-13  4:09 ` [PATCH v6 25/28] btrfs: relocate block group to repair IO failure in HMZONED Naohiro Aota
2019-12-17 22:04   ` Josef Bacik
2019-12-13  4:09 ` [PATCH v6 26/28] btrfs: split alloc_log_tree() Naohiro Aota
2019-12-13  4:09 ` [PATCH v6 27/28] btrfs: enable tree-log on HMZONED mode Naohiro Aota
2019-12-17 22:08   ` Josef Bacik
2019-12-18  9:35     ` Naohiro Aota
2019-12-13  4:09 ` [PATCH v6 28/28] btrfs: enable to mount HMZONED incompat flag Naohiro Aota
2019-12-17 22:09   ` Josef Bacik
2019-12-13  4:15 ` [PATCH RFC v2] libblkid: implement zone-aware probing for HMZONED btrfs Naohiro Aota
2019-12-19 20:19 ` [PATCH v6 00/28] btrfs: zoned block device support David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).