Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v2 00/19] btrfs zoned block device support
@ 2019-06-07 13:10 Naohiro Aota
  2019-06-07 13:10 ` [PATCH 01/19] btrfs: introduce HMZONED feature flag Naohiro Aota
                   ` (20 more replies)
  0 siblings, 21 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:10 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

btrfs zoned block device support

This series adds zoned block device support to btrfs.

A zoned block device consists of a number of zones. Zones are either
conventional and accepting random writes or sequential and requiring that
writes be issued in LBA order from each zone write pointer position. This
patch series ensures that the sequential write constraint of sequential
zones is respected while fundamentally not changing BtrFS block and I/O
management for block stored in conventional zones.

To achieve this, the default chunk size of btrfs is changed on zoned block
devices so that chunks are always aligned to a zone. Allocation of blocks
within a chunk is changed so that the allocation is always sequential from
the beginning of the chunks. To do so, an allocation pointer is added to
block groups and used as the allocation hint.  The allocation changes also
ensures that block freed below the allocation pointer are ignored,
resulting in sequential block allocation regardless of the chunk usage.

While the introduction of the allocation pointer ensure that blocks will be
allocated sequentially, I/Os to write out newly allocated blocks may be
issued out of order, causing errors when writing to sequential zones. This
problem s solved by introducing a submit_buffer() function and changes to
the internal I/O scheduler to ensure in-order issuing of write I/Os for
each chunk and corresponding to the block allocation order in the chunk.

The zone of a chunk is reset to allow reuse of the zone only when the block
group is being freed, that is, when all the chunks of the block group are
unused.

For btrfs volumes composed of multiple zoned disks, restrictions are added
to ensure that all disks have the same zone size. This matches the existing
constraint that all chunks in a block group must have the same size.

As discussed with Chris Mason in LSFMM, we enabled device replacing in
HMZONED mode. But still drop fallocate for now.

Patch 1 introduces the HMZONED incompatible feature flag to indicate that
the btrfs volume was formatted for use on zoned block devices.

Patches 2 and 3 implement functions to gather information on the zones of
the device (zones type and write pointer position).

Patches 4 and 5 disable features which are not compatible with the
sequential write constraints of zoned block devices. This includes
fallocate and direct I/O support.

Patches 6 and 7 tweak the extent buffer allocation for HMZONED mode to
implement sequential block allocation in block groups and chunks.

Patch 8 mark block group read only when write pointers of devices which
compose e.g. RAID1 block group devices are mismatch.

Patch 9 restrict the possible locations of super blocks to conventional
zones to preserve the existing update in-place mechanism for the super
blocks.

Patches 10 to 12 implement the new submit buffer I/O path to ensure
sequential write I/O delivery to the device zones.

Patches 13 to 17 modify several parts of btrfs to handle free blocks
without breaking the sequential block allocation and sequential write order
as well as zone reset for unused chunks.

Patch 18 add support for device replacing.

Finally, patch 19 adds the HMZONED feature to the list of supported
features.

This series applies on kdave/for-5.2-rc2.

Changelog
v2:
 - Add support for dev-replace
 -- To support dev-replace, moved submit_buffer one layer up. It now
    handles bio instead of btrfs_bio.
 -- Mark unmirrored Block Group readonly only when there is writable
    mirrored BGs. Necessary to handle degraded RAID.
 - Expire worker use vanilla delayed_work instead of btrfs's async-thread
 - Device extent allocator now ensure that region is on the same zone type.
 - Add delayed allocation shrinking.
 - Rename btrfs_drop_dev_zonetypes() to btrfs_destroy_dev_zonetypes
 - Fix
 -- Use SECTOR_SHIFT (Nikolay)
 -- Use btrfs_err (Nikolay)

Naohiro Aota (19):
  btrfs: introduce HMZONED feature flag
  btrfs: Get zone information of zoned block devices
  btrfs: Check and enable HMZONED mode
  btrfs: disable fallocate in HMZONED mode
  btrfs: disable direct IO in HMZONED mode
  btrfs: align dev extent allocation to zone boundary
  btrfs: do sequential extent allocation in HMZONED mode
  btrfs: make unmirroed BGs readonly only if we have at least one
    writable BG
  btrfs: limit super block locations in HMZONED mode
  btrfs: rename btrfs_map_bio()
  btrfs: introduce submit buffer
  btrfs: expire submit buffer on timeout
  btrfs: avoid sync IO prioritization on checksum in HMZONED mode
  btrfs: redirty released extent buffers in sequential BGs
  btrfs: reset zones of unused block groups
  btrfs: wait existing extents before truncating
  btrfs: shrink delayed allocation size in HMZONED mode
  btrfs: support dev-replace in HMZONED mode
  btrfs: enable to mount HMZONED incompat flag

 fs/btrfs/ctree.h             |  47 ++-
 fs/btrfs/dev-replace.c       | 103 ++++++
 fs/btrfs/disk-io.c           |  49 ++-
 fs/btrfs/disk-io.h           |   1 +
 fs/btrfs/extent-tree.c       | 479 +++++++++++++++++++++++-
 fs/btrfs/extent_io.c         |  28 ++
 fs/btrfs/extent_io.h         |   2 +
 fs/btrfs/file.c              |   4 +
 fs/btrfs/free-space-cache.c  |  33 ++
 fs/btrfs/free-space-cache.h  |   5 +
 fs/btrfs/inode.c             |  14 +
 fs/btrfs/scrub.c             | 171 +++++++++
 fs/btrfs/super.c             |  30 +-
 fs/btrfs/sysfs.c             |   2 +
 fs/btrfs/transaction.c       |  35 ++
 fs/btrfs/transaction.h       |   3 +
 fs/btrfs/volumes.c           | 684 ++++++++++++++++++++++++++++++++++-
 fs/btrfs/volumes.h           |  37 ++
 include/trace/events/btrfs.h |  43 +++
 include/uapi/linux/btrfs.h   |   1 +
 20 files changed, 1734 insertions(+), 37 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 01/19] btrfs: introduce HMZONED feature flag
  2019-06-07 13:10 [PATCH v2 00/19] btrfs zoned block device support Naohiro Aota
@ 2019-06-07 13:10 ` Naohiro Aota
  2019-06-07 13:10 ` [PATCH 02/19] btrfs: Get zone information of zoned block devices Naohiro Aota
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:10 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

This patch introduces the HMZONED incompat flag. The flag indicates that
the volume management will satisfy the constraints imposed by host-managed
zoned block devices.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/sysfs.c           | 2 ++
 include/uapi/linux/btrfs.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 2f078b77fe14..ccb3d732e7d2 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -192,6 +192,7 @@ BTRFS_FEAT_ATTR_INCOMPAT(raid56, RAID56);
 BTRFS_FEAT_ATTR_INCOMPAT(skinny_metadata, SKINNY_METADATA);
 BTRFS_FEAT_ATTR_INCOMPAT(no_holes, NO_HOLES);
 BTRFS_FEAT_ATTR_INCOMPAT(metadata_uuid, METADATA_UUID);
+BTRFS_FEAT_ATTR_INCOMPAT(hmzoned, HMZONED);
 BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
 
 static struct attribute *btrfs_supported_feature_attrs[] = {
@@ -206,6 +207,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(skinny_metadata),
 	BTRFS_FEAT_ATTR_PTR(no_holes),
 	BTRFS_FEAT_ATTR_PTR(metadata_uuid),
+	BTRFS_FEAT_ATTR_PTR(hmzoned),
 	BTRFS_FEAT_ATTR_PTR(free_space_tree),
 	NULL
 };
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index c195896d478f..2d5e8f801135 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -270,6 +270,7 @@ struct btrfs_ioctl_fs_info_args {
 #define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA	(1ULL << 8)
 #define BTRFS_FEATURE_INCOMPAT_NO_HOLES		(1ULL << 9)
 #define BTRFS_FEATURE_INCOMPAT_METADATA_UUID	(1ULL << 10)
+#define BTRFS_FEATURE_INCOMPAT_HMZONED		(1ULL << 11)
 
 struct btrfs_ioctl_feature_flags {
 	__u64 compat_flags;
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 02/19] btrfs: Get zone information of zoned block devices
  2019-06-07 13:10 [PATCH v2 00/19] btrfs zoned block device support Naohiro Aota
  2019-06-07 13:10 ` [PATCH 01/19] btrfs: introduce HMZONED feature flag Naohiro Aota
@ 2019-06-07 13:10 ` Naohiro Aota
  2019-06-13 13:58   ` Josef Bacik
                     ` (2 more replies)
  2019-06-07 13:10 ` [PATCH 03/19] btrfs: Check and enable HMZONED mode Naohiro Aota
                   ` (18 subsequent siblings)
  20 siblings, 3 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:10 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

If a zoned block device is found, get its zone information (number of zones
and zone size) using the new helper function btrfs_get_dev_zonetypes().  To
avoid costly run-time zone report commands to test the device zones type
during block allocation, attach the seqzones bitmap to the device structure
to indicate if a zone is sequential or accept random writes.

This patch also introduces the helper function btrfs_dev_is_sequential() to
test if the zone storing a block is a sequential write required zone.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/volumes.c | 143 +++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h |  33 +++++++++++
 2 files changed, 176 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1c2a6e4b39da..b673178718e3 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -786,6 +786,135 @@ static int btrfs_free_stale_devices(const char *path,
 	return ret;
 }
 
+static int __btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
+				 struct blk_zone **zones,
+				 unsigned int *nr_zones, gfp_t gfp_mask)
+{
+	struct blk_zone *z = *zones;
+	int ret;
+
+	if (!z) {
+		z = kcalloc(*nr_zones, sizeof(struct blk_zone), GFP_KERNEL);
+		if (!z)
+			return -ENOMEM;
+	}
+
+	ret = blkdev_report_zones(device->bdev, pos >> SECTOR_SHIFT,
+				  z, nr_zones, gfp_mask);
+	if (ret != 0) {
+		btrfs_err(device->fs_info, "Get zone at %llu failed %d\n",
+			  pos, ret);
+		return ret;
+	}
+
+	*zones = z;
+
+	return 0;
+}
+
+static void btrfs_destroy_dev_zonetypes(struct btrfs_device *device)
+{
+	kfree(device->seq_zones);
+	kfree(device->empty_zones);
+	device->seq_zones = NULL;
+	device->empty_zones = NULL;
+	device->nr_zones = 0;
+	device->zone_size = 0;
+	device->zone_size_shift = 0;
+}
+
+int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
+		       struct blk_zone *zone, gfp_t gfp_mask)
+{
+	unsigned int nr_zones = 1;
+	int ret;
+
+	ret = __btrfs_get_dev_zones(device, pos, &zone, &nr_zones, gfp_mask);
+	if (ret != 0 || !nr_zones)
+		return ret ? ret : -EIO;
+
+	return 0;
+}
+
+int btrfs_get_dev_zonetypes(struct btrfs_device *device)
+{
+	struct block_device *bdev = device->bdev;
+	sector_t nr_sectors = bdev->bd_part->nr_sects;
+	sector_t sector = 0;
+	struct blk_zone *zones = NULL;
+	unsigned int i, n = 0, nr_zones;
+	int ret;
+
+	device->zone_size = 0;
+	device->zone_size_shift = 0;
+	device->nr_zones = 0;
+	device->seq_zones = NULL;
+	device->empty_zones = NULL;
+
+	if (!bdev_is_zoned(bdev))
+		return 0;
+
+	device->zone_size = (u64)bdev_zone_sectors(bdev) << SECTOR_SHIFT;
+	device->zone_size_shift = ilog2(device->zone_size);
+	device->nr_zones = nr_sectors >> ilog2(bdev_zone_sectors(bdev));
+	if (nr_sectors & (bdev_zone_sectors(bdev) - 1))
+		device->nr_zones++;
+
+	device->seq_zones = kcalloc(BITS_TO_LONGS(device->nr_zones),
+				    sizeof(*device->seq_zones), GFP_KERNEL);
+	if (!device->seq_zones)
+		return -ENOMEM;
+
+	device->empty_zones = kcalloc(BITS_TO_LONGS(device->nr_zones),
+				      sizeof(*device->empty_zones), GFP_KERNEL);
+	if (!device->empty_zones)
+		return -ENOMEM;
+
+#define BTRFS_REPORT_NR_ZONES   4096
+
+	/* Get zones type */
+	while (sector < nr_sectors) {
+		nr_zones = BTRFS_REPORT_NR_ZONES;
+		ret = __btrfs_get_dev_zones(device, sector << SECTOR_SHIFT,
+					    &zones, &nr_zones, GFP_KERNEL);
+		if (ret != 0 || !nr_zones) {
+			if (!ret)
+				ret = -EIO;
+			goto out;
+		}
+
+		for (i = 0; i < nr_zones; i++) {
+			if (zones[i].type == BLK_ZONE_TYPE_SEQWRITE_REQ)
+				set_bit(n, device->seq_zones);
+			if (zones[i].cond == BLK_ZONE_COND_EMPTY)
+				set_bit(n, device->empty_zones);
+			sector = zones[i].start + zones[i].len;
+			n++;
+		}
+	}
+
+	if (n != device->nr_zones) {
+		btrfs_err(device->fs_info,
+			  "Inconsistent number of zones (%u / %u)\n", n,
+			  device->nr_zones);
+		ret = -EIO;
+		goto out;
+	}
+
+	btrfs_info(device->fs_info,
+		   "host-%s zoned block device, %u zones of %llu sectors\n",
+		   bdev_zoned_model(bdev) == BLK_ZONED_HM ? "managed" : "aware",
+		   device->nr_zones, device->zone_size >> SECTOR_SHIFT);
+
+out:
+	kfree(zones);
+
+	if (ret)
+		btrfs_destroy_dev_zonetypes(device);
+
+	return ret;
+}
+
 static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
 			struct btrfs_device *device, fmode_t flags,
 			void *holder)
@@ -842,6 +971,11 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
 	clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
 	device->mode = flags;
 
+	/* Get zone type information of zoned block devices */
+	ret = btrfs_get_dev_zonetypes(device);
+	if (ret != 0)
+		goto error_brelse;
+
 	fs_devices->open_devices++;
 	if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
 	    device->devid != BTRFS_DEV_REPLACE_DEVID) {
@@ -1243,6 +1377,7 @@ static void btrfs_close_bdev(struct btrfs_device *device)
 	}
 
 	blkdev_put(device->bdev, device->mode);
+	btrfs_destroy_dev_zonetypes(device);
 }
 
 static void btrfs_close_one_device(struct btrfs_device *device)
@@ -2664,6 +2799,13 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	mutex_unlock(&fs_info->chunk_mutex);
 	mutex_unlock(&fs_devices->device_list_mutex);
 
+	/* Get zone type information of zoned block devices */
+	ret = btrfs_get_dev_zonetypes(device);
+	if (ret) {
+		btrfs_abort_transaction(trans, ret);
+		goto error_sysfs;
+	}
+
 	if (seeding_dev) {
 		mutex_lock(&fs_info->chunk_mutex);
 		ret = init_first_rw_device(trans);
@@ -2729,6 +2871,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	return ret;
 
 error_sysfs:
+	btrfs_destroy_dev_zonetypes(device);
 	btrfs_sysfs_rm_device_link(fs_devices, device);
 	mutex_lock(&fs_info->fs_devices->device_list_mutex);
 	mutex_lock(&fs_info->chunk_mutex);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index b8a0e8d0672d..1599641e216c 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -62,6 +62,16 @@ struct btrfs_device {
 
 	struct block_device *bdev;
 
+	/*
+	 * Number of zones, zone size and types of zones if bdev is a
+	 * zoned block device.
+	 */
+	u64 zone_size;
+	u8  zone_size_shift;
+	u32 nr_zones;
+	unsigned long *seq_zones;
+	unsigned long *empty_zones;
+
 	/* the mode sent to blkdev_get */
 	fmode_t mode;
 
@@ -476,6 +486,28 @@ int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans,
 int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset);
 struct extent_map *btrfs_get_chunk_map(struct btrfs_fs_info *fs_info,
 				       u64 logical, u64 length);
+int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
+		       struct blk_zone *zone, gfp_t gfp_mask);
+
+static inline int btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
+{
+	unsigned int zno = pos >> device->zone_size_shift;
+
+	if (!device->seq_zones)
+		return 1;
+
+	return test_bit(zno, device->seq_zones);
+}
+
+static inline int btrfs_dev_is_empty_zone(struct btrfs_device *device, u64 pos)
+{
+	unsigned int zno = pos >> device->zone_size_shift;
+
+	if (!device->empty_zones)
+		return 0;
+
+	return test_bit(zno, device->empty_zones);
+}
 
 static inline void btrfs_dev_stat_inc(struct btrfs_device *dev,
 				      int index)
@@ -568,5 +600,6 @@ bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
 
 int btrfs_bg_type_to_factor(u64 flags);
 int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info);
+int btrfs_get_dev_zonetypes(struct btrfs_device *device);
 
 #endif
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 03/19] btrfs: Check and enable HMZONED mode
  2019-06-07 13:10 [PATCH v2 00/19] btrfs zoned block device support Naohiro Aota
  2019-06-07 13:10 ` [PATCH 01/19] btrfs: introduce HMZONED feature flag Naohiro Aota
  2019-06-07 13:10 ` [PATCH 02/19] btrfs: Get zone information of zoned block devices Naohiro Aota
@ 2019-06-07 13:10 ` Naohiro Aota
  2019-06-13 13:57   ` Josef Bacik
  2019-06-07 13:10 ` [PATCH 04/19] btrfs: disable fallocate in " Naohiro Aota
                   ` (17 subsequent siblings)
  20 siblings, 1 reply; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:10 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

HMZONED mode cannot be used together with the RAID5/6 profile for now.
Introduce the function btrfs_check_hmzoned_mode() to check this. This
function will also check if HMZONED flag is enabled on the file system and
if the file system consists of zoned devices with equal zone size.

Additionally, as updates to the space cache are in-place, the space cache
cannot be located over sequential zones and there is no guarantees that the
device will have enough conventional zones to store this cache. Resolve
this problem by disabling completely the space cache.  This does not
introduces any problems with sequential block groups: all the free space is
located after the allocation pointer and no free space before the pointer.
There is no need to have such cache.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ctree.h       |  3 ++
 fs/btrfs/dev-replace.c |  7 +++
 fs/btrfs/disk-io.c     |  7 +++
 fs/btrfs/super.c       | 12 ++---
 fs/btrfs/volumes.c     | 99 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h     |  1 +
 6 files changed, 124 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b81c331b28fa..6c00101407e4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -806,6 +806,9 @@ struct btrfs_fs_info {
 	struct btrfs_root *uuid_root;
 	struct btrfs_root *free_space_root;
 
+	/* Zone size when in HMZONED mode */
+	u64 zone_size;
+
 	/* the log root tree is a directory of all the other log roots */
 	struct btrfs_root *log_root_tree;
 
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index ee0989c7e3a9..fbe5ea2a04ed 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -201,6 +201,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
 		return PTR_ERR(bdev);
 	}
 
+	if ((bdev_zoned_model(bdev) == BLK_ZONED_HM &&
+	     !btrfs_fs_incompat(fs_info, HMZONED)) ||
+	    (!bdev_is_zoned(bdev) && btrfs_fs_incompat(fs_info, HMZONED))) {
+		ret = -EINVAL;
+		goto error;
+	}
+
 	filemap_write_and_wait(bdev->bd_inode->i_mapping);
 
 	devices = &fs_info->fs_devices->devices;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 663efce22d98..7c1404c76768 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3086,6 +3086,13 @@ int open_ctree(struct super_block *sb,
 
 	btrfs_free_extra_devids(fs_devices, 1);
 
+	ret = btrfs_check_hmzoned_mode(fs_info);
+	if (ret) {
+		btrfs_err(fs_info, "failed to init hmzoned mode: %d",
+				ret);
+		goto fail_block_groups;
+	}
+
 	ret = btrfs_sysfs_add_fsid(fs_devices, NULL);
 	if (ret) {
 		btrfs_err(fs_info, "failed to init sysfs fsid interface: %d",
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 2c66d9ea6a3b..740a701f16c5 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -435,11 +435,13 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 	bool saved_compress_force;
 	int no_compress = 0;
 
-	cache_gen = btrfs_super_cache_generation(info->super_copy);
-	if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE))
-		btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE);
-	else if (cache_gen)
-		btrfs_set_opt(info->mount_opt, SPACE_CACHE);
+	if (!btrfs_fs_incompat(info, HMZONED)) {
+		cache_gen = btrfs_super_cache_generation(info->super_copy);
+		if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE))
+			btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE);
+		else if (cache_gen)
+			btrfs_set_opt(info->mount_opt, SPACE_CACHE);
+	}
 
 	/*
 	 * Even the options are empty, we still need to do extra check
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b673178718e3..b6f367d19dc9 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1524,6 +1524,83 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
 	return ret;
 }
 
+int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+	struct btrfs_device *device;
+	u64 hmzoned_devices = 0;
+	u64 nr_devices = 0;
+	u64 zone_size = 0;
+	int incompat_hmzoned = btrfs_fs_incompat(fs_info, HMZONED);
+	int ret = 0;
+
+	/* Count zoned devices */
+	list_for_each_entry(device, &fs_devices->devices, dev_list) {
+		if (!device->bdev)
+			continue;
+		if (bdev_zoned_model(device->bdev) == BLK_ZONED_HM ||
+		    (bdev_zoned_model(device->bdev) == BLK_ZONED_HA &&
+		     incompat_hmzoned)) {
+			hmzoned_devices++;
+			if (!zone_size) {
+				zone_size = device->zone_size;
+			} else if (device->zone_size != zone_size) {
+				btrfs_err(fs_info,
+					  "Zoned block devices must have equal zone sizes");
+				ret = -EINVAL;
+				goto out;
+			}
+		}
+		nr_devices++;
+	}
+
+	if (!hmzoned_devices && incompat_hmzoned) {
+		/* No zoned block device, disable HMZONED */
+		btrfs_err(fs_info, "HMZONED enabled file system should have zoned devices");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (!hmzoned_devices && !incompat_hmzoned)
+		goto out;
+
+	fs_info->zone_size = zone_size;
+
+	if (hmzoned_devices != nr_devices) {
+		btrfs_err(fs_info,
+			  "zoned devices mixed with regular devices");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* RAID56 is not allowed */
+	if (btrfs_fs_incompat(fs_info, RAID56)) {
+		btrfs_err(fs_info, "HMZONED mode does not support RAID56");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/*
+	 * SPACE CACHE writing is not cowed. Disable that to avoid
+	 * write errors in sequential zones.
+	 */
+	if (btrfs_test_opt(fs_info, SPACE_CACHE)) {
+		btrfs_info(fs_info,
+			   "disabling disk space caching with HMZONED mode");
+		btrfs_clear_opt(fs_info->mount_opt, SPACE_CACHE);
+	}
+
+	btrfs_set_and_info(fs_info, NOTREELOG,
+			   "disabling tree log with HMZONED  mode");
+
+	btrfs_info(fs_info, "HMZONED mode enabled, zone size %llu B",
+		   fs_info->zone_size);
+
+out:
+
+	return ret;
+}
+
 static void btrfs_release_disk_super(struct page *page)
 {
 	kunmap(page);
@@ -2695,6 +2772,13 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	if (IS_ERR(bdev))
 		return PTR_ERR(bdev);
 
+	if ((bdev_zoned_model(bdev) == BLK_ZONED_HM &&
+	     !btrfs_fs_incompat(fs_info, HMZONED)) ||
+	    (!bdev_is_zoned(bdev) && btrfs_fs_incompat(fs_info, HMZONED))) {
+		ret = -EINVAL;
+		goto error;
+	}
+
 	if (fs_devices->seeding) {
 		seeding_dev = 1;
 		down_write(&sb->s_umount);
@@ -2816,6 +2900,21 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 		}
 	}
 
+	/* Get zone type information of zoned block devices */
+	if (bdev_is_zoned(bdev)) {
+		ret = btrfs_get_dev_zonetypes(device);
+		if (ret) {
+			btrfs_abort_transaction(trans, ret);
+			goto error_sysfs;
+		}
+	}
+
+	ret = btrfs_check_hmzoned_mode(fs_info);
+	if (ret) {
+		btrfs_abort_transaction(trans, ret);
+		goto error_sysfs;
+	}
+
 	ret = btrfs_add_dev_item(trans, device);
 	if (ret) {
 		btrfs_abort_transaction(trans, ret);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 1599641e216c..f66755e43669 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -432,6 +432,7 @@ struct btrfs_device *btrfs_scan_one_device(const char *path,
 					   fmode_t flags, void *holder);
 int btrfs_forget_devices(const char *path);
 int btrfs_close_devices(struct btrfs_fs_devices *fs_devices);
+int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info);
 void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices, int step);
 void btrfs_assign_next_active_device(struct btrfs_device *device,
 				     struct btrfs_device *this_dev);
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 04/19] btrfs: disable fallocate in HMZONED mode
  2019-06-07 13:10 [PATCH v2 00/19] btrfs zoned block device support Naohiro Aota
                   ` (2 preceding siblings ...)
  2019-06-07 13:10 ` [PATCH 03/19] btrfs: Check and enable HMZONED mode Naohiro Aota
@ 2019-06-07 13:10 ` " Naohiro Aota
  2019-06-07 13:10 ` [PATCH 05/19] btrfs: disable direct IO " Naohiro Aota
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:10 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

fallocate() is implemented by reserving actual extent instead of
reservations. This can result in exposing the sequential write constraint
of host-managed zoned block devices to the application, which would break
the POSIX semantic for the fallocated file.  To avoid this, report
fallocate() as not supported when in HMZONED mode for now.

In the future, we may be able to implement "in-memory" fallocate() in
HMZONED mode by utilizing space_info->bytes_may_use or so.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/file.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 89f5be2bfb43..e664b5363697 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3027,6 +3027,10 @@ static long btrfs_fallocate(struct file *file, int mode,
 	alloc_end = round_up(offset + len, blocksize);
 	cur_offset = alloc_start;
 
+	/* Do not allow fallocate in HMZONED mode */
+	if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED))
+		return -EOPNOTSUPP;
+
 	/* Make sure we aren't being give some crap mode */
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
 		     FALLOC_FL_ZERO_RANGE))
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 05/19] btrfs: disable direct IO in HMZONED mode
  2019-06-07 13:10 [PATCH v2 00/19] btrfs zoned block device support Naohiro Aota
                   ` (3 preceding siblings ...)
  2019-06-07 13:10 ` [PATCH 04/19] btrfs: disable fallocate in " Naohiro Aota
@ 2019-06-07 13:10 ` " Naohiro Aota
  2019-06-13 14:00   ` Josef Bacik
  2019-06-07 13:10 ` [PATCH 06/19] btrfs: align dev extent allocation to zone boundary Naohiro Aota
                   ` (15 subsequent siblings)
  20 siblings, 1 reply; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:10 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

Direct write I/Os can be directed at existing extents that have already
been written. Such write requests are prohibited on host-managed zoned
block devices. So disable direct IO support for a volume with HMZONED mode
enabled.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/inode.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6bebc0ca751d..89542c19d09e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8520,6 +8520,9 @@ static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
 	unsigned int blocksize_mask = fs_info->sectorsize - 1;
 	ssize_t retval = -EINVAL;
 
+	if (btrfs_fs_incompat(fs_info, HMZONED))
+		goto out;
+
 	if (offset & blocksize_mask)
 		goto out;
 
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 06/19] btrfs: align dev extent allocation to zone boundary
  2019-06-07 13:10 [PATCH v2 00/19] btrfs zoned block device support Naohiro Aota
                   ` (4 preceding siblings ...)
  2019-06-07 13:10 ` [PATCH 05/19] btrfs: disable direct IO " Naohiro Aota
@ 2019-06-07 13:10 ` Naohiro Aota
  2019-06-07 13:10 ` [PATCH 07/19] btrfs: do sequential extent allocation in HMZONED mode Naohiro Aota
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:10 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

In HMZONED mode, align the device extents to zone boundaries so that a zone
reset affects only the device extent and does not change the state of
blocks in the neighbor device extents. Also, check that a region allocation
is always over empty same-type zones.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent-tree.c |   6 +++
 fs/btrfs/volumes.c     | 100 +++++++++++++++++++++++++++++++++++++++--
 2 files changed, 103 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 1aee51a9f3bf..363db58f56b8 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9884,6 +9884,12 @@ int btrfs_can_relocate(struct btrfs_fs_info *fs_info, u64 bytenr)
 		min_free = div64_u64(min_free, dev_min);
 	}
 
+	/* We cannot allocate size less than zone_size anyway */
+	if (index == BTRFS_RAID_DUP)
+		min_free = max_t(u64, min_free, 2 * fs_info->zone_size);
+	else
+		min_free = max_t(u64, min_free, fs_info->zone_size);
+
 	mutex_lock(&fs_info->chunk_mutex);
 	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
 		u64 dev_offset;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b6f367d19dc9..c1ed3b6e3cfd 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1737,6 +1737,46 @@ static bool contains_pending_extent(struct btrfs_device *device, u64 *start,
 	return false;
 }
 
+static u64 dev_zone_align(struct btrfs_device *device, u64 pos)
+{
+	if (device->zone_size)
+		return ALIGN(pos, device->zone_size);
+	return pos;
+}
+
+/*
+ * is_allocatable_region - check if spcecifeid region is suitable for allocation
+ * @device:	the device to allocate a region
+ * @pos:	the position of the region
+ * @num_bytes:	the size of the region
+ *
+ * In non-ZONED device, anywhere is suitable for allocation. In ZONED
+ * device, check if the region is not on non-empty zones. Also, check if
+ * all zones in the region have the same zone type.
+ */
+static bool is_allocatable_region(struct btrfs_device *device, u64 pos,
+				  u64 num_bytes)
+{
+	int is_sequential;
+
+	if (device->zone_size == 0)
+		return true;
+
+	WARN_ON(!IS_ALIGNED(pos, device->zone_size));
+	WARN_ON(!IS_ALIGNED(num_bytes, device->zone_size));
+
+	is_sequential = btrfs_dev_is_sequential(device, pos);
+
+	while (num_bytes > 0) {
+		if (!btrfs_dev_is_empty_zone(device, pos) ||
+		    (is_sequential != btrfs_dev_is_sequential(device, pos)))
+			return false;
+		pos += device->zone_size;
+		num_bytes -= device->zone_size;
+	}
+
+	return true;
+}
 
 /*
  * find_free_dev_extent_start - find free space in the specified device
@@ -1779,9 +1819,14 @@ int find_free_dev_extent_start(struct btrfs_device *device, u64 num_bytes,
 	/*
 	 * We don't want to overwrite the superblock on the drive nor any area
 	 * used by the boot loader (grub for example), so we make sure to start
-	 * at an offset of at least 1MB.
+	 * at an offset of at least 1MB on a regular disk. For a zoned block
+	 * device, skip the first zone of the device entirely.
 	 */
-	search_start = max_t(u64, search_start, SZ_1M);
+	if (device->zone_size)
+		search_start = max_t(u64, dev_zone_align(device, search_start),
+				     device->zone_size);
+	else
+		search_start = max_t(u64, search_start, SZ_1M);
 
 	path = btrfs_alloc_path();
 	if (!path)
@@ -1846,12 +1891,22 @@ int find_free_dev_extent_start(struct btrfs_device *device, u64 num_bytes,
 			 */
 			if (contains_pending_extent(device, &search_start,
 						    hole_size)) {
+				search_start = dev_zone_align(device,
+							      search_start);
 				if (key.offset >= search_start)
 					hole_size = key.offset - search_start;
 				else
 					hole_size = 0;
 			}
 
+			if (!is_allocatable_region(device, search_start,
+						   num_bytes)) {
+				search_start = dev_zone_align(device,
+							      search_start+1);
+				btrfs_release_path(path);
+				goto again;
+			}
+
 			if (hole_size > max_hole_size) {
 				max_hole_start = search_start;
 				max_hole_size = hole_size;
@@ -1876,7 +1931,7 @@ int find_free_dev_extent_start(struct btrfs_device *device, u64 num_bytes,
 		extent_end = key.offset + btrfs_dev_extent_length(l,
 								  dev_extent);
 		if (extent_end > search_start)
-			search_start = extent_end;
+			search_start = dev_zone_align(device, extent_end);
 next:
 		path->slots[0]++;
 		cond_resched();
@@ -1891,6 +1946,14 @@ int find_free_dev_extent_start(struct btrfs_device *device, u64 num_bytes,
 		hole_size = search_end - search_start;
 
 		if (contains_pending_extent(device, &search_start, hole_size)) {
+			search_start = dev_zone_align(device,
+						      search_start);
+			btrfs_release_path(path);
+			goto again;
+		}
+
+		if (!is_allocatable_region(device, search_start, num_bytes)) {
+			search_start = dev_zone_align(device, search_start+1);
 			btrfs_release_path(path);
 			goto again;
 		}
@@ -5177,6 +5240,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	int i;
 	int j;
 	int index;
+	int hmzoned = btrfs_fs_incompat(info, HMZONED);
 
 	BUG_ON(!alloc_profile_is_valid(type, 0));
 
@@ -5221,10 +5285,20 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 		BUG();
 	}
 
+	if (hmzoned) {
+		max_stripe_size = info->zone_size;
+		max_chunk_size = round_down(max_chunk_size, info->zone_size);
+	}
+
 	/* We don't want a chunk larger than 10% of writable space */
 	max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
 			     max_chunk_size);
 
+	if (hmzoned)
+		max_chunk_size = max(round_down(max_chunk_size,
+						info->zone_size),
+				     info->zone_size);
+
 	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
 			       GFP_NOFS);
 	if (!devices_info)
@@ -5259,6 +5333,9 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 		if (total_avail == 0)
 			continue;
 
+		if (hmzoned && total_avail < max_stripe_size * dev_stripes)
+			continue;
+
 		ret = find_free_dev_extent(device,
 					   max_stripe_size * dev_stripes,
 					   &dev_offset, &max_avail);
@@ -5277,6 +5354,9 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 			continue;
 		}
 
+		if (hmzoned && max_avail < max_stripe_size * dev_stripes)
+			continue;
+
 		if (ndevs == fs_devices->rw_devices) {
 			WARN(1, "%s: found more than %llu devices\n",
 			     __func__, fs_devices->rw_devices);
@@ -5310,6 +5390,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 
 	ndevs = min(ndevs, devs_max);
 
+again:
 	/*
 	 * The primary goal is to maximize the number of stripes, so use as
 	 * many devices as possible, even if the stripes are not maximum sized.
@@ -5333,6 +5414,17 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	 * we try to reduce stripe_size.
 	 */
 	if (stripe_size * data_stripes > max_chunk_size) {
+		if (hmzoned) {
+			/*
+			 * stripe_size is fixed in HMZONED. Reduce ndevs
+			 * instead.
+			 */
+			WARN_ON(nparity != 0);
+			ndevs = div_u64(max_chunk_size * ncopies,
+					stripe_size * dev_stripes);
+			goto again;
+		}
+
 		/*
 		 * Reduce stripe_size, round it up to a 16MB boundary again and
 		 * then use it, unless it ends up being even bigger than the
@@ -5346,6 +5438,8 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	/* align to BTRFS_STRIPE_LEN */
 	stripe_size = round_down(stripe_size, BTRFS_STRIPE_LEN);
 
+	WARN_ON(hmzoned && stripe_size != info->zone_size);
+
 	map = kmalloc(map_lookup_size(num_stripes), GFP_NOFS);
 	if (!map) {
 		ret = -ENOMEM;
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 07/19] btrfs: do sequential extent allocation in HMZONED mode
  2019-06-07 13:10 [PATCH v2 00/19] btrfs zoned block device support Naohiro Aota
                   ` (5 preceding siblings ...)
  2019-06-07 13:10 ` [PATCH 06/19] btrfs: align dev extent allocation to zone boundary Naohiro Aota
@ 2019-06-07 13:10 ` Naohiro Aota
  2019-06-13 14:07   ` Josef Bacik
  2019-06-17 22:30   ` David Sterba
  2019-06-07 13:10 ` [PATCH 08/19] btrfs: make unmirroed BGs readonly only if we have at least one writable BG Naohiro Aota
                   ` (13 subsequent siblings)
  20 siblings, 2 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:10 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

On HMZONED drives, writes must always be sequential and directed at a block
group zone write pointer position. Thus, block allocation in a block group
must also be done sequentially using an allocation pointer equal to the
block group zone write pointer plus the number of blocks allocated but not
yet written.

Sequential allocation function find_free_extent_seq() bypass the checks in
find_free_extent() and increase the reserved byte counter by itself. It is
impossible to revert once allocated region in the sequential allocation,
since it might race with other allocations and leave an allocation hole,
which breaks the sequential write rule.

Furthermore, this commit introduce two new variable to struct
btrfs_block_group_cache. "wp_broken" indicate that write pointer is broken
(e.g. not synced on a RAID1 block group) and mark that block group read
only. "unusable" keeps track of the size of once allocated then freed
region. Such region is never usable until resetting underlying zones.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ctree.h            |  24 +++
 fs/btrfs/extent-tree.c      | 378 ++++++++++++++++++++++++++++++++++--
 fs/btrfs/free-space-cache.c |  33 ++++
 fs/btrfs/free-space-cache.h |   5 +
 4 files changed, 426 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 6c00101407e4..f4bcd2a6ec12 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -582,6 +582,20 @@ struct btrfs_full_stripe_locks_tree {
 	struct mutex lock;
 };
 
+/* Block group allocation types */
+enum btrfs_alloc_type {
+
+	/* Regular first fit allocation */
+	BTRFS_ALLOC_FIT		= 0,
+
+	/*
+	 * Sequential allocation: this is for HMZONED mode and
+	 * will result in ignoring free space before a block
+	 * group allocation offset.
+	 */
+	BTRFS_ALLOC_SEQ		= 1,
+};
+
 struct btrfs_block_group_cache {
 	struct btrfs_key key;
 	struct btrfs_block_group_item item;
@@ -592,6 +606,7 @@ struct btrfs_block_group_cache {
 	u64 reserved;
 	u64 delalloc_bytes;
 	u64 bytes_super;
+	u64 unusable;
 	u64 flags;
 	u64 cache_generation;
 
@@ -621,6 +636,7 @@ struct btrfs_block_group_cache {
 	unsigned int iref:1;
 	unsigned int has_caching_ctl:1;
 	unsigned int removed:1;
+	unsigned int wp_broken:1;
 
 	int disk_cache_state;
 
@@ -694,6 +710,14 @@ struct btrfs_block_group_cache {
 
 	/* Record locked full stripes for RAID5/6 block group */
 	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
+
+	/*
+	 * Allocation offset for the block group to implement sequential
+	 * allocation. This is used only with HMZONED mode enabled and if
+	 * the block group resides on a sequential zone.
+	 */
+	enum btrfs_alloc_type alloc_type;
+	u64 alloc_offset;
 };
 
 /* delayed seq elem */
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 363db58f56b8..ebd0d6eae038 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -28,6 +28,7 @@
 #include "sysfs.h"
 #include "qgroup.h"
 #include "ref-verify.h"
+#include "rcu-string.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -590,6 +591,8 @@ static int cache_block_group(struct btrfs_block_group_cache *cache,
 	struct btrfs_caching_control *caching_ctl;
 	int ret = 0;
 
+	WARN_ON(cache->alloc_type == BTRFS_ALLOC_SEQ);
+
 	caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS);
 	if (!caching_ctl)
 		return -ENOMEM;
@@ -6555,6 +6558,19 @@ void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg)
 	wait_var_event(&bg->reservations, !atomic_read(&bg->reservations));
 }
 
+static void __btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache,
+				       u64 ram_bytes, u64 num_bytes,
+				       int delalloc)
+{
+	struct btrfs_space_info *space_info = cache->space_info;
+
+	cache->reserved += num_bytes;
+	space_info->bytes_reserved += num_bytes;
+	update_bytes_may_use(space_info, -ram_bytes);
+	if (delalloc)
+		cache->delalloc_bytes += num_bytes;
+}
+
 /**
  * btrfs_add_reserved_bytes - update the block_group and space info counters
  * @cache:	The cache we are manipulating
@@ -6573,17 +6589,16 @@ static int btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache,
 	struct btrfs_space_info *space_info = cache->space_info;
 	int ret = 0;
 
+	/* should handled by find_free_extent_seq */
+	WARN_ON(cache->alloc_type == BTRFS_ALLOC_SEQ);
+
 	spin_lock(&space_info->lock);
 	spin_lock(&cache->lock);
-	if (cache->ro) {
+	if (cache->ro)
 		ret = -EAGAIN;
-	} else {
-		cache->reserved += num_bytes;
-		space_info->bytes_reserved += num_bytes;
-		update_bytes_may_use(space_info, -ram_bytes);
-		if (delalloc)
-			cache->delalloc_bytes += num_bytes;
-	}
+	else
+		__btrfs_add_reserved_bytes(cache, ram_bytes, num_bytes,
+					   delalloc);
 	spin_unlock(&cache->lock);
 	spin_unlock(&space_info->lock);
 	return ret;
@@ -6701,9 +6716,13 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
 			cache = btrfs_lookup_block_group(fs_info, start);
 			BUG_ON(!cache); /* Logic error */
 
-			cluster = fetch_cluster_info(fs_info,
-						     cache->space_info,
-						     &empty_cluster);
+			if (cache->alloc_type == BTRFS_ALLOC_FIT)
+				cluster = fetch_cluster_info(fs_info,
+							     cache->space_info,
+							     &empty_cluster);
+			else
+				cluster = NULL;
+
 			empty_cluster <<= 1;
 		}
 
@@ -6743,7 +6762,8 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
 		space_info->max_extent_size = 0;
 		percpu_counter_add_batch(&space_info->total_bytes_pinned,
 			    -len, BTRFS_TOTAL_BYTES_PINNED_BATCH);
-		if (cache->ro) {
+		if (cache->ro || cache->alloc_type == BTRFS_ALLOC_SEQ) {
+			/* need reset before reusing in ALLOC_SEQ BG */
 			space_info->bytes_readonly += len;
 			readonly = true;
 		}
@@ -7588,6 +7608,60 @@ static int find_free_extent_unclustered(struct btrfs_block_group_cache *bg,
 	return 0;
 }
 
+/*
+ * Simple allocator for sequential only block group. It only allows
+ * sequential allocation. No need to play with trees. This function
+ * also reserve the bytes as in btrfs_add_reserved_bytes.
+ */
+
+static int find_free_extent_seq(struct btrfs_block_group_cache *cache,
+				struct find_free_extent_ctl *ffe_ctl)
+{
+	struct btrfs_space_info *space_info = cache->space_info;
+	struct btrfs_free_space_ctl *ctl = cache->free_space_ctl;
+	u64 start = cache->key.objectid;
+	u64 num_bytes = ffe_ctl->num_bytes;
+	u64 avail;
+	int ret = 0;
+
+	/* Sanity check */
+	if (cache->alloc_type != BTRFS_ALLOC_SEQ)
+		return 1;
+
+	spin_lock(&space_info->lock);
+	spin_lock(&cache->lock);
+
+	if (cache->ro) {
+		ret = -EAGAIN;
+		goto out;
+	}
+
+	spin_lock(&ctl->tree_lock);
+	avail = cache->key.offset - cache->alloc_offset;
+	if (avail < num_bytes) {
+		ffe_ctl->max_extent_size = avail;
+		spin_unlock(&ctl->tree_lock);
+		ret = 1;
+		goto out;
+	}
+
+	ffe_ctl->found_offset = start + cache->alloc_offset;
+	cache->alloc_offset += num_bytes;
+	ctl->free_space -= num_bytes;
+	spin_unlock(&ctl->tree_lock);
+
+	BUG_ON(!IS_ALIGNED(ffe_ctl->found_offset,
+			   cache->fs_info->stripesize));
+	ffe_ctl->search_start = ffe_ctl->found_offset;
+	__btrfs_add_reserved_bytes(cache, ffe_ctl->ram_bytes, num_bytes,
+				   ffe_ctl->delalloc);
+
+out:
+	spin_unlock(&cache->lock);
+	spin_unlock(&space_info->lock);
+	return ret;
+}
+
 /*
  * Return >0 means caller needs to re-search for free extent
  * Return 0 means we have the needed free extent.
@@ -7889,6 +7963,16 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
 		if (unlikely(block_group->cached == BTRFS_CACHE_ERROR))
 			goto loop;
 
+		if (block_group->alloc_type == BTRFS_ALLOC_SEQ) {
+			ret = find_free_extent_seq(block_group, &ffe_ctl);
+			if (ret)
+				goto loop;
+			/* btrfs_find_space_for_alloc_seq should ensure
+			 * that everything is OK and reserve the extent.
+			 */
+			goto nocheck;
+		}
+
 		/*
 		 * Ok we want to try and use the cluster allocator, so
 		 * lets look there
@@ -7944,6 +8028,7 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
 					     num_bytes);
 			goto loop;
 		}
+nocheck:
 		btrfs_inc_block_group_reservations(block_group);
 
 		/* we are all good, lets return */
@@ -9616,7 +9701,8 @@ static int inc_block_group_ro(struct btrfs_block_group_cache *cache, int force)
 	}
 
 	num_bytes = cache->key.offset - cache->reserved - cache->pinned -
-		    cache->bytes_super - btrfs_block_group_used(&cache->item);
+		    cache->bytes_super - cache->unusable -
+		    btrfs_block_group_used(&cache->item);
 	sinfo_used = btrfs_space_info_used(sinfo, true);
 
 	if (sinfo_used + num_bytes + min_allocable_bytes <=
@@ -9766,6 +9852,7 @@ void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache)
 	if (!--cache->ro) {
 		num_bytes = cache->key.offset - cache->reserved -
 			    cache->pinned - cache->bytes_super -
+			    cache->unusable -
 			    btrfs_block_group_used(&cache->item);
 		sinfo->bytes_readonly -= num_bytes;
 		list_del_init(&cache->ro_list);
@@ -10200,11 +10287,240 @@ static void link_block_group(struct btrfs_block_group_cache *cache)
 	}
 }
 
+static int
+btrfs_get_block_group_alloc_offset(struct btrfs_block_group_cache *cache)
+{
+	struct btrfs_fs_info *fs_info = cache->fs_info;
+	struct extent_map_tree *em_tree = &fs_info->mapping_tree.map_tree;
+	struct extent_map *em;
+	struct map_lookup *map;
+	struct btrfs_device *device;
+	u64 logical = cache->key.objectid;
+	u64 length = cache->key.offset;
+	u64 physical = 0;
+	int ret, alloc_type;
+	int i, j;
+	u64 *alloc_offsets = NULL;
+
+#define WP_MISSING_DEV ((u64)-1)
+
+	/* Sanity check */
+	if (!IS_ALIGNED(length, fs_info->zone_size)) {
+		btrfs_err(fs_info, "unaligned block group at %llu + %llu",
+			  logical, length);
+		return -EIO;
+	}
+
+	/* Get the chunk mapping */
+	em_tree = &fs_info->mapping_tree.map_tree;
+	read_lock(&em_tree->lock);
+	em = lookup_extent_mapping(em_tree, logical, length);
+	read_unlock(&em_tree->lock);
+
+	if (!em)
+		return -EINVAL;
+
+	map = em->map_lookup;
+
+	/*
+	 * Get the zone type: if the group is mapped to a non-sequential zone,
+	 * there is no need for the allocation offset (fit allocation is OK).
+	 */
+	alloc_type = -1;
+	alloc_offsets = kcalloc(map->num_stripes, sizeof(*alloc_offsets),
+				GFP_NOFS);
+	if (!alloc_offsets) {
+		free_extent_map(em);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < map->num_stripes; i++) {
+		int is_sequential;
+		struct blk_zone zone;
+
+		device = map->stripes[i].dev;
+		physical = map->stripes[i].physical;
+
+		if (device->bdev == NULL) {
+			alloc_offsets[i] = WP_MISSING_DEV;
+			continue;
+		}
+
+		is_sequential = btrfs_dev_is_sequential(device, physical);
+		if (alloc_type == -1)
+			alloc_type = is_sequential ?
+					BTRFS_ALLOC_SEQ : BTRFS_ALLOC_FIT;
+
+		if ((is_sequential && alloc_type != BTRFS_ALLOC_SEQ) ||
+		    (!is_sequential && alloc_type == BTRFS_ALLOC_SEQ)) {
+			btrfs_err(fs_info, "found block group of mixed zone types");
+			ret = -EIO;
+			goto out;
+		}
+
+		if (!is_sequential)
+			continue;
+
+		/* this zone will be used for allocation, so mark this
+		 * zone non-empty
+		 */
+		clear_bit(physical >> device->zone_size_shift,
+			  device->empty_zones);
+
+		/*
+		 * The group is mapped to a sequential zone. Get the zone write
+		 * pointer to determine the allocation offset within the zone.
+		 */
+		WARN_ON(!IS_ALIGNED(physical, fs_info->zone_size));
+		ret = btrfs_get_dev_zone(device, physical, &zone, GFP_NOFS);
+		if (ret == -EIO || ret == -EOPNOTSUPP) {
+			ret = 0;
+			alloc_offsets[i] = WP_MISSING_DEV;
+			continue;
+		} else if (ret) {
+			goto out;
+		}
+
+
+		switch (zone.cond) {
+		case BLK_ZONE_COND_OFFLINE:
+		case BLK_ZONE_COND_READONLY:
+			btrfs_err(fs_info, "Offline/readonly zone %llu",
+				  physical >> device->zone_size_shift);
+			alloc_offsets[i] = WP_MISSING_DEV;
+			break;
+		case BLK_ZONE_COND_EMPTY:
+			alloc_offsets[i] = 0;
+			break;
+		case BLK_ZONE_COND_FULL:
+			alloc_offsets[i] = fs_info->zone_size;
+			break;
+		default:
+			/* Partially used zone */
+			alloc_offsets[i] =
+				((zone.wp - zone.start) << SECTOR_SHIFT);
+			break;
+		}
+	}
+
+	if (alloc_type == BTRFS_ALLOC_FIT)
+		goto out;
+
+	switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
+	case 0: /* single */
+	case BTRFS_BLOCK_GROUP_DUP:
+	case BTRFS_BLOCK_GROUP_RAID1:
+		cache->alloc_offset = WP_MISSING_DEV;
+		for (i = 0; i < map->num_stripes; i++) {
+			if (alloc_offsets[i] == WP_MISSING_DEV)
+				continue;
+			if (cache->alloc_offset == WP_MISSING_DEV)
+				cache->alloc_offset = alloc_offsets[i];
+			if (alloc_offsets[i] == cache->alloc_offset)
+				continue;
+
+			btrfs_err(fs_info,
+				  "write pointer mismatch: block group %llu",
+				  logical);
+			cache->wp_broken = 1;
+		}
+		break;
+	case BTRFS_BLOCK_GROUP_RAID0:
+		cache->alloc_offset = 0;
+		for (i = 0; i < map->num_stripes; i++) {
+			if (alloc_offsets[i] == WP_MISSING_DEV) {
+				btrfs_err(fs_info,
+					  "cannot recover write pointer: block group %llu",
+					  logical);
+				cache->wp_broken = 1;
+				continue;
+			}
+
+			if (alloc_offsets[0] < alloc_offsets[i]) {
+				btrfs_err(fs_info,
+					  "write pointer mismatch: block group %llu",
+					  logical);
+				cache->wp_broken = 1;
+				continue;
+			}
+
+			cache->alloc_offset += alloc_offsets[i];
+		}
+		break;
+	case BTRFS_BLOCK_GROUP_RAID10:
+		/*
+		 * Pass1: check write pointer of RAID1 level: each pointer
+		 * should be equal.
+		 */
+		for (i = 0; i < map->num_stripes / map->sub_stripes; i++) {
+			int base = i*map->sub_stripes;
+			u64 offset = WP_MISSING_DEV;
+
+			for (j = 0; j < map->sub_stripes; j++) {
+				if (alloc_offsets[base+j] == WP_MISSING_DEV)
+					continue;
+				if (offset == WP_MISSING_DEV)
+					offset = alloc_offsets[base+j];
+				if (alloc_offsets[base+j] == offset)
+					continue;
+
+				btrfs_err(fs_info,
+					  "write pointer mismatch: block group %llu",
+					  logical);
+				cache->wp_broken = 1;
+			}
+			for (j = 0; j < map->sub_stripes; j++)
+				alloc_offsets[base+j] = offset;
+		}
+
+		/* Pass2: check write pointer of RAID1 level */
+		cache->alloc_offset = 0;
+		for (i = 0; i < map->num_stripes / map->sub_stripes; i++) {
+			int base = i*map->sub_stripes;
+
+			if (alloc_offsets[base] == WP_MISSING_DEV) {
+				btrfs_err(fs_info,
+					  "cannot recover write pointer: block group %llu",
+					  logical);
+				cache->wp_broken = 1;
+				continue;
+			}
+
+			if (alloc_offsets[0] < alloc_offsets[base]) {
+				btrfs_err(fs_info,
+					  "write pointer mismatch: block group %llu",
+					  logical);
+				cache->wp_broken = 1;
+				continue;
+			}
+
+			cache->alloc_offset += alloc_offsets[base];
+		}
+		break;
+	case BTRFS_BLOCK_GROUP_RAID5:
+	case BTRFS_BLOCK_GROUP_RAID6:
+		/* RAID5/6 is not supported yet */
+	default:
+		btrfs_err(fs_info, "Unsupported profile on HMZONED %llu",
+			map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK);
+		ret = -EINVAL;
+		goto out;
+	}
+
+out:
+	cache->alloc_type = alloc_type;
+	kfree(alloc_offsets);
+	free_extent_map(em);
+
+	return ret;
+}
+
 static struct btrfs_block_group_cache *
 btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info,
 			       u64 start, u64 size)
 {
 	struct btrfs_block_group_cache *cache;
+	int ret;
 
 	cache = kzalloc(sizeof(*cache), GFP_NOFS);
 	if (!cache)
@@ -10238,6 +10554,16 @@ btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info,
 	atomic_set(&cache->trimming, 0);
 	mutex_init(&cache->free_space_lock);
 	btrfs_init_full_stripe_locks_tree(&cache->full_stripe_locks_root);
+	cache->alloc_type = BTRFS_ALLOC_FIT;
+	cache->alloc_offset = 0;
+
+	if (btrfs_fs_incompat(fs_info, HMZONED)) {
+		ret = btrfs_get_block_group_alloc_offset(cache);
+		if (ret) {
+			kfree(cache);
+			return NULL;
+		}
+	}
 
 	return cache;
 }
@@ -10310,6 +10636,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 	int need_clear = 0;
 	u64 cache_gen;
 	u64 feature;
+	u64 unusable;
 	int mixed;
 
 	feature = btrfs_super_incompat_flags(info->super_copy);
@@ -10415,6 +10742,26 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 			free_excluded_extents(cache);
 		}
 
+		switch (cache->alloc_type) {
+		case BTRFS_ALLOC_FIT:
+			unusable = cache->bytes_super;
+			break;
+		case BTRFS_ALLOC_SEQ:
+			WARN_ON(cache->bytes_super != 0);
+			unusable = cache->alloc_offset -
+				btrfs_block_group_used(&cache->item);
+			/* we only need ->free_space in ALLOC_SEQ BGs */
+			cache->last_byte_to_unpin = (u64)-1;
+			cache->cached = BTRFS_CACHE_FINISHED;
+			cache->free_space_ctl->free_space =
+				cache->key.offset - cache->alloc_offset;
+			cache->unusable = unusable;
+			free_excluded_extents(cache);
+			break;
+		default:
+			BUG();
+		}
+
 		ret = btrfs_add_block_group_cache(info, cache);
 		if (ret) {
 			btrfs_remove_free_space_cache(cache);
@@ -10425,7 +10772,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 		trace_btrfs_add_block_group(info, cache, 0);
 		update_space_info(info, cache->flags, found_key.offset,
 				  btrfs_block_group_used(&cache->item),
-				  cache->bytes_super, &space_info);
+				  unusable, &space_info);
 
 		cache->space_info = space_info;
 
@@ -10438,6 +10785,9 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 			ASSERT(list_empty(&cache->bg_list));
 			btrfs_mark_bg_unused(cache);
 		}
+
+		if (cache->wp_broken)
+			inc_block_group_ro(cache, 1);
 	}
 
 	list_for_each_entry_rcu(space_info, &info->space_info, list) {
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index f74dc259307b..cc69dc71f4c1 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2326,8 +2326,11 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 			   u64 offset, u64 bytes)
 {
 	struct btrfs_free_space *info;
+	struct btrfs_block_group_cache *block_group = ctl->private;
 	int ret = 0;
 
+	WARN_ON(block_group && block_group->alloc_type == BTRFS_ALLOC_SEQ);
+
 	info = kmem_cache_zalloc(btrfs_free_space_cachep, GFP_NOFS);
 	if (!info)
 		return -ENOMEM;
@@ -2376,6 +2379,28 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 	return ret;
 }
 
+int __btrfs_add_free_space_seq(struct btrfs_block_group_cache *block_group,
+			       u64 bytenr, u64 size)
+{
+	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
+	u64 offset = bytenr - block_group->key.objectid;
+	u64 to_free, to_unusable;
+
+	spin_lock(&ctl->tree_lock);
+	if (offset >= block_group->alloc_offset)
+		to_free = size;
+	else if (offset + size <= block_group->alloc_offset)
+		to_free = 0;
+	else
+		to_free = offset + size - block_group->alloc_offset;
+	to_unusable = size - to_free;
+	ctl->free_space += to_free;
+	block_group->unusable += to_unusable;
+	spin_unlock(&ctl->tree_lock);
+	return 0;
+
+}
+
 int btrfs_remove_free_space(struct btrfs_block_group_cache *block_group,
 			    u64 offset, u64 bytes)
 {
@@ -2384,6 +2409,8 @@ int btrfs_remove_free_space(struct btrfs_block_group_cache *block_group,
 	int ret;
 	bool re_search = false;
 
+	WARN_ON(block_group->alloc_type == BTRFS_ALLOC_SEQ);
+
 	spin_lock(&ctl->tree_lock);
 
 again:
@@ -2619,6 +2646,8 @@ u64 btrfs_find_space_for_alloc(struct btrfs_block_group_cache *block_group,
 	u64 align_gap = 0;
 	u64 align_gap_len = 0;
 
+	WARN_ON(block_group->alloc_type == BTRFS_ALLOC_SEQ);
+
 	spin_lock(&ctl->tree_lock);
 	entry = find_free_space(ctl, &offset, &bytes_search,
 				block_group->full_stripe_len, max_extent_size);
@@ -2738,6 +2767,8 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group_cache *block_group,
 	struct rb_node *node;
 	u64 ret = 0;
 
+	WARN_ON(block_group->alloc_type == BTRFS_ALLOC_SEQ);
+
 	spin_lock(&cluster->lock);
 	if (bytes > cluster->max_size)
 		goto out;
@@ -3384,6 +3415,8 @@ int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
 {
 	int ret;
 
+	WARN_ON(block_group->alloc_type == BTRFS_ALLOC_SEQ);
+
 	*trimmed = 0;
 
 	spin_lock(&block_group->lock);
diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h
index 8760acb55ffd..d30667784f73 100644
--- a/fs/btrfs/free-space-cache.h
+++ b/fs/btrfs/free-space-cache.h
@@ -73,10 +73,15 @@ void btrfs_init_free_space_ctl(struct btrfs_block_group_cache *block_group);
 int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 			   struct btrfs_free_space_ctl *ctl,
 			   u64 bytenr, u64 size);
+int __btrfs_add_free_space_seq(struct btrfs_block_group_cache *block_group,
+			       u64 bytenr, u64 size);
 static inline int
 btrfs_add_free_space(struct btrfs_block_group_cache *block_group,
 		     u64 bytenr, u64 size)
 {
+	if (block_group->alloc_type == BTRFS_ALLOC_SEQ)
+		return __btrfs_add_free_space_seq(block_group, bytenr, size);
+
 	return __btrfs_add_free_space(block_group->fs_info,
 				      block_group->free_space_ctl,
 				      bytenr, size);
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 08/19] btrfs: make unmirroed BGs readonly only if we have at least one writable BG
  2019-06-07 13:10 [PATCH v2 00/19] btrfs zoned block device support Naohiro Aota
                   ` (6 preceding siblings ...)
  2019-06-07 13:10 ` [PATCH 07/19] btrfs: do sequential extent allocation in HMZONED mode Naohiro Aota
@ 2019-06-07 13:10 ` Naohiro Aota
  2019-06-13 14:09   ` Josef Bacik
  2019-06-07 13:10 ` [PATCH 09/19] btrfs: limit super block locations in HMZONED mode Naohiro Aota
                   ` (12 subsequent siblings)
  20 siblings, 1 reply; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:10 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

If the btrfs volume has mirrored block groups, it unconditionally makes
un-mirrored block groups read only. When we have mirrored block groups, but
don't have writable block groups, this will drop all writable block groups.
So, check if we have at least one writable mirrored block group before
setting un-mirrored block groups read only.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent-tree.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ebd0d6eae038..3d41d840fe5c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10791,6 +10791,9 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 	}
 
 	list_for_each_entry_rcu(space_info, &info->space_info, list) {
+		bool has_rw = false;
+		int i;
+
 		if (!(get_alloc_profile(info, space_info->flags) &
 		      (BTRFS_BLOCK_GROUP_RAID10 |
 		       BTRFS_BLOCK_GROUP_RAID1 |
@@ -10798,6 +10801,25 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 		       BTRFS_BLOCK_GROUP_RAID6 |
 		       BTRFS_BLOCK_GROUP_DUP)))
 			continue;
+
+		/* check if we have at least one writable mirroed block group */
+		for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
+			if (i == BTRFS_RAID_RAID0 || i == BTRFS_RAID_SINGLE)
+				continue;
+			list_for_each_entry(cache, &space_info->block_groups[i],
+					    list) {
+				if (!cache->ro) {
+					has_rw = true;
+					break;
+				}
+			}
+			if (has_rw)
+				break;
+		}
+
+		if (!has_rw)
+			continue;
+
 		/*
 		 * avoid allocating from un-mirrored block group if there are
 		 * mirrored block groups.
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 09/19] btrfs: limit super block locations in HMZONED mode
  2019-06-07 13:10 [PATCH v2 00/19] btrfs zoned block device support Naohiro Aota
                   ` (7 preceding siblings ...)
  2019-06-07 13:10 ` [PATCH 08/19] btrfs: make unmirroed BGs readonly only if we have at least one writable BG Naohiro Aota
@ 2019-06-07 13:10 ` Naohiro Aota
  2019-06-13 14:12   ` Josef Bacik
                     ` (2 more replies)
  2019-06-07 13:10 ` [PATCH 10/19] btrfs: rename btrfs_map_bio() Naohiro Aota
                   ` (11 subsequent siblings)
  20 siblings, 3 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:10 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

When in HMZONED mode, make sure that device super blocks are located in
randomly writable zones of zoned block devices. That is, do not write super
blocks in sequential write required zones of host-managed zoned block
devices as update would not be possible.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/disk-io.c     | 11 +++++++++++
 fs/btrfs/disk-io.h     |  1 +
 fs/btrfs/extent-tree.c |  4 ++++
 fs/btrfs/scrub.c       |  2 ++
 4 files changed, 18 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7c1404c76768..ddbb02906042 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3466,6 +3466,13 @@ struct buffer_head *btrfs_read_dev_super(struct block_device *bdev)
 	return latest;
 }
 
+int btrfs_check_super_location(struct btrfs_device *device, u64 pos)
+{
+	/* any address is good on a regular (zone_size == 0) device */
+	/* non-SEQUENTIAL WRITE REQUIRED zones are capable on a zoned device */
+	return device->zone_size == 0 || !btrfs_dev_is_sequential(device, pos);
+}
+
 /*
  * Write superblock @sb to the @device. Do not wait for completion, all the
  * buffer heads we write are pinned.
@@ -3495,6 +3502,8 @@ static int write_dev_supers(struct btrfs_device *device,
 		if (bytenr + BTRFS_SUPER_INFO_SIZE >=
 		    device->commit_total_bytes)
 			break;
+		if (!btrfs_check_super_location(device, bytenr))
+			continue;
 
 		btrfs_set_super_bytenr(sb, bytenr);
 
@@ -3561,6 +3570,8 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors)
 		if (bytenr + BTRFS_SUPER_INFO_SIZE >=
 		    device->commit_total_bytes)
 			break;
+		if (!btrfs_check_super_location(device, bytenr))
+			continue;
 
 		bh = __find_get_block(device->bdev,
 				      bytenr / BTRFS_BDEV_BLOCKSIZE,
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index a0161aa1ea0b..70e97cd6fa76 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -141,6 +141,7 @@ struct extent_map *btree_get_extent(struct btrfs_inode *inode,
 		struct page *page, size_t pg_offset, u64 start, u64 len,
 		int create);
 int btrfs_get_num_tolerated_disk_barrier_failures(u64 flags);
+int btrfs_check_super_location(struct btrfs_device *device, u64 pos);
 int __init btrfs_end_io_wq_init(void);
 void __cold btrfs_end_io_wq_exit(void);
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3d41d840fe5c..ae2c895d08c4 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -267,6 +267,10 @@ static int exclude_super_stripes(struct btrfs_block_group_cache *cache)
 			return ret;
 	}
 
+	/* we won't have super stripes in sequential zones */
+	if (cache->alloc_type == BTRFS_ALLOC_SEQ)
+		return 0;
+
 	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
 		bytenr = btrfs_sb_offset(i);
 		ret = btrfs_rmap_block(fs_info, cache->key.objectid,
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index f7b29f9db5e2..36ad4fad7eaf 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3720,6 +3720,8 @@ static noinline_for_stack int scrub_supers(struct scrub_ctx *sctx,
 		if (bytenr + BTRFS_SUPER_INFO_SIZE >
 		    scrub_dev->commit_total_bytes)
 			break;
+		if (!btrfs_check_super_location(scrub_dev, bytenr))
+			continue;
 
 		ret = scrub_pages(sctx, bytenr, BTRFS_SUPER_INFO_SIZE, bytenr,
 				  scrub_dev, BTRFS_EXTENT_FLAG_SUPER, gen, i,
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 10/19] btrfs: rename btrfs_map_bio()
  2019-06-07 13:10 [PATCH v2 00/19] btrfs zoned block device support Naohiro Aota
                   ` (8 preceding siblings ...)
  2019-06-07 13:10 ` [PATCH 09/19] btrfs: limit super block locations in HMZONED mode Naohiro Aota
@ 2019-06-07 13:10 ` Naohiro Aota
  2019-06-07 13:10 ` [PATCH 11/19] btrfs: introduce submit buffer Naohiro Aota
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:10 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

This patch renames btrfs_map_bio() to __btrfs_map_bio() to prepare using
__btrfs_map_bio() as a helper function.

NOTE: may be squash with next patch?

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/volumes.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c1ed3b6e3cfd..52d0d458c0fd 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6808,8 +6808,9 @@ static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical)
 	}
 }
 
-blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
-			   int mirror_num, int async_submit)
+static blk_status_t __btrfs_map_bio(struct btrfs_fs_info *fs_info,
+				    struct bio *bio, int mirror_num,
+				    int async_submit)
 {
 	struct btrfs_device *dev;
 	struct bio *first_bio = bio;
@@ -6884,6 +6885,12 @@ blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 	return BLK_STS_OK;
 }
 
+blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
+			   int mirror_num, int async_submit)
+{
+	return __btrfs_map_bio(fs_info, bio, mirror_num, async_submit);
+}
+
 /*
  * Find a device specified by @devid or @uuid in the list of @fs_devices, or
  * return NULL.
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 11/19] btrfs: introduce submit buffer
  2019-06-07 13:10 [PATCH v2 00/19] btrfs zoned block device support Naohiro Aota
                   ` (9 preceding siblings ...)
  2019-06-07 13:10 ` [PATCH 10/19] btrfs: rename btrfs_map_bio() Naohiro Aota
@ 2019-06-07 13:10 ` Naohiro Aota
  2019-06-13 14:14   ` Josef Bacik
  2019-06-07 13:10 ` [PATCH 12/19] btrfs: expire submit buffer on timeout Naohiro Aota
                   ` (9 subsequent siblings)
  20 siblings, 1 reply; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:10 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

Sequential allocation is not enough to maintain sequential delivery of
write IOs to the device. Various features (async compress, async checksum,
...) of btrfs affect ordering of the IOs. This patch introduces submit
buffer to sort WRITE bios belonging to a block group and sort them out
sequentially in increasing block address to achieve sequential write
sequences with __btrfs_map_bio().

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ctree.h             |   3 +
 fs/btrfs/extent-tree.c       |   5 ++
 fs/btrfs/volumes.c           | 165 +++++++++++++++++++++++++++++++++--
 fs/btrfs/volumes.h           |   3 +
 include/trace/events/btrfs.h |  41 +++++++++
 5 files changed, 212 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f4bcd2a6ec12..ade6d8243962 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -718,6 +718,9 @@ struct btrfs_block_group_cache {
 	 */
 	enum btrfs_alloc_type alloc_type;
 	u64 alloc_offset;
+	struct mutex submit_lock;
+	u64 submit_offset;
+	struct bio_list submit_buffer;
 };
 
 /* delayed seq elem */
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ae2c895d08c4..ebdc7a6dbe01 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -124,6 +124,7 @@ void btrfs_put_block_group(struct btrfs_block_group_cache *cache)
 	if (atomic_dec_and_test(&cache->count)) {
 		WARN_ON(cache->pinned > 0);
 		WARN_ON(cache->reserved > 0);
+		WARN_ON(!bio_list_empty(&cache->submit_buffer));
 
 		/*
 		 * If not empty, someone is still holding mutex of
@@ -10511,6 +10512,8 @@ btrfs_get_block_group_alloc_offset(struct btrfs_block_group_cache *cache)
 		goto out;
 	}
 
+	cache->submit_offset = logical + cache->alloc_offset;
+
 out:
 	cache->alloc_type = alloc_type;
 	kfree(alloc_offsets);
@@ -10547,6 +10550,7 @@ btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info,
 
 	atomic_set(&cache->count, 1);
 	spin_lock_init(&cache->lock);
+	mutex_init(&cache->submit_lock);
 	init_rwsem(&cache->data_rwsem);
 	INIT_LIST_HEAD(&cache->list);
 	INIT_LIST_HEAD(&cache->cluster_list);
@@ -10554,6 +10558,7 @@ btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info,
 	INIT_LIST_HEAD(&cache->ro_list);
 	INIT_LIST_HEAD(&cache->dirty_list);
 	INIT_LIST_HEAD(&cache->io_list);
+	bio_list_init(&cache->submit_buffer);
 	btrfs_init_free_space_ctl(cache);
 	atomic_set(&cache->trimming, 0);
 	mutex_init(&cache->free_space_lock);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 52d0d458c0fd..26a64a53032f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -29,6 +29,11 @@
 #include "sysfs.h"
 #include "tree-checker.h"
 
+struct map_bio_data {
+	void *orig_bi_private;
+	int mirror_num;
+};
+
 const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
 	[BTRFS_RAID_RAID10] = {
 		.sub_stripes	= 2,
@@ -523,6 +528,7 @@ static void requeue_list(struct btrfs_pending_bios *pending_bios,
 		pending_bios->tail = tail;
 }
 
+
 /*
  * we try to collect pending bios for a device so we don't get a large
  * number of procs sending bios down to the same device.  This greatly
@@ -606,6 +612,8 @@ static noinline void run_scheduled_bios(struct btrfs_device *device)
 	spin_unlock(&device->io_lock);
 
 	while (pending) {
+		struct btrfs_bio *bbio;
+		struct completion *sent = NULL;
 
 		rmb();
 		/* we want to work on both lists, but do more bios on the
@@ -643,7 +651,12 @@ static noinline void run_scheduled_bios(struct btrfs_device *device)
 			sync_pending = 0;
 		}
 
+		bbio = cur->bi_private;
+		if (bbio)
+			sent = bbio->sent;
 		btrfsic_submit_bio(cur);
+		if (sent)
+			complete(sent);
 		num_run++;
 		batch_run++;
 
@@ -5916,6 +5929,7 @@ static struct btrfs_bio *alloc_btrfs_bio(int total_stripes, int real_stripes)
 
 	atomic_set(&bbio->error, 0);
 	refcount_set(&bbio->refs, 1);
+	INIT_LIST_HEAD(&bbio->list);
 
 	return bbio;
 }
@@ -6730,7 +6744,7 @@ static void btrfs_end_bio(struct bio *bio)
  * the work struct is scheduled.
  */
 static noinline void btrfs_schedule_bio(struct btrfs_device *device,
-					struct bio *bio)
+					struct bio *bio, int need_seqwrite)
 {
 	struct btrfs_fs_info *fs_info = device->fs_info;
 	int should_queue = 1;
@@ -6738,7 +6752,12 @@ static noinline void btrfs_schedule_bio(struct btrfs_device *device,
 
 	/* don't bother with additional async steps for reads, right now */
 	if (bio_op(bio) == REQ_OP_READ) {
+		struct btrfs_bio *bbio = bio->bi_private;
+		struct completion *sent = bbio->sent;
+
 		btrfsic_submit_bio(bio);
+		if (sent)
+			complete(sent);
 		return;
 	}
 
@@ -6746,7 +6765,7 @@ static noinline void btrfs_schedule_bio(struct btrfs_device *device,
 	bio->bi_next = NULL;
 
 	spin_lock(&device->io_lock);
-	if (op_is_sync(bio->bi_opf))
+	if (op_is_sync(bio->bi_opf) && need_seqwrite == 0)
 		pending_bios = &device->pending_sync_bios;
 	else
 		pending_bios = &device->pending_bios;
@@ -6785,8 +6804,21 @@ static void submit_stripe_bio(struct btrfs_bio *bbio, struct bio *bio,
 
 	btrfs_bio_counter_inc_noblocked(fs_info);
 
+	/* queue all bios into scheduler if sequential write is required */
+	if (bbio->need_seqwrite) {
+		if (!async) {
+			DECLARE_COMPLETION_ONSTACK(sent);
+
+			bbio->sent = &sent;
+			btrfs_schedule_bio(dev, bio, bbio->need_seqwrite);
+			wait_for_completion_io(&sent);
+		} else {
+			btrfs_schedule_bio(dev, bio, bbio->need_seqwrite);
+		}
+		return;
+	}
 	if (async)
-		btrfs_schedule_bio(dev, bio);
+		btrfs_schedule_bio(dev, bio, bbio->need_seqwrite);
 	else
 		btrfsic_submit_bio(bio);
 }
@@ -6808,9 +6840,10 @@ static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical)
 	}
 }
 
+
 static blk_status_t __btrfs_map_bio(struct btrfs_fs_info *fs_info,
 				    struct bio *bio, int mirror_num,
-				    int async_submit)
+				    int async_submit, int need_seqwrite)
 {
 	struct btrfs_device *dev;
 	struct bio *first_bio = bio;
@@ -6838,6 +6871,7 @@ static blk_status_t __btrfs_map_bio(struct btrfs_fs_info *fs_info,
 	bbio->private = first_bio->bi_private;
 	bbio->end_io = first_bio->bi_end_io;
 	bbio->fs_info = fs_info;
+	bbio->need_seqwrite = need_seqwrite;
 	atomic_set(&bbio->stripes_pending, bbio->num_stripes);
 
 	if ((bbio->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK) &&
@@ -6885,10 +6919,131 @@ static blk_status_t __btrfs_map_bio(struct btrfs_fs_info *fs_info,
 	return BLK_STS_OK;
 }
 
+static blk_status_t __btrfs_map_bio_zoned(struct btrfs_fs_info *fs_info,
+					  struct bio *cur_bio, int mirror_num,
+					  int async_submit)
+{
+	u64 logical = (u64)cur_bio->bi_iter.bi_sector << SECTOR_SHIFT;
+	u64 length = cur_bio->bi_iter.bi_size;
+	struct bio *bio;
+	struct bio *next;
+	struct bio_list submit_list;
+	struct btrfs_block_group_cache *cache = NULL;
+	struct map_bio_data *map_private;
+	int sent;
+	blk_status_t ret;
+
+	WARN_ON(bio_op(cur_bio) != REQ_OP_WRITE);
+
+	cache = btrfs_lookup_block_group(fs_info, logical);
+	if (!cache || cache->alloc_type != BTRFS_ALLOC_SEQ) {
+		if (cache)
+			btrfs_put_block_group(cache);
+		return __btrfs_map_bio(fs_info, cur_bio, mirror_num,
+				       async_submit, 0);
+	}
+
+	mutex_lock(&cache->submit_lock);
+	if (cache->submit_offset == logical)
+		goto send_bios;
+
+	if (cache->submit_offset > logical) {
+		trace_btrfs_bio_before_write_pointer(cache, cur_bio);
+		mutex_unlock(&cache->submit_lock);
+		btrfs_put_block_group(cache);
+		WARN_ON_ONCE(1);
+		return BLK_STS_IOERR;
+	}
+
+	/* buffer the unaligned bio */
+	map_private = kmalloc(sizeof(*map_private), GFP_NOFS);
+	if (!map_private) {
+		mutex_unlock(&cache->submit_lock);
+		return errno_to_blk_status(-ENOMEM);
+	}
+
+	map_private->orig_bi_private = cur_bio->bi_private;
+	map_private->mirror_num = mirror_num;
+	cur_bio->bi_private = map_private;
+
+	bio_list_add(&cache->submit_buffer, cur_bio);
+	mutex_unlock(&cache->submit_lock);
+	btrfs_put_block_group(cache);
+
+	/* mimic a good result ... */
+	return BLK_STS_OK;
+
+send_bios:
+	mutex_unlock(&cache->submit_lock);
+	/* send this bio */
+	ret = __btrfs_map_bio(fs_info, cur_bio, mirror_num, 1, 1);
+	if (ret != BLK_STS_OK) {
+		/* TODO kill buffered bios */
+		return ret;
+	}
+
+loop:
+	/* and send previously buffered following bios */
+	mutex_lock(&cache->submit_lock);
+	cache->submit_offset += length;
+	length = 0;
+	bio_list_init(&submit_list);
+
+	/* collect sequential bios into submit_list */
+	do {
+		sent = 0;
+		bio = bio_list_get(&cache->submit_buffer);
+		while (bio) {
+			u64 logical =
+				(u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
+			struct bio_list *target;
+
+			next = bio->bi_next;
+			bio->bi_next = NULL;
+
+			if (logical == cache->submit_offset + length) {
+				sent = 1;
+				length += bio->bi_iter.bi_size;
+				target = &submit_list;
+			} else {
+				target = &cache->submit_buffer;
+			}
+			bio_list_add(target, bio);
+
+			bio = next;
+		}
+	} while (sent);
+	mutex_unlock(&cache->submit_lock);
+
+	/* send the collected bios */
+	while ((bio = bio_list_pop(&submit_list)) != NULL) {
+		map_private = (struct map_bio_data *)bio->bi_private;
+		mirror_num = map_private->mirror_num;
+		bio->bi_private = map_private->orig_bi_private;
+		kfree(map_private);
+
+		ret = __btrfs_map_bio(fs_info, bio, mirror_num, 1, 1);
+		if (ret) {
+			bio->bi_status = ret;
+			bio_endio(bio);
+		}
+	}
+
+	if (length)
+		goto loop;
+	btrfs_put_block_group(cache);
+
+	return BLK_STS_OK;
+}
+
 blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 			   int mirror_num, int async_submit)
 {
-	return __btrfs_map_bio(fs_info, bio, mirror_num, async_submit);
+	if (btrfs_fs_incompat(fs_info, HMZONED) && bio_op(bio) == REQ_OP_WRITE)
+		return __btrfs_map_bio_zoned(fs_info, bio, mirror_num,
+					     async_submit);
+
+	return __btrfs_map_bio(fs_info, bio, mirror_num, async_submit, 0);
 }
 
 /*
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index f66755e43669..e97d13cb1627 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -329,6 +329,9 @@ struct btrfs_bio {
 	int mirror_num;
 	int num_tgtdevs;
 	int *tgtdev_map;
+	int need_seqwrite;
+	struct list_head list;
+	struct completion *sent;
 	/*
 	 * logical block numbers for the start of each stripe
 	 * The last one or two are p/q.  These are sorted,
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index fe4d268028ee..2b4cd791bf24 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -2091,6 +2091,47 @@ DEFINE_BTRFS_LOCK_EVENT(btrfs_try_tree_read_lock);
 DEFINE_BTRFS_LOCK_EVENT(btrfs_try_tree_write_lock);
 DEFINE_BTRFS_LOCK_EVENT(btrfs_tree_read_lock_atomic);
 
+DECLARE_EVENT_CLASS(btrfs_hmzoned_bio_buffer_events,
+	TP_PROTO(const struct btrfs_block_group_cache *cache,
+		 const struct bio *bio),
+
+	TP_ARGS(cache, bio),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64,	block_group	)
+		__field(	u64,	flags		)
+		__field(	u64,	submit_pos	)
+		__field(	u64,	logical	)
+		__field(	u64,	length		)
+	),
+
+	TP_fast_assign_btrfs(cache->fs_info,
+		__entry->block_group = cache->key.objectid;
+		__entry->flags = cache->flags;
+		__entry->submit_pos = cache->submit_offset;
+		__entry->logical = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
+		__entry->length = bio->bi_iter.bi_size;
+	),
+
+	TP_printk_btrfs(
+		"block_group=%llu(%s) submit_pos=%llu logical=%llu length=%llu",
+		__entry->block_group,
+		__print_flags((unsigned long)__entry->flags, "|",
+			      BTRFS_GROUP_FLAGS),
+		__entry->submit_pos, __entry->logical,
+		__entry->length)
+);
+
+#define DEFINE_BTRFS_HMZONED_BIO_BUF_EVENT(name)			\
+DEFINE_EVENT(btrfs_hmzoned_bio_buffer_events, name,			\
+	     TP_PROTO(const struct btrfs_block_group_cache *cache,	\
+		      const struct bio *bio),				\
+									\
+	     TP_ARGS(cache, bio)					\
+)
+
+DEFINE_BTRFS_HMZONED_BIO_BUF_EVENT(btrfs_bio_before_write_pointer);
+
 #endif /* _TRACE_BTRFS_H */
 
 /* This part must be outside protection */
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 12/19] btrfs: expire submit buffer on timeout
  2019-06-07 13:10 [PATCH v2 00/19] btrfs zoned block device support Naohiro Aota
                   ` (10 preceding siblings ...)
  2019-06-07 13:10 ` [PATCH 11/19] btrfs: introduce submit buffer Naohiro Aota
@ 2019-06-07 13:10 ` Naohiro Aota
  2019-06-13 14:15   ` Josef Bacik
  2019-06-07 13:10 ` [PATCH 13/19] btrfs: avoid sync IO prioritization on checksum in HMZONED mode Naohiro Aota
                   ` (8 subsequent siblings)
  20 siblings, 1 reply; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:10 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

It is possible to have bios stalled in the submit buffer due to some bug or
device problem. In such situation, btrfs stops working waiting for buffered
bios completions. To avoid such hang, add a worker that will cancel the
stalled bios after a timeout.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ctree.h             |  13 ++++
 fs/btrfs/disk-io.c           |   2 +
 fs/btrfs/extent-tree.c       |  16 +++-
 fs/btrfs/super.c             |  18 +++++
 fs/btrfs/volumes.c           | 146 ++++++++++++++++++++++++++++++++++-
 include/trace/events/btrfs.h |   2 +
 6 files changed, 193 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ade6d8243962..dad8ea5c3b99 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -596,6 +596,8 @@ enum btrfs_alloc_type {
 	BTRFS_ALLOC_SEQ		= 1,
 };
 
+struct expire_work;
+
 struct btrfs_block_group_cache {
 	struct btrfs_key key;
 	struct btrfs_block_group_item item;
@@ -721,6 +723,14 @@ struct btrfs_block_group_cache {
 	struct mutex submit_lock;
 	u64 submit_offset;
 	struct bio_list submit_buffer;
+	struct expire_work *expire_work;
+	int expired:1;
+};
+
+struct expire_work {
+	struct list_head list;
+	struct delayed_work work;
+	struct btrfs_block_group_cache *block_group;
 };
 
 /* delayed seq elem */
@@ -1194,6 +1204,9 @@ struct btrfs_fs_info {
 	spinlock_t ref_verify_lock;
 	struct rb_root block_tree;
 #endif
+
+	struct list_head expire_work_list;
+	struct mutex expire_work_lock;
 };
 
 static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index ddbb02906042..56a416902ce7 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2717,6 +2717,8 @@ int open_ctree(struct super_block *sb,
 	INIT_RADIX_TREE(&fs_info->reada_tree, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
 	spin_lock_init(&fs_info->reada_lock);
 	btrfs_init_ref_verify(fs_info);
+	INIT_LIST_HEAD(&fs_info->expire_work_list);
+	mutex_init(&fs_info->expire_work_lock);
 
 	fs_info->thread_pool_size = min_t(unsigned long,
 					  num_online_cpus() + 2, 8);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ebdc7a6dbe01..cb29a96c226b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -125,6 +125,7 @@ void btrfs_put_block_group(struct btrfs_block_group_cache *cache)
 		WARN_ON(cache->pinned > 0);
 		WARN_ON(cache->reserved > 0);
 		WARN_ON(!bio_list_empty(&cache->submit_buffer));
+		WARN_ON(cache->expire_work);
 
 		/*
 		 * If not empty, someone is still holding mutex of
@@ -10180,6 +10181,13 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
 		    block_group->cached == BTRFS_CACHE_ERROR)
 			free_excluded_extents(block_group);
 
+		if (block_group->alloc_type == BTRFS_ALLOC_SEQ) {
+			mutex_lock(&block_group->submit_lock);
+			WARN_ON(!bio_list_empty(&block_group->submit_buffer));
+			WARN_ON(block_group->expire_work != NULL);
+			mutex_unlock(&block_group->submit_lock);
+		}
+
 		btrfs_remove_free_space_cache(block_group);
 		ASSERT(block_group->cached != BTRFS_CACHE_STARTED);
 		ASSERT(list_empty(&block_group->dirty_list));
@@ -10513,6 +10521,7 @@ btrfs_get_block_group_alloc_offset(struct btrfs_block_group_cache *cache)
 	}
 
 	cache->submit_offset = logical + cache->alloc_offset;
+	cache->expired = 0;
 
 out:
 	cache->alloc_type = alloc_type;
@@ -10565,6 +10574,7 @@ btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info,
 	btrfs_init_full_stripe_locks_tree(&cache->full_stripe_locks_root);
 	cache->alloc_type = BTRFS_ALLOC_FIT;
 	cache->alloc_offset = 0;
+	cache->expire_work = NULL;
 
 	if (btrfs_fs_incompat(fs_info, HMZONED)) {
 		ret = btrfs_get_block_group_alloc_offset(cache);
@@ -11329,11 +11339,13 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 
 		/* Don't want to race with allocators so take the groups_sem */
 		down_write(&space_info->groups_sem);
+		mutex_lock(&block_group->submit_lock);
 		spin_lock(&block_group->lock);
 		if (block_group->reserved || block_group->pinned ||
 		    btrfs_block_group_used(&block_group->item) ||
 		    block_group->ro ||
-		    list_is_singular(&block_group->list)) {
+		    list_is_singular(&block_group->list) ||
+		    !bio_list_empty(&block_group->submit_buffer)) {
 			/*
 			 * We want to bail if we made new allocations or have
 			 * outstanding allocations in this block group.  We do
@@ -11342,10 +11354,12 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 			 */
 			trace_btrfs_skip_unused_block_group(block_group);
 			spin_unlock(&block_group->lock);
+			mutex_unlock(&block_group->submit_lock);
 			up_write(&space_info->groups_sem);
 			goto next;
 		}
 		spin_unlock(&block_group->lock);
+		mutex_unlock(&block_group->submit_lock);
 
 		/* We don't want to force the issue, only flip if it's ok. */
 		ret = inc_block_group_ro(block_group, 0);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 740a701f16c5..343c26537999 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -154,6 +154,24 @@ void __btrfs_handle_fs_error(struct btrfs_fs_info *fs_info, const char *function
 	 * completes. The next time when the filesystem is mounted writable
 	 * again, the device replace operation continues.
 	 */
+
+	/* expire pending bios in submit buffer */
+	if (btrfs_fs_incompat(fs_info, HMZONED)) {
+		struct expire_work *work;
+		struct btrfs_block_group_cache *block_group;
+
+		mutex_lock(&fs_info->expire_work_lock);
+		list_for_each_entry(work, &fs_info->expire_work_list, list) {
+			block_group = work->block_group;
+			mutex_lock(&block_group->submit_lock);
+			if (block_group->expire_work)
+				mod_delayed_work(
+					system_unbound_wq,
+					&block_group->expire_work->work, 0);
+			mutex_unlock(&block_group->submit_lock);
+		};
+		mutex_unlock(&fs_info->expire_work_lock);
+	}
 }
 
 #ifdef CONFIG_PRINTK
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 26a64a53032f..a04379e440fb 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6840,6 +6840,124 @@ static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical)
 	}
 }
 
+static void expire_bios_fn(struct work_struct *work)
+{
+	struct expire_work *ework;
+	struct btrfs_block_group_cache *cache;
+	struct bio *bio, *next;
+
+	ework = container_of(work, struct expire_work, work.work);
+	cache = ework->block_group;
+
+	mutex_lock(&cache->fs_info->expire_work_lock);
+	mutex_lock(&cache->submit_lock);
+	list_del(&cache->expire_work->list);
+
+	if (btrfs_fs_closing(cache->fs_info)) {
+		WARN_ON(!bio_list_empty(&cache->submit_buffer));
+		goto end;
+	}
+
+	if (bio_list_empty(&cache->submit_buffer))
+		goto end;
+
+	bio = bio_list_get(&cache->submit_buffer);
+	cache->expired = 1;
+	mutex_unlock(&cache->submit_lock);
+
+	btrfs_handle_fs_error(cache->fs_info, -EIO,
+			      "bio submit buffer expired");
+	btrfs_err(cache->fs_info, "block group %llu submit pos %llu",
+		  cache->key.objectid, cache->submit_offset);
+
+	while (bio) {
+		struct map_bio_data *map_private =
+			(struct map_bio_data *)bio->bi_private;
+
+		next = bio->bi_next;
+		bio->bi_next = NULL;
+		bio->bi_private = map_private->orig_bi_private;
+		kfree(map_private);
+
+		trace_btrfs_expire_bio(cache, bio);
+		bio->bi_status = BLK_STS_IOERR;
+		bio_endio(bio);
+
+		bio = next;
+	}
+
+end:
+	kfree(cache->expire_work);
+	cache->expire_work = NULL;
+	mutex_unlock(&cache->submit_lock);
+	mutex_unlock(&cache->fs_info->expire_work_lock);
+	btrfs_put_block_group(cache);
+}
+
+static int schedule_expire_work(struct btrfs_block_group_cache *cache)
+{
+	const unsigned long delay = 90 * HZ;
+	struct btrfs_fs_info *fs_info = cache->fs_info;
+	struct expire_work *work;
+	int ret = 0;
+
+	mutex_lock(&fs_info->expire_work_lock);
+	mutex_lock(&cache->submit_lock);
+	if (cache->expire_work) {
+		mod_delayed_work(system_unbound_wq, &cache->expire_work->work,
+				 delay);
+		goto end;
+	}
+
+	work = kmalloc(sizeof(*work), GFP_NOFS);
+	if (!work) {
+		ret = -ENOMEM;
+		goto end;
+	}
+	work->block_group = cache;
+	INIT_LIST_HEAD(&work->list);
+	INIT_DELAYED_WORK(&work->work, expire_bios_fn);
+	cache->expire_work = work;
+
+	list_add(&work->list, &fs_info->expire_work_list);
+	btrfs_get_block_group(cache);
+	mod_delayed_work(system_unbound_wq, &cache->expire_work->work, delay);
+
+end:
+	mutex_unlock(&cache->submit_lock);
+	mutex_unlock(&cache->fs_info->expire_work_lock);
+	return ret;
+}
+
+static bool cancel_expire_work(struct btrfs_block_group_cache *cache)
+{
+	struct expire_work *work;
+	bool ret = true;
+
+	mutex_lock(&cache->fs_info->expire_work_lock);
+	mutex_lock(&cache->submit_lock);
+	work = cache->expire_work;
+	if (!work)
+		goto end;
+	cache->expire_work = NULL;
+
+	ret = cancel_delayed_work(&work->work);
+	/*
+	 * if cancel failed, expire_work is freed by the
+	 * expire worker thread
+	 */
+	if (!ret)
+		goto end;
+
+	list_del(&work->list);
+	kfree(work);
+	btrfs_put_block_group(cache);
+
+end:
+	mutex_unlock(&cache->submit_lock);
+	mutex_unlock(&cache->fs_info->expire_work_lock);
+	return ret;
+}
 
 static blk_status_t __btrfs_map_bio(struct btrfs_fs_info *fs_info,
 				    struct bio *bio, int mirror_num,
@@ -6931,7 +7049,9 @@ static blk_status_t __btrfs_map_bio_zoned(struct btrfs_fs_info *fs_info,
 	struct btrfs_block_group_cache *cache = NULL;
 	struct map_bio_data *map_private;
 	int sent;
+	bool should_queue;
 	blk_status_t ret;
+	int ret2;
 
 	WARN_ON(bio_op(cur_bio) != REQ_OP_WRITE);
 
@@ -6944,8 +7064,20 @@ static blk_status_t __btrfs_map_bio_zoned(struct btrfs_fs_info *fs_info,
 	}
 
 	mutex_lock(&cache->submit_lock);
-	if (cache->submit_offset == logical)
+
+	if (cache->expired) {
+		trace_btrfs_bio_in_expired_block_group(cache, cur_bio);
+		mutex_unlock(&cache->submit_lock);
+		btrfs_put_block_group(cache);
+		WARN_ON_ONCE(1);
+		return BLK_STS_IOERR;
+	}
+
+	if (cache->submit_offset == logical) {
+		mutex_unlock(&cache->submit_lock);
+		cancel_expire_work(cache);
 		goto send_bios;
+	}
 
 	if (cache->submit_offset > logical) {
 		trace_btrfs_bio_before_write_pointer(cache, cur_bio);
@@ -6968,13 +7100,18 @@ static blk_status_t __btrfs_map_bio_zoned(struct btrfs_fs_info *fs_info,
 
 	bio_list_add(&cache->submit_buffer, cur_bio);
 	mutex_unlock(&cache->submit_lock);
+
+	ret2 = schedule_expire_work(cache);
+	if (ret2) {
+		btrfs_put_block_group(cache);
+		return errno_to_blk_status(ret2);
+	}
 	btrfs_put_block_group(cache);
 
 	/* mimic a good result ... */
 	return BLK_STS_OK;
 
 send_bios:
-	mutex_unlock(&cache->submit_lock);
 	/* send this bio */
 	ret = __btrfs_map_bio(fs_info, cur_bio, mirror_num, 1, 1);
 	if (ret != BLK_STS_OK) {
@@ -7013,6 +7150,7 @@ static blk_status_t __btrfs_map_bio_zoned(struct btrfs_fs_info *fs_info,
 			bio = next;
 		}
 	} while (sent);
+	should_queue = !bio_list_empty(&cache->submit_buffer);
 	mutex_unlock(&cache->submit_lock);
 
 	/* send the collected bios */
@@ -7031,8 +7169,10 @@ static blk_status_t __btrfs_map_bio_zoned(struct btrfs_fs_info *fs_info,
 
 	if (length)
 		goto loop;
-	btrfs_put_block_group(cache);
 
+	if (should_queue)
+		WARN_ON(schedule_expire_work(cache));
+	btrfs_put_block_group(cache);
 	return BLK_STS_OK;
 }
 
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 2b4cd791bf24..0ffb0b330b6c 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -2131,6 +2131,8 @@ DEFINE_EVENT(btrfs_hmzoned_bio_buffer_events, name,			\
 )
 
 DEFINE_BTRFS_HMZONED_BIO_BUF_EVENT(btrfs_bio_before_write_pointer);
+DEFINE_BTRFS_HMZONED_BIO_BUF_EVENT(btrfs_expire_bio);
+DEFINE_BTRFS_HMZONED_BIO_BUF_EVENT(btrfs_bio_in_expired_block_group);
 
 #endif /* _TRACE_BTRFS_H */
 
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 13/19] btrfs: avoid sync IO prioritization on checksum in HMZONED mode
  2019-06-07 13:10 [PATCH v2 00/19] btrfs zoned block device support Naohiro Aota
                   ` (11 preceding siblings ...)
  2019-06-07 13:10 ` [PATCH 12/19] btrfs: expire submit buffer on timeout Naohiro Aota
@ 2019-06-07 13:10 ` Naohiro Aota
  2019-06-13 14:17   ` Josef Bacik
  2019-06-07 13:10 ` [PATCH 14/19] btrfs: redirty released extent buffers in sequential BGs Naohiro Aota
                   ` (7 subsequent siblings)
  20 siblings, 1 reply; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:10 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

Btrfs prioritize sync I/Os to be handled by async checksum worker earlier.
As a result, checksumming sync I/Os to larger logical extent address can
finish faster than checksumming non-sync I/Os to smaller logical extent
address.

Since we have upper limit of number of checksum worker, it is possible that
sync I/Os to wait forever for non-starting checksum of I/Os for smaller
address.

This situation can be reproduced by e.g. fstests btrfs/073.

To avoid such disordering, disable sync IO prioritization for now. Note
that sync I/Os anyway must wait for I/Os to smaller address to finish. So,
actually prioritization have no benefit in HMZONED mode.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/disk-io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 56a416902ce7..6651986da470 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -838,7 +838,7 @@ blk_status_t btrfs_wq_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 
 	async->status = 0;
 
-	if (op_is_sync(bio->bi_opf))
+	if (op_is_sync(bio->bi_opf) && !btrfs_fs_incompat(fs_info, HMZONED))
 		btrfs_set_work_high_priority(&async->work);
 
 	btrfs_queue_work(fs_info->workers, &async->work);
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 14/19] btrfs: redirty released extent buffers in sequential BGs
  2019-06-07 13:10 [PATCH v2 00/19] btrfs zoned block device support Naohiro Aota
                   ` (12 preceding siblings ...)
  2019-06-07 13:10 ` [PATCH 13/19] btrfs: avoid sync IO prioritization on checksum in HMZONED mode Naohiro Aota
@ 2019-06-07 13:10 ` Naohiro Aota
  2019-06-13 14:24   ` Josef Bacik
  2019-06-07 13:10 ` [PATCH 15/19] btrfs: reset zones of unused block groups Naohiro Aota
                   ` (6 subsequent siblings)
  20 siblings, 1 reply; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:10 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

Tree manipulating operations like merging nodes often release
once-allocated tree nodes. Btrfs cleans such nodes so that pages in the
node are not uselessly written out. On HMZONED drives, however, such
optimization blocks the following IOs as the cancellation of the write out
of the freed blocks breaks the sequential write sequence expected by the
device.

This patch introduces a list of clean extent buffers that have been
released in a transaction. Btrfs consult the list before writing out and
waiting for the IOs, and it redirties a buffer if 1) it's in sequential BG,
2) it's in un-submit range, and 3) it's not under IO. Thus, such buffers
are marked for IO in btrfs_write_and_wait_transaction() to send proper bios
to the disk.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/disk-io.c     | 27 ++++++++++++++++++++++++---
 fs/btrfs/extent_io.c   |  1 +
 fs/btrfs/extent_io.h   |  2 ++
 fs/btrfs/transaction.c | 35 +++++++++++++++++++++++++++++++++++
 fs/btrfs/transaction.h |  3 +++
 5 files changed, 65 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 6651986da470..c6147fce648f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -535,7 +535,9 @@ static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct page *page)
 	if (csum_tree_block(eb, result))
 		return -EINVAL;
 
-	if (btrfs_header_level(eb))
+	if (test_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags))
+		ret = 0;
+	else if (btrfs_header_level(eb))
 		ret = btrfs_check_node(eb);
 	else
 		ret = btrfs_check_leaf_full(eb);
@@ -1115,10 +1117,20 @@ struct extent_buffer *read_tree_block(struct btrfs_fs_info *fs_info, u64 bytenr,
 void btrfs_clean_tree_block(struct extent_buffer *buf)
 {
 	struct btrfs_fs_info *fs_info = buf->fs_info;
-	if (btrfs_header_generation(buf) ==
-	    fs_info->running_transaction->transid) {
+	struct btrfs_transaction *cur_trans = fs_info->running_transaction;
+
+	if (btrfs_header_generation(buf) == cur_trans->transid) {
 		btrfs_assert_tree_locked(buf);
 
+		if (btrfs_fs_incompat(fs_info, HMZONED) &&
+		    list_empty(&buf->release_list)) {
+			atomic_inc(&buf->refs);
+			spin_lock(&cur_trans->releasing_ebs_lock);
+			list_add_tail(&buf->release_list,
+				      &cur_trans->releasing_ebs);
+			spin_unlock(&cur_trans->releasing_ebs_lock);
+		}
+
 		if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &buf->bflags)) {
 			percpu_counter_add_batch(&fs_info->dirty_metadata_bytes,
 						 -buf->len,
@@ -4533,6 +4545,15 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans,
 	btrfs_destroy_pinned_extent(fs_info,
 				    fs_info->pinned_extents);
 
+	while (!list_empty(&cur_trans->releasing_ebs)) {
+		struct extent_buffer *eb;
+
+		eb = list_first_entry(&cur_trans->releasing_ebs,
+				      struct extent_buffer, release_list);
+		list_del_init(&eb->release_list);
+		free_extent_buffer(eb);
+	}
+
 	cur_trans->state =TRANS_STATE_COMPLETED;
 	wake_up(&cur_trans->commit_wait);
 }
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 13fca7bfc1f2..c73c69e2bef4 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4816,6 +4816,7 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
 	init_waitqueue_head(&eb->read_lock_wq);
 
 	btrfs_leak_debug_add(&eb->leak_list, &buffers);
+	INIT_LIST_HEAD(&eb->release_list);
 
 	spin_lock_init(&eb->refs_lock);
 	atomic_set(&eb->refs, 1);
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index aa18a16a6ed7..2987a01f84f9 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -58,6 +58,7 @@ enum {
 	EXTENT_BUFFER_IN_TREE,
 	/* write IO error */
 	EXTENT_BUFFER_WRITE_ERR,
+	EXTENT_BUFFER_NO_CHECK,
 };
 
 /* these are flags for __process_pages_contig */
@@ -186,6 +187,7 @@ struct extent_buffer {
 	 */
 	wait_queue_head_t read_lock_wq;
 	struct page *pages[INLINE_EXTENT_BUFFER_PAGES];
+	struct list_head release_list;
 #ifdef CONFIG_BTRFS_DEBUG
 	atomic_t spinning_writers;
 	atomic_t spinning_readers;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 3f6811cdf803..ded40ad75419 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -236,6 +236,8 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info,
 	spin_lock_init(&cur_trans->dirty_bgs_lock);
 	INIT_LIST_HEAD(&cur_trans->deleted_bgs);
 	spin_lock_init(&cur_trans->dropped_roots_lock);
+	INIT_LIST_HEAD(&cur_trans->releasing_ebs);
+	spin_lock_init(&cur_trans->releasing_ebs_lock);
 	list_add_tail(&cur_trans->list, &fs_info->trans_list);
 	extent_io_tree_init(fs_info, &cur_trans->dirty_pages,
 			IO_TREE_TRANS_DIRTY_PAGES, fs_info->btree_inode);
@@ -2219,7 +2221,31 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 
 	wake_up(&fs_info->transaction_wait);
 
+	if (btrfs_fs_incompat(fs_info, HMZONED)) {
+		struct extent_buffer *eb;
+
+		list_for_each_entry(eb, &cur_trans->releasing_ebs,
+				    release_list) {
+			struct btrfs_block_group_cache *cache;
+
+			cache = btrfs_lookup_block_group(fs_info, eb->start);
+			if (!cache)
+				continue;
+			mutex_lock(&cache->submit_lock);
+			if (cache->alloc_type == BTRFS_ALLOC_SEQ &&
+			    cache->submit_offset <= eb->start &&
+			    !extent_buffer_under_io(eb)) {
+				set_extent_buffer_dirty(eb);
+				cache->space_info->bytes_readonly += eb->len;
+				set_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags);
+			}
+			mutex_unlock(&cache->submit_lock);
+			btrfs_put_block_group(cache);
+		}
+	}
+
 	ret = btrfs_write_and_wait_transaction(trans);
+
 	if (ret) {
 		btrfs_handle_fs_error(fs_info, ret,
 				      "Error while writing out transaction");
@@ -2227,6 +2253,15 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 		goto scrub_continue;
 	}
 
+	while (!list_empty(&cur_trans->releasing_ebs)) {
+		struct extent_buffer *eb;
+
+		eb = list_first_entry(&cur_trans->releasing_ebs,
+				      struct extent_buffer, release_list);
+		list_del_init(&eb->release_list);
+		free_extent_buffer(eb);
+	}
+
 	ret = write_all_supers(fs_info, 0);
 	/*
 	 * the super is written, we can safely allow the tree-loggers
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 78c446c222b7..7984a7f01dd8 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -85,6 +85,9 @@ struct btrfs_transaction {
 	spinlock_t dropped_roots_lock;
 	struct btrfs_delayed_ref_root delayed_refs;
 	struct btrfs_fs_info *fs_info;
+
+	spinlock_t releasing_ebs_lock;
+	struct list_head releasing_ebs;
 };
 
 #define __TRANS_FREEZABLE	(1U << 0)
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 15/19] btrfs: reset zones of unused block groups
  2019-06-07 13:10 [PATCH v2 00/19] btrfs zoned block device support Naohiro Aota
                   ` (13 preceding siblings ...)
  2019-06-07 13:10 ` [PATCH 14/19] btrfs: redirty released extent buffers in sequential BGs Naohiro Aota
@ 2019-06-07 13:10 ` Naohiro Aota
  2019-06-07 13:10 ` [PATCH 16/19] btrfs: wait existing extents before truncating Naohiro Aota
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:10 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

For an HMZONED volume, a block group maps to a zone of the device. For
deleted unused block groups, the zone of the block group can be reset to
rewind the zone write pointer at the start of the zone.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent-tree.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index cb29a96c226b..ff4d55d6ef04 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2018,6 +2018,26 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
 				ASSERT(btrfs_test_opt(fs_info, DEGRADED));
 				continue;
 			}
+
+			if (btrfs_dev_is_sequential(stripe->dev,
+						    stripe->physical) &&
+			    stripe->length == stripe->dev->zone_size) {
+				ret = blkdev_reset_zones(stripe->dev->bdev,
+							 stripe->physical >>
+								 SECTOR_SHIFT,
+							 stripe->length >>
+								 SECTOR_SHIFT,
+							 GFP_NOFS);
+				if (!ret)
+					discarded_bytes += stripe->length;
+				else
+					break;
+				set_bit(stripe->physical >>
+						stripe->dev->zone_size_shift,
+					stripe->dev->empty_zones);
+				continue;
+			}
+
 			req_q = bdev_get_queue(stripe->dev->bdev);
 			if (!blk_queue_discard(req_q))
 				continue;
@@ -11430,7 +11450,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		spin_unlock(&space_info->lock);
 
 		/* DISCARD can flip during remount */
-		trimming = btrfs_test_opt(fs_info, DISCARD);
+		trimming = btrfs_test_opt(fs_info, DISCARD) ||
+				btrfs_fs_incompat(fs_info, HMZONED);
 
 		/* Implicit trim during transaction commit. */
 		if (trimming)
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 16/19] btrfs: wait existing extents before truncating
  2019-06-07 13:10 [PATCH v2 00/19] btrfs zoned block device support Naohiro Aota
                   ` (14 preceding siblings ...)
  2019-06-07 13:10 ` [PATCH 15/19] btrfs: reset zones of unused block groups Naohiro Aota
@ 2019-06-07 13:10 ` Naohiro Aota
  2019-06-13 14:25   ` Josef Bacik
  2019-06-07 13:10 ` [PATCH 17/19] btrfs: shrink delayed allocation size in HMZONED mode Naohiro Aota
                   ` (4 subsequent siblings)
  20 siblings, 1 reply; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:10 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

When truncating a file, file buffers which have already been allocated but
not yet written may be truncated.  Truncating these buffers could cause
breakage of a sequential write pattern in a block group if the truncated
blocks are for example followed by blocks allocated to another file. To
avoid this problem, always wait for write out of all unwritten buffers
before proceeding with the truncate execution.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/inode.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 89542c19d09e..4e8c7921462f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5137,6 +5137,17 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr)
 		btrfs_end_write_no_snapshotting(root);
 		btrfs_end_transaction(trans);
 	} else {
+		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+
+		if (btrfs_fs_incompat(fs_info, HMZONED)) {
+			u64 sectormask = fs_info->sectorsize - 1;
+
+			ret = btrfs_wait_ordered_range(inode,
+						       newsize & (~sectormask),
+						       (u64)-1);
+			if (ret)
+				return ret;
+		}
 
 		/*
 		 * We're truncating a file that used to have good data down to
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 17/19] btrfs: shrink delayed allocation size in HMZONED mode
  2019-06-07 13:10 [PATCH v2 00/19] btrfs zoned block device support Naohiro Aota
                   ` (15 preceding siblings ...)
  2019-06-07 13:10 ` [PATCH 16/19] btrfs: wait existing extents before truncating Naohiro Aota
@ 2019-06-07 13:10 ` Naohiro Aota
  2019-06-13 14:27   ` Josef Bacik
  2019-06-07 13:10 ` [PATCH 18/19] btrfs: support dev-replace " Naohiro Aota
                   ` (3 subsequent siblings)
  20 siblings, 1 reply; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:10 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

In a write heavy workload, the following scenario can occur:

1. mark page #0 to page #2 (and their corresponding extent region) as dirty
   and candidate for delayed allocation

pages    0 1 2 3 4
dirty    o o o - -
towrite  - - - - -
delayed  o o o - -
alloc

2. extent_write_cache_pages() mark dirty pages as TOWRITE

pages    0 1 2 3 4
dirty    o o o - -
towrite  o o o - -
delayed  o o o - -
alloc

3. Meanwhile, another write dirties page #3 and page #4

pages    0 1 2 3 4
dirty    o o o o o
towrite  o o o - -
delayed  o o o o o
alloc

4. find_lock_delalloc_range() decide to allocate a region to write page #0
   to page #4
5. but, extent_write_cache_pages() only initiate write to TOWRITE tagged
   pages (#0 to #2)

So the above process leaves page #3 and page #4 behind. Usually, the
periodic dirty flush kicks write IOs for page #3 and #4. However, if we try
to mount a subvolume at this timing, mount process takes s_umount write
lock to block the periodic flush to come in.

To deal with the problem, shrink the delayed allocation region to have only
expected to be written pages.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent_io.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c73c69e2bef4..ea582ff85c73 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3310,6 +3310,33 @@ static noinline_for_stack int writepage_delalloc(struct inode *inode,
 			delalloc_start = delalloc_end + 1;
 			continue;
 		}
+
+		if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED) &&
+		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) &&
+		    ((delalloc_start >> PAGE_SHIFT) <
+		     (delalloc_end >> PAGE_SHIFT))) {
+			unsigned long i;
+			unsigned long end_index = delalloc_end >> PAGE_SHIFT;
+
+			for (i = delalloc_start >> PAGE_SHIFT;
+			     i <= end_index; i++)
+				if (!xa_get_mark(&inode->i_mapping->i_pages, i,
+						 PAGECACHE_TAG_TOWRITE))
+					break;
+
+			if (i <= end_index) {
+				u64 unlock_start = (u64)i << PAGE_SHIFT;
+
+				if (i == delalloc_start >> PAGE_SHIFT)
+					unlock_start += PAGE_SIZE;
+
+				unlock_extent(tree, unlock_start, delalloc_end);
+				__unlock_for_delalloc(inode, page, unlock_start,
+						      delalloc_end);
+				delalloc_end = unlock_start - 1;
+			}
+		}
+
 		ret = btrfs_run_delalloc_range(inode, page, delalloc_start,
 				delalloc_end, &page_started, nr_written, wbc);
 		/* File system has been set read-only */
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 18/19] btrfs: support dev-replace in HMZONED mode
  2019-06-07 13:10 [PATCH v2 00/19] btrfs zoned block device support Naohiro Aota
                   ` (16 preceding siblings ...)
  2019-06-07 13:10 ` [PATCH 17/19] btrfs: shrink delayed allocation size in HMZONED mode Naohiro Aota
@ 2019-06-07 13:10 ` " Naohiro Aota
  2019-06-13 14:33   ` Josef Bacik
  2019-06-07 13:10 ` [PATCH 19/19] btrfs: enable to mount HMZONED incompat flag Naohiro Aota
                   ` (2 subsequent siblings)
  20 siblings, 1 reply; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:10 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

Currently, dev-replace copy all the device extents on source device to the
target device, and it also clones new incoming write I/Os from users to the
source device into the target device.

Cloning incoming IOs can break the sequential write rule in the target
device. When write is mapped in the middle of block group, that I/O is
directed in the middle of a zone of target device, which breaks the
sequential write rule.

However, the cloning function cannot be simply disabled since incoming I/Os
targeting already copied device extents must be cloned so that the I/O is
executed on the target device.

We cannot use dev_replace->cursor_{left,right} to determine whether bio
is going to not yet copied region.  Since we have time gap between
finishing btrfs_scrub_dev() and rewriting the mapping tree in
btrfs_dev_replace_finishing(), we can have newly allocated device extent
which is never cloned (by handle_ops_on_dev_replace) nor copied (by the
dev-replace process).

So the point is to copy only already existing device extents. This patch
introduce mark_block_group_to_copy() to mark existing block group as a
target of copying. Then, handle_ops_on_dev_replace() and dev-replace can
check the flag to do their job.

This patch also handles empty region between used extents. Since
dev-replace is smart to copy only used extents on source device, we have to
fill the gap to honor the sequential write rule in the target device.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ctree.h       |   1 +
 fs/btrfs/dev-replace.c |  96 +++++++++++++++++++++++
 fs/btrfs/extent-tree.c |  32 +++++++-
 fs/btrfs/scrub.c       | 169 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.c     |  27 ++++++-
 5 files changed, 319 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index dad8ea5c3b99..a0be2b96117a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -639,6 +639,7 @@ struct btrfs_block_group_cache {
 	unsigned int has_caching_ctl:1;
 	unsigned int removed:1;
 	unsigned int wp_broken:1;
+	unsigned int to_copy:1;
 
 	int disk_cache_state;
 
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index fbe5ea2a04ed..5011b5ce0e75 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -263,6 +263,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
 	device->dev_stats_valid = 1;
 	set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
 	device->fs_devices = fs_info->fs_devices;
+	if (bdev_is_zoned(bdev)) {
+		ret = btrfs_get_dev_zonetypes(device);
+		if (ret) {
+			mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+			goto error;
+		}
+	}
 	list_add(&device->dev_list, &fs_info->fs_devices->devices);
 	fs_info->fs_devices->num_devices++;
 	fs_info->fs_devices->open_devices++;
@@ -396,6 +403,88 @@ static char* btrfs_dev_name(struct btrfs_device *device)
 		return rcu_str_deref(device->name);
 }
 
+static int mark_block_group_to_copy(struct btrfs_fs_info *fs_info,
+				    struct btrfs_device *src_dev)
+{
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	struct btrfs_key found_key;
+	struct btrfs_root *root = fs_info->dev_root;
+	struct btrfs_dev_extent *dev_extent = NULL;
+	struct btrfs_block_group_cache *cache;
+	struct extent_buffer *l;
+	int slot;
+	int ret;
+	u64 chunk_offset, length;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	path->reada = READA_FORWARD;
+	path->search_commit_root = 1;
+	path->skip_locking = 1;
+
+	key.objectid = src_dev->devid;
+	key.offset = 0ull;
+	key.type = BTRFS_DEV_EXTENT_KEY;
+
+	while (1) {
+		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+		if (ret < 0)
+			break;
+		if (ret > 0) {
+			if (path->slots[0] >=
+			    btrfs_header_nritems(path->nodes[0])) {
+				ret = btrfs_next_leaf(root, path);
+				if (ret < 0)
+					break;
+				if (ret > 0) {
+					ret = 0;
+					break;
+				}
+			} else {
+				ret = 0;
+			}
+		}
+
+		l = path->nodes[0];
+		slot = path->slots[0];
+
+		btrfs_item_key_to_cpu(l, &found_key, slot);
+
+		if (found_key.objectid != src_dev->devid)
+			break;
+
+		if (found_key.type != BTRFS_DEV_EXTENT_KEY)
+			break;
+
+		if (found_key.offset < key.offset)
+			break;
+
+		dev_extent = btrfs_item_ptr(l, slot, struct btrfs_dev_extent);
+		length = btrfs_dev_extent_length(l, dev_extent);
+
+		chunk_offset = btrfs_dev_extent_chunk_offset(l, dev_extent);
+
+		cache = btrfs_lookup_block_group(fs_info, chunk_offset);
+		if (!cache)
+			goto skip;
+
+		cache->to_copy = 1;
+
+		btrfs_put_block_group(cache);
+
+skip:
+		key.offset = found_key.offset + length;
+		btrfs_release_path(path);
+	}
+
+	btrfs_free_path(path);
+
+	return ret;
+}
+
 static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
 		const char *tgtdev_name, u64 srcdevid, const char *srcdev_name,
 		int read_src)
@@ -439,6 +528,13 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
 	}
 
 	need_unlock = true;
+
+	mutex_lock(&fs_info->chunk_mutex);
+	ret = mark_block_group_to_copy(fs_info, src_device);
+	mutex_unlock(&fs_info->chunk_mutex);
+	if (ret)
+		return ret;
+
 	down_write(&dev_replace->rwsem);
 	switch (dev_replace->replace_state) {
 	case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED:
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ff4d55d6ef04..268365dd9a5d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -29,6 +29,7 @@
 #include "qgroup.h"
 #include "ref-verify.h"
 #include "rcu-string.h"
+#include "dev-replace.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -2022,7 +2023,31 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
 			if (btrfs_dev_is_sequential(stripe->dev,
 						    stripe->physical) &&
 			    stripe->length == stripe->dev->zone_size) {
-				ret = blkdev_reset_zones(stripe->dev->bdev,
+				struct btrfs_device *dev = stripe->dev;
+
+				ret = blkdev_reset_zones(dev->bdev,
+							 stripe->physical >>
+								 SECTOR_SHIFT,
+							 stripe->length >>
+								 SECTOR_SHIFT,
+							 GFP_NOFS);
+				if (!ret)
+					discarded_bytes += stripe->length;
+				else
+					break;
+				set_bit(stripe->physical >>
+					dev->zone_size_shift,
+					dev->empty_zones);
+
+				if (!btrfs_dev_replace_is_ongoing(
+					    &fs_info->dev_replace) ||
+				    stripe->dev != fs_info->dev_replace.srcdev)
+					continue;
+
+				/* send to target as well */
+				dev = fs_info->dev_replace.tgtdev;
+
+				ret = blkdev_reset_zones(dev->bdev,
 							 stripe->physical >>
 								 SECTOR_SHIFT,
 							 stripe->length >>
@@ -2033,8 +2058,9 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
 				else
 					break;
 				set_bit(stripe->physical >>
-						stripe->dev->zone_size_shift,
-					stripe->dev->empty_zones);
+					dev->zone_size_shift,
+					dev->empty_zones);
+
 				continue;
 			}
 
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 36ad4fad7eaf..7bfc19c50224 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -165,6 +165,7 @@ struct scrub_ctx {
 	int			pages_per_rd_bio;
 
 	int			is_dev_replace;
+	u64			write_pointer;
 
 	struct scrub_bio        *wr_curr_bio;
 	struct mutex            wr_lock;
@@ -1646,6 +1647,19 @@ static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
 	sbio = sctx->wr_curr_bio;
 	if (sbio->page_count == 0) {
 		struct bio *bio;
+		u64 physical = spage->physical_for_dev_replace;
+
+		if (btrfs_fs_incompat(sctx->fs_info, HMZONED) &&
+		    sctx->write_pointer < physical) {
+			u64 length = physical - sctx->write_pointer;
+
+			ret = blkdev_issue_zeroout(
+				sctx->wr_tgtdev->bdev,
+				sctx->write_pointer >> SECTOR_SHIFT,
+				length >> SECTOR_SHIFT,
+				GFP_NOFS, 0);
+			sctx->write_pointer = physical;
+		}
 
 		sbio->physical = spage->physical_for_dev_replace;
 		sbio->logical = spage->logical;
@@ -1708,6 +1722,10 @@ static void scrub_wr_submit(struct scrub_ctx *sctx)
 	 * doubled the write performance on spinning disks when measured
 	 * with Linux 3.5 */
 	btrfsic_submit_bio(sbio->bio);
+
+	if (btrfs_fs_incompat(sctx->fs_info, HMZONED))
+		sctx->write_pointer = sbio->physical +
+			sbio->page_count * PAGE_SIZE;
 }
 
 static void scrub_wr_bio_end_io(struct bio *bio)
@@ -3030,6 +3048,43 @@ static noinline_for_stack int scrub_raid56_parity(struct scrub_ctx *sctx,
 	return ret < 0 ? ret : 0;
 }
 
+static int read_zone_info(struct btrfs_fs_info *fs_info, u64 logical,
+			  struct blk_zone *zone)
+{
+	struct btrfs_bio *bbio = NULL;
+	u64 mapped_length = PAGE_SIZE;
+	int nmirrors;
+	int i, ret;
+
+	ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, logical,
+			       &mapped_length, &bbio);
+	if (ret || !bbio || mapped_length < PAGE_SIZE) {
+		btrfs_put_bbio(bbio);
+		return -EIO;
+	}
+
+	if (bbio->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK)
+		return -EINVAL;
+
+	nmirrors = min(scrub_nr_raid_mirrors(bbio), BTRFS_MAX_MIRRORS);
+	for (i = 0; i < nmirrors; i++) {
+		u64 physical = bbio->stripes[i].physical;
+		struct btrfs_device *dev = bbio->stripes[i].dev;
+
+		/* missing device */
+		if (!dev->bdev)
+			continue;
+
+		ret = btrfs_get_dev_zone(dev, physical, zone, GFP_NOFS);
+		/* failing device */
+		if (ret == -EIO || ret == -EOPNOTSUPP)
+			continue;
+		break;
+	}
+
+	return ret;
+}
+
 static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 					   struct map_lookup *map,
 					   struct btrfs_device *scrub_dev,
@@ -3161,6 +3216,15 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 	 */
 	blk_start_plug(&plug);
 
+	if (btrfs_fs_incompat(fs_info, HMZONED) && sctx->is_dev_replace &&
+	    btrfs_dev_is_sequential(sctx->wr_tgtdev, physical)) {
+		mutex_lock(&sctx->wr_lock);
+		sctx->write_pointer = physical;
+		mutex_unlock(&sctx->wr_lock);
+	}
+
+	sctx->flush_all_writes = true;
+
 	/*
 	 * now find all extents for each stripe and scrub them
 	 */
@@ -3333,6 +3397,15 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 			if (ret)
 				goto out;
 
+			sctx->flush_all_writes = true;
+			scrub_submit(sctx);
+			mutex_lock(&sctx->wr_lock);
+			scrub_wr_submit(sctx);
+			mutex_unlock(&sctx->wr_lock);
+
+			wait_event(sctx->list_wait,
+				   atomic_read(&sctx->bios_in_flight) == 0);
+
 			if (extent_logical + extent_len <
 			    key.objectid + bytes) {
 				if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
@@ -3400,6 +3473,45 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 	blk_finish_plug(&plug);
 	btrfs_free_path(path);
 	btrfs_free_path(ppath);
+
+	if (btrfs_fs_incompat(fs_info, HMZONED) && sctx->is_dev_replace &&
+	    ret >= 0) {
+		wait_event(sctx->list_wait,
+			   atomic_read(&sctx->bios_in_flight) == 0);
+
+		mutex_lock(&sctx->wr_lock);
+		if (sctx->write_pointer < physical_end &&
+		    btrfs_dev_is_sequential(sctx->wr_tgtdev,
+					    sctx->write_pointer)) {
+			struct blk_zone zone;
+			u64 wp;
+
+			ret = read_zone_info(fs_info, base + offset, &zone);
+			if (ret) {
+				btrfs_err(fs_info,
+					  "cannot recover write pointer");
+				goto out_zone_sync;
+			}
+
+			wp = map->stripes[num].physical +
+				((zone.wp - zone.start) << SECTOR_SHIFT);
+			if (sctx->write_pointer < wp) {
+				u64 length = wp - sctx->write_pointer;
+
+				ret = blkdev_issue_zeroout(
+					sctx->wr_tgtdev->bdev,
+					sctx->write_pointer >> SECTOR_SHIFT,
+					length >> SECTOR_SHIFT,
+					GFP_NOFS, 0);
+			}
+		}
+out_zone_sync:
+		mutex_unlock(&sctx->wr_lock);
+		clear_bit(map->stripes[num].physical >>
+			  sctx->wr_tgtdev->zone_size_shift,
+			  sctx->wr_tgtdev->empty_zones);
+	}
+
 	return ret < 0 ? ret : 0;
 }
 
@@ -3468,11 +3580,14 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 	int ret = 0;
 	int ro_set;
 	int slot;
+	int i, num_extents, cur_extent;
 	struct extent_buffer *l;
 	struct btrfs_key key;
 	struct btrfs_key found_key;
 	struct btrfs_block_group_cache *cache;
 	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
+	struct extent_map *em;
+	struct map_lookup *map;
 
 	path = btrfs_alloc_path();
 	if (!path)
@@ -3487,6 +3602,23 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 	key.type = BTRFS_DEV_EXTENT_KEY;
 
 	while (1) {
+		if (btrfs_fs_incompat(fs_info, HMZONED) &&
+		    sctx->is_dev_replace) {
+			struct btrfs_trans_handle *trans;
+
+			scrub_pause_on(fs_info);
+			trans = btrfs_join_transaction(root);
+			if (IS_ERR(trans))
+				ret = PTR_ERR(trans);
+			else
+				ret = btrfs_commit_transaction(trans);
+			if (ret) {
+				scrub_pause_off(fs_info);
+				break;
+			}
+			scrub_pause_off(fs_info);
+		}
+
 		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
 		if (ret < 0)
 			break;
@@ -3541,6 +3673,11 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 		if (!cache)
 			goto skip;
 
+		if (sctx->is_dev_replace && !cache->to_copy) {
+			ro_set = 0;
+			goto done;
+		}
+
 		/*
 		 * we need call btrfs_inc_block_group_ro() with scrubs_paused,
 		 * to avoid deadlock caused by:
@@ -3651,6 +3788,38 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 
 		scrub_pause_off(fs_info);
 
+		if (sctx->is_dev_replace) {
+			em = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
+			BUG_ON(IS_ERR(em));
+			map = em->map_lookup;
+
+			num_extents = cur_extent = 0;
+			for (i = 0; i < map->num_stripes; i++) {
+				/* we have more device extent to copy */
+				if (dev_replace->srcdev != map->stripes[i].dev)
+					continue;
+
+				num_extents++;
+				if (found_key.offset ==
+				    map->stripes[i].physical)
+					cur_extent = i;
+			}
+
+			free_extent_map(em);
+
+			if (num_extents > 1) {
+				if (cur_extent == 0) {
+					btrfs_inc_block_group_ro(cache);
+				} else if (cur_extent == num_extents - 1) {
+					btrfs_dec_block_group_ro(cache);
+					cache->to_copy = 0;
+				}
+			} else {
+				cache->to_copy = 0;
+			}
+		}
+
+done:
 		down_write(&fs_info->dev_replace.rwsem);
 		dev_replace->cursor_left = dev_replace->cursor_right;
 		dev_replace->item_needs_writeback = 1;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a04379e440fb..e0a37466bb2d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1841,6 +1841,8 @@ int find_free_dev_extent_start(struct btrfs_device *device, u64 num_bytes,
 	else
 		search_start = max_t(u64, search_start, SZ_1M);
 
+	WARN_ON(device->zone_size && !IS_ALIGNED(num_bytes, device->zone_size));
+
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
@@ -6180,6 +6182,7 @@ static int get_extra_mirror_from_replace(struct btrfs_fs_info *fs_info,
 static void handle_ops_on_dev_replace(enum btrfs_map_op op,
 				      struct btrfs_bio **bbio_ret,
 				      struct btrfs_dev_replace *dev_replace,
+				      u64 logical,
 				      int *num_stripes_ret, int *max_errors_ret)
 {
 	struct btrfs_bio *bbio = *bbio_ret;
@@ -6190,7 +6193,18 @@ static void handle_ops_on_dev_replace(enum btrfs_map_op op,
 	int i;
 
 	if (op == BTRFS_MAP_WRITE) {
+		struct btrfs_block_group_cache *cache;
+		struct btrfs_fs_info *fs_info = dev_replace->srcdev->fs_info;
 		int index_where_to_add;
+		int hmzoned = btrfs_fs_incompat(fs_info, HMZONED);
+
+		cache = btrfs_lookup_block_group(fs_info, logical);
+		BUG_ON(!cache);
+		if (hmzoned && cache->to_copy) {
+			btrfs_put_block_group(cache);
+			return;
+		}
+		btrfs_put_block_group(cache);
 
 		/*
 		 * duplicate the write operations while the dev replace
@@ -6215,10 +6229,17 @@ static void handle_ops_on_dev_replace(enum btrfs_map_op op,
 				new->physical = old->physical;
 				new->length = old->length;
 				new->dev = dev_replace->tgtdev;
-				bbio->tgtdev_map[i] = index_where_to_add;
+				bbio->tgtdev_map[i] =
+					index_where_to_add;
 				index_where_to_add++;
 				max_errors++;
 				tgtdev_indexes++;
+
+				/* mark this zone as non-empty */
+				if (hmzoned)
+					clear_bit(new->physical >>
+						  new->dev->zone_size_shift,
+						  new->dev->empty_zones);
 			}
 		}
 		num_stripes = index_where_to_add;
@@ -6551,8 +6572,8 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 
 	if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL &&
 	    need_full_stripe(op)) {
-		handle_ops_on_dev_replace(op, &bbio, dev_replace, &num_stripes,
-					  &max_errors);
+		handle_ops_on_dev_replace(op, &bbio, dev_replace, logical,
+					  &num_stripes, &max_errors);
 	}
 
 	*bbio_ret = bbio;
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 19/19] btrfs: enable to mount HMZONED incompat flag
  2019-06-07 13:10 [PATCH v2 00/19] btrfs zoned block device support Naohiro Aota
                   ` (17 preceding siblings ...)
  2019-06-07 13:10 ` [PATCH 18/19] btrfs: support dev-replace " Naohiro Aota
@ 2019-06-07 13:10 ` Naohiro Aota
  2019-06-07 13:17 ` [PATCH 01/12] btrfs-progs: build: Check zoned block device support Naohiro Aota
  2019-06-12 17:51 ` [PATCH v2 00/19] btrfs zoned block device support David Sterba
  20 siblings, 0 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:10 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

This final patch adds the HMZONED incompat flag to
BTRFS_FEATURE_INCOMPAT_SUPP and enables btrfs to mount HMZONED flagged file
system.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ctree.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a0be2b96117a..b30af9bbf22f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -285,7 +285,8 @@ struct btrfs_super_block {
 	 BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF |		\
 	 BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA |	\
 	 BTRFS_FEATURE_INCOMPAT_NO_HOLES	|	\
-	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID)
+	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID	|	\
+	 BTRFS_FEATURE_INCOMPAT_HMZONED)
 
 #define BTRFS_FEATURE_INCOMPAT_SAFE_SET			\
 	(BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF)
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 01/12] btrfs-progs: build: Check zoned block device support
  2019-06-07 13:10 [PATCH v2 00/19] btrfs zoned block device support Naohiro Aota
                   ` (18 preceding siblings ...)
  2019-06-07 13:10 ` [PATCH 19/19] btrfs: enable to mount HMZONED incompat flag Naohiro Aota
@ 2019-06-07 13:17 ` Naohiro Aota
  2019-06-07 13:17   ` [PATCH 02/12] btrfs-progs: utils: Introduce queue_param Naohiro Aota
                     ` (10 more replies)
  2019-06-12 17:51 ` [PATCH v2 00/19] btrfs zoned block device support David Sterba
  20 siblings, 11 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:17 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

If the kernel supports zoned block devices, the file
/usr/include/linux/blkzoned.h will be present. Check this and define
BTRFS_ZONED if the file is present.

If it present, enables HMZONED feature, if not disable it.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 configure.ac | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/configure.ac b/configure.ac
index cf792eb5488b..c637f72a8fe6 100644
--- a/configure.ac
+++ b/configure.ac
@@ -206,6 +206,18 @@ else
 AC_DEFINE([HAVE_OWN_FIEMAP_EXTENT_SHARED_DEFINE], [0], [We did not define FIEMAP_EXTENT_SHARED])
 fi
 
+AC_CHECK_HEADER(linux/blkzoned.h, [blkzoned_found=yes], [blkzoned_found=no])
+AC_ARG_ENABLE([zoned],
+  AS_HELP_STRING([--disable-zoned], [disable zoned block device support]),
+  [], [enable_zoned=$blkzoned_found]
+)
+
+AS_IF([test "x$enable_zoned" = xyes], [
+	AC_CHECK_HEADER(linux/blkzoned.h, [],
+		[AC_MSG_ERROR([Couldn't find linux/blkzoned.h])])
+	AC_DEFINE([BTRFS_ZONED], [1], [enable zoned block device support])
+])
+
 dnl Define <NAME>_LIBS= and <NAME>_CFLAGS= by pkg-config
 dnl
 dnl The default PKG_CHECK_MODULES() action-if-not-found is end the
@@ -307,6 +319,7 @@ AC_MSG_RESULT([
 	btrfs-restore zstd: ${enable_zstd}
 	Python bindings:    ${enable_python}
 	Python interpreter: ${PYTHON}
+	zoned device:       ${enable_zoned}
 
 	Type 'make' to compile.
 ])
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 02/12] btrfs-progs: utils: Introduce queue_param
  2019-06-07 13:17 ` [PATCH 01/12] btrfs-progs: build: Check zoned block device support Naohiro Aota
@ 2019-06-07 13:17   ` Naohiro Aota
  2019-06-07 13:17   ` [PATCH 03/12] btrfs-progs: add new HMZONED feature flag Naohiro Aota
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:17 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

Introduce the queue_param function to get a device request queue
parameter and this function to test if the device is an SSD in
is_ssd().

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
[Naohiro] fixed error return value
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 mkfs/main.c | 40 ++--------------------------------------
 utils.c     | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 utils.h     |  1 +
 3 files changed, 49 insertions(+), 38 deletions(-)

diff --git a/mkfs/main.c b/mkfs/main.c
index b442e6e40c37..93c0b71c864e 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -404,49 +404,13 @@ static int zero_output_file(int out_fd, u64 size)
 
 static int is_ssd(const char *file)
 {
-	blkid_probe probe;
-	char wholedisk[PATH_MAX];
-	char sysfs_path[PATH_MAX];
-	dev_t devno;
-	int fd;
 	char rotational;
 	int ret;
 
-	probe = blkid_new_probe_from_filename(file);
-	if (!probe)
+	ret = queue_param(file, "rotational", &rotational, 1);
+	if (ret < 1)
 		return 0;
 
-	/* Device number of this disk (possibly a partition) */
-	devno = blkid_probe_get_devno(probe);
-	if (!devno) {
-		blkid_free_probe(probe);
-		return 0;
-	}
-
-	/* Get whole disk name (not full path) for this devno */
-	ret = blkid_devno_to_wholedisk(devno,
-			wholedisk, sizeof(wholedisk), NULL);
-	if (ret) {
-		blkid_free_probe(probe);
-		return 0;
-	}
-
-	snprintf(sysfs_path, PATH_MAX, "/sys/block/%s/queue/rotational",
-		 wholedisk);
-
-	blkid_free_probe(probe);
-
-	fd = open(sysfs_path, O_RDONLY);
-	if (fd < 0) {
-		return 0;
-	}
-
-	if (read(fd, &rotational, 1) < 1) {
-		close(fd);
-		return 0;
-	}
-	close(fd);
-
 	return rotational == '0';
 }
 
diff --git a/utils.c b/utils.c
index c6cdc8f01dc1..7d5a1f3b7f8d 100644
--- a/utils.c
+++ b/utils.c
@@ -65,6 +65,52 @@ static unsigned short rand_seed[3];
 
 struct btrfs_config bconf;
 
+/*
+ * Get a device request queue parameter.
+ */
+int queue_param(const char *file, const char *param, char *buf, size_t len)
+{
+	blkid_probe probe;
+	char wholedisk[PATH_MAX];
+	char sysfs_path[PATH_MAX];
+	dev_t devno;
+	int fd;
+	int ret;
+
+	probe = blkid_new_probe_from_filename(file);
+	if (!probe)
+		return 0;
+
+	/* Device number of this disk (possibly a partition) */
+	devno = blkid_probe_get_devno(probe);
+	if (!devno) {
+		blkid_free_probe(probe);
+		return 0;
+	}
+
+	/* Get whole disk name (not full path) for this devno */
+	ret = blkid_devno_to_wholedisk(devno,
+			wholedisk, sizeof(wholedisk), NULL);
+	if (ret) {
+		blkid_free_probe(probe);
+		return 0;
+	}
+
+	snprintf(sysfs_path, PATH_MAX, "/sys/block/%s/queue/%s",
+		 wholedisk, param);
+
+	blkid_free_probe(probe);
+
+	fd = open(sysfs_path, O_RDONLY);
+	if (fd < 0)
+		return 0;
+
+	len = read(fd, buf, len);
+	close(fd);
+
+	return len;
+}
+
 /*
  * Discard the given range in one go
  */
diff --git a/utils.h b/utils.h
index 7c5eb798557d..47321f62c8e0 100644
--- a/utils.h
+++ b/utils.h
@@ -121,6 +121,7 @@ int get_label(const char *btrfs_dev, char *label);
 int set_label(const char *btrfs_dev, const char *label);
 
 char *__strncpy_null(char *dest, const char *src, size_t n);
+int queue_param(const char *file, const char *param, char *buf, size_t len);
 int is_block_device(const char *file);
 int is_mount_point(const char *file);
 int is_path_exist(const char *file);
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 03/12] btrfs-progs: add new HMZONED feature flag
  2019-06-07 13:17 ` [PATCH 01/12] btrfs-progs: build: Check zoned block device support Naohiro Aota
  2019-06-07 13:17   ` [PATCH 02/12] btrfs-progs: utils: Introduce queue_param Naohiro Aota
@ 2019-06-07 13:17   ` Naohiro Aota
  2019-06-07 13:17   ` [PATCH 04/12] btrfs-progs: Introduce zone block device helper functions Naohiro Aota
                     ` (8 subsequent siblings)
  10 siblings, 0 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:17 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

With this feature enabled, a zoned block device aware btrfs allocates block
groups aligned to the device zones and always write in sequential zones at
the zone write pointer position.

Enabling this feature also force disable conversion from ext4 volumes.

Note: this flag can be moved to COMPAT_RO, so that older kernel can read
but not write zoned block devices formatted with btrfs.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 cmds-inspect-dump-super.c | 3 ++-
 ctree.h                   | 4 +++-
 fsfeatures.c              | 8 ++++++++
 fsfeatures.h              | 2 +-
 libbtrfsutil/btrfs.h      | 2 ++
 5 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/cmds-inspect-dump-super.c b/cmds-inspect-dump-super.c
index d62f0932556c..ff3c0aa262c8 100644
--- a/cmds-inspect-dump-super.c
+++ b/cmds-inspect-dump-super.c
@@ -229,7 +229,8 @@ static struct readable_flag_entry incompat_flags_array[] = {
 	DEF_INCOMPAT_FLAG_ENTRY(RAID56),
 	DEF_INCOMPAT_FLAG_ENTRY(SKINNY_METADATA),
 	DEF_INCOMPAT_FLAG_ENTRY(NO_HOLES),
-	DEF_INCOMPAT_FLAG_ENTRY(METADATA_UUID)
+	DEF_INCOMPAT_FLAG_ENTRY(METADATA_UUID),
+	DEF_INCOMPAT_FLAG_ENTRY(HMZONED)
 };
 static const int incompat_flags_num = sizeof(incompat_flags_array) /
 				      sizeof(struct readable_flag_entry);
diff --git a/ctree.h b/ctree.h
index 76f52b1c9b08..9f79686690e0 100644
--- a/ctree.h
+++ b/ctree.h
@@ -492,6 +492,7 @@ struct btrfs_super_block {
 #define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA	(1ULL << 8)
 #define BTRFS_FEATURE_INCOMPAT_NO_HOLES		(1ULL << 9)
 #define BTRFS_FEATURE_INCOMPAT_METADATA_UUID    (1ULL << 10)
+#define BTRFS_FEATURE_INCOMPAT_HMZONED		(1ULL << 11)
 
 #define BTRFS_FEATURE_COMPAT_SUPP		0ULL
 
@@ -515,7 +516,8 @@ struct btrfs_super_block {
 	 BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS |		\
 	 BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA |	\
 	 BTRFS_FEATURE_INCOMPAT_NO_HOLES |		\
-	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID)
+	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID |		\
+	 BTRFS_FEATURE_INCOMPAT_HMZONED)
 
 /*
  * A leaf is full of items. offset and size tell us where to find
diff --git a/fsfeatures.c b/fsfeatures.c
index 7f3ef03b8452..c4904ce8baf5 100644
--- a/fsfeatures.c
+++ b/fsfeatures.c
@@ -86,6 +86,14 @@ static const struct btrfs_fs_feature {
 		VERSION_TO_STRING2(4,0),
 		NULL, 0,
 		"no explicit hole extents for files" },
+#ifdef BTRFS_ZONED
+	{ "hmzoned", BTRFS_FEATURE_INCOMPAT_HMZONED,
+		"hmzoned",
+		NULL, 0,
+		NULL, 0,
+		NULL, 0,
+		"support Host-Managed Zoned devices" },
+#endif
 	/* Keep this one last */
 	{ "list-all", BTRFS_FEATURE_LIST_ALL, NULL }
 };
diff --git a/fsfeatures.h b/fsfeatures.h
index 3cc9452a3327..0918ee1aa113 100644
--- a/fsfeatures.h
+++ b/fsfeatures.h
@@ -25,7 +25,7 @@
 		| BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA)
 
 /*
- * Avoid multi-device features (RAID56) and mixed block groups
+ * Avoid multi-device features (RAID56), mixed block groups, and hmzoned device
  */
 #define BTRFS_CONVERT_ALLOWED_FEATURES				\
 	(BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF			\
diff --git a/libbtrfsutil/btrfs.h b/libbtrfsutil/btrfs.h
index 944d50132456..5c415240f74c 100644
--- a/libbtrfsutil/btrfs.h
+++ b/libbtrfsutil/btrfs.h
@@ -268,6 +268,8 @@ struct btrfs_ioctl_fs_info_args {
 #define BTRFS_FEATURE_INCOMPAT_RAID56		(1ULL << 7)
 #define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA	(1ULL << 8)
 #define BTRFS_FEATURE_INCOMPAT_NO_HOLES		(1ULL << 9)
+/* Missing */
+#define BTRFS_FEATURE_INCOMPAT_HMZONED		(1ULL << 11)
 
 struct btrfs_ioctl_feature_flags {
 	__u64 compat_flags;
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 04/12] btrfs-progs: Introduce zone block device helper functions
  2019-06-07 13:17 ` [PATCH 01/12] btrfs-progs: build: Check zoned block device support Naohiro Aota
  2019-06-07 13:17   ` [PATCH 02/12] btrfs-progs: utils: Introduce queue_param Naohiro Aota
  2019-06-07 13:17   ` [PATCH 03/12] btrfs-progs: add new HMZONED feature flag Naohiro Aota
@ 2019-06-07 13:17   ` Naohiro Aota
  2019-06-07 13:17   ` [PATCH 05/12] btrfs-progs: load and check zone information Naohiro Aota
                     ` (7 subsequent siblings)
  10 siblings, 0 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:17 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

This patch introduce several zone related functions: btrfs_get_zones() to
get zone information from the specified device and put the information in
zinfo, and zone_is_random_write() to check if a zone accept random writes.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 utils.c   | 194 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 utils.h   |  16 +++++
 volumes.h |  28 ++++++++
 3 files changed, 238 insertions(+)

diff --git a/utils.c b/utils.c
index 7d5a1f3b7f8d..d50304b1be80 100644
--- a/utils.c
+++ b/utils.c
@@ -359,6 +359,200 @@ out:
 	return ret;
 }
 
+enum btrfs_zoned_model zoned_model(const char *file)
+{
+	char model[32];
+	int ret;
+
+	ret = queue_param(file, "zoned", model, sizeof(model));
+	if (ret <= 0)
+		return ZONED_NONE;
+
+	if (strncmp(model, "host-aware", 10) == 0)
+		return ZONED_HOST_AWARE;
+	if (strncmp(model, "host-managed", 12) == 0)
+		return ZONED_HOST_MANAGED;
+
+	return ZONED_NONE;
+}
+
+size_t zone_size(const char *file)
+{
+	char chunk[32];
+	int ret;
+
+	ret = queue_param(file, "chunk_sectors", chunk, sizeof(chunk));
+	if (ret <= 0)
+		return 0;
+
+	return strtoul((const char *)chunk, NULL, 10) << 9;
+}
+
+#ifdef BTRFS_ZONED
+int zone_is_random_write(struct btrfs_zone_info *zinfo, u64 bytenr)
+{
+	unsigned int zno;
+
+	if (zinfo->model == ZONED_NONE)
+		return 1;
+
+	zno = bytenr / zinfo->zone_size;
+
+	/*
+	 * Only sequential write required zones on host-managed
+	 * devices cannot be written randomly.
+	 */
+	return zinfo->zones[zno].type != BLK_ZONE_TYPE_SEQWRITE_REQ;
+}
+
+#define BTRFS_REPORT_NR_ZONES	8192
+
+static int btrfs_get_zones(int fd, const char *file, u64 block_count,
+			   struct btrfs_zone_info *zinfo)
+{
+	size_t zone_bytes = zone_size(file);
+	size_t rep_size;
+	u64 sector = 0;
+	struct blk_zone_report *rep;
+	struct blk_zone *zone;
+	unsigned int i, n = 0;
+	int ret;
+
+	/*
+	 * Zones are guaranteed (by the kernel) to be a power of 2 number of
+	 * sectors. Check this here and make sure that zones are not too
+	 * small.
+	 */
+	if (!zone_bytes || (zone_bytes & (zone_bytes - 1))) {
+		error("ERROR: Illegal zone size %zu (not a power of 2)\n",
+		      zone_bytes);
+		exit(1);
+	}
+	if (zone_bytes < BTRFS_MKFS_SYSTEM_GROUP_SIZE) {
+		error("ERROR: Illegal zone size %zu (smaller than %d)\n",
+		      zone_bytes,
+		      BTRFS_MKFS_SYSTEM_GROUP_SIZE);
+		exit(1);
+	}
+
+	/* Allocate the zone information array */
+	zinfo->zone_size = zone_bytes;
+	zinfo->nr_zones = block_count / zone_bytes;
+	if (block_count & (zone_bytes - 1))
+		zinfo->nr_zones++;
+	zinfo->zones = calloc(zinfo->nr_zones, sizeof(struct blk_zone));
+	if (!zinfo->zones) {
+		error("No memory for zone information\n");
+		exit(1);
+	}
+
+	/* Allocate a zone report */
+	rep_size = sizeof(struct blk_zone_report) +
+		sizeof(struct blk_zone) * BTRFS_REPORT_NR_ZONES;
+	rep = malloc(rep_size);
+	if (!rep) {
+		error("No memory for zones report\n");
+		exit(1);
+	}
+
+	/* Get zone information */
+	zone = (struct blk_zone *)(rep + 1);
+	while (n < zinfo->nr_zones) {
+
+		memset(rep, 0, rep_size);
+		rep->sector = sector;
+		rep->nr_zones = BTRFS_REPORT_NR_ZONES;
+
+		ret = ioctl(fd, BLKREPORTZONE, rep);
+		if (ret != 0) {
+			error("ioctl BLKREPORTZONE failed (%s)\n",
+			      strerror(errno));
+			exit(1);
+		}
+
+		if (!rep->nr_zones)
+			break;
+
+		for (i = 0; i < rep->nr_zones; i++) {
+			if (n >= zinfo->nr_zones)
+				break;
+			memcpy(&zinfo->zones[n], &zone[i],
+			       sizeof(struct blk_zone));
+			sector = zone[i].start + zone[i].len;
+			n++;
+		}
+
+	}
+
+	/*
+	 * We need at least one random write zone (a conventional zone or
+	 * a sequential write preferred zone on a host-aware device).
+	 */
+	if (!zone_is_random_write(zinfo, 0)) {
+		error("ERROR: No conventional zone at block 0\n");
+		exit(1);
+	}
+
+	zinfo->nr_zones = n;
+
+	free(rep);
+
+	return 0;
+}
+
+#endif
+
+int btrfs_get_zone_info(int fd, const char *file, int hmzoned,
+			struct btrfs_zone_info *zinfo)
+{
+	struct stat st;
+	int ret;
+
+	memset(zinfo, 0, sizeof(struct btrfs_zone_info));
+
+	ret = fstat(fd, &st);
+	if (ret < 0) {
+		error("unable to stat %s\n", file);
+		return 1;
+	}
+
+	if (!S_ISBLK(st.st_mode))
+		return 0;
+
+	/* Check zone model */
+	zinfo->model = zoned_model(file);
+	if (zinfo->model == ZONED_NONE)
+		return 0;
+
+	if (zinfo->model == ZONED_HOST_MANAGED && !hmzoned) {
+		error("%s: host-managed zoned block device (enable zone block device support with -O hmzoned)\n",
+		      file);
+		return -1;
+	}
+
+	if (!hmzoned) {
+		/* Treat host-aware devices as regular devices */
+		zinfo->model = ZONED_NONE;
+		return 0;
+	}
+
+#ifdef BTRFS_ZONED
+	/* Get zone information */
+	ret = btrfs_get_zones(fd, file, btrfs_device_size(fd, &st), zinfo);
+	if (ret != 0)
+		return ret;
+#else
+	error("%s: Unsupported host-%s zoned block device\n",
+	      file, zinfo->model == ZONED_HOST_MANAGED ? "managed" : "aware");
+	if (zinfo->model == ZONED_HOST_MANAGED)
+		return -1;
+
+	printf("%s: heandling host-aware block device as a regular disk\n",
+	       file);
+#endif
+	return 0;
+}
+
 int btrfs_prepare_device(int fd, const char *file, u64 *block_count_ret,
 		u64 max_block_count, unsigned opflags)
 {
diff --git a/utils.h b/utils.h
index 47321f62c8e0..f5e5c5bec66a 100644
--- a/utils.h
+++ b/utils.h
@@ -69,6 +69,7 @@ void units_set_base(unsigned *units, unsigned base);
 #define	PREP_DEVICE_ZERO_END	(1U << 0)
 #define	PREP_DEVICE_DISCARD	(1U << 1)
 #define	PREP_DEVICE_VERBOSE	(1U << 2)
+#define	PREP_DEVICE_HMZONED	(1U << 3)
 
 #define SEEN_FSID_HASH_SIZE 256
 struct seen_fsid {
@@ -78,10 +79,25 @@ struct seen_fsid {
 	int fd;
 };
 
+struct btrfs_zone_info;
+
+enum btrfs_zoned_model zoned_model(const char *file);
+size_t zone_size(const char *file);
 int btrfs_make_root_dir(struct btrfs_trans_handle *trans,
 			struct btrfs_root *root, u64 objectid);
 int btrfs_prepare_device(int fd, const char *file, u64 *block_count_ret,
 		u64 max_block_count, unsigned opflags);
+int btrfs_get_zone_info(int fd, const char *file, int hmzoned,
+			struct btrfs_zone_info *zinfo);
+#ifdef BTRFS_ZONED
+int zone_is_random_write(struct btrfs_zone_info *zinfo, u64 bytenr);
+#else
+static inline int zone_is_random_write(struct btrfs_zone_info *zinfo,
+				       u64 bytenr)
+{
+	return 1;
+}
+#endif
 int btrfs_add_to_fsid(struct btrfs_trans_handle *trans,
 		      struct btrfs_root *root, int fd, const char *path,
 		      u64 block_count, u32 io_width, u32 io_align,
diff --git a/volumes.h b/volumes.h
index dbe9d3dea647..c9262ceaea93 100644
--- a/volumes.h
+++ b/volumes.h
@@ -22,12 +22,40 @@
 #include "kerncompat.h"
 #include "ctree.h"
 
+#ifdef BTRFS_ZONED
+#include <linux/blkzoned.h>
+#else
+struct blk_zone {
+	int dummy;
+};
+#endif
+
+/*
+ * Zoned block device models.
+ */
+enum btrfs_zoned_model {
+	ZONED_NONE = 0,
+	ZONED_HOST_AWARE,
+	ZONED_HOST_MANAGED,
+};
+
+/*
+ * Zone information for a zoned block device.
+ */
+struct btrfs_zone_info {
+	enum btrfs_zoned_model	model;
+	size_t			zone_size;
+	struct blk_zone		*zones;
+	unsigned int		nr_zones;
+};
+
 #define BTRFS_STRIPE_LEN	SZ_64K
 
 struct btrfs_device {
 	struct list_head dev_list;
 	struct btrfs_root *dev_root;
 	struct btrfs_fs_devices *fs_devices;
+	struct btrfs_zone_info zinfo;
 
 	u64 total_ios;
 
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 05/12] btrfs-progs: load and check zone information
  2019-06-07 13:17 ` [PATCH 01/12] btrfs-progs: build: Check zoned block device support Naohiro Aota
                     ` (2 preceding siblings ...)
  2019-06-07 13:17   ` [PATCH 04/12] btrfs-progs: Introduce zone block device helper functions Naohiro Aota
@ 2019-06-07 13:17   ` Naohiro Aota
  2019-06-07 13:17   ` [PATCH 06/12] btrfs-progs: avoid writing super block to sequential zones Naohiro Aota
                     ` (6 subsequent siblings)
  10 siblings, 0 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:17 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

This patch checks if a device added to btrfs is a zoned block device. If it
is, load zones information and the zone size for the device.

For a btrfs volume composed of multiple zoned block devices, all devices
must have the same zone size.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 utils.c   | 10 ++++++++++
 volumes.c | 18 ++++++++++++++++++
 volumes.h |  3 +++
 3 files changed, 31 insertions(+)

diff --git a/utils.c b/utils.c
index d50304b1be80..a26fe7a5743c 100644
--- a/utils.c
+++ b/utils.c
@@ -250,6 +250,16 @@ int btrfs_add_to_fsid(struct btrfs_trans_handle *trans,
 		goto out;
 	}
 
+	ret = btrfs_get_zone_info(fd, path, fs_info->fs_devices->hmzoned,
+				  &device->zinfo);
+	if (ret)
+		goto out;
+	if (device->zinfo.zone_size != fs_info->fs_devices->zone_size) {
+		error("Device zone size differ\n");
+		ret = -EINVAL;
+		goto out;
+	}
+
 	disk_super = (struct btrfs_super_block *)buf;
 	dev_item = &disk_super->dev_item;
 
diff --git a/volumes.c b/volumes.c
index 3a91b43b378b..f6d1b1e9dc7f 100644
--- a/volumes.c
+++ b/volumes.c
@@ -168,6 +168,8 @@ static int device_list_add(const char *path,
 	u64 found_transid = btrfs_super_generation(disk_super);
 	bool metadata_uuid = (btrfs_super_incompat_flags(disk_super) &
 		BTRFS_FEATURE_INCOMPAT_METADATA_UUID);
+	int hmzoned = btrfs_super_incompat_flags(disk_super) &
+			BTRFS_FEATURE_INCOMPAT_HMZONED;
 
 	if (metadata_uuid)
 		fs_devices = find_fsid(disk_super->fsid,
@@ -257,6 +259,8 @@ static int device_list_add(const char *path,
 	if (fs_devices->lowest_devid > devid) {
 		fs_devices->lowest_devid = devid;
 	}
+	if (hmzoned)
+		fs_devices->hmzoned = 1;
 	*fs_devices_ret = fs_devices;
 	return 0;
 }
@@ -327,6 +331,8 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices, int flags)
 	struct btrfs_device *device;
 	int ret;
 
+	fs_devices->zone_size = 0;
+
 	list_for_each_entry(device, &fs_devices->devices, dev_list) {
 		if (!device->name) {
 			printk("no name for device %llu, skip it now\n", device->devid);
@@ -350,6 +356,18 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices, int flags)
 		device->fd = fd;
 		if (flags & O_RDWR)
 			device->writeable = 1;
+
+		ret = btrfs_get_zone_info(fd, device->name, fs_devices->hmzoned,
+					  &device->zinfo);
+		if (ret != 0)
+			goto fail;
+		if (!fs_devices->zone_size) {
+			fs_devices->zone_size = device->zinfo.zone_size;
+		} else if (device->zinfo.zone_size != fs_devices->zone_size) {
+			fprintf(stderr, "Device zone size differ\n");
+			ret = -EINVAL;
+			goto fail;
+		}
 	}
 	return 0;
 fail:
diff --git a/volumes.h b/volumes.h
index c9262ceaea93..6ec83fe43cfe 100644
--- a/volumes.h
+++ b/volumes.h
@@ -115,6 +115,9 @@ struct btrfs_fs_devices {
 
 	int seeding;
 	struct btrfs_fs_devices *seed;
+
+	u64 zone_size;
+	unsigned int hmzoned:1;
 };
 
 struct btrfs_bio_stripe {
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 06/12] btrfs-progs: avoid writing super block to sequential zones
  2019-06-07 13:17 ` [PATCH 01/12] btrfs-progs: build: Check zoned block device support Naohiro Aota
                     ` (3 preceding siblings ...)
  2019-06-07 13:17   ` [PATCH 05/12] btrfs-progs: load and check zone information Naohiro Aota
@ 2019-06-07 13:17   ` Naohiro Aota
  2019-06-07 13:17   ` [PATCH 07/12] btrfs-progs: support discarding zoned device Naohiro Aota
                     ` (5 subsequent siblings)
  10 siblings, 0 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:17 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

It is not possible to write a super block copy in sequential write required
zones as this prevents in-place updates required for super blocks.  This
patch limits super block possible locations to zones accepting random
writes. In particular, the zone containing the first block of the device or
partition being formatted must accept random writes.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 disk-io.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/disk-io.c b/disk-io.c
index 151eb3b5a278..74a4346cbca7 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -1609,6 +1609,7 @@ static int write_dev_supers(struct btrfs_fs_info *fs_info,
 			    struct btrfs_super_block *sb,
 			    struct btrfs_device *device)
 {
+	struct btrfs_zone_info *zinfo = &device->zinfo;
 	u64 bytenr;
 	u32 crc;
 	int i, ret;
@@ -1631,6 +1632,14 @@ static int write_dev_supers(struct btrfs_fs_info *fs_info,
 				      BTRFS_SUPER_INFO_SIZE - BTRFS_CSUM_SIZE);
 		btrfs_csum_final(crc, &sb->csum[0]);
 
+		if (!zone_is_random_write(zinfo, fs_info->super_bytenr)) {
+			errno = -EIO;
+			error(
+		"failed to write super block for devid %llu: require random write zone: %m",
+				device->devid);
+			return -EIO;
+		}
+
 		/*
 		 * super_copy is BTRFS_SUPER_INFO_SIZE bytes and is
 		 * zero filled, we can use it directly
@@ -1659,6 +1668,8 @@ static int write_dev_supers(struct btrfs_fs_info *fs_info,
 		bytenr = btrfs_sb_offset(i);
 		if (bytenr + BTRFS_SUPER_INFO_SIZE > device->total_bytes)
 			break;
+		if (!zone_is_random_write(zinfo, bytenr))
+			continue;
 
 		btrfs_set_super_bytenr(sb, bytenr);
 
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 07/12] btrfs-progs: support discarding zoned device
  2019-06-07 13:17 ` [PATCH 01/12] btrfs-progs: build: Check zoned block device support Naohiro Aota
                     ` (4 preceding siblings ...)
  2019-06-07 13:17   ` [PATCH 06/12] btrfs-progs: avoid writing super block to sequential zones Naohiro Aota
@ 2019-06-07 13:17   ` Naohiro Aota
  2019-06-07 13:17   ` [PATCH 08/12] btrfs-progs: volume: align chunk allocation to zones Naohiro Aota
                     ` (4 subsequent siblings)
  10 siblings, 0 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:17 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

All zones of zoned block devices should be reset before writing. Support
this by considering zone reset as a special case of block discard and block
zeroing. Of note is that only zones accepting random writes can be zeroed.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 utils.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 88 insertions(+), 6 deletions(-)

diff --git a/utils.c b/utils.c
index a26fe7a5743c..c375b32953f7 100644
--- a/utils.c
+++ b/utils.c
@@ -123,6 +123,37 @@ static int discard_range(int fd, u64 start, u64 len)
 	return 0;
 }
 
+/*
+ * Discard blocks in the zones of a zoned block device.
+ * Process this with zone size granularity so that blocks in
+ * conventional zones are discarded using discard_range and
+ * blocks in sequential zones are discarded though a zone reset.
+ */
+static int discard_zones(int fd, struct btrfs_zone_info *zinfo)
+{
+#ifdef BTRFS_ZONED
+	unsigned int i;
+
+	/* Zone size granularity */
+	for (i = 0; i < zinfo->nr_zones; i++) {
+		if (zinfo->zones[i].type == BLK_ZONE_TYPE_CONVENTIONAL) {
+			discard_range(fd, zinfo->zones[i].start << 9,
+				      zinfo->zone_size);
+		} else if (zinfo->zones[i].cond != BLK_ZONE_COND_EMPTY) {
+			struct blk_zone_range range = {
+				zinfo->zones[i].start,
+				zinfo->zone_size >> 9 };
+			if (ioctl(fd, BLKRESETZONE, &range) < 0)
+				return errno;
+		}
+	}
+
+	return 0;
+#else
+	return -EIO;
+#endif
+}
+
 /*
  * Discard blocks in the given range in 1G chunks, the process is interruptible
  */
@@ -205,8 +236,38 @@ static int zero_blocks(int fd, off_t start, size_t len)
 
 #define ZERO_DEV_BYTES SZ_2M
 
+static int zero_zone_blocks(int fd, struct btrfs_zone_info *zinfo,
+			    off_t start, size_t len)
+{
+	size_t zone_len = zinfo->zone_size;
+	off_t ofst = start;
+	size_t count;
+	int ret;
+
+	/* Make sure that zero_blocks does not write sequential zones */
+	while (len > 0) {
+
+		/* Limit zero_blocks to a single zone */
+		count = min_t(size_t, len, zone_len);
+		if (count > zone_len - (ofst & (zone_len - 1)))
+			count = zone_len - (ofst & (zone_len - 1));
+
+		if (zone_is_random_write(zinfo, ofst)) {
+			ret = zero_blocks(fd, ofst, count);
+			if (ret != 0)
+				return ret;
+		}
+
+		len -= count;
+		ofst += count;
+	}
+
+	return 0;
+}
+
 /* don't write outside the device by clamping the region to the device size */
-static int zero_dev_clamped(int fd, off_t start, ssize_t len, u64 dev_size)
+static int zero_dev_clamped(int fd, struct btrfs_zone_info *zinfo,
+			    off_t start, ssize_t len, u64 dev_size)
 {
 	off_t end = max(start, start + len);
 
@@ -219,6 +280,9 @@ static int zero_dev_clamped(int fd, off_t start, ssize_t len, u64 dev_size)
 	start = min_t(u64, start, dev_size);
 	end = min_t(u64, end, dev_size);
 
+	if (zinfo->model != ZONED_NONE)
+		return zero_zone_blocks(fd, zinfo, start, end - start);
+
 	return zero_blocks(fd, start, end - start);
 }
 
@@ -566,6 +630,7 @@ int btrfs_get_zone_info(int fd, const char *file, int hmzoned,
 int btrfs_prepare_device(int fd, const char *file, u64 *block_count_ret,
 		u64 max_block_count, unsigned opflags)
 {
+	struct btrfs_zone_info zinfo;
 	u64 block_count;
 	struct stat st;
 	int i, ret;
@@ -584,13 +649,30 @@ int btrfs_prepare_device(int fd, const char *file, u64 *block_count_ret,
 	if (max_block_count)
 		block_count = min(block_count, max_block_count);
 
+	ret = btrfs_get_zone_info(fd, file, opflags & PREP_DEVICE_HMZONED,
+				  &zinfo);
+	if (ret < 0)
+		return 1;
+
 	if (opflags & PREP_DEVICE_DISCARD) {
 		/*
 		 * We intentionally ignore errors from the discard ioctl.  It
 		 * is not necessary for the mkfs functionality but just an
-		 * optimization.
+		 * optimization. However, we cannot ignore zone discard (reset)
+		 * errors for a zoned block device as this could result in the
+		 * inability to write to non-empty sequential zones of the
+		 * device.
 		 */
-		if (discard_range(fd, 0, 0) == 0) {
+		if (zinfo.model != ZONED_NONE) {
+			printf("Resetting device zones %s (%u zones) ...\n",
+				file, zinfo.nr_zones);
+			if (discard_zones(fd, &zinfo)) {
+				fprintf(stderr,
+					"ERROR: failed to reset device '%s' zones\n",
+					file);
+				return 1;
+			}
+		} else if (discard_range(fd, 0, 0) == 0) {
 			if (opflags & PREP_DEVICE_VERBOSE)
 				printf("Performing full device TRIM %s (%s) ...\n",
 						file, pretty_size(block_count));
@@ -598,12 +680,12 @@ int btrfs_prepare_device(int fd, const char *file, u64 *block_count_ret,
 		}
 	}
 
-	ret = zero_dev_clamped(fd, 0, ZERO_DEV_BYTES, block_count);
+	ret = zero_dev_clamped(fd, &zinfo, 0, ZERO_DEV_BYTES, block_count);
 	for (i = 0 ; !ret && i < BTRFS_SUPER_MIRROR_MAX; i++)
-		ret = zero_dev_clamped(fd, btrfs_sb_offset(i),
+		ret = zero_dev_clamped(fd, &zinfo, btrfs_sb_offset(i),
 				       BTRFS_SUPER_INFO_SIZE, block_count);
 	if (!ret && (opflags & PREP_DEVICE_ZERO_END))
-		ret = zero_dev_clamped(fd, block_count - ZERO_DEV_BYTES,
+		ret = zero_dev_clamped(fd, &zinfo, block_count - ZERO_DEV_BYTES,
 				       ZERO_DEV_BYTES, block_count);
 
 	if (ret < 0) {
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 08/12] btrfs-progs: volume: align chunk allocation to zones
  2019-06-07 13:17 ` [PATCH 01/12] btrfs-progs: build: Check zoned block device support Naohiro Aota
                     ` (5 preceding siblings ...)
  2019-06-07 13:17   ` [PATCH 07/12] btrfs-progs: support discarding zoned device Naohiro Aota
@ 2019-06-07 13:17   ` Naohiro Aota
  2019-06-07 13:17   ` [PATCH 09/12] btrfs-progs: do sequential allocation Naohiro Aota
                     ` (3 subsequent siblings)
  10 siblings, 0 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:17 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

To facilitate support for zoned block devices in the extent buffer
allocation, a zoned block device chunk is always aligned to a zone of the
device. With this, the zone write pointer location simply becomes a hint to
allocate new buffers.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 volumes.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 75 insertions(+), 4 deletions(-)

diff --git a/volumes.c b/volumes.c
index f6d1b1e9dc7f..64b42643390b 100644
--- a/volumes.c
+++ b/volumes.c
@@ -399,6 +399,34 @@ int btrfs_scan_one_device(int fd, const char *path,
 	return ret;
 }
 
+/* zone size is ensured to be power of 2 */
+static u64 btrfs_zone_align(struct btrfs_zone_info *zinfo, u64 val)
+{
+	if (zinfo && zinfo->zone_size)
+		return (val + zinfo->zone_size - 1) & ~(zinfo->zone_size - 1);
+	return val;
+}
+
+static bool check_dev_zone(struct btrfs_zone_info *zinfo, u64 physical,
+			   u64 num_bytes)
+{
+	u64 zone_size = zinfo->zone_size;
+	int zone_is_random;
+
+	WARN_ON(!IS_ALIGNED(num_bytes, zone_size));
+	zone_is_random = zone_is_random_write(zinfo, physical);
+
+	while (num_bytes) {
+		if (zone_is_random != zone_is_random_write(zinfo, physical))
+			return false;
+
+		physical += zone_size;
+		num_bytes -= zone_size;
+	}
+
+	return true;
+}
+
 /*
  * find_free_dev_extent_start - find free space in the specified device
  * @device:	  the device which we search the free space in
@@ -428,6 +456,7 @@ static int find_free_dev_extent_start(struct btrfs_device *device,
 	struct btrfs_root *root = device->dev_root;
 	struct btrfs_dev_extent *dev_extent;
 	struct btrfs_path *path;
+	struct btrfs_zone_info *zinfo = &device->zinfo;
 	u64 hole_size;
 	u64 max_hole_start;
 	u64 max_hole_size;
@@ -445,6 +474,7 @@ static int find_free_dev_extent_start(struct btrfs_device *device,
 	 */
 	min_search_start = max(root->fs_info->alloc_start, (u64)SZ_1M);
 	search_start = max(search_start, min_search_start);
+	search_start = btrfs_zone_align(zinfo, search_start);
 
 	path = btrfs_alloc_path();
 	if (!path)
@@ -497,6 +527,18 @@ static int find_free_dev_extent_start(struct btrfs_device *device,
 			goto next;
 
 		if (key.offset > search_start) {
+			if (zinfo && zinfo->zone_size) {
+				while (key.offset > search_start) {
+					hole_size = key.offset - search_start;
+					if (hole_size < num_bytes)
+						break;
+					if (check_dev_zone(zinfo, search_start,
+							   num_bytes))
+						break;
+					search_start += zinfo->zone_size;
+				}
+			}
+
 			hole_size = key.offset - search_start;
 
 			/*
@@ -527,7 +569,8 @@ static int find_free_dev_extent_start(struct btrfs_device *device,
 		extent_end = key.offset + btrfs_dev_extent_length(l,
 								  dev_extent);
 		if (extent_end > search_start)
-			search_start = extent_end;
+			search_start =  btrfs_zone_align(&device->zinfo,
+							 extent_end);
 next:
 		path->slots[0]++;
 		cond_resched();
@@ -539,6 +582,18 @@ next:
 	 * search_end may be smaller than search_start.
 	 */
 	if (search_end > search_start) {
+		if (zinfo && zinfo->zone_size) {
+			while (search_end > search_start) {
+				hole_size = search_end - search_start;
+				if (hole_size < num_bytes)
+					break;
+				if (check_dev_zone(zinfo, search_start,
+						   num_bytes))
+					break;
+				search_start += zinfo->zone_size;
+			}
+		}
+
 		hole_size = search_end - search_start;
 
 		if (hole_size > max_hole_size) {
@@ -582,6 +637,9 @@ int btrfs_insert_dev_extent(struct btrfs_trans_handle *trans,
 	struct extent_buffer *leaf;
 	struct btrfs_key key;
 
+	/* Align to zone for a zoned block device */
+	start = btrfs_zone_align(&device->zinfo, start);
+
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
@@ -1065,9 +1123,15 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 				    btrfs_super_stripesize(info->super_copy));
 	}
 
-	/* we don't want a chunk larger than 10% of the FS */
-	percent_max = div_factor(btrfs_super_total_bytes(info->super_copy), 1);
-	max_chunk_size = min(percent_max, max_chunk_size);
+	if (info->fs_devices->hmzoned) {
+		/* Zoned mode uses zone aligned chunks */
+		calc_size = info->fs_devices->zone_size;
+		max_chunk_size = calc_size * num_stripes;
+	} else {
+		/* we don't want a chunk larger than 10% of the FS */
+		percent_max = div_factor(btrfs_super_total_bytes(info->super_copy), 1);
+		max_chunk_size = min(percent_max, max_chunk_size);
+	}
 
 again:
 	if (chunk_bytes_by_type(type, calc_size, num_stripes, sub_stripes) >
@@ -1147,7 +1211,9 @@ again:
 	*num_bytes = chunk_bytes_by_type(type, calc_size,
 					 num_stripes, sub_stripes);
 	index = 0;
+	dev_offset = 0;
 	while(index < num_stripes) {
+		size_t zone_size = device->zinfo.zone_size;
 		struct btrfs_stripe *stripe;
 		BUG_ON(list_empty(&private_devs));
 		cur = private_devs.next;
@@ -1158,11 +1224,16 @@ again:
 		    (index == num_stripes - 1))
 			list_move_tail(&device->dev_list, dev_list);
 
+		if (device->zinfo.zone_size)
+			calc_size = device->zinfo.zone_size;
+
 		ret = btrfs_alloc_dev_extent(trans, device, key.offset,
 			     calc_size, &dev_offset);
 		if (ret < 0)
 			goto out_chunk_map;
 
+		WARN_ON(zone_size && !IS_ALIGNED(dev_offset, zone_size));
+
 		device->bytes_used += calc_size;
 		ret = btrfs_update_device(trans, device);
 		if (ret < 0)
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 09/12] btrfs-progs: do sequential allocation
  2019-06-07 13:17 ` [PATCH 01/12] btrfs-progs: build: Check zoned block device support Naohiro Aota
                     ` (6 preceding siblings ...)
  2019-06-07 13:17   ` [PATCH 08/12] btrfs-progs: volume: align chunk allocation to zones Naohiro Aota
@ 2019-06-07 13:17   ` Naohiro Aota
  2019-06-07 13:17   ` [PATCH 10/12] btrfs-progs: mkfs: Zoned block device support Naohiro Aota
                     ` (2 subsequent siblings)
  10 siblings, 0 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:17 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

Ensures that block allocation in sequential write required zones is always
done sequentially using an allocation pointer which is the zone write
pointer plus the number of blocks already allocated but not yet written.
For conventional zones, the legacy behavior is used.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 ctree.h       |  17 +++++
 extent-tree.c | 186 ++++++++++++++++++++++++++++++++++++++++++++++++++
 transaction.c |  16 +++++
 3 files changed, 219 insertions(+)

diff --git a/ctree.h b/ctree.h
index 9f79686690e0..2e828bf1250e 100644
--- a/ctree.h
+++ b/ctree.h
@@ -1068,15 +1068,32 @@ struct btrfs_space_info {
 	struct list_head list;
 };
 
+/* Block group allocation types */
+enum btrfs_alloc_type {
+
+	/* Regular first fit allocation */
+	BTRFS_ALLOC_FIT		= 0,
+
+	/*
+	 * Sequential allocation: this is for HMZONED mode and
+	 * will result in ignoring free space before a block
+	 * group allocation offset.
+	 */
+	BTRFS_ALLOC_SEQ		= 1,
+};
+
 struct btrfs_block_group_cache {
 	struct cache_extent cache;
 	struct btrfs_key key;
 	struct btrfs_block_group_item item;
 	struct btrfs_space_info *space_info;
 	struct btrfs_free_space_ctl *free_space_ctl;
+	enum btrfs_alloc_type alloc_type;
 	u64 bytes_super;
 	u64 pinned;
 	u64 flags;
+	u64 alloc_offset;
+	u64 write_offset;
 	int cached;
 	int ro;
 	/*
diff --git a/extent-tree.c b/extent-tree.c
index e62ee8c2ba13..528c6875c8fb 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -251,6 +251,14 @@ again:
 	if (cache->ro || !block_group_bits(cache, data))
 		goto new_group;
 
+	if (cache->alloc_type == BTRFS_ALLOC_SEQ) {
+		if (cache->key.offset - cache->alloc_offset < num)
+			goto new_group;
+		*start_ret = cache->key.objectid + cache->alloc_offset;
+		cache->alloc_offset += num;
+		return 0;
+	}
+
 	while(1) {
 		ret = find_first_extent_bit(&root->fs_info->free_space_cache,
 					    last, &start, &end, EXTENT_DIRTY);
@@ -277,6 +285,7 @@ out:
 			(unsigned long long)search_start);
 		return -ENOENT;
 	}
+	printf("nospace\n");
 	return -ENOSPC;
 
 new_group:
@@ -3039,6 +3048,176 @@ error:
 	return ret;
 }
 
+#ifdef BTRFS_ZONED
+static int
+btrfs_get_block_group_alloc_offset(struct btrfs_fs_info *fs_info,
+				   struct btrfs_block_group_cache *cache)
+{
+	struct btrfs_device *device;
+	struct btrfs_mapping_tree *map_tree = &fs_info->mapping_tree;
+	struct cache_extent *ce;
+	struct map_lookup *map;
+	u64 logical = cache->key.objectid;
+	u64 length = cache->key.offset;
+	u64 physical = 0;
+	int ret = 0;
+	int i;
+	u64 zone_size = fs_info->fs_devices->zone_size;
+	u64 *alloc_offsets = NULL;
+
+	if (!btrfs_fs_incompat(fs_info, HMZONED))
+		return 0;
+
+	/* Sanity check */
+	if (!IS_ALIGNED(length, zone_size)) {
+		fprintf(stderr, "unaligned block group at %llu", logical);
+		return -EIO;
+	}
+
+	/* Get the chunk mapping */
+	ce = search_cache_extent(&map_tree->cache_tree, logical);
+	if (!ce) {
+		fprintf(stderr, "failed to find block group at %llu", logical);
+		return -ENOENT;
+	}
+	map = container_of(ce, struct map_lookup, ce);
+
+	/*
+	 * Get the zone type: if the group is mapped to a non-sequential zone,
+	 * there is no need for the allocation offset (fit allocation is OK).
+	 */
+	device = map->stripes[0].dev;
+	physical = map->stripes[0].physical;
+	if (!zone_is_random_write(&device->zinfo, physical))
+		cache->alloc_type = BTRFS_ALLOC_SEQ;
+
+	/* check block group mapping */
+	alloc_offsets = calloc(map->num_stripes, sizeof(*alloc_offsets));
+	for (i = 0; i < map->num_stripes; i++) {
+		int is_sequential;
+		struct blk_zone zone;
+
+		device = map->stripes[i].dev;
+		physical = map->stripes[i].physical;
+
+		is_sequential = !zone_is_random_write(&device->zinfo, physical);
+		if ((is_sequential && cache->alloc_type != BTRFS_ALLOC_SEQ) ||
+		    (!is_sequential && cache->alloc_type == BTRFS_ALLOC_SEQ)) {
+			fprintf(stderr,
+				"found block group of mixed zone types");
+			ret = -EIO;
+			goto out;
+		}
+
+		if (!is_sequential)
+			continue;
+
+		WARN_ON(!IS_ALIGNED(physical, zone_size));
+		zone = device->zinfo.zones[physical / zone_size];
+
+		/*
+		 * The group is mapped to a sequential zone. Get the zone write
+		 * pointer to determine the allocation offset within the zone.
+		 */
+		switch (zone.cond) {
+		case BLK_ZONE_COND_OFFLINE:
+		case BLK_ZONE_COND_READONLY:
+			fprintf(stderr, "Offline/readonly zone %llu",
+				physical / fs_info->fs_devices->zone_size);
+			ret = -EIO;
+			goto out;
+		case BLK_ZONE_COND_EMPTY:
+			alloc_offsets[i] = 0;
+			break;
+		case BLK_ZONE_COND_FULL:
+			alloc_offsets[i] = zone_size;
+			break;
+		default:
+			/* Partially used zone */
+			alloc_offsets[i] = ((zone.wp - zone.start) << 9);
+			break;
+		}
+	}
+
+	if (cache->alloc_type != BTRFS_ALLOC_SEQ)
+		goto out;
+
+	switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
+	case 0: /* single */
+	case BTRFS_BLOCK_GROUP_DUP:
+	case BTRFS_BLOCK_GROUP_RAID1:
+		for (i = 1; i < map->num_stripes; i++) {
+			if (alloc_offsets[i] != alloc_offsets[0]) {
+				fprintf(stderr,
+					"zones' write pointers mismatch\n");
+				ret = -EIO;
+				goto out;
+			}
+		}
+		cache->alloc_offset = alloc_offsets[0];
+		break;
+	case BTRFS_BLOCK_GROUP_RAID0:
+		cache->alloc_offset = alloc_offsets[0];
+		for (i = 1; i < map->num_stripes; i++) {
+			cache->alloc_offset += alloc_offsets[i];
+			if (alloc_offsets[0] < alloc_offsets[i]) {
+				fprintf(stderr,
+					"zones' write pointers mismatch\n");
+				ret = -EIO;
+				goto out;
+			}
+		}
+		break;
+	case BTRFS_BLOCK_GROUP_RAID10:
+		cache->alloc_offset = 0;
+		for (i = 0; i < map->num_stripes / map->sub_stripes; i++) {
+			int j;
+			int base;
+
+			base = i*map->sub_stripes;
+			for (j = 1; j < map->sub_stripes; j++) {
+				if (alloc_offsets[base] !=
+					alloc_offsets[base+j]) {
+					fprintf(stderr,
+						"zones' write pointer mismatch\n");
+					ret = -EIO;
+					goto out;
+				}
+			}
+
+			if (alloc_offsets[0] < alloc_offsets[base]) {
+				fprintf(stderr,
+					"zones' write pointer mismatch\n");
+				ret = -EIO;
+				goto out;
+			}
+			cache->alloc_offset += alloc_offsets[base];
+		}
+		break;
+	case BTRFS_BLOCK_GROUP_RAID5:
+	case BTRFS_BLOCK_GROUP_RAID6:
+		/* RAID5/6 is not supported yet */
+	default:
+		fprintf(stderr, "Unsupported profile %llu\n",
+			map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK);
+		ret = -EINVAL;
+		goto out;
+	}
+
+out:
+	cache->write_offset = cache->alloc_offset;
+	free(alloc_offsets);
+	return ret;
+}
+#else
+static int
+btrfs_get_block_group_alloc_offset(struct btrfs_fs_info *fs_info,
+				   struct btrfs_block_group_cache *cache)
+{
+	return 0;
+}
+#endif
+
 int btrfs_read_block_groups(struct btrfs_root *root)
 {
 	struct btrfs_path *path;
@@ -3122,6 +3301,10 @@ int btrfs_read_block_groups(struct btrfs_root *root)
 		BUG_ON(ret);
 		cache->space_info = space_info;
 
+		ret = btrfs_get_block_group_alloc_offset(info, cache);
+		if (ret)
+			goto error;
+
 		/* use EXTENT_LOCKED to prevent merging */
 		set_extent_bits(block_group_cache, found_key.objectid,
 				found_key.objectid + found_key.offset - 1,
@@ -3151,6 +3334,9 @@ btrfs_add_block_group(struct btrfs_fs_info *fs_info, u64 bytes_used, u64 type,
 	cache->key.objectid = chunk_offset;
 	cache->key.offset = size;
 
+	ret = btrfs_get_block_group_alloc_offset(fs_info, cache);
+	BUG_ON(ret);
+
 	cache->key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
 	btrfs_set_block_group_used(&cache->item, bytes_used);
 	btrfs_set_block_group_chunk_objectid(&cache->item,
diff --git a/transaction.c b/transaction.c
index 138e10f0d6cc..39a52732bc71 100644
--- a/transaction.c
+++ b/transaction.c
@@ -129,16 +129,32 @@ int __commit_transaction(struct btrfs_trans_handle *trans,
 {
 	u64 start;
 	u64 end;
+	u64 next = 0;
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct extent_buffer *eb;
 	struct extent_io_tree *tree = &fs_info->extent_cache;
+	struct btrfs_block_group_cache *bg = NULL;
 	int ret;
 
 	while(1) {
+again:
 		ret = find_first_extent_bit(tree, 0, &start, &end,
 					    EXTENT_DIRTY);
 		if (ret)
 			break;
+		bg = btrfs_lookup_first_block_group(fs_info, start);
+		BUG_ON(!bg);
+		if (bg->alloc_type == BTRFS_ALLOC_SEQ &&
+		    bg->key.objectid + bg->write_offset < start) {
+			next = bg->key.objectid + bg->write_offset;
+			BUG_ON(next + fs_info->nodesize > start);
+			eb = btrfs_find_create_tree_block(fs_info, next);
+			btrfs_mark_buffer_dirty(eb);
+			free_extent_buffer(eb);
+			goto again;
+		}
+		if (bg->alloc_type == BTRFS_ALLOC_SEQ)
+			bg->write_offset += (end + 1 - start);
 		while(start <= end) {
 			eb = find_first_extent_buffer(tree, start);
 			BUG_ON(!eb || eb->start != start);
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 10/12] btrfs-progs: mkfs: Zoned block device support
  2019-06-07 13:17 ` [PATCH 01/12] btrfs-progs: build: Check zoned block device support Naohiro Aota
                     ` (7 preceding siblings ...)
  2019-06-07 13:17   ` [PATCH 09/12] btrfs-progs: do sequential allocation Naohiro Aota
@ 2019-06-07 13:17   ` Naohiro Aota
  2019-06-07 13:17   ` [PATCH 11/12] btrfs-progs: device-add: support HMZONED device Naohiro Aota
  2019-06-07 13:17   ` [PATCH 12/12] btrfs-progs: introduce support for dev-place " Naohiro Aota
  10 siblings, 0 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:17 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

This patch makes the size of the temporary system group chunk equal to the
device zone size. It also enables PREP_DEVICE_HMZONED if the user enables
the HMZONED feature.

Enabling HMZONED feature is done using option "-O hmzoned". This feature is
incompatible for now with source directory setup.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 mkfs/common.c | 12 +++++++-----
 mkfs/common.h |  1 +
 mkfs/main.c   | 45 +++++++++++++++++++++++++++++++++++++++------
 3 files changed, 47 insertions(+), 11 deletions(-)

diff --git a/mkfs/common.c b/mkfs/common.c
index f7e3badcf2b9..12af54c1d886 100644
--- a/mkfs/common.c
+++ b/mkfs/common.c
@@ -152,6 +152,7 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg)
 	int skinny_metadata = !!(cfg->features &
 				 BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA);
 	u64 num_bytes;
+	u64 system_group_size;
 
 	buf = malloc(sizeof(*buf) + max(cfg->sectorsize, cfg->nodesize));
 	if (!buf)
@@ -312,12 +313,14 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg)
 	btrfs_set_item_offset(buf, btrfs_item_nr(nritems), itemoff);
 	btrfs_set_item_size(buf, btrfs_item_nr(nritems), item_size);
 
+	system_group_size = (cfg->features & BTRFS_FEATURE_INCOMPAT_HMZONED) ?
+		cfg->zone_size : BTRFS_MKFS_SYSTEM_GROUP_SIZE;
+
 	dev_item = btrfs_item_ptr(buf, nritems, struct btrfs_dev_item);
 	btrfs_set_device_id(buf, dev_item, 1);
 	btrfs_set_device_generation(buf, dev_item, 0);
 	btrfs_set_device_total_bytes(buf, dev_item, num_bytes);
-	btrfs_set_device_bytes_used(buf, dev_item,
-				    BTRFS_MKFS_SYSTEM_GROUP_SIZE);
+	btrfs_set_device_bytes_used(buf, dev_item, system_group_size);
 	btrfs_set_device_io_align(buf, dev_item, cfg->sectorsize);
 	btrfs_set_device_io_width(buf, dev_item, cfg->sectorsize);
 	btrfs_set_device_sector_size(buf, dev_item, cfg->sectorsize);
@@ -345,7 +348,7 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg)
 	btrfs_set_item_size(buf, btrfs_item_nr(nritems), item_size);
 
 	chunk = btrfs_item_ptr(buf, nritems, struct btrfs_chunk);
-	btrfs_set_chunk_length(buf, chunk, BTRFS_MKFS_SYSTEM_GROUP_SIZE);
+	btrfs_set_chunk_length(buf, chunk, system_group_size);
 	btrfs_set_chunk_owner(buf, chunk, BTRFS_EXTENT_TREE_OBJECTID);
 	btrfs_set_chunk_stripe_len(buf, chunk, BTRFS_STRIPE_LEN);
 	btrfs_set_chunk_type(buf, chunk, BTRFS_BLOCK_GROUP_SYSTEM);
@@ -411,8 +414,7 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg)
 		    (unsigned long)btrfs_dev_extent_chunk_tree_uuid(dev_extent),
 		    BTRFS_UUID_SIZE);
 
-	btrfs_set_dev_extent_length(buf, dev_extent,
-				    BTRFS_MKFS_SYSTEM_GROUP_SIZE);
+	btrfs_set_dev_extent_length(buf, dev_extent, system_group_size);
 	nritems++;
 
 	btrfs_set_header_bytenr(buf, cfg->blocks[MKFS_DEV_TREE]);
diff --git a/mkfs/common.h b/mkfs/common.h
index 28912906d0a9..d0e4c7b2c906 100644
--- a/mkfs/common.h
+++ b/mkfs/common.h
@@ -53,6 +53,7 @@ struct btrfs_mkfs_config {
 	u64 features;
 	/* Size of the filesystem in bytes */
 	u64 num_bytes;
+	u64 zone_size;
 
 	/* Output fields, set during creation */
 
diff --git a/mkfs/main.c b/mkfs/main.c
index 93c0b71c864e..cbfd45bee836 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -61,8 +61,12 @@ static int create_metadata_block_groups(struct btrfs_root *root, int mixed,
 	u64 bytes_used;
 	u64 chunk_start = 0;
 	u64 chunk_size = 0;
+	u64 system_group_size = 0;
 	int ret;
 
+	system_group_size = fs_info->fs_devices->hmzoned ?
+		fs_info->fs_devices->zone_size : BTRFS_MKFS_SYSTEM_GROUP_SIZE;
+
 	trans = btrfs_start_transaction(root, 1);
 	BUG_ON(IS_ERR(trans));
 	bytes_used = btrfs_super_bytes_used(fs_info->super_copy);
@@ -75,8 +79,8 @@ static int create_metadata_block_groups(struct btrfs_root *root, int mixed,
 	ret = btrfs_make_block_group(trans, fs_info, bytes_used,
 				     BTRFS_BLOCK_GROUP_SYSTEM,
 				     BTRFS_BLOCK_RESERVED_1M_FOR_SUPER,
-				     BTRFS_MKFS_SYSTEM_GROUP_SIZE);
-	allocation->system += BTRFS_MKFS_SYSTEM_GROUP_SIZE;
+				     system_group_size);
+	allocation->system += system_group_size;
 	if (ret)
 		return ret;
 
@@ -761,6 +765,7 @@ int main(int argc, char **argv)
 	int metadata_profile_opt = 0;
 	int discard = 1;
 	int ssd = 0;
+	int hmzoned = 0;
 	int force_overwrite = 0;
 	char *source_dir = NULL;
 	bool source_dir_set = false;
@@ -774,6 +779,7 @@ int main(int argc, char **argv)
 	u64 features = BTRFS_MKFS_DEFAULT_FEATURES;
 	struct mkfs_allocation allocation = { 0 };
 	struct btrfs_mkfs_config mkfs_cfg;
+	u64 system_group_size;
 
 	while(1) {
 		int c;
@@ -896,6 +902,8 @@ int main(int argc, char **argv)
 	if (dev_cnt == 0)
 		print_usage(1);
 
+	hmzoned = features & BTRFS_FEATURE_INCOMPAT_HMZONED;
+
 	if (source_dir_set && dev_cnt > 1) {
 		error("the option -r is limited to a single device");
 		goto error;
@@ -905,6 +913,11 @@ int main(int argc, char **argv)
 		goto error;
 	}
 
+	if (source_dir_set && hmzoned) {
+		error("The -r and hmzoned feature are incompatible\n");
+		exit(1);
+	}
+
 	if (*fs_uuid) {
 		uuid_t dummy_uuid;
 
@@ -936,6 +949,16 @@ int main(int argc, char **argv)
 
 	file = argv[optind++];
 	ssd = is_ssd(file);
+	if (hmzoned) {
+		if (zoned_model(file) == ZONED_NONE) {
+			error("%s: not a zoned block device\n", file);
+			exit(1);
+		}
+		if (!zone_size(file)) {
+			error("%s: zone size undefined\n", file);
+			exit(1);
+		}
+	}
 
 	/*
 	* Set default profiles according to number of added devices.
@@ -1087,7 +1110,8 @@ int main(int argc, char **argv)
 	ret = btrfs_prepare_device(fd, file, &dev_block_count, block_count,
 			(zero_end ? PREP_DEVICE_ZERO_END : 0) |
 			(discard ? PREP_DEVICE_DISCARD : 0) |
-			(verbose ? PREP_DEVICE_VERBOSE : 0));
+			(verbose ? PREP_DEVICE_VERBOSE : 0) |
+			(hmzoned ? PREP_DEVICE_HMZONED : 0));
 	if (ret)
 		goto error;
 	if (block_count && block_count > dev_block_count) {
@@ -1098,9 +1122,11 @@ int main(int argc, char **argv)
 	}
 
 	/* To create the first block group and chunk 0 in make_btrfs */
-	if (dev_block_count < BTRFS_MKFS_SYSTEM_GROUP_SIZE) {
+	system_group_size = hmzoned ?
+		zone_size(file) : BTRFS_MKFS_SYSTEM_GROUP_SIZE;
+	if (dev_block_count < system_group_size) {
 		error("device is too small to make filesystem, must be at least %llu",
-				(unsigned long long)BTRFS_MKFS_SYSTEM_GROUP_SIZE);
+				(unsigned long long)system_group_size);
 		goto error;
 	}
 
@@ -1116,6 +1142,7 @@ int main(int argc, char **argv)
 	mkfs_cfg.sectorsize = sectorsize;
 	mkfs_cfg.stripesize = stripesize;
 	mkfs_cfg.features = features;
+	mkfs_cfg.zone_size = zone_size(file);
 
 	ret = make_btrfs(fd, &mkfs_cfg);
 	if (ret) {
@@ -1126,6 +1153,7 @@ int main(int argc, char **argv)
 
 	fs_info = open_ctree_fs_info(file, 0, 0, 0,
 			OPEN_CTREE_WRITES | OPEN_CTREE_TEMPORARY_SUPER);
+
 	if (!fs_info) {
 		error("open ctree failed");
 		goto error;
@@ -1199,7 +1227,8 @@ int main(int argc, char **argv)
 				block_count,
 				(verbose ? PREP_DEVICE_VERBOSE : 0) |
 				(zero_end ? PREP_DEVICE_ZERO_END : 0) |
-				(discard ? PREP_DEVICE_DISCARD : 0));
+				(discard ? PREP_DEVICE_DISCARD : 0) |
+				(hmzoned ? PREP_DEVICE_HMZONED : 0));
 		if (ret) {
 			goto error;
 		}
@@ -1296,6 +1325,10 @@ raid_groups:
 			btrfs_group_profile_str(metadata_profile),
 			pretty_size(allocation.system));
 		printf("SSD detected:       %s\n", ssd ? "yes" : "no");
+		printf("Zoned device:       %s\n", hmzoned ? "yes" : "no");
+		if (hmzoned)
+			printf("Zone size:          %s\n",
+			       pretty_size(fs_info->fs_devices->zone_size));
 		btrfs_parse_features_to_string(features_buf, features);
 		printf("Incompat features:  %s", features_buf);
 		printf("\n");
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 11/12] btrfs-progs: device-add: support HMZONED device
  2019-06-07 13:17 ` [PATCH 01/12] btrfs-progs: build: Check zoned block device support Naohiro Aota
                     ` (8 preceding siblings ...)
  2019-06-07 13:17   ` [PATCH 10/12] btrfs-progs: mkfs: Zoned block device support Naohiro Aota
@ 2019-06-07 13:17   ` Naohiro Aota
  2019-06-07 13:17   ` [PATCH 12/12] btrfs-progs: introduce support for dev-place " Naohiro Aota
  10 siblings, 0 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:17 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

This patch check if the target file system is flagged as HMZONED. If it is,
the device to be added is flagged PREP_DEVICE_HMZONED.  Also add checks to
prevent mixing non-zoned devices and zoned devices.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 cmds-device.c | 29 +++++++++++++++++++++++++++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/cmds-device.c b/cmds-device.c
index e3e30b6d5ded..86ffb1a2a5c2 100644
--- a/cmds-device.c
+++ b/cmds-device.c
@@ -57,6 +57,9 @@ static int cmd_device_add(int argc, char **argv)
 	int discard = 1;
 	int force = 0;
 	int last_dev;
+	int res;
+	int hmzoned;
+	struct btrfs_ioctl_feature_flags feature_flags;
 
 	optind = 0;
 	while (1) {
@@ -92,12 +95,33 @@ static int cmd_device_add(int argc, char **argv)
 	if (fdmnt < 0)
 		return 1;
 
+	res = ioctl(fdmnt, BTRFS_IOC_GET_FEATURES, &feature_flags);
+	if (res) {
+		error("error getting feature flags '%s': %m", mntpnt);
+		return 1;
+	}
+	hmzoned = feature_flags.incompat_flags & BTRFS_FEATURE_INCOMPAT_HMZONED;
+
 	for (i = optind; i < last_dev; i++){
 		struct btrfs_ioctl_vol_args ioctl_args;
-		int	devfd, res;
+		int	devfd;
 		u64 dev_block_count = 0;
 		char *path;
 
+		if (hmzoned && zoned_model(argv[i]) == ZONED_NONE) {
+			error("cannot add non-zoned device to HMZONED file system '%s'",
+			      argv[i]);
+			ret++;
+			continue;
+		}
+
+		if (!hmzoned && zoned_model(argv[i]) == ZONED_HOST_MANAGED) {
+			error("cannot add host managed zoned device to non-HMZONED file system '%s'",
+			      argv[i]);
+			ret++;
+			continue;
+		}
+
 		res = test_dev_for_mkfs(argv[i], force);
 		if (res) {
 			ret++;
@@ -113,7 +137,8 @@ static int cmd_device_add(int argc, char **argv)
 
 		res = btrfs_prepare_device(devfd, argv[i], &dev_block_count, 0,
 				PREP_DEVICE_ZERO_END | PREP_DEVICE_VERBOSE |
-				(discard ? PREP_DEVICE_DISCARD : 0));
+				(discard ? PREP_DEVICE_DISCARD : 0) |
+				(hmzoned ? PREP_DEVICE_HMZONED : 0));
 		close(devfd);
 		if (res) {
 			ret++;
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 12/12] btrfs-progs: introduce support for dev-place HMZONED device
  2019-06-07 13:17 ` [PATCH 01/12] btrfs-progs: build: Check zoned block device support Naohiro Aota
                     ` (9 preceding siblings ...)
  2019-06-07 13:17   ` [PATCH 11/12] btrfs-progs: device-add: support HMZONED device Naohiro Aota
@ 2019-06-07 13:17   ` " Naohiro Aota
  10 siblings, 0 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-07 13:17 UTC (permalink / raw)
  To: linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche,
	Naohiro Aota

This patch check if the target file system is flagged as HMZONED. If it is,
the device to be added is flagged PREP_DEVICE_HMZONED.  Also add checks to
prevent mixing non-zoned devices and zoned devices.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 cmds-replace.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/cmds-replace.c b/cmds-replace.c
index 713d200938d4..c752ceaadb77 100644
--- a/cmds-replace.c
+++ b/cmds-replace.c
@@ -116,6 +116,7 @@ static const char *const cmd_replace_start_usage[] = {
 
 static int cmd_replace_start(int argc, char **argv)
 {
+	struct btrfs_ioctl_feature_flags feature_flags;
 	struct btrfs_ioctl_dev_replace_args start_args = {0};
 	struct btrfs_ioctl_dev_replace_args status_args = {0};
 	int ret;
@@ -123,6 +124,7 @@ static int cmd_replace_start(int argc, char **argv)
 	int c;
 	int fdmnt = -1;
 	int fddstdev = -1;
+	int hmzoned;
 	char *path;
 	char *srcdev;
 	char *dstdev = NULL;
@@ -163,6 +165,13 @@ static int cmd_replace_start(int argc, char **argv)
 	if (fdmnt < 0)
 		goto leave_with_error;
 
+	ret = ioctl(fdmnt, BTRFS_IOC_GET_FEATURES, &feature_flags);
+	if (ret) {
+		error("ioctl(GET_FEATURES) on '%s' returns error: %m", path);
+		goto leave_with_error;
+	}
+	hmzoned = feature_flags.incompat_flags & BTRFS_FEATURE_INCOMPAT_HMZONED;
+
 	/* check for possible errors before backgrounding */
 	status_args.cmd = BTRFS_IOCTL_DEV_REPLACE_CMD_STATUS;
 	status_args.result = BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_RESULT;
@@ -257,7 +266,8 @@ static int cmd_replace_start(int argc, char **argv)
 	strncpy((char *)start_args.start.tgtdev_name, dstdev,
 		BTRFS_DEVICE_PATH_NAME_MAX);
 	ret = btrfs_prepare_device(fddstdev, dstdev, &dstdev_block_count, 0,
-			PREP_DEVICE_ZERO_END | PREP_DEVICE_VERBOSE);
+			PREP_DEVICE_ZERO_END | PREP_DEVICE_VERBOSE |
+			(hmzoned ? PREP_DEVICE_HMZONED | PREP_DEVICE_DISCARD : 0));
 	if (ret)
 		goto leave_with_error;
 
-- 
2.21.0


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 00/19] btrfs zoned block device support
  2019-06-07 13:10 [PATCH v2 00/19] btrfs zoned block device support Naohiro Aota
                   ` (19 preceding siblings ...)
  2019-06-07 13:17 ` [PATCH 01/12] btrfs-progs: build: Check zoned block device support Naohiro Aota
@ 2019-06-12 17:51 ` David Sterba
  2019-06-13  4:59   ` Naohiro Aota
  20 siblings, 1 reply; 79+ messages in thread
From: David Sterba @ 2019-06-12 17:51 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On Fri, Jun 07, 2019 at 10:10:06PM +0900, Naohiro Aota wrote:
> btrfs zoned block device support
> 
> This series adds zoned block device support to btrfs.

The overall design sounds ok.

I skimmed through the patches and the biggest task I see is how to make
the hmzoned adjustments and branches less visible, ie. there are too
many if (hmzoned) { do something } standing out. But that's merely a
matter of wrappers and maybe an abstraction here and there.

How can I test the zoned devices backed by files (or regular disks)? I
searched for some concrete example eg. for qemu or dm-zoned, but closest
match was a text description in libzbc README that it's possible to
implement. All other howtos expect a real zoned device.

Merge target is 5.3 or later, we'll see how things will go. I'm
expecting that we might need some time to get feedback about the
usability as there's no previous work widely used that we can build on
top of.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 00/19] btrfs zoned block device support
  2019-06-12 17:51 ` [PATCH v2 00/19] btrfs zoned block device support David Sterba
@ 2019-06-13  4:59   ` Naohiro Aota
  2019-06-13 13:46     ` David Sterba
  0 siblings, 1 reply; 79+ messages in thread
From: Naohiro Aota @ 2019-06-13  4:59 UTC (permalink / raw)
  To: dsterba
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On 2019/06/13 2:50, David Sterba wrote:
> On Fri, Jun 07, 2019 at 10:10:06PM +0900, Naohiro Aota wrote:
>> btrfs zoned block device support
>>
>> This series adds zoned block device support to btrfs.
> 
> The overall design sounds ok.
> 
> I skimmed through the patches and the biggest task I see is how to make
> the hmzoned adjustments and branches less visible, ie. there are too
> many if (hmzoned) { do something } standing out. But that's merely a
> matter of wrappers and maybe an abstraction here and there.

Sure. I'll add some more abstractions in the next version.

> How can I test the zoned devices backed by files (or regular disks)? I
> searched for some concrete example eg. for qemu or dm-zoned, but closest
> match was a text description in libzbc README that it's possible to
> implement. All other howtos expect a real zoned device.

You can use tcmu-runer [1] to create an emulated zoned device backed by 
a regular file. Here is a setup how-to:
http://zonedstorage.io/projects/tcmu-runner/#compilation-and-installation

[1] https://github.com/open-iscsi/tcmu-runner

> Merge target is 5.3 or later, we'll see how things will go. I'm
> expecting that we might need some time to get feedback about the
> usability as there's no previous work widely used that we can build on
> top of.
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 00/19] btrfs zoned block device support
  2019-06-13  4:59   ` Naohiro Aota
@ 2019-06-13 13:46     ` David Sterba
  2019-06-14  2:07       ` Naohiro Aota
  2019-06-17  2:44       ` Damien Le Moal
  0 siblings, 2 replies; 79+ messages in thread
From: David Sterba @ 2019-06-13 13:46 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On Thu, Jun 13, 2019 at 04:59:23AM +0000, Naohiro Aota wrote:
> On 2019/06/13 2:50, David Sterba wrote:
> > On Fri, Jun 07, 2019 at 10:10:06PM +0900, Naohiro Aota wrote:
> >> btrfs zoned block device support
> >>
> >> This series adds zoned block device support to btrfs.
> > 
> > The overall design sounds ok.
> > 
> > I skimmed through the patches and the biggest task I see is how to make
> > the hmzoned adjustments and branches less visible, ie. there are too
> > many if (hmzoned) { do something } standing out. But that's merely a
> > matter of wrappers and maybe an abstraction here and there.
> 
> Sure. I'll add some more abstractions in the next version.

Ok, I'll reply to the patches with specific things.

> > How can I test the zoned devices backed by files (or regular disks)? I
> > searched for some concrete example eg. for qemu or dm-zoned, but closest
> > match was a text description in libzbc README that it's possible to
> > implement. All other howtos expect a real zoned device.
> 
> You can use tcmu-runer [1] to create an emulated zoned device backed by 
> a regular file. Here is a setup how-to:
> http://zonedstorage.io/projects/tcmu-runner/#compilation-and-installation

That looks great, thanks. I wonder why there's no way to find that, all
I got were dead links to linux-iscsi.org or tutorials of targetcli that
were years old and not working.

Feeding the textual commands to targetcli is not exactly what I'd
expect for scripting, but at least it seems to work.

I tried to pass an emulated ZBC device on host to KVM guest (as a scsi
device) but lsscsi does not recognize that it as a zonde device (just a
QEMU harddisk). So this seems the emulation must be done inside the VM.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 03/19] btrfs: Check and enable HMZONED mode
  2019-06-07 13:10 ` [PATCH 03/19] btrfs: Check and enable HMZONED mode Naohiro Aota
@ 2019-06-13 13:57   ` Josef Bacik
  2019-06-18  6:43     ` Naohiro Aota
  0 siblings, 1 reply; 79+ messages in thread
From: Josef Bacik @ 2019-06-13 13:57 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On Fri, Jun 07, 2019 at 10:10:09PM +0900, Naohiro Aota wrote:
> HMZONED mode cannot be used together with the RAID5/6 profile for now.
> Introduce the function btrfs_check_hmzoned_mode() to check this. This
> function will also check if HMZONED flag is enabled on the file system and
> if the file system consists of zoned devices with equal zone size.
> 
> Additionally, as updates to the space cache are in-place, the space cache
> cannot be located over sequential zones and there is no guarantees that the
> device will have enough conventional zones to store this cache. Resolve
> this problem by disabling completely the space cache.  This does not
> introduces any problems with sequential block groups: all the free space is
> located after the allocation pointer and no free space before the pointer.
> There is no need to have such cache.
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  fs/btrfs/ctree.h       |  3 ++
>  fs/btrfs/dev-replace.c |  7 +++
>  fs/btrfs/disk-io.c     |  7 +++
>  fs/btrfs/super.c       | 12 ++---
>  fs/btrfs/volumes.c     | 99 ++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/volumes.h     |  1 +
>  6 files changed, 124 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index b81c331b28fa..6c00101407e4 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -806,6 +806,9 @@ struct btrfs_fs_info {
>  	struct btrfs_root *uuid_root;
>  	struct btrfs_root *free_space_root;
>  
> +	/* Zone size when in HMZONED mode */
> +	u64 zone_size;
> +
>  	/* the log root tree is a directory of all the other log roots */
>  	struct btrfs_root *log_root_tree;
>  
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index ee0989c7e3a9..fbe5ea2a04ed 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -201,6 +201,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
>  		return PTR_ERR(bdev);
>  	}
>  
> +	if ((bdev_zoned_model(bdev) == BLK_ZONED_HM &&
> +	     !btrfs_fs_incompat(fs_info, HMZONED)) ||
> +	    (!bdev_is_zoned(bdev) && btrfs_fs_incompat(fs_info, HMZONED))) {

You do this in a few places, turn this into a helper please.

> +		ret = -EINVAL;
> +		goto error;
> +	}
> +
>  	filemap_write_and_wait(bdev->bd_inode->i_mapping);
>  
>  	devices = &fs_info->fs_devices->devices;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 663efce22d98..7c1404c76768 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3086,6 +3086,13 @@ int open_ctree(struct super_block *sb,
>  
>  	btrfs_free_extra_devids(fs_devices, 1);
>  
> +	ret = btrfs_check_hmzoned_mode(fs_info);
> +	if (ret) {
> +		btrfs_err(fs_info, "failed to init hmzoned mode: %d",
> +				ret);
> +		goto fail_block_groups;
> +	}
> +
>  	ret = btrfs_sysfs_add_fsid(fs_devices, NULL);
>  	if (ret) {
>  		btrfs_err(fs_info, "failed to init sysfs fsid interface: %d",
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 2c66d9ea6a3b..740a701f16c5 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -435,11 +435,13 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
>  	bool saved_compress_force;
>  	int no_compress = 0;
>  
> -	cache_gen = btrfs_super_cache_generation(info->super_copy);
> -	if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE))
> -		btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE);
> -	else if (cache_gen)
> -		btrfs_set_opt(info->mount_opt, SPACE_CACHE);
> +	if (!btrfs_fs_incompat(info, HMZONED)) {
> +		cache_gen = btrfs_super_cache_generation(info->super_copy);
> +		if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE))
> +			btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE);
> +		else if (cache_gen)
> +			btrfs_set_opt(info->mount_opt, SPACE_CACHE);
> +	}
>  

This disables the free space tree as well as the cache, sounds like you only
need to disable the free space cache?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 02/19] btrfs: Get zone information of zoned block devices
  2019-06-07 13:10 ` [PATCH 02/19] btrfs: Get zone information of zoned block devices Naohiro Aota
@ 2019-06-13 13:58   ` Josef Bacik
  2019-06-18  6:04     ` Naohiro Aota
  2019-06-13 13:58   ` Josef Bacik
  2019-06-17 18:57   ` David Sterba
  2 siblings, 1 reply; 79+ messages in thread
From: Josef Bacik @ 2019-06-13 13:58 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On Fri, Jun 07, 2019 at 10:10:08PM +0900, Naohiro Aota wrote:
> If a zoned block device is found, get its zone information (number of zones
> and zone size) using the new helper function btrfs_get_dev_zonetypes().  To
> avoid costly run-time zone report commands to test the device zones type
> during block allocation, attach the seqzones bitmap to the device structure
> to indicate if a zone is sequential or accept random writes.
> 
> This patch also introduces the helper function btrfs_dev_is_sequential() to
> test if the zone storing a block is a sequential write required zone.
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  fs/btrfs/volumes.c | 143 +++++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/volumes.h |  33 +++++++++++
>  2 files changed, 176 insertions(+)
> 

We have enough problems with giant files already, please just add a separate
hmzoned.c or whatever and put all the zone specific code in there.  That'll save
me time when I go and break a bunch of stuff out.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 02/19] btrfs: Get zone information of zoned block devices
  2019-06-07 13:10 ` [PATCH 02/19] btrfs: Get zone information of zoned block devices Naohiro Aota
  2019-06-13 13:58   ` Josef Bacik
@ 2019-06-13 13:58   ` Josef Bacik
  2019-06-17 18:57   ` David Sterba
  2 siblings, 0 replies; 79+ messages in thread
From: Josef Bacik @ 2019-06-13 13:58 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On Fri, Jun 07, 2019 at 10:10:08PM +0900, Naohiro Aota wrote:
> If a zoned block device is found, get its zone information (number of zones
> and zone size) using the new helper function btrfs_get_dev_zonetypes().  To
> avoid costly run-time zone report commands to test the device zones type
> during block allocation, attach the seqzones bitmap to the device structure
> to indicate if a zone is sequential or accept random writes.
> 
> This patch also introduces the helper function btrfs_dev_is_sequential() to
> test if the zone storing a block is a sequential write required zone.
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  fs/btrfs/volumes.c | 143 +++++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/volumes.h |  33 +++++++++++
>  2 files changed, 176 insertions(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 1c2a6e4b39da..b673178718e3 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -786,6 +786,135 @@ static int btrfs_free_stale_devices(const char *path,
>  	return ret;
>  }
>  
> +static int __btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
> +				 struct blk_zone **zones,
> +				 unsigned int *nr_zones, gfp_t gfp_mask)
> +{
> +	struct blk_zone *z = *zones;
> +	int ret;
> +
> +	if (!z) {
> +		z = kcalloc(*nr_zones, sizeof(struct blk_zone), GFP_KERNEL);
> +		if (!z)
> +			return -ENOMEM;
> +	}
> +
> +	ret = blkdev_report_zones(device->bdev, pos >> SECTOR_SHIFT,
> +				  z, nr_zones, gfp_mask);
> +	if (ret != 0) {
> +		btrfs_err(device->fs_info, "Get zone at %llu failed %d\n",
> +			  pos, ret);
> +		return ret;
> +	}
> +
> +	*zones = z;
> +
> +	return 0;
> +}
> +
> +static void btrfs_destroy_dev_zonetypes(struct btrfs_device *device)
> +{
> +	kfree(device->seq_zones);
> +	kfree(device->empty_zones);
> +	device->seq_zones = NULL;
> +	device->empty_zones = NULL;
> +	device->nr_zones = 0;
> +	device->zone_size = 0;
> +	device->zone_size_shift = 0;
> +}
> +
> +int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
> +		       struct blk_zone *zone, gfp_t gfp_mask)
> +{
> +	unsigned int nr_zones = 1;
> +	int ret;
> +
> +	ret = __btrfs_get_dev_zones(device, pos, &zone, &nr_zones, gfp_mask);
> +	if (ret != 0 || !nr_zones)
> +		return ret ? ret : -EIO;
> +
> +	return 0;
> +}
> +
> +int btrfs_get_dev_zonetypes(struct btrfs_device *device)
> +{
> +	struct block_device *bdev = device->bdev;
> +	sector_t nr_sectors = bdev->bd_part->nr_sects;
> +	sector_t sector = 0;
> +	struct blk_zone *zones = NULL;
> +	unsigned int i, n = 0, nr_zones;
> +	int ret;
> +
> +	device->zone_size = 0;
> +	device->zone_size_shift = 0;
> +	device->nr_zones = 0;
> +	device->seq_zones = NULL;
> +	device->empty_zones = NULL;
> +
> +	if (!bdev_is_zoned(bdev))
> +		return 0;
> +
> +	device->zone_size = (u64)bdev_zone_sectors(bdev) << SECTOR_SHIFT;
> +	device->zone_size_shift = ilog2(device->zone_size);
> +	device->nr_zones = nr_sectors >> ilog2(bdev_zone_sectors(bdev));
> +	if (nr_sectors & (bdev_zone_sectors(bdev) - 1))
> +		device->nr_zones++;
> +
> +	device->seq_zones = kcalloc(BITS_TO_LONGS(device->nr_zones),
> +				    sizeof(*device->seq_zones), GFP_KERNEL);
> +	if (!device->seq_zones)
> +		return -ENOMEM;
> +
> +	device->empty_zones = kcalloc(BITS_TO_LONGS(device->nr_zones),
> +				      sizeof(*device->empty_zones), GFP_KERNEL);
> +	if (!device->empty_zones)
> +		return -ENOMEM;
> +
> +#define BTRFS_REPORT_NR_ZONES   4096
> +
> +	/* Get zones type */
> +	while (sector < nr_sectors) {
> +		nr_zones = BTRFS_REPORT_NR_ZONES;
> +		ret = __btrfs_get_dev_zones(device, sector << SECTOR_SHIFT,
> +					    &zones, &nr_zones, GFP_KERNEL);
> +		if (ret != 0 || !nr_zones) {
> +			if (!ret)
> +				ret = -EIO;
> +			goto out;
> +		}
> +
> +		for (i = 0; i < nr_zones; i++) {
> +			if (zones[i].type == BLK_ZONE_TYPE_SEQWRITE_REQ)
> +				set_bit(n, device->seq_zones);
> +			if (zones[i].cond == BLK_ZONE_COND_EMPTY)
> +				set_bit(n, device->empty_zones);
> +			sector = zones[i].start + zones[i].len;
> +			n++;
> +		}
> +	}
> +
> +	if (n != device->nr_zones) {
> +		btrfs_err(device->fs_info,
> +			  "Inconsistent number of zones (%u / %u)\n", n,
> +			  device->nr_zones);
> +		ret = -EIO;
> +		goto out;
> +	}
> +
> +	btrfs_info(device->fs_info,
> +		   "host-%s zoned block device, %u zones of %llu sectors\n",
> +		   bdev_zoned_model(bdev) == BLK_ZONED_HM ? "managed" : "aware",
> +		   device->nr_zones, device->zone_size >> SECTOR_SHIFT);
> +
> +out:
> +	kfree(zones);
> +
> +	if (ret)
> +		btrfs_destroy_dev_zonetypes(device);
> +
> +	return ret;
> +}
> +
>  static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
>  			struct btrfs_device *device, fmode_t flags,
>  			void *holder)
> @@ -842,6 +971,11 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
>  	clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
>  	device->mode = flags;
>  
> +	/* Get zone type information of zoned block devices */
> +	ret = btrfs_get_dev_zonetypes(device);
> +	if (ret != 0)
> +		goto error_brelse;
> +
>  	fs_devices->open_devices++;
>  	if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
>  	    device->devid != BTRFS_DEV_REPLACE_DEVID) {
> @@ -1243,6 +1377,7 @@ static void btrfs_close_bdev(struct btrfs_device *device)
>  	}
>  
>  	blkdev_put(device->bdev, device->mode);
> +	btrfs_destroy_dev_zonetypes(device);
>  }
>  
>  static void btrfs_close_one_device(struct btrfs_device *device)
> @@ -2664,6 +2799,13 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>  	mutex_unlock(&fs_info->chunk_mutex);
>  	mutex_unlock(&fs_devices->device_list_mutex);
>  
> +	/* Get zone type information of zoned block devices */
> +	ret = btrfs_get_dev_zonetypes(device);
> +	if (ret) {
> +		btrfs_abort_transaction(trans, ret);
> +		goto error_sysfs;
> +	}
> +
>  	if (seeding_dev) {
>  		mutex_lock(&fs_info->chunk_mutex);
>  		ret = init_first_rw_device(trans);
> @@ -2729,6 +2871,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>  	return ret;
>  
>  error_sysfs:
> +	btrfs_destroy_dev_zonetypes(device);
>  	btrfs_sysfs_rm_device_link(fs_devices, device);
>  	mutex_lock(&fs_info->fs_devices->device_list_mutex);
>  	mutex_lock(&fs_info->chunk_mutex);
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index b8a0e8d0672d..1599641e216c 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -62,6 +62,16 @@ struct btrfs_device {
>  
>  	struct block_device *bdev;
>  
> +	/*
> +	 * Number of zones, zone size and types of zones if bdev is a
> +	 * zoned block device.
> +	 */
> +	u64 zone_size;
> +	u8  zone_size_shift;
> +	u32 nr_zones;
> +	unsigned long *seq_zones;
> +	unsigned long *empty_zones;
> +

Also make a struct btrfs_zone_info and have a pointer to one here if we have
zone devices.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 05/19] btrfs: disable direct IO in HMZONED mode
  2019-06-07 13:10 ` [PATCH 05/19] btrfs: disable direct IO " Naohiro Aota
@ 2019-06-13 14:00   ` Josef Bacik
  2019-06-18  8:17     ` Naohiro Aota
  0 siblings, 1 reply; 79+ messages in thread
From: Josef Bacik @ 2019-06-13 14:00 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On Fri, Jun 07, 2019 at 10:10:11PM +0900, Naohiro Aota wrote:
> Direct write I/Os can be directed at existing extents that have already
> been written. Such write requests are prohibited on host-managed zoned
> block devices. So disable direct IO support for a volume with HMZONED mode
> enabled.
> 

That's only if we're nocow, so seems like you only need to disable DIO into
nocow regions with hmzoned?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 07/19] btrfs: do sequential extent allocation in HMZONED mode
  2019-06-07 13:10 ` [PATCH 07/19] btrfs: do sequential extent allocation in HMZONED mode Naohiro Aota
@ 2019-06-13 14:07   ` Josef Bacik
  2019-06-18  8:28     ` Naohiro Aota
  2019-06-17 22:30   ` David Sterba
  1 sibling, 1 reply; 79+ messages in thread
From: Josef Bacik @ 2019-06-13 14:07 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On Fri, Jun 07, 2019 at 10:10:13PM +0900, Naohiro Aota wrote:
> On HMZONED drives, writes must always be sequential and directed at a block
> group zone write pointer position. Thus, block allocation in a block group
> must also be done sequentially using an allocation pointer equal to the
> block group zone write pointer plus the number of blocks allocated but not
> yet written.
> 
> Sequential allocation function find_free_extent_seq() bypass the checks in
> find_free_extent() and increase the reserved byte counter by itself. It is
> impossible to revert once allocated region in the sequential allocation,
> since it might race with other allocations and leave an allocation hole,
> which breaks the sequential write rule.
> 
> Furthermore, this commit introduce two new variable to struct
> btrfs_block_group_cache. "wp_broken" indicate that write pointer is broken
> (e.g. not synced on a RAID1 block group) and mark that block group read
> only. "unusable" keeps track of the size of once allocated then freed
> region. Such region is never usable until resetting underlying zones.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  fs/btrfs/ctree.h            |  24 +++
>  fs/btrfs/extent-tree.c      | 378 ++++++++++++++++++++++++++++++++++--
>  fs/btrfs/free-space-cache.c |  33 ++++
>  fs/btrfs/free-space-cache.h |   5 +
>  4 files changed, 426 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 6c00101407e4..f4bcd2a6ec12 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -582,6 +582,20 @@ struct btrfs_full_stripe_locks_tree {
>  	struct mutex lock;
>  };
>  
> +/* Block group allocation types */
> +enum btrfs_alloc_type {
> +
> +	/* Regular first fit allocation */
> +	BTRFS_ALLOC_FIT		= 0,
> +
> +	/*
> +	 * Sequential allocation: this is for HMZONED mode and
> +	 * will result in ignoring free space before a block
> +	 * group allocation offset.
> +	 */
> +	BTRFS_ALLOC_SEQ		= 1,
> +};
> +
>  struct btrfs_block_group_cache {
>  	struct btrfs_key key;
>  	struct btrfs_block_group_item item;
> @@ -592,6 +606,7 @@ struct btrfs_block_group_cache {
>  	u64 reserved;
>  	u64 delalloc_bytes;
>  	u64 bytes_super;
> +	u64 unusable;
>  	u64 flags;
>  	u64 cache_generation;
>  
> @@ -621,6 +636,7 @@ struct btrfs_block_group_cache {
>  	unsigned int iref:1;
>  	unsigned int has_caching_ctl:1;
>  	unsigned int removed:1;
> +	unsigned int wp_broken:1;
>  
>  	int disk_cache_state;
>  
> @@ -694,6 +710,14 @@ struct btrfs_block_group_cache {
>  
>  	/* Record locked full stripes for RAID5/6 block group */
>  	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
> +
> +	/*
> +	 * Allocation offset for the block group to implement sequential
> +	 * allocation. This is used only with HMZONED mode enabled and if
> +	 * the block group resides on a sequential zone.
> +	 */
> +	enum btrfs_alloc_type alloc_type;
> +	u64 alloc_offset;
>  };
>  
>  /* delayed seq elem */
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 363db58f56b8..ebd0d6eae038 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -28,6 +28,7 @@
>  #include "sysfs.h"
>  #include "qgroup.h"
>  #include "ref-verify.h"
> +#include "rcu-string.h"
>  
>  #undef SCRAMBLE_DELAYED_REFS
>  
> @@ -590,6 +591,8 @@ static int cache_block_group(struct btrfs_block_group_cache *cache,
>  	struct btrfs_caching_control *caching_ctl;
>  	int ret = 0;
>  
> +	WARN_ON(cache->alloc_type == BTRFS_ALLOC_SEQ);
> +
>  	caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS);
>  	if (!caching_ctl)
>  		return -ENOMEM;
> @@ -6555,6 +6558,19 @@ void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg)
>  	wait_var_event(&bg->reservations, !atomic_read(&bg->reservations));
>  }
>  
> +static void __btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache,
> +				       u64 ram_bytes, u64 num_bytes,
> +				       int delalloc)
> +{
> +	struct btrfs_space_info *space_info = cache->space_info;
> +
> +	cache->reserved += num_bytes;
> +	space_info->bytes_reserved += num_bytes;
> +	update_bytes_may_use(space_info, -ram_bytes);
> +	if (delalloc)
> +		cache->delalloc_bytes += num_bytes;
> +}
> +
>  /**
>   * btrfs_add_reserved_bytes - update the block_group and space info counters
>   * @cache:	The cache we are manipulating
> @@ -6573,17 +6589,16 @@ static int btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache,
>  	struct btrfs_space_info *space_info = cache->space_info;
>  	int ret = 0;
>  
> +	/* should handled by find_free_extent_seq */
> +	WARN_ON(cache->alloc_type == BTRFS_ALLOC_SEQ);
> +
>  	spin_lock(&space_info->lock);
>  	spin_lock(&cache->lock);
> -	if (cache->ro) {
> +	if (cache->ro)
>  		ret = -EAGAIN;
> -	} else {
> -		cache->reserved += num_bytes;
> -		space_info->bytes_reserved += num_bytes;
> -		update_bytes_may_use(space_info, -ram_bytes);
> -		if (delalloc)
> -			cache->delalloc_bytes += num_bytes;
> -	}
> +	else
> +		__btrfs_add_reserved_bytes(cache, ram_bytes, num_bytes,
> +					   delalloc);
>  	spin_unlock(&cache->lock);
>  	spin_unlock(&space_info->lock);
>  	return ret;
> @@ -6701,9 +6716,13 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
>  			cache = btrfs_lookup_block_group(fs_info, start);
>  			BUG_ON(!cache); /* Logic error */
>  
> -			cluster = fetch_cluster_info(fs_info,
> -						     cache->space_info,
> -						     &empty_cluster);
> +			if (cache->alloc_type == BTRFS_ALLOC_FIT)
> +				cluster = fetch_cluster_info(fs_info,
> +							     cache->space_info,
> +							     &empty_cluster);
> +			else
> +				cluster = NULL;
> +
>  			empty_cluster <<= 1;
>  		}
>  
> @@ -6743,7 +6762,8 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
>  		space_info->max_extent_size = 0;
>  		percpu_counter_add_batch(&space_info->total_bytes_pinned,
>  			    -len, BTRFS_TOTAL_BYTES_PINNED_BATCH);
> -		if (cache->ro) {
> +		if (cache->ro || cache->alloc_type == BTRFS_ALLOC_SEQ) {
> +			/* need reset before reusing in ALLOC_SEQ BG */
>  			space_info->bytes_readonly += len;
>  			readonly = true;
>  		}
> @@ -7588,6 +7608,60 @@ static int find_free_extent_unclustered(struct btrfs_block_group_cache *bg,
>  	return 0;
>  }
>  
> +/*
> + * Simple allocator for sequential only block group. It only allows
> + * sequential allocation. No need to play with trees. This function
> + * also reserve the bytes as in btrfs_add_reserved_bytes.
> + */
> +
> +static int find_free_extent_seq(struct btrfs_block_group_cache *cache,
> +				struct find_free_extent_ctl *ffe_ctl)
> +{
> +	struct btrfs_space_info *space_info = cache->space_info;
> +	struct btrfs_free_space_ctl *ctl = cache->free_space_ctl;
> +	u64 start = cache->key.objectid;
> +	u64 num_bytes = ffe_ctl->num_bytes;
> +	u64 avail;
> +	int ret = 0;
> +
> +	/* Sanity check */
> +	if (cache->alloc_type != BTRFS_ALLOC_SEQ)
> +		return 1;
> +
> +	spin_lock(&space_info->lock);
> +	spin_lock(&cache->lock);
> +
> +	if (cache->ro) {
> +		ret = -EAGAIN;
> +		goto out;
> +	}
> +
> +	spin_lock(&ctl->tree_lock);
> +	avail = cache->key.offset - cache->alloc_offset;
> +	if (avail < num_bytes) {
> +		ffe_ctl->max_extent_size = avail;
> +		spin_unlock(&ctl->tree_lock);
> +		ret = 1;
> +		goto out;
> +	}
> +
> +	ffe_ctl->found_offset = start + cache->alloc_offset;
> +	cache->alloc_offset += num_bytes;
> +	ctl->free_space -= num_bytes;
> +	spin_unlock(&ctl->tree_lock);
> +
> +	BUG_ON(!IS_ALIGNED(ffe_ctl->found_offset,
> +			   cache->fs_info->stripesize));
> +	ffe_ctl->search_start = ffe_ctl->found_offset;
> +	__btrfs_add_reserved_bytes(cache, ffe_ctl->ram_bytes, num_bytes,
> +				   ffe_ctl->delalloc);
> +
> +out:
> +	spin_unlock(&cache->lock);
> +	spin_unlock(&space_info->lock);
> +	return ret;
> +}
> +
>  /*
>   * Return >0 means caller needs to re-search for free extent
>   * Return 0 means we have the needed free extent.
> @@ -7889,6 +7963,16 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
>  		if (unlikely(block_group->cached == BTRFS_CACHE_ERROR))
>  			goto loop;
>  
> +		if (block_group->alloc_type == BTRFS_ALLOC_SEQ) {
> +			ret = find_free_extent_seq(block_group, &ffe_ctl);
> +			if (ret)
> +				goto loop;
> +			/* btrfs_find_space_for_alloc_seq should ensure
> +			 * that everything is OK and reserve the extent.
> +			 */
> +			goto nocheck;
> +		}
> +
>  		/*
>  		 * Ok we want to try and use the cluster allocator, so
>  		 * lets look there
> @@ -7944,6 +8028,7 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
>  					     num_bytes);
>  			goto loop;
>  		}
> +nocheck:
>  		btrfs_inc_block_group_reservations(block_group);
>  
>  		/* we are all good, lets return */
> @@ -9616,7 +9701,8 @@ static int inc_block_group_ro(struct btrfs_block_group_cache *cache, int force)
>  	}
>  
>  	num_bytes = cache->key.offset - cache->reserved - cache->pinned -
> -		    cache->bytes_super - btrfs_block_group_used(&cache->item);
> +		    cache->bytes_super - cache->unusable -
> +		    btrfs_block_group_used(&cache->item);
>  	sinfo_used = btrfs_space_info_used(sinfo, true);
>  
>  	if (sinfo_used + num_bytes + min_allocable_bytes <=
> @@ -9766,6 +9852,7 @@ void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache)
>  	if (!--cache->ro) {
>  		num_bytes = cache->key.offset - cache->reserved -
>  			    cache->pinned - cache->bytes_super -
> +			    cache->unusable -
>  			    btrfs_block_group_used(&cache->item);

You've done this in a few places, but not all the places, most notably
btrfs_space_info_used() which is used in the space reservation code a lot.

>  		sinfo->bytes_readonly -= num_bytes;
>  		list_del_init(&cache->ro_list);
> @@ -10200,11 +10287,240 @@ static void link_block_group(struct btrfs_block_group_cache *cache)
>  	}
>  }
>  
> +static int
> +btrfs_get_block_group_alloc_offset(struct btrfs_block_group_cache *cache)
> +{
> +	struct btrfs_fs_info *fs_info = cache->fs_info;
> +	struct extent_map_tree *em_tree = &fs_info->mapping_tree.map_tree;
> +	struct extent_map *em;
> +	struct map_lookup *map;
> +	struct btrfs_device *device;
> +	u64 logical = cache->key.objectid;
> +	u64 length = cache->key.offset;
> +	u64 physical = 0;
> +	int ret, alloc_type;
> +	int i, j;
> +	u64 *alloc_offsets = NULL;
> +
> +#define WP_MISSING_DEV ((u64)-1)
> +
> +	/* Sanity check */
> +	if (!IS_ALIGNED(length, fs_info->zone_size)) {
> +		btrfs_err(fs_info, "unaligned block group at %llu + %llu",
> +			  logical, length);
> +		return -EIO;
> +	}
> +
> +	/* Get the chunk mapping */
> +	em_tree = &fs_info->mapping_tree.map_tree;
> +	read_lock(&em_tree->lock);
> +	em = lookup_extent_mapping(em_tree, logical, length);
> +	read_unlock(&em_tree->lock);
> +
> +	if (!em)
> +		return -EINVAL;
> +
> +	map = em->map_lookup;
> +
> +	/*
> +	 * Get the zone type: if the group is mapped to a non-sequential zone,
> +	 * there is no need for the allocation offset (fit allocation is OK).
> +	 */
> +	alloc_type = -1;
> +	alloc_offsets = kcalloc(map->num_stripes, sizeof(*alloc_offsets),
> +				GFP_NOFS);
> +	if (!alloc_offsets) {
> +		free_extent_map(em);
> +		return -ENOMEM;
> +	}
> +
> +	for (i = 0; i < map->num_stripes; i++) {
> +		int is_sequential;
> +		struct blk_zone zone;
> +
> +		device = map->stripes[i].dev;
> +		physical = map->stripes[i].physical;
> +
> +		if (device->bdev == NULL) {
> +			alloc_offsets[i] = WP_MISSING_DEV;
> +			continue;
> +		}
> +
> +		is_sequential = btrfs_dev_is_sequential(device, physical);
> +		if (alloc_type == -1)
> +			alloc_type = is_sequential ?
> +					BTRFS_ALLOC_SEQ : BTRFS_ALLOC_FIT;
> +
> +		if ((is_sequential && alloc_type != BTRFS_ALLOC_SEQ) ||
> +		    (!is_sequential && alloc_type == BTRFS_ALLOC_SEQ)) {
> +			btrfs_err(fs_info, "found block group of mixed zone types");
> +			ret = -EIO;
> +			goto out;
> +		}
> +
> +		if (!is_sequential)
> +			continue;
> +
> +		/* this zone will be used for allocation, so mark this
> +		 * zone non-empty
> +		 */
> +		clear_bit(physical >> device->zone_size_shift,
> +			  device->empty_zones);
> +
> +		/*
> +		 * The group is mapped to a sequential zone. Get the zone write
> +		 * pointer to determine the allocation offset within the zone.
> +		 */
> +		WARN_ON(!IS_ALIGNED(physical, fs_info->zone_size));
> +		ret = btrfs_get_dev_zone(device, physical, &zone, GFP_NOFS);
> +		if (ret == -EIO || ret == -EOPNOTSUPP) {
> +			ret = 0;
> +			alloc_offsets[i] = WP_MISSING_DEV;
> +			continue;
> +		} else if (ret) {
> +			goto out;
> +		}
> +
> +
> +		switch (zone.cond) {
> +		case BLK_ZONE_COND_OFFLINE:
> +		case BLK_ZONE_COND_READONLY:
> +			btrfs_err(fs_info, "Offline/readonly zone %llu",
> +				  physical >> device->zone_size_shift);
> +			alloc_offsets[i] = WP_MISSING_DEV;
> +			break;
> +		case BLK_ZONE_COND_EMPTY:
> +			alloc_offsets[i] = 0;
> +			break;
> +		case BLK_ZONE_COND_FULL:
> +			alloc_offsets[i] = fs_info->zone_size;
> +			break;
> +		default:
> +			/* Partially used zone */
> +			alloc_offsets[i] =
> +				((zone.wp - zone.start) << SECTOR_SHIFT);
> +			break;
> +		}
> +	}
> +
> +	if (alloc_type == BTRFS_ALLOC_FIT)
> +		goto out;
> +
> +	switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
> +	case 0: /* single */
> +	case BTRFS_BLOCK_GROUP_DUP:
> +	case BTRFS_BLOCK_GROUP_RAID1:
> +		cache->alloc_offset = WP_MISSING_DEV;
> +		for (i = 0; i < map->num_stripes; i++) {
> +			if (alloc_offsets[i] == WP_MISSING_DEV)
> +				continue;
> +			if (cache->alloc_offset == WP_MISSING_DEV)
> +				cache->alloc_offset = alloc_offsets[i];
> +			if (alloc_offsets[i] == cache->alloc_offset)
> +				continue;
> +
> +			btrfs_err(fs_info,
> +				  "write pointer mismatch: block group %llu",
> +				  logical);
> +			cache->wp_broken = 1;
> +		}
> +		break;
> +	case BTRFS_BLOCK_GROUP_RAID0:
> +		cache->alloc_offset = 0;
> +		for (i = 0; i < map->num_stripes; i++) {
> +			if (alloc_offsets[i] == WP_MISSING_DEV) {
> +				btrfs_err(fs_info,
> +					  "cannot recover write pointer: block group %llu",
> +					  logical);
> +				cache->wp_broken = 1;
> +				continue;
> +			}
> +
> +			if (alloc_offsets[0] < alloc_offsets[i]) {
> +				btrfs_err(fs_info,
> +					  "write pointer mismatch: block group %llu",
> +					  logical);
> +				cache->wp_broken = 1;
> +				continue;
> +			}
> +
> +			cache->alloc_offset += alloc_offsets[i];
> +		}
> +		break;
> +	case BTRFS_BLOCK_GROUP_RAID10:
> +		/*
> +		 * Pass1: check write pointer of RAID1 level: each pointer
> +		 * should be equal.
> +		 */
> +		for (i = 0; i < map->num_stripes / map->sub_stripes; i++) {
> +			int base = i*map->sub_stripes;
> +			u64 offset = WP_MISSING_DEV;
> +
> +			for (j = 0; j < map->sub_stripes; j++) {
> +				if (alloc_offsets[base+j] == WP_MISSING_DEV)
> +					continue;
> +				if (offset == WP_MISSING_DEV)
> +					offset = alloc_offsets[base+j];
> +				if (alloc_offsets[base+j] == offset)
> +					continue;
> +
> +				btrfs_err(fs_info,
> +					  "write pointer mismatch: block group %llu",
> +					  logical);
> +				cache->wp_broken = 1;
> +			}
> +			for (j = 0; j < map->sub_stripes; j++)
> +				alloc_offsets[base+j] = offset;
> +		}
> +
> +		/* Pass2: check write pointer of RAID1 level */
> +		cache->alloc_offset = 0;
> +		for (i = 0; i < map->num_stripes / map->sub_stripes; i++) {
> +			int base = i*map->sub_stripes;
> +
> +			if (alloc_offsets[base] == WP_MISSING_DEV) {
> +				btrfs_err(fs_info,
> +					  "cannot recover write pointer: block group %llu",
> +					  logical);
> +				cache->wp_broken = 1;
> +				continue;
> +			}
> +
> +			if (alloc_offsets[0] < alloc_offsets[base]) {
> +				btrfs_err(fs_info,
> +					  "write pointer mismatch: block group %llu",
> +					  logical);
> +				cache->wp_broken = 1;
> +				continue;
> +			}
> +
> +			cache->alloc_offset += alloc_offsets[base];
> +		}
> +		break;
> +	case BTRFS_BLOCK_GROUP_RAID5:
> +	case BTRFS_BLOCK_GROUP_RAID6:
> +		/* RAID5/6 is not supported yet */
> +	default:
> +		btrfs_err(fs_info, "Unsupported profile on HMZONED %llu",
> +			map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK);
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +out:
> +	cache->alloc_type = alloc_type;
> +	kfree(alloc_offsets);
> +	free_extent_map(em);
> +
> +	return ret;
> +}
> +

Move this to the zoned device file that you create.

>  static struct btrfs_block_group_cache *
>  btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info,
>  			       u64 start, u64 size)
>  {
>  	struct btrfs_block_group_cache *cache;
> +	int ret;
>  
>  	cache = kzalloc(sizeof(*cache), GFP_NOFS);
>  	if (!cache)
> @@ -10238,6 +10554,16 @@ btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info,
>  	atomic_set(&cache->trimming, 0);
>  	mutex_init(&cache->free_space_lock);
>  	btrfs_init_full_stripe_locks_tree(&cache->full_stripe_locks_root);
> +	cache->alloc_type = BTRFS_ALLOC_FIT;
> +	cache->alloc_offset = 0;
> +
> +	if (btrfs_fs_incompat(fs_info, HMZONED)) {
> +		ret = btrfs_get_block_group_alloc_offset(cache);
> +		if (ret) {
> +			kfree(cache);
> +			return NULL;
> +		}
> +	}
>  
>  	return cache;
>  }
> @@ -10310,6 +10636,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
>  	int need_clear = 0;
>  	u64 cache_gen;
>  	u64 feature;
> +	u64 unusable;
>  	int mixed;
>  
>  	feature = btrfs_super_incompat_flags(info->super_copy);
> @@ -10415,6 +10742,26 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
>  			free_excluded_extents(cache);
>  		}
>  
> +		switch (cache->alloc_type) {
> +		case BTRFS_ALLOC_FIT:
> +			unusable = cache->bytes_super;
> +			break;
> +		case BTRFS_ALLOC_SEQ:
> +			WARN_ON(cache->bytes_super != 0);
> +			unusable = cache->alloc_offset -
> +				btrfs_block_group_used(&cache->item);
> +			/* we only need ->free_space in ALLOC_SEQ BGs */
> +			cache->last_byte_to_unpin = (u64)-1;
> +			cache->cached = BTRFS_CACHE_FINISHED;
> +			cache->free_space_ctl->free_space =
> +				cache->key.offset - cache->alloc_offset;
> +			cache->unusable = unusable;
> +			free_excluded_extents(cache);
> +			break;
> +		default:
> +			BUG();
> +		}
> +
>  		ret = btrfs_add_block_group_cache(info, cache);
>  		if (ret) {
>  			btrfs_remove_free_space_cache(cache);
> @@ -10425,7 +10772,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
>  		trace_btrfs_add_block_group(info, cache, 0);
>  		update_space_info(info, cache->flags, found_key.offset,
>  				  btrfs_block_group_used(&cache->item),
> -				  cache->bytes_super, &space_info);
> +				  unusable, &space_info);
>  
>  		cache->space_info = space_info;
>  
> @@ -10438,6 +10785,9 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
>  			ASSERT(list_empty(&cache->bg_list));
>  			btrfs_mark_bg_unused(cache);
>  		}
> +
> +		if (cache->wp_broken)
> +			inc_block_group_ro(cache, 1);
>  	}
>  
>  	list_for_each_entry_rcu(space_info, &info->space_info, list) {
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index f74dc259307b..cc69dc71f4c1 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -2326,8 +2326,11 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
>  			   u64 offset, u64 bytes)
>  {
>  	struct btrfs_free_space *info;
> +	struct btrfs_block_group_cache *block_group = ctl->private;
>  	int ret = 0;
>  
> +	WARN_ON(block_group && block_group->alloc_type == BTRFS_ALLOC_SEQ);
> +
>  	info = kmem_cache_zalloc(btrfs_free_space_cachep, GFP_NOFS);
>  	if (!info)
>  		return -ENOMEM;
> @@ -2376,6 +2379,28 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
>  	return ret;
>  }
>  
> +int __btrfs_add_free_space_seq(struct btrfs_block_group_cache *block_group,
> +			       u64 bytenr, u64 size)
> +{
> +	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
> +	u64 offset = bytenr - block_group->key.objectid;
> +	u64 to_free, to_unusable;
> +
> +	spin_lock(&ctl->tree_lock);
> +	if (offset >= block_group->alloc_offset)
> +		to_free = size;
> +	else if (offset + size <= block_group->alloc_offset)
> +		to_free = 0;
> +	else
> +		to_free = offset + size - block_group->alloc_offset;
> +	to_unusable = size - to_free;
> +	ctl->free_space += to_free;
> +	block_group->unusable += to_unusable;
> +	spin_unlock(&ctl->tree_lock);
> +	return 0;
> +
> +}
> +
>  int btrfs_remove_free_space(struct btrfs_block_group_cache *block_group,
>  			    u64 offset, u64 bytes)
>  {
> @@ -2384,6 +2409,8 @@ int btrfs_remove_free_space(struct btrfs_block_group_cache *block_group,
>  	int ret;
>  	bool re_search = false;
>  
> +	WARN_ON(block_group->alloc_type == BTRFS_ALLOC_SEQ);
> +

These should probably be ASSERT() right?  Want to make sure the developers
really notice a problem when testing.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 08/19] btrfs: make unmirroed BGs readonly only if we have at least one writable BG
  2019-06-07 13:10 ` [PATCH 08/19] btrfs: make unmirroed BGs readonly only if we have at least one writable BG Naohiro Aota
@ 2019-06-13 14:09   ` Josef Bacik
  2019-06-18  7:42     ` Naohiro Aota
  0 siblings, 1 reply; 79+ messages in thread
From: Josef Bacik @ 2019-06-13 14:09 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On Fri, Jun 07, 2019 at 10:10:14PM +0900, Naohiro Aota wrote:
> If the btrfs volume has mirrored block groups, it unconditionally makes
> un-mirrored block groups read only. When we have mirrored block groups, but
> don't have writable block groups, this will drop all writable block groups.
> So, check if we have at least one writable mirrored block group before
> setting un-mirrored block groups read only.
> 

I don't understand why you want this.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/19] btrfs: limit super block locations in HMZONED mode
  2019-06-07 13:10 ` [PATCH 09/19] btrfs: limit super block locations in HMZONED mode Naohiro Aota
@ 2019-06-13 14:12   ` Josef Bacik
  2019-06-18  8:51     ` Naohiro Aota
  2019-06-17 22:53   ` David Sterba
  2019-06-28  3:55   ` Anand Jain
  2 siblings, 1 reply; 79+ messages in thread
From: Josef Bacik @ 2019-06-13 14:12 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On Fri, Jun 07, 2019 at 10:10:15PM +0900, Naohiro Aota wrote:
> When in HMZONED mode, make sure that device super blocks are located in
> randomly writable zones of zoned block devices. That is, do not write super
> blocks in sequential write required zones of host-managed zoned block
> devices as update would not be possible.
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  fs/btrfs/disk-io.c     | 11 +++++++++++
>  fs/btrfs/disk-io.h     |  1 +
>  fs/btrfs/extent-tree.c |  4 ++++
>  fs/btrfs/scrub.c       |  2 ++
>  4 files changed, 18 insertions(+)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 7c1404c76768..ddbb02906042 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3466,6 +3466,13 @@ struct buffer_head *btrfs_read_dev_super(struct block_device *bdev)
>  	return latest;
>  }
>  
> +int btrfs_check_super_location(struct btrfs_device *device, u64 pos)
> +{
> +	/* any address is good on a regular (zone_size == 0) device */
> +	/* non-SEQUENTIAL WRITE REQUIRED zones are capable on a zoned device */

This is not how you do multi-line comments in the kernel.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 11/19] btrfs: introduce submit buffer
  2019-06-07 13:10 ` [PATCH 11/19] btrfs: introduce submit buffer Naohiro Aota
@ 2019-06-13 14:14   ` Josef Bacik
  2019-06-17  3:16     ` Damien Le Moal
  0 siblings, 1 reply; 79+ messages in thread
From: Josef Bacik @ 2019-06-13 14:14 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On Fri, Jun 07, 2019 at 10:10:17PM +0900, Naohiro Aota wrote:
> Sequential allocation is not enough to maintain sequential delivery of
> write IOs to the device. Various features (async compress, async checksum,
> ...) of btrfs affect ordering of the IOs. This patch introduces submit
> buffer to sort WRITE bios belonging to a block group and sort them out
> sequentially in increasing block address to achieve sequential write
> sequences with __btrfs_map_bio().
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

I hate everything about this.  Can't we just use the plugging infrastructure for
this and then make sure it re-orders the bios before submitting them?  Also
what's to prevent the block layer scheduler from re-arranging these io's?
Thanks,

Josef

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 12/19] btrfs: expire submit buffer on timeout
  2019-06-07 13:10 ` [PATCH 12/19] btrfs: expire submit buffer on timeout Naohiro Aota
@ 2019-06-13 14:15   ` Josef Bacik
  2019-06-17  3:19     ` Damien Le Moal
  0 siblings, 1 reply; 79+ messages in thread
From: Josef Bacik @ 2019-06-13 14:15 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On Fri, Jun 07, 2019 at 10:10:18PM +0900, Naohiro Aota wrote:
> It is possible to have bios stalled in the submit buffer due to some bug or
> device problem. In such situation, btrfs stops working waiting for buffered
> bios completions. To avoid such hang, add a worker that will cancel the
> stalled bios after a timeout.
> 

The block layer does this with it's request timeouts right?  So it'll timeout
and we'll get an EIO?  If that's not working then we need to fix the block
layer.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 13/19] btrfs: avoid sync IO prioritization on checksum in HMZONED mode
  2019-06-07 13:10 ` [PATCH 13/19] btrfs: avoid sync IO prioritization on checksum in HMZONED mode Naohiro Aota
@ 2019-06-13 14:17   ` Josef Bacik
  0 siblings, 0 replies; 79+ messages in thread
From: Josef Bacik @ 2019-06-13 14:17 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On Fri, Jun 07, 2019 at 10:10:19PM +0900, Naohiro Aota wrote:
> Btrfs prioritize sync I/Os to be handled by async checksum worker earlier.
> As a result, checksumming sync I/Os to larger logical extent address can
> finish faster than checksumming non-sync I/Os to smaller logical extent
> address.
> 
> Since we have upper limit of number of checksum worker, it is possible that
> sync I/Os to wait forever for non-starting checksum of I/Os for smaller
> address.
> 
> This situation can be reproduced by e.g. fstests btrfs/073.
> 
> To avoid such disordering, disable sync IO prioritization for now. Note
> that sync I/Os anyway must wait for I/Os to smaller address to finish. So,
> actually prioritization have no benefit in HMZONED mode.
> 

This stuff is going away once we finish the io.weight work anyway, I wouldn't
worry about this.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 14/19] btrfs: redirty released extent buffers in sequential BGs
  2019-06-07 13:10 ` [PATCH 14/19] btrfs: redirty released extent buffers in sequential BGs Naohiro Aota
@ 2019-06-13 14:24   ` Josef Bacik
  2019-06-18  9:09     ` Naohiro Aota
  0 siblings, 1 reply; 79+ messages in thread
From: Josef Bacik @ 2019-06-13 14:24 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On Fri, Jun 07, 2019 at 10:10:20PM +0900, Naohiro Aota wrote:
> Tree manipulating operations like merging nodes often release
> once-allocated tree nodes. Btrfs cleans such nodes so that pages in the
> node are not uselessly written out. On HMZONED drives, however, such
> optimization blocks the following IOs as the cancellation of the write out
> of the freed blocks breaks the sequential write sequence expected by the
> device.
> 
> This patch introduces a list of clean extent buffers that have been
> released in a transaction. Btrfs consult the list before writing out and
> waiting for the IOs, and it redirties a buffer if 1) it's in sequential BG,
> 2) it's in un-submit range, and 3) it's not under IO. Thus, such buffers
> are marked for IO in btrfs_write_and_wait_transaction() to send proper bios
> to the disk.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  fs/btrfs/disk-io.c     | 27 ++++++++++++++++++++++++---
>  fs/btrfs/extent_io.c   |  1 +
>  fs/btrfs/extent_io.h   |  2 ++
>  fs/btrfs/transaction.c | 35 +++++++++++++++++++++++++++++++++++
>  fs/btrfs/transaction.h |  3 +++
>  5 files changed, 65 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 6651986da470..c6147fce648f 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -535,7 +535,9 @@ static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct page *page)
>  	if (csum_tree_block(eb, result))
>  		return -EINVAL;
>  
> -	if (btrfs_header_level(eb))
> +	if (test_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags))
> +		ret = 0;
> +	else if (btrfs_header_level(eb))
>  		ret = btrfs_check_node(eb);
>  	else
>  		ret = btrfs_check_leaf_full(eb);
> @@ -1115,10 +1117,20 @@ struct extent_buffer *read_tree_block(struct btrfs_fs_info *fs_info, u64 bytenr,
>  void btrfs_clean_tree_block(struct extent_buffer *buf)
>  {
>  	struct btrfs_fs_info *fs_info = buf->fs_info;
> -	if (btrfs_header_generation(buf) ==
> -	    fs_info->running_transaction->transid) {
> +	struct btrfs_transaction *cur_trans = fs_info->running_transaction;
> +
> +	if (btrfs_header_generation(buf) == cur_trans->transid) {
>  		btrfs_assert_tree_locked(buf);
>  
> +		if (btrfs_fs_incompat(fs_info, HMZONED) &&
> +		    list_empty(&buf->release_list)) {
> +			atomic_inc(&buf->refs);
> +			spin_lock(&cur_trans->releasing_ebs_lock);
> +			list_add_tail(&buf->release_list,
> +				      &cur_trans->releasing_ebs);
> +			spin_unlock(&cur_trans->releasing_ebs_lock);
> +		}
> +
>  		if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &buf->bflags)) {
>  			percpu_counter_add_batch(&fs_info->dirty_metadata_bytes,
>  						 -buf->len,
> @@ -4533,6 +4545,15 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans,
>  	btrfs_destroy_pinned_extent(fs_info,
>  				    fs_info->pinned_extents);
>  
> +	while (!list_empty(&cur_trans->releasing_ebs)) {
> +		struct extent_buffer *eb;
> +
> +		eb = list_first_entry(&cur_trans->releasing_ebs,
> +				      struct extent_buffer, release_list);
> +		list_del_init(&eb->release_list);
> +		free_extent_buffer(eb);
> +	}
> +
>  	cur_trans->state =TRANS_STATE_COMPLETED;
>  	wake_up(&cur_trans->commit_wait);
>  }
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 13fca7bfc1f2..c73c69e2bef4 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -4816,6 +4816,7 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
>  	init_waitqueue_head(&eb->read_lock_wq);
>  
>  	btrfs_leak_debug_add(&eb->leak_list, &buffers);
> +	INIT_LIST_HEAD(&eb->release_list);
>  
>  	spin_lock_init(&eb->refs_lock);
>  	atomic_set(&eb->refs, 1);
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index aa18a16a6ed7..2987a01f84f9 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -58,6 +58,7 @@ enum {
>  	EXTENT_BUFFER_IN_TREE,
>  	/* write IO error */
>  	EXTENT_BUFFER_WRITE_ERR,
> +	EXTENT_BUFFER_NO_CHECK,
>  };
>  
>  /* these are flags for __process_pages_contig */
> @@ -186,6 +187,7 @@ struct extent_buffer {
>  	 */
>  	wait_queue_head_t read_lock_wq;
>  	struct page *pages[INLINE_EXTENT_BUFFER_PAGES];
> +	struct list_head release_list;
>  #ifdef CONFIG_BTRFS_DEBUG
>  	atomic_t spinning_writers;
>  	atomic_t spinning_readers;
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 3f6811cdf803..ded40ad75419 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -236,6 +236,8 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info,
>  	spin_lock_init(&cur_trans->dirty_bgs_lock);
>  	INIT_LIST_HEAD(&cur_trans->deleted_bgs);
>  	spin_lock_init(&cur_trans->dropped_roots_lock);
> +	INIT_LIST_HEAD(&cur_trans->releasing_ebs);
> +	spin_lock_init(&cur_trans->releasing_ebs_lock);
>  	list_add_tail(&cur_trans->list, &fs_info->trans_list);
>  	extent_io_tree_init(fs_info, &cur_trans->dirty_pages,
>  			IO_TREE_TRANS_DIRTY_PAGES, fs_info->btree_inode);
> @@ -2219,7 +2221,31 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>  
>  	wake_up(&fs_info->transaction_wait);
>  
> +	if (btrfs_fs_incompat(fs_info, HMZONED)) {
> +		struct extent_buffer *eb;
> +
> +		list_for_each_entry(eb, &cur_trans->releasing_ebs,
> +				    release_list) {
> +			struct btrfs_block_group_cache *cache;
> +
> +			cache = btrfs_lookup_block_group(fs_info, eb->start);
> +			if (!cache)
> +				continue;
> +			mutex_lock(&cache->submit_lock);
> +			if (cache->alloc_type == BTRFS_ALLOC_SEQ &&
> +			    cache->submit_offset <= eb->start &&
> +			    !extent_buffer_under_io(eb)) {
> +				set_extent_buffer_dirty(eb);
> +				cache->space_info->bytes_readonly += eb->len;

Huh?

> +				set_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags);
> +			}
> +			mutex_unlock(&cache->submit_lock);
> +			btrfs_put_block_group(cache);
> +		}
> +	}
> +

Helper here please.
>  	ret = btrfs_write_and_wait_transaction(trans);
> +
>  	if (ret) {
>  		btrfs_handle_fs_error(fs_info, ret,
>  				      "Error while writing out transaction");
> @@ -2227,6 +2253,15 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>  		goto scrub_continue;
>  	}
>  
> +	while (!list_empty(&cur_trans->releasing_ebs)) {
> +		struct extent_buffer *eb;
> +
> +		eb = list_first_entry(&cur_trans->releasing_ebs,
> +				      struct extent_buffer, release_list);
> +		list_del_init(&eb->release_list);
> +		free_extent_buffer(eb);
> +	}
> +

Another helper, and also can't we release eb's above that we didn't need to
re-mark dirty?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 16/19] btrfs: wait existing extents before truncating
  2019-06-07 13:10 ` [PATCH 16/19] btrfs: wait existing extents before truncating Naohiro Aota
@ 2019-06-13 14:25   ` Josef Bacik
  0 siblings, 0 replies; 79+ messages in thread
From: Josef Bacik @ 2019-06-13 14:25 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On Fri, Jun 07, 2019 at 10:10:22PM +0900, Naohiro Aota wrote:
> When truncating a file, file buffers which have already been allocated but
> not yet written may be truncated.  Truncating these buffers could cause
> breakage of a sequential write pattern in a block group if the truncated
> blocks are for example followed by blocks allocated to another file. To
> avoid this problem, always wait for write out of all unwritten buffers
> before proceeding with the truncate execution.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  fs/btrfs/inode.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 89542c19d09e..4e8c7921462f 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -5137,6 +5137,17 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr)
>  		btrfs_end_write_no_snapshotting(root);
>  		btrfs_end_transaction(trans);
>  	} else {
> +		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +
> +		if (btrfs_fs_incompat(fs_info, HMZONED)) {
> +			u64 sectormask = fs_info->sectorsize - 1;
> +
> +			ret = btrfs_wait_ordered_range(inode,
> +						       newsize & (~sectormask),
> +						       (u64)-1);

Use ALIGN().  Thanks,

Josef

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 17/19] btrfs: shrink delayed allocation size in HMZONED mode
  2019-06-07 13:10 ` [PATCH 17/19] btrfs: shrink delayed allocation size in HMZONED mode Naohiro Aota
@ 2019-06-13 14:27   ` Josef Bacik
  0 siblings, 0 replies; 79+ messages in thread
From: Josef Bacik @ 2019-06-13 14:27 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On Fri, Jun 07, 2019 at 10:10:23PM +0900, Naohiro Aota wrote:
> In a write heavy workload, the following scenario can occur:
> 
> 1. mark page #0 to page #2 (and their corresponding extent region) as dirty
>    and candidate for delayed allocation
> 
> pages    0 1 2 3 4
> dirty    o o o - -
> towrite  - - - - -
> delayed  o o o - -
> alloc
> 
> 2. extent_write_cache_pages() mark dirty pages as TOWRITE
> 
> pages    0 1 2 3 4
> dirty    o o o - -
> towrite  o o o - -
> delayed  o o o - -
> alloc
> 
> 3. Meanwhile, another write dirties page #3 and page #4
> 
> pages    0 1 2 3 4
> dirty    o o o o o
> towrite  o o o - -
> delayed  o o o o o
> alloc
> 
> 4. find_lock_delalloc_range() decide to allocate a region to write page #0
>    to page #4
> 5. but, extent_write_cache_pages() only initiate write to TOWRITE tagged
>    pages (#0 to #2)
> 
> So the above process leaves page #3 and page #4 behind. Usually, the
> periodic dirty flush kicks write IOs for page #3 and #4. However, if we try
> to mount a subvolume at this timing, mount process takes s_umount write
> lock to block the periodic flush to come in.
> 
> To deal with the problem, shrink the delayed allocation region to have only
> expected to be written pages.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  fs/btrfs/extent_io.c | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index c73c69e2bef4..ea582ff85c73 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3310,6 +3310,33 @@ static noinline_for_stack int writepage_delalloc(struct inode *inode,
>  			delalloc_start = delalloc_end + 1;
>  			continue;
>  		}
> +
> +		if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED) &&
> +		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) &&
> +		    ((delalloc_start >> PAGE_SHIFT) <
> +		     (delalloc_end >> PAGE_SHIFT))) {
> +			unsigned long i;
> +			unsigned long end_index = delalloc_end >> PAGE_SHIFT;
> +
> +			for (i = delalloc_start >> PAGE_SHIFT;
> +			     i <= end_index; i++)
> +				if (!xa_get_mark(&inode->i_mapping->i_pages, i,
> +						 PAGECACHE_TAG_TOWRITE))
> +					break;
> +
> +			if (i <= end_index) {
> +				u64 unlock_start = (u64)i << PAGE_SHIFT;
> +
> +				if (i == delalloc_start >> PAGE_SHIFT)
> +					unlock_start += PAGE_SIZE;
> +
> +				unlock_extent(tree, unlock_start, delalloc_end);
> +				__unlock_for_delalloc(inode, page, unlock_start,
> +						      delalloc_end);
> +				delalloc_end = unlock_start - 1;
> +			}
> +		}
> +

Helper please.  Really for all this hmzoned stuff I want it segregated as much
as possible so when I'm debugging or cleaning other stuff up I want to easily be
able to say "oh this is for zoned devices, it doesn't matter."  Thanks,

Josef

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 18/19] btrfs: support dev-replace in HMZONED mode
  2019-06-07 13:10 ` [PATCH 18/19] btrfs: support dev-replace " Naohiro Aota
@ 2019-06-13 14:33   ` Josef Bacik
  2019-06-18  9:14     ` Naohiro Aota
  0 siblings, 1 reply; 79+ messages in thread
From: Josef Bacik @ 2019-06-13 14:33 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On Fri, Jun 07, 2019 at 10:10:24PM +0900, Naohiro Aota wrote:
> Currently, dev-replace copy all the device extents on source device to the
> target device, and it also clones new incoming write I/Os from users to the
> source device into the target device.
> 
> Cloning incoming IOs can break the sequential write rule in the target
> device. When write is mapped in the middle of block group, that I/O is
> directed in the middle of a zone of target device, which breaks the
> sequential write rule.
> 
> However, the cloning function cannot be simply disabled since incoming I/Os
> targeting already copied device extents must be cloned so that the I/O is
> executed on the target device.
> 
> We cannot use dev_replace->cursor_{left,right} to determine whether bio
> is going to not yet copied region.  Since we have time gap between
> finishing btrfs_scrub_dev() and rewriting the mapping tree in
> btrfs_dev_replace_finishing(), we can have newly allocated device extent
> which is never cloned (by handle_ops_on_dev_replace) nor copied (by the
> dev-replace process).
> 
> So the point is to copy only already existing device extents. This patch
> introduce mark_block_group_to_copy() to mark existing block group as a
> target of copying. Then, handle_ops_on_dev_replace() and dev-replace can
> check the flag to do their job.
> 
> This patch also handles empty region between used extents. Since
> dev-replace is smart to copy only used extents on source device, we have to
> fill the gap to honor the sequential write rule in the target device.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  fs/btrfs/ctree.h       |   1 +
>  fs/btrfs/dev-replace.c |  96 +++++++++++++++++++++++
>  fs/btrfs/extent-tree.c |  32 +++++++-
>  fs/btrfs/scrub.c       | 169 +++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/volumes.c     |  27 ++++++-
>  5 files changed, 319 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index dad8ea5c3b99..a0be2b96117a 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -639,6 +639,7 @@ struct btrfs_block_group_cache {
>  	unsigned int has_caching_ctl:1;
>  	unsigned int removed:1;
>  	unsigned int wp_broken:1;
> +	unsigned int to_copy:1;
>  
>  	int disk_cache_state;
>  
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index fbe5ea2a04ed..5011b5ce0e75 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -263,6 +263,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
>  	device->dev_stats_valid = 1;
>  	set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
>  	device->fs_devices = fs_info->fs_devices;
> +	if (bdev_is_zoned(bdev)) {
> +		ret = btrfs_get_dev_zonetypes(device);
> +		if (ret) {
> +			mutex_unlock(&fs_info->fs_devices->device_list_mutex);
> +			goto error;
> +		}
> +	}
>  	list_add(&device->dev_list, &fs_info->fs_devices->devices);
>  	fs_info->fs_devices->num_devices++;
>  	fs_info->fs_devices->open_devices++;
> @@ -396,6 +403,88 @@ static char* btrfs_dev_name(struct btrfs_device *device)
>  		return rcu_str_deref(device->name);
>  }
>  
> +static int mark_block_group_to_copy(struct btrfs_fs_info *fs_info,
> +				    struct btrfs_device *src_dev)
> +{
> +	struct btrfs_path *path;
> +	struct btrfs_key key;
> +	struct btrfs_key found_key;
> +	struct btrfs_root *root = fs_info->dev_root;
> +	struct btrfs_dev_extent *dev_extent = NULL;
> +	struct btrfs_block_group_cache *cache;
> +	struct extent_buffer *l;
> +	int slot;
> +	int ret;
> +	u64 chunk_offset, length;
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	path->reada = READA_FORWARD;
> +	path->search_commit_root = 1;
> +	path->skip_locking = 1;
> +
> +	key.objectid = src_dev->devid;
> +	key.offset = 0ull;
> +	key.type = BTRFS_DEV_EXTENT_KEY;
> +
> +	while (1) {
> +		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
> +		if (ret < 0)
> +			break;
> +		if (ret > 0) {
> +			if (path->slots[0] >=
> +			    btrfs_header_nritems(path->nodes[0])) {
> +				ret = btrfs_next_leaf(root, path);
> +				if (ret < 0)
> +					break;
> +				if (ret > 0) {
> +					ret = 0;
> +					break;
> +				}
> +			} else {
> +				ret = 0;
> +			}
> +		}
> +
> +		l = path->nodes[0];
> +		slot = path->slots[0];
> +
> +		btrfs_item_key_to_cpu(l, &found_key, slot);
> +
> +		if (found_key.objectid != src_dev->devid)
> +			break;
> +
> +		if (found_key.type != BTRFS_DEV_EXTENT_KEY)
> +			break;
> +
> +		if (found_key.offset < key.offset)
> +			break;
> +
> +		dev_extent = btrfs_item_ptr(l, slot, struct btrfs_dev_extent);
> +		length = btrfs_dev_extent_length(l, dev_extent);
> +
> +		chunk_offset = btrfs_dev_extent_chunk_offset(l, dev_extent);
> +
> +		cache = btrfs_lookup_block_group(fs_info, chunk_offset);
> +		if (!cache)
> +			goto skip;
> +
> +		cache->to_copy = 1;
> +
> +		btrfs_put_block_group(cache);
> +
> +skip:
> +		key.offset = found_key.offset + length;
> +		btrfs_release_path(path);
> +	}
> +
> +	btrfs_free_path(path);
> +
> +	return ret;
> +}
> +
>  static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
>  		const char *tgtdev_name, u64 srcdevid, const char *srcdev_name,
>  		int read_src)
> @@ -439,6 +528,13 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
>  	}
>  
>  	need_unlock = true;
> +
> +	mutex_lock(&fs_info->chunk_mutex);
> +	ret = mark_block_group_to_copy(fs_info, src_device);
> +	mutex_unlock(&fs_info->chunk_mutex);
> +	if (ret)
> +		return ret;
> +
>  	down_write(&dev_replace->rwsem);
>  	switch (dev_replace->replace_state) {
>  	case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED:
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index ff4d55d6ef04..268365dd9a5d 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -29,6 +29,7 @@
>  #include "qgroup.h"
>  #include "ref-verify.h"
>  #include "rcu-string.h"
> +#include "dev-replace.h"
>  
>  #undef SCRAMBLE_DELAYED_REFS
>  
> @@ -2022,7 +2023,31 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
>  			if (btrfs_dev_is_sequential(stripe->dev,
>  						    stripe->physical) &&
>  			    stripe->length == stripe->dev->zone_size) {
> -				ret = blkdev_reset_zones(stripe->dev->bdev,
> +				struct btrfs_device *dev = stripe->dev;
> +
> +				ret = blkdev_reset_zones(dev->bdev,
> +							 stripe->physical >>
> +								 SECTOR_SHIFT,
> +							 stripe->length >>
> +								 SECTOR_SHIFT,
> +							 GFP_NOFS);
> +				if (!ret)
> +					discarded_bytes += stripe->length;
> +				else
> +					break;
> +				set_bit(stripe->physical >>
> +					dev->zone_size_shift,
> +					dev->empty_zones);
> +
> +				if (!btrfs_dev_replace_is_ongoing(
> +					    &fs_info->dev_replace) ||
> +				    stripe->dev != fs_info->dev_replace.srcdev)
> +					continue;
> +
> +				/* send to target as well */
> +				dev = fs_info->dev_replace.tgtdev;
> +
> +				ret = blkdev_reset_zones(dev->bdev,

This is unrelated to dev replace isn't it?  Please make this it's own patch, and
it's own helper while you are at it.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 00/19] btrfs zoned block device support
  2019-06-13 13:46     ` David Sterba
@ 2019-06-14  2:07       ` Naohiro Aota
  2019-06-17  2:44       ` Damien Le Moal
  1 sibling, 0 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-14  2:07 UTC (permalink / raw)
  To: dsterba
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On 2019/06/13 22:45, David Sterba wrote:> On Thu, Jun 13, 2019 at 04:59:23AM +0000, Naohiro Aota wrote:
>> On 2019/06/13 2:50, David Sterba wrote:
>>> On Fri, Jun 07, 2019 at 10:10:06PM +0900, Naohiro Aota wrote:
>>> How can I test the zoned devices backed by files (or regular disks)? I
>>> searched for some concrete example eg. for qemu or dm-zoned, but closest
>>> match was a text description in libzbc README that it's possible to
>>> implement. All other howtos expect a real zoned device.
>>
>> You can use tcmu-runer [1] to create an emulated zoned device backed by
>> a regular file. Here is a setup how-to:
>> http://zonedstorage.io/projects/tcmu-runner/#compilation-and-installation>> That looks great, thanks. I wonder why there's no way to find that, all
> I got were dead links to linux-iscsi.org or tutorials of targetcli that
> were years old and not working.

Actually, this is quite new site. ;-)

> Feeding the textual commands to targetcli is not exactly what I'd
> expect for scripting, but at least it seems to work.

You can use "targetcli <directory> <command> [<args> ...]" format, so
you can call e.g.

targetcli /backstores/user:zbc create name=foo size=10G cfgstring=model-HM/zsize-256/conv-1@/mnt/nvme/disk0.raw

> I tried to pass an emulated ZBC device on host to KVM guest (as a scsi
> device) but lsscsi does not recognize that it as a zonde device (just a
> QEMU harddisk). So this seems the emulation must be done inside the VM.

Oops, QEMU hide the detail.

In this case, you can try exposing the ZBC device via iSCSI.

On the host:
(after creating the ZBC backstores)
# sudo targetcli /iscsi create
Created target iqn.2003-01.org.linux-iscsi.naota-devel.x8664:sn.f4f308e4892c.
Created TPG 1.
Global pref auto_add_default_portal=true
Created default portal listening on all IPs (0.0.0.0), port 3260.
# TARGET="iqn.2003-01.org.linux-iscsi.naota-devel.x8664:sn.f4f308e4892c"

(WARN: Allow any node to connect without any auth)
# targetcli /iscsi/${TARGET}/tpg1 set attribute generate_node_acls=1
Parameter generate_node_acls is now '1'.
( or you can explicitly allow an initiator)
# TCMU_INITIATOR=iqn.2018-07....
# targecli /iscsi/${TARGET}/tpg1/acls create ${TCMU_INITIATOR}

(for each backend)
# targetcli /iscsi/${TARGET}/tpg1/luns create /backstores/user:zbc/foo
Created LUN 0.

Then, you can login to the iSCSI on the KVM guest like:

# iscsiadm -m discovery -t st -p $HOST_IP
127.0.0.1:3260,1 iqn.2003-01.org.linux-iscsi.naota-devel.x8664:sn.f4f308e4892c
# iscsiadm -m node -l -T ${TARGET}
Logging in to [iface: default, target: iqn.2003-01.org.linux-iscsi.naota-devel.x8664:sn.f4f308e4892c, portal: 127.0.0.1,3260]
Login to [iface: default, target: iqn.2003-01.org.linux-iscsi.naota-devel.x8664:sn.f4f308e4892c, portal: 127.0.0.1,3260] successful.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 00/19] btrfs zoned block device support
  2019-06-13 13:46     ` David Sterba
  2019-06-14  2:07       ` Naohiro Aota
@ 2019-06-17  2:44       ` Damien Le Moal
  1 sibling, 0 replies; 79+ messages in thread
From: Damien Le Moal @ 2019-06-17  2:44 UTC (permalink / raw)
  To: dsterba, Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche

David,

On 2019/06/13 22:45, David Sterba wrote:
> On Thu, Jun 13, 2019 at 04:59:23AM +0000, Naohiro Aota wrote:
>> On 2019/06/13 2:50, David Sterba wrote:
>>> On Fri, Jun 07, 2019 at 10:10:06PM +0900, Naohiro Aota wrote:
>>>> btrfs zoned block device support
>>>>
>>>> This series adds zoned block device support to btrfs.
>>>
>>> The overall design sounds ok.
>>>
>>> I skimmed through the patches and the biggest task I see is how to make
>>> the hmzoned adjustments and branches less visible, ie. there are too
>>> many if (hmzoned) { do something } standing out. But that's merely a
>>> matter of wrappers and maybe an abstraction here and there.
>>
>> Sure. I'll add some more abstractions in the next version.
> 
> Ok, I'll reply to the patches with specific things.
> 
>>> How can I test the zoned devices backed by files (or regular disks)? I
>>> searched for some concrete example eg. for qemu or dm-zoned, but closest
>>> match was a text description in libzbc README that it's possible to
>>> implement. All other howtos expect a real zoned device.
>>
>> You can use tcmu-runer [1] to create an emulated zoned device backed by 
>> a regular file. Here is a setup how-to:
>> http://zonedstorage.io/projects/tcmu-runner/#compilation-and-installation
> 
> That looks great, thanks. I wonder why there's no way to find that, all
> I got were dead links to linux-iscsi.org or tutorials of targetcli that
> were years old and not working.

The site went online 4 days ago :) We will advertise it whenever we can. This is
intended to document all things "zoned block device" including Btrfs support,
when we get it finished :)

> 
> Feeding the textual commands to targetcli is not exactly what I'd
> expect for scripting, but at least it seems to work.

Yes, this is not exactly obvious, but that is how most automation with linux
iscsi is done.

> 
> I tried to pass an emulated ZBC device on host to KVM guest (as a scsi
> device) but lsscsi does not recognize that it as a zonde device (just a
> QEMU harddisk). So this seems the emulation must be done inside the VM.
> 

What driver did you use for the drive ? virtio block ? I have not touch that
driver nor qemu side, so zoned block dev support is likely missing. I will add
it. That would be especially useful for testing with a real drive. In the case
of tcmu runner, the initiator can be started in the guest directly and the
target emulation done either in the guest if loopback is used, or on the host
using iscsi connection. The former is what we use all the time and so is well
tested. I have to admit that testing with iscsi is lacking... Will add that to
the todo list.

Best regards.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 11/19] btrfs: introduce submit buffer
  2019-06-13 14:14   ` Josef Bacik
@ 2019-06-17  3:16     ` Damien Le Moal
  2019-06-18  0:00       ` David Sterba
  2019-06-18 13:33       ` Josef Bacik
  0 siblings, 2 replies; 79+ messages in thread
From: Damien Le Moal @ 2019-06-17  3:16 UTC (permalink / raw)
  To: Josef Bacik, Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche

Josef,

On 2019/06/13 23:15, Josef Bacik wrote:
> On Fri, Jun 07, 2019 at 10:10:17PM +0900, Naohiro Aota wrote:
>> Sequential allocation is not enough to maintain sequential delivery of
>> write IOs to the device. Various features (async compress, async checksum,
>> ...) of btrfs affect ordering of the IOs. This patch introduces submit
>> buffer to sort WRITE bios belonging to a block group and sort them out
>> sequentially in increasing block address to achieve sequential write
>> sequences with __btrfs_map_bio().
>>
>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> 
> I hate everything about this.  Can't we just use the plugging infrastructure for
> this and then make sure it re-orders the bios before submitting them?  Also
> what's to prevent the block layer scheduler from re-arranging these io's?
> Thanks,

The block I/O scheduler reorders requests in LBA order, but that happens for a
newly inserted request against pending requests. If there are no pending
requests because all requests were already issued, no ordering happen, and even
worse, if the drive queue is not full yet (e.g. there are free tags), then the
newly inserted request will be dispatched almost immediately, preventing
reordering with subsequent incoming write requests to happen.

The other problem is that the mq-deadline scheduler does not track zone WP
position. Write request issuing is done regardless of the current WP value,
solely based on LBA ordering. This means that mq-deadline will not prevent
out-of-order, or rather, unaligned write requests. These will not be detected
and dispatched whenever possible. The reasons for this are that:
1) the disk user (the FS) has to manage zone WP positions anyway. So duplicating
that management at the block IO scheduler level is inefficient.
2) Adding zone WP management at the block IO scheduler level would also need a
write error processing path to resync the WP value in case of failed writes. But
the user/FS also needs that anyway. Again duplicated functionalities.
3) The block layer will need a timeout to force issue or cancel pending
unaligned write requests. This is necessary in case the drive user stops issuing
writes (for whatever reasons) or the scheduler is being switched. This would
unnecessarily cause write I/O errors or cause deadlocks if the request queue
quiesce mode is entered at the wrong time (and I do not see a good way to deal
with that).

blk-mq is already complicated enough. Adding this to the block IO scheduler will
unnecessarily complicate things further for no real benefits. I would like to
point out the dm-zoned device mapper and f2fs which are both already dealing
with write ordering and write error processing directly. Both are fairly
straightforward but completely different and each optimized for their own structure.

Naohiro changes to btrfs IO scheduler have the same intent, that is, efficiently
integrate and handle write ordering "a la btrfs". Would creating a different
"hmzoned" btrfs IO scheduler help address your concerns ?

Best regards.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 12/19] btrfs: expire submit buffer on timeout
  2019-06-13 14:15   ` Josef Bacik
@ 2019-06-17  3:19     ` Damien Le Moal
  0 siblings, 0 replies; 79+ messages in thread
From: Damien Le Moal @ 2019-06-17  3:19 UTC (permalink / raw)
  To: Josef Bacik, Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche

On 2019/06/13 23:15, Josef Bacik wrote:
> On Fri, Jun 07, 2019 at 10:10:18PM +0900, Naohiro Aota wrote:
>> It is possible to have bios stalled in the submit buffer due to some bug or
>> device problem. In such situation, btrfs stops working waiting for buffered
>> bios completions. To avoid such hang, add a worker that will cancel the
>> stalled bios after a timeout.
>>
> 
> The block layer does this with it's request timeouts right?  So it'll timeout
> and we'll get an EIO?  If that's not working then we need to fix the block
> layer.  Thanks,

Joseph,

The block layer timeout is started only when the request is dispatched. The
timeout is not started on BIO/request allocation and so will not trigger for
bios stalled inside btrfs scheduler.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 02/19] btrfs: Get zone information of zoned block devices
  2019-06-07 13:10 ` [PATCH 02/19] btrfs: Get zone information of zoned block devices Naohiro Aota
  2019-06-13 13:58   ` Josef Bacik
  2019-06-13 13:58   ` Josef Bacik
@ 2019-06-17 18:57   ` David Sterba
  2019-06-18  6:42     ` Naohiro Aota
  2 siblings, 1 reply; 79+ messages in thread
From: David Sterba @ 2019-06-17 18:57 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On Fri, Jun 07, 2019 at 10:10:08PM +0900, Naohiro Aota wrote:
> If a zoned block device is found, get its zone information (number of zones
> and zone size) using the new helper function btrfs_get_dev_zonetypes().  To
> avoid costly run-time zone report commands to test the device zones type
> during block allocation, attach the seqzones bitmap to the device structure
> to indicate if a zone is sequential or accept random writes.
> 
> This patch also introduces the helper function btrfs_dev_is_sequential() to
> test if the zone storing a block is a sequential write required zone.
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  fs/btrfs/volumes.c | 143 +++++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/volumes.h |  33 +++++++++++
>  2 files changed, 176 insertions(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 1c2a6e4b39da..b673178718e3 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -786,6 +786,135 @@ static int btrfs_free_stale_devices(const char *path,
>  	return ret;
>  }
>  
> +static int __btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,

Please drop __ from the name, the pattern where this naming makes sense
does not apply here. It's for cases where teh function wihout
underscores does some extra stuff like locking and the underscored does
not and this is used on some context. I haven't found
btrfs_get_dev_zones in this or other patches.

> +				 struct blk_zone **zones,
> +				 unsigned int *nr_zones, gfp_t gfp_mask)
> +{
> +	struct blk_zone *z = *zones;

This may apply to more places, plese don't use single letter for
anything else than 'i' and similar. 'zone' would be suitable.

> +	int ret;
> +
> +	if (!z) {
> +		z = kcalloc(*nr_zones, sizeof(struct blk_zone), GFP_KERNEL);
> +		if (!z)
> +			return -ENOMEM;
> +	}
> +
> +	ret = blkdev_report_zones(device->bdev, pos >> SECTOR_SHIFT,
> +				  z, nr_zones, gfp_mask);
> +	if (ret != 0) {
> +		btrfs_err(device->fs_info, "Get zone at %llu failed %d\n",

No capital letter and no "\n" at the end of the message, that's added by
btrfs_er.

> +			  pos, ret);
> +		return ret;
> +	}
> +
> +	*zones = z;
> +
> +	return 0;
> +}
> +
> +static void btrfs_destroy_dev_zonetypes(struct btrfs_device *device)
> +{
> +	kfree(device->seq_zones);
> +	kfree(device->empty_zones);
> +	device->seq_zones = NULL;
> +	device->empty_zones = NULL;
> +	device->nr_zones = 0;
> +	device->zone_size = 0;
> +	device->zone_size_shift = 0;
> +}
> +
> +int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
> +		       struct blk_zone *zone, gfp_t gfp_mask)
> +{
> +	unsigned int nr_zones = 1;
> +	int ret;
> +
> +	ret = __btrfs_get_dev_zones(device, pos, &zone, &nr_zones, gfp_mask);
> +	if (ret != 0 || !nr_zones)
> +		return ret ? ret : -EIO;
> +
> +	return 0;
> +}
> +
> +int btrfs_get_dev_zonetypes(struct btrfs_device *device)
> +{
> +	struct block_device *bdev = device->bdev;
> +	sector_t nr_sectors = bdev->bd_part->nr_sects;
> +	sector_t sector = 0;
> +	struct blk_zone *zones = NULL;
> +	unsigned int i, n = 0, nr_zones;
> +	int ret;
> +
> +	device->zone_size = 0;
> +	device->zone_size_shift = 0;
> +	device->nr_zones = 0;
> +	device->seq_zones = NULL;
> +	device->empty_zones = NULL;
> +
> +	if (!bdev_is_zoned(bdev))
> +		return 0;
> +
> +	device->zone_size = (u64)bdev_zone_sectors(bdev) << SECTOR_SHIFT;
> +	device->zone_size_shift = ilog2(device->zone_size);
> +	device->nr_zones = nr_sectors >> ilog2(bdev_zone_sectors(bdev));
> +	if (nr_sectors & (bdev_zone_sectors(bdev) - 1))
> +		device->nr_zones++;
> +
> +	device->seq_zones = kcalloc(BITS_TO_LONGS(device->nr_zones),
> +				    sizeof(*device->seq_zones), GFP_KERNEL);

What's the expected range for the allocation size? There's one bit per
zone, so one 4KiB page can hold up to 32768 zones, with 1GiB it's 32TiB
of space on the drive. Ok that seems safe for now.

> +	if (!device->seq_zones)
> +		return -ENOMEM;
> +
> +	device->empty_zones = kcalloc(BITS_TO_LONGS(device->nr_zones),
> +				      sizeof(*device->empty_zones), GFP_KERNEL);
> +	if (!device->empty_zones)
> +		return -ENOMEM;

This leaks device->seq_zones from the current context, though thre are
calls to btrfs_destroy_dev_zonetypes that would clean it up eventually.
It'd be better to clean up here instead of relying on the caller.

> +
> +#define BTRFS_REPORT_NR_ZONES   4096

Please move this to the begining of the file if this is just local to
the .c file and put a short comment explaining the meaning.

> +
> +	/* Get zones type */
> +	while (sector < nr_sectors) {
> +		nr_zones = BTRFS_REPORT_NR_ZONES;
> +		ret = __btrfs_get_dev_zones(device, sector << SECTOR_SHIFT,
> +					    &zones, &nr_zones, GFP_KERNEL);
> +		if (ret != 0 || !nr_zones) {
> +			if (!ret)
> +				ret = -EIO;
> +			goto out;
> +		}
> +
> +		for (i = 0; i < nr_zones; i++) {
> +			if (zones[i].type == BLK_ZONE_TYPE_SEQWRITE_REQ)
> +				set_bit(n, device->seq_zones);
> +			if (zones[i].cond == BLK_ZONE_COND_EMPTY)
> +				set_bit(n, device->empty_zones);
> +			sector = zones[i].start + zones[i].len;
> +			n++;
> +		}
> +	}
> +
> +	if (n != device->nr_zones) {
> +		btrfs_err(device->fs_info,
> +			  "Inconsistent number of zones (%u / %u)\n", n,

lowercase and no "\n"

> +			  device->nr_zones);
> +		ret = -EIO;
> +		goto out;
> +	}
> +
> +	btrfs_info(device->fs_info,
> +		   "host-%s zoned block device, %u zones of %llu sectors\n",
> +		   bdev_zoned_model(bdev) == BLK_ZONED_HM ? "managed" : "aware",
> +		   device->nr_zones, device->zone_size >> SECTOR_SHIFT);
> +
> +out:
> +	kfree(zones);
> +
> +	if (ret)
> +		btrfs_destroy_dev_zonetypes(device);
> +
> +	return ret;
> +}
> +
>  static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
>  			struct btrfs_device *device, fmode_t flags,
>  			void *holder)
> @@ -842,6 +971,11 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
>  	clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
>  	device->mode = flags;
>  
> +	/* Get zone type information of zoned block devices */
> +	ret = btrfs_get_dev_zonetypes(device);
> +	if (ret != 0)
> +		goto error_brelse;
> +
>  	fs_devices->open_devices++;
>  	if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
>  	    device->devid != BTRFS_DEV_REPLACE_DEVID) {
> @@ -1243,6 +1377,7 @@ static void btrfs_close_bdev(struct btrfs_device *device)
>  	}
>  
>  	blkdev_put(device->bdev, device->mode);
> +	btrfs_destroy_dev_zonetypes(device);
>  }
>  
>  static void btrfs_close_one_device(struct btrfs_device *device)
> @@ -2664,6 +2799,13 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>  	mutex_unlock(&fs_info->chunk_mutex);
>  	mutex_unlock(&fs_devices->device_list_mutex);
>  
> +	/* Get zone type information of zoned block devices */
> +	ret = btrfs_get_dev_zonetypes(device);
> +	if (ret) {
> +		btrfs_abort_transaction(trans, ret);
> +		goto error_sysfs;

Can this be moved before the locked section so that any failure does not
lead to transaction abort?

The function returns ENOMEM that does not necessarily need to kill the
filesystem. And EIO which means that some faulty device is being added
to the filesystem but this again should fail early.

> +	}
> +
>  	if (seeding_dev) {
>  		mutex_lock(&fs_info->chunk_mutex);
>  		ret = init_first_rw_device(trans);
> @@ -2729,6 +2871,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>  	return ret;
>  
>  error_sysfs:
> +	btrfs_destroy_dev_zonetypes(device);
>  	btrfs_sysfs_rm_device_link(fs_devices, device);
>  	mutex_lock(&fs_info->fs_devices->device_list_mutex);
>  	mutex_lock(&fs_info->chunk_mutex);
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index b8a0e8d0672d..1599641e216c 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -62,6 +62,16 @@ struct btrfs_device {
>  
>  	struct block_device *bdev;
>  
> +	/*
> +	 * Number of zones, zone size and types of zones if bdev is a
> +	 * zoned block device.
> +	 */
> +	u64 zone_size;
> +	u8  zone_size_shift;

So the zone_size is always power of two? I may be missing something, but
I wonder if the calculations based on shifts are safe.

> +	u32 nr_zones;
> +	unsigned long *seq_zones;
> +	unsigned long *empty_zones;
> +
>  	/* the mode sent to blkdev_get */
>  	fmode_t mode;
>  
> @@ -476,6 +486,28 @@ int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans,
>  int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset);
>  struct extent_map *btrfs_get_chunk_map(struct btrfs_fs_info *fs_info,
>  				       u64 logical, u64 length);
> +int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
> +		       struct blk_zone *zone, gfp_t gfp_mask);
> +
> +static inline int btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
> +{
> +	unsigned int zno = pos >> device->zone_size_shift;

The types don't match here, pos is u64 and I'm not sure if it's
guaranteed that the value after shift will fit ti unsigned int.

> +
> +	if (!device->seq_zones)
> +		return 1;
> +
> +	return test_bit(zno, device->seq_zones);
> +}
> +
> +static inline int btrfs_dev_is_empty_zone(struct btrfs_device *device, u64 pos)
> +{
> +	unsigned int zno = pos >> device->zone_size_shift;

Same.

> +
> +	if (!device->empty_zones)
> +		return 0;
> +
> +	return test_bit(zno, device->empty_zones);
> +}
>  
>  static inline void btrfs_dev_stat_inc(struct btrfs_device *dev,
>  				      int index)
> @@ -568,5 +600,6 @@ bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
>  
>  int btrfs_bg_type_to_factor(u64 flags);
>  int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info);
> +int btrfs_get_dev_zonetypes(struct btrfs_device *device);
>  
>  #endif
> -- 
> 2.21.0

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 07/19] btrfs: do sequential extent allocation in HMZONED mode
  2019-06-07 13:10 ` [PATCH 07/19] btrfs: do sequential extent allocation in HMZONED mode Naohiro Aota
  2019-06-13 14:07   ` Josef Bacik
@ 2019-06-17 22:30   ` David Sterba
  2019-06-18  8:49     ` Naohiro Aota
  1 sibling, 1 reply; 79+ messages in thread
From: David Sterba @ 2019-06-17 22:30 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On Fri, Jun 07, 2019 at 10:10:13PM +0900, Naohiro Aota wrote:
> On HMZONED drives, writes must always be sequential and directed at a block
> group zone write pointer position. Thus, block allocation in a block group
> must also be done sequentially using an allocation pointer equal to the
> block group zone write pointer plus the number of blocks allocated but not
> yet written.
> 
> Sequential allocation function find_free_extent_seq() bypass the checks in
> find_free_extent() and increase the reserved byte counter by itself. It is
> impossible to revert once allocated region in the sequential allocation,
> since it might race with other allocations and leave an allocation hole,
> which breaks the sequential write rule.
> 
> Furthermore, this commit introduce two new variable to struct
> btrfs_block_group_cache. "wp_broken" indicate that write pointer is broken
> (e.g. not synced on a RAID1 block group) and mark that block group read
> only. "unusable" keeps track of the size of once allocated then freed
> region. Such region is never usable until resetting underlying zones.
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  fs/btrfs/ctree.h            |  24 +++
>  fs/btrfs/extent-tree.c      | 378 ++++++++++++++++++++++++++++++++++--
>  fs/btrfs/free-space-cache.c |  33 ++++
>  fs/btrfs/free-space-cache.h |   5 +
>  4 files changed, 426 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 6c00101407e4..f4bcd2a6ec12 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -582,6 +582,20 @@ struct btrfs_full_stripe_locks_tree {
>  	struct mutex lock;
>  };
>  
> +/* Block group allocation types */
> +enum btrfs_alloc_type {
> +
> +	/* Regular first fit allocation */
> +	BTRFS_ALLOC_FIT		= 0,
> +
> +	/*
> +	 * Sequential allocation: this is for HMZONED mode and
> +	 * will result in ignoring free space before a block
> +	 * group allocation offset.

Please format the comments to 80 columns

> +	 */
> +	BTRFS_ALLOC_SEQ		= 1,
> +};
> +
>  struct btrfs_block_group_cache {
>  	struct btrfs_key key;
>  	struct btrfs_block_group_item item;
> @@ -592,6 +606,7 @@ struct btrfs_block_group_cache {
>  	u64 reserved;
>  	u64 delalloc_bytes;
>  	u64 bytes_super;
> +	u64 unusable;

'unusable' is specific to the zones, so 'zone_unusable' would make it
clear. The terminilogy around space is confusing already (we have
unused, free, reserved, allocated, slack).

>  	u64 flags;
>  	u64 cache_generation;
>  
> @@ -621,6 +636,7 @@ struct btrfs_block_group_cache {
>  	unsigned int iref:1;
>  	unsigned int has_caching_ctl:1;
>  	unsigned int removed:1;
> +	unsigned int wp_broken:1;
>  
>  	int disk_cache_state;
>  
> @@ -694,6 +710,14 @@ struct btrfs_block_group_cache {
>  
>  	/* Record locked full stripes for RAID5/6 block group */
>  	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
> +
> +	/*
> +	 * Allocation offset for the block group to implement sequential
> +	 * allocation. This is used only with HMZONED mode enabled and if
> +	 * the block group resides on a sequential zone.
> +	 */
> +	enum btrfs_alloc_type alloc_type;
> +	u64 alloc_offset;
>  };
>  
>  /* delayed seq elem */
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 363db58f56b8..ebd0d6eae038 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -28,6 +28,7 @@
>  #include "sysfs.h"
>  #include "qgroup.h"
>  #include "ref-verify.h"
> +#include "rcu-string.h"
>  
>  #undef SCRAMBLE_DELAYED_REFS
>  
> @@ -590,6 +591,8 @@ static int cache_block_group(struct btrfs_block_group_cache *cache,
>  	struct btrfs_caching_control *caching_ctl;
>  	int ret = 0;
>  
> +	WARN_ON(cache->alloc_type == BTRFS_ALLOC_SEQ);
> +
>  	caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS);
>  	if (!caching_ctl)
>  		return -ENOMEM;
> @@ -6555,6 +6558,19 @@ void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg)
>  	wait_var_event(&bg->reservations, !atomic_read(&bg->reservations));
>  }
>  
> +static void __btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache,
> +				       u64 ram_bytes, u64 num_bytes,
> +				       int delalloc)
> +{
> +	struct btrfs_space_info *space_info = cache->space_info;
> +
> +	cache->reserved += num_bytes;
> +	space_info->bytes_reserved += num_bytes;
> +	update_bytes_may_use(space_info, -ram_bytes);
> +	if (delalloc)
> +		cache->delalloc_bytes += num_bytes;
> +}
> +
>  /**
>   * btrfs_add_reserved_bytes - update the block_group and space info counters
>   * @cache:	The cache we are manipulating
> @@ -6573,17 +6589,16 @@ static int btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache,
>  	struct btrfs_space_info *space_info = cache->space_info;
>  	int ret = 0;
>  
> +	/* should handled by find_free_extent_seq */
> +	WARN_ON(cache->alloc_type == BTRFS_ALLOC_SEQ);
> +
>  	spin_lock(&space_info->lock);
>  	spin_lock(&cache->lock);
> -	if (cache->ro) {
> +	if (cache->ro)
>  		ret = -EAGAIN;
> -	} else {
> -		cache->reserved += num_bytes;
> -		space_info->bytes_reserved += num_bytes;
> -		update_bytes_may_use(space_info, -ram_bytes);
> -		if (delalloc)
> -			cache->delalloc_bytes += num_bytes;
> -	}
> +	else
> +		__btrfs_add_reserved_bytes(cache, ram_bytes, num_bytes,
> +					   delalloc);
>  	spin_unlock(&cache->lock);
>  	spin_unlock(&space_info->lock);
>  	return ret;
> @@ -6701,9 +6716,13 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
>  			cache = btrfs_lookup_block_group(fs_info, start);
>  			BUG_ON(!cache); /* Logic error */
>  
> -			cluster = fetch_cluster_info(fs_info,
> -						     cache->space_info,
> -						     &empty_cluster);
> +			if (cache->alloc_type == BTRFS_ALLOC_FIT)
> +				cluster = fetch_cluster_info(fs_info,
> +							     cache->space_info,
> +							     &empty_cluster);
> +			else
> +				cluster = NULL;
> +
>  			empty_cluster <<= 1;
>  		}
>  
> @@ -6743,7 +6762,8 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
>  		space_info->max_extent_size = 0;
>  		percpu_counter_add_batch(&space_info->total_bytes_pinned,
>  			    -len, BTRFS_TOTAL_BYTES_PINNED_BATCH);
> -		if (cache->ro) {
> +		if (cache->ro || cache->alloc_type == BTRFS_ALLOC_SEQ) {
> +			/* need reset before reusing in ALLOC_SEQ BG */
>  			space_info->bytes_readonly += len;
>  			readonly = true;
>  		}
> @@ -7588,6 +7608,60 @@ static int find_free_extent_unclustered(struct btrfs_block_group_cache *bg,
>  	return 0;
>  }
>  
> +/*
> + * Simple allocator for sequential only block group. It only allows
> + * sequential allocation. No need to play with trees. This function
> + * also reserve the bytes as in btrfs_add_reserved_bytes.
> + */
> +
> +static int find_free_extent_seq(struct btrfs_block_group_cache *cache,
> +				struct find_free_extent_ctl *ffe_ctl)
> +{
> +	struct btrfs_space_info *space_info = cache->space_info;
> +	struct btrfs_free_space_ctl *ctl = cache->free_space_ctl;
> +	u64 start = cache->key.objectid;
> +	u64 num_bytes = ffe_ctl->num_bytes;
> +	u64 avail;
> +	int ret = 0;
> +
> +	/* Sanity check */
> +	if (cache->alloc_type != BTRFS_ALLOC_SEQ)
> +		return 1;
> +
> +	spin_lock(&space_info->lock);
> +	spin_lock(&cache->lock);
> +
> +	if (cache->ro) {
> +		ret = -EAGAIN;
> +		goto out;
> +	}
> +
> +	spin_lock(&ctl->tree_lock);
> +	avail = cache->key.offset - cache->alloc_offset;
> +	if (avail < num_bytes) {
> +		ffe_ctl->max_extent_size = avail;
> +		spin_unlock(&ctl->tree_lock);
> +		ret = 1;
> +		goto out;
> +	}
> +
> +	ffe_ctl->found_offset = start + cache->alloc_offset;
> +	cache->alloc_offset += num_bytes;
> +	ctl->free_space -= num_bytes;
> +	spin_unlock(&ctl->tree_lock);
> +
> +	BUG_ON(!IS_ALIGNED(ffe_ctl->found_offset,
> +			   cache->fs_info->stripesize));
> +	ffe_ctl->search_start = ffe_ctl->found_offset;
> +	__btrfs_add_reserved_bytes(cache, ffe_ctl->ram_bytes, num_bytes,
> +				   ffe_ctl->delalloc);
> +
> +out:
> +	spin_unlock(&cache->lock);
> +	spin_unlock(&space_info->lock);
> +	return ret;
> +}
> +
>  /*
>   * Return >0 means caller needs to re-search for free extent
>   * Return 0 means we have the needed free extent.
> @@ -7889,6 +7963,16 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
>  		if (unlikely(block_group->cached == BTRFS_CACHE_ERROR))
>  			goto loop;
>  
> +		if (block_group->alloc_type == BTRFS_ALLOC_SEQ) {
> +			ret = find_free_extent_seq(block_group, &ffe_ctl);
> +			if (ret)
> +				goto loop;
> +			/* btrfs_find_space_for_alloc_seq should ensure
> +			 * that everything is OK and reserve the extent.
> +			 */

Please use the

/*
 * comment
 */

style

> +			goto nocheck;
> +		}
> +
>  		/*
>  		 * Ok we want to try and use the cluster allocator, so
>  		 * lets look there
> @@ -7944,6 +8028,7 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
>  					     num_bytes);
>  			goto loop;
>  		}
> +nocheck:
>  		btrfs_inc_block_group_reservations(block_group);
>  
>  		/* we are all good, lets return */
> @@ -9616,7 +9701,8 @@ static int inc_block_group_ro(struct btrfs_block_group_cache *cache, int force)
>  	}
>  
>  	num_bytes = cache->key.offset - cache->reserved - cache->pinned -
> -		    cache->bytes_super - btrfs_block_group_used(&cache->item);
> +		    cache->bytes_super - cache->unusable -
> +		    btrfs_block_group_used(&cache->item);
>  	sinfo_used = btrfs_space_info_used(sinfo, true);
>  
>  	if (sinfo_used + num_bytes + min_allocable_bytes <=
> @@ -9766,6 +9852,7 @@ void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache)
>  	if (!--cache->ro) {
>  		num_bytes = cache->key.offset - cache->reserved -
>  			    cache->pinned - cache->bytes_super -
> +			    cache->unusable -
>  			    btrfs_block_group_used(&cache->item);
>  		sinfo->bytes_readonly -= num_bytes;
>  		list_del_init(&cache->ro_list);
> @@ -10200,11 +10287,240 @@ static void link_block_group(struct btrfs_block_group_cache *cache)
>  	}
>  }
>  
> +static int
> +btrfs_get_block_group_alloc_offset(struct btrfs_block_group_cache *cache)
> +{
> +	struct btrfs_fs_info *fs_info = cache->fs_info;
> +	struct extent_map_tree *em_tree = &fs_info->mapping_tree.map_tree;
> +	struct extent_map *em;
> +	struct map_lookup *map;
> +	struct btrfs_device *device;
> +	u64 logical = cache->key.objectid;
> +	u64 length = cache->key.offset;
> +	u64 physical = 0;
> +	int ret, alloc_type;
> +	int i, j;
> +	u64 *alloc_offsets = NULL;
> +
> +#define WP_MISSING_DEV ((u64)-1)

Please move the definition to the beginning of the file

> +
> +	/* Sanity check */
> +	if (!IS_ALIGNED(length, fs_info->zone_size)) {
> +		btrfs_err(fs_info, "unaligned block group at %llu + %llu",
> +			  logical, length);
> +		return -EIO;
> +	}
> +
> +	/* Get the chunk mapping */
> +	em_tree = &fs_info->mapping_tree.map_tree;
> +	read_lock(&em_tree->lock);
> +	em = lookup_extent_mapping(em_tree, logical, length);
> +	read_unlock(&em_tree->lock);
> +
> +	if (!em)
> +		return -EINVAL;
> +
> +	map = em->map_lookup;
> +
> +	/*
> +	 * Get the zone type: if the group is mapped to a non-sequential zone,
> +	 * there is no need for the allocation offset (fit allocation is OK).
> +	 */
> +	alloc_type = -1;
> +	alloc_offsets = kcalloc(map->num_stripes, sizeof(*alloc_offsets),
> +				GFP_NOFS);
> +	if (!alloc_offsets) {
> +		free_extent_map(em);
> +		return -ENOMEM;
> +	}
> +
> +	for (i = 0; i < map->num_stripes; i++) {
> +		int is_sequential;

Please use bool instead of int

> +		struct blk_zone zone;
> +
> +		device = map->stripes[i].dev;
> +		physical = map->stripes[i].physical;
> +
> +		if (device->bdev == NULL) {
> +			alloc_offsets[i] = WP_MISSING_DEV;
> +			continue;
> +		}
> +
> +		is_sequential = btrfs_dev_is_sequential(device, physical);
> +		if (alloc_type == -1)
> +			alloc_type = is_sequential ?
> +					BTRFS_ALLOC_SEQ : BTRFS_ALLOC_FIT;
> +
> +		if ((is_sequential && alloc_type != BTRFS_ALLOC_SEQ) ||
> +		    (!is_sequential && alloc_type == BTRFS_ALLOC_SEQ)) {
> +			btrfs_err(fs_info, "found block group of mixed zone types");
> +			ret = -EIO;
> +			goto out;
> +		}
> +
> +		if (!is_sequential)
> +			continue;
> +
> +		/* this zone will be used for allocation, so mark this
> +		 * zone non-empty
> +		 */
> +		clear_bit(physical >> device->zone_size_shift,
> +			  device->empty_zones);
> +
> +		/*
> +		 * The group is mapped to a sequential zone. Get the zone write
> +		 * pointer to determine the allocation offset within the zone.
> +		 */
> +		WARN_ON(!IS_ALIGNED(physical, fs_info->zone_size));
> +		ret = btrfs_get_dev_zone(device, physical, &zone, GFP_NOFS);
> +		if (ret == -EIO || ret == -EOPNOTSUPP) {
> +			ret = 0;
> +			alloc_offsets[i] = WP_MISSING_DEV;
> +			continue;
> +		} else if (ret) {
> +			goto out;
> +		}
> +
> +
> +		switch (zone.cond) {
> +		case BLK_ZONE_COND_OFFLINE:
> +		case BLK_ZONE_COND_READONLY:
> +			btrfs_err(fs_info, "Offline/readonly zone %llu",
> +				  physical >> device->zone_size_shift);
> +			alloc_offsets[i] = WP_MISSING_DEV;
> +			break;
> +		case BLK_ZONE_COND_EMPTY:
> +			alloc_offsets[i] = 0;
> +			break;
> +		case BLK_ZONE_COND_FULL:
> +			alloc_offsets[i] = fs_info->zone_size;
> +			break;
> +		default:
> +			/* Partially used zone */
> +			alloc_offsets[i] =
> +				((zone.wp - zone.start) << SECTOR_SHIFT);
> +			break;
> +		}
> +	}
> +
> +	if (alloc_type == BTRFS_ALLOC_FIT)
> +		goto out;
> +
> +	switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
> +	case 0: /* single */
> +	case BTRFS_BLOCK_GROUP_DUP:
> +	case BTRFS_BLOCK_GROUP_RAID1:
> +		cache->alloc_offset = WP_MISSING_DEV;
> +		for (i = 0; i < map->num_stripes; i++) {
> +			if (alloc_offsets[i] == WP_MISSING_DEV)
> +				continue;
> +			if (cache->alloc_offset == WP_MISSING_DEV)
> +				cache->alloc_offset = alloc_offsets[i];
> +			if (alloc_offsets[i] == cache->alloc_offset)
> +				continue;
> +
> +			btrfs_err(fs_info,
> +				  "write pointer mismatch: block group %llu",
> +				  logical);
> +			cache->wp_broken = 1;
> +		}
> +		break;
> +	case BTRFS_BLOCK_GROUP_RAID0:
> +		cache->alloc_offset = 0;
> +		for (i = 0; i < map->num_stripes; i++) {
> +			if (alloc_offsets[i] == WP_MISSING_DEV) {
> +				btrfs_err(fs_info,
> +					  "cannot recover write pointer: block group %llu",
> +					  logical);
> +				cache->wp_broken = 1;
> +				continue;
> +			}
> +
> +			if (alloc_offsets[0] < alloc_offsets[i]) {
> +				btrfs_err(fs_info,
> +					  "write pointer mismatch: block group %llu",
> +					  logical);
> +				cache->wp_broken = 1;
> +				continue;
> +			}
> +
> +			cache->alloc_offset += alloc_offsets[i];
> +		}
> +		break;
> +	case BTRFS_BLOCK_GROUP_RAID10:
> +		/*
> +		 * Pass1: check write pointer of RAID1 level: each pointer
> +		 * should be equal.
> +		 */
> +		for (i = 0; i < map->num_stripes / map->sub_stripes; i++) {
> +			int base = i*map->sub_stripes;

spaces around binary operators

			int base = i * map->sub_stripes;

> +			u64 offset = WP_MISSING_DEV;
> +
> +			for (j = 0; j < map->sub_stripes; j++) {
> +				if (alloc_offsets[base+j] == WP_MISSING_DEV)

here and below

> +					continue;
> +				if (offset == WP_MISSING_DEV)
> +					offset = alloc_offsets[base+j];
> +				if (alloc_offsets[base+j] == offset)
> +					continue;
> +
> +				btrfs_err(fs_info,
> +					  "write pointer mismatch: block group %llu",
> +					  logical);
> +				cache->wp_broken = 1;
> +			}
> +			for (j = 0; j < map->sub_stripes; j++)
> +				alloc_offsets[base+j] = offset;
> +		}
> +
> +		/* Pass2: check write pointer of RAID1 level */
> +		cache->alloc_offset = 0;
> +		for (i = 0; i < map->num_stripes / map->sub_stripes; i++) {
> +			int base = i*map->sub_stripes;
> +
> +			if (alloc_offsets[base] == WP_MISSING_DEV) {
> +				btrfs_err(fs_info,
> +					  "cannot recover write pointer: block group %llu",
> +					  logical);
> +				cache->wp_broken = 1;
> +				continue;
> +			}
> +
> +			if (alloc_offsets[0] < alloc_offsets[base]) {
> +				btrfs_err(fs_info,
> +					  "write pointer mismatch: block group %llu",
> +					  logical);
> +				cache->wp_broken = 1;
> +				continue;
> +			}
> +
> +			cache->alloc_offset += alloc_offsets[base];
> +		}
> +		break;
> +	case BTRFS_BLOCK_GROUP_RAID5:
> +	case BTRFS_BLOCK_GROUP_RAID6:
> +		/* RAID5/6 is not supported yet */
> +	default:
> +		btrfs_err(fs_info, "Unsupported profile on HMZONED %llu",
> +			map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK);
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +out:
> +	cache->alloc_type = alloc_type;
> +	kfree(alloc_offsets);
> +	free_extent_map(em);
> +
> +	return ret;
> +}
> +
>  static struct btrfs_block_group_cache *
>  btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info,
>  			       u64 start, u64 size)
>  {
>  	struct btrfs_block_group_cache *cache;
> +	int ret;
>  
>  	cache = kzalloc(sizeof(*cache), GFP_NOFS);
>  	if (!cache)
> @@ -10238,6 +10554,16 @@ btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info,
>  	atomic_set(&cache->trimming, 0);
>  	mutex_init(&cache->free_space_lock);
>  	btrfs_init_full_stripe_locks_tree(&cache->full_stripe_locks_root);
> +	cache->alloc_type = BTRFS_ALLOC_FIT;
> +	cache->alloc_offset = 0;
> +
> +	if (btrfs_fs_incompat(fs_info, HMZONED)) {
> +		ret = btrfs_get_block_group_alloc_offset(cache);
> +		if (ret) {
> +			kfree(cache);
> +			return NULL;
> +		}
> +	}
>  
>  	return cache;
>  }
> @@ -10310,6 +10636,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
>  	int need_clear = 0;
>  	u64 cache_gen;
>  	u64 feature;
> +	u64 unusable;
>  	int mixed;
>  
>  	feature = btrfs_super_incompat_flags(info->super_copy);
> @@ -10415,6 +10742,26 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
>  			free_excluded_extents(cache);
>  		}
>  
> +		switch (cache->alloc_type) {
> +		case BTRFS_ALLOC_FIT:
> +			unusable = cache->bytes_super;
> +			break;
> +		case BTRFS_ALLOC_SEQ:
> +			WARN_ON(cache->bytes_super != 0);
> +			unusable = cache->alloc_offset -
> +				btrfs_block_group_used(&cache->item);
> +			/* we only need ->free_space in ALLOC_SEQ BGs */
> +			cache->last_byte_to_unpin = (u64)-1;
> +			cache->cached = BTRFS_CACHE_FINISHED;
> +			cache->free_space_ctl->free_space =
> +				cache->key.offset - cache->alloc_offset;
> +			cache->unusable = unusable;
> +			free_excluded_extents(cache);
> +			break;
> +		default:
> +			BUG();

An unexpeced value of allocation is found, this needs a message and
proper error handling, btrfs_read_block_groups is called from mount path
so the recovery should be possible.

> +		}
> +
>  		ret = btrfs_add_block_group_cache(info, cache);
>  		if (ret) {
>  			btrfs_remove_free_space_cache(cache);
> @@ -10425,7 +10772,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
>  		trace_btrfs_add_block_group(info, cache, 0);
>  		update_space_info(info, cache->flags, found_key.offset,
>  				  btrfs_block_group_used(&cache->item),
> -				  cache->bytes_super, &space_info);
> +				  unusable, &space_info);
>  
>  		cache->space_info = space_info;
>  
> @@ -10438,6 +10785,9 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
>  			ASSERT(list_empty(&cache->bg_list));
>  			btrfs_mark_bg_unused(cache);
>  		}
> +
> +		if (cache->wp_broken)
> +			inc_block_group_ro(cache, 1);
>  	}
>  
>  	list_for_each_entry_rcu(space_info, &info->space_info, list) {
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index f74dc259307b..cc69dc71f4c1 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -2326,8 +2326,11 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
>  			   u64 offset, u64 bytes)
>  {
>  	struct btrfs_free_space *info;
> +	struct btrfs_block_group_cache *block_group = ctl->private;
>  	int ret = 0;
>  
> +	WARN_ON(block_group && block_group->alloc_type == BTRFS_ALLOC_SEQ);
> +
>  	info = kmem_cache_zalloc(btrfs_free_space_cachep, GFP_NOFS);
>  	if (!info)
>  		return -ENOMEM;
> @@ -2376,6 +2379,28 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
>  	return ret;
>  }
>  
> +int __btrfs_add_free_space_seq(struct btrfs_block_group_cache *block_group,
> +			       u64 bytenr, u64 size)
> +{
> +	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
> +	u64 offset = bytenr - block_group->key.objectid;
> +	u64 to_free, to_unusable;
> +
> +	spin_lock(&ctl->tree_lock);
> +	if (offset >= block_group->alloc_offset)
> +		to_free = size;
> +	else if (offset + size <= block_group->alloc_offset)
> +		to_free = 0;
> +	else
> +		to_free = offset + size - block_group->alloc_offset;
> +	to_unusable = size - to_free;
> +	ctl->free_space += to_free;
> +	block_group->unusable += to_unusable;
> +	spin_unlock(&ctl->tree_lock);
> +	return 0;
> +
> +}
> +
>  int btrfs_remove_free_space(struct btrfs_block_group_cache *block_group,
>  			    u64 offset, u64 bytes)
>  {
> @@ -2384,6 +2409,8 @@ int btrfs_remove_free_space(struct btrfs_block_group_cache *block_group,
>  	int ret;
>  	bool re_search = false;
>  
> +	WARN_ON(block_group->alloc_type == BTRFS_ALLOC_SEQ);
> +
>  	spin_lock(&ctl->tree_lock);
>  
>  again:
> @@ -2619,6 +2646,8 @@ u64 btrfs_find_space_for_alloc(struct btrfs_block_group_cache *block_group,
>  	u64 align_gap = 0;
>  	u64 align_gap_len = 0;
>  
> +	WARN_ON(block_group->alloc_type == BTRFS_ALLOC_SEQ);
> +
>  	spin_lock(&ctl->tree_lock);
>  	entry = find_free_space(ctl, &offset, &bytes_search,
>  				block_group->full_stripe_len, max_extent_size);
> @@ -2738,6 +2767,8 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group_cache *block_group,
>  	struct rb_node *node;
>  	u64 ret = 0;
>  
> +	WARN_ON(block_group->alloc_type == BTRFS_ALLOC_SEQ);
> +
>  	spin_lock(&cluster->lock);
>  	if (bytes > cluster->max_size)
>  		goto out;
> @@ -3384,6 +3415,8 @@ int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
>  {
>  	int ret;
>  
> +	WARN_ON(block_group->alloc_type == BTRFS_ALLOC_SEQ);
> +
>  	*trimmed = 0;
>  
>  	spin_lock(&block_group->lock);
> diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h
> index 8760acb55ffd..d30667784f73 100644
> --- a/fs/btrfs/free-space-cache.h
> +++ b/fs/btrfs/free-space-cache.h
> @@ -73,10 +73,15 @@ void btrfs_init_free_space_ctl(struct btrfs_block_group_cache *block_group);
>  int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
>  			   struct btrfs_free_space_ctl *ctl,
>  			   u64 bytenr, u64 size);
> +int __btrfs_add_free_space_seq(struct btrfs_block_group_cache *block_group,
> +			       u64 bytenr, u64 size);
>  static inline int
>  btrfs_add_free_space(struct btrfs_block_group_cache *block_group,
>  		     u64 bytenr, u64 size)
>  {
> +	if (block_group->alloc_type == BTRFS_ALLOC_SEQ)
> +		return __btrfs_add_free_space_seq(block_group, bytenr, size);
> +
>  	return __btrfs_add_free_space(block_group->fs_info,
>  				      block_group->free_space_ctl,
>  				      bytenr, size);
> -- 
> 2.21.0

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/19] btrfs: limit super block locations in HMZONED mode
  2019-06-07 13:10 ` [PATCH 09/19] btrfs: limit super block locations in HMZONED mode Naohiro Aota
  2019-06-13 14:12   ` Josef Bacik
@ 2019-06-17 22:53   ` David Sterba
  2019-06-18  9:01     ` Naohiro Aota
  2019-06-28  3:55   ` Anand Jain
  2 siblings, 1 reply; 79+ messages in thread
From: David Sterba @ 2019-06-17 22:53 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On Fri, Jun 07, 2019 at 10:10:15PM +0900, Naohiro Aota wrote:
> When in HMZONED mode, make sure that device super blocks are located in
> randomly writable zones of zoned block devices. That is, do not write super
> blocks in sequential write required zones of host-managed zoned block
> devices as update would not be possible.

This could be explained in more detail. My understanding is that the 1st
and 2nd copy superblocks is skipped at write time but the zone
containing the superblocks is not excluded from allocations. Ie. regular
data can appear in place where the superblocks would exist on
non-hmzoned filesystem. Is that correct?

The other option is to completely exclude the zone that contains the
superblock copies.

primary sb			 64K
1st copy			 64M
2nd copy			256G

Depends on the drives, but I think the size of the random write zone
will very often cover primary and 1st copy. So there's at least some
backup copy.

The 2nd copy will be in the sequential-only zone, so the whole zone
needs to be excluded in exclude_super_stripes. But it's not, so this
means data can go there.  I think the zone should be left empty.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 11/19] btrfs: introduce submit buffer
  2019-06-17  3:16     ` Damien Le Moal
@ 2019-06-18  0:00       ` David Sterba
  2019-06-18  4:04         ` Damien Le Moal
  2019-06-18 13:33       ` Josef Bacik
  1 sibling, 1 reply; 79+ messages in thread
From: David Sterba @ 2019-06-18  0:00 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Josef Bacik, Naohiro Aota, linux-btrfs, David Sterba,
	Chris Mason, Qu Wenruo, Nikolay Borisov, linux-kernel,
	Hannes Reinecke, linux-fsdevel, Matias Bjørling,
	Johannes Thumshirn, Bart Van Assche

On Mon, Jun 17, 2019 at 03:16:05AM +0000, Damien Le Moal wrote:
> Josef,
> 
> On 2019/06/13 23:15, Josef Bacik wrote:
> > On Fri, Jun 07, 2019 at 10:10:17PM +0900, Naohiro Aota wrote:
> >> Sequential allocation is not enough to maintain sequential delivery of
> >> write IOs to the device. Various features (async compress, async checksum,
> >> ...) of btrfs affect ordering of the IOs. This patch introduces submit
> >> buffer to sort WRITE bios belonging to a block group and sort them out
> >> sequentially in increasing block address to achieve sequential write
> >> sequences with __btrfs_map_bio().
> >>
> >> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > 
> > I hate everything about this.  Can't we just use the plugging infrastructure for
> > this and then make sure it re-orders the bios before submitting them?  Also
> > what's to prevent the block layer scheduler from re-arranging these io's?
> > Thanks,
> 
> The block I/O scheduler reorders requests in LBA order, but that happens for a
> newly inserted request against pending requests. If there are no pending
> requests because all requests were already issued, no ordering happen, and even
> worse, if the drive queue is not full yet (e.g. there are free tags), then the
> newly inserted request will be dispatched almost immediately, preventing
> reordering with subsequent incoming write requests to happen.

This would be good to add to the changelog.
> 
> The other problem is that the mq-deadline scheduler does not track zone WP
> position. Write request issuing is done regardless of the current WP value,
> solely based on LBA ordering. This means that mq-deadline will not prevent
> out-of-order, or rather, unaligned write requests.

This seems to be the key point.

> These will not be detected
> and dispatched whenever possible. The reasons for this are that:
> 1) the disk user (the FS) has to manage zone WP positions anyway. So duplicating
> that management at the block IO scheduler level is inefficient.
> 2) Adding zone WP management at the block IO scheduler level would also need a
> write error processing path to resync the WP value in case of failed writes. But
> the user/FS also needs that anyway. Again duplicated functionalities.
> 3) The block layer will need a timeout to force issue or cancel pending
> unaligned write requests. This is necessary in case the drive user stops issuing
> writes (for whatever reasons) or the scheduler is being switched. This would
> unnecessarily cause write I/O errors or cause deadlocks if the request queue
> quiesce mode is entered at the wrong time (and I do not see a good way to deal
> with that).
> 
> blk-mq is already complicated enough. Adding this to the block IO scheduler will
> unnecessarily complicate things further for no real benefits. I would like to
> point out the dm-zoned device mapper and f2fs which are both already dealing
> with write ordering and write error processing directly. Both are fairly
> straightforward but completely different and each optimized for their own structure.

So the question is where on which layer the decision logic is. The
filesystem(s) or dm-zoned have enough information about the zones and
the writes can be pre-sorted. This is what the patch proposes.

From your explanation I get that the io scheduler can throw the wrench
in the sequential ordering, for various reasons depending on state of
internal structures od device queues. This is my simplified
interpretation as I don't understand all the magic below filesystem
layer.

I assume there are some guarantees about the ordering, eg. within one
plug, that apply to all schedulers (maybe not the noop one). Something
like that should be the least common functionality that the filesystem
layer can rely on.
 
> Naohiro changes to btrfs IO scheduler have the same intent, that is, efficiently
> integrate and handle write ordering "a la btrfs". Would creating a different
> "hmzoned" btrfs IO scheduler help address your concerns ?

IMHO this sounds both the same, all we care about is the sequential
ordering, which in some sense is "scheduling", but I would not call it
that way due to the simplicity.

As implemented, it's a list of bios, but I'd suggest using rb-tree or
xarray, the insertion is fast and submission is start to end traversal.
I'm not sure that the loop in __btrfs_map_bio_zoned after label
send_bios: has reasonable complexity, looks like an O(N^2).

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 11/19] btrfs: introduce submit buffer
  2019-06-18  0:00       ` David Sterba
@ 2019-06-18  4:04         ` Damien Le Moal
  0 siblings, 0 replies; 79+ messages in thread
From: Damien Le Moal @ 2019-06-18  4:04 UTC (permalink / raw)
  To: dsterba
  Cc: Josef Bacik, Naohiro Aota, linux-btrfs, David Sterba,
	Chris Mason, Qu Wenruo, Nikolay Borisov, linux-kernel,
	Hannes Reinecke, linux-fsdevel, Matias Bjørling,
	Johannes Thumshirn, Bart Van Assche

David,

On 2019/06/18 8:59, David Sterba wrote:
> On Mon, Jun 17, 2019 at 03:16:05AM +0000, Damien Le Moal wrote:
>> Josef,
>>
>> On 2019/06/13 23:15, Josef Bacik wrote:
>>> On Fri, Jun 07, 2019 at 10:10:17PM +0900, Naohiro Aota wrote:
>>>> Sequential allocation is not enough to maintain sequential delivery of
>>>> write IOs to the device. Various features (async compress, async checksum,
>>>> ...) of btrfs affect ordering of the IOs. This patch introduces submit
>>>> buffer to sort WRITE bios belonging to a block group and sort them out
>>>> sequentially in increasing block address to achieve sequential write
>>>> sequences with __btrfs_map_bio().
>>>>
>>>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>>
>>> I hate everything about this.  Can't we just use the plugging infrastructure for
>>> this and then make sure it re-orders the bios before submitting them?  Also
>>> what's to prevent the block layer scheduler from re-arranging these io's?
>>> Thanks,
>>
>> The block I/O scheduler reorders requests in LBA order, but that happens for a
>> newly inserted request against pending requests. If there are no pending
>> requests because all requests were already issued, no ordering happen, and even
>> worse, if the drive queue is not full yet (e.g. there are free tags), then the
>> newly inserted request will be dispatched almost immediately, preventing
>> reordering with subsequent incoming write requests to happen.
> 
> This would be good to add to the changelog.

Sure. No problem. We can add that explanation.

>> The other problem is that the mq-deadline scheduler does not track zone WP
>> position. Write request issuing is done regardless of the current WP value,
>> solely based on LBA ordering. This means that mq-deadline will not prevent
>> out-of-order, or rather, unaligned write requests.
> 
> This seems to be the key point.

Yes it is. We can also add this to the commit message explanation.

>> These will not be detected
>> and dispatched whenever possible. The reasons for this are that:
>> 1) the disk user (the FS) has to manage zone WP positions anyway. So duplicating
>> that management at the block IO scheduler level is inefficient.
>> 2) Adding zone WP management at the block IO scheduler level would also need a
>> write error processing path to resync the WP value in case of failed writes. But
>> the user/FS also needs that anyway. Again duplicated functionalities.
>> 3) The block layer will need a timeout to force issue or cancel pending
>> unaligned write requests. This is necessary in case the drive user stops issuing
>> writes (for whatever reasons) or the scheduler is being switched. This would
>> unnecessarily cause write I/O errors or cause deadlocks if the request queue
>> quiesce mode is entered at the wrong time (and I do not see a good way to deal
>> with that).
>>
>> blk-mq is already complicated enough. Adding this to the block IO scheduler will
>> unnecessarily complicate things further for no real benefits. I would like to
>> point out the dm-zoned device mapper and f2fs which are both already dealing
>> with write ordering and write error processing directly. Both are fairly
>> straightforward but completely different and each optimized for their own structure.
> 
> So the question is where on which layer the decision logic is. The
> filesystem(s) or dm-zoned have enough information about the zones and
> the writes can be pre-sorted. This is what the patch proposes.

Yes, exactly.

> From your explanation I get that the io scheduler can throw the wrench
> in the sequential ordering, for various reasons depending on state of
> internal structures od device queues. This is my simplified
> interpretation as I don't understand all the magic below filesystem
> layer.

Not exactly "throw the wrench". mq-deadline will guarantee per zone write order
to be exactly the order in which requests were inserted, that is, issued by the
FS. But mq-dealine will not "wait" if the write order is not purely sequential,
that is, there are holes/jumps in the LBA sequence for the zone. Order only is
guaranteed. The alignment to WP/contiguous sequential write issuing is the
responsibility of the issuer (FS or DM or application in the case of raw accesses).

> I assume there are some guarantees about the ordering, eg. within one
> plug, that apply to all schedulers (maybe not the noop one). Something
> like that should be the least common functionality that the filesystem
> layer can rely on.

The insertion side of the scheduler (upper level from FS to scheduler), which
include the per CPU software queues and plug control, will not reorder requests.
However, the dispatch side (lower level, from scheduler to HBA driver) can cause
reordering. This is what mq-deadline prevents using a per zone write lock to
avoid reordering of write requests per zone by allowing only a single write
request per zone to be dispatched to the device at any time. Overall order is
not guaranteed, nor is read request order. But per zone write requests will not
be reordered.

But again, this is only ordering. Nothing to do with trying to achieve a purely
sequential write stream per zone. This is the responsibility of the issuer to
deliver write request per zone without any gap, all requests sequential in LBA
within each zone. Overall, the stream of request does not have to be sequential,
e.g. if multiple zones are being written at the same time. But per zones, write
requests must be sequential.

>> Naohiro changes to btrfs IO scheduler have the same intent, that is, efficiently
>> integrate and handle write ordering "a la btrfs". Would creating a different
>> "hmzoned" btrfs IO scheduler help address your concerns ?
> 
> IMHO this sounds both the same, all we care about is the sequential
> ordering, which in some sense is "scheduling", but I would not call it
> that way due to the simplicity.

OK. And yes, it is only ordering of writes per zone. For all other requests,
e.g. reads, order does not matter. And the overall interleaving of write
requests to different zones can also be anything. No constraints there.

> As implemented, it's a list of bios, but I'd suggest using rb-tree or
> xarray, the insertion is fast and submission is start to end traversal.
> I'm not sure that the loop in __btrfs_map_bio_zoned after label
> send_bios: has reasonable complexity, looks like an O(N^2).

OK. We can change that. rbtree is simple enough to use. We can change the list
to that.

Thank you for your comments.

Best regards.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 02/19] btrfs: Get zone information of zoned block devices
  2019-06-13 13:58   ` Josef Bacik
@ 2019-06-18  6:04     ` Naohiro Aota
  0 siblings, 0 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-18  6:04 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, David Sterba, Chris Mason, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On 2019/06/13 22:58, Josef Bacik wrote:
> On Fri, Jun 07, 2019 at 10:10:08PM +0900, Naohiro Aota wrote:
>> If a zoned block device is found, get its zone information (number of zones
>> and zone size) using the new helper function btrfs_get_dev_zonetypes().  To
>> avoid costly run-time zone report commands to test the device zones type
>> during block allocation, attach the seqzones bitmap to the device structure
>> to indicate if a zone is sequential or accept random writes.
>>
>> This patch also introduces the helper function btrfs_dev_is_sequential() to
>> test if the zone storing a block is a sequential write required zone.
>>
>> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>> ---
>>   fs/btrfs/volumes.c | 143 +++++++++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/volumes.h |  33 +++++++++++
>>   2 files changed, 176 insertions(+)
>>
> 
> We have enough problems with giant files already, please just add a separate
> hmzoned.c or whatever and put all the zone specific code in there.  That'll save
> me time when I go and break a bunch of stuff out.  Thanks,
> 
> Josef
> 

Thank you for the reviews.

I'll add hmzoned.c and put the things (with more helpers/abstraction) there in the next version.

Thanks.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 02/19] btrfs: Get zone information of zoned block devices
  2019-06-17 18:57   ` David Sterba
@ 2019-06-18  6:42     ` Naohiro Aota
  2019-06-27 15:11       ` David Sterba
  0 siblings, 1 reply; 79+ messages in thread
From: Naohiro Aota @ 2019-06-18  6:42 UTC (permalink / raw)
  To: dsterba
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On 2019/06/18 3:56, David Sterba wrote:
> On Fri, Jun 07, 2019 at 10:10:08PM +0900, Naohiro Aota wrote:
>> If a zoned block device is found, get its zone information (number of zones
>> and zone size) using the new helper function btrfs_get_dev_zonetypes().  To
>> avoid costly run-time zone report commands to test the device zones type
>> during block allocation, attach the seqzones bitmap to the device structure
>> to indicate if a zone is sequential or accept random writes.
>>
>> This patch also introduces the helper function btrfs_dev_is_sequential() to
>> test if the zone storing a block is a sequential write required zone.
>>
>> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>> ---
>>   fs/btrfs/volumes.c | 143 +++++++++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/volumes.h |  33 +++++++++++
>>   2 files changed, 176 insertions(+)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 1c2a6e4b39da..b673178718e3 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -786,6 +786,135 @@ static int btrfs_free_stale_devices(const char *path,
>>   	return ret;
>>   }
>>   
>> +static int __btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
> 
> Please drop __ from the name, the pattern where this naming makes sense
> does not apply here. It's for cases where teh function wihout
> underscores does some extra stuff like locking and the underscored does
> not and this is used on some context. I haven't found
> btrfs_get_dev_zones in this or other patches.
> 
>> +				 struct blk_zone **zones,
>> +				 unsigned int *nr_zones, gfp_t gfp_mask)
>> +{
>> +	struct blk_zone *z = *zones;
> 
> This may apply to more places, plese don't use single letter for
> anything else than 'i' and similar. 'zone' would be suitable.

Sure.

>> +	int ret;
>> +
>> +	if (!z) {
>> +		z = kcalloc(*nr_zones, sizeof(struct blk_zone), GFP_KERNEL);
>> +		if (!z)
>> +			return -ENOMEM;
>> +	}
>> +
>> +	ret = blkdev_report_zones(device->bdev, pos >> SECTOR_SHIFT,
>> +				  z, nr_zones, gfp_mask);
>> +	if (ret != 0) {
>> +		btrfs_err(device->fs_info, "Get zone at %llu failed %d\n",
> 
> No capital letter and no "\n" at the end of the message, that's added by
> btrfs_er.

oops, I overlooked it...

>> +			  pos, ret);
>> +		return ret;
>> +	}
>> +
>> +	*zones = z;
>> +
>> +	return 0;
>> +}
>> +
>> +static void btrfs_destroy_dev_zonetypes(struct btrfs_device *device)
>> +{
>> +	kfree(device->seq_zones);
>> +	kfree(device->empty_zones);
>> +	device->seq_zones = NULL;
>> +	device->empty_zones = NULL;
>> +	device->nr_zones = 0;
>> +	device->zone_size = 0;
>> +	device->zone_size_shift = 0;
>> +}
>> +
>> +int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
>> +		       struct blk_zone *zone, gfp_t gfp_mask)
>> +{
>> +	unsigned int nr_zones = 1;
>> +	int ret;
>> +
>> +	ret = __btrfs_get_dev_zones(device, pos, &zone, &nr_zones, gfp_mask);
>> +	if (ret != 0 || !nr_zones)
>> +		return ret ? ret : -EIO;
>> +
>> +	return 0;
>> +}
>> +
>> +int btrfs_get_dev_zonetypes(struct btrfs_device *device)
>> +{
>> +	struct block_device *bdev = device->bdev;
>> +	sector_t nr_sectors = bdev->bd_part->nr_sects;
>> +	sector_t sector = 0;
>> +	struct blk_zone *zones = NULL;
>> +	unsigned int i, n = 0, nr_zones;
>> +	int ret;
>> +
>> +	device->zone_size = 0;
>> +	device->zone_size_shift = 0;
>> +	device->nr_zones = 0;
>> +	device->seq_zones = NULL;
>> +	device->empty_zones = NULL;
>> +
>> +	if (!bdev_is_zoned(bdev))
>> +		return 0;
>> +
>> +	device->zone_size = (u64)bdev_zone_sectors(bdev) << SECTOR_SHIFT;
>> +	device->zone_size_shift = ilog2(device->zone_size);
>> +	device->nr_zones = nr_sectors >> ilog2(bdev_zone_sectors(bdev));
>> +	if (nr_sectors & (bdev_zone_sectors(bdev) - 1))
>> +		device->nr_zones++;
>> +
>> +	device->seq_zones = kcalloc(BITS_TO_LONGS(device->nr_zones),
>> +				    sizeof(*device->seq_zones), GFP_KERNEL);
> 
> What's the expected range for the allocation size? There's one bit per
> zone, so one 4KiB page can hold up to 32768 zones, with 1GiB it's 32TiB
> of space on the drive. Ok that seems safe for now.

Typically, zone size is 256MB (as default value in tcmu-runner). On such device,
we need one 4KB page per 8TB disk space. Still it's quite safe.

>> +	if (!device->seq_zones)
>> +		return -ENOMEM;
>> +
>> +	device->empty_zones = kcalloc(BITS_TO_LONGS(device->nr_zones),
>> +				      sizeof(*device->empty_zones), GFP_KERNEL);
>> +	if (!device->empty_zones)
>> +		return -ENOMEM;
> 
> This leaks device->seq_zones from the current context, though thre are
> calls to btrfs_destroy_dev_zonetypes that would clean it up eventually.
> It'd be better to clean up here instead of relying on the caller.

Exactly.

>> +
>> +#define BTRFS_REPORT_NR_ZONES   4096
> 
> Please move this to the begining of the file if this is just local to
> the .c file and put a short comment explaining the meaning.
> 
>> +
>> +	/* Get zones type */
>> +	while (sector < nr_sectors) {
>> +		nr_zones = BTRFS_REPORT_NR_ZONES;
>> +		ret = __btrfs_get_dev_zones(device, sector << SECTOR_SHIFT,
>> +					    &zones, &nr_zones, GFP_KERNEL);
>> +		if (ret != 0 || !nr_zones) {
>> +			if (!ret)
>> +				ret = -EIO;
>> +			goto out;
>> +		}
>> +
>> +		for (i = 0; i < nr_zones; i++) {
>> +			if (zones[i].type == BLK_ZONE_TYPE_SEQWRITE_REQ)
>> +				set_bit(n, device->seq_zones);
>> +			if (zones[i].cond == BLK_ZONE_COND_EMPTY)
>> +				set_bit(n, device->empty_zones);
>> +			sector = zones[i].start + zones[i].len;
>> +			n++;
>> +		}
>> +	}
>> +
>> +	if (n != device->nr_zones) {
>> +		btrfs_err(device->fs_info,
>> +			  "Inconsistent number of zones (%u / %u)\n", n,
> 
> lowercase and no "\n"
> 
>> +			  device->nr_zones);
>> +		ret = -EIO;
>> +		goto out;
>> +	}
>> +
>> +	btrfs_info(device->fs_info,
>> +		   "host-%s zoned block device, %u zones of %llu sectors\n",
>> +		   bdev_zoned_model(bdev) == BLK_ZONED_HM ? "managed" : "aware",
>> +		   device->nr_zones, device->zone_size >> SECTOR_SHIFT);
>> +
>> +out:
>> +	kfree(zones);
>> +
>> +	if (ret)
>> +		btrfs_destroy_dev_zonetypes(device);
>> +
>> +	return ret;
>> +}
>> +
>>   static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
>>   			struct btrfs_device *device, fmode_t flags,
>>   			void *holder)
>> @@ -842,6 +971,11 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
>>   	clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
>>   	device->mode = flags;
>>   
>> +	/* Get zone type information of zoned block devices */
>> +	ret = btrfs_get_dev_zonetypes(device);
>> +	if (ret != 0)
>> +		goto error_brelse;
>> +
>>   	fs_devices->open_devices++;
>>   	if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
>>   	    device->devid != BTRFS_DEV_REPLACE_DEVID) {
>> @@ -1243,6 +1377,7 @@ static void btrfs_close_bdev(struct btrfs_device *device)
>>   	}
>>   
>>   	blkdev_put(device->bdev, device->mode);
>> +	btrfs_destroy_dev_zonetypes(device);
>>   }
>>   
>>   static void btrfs_close_one_device(struct btrfs_device *device)
>> @@ -2664,6 +2799,13 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>>   	mutex_unlock(&fs_info->chunk_mutex);
>>   	mutex_unlock(&fs_devices->device_list_mutex);
>>   
>> +	/* Get zone type information of zoned block devices */
>> +	ret = btrfs_get_dev_zonetypes(device);
>> +	if (ret) {
>> +		btrfs_abort_transaction(trans, ret);
>> +		goto error_sysfs;
> 
> Can this be moved before the locked section so that any failure does not
> lead to transaction abort?
> 
> The function returns ENOMEM that does not necessarily need to kill the
> filesystem. And EIO which means that some faulty device is being added
> to the filesystem but this again should fail early.

OK. I can move that before the transaction starts.

>> +	}
>> +
>>   	if (seeding_dev) {
>>   		mutex_lock(&fs_info->chunk_mutex);
>>   		ret = init_first_rw_device(trans);
>> @@ -2729,6 +2871,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>>   	return ret;
>>   
>>   error_sysfs:
>> +	btrfs_destroy_dev_zonetypes(device);
>>   	btrfs_sysfs_rm_device_link(fs_devices, device);
>>   	mutex_lock(&fs_info->fs_devices->device_list_mutex);
>>   	mutex_lock(&fs_info->chunk_mutex);
>> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
>> index b8a0e8d0672d..1599641e216c 100644
>> --- a/fs/btrfs/volumes.h
>> +++ b/fs/btrfs/volumes.h
>> @@ -62,6 +62,16 @@ struct btrfs_device {
>>   
>>   	struct block_device *bdev;
>>   
>> +	/*
>> +	 * Number of zones, zone size and types of zones if bdev is a
>> +	 * zoned block device.
>> +	 */
>> +	u64 zone_size;
>> +	u8  zone_size_shift;
> 
> So the zone_size is always power of two? I may be missing something, but
> I wonder if the calculations based on shifts are safe.

The kernel ZBD support have a restriction that
"The zone size must also be equal to a power of 2 number of logical blocks."
http://zonedstorage.io/introduction/linux-support/#zbd-support-restrictions

So, the zone_size is guaranteed to be power of two.

>> +	u32 nr_zones;
>> +	unsigned long *seq_zones;
>> +	unsigned long *empty_zones;
>> +
>>   	/* the mode sent to blkdev_get */
>>   	fmode_t mode;
>>   
>> @@ -476,6 +486,28 @@ int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans,
>>   int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset);
>>   struct extent_map *btrfs_get_chunk_map(struct btrfs_fs_info *fs_info,
>>   				       u64 logical, u64 length);
>> +int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
>> +		       struct blk_zone *zone, gfp_t gfp_mask);
>> +
>> +static inline int btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
>> +{
>> +	unsigned int zno = pos >> device->zone_size_shift;
> 
> The types don't match here, pos is u64 and I'm not sure if it's
> guaranteed that the value after shift will fit ti unsigned int.
> 
>> +
>> +	if (!device->seq_zones)
>> +		return 1;
>> +
>> +	return test_bit(zno, device->seq_zones);
>> +}
>> +
>> +static inline int btrfs_dev_is_empty_zone(struct btrfs_device *device, u64 pos)
>> +{
>> +	unsigned int zno = pos >> device->zone_size_shift;
> 
> Same.

I will fix.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 03/19] btrfs: Check and enable HMZONED mode
  2019-06-13 13:57   ` Josef Bacik
@ 2019-06-18  6:43     ` Naohiro Aota
  0 siblings, 0 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-18  6:43 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, David Sterba, Chris Mason, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On 2019/06/13 22:57, Josef Bacik wrote:
> On Fri, Jun 07, 2019 at 10:10:09PM +0900, Naohiro Aota wrote:
>> HMZONED mode cannot be used together with the RAID5/6 profile for now.
>> Introduce the function btrfs_check_hmzoned_mode() to check this. This
>> function will also check if HMZONED flag is enabled on the file system and
>> if the file system consists of zoned devices with equal zone size.
>>
>> Additionally, as updates to the space cache are in-place, the space cache
>> cannot be located over sequential zones and there is no guarantees that the
>> device will have enough conventional zones to store this cache. Resolve
>> this problem by disabling completely the space cache.  This does not
>> introduces any problems with sequential block groups: all the free space is
>> located after the allocation pointer and no free space before the pointer.
>> There is no need to have such cache.
>>
>> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>> ---
>>   fs/btrfs/ctree.h       |  3 ++
>>   fs/btrfs/dev-replace.c |  7 +++
>>   fs/btrfs/disk-io.c     |  7 +++
>>   fs/btrfs/super.c       | 12 ++---
>>   fs/btrfs/volumes.c     | 99 ++++++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/volumes.h     |  1 +
>>   6 files changed, 124 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index b81c331b28fa..6c00101407e4 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -806,6 +806,9 @@ struct btrfs_fs_info {
>>   	struct btrfs_root *uuid_root;
>>   	struct btrfs_root *free_space_root;
>>   
>> +	/* Zone size when in HMZONED mode */
>> +	u64 zone_size;
>> +
>>   	/* the log root tree is a directory of all the other log roots */
>>   	struct btrfs_root *log_root_tree;
>>   
>> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
>> index ee0989c7e3a9..fbe5ea2a04ed 100644
>> --- a/fs/btrfs/dev-replace.c
>> +++ b/fs/btrfs/dev-replace.c
>> @@ -201,6 +201,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
>>   		return PTR_ERR(bdev);
>>   	}
>>   
>> +	if ((bdev_zoned_model(bdev) == BLK_ZONED_HM &&
>> +	     !btrfs_fs_incompat(fs_info, HMZONED)) ||
>> +	    (!bdev_is_zoned(bdev) && btrfs_fs_incompat(fs_info, HMZONED))) {
> 
> You do this in a few places, turn this into a helper please.
> 
>> +		ret = -EINVAL;
>> +		goto error;
>> +	}
>> +
>>   	filemap_write_and_wait(bdev->bd_inode->i_mapping);
>>   
>>   	devices = &fs_info->fs_devices->devices;
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 663efce22d98..7c1404c76768 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -3086,6 +3086,13 @@ int open_ctree(struct super_block *sb,
>>   
>>   	btrfs_free_extra_devids(fs_devices, 1);
>>   
>> +	ret = btrfs_check_hmzoned_mode(fs_info);
>> +	if (ret) {
>> +		btrfs_err(fs_info, "failed to init hmzoned mode: %d",
>> +				ret);
>> +		goto fail_block_groups;
>> +	}
>> +
>>   	ret = btrfs_sysfs_add_fsid(fs_devices, NULL);
>>   	if (ret) {
>>   		btrfs_err(fs_info, "failed to init sysfs fsid interface: %d",
>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>> index 2c66d9ea6a3b..740a701f16c5 100644
>> --- a/fs/btrfs/super.c
>> +++ b/fs/btrfs/super.c
>> @@ -435,11 +435,13 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
>>   	bool saved_compress_force;
>>   	int no_compress = 0;
>>   
>> -	cache_gen = btrfs_super_cache_generation(info->super_copy);
>> -	if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE))
>> -		btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE);
>> -	else if (cache_gen)
>> -		btrfs_set_opt(info->mount_opt, SPACE_CACHE);
>> +	if (!btrfs_fs_incompat(info, HMZONED)) {
>> +		cache_gen = btrfs_super_cache_generation(info->super_copy);
>> +		if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE))
>> +			btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE);
>> +		else if (cache_gen)
>> +			btrfs_set_opt(info->mount_opt, SPACE_CACHE);
>> +	}
>>   
> 
> This disables the free space tree as well as the cache, sounds like you only
> need to disable the free space cache?  Thanks,

Right. We can still use the free space tree on HMZONED. I'll fix in the next version.
Thanks


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 08/19] btrfs: make unmirroed BGs readonly only if we have at least one writable BG
  2019-06-13 14:09   ` Josef Bacik
@ 2019-06-18  7:42     ` Naohiro Aota
  2019-06-18 13:35       ` Josef Bacik
  0 siblings, 1 reply; 79+ messages in thread
From: Naohiro Aota @ 2019-06-18  7:42 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, David Sterba, Chris Mason, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On 2019/06/13 23:09, Josef Bacik wrote:
> On Fri, Jun 07, 2019 at 10:10:14PM +0900, Naohiro Aota wrote:
>> If the btrfs volume has mirrored block groups, it unconditionally makes
>> un-mirrored block groups read only. When we have mirrored block groups, but
>> don't have writable block groups, this will drop all writable block groups.
>> So, check if we have at least one writable mirrored block group before
>> setting un-mirrored block groups read only.
>>
> 
> I don't understand why you want this.  Thanks,
> 
> Josef
> 

This is necessary to handle e.g. btrfs/124 case.

When we mount degraded RAID1 FS and write to it, and then
re-mount with full device, the write pointers of corresponding
zones of written BG differ.  The patch 07 mark such block group
as "wp_broken" and make it read only.  In this situation, we only
have read only RAID1 BGs because of "wp_broken" and un-mirrored BGs
are also marked read only, because we have RAID1 BGs.
As a result, all the BGs are now read only, so that we
cannot even start the rebalance to fix the situation.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 05/19] btrfs: disable direct IO in HMZONED mode
  2019-06-13 14:00   ` Josef Bacik
@ 2019-06-18  8:17     ` Naohiro Aota
  0 siblings, 0 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-18  8:17 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, David Sterba, Chris Mason, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On 2019/06/13 23:00, Josef Bacik wrote:
> On Fri, Jun 07, 2019 at 10:10:11PM +0900, Naohiro Aota wrote:
>> Direct write I/Os can be directed at existing extents that have already
>> been written. Such write requests are prohibited on host-managed zoned
>> block devices. So disable direct IO support for a volume with HMZONED mode
>> enabled.
>>
> 
> That's only if we're nocow, so seems like you only need to disable DIO into
> nocow regions with hmzoned?  Thanks,
> 
> Josef
> 

True. And actually, I had to disable or ignore BTRFS_INODE_NODATACOW on HMZONED.
I'll replace this patch with that one.
Thanks,

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 07/19] btrfs: do sequential extent allocation in HMZONED mode
  2019-06-13 14:07   ` Josef Bacik
@ 2019-06-18  8:28     ` Naohiro Aota
  2019-06-18 13:37       ` Josef Bacik
  0 siblings, 1 reply; 79+ messages in thread
From: Naohiro Aota @ 2019-06-18  8:28 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, David Sterba, Chris Mason, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On 2019/06/13 23:07, Josef Bacik wrote:
> On Fri, Jun 07, 2019 at 10:10:13PM +0900, Naohiro Aota wrote:
>> @@ -9616,7 +9701,8 @@ static int inc_block_group_ro(struct btrfs_block_group_cache *cache, int force)
>>   	}
>>   
>>   	num_bytes = cache->key.offset - cache->reserved - cache->pinned -
>> -		    cache->bytes_super - btrfs_block_group_used(&cache->item);
>> +		    cache->bytes_super - cache->unusable -
>> +		    btrfs_block_group_used(&cache->item);
>>   	sinfo_used = btrfs_space_info_used(sinfo, true);
>>   
>>   	if (sinfo_used + num_bytes + min_allocable_bytes <=
>> @@ -9766,6 +9852,7 @@ void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache)
>>   	if (!--cache->ro) {
>>   		num_bytes = cache->key.offset - cache->reserved -
>>   			    cache->pinned - cache->bytes_super -
>> +			    cache->unusable -
>>   			    btrfs_block_group_used(&cache->item);
> 
> You've done this in a few places, but not all the places, most notably
> btrfs_space_info_used() which is used in the space reservation code a lot.

I added "unsable" to struct btrfs_block_group_cache, but added
nothing to struct btrfs_space_info. Once extent is allocated and
freed in an ALLOC_SEQ Block Group, such extent is never resued
until we remove the BG. I'm accounting the size of such region
in "cache->unusable" and in "space_info->bytes_readonly". So,
btrfs_space_info_used() does not need the modify.

I admit it's confusing here. I can add "bytes_zone_unusable" to
struct btrfs_space_info, if it's better.

>>   		sinfo->bytes_readonly -= num_bytes;
>>   		list_del_init(&cache->ro_list);
>> @@ -10200,11 +10287,240 @@ static void link_block_group(struct btrfs_block_group_cache *cache)
>>   	}
>>   }
>>   
>> +static int
>> +btrfs_get_block_group_alloc_offset(struct btrfs_block_group_cache *cache)
>> +{
>> +	struct btrfs_fs_info *fs_info = cache->fs_info;
>> +	struct extent_map_tree *em_tree = &fs_info->mapping_tree.map_tree;
>> +	struct extent_map *em;
>> +	struct map_lookup *map;
>> +	struct btrfs_device *device;
>> +	u64 logical = cache->key.objectid;
>> +	u64 length = cache->key.offset;
>> +	u64 physical = 0;
>> +	int ret, alloc_type;
>> +	int i, j;
>> +	u64 *alloc_offsets = NULL;
>> +
>> +#define WP_MISSING_DEV ((u64)-1)
>> +
>> +	/* Sanity check */
>> +	if (!IS_ALIGNED(length, fs_info->zone_size)) {
>> +		btrfs_err(fs_info, "unaligned block group at %llu + %llu",
>> +			  logical, length);
>> +		return -EIO;
>> +	}
>> +
>> +	/* Get the chunk mapping */
>> +	em_tree = &fs_info->mapping_tree.map_tree;
>> +	read_lock(&em_tree->lock);
>> +	em = lookup_extent_mapping(em_tree, logical, length);
>> +	read_unlock(&em_tree->lock);
>> +
>> +	if (!em)
>> +		return -EINVAL;
>> +
>> +	map = em->map_lookup;
>> +
>> +	/*
>> +	 * Get the zone type: if the group is mapped to a non-sequential zone,
>> +	 * there is no need for the allocation offset (fit allocation is OK).
>> +	 */
>> +	alloc_type = -1;
>> +	alloc_offsets = kcalloc(map->num_stripes, sizeof(*alloc_offsets),
>> +				GFP_NOFS);
>> +	if (!alloc_offsets) {
>> +		free_extent_map(em);
>> +		return -ENOMEM;
>> +	}
>> +
>> +	for (i = 0; i < map->num_stripes; i++) {
>> +		int is_sequential;
>> +		struct blk_zone zone;
>> +
>> +		device = map->stripes[i].dev;
>> +		physical = map->stripes[i].physical;
>> +
>> +		if (device->bdev == NULL) {
>> +			alloc_offsets[i] = WP_MISSING_DEV;
>> +			continue;
>> +		}
>> +
>> +		is_sequential = btrfs_dev_is_sequential(device, physical);
>> +		if (alloc_type == -1)
>> +			alloc_type = is_sequential ?
>> +					BTRFS_ALLOC_SEQ : BTRFS_ALLOC_FIT;
>> +
>> +		if ((is_sequential && alloc_type != BTRFS_ALLOC_SEQ) ||
>> +		    (!is_sequential && alloc_type == BTRFS_ALLOC_SEQ)) {
>> +			btrfs_err(fs_info, "found block group of mixed zone types");
>> +			ret = -EIO;
>> +			goto out;
>> +		}
>> +
>> +		if (!is_sequential)
>> +			continue;
>> +
>> +		/* this zone will be used for allocation, so mark this
>> +		 * zone non-empty
>> +		 */
>> +		clear_bit(physical >> device->zone_size_shift,
>> +			  device->empty_zones);
>> +
>> +		/*
>> +		 * The group is mapped to a sequential zone. Get the zone write
>> +		 * pointer to determine the allocation offset within the zone.
>> +		 */
>> +		WARN_ON(!IS_ALIGNED(physical, fs_info->zone_size));
>> +		ret = btrfs_get_dev_zone(device, physical, &zone, GFP_NOFS);
>> +		if (ret == -EIO || ret == -EOPNOTSUPP) {
>> +			ret = 0;
>> +			alloc_offsets[i] = WP_MISSING_DEV;
>> +			continue;
>> +		} else if (ret) {
>> +			goto out;
>> +		}
>> +
>> +
>> +		switch (zone.cond) {
>> +		case BLK_ZONE_COND_OFFLINE:
>> +		case BLK_ZONE_COND_READONLY:
>> +			btrfs_err(fs_info, "Offline/readonly zone %llu",
>> +				  physical >> device->zone_size_shift);
>> +			alloc_offsets[i] = WP_MISSING_DEV;
>> +			break;
>> +		case BLK_ZONE_COND_EMPTY:
>> +			alloc_offsets[i] = 0;
>> +			break;
>> +		case BLK_ZONE_COND_FULL:
>> +			alloc_offsets[i] = fs_info->zone_size;
>> +			break;
>> +		default:
>> +			/* Partially used zone */
>> +			alloc_offsets[i] =
>> +				((zone.wp - zone.start) << SECTOR_SHIFT);
>> +			break;
>> +		}
>> +	}
>> +
>> +	if (alloc_type == BTRFS_ALLOC_FIT)
>> +		goto out;
>> +
>> +	switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
>> +	case 0: /* single */
>> +	case BTRFS_BLOCK_GROUP_DUP:
>> +	case BTRFS_BLOCK_GROUP_RAID1:
>> +		cache->alloc_offset = WP_MISSING_DEV;
>> +		for (i = 0; i < map->num_stripes; i++) {
>> +			if (alloc_offsets[i] == WP_MISSING_DEV)
>> +				continue;
>> +			if (cache->alloc_offset == WP_MISSING_DEV)
>> +				cache->alloc_offset = alloc_offsets[i];
>> +			if (alloc_offsets[i] == cache->alloc_offset)
>> +				continue;
>> +
>> +			btrfs_err(fs_info,
>> +				  "write pointer mismatch: block group %llu",
>> +				  logical);
>> +			cache->wp_broken = 1;
>> +		}
>> +		break;
>> +	case BTRFS_BLOCK_GROUP_RAID0:
>> +		cache->alloc_offset = 0;
>> +		for (i = 0; i < map->num_stripes; i++) {
>> +			if (alloc_offsets[i] == WP_MISSING_DEV) {
>> +				btrfs_err(fs_info,
>> +					  "cannot recover write pointer: block group %llu",
>> +					  logical);
>> +				cache->wp_broken = 1;
>> +				continue;
>> +			}
>> +
>> +			if (alloc_offsets[0] < alloc_offsets[i]) {
>> +				btrfs_err(fs_info,
>> +					  "write pointer mismatch: block group %llu",
>> +					  logical);
>> +				cache->wp_broken = 1;
>> +				continue;
>> +			}
>> +
>> +			cache->alloc_offset += alloc_offsets[i];
>> +		}
>> +		break;
>> +	case BTRFS_BLOCK_GROUP_RAID10:
>> +		/*
>> +		 * Pass1: check write pointer of RAID1 level: each pointer
>> +		 * should be equal.
>> +		 */
>> +		for (i = 0; i < map->num_stripes / map->sub_stripes; i++) {
>> +			int base = i*map->sub_stripes;
>> +			u64 offset = WP_MISSING_DEV;
>> +
>> +			for (j = 0; j < map->sub_stripes; j++) {
>> +				if (alloc_offsets[base+j] == WP_MISSING_DEV)
>> +					continue;
>> +				if (offset == WP_MISSING_DEV)
>> +					offset = alloc_offsets[base+j];
>> +				if (alloc_offsets[base+j] == offset)
>> +					continue;
>> +
>> +				btrfs_err(fs_info,
>> +					  "write pointer mismatch: block group %llu",
>> +					  logical);
>> +				cache->wp_broken = 1;
>> +			}
>> +			for (j = 0; j < map->sub_stripes; j++)
>> +				alloc_offsets[base+j] = offset;
>> +		}
>> +
>> +		/* Pass2: check write pointer of RAID1 level */
>> +		cache->alloc_offset = 0;
>> +		for (i = 0; i < map->num_stripes / map->sub_stripes; i++) {
>> +			int base = i*map->sub_stripes;
>> +
>> +			if (alloc_offsets[base] == WP_MISSING_DEV) {
>> +				btrfs_err(fs_info,
>> +					  "cannot recover write pointer: block group %llu",
>> +					  logical);
>> +				cache->wp_broken = 1;
>> +				continue;
>> +			}
>> +
>> +			if (alloc_offsets[0] < alloc_offsets[base]) {
>> +				btrfs_err(fs_info,
>> +					  "write pointer mismatch: block group %llu",
>> +					  logical);
>> +				cache->wp_broken = 1;
>> +				continue;
>> +			}
>> +
>> +			cache->alloc_offset += alloc_offsets[base];
>> +		}
>> +		break;
>> +	case BTRFS_BLOCK_GROUP_RAID5:
>> +	case BTRFS_BLOCK_GROUP_RAID6:
>> +		/* RAID5/6 is not supported yet */
>> +	default:
>> +		btrfs_err(fs_info, "Unsupported profile on HMZONED %llu",
>> +			map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK);
>> +		ret = -EINVAL;
>> +		goto out;
>> +	}
>> +
>> +out:
>> +	cache->alloc_type = alloc_type;
>> +	kfree(alloc_offsets);
>> +	free_extent_map(em);
>> +
>> +	return ret;
>> +}
>> +
> 
> Move this to the zoned device file that you create.

Sure.

>>   static struct btrfs_block_group_cache *
>>   btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info,
>>   			       u64 start, u64 size)
>>   {
>>   	struct btrfs_block_group_cache *cache;
>> +	int ret;
>>   
>>   	cache = kzalloc(sizeof(*cache), GFP_NOFS);
>>   	if (!cache)
>> @@ -10238,6 +10554,16 @@ btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info,
>>   	atomic_set(&cache->trimming, 0);
>>   	mutex_init(&cache->free_space_lock);
>>   	btrfs_init_full_stripe_locks_tree(&cache->full_stripe_locks_root);
>> +	cache->alloc_type = BTRFS_ALLOC_FIT;
>> +	cache->alloc_offset = 0;
>> +
>> +	if (btrfs_fs_incompat(fs_info, HMZONED)) {
>> +		ret = btrfs_get_block_group_alloc_offset(cache);
>> +		if (ret) {
>> +			kfree(cache);
>> +			return NULL;
>> +		}
>> +	}
>>   
>>   	return cache;
>>   }
>> @@ -10310,6 +10636,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
>>   	int need_clear = 0;
>>   	u64 cache_gen;
>>   	u64 feature;
>> +	u64 unusable;
>>   	int mixed;
>>   
>>   	feature = btrfs_super_incompat_flags(info->super_copy);
>> @@ -10415,6 +10742,26 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
>>   			free_excluded_extents(cache);
>>   		}
>>   
>> +		switch (cache->alloc_type) {
>> +		case BTRFS_ALLOC_FIT:
>> +			unusable = cache->bytes_super;
>> +			break;
>> +		case BTRFS_ALLOC_SEQ:
>> +			WARN_ON(cache->bytes_super != 0);
>> +			unusable = cache->alloc_offset -
>> +				btrfs_block_group_used(&cache->item);
>> +			/* we only need ->free_space in ALLOC_SEQ BGs */
>> +			cache->last_byte_to_unpin = (u64)-1;
>> +			cache->cached = BTRFS_CACHE_FINISHED;
>> +			cache->free_space_ctl->free_space =
>> +				cache->key.offset - cache->alloc_offset;
>> +			cache->unusable = unusable;
>> +			free_excluded_extents(cache);
>> +			break;
>> +		default:
>> +			BUG();
>> +		}
>> +
>>   		ret = btrfs_add_block_group_cache(info, cache);
>>   		if (ret) {
>>   			btrfs_remove_free_space_cache(cache);
>> @@ -10425,7 +10772,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
>>   		trace_btrfs_add_block_group(info, cache, 0);
>>   		update_space_info(info, cache->flags, found_key.offset,
>>   				  btrfs_block_group_used(&cache->item),
>> -				  cache->bytes_super, &space_info);
>> +				  unusable, &space_info);
>>   
>>   		cache->space_info = space_info;
>>   
>> @@ -10438,6 +10785,9 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
>>   			ASSERT(list_empty(&cache->bg_list));
>>   			btrfs_mark_bg_unused(cache);
>>   		}
>> +
>> +		if (cache->wp_broken)
>> +			inc_block_group_ro(cache, 1);
>>   	}
>>   
>>   	list_for_each_entry_rcu(space_info, &info->space_info, list) {
>> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
>> index f74dc259307b..cc69dc71f4c1 100644
>> --- a/fs/btrfs/free-space-cache.c
>> +++ b/fs/btrfs/free-space-cache.c
>> @@ -2326,8 +2326,11 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
>>   			   u64 offset, u64 bytes)
>>   {
>>   	struct btrfs_free_space *info;
>> +	struct btrfs_block_group_cache *block_group = ctl->private;
>>   	int ret = 0;
>>   
>> +	WARN_ON(block_group && block_group->alloc_type == BTRFS_ALLOC_SEQ);
>> +
>>   	info = kmem_cache_zalloc(btrfs_free_space_cachep, GFP_NOFS);
>>   	if (!info)
>>   		return -ENOMEM;
>> @@ -2376,6 +2379,28 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
>>   	return ret;
>>   }
>>   
>> +int __btrfs_add_free_space_seq(struct btrfs_block_group_cache *block_group,
>> +			       u64 bytenr, u64 size)
>> +{
>> +	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
>> +	u64 offset = bytenr - block_group->key.objectid;
>> +	u64 to_free, to_unusable;
>> +
>> +	spin_lock(&ctl->tree_lock);
>> +	if (offset >= block_group->alloc_offset)
>> +		to_free = size;
>> +	else if (offset + size <= block_group->alloc_offset)
>> +		to_free = 0;
>> +	else
>> +		to_free = offset + size - block_group->alloc_offset;
>> +	to_unusable = size - to_free;
>> +	ctl->free_space += to_free;
>> +	block_group->unusable += to_unusable;
>> +	spin_unlock(&ctl->tree_lock);
>> +	return 0;
>> +
>> +}
>> +
>>   int btrfs_remove_free_space(struct btrfs_block_group_cache *block_group,
>>   			    u64 offset, u64 bytes)
>>   {
>> @@ -2384,6 +2409,8 @@ int btrfs_remove_free_space(struct btrfs_block_group_cache *block_group,
>>   	int ret;
>>   	bool re_search = false;
>>   
>> +	WARN_ON(block_group->alloc_type == BTRFS_ALLOC_SEQ);
>> +
> 
> These should probably be ASSERT() right?  Want to make sure the developers
> really notice a problem when testing.  Thanks,
> 
> Josef
> 

Agree. I will use ASSERT.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 07/19] btrfs: do sequential extent allocation in HMZONED mode
  2019-06-17 22:30   ` David Sterba
@ 2019-06-18  8:49     ` Naohiro Aota
  2019-06-27 15:28       ` David Sterba
  0 siblings, 1 reply; 79+ messages in thread
From: Naohiro Aota @ 2019-06-18  8:49 UTC (permalink / raw)
  To: dsterba
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On 2019/06/18 7:29, David Sterba wrote:
> On Fri, Jun 07, 2019 at 10:10:13PM +0900, Naohiro Aota wrote:
>> On HMZONED drives, writes must always be sequential and directed at a block
>> group zone write pointer position. Thus, block allocation in a block group
>> must also be done sequentially using an allocation pointer equal to the
>> block group zone write pointer plus the number of blocks allocated but not
>> yet written.
>>
>> Sequential allocation function find_free_extent_seq() bypass the checks in
>> find_free_extent() and increase the reserved byte counter by itself. It is
>> impossible to revert once allocated region in the sequential allocation,
>> since it might race with other allocations and leave an allocation hole,
>> which breaks the sequential write rule.
>>
>> Furthermore, this commit introduce two new variable to struct
>> btrfs_block_group_cache. "wp_broken" indicate that write pointer is broken
>> (e.g. not synced on a RAID1 block group) and mark that block group read
>> only. "unusable" keeps track of the size of once allocated then freed
>> region. Such region is never usable until resetting underlying zones.
>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>> ---
>>   fs/btrfs/ctree.h            |  24 +++
>>   fs/btrfs/extent-tree.c      | 378 ++++++++++++++++++++++++++++++++++--
>>   fs/btrfs/free-space-cache.c |  33 ++++
>>   fs/btrfs/free-space-cache.h |   5 +
>>   4 files changed, 426 insertions(+), 14 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 6c00101407e4..f4bcd2a6ec12 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -582,6 +582,20 @@ struct btrfs_full_stripe_locks_tree {
>>   	struct mutex lock;
>>   };
>>   
>> +/* Block group allocation types */
>> +enum btrfs_alloc_type {
>> +
>> +	/* Regular first fit allocation */
>> +	BTRFS_ALLOC_FIT		= 0,
>> +
>> +	/*
>> +	 * Sequential allocation: this is for HMZONED mode and
>> +	 * will result in ignoring free space before a block
>> +	 * group allocation offset.
> 
> Please format the comments to 80 columns
> 
>> +	 */
>> +	BTRFS_ALLOC_SEQ		= 1,
>> +};
>> +
>>   struct btrfs_block_group_cache {
>>   	struct btrfs_key key;
>>   	struct btrfs_block_group_item item;
>> @@ -592,6 +606,7 @@ struct btrfs_block_group_cache {
>>   	u64 reserved;
>>   	u64 delalloc_bytes;
>>   	u64 bytes_super;
>> +	u64 unusable;
> 
> 'unusable' is specific to the zones, so 'zone_unusable' would make it
> clear. The terminilogy around space is confusing already (we have
> unused, free, reserved, allocated, slack).

Sure. I will change the name.

Or, is it better toadd new struct "btrfs_seq_alloc_info" and move all
these variable there? Then, I can just add one pointer to the struct here.

>>   	u64 flags;
>>   	u64 cache_generation;
>>   
>> @@ -621,6 +636,7 @@ struct btrfs_block_group_cache {
>>   	unsigned int iref:1;
>>   	unsigned int has_caching_ctl:1;
>>   	unsigned int removed:1;
>> +	unsigned int wp_broken:1;
>>   
>>   	int disk_cache_state;
>>   
>> @@ -694,6 +710,14 @@ struct btrfs_block_group_cache {
>>   
>>   	/* Record locked full stripes for RAID5/6 block group */
>>   	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
>> +
>> +	/*
>> +	 * Allocation offset for the block group to implement sequential
>> +	 * allocation. This is used only with HMZONED mode enabled and if
>> +	 * the block group resides on a sequential zone.
>> +	 */
>> +	enum btrfs_alloc_type alloc_type;
>> +	u64 alloc_offset;
>>   };
>>   
>>   /* delayed seq elem */
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 363db58f56b8..ebd0d6eae038 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -28,6 +28,7 @@
>>   #include "sysfs.h"
>>   #include "qgroup.h"
>>   #include "ref-verify.h"
>> +#include "rcu-string.h"
>>   
>>   #undef SCRAMBLE_DELAYED_REFS
>>   
>> @@ -590,6 +591,8 @@ static int cache_block_group(struct btrfs_block_group_cache *cache,
>>   	struct btrfs_caching_control *caching_ctl;
>>   	int ret = 0;
>>   
>> +	WARN_ON(cache->alloc_type == BTRFS_ALLOC_SEQ);
>> +
>>   	caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS);
>>   	if (!caching_ctl)
>>   		return -ENOMEM;
>> @@ -6555,6 +6558,19 @@ void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg)
>>   	wait_var_event(&bg->reservations, !atomic_read(&bg->reservations));
>>   }
>>   
>> +static void __btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache,
>> +				       u64 ram_bytes, u64 num_bytes,
>> +				       int delalloc)
>> +{
>> +	struct btrfs_space_info *space_info = cache->space_info;
>> +
>> +	cache->reserved += num_bytes;
>> +	space_info->bytes_reserved += num_bytes;
>> +	update_bytes_may_use(space_info, -ram_bytes);
>> +	if (delalloc)
>> +		cache->delalloc_bytes += num_bytes;
>> +}
>> +
>>   /**
>>    * btrfs_add_reserved_bytes - update the block_group and space info counters
>>    * @cache:	The cache we are manipulating
>> @@ -6573,17 +6589,16 @@ static int btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache,
>>   	struct btrfs_space_info *space_info = cache->space_info;
>>   	int ret = 0;
>>   
>> +	/* should handled by find_free_extent_seq */
>> +	WARN_ON(cache->alloc_type == BTRFS_ALLOC_SEQ);
>> +
>>   	spin_lock(&space_info->lock);
>>   	spin_lock(&cache->lock);
>> -	if (cache->ro) {
>> +	if (cache->ro)
>>   		ret = -EAGAIN;
>> -	} else {
>> -		cache->reserved += num_bytes;
>> -		space_info->bytes_reserved += num_bytes;
>> -		update_bytes_may_use(space_info, -ram_bytes);
>> -		if (delalloc)
>> -			cache->delalloc_bytes += num_bytes;
>> -	}
>> +	else
>> +		__btrfs_add_reserved_bytes(cache, ram_bytes, num_bytes,
>> +					   delalloc);
>>   	spin_unlock(&cache->lock);
>>   	spin_unlock(&space_info->lock);
>>   	return ret;
>> @@ -6701,9 +6716,13 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
>>   			cache = btrfs_lookup_block_group(fs_info, start);
>>   			BUG_ON(!cache); /* Logic error */
>>   
>> -			cluster = fetch_cluster_info(fs_info,
>> -						     cache->space_info,
>> -						     &empty_cluster);
>> +			if (cache->alloc_type == BTRFS_ALLOC_FIT)
>> +				cluster = fetch_cluster_info(fs_info,
>> +							     cache->space_info,
>> +							     &empty_cluster);
>> +			else
>> +				cluster = NULL;
>> +
>>   			empty_cluster <<= 1;
>>   		}
>>   
>> @@ -6743,7 +6762,8 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
>>   		space_info->max_extent_size = 0;
>>   		percpu_counter_add_batch(&space_info->total_bytes_pinned,
>>   			    -len, BTRFS_TOTAL_BYTES_PINNED_BATCH);
>> -		if (cache->ro) {
>> +		if (cache->ro || cache->alloc_type == BTRFS_ALLOC_SEQ) {
>> +			/* need reset before reusing in ALLOC_SEQ BG */
>>   			space_info->bytes_readonly += len;
>>   			readonly = true;
>>   		}
>> @@ -7588,6 +7608,60 @@ static int find_free_extent_unclustered(struct btrfs_block_group_cache *bg,
>>   	return 0;
>>   }
>>   
>> +/*
>> + * Simple allocator for sequential only block group. It only allows
>> + * sequential allocation. No need to play with trees. This function
>> + * also reserve the bytes as in btrfs_add_reserved_bytes.
>> + */
>> +
>> +static int find_free_extent_seq(struct btrfs_block_group_cache *cache,
>> +				struct find_free_extent_ctl *ffe_ctl)
>> +{
>> +	struct btrfs_space_info *space_info = cache->space_info;
>> +	struct btrfs_free_space_ctl *ctl = cache->free_space_ctl;
>> +	u64 start = cache->key.objectid;
>> +	u64 num_bytes = ffe_ctl->num_bytes;
>> +	u64 avail;
>> +	int ret = 0;
>> +
>> +	/* Sanity check */
>> +	if (cache->alloc_type != BTRFS_ALLOC_SEQ)
>> +		return 1;
>> +
>> +	spin_lock(&space_info->lock);
>> +	spin_lock(&cache->lock);
>> +
>> +	if (cache->ro) {
>> +		ret = -EAGAIN;
>> +		goto out;
>> +	}
>> +
>> +	spin_lock(&ctl->tree_lock);
>> +	avail = cache->key.offset - cache->alloc_offset;
>> +	if (avail < num_bytes) {
>> +		ffe_ctl->max_extent_size = avail;
>> +		spin_unlock(&ctl->tree_lock);
>> +		ret = 1;
>> +		goto out;
>> +	}
>> +
>> +	ffe_ctl->found_offset = start + cache->alloc_offset;
>> +	cache->alloc_offset += num_bytes;
>> +	ctl->free_space -= num_bytes;
>> +	spin_unlock(&ctl->tree_lock);
>> +
>> +	BUG_ON(!IS_ALIGNED(ffe_ctl->found_offset,
>> +			   cache->fs_info->stripesize));
>> +	ffe_ctl->search_start = ffe_ctl->found_offset;
>> +	__btrfs_add_reserved_bytes(cache, ffe_ctl->ram_bytes, num_bytes,
>> +				   ffe_ctl->delalloc);
>> +
>> +out:
>> +	spin_unlock(&cache->lock);
>> +	spin_unlock(&space_info->lock);
>> +	return ret;
>> +}
>> +
>>   /*
>>    * Return >0 means caller needs to re-search for free extent
>>    * Return 0 means we have the needed free extent.
>> @@ -7889,6 +7963,16 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
>>   		if (unlikely(block_group->cached == BTRFS_CACHE_ERROR))
>>   			goto loop;
>>   
>> +		if (block_group->alloc_type == BTRFS_ALLOC_SEQ) {
>> +			ret = find_free_extent_seq(block_group, &ffe_ctl);
>> +			if (ret)
>> +				goto loop;
>> +			/* btrfs_find_space_for_alloc_seq should ensure
>> +			 * that everything is OK and reserve the extent.
>> +			 */
> 
> Please use the
> 
> /*
>   * comment
>   */
> 
> style
> 
>> +			goto nocheck;
>> +		}
>> +
>>   		/*
>>   		 * Ok we want to try and use the cluster allocator, so
>>   		 * lets look there
>> @@ -7944,6 +8028,7 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
>>   					     num_bytes);
>>   			goto loop;
>>   		}
>> +nocheck:
>>   		btrfs_inc_block_group_reservations(block_group);
>>   
>>   		/* we are all good, lets return */
>> @@ -9616,7 +9701,8 @@ static int inc_block_group_ro(struct btrfs_block_group_cache *cache, int force)
>>   	}
>>   
>>   	num_bytes = cache->key.offset - cache->reserved - cache->pinned -
>> -		    cache->bytes_super - btrfs_block_group_used(&cache->item);
>> +		    cache->bytes_super - cache->unusable -
>> +		    btrfs_block_group_used(&cache->item);
>>   	sinfo_used = btrfs_space_info_used(sinfo, true);
>>   
>>   	if (sinfo_used + num_bytes + min_allocable_bytes <=
>> @@ -9766,6 +9852,7 @@ void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache)
>>   	if (!--cache->ro) {
>>   		num_bytes = cache->key.offset - cache->reserved -
>>   			    cache->pinned - cache->bytes_super -
>> +			    cache->unusable -
>>   			    btrfs_block_group_used(&cache->item);
>>   		sinfo->bytes_readonly -= num_bytes;
>>   		list_del_init(&cache->ro_list);
>> @@ -10200,11 +10287,240 @@ static void link_block_group(struct btrfs_block_group_cache *cache)
>>   	}
>>   }
>>   
>> +static int
>> +btrfs_get_block_group_alloc_offset(struct btrfs_block_group_cache *cache)
>> +{
>> +	struct btrfs_fs_info *fs_info = cache->fs_info;
>> +	struct extent_map_tree *em_tree = &fs_info->mapping_tree.map_tree;
>> +	struct extent_map *em;
>> +	struct map_lookup *map;
>> +	struct btrfs_device *device;
>> +	u64 logical = cache->key.objectid;
>> +	u64 length = cache->key.offset;
>> +	u64 physical = 0;
>> +	int ret, alloc_type;
>> +	int i, j;
>> +	u64 *alloc_offsets = NULL;
>> +
>> +#define WP_MISSING_DEV ((u64)-1)
> 
> Please move the definition to the beginning of the file
> 
>> +
>> +	/* Sanity check */
>> +	if (!IS_ALIGNED(length, fs_info->zone_size)) {
>> +		btrfs_err(fs_info, "unaligned block group at %llu + %llu",
>> +			  logical, length);
>> +		return -EIO;
>> +	}
>> +
>> +	/* Get the chunk mapping */
>> +	em_tree = &fs_info->mapping_tree.map_tree;
>> +	read_lock(&em_tree->lock);
>> +	em = lookup_extent_mapping(em_tree, logical, length);
>> +	read_unlock(&em_tree->lock);
>> +
>> +	if (!em)
>> +		return -EINVAL;
>> +
>> +	map = em->map_lookup;
>> +
>> +	/*
>> +	 * Get the zone type: if the group is mapped to a non-sequential zone,
>> +	 * there is no need for the allocation offset (fit allocation is OK).
>> +	 */
>> +	alloc_type = -1;
>> +	alloc_offsets = kcalloc(map->num_stripes, sizeof(*alloc_offsets),
>> +				GFP_NOFS);
>> +	if (!alloc_offsets) {
>> +		free_extent_map(em);
>> +		return -ENOMEM;
>> +	}
>> +
>> +	for (i = 0; i < map->num_stripes; i++) {
>> +		int is_sequential;
> 
> Please use bool instead of int
> 
>> +		struct blk_zone zone;
>> +
>> +		device = map->stripes[i].dev;
>> +		physical = map->stripes[i].physical;
>> +
>> +		if (device->bdev == NULL) {
>> +			alloc_offsets[i] = WP_MISSING_DEV;
>> +			continue;
>> +		}
>> +
>> +		is_sequential = btrfs_dev_is_sequential(device, physical);
>> +		if (alloc_type == -1)
>> +			alloc_type = is_sequential ?
>> +					BTRFS_ALLOC_SEQ : BTRFS_ALLOC_FIT;
>> +
>> +		if ((is_sequential && alloc_type != BTRFS_ALLOC_SEQ) ||
>> +		    (!is_sequential && alloc_type == BTRFS_ALLOC_SEQ)) {
>> +			btrfs_err(fs_info, "found block group of mixed zone types");
>> +			ret = -EIO;
>> +			goto out;
>> +		}
>> +
>> +		if (!is_sequential)
>> +			continue;
>> +
>> +		/* this zone will be used for allocation, so mark this
>> +		 * zone non-empty
>> +		 */
>> +		clear_bit(physical >> device->zone_size_shift,
>> +			  device->empty_zones);
>> +
>> +		/*
>> +		 * The group is mapped to a sequential zone. Get the zone write
>> +		 * pointer to determine the allocation offset within the zone.
>> +		 */
>> +		WARN_ON(!IS_ALIGNED(physical, fs_info->zone_size));
>> +		ret = btrfs_get_dev_zone(device, physical, &zone, GFP_NOFS);
>> +		if (ret == -EIO || ret == -EOPNOTSUPP) {
>> +			ret = 0;
>> +			alloc_offsets[i] = WP_MISSING_DEV;
>> +			continue;
>> +		} else if (ret) {
>> +			goto out;
>> +		}
>> +
>> +
>> +		switch (zone.cond) {
>> +		case BLK_ZONE_COND_OFFLINE:
>> +		case BLK_ZONE_COND_READONLY:
>> +			btrfs_err(fs_info, "Offline/readonly zone %llu",
>> +				  physical >> device->zone_size_shift);
>> +			alloc_offsets[i] = WP_MISSING_DEV;
>> +			break;
>> +		case BLK_ZONE_COND_EMPTY:
>> +			alloc_offsets[i] = 0;
>> +			break;
>> +		case BLK_ZONE_COND_FULL:
>> +			alloc_offsets[i] = fs_info->zone_size;
>> +			break;
>> +		default:
>> +			/* Partially used zone */
>> +			alloc_offsets[i] =
>> +				((zone.wp - zone.start) << SECTOR_SHIFT);
>> +			break;
>> +		}
>> +	}
>> +
>> +	if (alloc_type == BTRFS_ALLOC_FIT)
>> +		goto out;
>> +
>> +	switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
>> +	case 0: /* single */
>> +	case BTRFS_BLOCK_GROUP_DUP:
>> +	case BTRFS_BLOCK_GROUP_RAID1:
>> +		cache->alloc_offset = WP_MISSING_DEV;
>> +		for (i = 0; i < map->num_stripes; i++) {
>> +			if (alloc_offsets[i] == WP_MISSING_DEV)
>> +				continue;
>> +			if (cache->alloc_offset == WP_MISSING_DEV)
>> +				cache->alloc_offset = alloc_offsets[i];
>> +			if (alloc_offsets[i] == cache->alloc_offset)
>> +				continue;
>> +
>> +			btrfs_err(fs_info,
>> +				  "write pointer mismatch: block group %llu",
>> +				  logical);
>> +			cache->wp_broken = 1;
>> +		}
>> +		break;
>> +	case BTRFS_BLOCK_GROUP_RAID0:
>> +		cache->alloc_offset = 0;
>> +		for (i = 0; i < map->num_stripes; i++) {
>> +			if (alloc_offsets[i] == WP_MISSING_DEV) {
>> +				btrfs_err(fs_info,
>> +					  "cannot recover write pointer: block group %llu",
>> +					  logical);
>> +				cache->wp_broken = 1;
>> +				continue;
>> +			}
>> +
>> +			if (alloc_offsets[0] < alloc_offsets[i]) {
>> +				btrfs_err(fs_info,
>> +					  "write pointer mismatch: block group %llu",
>> +					  logical);
>> +				cache->wp_broken = 1;
>> +				continue;
>> +			}
>> +
>> +			cache->alloc_offset += alloc_offsets[i];
>> +		}
>> +		break;
>> +	case BTRFS_BLOCK_GROUP_RAID10:
>> +		/*
>> +		 * Pass1: check write pointer of RAID1 level: each pointer
>> +		 * should be equal.
>> +		 */
>> +		for (i = 0; i < map->num_stripes / map->sub_stripes; i++) {
>> +			int base = i*map->sub_stripes;
> 
> spaces around binary operators
> 
> 			int base = i * map->sub_stripes;
> 
>> +			u64 offset = WP_MISSING_DEV;
>> +
>> +			for (j = 0; j < map->sub_stripes; j++) {
>> +				if (alloc_offsets[base+j] == WP_MISSING_DEV)
> 
> here and below
> 
>> +					continue;
>> +				if (offset == WP_MISSING_DEV)
>> +					offset = alloc_offsets[base+j];
>> +				if (alloc_offsets[base+j] == offset)
>> +					continue;
>> +
>> +				btrfs_err(fs_info,
>> +					  "write pointer mismatch: block group %llu",
>> +					  logical);
>> +				cache->wp_broken = 1;
>> +			}
>> +			for (j = 0; j < map->sub_stripes; j++)
>> +				alloc_offsets[base+j] = offset;
>> +		}
>> +
>> +		/* Pass2: check write pointer of RAID1 level */
>> +		cache->alloc_offset = 0;
>> +		for (i = 0; i < map->num_stripes / map->sub_stripes; i++) {
>> +			int base = i*map->sub_stripes;
>> +
>> +			if (alloc_offsets[base] == WP_MISSING_DEV) {
>> +				btrfs_err(fs_info,
>> +					  "cannot recover write pointer: block group %llu",
>> +					  logical);
>> +				cache->wp_broken = 1;
>> +				continue;
>> +			}
>> +
>> +			if (alloc_offsets[0] < alloc_offsets[base]) {
>> +				btrfs_err(fs_info,
>> +					  "write pointer mismatch: block group %llu",
>> +					  logical);
>> +				cache->wp_broken = 1;
>> +				continue;
>> +			}
>> +
>> +			cache->alloc_offset += alloc_offsets[base];
>> +		}
>> +		break;
>> +	case BTRFS_BLOCK_GROUP_RAID5:
>> +	case BTRFS_BLOCK_GROUP_RAID6:
>> +		/* RAID5/6 is not supported yet */
>> +	default:
>> +		btrfs_err(fs_info, "Unsupported profile on HMZONED %llu",
>> +			map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK);
>> +		ret = -EINVAL;
>> +		goto out;
>> +	}
>> +
>> +out:
>> +	cache->alloc_type = alloc_type;
>> +	kfree(alloc_offsets);
>> +	free_extent_map(em);
>> +
>> +	return ret;
>> +}
>> +
>>   static struct btrfs_block_group_cache *
>>   btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info,
>>   			       u64 start, u64 size)
>>   {
>>   	struct btrfs_block_group_cache *cache;
>> +	int ret;
>>   
>>   	cache = kzalloc(sizeof(*cache), GFP_NOFS);
>>   	if (!cache)
>> @@ -10238,6 +10554,16 @@ btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info,
>>   	atomic_set(&cache->trimming, 0);
>>   	mutex_init(&cache->free_space_lock);
>>   	btrfs_init_full_stripe_locks_tree(&cache->full_stripe_locks_root);
>> +	cache->alloc_type = BTRFS_ALLOC_FIT;
>> +	cache->alloc_offset = 0;
>> +
>> +	if (btrfs_fs_incompat(fs_info, HMZONED)) {
>> +		ret = btrfs_get_block_group_alloc_offset(cache);
>> +		if (ret) {
>> +			kfree(cache);
>> +			return NULL;
>> +		}
>> +	}
>>   
>>   	return cache;
>>   }
>> @@ -10310,6 +10636,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
>>   	int need_clear = 0;
>>   	u64 cache_gen;
>>   	u64 feature;
>> +	u64 unusable;
>>   	int mixed;
>>   
>>   	feature = btrfs_super_incompat_flags(info->super_copy);
>> @@ -10415,6 +10742,26 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
>>   			free_excluded_extents(cache);
>>   		}
>>   
>> +		switch (cache->alloc_type) {
>> +		case BTRFS_ALLOC_FIT:
>> +			unusable = cache->bytes_super;
>> +			break;
>> +		case BTRFS_ALLOC_SEQ:
>> +			WARN_ON(cache->bytes_super != 0);
>> +			unusable = cache->alloc_offset -
>> +				btrfs_block_group_used(&cache->item);
>> +			/* we only need ->free_space in ALLOC_SEQ BGs */
>> +			cache->last_byte_to_unpin = (u64)-1;
>> +			cache->cached = BTRFS_CACHE_FINISHED;
>> +			cache->free_space_ctl->free_space =
>> +				cache->key.offset - cache->alloc_offset;
>> +			cache->unusable = unusable;
>> +			free_excluded_extents(cache);
>> +			break;
>> +		default:
>> +			BUG();
> 
> An unexpeced value of allocation is found, this needs a message and
> proper error handling, btrfs_read_block_groups is called from mount path
> so the recovery should be possible.

OK. I will handle this case.

> 
>> +		}
>> +
>>   		ret = btrfs_add_block_group_cache(info, cache);
>>   		if (ret) {
>>   			btrfs_remove_free_space_cache(cache);
>> @@ -10425,7 +10772,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
>>   		trace_btrfs_add_block_group(info, cache, 0);
>>   		update_space_info(info, cache->flags, found_key.offset,
>>   				  btrfs_block_group_used(&cache->item),
>> -				  cache->bytes_super, &space_info);
>> +				  unusable, &space_info);
>>   
>>   		cache->space_info = space_info;
>>   
>> @@ -10438,6 +10785,9 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
>>   			ASSERT(list_empty(&cache->bg_list));
>>   			btrfs_mark_bg_unused(cache);
>>   		}
>> +
>> +		if (cache->wp_broken)
>> +			inc_block_group_ro(cache, 1);
>>   	}
>>   
>>   	list_for_each_entry_rcu(space_info, &info->space_info, list) {
>> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
>> index f74dc259307b..cc69dc71f4c1 100644
>> --- a/fs/btrfs/free-space-cache.c
>> +++ b/fs/btrfs/free-space-cache.c
>> @@ -2326,8 +2326,11 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
>>   			   u64 offset, u64 bytes)
>>   {
>>   	struct btrfs_free_space *info;
>> +	struct btrfs_block_group_cache *block_group = ctl->private;
>>   	int ret = 0;
>>   
>> +	WARN_ON(block_group && block_group->alloc_type == BTRFS_ALLOC_SEQ);
>> +
>>   	info = kmem_cache_zalloc(btrfs_free_space_cachep, GFP_NOFS);
>>   	if (!info)
>>   		return -ENOMEM;
>> @@ -2376,6 +2379,28 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
>>   	return ret;
>>   }
>>   
>> +int __btrfs_add_free_space_seq(struct btrfs_block_group_cache *block_group,
>> +			       u64 bytenr, u64 size)
>> +{
>> +	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
>> +	u64 offset = bytenr - block_group->key.objectid;
>> +	u64 to_free, to_unusable;
>> +
>> +	spin_lock(&ctl->tree_lock);
>> +	if (offset >= block_group->alloc_offset)
>> +		to_free = size;
>> +	else if (offset + size <= block_group->alloc_offset)
>> +		to_free = 0;
>> +	else
>> +		to_free = offset + size - block_group->alloc_offset;
>> +	to_unusable = size - to_free;
>> +	ctl->free_space += to_free;
>> +	block_group->unusable += to_unusable;
>> +	spin_unlock(&ctl->tree_lock);
>> +	return 0;
>> +
>> +}
>> +
>>   int btrfs_remove_free_space(struct btrfs_block_group_cache *block_group,
>>   			    u64 offset, u64 bytes)
>>   {
>> @@ -2384,6 +2409,8 @@ int btrfs_remove_free_space(struct btrfs_block_group_cache *block_group,
>>   	int ret;
>>   	bool re_search = false;
>>   
>> +	WARN_ON(block_group->alloc_type == BTRFS_ALLOC_SEQ);
>> +
>>   	spin_lock(&ctl->tree_lock);
>>   
>>   again:
>> @@ -2619,6 +2646,8 @@ u64 btrfs_find_space_for_alloc(struct btrfs_block_group_cache *block_group,
>>   	u64 align_gap = 0;
>>   	u64 align_gap_len = 0;
>>   
>> +	WARN_ON(block_group->alloc_type == BTRFS_ALLOC_SEQ);
>> +
>>   	spin_lock(&ctl->tree_lock);
>>   	entry = find_free_space(ctl, &offset, &bytes_search,
>>   				block_group->full_stripe_len, max_extent_size);
>> @@ -2738,6 +2767,8 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group_cache *block_group,
>>   	struct rb_node *node;
>>   	u64 ret = 0;
>>   
>> +	WARN_ON(block_group->alloc_type == BTRFS_ALLOC_SEQ);
>> +
>>   	spin_lock(&cluster->lock);
>>   	if (bytes > cluster->max_size)
>>   		goto out;
>> @@ -3384,6 +3415,8 @@ int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
>>   {
>>   	int ret;
>>   
>> +	WARN_ON(block_group->alloc_type == BTRFS_ALLOC_SEQ);
>> +
>>   	*trimmed = 0;
>>   
>>   	spin_lock(&block_group->lock);
>> diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h
>> index 8760acb55ffd..d30667784f73 100644
>> --- a/fs/btrfs/free-space-cache.h
>> +++ b/fs/btrfs/free-space-cache.h
>> @@ -73,10 +73,15 @@ void btrfs_init_free_space_ctl(struct btrfs_block_group_cache *block_group);
>>   int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
>>   			   struct btrfs_free_space_ctl *ctl,
>>   			   u64 bytenr, u64 size);
>> +int __btrfs_add_free_space_seq(struct btrfs_block_group_cache *block_group,
>> +			       u64 bytenr, u64 size);
>>   static inline int
>>   btrfs_add_free_space(struct btrfs_block_group_cache *block_group,
>>   		     u64 bytenr, u64 size)
>>   {
>> +	if (block_group->alloc_type == BTRFS_ALLOC_SEQ)
>> +		return __btrfs_add_free_space_seq(block_group, bytenr, size);
>> +
>>   	return __btrfs_add_free_space(block_group->fs_info,
>>   				      block_group->free_space_ctl,
>>   				      bytenr, size);
>> -- 
>> 2.21.0
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/19] btrfs: limit super block locations in HMZONED mode
  2019-06-13 14:12   ` Josef Bacik
@ 2019-06-18  8:51     ` Naohiro Aota
  0 siblings, 0 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-18  8:51 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, David Sterba, Chris Mason, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On 2019/06/13 23:13, Josef Bacik wrote:
> On Fri, Jun 07, 2019 at 10:10:15PM +0900, Naohiro Aota wrote:
>> When in HMZONED mode, make sure that device super blocks are located in
>> randomly writable zones of zoned block devices. That is, do not write super
>> blocks in sequential write required zones of host-managed zoned block
>> devices as update would not be possible.
>>
>> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>> ---
>>   fs/btrfs/disk-io.c     | 11 +++++++++++
>>   fs/btrfs/disk-io.h     |  1 +
>>   fs/btrfs/extent-tree.c |  4 ++++
>>   fs/btrfs/scrub.c       |  2 ++
>>   4 files changed, 18 insertions(+)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 7c1404c76768..ddbb02906042 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -3466,6 +3466,13 @@ struct buffer_head *btrfs_read_dev_super(struct block_device *bdev)
>>   	return latest;
>>   }
>>   
>> +int btrfs_check_super_location(struct btrfs_device *device, u64 pos)
>> +{
>> +	/* any address is good on a regular (zone_size == 0) device */
>> +	/* non-SEQUENTIAL WRITE REQUIRED zones are capable on a zoned device */
> 
> This is not how you do multi-line comments in the kernel.  Thanks,
> 
> Josef
> 

Thanks. I'll fix the style.
# I thought the checkpatch was catching this ...

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/19] btrfs: limit super block locations in HMZONED mode
  2019-06-17 22:53   ` David Sterba
@ 2019-06-18  9:01     ` Naohiro Aota
  2019-06-27 15:35       ` David Sterba
  0 siblings, 1 reply; 79+ messages in thread
From: Naohiro Aota @ 2019-06-18  9:01 UTC (permalink / raw)
  To: dsterba
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On 2019/06/18 7:53, David Sterba wrote:
> On Fri, Jun 07, 2019 at 10:10:15PM +0900, Naohiro Aota wrote:
>> When in HMZONED mode, make sure that device super blocks are located in
>> randomly writable zones of zoned block devices. That is, do not write super
>> blocks in sequential write required zones of host-managed zoned block
>> devices as update would not be possible.
> 
> This could be explained in more detail. My understanding is that the 1st
> and 2nd copy superblocks is skipped at write time but the zone
> containing the superblocks is not excluded from allocations. Ie. regular
> data can appear in place where the superblocks would exist on
> non-hmzoned filesystem. Is that correct?

Correct. You can see regular data stored at usually SB location on HMZONED fs.

> The other option is to completely exclude the zone that contains the
> superblock copies.
> 
> primary sb			 64K
> 1st copy			 64M
> 2nd copy			256G
> 
> Depends on the drives, but I think the size of the random write zone
> will very often cover primary and 1st copy. So there's at least some
> backup copy.
> 
> The 2nd copy will be in the sequential-only zone, so the whole zone
> needs to be excluded in exclude_super_stripes. But it's not, so this
> means data can go there.  I think the zone should be left empty.
> 

I see. That's more safe for the older kernel/userland, right? By keeping that zone empty,
we can avoid old ones to mis-interpret data to be SB.

Alright, I will change the code to do so.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 14/19] btrfs: redirty released extent buffers in sequential BGs
  2019-06-13 14:24   ` Josef Bacik
@ 2019-06-18  9:09     ` Naohiro Aota
  0 siblings, 0 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-18  9:09 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, David Sterba, Chris Mason, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On 2019/06/13 23:24, Josef Bacik wrote:
> On Fri, Jun 07, 2019 at 10:10:20PM +0900, Naohiro Aota wrote:
>> Tree manipulating operations like merging nodes often release
>> once-allocated tree nodes. Btrfs cleans such nodes so that pages in the
>> node are not uselessly written out. On HMZONED drives, however, such
>> optimization blocks the following IOs as the cancellation of the write out
>> of the freed blocks breaks the sequential write sequence expected by the
>> device.
>>
>> This patch introduces a list of clean extent buffers that have been
>> released in a transaction. Btrfs consult the list before writing out and
>> waiting for the IOs, and it redirties a buffer if 1) it's in sequential BG,
>> 2) it's in un-submit range, and 3) it's not under IO. Thus, such buffers
>> are marked for IO in btrfs_write_and_wait_transaction() to send proper bios
>> to the disk.
>>
>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>> ---
>>   fs/btrfs/disk-io.c     | 27 ++++++++++++++++++++++++---
>>   fs/btrfs/extent_io.c   |  1 +
>>   fs/btrfs/extent_io.h   |  2 ++
>>   fs/btrfs/transaction.c | 35 +++++++++++++++++++++++++++++++++++
>>   fs/btrfs/transaction.h |  3 +++
>>   5 files changed, 65 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 6651986da470..c6147fce648f 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -535,7 +535,9 @@ static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct page *page)
>>   	if (csum_tree_block(eb, result))
>>   		return -EINVAL;
>>   
>> -	if (btrfs_header_level(eb))
>> +	if (test_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags))
>> +		ret = 0;
>> +	else if (btrfs_header_level(eb))
>>   		ret = btrfs_check_node(eb);
>>   	else
>>   		ret = btrfs_check_leaf_full(eb);
>> @@ -1115,10 +1117,20 @@ struct extent_buffer *read_tree_block(struct btrfs_fs_info *fs_info, u64 bytenr,
>>   void btrfs_clean_tree_block(struct extent_buffer *buf)
>>   {
>>   	struct btrfs_fs_info *fs_info = buf->fs_info;
>> -	if (btrfs_header_generation(buf) ==
>> -	    fs_info->running_transaction->transid) {
>> +	struct btrfs_transaction *cur_trans = fs_info->running_transaction;
>> +
>> +	if (btrfs_header_generation(buf) == cur_trans->transid) {
>>   		btrfs_assert_tree_locked(buf);
>>   
>> +		if (btrfs_fs_incompat(fs_info, HMZONED) &&
>> +		    list_empty(&buf->release_list)) {
>> +			atomic_inc(&buf->refs);
>> +			spin_lock(&cur_trans->releasing_ebs_lock);
>> +			list_add_tail(&buf->release_list,
>> +				      &cur_trans->releasing_ebs);
>> +			spin_unlock(&cur_trans->releasing_ebs_lock);
>> +		}
>> +
>>   		if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &buf->bflags)) {
>>   			percpu_counter_add_batch(&fs_info->dirty_metadata_bytes,
>>   						 -buf->len,
>> @@ -4533,6 +4545,15 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans,
>>   	btrfs_destroy_pinned_extent(fs_info,
>>   				    fs_info->pinned_extents);
>>   
>> +	while (!list_empty(&cur_trans->releasing_ebs)) {
>> +		struct extent_buffer *eb;
>> +
>> +		eb = list_first_entry(&cur_trans->releasing_ebs,
>> +				      struct extent_buffer, release_list);
>> +		list_del_init(&eb->release_list);
>> +		free_extent_buffer(eb);
>> +	}
>> +
>>   	cur_trans->state =TRANS_STATE_COMPLETED;
>>   	wake_up(&cur_trans->commit_wait);
>>   }
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 13fca7bfc1f2..c73c69e2bef4 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -4816,6 +4816,7 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
>>   	init_waitqueue_head(&eb->read_lock_wq);
>>   
>>   	btrfs_leak_debug_add(&eb->leak_list, &buffers);
>> +	INIT_LIST_HEAD(&eb->release_list);
>>   
>>   	spin_lock_init(&eb->refs_lock);
>>   	atomic_set(&eb->refs, 1);
>> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
>> index aa18a16a6ed7..2987a01f84f9 100644
>> --- a/fs/btrfs/extent_io.h
>> +++ b/fs/btrfs/extent_io.h
>> @@ -58,6 +58,7 @@ enum {
>>   	EXTENT_BUFFER_IN_TREE,
>>   	/* write IO error */
>>   	EXTENT_BUFFER_WRITE_ERR,
>> +	EXTENT_BUFFER_NO_CHECK,
>>   };
>>   
>>   /* these are flags for __process_pages_contig */
>> @@ -186,6 +187,7 @@ struct extent_buffer {
>>   	 */
>>   	wait_queue_head_t read_lock_wq;
>>   	struct page *pages[INLINE_EXTENT_BUFFER_PAGES];
>> +	struct list_head release_list;
>>   #ifdef CONFIG_BTRFS_DEBUG
>>   	atomic_t spinning_writers;
>>   	atomic_t spinning_readers;
>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>> index 3f6811cdf803..ded40ad75419 100644
>> --- a/fs/btrfs/transaction.c
>> +++ b/fs/btrfs/transaction.c
>> @@ -236,6 +236,8 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info,
>>   	spin_lock_init(&cur_trans->dirty_bgs_lock);
>>   	INIT_LIST_HEAD(&cur_trans->deleted_bgs);
>>   	spin_lock_init(&cur_trans->dropped_roots_lock);
>> +	INIT_LIST_HEAD(&cur_trans->releasing_ebs);
>> +	spin_lock_init(&cur_trans->releasing_ebs_lock);
>>   	list_add_tail(&cur_trans->list, &fs_info->trans_list);
>>   	extent_io_tree_init(fs_info, &cur_trans->dirty_pages,
>>   			IO_TREE_TRANS_DIRTY_PAGES, fs_info->btree_inode);
>> @@ -2219,7 +2221,31 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>>   
>>   	wake_up(&fs_info->transaction_wait);
>>   
>> +	if (btrfs_fs_incompat(fs_info, HMZONED)) {
>> +		struct extent_buffer *eb;
>> +
>> +		list_for_each_entry(eb, &cur_trans->releasing_ebs,
>> +				    release_list) {
>> +			struct btrfs_block_group_cache *cache;
>> +
>> +			cache = btrfs_lookup_block_group(fs_info, eb->start);
>> +			if (!cache)
>> +				continue;
>> +			mutex_lock(&cache->submit_lock);
>> +			if (cache->alloc_type == BTRFS_ALLOC_SEQ &&
>> +			    cache->submit_offset <= eb->start &&
>> +			    !extent_buffer_under_io(eb)) {
>> +				set_extent_buffer_dirty(eb);
>> +				cache->space_info->bytes_readonly += eb->len;
> 
> Huh?
> 

I'm tracking once allocated then freed region in "space_info->bytes_readonly".
As I wrote in the other reply, I can add and use "space_info->bytes_zone_unavailable" instead.

>> +				set_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags);
>> +			}
>> +			mutex_unlock(&cache->submit_lock);
>> +			btrfs_put_block_group(cache);
>> +		}
>> +	}
>> +
> 
> Helper here please.
>>   	ret = btrfs_write_and_wait_transaction(trans);
>> +
>>   	if (ret) {
>>   		btrfs_handle_fs_error(fs_info, ret,
>>   				      "Error while writing out transaction");
>> @@ -2227,6 +2253,15 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
>>   		goto scrub_continue;
>>   	}
>>   
>> +	while (!list_empty(&cur_trans->releasing_ebs)) {
>> +		struct extent_buffer *eb;
>> +
>> +		eb = list_first_entry(&cur_trans->releasing_ebs,
>> +				      struct extent_buffer, release_list);
>> +		list_del_init(&eb->release_list);
>> +		free_extent_buffer(eb);
>> +	}
>> +
> 
> Another helper, and also can't we release eb's above that we didn't need to
> re-mark dirty?  Thanks,
> 
> Josef
> 

hm, we can do so. I'll change the code in the next version.
Thanks,

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 18/19] btrfs: support dev-replace in HMZONED mode
  2019-06-13 14:33   ` Josef Bacik
@ 2019-06-18  9:14     ` Naohiro Aota
  0 siblings, 0 replies; 79+ messages in thread
From: Naohiro Aota @ 2019-06-18  9:14 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, David Sterba, Chris Mason, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On 2019/06/13 23:33, Josef Bacik wrote:
> On Fri, Jun 07, 2019 at 10:10:24PM +0900, Naohiro Aota wrote:
>> Currently, dev-replace copy all the device extents on source device to the
>> target device, and it also clones new incoming write I/Os from users to the
>> source device into the target device.
>>
>> Cloning incoming IOs can break the sequential write rule in the target
>> device. When write is mapped in the middle of block group, that I/O is
>> directed in the middle of a zone of target device, which breaks the
>> sequential write rule.
>>
>> However, the cloning function cannot be simply disabled since incoming I/Os
>> targeting already copied device extents must be cloned so that the I/O is
>> executed on the target device.
>>
>> We cannot use dev_replace->cursor_{left,right} to determine whether bio
>> is going to not yet copied region.  Since we have time gap between
>> finishing btrfs_scrub_dev() and rewriting the mapping tree in
>> btrfs_dev_replace_finishing(), we can have newly allocated device extent
>> which is never cloned (by handle_ops_on_dev_replace) nor copied (by the
>> dev-replace process).
>>
>> So the point is to copy only already existing device extents. This patch
>> introduce mark_block_group_to_copy() to mark existing block group as a
>> target of copying. Then, handle_ops_on_dev_replace() and dev-replace can
>> check the flag to do their job.
>>
>> This patch also handles empty region between used extents. Since
>> dev-replace is smart to copy only used extents on source device, we have to
>> fill the gap to honor the sequential write rule in the target device.
>>
>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>> ---
>>   fs/btrfs/ctree.h       |   1 +
>>   fs/btrfs/dev-replace.c |  96 +++++++++++++++++++++++
>>   fs/btrfs/extent-tree.c |  32 +++++++-
>>   fs/btrfs/scrub.c       | 169 +++++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/volumes.c     |  27 ++++++-
>>   5 files changed, 319 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index dad8ea5c3b99..a0be2b96117a 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -639,6 +639,7 @@ struct btrfs_block_group_cache {
>>   	unsigned int has_caching_ctl:1;
>>   	unsigned int removed:1;
>>   	unsigned int wp_broken:1;
>> +	unsigned int to_copy:1;
>>   
>>   	int disk_cache_state;
>>   
>> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
>> index fbe5ea2a04ed..5011b5ce0e75 100644
>> --- a/fs/btrfs/dev-replace.c
>> +++ b/fs/btrfs/dev-replace.c
>> @@ -263,6 +263,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
>>   	device->dev_stats_valid = 1;
>>   	set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
>>   	device->fs_devices = fs_info->fs_devices;
>> +	if (bdev_is_zoned(bdev)) {
>> +		ret = btrfs_get_dev_zonetypes(device);
>> +		if (ret) {
>> +			mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>> +			goto error;
>> +		}
>> +	}
>>   	list_add(&device->dev_list, &fs_info->fs_devices->devices);
>>   	fs_info->fs_devices->num_devices++;
>>   	fs_info->fs_devices->open_devices++;
>> @@ -396,6 +403,88 @@ static char* btrfs_dev_name(struct btrfs_device *device)
>>   		return rcu_str_deref(device->name);
>>   }
>>   
>> +static int mark_block_group_to_copy(struct btrfs_fs_info *fs_info,
>> +				    struct btrfs_device *src_dev)
>> +{
>> +	struct btrfs_path *path;
>> +	struct btrfs_key key;
>> +	struct btrfs_key found_key;
>> +	struct btrfs_root *root = fs_info->dev_root;
>> +	struct btrfs_dev_extent *dev_extent = NULL;
>> +	struct btrfs_block_group_cache *cache;
>> +	struct extent_buffer *l;
>> +	int slot;
>> +	int ret;
>> +	u64 chunk_offset, length;
>> +
>> +	path = btrfs_alloc_path();
>> +	if (!path)
>> +		return -ENOMEM;
>> +
>> +	path->reada = READA_FORWARD;
>> +	path->search_commit_root = 1;
>> +	path->skip_locking = 1;
>> +
>> +	key.objectid = src_dev->devid;
>> +	key.offset = 0ull;
>> +	key.type = BTRFS_DEV_EXTENT_KEY;
>> +
>> +	while (1) {
>> +		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
>> +		if (ret < 0)
>> +			break;
>> +		if (ret > 0) {
>> +			if (path->slots[0] >=
>> +			    btrfs_header_nritems(path->nodes[0])) {
>> +				ret = btrfs_next_leaf(root, path);
>> +				if (ret < 0)
>> +					break;
>> +				if (ret > 0) {
>> +					ret = 0;
>> +					break;
>> +				}
>> +			} else {
>> +				ret = 0;
>> +			}
>> +		}
>> +
>> +		l = path->nodes[0];
>> +		slot = path->slots[0];
>> +
>> +		btrfs_item_key_to_cpu(l, &found_key, slot);
>> +
>> +		if (found_key.objectid != src_dev->devid)
>> +			break;
>> +
>> +		if (found_key.type != BTRFS_DEV_EXTENT_KEY)
>> +			break;
>> +
>> +		if (found_key.offset < key.offset)
>> +			break;
>> +
>> +		dev_extent = btrfs_item_ptr(l, slot, struct btrfs_dev_extent);
>> +		length = btrfs_dev_extent_length(l, dev_extent);
>> +
>> +		chunk_offset = btrfs_dev_extent_chunk_offset(l, dev_extent);
>> +
>> +		cache = btrfs_lookup_block_group(fs_info, chunk_offset);
>> +		if (!cache)
>> +			goto skip;
>> +
>> +		cache->to_copy = 1;
>> +
>> +		btrfs_put_block_group(cache);
>> +
>> +skip:
>> +		key.offset = found_key.offset + length;
>> +		btrfs_release_path(path);
>> +	}
>> +
>> +	btrfs_free_path(path);
>> +
>> +	return ret;
>> +}
>> +
>>   static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
>>   		const char *tgtdev_name, u64 srcdevid, const char *srcdev_name,
>>   		int read_src)
>> @@ -439,6 +528,13 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
>>   	}
>>   
>>   	need_unlock = true;
>> +
>> +	mutex_lock(&fs_info->chunk_mutex);
>> +	ret = mark_block_group_to_copy(fs_info, src_device);
>> +	mutex_unlock(&fs_info->chunk_mutex);
>> +	if (ret)
>> +		return ret;
>> +
>>   	down_write(&dev_replace->rwsem);
>>   	switch (dev_replace->replace_state) {
>>   	case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED:
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index ff4d55d6ef04..268365dd9a5d 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -29,6 +29,7 @@
>>   #include "qgroup.h"
>>   #include "ref-verify.h"
>>   #include "rcu-string.h"
>> +#include "dev-replace.h"
>>   
>>   #undef SCRAMBLE_DELAYED_REFS
>>   
>> @@ -2022,7 +2023,31 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
>>   			if (btrfs_dev_is_sequential(stripe->dev,
>>   						    stripe->physical) &&
>>   			    stripe->length == stripe->dev->zone_size) {
>> -				ret = blkdev_reset_zones(stripe->dev->bdev,
>> +				struct btrfs_device *dev = stripe->dev;
>> +
>> +				ret = blkdev_reset_zones(dev->bdev,
>> +							 stripe->physical >>
>> +								 SECTOR_SHIFT,
>> +							 stripe->length >>
>> +								 SECTOR_SHIFT,
>> +							 GFP_NOFS);
>> +				if (!ret)
>> +					discarded_bytes += stripe->length;
>> +				else
>> +					break;
>> +				set_bit(stripe->physical >>
>> +					dev->zone_size_shift,
>> +					dev->empty_zones);
>> +
>> +				if (!btrfs_dev_replace_is_ongoing(
>> +					    &fs_info->dev_replace) ||
>> +				    stripe->dev != fs_info->dev_replace.srcdev)
>> +					continue;
>> +
>> +				/* send to target as well */
>> +				dev = fs_info->dev_replace.tgtdev;
>> +
>> +				ret = blkdev_reset_zones(dev->bdev,
> 
> This is unrelated to dev replace isn't it?  Please make this it's own patch, and
> it's own helper while you are at it.  Thanks,
> 
> Josef
> 

Actually, patch 0015 introduced zone reset here. And this patch extend that code
to reset also the corresponding zone when dev_replace is on going. The diff is
messed up here.

I'll add the reset helper in the next version.
Thanks,

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 11/19] btrfs: introduce submit buffer
  2019-06-17  3:16     ` Damien Le Moal
  2019-06-18  0:00       ` David Sterba
@ 2019-06-18 13:33       ` Josef Bacik
  2019-06-19 10:32         ` Damien Le Moal
  1 sibling, 1 reply; 79+ messages in thread
From: Josef Bacik @ 2019-06-18 13:33 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Josef Bacik, Naohiro Aota, linux-btrfs, David Sterba,
	Chris Mason, Qu Wenruo, Nikolay Borisov, linux-kernel,
	Hannes Reinecke, linux-fsdevel, Matias Bjørling,
	Johannes Thumshirn, Bart Van Assche

On Mon, Jun 17, 2019 at 03:16:05AM +0000, Damien Le Moal wrote:
> Josef,
> 
> On 2019/06/13 23:15, Josef Bacik wrote:
> > On Fri, Jun 07, 2019 at 10:10:17PM +0900, Naohiro Aota wrote:
> >> Sequential allocation is not enough to maintain sequential delivery of
> >> write IOs to the device. Various features (async compress, async checksum,
> >> ...) of btrfs affect ordering of the IOs. This patch introduces submit
> >> buffer to sort WRITE bios belonging to a block group and sort them out
> >> sequentially in increasing block address to achieve sequential write
> >> sequences with __btrfs_map_bio().
> >>
> >> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > 
> > I hate everything about this.  Can't we just use the plugging infrastructure for
> > this and then make sure it re-orders the bios before submitting them?  Also
> > what's to prevent the block layer scheduler from re-arranging these io's?
> > Thanks,
> 
> The block I/O scheduler reorders requests in LBA order, but that happens for a
> newly inserted request against pending requests. If there are no pending
> requests because all requests were already issued, no ordering happen, and even
> worse, if the drive queue is not full yet (e.g. there are free tags), then the
> newly inserted request will be dispatched almost immediately, preventing
> reordering with subsequent incoming write requests to happen.
> 

This sounds like we're depending on specific behavior from the ioscheduler,
which means we're going to have a sad day at some point in the future.

> The other problem is that the mq-deadline scheduler does not track zone WP
> position. Write request issuing is done regardless of the current WP value,
> solely based on LBA ordering. This means that mq-deadline will not prevent
> out-of-order, or rather, unaligned write requests. These will not be detected
> and dispatched whenever possible. The reasons for this are that:
> 1) the disk user (the FS) has to manage zone WP positions anyway. So duplicating
> that management at the block IO scheduler level is inefficient.

I'm not saying it has to manage the WP pointer, and in fact I'm not saying the
scheduler has to do anything at all.  We just need a more generic way to make
sure that bio's submitted in order are kept in order.  So perhaps a hmzoned
scheduler that does just that, and is pinned for these devices.

> 2) Adding zone WP management at the block IO scheduler level would also need a
> write error processing path to resync the WP value in case of failed writes. But
> the user/FS also needs that anyway. Again duplicated functionalities.

Again, no not really.  My point is I want as little block layer knowledge in
btrfs as possible.  I accept we should probably keep track of the WP, it just
makes it easier on everybody if we allocate sequentially.  I'll even allow that
we need to handle the write errors and adjust our WP stuff internally when
things go wrong.

What I'm having a hard time swallowing is having a io scheduler in btrfs proper.
We just ripped out the old one we had because it broke cgroups.  It just adds
extra complexity to an already complex mess.

> 3) The block layer will need a timeout to force issue or cancel pending
> unaligned write requests. This is necessary in case the drive user stops issuing
> writes (for whatever reasons) or the scheduler is being switched. This would
> unnecessarily cause write I/O errors or cause deadlocks if the request queue
> quiesce mode is entered at the wrong time (and I do not see a good way to deal
> with that).

Again we could just pin the hmzoned scheduler to those devices so you can't
switch them.  Or make a hmzoned blk plug and pin no scheduler to these devices.

> 
> blk-mq is already complicated enough. Adding this to the block IO scheduler will
> unnecessarily complicate things further for no real benefits. I would like to
> point out the dm-zoned device mapper and f2fs which are both already dealing
> with write ordering and write error processing directly. Both are fairly
> straightforward but completely different and each optimized for their own structure.
> 

So we're duplicating this effort in 2 places already and adding a 3rd place
seems like a solid plan?  Device-mapper it makes sense, we're sitting squarely
in the block layer so moving around bio's/requests is its very reason for
existing.  I'm not sold on the file system needing to take up this behavior.
This needs to be handled in a more generic way so that all file systems can
share the same mechanism.

I'd even go so far as to say that you could just require using a dm device with
these hmzoned block devices and then handle all of that logic in there if you
didn't feel like doing it generically.  We're already talking about esoteric
devices that require special care to use, adding the extra requirement of
needing to go through device-mapper to use it wouldn't be that big of a stretch.
Thanks,

Josef

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 08/19] btrfs: make unmirroed BGs readonly only if we have at least one writable BG
  2019-06-18  7:42     ` Naohiro Aota
@ 2019-06-18 13:35       ` Josef Bacik
  0 siblings, 0 replies; 79+ messages in thread
From: Josef Bacik @ 2019-06-18 13:35 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: Josef Bacik, linux-btrfs, David Sterba, Chris Mason, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On Tue, Jun 18, 2019 at 07:42:46AM +0000, Naohiro Aota wrote:
> On 2019/06/13 23:09, Josef Bacik wrote:
> > On Fri, Jun 07, 2019 at 10:10:14PM +0900, Naohiro Aota wrote:
> >> If the btrfs volume has mirrored block groups, it unconditionally makes
> >> un-mirrored block groups read only. When we have mirrored block groups, but
> >> don't have writable block groups, this will drop all writable block groups.
> >> So, check if we have at least one writable mirrored block group before
> >> setting un-mirrored block groups read only.
> >>
> > 
> > I don't understand why you want this.  Thanks,
> > 
> > Josef
> > 
> 
> This is necessary to handle e.g. btrfs/124 case.
> 
> When we mount degraded RAID1 FS and write to it, and then
> re-mount with full device, the write pointers of corresponding
> zones of written BG differ.  The patch 07 mark such block group
> as "wp_broken" and make it read only.  In this situation, we only
> have read only RAID1 BGs because of "wp_broken" and un-mirrored BGs
> are also marked read only, because we have RAID1 BGs.
> As a result, all the BGs are now read only, so that we
> cannot even start the rebalance to fix the situation.

Ah ok, please add this explanation to the changelog.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 07/19] btrfs: do sequential extent allocation in HMZONED mode
  2019-06-18  8:28     ` Naohiro Aota
@ 2019-06-18 13:37       ` Josef Bacik
  0 siblings, 0 replies; 79+ messages in thread
From: Josef Bacik @ 2019-06-18 13:37 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: Josef Bacik, linux-btrfs, David Sterba, Chris Mason, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche

On Tue, Jun 18, 2019 at 08:28:07AM +0000, Naohiro Aota wrote:
> On 2019/06/13 23:07, Josef Bacik wrote:
> > On Fri, Jun 07, 2019 at 10:10:13PM +0900, Naohiro Aota wrote:
> >> @@ -9616,7 +9701,8 @@ static int inc_block_group_ro(struct btrfs_block_group_cache *cache, int force)
> >>   	}
> >>   
> >>   	num_bytes = cache->key.offset - cache->reserved - cache->pinned -
> >> -		    cache->bytes_super - btrfs_block_group_used(&cache->item);
> >> +		    cache->bytes_super - cache->unusable -
> >> +		    btrfs_block_group_used(&cache->item);
> >>   	sinfo_used = btrfs_space_info_used(sinfo, true);
> >>   
> >>   	if (sinfo_used + num_bytes + min_allocable_bytes <=
> >> @@ -9766,6 +9852,7 @@ void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache)
> >>   	if (!--cache->ro) {
> >>   		num_bytes = cache->key.offset - cache->reserved -
> >>   			    cache->pinned - cache->bytes_super -
> >> +			    cache->unusable -
> >>   			    btrfs_block_group_used(&cache->item);
> > 
> > You've done this in a few places, but not all the places, most notably
> > btrfs_space_info_used() which is used in the space reservation code a lot.
> 
> I added "unsable" to struct btrfs_block_group_cache, but added
> nothing to struct btrfs_space_info. Once extent is allocated and
> freed in an ALLOC_SEQ Block Group, such extent is never resued
> until we remove the BG. I'm accounting the size of such region
> in "cache->unusable" and in "space_info->bytes_readonly". So,
> btrfs_space_info_used() does not need the modify.
> 
> I admit it's confusing here. I can add "bytes_zone_unusable" to
> struct btrfs_space_info, if it's better.
> 

Ah you're right, sorry I just read it as space_info.  Yes please add
bytes_zone_unusable, I'd like to be as verbose as possible about where our space
actually is.  I know if I go to debug something and see a huge amount in
read_only I'll be confused.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 11/19] btrfs: introduce submit buffer
  2019-06-18 13:33       ` Josef Bacik
@ 2019-06-19 10:32         ` Damien Le Moal
  0 siblings, 0 replies; 79+ messages in thread
From: Damien Le Moal @ 2019-06-19 10:32 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Naohiro Aota, linux-btrfs, David Sterba, Chris Mason, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche

On 2019/06/18 22:34, Josef Bacik wrote:
> On Mon, Jun 17, 2019 at 03:16:05AM +0000, Damien Le Moal wrote:
>> The block I/O scheduler reorders requests in LBA order, but that happens for a
>> newly inserted request against pending requests. If there are no pending
>> requests because all requests were already issued, no ordering happen, and even
>> worse, if the drive queue is not full yet (e.g. there are free tags), then the
>> newly inserted request will be dispatched almost immediately, preventing
>> reordering with subsequent incoming write requests to happen.
>>
> 
> This sounds like we're depending on specific behavior from the ioscheduler,
> which means we're going to have a sad day at some point in the future.

In a sense yes, we are. But my team and I always make sure that such sad day do
not come. We are always making sure that HM-zoned drives can be used and work as
expected (all RCs and stable versions are tested weekly). For now, getting
guarantees on write requests order mandates the use of the mq-deadline scheduler
as it is currently the only one providing these guarantees. I just sent a patch
to ensure that this scheduler is always available with CONFIG_BLK_DEV_ZONED
enabled (see commit b9aef63aca77 "block: force select mq-deadline for zoned
block devices") and automatically configuring it for HM zoned devices is simply
a matter of adding an udev rule to the system (mq-deadline is the default
scheduler for spinning rust anyway).

>> The other problem is that the mq-deadline scheduler does not track zone WP
>> position. Write request issuing is done regardless of the current WP value,
>> solely based on LBA ordering. This means that mq-deadline will not prevent
>> out-of-order, or rather, unaligned write requests. These will not be detected
>> and dispatched whenever possible. The reasons for this are that:
>> 1) the disk user (the FS) has to manage zone WP positions anyway. So duplicating
>> that management at the block IO scheduler level is inefficient.
> 
> I'm not saying it has to manage the WP pointer, and in fact I'm not saying the
> scheduler has to do anything at all.  We just need a more generic way to make
> sure that bio's submitted in order are kept in order.  So perhaps a hmzoned
> scheduler that does just that, and is pinned for these devices.

This is exactly what mq-deadline does for HM devices: it guarantees that write
bio order submission is kept as is for request dispatching to the disk. The only
missing part is "pinned for these devices". This is not possible now. A user can
still change the scheduler to say BFQ. But in that case, unaligned write errors
will show up very quickly. So this is easy to debug. Not ideal I agree, but that
can be fixed independently of BtrFS support for hmzoned disks.

>> 2) Adding zone WP management at the block IO scheduler level would also need a
>> write error processing path to resync the WP value in case of failed writes. But
>> the user/FS also needs that anyway. Again duplicated functionalities.
> 
> Again, no not really.  My point is I want as little block layer knowledge in
> btrfs as possible.  I accept we should probably keep track of the WP, it just
> makes it easier on everybody if we allocate sequentially.  I'll even allow that
> we need to handle the write errors and adjust our WP stuff internally when
> things go wrong.
> 
> What I'm having a hard time swallowing is having a io scheduler in btrfs proper.
> We just ripped out the old one we had because it broke cgroups.  It just adds
> extra complexity to an already complex mess.

I understand your point. It makes perfect sense. The "IO scheduler" added for
hmzoned case is only the method proposed to implement sequential write issuing
guarantees. The sequential allocation was relatively easy to achieve, but what
is really needed is an atomic "sequential alloc blocks + issue write BIO for
these blocks" so that the block IO schedulker sees sequential write streams per
zone. If only the sequential allocation is achieved, write bios serving these
blocks may be reordered at the FS level and result in write failures since the
block layer scheduler only guarantees preserving the order without any
reordering guarantees for unaligned writes.

>> 3) The block layer will need a timeout to force issue or cancel pending
>> unaligned write requests. This is necessary in case the drive user stops issuing
>> writes (for whatever reasons) or the scheduler is being switched. This would
>> unnecessarily cause write I/O errors or cause deadlocks if the request queue
>> quiesce mode is entered at the wrong time (and I do not see a good way to deal
>> with that).
> 
> Again we could just pin the hmzoned scheduler to those devices so you can't
> switch them.  Or make a hmzoned blk plug and pin no scheduler to these devices.

That is not enough. Pinning the schedulers or using the plugs cannot guarantee
that write requests issued out of order will always be correctly reordered. Even
worse, we cannot implement this. For multiple reason as I stated before.

One example that may illustrates this more easily is this: imagine a user doing
buffered I/Os to an hm disk (e.g. dd if=/dev/zero of=/dev/sdX). The first part
of this execution, that is, allocate a free page, copy the user data and add the
page to the page cache as dirty, is in fact equivalent to an FS sequential block
allocation (the dirty pages are allocated in offset order and added to  the page
cache in that same order).

Most of the time, this will work just fine because the page cache dirty page
writeback code is mostly sequential. Dirty pages for an inode are found in
offset order, packed into write bios and issued sequentially. But start putting
memory pressure on the system, or executing "sync" or other applications in
parallel, and you will start seeing unaligned write errors because the page
cache atomicity is per page so different contexts may end up grabbing dirty
pages in order (as expected) but issuing interleaved write bios out of order.
And this type of problem *cannot* be handled in the block layer (plug or
scheduler) because stopping execution of a bio expecting that another bio will
come is very dangerous as there are no guarantees that such bio will ever be
issued. In the case of the page cache flush, this is actually a real eventuality
as memory allocation needed for issuing a bio may depend on the completion of
already issued bios, and if we cannot dispatch those, then we can deadlock.

This is an extreme example. This is unlikely but still a real possibility.
Similarly to your position, that is, the FS should not know anything about the
block layer, the block layer position is that it cannot rely on a specific
behavior from the upper layers. Essentially, all bios are independent and
treated as such.

For HM devices, we needed sequential write guarantees, but could not break the
independence of write requests. So what we did is simply guarantee that the
dispatch order is preserved from the issuing order, nothing else. There is no
"buffering" possible and no checks regarding the sequentiality of writes.

As a result, the sequential write constraint of the disks is directly exposed to
the disk user (FS or DM).

>> blk-mq is already complicated enough. Adding this to the block IO scheduler will
>> unnecessarily complicate things further for no real benefits. I would like to
>> point out the dm-zoned device mapper and f2fs which are both already dealing
>> with write ordering and write error processing directly. Both are fairly
>> straightforward but completely different and each optimized for their own structure.
>>
> 
> So we're duplicating this effort in 2 places already and adding a 3rd place
> seems like a solid plan?  Device-mapper it makes sense, we're sitting squarely
> in the block layer so moving around bio's/requests is its very reason for
> existing.  I'm not sold on the file system needing to take up this behavior.
> This needs to be handled in a more generic way so that all file systems can
> share the same mechanism.

I understand your point. But I am afraid it is not easily possible. The reason
is that for an FS, to achieve sequential write streams in zones, one need an
atomic (or serialized) execution of "block allocation + wrtite bio issuing".
Both combined achieve a sequential write stream that mq-deadline will preserve
and everything will work as intended. This is obviously not easily possible in a
generic manner for all FSes. In f2fs, this was rather easy to do without
changing a lot of code by simply using a mutex to have the 2 operations
atomically executed without any noticeable performance impact. A similar method
in BtrFS is not possible because of async checksum and async compression which
can result in btrfs_map_bio() execution in an order that is different from the
extent allocation order.

> 
> I'd even go so far as to say that you could just require using a dm device with
> these hmzoned block devices and then handle all of that logic in there if you
> didn't feel like doing it generically.  We're already talking about esoteric
> devices that require special care to use, adding the extra requirement of
> needing to go through device-mapper to use it wouldn't be that big of a stretch.

HM drives are not so "esoteric" anymore. Entire data centers are starting
running on them. And getting BtrFS to work natively on HM drives would be a huge
step toward facilitating their use, and remove this "esoteric" label :)

Back to your point, using a dm to do the reordering is possible, but requires
temporary persistent backup of the out-of-order BIOs due to the reasons pointed
out above (dependency of memory allocation failure/success on bio completion).
This is basically what dm-zoned does, using conventional zones to store
out-of-order writes in conventional zones. Such generic DM is enough to run any
file system (ext4 or XFS run perfectly fine on dm-zoned), but come at the cost
of needing garbage collection with a huge impact on performance. The simple
addition of Naohiro's write bio ordering feature in BtrFS avoids all this and
preserves performance. I really understand your desire to reduce complexity. But
in the end, this is only a "sorted list" that is well controlled within btrfs
itself and avoids dependency on the behavior of other components beside the
block IO scheduler.

We could envision to make such feature generic, implementing it as a block layer
object. But its use would still be needed in btrfs. Since f2fs and dm-zoned do
not require it, btrfs would be the sole user though, so for now at least, this
generic implementation has I think little value. We can work on trying to
isolate this bio reordering code more so  that it is easier to remove and use a
future generic implementation. Would that help in addressing your concerns ?

Thank you for your comments.

Best regards.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 02/19] btrfs: Get zone information of zoned block devices
  2019-06-18  6:42     ` Naohiro Aota
@ 2019-06-27 15:11       ` David Sterba
  0 siblings, 0 replies; 79+ messages in thread
From: David Sterba @ 2019-06-27 15:11 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: dsterba, linux-btrfs, David Sterba, Chris Mason, Josef Bacik,
	Qu Wenruo, Nikolay Borisov, linux-kernel, Hannes Reinecke,
	linux-fsdevel, Damien Le Moal, Matias Bjørling,
	Johannes Thumshirn, Bart Van Assche

On Tue, Jun 18, 2019 at 06:42:09AM +0000, Naohiro Aota wrote:
> >> +	device->seq_zones = kcalloc(BITS_TO_LONGS(device->nr_zones),
> >> +				    sizeof(*device->seq_zones), GFP_KERNEL);
> > 
> > What's the expected range for the allocation size? There's one bit per
> > zone, so one 4KiB page can hold up to 32768 zones, with 1GiB it's 32TiB
> > of space on the drive. Ok that seems safe for now.
> 
> Typically, zone size is 256MB (as default value in tcmu-runner). On such device,
> we need one 4KB page per 8TB disk space. Still it's quite safe.

Ok, and for drives up to 16T the allocation is 8kb that the allocator
usually is able to find.

> >> +	u8  zone_size_shift;
> > 
> > So the zone_size is always power of two? I may be missing something, but
> > I wonder if the calculations based on shifts are safe.
> 
> The kernel ZBD support have a restriction that
> "The zone size must also be equal to a power of 2 number of logical blocks."
> http://zonedstorage.io/introduction/linux-support/#zbd-support-restrictions
> 
> So, the zone_size is guaranteed to be power of two.

Ok. I don't remember if there are assertions, but would like to see them
in the filesystem code indpendently anyway as the mount-time sanity
checks.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 07/19] btrfs: do sequential extent allocation in HMZONED mode
  2019-06-18  8:49     ` Naohiro Aota
@ 2019-06-27 15:28       ` David Sterba
  0 siblings, 0 replies; 79+ messages in thread
From: David Sterba @ 2019-06-27 15:28 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: dsterba, linux-btrfs, David Sterba, Chris Mason, Josef Bacik,
	Qu Wenruo, Nikolay Borisov, linux-kernel, Hannes Reinecke,
	linux-fsdevel, Damien Le Moal, Matias Bjørling,
	Johannes Thumshirn, Bart Van Assche

On Tue, Jun 18, 2019 at 08:49:00AM +0000, Naohiro Aota wrote:
> On 2019/06/18 7:29, David Sterba wrote:
> > On Fri, Jun 07, 2019 at 10:10:13PM +0900, Naohiro Aota wrote:
> >> +	u64 unusable;
> > 
> > 'unusable' is specific to the zones, so 'zone_unusable' would make it
> > clear. The terminilogy around space is confusing already (we have
> > unused, free, reserved, allocated, slack).
> 
> Sure. I will change the name.
> 
> Or, is it better toadd new struct "btrfs_seq_alloc_info" and move all
> these variable there? Then, I can just add one pointer to the struct here.

There are 4 new members, but the block group structure is large already
(528 bytes) so adding a few more will not make the allocations worse.
There are also holes or inefficient types used so the size can be
squeezed a bit, but this is unrelated to this patchset.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/19] btrfs: limit super block locations in HMZONED mode
  2019-06-18  9:01     ` Naohiro Aota
@ 2019-06-27 15:35       ` David Sterba
  0 siblings, 0 replies; 79+ messages in thread
From: David Sterba @ 2019-06-27 15:35 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: dsterba, linux-btrfs, David Sterba, Chris Mason, Josef Bacik,
	Qu Wenruo, Nikolay Borisov, linux-kernel, Hannes Reinecke,
	linux-fsdevel, Damien Le Moal, Matias Bjørling,
	Johannes Thumshirn, Bart Van Assche

On Tue, Jun 18, 2019 at 09:01:35AM +0000, Naohiro Aota wrote:
> On 2019/06/18 7:53, David Sterba wrote:
> > On Fri, Jun 07, 2019 at 10:10:15PM +0900, Naohiro Aota wrote:
> >> When in HMZONED mode, make sure that device super blocks are located in
> >> randomly writable zones of zoned block devices. That is, do not write super
> >> blocks in sequential write required zones of host-managed zoned block
> >> devices as update would not be possible.
> > 
> > This could be explained in more detail. My understanding is that the 1st
> > and 2nd copy superblocks is skipped at write time but the zone
> > containing the superblocks is not excluded from allocations. Ie. regular
> > data can appear in place where the superblocks would exist on
> > non-hmzoned filesystem. Is that correct?
> 
> Correct. You can see regular data stored at usually SB location on HMZONED fs.
> 
> > The other option is to completely exclude the zone that contains the
> > superblock copies.
> > 
> > primary sb			 64K
> > 1st copy			 64M
> > 2nd copy			256G
> > 
> > Depends on the drives, but I think the size of the random write zone
> > will very often cover primary and 1st copy. So there's at least some
> > backup copy.
> > 
> > The 2nd copy will be in the sequential-only zone, so the whole zone
> > needs to be excluded in exclude_super_stripes. But it's not, so this
> > means data can go there.  I think the zone should be left empty.
> > 
> 
> I see. That's more safe for the older kernel/userland, right? By keeping that zone empty,
> we can avoid old ones to mis-interpret data to be SB.

That's not only for older kernels, the superblock locations are known
and the contents should not depend on the type of device on which it was
created. This can be considered part of the on-disk format.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/19] btrfs: limit super block locations in HMZONED mode
  2019-06-07 13:10 ` [PATCH 09/19] btrfs: limit super block locations in HMZONED mode Naohiro Aota
  2019-06-13 14:12   ` Josef Bacik
  2019-06-17 22:53   ` David Sterba
@ 2019-06-28  3:55   ` Anand Jain
  2019-06-28  6:39     ` Naohiro Aota
  2 siblings, 1 reply; 79+ messages in thread
From: Anand Jain @ 2019-06-28  3:55 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche

On 7/6/19 9:10 PM, Naohiro Aota wrote:
> When in HMZONED mode, make sure that device super blocks are located in
> randomly writable zones of zoned block devices. That is, do not write super
> blocks in sequential write required zones of host-managed zoned block
> devices as update would not be possible.

  By design all copies of SB must be updated at each transaction,
  as they are redundant copies they must match at the end of
  each transaction.

  Instead of skipping the sb updates, why not alter number of
  copies at the time of mkfs.btrfs?

Thanks, Anand


> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>   fs/btrfs/disk-io.c     | 11 +++++++++++
>   fs/btrfs/disk-io.h     |  1 +
>   fs/btrfs/extent-tree.c |  4 ++++
>   fs/btrfs/scrub.c       |  2 ++
>   4 files changed, 18 insertions(+)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 7c1404c76768..ddbb02906042 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3466,6 +3466,13 @@ struct buffer_head *btrfs_read_dev_super(struct block_device *bdev)
>   	return latest;
>   }
>   
> +int btrfs_check_super_location(struct btrfs_device *device, u64 pos)
> +{
> +	/* any address is good on a regular (zone_size == 0) device */
> +	/* non-SEQUENTIAL WRITE REQUIRED zones are capable on a zoned device */
> +	return device->zone_size == 0 || !btrfs_dev_is_sequential(device, pos);
> +}
> +
>   /*
>    * Write superblock @sb to the @device. Do not wait for completion, all the
>    * buffer heads we write are pinned.
> @@ -3495,6 +3502,8 @@ static int write_dev_supers(struct btrfs_device *device,
>   		if (bytenr + BTRFS_SUPER_INFO_SIZE >=
>   		    device->commit_total_bytes)
>   			break;
> +		if (!btrfs_check_super_location(device, bytenr))
> +			continue;
>   
>   		btrfs_set_super_bytenr(sb, bytenr);
>   
> @@ -3561,6 +3570,8 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors)
>   		if (bytenr + BTRFS_SUPER_INFO_SIZE >=
>   		    device->commit_total_bytes)
>   			break;
> +		if (!btrfs_check_super_location(device, bytenr))
> +			continue;
>   
>   		bh = __find_get_block(device->bdev,
>   				      bytenr / BTRFS_BDEV_BLOCKSIZE,
> diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
> index a0161aa1ea0b..70e97cd6fa76 100644
> --- a/fs/btrfs/disk-io.h
> +++ b/fs/btrfs/disk-io.h
> @@ -141,6 +141,7 @@ struct extent_map *btree_get_extent(struct btrfs_inode *inode,
>   		struct page *page, size_t pg_offset, u64 start, u64 len,
>   		int create);
>   int btrfs_get_num_tolerated_disk_barrier_failures(u64 flags);
> +int btrfs_check_super_location(struct btrfs_device *device, u64 pos);
>   int __init btrfs_end_io_wq_init(void);
>   void __cold btrfs_end_io_wq_exit(void);
>   
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 3d41d840fe5c..ae2c895d08c4 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -267,6 +267,10 @@ static int exclude_super_stripes(struct btrfs_block_group_cache *cache)
>   			return ret;
>   	}
>   
> +	/* we won't have super stripes in sequential zones */
> +	if (cache->alloc_type == BTRFS_ALLOC_SEQ)
> +		return 0;
> +
>   	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
>   		bytenr = btrfs_sb_offset(i);
>   		ret = btrfs_rmap_block(fs_info, cache->key.objectid,
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index f7b29f9db5e2..36ad4fad7eaf 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3720,6 +3720,8 @@ static noinline_for_stack int scrub_supers(struct scrub_ctx *sctx,
>   		if (bytenr + BTRFS_SUPER_INFO_SIZE >
>   		    scrub_dev->commit_total_bytes)
>   			break;
> +		if (!btrfs_check_super_location(scrub_dev, bytenr))
> +			continue;
>   
>   		ret = scrub_pages(sctx, bytenr, BTRFS_SUPER_INFO_SIZE, bytenr,
>   				  scrub_dev, BTRFS_EXTENT_FLAG_SUPER, gen, i,
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/19] btrfs: limit super block locations in HMZONED mode
  2019-06-28  3:55   ` Anand Jain
@ 2019-06-28  6:39     ` Naohiro Aota
  2019-06-28  6:52       ` Anand Jain
  0 siblings, 1 reply; 79+ messages in thread
From: Naohiro Aota @ 2019-06-28  6:39 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs, David Sterba
  Cc: Chris Mason, Josef Bacik, Qu Wenruo, Nikolay Borisov,
	linux-kernel, Hannes Reinecke, linux-fsdevel, Damien Le Moal,
	Matias Bjørling, Johannes Thumshirn, Bart Van Assche

On 2019/06/28 12:56, Anand Jain wrote:
> On 7/6/19 9:10 PM, Naohiro Aota wrote:
>> When in HMZONED mode, make sure that device super blocks are located in
>> randomly writable zones of zoned block devices. That is, do not write super
>> blocks in sequential write required zones of host-managed zoned block
>> devices as update would not be possible.
> 
>    By design all copies of SB must be updated at each transaction,
>    as they are redundant copies they must match at the end of
>    each transaction.
> 
>    Instead of skipping the sb updates, why not alter number of
>    copies at the time of mkfs.btrfs?
> 
> Thanks, Anand

That is exactly what the patched code does. It updates all the SB
copies, but it just avoids writing a copy to sequential writing
required zones. Mkfs.btrfs do the same. So, all the available SB
copies always match after a transaction. At the SB location in a
sequential write required zone, you will see zeroed region (in the
next version of the patch series), but that is easy to ignore: it
lacks even BTRFS_MAGIC.

The number of SB copy available on HMZONED device will vary
by its zone size and its zone layout.

Thanks,

> 
>> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>> ---
>>    fs/btrfs/disk-io.c     | 11 +++++++++++
>>    fs/btrfs/disk-io.h     |  1 +
>>    fs/btrfs/extent-tree.c |  4 ++++
>>    fs/btrfs/scrub.c       |  2 ++
>>    4 files changed, 18 insertions(+)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 7c1404c76768..ddbb02906042 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -3466,6 +3466,13 @@ struct buffer_head *btrfs_read_dev_super(struct block_device *bdev)
>>    	return latest;
>>    }
>>    
>> +int btrfs_check_super_location(struct btrfs_device *device, u64 pos)
>> +{
>> +	/* any address is good on a regular (zone_size == 0) device */
>> +	/* non-SEQUENTIAL WRITE REQUIRED zones are capable on a zoned device */
>> +	return device->zone_size == 0 || !btrfs_dev_is_sequential(device, pos);
>> +}
>> +
>>    /*
>>     * Write superblock @sb to the @device. Do not wait for completion, all the
>>     * buffer heads we write are pinned.
>> @@ -3495,6 +3502,8 @@ static int write_dev_supers(struct btrfs_device *device,
>>    		if (bytenr + BTRFS_SUPER_INFO_SIZE >=
>>    		    device->commit_total_bytes)
>>    			break;
>> +		if (!btrfs_check_super_location(device, bytenr))
>> +			continue;
>>    
>>    		btrfs_set_super_bytenr(sb, bytenr);
>>    
>> @@ -3561,6 +3570,8 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors)
>>    		if (bytenr + BTRFS_SUPER_INFO_SIZE >=
>>    		    device->commit_total_bytes)
>>    			break;
>> +		if (!btrfs_check_super_location(device, bytenr))
>> +			continue;
>>    
>>    		bh = __find_get_block(device->bdev,
>>    				      bytenr / BTRFS_BDEV_BLOCKSIZE,
>> diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
>> index a0161aa1ea0b..70e97cd6fa76 100644
>> --- a/fs/btrfs/disk-io.h
>> +++ b/fs/btrfs/disk-io.h
>> @@ -141,6 +141,7 @@ struct extent_map *btree_get_extent(struct btrfs_inode *inode,
>>    		struct page *page, size_t pg_offset, u64 start, u64 len,
>>    		int create);
>>    int btrfs_get_num_tolerated_disk_barrier_failures(u64 flags);
>> +int btrfs_check_super_location(struct btrfs_device *device, u64 pos);
>>    int __init btrfs_end_io_wq_init(void);
>>    void __cold btrfs_end_io_wq_exit(void);
>>    
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 3d41d840fe5c..ae2c895d08c4 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -267,6 +267,10 @@ static int exclude_super_stripes(struct btrfs_block_group_cache *cache)
>>    			return ret;
>>    	}
>>    
>> +	/* we won't have super stripes in sequential zones */
>> +	if (cache->alloc_type == BTRFS_ALLOC_SEQ)
>> +		return 0;
>> +
>>    	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
>>    		bytenr = btrfs_sb_offset(i);
>>    		ret = btrfs_rmap_block(fs_info, cache->key.objectid,
>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>> index f7b29f9db5e2..36ad4fad7eaf 100644
>> --- a/fs/btrfs/scrub.c
>> +++ b/fs/btrfs/scrub.c
>> @@ -3720,6 +3720,8 @@ static noinline_for_stack int scrub_supers(struct scrub_ctx *sctx,
>>    		if (bytenr + BTRFS_SUPER_INFO_SIZE >
>>    		    scrub_dev->commit_total_bytes)
>>    			break;
>> +		if (!btrfs_check_super_location(scrub_dev, bytenr))
>> +			continue;
>>    
>>    		ret = scrub_pages(sctx, bytenr, BTRFS_SUPER_INFO_SIZE, bytenr,
>>    				  scrub_dev, BTRFS_EXTENT_FLAG_SUPER, gen, i,
>>
> 
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 09/19] btrfs: limit super block locations in HMZONED mode
  2019-06-28  6:39     ` Naohiro Aota
@ 2019-06-28  6:52       ` Anand Jain
  0 siblings, 0 replies; 79+ messages in thread
From: Anand Jain @ 2019-06-28  6:52 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: linux-btrfs, David Sterba, Chris Mason, Josef Bacik, Qu Wenruo,
	Nikolay Borisov, linux-kernel, Hannes Reinecke, linux-fsdevel,
	Damien Le Moal, Matias Bjørling, Johannes Thumshirn,
	Bart Van Assche



> On 28 Jun 2019, at 2:39 PM, Naohiro Aota <Naohiro.Aota@wdc.com> wrote:
> 
> On 2019/06/28 12:56, Anand Jain wrote:
>> On 7/6/19 9:10 PM, Naohiro Aota wrote:
>>> When in HMZONED mode, make sure that device super blocks are located in
>>> randomly writable zones of zoned block devices. That is, do not write super
>>> blocks in sequential write required zones of host-managed zoned block
>>> devices as update would not be possible.
>> 
>>   By design all copies of SB must be updated at each transaction,
>>   as they are redundant copies they must match at the end of
>>   each transaction.
>> 
>>   Instead of skipping the sb updates, why not alter number of
>>   copies at the time of mkfs.btrfs?
>> 
>> Thanks, Anand
> 
> That is exactly what the patched code does. It updates all the SB
> copies, but it just avoids writing a copy to sequential writing
> required zones. Mkfs.btrfs do the same. So, all the available SB
> copies always match after a transaction. At the SB location in a
> sequential write required zone, you will see zeroed region (in the
> next version of the patch series), but that is easy to ignore: it
> lacks even BTRFS_MAGIC.
> 

 Right, I saw the related Btrfs-progs patches at a later time,
 there are piles of emails after a vacation.;-)

> The number of SB copy available on HMZONED device will vary
> by its zone size and its zone layout.


Thanks, Anand

> Thanks,
> 
>> 
>>> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
>>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>>> ---
>>>   fs/btrfs/disk-io.c     | 11 +++++++++++
>>>   fs/btrfs/disk-io.h     |  1 +
>>>   fs/btrfs/extent-tree.c |  4 ++++
>>>   fs/btrfs/scrub.c       |  2 ++
>>>   4 files changed, 18 insertions(+)
>>> 
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index 7c1404c76768..ddbb02906042 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -3466,6 +3466,13 @@ struct buffer_head *btrfs_read_dev_super(struct block_device *bdev)
>>>   	return latest;
>>>   }
>>> 
>>> +int btrfs_check_super_location(struct btrfs_device *device, u64 pos)
>>> +{
>>> +	/* any address is good on a regular (zone_size == 0) device */
>>> +	/* non-SEQUENTIAL WRITE REQUIRED zones are capable on a zoned device */
>>> +	return device->zone_size == 0 || !btrfs_dev_is_sequential(device, pos);
>>> +}
>>> +
>>>   /*
>>>    * Write superblock @sb to the @device. Do not wait for completion, all the
>>>    * buffer heads we write are pinned.
>>> @@ -3495,6 +3502,8 @@ static int write_dev_supers(struct btrfs_device *device,
>>>   		if (bytenr + BTRFS_SUPER_INFO_SIZE >=
>>>   		    device->commit_total_bytes)
>>>   			break;
>>> +		if (!btrfs_check_super_location(device, bytenr))
>>> +			continue;
>>> 
>>>   		btrfs_set_super_bytenr(sb, bytenr);
>>> 
>>> @@ -3561,6 +3570,8 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors)
>>>   		if (bytenr + BTRFS_SUPER_INFO_SIZE >=
>>>   		    device->commit_total_bytes)
>>>   			break;
>>> +		if (!btrfs_check_super_location(device, bytenr))
>>> +			continue;
>>> 
>>>   		bh = __find_get_block(device->bdev,
>>>   				      bytenr / BTRFS_BDEV_BLOCKSIZE,
>>> diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
>>> index a0161aa1ea0b..70e97cd6fa76 100644
>>> --- a/fs/btrfs/disk-io.h
>>> +++ b/fs/btrfs/disk-io.h
>>> @@ -141,6 +141,7 @@ struct extent_map *btree_get_extent(struct btrfs_inode *inode,
>>>   		struct page *page, size_t pg_offset, u64 start, u64 len,
>>>   		int create);
>>>   int btrfs_get_num_tolerated_disk_barrier_failures(u64 flags);
>>> +int btrfs_check_super_location(struct btrfs_device *device, u64 pos);
>>>   int __init btrfs_end_io_wq_init(void);
>>>   void __cold btrfs_end_io_wq_exit(void);
>>> 
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 3d41d840fe5c..ae2c895d08c4 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -267,6 +267,10 @@ static int exclude_super_stripes(struct btrfs_block_group_cache *cache)
>>>   			return ret;
>>>   	}
>>> 
>>> +	/* we won't have super stripes in sequential zones */
>>> +	if (cache->alloc_type == BTRFS_ALLOC_SEQ)
>>> +		return 0;
>>> +
>>>   	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
>>>   		bytenr = btrfs_sb_offset(i);
>>>   		ret = btrfs_rmap_block(fs_info, cache->key.objectid,
>>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>>> index f7b29f9db5e2..36ad4fad7eaf 100644
>>> --- a/fs/btrfs/scrub.c
>>> +++ b/fs/btrfs/scrub.c
>>> @@ -3720,6 +3720,8 @@ static noinline_for_stack int scrub_supers(struct scrub_ctx *sctx,
>>>   		if (bytenr + BTRFS_SUPER_INFO_SIZE >
>>>   		    scrub_dev->commit_total_bytes)
>>>   			break;
>>> +		if (!btrfs_check_super_location(scrub_dev, bytenr))
>>> +			continue;
>>> 
>>>   		ret = scrub_pages(sctx, bytenr, BTRFS_SUPER_INFO_SIZE, bytenr,
>>>   				  scrub_dev, BTRFS_EXTENT_FLAG_SUPER, gen, i,
>>> 
>> 
>> 
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

end of thread, back to index

Thread overview: 79+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-07 13:10 [PATCH v2 00/19] btrfs zoned block device support Naohiro Aota
2019-06-07 13:10 ` [PATCH 01/19] btrfs: introduce HMZONED feature flag Naohiro Aota
2019-06-07 13:10 ` [PATCH 02/19] btrfs: Get zone information of zoned block devices Naohiro Aota
2019-06-13 13:58   ` Josef Bacik
2019-06-18  6:04     ` Naohiro Aota
2019-06-13 13:58   ` Josef Bacik
2019-06-17 18:57   ` David Sterba
2019-06-18  6:42     ` Naohiro Aota
2019-06-27 15:11       ` David Sterba
2019-06-07 13:10 ` [PATCH 03/19] btrfs: Check and enable HMZONED mode Naohiro Aota
2019-06-13 13:57   ` Josef Bacik
2019-06-18  6:43     ` Naohiro Aota
2019-06-07 13:10 ` [PATCH 04/19] btrfs: disable fallocate in " Naohiro Aota
2019-06-07 13:10 ` [PATCH 05/19] btrfs: disable direct IO " Naohiro Aota
2019-06-13 14:00   ` Josef Bacik
2019-06-18  8:17     ` Naohiro Aota
2019-06-07 13:10 ` [PATCH 06/19] btrfs: align dev extent allocation to zone boundary Naohiro Aota
2019-06-07 13:10 ` [PATCH 07/19] btrfs: do sequential extent allocation in HMZONED mode Naohiro Aota
2019-06-13 14:07   ` Josef Bacik
2019-06-18  8:28     ` Naohiro Aota
2019-06-18 13:37       ` Josef Bacik
2019-06-17 22:30   ` David Sterba
2019-06-18  8:49     ` Naohiro Aota
2019-06-27 15:28       ` David Sterba
2019-06-07 13:10 ` [PATCH 08/19] btrfs: make unmirroed BGs readonly only if we have at least one writable BG Naohiro Aota
2019-06-13 14:09   ` Josef Bacik
2019-06-18  7:42     ` Naohiro Aota
2019-06-18 13:35       ` Josef Bacik
2019-06-07 13:10 ` [PATCH 09/19] btrfs: limit super block locations in HMZONED mode Naohiro Aota
2019-06-13 14:12   ` Josef Bacik
2019-06-18  8:51     ` Naohiro Aota
2019-06-17 22:53   ` David Sterba
2019-06-18  9:01     ` Naohiro Aota
2019-06-27 15:35       ` David Sterba
2019-06-28  3:55   ` Anand Jain
2019-06-28  6:39     ` Naohiro Aota
2019-06-28  6:52       ` Anand Jain
2019-06-07 13:10 ` [PATCH 10/19] btrfs: rename btrfs_map_bio() Naohiro Aota
2019-06-07 13:10 ` [PATCH 11/19] btrfs: introduce submit buffer Naohiro Aota
2019-06-13 14:14   ` Josef Bacik
2019-06-17  3:16     ` Damien Le Moal
2019-06-18  0:00       ` David Sterba
2019-06-18  4:04         ` Damien Le Moal
2019-06-18 13:33       ` Josef Bacik
2019-06-19 10:32         ` Damien Le Moal
2019-06-07 13:10 ` [PATCH 12/19] btrfs: expire submit buffer on timeout Naohiro Aota
2019-06-13 14:15   ` Josef Bacik
2019-06-17  3:19     ` Damien Le Moal
2019-06-07 13:10 ` [PATCH 13/19] btrfs: avoid sync IO prioritization on checksum in HMZONED mode Naohiro Aota
2019-06-13 14:17   ` Josef Bacik
2019-06-07 13:10 ` [PATCH 14/19] btrfs: redirty released extent buffers in sequential BGs Naohiro Aota
2019-06-13 14:24   ` Josef Bacik
2019-06-18  9:09     ` Naohiro Aota
2019-06-07 13:10 ` [PATCH 15/19] btrfs: reset zones of unused block groups Naohiro Aota
2019-06-07 13:10 ` [PATCH 16/19] btrfs: wait existing extents before truncating Naohiro Aota
2019-06-13 14:25   ` Josef Bacik
2019-06-07 13:10 ` [PATCH 17/19] btrfs: shrink delayed allocation size in HMZONED mode Naohiro Aota
2019-06-13 14:27   ` Josef Bacik
2019-06-07 13:10 ` [PATCH 18/19] btrfs: support dev-replace " Naohiro Aota
2019-06-13 14:33   ` Josef Bacik
2019-06-18  9:14     ` Naohiro Aota
2019-06-07 13:10 ` [PATCH 19/19] btrfs: enable to mount HMZONED incompat flag Naohiro Aota
2019-06-07 13:17 ` [PATCH 01/12] btrfs-progs: build: Check zoned block device support Naohiro Aota
2019-06-07 13:17   ` [PATCH 02/12] btrfs-progs: utils: Introduce queue_param Naohiro Aota
2019-06-07 13:17   ` [PATCH 03/12] btrfs-progs: add new HMZONED feature flag Naohiro Aota
2019-06-07 13:17   ` [PATCH 04/12] btrfs-progs: Introduce zone block device helper functions Naohiro Aota
2019-06-07 13:17   ` [PATCH 05/12] btrfs-progs: load and check zone information Naohiro Aota
2019-06-07 13:17   ` [PATCH 06/12] btrfs-progs: avoid writing super block to sequential zones Naohiro Aota
2019-06-07 13:17   ` [PATCH 07/12] btrfs-progs: support discarding zoned device Naohiro Aota
2019-06-07 13:17   ` [PATCH 08/12] btrfs-progs: volume: align chunk allocation to zones Naohiro Aota
2019-06-07 13:17   ` [PATCH 09/12] btrfs-progs: do sequential allocation Naohiro Aota
2019-06-07 13:17   ` [PATCH 10/12] btrfs-progs: mkfs: Zoned block device support Naohiro Aota
2019-06-07 13:17   ` [PATCH 11/12] btrfs-progs: device-add: support HMZONED device Naohiro Aota
2019-06-07 13:17   ` [PATCH 12/12] btrfs-progs: introduce support for dev-place " Naohiro Aota
2019-06-12 17:51 ` [PATCH v2 00/19] btrfs zoned block device support David Sterba
2019-06-13  4:59   ` Naohiro Aota
2019-06-13 13:46     ` David Sterba
2019-06-14  2:07       ` Naohiro Aota
2019-06-17  2:44       ` Damien Le Moal

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org linux-btrfs@archiver.kernel.org
	public-inbox-index linux-btrfs


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox