linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] device type and create chunk
@ 2022-01-18 15:18 Anand Jain
  2022-01-18 15:18 ` [PATCH 1/2] btrfs: keep device type in the struct btrfs_device Anand Jain
  2022-01-18 15:18 ` [PATCH 2/2] btrfs: create chunk device type aware Anand Jain
  0 siblings, 2 replies; 13+ messages in thread
From: Anand Jain @ 2022-01-18 15:18 UTC (permalink / raw)
  To: linux-btrfs

I had these patches as part of experiments with the readpolicy I am
sending it now. This is different from the allocation_hint mode patch-set
where I use the device type to make the allocation destination automatic.

Patch 1/2 keeps the device's type in the struct btrfs_device so that we
could maintain the status if there are mixed devices in the
filesystem.

And if so, then patch 2/2 shall take care of arranging the disks by the
order of latency so that metadata chunks can pick disk with low latency
and data can pick the disk with higher latency.

By having fewer restrictions and not hard coding the chunk allocation
destination helps to cause the spillover to the available disk space
instead of causing the spurious ENOSPC.

Anand Jain (2):
  btrfs: keep device type in the struct btrfs_device
  btrfs: create chunk device type aware

 fs/btrfs/dev-replace.c |  2 +
 fs/btrfs/volumes.c     | 90 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h     | 26 +++++++++++-
 3 files changed, 117 insertions(+), 1 deletion(-)

-- 
2.33.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] btrfs: keep device type in the struct btrfs_device
  2022-01-18 15:18 [PATCH 0/2] device type and create chunk Anand Jain
@ 2022-01-18 15:18 ` Anand Jain
  2022-01-26 16:53   ` David Sterba
  2022-01-18 15:18 ` [PATCH 2/2] btrfs: create chunk device type aware Anand Jain
  1 sibling, 1 reply; 13+ messages in thread
From: Anand Jain @ 2022-01-18 15:18 UTC (permalink / raw)
  To: linux-btrfs

Preparation to make data/metadata chunks allocations based on the device
types- keep the identified device type in the struct btrfs_device.

This patch adds a member 'dev_type' to hold the defined device types in
the struct btrfs_devices.

Also, add a helper function and a struct btrfs_fs_devices member
'mixed_dev_type' to know if the filesystem contains the mixed device
types.

Struct btrfs_device has an existing member 'type' that stages and writes
back to the on-disk format. This patch does not use it. As just an
in-memory only data will suffice the requirement here.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/dev-replace.c |  2 ++
 fs/btrfs/volumes.c     | 45 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h     | 26 +++++++++++++++++++++++-
 3 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 71fd99b48283..3731c7d1c6b7 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -325,6 +325,7 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
 	device->dev_stats_valid = 1;
 	set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
 	device->fs_devices = fs_devices;
+	device->dev_type = btrfs_get_device_type(device);
 
 	ret = btrfs_get_dev_zone_info(device, false);
 	if (ret)
@@ -334,6 +335,7 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
 	list_add(&device->dev_list, &fs_devices->devices);
 	fs_devices->num_devices++;
 	fs_devices->open_devices++;
+	fs_devices->mixed_dev_types = btrfs_has_mixed_dev_types(fs_devices);
 	mutex_unlock(&fs_devices->device_list_mutex);
 
 	*device_out = device;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9d50e035e61d..da3d6d0f5bc3 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1041,6 +1041,7 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
 			     device->generation > (*latest_dev)->generation)) {
 				*latest_dev = device;
 			}
+			device->dev_type = btrfs_get_device_type(device);
 			continue;
 		}
 
@@ -1084,6 +1085,7 @@ void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices)
 		__btrfs_free_extra_devids(seed_dev, &latest_dev);
 
 	fs_devices->latest_dev = latest_dev;
+	fs_devices->mixed_dev_types = btrfs_has_mixed_dev_types(fs_devices);
 
 	mutex_unlock(&uuid_mutex);
 }
@@ -2183,6 +2185,9 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
 
 	num_devices = btrfs_super_num_devices(fs_info->super_copy) - 1;
 	btrfs_set_super_num_devices(fs_info->super_copy, num_devices);
+
+	cur_devices->mixed_dev_types = btrfs_has_mixed_dev_types(cur_devices);
+
 	mutex_unlock(&fs_devices->device_list_mutex);
 
 	/*
@@ -2584,6 +2589,44 @@ static int btrfs_finish_sprout(struct btrfs_trans_handle *trans)
 	return ret;
 }
 
+bool btrfs_has_mixed_dev_types(struct btrfs_fs_devices *fs_devices)
+{
+	struct btrfs_device *device;
+	int type_rot = 0;
+	int type_nonrot = 0;
+
+	list_for_each_entry(device, &fs_devices->devices, dev_list) {
+
+		if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state))
+			continue;
+
+		switch (device->dev_type) {
+		case BTRFS_DEV_TYPE_ROT:
+			type_rot++;
+			break;
+		case BTRFS_DEV_TYPE_NONROT:
+		default:
+			type_nonrot++;
+		}
+	}
+
+	if (type_rot && type_nonrot)
+		return true;
+	else
+		return false;
+}
+
+enum btrfs_dev_types btrfs_get_device_type(struct btrfs_device *device)
+{
+	if (bdev_is_zoned(device->bdev))
+		return BTRFS_DEV_TYPE_ZONED;
+
+	if (blk_queue_nonrot(bdev_get_queue(device->bdev)))
+		return BTRFS_DEV_TYPE_NONROT;
+
+	return BTRFS_DEV_TYPE_ROT;
+}
+
 int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path)
 {
 	struct btrfs_root *root = fs_info->dev_root;
@@ -2675,6 +2718,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	clear_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state);
 	device->mode = FMODE_EXCL;
 	device->dev_stats_valid = 1;
+	device->dev_type = btrfs_get_device_type(device);
 	set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
 
 	if (seeding_dev) {
@@ -2710,6 +2754,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	atomic64_add(device->total_bytes, &fs_info->free_chunk_space);
 
 	fs_devices->rotating = !blk_queue_nonrot(bdev_get_queue(bdev));
+	fs_devices->mixed_dev_types = btrfs_has_mixed_dev_types(fs_devices);
 
 	orig_super_total_bytes = btrfs_super_total_bytes(fs_info->super_copy);
 	btrfs_set_super_total_bytes(fs_info->super_copy,
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 6a790b85edd8..5be31161601d 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -52,6 +52,16 @@ struct btrfs_io_geometry {
 #define BTRFS_DEV_STATE_FLUSH_SENT	(4)
 #define BTRFS_DEV_STATE_NO_READA	(5)
 
+/*
+ * Device class types arranged by their IO latency from low to high.
+ */
+enum btrfs_dev_types {
+	BTRFS_DEV_TYPE_MEM = 1,
+	BTRFS_DEV_TYPE_NONROT,
+	BTRFS_DEV_TYPE_ROT,
+	BTRFS_DEV_TYPE_ZONED,
+};
+
 struct btrfs_zoned_device_info;
 
 struct btrfs_device {
@@ -101,9 +111,16 @@ struct btrfs_device {
 
 	/* optimal io width for this device */
 	u32 io_width;
-	/* type and info about this device */
+
+	/* Type and info about this device, on-disk (currently unused) */
 	u64 type;
 
+	/*
+	 * Device type (in memory only) at some point, merge to the on-disk
+	 * member 'type' above.
+	 */
+	enum btrfs_dev_types dev_type;
+
 	/* minimal io size for this device */
 	u32 sector_size;
 
@@ -296,6 +313,11 @@ struct btrfs_fs_devices {
 	 */
 	bool rotating;
 
+	/*
+	 * True when devices belong more than one type.
+	 */
+	bool mixed_dev_types;
+
 	struct btrfs_fs_info *fs_info;
 	/* sysfs kobjects */
 	struct kobject fsid_kobj;
@@ -636,5 +658,7 @@ int btrfs_bg_type_to_factor(u64 flags);
 const char *btrfs_bg_type_to_raid_name(u64 flags);
 int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info);
 bool btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical);
+enum btrfs_dev_types btrfs_get_device_type(struct btrfs_device *device);
+bool btrfs_has_mixed_dev_types(struct btrfs_fs_devices *fs_devices);
 
 #endif
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/2] btrfs: create chunk device type aware
  2022-01-18 15:18 [PATCH 0/2] device type and create chunk Anand Jain
  2022-01-18 15:18 ` [PATCH 1/2] btrfs: keep device type in the struct btrfs_device Anand Jain
@ 2022-01-18 15:18 ` Anand Jain
  2022-01-26 17:01   ` David Sterba
  2022-01-26 17:38   ` David Sterba
  1 sibling, 2 replies; 13+ messages in thread
From: Anand Jain @ 2022-01-18 15:18 UTC (permalink / raw)
  To: linux-btrfs

Mixed device types configuration exist. Most commonly, HDD mixed with
SSD/NVME device types. This use case prefers that the data chunk
allocates on HDD and the metadata chunk allocates on the SSD/NVME.

As of now, in the function gather_device_info() called from
btrfs_create_chunk(), we sort the devices based on unallocated space
only.

After this patch, this function will check for mixed device types. And
will sort the devices based on enum btrfs_device_types. That is, sort if
the allocation type is metadata and reverse-sort if the allocation type
is data.

The advantage of this method is that data/metadata allocation distribution
based on the device type happens automatically without any manual
configuration.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/volumes.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index da3d6d0f5bc3..77fba78555d7 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5060,6 +5060,37 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
 	return 0;
 }
 
+/*
+ * Sort the devices in its ascending order of latency value.
+ */
+static int btrfs_cmp_device_latency(const void *a, const void *b)
+{
+	const struct btrfs_device_info *di_a = a;
+	const struct btrfs_device_info *di_b = b;
+	struct btrfs_device *dev_a = di_a->dev;
+	struct btrfs_device *dev_b = di_b->dev;
+
+	if (dev_a->dev_type > dev_b->dev_type)
+		return 1;
+	if (dev_a->dev_type < dev_b->dev_type)
+		return -1;
+	return 0;
+}
+
+static int btrfs_cmp_device_rev_latency(const void *a, const void *b)
+{
+	const struct btrfs_device_info *di_a = a;
+	const struct btrfs_device_info *di_b = b;
+	struct btrfs_device *dev_a = di_a->dev;
+	struct btrfs_device *dev_b = di_b->dev;
+
+	if (dev_a->dev_type > dev_b->dev_type)
+		return -1;
+	if (dev_a->dev_type < dev_b->dev_type)
+		return 1;
+	return 0;
+}
+
 /*
  * sort the devices in descending order by max_avail, total_avail
  */
@@ -5292,6 +5323,20 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
 	     btrfs_cmp_device_info, NULL);
 
+	/*
+	 * Sort devices by their latency. Ascending order of latency for
+	 * metadata and descending order of latency for the data chunks for
+	 * mixed device types.
+	 */
+	if (fs_devices->mixed_dev_types) {
+		if (ctl->type & BTRFS_BLOCK_GROUP_DATA)
+			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+			     btrfs_cmp_device_rev_latency, NULL);
+		else
+			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+			     btrfs_cmp_device_latency, NULL);
+	}
+
 	return 0;
 }
 
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] btrfs: keep device type in the struct btrfs_device
  2022-01-18 15:18 ` [PATCH 1/2] btrfs: keep device type in the struct btrfs_device Anand Jain
@ 2022-01-26 16:53   ` David Sterba
  2022-01-29 16:24     ` Anand Jain
  0 siblings, 1 reply; 13+ messages in thread
From: David Sterba @ 2022-01-26 16:53 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On Tue, Jan 18, 2022 at 11:18:01PM +0800, Anand Jain wrote:
> Preparation to make data/metadata chunks allocations based on the device
> types- keep the identified device type in the struct btrfs_device.
> 
> This patch adds a member 'dev_type' to hold the defined device types in
> the struct btrfs_devices.
> 
> Also, add a helper function and a struct btrfs_fs_devices member
> 'mixed_dev_type' to know if the filesystem contains the mixed device
> types.
> 
> Struct btrfs_device has an existing member 'type' that stages and writes
> back to the on-disk format. This patch does not use it. As just an
> in-memory only data will suffice the requirement here.
> 
> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> ---
>  fs/btrfs/dev-replace.c |  2 ++
>  fs/btrfs/volumes.c     | 45 ++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/volumes.h     | 26 +++++++++++++++++++++++-
>  3 files changed, 72 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index 71fd99b48283..3731c7d1c6b7 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -325,6 +325,7 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
>  	device->dev_stats_valid = 1;
>  	set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
>  	device->fs_devices = fs_devices;
> +	device->dev_type = btrfs_get_device_type(device);
>  
>  	ret = btrfs_get_dev_zone_info(device, false);
>  	if (ret)
> @@ -334,6 +335,7 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
>  	list_add(&device->dev_list, &fs_devices->devices);
>  	fs_devices->num_devices++;
>  	fs_devices->open_devices++;
> +	fs_devices->mixed_dev_types = btrfs_has_mixed_dev_types(fs_devices);
>  	mutex_unlock(&fs_devices->device_list_mutex);
>  
>  	*device_out = device;
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 9d50e035e61d..da3d6d0f5bc3 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1041,6 +1041,7 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
>  			     device->generation > (*latest_dev)->generation)) {
>  				*latest_dev = device;
>  			}
> +			device->dev_type = btrfs_get_device_type(device);
>  			continue;
>  		}
>  
> @@ -1084,6 +1085,7 @@ void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices)
>  		__btrfs_free_extra_devids(seed_dev, &latest_dev);
>  
>  	fs_devices->latest_dev = latest_dev;
> +	fs_devices->mixed_dev_types = btrfs_has_mixed_dev_types(fs_devices);
>  
>  	mutex_unlock(&uuid_mutex);
>  }
> @@ -2183,6 +2185,9 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
>  
>  	num_devices = btrfs_super_num_devices(fs_info->super_copy) - 1;
>  	btrfs_set_super_num_devices(fs_info->super_copy, num_devices);
> +
> +	cur_devices->mixed_dev_types = btrfs_has_mixed_dev_types(cur_devices);
> +
>  	mutex_unlock(&fs_devices->device_list_mutex);
>  
>  	/*
> @@ -2584,6 +2589,44 @@ static int btrfs_finish_sprout(struct btrfs_trans_handle *trans)
>  	return ret;
>  }
>  
> +bool btrfs_has_mixed_dev_types(struct btrfs_fs_devices *fs_devices)
> +{
> +	struct btrfs_device *device;
> +	int type_rot = 0;
> +	int type_nonrot = 0;
> +
> +	list_for_each_entry(device, &fs_devices->devices, dev_list) {
> +
> +		if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state))
> +			continue;
> +
> +		switch (device->dev_type) {
> +		case BTRFS_DEV_TYPE_ROT:
> +			type_rot++;
> +			break;
> +		case BTRFS_DEV_TYPE_NONROT:
> +		default:
> +			type_nonrot++;
> +		}
> +	}
> +
> +	if (type_rot && type_nonrot)
> +		return true;
> +	else
> +		return false;
> +}
> +
> +enum btrfs_dev_types btrfs_get_device_type(struct btrfs_device *device)
> +{
> +	if (bdev_is_zoned(device->bdev))
> +		return BTRFS_DEV_TYPE_ZONED;
> +
> +	if (blk_queue_nonrot(bdev_get_queue(device->bdev)))
> +		return BTRFS_DEV_TYPE_NONROT;
> +
> +	return BTRFS_DEV_TYPE_ROT;
> +}
> +
>  int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path)
>  {
>  	struct btrfs_root *root = fs_info->dev_root;
> @@ -2675,6 +2718,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>  	clear_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state);
>  	device->mode = FMODE_EXCL;
>  	device->dev_stats_valid = 1;
> +	device->dev_type = btrfs_get_device_type(device);
>  	set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
>  
>  	if (seeding_dev) {
> @@ -2710,6 +2754,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>  	atomic64_add(device->total_bytes, &fs_info->free_chunk_space);
>  
>  	fs_devices->rotating = !blk_queue_nonrot(bdev_get_queue(bdev));
> +	fs_devices->mixed_dev_types = btrfs_has_mixed_dev_types(fs_devices);
>  
>  	orig_super_total_bytes = btrfs_super_total_bytes(fs_info->super_copy);
>  	btrfs_set_super_total_bytes(fs_info->super_copy,
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 6a790b85edd8..5be31161601d 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -52,6 +52,16 @@ struct btrfs_io_geometry {
>  #define BTRFS_DEV_STATE_FLUSH_SENT	(4)
>  #define BTRFS_DEV_STATE_NO_READA	(5)
>  
> +/*
> + * Device class types arranged by their IO latency from low to high.
> + */
> +enum btrfs_dev_types {
> +	BTRFS_DEV_TYPE_MEM = 1,
> +	BTRFS_DEV_TYPE_NONROT,
> +	BTRFS_DEV_TYPE_ROT,
> +	BTRFS_DEV_TYPE_ZONED,

I think this should be separate, in one list define all sensible device
types and then add arrays sorted by latency, or other factors.

The zoned devices are mostly HDD but ther are also SSD-like using ZNS,
so that can't be both under BTRFS_DEV_TYPE_ZONED and behind
BTRFS_DEV_TYPE_ROT as if it had worse latency.

I'm not sure how much we should try to guess the device types, the ones
you have so far are almost all we can get without peeking too much into
the devices/queues internals.

Also here's the terminology question if we should still consider
rotational device status as the main criteria, because then SSD, NVMe,
RAM-backed are all non-rotational but with different latency
characteristics.

> +};
> +
>  struct btrfs_zoned_device_info;
>  
>  struct btrfs_device {
> @@ -101,9 +111,16 @@ struct btrfs_device {
>  
>  	/* optimal io width for this device */
>  	u32 io_width;
> -	/* type and info about this device */
> +
> +	/* Type and info about this device, on-disk (currently unused) */
>  	u64 type;
>  
> +	/*
> +	 * Device type (in memory only) at some point, merge to the on-disk
> +	 * member 'type' above.
> +	 */
> +	enum btrfs_dev_types dev_type;

So at some point this is planned to be user configurable? We should be
always able to detect the device type at mount type so I wonder if this
needs to be stored on disk.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] btrfs: create chunk device type aware
  2022-01-18 15:18 ` [PATCH 2/2] btrfs: create chunk device type aware Anand Jain
@ 2022-01-26 17:01   ` David Sterba
  2022-01-29 16:24     ` Anand Jain
  2022-01-26 17:38   ` David Sterba
  1 sibling, 1 reply; 13+ messages in thread
From: David Sterba @ 2022-01-26 17:01 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On Tue, Jan 18, 2022 at 11:18:02PM +0800, Anand Jain wrote:
> Mixed device types configuration exist. Most commonly, HDD mixed with
> SSD/NVME device types. This use case prefers that the data chunk
> allocates on HDD and the metadata chunk allocates on the SSD/NVME.
> 
> As of now, in the function gather_device_info() called from
> btrfs_create_chunk(), we sort the devices based on unallocated space
> only.
> 
> After this patch, this function will check for mixed device types. And
> will sort the devices based on enum btrfs_device_types. That is, sort if
> the allocation type is metadata and reverse-sort if the allocation type
> is data.
> 
> The advantage of this method is that data/metadata allocation distribution
> based on the device type happens automatically without any manual
> configuration.
> 
> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> ---
>  fs/btrfs/volumes.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 45 insertions(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index da3d6d0f5bc3..77fba78555d7 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5060,6 +5060,37 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
>  	return 0;
>  }
>  
> +/*
> + * Sort the devices in its ascending order of latency value.
> + */
> +static int btrfs_cmp_device_latency(const void *a, const void *b)
> +{
> +	const struct btrfs_device_info *di_a = a;
> +	const struct btrfs_device_info *di_b = b;
> +	struct btrfs_device *dev_a = di_a->dev;
> +	struct btrfs_device *dev_b = di_b->dev;
> +
> +	if (dev_a->dev_type > dev_b->dev_type)

It's strange to see dev_type being compared while the function compares
the latency. As pointed in the first patch, this could simply refer to a
array of device types sorted by latency.

> +		return 1;
> +	if (dev_a->dev_type < dev_b->dev_type)
> +		return -1;
> +	return 0;
> +}
> +
> +static int btrfs_cmp_device_rev_latency(const void *a, const void *b)

Regarding the naming, I think we've been using _desc (descending) as
suffix for the reverse sort functions, while ascending is the default.

> +{
> +	const struct btrfs_device_info *di_a = a;
> +	const struct btrfs_device_info *di_b = b;
> +	struct btrfs_device *dev_a = di_a->dev;
> +	struct btrfs_device *dev_b = di_b->dev;
> +
> +	if (dev_a->dev_type > dev_b->dev_type)
> +		return -1;
> +	if (dev_a->dev_type < dev_b->dev_type)
> +		return 1;
> +	return 0;

To avoid code duplication this could be

	return -btrfs_cmp_device_latency(a, b);

> +}
> +
>  /*
>   * sort the devices in descending order by max_avail, total_avail
>   */
> @@ -5292,6 +5323,20 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>  	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>  	     btrfs_cmp_device_info, NULL);
>  
> +	/*
> +	 * Sort devices by their latency. Ascending order of latency for
> +	 * metadata and descending order of latency for the data chunks for
> +	 * mixed device types.
> +	 */
> +	if (fs_devices->mixed_dev_types) {
> +		if (ctl->type & BTRFS_BLOCK_GROUP_DATA)
> +			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +			     btrfs_cmp_device_rev_latency, NULL);
> +		else
> +			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +			     btrfs_cmp_device_latency, NULL);
> +	}
> +
>  	return 0;
>  }
>  
> -- 
> 2.33.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] btrfs: create chunk device type aware
  2022-01-18 15:18 ` [PATCH 2/2] btrfs: create chunk device type aware Anand Jain
  2022-01-26 17:01   ` David Sterba
@ 2022-01-26 17:38   ` David Sterba
  2022-01-29 16:46     ` Anand Jain
  2022-01-30 22:28     ` Goffredo Baroncelli
  1 sibling, 2 replies; 13+ messages in thread
From: David Sterba @ 2022-01-26 17:38 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On Tue, Jan 18, 2022 at 11:18:02PM +0800, Anand Jain wrote:
> Mixed device types configuration exist. Most commonly, HDD mixed with
> SSD/NVME device types. This use case prefers that the data chunk
> allocates on HDD and the metadata chunk allocates on the SSD/NVME.
> 
> As of now, in the function gather_device_info() called from
> btrfs_create_chunk(), we sort the devices based on unallocated space
> only.

Yes, and this guarantees maximizing the used space for the raid
profiles.

> After this patch, this function will check for mixed device types. And
> will sort the devices based on enum btrfs_device_types. That is, sort if
> the allocation type is metadata and reverse-sort if the allocation type
> is data.

And this changes the above, because a small fast device can be depleted
first.

> The advantage of this method is that data/metadata allocation distribution
> based on the device type happens automatically without any manual
> configuration.

Yeah, but the default behaviour may not be suitable for all users so
some policy will have to be done anyway.

I vaguely remember some comments regarding mixed setups, along lines
that "if there's a fast flash device I'd rather see ENOSPC and either
delete files or add more devices than to let everything work but with
the risk of storing metadata on the slow devices."

> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> ---
>  fs/btrfs/volumes.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 45 insertions(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index da3d6d0f5bc3..77fba78555d7 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5060,6 +5060,37 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
>  	return 0;
>  }
>  
> +/*
> + * Sort the devices in its ascending order of latency value.
> + */
> +static int btrfs_cmp_device_latency(const void *a, const void *b)
> +{
> +	const struct btrfs_device_info *di_a = a;
> +	const struct btrfs_device_info *di_b = b;
> +	struct btrfs_device *dev_a = di_a->dev;
> +	struct btrfs_device *dev_b = di_b->dev;
> +
> +	if (dev_a->dev_type > dev_b->dev_type)
> +		return 1;
> +	if (dev_a->dev_type < dev_b->dev_type)
> +		return -1;
> +	return 0;
> +}
> +
> +static int btrfs_cmp_device_rev_latency(const void *a, const void *b)
> +{
> +	const struct btrfs_device_info *di_a = a;
> +	const struct btrfs_device_info *di_b = b;
> +	struct btrfs_device *dev_a = di_a->dev;
> +	struct btrfs_device *dev_b = di_b->dev;
> +
> +	if (dev_a->dev_type > dev_b->dev_type)
> +		return -1;
> +	if (dev_a->dev_type < dev_b->dev_type)
> +		return 1;
> +	return 0;
> +}
> +
>  /*
>   * sort the devices in descending order by max_avail, total_avail
>   */
> @@ -5292,6 +5323,20 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>  	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>  	     btrfs_cmp_device_info, NULL);
>  
> +	/*
> +	 * Sort devices by their latency. Ascending order of latency for
> +	 * metadata and descending order of latency for the data chunks for
> +	 * mixed device types.
> +	 */
> +	if (fs_devices->mixed_dev_types) {
> +		if (ctl->type & BTRFS_BLOCK_GROUP_DATA)
> +			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +			     btrfs_cmp_device_rev_latency, NULL);
> +		else
> +			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +			     btrfs_cmp_device_latency, NULL);

In case there are mixed devices the sort happens twice and because as
implemented in kernel sort() is not stable so even if device have same
amount of data they can get reorderd wildly. The remaingin space is
still a factor we need to take into account to avoid ENOSPC on the chunk
level.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] btrfs: keep device type in the struct btrfs_device
  2022-01-26 16:53   ` David Sterba
@ 2022-01-29 16:24     ` Anand Jain
  2022-02-01 17:06       ` David Sterba
  0 siblings, 1 reply; 13+ messages in thread
From: Anand Jain @ 2022-01-29 16:24 UTC (permalink / raw)
  To: dsterba, linux-btrfs



On 27/01/2022 00:53, David Sterba wrote:
> On Tue, Jan 18, 2022 at 11:18:01PM +0800, Anand Jain wrote:
>> Preparation to make data/metadata chunks allocations based on the device
>> types- keep the identified device type in the struct btrfs_device.
>>
>> This patch adds a member 'dev_type' to hold the defined device types in
>> the struct btrfs_devices.
>>
>> Also, add a helper function and a struct btrfs_fs_devices member
>> 'mixed_dev_type' to know if the filesystem contains the mixed device
>> types.
>>
>> Struct btrfs_device has an existing member 'type' that stages and writes
>> back to the on-disk format. This patch does not use it. As just an
>> in-memory only data will suffice the requirement here.
>>
>> Signed-off-by: Anand Jain <anand.jain@oracle.com>
>> ---
>>   fs/btrfs/dev-replace.c |  2 ++
>>   fs/btrfs/volumes.c     | 45 ++++++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/volumes.h     | 26 +++++++++++++++++++++++-
>>   3 files changed, 72 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
>> index 71fd99b48283..3731c7d1c6b7 100644
>> --- a/fs/btrfs/dev-replace.c
>> +++ b/fs/btrfs/dev-replace.c
>> @@ -325,6 +325,7 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
>>   	device->dev_stats_valid = 1;
>>   	set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
>>   	device->fs_devices = fs_devices;
>> +	device->dev_type = btrfs_get_device_type(device);
>>   
>>   	ret = btrfs_get_dev_zone_info(device, false);
>>   	if (ret)
>> @@ -334,6 +335,7 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
>>   	list_add(&device->dev_list, &fs_devices->devices);
>>   	fs_devices->num_devices++;
>>   	fs_devices->open_devices++;
>> +	fs_devices->mixed_dev_types = btrfs_has_mixed_dev_types(fs_devices);
>>   	mutex_unlock(&fs_devices->device_list_mutex);
>>   
>>   	*device_out = device;
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 9d50e035e61d..da3d6d0f5bc3 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -1041,6 +1041,7 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
>>   			     device->generation > (*latest_dev)->generation)) {
>>   				*latest_dev = device;
>>   			}
>> +			device->dev_type = btrfs_get_device_type(device);
>>   			continue;
>>   		}
>>   
>> @@ -1084,6 +1085,7 @@ void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices)
>>   		__btrfs_free_extra_devids(seed_dev, &latest_dev);
>>   
>>   	fs_devices->latest_dev = latest_dev;
>> +	fs_devices->mixed_dev_types = btrfs_has_mixed_dev_types(fs_devices);
>>   
>>   	mutex_unlock(&uuid_mutex);
>>   }
>> @@ -2183,6 +2185,9 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info,
>>   
>>   	num_devices = btrfs_super_num_devices(fs_info->super_copy) - 1;
>>   	btrfs_set_super_num_devices(fs_info->super_copy, num_devices);
>> +
>> +	cur_devices->mixed_dev_types = btrfs_has_mixed_dev_types(cur_devices);
>> +
>>   	mutex_unlock(&fs_devices->device_list_mutex);
>>   
>>   	/*
>> @@ -2584,6 +2589,44 @@ static int btrfs_finish_sprout(struct btrfs_trans_handle *trans)
>>   	return ret;
>>   }
>>   
>> +bool btrfs_has_mixed_dev_types(struct btrfs_fs_devices *fs_devices)
>> +{
>> +	struct btrfs_device *device;
>> +	int type_rot = 0;
>> +	int type_nonrot = 0;
>> +
>> +	list_for_each_entry(device, &fs_devices->devices, dev_list) {
>> +
>> +		if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state))
>> +			continue;
>> +
>> +		switch (device->dev_type) {
>> +		case BTRFS_DEV_TYPE_ROT:
>> +			type_rot++;
>> +			break;
>> +		case BTRFS_DEV_TYPE_NONROT:
>> +		default:
>> +			type_nonrot++;
>> +		}
>> +	}
>> +
>> +	if (type_rot && type_nonrot)
>> +		return true;
>> +	else
>> +		return false;
>> +}
>> +
>> +enum btrfs_dev_types btrfs_get_device_type(struct btrfs_device *device)
>> +{
>> +	if (bdev_is_zoned(device->bdev))
>> +		return BTRFS_DEV_TYPE_ZONED;
>> +
>> +	if (blk_queue_nonrot(bdev_get_queue(device->bdev)))
>> +		return BTRFS_DEV_TYPE_NONROT;
>> +
>> +	return BTRFS_DEV_TYPE_ROT;
>> +}
>> +
>>   int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path)
>>   {
>>   	struct btrfs_root *root = fs_info->dev_root;
>> @@ -2675,6 +2718,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>>   	clear_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state);
>>   	device->mode = FMODE_EXCL;
>>   	device->dev_stats_valid = 1;
>> +	device->dev_type = btrfs_get_device_type(device);
>>   	set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
>>   
>>   	if (seeding_dev) {
>> @@ -2710,6 +2754,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>>   	atomic64_add(device->total_bytes, &fs_info->free_chunk_space);
>>   
>>   	fs_devices->rotating = !blk_queue_nonrot(bdev_get_queue(bdev));
>> +	fs_devices->mixed_dev_types = btrfs_has_mixed_dev_types(fs_devices);
>>   
>>   	orig_super_total_bytes = btrfs_super_total_bytes(fs_info->super_copy);
>>   	btrfs_set_super_total_bytes(fs_info->super_copy,
>> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
>> index 6a790b85edd8..5be31161601d 100644
>> --- a/fs/btrfs/volumes.h
>> +++ b/fs/btrfs/volumes.h
>> @@ -52,6 +52,16 @@ struct btrfs_io_geometry {
>>   #define BTRFS_DEV_STATE_FLUSH_SENT	(4)
>>   #define BTRFS_DEV_STATE_NO_READA	(5)
>>   
>> +/*
>> + * Device class types arranged by their IO latency from low to high.
>> + */
>> +enum btrfs_dev_types {
>> +	BTRFS_DEV_TYPE_MEM = 1,
>> +	BTRFS_DEV_TYPE_NONROT,
>> +	BTRFS_DEV_TYPE_ROT,
>> +	BTRFS_DEV_TYPE_ZONED,
> 
> I think this should be separate, in one list define all sensible device
> types and then add arrays sorted by latency, or other factors.
>
> The zoned devices are mostly HDD but ther are also SSD-like using ZNS,
> so that can't be both under BTRFS_DEV_TYPE_ZONED and behind
> BTRFS_DEV_TYPE_ROT as if it had worse latency.

I still do not have a complete idea of its implantation using an array. 
Do you mean something like as below,

enum btrfs_dev_types {
	::
};

struct btrfs_dev_type {
	enum btrfs_dev_types dev_type;
	int priority_latency;
};


I think we can identify a ZNS and set its relative latency to a value 
lower (better) than rotational.


> I'm not sure how much we should try to guess the device types, the ones
> you have so far are almost all we can get without peeking too much into
> the devices/queues internals.

Agree.

> Also here's the terminology question if we should still consider
> rotational device status as the main criteria, because then SSD, NVMe,
> RAM-backed are all non-rotational but with different latency
> characteristics.

Right. The expected performance also depends on the interconnect type
which is undetectable.

IMO we shouldn't worry about the non-rational's interconnect types
(M2/PCIe/NVMe/SATA/SAS) even though they have different performances.

Because some of these interconnects are compatible with each-other 
medium might change its interconnect during the lifecycle of the data.

Then left with are the types of mediums- rotational, non-rotational and
zoned which, we can identify safely.

Use-cases plan to mix these types of mediums to achieve the 
cost-per-byte advantage. As of now, I think our target can be to help these
kind of planned mixing.

>> +};
>> +
>>   struct btrfs_zoned_device_info;
>>   
>>   struct btrfs_device {
>> @@ -101,9 +111,16 @@ struct btrfs_device {
>>   
>>   	/* optimal io width for this device */
>>   	u32 io_width;
>> -	/* type and info about this device */
>> +
>> +	/* Type and info about this device, on-disk (currently unused) */
>>   	u64 type;
>>   
>> +	/*
>> +	 * Device type (in memory only) at some point, merge to the on-disk
>> +	 * member 'type' above.
>> +	 */
>> +	enum btrfs_dev_types dev_type;
> 
> So at some point this is planned to be user configurable? We should be
> always able to detect the device type at mount type so I wonder if this
> needs to be stored on disk.

I didn't find any planned configs (white papers) discussing the
advantages of mixing non-rotational drives with different interconnect
types. If any then, we may have to provide the manual configuration.
(If the mixing is across the medium types like non-rotational with
either zoned or HDD, then the users don't have to configure as we can
optimize the allocation automatically for performance.)

Also, hot cache device or device grouping (when implemented) need the
user to configure.

And so maybe part of in-memory 'dev_type' would be in-sync with on-disk
'type' at some point. IMO.

Thanks, Anand


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] btrfs: create chunk device type aware
  2022-01-26 17:01   ` David Sterba
@ 2022-01-29 16:24     ` Anand Jain
  0 siblings, 0 replies; 13+ messages in thread
From: Anand Jain @ 2022-01-29 16:24 UTC (permalink / raw)
  To: dsterba, linux-btrfs



On 27/01/2022 01:01, David Sterba wrote:
> On Tue, Jan 18, 2022 at 11:18:02PM +0800, Anand Jain wrote:
>> Mixed device types configuration exist. Most commonly, HDD mixed with
>> SSD/NVME device types. This use case prefers that the data chunk
>> allocates on HDD and the metadata chunk allocates on the SSD/NVME.
>>
>> As of now, in the function gather_device_info() called from
>> btrfs_create_chunk(), we sort the devices based on unallocated space
>> only.
>>
>> After this patch, this function will check for mixed device types. And
>> will sort the devices based on enum btrfs_device_types. That is, sort if
>> the allocation type is metadata and reverse-sort if the allocation type
>> is data.
>>
>> The advantage of this method is that data/metadata allocation distribution
>> based on the device type happens automatically without any manual
>> configuration.
>>
>> Signed-off-by: Anand Jain <anand.jain@oracle.com>
>> ---
>>   fs/btrfs/volumes.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 45 insertions(+)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index da3d6d0f5bc3..77fba78555d7 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -5060,6 +5060,37 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
>>   	return 0;
>>   }
>>   
>> +/*
>> + * Sort the devices in its ascending order of latency value.
>> + */
>> +static int btrfs_cmp_device_latency(const void *a, const void *b)
>> +{
>> +	const struct btrfs_device_info *di_a = a;
>> +	const struct btrfs_device_info *di_b = b;
>> +	struct btrfs_device *dev_a = di_a->dev;
>> +	struct btrfs_device *dev_b = di_b->dev;
>> +
>> +	if (dev_a->dev_type > dev_b->dev_type)
> 
> It's strange to see dev_type being compared while the function compares
> the latency. As pointed in the first patch, this could simply refer to a
> array of device types sorted by latency.
> 
  Agreed. This will change.

>> +		return 1;
>> +	if (dev_a->dev_type < dev_b->dev_type)
>> +		return -1;
>> +	return 0;
>> +}
>> +
>> +static int btrfs_cmp_device_rev_latency(const void *a, const void *b)
> 
> Regarding the naming, I think we've been using _desc (descending) as
> suffix for the reverse sort functions, while ascending is the default.
> 

  Ok. Noted.

>> +{
>> +	const struct btrfs_device_info *di_a = a;
>> +	const struct btrfs_device_info *di_b = b;
>> +	struct btrfs_device *dev_a = di_a->dev;
>> +	struct btrfs_device *dev_b = di_b->dev;
>> +
>> +	if (dev_a->dev_type > dev_b->dev_type)
>> +		return -1;
>> +	if (dev_a->dev_type < dev_b->dev_type)
>> +		return 1;
>> +	return 0;
> 
> To avoid code duplication this could be
> 
> 	return -btrfs_cmp_device_latency(a, b);

  Oh right. I will do that.

Thanks, Anand


>> +}
>> +
>>   /*
>>    * sort the devices in descending order by max_avail, total_avail
>>    */
>> @@ -5292,6 +5323,20 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>>   	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>>   	     btrfs_cmp_device_info, NULL);
>>   
>> +	/*
>> +	 * Sort devices by their latency. Ascending order of latency for
>> +	 * metadata and descending order of latency for the data chunks for
>> +	 * mixed device types.
>> +	 */
>> +	if (fs_devices->mixed_dev_types) {
>> +		if (ctl->type & BTRFS_BLOCK_GROUP_DATA)
>> +			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>> +			     btrfs_cmp_device_rev_latency, NULL);
>> +		else
>> +			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>> +			     btrfs_cmp_device_latency, NULL);
>> +	}
>> +
>>   	return 0;
>>   }
>>   
>> -- 
>> 2.33.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] btrfs: create chunk device type aware
  2022-01-26 17:38   ` David Sterba
@ 2022-01-29 16:46     ` Anand Jain
  2022-01-30 22:15       ` Goffredo Baroncelli
  2022-01-30 22:28     ` Goffredo Baroncelli
  1 sibling, 1 reply; 13+ messages in thread
From: Anand Jain @ 2022-01-29 16:46 UTC (permalink / raw)
  To: dsterba, linux-btrfs



On 27/01/2022 01:38, David Sterba wrote:
> On Tue, Jan 18, 2022 at 11:18:02PM +0800, Anand Jain wrote:
>> Mixed device types configuration exist. Most commonly, HDD mixed with
>> SSD/NVME device types. This use case prefers that the data chunk
>> allocates on HDD and the metadata chunk allocates on the SSD/NVME.
>>
>> As of now, in the function gather_device_info() called from
>> btrfs_create_chunk(), we sort the devices based on unallocated space
>> only.
> 

> Yes, and this guarantees maximizing the used space for the raid
> profiles.

  Oops. Thanks for reminding me of that. More below.

>> After this patch, this function will check for mixed device types. And
>> will sort the devices based on enum btrfs_device_types. That is, sort if
>> the allocation type is metadata and reverse-sort if the allocation type
>> is data.
> 
> And this changes the above, because a small fast device can be depleted
> first.


Both sort by size or latency do _not_ help if
  given_raid.devs_max == 0 (raid0, raid5, raid6) OR given_raid.devs_max 
== num_devices.

It helps only when given_raid.devs_max != 0 and given_raid.devs_max < 
num_devices.

Sort by size does not help Single and Dup profiles.

So, if (given_raid.devs_max != 0 && given_raid.devs_max < num_devices) {

Mixed devs types with different sizes if sorted by free size:
  is pro for  raid1, raid1c3, raid1c4, raid10
  doesn't matter for single, dup

Mixed devs types with different sizes if sorted by latency:
  is pro for single and dup
  is con for raid1, raid1c3, raid1c4, raid10 (depends)
}


So,

If given_raid.devs_max == num_devices we don't need any type of sorting.

If given_raid.devs_max = 0 (raid0, raid5, raid6) we don't need any type 
of sorting.

And sort devs by latency for Single and Dup profiles only.

For rest of profiles sort devs by size only if given_raid.devs_max < 
num_devices.


>> The advantage of this method is that data/metadata allocation distribution
>> based on the device type happens automatically without any manual
>> configuration.
> 
> Yeah, but the default behaviour may not be suitable for all users so
> some policy will have to be done anyway.

  Right. If nothing is configured even when provided then also it should
  fallback to the default behaviour.

> I vaguely remember some comments regarding mixed setups, along lines
> that "if there's a fast flash device I'd rather see ENOSPC and either
> delete files or add more devices than to let everything work but with
> the risk of storing metadata on the slow devices."

  It entirely depends on the use-case. An option like following will
  solve it better:
    mount -o metadata_nospc_on_faster_devs=<use-slower-devs>|<error>

  If metadata_nospc_on_faster_devs=error is preferred then it also
  implies that data_nospc_on_slower_devs=error.

  Also, the use cases which prefer to use the error option should
  remember it is difficult to estimate the data/metadata ratio
  beforehand.

>> Signed-off-by: Anand Jain <anand.jain@oracle.com>
>> ---
>>   fs/btrfs/volumes.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 45 insertions(+)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index da3d6d0f5bc3..77fba78555d7 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -5060,6 +5060,37 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
>>   	return 0;
>>   }
>>   
>> +/*
>> + * Sort the devices in its ascending order of latency value.
>> + */
>> +static int btrfs_cmp_device_latency(const void *a, const void *b)
>> +{
>> +	const struct btrfs_device_info *di_a = a;
>> +	const struct btrfs_device_info *di_b = b;
>> +	struct btrfs_device *dev_a = di_a->dev;
>> +	struct btrfs_device *dev_b = di_b->dev;
>> +
>> +	if (dev_a->dev_type > dev_b->dev_type)
>> +		return 1;
>> +	if (dev_a->dev_type < dev_b->dev_type)
>> +		return -1;
>> +	return 0;
>> +}
>> +
>> +static int btrfs_cmp_device_rev_latency(const void *a, const void *b)
>> +{
>> +	const struct btrfs_device_info *di_a = a;
>> +	const struct btrfs_device_info *di_b = b;
>> +	struct btrfs_device *dev_a = di_a->dev;
>> +	struct btrfs_device *dev_b = di_b->dev;
>> +
>> +	if (dev_a->dev_type > dev_b->dev_type)
>> +		return -1;
>> +	if (dev_a->dev_type < dev_b->dev_type)
>> +		return 1;
>> +	return 0;
>> +}
>> +
>>   /*
>>    * sort the devices in descending order by max_avail, total_avail
>>    */
>> @@ -5292,6 +5323,20 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>>   	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>>   	     btrfs_cmp_device_info, NULL);
>>   
>> +	/*
>> +	 * Sort devices by their latency. Ascending order of latency for
>> +	 * metadata and descending order of latency for the data chunks for
>> +	 * mixed device types.
>> +	 */
>> +	if (fs_devices->mixed_dev_types) {
>> +		if (ctl->type & BTRFS_BLOCK_GROUP_DATA)
>> +			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>> +			     btrfs_cmp_device_rev_latency, NULL);
>> +		else
>> +			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>> +			     btrfs_cmp_device_latency, NULL);
> 
> In case there are mixed devices the sort happens twice and because as
> implemented in kernel sort() is not stable so even if device have same
> amount of data they can get reorderd wildly. The remaingin space is
> still a factor we need to take into account to avoid ENOSPC on the chunk
> level.


I didn't get this part. How about if it is this way:

    if (mixed && metadata && (single || dup)) {
      ndevs=0
      pick all non-rotational ndevs++ with free space >= required space
      if (ndevs == 0) {
        if (user_option->metadata_nospc_on_faster_devs == error)
             return -ENOSPC;
        pick all rotational
      }
      sort-by-size-select-top
    }

    if (mixed && data && (single || dup)) {
      ndevs=0
      pick all rotational ndevs++ with free space >= required space
      if (ndevs == 0) {
        if (user_option->data_nospc_on_faster_devs == error)
             return -ENOSPC;
        pick all non-rotational
      }
      sort-by-size-select-top
    }

Thanks, Anand

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] btrfs: create chunk device type aware
  2022-01-29 16:46     ` Anand Jain
@ 2022-01-30 22:15       ` Goffredo Baroncelli
  0 siblings, 0 replies; 13+ messages in thread
From: Goffredo Baroncelli @ 2022-01-30 22:15 UTC (permalink / raw)
  To: Anand Jain, dsterba, linux-btrfs

On 29/01/2022 17.46, Anand Jain wrote:
> 
> 
> On 27/01/2022 01:38, David Sterba wrote:
>> On Tue, Jan 18, 2022 at 11:18:02PM +0800, Anand Jain wrote:
>>> Mixed device types configuration exist. Most commonly, HDD mixed with
>>> SSD/NVME device types. This use case prefers that the data chunk
>>> allocates on HDD and the metadata chunk allocates on the SSD/NVME.
>>>
>>> As of now, in the function gather_device_info() called from
>>> btrfs_create_chunk(), we sort the devices based on unallocated space
>>> only.
>>
> 
>> Yes, and this guarantees maximizing the used space for the raid
>> profiles.
> 
>   Oops. Thanks for reminding me of that. More below.
> 
>>> After this patch, this function will check for mixed device types. And
>>> will sort the devices based on enum btrfs_device_types. That is, sort if
>>> the allocation type is metadata and reverse-sort if the allocation type
>>> is data.
>>
>> And this changes the above, because a small fast device can be depleted
>> first.
> 
> 
> Both sort by size or latency do _not_ help if
>   given_raid.devs_max == 0 (raid0, raid5, raid6) OR given_raid.devs_max == num_devices.
> 
> It helps only when given_raid.devs_max != 0 and given_raid.devs_max < num_devices.
> 
> Sort by size does not help Single and Dup profiles.
> 
> So, if (given_raid.devs_max != 0 && given_raid.devs_max < num_devices) {
> 
> Mixed devs types with different sizes if sorted by free size:
>   is pro for  raid1, raid1c3, raid1c4, raid10
>   doesn't matter for single, dup
> 
> Mixed devs types with different sizes if sorted by latency:
>   is pro for single and dup
>   is con for raid1, raid1c3, raid1c4, raid10 (depends)
> }

May be I remember bad; but in the "single" cases a new BG is allocated
in the "more empty" disks. The same for the "dup". In this it is not different from
RAID1xx. Only in the case of RAID5/6 the size doesn't matter, because a new chunk is
allocate in each disk anyway.

Pay attention that you can have a "mixed" environment of different disks sizes:
  2 x ssd (100GB + 200GB)
  2 x HDD (1T + 2T)

So it could make sense to spread the metadata by priority (only to ssd) AND by "emptiness"
(i.e. each 3 BG, one is allocated on the smaller disks and two in the bigger one).

> 
> So,
> 
> If given_raid.devs_max == num_devices we don't need any type of sorting.
> 
> If given_raid.devs_max = 0 (raid0, raid5, raid6) we don't need any type of sorting.
> 
> And sort devs by latency for Single and Dup profiles only.
> 
> For rest of profiles sort devs by size only if given_raid.devs_max < num_devices.
> 
> 
>>> The advantage of this method is that data/metadata allocation distribution
>>> based on the device type happens automatically without any manual
>>> configuration.
>>
>> Yeah, but the default behaviour may not be suitable for all users so
>> some policy will have to be done anyway.
> 
>   Right. If nothing is configured even when provided then also it should
>   fallback to the default behaviour.
> 
>> I vaguely remember some comments regarding mixed setups, along lines
>> that "if there's a fast flash device I'd rather see ENOSPC and either
>> delete files or add more devices than to let everything work but with
>> the risk of storing metadata on the slow devices."
> 
>   It entirely depends on the use-case. An option like following will
>   solve it better:
>     mount -o metadata_nospc_on_faster_devs=<use-slower-devs>|<error>
> 
>   If metadata_nospc_on_faster_devs=error is preferred then it also
>   implies that data_nospc_on_slower_devs=error.
> 
>   Also, the use cases which prefer to use the error option should
>   remember it is difficult to estimate the data/metadata ratio
>   beforehand.
> 
>>> Signed-off-by: Anand Jain <anand.jain@oracle.com>
>>> ---
>>>   fs/btrfs/volumes.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>>>   1 file changed, 45 insertions(+)
>>>
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index da3d6d0f5bc3..77fba78555d7 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -5060,6 +5060,37 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
>>>       return 0;
>>>   }
>>> +/*
>>> + * Sort the devices in its ascending order of latency value.
>>> + */
>>> +static int btrfs_cmp_device_latency(const void *a, const void *b)
>>> +{
>>> +    const struct btrfs_device_info *di_a = a;
>>> +    const struct btrfs_device_info *di_b = b;
>>> +    struct btrfs_device *dev_a = di_a->dev;
>>> +    struct btrfs_device *dev_b = di_b->dev;
>>> +
>>> +    if (dev_a->dev_type > dev_b->dev_type)
>>> +        return 1;
>>> +    if (dev_a->dev_type < dev_b->dev_type)
>>> +        return -1;
>>> +    return 0;
>>> +}
>>> +
>>> +static int btrfs_cmp_device_rev_latency(const void *a, const void *b)
>>> +{
>>> +    const struct btrfs_device_info *di_a = a;
>>> +    const struct btrfs_device_info *di_b = b;
>>> +    struct btrfs_device *dev_a = di_a->dev;
>>> +    struct btrfs_device *dev_b = di_b->dev;
>>> +
>>> +    if (dev_a->dev_type > dev_b->dev_type)
>>> +        return -1;
>>> +    if (dev_a->dev_type < dev_b->dev_type)
>>> +        return 1;
>>> +    return 0;
>>> +}
>>> +
>>>   /*
>>>    * sort the devices in descending order by max_avail, total_avail
>>>    */
>>> @@ -5292,6 +5323,20 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>>>       sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>>>            btrfs_cmp_device_info, NULL);
>>> +    /*
>>> +     * Sort devices by their latency. Ascending order of latency for
>>> +     * metadata and descending order of latency for the data chunks for
>>> +     * mixed device types.
>>> +     */
>>> +    if (fs_devices->mixed_dev_types) {
>>> +        if (ctl->type & BTRFS_BLOCK_GROUP_DATA)
>>> +            sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>>> +                 btrfs_cmp_device_rev_latency, NULL);
>>> +        else
>>> +            sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>>> +                 btrfs_cmp_device_latency, NULL);
>>
>> In case there are mixed devices the sort happens twice and because as
>> implemented in kernel sort() is not stable so even if device have same
>> amount of data they can get reorderd wildly. The remaingin space is
>> still a factor we need to take into account to avoid ENOSPC on the chunk
>> level.
> 
> 
> I didn't get this part. How about if it is this way:
> 
>     if (mixed && metadata && (single || dup)) {
>       ndevs=0
>       pick all non-rotational ndevs++ with free space >= required space
>       if (ndevs == 0) {
>         if (user_option->metadata_nospc_on_faster_devs == error)
>              return -ENOSPC;
>         pick all rotational
>       }
>       sort-by-size-select-top
>     }
> 
>     if (mixed && data && (single || dup)) {
>       ndevs=0
>       pick all rotational ndevs++ with free space >= required space
>       if (ndevs == 0) {
>         if (user_option->data_nospc_on_faster_devs == error)
>              return -ENOSPC;
>         pick all non-rotational
>       }
>       sort-by-size-select-top
>     }
> 
> Thanks, Anand

I think that you have to behave as allocation_hint patches set:
The sort is one, and it sorts by
- by priority
- if the disks have the same priority, by free space
- if the disks have the same priority and free space, by max avail
...


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] btrfs: create chunk device type aware
  2022-01-26 17:38   ` David Sterba
  2022-01-29 16:46     ` Anand Jain
@ 2022-01-30 22:28     ` Goffredo Baroncelli
  1 sibling, 0 replies; 13+ messages in thread
From: Goffredo Baroncelli @ 2022-01-30 22:28 UTC (permalink / raw)
  To: dsterba, Anand Jain, linux-btrfs

On 26/01/2022 18.38, David Sterba wrote:
>> The advantage of this method is that data/metadata allocation distribution
>> based on the device type happens automatically without any manual
>> configuration.
> Yeah, but the default behaviour may not be suitable for all users so
> some policy will have to be done anyway.
> 
> I vaguely remember some comments regarding mixed setups, along lines
> that "if there's a fast flash device I'd rather see ENOSPC and either
> delete files or add more devices than to let everything work but with
> the risk of storing metadata on the slow devices."
> 

I confirm that. There are two aspects that impacted the "allocation_hint"
patches set discussions:
1) the criteria to order the disks
2) allow/permit/deny a kind of BG to be hosted by a device

Regarding 1), initially the first set of patches considered an automatic
behavior on the basis of the "rotational" attribute. Soon it was pointed out
that in the "not rotational" class there are different options (like SSD, NVME...).
The conclusion was that the "priority" should not be cabled in the btrfs code.
(e.g. we could consider the reliability ?)

Regarding 2), my initial patches set only ordered the disks and alloed any
disk to be used. Some user asked me to prevent to use certain disks
for (e.g.) the data; this to prevent the data BG to consume all the
available space [*]

These discussions leads me to create 4 "classes" for disks
- ONLY_METADATA
- PREFERRED_METADATA
- PREFERRED_DATA
- ONLY_DATA

Where the last one is not suitable for metadata, and the first one is not
suitable for data. The middle ones allow one type but only of the other disks are full.

Another differences between the Anand patches and the my ones, is that
in the "allocation_hint" for striped profiles (like raid5, which spans all
the available disks), the devices involved should have the same classes.

I.e., if btrfs has to allocate a RAID5 metadata chunk,
- first tries to use ONLY_METADATA disks.
- If the disks are not enough, then it tries to use ONLY_METADATA and
   PREFERRED_METADATA disks.
- If the disks are not enough, then it tries to use ONLY_METADATA,
   PREFERRED_METADATA and PREFERRED_DATA disks.
- If the disks are not enough, then -ENOSPC

What the Anand patches has more than allocation_hint patches, is that
these handle the case of different latency disks, giving higher
priority to the disks with lower latency. If this is a requirements
we can reserve some bits to add a priority of a disk, and then
we can sort the disk by:

- allocation_hint class
- priority
- max avail
- free space

BR
G.Baroncelli

[*] I don't want to open another discussion, but this seems to me more a "quota"
problem than a "disk allocation" problem...

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] btrfs: keep device type in the struct btrfs_device
  2022-01-29 16:24     ` Anand Jain
@ 2022-02-01 17:06       ` David Sterba
  2022-02-03 12:56         ` Anand Jain
  0 siblings, 1 reply; 13+ messages in thread
From: David Sterba @ 2022-02-01 17:06 UTC (permalink / raw)
  To: Anand Jain; +Cc: dsterba, linux-btrfs

On Sun, Jan 30, 2022 at 12:24:15AM +0800, Anand Jain wrote:
> On 27/01/2022 00:53, David Sterba wrote:
> > On Tue, Jan 18, 2022 at 11:18:01PM +0800, Anand Jain wrote:
> >> +/*
> >> + * Device class types arranged by their IO latency from low to high.
> >> + */
> >> +enum btrfs_dev_types {
> >> +	BTRFS_DEV_TYPE_MEM = 1,
> >> +	BTRFS_DEV_TYPE_NONROT,
> >> +	BTRFS_DEV_TYPE_ROT,
> >> +	BTRFS_DEV_TYPE_ZONED,
> > 
> > I think this should be separate, in one list define all sensible device
> > types and then add arrays sorted by latency, or other factors.
> >
> > The zoned devices are mostly HDD but ther are also SSD-like using ZNS,
> > so that can't be both under BTRFS_DEV_TYPE_ZONED and behind
> > BTRFS_DEV_TYPE_ROT as if it had worse latency.
> 
> I still do not have a complete idea of its implantation using an array. 
> Do you mean something like as below,
> 
> enum btrfs_dev_types {
> 	::
> };
> 
> struct btrfs_dev_type {
> 	enum btrfs_dev_types dev_type;
> 	int priority_latency;
> };

	enum btrfs_dev_types {
		BTRFS_DEV_TYPE_HDD,
		BTRFS_DEV_TYPE_SSD,
		BTRFS_DEV_TYPE_NVME,
		BTRFS_DEV_TYPE_ZONED_HDD,
		BTRFS_DEV_TYPE_ZONED_ZNS,
	};

	enum btrfs_dev_types btrfs_devices_by_latency[] = {
		BTRFS_DEV_TYPE_NVME,
		BTRFS_DEV_TYPE_SSD,
		BTRFS_DEV_TYPE_ZONED_ZNS,
		BTRFS_DEV_TYPE_HDD,
		BTRFS_DEV_TYPE_ZONED_HDD,
	};

	enum btrfs_dev_types btrfs_devices_by_capacity[] = {
		BTRFS_DEV_TYPE_ZONED_HDD,
		BTRFS_DEV_TYPE_HDD,
		BTRFS_DEV_TYPE_SSD,
		BTRFS_DEV_TYPE_ZONED_ZNS,
		BTRFS_DEV_TYPE_NVME,
	};

Then if the chunk allocator has a selected policy (here by latency and
by capacity), it would use the list as additional sorting criteria.

Optimizing for latency: device 1 (SSD) vs device 2 (NVME), pick NVME
even if device 1 would be otherwise picked due to the capacity criteria
(as we have it now).

> I think we can identify a ZNS and set its relative latency to a value 
> lower (better) than rotational.

The device classes are general I don't mean to measure the latecy
exactly, it's usually typical for the whole class and eg. NVME is
considered better than SSD and SSD better than HDD.

> > I'm not sure how much we should try to guess the device types, the ones
> > you have so far are almost all we can get without peeking too much into
> > the devices/queues internals.
> 
> Agree.
> 
> > Also here's the terminology question if we should still consider
> > rotational device status as the main criteria, because then SSD, NVMe,
> > RAM-backed are all non-rotational but with different latency
> > characteristics.
> 
> Right. The expected performance also depends on the interconnect type
> which is undetectable.
> 
> IMO we shouldn't worry about the non-rational's interconnect types
> (M2/PCIe/NVMe/SATA/SAS) even though they have different performances.

Yeah, that's why I'd stay just with the general storage type, not the
type of connection.

> Because some of these interconnects are compatible with each-other 
> medium might change its interconnect during the lifecycle of the data.
> 
> Then left with are the types of mediums- rotational, non-rotational and
> zoned which, we can identify safely.
> 
> Use-cases plan to mix these types of mediums to achieve the 
> cost-per-byte advantage. As of now, I think our target can be to help these
> kind of planned mixing.
> 
> >> +};
> >> +
> >>   struct btrfs_zoned_device_info;
> >>   
> >>   struct btrfs_device {
> >> @@ -101,9 +111,16 @@ struct btrfs_device {
> >>   
> >>   	/* optimal io width for this device */
> >>   	u32 io_width;
> >> -	/* type and info about this device */
> >> +
> >> +	/* Type and info about this device, on-disk (currently unused) */
> >>   	u64 type;
> >>   
> >> +	/*
> >> +	 * Device type (in memory only) at some point, merge to the on-disk
> >> +	 * member 'type' above.
> >> +	 */
> >> +	enum btrfs_dev_types dev_type;
> > 
> > So at some point this is planned to be user configurable? We should be
> > always able to detect the device type at mount type so I wonder if this
> > needs to be stored on disk.
> 
> I didn't find any planned configs (white papers) discussing the
> advantages of mixing non-rotational drives with different interconnect
> types. If any then, we may have to provide the manual configuration.
> (If the mixing is across the medium types like non-rotational with
> either zoned or HDD, then the users don't have to configure as we can
> optimize the allocation automatically for performance.)
> 
> Also, hot cache device or device grouping (when implemented) need the
> user to configure.
> 
> And so maybe part of in-memory 'dev_type' would be in-sync with on-disk
> 'type' at some point. IMO.

Yeah, the tentative plan for such features is to first make it in-memory
only so we can test it without also defining the permanent on-disk
format. From user perspective if there's a reasonable auto-tuning
behaviour without a need for configuration then it should be
implemented. Once we start adding knobs for various storage types and
what should be stored where, it's a beginning of chaos.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] btrfs: keep device type in the struct btrfs_device
  2022-02-01 17:06       ` David Sterba
@ 2022-02-03 12:56         ` Anand Jain
  0 siblings, 0 replies; 13+ messages in thread
From: Anand Jain @ 2022-02-03 12:56 UTC (permalink / raw)
  To: dsterba, linux-btrfs



On 02/02/2022 01:06, David Sterba wrote:
> On Sun, Jan 30, 2022 at 12:24:15AM +0800, Anand Jain wrote:
>> On 27/01/2022 00:53, David Sterba wrote:
>>> On Tue, Jan 18, 2022 at 11:18:01PM +0800, Anand Jain wrote:
>>>> +/*
>>>> + * Device class types arranged by their IO latency from low to high.
>>>> + */
>>>> +enum btrfs_dev_types {
>>>> +	BTRFS_DEV_TYPE_MEM = 1,
>>>> +	BTRFS_DEV_TYPE_NONROT,
>>>> +	BTRFS_DEV_TYPE_ROT,
>>>> +	BTRFS_DEV_TYPE_ZONED,
>>>
>>> I think this should be separate, in one list define all sensible device
>>> types and then add arrays sorted by latency, or other factors.
>>>
>>> The zoned devices are mostly HDD but ther are also SSD-like using ZNS,
>>> so that can't be both under BTRFS_DEV_TYPE_ZONED and behind
>>> BTRFS_DEV_TYPE_ROT as if it had worse latency.
>>
>> I still do not have a complete idea of its implantation using an array.
>> Do you mean something like as below,
>>
>> enum btrfs_dev_types {
>> 	::
>> };
>>
>> struct btrfs_dev_type {
>> 	enum btrfs_dev_types dev_type;
>> 	int priority_latency;
>> };
> 
> 	enum btrfs_dev_types {
> 		BTRFS_DEV_TYPE_HDD,
> 		BTRFS_DEV_TYPE_SSD,
> 		BTRFS_DEV_TYPE_NVME,
> 		BTRFS_DEV_TYPE_ZONED_HDD,
> 		BTRFS_DEV_TYPE_ZONED_ZNS,
> 	};
> 
> 	enum btrfs_dev_types btrfs_devices_by_latency[] = {
> 		BTRFS_DEV_TYPE_NVME,
> 		BTRFS_DEV_TYPE_SSD,
> 		BTRFS_DEV_TYPE_ZONED_ZNS,
> 		BTRFS_DEV_TYPE_HDD,
> 		BTRFS_DEV_TYPE_ZONED_HDD,
> 	};
> 

  Ah. It is a cleaner way. Thanks.

> 	enum btrfs_dev_types btrfs_devices_by_capacity[] = {
> 		BTRFS_DEV_TYPE_ZONED_HDD,
> 		BTRFS_DEV_TYPE_HDD,
> 		BTRFS_DEV_TYPE_SSD,
> 		BTRFS_DEV_TYPE_ZONED_ZNS,
> 		BTRFS_DEV_TYPE_NVME,
> 	};
> 
 >
> Then if the chunk allocator has a selected policy (here by latency and
> by capacity), it would use the list as additional sorting criteria.

> Optimizing for latency: device 1 (SSD) vs device 2 (NVME), pick NVME
> even if device 1 would be otherwise picked due to the capacity criteria
> (as we have it now).

  Hmm. Above, btrfs_devices_by_capacity[] contains device types
  (not device itself).
  Do you mean the btrfs_devices_by_capacity[]  used as the device-type
  priority list. And, then to sort the devices by the capacity in that
  type?

>> I think we can identify a ZNS and set its relative latency to a value
>> lower (better) than rotational.
> 
> The device classes are general I don't mean to measure the latecy
> exactly, it's usually typical for the whole class and eg. NVME is
> considered better than SSD and SSD better than HDD.
> 

  Got it.

>>> I'm not sure how much we should try to guess the device types, the ones
>>> you have so far are almost all we can get without peeking too much into
>>> the devices/queues internals.
>>
>> Agree.
>>
>>> Also here's the terminology question if we should still consider
>>> rotational device status as the main criteria, because then SSD, NVMe,
>>> RAM-backed are all non-rotational but with different latency
>>> characteristics.
>>
>> Right. The expected performance also depends on the interconnect type
>> which is undetectable.
>>
>> IMO we shouldn't worry about the non-rational's interconnect types
>> (M2/PCIe/NVMe/SATA/SAS) even though they have different performances.
> 
> Yeah, that's why I'd stay just with the general storage type, not the
> type of connection.
> 

  Hmm. We might have challenges to distinguish between SSD and NVME.
  Both are non-rotational. They differ by interface type. Unless we read
  the class name from the
     struct block_device::struct device bd_device::class
  that may be considered as a dirty hack.

>> Because some of these interconnects are compatible with each-other
>> medium might change its interconnect during the lifecycle of the data.
>>
>> Then left with are the types of mediums- rotational, non-rotational and
>> zoned which, we can identify safely.
>>
>> Use-cases plan to mix these types of mediums to achieve the
>> cost-per-byte advantage. As of now, I think our target can be to help these
>> kind of planned mixing.
>>
>>>> +};
>>>> +
>>>>    struct btrfs_zoned_device_info;
>>>>    
>>>>    struct btrfs_device {
>>>> @@ -101,9 +111,16 @@ struct btrfs_device {
>>>>    
>>>>    	/* optimal io width for this device */
>>>>    	u32 io_width;
>>>> -	/* type and info about this device */
>>>> +
>>>> +	/* Type and info about this device, on-disk (currently unused) */
>>>>    	u64 type;
>>>>    
>>>> +	/*
>>>> +	 * Device type (in memory only) at some point, merge to the on-disk
>>>> +	 * member 'type' above.
>>>> +	 */
>>>> +	enum btrfs_dev_types dev_type;
>>>
>>> So at some point this is planned to be user configurable? We should be
>>> always able to detect the device type at mount type so I wonder if this
>>> needs to be stored on disk.
>>
>> I didn't find any planned configs (white papers) discussing the
>> advantages of mixing non-rotational drives with different interconnect
>> types. If any then, we may have to provide the manual configuration.
>> (If the mixing is across the medium types like non-rotational with
>> either zoned or HDD, then the users don't have to configure as we can
>> optimize the allocation automatically for performance.)
>>
>> Also, hot cache device or device grouping (when implemented) need the
>> user to configure.
>>
>> And so maybe part of in-memory 'dev_type' would be in-sync with on-disk
>> 'type' at some point. IMO.
> 
> Yeah, the tentative plan for such features is to first make it in-memory
> only so we can test it without also defining the permanent on-disk
> format. From user perspective if there's a reasonable auto-tuning
> behaviour without a need for configuration then it should be
> implemented. Once we start adding knobs for various storage types and
> what should be stored where, it's a beginning of chaos.

Yep. Thanks.

Anand

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-02-03 12:57 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-18 15:18 [PATCH 0/2] device type and create chunk Anand Jain
2022-01-18 15:18 ` [PATCH 1/2] btrfs: keep device type in the struct btrfs_device Anand Jain
2022-01-26 16:53   ` David Sterba
2022-01-29 16:24     ` Anand Jain
2022-02-01 17:06       ` David Sterba
2022-02-03 12:56         ` Anand Jain
2022-01-18 15:18 ` [PATCH 2/2] btrfs: create chunk device type aware Anand Jain
2022-01-26 17:01   ` David Sterba
2022-01-29 16:24     ` Anand Jain
2022-01-26 17:38   ` David Sterba
2022-01-29 16:46     ` Anand Jain
2022-01-30 22:15       ` Goffredo Baroncelli
2022-01-30 22:28     ` Goffredo Baroncelli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).