* [PATCH 0/2] device type and create chunk @ 2022-01-18 15:18 Anand Jain 2022-01-18 15:18 ` [PATCH 1/2] btrfs: keep device type in the struct btrfs_device Anand Jain 2022-01-18 15:18 ` [PATCH 2/2] btrfs: create chunk device type aware Anand Jain 0 siblings, 2 replies; 13+ messages in thread From: Anand Jain @ 2022-01-18 15:18 UTC (permalink / raw) To: linux-btrfs I had these patches as part of experiments with the readpolicy I am sending it now. This is different from the allocation_hint mode patch-set where I use the device type to make the allocation destination automatic. Patch 1/2 keeps the device's type in the struct btrfs_device so that we could maintain the status if there are mixed devices in the filesystem. And if so, then patch 2/2 shall take care of arranging the disks by the order of latency so that metadata chunks can pick disk with low latency and data can pick the disk with higher latency. By having fewer restrictions and not hard coding the chunk allocation destination helps to cause the spillover to the available disk space instead of causing the spurious ENOSPC. Anand Jain (2): btrfs: keep device type in the struct btrfs_device btrfs: create chunk device type aware fs/btrfs/dev-replace.c | 2 + fs/btrfs/volumes.c | 90 ++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/volumes.h | 26 +++++++++++- 3 files changed, 117 insertions(+), 1 deletion(-) -- 2.33.1 ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 1/2] btrfs: keep device type in the struct btrfs_device 2022-01-18 15:18 [PATCH 0/2] device type and create chunk Anand Jain @ 2022-01-18 15:18 ` Anand Jain 2022-01-26 16:53 ` David Sterba 2022-01-18 15:18 ` [PATCH 2/2] btrfs: create chunk device type aware Anand Jain 1 sibling, 1 reply; 13+ messages in thread From: Anand Jain @ 2022-01-18 15:18 UTC (permalink / raw) To: linux-btrfs Preparation to make data/metadata chunks allocations based on the device types- keep the identified device type in the struct btrfs_device. This patch adds a member 'dev_type' to hold the defined device types in the struct btrfs_devices. Also, add a helper function and a struct btrfs_fs_devices member 'mixed_dev_type' to know if the filesystem contains the mixed device types. Struct btrfs_device has an existing member 'type' that stages and writes back to the on-disk format. This patch does not use it. As just an in-memory only data will suffice the requirement here. Signed-off-by: Anand Jain <anand.jain@oracle.com> --- fs/btrfs/dev-replace.c | 2 ++ fs/btrfs/volumes.c | 45 ++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/volumes.h | 26 +++++++++++++++++++++++- 3 files changed, 72 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 71fd99b48283..3731c7d1c6b7 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -325,6 +325,7 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, device->dev_stats_valid = 1; set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE); device->fs_devices = fs_devices; + device->dev_type = btrfs_get_device_type(device); ret = btrfs_get_dev_zone_info(device, false); if (ret) @@ -334,6 +335,7 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, list_add(&device->dev_list, &fs_devices->devices); fs_devices->num_devices++; fs_devices->open_devices++; + fs_devices->mixed_dev_types = btrfs_has_mixed_dev_types(fs_devices); mutex_unlock(&fs_devices->device_list_mutex); *device_out = device; diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 9d50e035e61d..da3d6d0f5bc3 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1041,6 +1041,7 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices, device->generation > (*latest_dev)->generation)) { *latest_dev = device; } + device->dev_type = btrfs_get_device_type(device); continue; } @@ -1084,6 +1085,7 @@ void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices) __btrfs_free_extra_devids(seed_dev, &latest_dev); fs_devices->latest_dev = latest_dev; + fs_devices->mixed_dev_types = btrfs_has_mixed_dev_types(fs_devices); mutex_unlock(&uuid_mutex); } @@ -2183,6 +2185,9 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, num_devices = btrfs_super_num_devices(fs_info->super_copy) - 1; btrfs_set_super_num_devices(fs_info->super_copy, num_devices); + + cur_devices->mixed_dev_types = btrfs_has_mixed_dev_types(cur_devices); + mutex_unlock(&fs_devices->device_list_mutex); /* @@ -2584,6 +2589,44 @@ static int btrfs_finish_sprout(struct btrfs_trans_handle *trans) return ret; } +bool btrfs_has_mixed_dev_types(struct btrfs_fs_devices *fs_devices) +{ + struct btrfs_device *device; + int type_rot = 0; + int type_nonrot = 0; + + list_for_each_entry(device, &fs_devices->devices, dev_list) { + + if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state)) + continue; + + switch (device->dev_type) { + case BTRFS_DEV_TYPE_ROT: + type_rot++; + break; + case BTRFS_DEV_TYPE_NONROT: + default: + type_nonrot++; + } + } + + if (type_rot && type_nonrot) + return true; + else + return false; +} + +enum btrfs_dev_types btrfs_get_device_type(struct btrfs_device *device) +{ + if (bdev_is_zoned(device->bdev)) + return BTRFS_DEV_TYPE_ZONED; + + if (blk_queue_nonrot(bdev_get_queue(device->bdev))) + return BTRFS_DEV_TYPE_NONROT; + + return BTRFS_DEV_TYPE_ROT; +} + int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path) { struct btrfs_root *root = fs_info->dev_root; @@ -2675,6 +2718,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path clear_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state); device->mode = FMODE_EXCL; device->dev_stats_valid = 1; + device->dev_type = btrfs_get_device_type(device); set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE); if (seeding_dev) { @@ -2710,6 +2754,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path atomic64_add(device->total_bytes, &fs_info->free_chunk_space); fs_devices->rotating = !blk_queue_nonrot(bdev_get_queue(bdev)); + fs_devices->mixed_dev_types = btrfs_has_mixed_dev_types(fs_devices); orig_super_total_bytes = btrfs_super_total_bytes(fs_info->super_copy); btrfs_set_super_total_bytes(fs_info->super_copy, diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 6a790b85edd8..5be31161601d 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -52,6 +52,16 @@ struct btrfs_io_geometry { #define BTRFS_DEV_STATE_FLUSH_SENT (4) #define BTRFS_DEV_STATE_NO_READA (5) +/* + * Device class types arranged by their IO latency from low to high. + */ +enum btrfs_dev_types { + BTRFS_DEV_TYPE_MEM = 1, + BTRFS_DEV_TYPE_NONROT, + BTRFS_DEV_TYPE_ROT, + BTRFS_DEV_TYPE_ZONED, +}; + struct btrfs_zoned_device_info; struct btrfs_device { @@ -101,9 +111,16 @@ struct btrfs_device { /* optimal io width for this device */ u32 io_width; - /* type and info about this device */ + + /* Type and info about this device, on-disk (currently unused) */ u64 type; + /* + * Device type (in memory only) at some point, merge to the on-disk + * member 'type' above. + */ + enum btrfs_dev_types dev_type; + /* minimal io size for this device */ u32 sector_size; @@ -296,6 +313,11 @@ struct btrfs_fs_devices { */ bool rotating; + /* + * True when devices belong more than one type. + */ + bool mixed_dev_types; + struct btrfs_fs_info *fs_info; /* sysfs kobjects */ struct kobject fsid_kobj; @@ -636,5 +658,7 @@ int btrfs_bg_type_to_factor(u64 flags); const char *btrfs_bg_type_to_raid_name(u64 flags); int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info); bool btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical); +enum btrfs_dev_types btrfs_get_device_type(struct btrfs_device *device); +bool btrfs_has_mixed_dev_types(struct btrfs_fs_devices *fs_devices); #endif -- 2.33.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] btrfs: keep device type in the struct btrfs_device 2022-01-18 15:18 ` [PATCH 1/2] btrfs: keep device type in the struct btrfs_device Anand Jain @ 2022-01-26 16:53 ` David Sterba 2022-01-29 16:24 ` Anand Jain 0 siblings, 1 reply; 13+ messages in thread From: David Sterba @ 2022-01-26 16:53 UTC (permalink / raw) To: Anand Jain; +Cc: linux-btrfs On Tue, Jan 18, 2022 at 11:18:01PM +0800, Anand Jain wrote: > Preparation to make data/metadata chunks allocations based on the device > types- keep the identified device type in the struct btrfs_device. > > This patch adds a member 'dev_type' to hold the defined device types in > the struct btrfs_devices. > > Also, add a helper function and a struct btrfs_fs_devices member > 'mixed_dev_type' to know if the filesystem contains the mixed device > types. > > Struct btrfs_device has an existing member 'type' that stages and writes > back to the on-disk format. This patch does not use it. As just an > in-memory only data will suffice the requirement here. > > Signed-off-by: Anand Jain <anand.jain@oracle.com> > --- > fs/btrfs/dev-replace.c | 2 ++ > fs/btrfs/volumes.c | 45 ++++++++++++++++++++++++++++++++++++++++++ > fs/btrfs/volumes.h | 26 +++++++++++++++++++++++- > 3 files changed, 72 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c > index 71fd99b48283..3731c7d1c6b7 100644 > --- a/fs/btrfs/dev-replace.c > +++ b/fs/btrfs/dev-replace.c > @@ -325,6 +325,7 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, > device->dev_stats_valid = 1; > set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE); > device->fs_devices = fs_devices; > + device->dev_type = btrfs_get_device_type(device); > > ret = btrfs_get_dev_zone_info(device, false); > if (ret) > @@ -334,6 +335,7 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, > list_add(&device->dev_list, &fs_devices->devices); > fs_devices->num_devices++; > fs_devices->open_devices++; > + fs_devices->mixed_dev_types = btrfs_has_mixed_dev_types(fs_devices); > mutex_unlock(&fs_devices->device_list_mutex); > > *device_out = device; > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 9d50e035e61d..da3d6d0f5bc3 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -1041,6 +1041,7 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices, > device->generation > (*latest_dev)->generation)) { > *latest_dev = device; > } > + device->dev_type = btrfs_get_device_type(device); > continue; > } > > @@ -1084,6 +1085,7 @@ void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices) > __btrfs_free_extra_devids(seed_dev, &latest_dev); > > fs_devices->latest_dev = latest_dev; > + fs_devices->mixed_dev_types = btrfs_has_mixed_dev_types(fs_devices); > > mutex_unlock(&uuid_mutex); > } > @@ -2183,6 +2185,9 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, > > num_devices = btrfs_super_num_devices(fs_info->super_copy) - 1; > btrfs_set_super_num_devices(fs_info->super_copy, num_devices); > + > + cur_devices->mixed_dev_types = btrfs_has_mixed_dev_types(cur_devices); > + > mutex_unlock(&fs_devices->device_list_mutex); > > /* > @@ -2584,6 +2589,44 @@ static int btrfs_finish_sprout(struct btrfs_trans_handle *trans) > return ret; > } > > +bool btrfs_has_mixed_dev_types(struct btrfs_fs_devices *fs_devices) > +{ > + struct btrfs_device *device; > + int type_rot = 0; > + int type_nonrot = 0; > + > + list_for_each_entry(device, &fs_devices->devices, dev_list) { > + > + if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state)) > + continue; > + > + switch (device->dev_type) { > + case BTRFS_DEV_TYPE_ROT: > + type_rot++; > + break; > + case BTRFS_DEV_TYPE_NONROT: > + default: > + type_nonrot++; > + } > + } > + > + if (type_rot && type_nonrot) > + return true; > + else > + return false; > +} > + > +enum btrfs_dev_types btrfs_get_device_type(struct btrfs_device *device) > +{ > + if (bdev_is_zoned(device->bdev)) > + return BTRFS_DEV_TYPE_ZONED; > + > + if (blk_queue_nonrot(bdev_get_queue(device->bdev))) > + return BTRFS_DEV_TYPE_NONROT; > + > + return BTRFS_DEV_TYPE_ROT; > +} > + > int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path) > { > struct btrfs_root *root = fs_info->dev_root; > @@ -2675,6 +2718,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path > clear_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state); > device->mode = FMODE_EXCL; > device->dev_stats_valid = 1; > + device->dev_type = btrfs_get_device_type(device); > set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE); > > if (seeding_dev) { > @@ -2710,6 +2754,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path > atomic64_add(device->total_bytes, &fs_info->free_chunk_space); > > fs_devices->rotating = !blk_queue_nonrot(bdev_get_queue(bdev)); > + fs_devices->mixed_dev_types = btrfs_has_mixed_dev_types(fs_devices); > > orig_super_total_bytes = btrfs_super_total_bytes(fs_info->super_copy); > btrfs_set_super_total_bytes(fs_info->super_copy, > diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h > index 6a790b85edd8..5be31161601d 100644 > --- a/fs/btrfs/volumes.h > +++ b/fs/btrfs/volumes.h > @@ -52,6 +52,16 @@ struct btrfs_io_geometry { > #define BTRFS_DEV_STATE_FLUSH_SENT (4) > #define BTRFS_DEV_STATE_NO_READA (5) > > +/* > + * Device class types arranged by their IO latency from low to high. > + */ > +enum btrfs_dev_types { > + BTRFS_DEV_TYPE_MEM = 1, > + BTRFS_DEV_TYPE_NONROT, > + BTRFS_DEV_TYPE_ROT, > + BTRFS_DEV_TYPE_ZONED, I think this should be separate, in one list define all sensible device types and then add arrays sorted by latency, or other factors. The zoned devices are mostly HDD but ther are also SSD-like using ZNS, so that can't be both under BTRFS_DEV_TYPE_ZONED and behind BTRFS_DEV_TYPE_ROT as if it had worse latency. I'm not sure how much we should try to guess the device types, the ones you have so far are almost all we can get without peeking too much into the devices/queues internals. Also here's the terminology question if we should still consider rotational device status as the main criteria, because then SSD, NVMe, RAM-backed are all non-rotational but with different latency characteristics. > +}; > + > struct btrfs_zoned_device_info; > > struct btrfs_device { > @@ -101,9 +111,16 @@ struct btrfs_device { > > /* optimal io width for this device */ > u32 io_width; > - /* type and info about this device */ > + > + /* Type and info about this device, on-disk (currently unused) */ > u64 type; > > + /* > + * Device type (in memory only) at some point, merge to the on-disk > + * member 'type' above. > + */ > + enum btrfs_dev_types dev_type; So at some point this is planned to be user configurable? We should be always able to detect the device type at mount type so I wonder if this needs to be stored on disk. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] btrfs: keep device type in the struct btrfs_device 2022-01-26 16:53 ` David Sterba @ 2022-01-29 16:24 ` Anand Jain 2022-02-01 17:06 ` David Sterba 0 siblings, 1 reply; 13+ messages in thread From: Anand Jain @ 2022-01-29 16:24 UTC (permalink / raw) To: dsterba, linux-btrfs On 27/01/2022 00:53, David Sterba wrote: > On Tue, Jan 18, 2022 at 11:18:01PM +0800, Anand Jain wrote: >> Preparation to make data/metadata chunks allocations based on the device >> types- keep the identified device type in the struct btrfs_device. >> >> This patch adds a member 'dev_type' to hold the defined device types in >> the struct btrfs_devices. >> >> Also, add a helper function and a struct btrfs_fs_devices member >> 'mixed_dev_type' to know if the filesystem contains the mixed device >> types. >> >> Struct btrfs_device has an existing member 'type' that stages and writes >> back to the on-disk format. This patch does not use it. As just an >> in-memory only data will suffice the requirement here. >> >> Signed-off-by: Anand Jain <anand.jain@oracle.com> >> --- >> fs/btrfs/dev-replace.c | 2 ++ >> fs/btrfs/volumes.c | 45 ++++++++++++++++++++++++++++++++++++++++++ >> fs/btrfs/volumes.h | 26 +++++++++++++++++++++++- >> 3 files changed, 72 insertions(+), 1 deletion(-) >> >> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c >> index 71fd99b48283..3731c7d1c6b7 100644 >> --- a/fs/btrfs/dev-replace.c >> +++ b/fs/btrfs/dev-replace.c >> @@ -325,6 +325,7 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, >> device->dev_stats_valid = 1; >> set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE); >> device->fs_devices = fs_devices; >> + device->dev_type = btrfs_get_device_type(device); >> >> ret = btrfs_get_dev_zone_info(device, false); >> if (ret) >> @@ -334,6 +335,7 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, >> list_add(&device->dev_list, &fs_devices->devices); >> fs_devices->num_devices++; >> fs_devices->open_devices++; >> + fs_devices->mixed_dev_types = btrfs_has_mixed_dev_types(fs_devices); >> mutex_unlock(&fs_devices->device_list_mutex); >> >> *device_out = device; >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >> index 9d50e035e61d..da3d6d0f5bc3 100644 >> --- a/fs/btrfs/volumes.c >> +++ b/fs/btrfs/volumes.c >> @@ -1041,6 +1041,7 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices, >> device->generation > (*latest_dev)->generation)) { >> *latest_dev = device; >> } >> + device->dev_type = btrfs_get_device_type(device); >> continue; >> } >> >> @@ -1084,6 +1085,7 @@ void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices) >> __btrfs_free_extra_devids(seed_dev, &latest_dev); >> >> fs_devices->latest_dev = latest_dev; >> + fs_devices->mixed_dev_types = btrfs_has_mixed_dev_types(fs_devices); >> >> mutex_unlock(&uuid_mutex); >> } >> @@ -2183,6 +2185,9 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, >> >> num_devices = btrfs_super_num_devices(fs_info->super_copy) - 1; >> btrfs_set_super_num_devices(fs_info->super_copy, num_devices); >> + >> + cur_devices->mixed_dev_types = btrfs_has_mixed_dev_types(cur_devices); >> + >> mutex_unlock(&fs_devices->device_list_mutex); >> >> /* >> @@ -2584,6 +2589,44 @@ static int btrfs_finish_sprout(struct btrfs_trans_handle *trans) >> return ret; >> } >> >> +bool btrfs_has_mixed_dev_types(struct btrfs_fs_devices *fs_devices) >> +{ >> + struct btrfs_device *device; >> + int type_rot = 0; >> + int type_nonrot = 0; >> + >> + list_for_each_entry(device, &fs_devices->devices, dev_list) { >> + >> + if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state)) >> + continue; >> + >> + switch (device->dev_type) { >> + case BTRFS_DEV_TYPE_ROT: >> + type_rot++; >> + break; >> + case BTRFS_DEV_TYPE_NONROT: >> + default: >> + type_nonrot++; >> + } >> + } >> + >> + if (type_rot && type_nonrot) >> + return true; >> + else >> + return false; >> +} >> + >> +enum btrfs_dev_types btrfs_get_device_type(struct btrfs_device *device) >> +{ >> + if (bdev_is_zoned(device->bdev)) >> + return BTRFS_DEV_TYPE_ZONED; >> + >> + if (blk_queue_nonrot(bdev_get_queue(device->bdev))) >> + return BTRFS_DEV_TYPE_NONROT; >> + >> + return BTRFS_DEV_TYPE_ROT; >> +} >> + >> int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path) >> { >> struct btrfs_root *root = fs_info->dev_root; >> @@ -2675,6 +2718,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path >> clear_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state); >> device->mode = FMODE_EXCL; >> device->dev_stats_valid = 1; >> + device->dev_type = btrfs_get_device_type(device); >> set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE); >> >> if (seeding_dev) { >> @@ -2710,6 +2754,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path >> atomic64_add(device->total_bytes, &fs_info->free_chunk_space); >> >> fs_devices->rotating = !blk_queue_nonrot(bdev_get_queue(bdev)); >> + fs_devices->mixed_dev_types = btrfs_has_mixed_dev_types(fs_devices); >> >> orig_super_total_bytes = btrfs_super_total_bytes(fs_info->super_copy); >> btrfs_set_super_total_bytes(fs_info->super_copy, >> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h >> index 6a790b85edd8..5be31161601d 100644 >> --- a/fs/btrfs/volumes.h >> +++ b/fs/btrfs/volumes.h >> @@ -52,6 +52,16 @@ struct btrfs_io_geometry { >> #define BTRFS_DEV_STATE_FLUSH_SENT (4) >> #define BTRFS_DEV_STATE_NO_READA (5) >> >> +/* >> + * Device class types arranged by their IO latency from low to high. >> + */ >> +enum btrfs_dev_types { >> + BTRFS_DEV_TYPE_MEM = 1, >> + BTRFS_DEV_TYPE_NONROT, >> + BTRFS_DEV_TYPE_ROT, >> + BTRFS_DEV_TYPE_ZONED, > > I think this should be separate, in one list define all sensible device > types and then add arrays sorted by latency, or other factors. > > The zoned devices are mostly HDD but ther are also SSD-like using ZNS, > so that can't be both under BTRFS_DEV_TYPE_ZONED and behind > BTRFS_DEV_TYPE_ROT as if it had worse latency. I still do not have a complete idea of its implantation using an array. Do you mean something like as below, enum btrfs_dev_types { :: }; struct btrfs_dev_type { enum btrfs_dev_types dev_type; int priority_latency; }; I think we can identify a ZNS and set its relative latency to a value lower (better) than rotational. > I'm not sure how much we should try to guess the device types, the ones > you have so far are almost all we can get without peeking too much into > the devices/queues internals. Agree. > Also here's the terminology question if we should still consider > rotational device status as the main criteria, because then SSD, NVMe, > RAM-backed are all non-rotational but with different latency > characteristics. Right. The expected performance also depends on the interconnect type which is undetectable. IMO we shouldn't worry about the non-rational's interconnect types (M2/PCIe/NVMe/SATA/SAS) even though they have different performances. Because some of these interconnects are compatible with each-other medium might change its interconnect during the lifecycle of the data. Then left with are the types of mediums- rotational, non-rotational and zoned which, we can identify safely. Use-cases plan to mix these types of mediums to achieve the cost-per-byte advantage. As of now, I think our target can be to help these kind of planned mixing. >> +}; >> + >> struct btrfs_zoned_device_info; >> >> struct btrfs_device { >> @@ -101,9 +111,16 @@ struct btrfs_device { >> >> /* optimal io width for this device */ >> u32 io_width; >> - /* type and info about this device */ >> + >> + /* Type and info about this device, on-disk (currently unused) */ >> u64 type; >> >> + /* >> + * Device type (in memory only) at some point, merge to the on-disk >> + * member 'type' above. >> + */ >> + enum btrfs_dev_types dev_type; > > So at some point this is planned to be user configurable? We should be > always able to detect the device type at mount type so I wonder if this > needs to be stored on disk. I didn't find any planned configs (white papers) discussing the advantages of mixing non-rotational drives with different interconnect types. If any then, we may have to provide the manual configuration. (If the mixing is across the medium types like non-rotational with either zoned or HDD, then the users don't have to configure as we can optimize the allocation automatically for performance.) Also, hot cache device or device grouping (when implemented) need the user to configure. And so maybe part of in-memory 'dev_type' would be in-sync with on-disk 'type' at some point. IMO. Thanks, Anand ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] btrfs: keep device type in the struct btrfs_device 2022-01-29 16:24 ` Anand Jain @ 2022-02-01 17:06 ` David Sterba 2022-02-03 12:56 ` Anand Jain 0 siblings, 1 reply; 13+ messages in thread From: David Sterba @ 2022-02-01 17:06 UTC (permalink / raw) To: Anand Jain; +Cc: dsterba, linux-btrfs On Sun, Jan 30, 2022 at 12:24:15AM +0800, Anand Jain wrote: > On 27/01/2022 00:53, David Sterba wrote: > > On Tue, Jan 18, 2022 at 11:18:01PM +0800, Anand Jain wrote: > >> +/* > >> + * Device class types arranged by their IO latency from low to high. > >> + */ > >> +enum btrfs_dev_types { > >> + BTRFS_DEV_TYPE_MEM = 1, > >> + BTRFS_DEV_TYPE_NONROT, > >> + BTRFS_DEV_TYPE_ROT, > >> + BTRFS_DEV_TYPE_ZONED, > > > > I think this should be separate, in one list define all sensible device > > types and then add arrays sorted by latency, or other factors. > > > > The zoned devices are mostly HDD but ther are also SSD-like using ZNS, > > so that can't be both under BTRFS_DEV_TYPE_ZONED and behind > > BTRFS_DEV_TYPE_ROT as if it had worse latency. > > I still do not have a complete idea of its implantation using an array. > Do you mean something like as below, > > enum btrfs_dev_types { > :: > }; > > struct btrfs_dev_type { > enum btrfs_dev_types dev_type; > int priority_latency; > }; enum btrfs_dev_types { BTRFS_DEV_TYPE_HDD, BTRFS_DEV_TYPE_SSD, BTRFS_DEV_TYPE_NVME, BTRFS_DEV_TYPE_ZONED_HDD, BTRFS_DEV_TYPE_ZONED_ZNS, }; enum btrfs_dev_types btrfs_devices_by_latency[] = { BTRFS_DEV_TYPE_NVME, BTRFS_DEV_TYPE_SSD, BTRFS_DEV_TYPE_ZONED_ZNS, BTRFS_DEV_TYPE_HDD, BTRFS_DEV_TYPE_ZONED_HDD, }; enum btrfs_dev_types btrfs_devices_by_capacity[] = { BTRFS_DEV_TYPE_ZONED_HDD, BTRFS_DEV_TYPE_HDD, BTRFS_DEV_TYPE_SSD, BTRFS_DEV_TYPE_ZONED_ZNS, BTRFS_DEV_TYPE_NVME, }; Then if the chunk allocator has a selected policy (here by latency and by capacity), it would use the list as additional sorting criteria. Optimizing for latency: device 1 (SSD) vs device 2 (NVME), pick NVME even if device 1 would be otherwise picked due to the capacity criteria (as we have it now). > I think we can identify a ZNS and set its relative latency to a value > lower (better) than rotational. The device classes are general I don't mean to measure the latecy exactly, it's usually typical for the whole class and eg. NVME is considered better than SSD and SSD better than HDD. > > I'm not sure how much we should try to guess the device types, the ones > > you have so far are almost all we can get without peeking too much into > > the devices/queues internals. > > Agree. > > > Also here's the terminology question if we should still consider > > rotational device status as the main criteria, because then SSD, NVMe, > > RAM-backed are all non-rotational but with different latency > > characteristics. > > Right. The expected performance also depends on the interconnect type > which is undetectable. > > IMO we shouldn't worry about the non-rational's interconnect types > (M2/PCIe/NVMe/SATA/SAS) even though they have different performances. Yeah, that's why I'd stay just with the general storage type, not the type of connection. > Because some of these interconnects are compatible with each-other > medium might change its interconnect during the lifecycle of the data. > > Then left with are the types of mediums- rotational, non-rotational and > zoned which, we can identify safely. > > Use-cases plan to mix these types of mediums to achieve the > cost-per-byte advantage. As of now, I think our target can be to help these > kind of planned mixing. > > >> +}; > >> + > >> struct btrfs_zoned_device_info; > >> > >> struct btrfs_device { > >> @@ -101,9 +111,16 @@ struct btrfs_device { > >> > >> /* optimal io width for this device */ > >> u32 io_width; > >> - /* type and info about this device */ > >> + > >> + /* Type and info about this device, on-disk (currently unused) */ > >> u64 type; > >> > >> + /* > >> + * Device type (in memory only) at some point, merge to the on-disk > >> + * member 'type' above. > >> + */ > >> + enum btrfs_dev_types dev_type; > > > > So at some point this is planned to be user configurable? We should be > > always able to detect the device type at mount type so I wonder if this > > needs to be stored on disk. > > I didn't find any planned configs (white papers) discussing the > advantages of mixing non-rotational drives with different interconnect > types. If any then, we may have to provide the manual configuration. > (If the mixing is across the medium types like non-rotational with > either zoned or HDD, then the users don't have to configure as we can > optimize the allocation automatically for performance.) > > Also, hot cache device or device grouping (when implemented) need the > user to configure. > > And so maybe part of in-memory 'dev_type' would be in-sync with on-disk > 'type' at some point. IMO. Yeah, the tentative plan for such features is to first make it in-memory only so we can test it without also defining the permanent on-disk format. From user perspective if there's a reasonable auto-tuning behaviour without a need for configuration then it should be implemented. Once we start adding knobs for various storage types and what should be stored where, it's a beginning of chaos. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] btrfs: keep device type in the struct btrfs_device 2022-02-01 17:06 ` David Sterba @ 2022-02-03 12:56 ` Anand Jain 0 siblings, 0 replies; 13+ messages in thread From: Anand Jain @ 2022-02-03 12:56 UTC (permalink / raw) To: dsterba, linux-btrfs On 02/02/2022 01:06, David Sterba wrote: > On Sun, Jan 30, 2022 at 12:24:15AM +0800, Anand Jain wrote: >> On 27/01/2022 00:53, David Sterba wrote: >>> On Tue, Jan 18, 2022 at 11:18:01PM +0800, Anand Jain wrote: >>>> +/* >>>> + * Device class types arranged by their IO latency from low to high. >>>> + */ >>>> +enum btrfs_dev_types { >>>> + BTRFS_DEV_TYPE_MEM = 1, >>>> + BTRFS_DEV_TYPE_NONROT, >>>> + BTRFS_DEV_TYPE_ROT, >>>> + BTRFS_DEV_TYPE_ZONED, >>> >>> I think this should be separate, in one list define all sensible device >>> types and then add arrays sorted by latency, or other factors. >>> >>> The zoned devices are mostly HDD but ther are also SSD-like using ZNS, >>> so that can't be both under BTRFS_DEV_TYPE_ZONED and behind >>> BTRFS_DEV_TYPE_ROT as if it had worse latency. >> >> I still do not have a complete idea of its implantation using an array. >> Do you mean something like as below, >> >> enum btrfs_dev_types { >> :: >> }; >> >> struct btrfs_dev_type { >> enum btrfs_dev_types dev_type; >> int priority_latency; >> }; > > enum btrfs_dev_types { > BTRFS_DEV_TYPE_HDD, > BTRFS_DEV_TYPE_SSD, > BTRFS_DEV_TYPE_NVME, > BTRFS_DEV_TYPE_ZONED_HDD, > BTRFS_DEV_TYPE_ZONED_ZNS, > }; > > enum btrfs_dev_types btrfs_devices_by_latency[] = { > BTRFS_DEV_TYPE_NVME, > BTRFS_DEV_TYPE_SSD, > BTRFS_DEV_TYPE_ZONED_ZNS, > BTRFS_DEV_TYPE_HDD, > BTRFS_DEV_TYPE_ZONED_HDD, > }; > Ah. It is a cleaner way. Thanks. > enum btrfs_dev_types btrfs_devices_by_capacity[] = { > BTRFS_DEV_TYPE_ZONED_HDD, > BTRFS_DEV_TYPE_HDD, > BTRFS_DEV_TYPE_SSD, > BTRFS_DEV_TYPE_ZONED_ZNS, > BTRFS_DEV_TYPE_NVME, > }; > > > Then if the chunk allocator has a selected policy (here by latency and > by capacity), it would use the list as additional sorting criteria. > Optimizing for latency: device 1 (SSD) vs device 2 (NVME), pick NVME > even if device 1 would be otherwise picked due to the capacity criteria > (as we have it now). Hmm. Above, btrfs_devices_by_capacity[] contains device types (not device itself). Do you mean the btrfs_devices_by_capacity[] used as the device-type priority list. And, then to sort the devices by the capacity in that type? >> I think we can identify a ZNS and set its relative latency to a value >> lower (better) than rotational. > > The device classes are general I don't mean to measure the latecy > exactly, it's usually typical for the whole class and eg. NVME is > considered better than SSD and SSD better than HDD. > Got it. >>> I'm not sure how much we should try to guess the device types, the ones >>> you have so far are almost all we can get without peeking too much into >>> the devices/queues internals. >> >> Agree. >> >>> Also here's the terminology question if we should still consider >>> rotational device status as the main criteria, because then SSD, NVMe, >>> RAM-backed are all non-rotational but with different latency >>> characteristics. >> >> Right. The expected performance also depends on the interconnect type >> which is undetectable. >> >> IMO we shouldn't worry about the non-rational's interconnect types >> (M2/PCIe/NVMe/SATA/SAS) even though they have different performances. > > Yeah, that's why I'd stay just with the general storage type, not the > type of connection. > Hmm. We might have challenges to distinguish between SSD and NVME. Both are non-rotational. They differ by interface type. Unless we read the class name from the struct block_device::struct device bd_device::class that may be considered as a dirty hack. >> Because some of these interconnects are compatible with each-other >> medium might change its interconnect during the lifecycle of the data. >> >> Then left with are the types of mediums- rotational, non-rotational and >> zoned which, we can identify safely. >> >> Use-cases plan to mix these types of mediums to achieve the >> cost-per-byte advantage. As of now, I think our target can be to help these >> kind of planned mixing. >> >>>> +}; >>>> + >>>> struct btrfs_zoned_device_info; >>>> >>>> struct btrfs_device { >>>> @@ -101,9 +111,16 @@ struct btrfs_device { >>>> >>>> /* optimal io width for this device */ >>>> u32 io_width; >>>> - /* type and info about this device */ >>>> + >>>> + /* Type and info about this device, on-disk (currently unused) */ >>>> u64 type; >>>> >>>> + /* >>>> + * Device type (in memory only) at some point, merge to the on-disk >>>> + * member 'type' above. >>>> + */ >>>> + enum btrfs_dev_types dev_type; >>> >>> So at some point this is planned to be user configurable? We should be >>> always able to detect the device type at mount type so I wonder if this >>> needs to be stored on disk. >> >> I didn't find any planned configs (white papers) discussing the >> advantages of mixing non-rotational drives with different interconnect >> types. If any then, we may have to provide the manual configuration. >> (If the mixing is across the medium types like non-rotational with >> either zoned or HDD, then the users don't have to configure as we can >> optimize the allocation automatically for performance.) >> >> Also, hot cache device or device grouping (when implemented) need the >> user to configure. >> >> And so maybe part of in-memory 'dev_type' would be in-sync with on-disk >> 'type' at some point. IMO. > > Yeah, the tentative plan for such features is to first make it in-memory > only so we can test it without also defining the permanent on-disk > format. From user perspective if there's a reasonable auto-tuning > behaviour without a need for configuration then it should be > implemented. Once we start adding knobs for various storage types and > what should be stored where, it's a beginning of chaos. Yep. Thanks. Anand ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 2/2] btrfs: create chunk device type aware 2022-01-18 15:18 [PATCH 0/2] device type and create chunk Anand Jain 2022-01-18 15:18 ` [PATCH 1/2] btrfs: keep device type in the struct btrfs_device Anand Jain @ 2022-01-18 15:18 ` Anand Jain 2022-01-26 17:01 ` David Sterba 2022-01-26 17:38 ` David Sterba 1 sibling, 2 replies; 13+ messages in thread From: Anand Jain @ 2022-01-18 15:18 UTC (permalink / raw) To: linux-btrfs Mixed device types configuration exist. Most commonly, HDD mixed with SSD/NVME device types. This use case prefers that the data chunk allocates on HDD and the metadata chunk allocates on the SSD/NVME. As of now, in the function gather_device_info() called from btrfs_create_chunk(), we sort the devices based on unallocated space only. After this patch, this function will check for mixed device types. And will sort the devices based on enum btrfs_device_types. That is, sort if the allocation type is metadata and reverse-sort if the allocation type is data. The advantage of this method is that data/metadata allocation distribution based on the device type happens automatically without any manual configuration. Signed-off-by: Anand Jain <anand.jain@oracle.com> --- fs/btrfs/volumes.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index da3d6d0f5bc3..77fba78555d7 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5060,6 +5060,37 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info, return 0; } +/* + * Sort the devices in its ascending order of latency value. + */ +static int btrfs_cmp_device_latency(const void *a, const void *b) +{ + const struct btrfs_device_info *di_a = a; + const struct btrfs_device_info *di_b = b; + struct btrfs_device *dev_a = di_a->dev; + struct btrfs_device *dev_b = di_b->dev; + + if (dev_a->dev_type > dev_b->dev_type) + return 1; + if (dev_a->dev_type < dev_b->dev_type) + return -1; + return 0; +} + +static int btrfs_cmp_device_rev_latency(const void *a, const void *b) +{ + const struct btrfs_device_info *di_a = a; + const struct btrfs_device_info *di_b = b; + struct btrfs_device *dev_a = di_a->dev; + struct btrfs_device *dev_b = di_b->dev; + + if (dev_a->dev_type > dev_b->dev_type) + return -1; + if (dev_a->dev_type < dev_b->dev_type) + return 1; + return 0; +} + /* * sort the devices in descending order by max_avail, total_avail */ @@ -5292,6 +5323,20 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices, sort(devices_info, ndevs, sizeof(struct btrfs_device_info), btrfs_cmp_device_info, NULL); + /* + * Sort devices by their latency. Ascending order of latency for + * metadata and descending order of latency for the data chunks for + * mixed device types. + */ + if (fs_devices->mixed_dev_types) { + if (ctl->type & BTRFS_BLOCK_GROUP_DATA) + sort(devices_info, ndevs, sizeof(struct btrfs_device_info), + btrfs_cmp_device_rev_latency, NULL); + else + sort(devices_info, ndevs, sizeof(struct btrfs_device_info), + btrfs_cmp_device_latency, NULL); + } + return 0; } -- 2.33.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] btrfs: create chunk device type aware 2022-01-18 15:18 ` [PATCH 2/2] btrfs: create chunk device type aware Anand Jain @ 2022-01-26 17:01 ` David Sterba 2022-01-29 16:24 ` Anand Jain 2022-01-26 17:38 ` David Sterba 1 sibling, 1 reply; 13+ messages in thread From: David Sterba @ 2022-01-26 17:01 UTC (permalink / raw) To: Anand Jain; +Cc: linux-btrfs On Tue, Jan 18, 2022 at 11:18:02PM +0800, Anand Jain wrote: > Mixed device types configuration exist. Most commonly, HDD mixed with > SSD/NVME device types. This use case prefers that the data chunk > allocates on HDD and the metadata chunk allocates on the SSD/NVME. > > As of now, in the function gather_device_info() called from > btrfs_create_chunk(), we sort the devices based on unallocated space > only. > > After this patch, this function will check for mixed device types. And > will sort the devices based on enum btrfs_device_types. That is, sort if > the allocation type is metadata and reverse-sort if the allocation type > is data. > > The advantage of this method is that data/metadata allocation distribution > based on the device type happens automatically without any manual > configuration. > > Signed-off-by: Anand Jain <anand.jain@oracle.com> > --- > fs/btrfs/volumes.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 45 insertions(+) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index da3d6d0f5bc3..77fba78555d7 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -5060,6 +5060,37 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info, > return 0; > } > > +/* > + * Sort the devices in its ascending order of latency value. > + */ > +static int btrfs_cmp_device_latency(const void *a, const void *b) > +{ > + const struct btrfs_device_info *di_a = a; > + const struct btrfs_device_info *di_b = b; > + struct btrfs_device *dev_a = di_a->dev; > + struct btrfs_device *dev_b = di_b->dev; > + > + if (dev_a->dev_type > dev_b->dev_type) It's strange to see dev_type being compared while the function compares the latency. As pointed in the first patch, this could simply refer to a array of device types sorted by latency. > + return 1; > + if (dev_a->dev_type < dev_b->dev_type) > + return -1; > + return 0; > +} > + > +static int btrfs_cmp_device_rev_latency(const void *a, const void *b) Regarding the naming, I think we've been using _desc (descending) as suffix for the reverse sort functions, while ascending is the default. > +{ > + const struct btrfs_device_info *di_a = a; > + const struct btrfs_device_info *di_b = b; > + struct btrfs_device *dev_a = di_a->dev; > + struct btrfs_device *dev_b = di_b->dev; > + > + if (dev_a->dev_type > dev_b->dev_type) > + return -1; > + if (dev_a->dev_type < dev_b->dev_type) > + return 1; > + return 0; To avoid code duplication this could be return -btrfs_cmp_device_latency(a, b); > +} > + > /* > * sort the devices in descending order by max_avail, total_avail > */ > @@ -5292,6 +5323,20 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices, > sort(devices_info, ndevs, sizeof(struct btrfs_device_info), > btrfs_cmp_device_info, NULL); > > + /* > + * Sort devices by their latency. Ascending order of latency for > + * metadata and descending order of latency for the data chunks for > + * mixed device types. > + */ > + if (fs_devices->mixed_dev_types) { > + if (ctl->type & BTRFS_BLOCK_GROUP_DATA) > + sort(devices_info, ndevs, sizeof(struct btrfs_device_info), > + btrfs_cmp_device_rev_latency, NULL); > + else > + sort(devices_info, ndevs, sizeof(struct btrfs_device_info), > + btrfs_cmp_device_latency, NULL); > + } > + > return 0; > } > > -- > 2.33.1 ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] btrfs: create chunk device type aware 2022-01-26 17:01 ` David Sterba @ 2022-01-29 16:24 ` Anand Jain 0 siblings, 0 replies; 13+ messages in thread From: Anand Jain @ 2022-01-29 16:24 UTC (permalink / raw) To: dsterba, linux-btrfs On 27/01/2022 01:01, David Sterba wrote: > On Tue, Jan 18, 2022 at 11:18:02PM +0800, Anand Jain wrote: >> Mixed device types configuration exist. Most commonly, HDD mixed with >> SSD/NVME device types. This use case prefers that the data chunk >> allocates on HDD and the metadata chunk allocates on the SSD/NVME. >> >> As of now, in the function gather_device_info() called from >> btrfs_create_chunk(), we sort the devices based on unallocated space >> only. >> >> After this patch, this function will check for mixed device types. And >> will sort the devices based on enum btrfs_device_types. That is, sort if >> the allocation type is metadata and reverse-sort if the allocation type >> is data. >> >> The advantage of this method is that data/metadata allocation distribution >> based on the device type happens automatically without any manual >> configuration. >> >> Signed-off-by: Anand Jain <anand.jain@oracle.com> >> --- >> fs/btrfs/volumes.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 45 insertions(+) >> >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >> index da3d6d0f5bc3..77fba78555d7 100644 >> --- a/fs/btrfs/volumes.c >> +++ b/fs/btrfs/volumes.c >> @@ -5060,6 +5060,37 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info, >> return 0; >> } >> >> +/* >> + * Sort the devices in its ascending order of latency value. >> + */ >> +static int btrfs_cmp_device_latency(const void *a, const void *b) >> +{ >> + const struct btrfs_device_info *di_a = a; >> + const struct btrfs_device_info *di_b = b; >> + struct btrfs_device *dev_a = di_a->dev; >> + struct btrfs_device *dev_b = di_b->dev; >> + >> + if (dev_a->dev_type > dev_b->dev_type) > > It's strange to see dev_type being compared while the function compares > the latency. As pointed in the first patch, this could simply refer to a > array of device types sorted by latency. > Agreed. This will change. >> + return 1; >> + if (dev_a->dev_type < dev_b->dev_type) >> + return -1; >> + return 0; >> +} >> + >> +static int btrfs_cmp_device_rev_latency(const void *a, const void *b) > > Regarding the naming, I think we've been using _desc (descending) as > suffix for the reverse sort functions, while ascending is the default. > Ok. Noted. >> +{ >> + const struct btrfs_device_info *di_a = a; >> + const struct btrfs_device_info *di_b = b; >> + struct btrfs_device *dev_a = di_a->dev; >> + struct btrfs_device *dev_b = di_b->dev; >> + >> + if (dev_a->dev_type > dev_b->dev_type) >> + return -1; >> + if (dev_a->dev_type < dev_b->dev_type) >> + return 1; >> + return 0; > > To avoid code duplication this could be > > return -btrfs_cmp_device_latency(a, b); Oh right. I will do that. Thanks, Anand >> +} >> + >> /* >> * sort the devices in descending order by max_avail, total_avail >> */ >> @@ -5292,6 +5323,20 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices, >> sort(devices_info, ndevs, sizeof(struct btrfs_device_info), >> btrfs_cmp_device_info, NULL); >> >> + /* >> + * Sort devices by their latency. Ascending order of latency for >> + * metadata and descending order of latency for the data chunks for >> + * mixed device types. >> + */ >> + if (fs_devices->mixed_dev_types) { >> + if (ctl->type & BTRFS_BLOCK_GROUP_DATA) >> + sort(devices_info, ndevs, sizeof(struct btrfs_device_info), >> + btrfs_cmp_device_rev_latency, NULL); >> + else >> + sort(devices_info, ndevs, sizeof(struct btrfs_device_info), >> + btrfs_cmp_device_latency, NULL); >> + } >> + >> return 0; >> } >> >> -- >> 2.33.1 ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] btrfs: create chunk device type aware 2022-01-18 15:18 ` [PATCH 2/2] btrfs: create chunk device type aware Anand Jain 2022-01-26 17:01 ` David Sterba @ 2022-01-26 17:38 ` David Sterba 2022-01-29 16:46 ` Anand Jain 2022-01-30 22:28 ` Goffredo Baroncelli 1 sibling, 2 replies; 13+ messages in thread From: David Sterba @ 2022-01-26 17:38 UTC (permalink / raw) To: Anand Jain; +Cc: linux-btrfs On Tue, Jan 18, 2022 at 11:18:02PM +0800, Anand Jain wrote: > Mixed device types configuration exist. Most commonly, HDD mixed with > SSD/NVME device types. This use case prefers that the data chunk > allocates on HDD and the metadata chunk allocates on the SSD/NVME. > > As of now, in the function gather_device_info() called from > btrfs_create_chunk(), we sort the devices based on unallocated space > only. Yes, and this guarantees maximizing the used space for the raid profiles. > After this patch, this function will check for mixed device types. And > will sort the devices based on enum btrfs_device_types. That is, sort if > the allocation type is metadata and reverse-sort if the allocation type > is data. And this changes the above, because a small fast device can be depleted first. > The advantage of this method is that data/metadata allocation distribution > based on the device type happens automatically without any manual > configuration. Yeah, but the default behaviour may not be suitable for all users so some policy will have to be done anyway. I vaguely remember some comments regarding mixed setups, along lines that "if there's a fast flash device I'd rather see ENOSPC and either delete files or add more devices than to let everything work but with the risk of storing metadata on the slow devices." > Signed-off-by: Anand Jain <anand.jain@oracle.com> > --- > fs/btrfs/volumes.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 45 insertions(+) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index da3d6d0f5bc3..77fba78555d7 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -5060,6 +5060,37 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info, > return 0; > } > > +/* > + * Sort the devices in its ascending order of latency value. > + */ > +static int btrfs_cmp_device_latency(const void *a, const void *b) > +{ > + const struct btrfs_device_info *di_a = a; > + const struct btrfs_device_info *di_b = b; > + struct btrfs_device *dev_a = di_a->dev; > + struct btrfs_device *dev_b = di_b->dev; > + > + if (dev_a->dev_type > dev_b->dev_type) > + return 1; > + if (dev_a->dev_type < dev_b->dev_type) > + return -1; > + return 0; > +} > + > +static int btrfs_cmp_device_rev_latency(const void *a, const void *b) > +{ > + const struct btrfs_device_info *di_a = a; > + const struct btrfs_device_info *di_b = b; > + struct btrfs_device *dev_a = di_a->dev; > + struct btrfs_device *dev_b = di_b->dev; > + > + if (dev_a->dev_type > dev_b->dev_type) > + return -1; > + if (dev_a->dev_type < dev_b->dev_type) > + return 1; > + return 0; > +} > + > /* > * sort the devices in descending order by max_avail, total_avail > */ > @@ -5292,6 +5323,20 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices, > sort(devices_info, ndevs, sizeof(struct btrfs_device_info), > btrfs_cmp_device_info, NULL); > > + /* > + * Sort devices by their latency. Ascending order of latency for > + * metadata and descending order of latency for the data chunks for > + * mixed device types. > + */ > + if (fs_devices->mixed_dev_types) { > + if (ctl->type & BTRFS_BLOCK_GROUP_DATA) > + sort(devices_info, ndevs, sizeof(struct btrfs_device_info), > + btrfs_cmp_device_rev_latency, NULL); > + else > + sort(devices_info, ndevs, sizeof(struct btrfs_device_info), > + btrfs_cmp_device_latency, NULL); In case there are mixed devices the sort happens twice and because as implemented in kernel sort() is not stable so even if device have same amount of data they can get reorderd wildly. The remaingin space is still a factor we need to take into account to avoid ENOSPC on the chunk level. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] btrfs: create chunk device type aware 2022-01-26 17:38 ` David Sterba @ 2022-01-29 16:46 ` Anand Jain 2022-01-30 22:15 ` Goffredo Baroncelli 2022-01-30 22:28 ` Goffredo Baroncelli 1 sibling, 1 reply; 13+ messages in thread From: Anand Jain @ 2022-01-29 16:46 UTC (permalink / raw) To: dsterba, linux-btrfs On 27/01/2022 01:38, David Sterba wrote: > On Tue, Jan 18, 2022 at 11:18:02PM +0800, Anand Jain wrote: >> Mixed device types configuration exist. Most commonly, HDD mixed with >> SSD/NVME device types. This use case prefers that the data chunk >> allocates on HDD and the metadata chunk allocates on the SSD/NVME. >> >> As of now, in the function gather_device_info() called from >> btrfs_create_chunk(), we sort the devices based on unallocated space >> only. > > Yes, and this guarantees maximizing the used space for the raid > profiles. Oops. Thanks for reminding me of that. More below. >> After this patch, this function will check for mixed device types. And >> will sort the devices based on enum btrfs_device_types. That is, sort if >> the allocation type is metadata and reverse-sort if the allocation type >> is data. > > And this changes the above, because a small fast device can be depleted > first. Both sort by size or latency do _not_ help if given_raid.devs_max == 0 (raid0, raid5, raid6) OR given_raid.devs_max == num_devices. It helps only when given_raid.devs_max != 0 and given_raid.devs_max < num_devices. Sort by size does not help Single and Dup profiles. So, if (given_raid.devs_max != 0 && given_raid.devs_max < num_devices) { Mixed devs types with different sizes if sorted by free size: is pro for raid1, raid1c3, raid1c4, raid10 doesn't matter for single, dup Mixed devs types with different sizes if sorted by latency: is pro for single and dup is con for raid1, raid1c3, raid1c4, raid10 (depends) } So, If given_raid.devs_max == num_devices we don't need any type of sorting. If given_raid.devs_max = 0 (raid0, raid5, raid6) we don't need any type of sorting. And sort devs by latency for Single and Dup profiles only. For rest of profiles sort devs by size only if given_raid.devs_max < num_devices. >> The advantage of this method is that data/metadata allocation distribution >> based on the device type happens automatically without any manual >> configuration. > > Yeah, but the default behaviour may not be suitable for all users so > some policy will have to be done anyway. Right. If nothing is configured even when provided then also it should fallback to the default behaviour. > I vaguely remember some comments regarding mixed setups, along lines > that "if there's a fast flash device I'd rather see ENOSPC and either > delete files or add more devices than to let everything work but with > the risk of storing metadata on the slow devices." It entirely depends on the use-case. An option like following will solve it better: mount -o metadata_nospc_on_faster_devs=<use-slower-devs>|<error> If metadata_nospc_on_faster_devs=error is preferred then it also implies that data_nospc_on_slower_devs=error. Also, the use cases which prefer to use the error option should remember it is difficult to estimate the data/metadata ratio beforehand. >> Signed-off-by: Anand Jain <anand.jain@oracle.com> >> --- >> fs/btrfs/volumes.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 45 insertions(+) >> >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >> index da3d6d0f5bc3..77fba78555d7 100644 >> --- a/fs/btrfs/volumes.c >> +++ b/fs/btrfs/volumes.c >> @@ -5060,6 +5060,37 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info, >> return 0; >> } >> >> +/* >> + * Sort the devices in its ascending order of latency value. >> + */ >> +static int btrfs_cmp_device_latency(const void *a, const void *b) >> +{ >> + const struct btrfs_device_info *di_a = a; >> + const struct btrfs_device_info *di_b = b; >> + struct btrfs_device *dev_a = di_a->dev; >> + struct btrfs_device *dev_b = di_b->dev; >> + >> + if (dev_a->dev_type > dev_b->dev_type) >> + return 1; >> + if (dev_a->dev_type < dev_b->dev_type) >> + return -1; >> + return 0; >> +} >> + >> +static int btrfs_cmp_device_rev_latency(const void *a, const void *b) >> +{ >> + const struct btrfs_device_info *di_a = a; >> + const struct btrfs_device_info *di_b = b; >> + struct btrfs_device *dev_a = di_a->dev; >> + struct btrfs_device *dev_b = di_b->dev; >> + >> + if (dev_a->dev_type > dev_b->dev_type) >> + return -1; >> + if (dev_a->dev_type < dev_b->dev_type) >> + return 1; >> + return 0; >> +} >> + >> /* >> * sort the devices in descending order by max_avail, total_avail >> */ >> @@ -5292,6 +5323,20 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices, >> sort(devices_info, ndevs, sizeof(struct btrfs_device_info), >> btrfs_cmp_device_info, NULL); >> >> + /* >> + * Sort devices by their latency. Ascending order of latency for >> + * metadata and descending order of latency for the data chunks for >> + * mixed device types. >> + */ >> + if (fs_devices->mixed_dev_types) { >> + if (ctl->type & BTRFS_BLOCK_GROUP_DATA) >> + sort(devices_info, ndevs, sizeof(struct btrfs_device_info), >> + btrfs_cmp_device_rev_latency, NULL); >> + else >> + sort(devices_info, ndevs, sizeof(struct btrfs_device_info), >> + btrfs_cmp_device_latency, NULL); > > In case there are mixed devices the sort happens twice and because as > implemented in kernel sort() is not stable so even if device have same > amount of data they can get reorderd wildly. The remaingin space is > still a factor we need to take into account to avoid ENOSPC on the chunk > level. I didn't get this part. How about if it is this way: if (mixed && metadata && (single || dup)) { ndevs=0 pick all non-rotational ndevs++ with free space >= required space if (ndevs == 0) { if (user_option->metadata_nospc_on_faster_devs == error) return -ENOSPC; pick all rotational } sort-by-size-select-top } if (mixed && data && (single || dup)) { ndevs=0 pick all rotational ndevs++ with free space >= required space if (ndevs == 0) { if (user_option->data_nospc_on_faster_devs == error) return -ENOSPC; pick all non-rotational } sort-by-size-select-top } Thanks, Anand ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] btrfs: create chunk device type aware 2022-01-29 16:46 ` Anand Jain @ 2022-01-30 22:15 ` Goffredo Baroncelli 0 siblings, 0 replies; 13+ messages in thread From: Goffredo Baroncelli @ 2022-01-30 22:15 UTC (permalink / raw) To: Anand Jain, dsterba, linux-btrfs On 29/01/2022 17.46, Anand Jain wrote: > > > On 27/01/2022 01:38, David Sterba wrote: >> On Tue, Jan 18, 2022 at 11:18:02PM +0800, Anand Jain wrote: >>> Mixed device types configuration exist. Most commonly, HDD mixed with >>> SSD/NVME device types. This use case prefers that the data chunk >>> allocates on HDD and the metadata chunk allocates on the SSD/NVME. >>> >>> As of now, in the function gather_device_info() called from >>> btrfs_create_chunk(), we sort the devices based on unallocated space >>> only. >> > >> Yes, and this guarantees maximizing the used space for the raid >> profiles. > > Oops. Thanks for reminding me of that. More below. > >>> After this patch, this function will check for mixed device types. And >>> will sort the devices based on enum btrfs_device_types. That is, sort if >>> the allocation type is metadata and reverse-sort if the allocation type >>> is data. >> >> And this changes the above, because a small fast device can be depleted >> first. > > > Both sort by size or latency do _not_ help if > given_raid.devs_max == 0 (raid0, raid5, raid6) OR given_raid.devs_max == num_devices. > > It helps only when given_raid.devs_max != 0 and given_raid.devs_max < num_devices. > > Sort by size does not help Single and Dup profiles. > > So, if (given_raid.devs_max != 0 && given_raid.devs_max < num_devices) { > > Mixed devs types with different sizes if sorted by free size: > is pro for raid1, raid1c3, raid1c4, raid10 > doesn't matter for single, dup > > Mixed devs types with different sizes if sorted by latency: > is pro for single and dup > is con for raid1, raid1c3, raid1c4, raid10 (depends) > } May be I remember bad; but in the "single" cases a new BG is allocated in the "more empty" disks. The same for the "dup". In this it is not different from RAID1xx. Only in the case of RAID5/6 the size doesn't matter, because a new chunk is allocate in each disk anyway. Pay attention that you can have a "mixed" environment of different disks sizes: 2 x ssd (100GB + 200GB) 2 x HDD (1T + 2T) So it could make sense to spread the metadata by priority (only to ssd) AND by "emptiness" (i.e. each 3 BG, one is allocated on the smaller disks and two in the bigger one). > > So, > > If given_raid.devs_max == num_devices we don't need any type of sorting. > > If given_raid.devs_max = 0 (raid0, raid5, raid6) we don't need any type of sorting. > > And sort devs by latency for Single and Dup profiles only. > > For rest of profiles sort devs by size only if given_raid.devs_max < num_devices. > > >>> The advantage of this method is that data/metadata allocation distribution >>> based on the device type happens automatically without any manual >>> configuration. >> >> Yeah, but the default behaviour may not be suitable for all users so >> some policy will have to be done anyway. > > Right. If nothing is configured even when provided then also it should > fallback to the default behaviour. > >> I vaguely remember some comments regarding mixed setups, along lines >> that "if there's a fast flash device I'd rather see ENOSPC and either >> delete files or add more devices than to let everything work but with >> the risk of storing metadata on the slow devices." > > It entirely depends on the use-case. An option like following will > solve it better: > mount -o metadata_nospc_on_faster_devs=<use-slower-devs>|<error> > > If metadata_nospc_on_faster_devs=error is preferred then it also > implies that data_nospc_on_slower_devs=error. > > Also, the use cases which prefer to use the error option should > remember it is difficult to estimate the data/metadata ratio > beforehand. > >>> Signed-off-by: Anand Jain <anand.jain@oracle.com> >>> --- >>> fs/btrfs/volumes.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ >>> 1 file changed, 45 insertions(+) >>> >>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >>> index da3d6d0f5bc3..77fba78555d7 100644 >>> --- a/fs/btrfs/volumes.c >>> +++ b/fs/btrfs/volumes.c >>> @@ -5060,6 +5060,37 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info, >>> return 0; >>> } >>> +/* >>> + * Sort the devices in its ascending order of latency value. >>> + */ >>> +static int btrfs_cmp_device_latency(const void *a, const void *b) >>> +{ >>> + const struct btrfs_device_info *di_a = a; >>> + const struct btrfs_device_info *di_b = b; >>> + struct btrfs_device *dev_a = di_a->dev; >>> + struct btrfs_device *dev_b = di_b->dev; >>> + >>> + if (dev_a->dev_type > dev_b->dev_type) >>> + return 1; >>> + if (dev_a->dev_type < dev_b->dev_type) >>> + return -1; >>> + return 0; >>> +} >>> + >>> +static int btrfs_cmp_device_rev_latency(const void *a, const void *b) >>> +{ >>> + const struct btrfs_device_info *di_a = a; >>> + const struct btrfs_device_info *di_b = b; >>> + struct btrfs_device *dev_a = di_a->dev; >>> + struct btrfs_device *dev_b = di_b->dev; >>> + >>> + if (dev_a->dev_type > dev_b->dev_type) >>> + return -1; >>> + if (dev_a->dev_type < dev_b->dev_type) >>> + return 1; >>> + return 0; >>> +} >>> + >>> /* >>> * sort the devices in descending order by max_avail, total_avail >>> */ >>> @@ -5292,6 +5323,20 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices, >>> sort(devices_info, ndevs, sizeof(struct btrfs_device_info), >>> btrfs_cmp_device_info, NULL); >>> + /* >>> + * Sort devices by their latency. Ascending order of latency for >>> + * metadata and descending order of latency for the data chunks for >>> + * mixed device types. >>> + */ >>> + if (fs_devices->mixed_dev_types) { >>> + if (ctl->type & BTRFS_BLOCK_GROUP_DATA) >>> + sort(devices_info, ndevs, sizeof(struct btrfs_device_info), >>> + btrfs_cmp_device_rev_latency, NULL); >>> + else >>> + sort(devices_info, ndevs, sizeof(struct btrfs_device_info), >>> + btrfs_cmp_device_latency, NULL); >> >> In case there are mixed devices the sort happens twice and because as >> implemented in kernel sort() is not stable so even if device have same >> amount of data they can get reorderd wildly. The remaingin space is >> still a factor we need to take into account to avoid ENOSPC on the chunk >> level. > > > I didn't get this part. How about if it is this way: > > if (mixed && metadata && (single || dup)) { > ndevs=0 > pick all non-rotational ndevs++ with free space >= required space > if (ndevs == 0) { > if (user_option->metadata_nospc_on_faster_devs == error) > return -ENOSPC; > pick all rotational > } > sort-by-size-select-top > } > > if (mixed && data && (single || dup)) { > ndevs=0 > pick all rotational ndevs++ with free space >= required space > if (ndevs == 0) { > if (user_option->data_nospc_on_faster_devs == error) > return -ENOSPC; > pick all non-rotational > } > sort-by-size-select-top > } > > Thanks, Anand I think that you have to behave as allocation_hint patches set: The sort is one, and it sorts by - by priority - if the disks have the same priority, by free space - if the disks have the same priority and free space, by max avail ... -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] btrfs: create chunk device type aware 2022-01-26 17:38 ` David Sterba 2022-01-29 16:46 ` Anand Jain @ 2022-01-30 22:28 ` Goffredo Baroncelli 1 sibling, 0 replies; 13+ messages in thread From: Goffredo Baroncelli @ 2022-01-30 22:28 UTC (permalink / raw) To: dsterba, Anand Jain, linux-btrfs On 26/01/2022 18.38, David Sterba wrote: >> The advantage of this method is that data/metadata allocation distribution >> based on the device type happens automatically without any manual >> configuration. > Yeah, but the default behaviour may not be suitable for all users so > some policy will have to be done anyway. > > I vaguely remember some comments regarding mixed setups, along lines > that "if there's a fast flash device I'd rather see ENOSPC and either > delete files or add more devices than to let everything work but with > the risk of storing metadata on the slow devices." > I confirm that. There are two aspects that impacted the "allocation_hint" patches set discussions: 1) the criteria to order the disks 2) allow/permit/deny a kind of BG to be hosted by a device Regarding 1), initially the first set of patches considered an automatic behavior on the basis of the "rotational" attribute. Soon it was pointed out that in the "not rotational" class there are different options (like SSD, NVME...). The conclusion was that the "priority" should not be cabled in the btrfs code. (e.g. we could consider the reliability ?) Regarding 2), my initial patches set only ordered the disks and alloed any disk to be used. Some user asked me to prevent to use certain disks for (e.g.) the data; this to prevent the data BG to consume all the available space [*] These discussions leads me to create 4 "classes" for disks - ONLY_METADATA - PREFERRED_METADATA - PREFERRED_DATA - ONLY_DATA Where the last one is not suitable for metadata, and the first one is not suitable for data. The middle ones allow one type but only of the other disks are full. Another differences between the Anand patches and the my ones, is that in the "allocation_hint" for striped profiles (like raid5, which spans all the available disks), the devices involved should have the same classes. I.e., if btrfs has to allocate a RAID5 metadata chunk, - first tries to use ONLY_METADATA disks. - If the disks are not enough, then it tries to use ONLY_METADATA and PREFERRED_METADATA disks. - If the disks are not enough, then it tries to use ONLY_METADATA, PREFERRED_METADATA and PREFERRED_DATA disks. - If the disks are not enough, then -ENOSPC What the Anand patches has more than allocation_hint patches, is that these handle the case of different latency disks, giving higher priority to the disks with lower latency. If this is a requirements we can reserve some bits to add a priority of a disk, and then we can sort the disk by: - allocation_hint class - priority - max avail - free space BR G.Baroncelli [*] I don't want to open another discussion, but this seems to me more a "quota" problem than a "disk allocation" problem... -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2022-02-03 12:57 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-01-18 15:18 [PATCH 0/2] device type and create chunk Anand Jain 2022-01-18 15:18 ` [PATCH 1/2] btrfs: keep device type in the struct btrfs_device Anand Jain 2022-01-26 16:53 ` David Sterba 2022-01-29 16:24 ` Anand Jain 2022-02-01 17:06 ` David Sterba 2022-02-03 12:56 ` Anand Jain 2022-01-18 15:18 ` [PATCH 2/2] btrfs: create chunk device type aware Anand Jain 2022-01-26 17:01 ` David Sterba 2022-01-29 16:24 ` Anand Jain 2022-01-26 17:38 ` David Sterba 2022-01-29 16:46 ` Anand Jain 2022-01-30 22:15 ` Goffredo Baroncelli 2022-01-30 22:28 ` Goffredo Baroncelli
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.