linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][V9][PATCH 0/6] btrfs: allocation_hint mode
@ 2021-12-17 18:47 Goffredo Baroncelli
  2021-12-17 18:47 ` [PATCH 1/6] btrfs: add flags to give an hint to the chunk allocator Goffredo Baroncelli
                   ` (7 more replies)
  0 siblings, 8 replies; 21+ messages in thread
From: Goffredo Baroncelli @ 2021-12-17 18:47 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

Hi all,

This patches set was born after some discussion between me, Zygo and Josef.
Some details can be found in https://github.com/btrfs/btrfs-todo/issues/19.

Some further information about a real use case can be found in
https://lore.kernel.org/linux-btrfs/20210116002533.GE31381@hungrycats.org/

Reently Shafeeq told me that he is interested too, due to the performance gain.

In V8 revision I switched away from an ioctl API in favor of a sysfs API (
see patch #2 and #3).

In V9 I renamed the sysfs interface from devinfo/type to devinfo/allocation_hint.
Moreover I renamed dev_info->type to dev_info->flags.

The idea behind this patches set, is to dedicate some disks (the fastest one)
to the metadata chunk. My initial idea was a "soft" hint. However Zygo
asked an option for a "strong" hint (== mandatory). The result is that
each disk can be "tagged" by one of the following flags:
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
- BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
- BTRFS_DEV_ALLOCATION_PREFERRED_DATA
- BTRFS_DEV_ALLOCATION_DATA_ONLY

When the chunk allocator search a disks to allocate a chunk, scans the disks
in an order decided by these tags. For metadata, the order is:
*_METADATA_ONLY
*_PREFERRED_METADATA
*_PREFERRED_DATA

The *_DATA_ONLY are not eligible from metadata chunk allocation.

For the data chunk, the order is reversed, and the *_METADATA_ONLY are
excluded.

The exact sort logic is to sort first for the "tag", and then for the space
available. If there is no space available, the next "tag" disks set are
selected.

To set these tags, a new property called "allocation_hint" was created.
There is a dedicated btrfs-prog patches set [[PATCH V9] btrfs-progs:
allocation_hint disk property].

$ sudo mount /dev/loop0 /mnt/test-btrfs/
$ for i in /dev/loop[0-9]; do sudo ./btrfs prop get $i allocation_hint; done
devid=1, path=/dev/loop0: allocation_hint=PREFERRED_METADATA
devid=2, path=/dev/loop1: allocation_hint=PREFERRED_METADATA
devid=3, path=/dev/loop2: allocation_hint=PREFERRED_DATA
devid=4, path=/dev/loop3: allocation_hint=PREFERRED_DATA
devid=5, path=/dev/loop4: allocation_hint=PREFERRED_DATA
devid=6, path=/dev/loop5: allocation_hint=DATA_ONLY
devid=7, path=/dev/loop6: allocation_hint=METADATA_ONLY
devid=8, path=/dev/loop7: allocation_hint=METADATA_ONLY

$ sudo ./btrfs fi us /mnt/test-btrfs/
Overall:
    Device size:           2.75GiB
    Device allocated:           1.34GiB
    Device unallocated:           1.41GiB
    Device missing:             0.00B
    Used:             400.89MiB
    Free (estimated):           1.04GiB    (min: 1.04GiB)
    Data ratio:                  2.00
    Metadata ratio:              1.00
    Global reserve:           3.25MiB    (used: 0.00B)
    Multiple profiles:                no

Data,RAID1: Size:542.00MiB, Used:200.25MiB (36.95%)
   /dev/loop0     288.00MiB
   /dev/loop1     288.00MiB
   /dev/loop2     127.00MiB
   /dev/loop3     127.00MiB
   /dev/loop4     127.00MiB
   /dev/loop5     127.00MiB

Metadata,single: Size:256.00MiB, Used:384.00KiB (0.15%)
   /dev/loop1     256.00MiB

System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
   /dev/loop0      32.00MiB

Unallocated:
   /dev/loop0     704.00MiB
   /dev/loop1     480.00MiB
   /dev/loop2       1.00MiB
   /dev/loop3       1.00MiB
   /dev/loop4       1.00MiB
   /dev/loop5       1.00MiB
   /dev/loop6     128.00MiB
   /dev/loop7     128.00MiB

# change the tag of some disks

$ sudo ./btrfs prop set /dev/loop0 allocation_hint DATA_ONLY
$ sudo ./btrfs prop set /dev/loop1 allocation_hint DATA_ONLY
$ sudo ./btrfs prop set /dev/loop5 allocation_hint METADATA_ONLY

$ for i in /dev/loop[0-9]; do sudo ./btrfs prop get $i allocation_hint; done
devid=1, path=/dev/loop0: allocation_hint=DATA_ONLY
devid=2, path=/dev/loop1: allocation_hint=DATA_ONLY
devid=3, path=/dev/loop2: allocation_hint=PREFERRED_DATA
devid=4, path=/dev/loop3: allocation_hint=PREFERRED_DATA
devid=5, path=/dev/loop4: allocation_hint=PREFERRED_DATA
devid=6, path=/dev/loop5: allocation_hint=METADATA_ONLY
devid=7, path=/dev/loop6: allocation_hint=METADATA_ONLY
devid=8, path=/dev/loop7: allocation_hint=METADATA_ONLY

$ sudo btrfs bal start --full-balance /mnt/test-btrfs/
$ sudo ./btrfs fi us /mnt/test-btrfs/
Overall:
    Device size:           2.75GiB
    Device allocated:         735.00MiB
    Device unallocated:           2.03GiB
    Device missing:             0.00B
    Used:             400.72MiB
    Free (estimated):           1.10GiB    (min: 1.10GiB)
    Data ratio:                  2.00
    Metadata ratio:              1.00
    Global reserve:           3.25MiB    (used: 0.00B)
    Multiple profiles:                no

Data,RAID1: Size:288.00MiB, Used:200.19MiB (69.51%)
   /dev/loop0     288.00MiB
   /dev/loop1     288.00MiB

Metadata,single: Size:127.00MiB, Used:336.00KiB (0.26%)
   /dev/loop5     127.00MiB

System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
   /dev/loop7      32.00MiB

Unallocated:
   /dev/loop0     736.00MiB
   /dev/loop1     736.00MiB
   /dev/loop2     128.00MiB
   /dev/loop3     128.00MiB
   /dev/loop4     128.00MiB
   /dev/loop5       1.00MiB
   /dev/loop6     128.00MiB
   /dev/loop7      96.00MiB


#As you can see all the metadata were placed on the disk loop5/loop7 even if
#the most empty one are loop0 and loop1.



TODO:
- more tests
- the tool which show the space available should consider the tagging (eg
  the disks tagged by _METADATA_ONLY should be excluded from the data
  availability)
- allow btrfs-prog to change the allocation_hint even when the filesystem
  is not mounted.


Comments are welcome
BR
G.Baroncelli

Revision:
V9:
- rename dev_item->type to dev_item->flags
- rename /sys/fs/btrfs/$UUID/devinfo/type -> allocation_hint

V8:
- drop the ioctl API, instead use a sysfs one

V7:
- make more room in the struct btrfs_ioctl_dev_properties up to 1K
- leave in btrfs_tree.h only the costants
- removed the mount option (sic)
- correct an 'use before check' in the while loop (signaled
  by Zygo)
- add a 2nd sort to be sure that the device_info array is in the
  expected order

V6:
- add further values to the hints: add the possibility to
  exclude a disk for a chunk type 


Goffredo Baroncelli (6):
  btrfs: add flags to give an hint to the chunk allocator
  btrfs: export the device allocation_hint property in sysfs
  btrfs: change the device allocation_hint property via sysfs
  btrfs: add allocation_hint mode
  btrfs: rename dev_item->type to dev_item->flags
  btrfs: add allocation_hint option.

 fs/btrfs/ctree.h                |  18 +++++-
 fs/btrfs/disk-io.c              |   4 +-
 fs/btrfs/super.c                |  17 ++++++
 fs/btrfs/sysfs.c                |  73 ++++++++++++++++++++++
 fs/btrfs/volumes.c              | 105 ++++++++++++++++++++++++++++++--
 fs/btrfs/volumes.h              |   7 ++-
 include/uapi/linux/btrfs_tree.h |  20 +++++-
 7 files changed, 232 insertions(+), 12 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/6] btrfs: add flags to give an hint to the chunk allocator
  2021-12-17 18:47 [RFC][V9][PATCH 0/6] btrfs: allocation_hint mode Goffredo Baroncelli
@ 2021-12-17 18:47 ` Goffredo Baroncelli
  2022-01-05 22:10   ` Boris Burkov
  2021-12-17 18:47 ` [PATCH 2/6] btrfs: export the device allocation_hint property in sysfs Goffredo Baroncelli
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Goffredo Baroncelli @ 2021-12-17 18:47 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

Add the following flags to give an hint about which chunk should be
allocated in which a disk:

- BTRFS_DEV_ALLOCATION_HINT_PREFERRED_DATA
  preferred data chunk, but metadata chunk allowed
- BTRFS_DEV_ALLOCATION_HINT_PREFERRED_METADATA
  preferred metadata chunk, but data chunk allowed
- BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY
  only metadata chunk allowed
- BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY
  only data chunk allowed

Signed-off-by: Goffredo Baroncelli <kreijack@inwid.it>
---
 include/uapi/linux/btrfs_tree.h | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 5416f1f1a77a..55da906c2eac 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -386,6 +386,22 @@ struct btrfs_key {
 	__u64 offset;
 } __attribute__ ((__packed__));
 
+/* dev_item.type */
+
+/* btrfs chunk allocation hint */
+#define BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT	2
+/* btrfs chunk allocation hint mask */
+#define BTRFS_DEV_ALLOCATION_HINT_MASK	\
+	((1 << BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT) -1)
+/* preferred data chunk, but metadata chunk allowed */
+#define BTRFS_DEV_ALLOCATION_HINT_PREFERRED_DATA	(0ULL)
+/* preferred metadata chunk, but data chunk allowed */
+#define BTRFS_DEV_ALLOCATION_HINT_PREFERRED_METADATA	(1ULL)
+/* only metadata chunk are allowed */
+#define BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY		(2ULL)
+/* only data chunk allowed */
+#define BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY		(3ULL)
+
 struct btrfs_dev_item {
 	/* the internal btrfs device id */
 	__le64 devid;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/6] btrfs: export the device allocation_hint property in sysfs
  2021-12-17 18:47 [RFC][V9][PATCH 0/6] btrfs: allocation_hint mode Goffredo Baroncelli
  2021-12-17 18:47 ` [PATCH 1/6] btrfs: add flags to give an hint to the chunk allocator Goffredo Baroncelli
@ 2021-12-17 18:47 ` Goffredo Baroncelli
  2022-01-05 21:57   ` Boris Burkov
  2021-12-17 18:47 ` [PATCH 3/6] btrfs: change the device allocation_hint property via sysfs Goffredo Baroncelli
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Goffredo Baroncelli @ 2021-12-17 18:47 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

Eport the device allocation_hint property via
/sys/fs/btrfs/<uuid>/devinfo/<devid>/allocation_hint

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 fs/btrfs/sysfs.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index beb7f72d50b8..a8d918700d2b 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1575,6 +1575,17 @@ static ssize_t btrfs_devinfo_error_stats_show(struct kobject *kobj,
 }
 BTRFS_ATTR(devid, error_stats, btrfs_devinfo_error_stats_show);
 
+static ssize_t btrfs_devinfo_allocation_hint_show(struct kobject *kobj,
+					struct kobj_attribute *a, char *buf)
+{
+	struct btrfs_device *device = container_of(kobj, struct btrfs_device,
+						   devid_kobj);
+
+	return scnprintf(buf, PAGE_SIZE, "0x%08llx\n",
+		device->type & BTRFS_DEV_ALLOCATION_HINT_MASK );
+}
+BTRFS_ATTR(devid, allocation_hint, btrfs_devinfo_allocation_hint_show);
+
 /*
  * Information about one device.
  *
@@ -1588,6 +1599,7 @@ static struct attribute *devid_attrs[] = {
 	BTRFS_ATTR_PTR(devid, replace_target),
 	BTRFS_ATTR_PTR(devid, scrub_speed_max),
 	BTRFS_ATTR_PTR(devid, writeable),
+	BTRFS_ATTR_PTR(devid, allocation_hint),
 	NULL
 };
 ATTRIBUTE_GROUPS(devid);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/6] btrfs: change the device allocation_hint property via sysfs
  2021-12-17 18:47 [RFC][V9][PATCH 0/6] btrfs: allocation_hint mode Goffredo Baroncelli
  2021-12-17 18:47 ` [PATCH 1/6] btrfs: add flags to give an hint to the chunk allocator Goffredo Baroncelli
  2021-12-17 18:47 ` [PATCH 2/6] btrfs: export the device allocation_hint property in sysfs Goffredo Baroncelli
@ 2021-12-17 18:47 ` Goffredo Baroncelli
  2021-12-17 18:47 ` [PATCH 4/6] btrfs: add allocation_hint mode Goffredo Baroncelli
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Goffredo Baroncelli @ 2021-12-17 18:47 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 fs/btrfs/sysfs.c   | 62 +++++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/volumes.c |  2 +-
 fs/btrfs/volumes.h |  2 ++
 3 files changed, 64 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index a8d918700d2b..53acc66065dd 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1584,7 +1584,67 @@ static ssize_t btrfs_devinfo_allocation_hint_show(struct kobject *kobj,
 	return scnprintf(buf, PAGE_SIZE, "0x%08llx\n",
 		device->type & BTRFS_DEV_ALLOCATION_HINT_MASK );
 }
-BTRFS_ATTR(devid, allocation_hint, btrfs_devinfo_allocation_hint_show);
+
+static ssize_t btrfs_devinfo_allocation_hint_store(struct kobject *kobj,
+				 struct kobj_attribute *a,
+				 const char *buf, size_t len)
+{
+	struct btrfs_fs_info *fs_info;
+	struct btrfs_root *root;
+	struct btrfs_device *device;
+	int ret;
+	struct btrfs_trans_handle *trans;
+
+	u64 type, prev_type;
+
+	device = container_of(kobj, struct btrfs_device, devid_kobj);
+	fs_info = device->fs_info;
+	if (!fs_info)
+		return -EPERM;
+
+	root = fs_info->chunk_root;
+	if (sb_rdonly(fs_info->sb))
+		return -EROFS;
+
+	ret = kstrtou64(buf, 0, &type);
+	if (ret < 0)
+		return -EINVAL;
+
+	/* for now, allow to touch only the 'allocation hint' bits */
+	if (type & ~BTRFS_DEV_ALLOCATION_HINT_MASK)
+		return -EINVAL;
+
+	/* check if a change is really needed */
+	if ((device->type & BTRFS_DEV_ALLOCATION_HINT_MASK) == type)
+		return len;
+
+	trans = btrfs_start_transaction(root, 1);
+	if (IS_ERR(trans))
+		return PTR_ERR(trans);
+
+	prev_type = device->type;
+	device->type = (device->type & ~BTRFS_DEV_ALLOCATION_HINT_MASK) | type;
+
+	ret = btrfs_update_device(trans, device);
+
+	if (ret < 0) {
+		btrfs_abort_transaction(trans, ret);
+		btrfs_end_transaction(trans);
+		goto abort;
+	}
+
+	ret = btrfs_commit_transaction(trans);
+	if (ret < 0)
+		goto abort;
+
+	return len;
+abort:
+	device->type = prev_type;
+	return  ret;
+}
+BTRFS_ATTR_RW(devid, allocation_hint, btrfs_devinfo_allocation_hint_show,
+				      btrfs_devinfo_allocation_hint_store);
+
 
 /*
  * Information about one device.
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a7071f34fe64..806b599c6a46 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2859,7 +2859,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	return ret;
 }
 
-static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
+noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
 					struct btrfs_device *device)
 {
 	int ret;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 9cf1d93a3d66..5097c0c12a8e 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -638,5 +638,7 @@ int btrfs_bg_type_to_factor(u64 flags);
 const char *btrfs_bg_type_to_raid_name(u64 flags);
 int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info);
 bool btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical);
+int btrfs_update_device(struct btrfs_trans_handle *trans,
+                                       struct btrfs_device *device);
 
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/6] btrfs: add allocation_hint mode
  2021-12-17 18:47 [RFC][V9][PATCH 0/6] btrfs: allocation_hint mode Goffredo Baroncelli
                   ` (2 preceding siblings ...)
  2021-12-17 18:47 ` [PATCH 3/6] btrfs: change the device allocation_hint property via sysfs Goffredo Baroncelli
@ 2021-12-17 18:47 ` Goffredo Baroncelli
  2022-01-05 23:48   ` Boris Burkov
  2021-12-17 18:47 ` [PATCH 5/6] btrfs: rename dev_item->type to dev_item->flags Goffredo Baroncelli
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Goffredo Baroncelli @ 2021-12-17 18:47 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

The chunk allocation policy is modified as follow.

Each disk may have one of the following tags:
- BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
- BTRFS_DEV_ALLOCATION_DATA_ONLY
- BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)

During a *mixed data/metadata* chunk allocation, BTRFS works as
usual.

During a *data* chunk allocation, the space are searched first in
BTRFS_DEV_ALLOCATION_DATA_ONLY and BTRFS_DEV_ALLOCATION_PREFERRED_DATA
tagged disks. If no space is found or the space found is not enough (eg.
in raid5, only two disks are available), then also the disks tagged
BTRFS_DEV_ALLOCATION_PREFERRED_METADATA are evaluated. If even in this
case this the space is not sufficient, -ENOSPC is raised.
A disk tagged with BTRFS_DEV_ALLOCATION_METADATA_ONLY is never considered
for a data BG allocation.

During a *metadata* chunk allocation, the space are searched first in
BTRFS_DEV_ALLOCATION_METADATA_ONLY and BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
tagged disks. If no space is found or the space found is not enough (eg.
in raid5, only two disks are available), then also the disks tagged
BTRFS_DEV_ALLOCATION_PREFERRED_DATA are considered. If even in this
case this the space is not sufficient, -ENOSPC is raised.
A disk tagged with BTRFS_DEV_ALLOCATION_DATA_ONLY is never considered
for a metadata BG allocation.

By default the disks are tagged as BTRFS_DEV_ALLOCATION_PREFERRED_DATA,
so the default behavior happens. If the user prefer to store the
metadata in the faster disks (e.g. the SSD), he can tag these with
BTRFS_DEV_ALLOCATION_PREFERRED_DATA: in this case the data BG go in the
BTRFS_DEV_ALLOCATION_PREFERRED_DATA disks and the metadata BG in the
others, until there is enough space. Only if one disks set is filled,
the other is occupied.

WARNING: if the user tags a disk with BTRFS_DEV_ALLOCATION_DATA_ONLY,
this means that this disk will never be used for allocating metadata
increasing the likelihood of exhausting the metadata space.

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 fs/btrfs/volumes.c | 94 +++++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/volumes.h |  1 +
 2 files changed, 94 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 806b599c6a46..beee7d1ae79d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -184,6 +184,16 @@ enum btrfs_raid_types __attribute_const__ btrfs_bg_flags_to_raid_index(u64 flags
 	return BTRFS_RAID_SINGLE; /* BTRFS_BLOCK_GROUP_SINGLE */
 }
 
+#define BTRFS_DEV_ALLOCATION_HINT_COUNT (1ULL << \
+		BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT)
+
+static const char alloc_hint_map[BTRFS_DEV_ALLOCATION_HINT_COUNT] = {
+	[BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY] = -1,
+	[BTRFS_DEV_ALLOCATION_HINT_PREFERRED_DATA] = 0,
+	[BTRFS_DEV_ALLOCATION_HINT_PREFERRED_METADATA] = 1,
+	[BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY] = 2,
+};
+
 const char *btrfs_bg_type_to_raid_name(u64 flags)
 {
 	const int index = btrfs_bg_flags_to_raid_index(flags);
@@ -5037,13 +5047,18 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
 }
 
 /*
- * sort the devices in descending order by max_avail, total_avail
+ * sort the devices in descending order by alloc_hint,
+ * max_avail, total_avail
  */
 static int btrfs_cmp_device_info(const void *a, const void *b)
 {
 	const struct btrfs_device_info *di_a = a;
 	const struct btrfs_device_info *di_b = b;
 
+	if (di_a->alloc_hint > di_b->alloc_hint)
+		return -1;
+	if (di_a->alloc_hint < di_b->alloc_hint)
+		return 1;
 	if (di_a->max_avail > di_b->max_avail)
 		return -1;
 	if (di_a->max_avail < di_b->max_avail)
@@ -5206,6 +5221,8 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 	int ndevs = 0;
 	u64 max_avail;
 	u64 dev_offset;
+	int hint;
+	int i;
 
 	/*
 	 * in the first pass through the devices list, we gather information
@@ -5258,16 +5275,91 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 		devices_info[ndevs].max_avail = max_avail;
 		devices_info[ndevs].total_avail = total_avail;
 		devices_info[ndevs].dev = device;
+
+		if ((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
+		     (ctl->type & BTRFS_BLOCK_GROUP_METADATA)) {
+			/*
+			 * if mixed bg set all the alloc_hint
+			 * fields to the same value, so the sorting
+			 * is not affected
+			 */
+			devices_info[ndevs].alloc_hint = 0;
+		} else if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
+			hint = device->type & BTRFS_DEV_ALLOCATION_HINT_MASK;
+
+			/*
+			 * skip BTRFS_DEV_METADATA_ONLY disks
+			 */
+			if (hint == BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY)
+				continue;
+			/*
+			 * if a data chunk must be allocated,
+			 * sort also by hint (data disk
+			 * higher priority)
+			 */
+			devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
+		} else { /* BTRFS_BLOCK_GROUP_METADATA */
+			hint = device->type & BTRFS_DEV_ALLOCATION_HINT_MASK;
+
+			/*
+			 * skip BTRFS_DEV_DATA_ONLY disks
+			 */
+			if (hint == BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY)
+				continue;
+			/*
+			 * if a data chunk must be allocated,
+			 * sort also by hint (metadata hint
+			 * higher priority)
+			 */
+			devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
+		}
+
 		++ndevs;
 	}
 	ctl->ndevs = ndevs;
 
+	/*
+	 * no devices available
+	 */
+	if (!ndevs)
+		return 0;
+
 	/*
 	 * now sort the devices by hole size / available space
 	 */
 	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
 	     btrfs_cmp_device_info, NULL);
 
+	/*
+	 * select the minimum set of disks grouped by hint that
+	 * can host the chunk
+	 */
+	ndevs = 0;
+	while (ndevs < ctl->ndevs) {
+		hint = devices_info[ndevs++].alloc_hint;
+		while (ndevs < ctl->ndevs &&
+		       devices_info[ndevs].alloc_hint == hint)
+				ndevs++;
+		if (ndevs >= ctl->devs_min)
+			break;
+	}
+
+	BUG_ON(ndevs > ctl->ndevs);
+	ctl->ndevs = ndevs;
+
+	/*
+	 * the next layers require the devices_info ordered by
+	 * max_avail. If we are returing two (or more) different
+	 * group of alloc_hint, this is not always true. So sort
+	 * these gain.
+	 */
+
+	for (i = 0 ; i < ndevs ; i++)
+		devices_info[i].alloc_hint = 0;
+
+	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+	     btrfs_cmp_device_info, NULL);
+
 	return 0;
 }
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 5097c0c12a8e..61c0cba045e9 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -406,6 +406,7 @@ struct btrfs_device_info {
 	u64 dev_offset;
 	u64 max_avail;
 	u64 total_avail;
+	int alloc_hint;
 };
 
 struct btrfs_raid_attr {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/6] btrfs: rename dev_item->type to dev_item->flags
  2021-12-17 18:47 [RFC][V9][PATCH 0/6] btrfs: allocation_hint mode Goffredo Baroncelli
                   ` (3 preceding siblings ...)
  2021-12-17 18:47 ` [PATCH 4/6] btrfs: add allocation_hint mode Goffredo Baroncelli
@ 2021-12-17 18:47 ` Goffredo Baroncelli
  2022-01-05 23:50   ` Boris Burkov
  2021-12-17 18:47 ` [PATCH 6/6] btrfs: add allocation_hint option Goffredo Baroncelli
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Goffredo Baroncelli @ 2021-12-17 18:47 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

Rename the field type of dev_item from 'type' to 'flags' changing the
struct btrfs_device and btrfs_dev_item.

Signed-off-by: Goffredo Baroncelli <krejack@inwind.it>
---
 fs/btrfs/ctree.h                |  4 ++--
 fs/btrfs/disk-io.c              |  2 +-
 fs/btrfs/sysfs.c                | 17 +++++++++--------
 fs/btrfs/volumes.c              | 10 +++++-----
 fs/btrfs/volumes.h              |  4 ++--
 include/uapi/linux/btrfs_tree.h |  4 ++--
 6 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 459d00211181..778c7c807289 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1669,7 +1669,7 @@ static inline void btrfs_set_device_total_bytes(const struct extent_buffer *eb,
 }
 
 
-BTRFS_SETGET_FUNCS(device_type, struct btrfs_dev_item, type, 64);
+BTRFS_SETGET_FUNCS(device_flags, struct btrfs_dev_item, flags, 64);
 BTRFS_SETGET_FUNCS(device_bytes_used, struct btrfs_dev_item, bytes_used, 64);
 BTRFS_SETGET_FUNCS(device_io_align, struct btrfs_dev_item, io_align, 32);
 BTRFS_SETGET_FUNCS(device_io_width, struct btrfs_dev_item, io_width, 32);
@@ -1682,7 +1682,7 @@ BTRFS_SETGET_FUNCS(device_seek_speed, struct btrfs_dev_item, seek_speed, 8);
 BTRFS_SETGET_FUNCS(device_bandwidth, struct btrfs_dev_item, bandwidth, 8);
 BTRFS_SETGET_FUNCS(device_generation, struct btrfs_dev_item, generation, 64);
 
-BTRFS_SETGET_STACK_FUNCS(stack_device_type, struct btrfs_dev_item, type, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_device_flags, struct btrfs_dev_item, flags, 64);
 BTRFS_SETGET_STACK_FUNCS(stack_device_total_bytes, struct btrfs_dev_item,
 			 total_bytes, 64);
 BTRFS_SETGET_STACK_FUNCS(stack_device_bytes_used, struct btrfs_dev_item,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index fc7dd5109806..02ffb8bc7d6b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4342,7 +4342,7 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int max_mirrors)
 			continue;
 
 		btrfs_set_stack_device_generation(dev_item, 0);
-		btrfs_set_stack_device_type(dev_item, dev->type);
+		btrfs_set_stack_device_flags(dev_item, dev->flags);
 		btrfs_set_stack_device_id(dev_item, dev->devid);
 		btrfs_set_stack_device_total_bytes(dev_item,
 						   dev->commit_total_bytes);
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 53acc66065dd..be4196a1645c 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1582,7 +1582,7 @@ static ssize_t btrfs_devinfo_allocation_hint_show(struct kobject *kobj,
 						   devid_kobj);
 
 	return scnprintf(buf, PAGE_SIZE, "0x%08llx\n",
-		device->type & BTRFS_DEV_ALLOCATION_HINT_MASK );
+		device->flags & BTRFS_DEV_ALLOCATION_HINT_MASK );
 }
 
 static ssize_t btrfs_devinfo_allocation_hint_store(struct kobject *kobj,
@@ -1595,7 +1595,7 @@ static ssize_t btrfs_devinfo_allocation_hint_store(struct kobject *kobj,
 	int ret;
 	struct btrfs_trans_handle *trans;
 
-	u64 type, prev_type;
+	u64 flags, prev_flags;
 
 	device = container_of(kobj, struct btrfs_device, devid_kobj);
 	fs_info = device->fs_info;
@@ -1606,24 +1606,25 @@ static ssize_t btrfs_devinfo_allocation_hint_store(struct kobject *kobj,
 	if (sb_rdonly(fs_info->sb))
 		return -EROFS;
 
-	ret = kstrtou64(buf, 0, &type);
+	ret = kstrtou64(buf, 0, &flags);
 	if (ret < 0)
 		return -EINVAL;
 
 	/* for now, allow to touch only the 'allocation hint' bits */
-	if (type & ~BTRFS_DEV_ALLOCATION_HINT_MASK)
+	if (flags & ~BTRFS_DEV_ALLOCATION_HINT_MASK)
 		return -EINVAL;
 
 	/* check if a change is really needed */
-	if ((device->type & BTRFS_DEV_ALLOCATION_HINT_MASK) == type)
+	if ((device->flags & BTRFS_DEV_ALLOCATION_HINT_MASK) == flags)
 		return len;
 
 	trans = btrfs_start_transaction(root, 1);
 	if (IS_ERR(trans))
 		return PTR_ERR(trans);
 
-	prev_type = device->type;
-	device->type = (device->type & ~BTRFS_DEV_ALLOCATION_HINT_MASK) | type;
+	prev_flags = device->flags;
+	device->flags = (device->flags & ~BTRFS_DEV_ALLOCATION_HINT_MASK) |
+			flags;
 
 	ret = btrfs_update_device(trans, device);
 
@@ -1639,7 +1640,7 @@ static ssize_t btrfs_devinfo_allocation_hint_store(struct kobject *kobj,
 
 	return len;
 abort:
-	device->type = prev_type;
+	device->flags = prev_flags;
 	return  ret;
 }
 BTRFS_ATTR_RW(devid, allocation_hint, btrfs_devinfo_allocation_hint_show,
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index beee7d1ae79d..9184570c51b0 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1876,7 +1876,7 @@ static int btrfs_add_dev_item(struct btrfs_trans_handle *trans,
 
 	btrfs_set_device_id(leaf, dev_item, device->devid);
 	btrfs_set_device_generation(leaf, dev_item, 0);
-	btrfs_set_device_type(leaf, dev_item, device->type);
+	btrfs_set_device_flags(leaf, dev_item, device->flags);
 	btrfs_set_device_io_align(leaf, dev_item, device->io_align);
 	btrfs_set_device_io_width(leaf, dev_item, device->io_width);
 	btrfs_set_device_sector_size(leaf, dev_item, device->sector_size);
@@ -2900,7 +2900,7 @@ noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
 	dev_item = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_dev_item);
 
 	btrfs_set_device_id(leaf, dev_item, device->devid);
-	btrfs_set_device_type(leaf, dev_item, device->type);
+	btrfs_set_device_flags(leaf, dev_item, device->flags);
 	btrfs_set_device_io_align(leaf, dev_item, device->io_align);
 	btrfs_set_device_io_width(leaf, dev_item, device->io_width);
 	btrfs_set_device_sector_size(leaf, dev_item, device->sector_size);
@@ -5285,7 +5285,7 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 			 */
 			devices_info[ndevs].alloc_hint = 0;
 		} else if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
-			hint = device->type & BTRFS_DEV_ALLOCATION_HINT_MASK;
+			hint = device->flags & BTRFS_DEV_ALLOCATION_HINT_MASK;
 
 			/*
 			 * skip BTRFS_DEV_METADATA_ONLY disks
@@ -5299,7 +5299,7 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 			 */
 			devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
 		} else { /* BTRFS_BLOCK_GROUP_METADATA */
-			hint = device->type & BTRFS_DEV_ALLOCATION_HINT_MASK;
+			hint = device->flags & BTRFS_DEV_ALLOCATION_HINT_MASK;
 
 			/*
 			 * skip BTRFS_DEV_DATA_ONLY disks
@@ -7293,7 +7293,7 @@ static void fill_device_from_item(struct extent_buffer *leaf,
 	device->commit_total_bytes = device->disk_total_bytes;
 	device->bytes_used = btrfs_device_bytes_used(leaf, dev_item);
 	device->commit_bytes_used = device->bytes_used;
-	device->type = btrfs_device_type(leaf, dev_item);
+	device->flags = btrfs_device_flags(leaf, dev_item);
 	device->io_align = btrfs_device_io_align(leaf, dev_item);
 	device->io_width = btrfs_device_io_width(leaf, dev_item);
 	device->sector_size = btrfs_device_sector_size(leaf, dev_item);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 61c0cba045e9..27ecf062d50c 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -96,8 +96,8 @@ struct btrfs_device {
 
 	/* optimal io width for this device */
 	u32 io_width;
-	/* type and info about this device */
-	u64 type;
+	/* device flags (e.g. allocation hint) */
+	u64 flags;
 
 	/* minimal io size for this device */
 	u32 sector_size;
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 55da906c2eac..f9891c94a75e 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -421,8 +421,8 @@ struct btrfs_dev_item {
 	/* minimal io size for this device */
 	__le32 sector_size;
 
-	/* type and info about this device */
-	__le64 type;
+	/* device flags (e.g. allocation hint) */
+	__le64 flags;
 
 	/* expected generation for this device */
 	__le64 generation;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/6] btrfs: add allocation_hint option.
  2021-12-17 18:47 [RFC][V9][PATCH 0/6] btrfs: allocation_hint mode Goffredo Baroncelli
                   ` (4 preceding siblings ...)
  2021-12-17 18:47 ` [PATCH 5/6] btrfs: rename dev_item->type to dev_item->flags Goffredo Baroncelli
@ 2021-12-17 18:47 ` Goffredo Baroncelli
  2022-01-05  2:44 ` [RFC][V9][PATCH 0/6] btrfs: allocation_hint mode Boris Burkov
  2022-01-05 22:21 ` Boris Burkov
  7 siblings, 0 replies; 21+ messages in thread
From: Goffredo Baroncelli @ 2021-12-17 18:47 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

Add allocation_hint mount option. This option accepts the following values:

- 0 (default):  the chunks allocator ignores the disk hints
- 1:            the chunks allocator considers the disk hints

Signed-off-by: Goffredo Baroncelli <kreijack@winwind.it>
---
 fs/btrfs/ctree.h   | 14 ++++++++++++++
 fs/btrfs/disk-io.c |  2 ++
 fs/btrfs/super.c   | 17 +++++++++++++++++
 fs/btrfs/volumes.c |  9 ++++++---
 4 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 778c7c807289..bb31cdcaf959 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -620,6 +620,15 @@ enum btrfs_exclusive_operation {
 	BTRFS_EXCLOP_SWAP_ACTIVATE,
 };
 
+/*
+ * allocation_hint mode
+ */
+
+enum btrfs_allocation_hint_modes {
+	BTRFS_ALLOCATION_HINT_DISABLED,
+	BTRFS_ALLOCATION_HINT_ENABLED
+};
+
 struct btrfs_fs_info {
 	u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
 	unsigned long flags;
@@ -1021,6 +1030,11 @@ struct btrfs_fs_info {
 		u64 zoned;
 	};
 
+	/* allocation_hint mode */
+	int allocation_hint_mode;
+
+	/* Max size to emit ZONE_APPEND write command */
+	u64 max_zone_append_size;
 	struct mutex zoned_meta_io_lock;
 	spinlock_t treelog_bg_lock;
 	u64 treelog_bg;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 02ffb8bc7d6b..09d365a689c9 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3160,6 +3160,8 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	spin_lock_init(&fs_info->swapfile_pins_lock);
 	fs_info->swapfile_pins = RB_ROOT;
 
+	fs_info->allocation_hint_mode = BTRFS_ALLOCATION_HINT_DISABLED;
+
 	fs_info->bg_reclaim_threshold = BTRFS_DEFAULT_RECLAIM_THRESH;
 	INIT_WORK(&fs_info->reclaim_bgs_work, btrfs_reclaim_bgs_work);
 }
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index a1c54a2c787c..68911152420a 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -373,6 +373,7 @@ enum {
 	Opt_thread_pool,
 	Opt_treelog, Opt_notreelog,
 	Opt_user_subvol_rm_allowed,
+	Opt_allocation_hint,
 
 	/* Rescue options */
 	Opt_rescue,
@@ -446,6 +447,7 @@ static const match_table_t tokens = {
 	{Opt_treelog, "treelog"},
 	{Opt_notreelog, "notreelog"},
 	{Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"},
+	{Opt_allocation_hint, "allocation_hint=%d"},
 
 	/* Rescue options */
 	{Opt_rescue, "rescue=%s"},
@@ -903,6 +905,19 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 		case Opt_user_subvol_rm_allowed:
 			btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED);
 			break;
+		case Opt_allocation_hint:
+			ret = match_int(&args[0], &intarg);
+			if (ret || (intarg != 1 && intarg != 0)) {
+				btrfs_err(info, "invalid allocation_hint= parameter\n");
+				ret = -EINVAL;
+				goto out;
+			}
+			if (intarg)
+				btrfs_info(info, "allocation_hint enabled");
+			else
+				btrfs_info(info, "allocation_hint disabled");
+			info->allocation_hint_mode = intarg;
+			break;
 		case Opt_enospc_debug:
 			btrfs_set_opt(info->mount_opt, ENOSPC_DEBUG);
 			break;
@@ -1497,6 +1512,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 		seq_puts(seq, ",clear_cache");
 	if (btrfs_test_opt(info, USER_SUBVOL_RM_ALLOWED))
 		seq_puts(seq, ",user_subvol_rm_allowed");
+	if (info->allocation_hint_mode)
+		seq_puts(seq, ",allocation_hint=1");
 	if (btrfs_test_opt(info, ENOSPC_DEBUG))
 		seq_puts(seq, ",enospc_debug");
 	if (btrfs_test_opt(info, AUTO_DEFRAG))
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9184570c51b0..15302c068008 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5276,10 +5276,13 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 		devices_info[ndevs].total_avail = total_avail;
 		devices_info[ndevs].dev = device;
 
-		if ((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
-		     (ctl->type & BTRFS_BLOCK_GROUP_METADATA)) {
+		if (((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
+		     (ctl->type & BTRFS_BLOCK_GROUP_METADATA)) ||
+		    info->allocation_hint_mode ==
+		     BTRFS_ALLOCATION_HINT_DISABLED) {
 			/*
-			 * if mixed bg set all the alloc_hint
+			 * if mixed bg or the allocator hint is
+			 * disabled, set all the alloc_hint
 			 * fields to the same value, so the sorting
 			 * is not affected
 			 */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC][V9][PATCH 0/6] btrfs: allocation_hint mode
  2021-12-17 18:47 [RFC][V9][PATCH 0/6] btrfs: allocation_hint mode Goffredo Baroncelli
                   ` (5 preceding siblings ...)
  2021-12-17 18:47 ` [PATCH 6/6] btrfs: add allocation_hint option Goffredo Baroncelli
@ 2022-01-05  2:44 ` Boris Burkov
  2022-01-05  9:16   ` Goffredo Baroncelli
  2022-01-05 22:21 ` Boris Burkov
  7 siblings, 1 reply; 21+ messages in thread
From: Boris Burkov @ 2022-01-05  2:44 UTC (permalink / raw)
  To: Goffredo Baroncelli
  Cc: linux-btrfs, Zygo Blaxell, Josef Bacik, David Sterba,
	Sinnamohideen Shafeeq, Paul Jones, Goffredo Baroncelli

On Fri, Dec 17, 2021 at 07:47:16PM +0100, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli <kreijack@inwind.it>
> 
> Hi all,
> 
> This patches set was born after some discussion between me, Zygo and Josef.
> Some details can be found in https://github.com/btrfs/btrfs-todo/issues/19.
> 
> Some further information about a real use case can be found in
> https://lore.kernel.org/linux-btrfs/20210116002533.GE31381@hungrycats.org/
> 
> Reently Shafeeq told me that he is interested too, due to the performance gain.
> 
> In V8 revision I switched away from an ioctl API in favor of a sysfs API (
> see patch #2 and #3).
> 
> In V9 I renamed the sysfs interface from devinfo/type to devinfo/allocation_hint.
> Moreover I renamed dev_info->type to dev_info->flags.
> 
> The idea behind this patches set, is to dedicate some disks (the fastest one)
> to the metadata chunk. My initial idea was a "soft" hint. However Zygo
> asked an option for a "strong" hint (== mandatory). The result is that
> each disk can be "tagged" by one of the following flags:
> - BTRFS_DEV_ALLOCATION_METADATA_ONLY
> - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
> - BTRFS_DEV_ALLOCATION_PREFERRED_DATA
> - BTRFS_DEV_ALLOCATION_DATA_ONLY
> 
> When the chunk allocator search a disks to allocate a chunk, scans the disks
> in an order decided by these tags. For metadata, the order is:
> *_METADATA_ONLY
> *_PREFERRED_METADATA
> *_PREFERRED_DATA
> 
> The *_DATA_ONLY are not eligible from metadata chunk allocation.
> 
> For the data chunk, the order is reversed, and the *_METADATA_ONLY are
> excluded.
> 
> The exact sort logic is to sort first for the "tag", and then for the space
> available. If there is no space available, the next "tag" disks set are
> selected.
> 
> To set these tags, a new property called "allocation_hint" was created.
> There is a dedicated btrfs-prog patches set [[PATCH V9] btrfs-progs:
> allocation_hint disk property].
> 
> $ sudo mount /dev/loop0 /mnt/test-btrfs/
> $ for i in /dev/loop[0-9]; do sudo ./btrfs prop get $i allocation_hint; done
> devid=1, path=/dev/loop0: allocation_hint=PREFERRED_METADATA
> devid=2, path=/dev/loop1: allocation_hint=PREFERRED_METADATA
> devid=3, path=/dev/loop2: allocation_hint=PREFERRED_DATA
> devid=4, path=/dev/loop3: allocation_hint=PREFERRED_DATA
> devid=5, path=/dev/loop4: allocation_hint=PREFERRED_DATA
> devid=6, path=/dev/loop5: allocation_hint=DATA_ONLY
> devid=7, path=/dev/loop6: allocation_hint=METADATA_ONLY
> devid=8, path=/dev/loop7: allocation_hint=METADATA_ONLY
> 
> $ sudo ./btrfs fi us /mnt/test-btrfs/
> Overall:
>     Device size:           2.75GiB
>     Device allocated:           1.34GiB
>     Device unallocated:           1.41GiB
>     Device missing:             0.00B
>     Used:             400.89MiB
>     Free (estimated):           1.04GiB    (min: 1.04GiB)
>     Data ratio:                  2.00
>     Metadata ratio:              1.00
>     Global reserve:           3.25MiB    (used: 0.00B)
>     Multiple profiles:                no
> 
> Data,RAID1: Size:542.00MiB, Used:200.25MiB (36.95%)
>    /dev/loop0     288.00MiB
>    /dev/loop1     288.00MiB
>    /dev/loop2     127.00MiB
>    /dev/loop3     127.00MiB
>    /dev/loop4     127.00MiB
>    /dev/loop5     127.00MiB
> 
> Metadata,single: Size:256.00MiB, Used:384.00KiB (0.15%)
>    /dev/loop1     256.00MiB
> 
> System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
>    /dev/loop0      32.00MiB
> 
> Unallocated:
>    /dev/loop0     704.00MiB
>    /dev/loop1     480.00MiB
>    /dev/loop2       1.00MiB
>    /dev/loop3       1.00MiB
>    /dev/loop4       1.00MiB
>    /dev/loop5       1.00MiB
>    /dev/loop6     128.00MiB
>    /dev/loop7     128.00MiB
> 
> # change the tag of some disks
> 
> $ sudo ./btrfs prop set /dev/loop0 allocation_hint DATA_ONLY
> $ sudo ./btrfs prop set /dev/loop1 allocation_hint DATA_ONLY
> $ sudo ./btrfs prop set /dev/loop5 allocation_hint METADATA_ONLY
> 
> $ for i in /dev/loop[0-9]; do sudo ./btrfs prop get $i allocation_hint; done
> devid=1, path=/dev/loop0: allocation_hint=DATA_ONLY
> devid=2, path=/dev/loop1: allocation_hint=DATA_ONLY
> devid=3, path=/dev/loop2: allocation_hint=PREFERRED_DATA
> devid=4, path=/dev/loop3: allocation_hint=PREFERRED_DATA
> devid=5, path=/dev/loop4: allocation_hint=PREFERRED_DATA
> devid=6, path=/dev/loop5: allocation_hint=METADATA_ONLY
> devid=7, path=/dev/loop6: allocation_hint=METADATA_ONLY
> devid=8, path=/dev/loop7: allocation_hint=METADATA_ONLY
> 
> $ sudo btrfs bal start --full-balance /mnt/test-btrfs/
> $ sudo ./btrfs fi us /mnt/test-btrfs/
> Overall:
>     Device size:           2.75GiB
>     Device allocated:         735.00MiB
>     Device unallocated:           2.03GiB
>     Device missing:             0.00B
>     Used:             400.72MiB
>     Free (estimated):           1.10GiB    (min: 1.10GiB)
>     Data ratio:                  2.00
>     Metadata ratio:              1.00
>     Global reserve:           3.25MiB    (used: 0.00B)
>     Multiple profiles:                no
> 
> Data,RAID1: Size:288.00MiB, Used:200.19MiB (69.51%)
>    /dev/loop0     288.00MiB
>    /dev/loop1     288.00MiB
> 
> Metadata,single: Size:127.00MiB, Used:336.00KiB (0.26%)
>    /dev/loop5     127.00MiB
> 
> System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
>    /dev/loop7      32.00MiB
> 
> Unallocated:
>    /dev/loop0     736.00MiB
>    /dev/loop1     736.00MiB
>    /dev/loop2     128.00MiB
>    /dev/loop3     128.00MiB
>    /dev/loop4     128.00MiB
>    /dev/loop5       1.00MiB
>    /dev/loop6     128.00MiB
>    /dev/loop7      96.00MiB
> 
> 
> #As you can see all the metadata were placed on the disk loop5/loop7 even if
> #the most empty one are loop0 and loop1.
> 
> 
> 
> TODO:
> - more tests
> - the tool which show the space available should consider the tagging (eg
>   the disks tagged by _METADATA_ONLY should be excluded from the data
>   availability)
> - allow btrfs-prog to change the allocation_hint even when the filesystem
>   is not mounted.
> 
> 
> Comments are welcome

This is cool, thanks for building it!

I'm playing with setting this up for a test I'm working on where I want
to send data to a dm-zero device. To that end, I applied this patchset
on top of misc-next and ran:

$ mkfs.btrfs -f /dev/vg0/lv0 -dsingle -msingle
$ mount /dev/vg0/lv0 /mnt/lol
$ btrfs device add /dev/mapper/zero-data /mnt/lol
$ btrfs fi usage /mnt/lol
Overall:
    Device size:                  50.01TiB
    Device allocated:             20.00MiB
    Device unallocated:           50.01TiB
    Device missing:                  0.00B
    Used:                        128.00KiB
    Free (estimated):             50.01TiB      (min: 50.01TiB)
    Free (statfs, df):            50.01TiB
    Data ratio:                       1.00
    Metadata ratio:                   1.00
    Global reserve:                3.25MiB      (used: 0.00B)
    Multiple profiles:                  no

Data,single: Size:8.00MiB, Used:0.00B (0.00%)
   /dev/mapper/vg0-lv0     8.00MiB

Metadata,single: Size:8.00MiB, Used:112.00KiB (1.37%)
   /dev/mapper/vg0-lv0     8.00MiB

System,single: Size:4.00MiB, Used:16.00KiB (0.39%)
   /dev/mapper/vg0-lv0     4.00MiB

Unallocated:
   /dev/mapper/vg0-lv0     9.98GiB
   /dev/mapper/zero-data          50.00TiB

$ ./btrfs property set -t device /dev/mapper/zero-data allocation_hint DATA_ONLY
$ ./btrfs property set -t device /dev/vg0/lv0 allocation_hint METADATA_ONLY

$ btrfs balance start --full-balance /mnt/lol
Done, had to relocate 3 out of 3 chunks

$ btrfs fi usage /mnt/lol
Overall:
    Device size:                  50.01TiB
    Device allocated:              2.03GiB
    Device unallocated:           50.01TiB
    Device missing:                  0.00B
    Used:                        640.00KiB
    Free (estimated):             50.01TiB      (min: 50.01TiB)
    Free (statfs, df):            50.01TiB
    Data ratio:                       1.00
    Metadata ratio:                   1.00
    Global reserve:                3.25MiB      (used: 0.00B)
    Multiple profiles:                  no

Data,single: Size:1.00GiB, Used:512.00KiB (0.05%)
   /dev/mapper/zero-data           1.00GiB

Metadata,single: Size:1.00GiB, Used:112.00KiB (0.01%)
   /dev/mapper/zero-data           1.00GiB

System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
   /dev/mapper/zero-data          32.00MiB

Unallocated:
   /dev/mapper/vg0-lv0    10.00GiB
   /dev/mapper/zero-data          50.00TiB


I expected that I would have data on /dev/mapper/zero-data and metadata
on /dev/mapper/vg0-lv0, but it seems both of them were written to the zero
device. Attempting to actually use the file system eventually fails, since
the metadata is black-holed :)

Did I make some mistake in how I used it, or is this a bug?

Thanks,
Boris

> BR
> G.Baroncelli
> 
> Revision:
> V9:
> - rename dev_item->type to dev_item->flags
> - rename /sys/fs/btrfs/$UUID/devinfo/type -> allocation_hint
> 
> V8:
> - drop the ioctl API, instead use a sysfs one
> 
> V7:
> - make more room in the struct btrfs_ioctl_dev_properties up to 1K
> - leave in btrfs_tree.h only the costants
> - removed the mount option (sic)
> - correct an 'use before check' in the while loop (signaled
>   by Zygo)
> - add a 2nd sort to be sure that the device_info array is in the
>   expected order
> 
> V6:
> - add further values to the hints: add the possibility to
>   exclude a disk for a chunk type 
> 
> 
> Goffredo Baroncelli (6):
>   btrfs: add flags to give an hint to the chunk allocator
>   btrfs: export the device allocation_hint property in sysfs
>   btrfs: change the device allocation_hint property via sysfs
>   btrfs: add allocation_hint mode
>   btrfs: rename dev_item->type to dev_item->flags
>   btrfs: add allocation_hint option.
> 
>  fs/btrfs/ctree.h                |  18 +++++-
>  fs/btrfs/disk-io.c              |   4 +-
>  fs/btrfs/super.c                |  17 ++++++
>  fs/btrfs/sysfs.c                |  73 ++++++++++++++++++++++
>  fs/btrfs/volumes.c              | 105 ++++++++++++++++++++++++++++++--
>  fs/btrfs/volumes.h              |   7 ++-
>  include/uapi/linux/btrfs_tree.h |  20 +++++-
>  7 files changed, 232 insertions(+), 12 deletions(-)
> 
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][V9][PATCH 0/6] btrfs: allocation_hint mode
  2022-01-05  2:44 ` [RFC][V9][PATCH 0/6] btrfs: allocation_hint mode Boris Burkov
@ 2022-01-05  9:16   ` Goffredo Baroncelli
  2022-01-05 17:55     ` Boris Burkov
  2022-01-05 18:07     ` Zygo Blaxell
  0 siblings, 2 replies; 21+ messages in thread
From: Goffredo Baroncelli @ 2022-01-05  9:16 UTC (permalink / raw)
  To: Boris Burkov
  Cc: linux-btrfs, Zygo Blaxell, Josef Bacik, David Sterba,
	Sinnamohideen Shafeeq, Paul Jones, Goffredo Baroncelli

Hi Boris,



On 1/5/22 03:44, Boris Burkov wrote:
[...]
> 
> This is cool, thanks for building it!
> 
> I'm playing with setting this up for a test I'm working on where I want
> to send data to a dm-zero device. To that end, I applied this patchset
> on top of misc-next and ran:
> 
> $ mkfs.btrfs -f /dev/vg0/lv0 -dsingle -msingle
> $ mount /dev/vg0/lv0 /mnt/lol

You should mount the filesystem with

$ mount -o allocation_hint=1 /dev/vg0/lv0 /mnt/lol


In the previous iteration I missed the patch #6, which activates this option. You can drop patch #6 and avoid to pass this option.

Please give me a feedback if this resolve.

BR
G.Baroncelli

> $ btrfs device add /dev/mapper/zero-data /mnt/lol
> $ btrfs fi usage /mnt/lol
> Overall:
>      Device size:                  50.01TiB
>      Device allocated:             20.00MiB
>      Device unallocated:           50.01TiB
>      Device missing:                  0.00B
>      Used:                        128.00KiB
>      Free (estimated):             50.01TiB      (min: 50.01TiB)
>      Free (statfs, df):            50.01TiB
>      Data ratio:                       1.00
>      Metadata ratio:                   1.00
>      Global reserve:                3.25MiB      (used: 0.00B)
>      Multiple profiles:                  no
> 
> Data,single: Size:8.00MiB, Used:0.00B (0.00%)
>     /dev/mapper/vg0-lv0     8.00MiB
> 
> Metadata,single: Size:8.00MiB, Used:112.00KiB (1.37%)
>     /dev/mapper/vg0-lv0     8.00MiB
> 
> System,single: Size:4.00MiB, Used:16.00KiB (0.39%)
>     /dev/mapper/vg0-lv0     4.00MiB
> 
> Unallocated:
>     /dev/mapper/vg0-lv0     9.98GiB
>     /dev/mapper/zero-data          50.00TiB
> 
> $ ./btrfs property set -t device /dev/mapper/zero-data allocation_hint DATA_ONLY
> $ ./btrfs property set -t device /dev/vg0/lv0 allocation_hint METADATA_ONLY
> 
> $ btrfs balance start --full-balance /mnt/lol
> Done, had to relocate 3 out of 3 chunks
> 
> $ btrfs fi usage /mnt/lol
> Overall:
>      Device size:                  50.01TiB
>      Device allocated:              2.03GiB
>      Device unallocated:           50.01TiB
>      Device missing:                  0.00B
>      Used:                        640.00KiB
>      Free (estimated):             50.01TiB      (min: 50.01TiB)
>      Free (statfs, df):            50.01TiB
>      Data ratio:                       1.00
>      Metadata ratio:                   1.00
>      Global reserve:                3.25MiB      (used: 0.00B)
>      Multiple profiles:                  no
> 
> Data,single: Size:1.00GiB, Used:512.00KiB (0.05%)
>     /dev/mapper/zero-data           1.00GiB
> 
> Metadata,single: Size:1.00GiB, Used:112.00KiB (0.01%)
>     /dev/mapper/zero-data           1.00GiB
> 
> System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
>     /dev/mapper/zero-data          32.00MiB
> 
> Unallocated:
>     /dev/mapper/vg0-lv0    10.00GiB
>     /dev/mapper/zero-data          50.00TiB
> 
> 
> I expected that I would have data on /dev/mapper/zero-data and metadata
> on /dev/mapper/vg0-lv0, but it seems both of them were written to the zero
> device. Attempting to actually use the file system eventually fails, since
> the metadata is black-holed :)
> 
> Did I make some mistake in how I used it, or is this a bug?
> 
> Thanks,
> Boris
> 
>> BR
>> G.Baroncelli
>>
>> Revision:
>> V9:
>> - rename dev_item->type to dev_item->flags
>> - rename /sys/fs/btrfs/$UUID/devinfo/type -> allocation_hint
>>
>> V8:
>> - drop the ioctl API, instead use a sysfs one
>>
>> V7:
>> - make more room in the struct btrfs_ioctl_dev_properties up to 1K
>> - leave in btrfs_tree.h only the costants
>> - removed the mount option (sic)
>> - correct an 'use before check' in the while loop (signaled
>>    by Zygo)
>> - add a 2nd sort to be sure that the device_info array is in the
>>    expected order
>>
>> V6:
>> - add further values to the hints: add the possibility to
>>    exclude a disk for a chunk type
>>
>>
>> Goffredo Baroncelli (6):
>>    btrfs: add flags to give an hint to the chunk allocator
>>    btrfs: export the device allocation_hint property in sysfs
>>    btrfs: change the device allocation_hint property via sysfs
>>    btrfs: add allocation_hint mode
>>    btrfs: rename dev_item->type to dev_item->flags
>>    btrfs: add allocation_hint option.
>>
>>   fs/btrfs/ctree.h                |  18 +++++-
>>   fs/btrfs/disk-io.c              |   4 +-
>>   fs/btrfs/super.c                |  17 ++++++
>>   fs/btrfs/sysfs.c                |  73 ++++++++++++++++++++++
>>   fs/btrfs/volumes.c              | 105 ++++++++++++++++++++++++++++++--
>>   fs/btrfs/volumes.h              |   7 ++-
>>   include/uapi/linux/btrfs_tree.h |  20 +++++-
>>   7 files changed, 232 insertions(+), 12 deletions(-)
>>
>> -- 
>> 2.34.1
>>


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][V9][PATCH 0/6] btrfs: allocation_hint mode
  2022-01-05  9:16   ` Goffredo Baroncelli
@ 2022-01-05 17:55     ` Boris Burkov
  2022-01-05 18:07     ` Zygo Blaxell
  1 sibling, 0 replies; 21+ messages in thread
From: Boris Burkov @ 2022-01-05 17:55 UTC (permalink / raw)
  To: kreijack
  Cc: linux-btrfs, Zygo Blaxell, Josef Bacik, David Sterba,
	Sinnamohideen Shafeeq, Paul Jones

On Wed, Jan 05, 2022 at 10:16:08AM +0100, Goffredo Baroncelli wrote:
> Hi Boris,
> 
> 
> 
> On 1/5/22 03:44, Boris Burkov wrote:
> [...]
> > 
> > This is cool, thanks for building it!
> > 
> > I'm playing with setting this up for a test I'm working on where I want
> > to send data to a dm-zero device. To that end, I applied this patchset
> > on top of misc-next and ran:
> > 
> > $ mkfs.btrfs -f /dev/vg0/lv0 -dsingle -msingle
> > $ mount /dev/vg0/lv0 /mnt/lol
> 
> You should mount the filesystem with
> 
> $ mount -o allocation_hint=1 /dev/vg0/lv0 /mnt/lol
> 

With this option, I got the expected usage output:

Data,single: Size:1.00GiB, Used:512.00KiB (0.05%)
   /dev/mapper/zero-data           1.00GiB

Metadata,single: Size:1.00GiB, Used:112.00KiB (0.01%)
   /dev/mapper/vg0-lv0     1.00GiB

Sorry I missed that, and thanks for the quick reply.

> 
> In the previous iteration I missed the patch #6, which activates this option. You can drop patch #6 and avoid to pass this option.
> 
> Please give me a feedback if this resolve.
> 
> BR
> G.Baroncelli
> 
> > $ btrfs device add /dev/mapper/zero-data /mnt/lol
> > $ btrfs fi usage /mnt/lol
> > Overall:
> >      Device size:                  50.01TiB
> >      Device allocated:             20.00MiB
> >      Device unallocated:           50.01TiB
> >      Device missing:                  0.00B
> >      Used:                        128.00KiB
> >      Free (estimated):             50.01TiB      (min: 50.01TiB)
> >      Free (statfs, df):            50.01TiB
> >      Data ratio:                       1.00
> >      Metadata ratio:                   1.00
> >      Global reserve:                3.25MiB      (used: 0.00B)
> >      Multiple profiles:                  no
> > 
> > Data,single: Size:8.00MiB, Used:0.00B (0.00%)
> >     /dev/mapper/vg0-lv0     8.00MiB
> > 
> > Metadata,single: Size:8.00MiB, Used:112.00KiB (1.37%)
> >     /dev/mapper/vg0-lv0     8.00MiB
> > 
> > System,single: Size:4.00MiB, Used:16.00KiB (0.39%)
> >     /dev/mapper/vg0-lv0     4.00MiB
> > 
> > Unallocated:
> >     /dev/mapper/vg0-lv0     9.98GiB
> >     /dev/mapper/zero-data          50.00TiB
> > 
> > $ ./btrfs property set -t device /dev/mapper/zero-data allocation_hint DATA_ONLY
> > $ ./btrfs property set -t device /dev/vg0/lv0 allocation_hint METADATA_ONLY
> > 
> > $ btrfs balance start --full-balance /mnt/lol
> > Done, had to relocate 3 out of 3 chunks
> > 
> > $ btrfs fi usage /mnt/lol
> > Overall:
> >      Device size:                  50.01TiB
> >      Device allocated:              2.03GiB
> >      Device unallocated:           50.01TiB
> >      Device missing:                  0.00B
> >      Used:                        640.00KiB
> >      Free (estimated):             50.01TiB      (min: 50.01TiB)
> >      Free (statfs, df):            50.01TiB
> >      Data ratio:                       1.00
> >      Metadata ratio:                   1.00
> >      Global reserve:                3.25MiB      (used: 0.00B)
> >      Multiple profiles:                  no
> > 
> > Data,single: Size:1.00GiB, Used:512.00KiB (0.05%)
> >     /dev/mapper/zero-data           1.00GiB
> > 
> > Metadata,single: Size:1.00GiB, Used:112.00KiB (0.01%)
> >     /dev/mapper/zero-data           1.00GiB
> > 
> > System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
> >     /dev/mapper/zero-data          32.00MiB
> > 
> > Unallocated:
> >     /dev/mapper/vg0-lv0    10.00GiB
> >     /dev/mapper/zero-data          50.00TiB
> > 
> > 
> > I expected that I would have data on /dev/mapper/zero-data and metadata
> > on /dev/mapper/vg0-lv0, but it seems both of them were written to the zero
> > device. Attempting to actually use the file system eventually fails, since
> > the metadata is black-holed :)
> > 
> > Did I make some mistake in how I used it, or is this a bug?
> > 
> > Thanks,
> > Boris
> > 
> > > BR
> > > G.Baroncelli
> > > 
> > > Revision:
> > > V9:
> > > - rename dev_item->type to dev_item->flags
> > > - rename /sys/fs/btrfs/$UUID/devinfo/type -> allocation_hint
> > > 
> > > V8:
> > > - drop the ioctl API, instead use a sysfs one
> > > 
> > > V7:
> > > - make more room in the struct btrfs_ioctl_dev_properties up to 1K
> > > - leave in btrfs_tree.h only the costants
> > > - removed the mount option (sic)
> > > - correct an 'use before check' in the while loop (signaled
> > >    by Zygo)
> > > - add a 2nd sort to be sure that the device_info array is in the
> > >    expected order
> > > 
> > > V6:
> > > - add further values to the hints: add the possibility to
> > >    exclude a disk for a chunk type
> > > 
> > > 
> > > Goffredo Baroncelli (6):
> > >    btrfs: add flags to give an hint to the chunk allocator
> > >    btrfs: export the device allocation_hint property in sysfs
> > >    btrfs: change the device allocation_hint property via sysfs
> > >    btrfs: add allocation_hint mode
> > >    btrfs: rename dev_item->type to dev_item->flags
> > >    btrfs: add allocation_hint option.
> > > 
> > >   fs/btrfs/ctree.h                |  18 +++++-
> > >   fs/btrfs/disk-io.c              |   4 +-
> > >   fs/btrfs/super.c                |  17 ++++++
> > >   fs/btrfs/sysfs.c                |  73 ++++++++++++++++++++++
> > >   fs/btrfs/volumes.c              | 105 ++++++++++++++++++++++++++++++--
> > >   fs/btrfs/volumes.h              |   7 ++-
> > >   include/uapi/linux/btrfs_tree.h |  20 +++++-
> > >   7 files changed, 232 insertions(+), 12 deletions(-)
> > > 
> > > -- 
> > > 2.34.1
> > > 
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][V9][PATCH 0/6] btrfs: allocation_hint mode
  2022-01-05  9:16   ` Goffredo Baroncelli
  2022-01-05 17:55     ` Boris Burkov
@ 2022-01-05 18:07     ` Zygo Blaxell
  2022-01-05 18:16       ` Goffredo Baroncelli
  1 sibling, 1 reply; 21+ messages in thread
From: Zygo Blaxell @ 2022-01-05 18:07 UTC (permalink / raw)
  To: kreijack
  Cc: Boris Burkov, linux-btrfs, Josef Bacik, David Sterba,
	Sinnamohideen Shafeeq, Paul Jones

On Wed, Jan 05, 2022 at 10:16:08AM +0100, Goffredo Baroncelli wrote:
> Hi Boris,
> 
> 
> 
> On 1/5/22 03:44, Boris Burkov wrote:
> [...]
> > 
> > This is cool, thanks for building it!
> > 
> > I'm playing with setting this up for a test I'm working on where I want
> > to send data to a dm-zero device. To that end, I applied this patchset
> > on top of misc-next and ran:
> > 
> > $ mkfs.btrfs -f /dev/vg0/lv0 -dsingle -msingle
> > $ mount /dev/vg0/lv0 /mnt/lol
> 
> You should mount the filesystem with
> 
> $ mount -o allocation_hint=1 /dev/vg0/lv0 /mnt/lol
> 
> In the previous iteration I missed the patch #6, which activates this option. You can drop patch #6 and avoid to pass this option.

Can we drop the mount option from the series?  It isn't needed.

Or, if we must have it (and I am in no way conceding that we do),
at least make it default to enabled.  Or turn the mount option into a
disable flag under the 'rescue=' option to make it clear this option is
not intended to be used in normal operation.

> Please give me a feedback if this resolve.
> 
> BR
> G.Baroncelli
> 
> > $ btrfs device add /dev/mapper/zero-data /mnt/lol
> > $ btrfs fi usage /mnt/lol
> > Overall:
> >      Device size:                  50.01TiB
> >      Device allocated:             20.00MiB
> >      Device unallocated:           50.01TiB
> >      Device missing:                  0.00B
> >      Used:                        128.00KiB
> >      Free (estimated):             50.01TiB      (min: 50.01TiB)
> >      Free (statfs, df):            50.01TiB
> >      Data ratio:                       1.00
> >      Metadata ratio:                   1.00
> >      Global reserve:                3.25MiB      (used: 0.00B)
> >      Multiple profiles:                  no
> > 
> > Data,single: Size:8.00MiB, Used:0.00B (0.00%)
> >     /dev/mapper/vg0-lv0     8.00MiB
> > 
> > Metadata,single: Size:8.00MiB, Used:112.00KiB (1.37%)
> >     /dev/mapper/vg0-lv0     8.00MiB
> > 
> > System,single: Size:4.00MiB, Used:16.00KiB (0.39%)
> >     /dev/mapper/vg0-lv0     4.00MiB
> > 
> > Unallocated:
> >     /dev/mapper/vg0-lv0     9.98GiB
> >     /dev/mapper/zero-data          50.00TiB
> > 
> > $ ./btrfs property set -t device /dev/mapper/zero-data allocation_hint DATA_ONLY
> > $ ./btrfs property set -t device /dev/vg0/lv0 allocation_hint METADATA_ONLY
> > 
> > $ btrfs balance start --full-balance /mnt/lol
> > Done, had to relocate 3 out of 3 chunks
> > 
> > $ btrfs fi usage /mnt/lol
> > Overall:
> >      Device size:                  50.01TiB
> >      Device allocated:              2.03GiB
> >      Device unallocated:           50.01TiB
> >      Device missing:                  0.00B
> >      Used:                        640.00KiB
> >      Free (estimated):             50.01TiB      (min: 50.01TiB)
> >      Free (statfs, df):            50.01TiB
> >      Data ratio:                       1.00
> >      Metadata ratio:                   1.00
> >      Global reserve:                3.25MiB      (used: 0.00B)
> >      Multiple profiles:                  no
> > 
> > Data,single: Size:1.00GiB, Used:512.00KiB (0.05%)
> >     /dev/mapper/zero-data           1.00GiB
> > 
> > Metadata,single: Size:1.00GiB, Used:112.00KiB (0.01%)
> >     /dev/mapper/zero-data           1.00GiB
> > 
> > System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
> >     /dev/mapper/zero-data          32.00MiB
> > 
> > Unallocated:
> >     /dev/mapper/vg0-lv0    10.00GiB
> >     /dev/mapper/zero-data          50.00TiB
> > 
> > 
> > I expected that I would have data on /dev/mapper/zero-data and metadata
> > on /dev/mapper/vg0-lv0, but it seems both of them were written to the zero
> > device. Attempting to actually use the file system eventually fails, since
> > the metadata is black-holed :)
> > 
> > Did I make some mistake in how I used it, or is this a bug?
> > 
> > Thanks,
> > Boris
> > 
> > > BR
> > > G.Baroncelli
> > > 
> > > Revision:
> > > V9:
> > > - rename dev_item->type to dev_item->flags
> > > - rename /sys/fs/btrfs/$UUID/devinfo/type -> allocation_hint
> > > 
> > > V8:
> > > - drop the ioctl API, instead use a sysfs one
> > > 
> > > V7:
> > > - make more room in the struct btrfs_ioctl_dev_properties up to 1K
> > > - leave in btrfs_tree.h only the costants
> > > - removed the mount option (sic)
> > > - correct an 'use before check' in the while loop (signaled
> > >    by Zygo)
> > > - add a 2nd sort to be sure that the device_info array is in the
> > >    expected order
> > > 
> > > V6:
> > > - add further values to the hints: add the possibility to
> > >    exclude a disk for a chunk type
> > > 
> > > 
> > > Goffredo Baroncelli (6):
> > >    btrfs: add flags to give an hint to the chunk allocator
> > >    btrfs: export the device allocation_hint property in sysfs
> > >    btrfs: change the device allocation_hint property via sysfs
> > >    btrfs: add allocation_hint mode
> > >    btrfs: rename dev_item->type to dev_item->flags
> > >    btrfs: add allocation_hint option.
> > > 
> > >   fs/btrfs/ctree.h                |  18 +++++-
> > >   fs/btrfs/disk-io.c              |   4 +-
> > >   fs/btrfs/super.c                |  17 ++++++
> > >   fs/btrfs/sysfs.c                |  73 ++++++++++++++++++++++
> > >   fs/btrfs/volumes.c              | 105 ++++++++++++++++++++++++++++++--
> > >   fs/btrfs/volumes.h              |   7 ++-
> > >   include/uapi/linux/btrfs_tree.h |  20 +++++-
> > >   7 files changed, 232 insertions(+), 12 deletions(-)
> > > 
> > > -- 
> > > 2.34.1
> > > 
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][V9][PATCH 0/6] btrfs: allocation_hint mode
  2022-01-05 18:07     ` Zygo Blaxell
@ 2022-01-05 18:16       ` Goffredo Baroncelli
  2022-01-05 18:29         ` Boris Burkov
  0 siblings, 1 reply; 21+ messages in thread
From: Goffredo Baroncelli @ 2022-01-05 18:16 UTC (permalink / raw)
  To: Zygo Blaxell, Josef Bacik
  Cc: Boris Burkov, linux-btrfs, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones

On 05/01/2022 19.07, Zygo Blaxell wrote:
> On Wed, Jan 05, 2022 at 10:16:08AM +0100, Goffredo Baroncelli wrote:
>> Hi Boris,
>>
>>
>>
>> On 1/5/22 03:44, Boris Burkov wrote:
>> [...]
>>>
>>> This is cool, thanks for building it!
>>>
>>> I'm playing with setting this up for a test I'm working on where I want
>>> to send data to a dm-zero device. To that end, I applied this patchset
>>> on top of misc-next and ran:
>>>
>>> $ mkfs.btrfs -f /dev/vg0/lv0 -dsingle -msingle
>>> $ mount /dev/vg0/lv0 /mnt/lol
>>
>> You should mount the filesystem with
>>
>> $ mount -o allocation_hint=1 /dev/vg0/lv0 /mnt/lol
>>
>> In the previous iteration I missed the patch #6, which activates this option. You can drop patch #6 and avoid to pass this option.
> 
> Can we drop the mount option from the series?  It isn't needed.
> 
> Or, if we must have it (and I am in no way conceding that we do),
> at least make it default to enabled.  Or turn the mount option into a
> disable flag under the 'rescue=' option to make it clear this option is
> not intended to be used in normal operation.

Frankly speaking it was a my mistake to add this patch. It was in the
queue, but in the last patches sets I dropped it. However in the last
one I forgot to drop it manually so it reappeared :-)

However I like your suggestion to add as 'rescue' option, where the
default is "enabled".

@Josef,
do you started to play with this patch ? If not can I send an update
where the main change is the renaming of the properties from

PREFERRED_<X> to <X>_PREFERRED

(e.g. PREFERRED_DATA to DATA_PREFERRED) which are more correct ?

BR
G.Baroncelli
> 
>> Please give me a feedback if this resolve.
>>
>> BR
>> G.Baroncelli
>>
>>> $ btrfs device add /dev/mapper/zero-data /mnt/lol
>>> $ btrfs fi usage /mnt/lol
>>> Overall:
>>>       Device size:                  50.01TiB
>>>       Device allocated:             20.00MiB
>>>       Device unallocated:           50.01TiB
>>>       Device missing:                  0.00B
>>>       Used:                        128.00KiB
>>>       Free (estimated):             50.01TiB      (min: 50.01TiB)
>>>       Free (statfs, df):            50.01TiB
>>>       Data ratio:                       1.00
>>>       Metadata ratio:                   1.00
>>>       Global reserve:                3.25MiB      (used: 0.00B)
>>>       Multiple profiles:                  no
>>>
>>> Data,single: Size:8.00MiB, Used:0.00B (0.00%)
>>>      /dev/mapper/vg0-lv0     8.00MiB
>>>
>>> Metadata,single: Size:8.00MiB, Used:112.00KiB (1.37%)
>>>      /dev/mapper/vg0-lv0     8.00MiB
>>>
>>> System,single: Size:4.00MiB, Used:16.00KiB (0.39%)
>>>      /dev/mapper/vg0-lv0     4.00MiB
>>>
>>> Unallocated:
>>>      /dev/mapper/vg0-lv0     9.98GiB
>>>      /dev/mapper/zero-data          50.00TiB
>>>
>>> $ ./btrfs property set -t device /dev/mapper/zero-data allocation_hint DATA_ONLY
>>> $ ./btrfs property set -t device /dev/vg0/lv0 allocation_hint METADATA_ONLY
>>>
>>> $ btrfs balance start --full-balance /mnt/lol
>>> Done, had to relocate 3 out of 3 chunks
>>>
>>> $ btrfs fi usage /mnt/lol
>>> Overall:
>>>       Device size:                  50.01TiB
>>>       Device allocated:              2.03GiB
>>>       Device unallocated:           50.01TiB
>>>       Device missing:                  0.00B
>>>       Used:                        640.00KiB
>>>       Free (estimated):             50.01TiB      (min: 50.01TiB)
>>>       Free (statfs, df):            50.01TiB
>>>       Data ratio:                       1.00
>>>       Metadata ratio:                   1.00
>>>       Global reserve:                3.25MiB      (used: 0.00B)
>>>       Multiple profiles:                  no
>>>
>>> Data,single: Size:1.00GiB, Used:512.00KiB (0.05%)
>>>      /dev/mapper/zero-data           1.00GiB
>>>
>>> Metadata,single: Size:1.00GiB, Used:112.00KiB (0.01%)
>>>      /dev/mapper/zero-data           1.00GiB
>>>
>>> System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
>>>      /dev/mapper/zero-data          32.00MiB
>>>
>>> Unallocated:
>>>      /dev/mapper/vg0-lv0    10.00GiB
>>>      /dev/mapper/zero-data          50.00TiB
>>>
>>>
>>> I expected that I would have data on /dev/mapper/zero-data and metadata
>>> on /dev/mapper/vg0-lv0, but it seems both of them were written to the zero
>>> device. Attempting to actually use the file system eventually fails, since
>>> the metadata is black-holed :)
>>>
>>> Did I make some mistake in how I used it, or is this a bug?
>>>
>>> Thanks,
>>> Boris
>>>
>>>> BR
>>>> G.Baroncelli
>>>>
>>>> Revision:
>>>> V9:
>>>> - rename dev_item->type to dev_item->flags
>>>> - rename /sys/fs/btrfs/$UUID/devinfo/type -> allocation_hint
>>>>
>>>> V8:
>>>> - drop the ioctl API, instead use a sysfs one
>>>>
>>>> V7:
>>>> - make more room in the struct btrfs_ioctl_dev_properties up to 1K
>>>> - leave in btrfs_tree.h only the costants
>>>> - removed the mount option (sic)
>>>> - correct an 'use before check' in the while loop (signaled
>>>>     by Zygo)
>>>> - add a 2nd sort to be sure that the device_info array is in the
>>>>     expected order
>>>>
>>>> V6:
>>>> - add further values to the hints: add the possibility to
>>>>     exclude a disk for a chunk type
>>>>
>>>>
>>>> Goffredo Baroncelli (6):
>>>>     btrfs: add flags to give an hint to the chunk allocator
>>>>     btrfs: export the device allocation_hint property in sysfs
>>>>     btrfs: change the device allocation_hint property via sysfs
>>>>     btrfs: add allocation_hint mode
>>>>     btrfs: rename dev_item->type to dev_item->flags
>>>>     btrfs: add allocation_hint option.
>>>>
>>>>    fs/btrfs/ctree.h                |  18 +++++-
>>>>    fs/btrfs/disk-io.c              |   4 +-
>>>>    fs/btrfs/super.c                |  17 ++++++
>>>>    fs/btrfs/sysfs.c                |  73 ++++++++++++++++++++++
>>>>    fs/btrfs/volumes.c              | 105 ++++++++++++++++++++++++++++++--
>>>>    fs/btrfs/volumes.h              |   7 ++-
>>>>    include/uapi/linux/btrfs_tree.h |  20 +++++-
>>>>    7 files changed, 232 insertions(+), 12 deletions(-)
>>>>
>>>> -- 
>>>> 2.34.1
>>>>
>>
>>
>> -- 
>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][V9][PATCH 0/6] btrfs: allocation_hint mode
  2022-01-05 18:16       ` Goffredo Baroncelli
@ 2022-01-05 18:29         ` Boris Burkov
  0 siblings, 0 replies; 21+ messages in thread
From: Boris Burkov @ 2022-01-05 18:29 UTC (permalink / raw)
  To: Goffredo Baroncelli
  Cc: Zygo Blaxell, Josef Bacik, linux-btrfs, David Sterba,
	Sinnamohideen Shafeeq, Paul Jones

On Wed, Jan 05, 2022 at 07:16:04PM +0100, Goffredo Baroncelli wrote:
> On 05/01/2022 19.07, Zygo Blaxell wrote:
> > On Wed, Jan 05, 2022 at 10:16:08AM +0100, Goffredo Baroncelli wrote:
> > > Hi Boris,
> > > 
> > > 
> > > 
> > > On 1/5/22 03:44, Boris Burkov wrote:
> > > [...]
> > > > 
> > > > This is cool, thanks for building it!
> > > > 
> > > > I'm playing with setting this up for a test I'm working on where I want
> > > > to send data to a dm-zero device. To that end, I applied this patchset
> > > > on top of misc-next and ran:
> > > > 
> > > > $ mkfs.btrfs -f /dev/vg0/lv0 -dsingle -msingle
> > > > $ mount /dev/vg0/lv0 /mnt/lol
> > > 
> > > You should mount the filesystem with
> > > 
> > > $ mount -o allocation_hint=1 /dev/vg0/lv0 /mnt/lol
> > > 
> > > In the previous iteration I missed the patch #6, which activates this option. You can drop patch #6 and avoid to pass this option.
> > 
> > Can we drop the mount option from the series?  It isn't needed.
> > 
> > Or, if we must have it (and I am in no way conceding that we do),
> > at least make it default to enabled.  Or turn the mount option into a
> > disable flag under the 'rescue=' option to make it clear this option is
> > not intended to be used in normal operation.
> 
> Frankly speaking it was a my mistake to add this patch. It was in the
> queue, but in the last patches sets I dropped it. However in the last
> one I forgot to drop it manually so it reappeared :-)
> 
> However I like your suggestion to add as 'rescue' option, where the
> default is "enabled".

A mount option adds a lot of testing burden:
enabling it where it was disabled
disabling it where it was enabled
does the above work on remount
does it always print what's expected in /proc/mounts
etc..

So I think it should have a strong justification for adding it, and the
xfstests will need to reflect the above.

Unless it's the best way to support some otherwise impossible recovery
for a realistic failure mode, I would personally suggest just skipping
it. However, I only skimmed through the discussion about recovery in the
older thread, FWIW.

> 
> @Josef,
> do you started to play with this patch ? If not can I send an update
> where the main change is the renaming of the properties from
> 
> PREFERRED_<X> to <X>_PREFERRED
> 
> (e.g. PREFERRED_DATA to DATA_PREFERRED) which are more correct ?
> 
> BR
> G.Baroncelli
> > 
> > > Please give me a feedback if this resolve.
> > > 
> > > BR
> > > G.Baroncelli
> > > 
> > > > $ btrfs device add /dev/mapper/zero-data /mnt/lol
> > > > $ btrfs fi usage /mnt/lol
> > > > Overall:
> > > >       Device size:                  50.01TiB
> > > >       Device allocated:             20.00MiB
> > > >       Device unallocated:           50.01TiB
> > > >       Device missing:                  0.00B
> > > >       Used:                        128.00KiB
> > > >       Free (estimated):             50.01TiB      (min: 50.01TiB)
> > > >       Free (statfs, df):            50.01TiB
> > > >       Data ratio:                       1.00
> > > >       Metadata ratio:                   1.00
> > > >       Global reserve:                3.25MiB      (used: 0.00B)
> > > >       Multiple profiles:                  no
> > > > 
> > > > Data,single: Size:8.00MiB, Used:0.00B (0.00%)
> > > >      /dev/mapper/vg0-lv0     8.00MiB
> > > > 
> > > > Metadata,single: Size:8.00MiB, Used:112.00KiB (1.37%)
> > > >      /dev/mapper/vg0-lv0     8.00MiB
> > > > 
> > > > System,single: Size:4.00MiB, Used:16.00KiB (0.39%)
> > > >      /dev/mapper/vg0-lv0     4.00MiB
> > > > 
> > > > Unallocated:
> > > >      /dev/mapper/vg0-lv0     9.98GiB
> > > >      /dev/mapper/zero-data          50.00TiB
> > > > 
> > > > $ ./btrfs property set -t device /dev/mapper/zero-data allocation_hint DATA_ONLY
> > > > $ ./btrfs property set -t device /dev/vg0/lv0 allocation_hint METADATA_ONLY
> > > > 
> > > > $ btrfs balance start --full-balance /mnt/lol
> > > > Done, had to relocate 3 out of 3 chunks
> > > > 
> > > > $ btrfs fi usage /mnt/lol
> > > > Overall:
> > > >       Device size:                  50.01TiB
> > > >       Device allocated:              2.03GiB
> > > >       Device unallocated:           50.01TiB
> > > >       Device missing:                  0.00B
> > > >       Used:                        640.00KiB
> > > >       Free (estimated):             50.01TiB      (min: 50.01TiB)
> > > >       Free (statfs, df):            50.01TiB
> > > >       Data ratio:                       1.00
> > > >       Metadata ratio:                   1.00
> > > >       Global reserve:                3.25MiB      (used: 0.00B)
> > > >       Multiple profiles:                  no
> > > > 
> > > > Data,single: Size:1.00GiB, Used:512.00KiB (0.05%)
> > > >      /dev/mapper/zero-data           1.00GiB
> > > > 
> > > > Metadata,single: Size:1.00GiB, Used:112.00KiB (0.01%)
> > > >      /dev/mapper/zero-data           1.00GiB
> > > > 
> > > > System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
> > > >      /dev/mapper/zero-data          32.00MiB
> > > > 
> > > > Unallocated:
> > > >      /dev/mapper/vg0-lv0    10.00GiB
> > > >      /dev/mapper/zero-data          50.00TiB
> > > > 
> > > > 
> > > > I expected that I would have data on /dev/mapper/zero-data and metadata
> > > > on /dev/mapper/vg0-lv0, but it seems both of them were written to the zero
> > > > device. Attempting to actually use the file system eventually fails, since
> > > > the metadata is black-holed :)
> > > > 
> > > > Did I make some mistake in how I used it, or is this a bug?
> > > > 
> > > > Thanks,
> > > > Boris
> > > > 
> > > > > BR
> > > > > G.Baroncelli
> > > > > 
> > > > > Revision:
> > > > > V9:
> > > > > - rename dev_item->type to dev_item->flags
> > > > > - rename /sys/fs/btrfs/$UUID/devinfo/type -> allocation_hint
> > > > > 
> > > > > V8:
> > > > > - drop the ioctl API, instead use a sysfs one
> > > > > 
> > > > > V7:
> > > > > - make more room in the struct btrfs_ioctl_dev_properties up to 1K
> > > > > - leave in btrfs_tree.h only the costants
> > > > > - removed the mount option (sic)
> > > > > - correct an 'use before check' in the while loop (signaled
> > > > >     by Zygo)
> > > > > - add a 2nd sort to be sure that the device_info array is in the
> > > > >     expected order
> > > > > 
> > > > > V6:
> > > > > - add further values to the hints: add the possibility to
> > > > >     exclude a disk for a chunk type
> > > > > 
> > > > > 
> > > > > Goffredo Baroncelli (6):
> > > > >     btrfs: add flags to give an hint to the chunk allocator
> > > > >     btrfs: export the device allocation_hint property in sysfs
> > > > >     btrfs: change the device allocation_hint property via sysfs
> > > > >     btrfs: add allocation_hint mode
> > > > >     btrfs: rename dev_item->type to dev_item->flags
> > > > >     btrfs: add allocation_hint option.
> > > > > 
> > > > >    fs/btrfs/ctree.h                |  18 +++++-
> > > > >    fs/btrfs/disk-io.c              |   4 +-
> > > > >    fs/btrfs/super.c                |  17 ++++++
> > > > >    fs/btrfs/sysfs.c                |  73 ++++++++++++++++++++++
> > > > >    fs/btrfs/volumes.c              | 105 ++++++++++++++++++++++++++++++--
> > > > >    fs/btrfs/volumes.h              |   7 ++-
> > > > >    include/uapi/linux/btrfs_tree.h |  20 +++++-
> > > > >    7 files changed, 232 insertions(+), 12 deletions(-)
> > > > > 
> > > > > -- 
> > > > > 2.34.1
> > > > > 
> > > 
> > > 
> > > -- 
> > > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> > > Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/6] btrfs: export the device allocation_hint property in sysfs
  2021-12-17 18:47 ` [PATCH 2/6] btrfs: export the device allocation_hint property in sysfs Goffredo Baroncelli
@ 2022-01-05 21:57   ` Boris Burkov
  0 siblings, 0 replies; 21+ messages in thread
From: Boris Burkov @ 2022-01-05 21:57 UTC (permalink / raw)
  To: Goffredo Baroncelli
  Cc: linux-btrfs, Zygo Blaxell, Josef Bacik, David Sterba,
	Sinnamohideen Shafeeq, Paul Jones, Goffredo Baroncelli

On Fri, Dec 17, 2021 at 07:47:18PM +0100, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli <kreijack@inwind.it>
> 
> Eport the device allocation_hint property via
> /sys/fs/btrfs/<uuid>/devinfo/<devid>/allocation_hint
> 
> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
Reviewed-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/sysfs.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> index beb7f72d50b8..a8d918700d2b 100644
> --- a/fs/btrfs/sysfs.c
> +++ b/fs/btrfs/sysfs.c
> @@ -1575,6 +1575,17 @@ static ssize_t btrfs_devinfo_error_stats_show(struct kobject *kobj,
>  }
>  BTRFS_ATTR(devid, error_stats, btrfs_devinfo_error_stats_show);
>  
> +static ssize_t btrfs_devinfo_allocation_hint_show(struct kobject *kobj,
> +					struct kobj_attribute *a, char *buf)
> +{
> +	struct btrfs_device *device = container_of(kobj, struct btrfs_device,
> +						   devid_kobj);
> +
> +	return scnprintf(buf, PAGE_SIZE, "0x%08llx\n",
> +		device->type & BTRFS_DEV_ALLOCATION_HINT_MASK );

I think I lightly prefer the string based interface like what was
discussed on V8, but this is fine as well, especially with the progs
change in mind to add the extra usability.

> +}
> +BTRFS_ATTR(devid, allocation_hint, btrfs_devinfo_allocation_hint_show);
> +
>  /*
>   * Information about one device.
>   *
> @@ -1588,6 +1599,7 @@ static struct attribute *devid_attrs[] = {
>  	BTRFS_ATTR_PTR(devid, replace_target),
>  	BTRFS_ATTR_PTR(devid, scrub_speed_max),
>  	BTRFS_ATTR_PTR(devid, writeable),
> +	BTRFS_ATTR_PTR(devid, allocation_hint),
>  	NULL
>  };
>  ATTRIBUTE_GROUPS(devid);
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/6] btrfs: add flags to give an hint to the chunk allocator
  2021-12-17 18:47 ` [PATCH 1/6] btrfs: add flags to give an hint to the chunk allocator Goffredo Baroncelli
@ 2022-01-05 22:10   ` Boris Burkov
  2022-01-06  8:53     ` Goffredo Baroncelli
  0 siblings, 1 reply; 21+ messages in thread
From: Boris Burkov @ 2022-01-05 22:10 UTC (permalink / raw)
  To: Goffredo Baroncelli
  Cc: linux-btrfs, Zygo Blaxell, Josef Bacik, David Sterba,
	Sinnamohideen Shafeeq, Paul Jones, Goffredo Baroncelli

On Fri, Dec 17, 2021 at 07:47:17PM +0100, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli <kreijack@inwind.it>
> 
> Add the following flags to give an hint about which chunk should be
> allocated in which a disk:
> 
> - BTRFS_DEV_ALLOCATION_HINT_PREFERRED_DATA
>   preferred data chunk, but metadata chunk allowed
> - BTRFS_DEV_ALLOCATION_HINT_PREFERRED_METADATA
>   preferred metadata chunk, but data chunk allowed
> - BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY
>   only metadata chunk allowed
> - BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY
>   only data chunk allowed

Weighing in on the naming discussion:

I think DATA_ONLY > ONLY_DATA, with my best argument for this subjective
opinion being that it follows the example of read-only.

Therefore, I think you should go with DATA_ONLY, DATA_PREFERRED,
METADATA_ONLY, METADATA_PREFERRED. Definitely put ONLY/PREFERRED on the
same side of DATA/METADATA for all four, regardless of which you choose.

Looks good otherwise.

> 
> Signed-off-by: Goffredo Baroncelli <kreijack@inwid.it>
> ---
>  include/uapi/linux/btrfs_tree.h | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> index 5416f1f1a77a..55da906c2eac 100644
> --- a/include/uapi/linux/btrfs_tree.h
> +++ b/include/uapi/linux/btrfs_tree.h
> @@ -386,6 +386,22 @@ struct btrfs_key {
>  	__u64 offset;
>  } __attribute__ ((__packed__));
>  
> +/* dev_item.type */
> +
> +/* btrfs chunk allocation hint */
> +#define BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT	2
> +/* btrfs chunk allocation hint mask */
> +#define BTRFS_DEV_ALLOCATION_HINT_MASK	\
> +	((1 << BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT) -1)
> +/* preferred data chunk, but metadata chunk allowed */
> +#define BTRFS_DEV_ALLOCATION_HINT_PREFERRED_DATA	(0ULL)
> +/* preferred metadata chunk, but data chunk allowed */
> +#define BTRFS_DEV_ALLOCATION_HINT_PREFERRED_METADATA	(1ULL)
> +/* only metadata chunk are allowed */
> +#define BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY		(2ULL)
> +/* only data chunk allowed */
> +#define BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY		(3ULL)
> +
>  struct btrfs_dev_item {
>  	/* the internal btrfs device id */
>  	__le64 devid;
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][V9][PATCH 0/6] btrfs: allocation_hint mode
  2021-12-17 18:47 [RFC][V9][PATCH 0/6] btrfs: allocation_hint mode Goffredo Baroncelli
                   ` (6 preceding siblings ...)
  2022-01-05  2:44 ` [RFC][V9][PATCH 0/6] btrfs: allocation_hint mode Boris Burkov
@ 2022-01-05 22:21 ` Boris Burkov
  7 siblings, 0 replies; 21+ messages in thread
From: Boris Burkov @ 2022-01-05 22:21 UTC (permalink / raw)
  To: Goffredo Baroncelli
  Cc: linux-btrfs, Zygo Blaxell, Josef Bacik, David Sterba,
	Sinnamohideen Shafeeq, Paul Jones, Goffredo Baroncelli

I noticed this patchset still suffers from some checkpatch failures with
stuff like spaces around parens and minus signs, tabs vs spaces, etc. We
generally try to keep checkpatch happy in btrfs, though of course reason
should always prevail.

FWIW, I have it rebased on kdave/misc-next and ran:
checkpatch.pl -g kdave/misc-next..

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/6] btrfs: add allocation_hint mode
  2021-12-17 18:47 ` [PATCH 4/6] btrfs: add allocation_hint mode Goffredo Baroncelli
@ 2022-01-05 23:48   ` Boris Burkov
  2022-01-06 10:09     ` Goffredo Baroncelli
  0 siblings, 1 reply; 21+ messages in thread
From: Boris Burkov @ 2022-01-05 23:48 UTC (permalink / raw)
  To: Goffredo Baroncelli
  Cc: linux-btrfs, Zygo Blaxell, Josef Bacik, David Sterba,
	Sinnamohideen Shafeeq, Paul Jones, Goffredo Baroncelli

On Fri, Dec 17, 2021 at 07:47:20PM +0100, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli <kreijack@inwind.it>
> 
> The chunk allocation policy is modified as follow.
> 
> Each disk may have one of the following tags:
> - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
> - BTRFS_DEV_ALLOCATION_METADATA_ONLY
> - BTRFS_DEV_ALLOCATION_DATA_ONLY
> - BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)
> 
> During a *mixed data/metadata* chunk allocation, BTRFS works as
> usual.
> 
> During a *data* chunk allocation, the space are searched first in
> BTRFS_DEV_ALLOCATION_DATA_ONLY and BTRFS_DEV_ALLOCATION_PREFERRED_DATA
> tagged disks. If no space is found or the space found is not enough (eg.
> in raid5, only two disks are available), then also the disks tagged
> BTRFS_DEV_ALLOCATION_PREFERRED_METADATA are evaluated. If even in this
> case this the space is not sufficient, -ENOSPC is raised.
> A disk tagged with BTRFS_DEV_ALLOCATION_METADATA_ONLY is never considered
> for a data BG allocation.
> 
> During a *metadata* chunk allocation, the space are searched first in
> BTRFS_DEV_ALLOCATION_METADATA_ONLY and BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
> tagged disks. If no space is found or the space found is not enough (eg.
> in raid5, only two disks are available), then also the disks tagged
> BTRFS_DEV_ALLOCATION_PREFERRED_DATA are considered. If even in this
> case this the space is not sufficient, -ENOSPC is raised.
> A disk tagged with BTRFS_DEV_ALLOCATION_DATA_ONLY is never considered
> for a metadata BG allocation.
> 
> By default the disks are tagged as BTRFS_DEV_ALLOCATION_PREFERRED_DATA,
> so the default behavior happens. If the user prefer to store the
> metadata in the faster disks (e.g. the SSD), he can tag these with
> BTRFS_DEV_ALLOCATION_PREFERRED_DATA: in this case the data BG go in the

is this a typo? I would expect they should mark the disk with
METADATA_PREFERRED to get metadata to go there. Also, that value is
already the default, so setting it should be a no-op.

> BTRFS_DEV_ALLOCATION_PREFERRED_DATA disks and the metadata BG in the
> others, until there is enough space. Only if one disks set is filled,

I think this may be another typo: "is not enough space" seems to make
more sense.

> the other is occupied.
> 
> WARNING: if the user tags a disk with BTRFS_DEV_ALLOCATION_DATA_ONLY,
> this means that this disk will never be used for allocating metadata
> increasing the likelihood of exhausting the metadata space.
> 
> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
> ---
>  fs/btrfs/volumes.c | 94 +++++++++++++++++++++++++++++++++++++++++++++-
>  fs/btrfs/volumes.h |  1 +
>  2 files changed, 94 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 806b599c6a46..beee7d1ae79d 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -184,6 +184,16 @@ enum btrfs_raid_types __attribute_const__ btrfs_bg_flags_to_raid_index(u64 flags
>  	return BTRFS_RAID_SINGLE; /* BTRFS_BLOCK_GROUP_SINGLE */
>  }
>  
> +#define BTRFS_DEV_ALLOCATION_HINT_COUNT (1ULL << \
> +		BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT)
> +

This logic with -1..2 and then negating the hint is quite clever. I
would add a comment to make it obvious what's happening, or a helper
static function that sets a hint given the device_info, the ctl, and the
device. You could also consider numbering from to 3 and flipping the
order by doing (count - hint), which keeps things positive, at least.

I think your algorithm is fine, though.

> +static const char alloc_hint_map[BTRFS_DEV_ALLOCATION_HINT_COUNT] = {
> +	[BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY] = -1,
> +	[BTRFS_DEV_ALLOCATION_HINT_PREFERRED_DATA] = 0,
> +	[BTRFS_DEV_ALLOCATION_HINT_PREFERRED_METADATA] = 1,
> +	[BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY] = 2,
> +};
> +
>  const char *btrfs_bg_type_to_raid_name(u64 flags)
>  {
>  	const int index = btrfs_bg_flags_to_raid_index(flags);
> @@ -5037,13 +5047,18 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
>  }
>  
>  /*
> - * sort the devices in descending order by max_avail, total_avail
> + * sort the devices in descending order by alloc_hint,
> + * max_avail, total_avail
>   */
>  static int btrfs_cmp_device_info(const void *a, const void *b)
>  {
>  	const struct btrfs_device_info *di_a = a;
>  	const struct btrfs_device_info *di_b = b;
>  
> +	if (di_a->alloc_hint > di_b->alloc_hint)
> +		return -1;
> +	if (di_a->alloc_hint < di_b->alloc_hint)
> +		return 1;
>  	if (di_a->max_avail > di_b->max_avail)
>  		return -1;
>  	if (di_a->max_avail < di_b->max_avail)
> @@ -5206,6 +5221,8 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>  	int ndevs = 0;
>  	u64 max_avail;
>  	u64 dev_offset;
> +	int hint;
> +	int i;
>  
>  	/*
>  	 * in the first pass through the devices list, we gather information
> @@ -5258,16 +5275,91 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>  		devices_info[ndevs].max_avail = max_avail;
>  		devices_info[ndevs].total_avail = total_avail;
>  		devices_info[ndevs].dev = device;
> +
> +		if ((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
> +		     (ctl->type & BTRFS_BLOCK_GROUP_METADATA)) {
> +			/*
> +			 * if mixed bg set all the alloc_hint
> +			 * fields to the same value, so the sorting
> +			 * is not affected
> +			 */
> +			devices_info[ndevs].alloc_hint = 0;
> +		} else if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
> +			hint = device->type & BTRFS_DEV_ALLOCATION_HINT_MASK;
> +
> +			/*
> +			 * skip BTRFS_DEV_METADATA_ONLY disks
> +			 */
> +			if (hint == BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY)
> +				continue;
> +			/*
> +			 * if a data chunk must be allocated,
> +			 * sort also by hint (data disk
> +			 * higher priority)
> +			 */
> +			devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
> +		} else { /* BTRFS_BLOCK_GROUP_METADATA */
> +			hint = device->type & BTRFS_DEV_ALLOCATION_HINT_MASK;
> +
> +			/*
> +			 * skip BTRFS_DEV_DATA_ONLY disks
> +			 */
> +			if (hint == BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY)
> +				continue;
> +			/*
> +			 * if a data chunk must be allocated,

typo: metadata chunk

> +			 * sort also by hint (metadata hint
> +			 * higher priority)
> +			 */
> +			devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
> +		}
> +
>  		++ndevs;
>  	}
>  	ctl->ndevs = ndevs;
>  
> +	/*
> +	 * no devices available
> +	 */
> +	if (!ndevs)
> +		return 0;
> +

What doesn't handle 0 devices properly? As far as I can tell, sort will
be fine, and we'll skip the while and for loops.

>  	/*
>  	 * now sort the devices by hole size / available space

modify the comment to include that this sort cares about hint.

>  	 */
>  	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>  	     btrfs_cmp_device_info, NULL);
>  

"restarting" the function here feels off to me. I'm wondering if you
could refactor your logic to make it clearer and avoid the ugly logic
reset mid function. You are going from:

- filter/prepare devices
- sort by avail

to

- filter/prepare devices
- sort by hint/avail
- take all by hint until we take enough
- sort by avail

that second step of "take by hint; sort by avail" could possibly be a
new filter function run after gather_device_info and before
decide_stripe_size. You could name it something like
"reduce_devices_by_hint" or "take_devices_by_hint".

> +	/*
> +	 * select the minimum set of disks grouped by hint that
> +	 * can host the chunk
> +	 */
> +	ndevs = 0;
> +	while (ndevs < ctl->ndevs) {
> +		hint = devices_info[ndevs++].alloc_hint;
> +		while (ndevs < ctl->ndevs &&
> +		       devices_info[ndevs].alloc_hint == hint)
> +				ndevs++;
> +		if (ndevs >= ctl->devs_min)
> +			break;
> +	}
> +
> +	BUG_ON(ndevs > ctl->ndevs);

I think this BUG_ON is overly paranoid. Is it defensive against a future
logic error with the nested while loops? (which I agree one should be
careful with...)

> +	ctl->ndevs = ndevs;
> +
> +	/*
> +	 * the next layers require the devices_info ordered by
> +	 * max_avail. If we are returing two (or more) different

typo: returning

> +	 * group of alloc_hint, this is not always true. So sort
> +	 * these gain.

typo: again

> +	 */
> +
> +	for (i = 0 ; i < ndevs ; i++)
> +		devices_info[i].alloc_hint = 0;
> +
> +	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +	     btrfs_cmp_device_info, NULL);
> +
>  	return 0;
>  }
>  
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 5097c0c12a8e..61c0cba045e9 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -406,6 +406,7 @@ struct btrfs_device_info {
>  	u64 dev_offset;
>  	u64 max_avail;
>  	u64 total_avail;
> +	int alloc_hint;
>  };
>  
>  struct btrfs_raid_attr {
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/6] btrfs: rename dev_item->type to dev_item->flags
  2021-12-17 18:47 ` [PATCH 5/6] btrfs: rename dev_item->type to dev_item->flags Goffredo Baroncelli
@ 2022-01-05 23:50   ` Boris Burkov
  0 siblings, 0 replies; 21+ messages in thread
From: Boris Burkov @ 2022-01-05 23:50 UTC (permalink / raw)
  To: Goffredo Baroncelli
  Cc: linux-btrfs, Zygo Blaxell, Josef Bacik, David Sterba,
	Sinnamohideen Shafeeq, Paul Jones, Goffredo Baroncelli

On Fri, Dec 17, 2021 at 07:47:21PM +0100, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli <kreijack@inwind.it>
> 
> Rename the field type of dev_item from 'type' to 'flags' changing the
> struct btrfs_device and btrfs_dev_item.
> 
> Signed-off-by: Goffredo Baroncelli <krejack@inwind.it>
> ---
>  fs/btrfs/ctree.h                |  4 ++--
>  fs/btrfs/disk-io.c              |  2 +-
>  fs/btrfs/sysfs.c                | 17 +++++++++--------
>  fs/btrfs/volumes.c              | 10 +++++-----
>  fs/btrfs/volumes.h              |  4 ++--
>  include/uapi/linux/btrfs_tree.h |  4 ++--
>  6 files changed, 21 insertions(+), 20 deletions(-)

Can you also change the comment in include/uapi/linux/btrfs_tree.h to
reflect changing from dev_item.type to dev_item.flags?

LGTM otherwise.

> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 459d00211181..778c7c807289 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1669,7 +1669,7 @@ static inline void btrfs_set_device_total_bytes(const struct extent_buffer *eb,
>  }
>  
>  
> -BTRFS_SETGET_FUNCS(device_type, struct btrfs_dev_item, type, 64);
> +BTRFS_SETGET_FUNCS(device_flags, struct btrfs_dev_item, flags, 64);
>  BTRFS_SETGET_FUNCS(device_bytes_used, struct btrfs_dev_item, bytes_used, 64);
>  BTRFS_SETGET_FUNCS(device_io_align, struct btrfs_dev_item, io_align, 32);
>  BTRFS_SETGET_FUNCS(device_io_width, struct btrfs_dev_item, io_width, 32);
> @@ -1682,7 +1682,7 @@ BTRFS_SETGET_FUNCS(device_seek_speed, struct btrfs_dev_item, seek_speed, 8);
>  BTRFS_SETGET_FUNCS(device_bandwidth, struct btrfs_dev_item, bandwidth, 8);
>  BTRFS_SETGET_FUNCS(device_generation, struct btrfs_dev_item, generation, 64);
>  
> -BTRFS_SETGET_STACK_FUNCS(stack_device_type, struct btrfs_dev_item, type, 64);
> +BTRFS_SETGET_STACK_FUNCS(stack_device_flags, struct btrfs_dev_item, flags, 64);
>  BTRFS_SETGET_STACK_FUNCS(stack_device_total_bytes, struct btrfs_dev_item,
>  			 total_bytes, 64);
>  BTRFS_SETGET_STACK_FUNCS(stack_device_bytes_used, struct btrfs_dev_item,
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index fc7dd5109806..02ffb8bc7d6b 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -4342,7 +4342,7 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int max_mirrors)
>  			continue;
>  
>  		btrfs_set_stack_device_generation(dev_item, 0);
> -		btrfs_set_stack_device_type(dev_item, dev->type);
> +		btrfs_set_stack_device_flags(dev_item, dev->flags);
>  		btrfs_set_stack_device_id(dev_item, dev->devid);
>  		btrfs_set_stack_device_total_bytes(dev_item,
>  						   dev->commit_total_bytes);
> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> index 53acc66065dd..be4196a1645c 100644
> --- a/fs/btrfs/sysfs.c
> +++ b/fs/btrfs/sysfs.c
> @@ -1582,7 +1582,7 @@ static ssize_t btrfs_devinfo_allocation_hint_show(struct kobject *kobj,
>  						   devid_kobj);
>  
>  	return scnprintf(buf, PAGE_SIZE, "0x%08llx\n",
> -		device->type & BTRFS_DEV_ALLOCATION_HINT_MASK );
> +		device->flags & BTRFS_DEV_ALLOCATION_HINT_MASK );
>  }
>  
>  static ssize_t btrfs_devinfo_allocation_hint_store(struct kobject *kobj,
> @@ -1595,7 +1595,7 @@ static ssize_t btrfs_devinfo_allocation_hint_store(struct kobject *kobj,
>  	int ret;
>  	struct btrfs_trans_handle *trans;
>  
> -	u64 type, prev_type;
> +	u64 flags, prev_flags;
>  
>  	device = container_of(kobj, struct btrfs_device, devid_kobj);
>  	fs_info = device->fs_info;
> @@ -1606,24 +1606,25 @@ static ssize_t btrfs_devinfo_allocation_hint_store(struct kobject *kobj,
>  	if (sb_rdonly(fs_info->sb))
>  		return -EROFS;
>  
> -	ret = kstrtou64(buf, 0, &type);
> +	ret = kstrtou64(buf, 0, &flags);
>  	if (ret < 0)
>  		return -EINVAL;
>  
>  	/* for now, allow to touch only the 'allocation hint' bits */
> -	if (type & ~BTRFS_DEV_ALLOCATION_HINT_MASK)
> +	if (flags & ~BTRFS_DEV_ALLOCATION_HINT_MASK)
>  		return -EINVAL;
>  
>  	/* check if a change is really needed */
> -	if ((device->type & BTRFS_DEV_ALLOCATION_HINT_MASK) == type)
> +	if ((device->flags & BTRFS_DEV_ALLOCATION_HINT_MASK) == flags)
>  		return len;
>  
>  	trans = btrfs_start_transaction(root, 1);
>  	if (IS_ERR(trans))
>  		return PTR_ERR(trans);
>  
> -	prev_type = device->type;
> -	device->type = (device->type & ~BTRFS_DEV_ALLOCATION_HINT_MASK) | type;
> +	prev_flags = device->flags;
> +	device->flags = (device->flags & ~BTRFS_DEV_ALLOCATION_HINT_MASK) |
> +			flags;
>  
>  	ret = btrfs_update_device(trans, device);
>  
> @@ -1639,7 +1640,7 @@ static ssize_t btrfs_devinfo_allocation_hint_store(struct kobject *kobj,
>  
>  	return len;
>  abort:
> -	device->type = prev_type;
> +	device->flags = prev_flags;
>  	return  ret;
>  }
>  BTRFS_ATTR_RW(devid, allocation_hint, btrfs_devinfo_allocation_hint_show,
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index beee7d1ae79d..9184570c51b0 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1876,7 +1876,7 @@ static int btrfs_add_dev_item(struct btrfs_trans_handle *trans,
>  
>  	btrfs_set_device_id(leaf, dev_item, device->devid);
>  	btrfs_set_device_generation(leaf, dev_item, 0);
> -	btrfs_set_device_type(leaf, dev_item, device->type);
> +	btrfs_set_device_flags(leaf, dev_item, device->flags);
>  	btrfs_set_device_io_align(leaf, dev_item, device->io_align);
>  	btrfs_set_device_io_width(leaf, dev_item, device->io_width);
>  	btrfs_set_device_sector_size(leaf, dev_item, device->sector_size);
> @@ -2900,7 +2900,7 @@ noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
>  	dev_item = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_dev_item);
>  
>  	btrfs_set_device_id(leaf, dev_item, device->devid);
> -	btrfs_set_device_type(leaf, dev_item, device->type);
> +	btrfs_set_device_flags(leaf, dev_item, device->flags);
>  	btrfs_set_device_io_align(leaf, dev_item, device->io_align);
>  	btrfs_set_device_io_width(leaf, dev_item, device->io_width);
>  	btrfs_set_device_sector_size(leaf, dev_item, device->sector_size);
> @@ -5285,7 +5285,7 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>  			 */
>  			devices_info[ndevs].alloc_hint = 0;
>  		} else if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
> -			hint = device->type & BTRFS_DEV_ALLOCATION_HINT_MASK;
> +			hint = device->flags & BTRFS_DEV_ALLOCATION_HINT_MASK;
>  
>  			/*
>  			 * skip BTRFS_DEV_METADATA_ONLY disks
> @@ -5299,7 +5299,7 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>  			 */
>  			devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
>  		} else { /* BTRFS_BLOCK_GROUP_METADATA */
> -			hint = device->type & BTRFS_DEV_ALLOCATION_HINT_MASK;
> +			hint = device->flags & BTRFS_DEV_ALLOCATION_HINT_MASK;
>  
>  			/*
>  			 * skip BTRFS_DEV_DATA_ONLY disks
> @@ -7293,7 +7293,7 @@ static void fill_device_from_item(struct extent_buffer *leaf,
>  	device->commit_total_bytes = device->disk_total_bytes;
>  	device->bytes_used = btrfs_device_bytes_used(leaf, dev_item);
>  	device->commit_bytes_used = device->bytes_used;
> -	device->type = btrfs_device_type(leaf, dev_item);
> +	device->flags = btrfs_device_flags(leaf, dev_item);
>  	device->io_align = btrfs_device_io_align(leaf, dev_item);
>  	device->io_width = btrfs_device_io_width(leaf, dev_item);
>  	device->sector_size = btrfs_device_sector_size(leaf, dev_item);
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 61c0cba045e9..27ecf062d50c 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -96,8 +96,8 @@ struct btrfs_device {
>  
>  	/* optimal io width for this device */
>  	u32 io_width;
> -	/* type and info about this device */
> -	u64 type;
> +	/* device flags (e.g. allocation hint) */
> +	u64 flags;
>  
>  	/* minimal io size for this device */
>  	u32 sector_size;
> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> index 55da906c2eac..f9891c94a75e 100644
> --- a/include/uapi/linux/btrfs_tree.h
> +++ b/include/uapi/linux/btrfs_tree.h
> @@ -421,8 +421,8 @@ struct btrfs_dev_item {
>  	/* minimal io size for this device */
>  	__le32 sector_size;
>  
> -	/* type and info about this device */
> -	__le64 type;
> +	/* device flags (e.g. allocation hint) */
> +	__le64 flags;
>  
>  	/* expected generation for this device */
>  	__le64 generation;
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/6] btrfs: add flags to give an hint to the chunk allocator
  2022-01-05 22:10   ` Boris Burkov
@ 2022-01-06  8:53     ` Goffredo Baroncelli
  0 siblings, 0 replies; 21+ messages in thread
From: Goffredo Baroncelli @ 2022-01-06  8:53 UTC (permalink / raw)
  To: Boris Burkov
  Cc: linux-btrfs, Zygo Blaxell, Josef Bacik, David Sterba,
	Sinnamohideen Shafeeq, Paul Jones, Goffredo Baroncelli

On 05/01/2022 23.10, Boris Burkov wrote:
> On Fri, Dec 17, 2021 at 07:47:17PM +0100, Goffredo Baroncelli wrote:
[...]
> 
> Therefore, I think you should go with DATA_ONLY, DATA_PREFERRED,
> METADATA_ONLY, METADATA_PREFERRED. Definitely put ONLY/PREFERRED on the
> same side of DATA/METADATA for all four, regardless of which you choose.

This was the conclusion that I already reached with Zygo. The next patches set
will rename the flag in this way.


> 
> Looks good otherwise.
> 
>>
>> Signed-off-by: Goffredo Baroncelli <kreijack@inwid.it>
>> ---
>>   include/uapi/linux/btrfs_tree.h | 16 ++++++++++++++++
>>   1 file changed, 16 insertions(+)
>>
>> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
>> index 5416f1f1a77a..55da906c2eac 100644
>> --- a/include/uapi/linux/btrfs_tree.h
>> +++ b/include/uapi/linux/btrfs_tree.h
>> @@ -386,6 +386,22 @@ struct btrfs_key {
>>   	__u64 offset;
>>   } __attribute__ ((__packed__));
>>   
>> +/* dev_item.type */
>> +
>> +/* btrfs chunk allocation hint */
>> +#define BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT	2
>> +/* btrfs chunk allocation hint mask */
>> +#define BTRFS_DEV_ALLOCATION_HINT_MASK	\
>> +	((1 << BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT) -1)
>> +/* preferred data chunk, but metadata chunk allowed */
>> +#define BTRFS_DEV_ALLOCATION_HINT_PREFERRED_DATA	(0ULL)
>> +/* preferred metadata chunk, but data chunk allowed */
>> +#define BTRFS_DEV_ALLOCATION_HINT_PREFERRED_METADATA	(1ULL)
>> +/* only metadata chunk are allowed */
>> +#define BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY		(2ULL)
>> +/* only data chunk allowed */
>> +#define BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY		(3ULL)
>> +
>>   struct btrfs_dev_item {
>>   	/* the internal btrfs device id */
>>   	__le64 devid;
>> -- 
>> 2.34.1
>>


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/6] btrfs: add allocation_hint mode
  2022-01-05 23:48   ` Boris Burkov
@ 2022-01-06 10:09     ` Goffredo Baroncelli
  0 siblings, 0 replies; 21+ messages in thread
From: Goffredo Baroncelli @ 2022-01-06 10:09 UTC (permalink / raw)
  To: Boris Burkov
  Cc: linux-btrfs, Zygo Blaxell, Josef Bacik, David Sterba,
	Sinnamohideen Shafeeq, Paul Jones, Goffredo Baroncelli

On 06/01/2022 00.48, Boris Burkov wrote:
> On Fri, Dec 17, 2021 at 07:47:20PM +0100, Goffredo Baroncelli wrote:
>> From: Goffredo Baroncelli <kreijack@inwind.it>
>>
>> The chunk allocation policy is modified as follow.
>>
>> Each disk may have one of the following tags:
>> - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
>> - BTRFS_DEV_ALLOCATION_METADATA_ONLY
>> - BTRFS_DEV_ALLOCATION_DATA_ONLY
>> - BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)
>>
>> During a *mixed data/metadata* chunk allocation, BTRFS works as
>> usual.
>>
>> During a *data* chunk allocation, the space are searched first in
>> BTRFS_DEV_ALLOCATION_DATA_ONLY and BTRFS_DEV_ALLOCATION_PREFERRED_DATA
>> tagged disks. If no space is found or the space found is not enough (eg.
>> in raid5, only two disks are available), then also the disks tagged
>> BTRFS_DEV_ALLOCATION_PREFERRED_METADATA are evaluated. If even in this
>> case this the space is not sufficient, -ENOSPC is raised.
>> A disk tagged with BTRFS_DEV_ALLOCATION_METADATA_ONLY is never considered
>> for a data BG allocation.
>>
>> During a *metadata* chunk allocation, the space are searched first in
>> BTRFS_DEV_ALLOCATION_METADATA_ONLY and BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
>> tagged disks. If no space is found or the space found is not enough (eg.
>> in raid5, only two disks are available), then also the disks tagged
>> BTRFS_DEV_ALLOCATION_PREFERRED_DATA are considered. If even in this
>> case this the space is not sufficient, -ENOSPC is raised.
>> A disk tagged with BTRFS_DEV_ALLOCATION_DATA_ONLY is never considered
>> for a metadata BG allocation.
>>
>> By default the disks are tagged as BTRFS_DEV_ALLOCATION_PREFERRED_DATA,
>> so the default behavior happens. If the user prefer to store the
>> metadata in the faster disks (e.g. the SSD), he can tag these with
>> BTRFS_DEV_ALLOCATION_PREFERRED_DATA: in this case the data BG go in the
> 
> is this a typo? I would expect they should mark the disk with
> METADATA_PREFERRED to get metadata to go there. Also, that value is
> already the default, so setting it should be a no-op.

Yes it is a typo.
> 
>> BTRFS_DEV_ALLOCATION_PREFERRED_DATA disks and the metadata BG in the
>> others, until there is enough space. Only if one disks set is filled,
> 
> I think this may be another typo: "is not enough space" seems to make
> more sense.
> 

I rephrased a bit the commit message:
----------
btrfs: add allocation_hint mode

The chunk allocation policy is modified as follow.

Each disk may have one of the following tags:
- BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
- BTRFS_DEV_ALLOCATION_DATA_ONLY
- BTRFS_DEV_ALLOCATION_DATA_PREFERRED (default)

During a *mixed data/metadata* chunk allocation, BTRFS works as
usual.

During a *data* chunk allocation, the space are searched first in
BTRFS_DEV_ALLOCATION_DATA_ONLY. If the space found is not enough (eg.
in raid5, only two disks are available), then the disks taggged
BTRFS_DEV_ALLOCATION_DATA_PREFERRED are considered. If the space is not
enough again, the disks tagged BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
are also considered. If even in this case this the space is not
sufficient, -ENOSPC is raised.
A disk tagged with BTRFS_DEV_ALLOCATION_METADATA_ONLY is never considered
for a data BG allocation.

During a *metadata* chunk allocation, the same algorithm applies swapping
_DATA_ and _METADATA_.

By default the disks are tagged as BTRFS_DEV_ALLOCATION_DATA_PREFERRED,
so BTRFS behaves as usual.

If the user prefers to store the metadata in the faster disks (e.g. SSD),
he can tag these with BTRFS_DEV_ALLOCATION_METADATA_PREFERRED: in this
case the metadata BG go in the BTRFS_DEV_ALLOCATION_METADATA_PREFERRED disks
and the data BG in the others ones. When a disks set is filled, the
other considered.

----------

>> the other is occupied.
>>
>> WARNING: if the user tags a disk with BTRFS_DEV_ALLOCATION_DATA_ONLY,
>> this means that this disk will never be used for allocating metadata
>> increasing the likelihood of exhausting the metadata space.
>>
>> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
>> ---
>>   fs/btrfs/volumes.c | 94 +++++++++++++++++++++++++++++++++++++++++++++-
>>   fs/btrfs/volumes.h |  1 +
>>   2 files changed, 94 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 806b599c6a46..beee7d1ae79d 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -184,6 +184,16 @@ enum btrfs_raid_types __attribute_const__ btrfs_bg_flags_to_raid_index(u64 flags
>>   	return BTRFS_RAID_SINGLE; /* BTRFS_BLOCK_GROUP_SINGLE */
>>   }
>>   
>> +#define BTRFS_DEV_ALLOCATION_HINT_COUNT (1ULL << \
>> +		BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT)
>> +
> 
> This logic with -1..2 and then negating the hint is quite clever. I
> would add a comment to make it obvious what's happening, or a helper
> static function that sets a hint given the device_info, the ctl, and the
> device. You could also consider numbering from to 3 and flipping the
> order by doing (count - hint), which keeps things positive, at least.
> 
> I think your algorithm is fine, though

I add a comment like

   /*
    *      The order of BTRFS_DEV_ALLOCATION_HINT_* values are not
    *      good, because BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED is 0
    *      (for backward compatibility reason), and the other
    *      values are greater (because the field is unsigned). So we
    *      need a map that rearranges the order giving to _DATA_PREFERRED
    *      an intermediate priority.
    *      These values give to METADATA_ONLY the highest priority, and are
    *      valid for metadata BG allocation. When a data
    *      BG is allocated we negate these values to reverse the priority.
    */
   static const char alloc_hint_map[BTRFS_DEV_ALLOCATION_HINT_COUNT] = {

.
> 
>> +static const char alloc_hint_map[BTRFS_DEV_ALLOCATION_HINT_COUNT] = {
>> +	[BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY] = -1,
>> +	[BTRFS_DEV_ALLOCATION_HINT_PREFERRED_DATA] = 0,
>> +	[BTRFS_DEV_ALLOCATION_HINT_PREFERRED_METADATA] = 1,
>> +	[BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY] = 2,
>> +};
>> +
>>   const char *btrfs_bg_type_to_raid_name(u64 flags)
>>   {
>>   	const int index = btrfs_bg_flags_to_raid_index(flags);
>> @@ -5037,13 +5047,18 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
>>   }
>>   
>>   /*
>> - * sort the devices in descending order by max_avail, total_avail
>> + * sort the devices in descending order by alloc_hint,
>> + * max_avail, total_avail
>>    */
>>   static int btrfs_cmp_device_info(const void *a, const void *b)
>>   {
>>   	const struct btrfs_device_info *di_a = a;
>>   	const struct btrfs_device_info *di_b = b;
>>   
>> +	if (di_a->alloc_hint > di_b->alloc_hint)
>> +		return -1;
>> +	if (di_a->alloc_hint < di_b->alloc_hint)
>> +		return 1;
>>   	if (di_a->max_avail > di_b->max_avail)
>>   		return -1;
>>   	if (di_a->max_avail < di_b->max_avail)
>> @@ -5206,6 +5221,8 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>>   	int ndevs = 0;
>>   	u64 max_avail;
>>   	u64 dev_offset;
>> +	int hint;
>> +	int i;
>>   
>>   	/*
>>   	 * in the first pass through the devices list, we gather information
>> @@ -5258,16 +5275,91 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>>   		devices_info[ndevs].max_avail = max_avail;
>>   		devices_info[ndevs].total_avail = total_avail;
>>   		devices_info[ndevs].dev = device;
>> +
>> +		if ((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
>> +		     (ctl->type & BTRFS_BLOCK_GROUP_METADATA)) {
>> +			/*
>> +			 * if mixed bg set all the alloc_hint
>> +			 * fields to the same value, so the sorting
>> +			 * is not affected
>> +			 */
>> +			devices_info[ndevs].alloc_hint = 0;
>> +		} else if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
>> +			hint = device->type & BTRFS_DEV_ALLOCATION_HINT_MASK;
>> +
>> +			/*
>> +			 * skip BTRFS_DEV_METADATA_ONLY disks
>> +			 */
>> +			if (hint == BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY)
>> +				continue;
>> +			/*
>> +			 * if a data chunk must be allocated,
>> +			 * sort also by hint (data disk
>> +			 * higher priority)
>> +			 */
>> +			devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
>> +		} else { /* BTRFS_BLOCK_GROUP_METADATA */
>> +			hint = device->type & BTRFS_DEV_ALLOCATION_HINT_MASK;
>> +
>> +			/*
>> +			 * skip BTRFS_DEV_DATA_ONLY disks
>> +			 */
>> +			if (hint == BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY)
>> +				continue;
>> +			/*
>> +			 * if a data chunk must be allocated,
> 
> typo: metadata chunk
> 
>> +			 * sort also by hint (metadata hint
>> +			 * higher priority)
>> +			 */
>> +			devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
>> +		}
>> +
>>   		++ndevs;
>>   	}
>>   	ctl->ndevs = ndevs;
>>   
>> +	/*
>> +	 * no devices available
>> +	 */
>> +	if (!ndevs)
>> +		return 0;
>> +
> 
> What doesn't handle 0 devices properly? As far as I can tell, sort will
> be fine, and we'll skip the while and for loops.
It may be removed; however we already know that the rest of the code
is unneeded...

> 
>>   	/*
>>   	 * now sort the devices by hole size / available space
> 
> modify the comment to include that this sort cares about hint.
ok
> 
>>   	 */
>>   	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>>   	     btrfs_cmp_device_info, NULL);
>>   
> 
> "restarting" the function here feels off to me. I'm wondering if you
> could refactor your logic to make it clearer and avoid the ugly logic
> reset mid function. You are going from:
> 
> - filter/prepare devices
> - sort by avail
> 
> to
> 
> - filter/prepare devices
> - sort by hint/avail
> - take all by hint until we take enough
> - sort by avail
> 
> that second step of "take by hint; sort by avail" could possibly be a
> new filter function run after gather_device_info and before
> decide_stripe_size. You could name it something like
> "reduce_devices_by_hint" or "take_devices_by_hint".

In the next iteration I am refactoring the code of sorting (from
sort by hint/avail') in a separate function.

>> +	/*
>> +	 * select the minimum set of disks grouped by hint that
>> +	 * can host the chunk
>> +	 */
>> +	ndevs = 0;
>> +	while (ndevs < ctl->ndevs) {
>> +		hint = devices_info[ndevs++].alloc_hint;
>> +		while (ndevs < ctl->ndevs &&
>> +		       devices_info[ndevs].alloc_hint == hint)
>> +				ndevs++;
>> +		if (ndevs >= ctl->devs_min)
>> +			break;
>> +	}
>> +
>> +	BUG_ON(ndevs > ctl->ndevs);
> 
> I think this BUG_ON is overly paranoid. Is it defensive against a future
> logic error with the nested while loops? (which I agree one should be
> careful with...)
> 

Yes, because this loop is not so obvious I preferred to leave a BOG_ON
to be sure. Until now it was never triggered, so now I am removing it.


>> +	ctl->ndevs = ndevs;
>> +
>> +	/*
>> +	 * the next layers require the devices_info ordered by
>> +	 * max_avail. If we are returing two (or more) different
> 
> typo: returning
Corrected
> 
>> +	 * group of alloc_hint, this is not always true. So sort
>> +	 * these gain.
> 
> typo: again
Corrected
> 
>> +	 */
>> +
>> +	for (i = 0 ; i < ndevs ; i++)
>> +		devices_info[i].alloc_hint = 0;
>> +
>> +	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>> +	     btrfs_cmp_device_info, NULL);
>> +
>>   	return 0;
>>   }
>>   
>> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
>> index 5097c0c12a8e..61c0cba045e9 100644
>> --- a/fs/btrfs/volumes.h
>> +++ b/fs/btrfs/volumes.h
>> @@ -406,6 +406,7 @@ struct btrfs_device_info {
>>   	u64 dev_offset;
>>   	u64 max_avail;
>>   	u64 total_avail;
>> +	int alloc_hint;
>>   };
>>   
>>   struct btrfs_raid_attr {
>> -- 
>> 2.34.1
>>


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/6] btrfs: add flags to give an hint to the chunk allocator
  2022-01-06 17:49 [PATCH 0/6][V10] btrfs: " Goffredo Baroncelli
@ 2022-01-06 17:49 ` Goffredo Baroncelli
  0 siblings, 0 replies; 21+ messages in thread
From: Goffredo Baroncelli @ 2022-01-06 17:49 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

Add the following flags to give an hint about which chunk should be
allocated in which disk:

- BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED
  preferred for data chunk, but metadata chunk allowed
- BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED
  preferred for metadata chunk, but data chunk allowed
- BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY
  only metadata chunk allowed
- BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY
  only data chunk allowed

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 include/uapi/linux/btrfs_tree.h | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 5416f1f1a77a..02955d5fcd21 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -386,6 +386,22 @@ struct btrfs_key {
 	__u64 offset;
 } __attribute__ ((__packed__));
 
+/* dev_item.type */
+
+/* btrfs chunk allocation hint */
+#define BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT	2
+/* btrfs chunk allocation hint mask */
+#define BTRFS_DEV_ALLOCATION_HINT_MASK	\
+	((1 << BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT) - 1)
+/* preferred data chunk, but metadata chunk allowed */
+#define BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED	(0ULL)
+/* preferred metadata chunk, but data chunk allowed */
+#define BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED	(1ULL)
+/* only metadata chunk are allowed */
+#define BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY		(2ULL)
+/* only data chunk allowed */
+#define BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY		(3ULL)
+
 struct btrfs_dev_item {
 	/* the internal btrfs device id */
 	__le64 devid;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2022-01-06 17:49 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-17 18:47 [RFC][V9][PATCH 0/6] btrfs: allocation_hint mode Goffredo Baroncelli
2021-12-17 18:47 ` [PATCH 1/6] btrfs: add flags to give an hint to the chunk allocator Goffredo Baroncelli
2022-01-05 22:10   ` Boris Burkov
2022-01-06  8:53     ` Goffredo Baroncelli
2021-12-17 18:47 ` [PATCH 2/6] btrfs: export the device allocation_hint property in sysfs Goffredo Baroncelli
2022-01-05 21:57   ` Boris Burkov
2021-12-17 18:47 ` [PATCH 3/6] btrfs: change the device allocation_hint property via sysfs Goffredo Baroncelli
2021-12-17 18:47 ` [PATCH 4/6] btrfs: add allocation_hint mode Goffredo Baroncelli
2022-01-05 23:48   ` Boris Burkov
2022-01-06 10:09     ` Goffredo Baroncelli
2021-12-17 18:47 ` [PATCH 5/6] btrfs: rename dev_item->type to dev_item->flags Goffredo Baroncelli
2022-01-05 23:50   ` Boris Burkov
2021-12-17 18:47 ` [PATCH 6/6] btrfs: add allocation_hint option Goffredo Baroncelli
2022-01-05  2:44 ` [RFC][V9][PATCH 0/6] btrfs: allocation_hint mode Boris Burkov
2022-01-05  9:16   ` Goffredo Baroncelli
2022-01-05 17:55     ` Boris Burkov
2022-01-05 18:07     ` Zygo Blaxell
2022-01-05 18:16       ` Goffredo Baroncelli
2022-01-05 18:29         ` Boris Burkov
2022-01-05 22:21 ` Boris Burkov
2022-01-06 17:49 [PATCH 0/6][V10] btrfs: " Goffredo Baroncelli
2022-01-06 17:49 ` [PATCH 1/6] btrfs: add flags to give an hint to the chunk allocator Goffredo Baroncelli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).