All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5][V12] btrfs: allocation_hint
@ 2022-03-06 18:14 Goffredo Baroncelli
  2022-03-06 18:14 ` [PATCH 1/5] btrfs: add flags to give an hint to the chunk allocator Goffredo Baroncelli
                   ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: Goffredo Baroncelli @ 2022-03-06 18:14 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

Hi all,

This patches set was born after some discussion between me, Zygo and Josef.
Some details can be found in https://github.com/btrfs/btrfs-todo/issues/19.

Some further information about a real use case can be found in
https://lore.kernel.org/linux-btrfs/20210116002533.GE31381@hungrycats.org/

Recently Shafeeq told me that he is interested too, due to the performance gain.

In V8, revision I switched away from an ioctl API in favor of a sysfs API (
see patch #2 and #3).

In V9, I renamed the sysfs interface from devinfo/type to devinfo/allocation_hint.
Moreover I renamed dev_info->type to dev_info->flags.

In V10, I renamed the tag 'PREFERRED' from PREFERRED_X to X_PREFERRED; I added
another devinfo property, called major_minor. For now it is unused;
the plan is to use this in btrfs-progs to go from the block device to the
filesystem.
First client will be "btrfs prop get/set allocation_hint", but I see others
possible clients.
Finally I made some cleanup, and remove the mount option 'allocation_hint'

In V12 I removed some features introduced in V11 (like major_minor sysfs
filed and the 'feature' file) and changed the values to write in the
sysfs files: until V12 the values are numericals ones. Now are strings ones:
- METADATA_ONLY
- METADATA_PREFERRED
- DATA_PREFERRED
- DATA_ONLY

The idea behind this patches set, is to dedicate some disks (the fastest one)
to the metadata chunk. My initial idea was a "soft" hint. However Zygo
asked an option for a "strong" hint (== mandatory). The result is that
each disk can be "tagged" by one of the following flags:
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
- BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
- BTRFS_DEV_ALLOCATION_DATA_PREFERRED
- BTRFS_DEV_ALLOCATION_DATA_ONLY

When the chunk allocator search a disks to allocate a chunk, scans the disks
in an order decided by these tags. For metadata, the order is:
*_METADATA_ONLY
*_METADATA_PREFERRED
*_DATA_PREFERRED

The *_DATA_ONLY are not eligible from metadata chunk allocation.

For the data chunk, the order is reversed, and the *_METADATA_ONLY are
excluded.

The exact sort logic is to sort first for the "tag", and then for the space
available. If there is no space available, the next "tag" disks set are
selected.

To set these tags, a new property called "allocation_hint" was created.
There is a dedicated btrfs-prog patches set [[PATCH V11] btrfs-progs:
allocation_hint disk property].

$ sudo mount /dev/loop0 /mnt/test-btrfs/
$ for i in /dev/loop[0-9]; do sudo ./btrfs prop get $i allocation_hint; done
devid=1, path=/dev/loop0: allocation_hint=METADATA_PREFERRED
devid=2, path=/dev/loop1: allocation_hint=METADATA_PREFERRED
devid=3, path=/dev/loop2: allocation_hint=DATA_PREFERRED
devid=4, path=/dev/loop3: allocation_hint=DATA_PREFERRED
devid=5, path=/dev/loop4: allocation_hint=DATA_PREFERRED
devid=6, path=/dev/loop5: allocation_hint=DATA_ONLY
devid=7, path=/dev/loop6: allocation_hint=METADATA_ONLY
devid=8, path=/dev/loop7: allocation_hint=METADATA_ONLY

$ sudo ./btrfs fi us /mnt/test-btrfs/
Overall:
    Device size:           2.75GiB
    Device allocated:           1.34GiB
    Device unallocated:           1.41GiB
    Device missing:             0.00B
    Used:             400.89MiB
    Free (estimated):           1.04GiB    (min: 1.04GiB)
    Data ratio:                  2.00
    Metadata ratio:              1.00
    Global reserve:           3.25MiB    (used: 0.00B)
    Multiple profiles:                no

Data,RAID1: Size:542.00MiB, Used:200.25MiB (36.95%)
   /dev/loop0     288.00MiB
   /dev/loop1     288.00MiB
   /dev/loop2     127.00MiB
   /dev/loop3     127.00MiB
   /dev/loop4     127.00MiB
   /dev/loop5     127.00MiB

Metadata,single: Size:256.00MiB, Used:384.00KiB (0.15%)
   /dev/loop1     256.00MiB

System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
   /dev/loop0      32.00MiB

Unallocated:
   /dev/loop0     704.00MiB
   /dev/loop1     480.00MiB
   /dev/loop2       1.00MiB
   /dev/loop3       1.00MiB
   /dev/loop4       1.00MiB
   /dev/loop5       1.00MiB
   /dev/loop6     128.00MiB
   /dev/loop7     128.00MiB

# change the tag of some disks

$ sudo ./btrfs prop set /dev/loop0 allocation_hint DATA_ONLY
$ sudo ./btrfs prop set /dev/loop1 allocation_hint DATA_ONLY
$ sudo ./btrfs prop set /dev/loop5 allocation_hint METADATA_ONLY

$ for i in /dev/loop[0-9]; do sudo ./btrfs prop get $i allocation_hint; done
devid=1, path=/dev/loop0: allocation_hint=DATA_ONLY
devid=2, path=/dev/loop1: allocation_hint=DATA_ONLY
devid=3, path=/dev/loop2: allocation_hint=DATA_PREFERRED
devid=4, path=/dev/loop3: allocation_hint=DATA_PREFERRED
devid=5, path=/dev/loop4: allocation_hint=DATA_PREFERRED
devid=6, path=/dev/loop5: allocation_hint=METADATA_ONLY
devid=7, path=/dev/loop6: allocation_hint=METADATA_ONLY
devid=8, path=/dev/loop7: allocation_hint=METADATA_ONLY

$ sudo btrfs bal start --full-balance /mnt/test-btrfs/
$ sudo ./btrfs fi us /mnt/test-btrfs/
Overall:
    Device size:           2.75GiB
    Device allocated:         735.00MiB
    Device unallocated:           2.03GiB
    Device missing:             0.00B
    Used:             400.72MiB
    Free (estimated):           1.10GiB    (min: 1.10GiB)
    Data ratio:                  2.00
    Metadata ratio:              1.00
    Global reserve:           3.25MiB    (used: 0.00B)
    Multiple profiles:                no

Data,RAID1: Size:288.00MiB, Used:200.19MiB (69.51%)
   /dev/loop0     288.00MiB
   /dev/loop1     288.00MiB

Metadata,single: Size:127.00MiB, Used:336.00KiB (0.26%)
   /dev/loop5     127.00MiB

System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
   /dev/loop7      32.00MiB

Unallocated:
   /dev/loop0     736.00MiB
   /dev/loop1     736.00MiB
   /dev/loop2     128.00MiB
   /dev/loop3     128.00MiB
   /dev/loop4     128.00MiB
   /dev/loop5       1.00MiB
   /dev/loop6     128.00MiB
   /dev/loop7      96.00MiB


#As you can see all the metadata were placed on the disk loop5/loop7 even if
#the most empty one are loop0 and loop1.

Furher works:
- the tool which show the space available should consider the tagging (eg
  the disks tagged by _METADATA_ONLY should be excluded from the data
  availability)
- allow btrfs-prog to change the allocation_hint even when the filesystem
  is not mountead.

In following emails, there will be the btrfs-progs patches set and the xfstest
suite patches set.

Comments are welcome
BR
G.Baroncelli

Revision:
V12:
- removed the property /sys/fs/btrfs/features/allocation_hint
- changed the value to write in <devid>/allocation hint from numericals one
  to string one (METADATA_PREFERRED, DATA_ONLY....)
- removed major_minor sysfs field (added in v11)
V11:
- added the property /sys/fs/btrfs/features/allocation_hint

V10:
- renamed two disk tags/constants:
        - *_METADATA_PREFERRED -> *_METADATA_PREFERRED
        - *_DATA_PREFERRED -> *_DATA_EFERRED
- add /sys/fs/btrfs/$UUID/devinfo/$DEVID/major_minor
- revise some commit description
- refactored the code of gather_device_info(): one portion of this code
  is moved in a separate function called sort_and_reduce_device_info()
  for clarity.
- removed the 'allocation_hint' mount option 

V9:
- rename dev_item->type to dev_item->flags
- rename /sys/fs/btrfs/$UUID/devinfo/type -> allocation_hint

V8:
- drop the ioctl API, instead use a sysfs one

V7:
- make more room in the struct btrfs_ioctl_dev_properties up to 1K
- leave in btrfs_tree.h only the costants
- removed the mount option (sic)
- correct an 'use before check' in the while loop (signaled
  by Zygo)
- add a 2nd sort to be sure that the device_info array is in the
  expected order

V6:
- add further values to the hints: add the possibility to
  exclude a disk for a chunk type 



Goffredo Baroncelli (5):
  btrfs: add flags to give an hint to the chunk allocator
  btrfs: export the device allocation_hint property in sysfs
  btrfs: change the device allocation_hint property via sysfs
  btrfs: add allocation_hint mode
  btrfs: rename dev_item->type to dev_item->flags

 fs/btrfs/ctree.h                |   4 +-
 fs/btrfs/disk-io.c              |   2 +-
 fs/btrfs/sysfs.c                | 106 ++++++++++++++++++++++++++++
 fs/btrfs/volumes.c              | 121 ++++++++++++++++++++++++++++++--
 fs/btrfs/volumes.h              |   7 +-
 include/uapi/linux/btrfs_tree.h |  20 +++++-
 6 files changed, 246 insertions(+), 14 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/5] btrfs: add flags to give an hint to the chunk allocator
  2022-03-06 18:14 [PATCH 0/5][V12] btrfs: allocation_hint Goffredo Baroncelli
@ 2022-03-06 18:14 ` Goffredo Baroncelli
  2022-03-22 17:51   ` Josef Bacik
                     ` (2 more replies)
  2022-03-06 18:14 ` [PATCH 2/5] btrfs: export the device allocation_hint property in sysfs Goffredo Baroncelli
                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 25+ messages in thread
From: Goffredo Baroncelli @ 2022-03-06 18:14 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

Add the following flags to give an hint about which chunk should be
allocated in which disk:

- BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED
  preferred for data chunk, but metadata chunk allowed
- BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED
  preferred for metadata chunk, but data chunk allowed
- BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY
  only metadata chunk allowed
- BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY
  only data chunk allowed

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 include/uapi/linux/btrfs_tree.h | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index b069752a8ecf..e0d842c2e616 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -389,6 +389,22 @@ struct btrfs_key {
 	__u64 offset;
 } __attribute__ ((__packed__));
 
+/* dev_item.type */
+
+/* btrfs chunk allocation hint */
+#define BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT	2
+/* btrfs chunk allocation hint mask */
+#define BTRFS_DEV_ALLOCATION_HINT_MASK	\
+	((1 << BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT) - 1)
+/* preferred data chunk, but metadata chunk allowed */
+#define BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED	(0ULL)
+/* preferred metadata chunk, but data chunk allowed */
+#define BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED	(1ULL)
+/* only metadata chunk are allowed */
+#define BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY		(2ULL)
+/* only data chunk allowed */
+#define BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY		(3ULL)
+
 struct btrfs_dev_item {
 	/* the internal btrfs device id */
 	__le64 devid;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/5] btrfs: export the device allocation_hint property in sysfs
  2022-03-06 18:14 [PATCH 0/5][V12] btrfs: allocation_hint Goffredo Baroncelli
  2022-03-06 18:14 ` [PATCH 1/5] btrfs: add flags to give an hint to the chunk allocator Goffredo Baroncelli
@ 2022-03-06 18:14 ` Goffredo Baroncelli
  2022-03-08 13:25   ` kernel test robot
  2022-03-22 18:05   ` Josef Bacik
  2022-03-06 18:14 ` [PATCH 3/5] btrfs: change the device allocation_hint property via sysfs Goffredo Baroncelli
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 25+ messages in thread
From: Goffredo Baroncelli @ 2022-03-06 18:14 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

Export the device allocation_hint property via
/sys/fs/btrfs/<uuid>/devinfo/<devid>/allocation_hint

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 fs/btrfs/sysfs.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 17389a42a3ab..59d92a385a96 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1578,6 +1578,36 @@ static ssize_t btrfs_devinfo_error_stats_show(struct kobject *kobj,
 }
 BTRFS_ATTR(devid, error_stats, btrfs_devinfo_error_stats_show);
 
+
+struct allocation_hint_name_t {
+	const char *name;
+	const u64 value;
+} allocation_hint_name[] = {
+	{ "DATA_PREFERRED", BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED },
+	{ "METADATA_PREFERRED", BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED },
+	{ "DATA_ONLY", BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY },
+	{ "METADATA_ONLY", BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY },
+};
+
+static ssize_t btrfs_devinfo_allocation_hint_show(struct kobject *kobj,
+					struct kobj_attribute *a, char *buf)
+{
+	int i;
+	struct btrfs_device *device = container_of(kobj, struct btrfs_device,
+						   devid_kobj);
+
+	for (i = 0 ; i < ARRAY_SIZE(allocation_hint_name) ; i++) {
+		if ((device->type & BTRFS_DEV_ALLOCATION_HINT_MASK) !=
+		    allocation_hint_name[i].value)
+			continue;
+
+		return scnprintf(buf, PAGE_SIZE, "%s\n",
+			allocation_hint_name[i].name);
+	}
+	return scnprintf(buf, PAGE_SIZE, "<UNKNOWN>\n");
+}
+BTRFS_ATTR(devid, allocation_hint, btrfs_devinfo_allocation_hint_show);
+
 /*
  * Information about one device.
  *
@@ -1591,6 +1621,7 @@ static struct attribute *devid_attrs[] = {
 	BTRFS_ATTR_PTR(devid, replace_target),
 	BTRFS_ATTR_PTR(devid, scrub_speed_max),
 	BTRFS_ATTR_PTR(devid, writeable),
+	BTRFS_ATTR_PTR(devid, allocation_hint),
 	NULL
 };
 ATTRIBUTE_GROUPS(devid);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/5] btrfs: change the device allocation_hint property via sysfs
  2022-03-06 18:14 [PATCH 0/5][V12] btrfs: allocation_hint Goffredo Baroncelli
  2022-03-06 18:14 ` [PATCH 1/5] btrfs: add flags to give an hint to the chunk allocator Goffredo Baroncelli
  2022-03-06 18:14 ` [PATCH 2/5] btrfs: export the device allocation_hint property in sysfs Goffredo Baroncelli
@ 2022-03-06 18:14 ` Goffredo Baroncelli
  2022-03-22 19:19   ` Josef Bacik
  2022-03-06 18:14 ` [PATCH 4/5] btrfs: add allocation_hint mode Goffredo Baroncelli
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: Goffredo Baroncelli @ 2022-03-06 18:14 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

This patch allow to change the allocation_hint property writing
a numerical value in the file.

/sysfs/fs/btrfs/<UUID>/devinfo/<devid>/allocation_hint

To update this field it is added the property "allocation_hint" in
btrfs-prog too.

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 fs/btrfs/sysfs.c   | 76 +++++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/volumes.c |  2 +-
 fs/btrfs/volumes.h |  2 ++
 3 files changed, 78 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 59d92a385a96..c6723456c0e1 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1606,7 +1606,81 @@ static ssize_t btrfs_devinfo_allocation_hint_show(struct kobject *kobj,
 	}
 	return scnprintf(buf, PAGE_SIZE, "<UNKNOWN>\n");
 }
-BTRFS_ATTR(devid, allocation_hint, btrfs_devinfo_allocation_hint_show);
+
+static ssize_t btrfs_devinfo_allocation_hint_store(struct kobject *kobj,
+				 struct kobj_attribute *a,
+				 const char *buf, size_t len)
+{
+	struct btrfs_fs_info *fs_info;
+	struct btrfs_root *root;
+	struct btrfs_device *device;
+	int ret;
+	struct btrfs_trans_handle *trans;
+	int i, l;
+	u64 type, prev_type;
+
+	if (len < 1)
+		return -EINVAL;
+
+	/* remove trailing newline */
+	l = len;
+	if (buf[len-1] == '\n')
+		l--;
+
+	for (i = 0 ; i < ARRAY_SIZE(allocation_hint_name) ; i++) {
+		if (l != strlen(allocation_hint_name[i].name))
+			continue;
+
+		if (strncasecmp(allocation_hint_name[i].name, buf, l))
+			continue;
+
+		type = allocation_hint_name[i].value;
+		break;
+	}
+
+	if (i >= ARRAY_SIZE(allocation_hint_name))
+		return -EINVAL;
+
+	device = container_of(kobj, struct btrfs_device, devid_kobj);
+	fs_info = device->fs_info;
+	if (!fs_info)
+		return -EPERM;
+
+	root = fs_info->chunk_root;
+	if (sb_rdonly(fs_info->sb))
+		return -EROFS;
+
+	/* check if a change is really needed */
+	if ((device->type & BTRFS_DEV_ALLOCATION_HINT_MASK) == type)
+		return len;
+
+	trans = btrfs_start_transaction(root, 1);
+	if (IS_ERR(trans))
+		return PTR_ERR(trans);
+
+	prev_type = device->type;
+	device->type = (device->type & ~BTRFS_DEV_ALLOCATION_HINT_MASK) | type;
+
+	ret = btrfs_update_device(trans, device);
+
+	if (ret < 0) {
+		btrfs_abort_transaction(trans, ret);
+		btrfs_end_transaction(trans);
+		goto abort;
+	}
+
+	ret = btrfs_commit_transaction(trans);
+	if (ret < 0)
+		goto abort;
+
+	return len;
+abort:
+	device->type = prev_type;
+	return  ret;
+}
+BTRFS_ATTR_RW(devid, allocation_hint, btrfs_devinfo_allocation_hint_show,
+				      btrfs_devinfo_allocation_hint_store);
+
 
 /*
  * Information about one device.
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 5e3e13d4940b..d4ac90f5c949 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2846,7 +2846,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	return ret;
 }
 
-static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
+noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
 					struct btrfs_device *device)
 {
 	int ret;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index bd297f23d19e..93ac27d8097c 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -636,5 +636,7 @@ int btrfs_bg_type_to_factor(u64 flags);
 const char *btrfs_bg_type_to_raid_name(u64 flags);
 int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info);
 bool btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical);
+int btrfs_update_device(struct btrfs_trans_handle *trans,
+			struct btrfs_device *device);
 
 #endif
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/5] btrfs: add allocation_hint mode
  2022-03-06 18:14 [PATCH 0/5][V12] btrfs: allocation_hint Goffredo Baroncelli
                   ` (2 preceding siblings ...)
  2022-03-06 18:14 ` [PATCH 3/5] btrfs: change the device allocation_hint property via sysfs Goffredo Baroncelli
@ 2022-03-06 18:14 ` Goffredo Baroncelli
  2022-03-22 19:47   ` Josef Bacik
  2022-03-06 18:14 ` [PATCH 5/5] btrfs: rename dev_item->type to dev_item->flags Goffredo Baroncelli
  2022-03-21 20:29 ` [PATCH 0/5][V12] btrfs: allocation_hint Goffredo Baroncelli
  5 siblings, 1 reply; 25+ messages in thread
From: Goffredo Baroncelli @ 2022-03-06 18:14 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

The chunk allocation policy is modified as follow.

Each disk may have one of the following tags:
- BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
- BTRFS_DEV_ALLOCATION_DATA_ONLY
- BTRFS_DEV_ALLOCATION_DATA_PREFERRED (default)

During a *mixed data/metadata* chunk allocation, BTRFS works as
usual.

During a *data* chunk allocation, the space are searched first in
BTRFS_DEV_ALLOCATION_DATA_ONLY. If the space found is not enough (eg.
in raid5, only two disks are available), then the disks tagged
BTRFS_DEV_ALLOCATION_DATA_PREFERRED are considered. If the space is not
enough again, the disks tagged BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
are also considered. If even in this case the space is not
sufficient, -ENOSPC is raised.
A disk tagged with BTRFS_DEV_ALLOCATION_METADATA_ONLY is never considered
for a data BG allocation.

During a *metadata* chunk allocation, the same algorithm applies swapping
_DATA_ and _METADATA_.

By default the disks are tagged as BTRFS_DEV_ALLOCATION_DATA_PREFERRED,
so BTRFS behaves as usual.

If the user prefers to store the metadata in the faster disks (e.g. SSD),
he can tag these with BTRFS_DEV_ALLOCATION_METADATA_PREFERRED: in this
case the metadata BG go in the BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
disks and the data BG in the others ones. When a disks set is filled, the
other is considered.

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 fs/btrfs/volumes.c | 113 +++++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/volumes.h |   1 +
 2 files changed, 111 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d4ac90f5c949..7b37db9bb887 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -184,6 +184,27 @@ enum btrfs_raid_types __attribute_const__ btrfs_bg_flags_to_raid_index(u64 flags
 	return BTRFS_RAID_SINGLE; /* BTRFS_BLOCK_GROUP_SINGLE */
 }
 
+#define BTRFS_DEV_ALLOCATION_HINT_COUNT (1ULL << \
+		BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT)
+
+/*
+ *	The order of BTRFS_DEV_ALLOCATION_HINT_* values are not
+ *	good, because BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED is 0
+ *	(for backward compatibility reason), and the other
+ *	values are greater (because the field is unsigned). So we
+ *	need a map that rearranges the order giving to _DATA_PREFERRED
+ *	an intermediate priority.
+ *	These values give to METADATA_ONLY the highest priority, and are
+ *	valid for metadata BG allocation. When a data
+ *	BG is allocated we negate these values to reverse the priority.
+ */
+static const char alloc_hint_map[BTRFS_DEV_ALLOCATION_HINT_COUNT] = {
+	[BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY] = -1,
+	[BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED] = 0,
+	[BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED] = 1,
+	[BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY] = 2,
+};
+
 const char *btrfs_bg_type_to_raid_name(u64 flags)
 {
 	const int index = btrfs_bg_flags_to_raid_index(flags);
@@ -5030,13 +5051,18 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
 }
 
 /*
- * sort the devices in descending order by max_avail, total_avail
+ * sort the devices in descending order by alloc_hint,
+ * max_avail, total_avail
  */
 static int btrfs_cmp_device_info(const void *a, const void *b)
 {
 	const struct btrfs_device_info *di_a = a;
 	const struct btrfs_device_info *di_b = b;
 
+	if (di_a->alloc_hint > di_b->alloc_hint)
+		return -1;
+	if (di_a->alloc_hint < di_b->alloc_hint)
+		return 1;
 	if (di_a->max_avail > di_b->max_avail)
 		return -1;
 	if (di_a->max_avail < di_b->max_avail)
@@ -5199,6 +5225,7 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 	int ndevs = 0;
 	u64 max_avail;
 	u64 dev_offset;
+	int hint;
 
 	/*
 	 * in the first pass through the devices list, we gather information
@@ -5251,17 +5278,95 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 		devices_info[ndevs].max_avail = max_avail;
 		devices_info[ndevs].total_avail = total_avail;
 		devices_info[ndevs].dev = device;
+
+		if ((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
+		     (ctl->type & BTRFS_BLOCK_GROUP_METADATA)) {
+			/*
+			 * if mixed bg set all the alloc_hint
+			 * fields to the same value, so the sorting
+			 * is not affected
+			 */
+			devices_info[ndevs].alloc_hint = 0;
+		} else if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
+			hint = device->type & BTRFS_DEV_ALLOCATION_HINT_MASK;
+
+			/*
+			 * skip BTRFS_DEV_METADATA_ONLY disks
+			 */
+			if (hint == BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY)
+				continue;
+			/*
+			 * if a data chunk must be allocated,
+			 * sort also by hint (data disk
+			 * higher priority)
+			 */
+			devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
+		} else { /* BTRFS_BLOCK_GROUP_METADATA */
+			hint = device->type & BTRFS_DEV_ALLOCATION_HINT_MASK;
+
+			/*
+			 * skip BTRFS_DEV_DATA_ONLY disks
+			 */
+			if (hint == BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY)
+				continue;
+			/*
+			 * if a metadata chunk must be allocated,
+			 * sort also by hint (metadata hint
+			 * higher priority)
+			 */
+			devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
+		}
+
 		++ndevs;
 	}
 	ctl->ndevs = ndevs;
 
+	return 0;
+}
+
+static void sort_and_reduce_device_info(struct alloc_chunk_ctl *ctl,
+					struct btrfs_device_info *devices_info)
+{
+	int ndevs, hint, i;
+
+	ndevs = ctl->ndevs;
 	/*
-	 * now sort the devices by hole size / available space
+	 * now sort the devices by hint / hole size / available space
 	 */
 	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
 	     btrfs_cmp_device_info, NULL);
 
-	return 0;
+	/*
+	 * select the minimum set of disks grouped by hint that
+	 * can host the chunk
+	 */
+	ndevs = 0;
+	while (ndevs < ctl->ndevs) {
+		hint = devices_info[ndevs++].alloc_hint;
+		while (ndevs < ctl->ndevs) {
+			if (devices_info[ndevs].alloc_hint != hint)
+				break;
+			ndevs++;
+		}
+		if (ndevs >= ctl->devs_min)
+			break;
+	}
+
+	ctl->ndevs = ndevs;
+
+	/*
+	 * the next layers require the devices_info ordered by
+	 * max_avail. If we are returning two (or more) different
+	 * group of alloc_hint, this is not always true. So sort
+	 * these again.
+	 */
+
+	for (i = 0 ; i < ndevs ; i++)
+		devices_info[i].alloc_hint = 0;
+
+	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+	     btrfs_cmp_device_info, NULL);
+
 }
 
 static int decide_stripe_size_regular(struct alloc_chunk_ctl *ctl,
@@ -5513,6 +5618,8 @@ struct btrfs_block_group *btrfs_create_chunk(struct btrfs_trans_handle *trans,
 		goto out;
 	}
 
+	sort_and_reduce_device_info(&ctl, devices_info);
+
 	ret = decide_stripe_size(fs_devices, &ctl, devices_info);
 	if (ret < 0) {
 		block_group = ERR_PTR(ret);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 93ac27d8097c..b066f9af216a 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -404,6 +404,7 @@ struct btrfs_device_info {
 	u64 dev_offset;
 	u64 max_avail;
 	u64 total_avail;
+	int alloc_hint;
 };
 
 struct btrfs_raid_attr {
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 5/5] btrfs: rename dev_item->type to dev_item->flags
  2022-03-06 18:14 [PATCH 0/5][V12] btrfs: allocation_hint Goffredo Baroncelli
                   ` (3 preceding siblings ...)
  2022-03-06 18:14 ` [PATCH 4/5] btrfs: add allocation_hint mode Goffredo Baroncelli
@ 2022-03-06 18:14 ` Goffredo Baroncelli
  2022-03-21 20:29 ` [PATCH 0/5][V12] btrfs: allocation_hint Goffredo Baroncelli
  5 siblings, 0 replies; 25+ messages in thread
From: Goffredo Baroncelli @ 2022-03-06 18:14 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

Rename the field type of dev_item from 'type' to 'flags' changing the
struct btrfs_device and btrfs_dev_item.

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 fs/btrfs/ctree.h                |  4 ++--
 fs/btrfs/disk-io.c              |  2 +-
 fs/btrfs/sysfs.c                | 15 ++++++++-------
 fs/btrfs/volumes.c              | 10 +++++-----
 fs/btrfs/volumes.h              |  4 ++--
 include/uapi/linux/btrfs_tree.h |  4 ++--
 6 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4db17bd05a21..afa47061a47a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1707,7 +1707,7 @@ static inline void btrfs_set_device_total_bytes(const struct extent_buffer *eb,
 }
 
 
-BTRFS_SETGET_FUNCS(device_type, struct btrfs_dev_item, type, 64);
+BTRFS_SETGET_FUNCS(device_flags, struct btrfs_dev_item, flags, 64);
 BTRFS_SETGET_FUNCS(device_bytes_used, struct btrfs_dev_item, bytes_used, 64);
 BTRFS_SETGET_FUNCS(device_io_align, struct btrfs_dev_item, io_align, 32);
 BTRFS_SETGET_FUNCS(device_io_width, struct btrfs_dev_item, io_width, 32);
@@ -1720,7 +1720,7 @@ BTRFS_SETGET_FUNCS(device_seek_speed, struct btrfs_dev_item, seek_speed, 8);
 BTRFS_SETGET_FUNCS(device_bandwidth, struct btrfs_dev_item, bandwidth, 8);
 BTRFS_SETGET_FUNCS(device_generation, struct btrfs_dev_item, generation, 64);
 
-BTRFS_SETGET_STACK_FUNCS(stack_device_type, struct btrfs_dev_item, type, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_device_flags, struct btrfs_dev_item, flags, 64);
 BTRFS_SETGET_STACK_FUNCS(stack_device_total_bytes, struct btrfs_dev_item,
 			 total_bytes, 64);
 BTRFS_SETGET_STACK_FUNCS(stack_device_bytes_used, struct btrfs_dev_item,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 6a0b4dbd70e9..0c2f5a98b9df 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4407,7 +4407,7 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int max_mirrors)
 			continue;
 
 		btrfs_set_stack_device_generation(dev_item, 0);
-		btrfs_set_stack_device_type(dev_item, dev->type);
+		btrfs_set_stack_device_flags(dev_item, dev->flags);
 		btrfs_set_stack_device_id(dev_item, dev->devid);
 		btrfs_set_stack_device_total_bytes(dev_item,
 						   dev->commit_total_bytes);
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index c6723456c0e1..8d0581c5383d 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1597,7 +1597,7 @@ static ssize_t btrfs_devinfo_allocation_hint_show(struct kobject *kobj,
 						   devid_kobj);
 
 	for (i = 0 ; i < ARRAY_SIZE(allocation_hint_name) ; i++) {
-		if ((device->type & BTRFS_DEV_ALLOCATION_HINT_MASK) !=
+		if ((device->flags & BTRFS_DEV_ALLOCATION_HINT_MASK) !=
 		    allocation_hint_name[i].value)
 			continue;
 
@@ -1617,7 +1617,7 @@ static ssize_t btrfs_devinfo_allocation_hint_store(struct kobject *kobj,
 	int ret;
 	struct btrfs_trans_handle *trans;
 	int i, l;
-	u64 type, prev_type;
+	u64 flags, prev_flags;
 
 	if (len < 1)
 		return -EINVAL;
@@ -1634,7 +1634,7 @@ static ssize_t btrfs_devinfo_allocation_hint_store(struct kobject *kobj,
 		if (strncasecmp(allocation_hint_name[i].name, buf, l))
 			continue;
 
-		type = allocation_hint_name[i].value;
+		flags = allocation_hint_name[i].value;
 		break;
 	}
 
@@ -1651,15 +1651,16 @@ static ssize_t btrfs_devinfo_allocation_hint_store(struct kobject *kobj,
 		return -EROFS;
 
 	/* check if a change is really needed */
-	if ((device->type & BTRFS_DEV_ALLOCATION_HINT_MASK) == type)
+	if ((device->flags & BTRFS_DEV_ALLOCATION_HINT_MASK) == flags)
 		return len;
 
 	trans = btrfs_start_transaction(root, 1);
 	if (IS_ERR(trans))
 		return PTR_ERR(trans);
 
-	prev_type = device->type;
-	device->type = (device->type & ~BTRFS_DEV_ALLOCATION_HINT_MASK) | type;
+	prev_flags = device->flags;
+	device->flags = (device->flags & ~BTRFS_DEV_ALLOCATION_HINT_MASK) |
+			flags;
 
 	ret = btrfs_update_device(trans, device);
 
@@ -1675,7 +1676,7 @@ static ssize_t btrfs_devinfo_allocation_hint_store(struct kobject *kobj,
 
 	return len;
 abort:
-	device->type = prev_type;
+	device->flags = prev_flags;
 	return  ret;
 }
 BTRFS_ATTR_RW(devid, allocation_hint, btrfs_devinfo_allocation_hint_show,
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 7b37db9bb887..728e3a7582bc 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1871,7 +1871,7 @@ static int btrfs_add_dev_item(struct btrfs_trans_handle *trans,
 
 	btrfs_set_device_id(leaf, dev_item, device->devid);
 	btrfs_set_device_generation(leaf, dev_item, 0);
-	btrfs_set_device_type(leaf, dev_item, device->type);
+	btrfs_set_device_flags(leaf, dev_item, device->flags);
 	btrfs_set_device_io_align(leaf, dev_item, device->io_align);
 	btrfs_set_device_io_width(leaf, dev_item, device->io_width);
 	btrfs_set_device_sector_size(leaf, dev_item, device->sector_size);
@@ -2898,7 +2898,7 @@ noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
 	dev_item = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_dev_item);
 
 	btrfs_set_device_id(leaf, dev_item, device->devid);
-	btrfs_set_device_type(leaf, dev_item, device->type);
+	btrfs_set_device_flags(leaf, dev_item, device->flags);
 	btrfs_set_device_io_align(leaf, dev_item, device->io_align);
 	btrfs_set_device_io_width(leaf, dev_item, device->io_width);
 	btrfs_set_device_sector_size(leaf, dev_item, device->sector_size);
@@ -5288,7 +5288,7 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 			 */
 			devices_info[ndevs].alloc_hint = 0;
 		} else if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
-			hint = device->type & BTRFS_DEV_ALLOCATION_HINT_MASK;
+			hint = device->flags & BTRFS_DEV_ALLOCATION_HINT_MASK;
 
 			/*
 			 * skip BTRFS_DEV_METADATA_ONLY disks
@@ -5302,7 +5302,7 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 			 */
 			devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
 		} else { /* BTRFS_BLOCK_GROUP_METADATA */
-			hint = device->type & BTRFS_DEV_ALLOCATION_HINT_MASK;
+			hint = device->flags & BTRFS_DEV_ALLOCATION_HINT_MASK;
 
 			/*
 			 * skip BTRFS_DEV_DATA_ONLY disks
@@ -7308,7 +7308,7 @@ static void fill_device_from_item(struct extent_buffer *leaf,
 	device->commit_total_bytes = device->disk_total_bytes;
 	device->bytes_used = btrfs_device_bytes_used(leaf, dev_item);
 	device->commit_bytes_used = device->bytes_used;
-	device->type = btrfs_device_type(leaf, dev_item);
+	device->flags = btrfs_device_flags(leaf, dev_item);
 	device->io_align = btrfs_device_io_align(leaf, dev_item);
 	device->io_width = btrfs_device_io_width(leaf, dev_item);
 	device->sector_size = btrfs_device_sector_size(leaf, dev_item);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index b066f9af216a..6230d911e7af 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -101,8 +101,8 @@ struct btrfs_device {
 
 	/* optimal io width for this device */
 	u32 io_width;
-	/* type and info about this device */
-	u64 type;
+	/* device flags (e.g. allocation hint) */
+	u64 flags;
 
 	/* minimal io size for this device */
 	u32 sector_size;
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index e0d842c2e616..bfe0b1a7f3a1 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -424,8 +424,8 @@ struct btrfs_dev_item {
 	/* minimal io size for this device */
 	__le32 sector_size;
 
-	/* type and info about this device */
-	__le64 type;
+	/* device flags (e.g. allocation hint) */
+	__le64 flags;
 
 	/* expected generation for this device */
 	__le64 generation;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/5] btrfs: export the device allocation_hint property in sysfs
  2022-03-06 18:14 ` [PATCH 2/5] btrfs: export the device allocation_hint property in sysfs Goffredo Baroncelli
@ 2022-03-08 13:25   ` kernel test robot
  2022-03-22 18:05   ` Josef Bacik
  1 sibling, 0 replies; 25+ messages in thread
From: kernel test robot @ 2022-03-08 13:25 UTC (permalink / raw)
  To: Goffredo Baroncelli, linux-btrfs
  Cc: kbuild-all, Zygo Blaxell, Josef Bacik, David Sterba,
	Sinnamohideen Shafeeq, Paul Jones, Boris Burkov,
	Goffredo Baroncelli

Hi Goffredo,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on kdave/for-next]
[also build test WARNING on v5.17-rc7 next-20220308]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Goffredo-Baroncelli/btrfs-allocation_hint/20220307-145335
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: microblaze-randconfig-s032-20220307 (https://download.01.org/0day-ci/archive/20220308/202203082127.TtxMcqDK-lkp@intel.com/config)
compiler: microblaze-linux-gcc (GCC) 11.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # apt-get install sparse
        # sparse version: v0.6.4-dirty
        # https://github.com/0day-ci/linux/commit/641cd29d0792eb2e702b6c6a226fce5b4a655e20
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Goffredo-Baroncelli/btrfs-allocation_hint/20220307-145335
        git checkout 641cd29d0792eb2e702b6c6a226fce5b4a655e20
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=microblaze SHELL=/bin/bash fs/btrfs/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)
>> fs/btrfs/sysfs.c:1583:3: sparse: sparse: symbol 'allocation_hint_name' was not declared. Should it be static?
   fs/btrfs/sysfs.c:630:9: sparse: sparse: context imbalance in 'btrfs_show_u64' - different lock contexts for basic block

vim +/allocation_hint_name +1583 fs/btrfs/sysfs.c

  1578	
  1579	
  1580	struct allocation_hint_name_t {
  1581		const char *name;
  1582		const u64 value;
> 1583	} allocation_hint_name[] = {
  1584		{ "DATA_PREFERRED", BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED },
  1585		{ "METADATA_PREFERRED", BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED },
  1586		{ "DATA_ONLY", BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY },
  1587		{ "METADATA_ONLY", BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY },
  1588	};
  1589	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/5][V12] btrfs: allocation_hint
  2022-03-06 18:14 [PATCH 0/5][V12] btrfs: allocation_hint Goffredo Baroncelli
                   ` (4 preceding siblings ...)
  2022-03-06 18:14 ` [PATCH 5/5] btrfs: rename dev_item->type to dev_item->flags Goffredo Baroncelli
@ 2022-03-21 20:29 ` Goffredo Baroncelli
  5 siblings, 0 replies; 25+ messages in thread
From: Goffredo Baroncelli @ 2022-03-21 20:29 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

Gentle ping
On 06/03/2022 19.14, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli <kreijack@inwind.it>
> 
> Hi all,
> 
> This patches set was born after some discussion between me, Zygo and Josef.
> Some details can be found in https://github.com/btrfs/btrfs-todo/issues/19.
> 
> Some further information about a real use case can be found in
> https://lore.kernel.org/linux-btrfs/20210116002533.GE31381@hungrycats.org/
> 
> Recently Shafeeq told me that he is interested too, due to the performance gain.
> 
> In V8, revision I switched away from an ioctl API in favor of a sysfs API (
> see patch #2 and #3).
> 
> In V9, I renamed the sysfs interface from devinfo/type to devinfo/allocation_hint.
> Moreover I renamed dev_info->type to dev_info->flags.
> 
> In V10, I renamed the tag 'PREFERRED' from PREFERRED_X to X_PREFERRED; I added
> another devinfo property, called major_minor. For now it is unused;
> the plan is to use this in btrfs-progs to go from the block device to the
> filesystem.
> First client will be "btrfs prop get/set allocation_hint", but I see others
> possible clients.
> Finally I made some cleanup, and remove the mount option 'allocation_hint'
> 
> In V12 I removed some features introduced in V11 (like major_minor sysfs
> filed and the 'feature' file) and changed the values to write in the
> sysfs files: until V12 the values are numericals ones. Now are strings ones:
> - METADATA_ONLY
> - METADATA_PREFERRED
> - DATA_PREFERRED
> - DATA_ONLY
> 
> The idea behind this patches set, is to dedicate some disks (the fastest one)
> to the metadata chunk. My initial idea was a "soft" hint. However Zygo
> asked an option for a "strong" hint (== mandatory). The result is that
> each disk can be "tagged" by one of the following flags:
> - BTRFS_DEV_ALLOCATION_METADATA_ONLY
> - BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
> - BTRFS_DEV_ALLOCATION_DATA_PREFERRED
> - BTRFS_DEV_ALLOCATION_DATA_ONLY
> 
> When the chunk allocator search a disks to allocate a chunk, scans the disks
> in an order decided by these tags. For metadata, the order is:
> *_METADATA_ONLY
> *_METADATA_PREFERRED
> *_DATA_PREFERRED
> 
> The *_DATA_ONLY are not eligible from metadata chunk allocation.
> 
> For the data chunk, the order is reversed, and the *_METADATA_ONLY are
> excluded.
> 
> The exact sort logic is to sort first for the "tag", and then for the space
> available. If there is no space available, the next "tag" disks set are
> selected.
> 
> To set these tags, a new property called "allocation_hint" was created.
> There is a dedicated btrfs-prog patches set [[PATCH V11] btrfs-progs:
> allocation_hint disk property].
> 
> $ sudo mount /dev/loop0 /mnt/test-btrfs/
> $ for i in /dev/loop[0-9]; do sudo ./btrfs prop get $i allocation_hint; done
> devid=1, path=/dev/loop0: allocation_hint=METADATA_PREFERRED
> devid=2, path=/dev/loop1: allocation_hint=METADATA_PREFERRED
> devid=3, path=/dev/loop2: allocation_hint=DATA_PREFERRED
> devid=4, path=/dev/loop3: allocation_hint=DATA_PREFERRED
> devid=5, path=/dev/loop4: allocation_hint=DATA_PREFERRED
> devid=6, path=/dev/loop5: allocation_hint=DATA_ONLY
> devid=7, path=/dev/loop6: allocation_hint=METADATA_ONLY
> devid=8, path=/dev/loop7: allocation_hint=METADATA_ONLY
> 
> $ sudo ./btrfs fi us /mnt/test-btrfs/
> Overall:
>      Device size:           2.75GiB
>      Device allocated:           1.34GiB
>      Device unallocated:           1.41GiB
>      Device missing:             0.00B
>      Used:             400.89MiB
>      Free (estimated):           1.04GiB    (min: 1.04GiB)
>      Data ratio:                  2.00
>      Metadata ratio:              1.00
>      Global reserve:           3.25MiB    (used: 0.00B)
>      Multiple profiles:                no
> 
> Data,RAID1: Size:542.00MiB, Used:200.25MiB (36.95%)
>     /dev/loop0     288.00MiB
>     /dev/loop1     288.00MiB
>     /dev/loop2     127.00MiB
>     /dev/loop3     127.00MiB
>     /dev/loop4     127.00MiB
>     /dev/loop5     127.00MiB
> 
> Metadata,single: Size:256.00MiB, Used:384.00KiB (0.15%)
>     /dev/loop1     256.00MiB
> 
> System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
>     /dev/loop0      32.00MiB
> 
> Unallocated:
>     /dev/loop0     704.00MiB
>     /dev/loop1     480.00MiB
>     /dev/loop2       1.00MiB
>     /dev/loop3       1.00MiB
>     /dev/loop4       1.00MiB
>     /dev/loop5       1.00MiB
>     /dev/loop6     128.00MiB
>     /dev/loop7     128.00MiB
> 
> # change the tag of some disks
> 
> $ sudo ./btrfs prop set /dev/loop0 allocation_hint DATA_ONLY
> $ sudo ./btrfs prop set /dev/loop1 allocation_hint DATA_ONLY
> $ sudo ./btrfs prop set /dev/loop5 allocation_hint METADATA_ONLY
> 
> $ for i in /dev/loop[0-9]; do sudo ./btrfs prop get $i allocation_hint; done
> devid=1, path=/dev/loop0: allocation_hint=DATA_ONLY
> devid=2, path=/dev/loop1: allocation_hint=DATA_ONLY
> devid=3, path=/dev/loop2: allocation_hint=DATA_PREFERRED
> devid=4, path=/dev/loop3: allocation_hint=DATA_PREFERRED
> devid=5, path=/dev/loop4: allocation_hint=DATA_PREFERRED
> devid=6, path=/dev/loop5: allocation_hint=METADATA_ONLY
> devid=7, path=/dev/loop6: allocation_hint=METADATA_ONLY
> devid=8, path=/dev/loop7: allocation_hint=METADATA_ONLY
> 
> $ sudo btrfs bal start --full-balance /mnt/test-btrfs/
> $ sudo ./btrfs fi us /mnt/test-btrfs/
> Overall:
>      Device size:           2.75GiB
>      Device allocated:         735.00MiB
>      Device unallocated:           2.03GiB
>      Device missing:             0.00B
>      Used:             400.72MiB
>      Free (estimated):           1.10GiB    (min: 1.10GiB)
>      Data ratio:                  2.00
>      Metadata ratio:              1.00
>      Global reserve:           3.25MiB    (used: 0.00B)
>      Multiple profiles:                no
> 
> Data,RAID1: Size:288.00MiB, Used:200.19MiB (69.51%)
>     /dev/loop0     288.00MiB
>     /dev/loop1     288.00MiB
> 
> Metadata,single: Size:127.00MiB, Used:336.00KiB (0.26%)
>     /dev/loop5     127.00MiB
> 
> System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
>     /dev/loop7      32.00MiB
> 
> Unallocated:
>     /dev/loop0     736.00MiB
>     /dev/loop1     736.00MiB
>     /dev/loop2     128.00MiB
>     /dev/loop3     128.00MiB
>     /dev/loop4     128.00MiB
>     /dev/loop5       1.00MiB
>     /dev/loop6     128.00MiB
>     /dev/loop7      96.00MiB
> 
> 
> #As you can see all the metadata were placed on the disk loop5/loop7 even if
> #the most empty one are loop0 and loop1.
> 
> Furher works:
> - the tool which show the space available should consider the tagging (eg
>    the disks tagged by _METADATA_ONLY should be excluded from the data
>    availability)
> - allow btrfs-prog to change the allocation_hint even when the filesystem
>    is not mountead.
> 
> In following emails, there will be the btrfs-progs patches set and the xfstest
> suite patches set.
> 
> Comments are welcome
> BR
> G.Baroncelli
> 
> Revision:
> V12:
> - removed the property /sys/fs/btrfs/features/allocation_hint
> - changed the value to write in <devid>/allocation hint from numericals one
>    to string one (METADATA_PREFERRED, DATA_ONLY....)
> - removed major_minor sysfs field (added in v11)
> V11:
> - added the property /sys/fs/btrfs/features/allocation_hint
> 
> V10:
> - renamed two disk tags/constants:
>          - *_METADATA_PREFERRED -> *_METADATA_PREFERRED
>          - *_DATA_PREFERRED -> *_DATA_EFERRED
> - add /sys/fs/btrfs/$UUID/devinfo/$DEVID/major_minor
> - revise some commit description
> - refactored the code of gather_device_info(): one portion of this code
>    is moved in a separate function called sort_and_reduce_device_info()
>    for clarity.
> - removed the 'allocation_hint' mount option
> 
> V9:
> - rename dev_item->type to dev_item->flags
> - rename /sys/fs/btrfs/$UUID/devinfo/type -> allocation_hint
> 
> V8:
> - drop the ioctl API, instead use a sysfs one
> 
> V7:
> - make more room in the struct btrfs_ioctl_dev_properties up to 1K
> - leave in btrfs_tree.h only the costants
> - removed the mount option (sic)
> - correct an 'use before check' in the while loop (signaled
>    by Zygo)
> - add a 2nd sort to be sure that the device_info array is in the
>    expected order
> 
> V6:
> - add further values to the hints: add the possibility to
>    exclude a disk for a chunk type
> 
> 
> 
> Goffredo Baroncelli (5):
>    btrfs: add flags to give an hint to the chunk allocator
>    btrfs: export the device allocation_hint property in sysfs
>    btrfs: change the device allocation_hint property via sysfs
>    btrfs: add allocation_hint mode
>    btrfs: rename dev_item->type to dev_item->flags
> 
>   fs/btrfs/ctree.h                |   4 +-
>   fs/btrfs/disk-io.c              |   2 +-
>   fs/btrfs/sysfs.c                | 106 ++++++++++++++++++++++++++++
>   fs/btrfs/volumes.c              | 121 ++++++++++++++++++++++++++++++--
>   fs/btrfs/volumes.h              |   7 +-
>   include/uapi/linux/btrfs_tree.h |  20 +++++-
>   6 files changed, 246 insertions(+), 14 deletions(-)
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] btrfs: add flags to give an hint to the chunk allocator
  2022-03-06 18:14 ` [PATCH 1/5] btrfs: add flags to give an hint to the chunk allocator Goffredo Baroncelli
@ 2022-03-22 17:51   ` Josef Bacik
  2022-03-22 17:56   ` Josef Bacik
  2022-03-22 19:52   ` Josef Bacik
  2 siblings, 0 replies; 25+ messages in thread
From: Josef Bacik @ 2022-03-22 17:51 UTC (permalink / raw)
  To: Goffredo Baroncelli
  Cc: linux-btrfs, Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov, Goffredo Baroncelli

On Sun, Mar 06, 2022 at 07:14:39PM +0100, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli <kreijack@inwind.it>
> 
> Add the following flags to give an hint about which chunk should be
> allocated in which disk:
> 
> - BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED
>   preferred for data chunk, but metadata chunk allowed
> - BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED
>   preferred for metadata chunk, but data chunk allowed
> - BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY
>   only metadata chunk allowed
> - BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY
>   only data chunk allowed
> 
> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
> ---
>  include/uapi/linux/btrfs_tree.h | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> index b069752a8ecf..e0d842c2e616 100644
> --- a/include/uapi/linux/btrfs_tree.h
> +++ b/include/uapi/linux/btrfs_tree.h
> @@ -389,6 +389,22 @@ struct btrfs_key {
>  	__u64 offset;
>  } __attribute__ ((__packed__));
>  
> +/* dev_item.type */
> +
> +/* btrfs chunk allocation hint */
> +#define BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT	2
> +/* btrfs chunk allocation hint mask */
> +#define BTRFS_DEV_ALLOCATION_HINT_MASK	\
> +	((1 << BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT) - 1)
> +/* preferred data chunk, but metadata chunk allowed */
> +#define BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED	(0ULL)
> +/* preferred metadata chunk, but data chunk allowed */
> +#define BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED	(1ULL)
> +/* only metadata chunk are allowed */
> +#define BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY		(2ULL)
> +/* only data chunk allowed */
> +#define BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY		(3ULL)
> +

Just a style thing here, format it to match everything else, and have the bit
stuff later, as well as newlines between everything since we're commenting.
Also we don't need the () around the values, generally it's best to follow the
style of the surrounding code when in doubt.

I know Dave probably has stronger opinions on this than I do, but I think
something like this would be better.

/*
 * btrfs_dev_item.type values.
 *
 * The _PREFERRED options indicate we prefer to allocate the respective type on
 * this device, but will fall back to allocating any chunk type if we run out of
 * space.  The _ONLY options indicate we will only allocate the given type from
 * this device and no other types.
 */
#define BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED	0ULL
#define BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED	1ULL
....

#define BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT	2
#define BTRFS_DEV_ALLOCATION_HINT_MASK		\
	((1 << BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT) - 1)

This will make it a bit clearer and easier to read.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] btrfs: add flags to give an hint to the chunk allocator
  2022-03-06 18:14 ` [PATCH 1/5] btrfs: add flags to give an hint to the chunk allocator Goffredo Baroncelli
  2022-03-22 17:51   ` Josef Bacik
@ 2022-03-22 17:56   ` Josef Bacik
  2022-03-22 18:49     ` Goffredo Baroncelli
  2022-03-23  3:26     ` Zygo Blaxell
  2022-03-22 19:52   ` Josef Bacik
  2 siblings, 2 replies; 25+ messages in thread
From: Josef Bacik @ 2022-03-22 17:56 UTC (permalink / raw)
  To: Goffredo Baroncelli
  Cc: linux-btrfs, Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov, Goffredo Baroncelli

On Sun, Mar 06, 2022 at 07:14:39PM +0100, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli <kreijack@inwind.it>
> 
> Add the following flags to give an hint about which chunk should be
> allocated in which disk:
> 
> - BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED
>   preferred for data chunk, but metadata chunk allowed
> - BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED
>   preferred for metadata chunk, but data chunk allowed
> - BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY
>   only metadata chunk allowed
> - BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY
>   only data chunk allowed
> 
> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
> ---
>  include/uapi/linux/btrfs_tree.h | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> index b069752a8ecf..e0d842c2e616 100644
> --- a/include/uapi/linux/btrfs_tree.h
> +++ b/include/uapi/linux/btrfs_tree.h
> @@ -389,6 +389,22 @@ struct btrfs_key {
>  	__u64 offset;
>  } __attribute__ ((__packed__));
>  
> +/* dev_item.type */
> +
> +/* btrfs chunk allocation hint */
> +#define BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT	2
> +/* btrfs chunk allocation hint mask */
> +#define BTRFS_DEV_ALLOCATION_HINT_MASK	\
> +	((1 << BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT) - 1)
> +/* preferred data chunk, but metadata chunk allowed */
> +#define BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED	(0ULL)

Actually don't we have 0 set on type already?  So this will make all existing
file systems have DATA_PREFERRED, when in reality they may not want that
behavior.  So should we start at 1?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/5] btrfs: export the device allocation_hint property in sysfs
  2022-03-06 18:14 ` [PATCH 2/5] btrfs: export the device allocation_hint property in sysfs Goffredo Baroncelli
  2022-03-08 13:25   ` kernel test robot
@ 2022-03-22 18:05   ` Josef Bacik
  2022-03-22 19:02     ` Goffredo Baroncelli
  1 sibling, 1 reply; 25+ messages in thread
From: Josef Bacik @ 2022-03-22 18:05 UTC (permalink / raw)
  To: Goffredo Baroncelli
  Cc: linux-btrfs, Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov, Goffredo Baroncelli

On Sun, Mar 06, 2022 at 07:14:40PM +0100, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli <kreijack@inwind.it>
> 
> Export the device allocation_hint property via
> /sys/fs/btrfs/<uuid>/devinfo/<devid>/allocation_hint
> 
> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
> ---
>  fs/btrfs/sysfs.c | 31 +++++++++++++++++++++++++++++++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> index 17389a42a3ab..59d92a385a96 100644
> --- a/fs/btrfs/sysfs.c
> +++ b/fs/btrfs/sysfs.c
> @@ -1578,6 +1578,36 @@ static ssize_t btrfs_devinfo_error_stats_show(struct kobject *kobj,
>  }
>  BTRFS_ATTR(devid, error_stats, btrfs_devinfo_error_stats_show);
>  
> +
> +struct allocation_hint_name_t {
> +	const char *name;
> +	const u64 value;
> +} allocation_hint_name[] = {
> +	{ "DATA_PREFERRED", BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED },
> +	{ "METADATA_PREFERRED", BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED },
> +	{ "DATA_ONLY", BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY },
> +	{ "METADATA_ONLY", BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY },
> +};
> +

The ktest robot complained about this, but also we don't need to be this clever,
also it looks better to have the flags first with standard tabbing.

struct allocation_hint_name {
	const u64 val;
	const char *name;
};

static struct allocation_hint_name allocation_hints[] = {
	{ BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED,	"DATA_PREFERRED"	},
	{ BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED,	"METADATA_PREFERRED"	},
	...
};


> +static ssize_t btrfs_devinfo_allocation_hint_show(struct kobject *kobj,
> +					struct kobj_attribute *a, char *buf)
> +{
> +	int i;
> +	struct btrfs_device *device = container_of(kobj, struct btrfs_device,
> +						   devid_kobj);
> +
> +	for (i = 0 ; i < ARRAY_SIZE(allocation_hint_name) ; i++) {
> +		if ((device->type & BTRFS_DEV_ALLOCATION_HINT_MASK) !=
> +		    allocation_hint_name[i].value)
> +			continue;
> +
> +		return scnprintf(buf, PAGE_SIZE, "%s\n",
> +			allocation_hint_name[i].name);
> +	}
> +	return scnprintf(buf, PAGE_SIZE, "<UNKNOWN>\n");

Since we have fs'es that won't use this you should just spit out "NONE".
Thanks,

Josef

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] btrfs: add flags to give an hint to the chunk allocator
  2022-03-22 17:56   ` Josef Bacik
@ 2022-03-22 18:49     ` Goffredo Baroncelli
  2022-04-06 19:32       ` Goffredo Baroncelli
  2022-03-23  3:26     ` Zygo Blaxell
  1 sibling, 1 reply; 25+ messages in thread
From: Goffredo Baroncelli @ 2022-03-22 18:49 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov, Goffredo Baroncelli

On 22/03/2022 18.56, Josef Bacik wrote:
> On Sun, Mar 06, 2022 at 07:14:39PM +0100, Goffredo Baroncelli wrote:
>> From: Goffredo Baroncelli <kreijack@inwind.it>
>>
>> Add the following flags to give an hint about which chunk should be
>> allocated in which disk:
>>
>> - BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED
>>    preferred for data chunk, but metadata chunk allowed
>> - BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED
>>    preferred for metadata chunk, but data chunk allowed
>> - BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY
>>    only metadata chunk allowed
>> - BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY
>>    only data chunk allowed
>>
>> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
>> ---
>>   include/uapi/linux/btrfs_tree.h | 16 ++++++++++++++++
>>   1 file changed, 16 insertions(+)
>>
>> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
>> index b069752a8ecf..e0d842c2e616 100644
>> --- a/include/uapi/linux/btrfs_tree.h
>> +++ b/include/uapi/linux/btrfs_tree.h
>> @@ -389,6 +389,22 @@ struct btrfs_key {
>>   	__u64 offset;
>>   } __attribute__ ((__packed__));
>>   
>> +/* dev_item.type */
>> +
>> +/* btrfs chunk allocation hint */
>> +#define BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT	2
>> +/* btrfs chunk allocation hint mask */
>> +#define BTRFS_DEV_ALLOCATION_HINT_MASK	\
>> +	((1 << BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT) - 1)
>> +/* preferred data chunk, but metadata chunk allowed */
>> +#define BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED	(0ULL)
> 
> Actually don't we have 0 set on type already?  So this will make all existing
> file systems have DATA_PREFERRED, when in reality they may not want that
> behavior.  So should we start at 1?  Thanks,

Yes, the default is 0 (now DATA_PREFERRED).
If we have all the disks set to DATA_PREFERRED (or METADATA_PREFERRED), all the disks
have the same priority so the chunks allocator works as usual.

If we have DATA_PREFERRED=1, it is not clear to me which would be the expected behavior when the allocator has to choice one of the followings disks:
- TYPE=0
- DATA_PREFERRED
- DATA_ONLY

It should ignore TYPE=0 ? or it should block the 'allocation_hint' policy ? Or TYPE=0 should have the lowest (or highest) priority ?

DATA_PREFERRED to me seems a safe default, because at worst we have only missing performance penalty (i.e. it is a faster disk where we should put METADATA)

BR

> 
> Josef


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/5] btrfs: export the device allocation_hint property in sysfs
  2022-03-22 18:05   ` Josef Bacik
@ 2022-03-22 19:02     ` Goffredo Baroncelli
  0 siblings, 0 replies; 25+ messages in thread
From: Goffredo Baroncelli @ 2022-03-22 19:02 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov, Goffredo Baroncelli

On 22/03/2022 19.05, Josef Bacik wrote:
> On Sun, Mar 06, 2022 at 07:14:40PM +0100, Goffredo Baroncelli wrote:
>> From: Goffredo Baroncelli <kreijack@inwind.it>
>>
>> Export the device allocation_hint property via
>> /sys/fs/btrfs/<uuid>/devinfo/<devid>/allocation_hint
>>
>> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
>> ---
>>   fs/btrfs/sysfs.c | 31 +++++++++++++++++++++++++++++++
>>   1 file changed, 31 insertions(+)
>>
>> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
>> index 17389a42a3ab..59d92a385a96 100644
>> --- a/fs/btrfs/sysfs.c
>> +++ b/fs/btrfs/sysfs.c
>> @@ -1578,6 +1578,36 @@ static ssize_t btrfs_devinfo_error_stats_show(struct kobject *kobj,
>>   }
>>   BTRFS_ATTR(devid, error_stats, btrfs_devinfo_error_stats_show);
>>   
>> +
>> +struct allocation_hint_name_t {
>> +	const char *name;
>> +	const u64 value;
>> +} allocation_hint_name[] = {
>> +	{ "DATA_PREFERRED", BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED },
>> +	{ "METADATA_PREFERRED", BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED },
>> +	{ "DATA_ONLY", BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY },
>> +	{ "METADATA_ONLY", BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY },
>> +};
>> +
> 
> The ktest robot complained about this, but also we don't need to be this clever,
> also it looks better to have the flags first with standard tabbing.
> 
Yes, ktest is more fine than gcc. I didn't tough about the fact that outside sysfs.c
allocation_hints[] is not used. If in the future it will be used, we can remove "static".

> struct allocation_hint_name {
> 	const u64 val;
> 	const char *name;
> };
> 
> static struct allocation_hint_name allocation_hints[] = {
> 	{ BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED,	"DATA_PREFERRED"	},
> 	{ BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED,	"METADATA_PREFERRED"	},
> 	...
> };
> 
Ok

> 
>> +static ssize_t btrfs_devinfo_allocation_hint_show(struct kobject *kobj,
>> +					struct kobj_attribute *a, char *buf)
>> +{
>> +	int i;
>> +	struct btrfs_device *device = container_of(kobj, struct btrfs_device,
>> +						   devid_kobj);
>> +
>> +	for (i = 0 ; i < ARRAY_SIZE(allocation_hint_name) ; i++) {
>> +		if ((device->type & BTRFS_DEV_ALLOCATION_HINT_MASK) !=
>> +		    allocation_hint_name[i].value)
>> +			continue;
>> +
>> +		return scnprintf(buf, PAGE_SIZE, "%s\n",
>> +			allocation_hint_name[i].name);
>> +	}
>> +	return scnprintf(buf, PAGE_SIZE, "<UNKNOWN>\n");
> 
> Since we have fs'es that won't use this you should just spit out "NONE".
> Thanks,
> 
To me NONE means -> I dont' use it
Unknown means -> it is not a understood value

Because BTRFS_DEV_ALLOCATION_HINT_MASK is 2 bits, and we have 4 possible values now it is impossible to get UNKNOWN.
If future if we increase the allocation-hint bits, and we don't map all the possible combination, it is possible that in the disk it is stored an UNKNOWN value. But it may be VALID for a more newer kernel.


> Josef


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/5] btrfs: change the device allocation_hint property via sysfs
  2022-03-06 18:14 ` [PATCH 3/5] btrfs: change the device allocation_hint property via sysfs Goffredo Baroncelli
@ 2022-03-22 19:19   ` Josef Bacik
  2022-03-22 19:52     ` Goffredo Baroncelli
  0 siblings, 1 reply; 25+ messages in thread
From: Josef Bacik @ 2022-03-22 19:19 UTC (permalink / raw)
  To: Goffredo Baroncelli
  Cc: linux-btrfs, Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov, Goffredo Baroncelli

On Sun, Mar 06, 2022 at 07:14:41PM +0100, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli <kreijack@inwind.it>
> 
> This patch allow to change the allocation_hint property writing
> a numerical value in the file.
> 
> /sysfs/fs/btrfs/<UUID>/devinfo/<devid>/allocation_hint
> 
> To update this field it is added the property "allocation_hint" in
> btrfs-prog too.
> 
> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
> ---
>  fs/btrfs/sysfs.c   | 76 +++++++++++++++++++++++++++++++++++++++++++++-
>  fs/btrfs/volumes.c |  2 +-
>  fs/btrfs/volumes.h |  2 ++
>  3 files changed, 78 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> index 59d92a385a96..c6723456c0e1 100644
> --- a/fs/btrfs/sysfs.c
> +++ b/fs/btrfs/sysfs.c
> @@ -1606,7 +1606,81 @@ static ssize_t btrfs_devinfo_allocation_hint_show(struct kobject *kobj,
>  	}
>  	return scnprintf(buf, PAGE_SIZE, "<UNKNOWN>\n");
>  }
> -BTRFS_ATTR(devid, allocation_hint, btrfs_devinfo_allocation_hint_show);
> +
> +static ssize_t btrfs_devinfo_allocation_hint_store(struct kobject *kobj,
> +				 struct kobj_attribute *a,
> +				 const char *buf, size_t len)

If you're using vim, use

set cindent
set cino=(0

This will give you the proper formatting we expect, this should be something
like

statice ssize_t btrfs_devinfo_allocation_hint_store(struct kobject *kobj,
						    struct kobj_attribute *a,
						    const char *buf, size_t len)


> +{
> +	struct btrfs_fs_info *fs_info;
> +	struct btrfs_root *root;
> +	struct btrfs_device *device;
> +	int ret;
> +	struct btrfs_trans_handle *trans;
> +	int i, l;
> +	u64 type, prev_type;
> +
> +	if (len < 1)
> +		return -EINVAL;
> +
> +	/* remove trailing newline */
> +	l = len;
> +	if (buf[len-1] == '\n')
> +		l--;
> +

This is unnecessary, because lower you can do...

> +	for (i = 0 ; i < ARRAY_SIZE(allocation_hint_name) ; i++) {
> +		if (l != strlen(allocation_hint_name[i].name))
> +			continue;
> +
> +		if (strncasecmp(allocation_hint_name[i].name, buf, l))
> +			continue;
> +

strmatch(buf, allocation_hint_name[i].name)

I would make a separate patch to change strmatch to do strncasecmp instead, but
then you can just use that helper.

> +		type = allocation_hint_name[i].value;
> +		break;
> +	}
> +
> +	if (i >= ARRAY_SIZE(allocation_hint_name))
> +		return -EINVAL;
> +
> +	device = container_of(kobj, struct btrfs_device, devid_kobj);
> +	fs_info = device->fs_info;
> +	if (!fs_info)
> +		return -EPERM;
> +
> +	root = fs_info->chunk_root;
> +	if (sb_rdonly(fs_info->sb))
> +		return -EROFS;
> +
> +	/* check if a change is really needed */
> +	if ((device->type & BTRFS_DEV_ALLOCATION_HINT_MASK) == type)
> +		return len;
> +
> +	trans = btrfs_start_transaction(root, 1);
> +	if (IS_ERR(trans))
> +		return PTR_ERR(trans);
> +
> +	prev_type = device->type;
> +	device->type = (device->type & ~BTRFS_DEV_ALLOCATION_HINT_MASK) | type;
> +
> +	ret = btrfs_update_device(trans, device);
> +
> +	if (ret < 0) {
> +		btrfs_abort_transaction(trans, ret);
> +		btrfs_end_transaction(trans);
> +		goto abort;
> +	}
> +
> +	ret = btrfs_commit_transaction(trans);
> +	if (ret < 0)
> +		goto abort;
> +
> +	return len;
> +abort:
> +	device->type = prev_type;
> +	return  ret;

Extra whitespace here.

> +}
> +BTRFS_ATTR_RW(devid, allocation_hint, btrfs_devinfo_allocation_hint_show,
> +				      btrfs_devinfo_allocation_hint_store);
> +
>  
>  /*
>   * Information about one device.
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 5e3e13d4940b..d4ac90f5c949 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2846,7 +2846,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>  	return ret;
>  }
>  
> -static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
> +noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
>  					struct btrfs_device *device)
>  {
>  	int ret;

You can drop the noinline thing here as well, and make sure to fix the
indentation.

> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index bd297f23d19e..93ac27d8097c 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -636,5 +636,7 @@ int btrfs_bg_type_to_factor(u64 flags);
>  const char *btrfs_bg_type_to_raid_name(u64 flags);
>  int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info);
>  bool btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical);
> +int btrfs_update_device(struct btrfs_trans_handle *trans,
> +			struct btrfs_device *device);

Make the indentation correct.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/5] btrfs: add allocation_hint mode
  2022-03-06 18:14 ` [PATCH 4/5] btrfs: add allocation_hint mode Goffredo Baroncelli
@ 2022-03-22 19:47   ` Josef Bacik
  2022-03-22 20:56     ` Goffredo Baroncelli
  0 siblings, 1 reply; 25+ messages in thread
From: Josef Bacik @ 2022-03-22 19:47 UTC (permalink / raw)
  To: Goffredo Baroncelli
  Cc: linux-btrfs, Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov, Goffredo Baroncelli

On Sun, Mar 06, 2022 at 07:14:42PM +0100, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli <kreijack@inwind.it>
> 
> The chunk allocation policy is modified as follow.
> 
> Each disk may have one of the following tags:
> - BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
> - BTRFS_DEV_ALLOCATION_METADATA_ONLY
> - BTRFS_DEV_ALLOCATION_DATA_ONLY
> - BTRFS_DEV_ALLOCATION_DATA_PREFERRED (default)
> 
> During a *mixed data/metadata* chunk allocation, BTRFS works as
> usual.
> 
> During a *data* chunk allocation, the space are searched first in
> BTRFS_DEV_ALLOCATION_DATA_ONLY. If the space found is not enough (eg.
> in raid5, only two disks are available), then the disks tagged
> BTRFS_DEV_ALLOCATION_DATA_PREFERRED are considered. If the space is not
> enough again, the disks tagged BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
> are also considered. If even in this case the space is not
> sufficient, -ENOSPC is raised.
> A disk tagged with BTRFS_DEV_ALLOCATION_METADATA_ONLY is never considered
> for a data BG allocation.
> 
> During a *metadata* chunk allocation, the same algorithm applies swapping
> _DATA_ and _METADATA_.
> 
> By default the disks are tagged as BTRFS_DEV_ALLOCATION_DATA_PREFERRED,
> so BTRFS behaves as usual.
> 
> If the user prefers to store the metadata in the faster disks (e.g. SSD),
> he can tag these with BTRFS_DEV_ALLOCATION_METADATA_PREFERRED: in this
> case the metadata BG go in the BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
> disks and the data BG in the others ones. When a disks set is filled, the
> other is considered.
> 
> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
> ---
>  fs/btrfs/volumes.c | 113 +++++++++++++++++++++++++++++++++++++++++++--
>  fs/btrfs/volumes.h |   1 +
>  2 files changed, 111 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index d4ac90f5c949..7b37db9bb887 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -184,6 +184,27 @@ enum btrfs_raid_types __attribute_const__ btrfs_bg_flags_to_raid_index(u64 flags
>  	return BTRFS_RAID_SINGLE; /* BTRFS_BLOCK_GROUP_SINGLE */
>  }
>  
> +#define BTRFS_DEV_ALLOCATION_HINT_COUNT (1ULL << \
> +		BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT)
> +
> +/*
> + *	The order of BTRFS_DEV_ALLOCATION_HINT_* values are not
> + *	good, because BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED is 0
> + *	(for backward compatibility reason), and the other
> + *	values are greater (because the field is unsigned). So we
> + *	need a map that rearranges the order giving to _DATA_PREFERRED
> + *	an intermediate priority.
> + *	These values give to METADATA_ONLY the highest priority, and are
> + *	valid for metadata BG allocation. When a data
> + *	BG is allocated we negate these values to reverse the priority.
> + */
> +static const char alloc_hint_map[BTRFS_DEV_ALLOCATION_HINT_COUNT] = {
> +	[BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY] = -1,
> +	[BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED] = 0,
> +	[BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED] = 1,
> +	[BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY] = 2,
> +};

This should be const int, not const char.  Also the formatting for the comment
is awkward, it's 1 space between the * and the word, so

/*
 * The order of ....
 *
 * These values give METADATA_ONLY the highest priority...
 */

Also the -1 thing is weird and unclear.  In fact I think it's problematic, I'll
explain below.

> +
>  const char *btrfs_bg_type_to_raid_name(u64 flags)
>  {
>  	const int index = btrfs_bg_flags_to_raid_index(flags);
> @@ -5030,13 +5051,18 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
>  }
>  
>  /*
> - * sort the devices in descending order by max_avail, total_avail
> + * sort the devices in descending order by alloc_hint,
> + * max_avail, total_avail
>   */
>  static int btrfs_cmp_device_info(const void *a, const void *b)
>  {
>  	const struct btrfs_device_info *di_a = a;
>  	const struct btrfs_device_info *di_b = b;
>  
> +	if (di_a->alloc_hint > di_b->alloc_hint)
> +		return -1;
> +	if (di_a->alloc_hint < di_b->alloc_hint)
> +		return 1;

This is making things awkward, instead I think we change this to sort_r, which
uses cmp(a, b, priv).  You pass in priv which is the type we want, DATA or
METADATA or whatever.  Then you can do

if (priv == data) {
	/* do the sorting so DATA_ONLY is on top, then DATA_PREFERRED, etc. */
} else {
	/* do the METADATA thing instead. */
}

>  	if (di_a->max_avail > di_b->max_avail)
>  		return -1;
>  	if (di_a->max_avail < di_b->max_avail)
> @@ -5199,6 +5225,7 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>  	int ndevs = 0;
>  	u64 max_avail;
>  	u64 dev_offset;
> +	int hint;
>  
>  	/*
>  	 * in the first pass through the devices list, we gather information
> @@ -5251,17 +5278,95 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>  		devices_info[ndevs].max_avail = max_avail;
>  		devices_info[ndevs].total_avail = total_avail;
>  		devices_info[ndevs].dev = device;
> +
> +		if ((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
> +		     (ctl->type & BTRFS_BLOCK_GROUP_METADATA)) {
> +			/*
> +			 * if mixed bg set all the alloc_hint
> +			 * fields to the same value, so the sorting
> +			 * is not affected
> +			 */
> +			devices_info[ndevs].alloc_hint = 0;
> +		} else if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
> +			hint = device->type & BTRFS_DEV_ALLOCATION_HINT_MASK;
> +
> +			/*
> +			 * skip BTRFS_DEV_METADATA_ONLY disks
> +			 */
> +			if (hint == BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY)
> +				continue;
> +			/*
> +			 * if a data chunk must be allocated,
> +			 * sort also by hint (data disk
> +			 * higher priority)
> +			 */
> +			devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
> +		} else { /* BTRFS_BLOCK_GROUP_METADATA */
> +			hint = device->type & BTRFS_DEV_ALLOCATION_HINT_MASK;
> +
> +			/*
> +			 * skip BTRFS_DEV_DATA_ONLY disks
> +			 */
> +			if (hint == BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY)
> +				continue;
> +			/*
> +			 * if a metadata chunk must be allocated,
> +			 * sort also by hint (metadata hint
> +			 * higher priority)
> +			 */
> +			devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
> +		}
> +

Shouldn't we be doing this before adding the device to devices_info?  That way
for _ONLY we just don't even add the disk to the devices_info.

>  		++ndevs;
>  	}
>  	ctl->ndevs = ndevs;
>  
> +	return 0;
> +}
> +
> +static void sort_and_reduce_device_info(struct alloc_chunk_ctl *ctl,
> +					struct btrfs_device_info *devices_info)
> +{
> +	int ndevs, hint, i;
> +
> +	ndevs = ctl->ndevs;
>  	/*
> -	 * now sort the devices by hole size / available space
> +	 * now sort the devices by hint / hole size / available space
>  	 */
>  	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>  	     btrfs_cmp_device_info, NULL);
>  
> -	return 0;
> +	/*
> +	 * select the minimum set of disks grouped by hint that
> +	 * can host the chunk
> +	 */
> +	ndevs = 0;
> +	while (ndevs < ctl->ndevs) {
> +		hint = devices_info[ndevs++].alloc_hint;
> +		while (ndevs < ctl->ndevs) {
> +			if (devices_info[ndevs].alloc_hint != hint)
> +				break;
> +			ndevs++;
> +		}
> +		if (ndevs >= ctl->devs_min)
> +			break;
> +	}
> +
> +	ctl->ndevs = ndevs;
> +
> +	/*
> +	 * the next layers require the devices_info ordered by
> +	 * max_avail. If we are returning two (or more) different
> +	 * group of alloc_hint, this is not always true. So sort
> +	 * these again.
> +	 */
> +
> +	for (i = 0 ; i < ndevs ; i++)
> +		devices_info[i].alloc_hint = 0;
> +
> +	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +	     btrfs_cmp_device_info, NULL);

With my sort_r suggestion I think we no longer need the second sort.  It'll get
you the devices you want in order of most preferred alloc hint, and with the
max_avail.  I'd double check with printk's, but you should be able to drop all
this.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] btrfs: add flags to give an hint to the chunk allocator
  2022-03-06 18:14 ` [PATCH 1/5] btrfs: add flags to give an hint to the chunk allocator Goffredo Baroncelli
  2022-03-22 17:51   ` Josef Bacik
  2022-03-22 17:56   ` Josef Bacik
@ 2022-03-22 19:52   ` Josef Bacik
  2022-03-22 20:25     ` Goffredo Baroncelli
  2 siblings, 1 reply; 25+ messages in thread
From: Josef Bacik @ 2022-03-22 19:52 UTC (permalink / raw)
  To: Goffredo Baroncelli
  Cc: linux-btrfs, Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov, Goffredo Baroncelli

On Sun, Mar 06, 2022 at 07:14:39PM +0100, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli <kreijack@inwind.it>
> 
> Add the following flags to give an hint about which chunk should be
> allocated in which disk:
> 
> - BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED
>   preferred for data chunk, but metadata chunk allowed
> - BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED
>   preferred for metadata chunk, but data chunk allowed
> - BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY
>   only metadata chunk allowed
> - BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY
>   only data chunk allowed
> 
> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
> ---
>  include/uapi/linux/btrfs_tree.h | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> index b069752a8ecf..e0d842c2e616 100644
> --- a/include/uapi/linux/btrfs_tree.h
> +++ b/include/uapi/linux/btrfs_tree.h
> @@ -389,6 +389,22 @@ struct btrfs_key {
>  	__u64 offset;
>  } __attribute__ ((__packed__));
>  
> +/* dev_item.type */
> +
> +/* btrfs chunk allocation hint */
> +#define BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT	2
> +/* btrfs chunk allocation hint mask */
> +#define BTRFS_DEV_ALLOCATION_HINT_MASK	\
> +	((1 << BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT) - 1)
> +/* preferred data chunk, but metadata chunk allowed */
> +#define BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED	(0ULL)
> +/* preferred metadata chunk, but data chunk allowed */
> +#define BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED	(1ULL)
> +/* only metadata chunk are allowed */
> +#define BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY		(2ULL)
> +/* only data chunk allowed */
> +#define BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY		(3ULL)
> +

I also just realized you're using these as flags, so they need to be

(1ULL << 0)
(1ULL << 1)
(1ULL << 2)
(1ULL << 3)

Thanks,

Josef

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/5] btrfs: change the device allocation_hint property via sysfs
  2022-03-22 19:19   ` Josef Bacik
@ 2022-03-22 19:52     ` Goffredo Baroncelli
  0 siblings, 0 replies; 25+ messages in thread
From: Goffredo Baroncelli @ 2022-03-22 19:52 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov, Goffredo Baroncelli

On 22/03/2022 20.19, Josef Bacik wrote:
> On Sun, Mar 06, 2022 at 07:14:41PM +0100, Goffredo Baroncelli wrote:
>> From: Goffredo Baroncelli <kreijack@inwind.it>
>>
>> This patch allow to change the allocation_hint property writing
>> a numerical value in the file.
>>
>> /sysfs/fs/btrfs/<UUID>/devinfo/<devid>/allocation_hint
>>
>> To update this field it is added the property "allocation_hint" in
>> btrfs-prog too.
>>
>> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
>> ---
>>   fs/btrfs/sysfs.c   | 76 +++++++++++++++++++++++++++++++++++++++++++++-
>>   fs/btrfs/volumes.c |  2 +-
>>   fs/btrfs/volumes.h |  2 ++
>>   3 files changed, 78 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
>> index 59d92a385a96..c6723456c0e1 100644
>> --- a/fs/btrfs/sysfs.c
>> +++ b/fs/btrfs/sysfs.c
>> @@ -1606,7 +1606,81 @@ static ssize_t btrfs_devinfo_allocation_hint_show(struct kobject *kobj,
>>   	}
>>   	return scnprintf(buf, PAGE_SIZE, "<UNKNOWN>\n");
>>   }
>> -BTRFS_ATTR(devid, allocation_hint, btrfs_devinfo_allocation_hint_show);
>> +
>> +static ssize_t btrfs_devinfo_allocation_hint_store(struct kobject *kobj,
>> +				 struct kobj_attribute *a,
>> +				 const char *buf, size_t len)
> 
> If you're using vim, use
> 
> set cindent
> set cino=(0
> 
> This will give you the proper formatting we expect, this should be something
> like
> 
> statice ssize_t btrfs_devinfo_allocation_hint_store(struct kobject *kobj,
> 						    struct kobj_attribute *a,
> 						    const char *buf, size_t len)
> 
> 
>> +{
>> +	struct btrfs_fs_info *fs_info;
>> +	struct btrfs_root *root;
>> +	struct btrfs_device *device;
>> +	int ret;
>> +	struct btrfs_trans_handle *trans;
>> +	int i, l;
>> +	u64 type, prev_type;
>> +
>> +	if (len < 1)
>> +		return -EINVAL;
>> +
>> +	/* remove trailing newline */
>> +	l = len;
>> +	if (buf[len-1] == '\n')
>> +		l--;
>> +
> 
> This is unnecessary, because lower you can do...
> 
>> +	for (i = 0 ; i < ARRAY_SIZE(allocation_hint_name) ; i++) {
>> +		if (l != strlen(allocation_hint_name[i].name))
>> +			continue;
>> +
>> +		if (strncasecmp(allocation_hint_name[i].name, buf, l))
>> +			continue;
>> +
> 
> strmatch(buf, allocation_hint_name[i].name)
> 
> I would make a separate patch to change strmatch to do strncasecmp instead, but
> then you can just use that helper.

Looking at the code of strmatch()

   /*
    * Look for an exact string @string in @buffer with possible leading or
    * trailing whitespace
    */
   static bool strmatch(const char *buffer, const char *string)
   {
           const size_t len = strlen(string);
   
           /* Skip leading whitespace */
           buffer = skip_spaces(buffer);
   
           /* Match entire string, check if the rest is whitespace or empty */
           if (strncmp(string, buffer, len) == 0 &&
               strlen(skip_spaces(buffer + len)) == 0)
                   return true;
   
           return false;
   }

Is buf always zero terminated ?


> 
>> +		type = allocation_hint_name[i].value;
>> +		break;
>> +	}
>> +
>> +	if (i >= ARRAY_SIZE(allocation_hint_name))
>> +		return -EINVAL;
>> +
>> +	device = container_of(kobj, struct btrfs_device, devid_kobj);
>> +	fs_info = device->fs_info;
>> +	if (!fs_info)
>> +		return -EPERM;
>> +
>> +	root = fs_info->chunk_root;
>> +	if (sb_rdonly(fs_info->sb))
>> +		return -EROFS;
>> +
>> +	/* check if a change is really needed */
>> +	if ((device->type & BTRFS_DEV_ALLOCATION_HINT_MASK) == type)
>> +		return len;
>> +
>> +	trans = btrfs_start_transaction(root, 1);
>> +	if (IS_ERR(trans))
>> +		return PTR_ERR(trans);
>> +
>> +	prev_type = device->type;
>> +	device->type = (device->type & ~BTRFS_DEV_ALLOCATION_HINT_MASK) | type;
>> +
>> +	ret = btrfs_update_device(trans, device);
>> +
>> +	if (ret < 0) {
>> +		btrfs_abort_transaction(trans, ret);
>> +		btrfs_end_transaction(trans);
>> +		goto abort;
>> +	}
>> +
>> +	ret = btrfs_commit_transaction(trans);
>> +	if (ret < 0)
>> +		goto abort;
>> +
>> +	return len;
>> +abort:
>> +	device->type = prev_type;
>> +	return  ret;
> 
> Extra whitespace here.
Ok
> 
>> +}
>> +BTRFS_ATTR_RW(devid, allocation_hint, btrfs_devinfo_allocation_hint_show,
>> +				      btrfs_devinfo_allocation_hint_store);
>> +
>>   
>>   /*
>>    * Information about one device.
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 5e3e13d4940b..d4ac90f5c949 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -2846,7 +2846,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>>   	return ret;
>>   }
>>   
>> -static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
>> +noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
>>   					struct btrfs_device *device)
>>   {
>>   	int ret;
> 
> You can drop the noinline thing here as well, and make sure to fix the
> indentation.
Ok
> 
>> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
>> index bd297f23d19e..93ac27d8097c 100644
>> --- a/fs/btrfs/volumes.h
>> +++ b/fs/btrfs/volumes.h
>> @@ -636,5 +636,7 @@ int btrfs_bg_type_to_factor(u64 flags);
>>   const char *btrfs_bg_type_to_raid_name(u64 flags);
>>   int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info);
>>   bool btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical);
>> +int btrfs_update_device(struct btrfs_trans_handle *trans,
>> +			struct btrfs_device *device);
> 
> Make the indentation correct.  Thanks,

It is correct. Are the '+' and the tab that misaligned the patch.

> 
> Josef


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] btrfs: add flags to give an hint to the chunk allocator
  2022-03-22 19:52   ` Josef Bacik
@ 2022-03-22 20:25     ` Goffredo Baroncelli
  2022-03-23  2:56       ` Zygo Blaxell
  2022-03-25 14:59       ` Josef Bacik
  0 siblings, 2 replies; 25+ messages in thread
From: Goffredo Baroncelli @ 2022-03-22 20:25 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov, Goffredo Baroncelli

On 22/03/2022 20.52, Josef Bacik wrote:
> On Sun, Mar 06, 2022 at 07:14:39PM +0100, Goffredo Baroncelli wrote:
>> From: Goffredo Baroncelli <kreijack@inwind.it>
>>
>> Add the following flags to give an hint about which chunk should be
>> allocated in which disk:
>>
>> - BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED
>>    preferred for data chunk, but metadata chunk allowed
>> - BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED
>>    preferred for metadata chunk, but data chunk allowed
>> - BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY
>>    only metadata chunk allowed
>> - BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY
>>    only data chunk allowed
>>
>> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
>> ---
>>   include/uapi/linux/btrfs_tree.h | 16 ++++++++++++++++
>>   1 file changed, 16 insertions(+)
>>
>> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
>> index b069752a8ecf..e0d842c2e616 100644
>> --- a/include/uapi/linux/btrfs_tree.h
>> +++ b/include/uapi/linux/btrfs_tree.h
>> @@ -389,6 +389,22 @@ struct btrfs_key {
>>   	__u64 offset;
>>   } __attribute__ ((__packed__));
>>   
>> +/* dev_item.type */
>> +
>> +/* btrfs chunk allocation hint */
>> +#define BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT	2
>> +/* btrfs chunk allocation hint mask */
>> +#define BTRFS_DEV_ALLOCATION_HINT_MASK	\
>> +	((1 << BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT) - 1)
>> +/* preferred data chunk, but metadata chunk allowed */
>> +#define BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED	(0ULL)
>> +/* preferred metadata chunk, but data chunk allowed */
>> +#define BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED	(1ULL)
>> +/* only metadata chunk are allowed */
>> +#define BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY		(2ULL)
>> +/* only data chunk allowed */
>> +#define BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY		(3ULL)
>> +
> 
> I also just realized you're using these as flags, so they need to be
> 
> (1ULL << 0)
> (1ULL << 1)
> (1ULL << 2)
> (1ULL << 3)
> 

Could you elaborate a bit ? These are mutual exclusive values...

> Thanks,
> 
> Josef


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/5] btrfs: add allocation_hint mode
  2022-03-22 19:47   ` Josef Bacik
@ 2022-03-22 20:56     ` Goffredo Baroncelli
  0 siblings, 0 replies; 25+ messages in thread
From: Goffredo Baroncelli @ 2022-03-22 20:56 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov

On 22/03/2022 20.47, Josef Bacik wrote:
> On Sun, Mar 06, 2022 at 07:14:42PM +0100, Goffredo Baroncelli wrote:
>> From: Goffredo Baroncelli <kreijack@inwind.it>
>>
>> The chunk allocation policy is modified as follow.
>>
>> Each disk may have one of the following tags:
>> - BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
>> - BTRFS_DEV_ALLOCATION_METADATA_ONLY
>> - BTRFS_DEV_ALLOCATION_DATA_ONLY
>> - BTRFS_DEV_ALLOCATION_DATA_PREFERRED (default)
>>
>> During a *mixed data/metadata* chunk allocation, BTRFS works as
>> usual.
>>
>> During a *data* chunk allocation, the space are searched first in
>> BTRFS_DEV_ALLOCATION_DATA_ONLY. If the space found is not enough (eg.
>> in raid5, only two disks are available), then the disks tagged
>> BTRFS_DEV_ALLOCATION_DATA_PREFERRED are considered. If the space is not
>> enough again, the disks tagged BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
>> are also considered. If even in this case the space is not
>> sufficient, -ENOSPC is raised.
>> A disk tagged with BTRFS_DEV_ALLOCATION_METADATA_ONLY is never considered
>> for a data BG allocation.
>>
>> During a *metadata* chunk allocation, the same algorithm applies swapping
>> _DATA_ and _METADATA_.
>>
>> By default the disks are tagged as BTRFS_DEV_ALLOCATION_DATA_PREFERRED,
>> so BTRFS behaves as usual.
>>
>> If the user prefers to store the metadata in the faster disks (e.g. SSD),
>> he can tag these with BTRFS_DEV_ALLOCATION_METADATA_PREFERRED: in this
>> case the metadata BG go in the BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
>> disks and the data BG in the others ones. When a disks set is filled, the
>> other is considered.
>>
>> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
>> ---
>>   fs/btrfs/volumes.c | 113 +++++++++++++++++++++++++++++++++++++++++++--
>>   fs/btrfs/volumes.h |   1 +
>>   2 files changed, 111 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index d4ac90f5c949..7b37db9bb887 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -184,6 +184,27 @@ enum btrfs_raid_types __attribute_const__ btrfs_bg_flags_to_raid_index(u64 flags
>>   	return BTRFS_RAID_SINGLE; /* BTRFS_BLOCK_GROUP_SINGLE */
>>   }
>>   
>> +#define BTRFS_DEV_ALLOCATION_HINT_COUNT (1ULL << \
>> +		BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT)
>> +
>> +/*
>> + *	The order of BTRFS_DEV_ALLOCATION_HINT_* values are not
>> + *	good, because BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED is 0
>> + *	(for backward compatibility reason), and the other
>> + *	values are greater (because the field is unsigned). So we
>> + *	need a map that rearranges the order giving to _DATA_PREFERRED
>> + *	an intermediate priority.
>> + *	These values give to METADATA_ONLY the highest priority, and are
>> + *	valid for metadata BG allocation. When a data
>> + *	BG is allocated we negate these values to reverse the priority.
>> + */
>> +static const char alloc_hint_map[BTRFS_DEV_ALLOCATION_HINT_COUNT] = {
>> +	[BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY] = -1,
>> +	[BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED] = 0,
>> +	[BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED] = 1,
>> +	[BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY] = 2,
>> +};
> 
> This should be const int, not const char.  Also the formatting for the comment
> is awkward, it's 1 space between the * and the word, so
> 
Ok

> /*
>   * The order of ....
>   *
>   * These values give METADATA_ONLY the highest priority...
>   */
> 
> Also the -1 thing is weird and unclear.  In fact I think it's problematic, I'll
> explain below.

To me it was more natural to have BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED -> 0.
However we can rearrange this as we want, like below.

+static const int alloc_hint_map[BTRFS_DEV_ALLOCATION_HINT_COUNT] = {
+	[BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY] = 0,
+	[BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED] = 1,
+	[BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED] = 2,
+	[BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY] = 3,
+};

I don't see any problem

> 
>> +
>>   const char *btrfs_bg_type_to_raid_name(u64 flags)
>>   {
>>   	const int index = btrfs_bg_flags_to_raid_index(flags);
>> @@ -5030,13 +5051,18 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
>>   }
>>   
>>   /*
>> - * sort the devices in descending order by max_avail, total_avail
>> + * sort the devices in descending order by alloc_hint,
>> + * max_avail, total_avail
>>    */
>>   static int btrfs_cmp_device_info(const void *a, const void *b)
>>   {
>>   	const struct btrfs_device_info *di_a = a;
>>   	const struct btrfs_device_info *di_b = b;
>>   
>> +	if (di_a->alloc_hint > di_b->alloc_hint)
>> +		return -1;
>> +	if (di_a->alloc_hint < di_b->alloc_hint)
>> +		return 1;
> 
> This is making things awkward, instead I think we change this to sort_r, which
> uses cmp(a, b, priv).  You pass in priv which is the type we want, DATA or
> METADATA or whatever.  Then you can do

To me it seemed a natural extension of the previous code.

What I dislike is the use (see below) of alloc_hint_map[]:

	hint = device->type & BTRFS_DEV_ALLOCATION_HINT_MASK;
	devices_info[ndevs].alloc_hint = -alloc_hint_map[hint]; /* for data */

and
	hint = device->type & BTRFS_DEV_ALLOCATION_HINT_MASK;
	devices_info[ndevs].alloc_hint = +alloc_hint_map[hint]; /* for metadata */

However this is necessary because the values of device->type & BTRFS_DEV_ALLOCATION_HINT_MASK
are not in the same order of priority allocation. And this is due to the fact that
device->type & BTRFS_DEV_ALLOCATION_HINT_MASK == 0 means DATA_PREFERRED

Let me to summarize:

device->type & 					    alloc_hint_map[hint]
BTRFS_DEV_ALLOCATION_HINT_MASK    meaning           (aka priority)
-------------------------------  ------------------ --------------------
0                                DATA_PREFERRED		 0
1				 METADATA_PREFERRED      1
2				 METADATA_ONLY           2
3                                DATA_ONLY              -1


If you sort the table above by priority, you have the hints in the correct order


device->type & 					    alloc_hint_map[hint]
BTRFS_DEV_ALLOCATION_HINT_MASK    meaning           (aka priority)
-------------------------------  ------------------ --------------------
3                                DATA_ONLY              -1
0                                DATA_PREFERRED		 0
1				 METADATA_PREFERRED      1
2				 METADATA_ONLY           2


To understand the tables above, I have to point out:
- device->type & BTRFS_DEV_ALLOCATION_HINT_MASK is zero by default
- when all the disks are in a default state, we don't want that the
   allocation_hint policy creates mess
- this means that the default state has to be DATA_PREFERRED (or METADATA_PREFERRED),
   otherwise we cannot allocate METADATA (or DATA)
- for now we have only four values. But I am expect to have more in the future (e.g.
   different priorities because different disks speeds: hdd vs sdd vs nvme). So
   I preferred to avoid more complex scheme like: first bit=data|medata,
   second bit=only|preferred. Instead having an array that maps the values stored to disk
   to a priority give us the maximum of flexibility.

> 
> if (priv == data) {
> 	/* do the sorting so DATA_ONLY is on top, then DATA_PREFERRED, etc. */
> } else {
> 	/* do the METADATA thing instead. */
> }
> 

Yes, in the past I thought about something similar.

However, pay attention that you need to ignore the DATA_ONLY when you need to allocate METADATA.
I don't see a simple way to do that in btrfs_cmp_device_info().
So in any case you need some "if(s)" in gather_device_info() to skip the unwanted devices.
And because of that, it is simpler to adjust the field "devices_info[ndevs].alloc_hint"
to the sorting that we need (data, metadata o nothing) in these "if(s)"

Moreover I liked the simplicity of btrfs_cmp_device_info(), so I preferred to don't complicate it.

>>   	if (di_a->max_avail > di_b->max_avail)
>>   		return -1;
>>   	if (di_a->max_avail < di_b->max_avail)
>> @@ -5199,6 +5225,7 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>>   	int ndevs = 0;
>>   	u64 max_avail;
>>   	u64 dev_offset;
>> +	int hint;
>>   
>>   	/*
>>   	 * in the first pass through the devices list, we gather information
>> @@ -5251,17 +5278,95 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>>   		devices_info[ndevs].max_avail = max_avail;
>>   		devices_info[ndevs].total_avail = total_avail;
>>   		devices_info[ndevs].dev = device;
>> +
>> +		if ((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
>> +		     (ctl->type & BTRFS_BLOCK_GROUP_METADATA)) {
>> +			/*
>> +			 * if mixed bg set all the alloc_hint
>> +			 * fields to the same value, so the sorting
>> +			 * is not affected
>> +			 */
>> +			devices_info[ndevs].alloc_hint = 0;
>> +		} else if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
>> +			hint = device->type & BTRFS_DEV_ALLOCATION_HINT_MASK;
>> +
>> +			/*
>> +			 * skip BTRFS_DEV_METADATA_ONLY disks
>> +			 */
>> +			if (hint == BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY)
>> +				continue;
>> +			/*
>> +			 * if a data chunk must be allocated,
>> +			 * sort also by hint (data disk
>> +			 * higher priority)
>> +			 */
>> +			devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
>> +		} else { /* BTRFS_BLOCK_GROUP_METADATA */
>> +			hint = device->type & BTRFS_DEV_ALLOCATION_HINT_MASK;
>> +
>> +			/*
>> +			 * skip BTRFS_DEV_DATA_ONLY disks
>> +			 */
>> +			if (hint == BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY)
>> +				continue;
>> +			/*
>> +			 * if a metadata chunk must be allocated,
>> +			 * sort also by hint (metadata hint
>> +			 * higher priority)
>> +			 */
>> +			devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
>> +		}
>> +
> 
> Shouldn't we be doing this before adding the device to devices_info?  That way
> for _ONLY we just don't even add the disk to the devices_info.

Yes, it can be safely moved before
> 
>>   		++ndevs;
>>   	}
>>   	ctl->ndevs = ndevs;
>>   
>> +	return 0;
>> +}
>> +
>> +static void sort_and_reduce_device_info(struct alloc_chunk_ctl *ctl,
>> +					struct btrfs_device_info *devices_info)
>> +{
>> +	int ndevs, hint, i;
>> +
>> +	ndevs = ctl->ndevs;
>>   	/*
>> -	 * now sort the devices by hole size / available space
>> +	 * now sort the devices by hint / hole size / available space
>>   	 */
>>   	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>>   	     btrfs_cmp_device_info, NULL);
>>   
>> -	return 0;
>> +	/*
>> +	 * select the minimum set of disks grouped by hint that
>> +	 * can host the chunk
>> +	 */
>> +	ndevs = 0;
>> +	while (ndevs < ctl->ndevs) {
>> +		hint = devices_info[ndevs++].alloc_hint;
>> +		while (ndevs < ctl->ndevs) {
>> +			if (devices_info[ndevs].alloc_hint != hint)
>> +				break;
>> +			ndevs++;
>> +		}
>> +		if (ndevs >= ctl->devs_min)
>> +			break;
>> +	}
>> +
>> +	ctl->ndevs = ndevs;
>> +
>> +	/*
>> +	 * the next layers require the devices_info ordered by
>> +	 * max_avail. If we are returning two (or more) different
>> +	 * group of alloc_hint, this is not always true. So sort
>> +	 * these again.
>> +	 */
>> +
>> +	for (i = 0 ; i < ndevs ; i++)
>> +		devices_info[i].alloc_hint = 0;
>> +
>> +	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>> +	     btrfs_cmp_device_info, NULL);
> 
> With my sort_r suggestion I think we no longer need the second sort.  It'll get
> you the devices you want in order of most preferred alloc hint, and with the
> max_avail.  I'd double check with printk's, but you should be able to drop all
> this.  Thanks,

It is not so simple. Assuming that you have the following situation
(after the first sort):

sda: DATA_PREFERRED		free=1GB
sdb: DATA_PREFERRED		free=0GB
sdc: METADATA_PREFERRED		free=2GB
sdd: METADATA_PREFERRED		free=1GB

And you need to allocate 1 DATA chunk in RAID1, you need to use the
following combination:

sda: DATA_PREFERRED		free=1GB
sdc: METADATA_PREFERRED		free=2GB

However, as the comment report, you need to pass the array sorted by free avail in
the next layers; this is the reason why you need a 2nd sort.



> 
> Josef


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] btrfs: add flags to give an hint to the chunk allocator
  2022-03-22 20:25     ` Goffredo Baroncelli
@ 2022-03-23  2:56       ` Zygo Blaxell
  2022-03-24 19:05         ` Goffredo Baroncelli
  2022-03-25 14:59       ` Josef Bacik
  1 sibling, 1 reply; 25+ messages in thread
From: Zygo Blaxell @ 2022-03-23  2:56 UTC (permalink / raw)
  To: kreijack
  Cc: Josef Bacik, linux-btrfs, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov

On Tue, Mar 22, 2022 at 09:25:45PM +0100, Goffredo Baroncelli wrote:
> On 22/03/2022 20.52, Josef Bacik wrote:
> > On Sun, Mar 06, 2022 at 07:14:39PM +0100, Goffredo Baroncelli wrote:
> > > From: Goffredo Baroncelli <kreijack@inwind.it>
> > > 
> > > Add the following flags to give an hint about which chunk should be
> > > allocated in which disk:
> > > 
> > > - BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED
> > >    preferred for data chunk, but metadata chunk allowed
> > > - BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED
> > >    preferred for metadata chunk, but data chunk allowed
> > > - BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY
> > >    only metadata chunk allowed
> > > - BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY
> > >    only data chunk allowed
> > > 
> > > Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
> > > ---
> > >   include/uapi/linux/btrfs_tree.h | 16 ++++++++++++++++
> > >   1 file changed, 16 insertions(+)
> > > 
> > > diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> > > index b069752a8ecf..e0d842c2e616 100644
> > > --- a/include/uapi/linux/btrfs_tree.h
> > > +++ b/include/uapi/linux/btrfs_tree.h
> > > @@ -389,6 +389,22 @@ struct btrfs_key {
> > >   	__u64 offset;
> > >   } __attribute__ ((__packed__));
> > > +/* dev_item.type */
> > > +
> > > +/* btrfs chunk allocation hint */
> > > +#define BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT	2
> > > +/* btrfs chunk allocation hint mask */
> > > +#define BTRFS_DEV_ALLOCATION_HINT_MASK	\
> > > +	((1 << BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT) - 1)
> > > +/* preferred data chunk, but metadata chunk allowed */
> > > +#define BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED	(0ULL)
> > > +/* preferred metadata chunk, but data chunk allowed */
> > > +#define BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED	(1ULL)
> > > +/* only metadata chunk are allowed */
> > > +#define BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY		(2ULL)
> > > +/* only data chunk allowed */
> > > +#define BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY		(3ULL)
> > > +
> > 
> > I also just realized you're using these as flags, so they need to be
> > 
> > (1ULL << 0)
> > (1ULL << 1)
> > (1ULL << 2)
> > (1ULL << 3)
> > 
> 
> Could you elaborate a bit ? These are mutual exclusive values...

One of the comments I had on earlier versions was that these should
be bit values.  Bit 0 is data (0) or metadata (1), bit 1 is preferred
(0) or only (2).  Thus "metadata only" is 3, "data preferred" is 0,
"data only" is 2, and "metadata preferred" is 1.  This maintained on-disk
compatibility with the earliest versions that only had the two "preferred"
options encoded as 0 and 1.

At some point this got lost.  Between one of the patch versions and
another I had to change the type numbers on all of my devices.

> > Thanks,
> > 
> > Josef
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] btrfs: add flags to give an hint to the chunk allocator
  2022-03-22 17:56   ` Josef Bacik
  2022-03-22 18:49     ` Goffredo Baroncelli
@ 2022-03-23  3:26     ` Zygo Blaxell
  1 sibling, 0 replies; 25+ messages in thread
From: Zygo Blaxell @ 2022-03-23  3:26 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Goffredo Baroncelli, linux-btrfs, David Sterba,
	Sinnamohideen Shafeeq, Paul Jones, Boris Burkov,
	Goffredo Baroncelli

On Tue, Mar 22, 2022 at 01:56:14PM -0400, Josef Bacik wrote:
> On Sun, Mar 06, 2022 at 07:14:39PM +0100, Goffredo Baroncelli wrote:
> > From: Goffredo Baroncelli <kreijack@inwind.it>
> > 
> > Add the following flags to give an hint about which chunk should be
> > allocated in which disk:
> > 
> > - BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED
> >   preferred for data chunk, but metadata chunk allowed
> > - BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED
> >   preferred for metadata chunk, but data chunk allowed
> > - BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY
> >   only metadata chunk allowed
> > - BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY
> >   only data chunk allowed
> > 
> > Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
> > ---
> >  include/uapi/linux/btrfs_tree.h | 16 ++++++++++++++++
> >  1 file changed, 16 insertions(+)
> > 
> > diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> > index b069752a8ecf..e0d842c2e616 100644
> > --- a/include/uapi/linux/btrfs_tree.h
> > +++ b/include/uapi/linux/btrfs_tree.h
> > @@ -389,6 +389,22 @@ struct btrfs_key {
> >  	__u64 offset;
> >  } __attribute__ ((__packed__));
> >  
> > +/* dev_item.type */
> > +
> > +/* btrfs chunk allocation hint */
> > +#define BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT	2
> > +/* btrfs chunk allocation hint mask */
> > +#define BTRFS_DEV_ALLOCATION_HINT_MASK	\
> > +	((1 << BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT) - 1)
> > +/* preferred data chunk, but metadata chunk allowed */
> > +#define BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED	(0ULL)
> 
> Actually don't we have 0 set on type already?  So this will make all existing
> file systems have DATA_PREFERRED, when in reality they may not want that
> behavior.  So should we start at 1?  Thanks,

If all the disks in the filesystem have the same value, and it's one
of the two _PREFERRED values, then there is no change in behavior from
before:  all devices will be filled with no (difference in) preferences.
So if the default is either DATA_PREFERRED or METADATA_PREFERRED, then
nothing changes for existing filesystems until at least one device is
set to some other type.

Another problem is, what happens if we add a new device to a filesystem
where some devices have _ONLY set?  In that case, the new device has
type 0, and might get either data or metadata allocated on it that we
don't want.

Four possible solutions to choose from:

1.  Leave things as they are, sysadmins will have to set the type on
new devices immediately after add, and possibly balance to fix bad
allocations if they lose the race.

2.  Make device type a parameter to the add-device ioctl, so that
drives can be added with a non-default type already set.

3.  Define "0" as "get the preference value from a mount option, a
sysfs variable, or some on-disk item?" which is another way to do
#2 that doesn't mean having to change an existing ioctl's parameters
or roll a new one (I haven't checked to see if we have a spare u64
on device add).

4.  Define "0" as meaning "last resort", a new preference level where
the device is not preferred for any allocation unless there are no other
devices with a preference set.  Ecch I don't like this one because there's
a possible race where we're converting e.g. a 4-disk raid1 array into 2
data-only 2 metadata-only, and there's a point where we have 2x data-only
1x metadata-only and one type 0 device and it's possible to ENOSPC on
metadata at that moment if type 0 doesn't allow metadata allocation.
This plan would prevent the "data on the wrong drive" failure mode, so
it's in the list, but don't use it unless you can solve the new problem.


> Josef
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] btrfs: add flags to give an hint to the chunk allocator
  2022-03-23  2:56       ` Zygo Blaxell
@ 2022-03-24 19:05         ` Goffredo Baroncelli
  0 siblings, 0 replies; 25+ messages in thread
From: Goffredo Baroncelli @ 2022-03-24 19:05 UTC (permalink / raw)
  To: Zygo Blaxell
  Cc: Josef Bacik, linux-btrfs, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov

On 23/03/2022 03.56, Zygo Blaxell wrote:
> On Tue, Mar 22, 2022 at 09:25:45PM +0100, Goffredo Baroncelli wrote:
>> On 22/03/2022 20.52, Josef Bacik wrote:
>>> On Sun, Mar 06, 2022 at 07:14:39PM +0100, Goffredo Baroncelli wrote:
>>>> From: Goffredo Baroncelli <kreijack@inwind.it>
>>>>
>>>> Add the following flags to give an hint about which chunk should be
>>>> allocated in which disk:
>>>>
>>>> - BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED
>>>>     preferred for data chunk, but metadata chunk allowed
>>>> - BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED
>>>>     preferred for metadata chunk, but data chunk allowed
>>>> - BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY
>>>>     only metadata chunk allowed
>>>> - BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY
>>>>     only data chunk allowed
>>>>
>>>> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
>>>> ---
>>>>    include/uapi/linux/btrfs_tree.h | 16 ++++++++++++++++
>>>>    1 file changed, 16 insertions(+)
>>>>
>>>> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
>>>> index b069752a8ecf..e0d842c2e616 100644
>>>> --- a/include/uapi/linux/btrfs_tree.h
>>>> +++ b/include/uapi/linux/btrfs_tree.h
>>>> @@ -389,6 +389,22 @@ struct btrfs_key {
>>>>    	__u64 offset;
>>>>    } __attribute__ ((__packed__));
>>>> +/* dev_item.type */
>>>> +
>>>> +/* btrfs chunk allocation hint */
>>>> +#define BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT	2
>>>> +/* btrfs chunk allocation hint mask */
>>>> +#define BTRFS_DEV_ALLOCATION_HINT_MASK	\
>>>> +	((1 << BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT) - 1)
>>>> +/* preferred data chunk, but metadata chunk allowed */
>>>> +#define BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED	(0ULL)
>>>> +/* preferred metadata chunk, but data chunk allowed */
>>>> +#define BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED	(1ULL)
>>>> +/* only metadata chunk are allowed */
>>>> +#define BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY		(2ULL)
>>>> +/* only data chunk allowed */
>>>> +#define BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY		(3ULL)
>>>> +
>>>
>>> I also just realized you're using these as flags, so they need to be
>>>
>>> (1ULL << 0)
>>> (1ULL << 1)
>>> (1ULL << 2)
>>> (1ULL << 3)
>>>
>>
>> Could you elaborate a bit ? These are mutual exclusive values...
> 
> One of the comments I had on earlier versions was that these should
> be bit values.  Bit 0 is data (0) or metadata (1), bit 1 is preferred
> (0) or only (2).  Thus "metadata only" is 3, "data preferred" is 0,
> "data only" is 2, and "metadata preferred" is 1.  This maintained on-disk
> compatibility with the earliest versions that only had the two "preferred"
> options encoded as 0 and 1.

At this point I would prefer to not change this part.
  
> At some point this got lost.  Between one of the patch versions and
> another I had to change the type numbers on all of my devices.

I remember that in the first iterations I used 3 bits, when now I use
only two bits. So it is possible that I changed the values.

> 
>>> Thanks,
>>>
>>> Josef
>>
>>
>> -- 
>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>>


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] btrfs: add flags to give an hint to the chunk allocator
  2022-03-22 20:25     ` Goffredo Baroncelli
  2022-03-23  2:56       ` Zygo Blaxell
@ 2022-03-25 14:59       ` Josef Bacik
  2022-03-25 18:55         ` Goffredo Baroncelli
  1 sibling, 1 reply; 25+ messages in thread
From: Josef Bacik @ 2022-03-25 14:59 UTC (permalink / raw)
  To: kreijack
  Cc: linux-btrfs, Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov

On Tue, Mar 22, 2022 at 09:25:45PM +0100, Goffredo Baroncelli wrote:
> On 22/03/2022 20.52, Josef Bacik wrote:
> > On Sun, Mar 06, 2022 at 07:14:39PM +0100, Goffredo Baroncelli wrote:
> > > From: Goffredo Baroncelli <kreijack@inwind.it>
> > > 
> > > Add the following flags to give an hint about which chunk should be
> > > allocated in which disk:
> > > 
> > > - BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED
> > >    preferred for data chunk, but metadata chunk allowed
> > > - BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED
> > >    preferred for metadata chunk, but data chunk allowed
> > > - BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY
> > >    only metadata chunk allowed
> > > - BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY
> > >    only data chunk allowed
> > > 
> > > Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
> > > ---
> > >   include/uapi/linux/btrfs_tree.h | 16 ++++++++++++++++
> > >   1 file changed, 16 insertions(+)
> > > 
> > > diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> > > index b069752a8ecf..e0d842c2e616 100644
> > > --- a/include/uapi/linux/btrfs_tree.h
> > > +++ b/include/uapi/linux/btrfs_tree.h
> > > @@ -389,6 +389,22 @@ struct btrfs_key {
> > >   	__u64 offset;
> > >   } __attribute__ ((__packed__));
> > > +/* dev_item.type */
> > > +
> > > +/* btrfs chunk allocation hint */
> > > +#define BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT	2
> > > +/* btrfs chunk allocation hint mask */
> > > +#define BTRFS_DEV_ALLOCATION_HINT_MASK	\
> > > +	((1 << BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT) - 1)
> > > +/* preferred data chunk, but metadata chunk allowed */
> > > +#define BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED	(0ULL)
> > > +/* preferred metadata chunk, but data chunk allowed */
> > > +#define BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED	(1ULL)
> > > +/* only metadata chunk are allowed */
> > > +#define BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY		(2ULL)
> > > +/* only data chunk allowed */
> > > +#define BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY		(3ULL)
> > > +
> > 
> > I also just realized you're using these as flags, so they need to be
> > 
> > (1ULL << 0)
> > (1ULL << 1)
> > (1ULL << 2)
> > (1ULL << 3)
> > 
> 
> Could you elaborate a bit ? These are mutual exclusive values...

Your set patch is doing

type = (type & ~MASK) | value;

So if you have METADATA_PREFERRED set already, and you do DATA_PREFERRED it'll
be

type = (1) | 0

which is METADATA_PREFERRED.

If you were doing simply

type = <value>

then these values would make sense.  So either we need to treat them as bit
flags, which should be <<, or you need to stop using the | operator.  It sounds
like you want them to be exclusive, so you should just change the setting code?
Thanks,

Josef 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] btrfs: add flags to give an hint to the chunk allocator
  2022-03-25 14:59       ` Josef Bacik
@ 2022-03-25 18:55         ` Goffredo Baroncelli
  0 siblings, 0 replies; 25+ messages in thread
From: Goffredo Baroncelli @ 2022-03-25 18:55 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov

On 25/03/2022 15.59, Josef Bacik wrote:
> On Tue, Mar 22, 2022 at 09:25:45PM +0100, Goffredo Baroncelli wrote:
>> On 22/03/2022 20.52, Josef Bacik wrote:
>>> On Sun, Mar 06, 2022 at 07:14:39PM +0100, Goffredo Baroncelli wrote:
>>>> From: Goffredo Baroncelli <kreijack@inwind.it>
>>>>
>>>> Add the following flags to give an hint about which chunk should be
>>>> allocated in which disk:
>>>>
>>>> - BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED
>>>>     preferred for data chunk, but metadata chunk allowed
>>>> - BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED
>>>>     preferred for metadata chunk, but data chunk allowed
>>>> - BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY
>>>>     only metadata chunk allowed
>>>> - BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY
>>>>     only data chunk allowed
>>>>
>>>> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
>>>> ---
>>>>    include/uapi/linux/btrfs_tree.h | 16 ++++++++++++++++
>>>>    1 file changed, 16 insertions(+)
>>>>
>>>> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
>>>> index b069752a8ecf..e0d842c2e616 100644
>>>> --- a/include/uapi/linux/btrfs_tree.h
>>>> +++ b/include/uapi/linux/btrfs_tree.h
>>>> @@ -389,6 +389,22 @@ struct btrfs_key {
>>>>    	__u64 offset;
>>>>    } __attribute__ ((__packed__));
>>>> +/* dev_item.type */
>>>> +
>>>> +/* btrfs chunk allocation hint */
>>>> +#define BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT	2
>>>> +/* btrfs chunk allocation hint mask */
>>>> +#define BTRFS_DEV_ALLOCATION_HINT_MASK	\
>>>> +	((1 << BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT) - 1)
>>>> +/* preferred data chunk, but metadata chunk allowed */
>>>> +#define BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED	(0ULL)
>>>> +/* preferred metadata chunk, but data chunk allowed */
>>>> +#define BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED	(1ULL)
>>>> +/* only metadata chunk are allowed */
>>>> +#define BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY		(2ULL)
>>>> +/* only data chunk allowed */
>>>> +#define BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY		(3ULL)
>>>> +
>>>
>>> I also just realized you're using these as flags, so they need to be
>>>
>>> (1ULL << 0)
>>> (1ULL << 1)
>>> (1ULL << 2)
>>> (1ULL << 3)
>>>
>>
>> Could you elaborate a bit ? These are mutual exclusive values...
> 
> Your set patch is doing
> 
> type = (type & ~MASK) | value;



> 
> So if you have METADATA_PREFERRED set already, and you do DATA_PREFERRED it'll
> be
> 
> type = (1) | 0
> 
> which is METADATA_PREFERRED.

May be that I am missing something, however

MASK = 3
~MASK = ~3 = 0xfffffffffffffffc; /* see below */

so

type = (type & ~MASK) | 0 =
        (1 & 0xfffffffffffffffc) | 0 = 0 -> DATA_PREFERRED


May be that you missed '~' ?

In any case, my MASK definition is wrong, it should contain (u64)1

#define BTRFS_DEV_ALLOCATION_HINT_MASK       \
    (((u64)1 << BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT) - 1)

otherwise on a 32 bit arch, BTRFS_DEV_ALLOCATION_HINT_MASK = 0xfffffffc


> 
> If you were doing simply
> 
> type = <value>
> 
> then these values would make sense.  So either we need to treat them as bit
> flags, which should be <<, or you need to stop using the | operator.  It sounds
> like you want them to be exclusive, so you should just change the setting code?
> Thanks,
> 
> Josef


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] btrfs: add flags to give an hint to the chunk allocator
  2022-03-22 18:49     ` Goffredo Baroncelli
@ 2022-04-06 19:32       ` Goffredo Baroncelli
  0 siblings, 0 replies; 25+ messages in thread
From: Goffredo Baroncelli @ 2022-04-06 19:32 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov

Gentle ping

On 22/03/2022 19.49, Goffredo Baroncelli wrote:
> On 22/03/2022 18.56, Josef Bacik wrote:
>> On Sun, Mar 06, 2022 at 07:14:39PM +0100, Goffredo Baroncelli wrote:
>>> From: Goffredo Baroncelli <kreijack@inwind.it>
>>>
>>> Add the following flags to give an hint about which chunk should be
>>> allocated in which disk:
>>>
>>> - BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED
>>>    preferred for data chunk, but metadata chunk allowed
>>> - BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED
>>>    preferred for metadata chunk, but data chunk allowed
>>> - BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY
>>>    only metadata chunk allowed
>>> - BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY
>>>    only data chunk allowed
>>>
>>> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
>>> ---
>>>   include/uapi/linux/btrfs_tree.h | 16 ++++++++++++++++
>>>   1 file changed, 16 insertions(+)
>>>
>>> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
>>> index b069752a8ecf..e0d842c2e616 100644
>>> --- a/include/uapi/linux/btrfs_tree.h
>>> +++ b/include/uapi/linux/btrfs_tree.h
>>> @@ -389,6 +389,22 @@ struct btrfs_key {
>>>       __u64 offset;
>>>   } __attribute__ ((__packed__));
>>> +/* dev_item.type */
>>> +
>>> +/* btrfs chunk allocation hint */
>>> +#define BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT    2
>>> +/* btrfs chunk allocation hint mask */
>>> +#define BTRFS_DEV_ALLOCATION_HINT_MASK    \
>>> +    ((1 << BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT) - 1)
>>> +/* preferred data chunk, but metadata chunk allowed */
>>> +#define BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED    (0ULL)
>>
>> Actually don't we have 0 set on type already?  So this will make all existing
>> file systems have DATA_PREFERRED, when in reality they may not want that
>> behavior.  So should we start at 1?  Thanks,
> 
> Yes, the default is 0 (now DATA_PREFERRED).
> If we have all the disks set to DATA_PREFERRED (or METADATA_PREFERRED), all the disks
> have the same priority so the chunks allocator works as usual.
> 
> If we have DATA_PREFERRED=1, it is not clear to me which would be the expected behavior when the allocator has to choice one of the followings disks:
> - TYPE=0
> - DATA_PREFERRED
> - DATA_ONLY
> 
> It should ignore TYPE=0 ? or it should block the 'allocation_hint' policy ? Or TYPE=0 should have the lowest (or highest) priority ?
> 
> DATA_PREFERRED to me seems a safe default, because at worst we have only missing performance penalty (i.e. it is a faster disk where we should put METADATA)
> 
> BR
> 
>>
>> Josef
> 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2022-04-06 21:02 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-06 18:14 [PATCH 0/5][V12] btrfs: allocation_hint Goffredo Baroncelli
2022-03-06 18:14 ` [PATCH 1/5] btrfs: add flags to give an hint to the chunk allocator Goffredo Baroncelli
2022-03-22 17:51   ` Josef Bacik
2022-03-22 17:56   ` Josef Bacik
2022-03-22 18:49     ` Goffredo Baroncelli
2022-04-06 19:32       ` Goffredo Baroncelli
2022-03-23  3:26     ` Zygo Blaxell
2022-03-22 19:52   ` Josef Bacik
2022-03-22 20:25     ` Goffredo Baroncelli
2022-03-23  2:56       ` Zygo Blaxell
2022-03-24 19:05         ` Goffredo Baroncelli
2022-03-25 14:59       ` Josef Bacik
2022-03-25 18:55         ` Goffredo Baroncelli
2022-03-06 18:14 ` [PATCH 2/5] btrfs: export the device allocation_hint property in sysfs Goffredo Baroncelli
2022-03-08 13:25   ` kernel test robot
2022-03-22 18:05   ` Josef Bacik
2022-03-22 19:02     ` Goffredo Baroncelli
2022-03-06 18:14 ` [PATCH 3/5] btrfs: change the device allocation_hint property via sysfs Goffredo Baroncelli
2022-03-22 19:19   ` Josef Bacik
2022-03-22 19:52     ` Goffredo Baroncelli
2022-03-06 18:14 ` [PATCH 4/5] btrfs: add allocation_hint mode Goffredo Baroncelli
2022-03-22 19:47   ` Josef Bacik
2022-03-22 20:56     ` Goffredo Baroncelli
2022-03-06 18:14 ` [PATCH 5/5] btrfs: rename dev_item->type to dev_item->flags Goffredo Baroncelli
2022-03-21 20:29 ` [PATCH 0/5][V12] btrfs: allocation_hint Goffredo Baroncelli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.