linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7][V11] btrfs: allocation_hint
@ 2022-01-26 20:32 Goffredo Baroncelli
  2022-01-26 20:32 ` [PATCH 1/7] btrfs: add flags to give an hint to the chunk allocator Goffredo Baroncelli
                   ` (8 more replies)
  0 siblings, 9 replies; 22+ messages in thread
From: Goffredo Baroncelli @ 2022-01-26 20:32 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

Hi all,

This patches set was born after some discussion between me, Zygo and Josef.
Some details can be found in https://github.com/btrfs/btrfs-todo/issues/19.

Some further information about a real use case can be found in
https://lore.kernel.org/linux-btrfs/20210116002533.GE31381@hungrycats.org/

Recently Shafeeq told me that he is interested too, due to the performance gain.

In V8, revision I switched away from an ioctl API in favor of a sysfs API (
see patch #2 and #3).

In V9, I renamed the sysfs interface from devinfo/type to devinfo/allocation_hint.
Moreover I renamed dev_info->type to dev_info->flags.

In V10, I renamed the tag 'PREFERRED' from PREFERRED_X to X_PREFERRED; I added
another devinfo property, called major_minor. For now it is unused;
the plan is to use this in btrfs-progs to go from the block device to the filesystem.
First client will be "btrfs prop get/set allocation_hint", but I see others possible
clients.
Finally I made some cleanup, and remove the mount option 'allocation_hint'

In V11 I added a new 'feature' file /sys/fs/btrfs/features/allocation_hint
to help the detection of the allocation_hint feature.

The idea behind this patches set, is to dedicate some disks (the fastest one)
to the metadata chunk. My initial idea was a "soft" hint. However Zygo
asked an option for a "strong" hint (== mandatory). The result is that
each disk can be "tagged" by one of the following flags:
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
- BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
- BTRFS_DEV_ALLOCATION_DATA_PREFERRED
- BTRFS_DEV_ALLOCATION_DATA_ONLY

When the chunk allocator search a disks to allocate a chunk, scans the disks
in an order decided by these tags. For metadata, the order is:
*_METADATA_ONLY
*_METADATA_PREFERRED
*_DATA_PREFERRED

The *_DATA_ONLY are not eligible from metadata chunk allocation.

For the data chunk, the order is reversed, and the *_METADATA_ONLY are
excluded.

The exact sort logic is to sort first for the "tag", and then for the space
available. If there is no space available, the next "tag" disks set are
selected.

To set these tags, a new property called "allocation_hint" was created.
There is a dedicated btrfs-prog patches set [[PATCH V11] btrfs-progs:
allocation_hint disk property].

$ sudo mount /dev/loop0 /mnt/test-btrfs/
$ for i in /dev/loop[0-9]; do sudo ./btrfs prop get $i allocation_hint; done
devid=1, path=/dev/loop0: allocation_hint=METADATA_PREFERRED
devid=2, path=/dev/loop1: allocation_hint=METADATA_PREFERRED
devid=3, path=/dev/loop2: allocation_hint=DATA_PREFERRED
devid=4, path=/dev/loop3: allocation_hint=DATA_PREFERRED
devid=5, path=/dev/loop4: allocation_hint=DATA_PREFERRED
devid=6, path=/dev/loop5: allocation_hint=DATA_ONLY
devid=7, path=/dev/loop6: allocation_hint=METADATA_ONLY
devid=8, path=/dev/loop7: allocation_hint=METADATA_ONLY

$ sudo ./btrfs fi us /mnt/test-btrfs/
Overall:
    Device size:           2.75GiB
    Device allocated:           1.34GiB
    Device unallocated:           1.41GiB
    Device missing:             0.00B
    Used:             400.89MiB
    Free (estimated):           1.04GiB    (min: 1.04GiB)
    Data ratio:                  2.00
    Metadata ratio:              1.00
    Global reserve:           3.25MiB    (used: 0.00B)
    Multiple profiles:                no

Data,RAID1: Size:542.00MiB, Used:200.25MiB (36.95%)
   /dev/loop0     288.00MiB
   /dev/loop1     288.00MiB
   /dev/loop2     127.00MiB
   /dev/loop3     127.00MiB
   /dev/loop4     127.00MiB
   /dev/loop5     127.00MiB

Metadata,single: Size:256.00MiB, Used:384.00KiB (0.15%)
   /dev/loop1     256.00MiB

System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
   /dev/loop0      32.00MiB

Unallocated:
   /dev/loop0     704.00MiB
   /dev/loop1     480.00MiB
   /dev/loop2       1.00MiB
   /dev/loop3       1.00MiB
   /dev/loop4       1.00MiB
   /dev/loop5       1.00MiB
   /dev/loop6     128.00MiB
   /dev/loop7     128.00MiB

# change the tag of some disks

$ sudo ./btrfs prop set /dev/loop0 allocation_hint DATA_ONLY
$ sudo ./btrfs prop set /dev/loop1 allocation_hint DATA_ONLY
$ sudo ./btrfs prop set /dev/loop5 allocation_hint METADATA_ONLY

$ for i in /dev/loop[0-9]; do sudo ./btrfs prop get $i allocation_hint; done
devid=1, path=/dev/loop0: allocation_hint=DATA_ONLY
devid=2, path=/dev/loop1: allocation_hint=DATA_ONLY
devid=3, path=/dev/loop2: allocation_hint=DATA_PREFERRED
devid=4, path=/dev/loop3: allocation_hint=DATA_PREFERRED
devid=5, path=/dev/loop4: allocation_hint=DATA_PREFERRED
devid=6, path=/dev/loop5: allocation_hint=METADATA_ONLY
devid=7, path=/dev/loop6: allocation_hint=METADATA_ONLY
devid=8, path=/dev/loop7: allocation_hint=METADATA_ONLY

$ sudo btrfs bal start --full-balance /mnt/test-btrfs/
$ sudo ./btrfs fi us /mnt/test-btrfs/
Overall:
    Device size:           2.75GiB
    Device allocated:         735.00MiB
    Device unallocated:           2.03GiB
    Device missing:             0.00B
    Used:             400.72MiB
    Free (estimated):           1.10GiB    (min: 1.10GiB)
    Data ratio:                  2.00
    Metadata ratio:              1.00
    Global reserve:           3.25MiB    (used: 0.00B)
    Multiple profiles:                no

Data,RAID1: Size:288.00MiB, Used:200.19MiB (69.51%)
   /dev/loop0     288.00MiB
   /dev/loop1     288.00MiB

Metadata,single: Size:127.00MiB, Used:336.00KiB (0.26%)
   /dev/loop5     127.00MiB

System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
   /dev/loop7      32.00MiB

Unallocated:
   /dev/loop0     736.00MiB
   /dev/loop1     736.00MiB
   /dev/loop2     128.00MiB
   /dev/loop3     128.00MiB
   /dev/loop4     128.00MiB
   /dev/loop5       1.00MiB
   /dev/loop6     128.00MiB
   /dev/loop7      96.00MiB


#As you can see all the metadata were placed on the disk loop5/loop7 even if
#the most empty one are loop0 and loop1.



Furher works:
- the tool which show the space available should consider the tagging (eg
  the disks tagged by _METADATA_ONLY should be excluded from the data
  availability)
- allow btrfs-prog to change the allocation_hint even when the filesystem
  is not mounted.

In following emails, there will be the btrfs-progs patches set and the xfstest
suite patches set.


Comments are welcome
BR
G.Baroncelli

Revision:
V11:
- added the property /sys/fs/btrfs/features/allocation_hint

V10:
- renamed two disk tags/constants:
        - *_METADATA_PREFERRED -> *_METADATA_PREFERRED
        - *_DATA_PREFERRED -> *_DATA_EFERRED
- add /sys/fs/btrfs/$UUID/devinfo/$DEVID/major_minor
- revise some commit description
- refactored the code of gather_device_info(): one portion of this code
  is moved in a separate function called sort_and_reduce_device_info()
  for clarity.
- removed the 'allocation_hint' mount option 

V9:
- rename dev_item->type to dev_item->flags
- rename /sys/fs/btrfs/$UUID/devinfo/type -> allocation_hint

V8:
- drop the ioctl API, instead use a sysfs one

V7:
- make more room in the struct btrfs_ioctl_dev_properties up to 1K
- leave in btrfs_tree.h only the costants
- removed the mount option (sic)
- correct an 'use before check' in the while loop (signaled
  by Zygo)
- add a 2nd sort to be sure that the device_info array is in the
  expected order

V6:
- add further values to the hints: add the possibility to
  exclude a disk for a chunk type 



Goffredo Baroncelli (7):
  btrfs: add flags to give an hint to the chunk allocator
  btrfs: export the device allocation_hint property in sysfs
  btrfs: change the device allocation_hint property via sysfs
  btrfs: add allocation_hint mode
  btrfs: rename dev_item->type to dev_item->flags
  btrfs: add major and minor to sysfs
  Add /sys/fs/btrfs/features/allocation_hint

 fs/btrfs/ctree.h                |   4 +-
 fs/btrfs/disk-io.c              |   2 +-
 fs/btrfs/sysfs.c                | 105 +++++++++++++++++++++++++++
 fs/btrfs/volumes.c              | 121 ++++++++++++++++++++++++++++++--
 fs/btrfs/volumes.h              |   7 +-
 include/uapi/linux/btrfs_tree.h |  20 +++++-
 6 files changed, 245 insertions(+), 14 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/7] btrfs: add flags to give an hint to the chunk allocator
  2022-01-26 20:32 [PATCH 0/7][V11] btrfs: allocation_hint Goffredo Baroncelli
@ 2022-01-26 20:32 ` Goffredo Baroncelli
  2022-01-26 20:32 ` [PATCH 2/7] btrfs: export the device allocation_hint property in sysfs Goffredo Baroncelli
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Goffredo Baroncelli @ 2022-01-26 20:32 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

Add the following flags to give an hint about which chunk should be
allocated in which disk:

- BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED
  preferred for data chunk, but metadata chunk allowed
- BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED
  preferred for metadata chunk, but data chunk allowed
- BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY
  only metadata chunk allowed
- BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY
  only data chunk allowed

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 include/uapi/linux/btrfs_tree.h | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 5416f1f1a77a..02955d5fcd21 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -386,6 +386,22 @@ struct btrfs_key {
 	__u64 offset;
 } __attribute__ ((__packed__));
 
+/* dev_item.type */
+
+/* btrfs chunk allocation hint */
+#define BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT	2
+/* btrfs chunk allocation hint mask */
+#define BTRFS_DEV_ALLOCATION_HINT_MASK	\
+	((1 << BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT) - 1)
+/* preferred data chunk, but metadata chunk allowed */
+#define BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED	(0ULL)
+/* preferred metadata chunk, but data chunk allowed */
+#define BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED	(1ULL)
+/* only metadata chunk are allowed */
+#define BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY		(2ULL)
+/* only data chunk allowed */
+#define BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY		(3ULL)
+
 struct btrfs_dev_item {
 	/* the internal btrfs device id */
 	__le64 devid;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/7] btrfs: export the device allocation_hint property in sysfs
  2022-01-26 20:32 [PATCH 0/7][V11] btrfs: allocation_hint Goffredo Baroncelli
  2022-01-26 20:32 ` [PATCH 1/7] btrfs: add flags to give an hint to the chunk allocator Goffredo Baroncelli
@ 2022-01-26 20:32 ` Goffredo Baroncelli
  2022-01-26 20:32 ` [PATCH 3/7] btrfs: change the device allocation_hint property via sysfs Goffredo Baroncelli
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Goffredo Baroncelli @ 2022-01-26 20:32 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

Export the device allocation_hint property via
/sys/fs/btrfs/<uuid>/devinfo/<devid>/allocation_hint

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 fs/btrfs/sysfs.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index beb7f72d50b8..c1c903187e19 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1575,6 +1575,17 @@ static ssize_t btrfs_devinfo_error_stats_show(struct kobject *kobj,
 }
 BTRFS_ATTR(devid, error_stats, btrfs_devinfo_error_stats_show);
 
+static ssize_t btrfs_devinfo_allocation_hint_show(struct kobject *kobj,
+					struct kobj_attribute *a, char *buf)
+{
+	struct btrfs_device *device = container_of(kobj, struct btrfs_device,
+						   devid_kobj);
+
+	return scnprintf(buf, PAGE_SIZE, "0x%08llx\n",
+		device->type & BTRFS_DEV_ALLOCATION_HINT_MASK);
+}
+BTRFS_ATTR(devid, allocation_hint, btrfs_devinfo_allocation_hint_show);
+
 /*
  * Information about one device.
  *
@@ -1588,6 +1599,7 @@ static struct attribute *devid_attrs[] = {
 	BTRFS_ATTR_PTR(devid, replace_target),
 	BTRFS_ATTR_PTR(devid, scrub_speed_max),
 	BTRFS_ATTR_PTR(devid, writeable),
+	BTRFS_ATTR_PTR(devid, allocation_hint),
 	NULL
 };
 ATTRIBUTE_GROUPS(devid);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/7] btrfs: change the device allocation_hint property via sysfs
  2022-01-26 20:32 [PATCH 0/7][V11] btrfs: allocation_hint Goffredo Baroncelli
  2022-01-26 20:32 ` [PATCH 1/7] btrfs: add flags to give an hint to the chunk allocator Goffredo Baroncelli
  2022-01-26 20:32 ` [PATCH 2/7] btrfs: export the device allocation_hint property in sysfs Goffredo Baroncelli
@ 2022-01-26 20:32 ` Goffredo Baroncelli
  2022-01-26 20:32 ` [PATCH 4/7] btrfs: add allocation_hint mode Goffredo Baroncelli
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Goffredo Baroncelli @ 2022-01-26 20:32 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

This patch allow to change the allocation_hint property writing
a numerical value in the file.

/sysfs/fs/btrfs/<UUID>/devinfo/<devid>/allocation_hint

To update this field it is added the property "allocation_hint" in
btrfs-prog too.

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 fs/btrfs/sysfs.c   | 62 +++++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/volumes.c |  2 +-
 fs/btrfs/volumes.h |  2 ++
 3 files changed, 64 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index c1c903187e19..9070d0370343 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1584,7 +1584,67 @@ static ssize_t btrfs_devinfo_allocation_hint_show(struct kobject *kobj,
 	return scnprintf(buf, PAGE_SIZE, "0x%08llx\n",
 		device->type & BTRFS_DEV_ALLOCATION_HINT_MASK);
 }
-BTRFS_ATTR(devid, allocation_hint, btrfs_devinfo_allocation_hint_show);
+
+static ssize_t btrfs_devinfo_allocation_hint_store(struct kobject *kobj,
+				 struct kobj_attribute *a,
+				 const char *buf, size_t len)
+{
+	struct btrfs_fs_info *fs_info;
+	struct btrfs_root *root;
+	struct btrfs_device *device;
+	int ret;
+	struct btrfs_trans_handle *trans;
+
+	u64 type, prev_type;
+
+	device = container_of(kobj, struct btrfs_device, devid_kobj);
+	fs_info = device->fs_info;
+	if (!fs_info)
+		return -EPERM;
+
+	root = fs_info->chunk_root;
+	if (sb_rdonly(fs_info->sb))
+		return -EROFS;
+
+	ret = kstrtou64(buf, 0, &type);
+	if (ret < 0)
+		return -EINVAL;
+
+	/* for now, allow to touch only the 'allocation hint' bits */
+	if (type & ~BTRFS_DEV_ALLOCATION_HINT_MASK)
+		return -EINVAL;
+
+	/* check if a change is really needed */
+	if ((device->type & BTRFS_DEV_ALLOCATION_HINT_MASK) == type)
+		return len;
+
+	trans = btrfs_start_transaction(root, 1);
+	if (IS_ERR(trans))
+		return PTR_ERR(trans);
+
+	prev_type = device->type;
+	device->type = (device->type & ~BTRFS_DEV_ALLOCATION_HINT_MASK) | type;
+
+	ret = btrfs_update_device(trans, device);
+
+	if (ret < 0) {
+		btrfs_abort_transaction(trans, ret);
+		btrfs_end_transaction(trans);
+		goto abort;
+	}
+
+	ret = btrfs_commit_transaction(trans);
+	if (ret < 0)
+		goto abort;
+
+	return len;
+abort:
+	device->type = prev_type;
+	return  ret;
+}
+BTRFS_ATTR_RW(devid, allocation_hint, btrfs_devinfo_allocation_hint_show,
+				      btrfs_devinfo_allocation_hint_store);
+
 
 /*
  * Information about one device.
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a91e51b0ca81..c43a8a36ff5b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2841,7 +2841,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	return ret;
 }
 
-static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
+noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
 					struct btrfs_device *device)
 {
 	int ret;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index bd297f23d19e..93ac27d8097c 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -636,5 +636,7 @@ int btrfs_bg_type_to_factor(u64 flags);
 const char *btrfs_bg_type_to_raid_name(u64 flags);
 int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info);
 bool btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical);
+int btrfs_update_device(struct btrfs_trans_handle *trans,
+			struct btrfs_device *device);
 
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/7] btrfs: add allocation_hint mode
  2022-01-26 20:32 [PATCH 0/7][V11] btrfs: allocation_hint Goffredo Baroncelli
                   ` (2 preceding siblings ...)
  2022-01-26 20:32 ` [PATCH 3/7] btrfs: change the device allocation_hint property via sysfs Goffredo Baroncelli
@ 2022-01-26 20:32 ` Goffredo Baroncelli
  2022-01-26 20:32 ` [PATCH 5/7] btrfs: rename dev_item->type to dev_item->flags Goffredo Baroncelli
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Goffredo Baroncelli @ 2022-01-26 20:32 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

The chunk allocation policy is modified as follow.

Each disk may have one of the following tags:
- BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
- BTRFS_DEV_ALLOCATION_DATA_ONLY
- BTRFS_DEV_ALLOCATION_DATA_PREFERRED (default)

During a *mixed data/metadata* chunk allocation, BTRFS works as
usual.

During a *data* chunk allocation, the space are searched first in
BTRFS_DEV_ALLOCATION_DATA_ONLY. If the space found is not enough (eg.
in raid5, only two disks are available), then the disks tagged
BTRFS_DEV_ALLOCATION_DATA_PREFERRED are considered. If the space is not
enough again, the disks tagged BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
are also considered. If even in this case the space is not
sufficient, -ENOSPC is raised.
A disk tagged with BTRFS_DEV_ALLOCATION_METADATA_ONLY is never considered
for a data BG allocation.

During a *metadata* chunk allocation, the same algorithm applies swapping
_DATA_ and _METADATA_.

By default the disks are tagged as BTRFS_DEV_ALLOCATION_DATA_PREFERRED,
so BTRFS behaves as usual.

If the user prefers to store the metadata in the faster disks (e.g. SSD),
he can tag these with BTRFS_DEV_ALLOCATION_METADATA_PREFERRED: in this
case the metadata BG go in the BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
disks and the data BG in the others ones. When a disks set is filled, the
other is considered.

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 fs/btrfs/volumes.c | 113 +++++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/volumes.h |   1 +
 2 files changed, 111 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c43a8a36ff5b..666b67f4e07b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -184,6 +184,27 @@ enum btrfs_raid_types __attribute_const__ btrfs_bg_flags_to_raid_index(u64 flags
 	return BTRFS_RAID_SINGLE; /* BTRFS_BLOCK_GROUP_SINGLE */
 }
 
+#define BTRFS_DEV_ALLOCATION_HINT_COUNT (1ULL << \
+		BTRFS_DEV_ALLOCATION_HINT_BIT_COUNT)
+
+/*
+ *	The order of BTRFS_DEV_ALLOCATION_HINT_* values are not
+ *	good, because BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED is 0
+ *	(for backward compatibility reason), and the other
+ *	values are greater (because the field is unsigned). So we
+ *	need a map that rearranges the order giving to _DATA_PREFERRED
+ *	an intermediate priority.
+ *	These values give to METADATA_ONLY the highest priority, and are
+ *	valid for metadata BG allocation. When a data
+ *	BG is allocated we negate these values to reverse the priority.
+ */
+static const char alloc_hint_map[BTRFS_DEV_ALLOCATION_HINT_COUNT] = {
+	[BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY] = -1,
+	[BTRFS_DEV_ALLOCATION_HINT_DATA_PREFERRED] = 0,
+	[BTRFS_DEV_ALLOCATION_HINT_METADATA_PREFERRED] = 1,
+	[BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY] = 2,
+};
+
 const char *btrfs_bg_type_to_raid_name(u64 flags)
 {
 	const int index = btrfs_bg_flags_to_raid_index(flags);
@@ -5019,13 +5040,18 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
 }
 
 /*
- * sort the devices in descending order by max_avail, total_avail
+ * sort the devices in descending order by alloc_hint,
+ * max_avail, total_avail
  */
 static int btrfs_cmp_device_info(const void *a, const void *b)
 {
 	const struct btrfs_device_info *di_a = a;
 	const struct btrfs_device_info *di_b = b;
 
+	if (di_a->alloc_hint > di_b->alloc_hint)
+		return -1;
+	if (di_a->alloc_hint < di_b->alloc_hint)
+		return 1;
 	if (di_a->max_avail > di_b->max_avail)
 		return -1;
 	if (di_a->max_avail < di_b->max_avail)
@@ -5188,6 +5214,7 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 	int ndevs = 0;
 	u64 max_avail;
 	u64 dev_offset;
+	int hint;
 
 	/*
 	 * in the first pass through the devices list, we gather information
@@ -5240,17 +5267,95 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 		devices_info[ndevs].max_avail = max_avail;
 		devices_info[ndevs].total_avail = total_avail;
 		devices_info[ndevs].dev = device;
+
+		if ((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
+		     (ctl->type & BTRFS_BLOCK_GROUP_METADATA)) {
+			/*
+			 * if mixed bg set all the alloc_hint
+			 * fields to the same value, so the sorting
+			 * is not affected
+			 */
+			devices_info[ndevs].alloc_hint = 0;
+		} else if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
+			hint = device->type & BTRFS_DEV_ALLOCATION_HINT_MASK;
+
+			/*
+			 * skip BTRFS_DEV_METADATA_ONLY disks
+			 */
+			if (hint == BTRFS_DEV_ALLOCATION_HINT_METADATA_ONLY)
+				continue;
+			/*
+			 * if a data chunk must be allocated,
+			 * sort also by hint (data disk
+			 * higher priority)
+			 */
+			devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
+		} else { /* BTRFS_BLOCK_GROUP_METADATA */
+			hint = device->type & BTRFS_DEV_ALLOCATION_HINT_MASK;
+
+			/*
+			 * skip BTRFS_DEV_DATA_ONLY disks
+			 */
+			if (hint == BTRFS_DEV_ALLOCATION_HINT_DATA_ONLY)
+				continue;
+			/*
+			 * if a metadata chunk must be allocated,
+			 * sort also by hint (metadata hint
+			 * higher priority)
+			 */
+			devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
+		}
+
 		++ndevs;
 	}
 	ctl->ndevs = ndevs;
 
+	return 0;
+}
+
+static void sort_and_reduce_device_info(struct alloc_chunk_ctl *ctl,
+					struct btrfs_device_info *devices_info)
+{
+	int ndevs, hint, i;
+
+	ndevs = ctl->ndevs;
 	/*
-	 * now sort the devices by hole size / available space
+	 * now sort the devices by hint / hole size / available space
 	 */
 	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
 	     btrfs_cmp_device_info, NULL);
 
-	return 0;
+	/*
+	 * select the minimum set of disks grouped by hint that
+	 * can host the chunk
+	 */
+	ndevs = 0;
+	while (ndevs < ctl->ndevs) {
+		hint = devices_info[ndevs++].alloc_hint;
+		while (ndevs < ctl->ndevs) {
+			if (devices_info[ndevs].alloc_hint != hint)
+				break;
+			ndevs++;
+		}
+		if (ndevs >= ctl->devs_min)
+			break;
+	}
+
+	ctl->ndevs = ndevs;
+
+	/*
+	 * the next layers require the devices_info ordered by
+	 * max_avail. If we are returning two (or more) different
+	 * group of alloc_hint, this is not always true. So sort
+	 * these again.
+	 */
+
+	for (i = 0 ; i < ndevs ; i++)
+		devices_info[i].alloc_hint = 0;
+
+	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+	     btrfs_cmp_device_info, NULL);
+
 }
 
 static int decide_stripe_size_regular(struct alloc_chunk_ctl *ctl,
@@ -5502,6 +5607,8 @@ struct btrfs_block_group *btrfs_create_chunk(struct btrfs_trans_handle *trans,
 		goto out;
 	}
 
+	sort_and_reduce_device_info(&ctl, devices_info);
+
 	ret = decide_stripe_size(fs_devices, &ctl, devices_info);
 	if (ret < 0) {
 		block_group = ERR_PTR(ret);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 93ac27d8097c..b066f9af216a 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -404,6 +404,7 @@ struct btrfs_device_info {
 	u64 dev_offset;
 	u64 max_avail;
 	u64 total_avail;
+	int alloc_hint;
 };
 
 struct btrfs_raid_attr {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/7] btrfs: rename dev_item->type to dev_item->flags
  2022-01-26 20:32 [PATCH 0/7][V11] btrfs: allocation_hint Goffredo Baroncelli
                   ` (3 preceding siblings ...)
  2022-01-26 20:32 ` [PATCH 4/7] btrfs: add allocation_hint mode Goffredo Baroncelli
@ 2022-01-26 20:32 ` Goffredo Baroncelli
  2022-01-26 20:32 ` [PATCH 6/7] btrfs: add major and minor to sysfs Goffredo Baroncelli
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Goffredo Baroncelli @ 2022-01-26 20:32 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

Rename the field type of dev_item from 'type' to 'flags' changing the
struct btrfs_device and btrfs_dev_item.

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 fs/btrfs/ctree.h                |  4 ++--
 fs/btrfs/disk-io.c              |  2 +-
 fs/btrfs/sysfs.c                | 17 +++++++++--------
 fs/btrfs/volumes.c              | 10 +++++-----
 fs/btrfs/volumes.h              |  4 ++--
 include/uapi/linux/btrfs_tree.h |  4 ++--
 6 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8992e0096163..41a7f4be0c13 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1664,7 +1664,7 @@ static inline void btrfs_set_device_total_bytes(const struct extent_buffer *eb,
 }
 
 
-BTRFS_SETGET_FUNCS(device_type, struct btrfs_dev_item, type, 64);
+BTRFS_SETGET_FUNCS(device_flags, struct btrfs_dev_item, flags, 64);
 BTRFS_SETGET_FUNCS(device_bytes_used, struct btrfs_dev_item, bytes_used, 64);
 BTRFS_SETGET_FUNCS(device_io_align, struct btrfs_dev_item, io_align, 32);
 BTRFS_SETGET_FUNCS(device_io_width, struct btrfs_dev_item, io_width, 32);
@@ -1677,7 +1677,7 @@ BTRFS_SETGET_FUNCS(device_seek_speed, struct btrfs_dev_item, seek_speed, 8);
 BTRFS_SETGET_FUNCS(device_bandwidth, struct btrfs_dev_item, bandwidth, 8);
 BTRFS_SETGET_FUNCS(device_generation, struct btrfs_dev_item, generation, 64);
 
-BTRFS_SETGET_STACK_FUNCS(stack_device_type, struct btrfs_dev_item, type, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_device_flags, struct btrfs_dev_item, flags, 64);
 BTRFS_SETGET_STACK_FUNCS(stack_device_total_bytes, struct btrfs_dev_item,
 			 total_bytes, 64);
 BTRFS_SETGET_STACK_FUNCS(stack_device_bytes_used, struct btrfs_dev_item,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 87a5addbedf6..5b1f66eeed46 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4326,7 +4326,7 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int max_mirrors)
 			continue;
 
 		btrfs_set_stack_device_generation(dev_item, 0);
-		btrfs_set_stack_device_type(dev_item, dev->type);
+		btrfs_set_stack_device_flags(dev_item, dev->flags);
 		btrfs_set_stack_device_id(dev_item, dev->devid);
 		btrfs_set_stack_device_total_bytes(dev_item,
 						   dev->commit_total_bytes);
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 9070d0370343..42921432c9dc 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1582,7 +1582,7 @@ static ssize_t btrfs_devinfo_allocation_hint_show(struct kobject *kobj,
 						   devid_kobj);
 
 	return scnprintf(buf, PAGE_SIZE, "0x%08llx\n",
-		device->type & BTRFS_DEV_ALLOCATION_HINT_MASK);
+		device->flags & BTRFS_DEV_ALLOCATION_HINT_MASK);
 }
 
 static ssize_t btrfs_devinfo_allocation_hint_store(struct kobject *kobj,
@@ -1595,7 +1595,7 @@ static ssize_t btrfs_devinfo_allocation_hint_store(struct kobject *kobj,
 	int ret;
 	struct btrfs_trans_handle *trans;
 
-	u64 type, prev_type;
+	u64 flags, prev_flags;
 
 	device = container_of(kobj, struct btrfs_device, devid_kobj);
 	fs_info = device->fs_info;
@@ -1606,24 +1606,25 @@ static ssize_t btrfs_devinfo_allocation_hint_store(struct kobject *kobj,
 	if (sb_rdonly(fs_info->sb))
 		return -EROFS;
 
-	ret = kstrtou64(buf, 0, &type);
+	ret = kstrtou64(buf, 0, &flags);
 	if (ret < 0)
 		return -EINVAL;
 
 	/* for now, allow to touch only the 'allocation hint' bits */
-	if (type & ~BTRFS_DEV_ALLOCATION_HINT_MASK)
+	if (flags & ~BTRFS_DEV_ALLOCATION_HINT_MASK)
 		return -EINVAL;
 
 	/* check if a change is really needed */
-	if ((device->type & BTRFS_DEV_ALLOCATION_HINT_MASK) == type)
+	if ((device->flags & BTRFS_DEV_ALLOCATION_HINT_MASK) == flags)
 		return len;
 
 	trans = btrfs_start_transaction(root, 1);
 	if (IS_ERR(trans))
 		return PTR_ERR(trans);
 
-	prev_type = device->type;
-	device->type = (device->type & ~BTRFS_DEV_ALLOCATION_HINT_MASK) | type;
+	prev_flags = device->flags;
+	device->flags = (device->flags & ~BTRFS_DEV_ALLOCATION_HINT_MASK) |
+			flags;
 
 	ret = btrfs_update_device(trans, device);
 
@@ -1639,7 +1640,7 @@ static ssize_t btrfs_devinfo_allocation_hint_store(struct kobject *kobj,
 
 	return len;
 abort:
-	device->type = prev_type;
+	device->flags = prev_flags;
 	return  ret;
 }
 BTRFS_ATTR_RW(devid, allocation_hint, btrfs_devinfo_allocation_hint_show,
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 666b67f4e07b..dd2996b0318b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1871,7 +1871,7 @@ static int btrfs_add_dev_item(struct btrfs_trans_handle *trans,
 
 	btrfs_set_device_id(leaf, dev_item, device->devid);
 	btrfs_set_device_generation(leaf, dev_item, 0);
-	btrfs_set_device_type(leaf, dev_item, device->type);
+	btrfs_set_device_flags(leaf, dev_item, device->flags);
 	btrfs_set_device_io_align(leaf, dev_item, device->io_align);
 	btrfs_set_device_io_width(leaf, dev_item, device->io_width);
 	btrfs_set_device_sector_size(leaf, dev_item, device->sector_size);
@@ -2893,7 +2893,7 @@ noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
 	dev_item = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_dev_item);
 
 	btrfs_set_device_id(leaf, dev_item, device->devid);
-	btrfs_set_device_type(leaf, dev_item, device->type);
+	btrfs_set_device_flags(leaf, dev_item, device->flags);
 	btrfs_set_device_io_align(leaf, dev_item, device->io_align);
 	btrfs_set_device_io_width(leaf, dev_item, device->io_width);
 	btrfs_set_device_sector_size(leaf, dev_item, device->sector_size);
@@ -5277,7 +5277,7 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 			 */
 			devices_info[ndevs].alloc_hint = 0;
 		} else if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
-			hint = device->type & BTRFS_DEV_ALLOCATION_HINT_MASK;
+			hint = device->flags & BTRFS_DEV_ALLOCATION_HINT_MASK;
 
 			/*
 			 * skip BTRFS_DEV_METADATA_ONLY disks
@@ -5291,7 +5291,7 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 			 */
 			devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
 		} else { /* BTRFS_BLOCK_GROUP_METADATA */
-			hint = device->type & BTRFS_DEV_ALLOCATION_HINT_MASK;
+			hint = device->flags & BTRFS_DEV_ALLOCATION_HINT_MASK;
 
 			/*
 			 * skip BTRFS_DEV_DATA_ONLY disks
@@ -7297,7 +7297,7 @@ static void fill_device_from_item(struct extent_buffer *leaf,
 	device->commit_total_bytes = device->disk_total_bytes;
 	device->bytes_used = btrfs_device_bytes_used(leaf, dev_item);
 	device->commit_bytes_used = device->bytes_used;
-	device->type = btrfs_device_type(leaf, dev_item);
+	device->flags = btrfs_device_flags(leaf, dev_item);
 	device->io_align = btrfs_device_io_align(leaf, dev_item);
 	device->io_width = btrfs_device_io_width(leaf, dev_item);
 	device->sector_size = btrfs_device_sector_size(leaf, dev_item);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index b066f9af216a..6230d911e7af 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -101,8 +101,8 @@ struct btrfs_device {
 
 	/* optimal io width for this device */
 	u32 io_width;
-	/* type and info about this device */
-	u64 type;
+	/* device flags (e.g. allocation hint) */
+	u64 flags;
 
 	/* minimal io size for this device */
 	u32 sector_size;
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 02955d5fcd21..232c66e7bc43 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -421,8 +421,8 @@ struct btrfs_dev_item {
 	/* minimal io size for this device */
 	__le32 sector_size;
 
-	/* type and info about this device */
-	__le64 type;
+	/* device flags (e.g. allocation hint) */
+	__le64 flags;
 
 	/* expected generation for this device */
 	__le64 generation;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 6/7] btrfs: add major and minor to sysfs
  2022-01-26 20:32 [PATCH 0/7][V11] btrfs: allocation_hint Goffredo Baroncelli
                   ` (4 preceding siblings ...)
  2022-01-26 20:32 ` [PATCH 5/7] btrfs: rename dev_item->type to dev_item->flags Goffredo Baroncelli
@ 2022-01-26 20:32 ` Goffredo Baroncelli
  2022-01-26 20:32 ` [PATCH 7/7] Add /sys/fs/btrfs/features/allocation_hint Goffredo Baroncelli
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Goffredo Baroncelli @ 2022-01-26 20:32 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

Add the following property to btrfs sysfs

/sysfs/fs/btrfs/<UUID>/devinfo/<devid>/major_minor

This would help to figure out which block device is involved in
which filesystem.

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 fs/btrfs/sysfs.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 42921432c9dc..dee23669a00f 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1647,6 +1647,22 @@ BTRFS_ATTR_RW(devid, allocation_hint, btrfs_devinfo_allocation_hint_show,
 				      btrfs_devinfo_allocation_hint_store);
 
 
+static ssize_t btrfs_devinfo_major_minor_show(struct kobject *kobj,
+					struct kobj_attribute *a, char *buf)
+{
+	struct btrfs_device *device = container_of(kobj, struct btrfs_device,
+						   devid_kobj);
+
+	if (device->bdev)
+		return scnprintf(buf, PAGE_SIZE, "%d:%d\n",
+			MAJOR(device->bdev->bd_dev),
+			MINOR(device->bdev->bd_dev));
+	else
+		return scnprintf(buf, PAGE_SIZE, "N/A\n");
+}
+
+BTRFS_ATTR(devid, major_minor, btrfs_devinfo_major_minor_show);
+
 /*
  * Information about one device.
  *
@@ -1661,6 +1677,7 @@ static struct attribute *devid_attrs[] = {
 	BTRFS_ATTR_PTR(devid, scrub_speed_max),
 	BTRFS_ATTR_PTR(devid, writeable),
 	BTRFS_ATTR_PTR(devid, allocation_hint),
+	BTRFS_ATTR_PTR(devid, major_minor),
 	NULL
 };
 ATTRIBUTE_GROUPS(devid);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 7/7] Add /sys/fs/btrfs/features/allocation_hint
  2022-01-26 20:32 [PATCH 0/7][V11] btrfs: allocation_hint Goffredo Baroncelli
                   ` (5 preceding siblings ...)
  2022-01-26 20:32 ` [PATCH 6/7] btrfs: add major and minor to sysfs Goffredo Baroncelli
@ 2022-01-26 20:32 ` Goffredo Baroncelli
  2022-02-15 18:49 ` [PATCH 0/7][V11] btrfs: allocation_hint Goffredo Baroncelli
  2022-02-28 17:04 ` Josef Bacik
  8 siblings, 0 replies; 22+ messages in thread
From: Goffredo Baroncelli @ 2022-01-26 20:32 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

Add a new feature sysfs file to simplify the allocation_hit feature
availability check.

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 fs/btrfs/sysfs.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index dee23669a00f..0f4a7ab79fe5 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -403,6 +403,20 @@ static ssize_t supported_sectorsizes_show(struct kobject *kobj,
 BTRFS_ATTR(static_feature, supported_sectorsizes,
 	   supported_sectorsizes_show);
 
+static ssize_t allocation_hint_show(struct kobject *kobj,
+					  struct kobj_attribute *a,
+					  char *buf)
+{
+	ssize_t ret;
+
+	/* Only sectorsize == PAGE_SIZE is now supported */
+	ret = sysfs_emit(buf, "1\n");
+
+	return ret;
+}
+BTRFS_ATTR(static_feature, allocation_hint,
+	   allocation_hint_show);
+
 /*
  * Features which only depend on kernel version.
  *
@@ -415,6 +429,7 @@ static struct attribute *btrfs_supported_static_feature_attrs[] = {
 	BTRFS_ATTR_PTR(static_feature, send_stream_version),
 	BTRFS_ATTR_PTR(static_feature, supported_rescue_options),
 	BTRFS_ATTR_PTR(static_feature, supported_sectorsizes),
+	BTRFS_ATTR_PTR(static_feature, allocation_hint),
 	NULL
 };
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/7][V11] btrfs: allocation_hint
  2022-01-26 20:32 [PATCH 0/7][V11] btrfs: allocation_hint Goffredo Baroncelli
                   ` (6 preceding siblings ...)
  2022-01-26 20:32 ` [PATCH 7/7] Add /sys/fs/btrfs/features/allocation_hint Goffredo Baroncelli
@ 2022-02-15 18:49 ` Goffredo Baroncelli
  2022-02-16  0:22   ` Qu Wenruo
  2022-02-28 17:04 ` Josef Bacik
  8 siblings, 1 reply; 22+ messages in thread
From: Goffredo Baroncelli @ 2022-02-15 18:49 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq, Paul Jones,
	Boris Burkov, linux-btrfs

Hi Josef,

gentle ping...

few months ago you showed some interest in this patches set. Few of the cc-ed person use this patch set.

I know that David showed some interest in the Anand approach (i.e. no knobs, but an automatic behavior looking at the speed of the devices).

At the time when I tried this approach in the first attempts, I got the complain that the kernel may not know the performance differences of the disk (HDD vs SSD vs NVME vs ZONED disk...).

Comments ?

BR
Goffredo


On 26/01/2022 21.32, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli <kreijack@inwind.it>
> 
> Hi all,
> 
> This patches set was born after some discussion between me, Zygo and Josef.
> Some details can be found in https://github.com/btrfs/btrfs-todo/issues/19.
> 
> Some further information about a real use case can be found in
> https://lore.kernel.org/linux-btrfs/20210116002533.GE31381@hungrycats.org/
> 
> Recently Shafeeq told me that he is interested too, due to the performance gain.
> 
> In V8, revision I switched away from an ioctl API in favor of a sysfs API (
> see patch #2 and #3).
> 
> In V9, I renamed the sysfs interface from devinfo/type to devinfo/allocation_hint.
> Moreover I renamed dev_info->type to dev_info->flags.
> 
> In V10, I renamed the tag 'PREFERRED' from PREFERRED_X to X_PREFERRED; I added
> another devinfo property, called major_minor. For now it is unused;
> the plan is to use this in btrfs-progs to go from the block device to the filesystem.
> First client will be "btrfs prop get/set allocation_hint", but I see others possible
> clients.
> Finally I made some cleanup, and remove the mount option 'allocation_hint'
> 
> In V11 I added a new 'feature' file /sys/fs/btrfs/features/allocation_hint
> to help the detection of the allocation_hint feature.
> 
> The idea behind this patches set, is to dedicate some disks (the fastest one)
> to the metadata chunk. My initial idea was a "soft" hint. However Zygo
> asked an option for a "strong" hint (== mandatory). The result is that
> each disk can be "tagged" by one of the following flags:
> - BTRFS_DEV_ALLOCATION_METADATA_ONLY
> - BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
> - BTRFS_DEV_ALLOCATION_DATA_PREFERRED
> - BTRFS_DEV_ALLOCATION_DATA_ONLY
> 
> When the chunk allocator search a disks to allocate a chunk, scans the disks
> in an order decided by these tags. For metadata, the order is:
> *_METADATA_ONLY
> *_METADATA_PREFERRED
> *_DATA_PREFERRED
> 
> The *_DATA_ONLY are not eligible from metadata chunk allocation.
> 
> For the data chunk, the order is reversed, and the *_METADATA_ONLY are
> excluded.
> 
> The exact sort logic is to sort first for the "tag", and then for the space
> available. If there is no space available, the next "tag" disks set are
> selected.
> 
> To set these tags, a new property called "allocation_hint" was created.
> There is a dedicated btrfs-prog patches set [[PATCH V11] btrfs-progs:
> allocation_hint disk property].
> 
> $ sudo mount /dev/loop0 /mnt/test-btrfs/
> $ for i in /dev/loop[0-9]; do sudo ./btrfs prop get $i allocation_hint; done
> devid=1, path=/dev/loop0: allocation_hint=METADATA_PREFERRED
> devid=2, path=/dev/loop1: allocation_hint=METADATA_PREFERRED
> devid=3, path=/dev/loop2: allocation_hint=DATA_PREFERRED
> devid=4, path=/dev/loop3: allocation_hint=DATA_PREFERRED
> devid=5, path=/dev/loop4: allocation_hint=DATA_PREFERRED
> devid=6, path=/dev/loop5: allocation_hint=DATA_ONLY
> devid=7, path=/dev/loop6: allocation_hint=METADATA_ONLY
> devid=8, path=/dev/loop7: allocation_hint=METADATA_ONLY
> 
> $ sudo ./btrfs fi us /mnt/test-btrfs/
> Overall:
>      Device size:           2.75GiB
>      Device allocated:           1.34GiB
>      Device unallocated:           1.41GiB
>      Device missing:             0.00B
>      Used:             400.89MiB
>      Free (estimated):           1.04GiB    (min: 1.04GiB)
>      Data ratio:                  2.00
>      Metadata ratio:              1.00
>      Global reserve:           3.25MiB    (used: 0.00B)
>      Multiple profiles:                no
> 
> Data,RAID1: Size:542.00MiB, Used:200.25MiB (36.95%)
>     /dev/loop0     288.00MiB
>     /dev/loop1     288.00MiB
>     /dev/loop2     127.00MiB
>     /dev/loop3     127.00MiB
>     /dev/loop4     127.00MiB
>     /dev/loop5     127.00MiB
> 
> Metadata,single: Size:256.00MiB, Used:384.00KiB (0.15%)
>     /dev/loop1     256.00MiB
> 
> System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
>     /dev/loop0      32.00MiB
> 
> Unallocated:
>     /dev/loop0     704.00MiB
>     /dev/loop1     480.00MiB
>     /dev/loop2       1.00MiB
>     /dev/loop3       1.00MiB
>     /dev/loop4       1.00MiB
>     /dev/loop5       1.00MiB
>     /dev/loop6     128.00MiB
>     /dev/loop7     128.00MiB
> 
> # change the tag of some disks
> 
> $ sudo ./btrfs prop set /dev/loop0 allocation_hint DATA_ONLY
> $ sudo ./btrfs prop set /dev/loop1 allocation_hint DATA_ONLY
> $ sudo ./btrfs prop set /dev/loop5 allocation_hint METADATA_ONLY
> 
> $ for i in /dev/loop[0-9]; do sudo ./btrfs prop get $i allocation_hint; done
> devid=1, path=/dev/loop0: allocation_hint=DATA_ONLY
> devid=2, path=/dev/loop1: allocation_hint=DATA_ONLY
> devid=3, path=/dev/loop2: allocation_hint=DATA_PREFERRED
> devid=4, path=/dev/loop3: allocation_hint=DATA_PREFERRED
> devid=5, path=/dev/loop4: allocation_hint=DATA_PREFERRED
> devid=6, path=/dev/loop5: allocation_hint=METADATA_ONLY
> devid=7, path=/dev/loop6: allocation_hint=METADATA_ONLY
> devid=8, path=/dev/loop7: allocation_hint=METADATA_ONLY
> 
> $ sudo btrfs bal start --full-balance /mnt/test-btrfs/
> $ sudo ./btrfs fi us /mnt/test-btrfs/
> Overall:
>      Device size:           2.75GiB
>      Device allocated:         735.00MiB
>      Device unallocated:           2.03GiB
>      Device missing:             0.00B
>      Used:             400.72MiB
>      Free (estimated):           1.10GiB    (min: 1.10GiB)
>      Data ratio:                  2.00
>      Metadata ratio:              1.00
>      Global reserve:           3.25MiB    (used: 0.00B)
>      Multiple profiles:                no
> 
> Data,RAID1: Size:288.00MiB, Used:200.19MiB (69.51%)
>     /dev/loop0     288.00MiB
>     /dev/loop1     288.00MiB
> 
> Metadata,single: Size:127.00MiB, Used:336.00KiB (0.26%)
>     /dev/loop5     127.00MiB
> 
> System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
>     /dev/loop7      32.00MiB
> 
> Unallocated:
>     /dev/loop0     736.00MiB
>     /dev/loop1     736.00MiB
>     /dev/loop2     128.00MiB
>     /dev/loop3     128.00MiB
>     /dev/loop4     128.00MiB
>     /dev/loop5       1.00MiB
>     /dev/loop6     128.00MiB
>     /dev/loop7      96.00MiB
> 
> 
> #As you can see all the metadata were placed on the disk loop5/loop7 even if
> #the most empty one are loop0 and loop1.
> 
> 
> 
> Furher works:
> - the tool which show the space available should consider the tagging (eg
>    the disks tagged by _METADATA_ONLY should be excluded from the data
>    availability)
> - allow btrfs-prog to change the allocation_hint even when the filesystem
>    is not mounted.
> 
> In following emails, there will be the btrfs-progs patches set and the xfstest
> suite patches set.
> 
> 
> Comments are welcome
> BR
> G.Baroncelli
> 
> Revision:
> V11:
> - added the property /sys/fs/btrfs/features/allocation_hint
> 
> V10:
> - renamed two disk tags/constants:
>          - *_METADATA_PREFERRED -> *_METADATA_PREFERRED
>          - *_DATA_PREFERRED -> *_DATA_EFERRED
> - add /sys/fs/btrfs/$UUID/devinfo/$DEVID/major_minor
> - revise some commit description
> - refactored the code of gather_device_info(): one portion of this code
>    is moved in a separate function called sort_and_reduce_device_info()
>    for clarity.
> - removed the 'allocation_hint' mount option
> 
> V9:
> - rename dev_item->type to dev_item->flags
> - rename /sys/fs/btrfs/$UUID/devinfo/type -> allocation_hint
> 
> V8:
> - drop the ioctl API, instead use a sysfs one
> 
> V7:
> - make more room in the struct btrfs_ioctl_dev_properties up to 1K
> - leave in btrfs_tree.h only the costants
> - removed the mount option (sic)
> - correct an 'use before check' in the while loop (signaled
>    by Zygo)
> - add a 2nd sort to be sure that the device_info array is in the
>    expected order
> 
> V6:
> - add further values to the hints: add the possibility to
>    exclude a disk for a chunk type
> 
> 
> 
> Goffredo Baroncelli (7):
>    btrfs: add flags to give an hint to the chunk allocator
>    btrfs: export the device allocation_hint property in sysfs
>    btrfs: change the device allocation_hint property via sysfs
>    btrfs: add allocation_hint mode
>    btrfs: rename dev_item->type to dev_item->flags
>    btrfs: add major and minor to sysfs
>    Add /sys/fs/btrfs/features/allocation_hint
> 
>   fs/btrfs/ctree.h                |   4 +-
>   fs/btrfs/disk-io.c              |   2 +-
>   fs/btrfs/sysfs.c                | 105 +++++++++++++++++++++++++++
>   fs/btrfs/volumes.c              | 121 ++++++++++++++++++++++++++++++--
>   fs/btrfs/volumes.h              |   7 +-
>   include/uapi/linux/btrfs_tree.h |  20 +++++-
>   6 files changed, 245 insertions(+), 14 deletions(-)
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/7][V11] btrfs: allocation_hint
  2022-02-15 18:49 ` [PATCH 0/7][V11] btrfs: allocation_hint Goffredo Baroncelli
@ 2022-02-16  0:22   ` Qu Wenruo
  2022-02-16  3:28     ` Zygo Blaxell
  0 siblings, 1 reply; 22+ messages in thread
From: Qu Wenruo @ 2022-02-16  0:22 UTC (permalink / raw)
  To: kreijack, Josef Bacik
  Cc: Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq, Paul Jones,
	Boris Burkov, linux-btrfs



On 2022/2/16 02:49, Goffredo Baroncelli wrote:
> Hi Josef,
>
> gentle ping...
>
> few months ago you showed some interest in this patches set. Few of the
> cc-ed person use this patch set.
>
> I know that David showed some interest in the Anand approach (i.e. no
> knobs, but an automatic behavior looking at the speed of the devices).
>
> At the time when I tried this approach in the first attempts, I got the
> complain that the kernel may not know the performance differences of the
> disk (HDD vs SSD vs NVME vs ZONED disk...).

Sorry I didn't check the patches in details.

But I'm a little concerned about how to accurately determine the
performance of a device.

If doing it automatically, there must be some (commonly very short) time
spent to do the test.

In the very short time, I doubt we can even accurately got a full
picture of a device (from sequential read/write speed to IOPS values)

For spinning disks, the sequential read/write speed even change based on
their LBA address (as their physical location inside the plate can
change their linear velocity, since the angular velocity is fixed).

And even for SSD, IOPS can var dramatically due to cache/controller
difference.


For a proper performance aware setup, I guess the only correct way to
fetch performance characteristics is from the (advanced) user.

Or we may need to spent at least tens of minutes to do proper tests to
get the result.

For regular end users, the difference between SSD and HDD is huge enough
and simply preferring SSD for metadata is good enough.

But for more complex setup, like btrfs over LUKS over LVM (even crosses
several physical devices), I doubt if it's even possible to fetch the
correct performance characteristics automatically.

Thanks,
Qu
>
> Comments ?
>
> BR
> Goffredo
>
>
> On 26/01/2022 21.32, Goffredo Baroncelli wrote:
>> From: Goffredo Baroncelli <kreijack@inwind.it>
>>
>> Hi all,
>>
>> This patches set was born after some discussion between me, Zygo and
>> Josef.
>> Some details can be found in
>> https://github.com/btrfs/btrfs-todo/issues/19.
>>
>> Some further information about a real use case can be found in
>> https://lore.kernel.org/linux-btrfs/20210116002533.GE31381@hungrycats.org/
>>
>>
>> Recently Shafeeq told me that he is interested too, due to the
>> performance gain.
>>
>> In V8, revision I switched away from an ioctl API in favor of a sysfs
>> API (
>> see patch #2 and #3).
>>
>> In V9, I renamed the sysfs interface from devinfo/type to
>> devinfo/allocation_hint.
>> Moreover I renamed dev_info->type to dev_info->flags.
>>
>> In V10, I renamed the tag 'PREFERRED' from PREFERRED_X to X_PREFERRED;
>> I added
>> another devinfo property, called major_minor. For now it is unused;
>> the plan is to use this in btrfs-progs to go from the block device to
>> the filesystem.
>> First client will be "btrfs prop get/set allocation_hint", but I see
>> others possible
>> clients.
>> Finally I made some cleanup, and remove the mount option
>> 'allocation_hint'
>>
>> In V11 I added a new 'feature' file
>> /sys/fs/btrfs/features/allocation_hint
>> to help the detection of the allocation_hint feature.
>>
>> The idea behind this patches set, is to dedicate some disks (the
>> fastest one)
>> to the metadata chunk. My initial idea was a "soft" hint. However Zygo
>> asked an option for a "strong" hint (== mandatory). The result is that
>> each disk can be "tagged" by one of the following flags:
>> - BTRFS_DEV_ALLOCATION_METADATA_ONLY
>> - BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
>> - BTRFS_DEV_ALLOCATION_DATA_PREFERRED
>> - BTRFS_DEV_ALLOCATION_DATA_ONLY
>>
>> When the chunk allocator search a disks to allocate a chunk, scans the
>> disks
>> in an order decided by these tags. For metadata, the order is:
>> *_METADATA_ONLY
>> *_METADATA_PREFERRED
>> *_DATA_PREFERRED
>>
>> The *_DATA_ONLY are not eligible from metadata chunk allocation.
>>
>> For the data chunk, the order is reversed, and the *_METADATA_ONLY are
>> excluded.
>>
>> The exact sort logic is to sort first for the "tag", and then for the
>> space
>> available. If there is no space available, the next "tag" disks set are
>> selected.
>>
>> To set these tags, a new property called "allocation_hint" was created.
>> There is a dedicated btrfs-prog patches set [[PATCH V11] btrfs-progs:
>> allocation_hint disk property].
>>
>> $ sudo mount /dev/loop0 /mnt/test-btrfs/
>> $ for i in /dev/loop[0-9]; do sudo ./btrfs prop get $i
>> allocation_hint; done
>> devid=1, path=/dev/loop0: allocation_hint=METADATA_PREFERRED
>> devid=2, path=/dev/loop1: allocation_hint=METADATA_PREFERRED
>> devid=3, path=/dev/loop2: allocation_hint=DATA_PREFERRED
>> devid=4, path=/dev/loop3: allocation_hint=DATA_PREFERRED
>> devid=5, path=/dev/loop4: allocation_hint=DATA_PREFERRED
>> devid=6, path=/dev/loop5: allocation_hint=DATA_ONLY
>> devid=7, path=/dev/loop6: allocation_hint=METADATA_ONLY
>> devid=8, path=/dev/loop7: allocation_hint=METADATA_ONLY
>>
>> $ sudo ./btrfs fi us /mnt/test-btrfs/
>> Overall:
>>      Device size:           2.75GiB
>>      Device allocated:           1.34GiB
>>      Device unallocated:           1.41GiB
>>      Device missing:             0.00B
>>      Used:             400.89MiB
>>      Free (estimated):           1.04GiB    (min: 1.04GiB)
>>      Data ratio:                  2.00
>>      Metadata ratio:              1.00
>>      Global reserve:           3.25MiB    (used: 0.00B)
>>      Multiple profiles:                no
>>
>> Data,RAID1: Size:542.00MiB, Used:200.25MiB (36.95%)
>>     /dev/loop0     288.00MiB
>>     /dev/loop1     288.00MiB
>>     /dev/loop2     127.00MiB
>>     /dev/loop3     127.00MiB
>>     /dev/loop4     127.00MiB
>>     /dev/loop5     127.00MiB
>>
>> Metadata,single: Size:256.00MiB, Used:384.00KiB (0.15%)
>>     /dev/loop1     256.00MiB
>>
>> System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
>>     /dev/loop0      32.00MiB
>>
>> Unallocated:
>>     /dev/loop0     704.00MiB
>>     /dev/loop1     480.00MiB
>>     /dev/loop2       1.00MiB
>>     /dev/loop3       1.00MiB
>>     /dev/loop4       1.00MiB
>>     /dev/loop5       1.00MiB
>>     /dev/loop6     128.00MiB
>>     /dev/loop7     128.00MiB
>>
>> # change the tag of some disks
>>
>> $ sudo ./btrfs prop set /dev/loop0 allocation_hint DATA_ONLY
>> $ sudo ./btrfs prop set /dev/loop1 allocation_hint DATA_ONLY
>> $ sudo ./btrfs prop set /dev/loop5 allocation_hint METADATA_ONLY
>>
>> $ for i in /dev/loop[0-9]; do sudo ./btrfs prop get $i
>> allocation_hint; done
>> devid=1, path=/dev/loop0: allocation_hint=DATA_ONLY
>> devid=2, path=/dev/loop1: allocation_hint=DATA_ONLY
>> devid=3, path=/dev/loop2: allocation_hint=DATA_PREFERRED
>> devid=4, path=/dev/loop3: allocation_hint=DATA_PREFERRED
>> devid=5, path=/dev/loop4: allocation_hint=DATA_PREFERRED
>> devid=6, path=/dev/loop5: allocation_hint=METADATA_ONLY
>> devid=7, path=/dev/loop6: allocation_hint=METADATA_ONLY
>> devid=8, path=/dev/loop7: allocation_hint=METADATA_ONLY
>>
>> $ sudo btrfs bal start --full-balance /mnt/test-btrfs/
>> $ sudo ./btrfs fi us /mnt/test-btrfs/
>> Overall:
>>      Device size:           2.75GiB
>>      Device allocated:         735.00MiB
>>      Device unallocated:           2.03GiB
>>      Device missing:             0.00B
>>      Used:             400.72MiB
>>      Free (estimated):           1.10GiB    (min: 1.10GiB)
>>      Data ratio:                  2.00
>>      Metadata ratio:              1.00
>>      Global reserve:           3.25MiB    (used: 0.00B)
>>      Multiple profiles:                no
>>
>> Data,RAID1: Size:288.00MiB, Used:200.19MiB (69.51%)
>>     /dev/loop0     288.00MiB
>>     /dev/loop1     288.00MiB
>>
>> Metadata,single: Size:127.00MiB, Used:336.00KiB (0.26%)
>>     /dev/loop5     127.00MiB
>>
>> System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
>>     /dev/loop7      32.00MiB
>>
>> Unallocated:
>>     /dev/loop0     736.00MiB
>>     /dev/loop1     736.00MiB
>>     /dev/loop2     128.00MiB
>>     /dev/loop3     128.00MiB
>>     /dev/loop4     128.00MiB
>>     /dev/loop5       1.00MiB
>>     /dev/loop6     128.00MiB
>>     /dev/loop7      96.00MiB
>>
>>
>> #As you can see all the metadata were placed on the disk loop5/loop7
>> even if
>> #the most empty one are loop0 and loop1.
>>
>>
>>
>> Furher works:
>> - the tool which show the space available should consider the tagging (eg
>>    the disks tagged by _METADATA_ONLY should be excluded from the data
>>    availability)
>> - allow btrfs-prog to change the allocation_hint even when the filesystem
>>    is not mounted.
>>
>> In following emails, there will be the btrfs-progs patches set and the
>> xfstest
>> suite patches set.
>>
>>
>> Comments are welcome
>> BR
>> G.Baroncelli
>>
>> Revision:
>> V11:
>> - added the property /sys/fs/btrfs/features/allocation_hint
>>
>> V10:
>> - renamed two disk tags/constants:
>>          - *_METADATA_PREFERRED -> *_METADATA_PREFERRED
>>          - *_DATA_PREFERRED -> *_DATA_EFERRED
>> - add /sys/fs/btrfs/$UUID/devinfo/$DEVID/major_minor
>> - revise some commit description
>> - refactored the code of gather_device_info(): one portion of this code
>>    is moved in a separate function called sort_and_reduce_device_info()
>>    for clarity.
>> - removed the 'allocation_hint' mount option
>>
>> V9:
>> - rename dev_item->type to dev_item->flags
>> - rename /sys/fs/btrfs/$UUID/devinfo/type -> allocation_hint
>>
>> V8:
>> - drop the ioctl API, instead use a sysfs one
>>
>> V7:
>> - make more room in the struct btrfs_ioctl_dev_properties up to 1K
>> - leave in btrfs_tree.h only the costants
>> - removed the mount option (sic)
>> - correct an 'use before check' in the while loop (signaled
>>    by Zygo)
>> - add a 2nd sort to be sure that the device_info array is in the
>>    expected order
>>
>> V6:
>> - add further values to the hints: add the possibility to
>>    exclude a disk for a chunk type
>>
>>
>>
>> Goffredo Baroncelli (7):
>>    btrfs: add flags to give an hint to the chunk allocator
>>    btrfs: export the device allocation_hint property in sysfs
>>    btrfs: change the device allocation_hint property via sysfs
>>    btrfs: add allocation_hint mode
>>    btrfs: rename dev_item->type to dev_item->flags
>>    btrfs: add major and minor to sysfs
>>    Add /sys/fs/btrfs/features/allocation_hint
>>
>>   fs/btrfs/ctree.h                |   4 +-
>>   fs/btrfs/disk-io.c              |   2 +-
>>   fs/btrfs/sysfs.c                | 105 +++++++++++++++++++++++++++
>>   fs/btrfs/volumes.c              | 121 ++++++++++++++++++++++++++++++--
>>   fs/btrfs/volumes.h              |   7 +-
>>   include/uapi/linux/btrfs_tree.h |  20 +++++-
>>   6 files changed, 245 insertions(+), 14 deletions(-)
>>
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/7][V11] btrfs: allocation_hint
  2022-02-16  0:22   ` Qu Wenruo
@ 2022-02-16  3:28     ` Zygo Blaxell
  2022-02-16  4:43       ` Paul Jones
  0 siblings, 1 reply; 22+ messages in thread
From: Zygo Blaxell @ 2022-02-16  3:28 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: kreijack, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov, linux-btrfs

On Wed, Feb 16, 2022 at 08:22:55AM +0800, Qu Wenruo wrote:
> 
> 
> On 2022/2/16 02:49, Goffredo Baroncelli wrote:
> > Hi Josef,
> > 
> > gentle ping...
> > 
> > few months ago you showed some interest in this patches set. Few of the
> > cc-ed person use this patch set.
> > 
> > I know that David showed some interest in the Anand approach (i.e. no
> > knobs, but an automatic behavior looking at the speed of the devices).
> > 
> > At the time when I tried this approach in the first attempts, I got the
> > complain that the kernel may not know the performance differences of the
> > disk (HDD vs SSD vs NVME vs ZONED disk...).
> 
> Sorry I didn't check the patches in details.
> 
> But I'm a little concerned about how to accurately determine the
> performance of a device.
> 
> If doing it automatically, there must be some (commonly very short) time
> spent to do the test.
> 
> In the very short time, I doubt we can even accurately got a full
> picture of a device (from sequential read/write speed to IOPS values)
> 
> For spinning disks, the sequential read/write speed even change based on
> their LBA address (as their physical location inside the plate can
> change their linear velocity, since the angular velocity is fixed).
> 
> And even for SSD, IOPS can var dramatically due to cache/controller
> difference.
> 
> 
> For a proper performance aware setup, I guess the only correct way to
> fetch performance characteristics is from the (advanced) user.
> 
> Or we may need to spent at least tens of minutes to do proper tests to
> get the result.
> 
> For regular end users, the difference between SSD and HDD is huge enough
> and simply preferring SSD for metadata is good enough.
> 
> But for more complex setup, like btrfs over LUKS over LVM (even crosses
> several physical devices), I doubt if it's even possible to fetch the
> correct performance characteristics automatically.

I agree with all of the above.

An automatic performance detection/configuration daemon can easily
set the metadata/data preference bits during mkfs, or monitor the
system's iostats and change preferences over time if this is really
useful and desired.  It doesn't have to run in the kernel.

> Thanks,
> Qu
> > 
> > Comments ?
> > 
> > BR
> > Goffredo
> > 
> > 
> > On 26/01/2022 21.32, Goffredo Baroncelli wrote:
> > > From: Goffredo Baroncelli <kreijack@inwind.it>
> > > 
> > > Hi all,
> > > 
> > > This patches set was born after some discussion between me, Zygo and
> > > Josef.
> > > Some details can be found in
> > > https://github.com/btrfs/btrfs-todo/issues/19.
> > > 
> > > Some further information about a real use case can be found in
> > > https://lore.kernel.org/linux-btrfs/20210116002533.GE31381@hungrycats.org/
> > > 
> > > 
> > > Recently Shafeeq told me that he is interested too, due to the
> > > performance gain.
> > > 
> > > In V8, revision I switched away from an ioctl API in favor of a sysfs
> > > API (
> > > see patch #2 and #3).
> > > 
> > > In V9, I renamed the sysfs interface from devinfo/type to
> > > devinfo/allocation_hint.
> > > Moreover I renamed dev_info->type to dev_info->flags.
> > > 
> > > In V10, I renamed the tag 'PREFERRED' from PREFERRED_X to X_PREFERRED;
> > > I added
> > > another devinfo property, called major_minor. For now it is unused;
> > > the plan is to use this in btrfs-progs to go from the block device to
> > > the filesystem.
> > > First client will be "btrfs prop get/set allocation_hint", but I see
> > > others possible
> > > clients.
> > > Finally I made some cleanup, and remove the mount option
> > > 'allocation_hint'
> > > 
> > > In V11 I added a new 'feature' file
> > > /sys/fs/btrfs/features/allocation_hint
> > > to help the detection of the allocation_hint feature.
> > > 
> > > The idea behind this patches set, is to dedicate some disks (the
> > > fastest one)
> > > to the metadata chunk. My initial idea was a "soft" hint. However Zygo
> > > asked an option for a "strong" hint (== mandatory). The result is that
> > > each disk can be "tagged" by one of the following flags:
> > > - BTRFS_DEV_ALLOCATION_METADATA_ONLY
> > > - BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
> > > - BTRFS_DEV_ALLOCATION_DATA_PREFERRED
> > > - BTRFS_DEV_ALLOCATION_DATA_ONLY
> > > 
> > > When the chunk allocator search a disks to allocate a chunk, scans the
> > > disks
> > > in an order decided by these tags. For metadata, the order is:
> > > *_METADATA_ONLY
> > > *_METADATA_PREFERRED
> > > *_DATA_PREFERRED
> > > 
> > > The *_DATA_ONLY are not eligible from metadata chunk allocation.
> > > 
> > > For the data chunk, the order is reversed, and the *_METADATA_ONLY are
> > > excluded.
> > > 
> > > The exact sort logic is to sort first for the "tag", and then for the
> > > space
> > > available. If there is no space available, the next "tag" disks set are
> > > selected.
> > > 
> > > To set these tags, a new property called "allocation_hint" was created.
> > > There is a dedicated btrfs-prog patches set [[PATCH V11] btrfs-progs:
> > > allocation_hint disk property].
> > > 
> > > $ sudo mount /dev/loop0 /mnt/test-btrfs/
> > > $ for i in /dev/loop[0-9]; do sudo ./btrfs prop get $i
> > > allocation_hint; done
> > > devid=1, path=/dev/loop0: allocation_hint=METADATA_PREFERRED
> > > devid=2, path=/dev/loop1: allocation_hint=METADATA_PREFERRED
> > > devid=3, path=/dev/loop2: allocation_hint=DATA_PREFERRED
> > > devid=4, path=/dev/loop3: allocation_hint=DATA_PREFERRED
> > > devid=5, path=/dev/loop4: allocation_hint=DATA_PREFERRED
> > > devid=6, path=/dev/loop5: allocation_hint=DATA_ONLY
> > > devid=7, path=/dev/loop6: allocation_hint=METADATA_ONLY
> > > devid=8, path=/dev/loop7: allocation_hint=METADATA_ONLY
> > > 
> > > $ sudo ./btrfs fi us /mnt/test-btrfs/
> > > Overall:
> > >      Device size:           2.75GiB
> > >      Device allocated:           1.34GiB
> > >      Device unallocated:           1.41GiB
> > >      Device missing:             0.00B
> > >      Used:             400.89MiB
> > >      Free (estimated):           1.04GiB    (min: 1.04GiB)
> > >      Data ratio:                  2.00
> > >      Metadata ratio:              1.00
> > >      Global reserve:           3.25MiB    (used: 0.00B)
> > >      Multiple profiles:                no
> > > 
> > > Data,RAID1: Size:542.00MiB, Used:200.25MiB (36.95%)
> > >     /dev/loop0     288.00MiB
> > >     /dev/loop1     288.00MiB
> > >     /dev/loop2     127.00MiB
> > >     /dev/loop3     127.00MiB
> > >     /dev/loop4     127.00MiB
> > >     /dev/loop5     127.00MiB
> > > 
> > > Metadata,single: Size:256.00MiB, Used:384.00KiB (0.15%)
> > >     /dev/loop1     256.00MiB
> > > 
> > > System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
> > >     /dev/loop0      32.00MiB
> > > 
> > > Unallocated:
> > >     /dev/loop0     704.00MiB
> > >     /dev/loop1     480.00MiB
> > >     /dev/loop2       1.00MiB
> > >     /dev/loop3       1.00MiB
> > >     /dev/loop4       1.00MiB
> > >     /dev/loop5       1.00MiB
> > >     /dev/loop6     128.00MiB
> > >     /dev/loop7     128.00MiB
> > > 
> > > # change the tag of some disks
> > > 
> > > $ sudo ./btrfs prop set /dev/loop0 allocation_hint DATA_ONLY
> > > $ sudo ./btrfs prop set /dev/loop1 allocation_hint DATA_ONLY
> > > $ sudo ./btrfs prop set /dev/loop5 allocation_hint METADATA_ONLY
> > > 
> > > $ for i in /dev/loop[0-9]; do sudo ./btrfs prop get $i
> > > allocation_hint; done
> > > devid=1, path=/dev/loop0: allocation_hint=DATA_ONLY
> > > devid=2, path=/dev/loop1: allocation_hint=DATA_ONLY
> > > devid=3, path=/dev/loop2: allocation_hint=DATA_PREFERRED
> > > devid=4, path=/dev/loop3: allocation_hint=DATA_PREFERRED
> > > devid=5, path=/dev/loop4: allocation_hint=DATA_PREFERRED
> > > devid=6, path=/dev/loop5: allocation_hint=METADATA_ONLY
> > > devid=7, path=/dev/loop6: allocation_hint=METADATA_ONLY
> > > devid=8, path=/dev/loop7: allocation_hint=METADATA_ONLY
> > > 
> > > $ sudo btrfs bal start --full-balance /mnt/test-btrfs/
> > > $ sudo ./btrfs fi us /mnt/test-btrfs/
> > > Overall:
> > >      Device size:           2.75GiB
> > >      Device allocated:         735.00MiB
> > >      Device unallocated:           2.03GiB
> > >      Device missing:             0.00B
> > >      Used:             400.72MiB
> > >      Free (estimated):           1.10GiB    (min: 1.10GiB)
> > >      Data ratio:                  2.00
> > >      Metadata ratio:              1.00
> > >      Global reserve:           3.25MiB    (used: 0.00B)
> > >      Multiple profiles:                no
> > > 
> > > Data,RAID1: Size:288.00MiB, Used:200.19MiB (69.51%)
> > >     /dev/loop0     288.00MiB
> > >     /dev/loop1     288.00MiB
> > > 
> > > Metadata,single: Size:127.00MiB, Used:336.00KiB (0.26%)
> > >     /dev/loop5     127.00MiB
> > > 
> > > System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
> > >     /dev/loop7      32.00MiB
> > > 
> > > Unallocated:
> > >     /dev/loop0     736.00MiB
> > >     /dev/loop1     736.00MiB
> > >     /dev/loop2     128.00MiB
> > >     /dev/loop3     128.00MiB
> > >     /dev/loop4     128.00MiB
> > >     /dev/loop5       1.00MiB
> > >     /dev/loop6     128.00MiB
> > >     /dev/loop7      96.00MiB
> > > 
> > > 
> > > #As you can see all the metadata were placed on the disk loop5/loop7
> > > even if
> > > #the most empty one are loop0 and loop1.
> > > 
> > > 
> > > 
> > > Furher works:
> > > - the tool which show the space available should consider the tagging (eg
> > >    the disks tagged by _METADATA_ONLY should be excluded from the data
> > >    availability)
> > > - allow btrfs-prog to change the allocation_hint even when the filesystem
> > >    is not mounted.
> > > 
> > > In following emails, there will be the btrfs-progs patches set and the
> > > xfstest
> > > suite patches set.
> > > 
> > > 
> > > Comments are welcome
> > > BR
> > > G.Baroncelli
> > > 
> > > Revision:
> > > V11:
> > > - added the property /sys/fs/btrfs/features/allocation_hint
> > > 
> > > V10:
> > > - renamed two disk tags/constants:
> > >          - *_METADATA_PREFERRED -> *_METADATA_PREFERRED
> > >          - *_DATA_PREFERRED -> *_DATA_EFERRED
> > > - add /sys/fs/btrfs/$UUID/devinfo/$DEVID/major_minor
> > > - revise some commit description
> > > - refactored the code of gather_device_info(): one portion of this code
> > >    is moved in a separate function called sort_and_reduce_device_info()
> > >    for clarity.
> > > - removed the 'allocation_hint' mount option
> > > 
> > > V9:
> > > - rename dev_item->type to dev_item->flags
> > > - rename /sys/fs/btrfs/$UUID/devinfo/type -> allocation_hint
> > > 
> > > V8:
> > > - drop the ioctl API, instead use a sysfs one
> > > 
> > > V7:
> > > - make more room in the struct btrfs_ioctl_dev_properties up to 1K
> > > - leave in btrfs_tree.h only the costants
> > > - removed the mount option (sic)
> > > - correct an 'use before check' in the while loop (signaled
> > >    by Zygo)
> > > - add a 2nd sort to be sure that the device_info array is in the
> > >    expected order
> > > 
> > > V6:
> > > - add further values to the hints: add the possibility to
> > >    exclude a disk for a chunk type
> > > 
> > > 
> > > 
> > > Goffredo Baroncelli (7):
> > >    btrfs: add flags to give an hint to the chunk allocator
> > >    btrfs: export the device allocation_hint property in sysfs
> > >    btrfs: change the device allocation_hint property via sysfs
> > >    btrfs: add allocation_hint mode
> > >    btrfs: rename dev_item->type to dev_item->flags
> > >    btrfs: add major and minor to sysfs
> > >    Add /sys/fs/btrfs/features/allocation_hint
> > > 
> > >   fs/btrfs/ctree.h                |   4 +-
> > >   fs/btrfs/disk-io.c              |   2 +-
> > >   fs/btrfs/sysfs.c                | 105 +++++++++++++++++++++++++++
> > >   fs/btrfs/volumes.c              | 121 ++++++++++++++++++++++++++++++--
> > >   fs/btrfs/volumes.h              |   7 +-
> > >   include/uapi/linux/btrfs_tree.h |  20 +++++-
> > >   6 files changed, 245 insertions(+), 14 deletions(-)
> > > 
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH 0/7][V11] btrfs: allocation_hint
  2022-02-16  3:28     ` Zygo Blaxell
@ 2022-02-16  4:43       ` Paul Jones
  2022-02-25 20:18         ` Boris Burkov
  0 siblings, 1 reply; 22+ messages in thread
From: Paul Jones @ 2022-02-16  4:43 UTC (permalink / raw)
  To: Zygo Blaxell, Qu Wenruo
  Cc: kreijack, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Boris Burkov, linux-btrfs

> -----Original Message-----
> From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
> Sent: Wednesday, 16 February 2022 2:28 PM
> To: Qu Wenruo <quwenruo.btrfs@gmx.com>
> Cc: kreijack@inwind.it; Josef Bacik <josef@toxicpanda.com>; David Sterba
> <dsterba@suse.cz>; Sinnamohideen Shafeeq <shafeeqs@panasas.com>;
> Paul Jones <paul@pauljones.id.au>; Boris Burkov <boris@bur.io>; linux-
> btrfs@vger.kernel.org
> Subject: Re: [PATCH 0/7][V11] btrfs: allocation_hint
> 
> On Wed, Feb 16, 2022 at 08:22:55AM +0800, Qu Wenruo wrote:
> >
> >
> > On 2022/2/16 02:49, Goffredo Baroncelli wrote:
> > > Hi Josef,
> > >
> > > gentle ping...
> > >
> > > few months ago you showed some interest in this patches set. Few of
> > > the cc-ed person use this patch set.
> > >
> > > I know that David showed some interest in the Anand approach (i.e.
> > > no knobs, but an automatic behavior looking at the speed of the devices).
> > >
> > > At the time when I tried this approach in the first attempts, I got
> > > the complain that the kernel may not know the performance
> > > differences of the disk (HDD vs SSD vs NVME vs ZONED disk...).
> >
> > Sorry I didn't check the patches in details.
> >
> > But I'm a little concerned about how to accurately determine the
> > performance of a device.
> >
> > If doing it automatically, there must be some (commonly very short)
> > time spent to do the test.
> >
> > In the very short time, I doubt we can even accurately got a full
> > picture of a device (from sequential read/write speed to IOPS values)
> >
> > For spinning disks, the sequential read/write speed even change based
> > on their LBA address (as their physical location inside the plate can
> > change their linear velocity, since the angular velocity is fixed).
> >
> > And even for SSD, IOPS can var dramatically due to cache/controller
> > difference.
> >
> >
> > For a proper performance aware setup, I guess the only correct way to
> > fetch performance characteristics is from the (advanced) user.
> >
> > Or we may need to spent at least tens of minutes to do proper tests to
> > get the result.
> >
> > For regular end users, the difference between SSD and HDD is huge
> > enough and simply preferring SSD for metadata is good enough.
> >
> > But for more complex setup, like btrfs over LUKS over LVM (even
> > crosses several physical devices), I doubt if it's even possible to
> > fetch the correct performance characteristics automatically.
> 
> I agree with all of the above.
> 
> An automatic performance detection/configuration daemon can easily set
> the metadata/data preference bits during mkfs, or monitor the system's
> iostats and change preferences over time if this is really useful and desired.
> It doesn't have to run in the kernel.

At the moment it's all manual anyway so a bit of a moot point until something is merged.
I've been using this patchset since the beginning and imho it's killer feature is being able to split data and metadata onto different speed devices. It massively speeds up large filesystems that are metadata heavy.

Paul.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/7][V11] btrfs: allocation_hint
  2022-02-16  4:43       ` Paul Jones
@ 2022-02-25 20:18         ` Boris Burkov
  0 siblings, 0 replies; 22+ messages in thread
From: Boris Burkov @ 2022-02-25 20:18 UTC (permalink / raw)
  To: Paul Jones
  Cc: Zygo Blaxell, Qu Wenruo, kreijack, Josef Bacik, David Sterba,
	Sinnamohideen Shafeeq, linux-btrfs

On Wed, Feb 16, 2022 at 04:43:43AM +0000, Paul Jones wrote:
> > -----Original Message-----
> > From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
> > Sent: Wednesday, 16 February 2022 2:28 PM
> > To: Qu Wenruo <quwenruo.btrfs@gmx.com>
> > Cc: kreijack@inwind.it; Josef Bacik <josef@toxicpanda.com>; David Sterba
> > <dsterba@suse.cz>; Sinnamohideen Shafeeq <shafeeqs@panasas.com>;
> > Paul Jones <paul@pauljones.id.au>; Boris Burkov <boris@bur.io>; linux-
> > btrfs@vger.kernel.org
> > Subject: Re: [PATCH 0/7][V11] btrfs: allocation_hint
> > 
> > On Wed, Feb 16, 2022 at 08:22:55AM +0800, Qu Wenruo wrote:
> > >
> > >
> > > On 2022/2/16 02:49, Goffredo Baroncelli wrote:
> > > > Hi Josef,
> > > >
> > > > gentle ping...
> > > >
> > > > few months ago you showed some interest in this patches set. Few of
> > > > the cc-ed person use this patch set.
> > > >
> > > > I know that David showed some interest in the Anand approach (i.e.
> > > > no knobs, but an automatic behavior looking at the speed of the devices).
> > > >
> > > > At the time when I tried this approach in the first attempts, I got
> > > > the complain that the kernel may not know the performance
> > > > differences of the disk (HDD vs SSD vs NVME vs ZONED disk...).
> > >
> > > Sorry I didn't check the patches in details.
> > >
> > > But I'm a little concerned about how to accurately determine the
> > > performance of a device.
> > >
> > > If doing it automatically, there must be some (commonly very short)
> > > time spent to do the test.
> > >
> > > In the very short time, I doubt we can even accurately got a full
> > > picture of a device (from sequential read/write speed to IOPS values)
> > >
> > > For spinning disks, the sequential read/write speed even change based
> > > on their LBA address (as their physical location inside the plate can
> > > change their linear velocity, since the angular velocity is fixed).
> > >
> > > And even for SSD, IOPS can var dramatically due to cache/controller
> > > difference.
> > >
> > >
> > > For a proper performance aware setup, I guess the only correct way to
> > > fetch performance characteristics is from the (advanced) user.
> > >
> > > Or we may need to spent at least tens of minutes to do proper tests to
> > > get the result.
> > >
> > > For regular end users, the difference between SSD and HDD is huge
> > > enough and simply preferring SSD for metadata is good enough.
> > >
> > > But for more complex setup, like btrfs over LUKS over LVM (even
> > > crosses several physical devices), I doubt if it's even possible to
> > > fetch the correct performance characteristics automatically.
> > 
> > I agree with all of the above.
> > 
> > An automatic performance detection/configuration daemon can easily set
> > the metadata/data preference bits during mkfs, or monitor the system's
> > iostats and change preferences over time if this is really useful and desired.
> > It doesn't have to run in the kernel.
> 
> At the moment it's all manual anyway so a bit of a moot point until something is merged.
> I've been using this patchset since the beginning and imho it's killer feature is being able to split data and metadata onto different speed devices. It massively speeds up large filesystems that are metadata heavy.
> 
> Paul.

I've also been using it for about a month and it has been great for
manually splitting metadata and data to separate devices for testing
some of the work I've been doing. Works as advertised on the tin, as far
as I'm concerned, and feels like a good addition.

Boris

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/7][V11] btrfs: allocation_hint
  2022-01-26 20:32 [PATCH 0/7][V11] btrfs: allocation_hint Goffredo Baroncelli
                   ` (7 preceding siblings ...)
  2022-02-15 18:49 ` [PATCH 0/7][V11] btrfs: allocation_hint Goffredo Baroncelli
@ 2022-02-28 17:04 ` Josef Bacik
  2022-02-28 21:01   ` Goffredo Baroncelli
  8 siblings, 1 reply; 22+ messages in thread
From: Josef Bacik @ 2022-02-28 17:04 UTC (permalink / raw)
  To: Goffredo Baroncelli
  Cc: linux-btrfs, Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov, Goffredo Baroncelli

On Wed, Jan 26, 2022 at 09:32:07PM +0100, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli <kreijack@inwind.it>
> 
> Hi all,
> 
> This patches set was born after some discussion between me, Zygo and Josef.
> Some details can be found in https://github.com/btrfs/btrfs-todo/issues/19.
> 
> Some further information about a real use case can be found in
> https://lore.kernel.org/linux-btrfs/20210116002533.GE31381@hungrycats.org/
> 
> Recently Shafeeq told me that he is interested too, due to the performance gain.
> 
> In V8, revision I switched away from an ioctl API in favor of a sysfs API (
> see patch #2 and #3).
> 
> In V9, I renamed the sysfs interface from devinfo/type to devinfo/allocation_hint.
> Moreover I renamed dev_info->type to dev_info->flags.
> 
> In V10, I renamed the tag 'PREFERRED' from PREFERRED_X to X_PREFERRED; I added
> another devinfo property, called major_minor. For now it is unused;
> the plan is to use this in btrfs-progs to go from the block device to the filesystem.
> First client will be "btrfs prop get/set allocation_hint", but I see others possible
> clients.
> Finally I made some cleanup, and remove the mount option 'allocation_hint'
> 
> In V11 I added a new 'feature' file /sys/fs/btrfs/features/allocation_hint
> to help the detection of the allocation_hint feature.
> 

Sorry Goffredo I dropped the ball on this, lets try and get this shit nailed
down and done so we can get it merged.  The code overall looks good, I just have
two things I want changed

1. The sysfs file should use a string, not a magic number.  Think how
   /sys/block/<dev>/queue/scheduler works.  You echo "metadata_only" >
   allocation_hint, you cat allocation_hint and it says "none" or
   "metadata_only".  If you want to be fancy you can do exactly like
   queue/scheduler and show the list of options with [] around the selected
   option.

2. I hate the major_minor thing, can you do similar to what we do for devices/
   and symlink the correct device sysfs directory under devid?

Other than that the code is perfect.  I think these two changes make it solid
enough to merge.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/7][V11] btrfs: allocation_hint
  2022-02-28 17:04 ` Josef Bacik
@ 2022-02-28 21:01   ` Goffredo Baroncelli
  2022-03-01 15:07     ` Josef Bacik
  0 siblings, 1 reply; 22+ messages in thread
From: Goffredo Baroncelli @ 2022-02-28 21:01 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov, Goffredo Baroncelli

Hi Josef,

On 28/02/2022 18.04, Josef Bacik wrote:
> On Wed, Jan 26, 2022 at 09:32:07PM +0100, Goffredo Baroncelli wrote:
>> From: Goffredo Baroncelli <kreijack@inwind.it>
>>
>> Hi all,
>>
[...

>> In V11 I added a new 'feature' file /sys/fs/btrfs/features/allocation_hint
>> to help the detection of the allocation_hint feature.
>>
> 
> Sorry Goffredo I dropped the ball on this, lets try and get this shit nailed
> down and done so we can get it merged.  The code overall looks good, I just have
> two things I want changed
> 
> 1. The sysfs file should use a string, not a magic number.  Think how
>     /sys/block/<dev>/queue/scheduler works.  You echo "metadata_only" >
>     allocation_hint, you cat allocation_hint and it says "none" or
>     "metadata_only".  If you want to be fancy you can do exactly like
>     queue/scheduler and show the list of options with [] around the selected
>     option.

Ok.
> 
> 2. I hate the major_minor thing, can you do similar to what we do for devices/
>     and symlink the correct device sysfs directory under devid?
> 
Ok, do you have any suggestion for the name ? what about bdev ?

> Other than that the code is perfect.  I think these two changes make it solid
> enough to merge.  Thanks,
> 
> Josef


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/7][V11] btrfs: allocation_hint
  2022-02-28 21:01   ` Goffredo Baroncelli
@ 2022-03-01 15:07     ` Josef Bacik
  2022-03-01 18:55       ` Goffredo Baroncelli
  0 siblings, 1 reply; 22+ messages in thread
From: Josef Bacik @ 2022-03-01 15:07 UTC (permalink / raw)
  To: kreijack
  Cc: linux-btrfs, Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov

On Mon, Feb 28, 2022 at 10:01:35PM +0100, Goffredo Baroncelli wrote:
> Hi Josef,
> 
> On 28/02/2022 18.04, Josef Bacik wrote:
> > On Wed, Jan 26, 2022 at 09:32:07PM +0100, Goffredo Baroncelli wrote:
> > > From: Goffredo Baroncelli <kreijack@inwind.it>
> > > 
> > > Hi all,
> > > 
> [...
> 
> > > In V11 I added a new 'feature' file /sys/fs/btrfs/features/allocation_hint
> > > to help the detection of the allocation_hint feature.
> > > 
> > 
> > Sorry Goffredo I dropped the ball on this, lets try and get this shit nailed
> > down and done so we can get it merged.  The code overall looks good, I just have
> > two things I want changed
> > 
> > 1. The sysfs file should use a string, not a magic number.  Think how
> >     /sys/block/<dev>/queue/scheduler works.  You echo "metadata_only" >
> >     allocation_hint, you cat allocation_hint and it says "none" or
> >     "metadata_only".  If you want to be fancy you can do exactly like
> >     queue/scheduler and show the list of options with [] around the selected
> >     option.
> 
> Ok.
> > 
> > 2. I hate the major_minor thing, can you do similar to what we do for devices/
> >     and symlink the correct device sysfs directory under devid?
> > 
> Ok, do you have any suggestion for the name ? what about bdev ?
> 

You literally just add a link to the device kobj to the devid kobj.  If you look
at btrfs_sysfs_add_device(), you would do something like this (completely
untested and uncompiled)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 17389a42a3ab..cc570078bf7d 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1450,8 +1450,10 @@ void btrfs_sysfs_remove_device(struct btrfs_device *device)
        devices_kobj = device->fs_info->fs_devices->devices_kobj;
        ASSERT(devices_kobj);
 
-       if (device->bdev)
+       if (device->bdev) {
                sysfs_remove_link(devices_kobj, bdev_kobj(device->bdev)->name);
+               sysfs_remove_link(&device->devid_kobj, bdev_kobj(device->bdev)->name);
+       }
 
        if (device->devid_kobj.state_initialized) {
                kobject_del(&device->devid_kobj);
@@ -1628,10 +1630,23 @@ int btrfs_sysfs_add_device(struct btrfs_device *device)
 
        nofs_flag = memalloc_nofs_save();
 
+       init_completion(&device->kobj_unregister);
+       ret = kobject_init_and_add(&device->devid_kobj, &devid_ktype,
+                                  devinfo_kobj, "%llu", device->devid);
+       if (ret) {
+               kobject_put(&device->devid_kobj);
+               btrfs_warn(device->fs_info,
+                          "devinfo init for devid %llu failed: %d",
+                          device->devid, ret);
+       }
+
        if (device->bdev) {
                struct kobject *disk_kobj = bdev_kobj(device->bdev);
 
                ret = sysfs_create_link(devices_kobj, disk_kobj, disk_kobj->name);
+               if (!ret)
+                       ret = sysfs_create_link(device->devid_kobj, disk_kobj,
+                                               disk_kobj->name);
                if (ret) {
                        btrfs_warn(device->fs_info,
                                "creating sysfs device link for devid %llu failed: %d",
@@ -1640,16 +1655,6 @@ int btrfs_sysfs_add_device(struct btrfs_device *device)
                }
        }
 
-       init_completion(&device->kobj_unregister);
-       ret = kobject_init_and_add(&device->devid_kobj, &devid_ktype,
-                                  devinfo_kobj, "%llu", device->devid);
-       if (ret) {
-               kobject_put(&device->devid_kobj);
-               btrfs_warn(device->fs_info,
-                          "devinfo init for devid %llu failed: %d",
-                          device->devid, ret);
-       }
-
 out:
        memalloc_nofs_restore(nofs_flag);
        return ret;


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/7][V11] btrfs: allocation_hint
  2022-03-01 15:07     ` Josef Bacik
@ 2022-03-01 18:55       ` Goffredo Baroncelli
  2022-03-01 21:43         ` Josef Bacik
  0 siblings, 1 reply; 22+ messages in thread
From: Goffredo Baroncelli @ 2022-03-01 18:55 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov

On 01/03/2022 16.07, Josef Bacik wrote:
> On Mon, Feb 28, 2022 at 10:01:35PM +0100, Goffredo Baroncelli wrote:
>> Hi Josef,
>>
>> On 28/02/2022 18.04, Josef Bacik wrote:
>>> On Wed, Jan 26, 2022 at 09:32:07PM +0100, Goffredo Baroncelli wrote:
>>>> From: Goffredo Baroncelli <kreijack@inwind.it>
>>>>
>>>> Hi all,
>>>>
>> [...
>>
>>>> In V11 I added a new 'feature' file /sys/fs/btrfs/features/allocation_hint
>>>> to help the detection of the allocation_hint feature.
>>>>
>>>
>>> Sorry Goffredo I dropped the ball on this, lets try and get this shit nailed
>>> down and done so we can get it merged.  The code overall looks good, I just have
>>> two things I want changed
>>>
>>> 1. The sysfs file should use a string, not a magic number.  Think how
>>>      /sys/block/<dev>/queue/scheduler works.  You echo "metadata_only" >
>>>      allocation_hint, you cat allocation_hint and it says "none" or
>>>      "metadata_only".  If you want to be fancy you can do exactly like
>>>      queue/scheduler and show the list of options with [] around the selected
>>>      option.
>>
>> Ok.
>>>
>>> 2. I hate the major_minor thing, can you do similar to what we do for devices/
>>>      and symlink the correct device sysfs directory under devid?
>>>
>> Ok, do you have any suggestion for the name ? what about bdev ?
>>
> 
> You literally just add a link to the device kobj to the devid kobj.  If you look
> at btrfs_sysfs_add_device(), you would do something like this (completely
> untested and uncompiled)

I will give an eye to your code; thanks. However my question was more basic.

Now we have:

.../btrfs/<uuid>/devinfo/<dev-nr>/major_minor

with the link, as you suggested, I think that we will have:

.../btrfs/<uuid>/devinfo/<dev-nr>/bdev -> ../../../../devices/pci0000:00/0000:00:01.2/0000:01:00.1/ata6/host5/target5:0:0/5:0:0:0/block/sdg/sdg3
                                   ^^^^

(notice 'bdev', which is the name that I asked)

looking at your patch, it seems to me that the link will be named like the device name:

.../btrfs/<uuid>/devinfo/<dev-nr>/sdg3 -> ../../../../devices/pci0000:00/0000:00:01.2/0000:01:00.1/ata6/host5/target5:0:0/5:0:0:0/block/sdg/sdg3
                                   ^^^^

which is quite convoluted as approach, because the code has to find a name that matches a device (sdg3), instead to look for a fixed name (bdev).

Because I know that every people has a strong, valid (and above all different !) opinion about the name, I want to ask it before issue another patches set.
For the record, I like 'bdev' only because I saw used (by bcache)

IMHO, the btrfs world had been simpler if devices/ sysfs directory was populated by the btrfs-dev-nr instead the device name



> 
> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> index 17389a42a3ab..cc570078bf7d 100644
> --- a/fs/btrfs/sysfs.c
> +++ b/fs/btrfs/sysfs.c
> @@ -1450,8 +1450,10 @@ void btrfs_sysfs_remove_device(struct btrfs_device *device)
>          devices_kobj = device->fs_info->fs_devices->devices_kobj;
>          ASSERT(devices_kobj);
>   
> -       if (device->bdev)
> +       if (device->bdev) {
>                  sysfs_remove_link(devices_kobj, bdev_kobj(device->bdev)->name);
> +               sysfs_remove_link(&device->devid_kobj, bdev_kobj(device->bdev)->name);
> +       }
>   
>          if (device->devid_kobj.state_initialized) {
>                  kobject_del(&device->devid_kobj);
> @@ -1628,10 +1630,23 @@ int btrfs_sysfs_add_device(struct btrfs_device *device)
>   
>          nofs_flag = memalloc_nofs_save();
>   
> +       init_completion(&device->kobj_unregister);
> +       ret = kobject_init_and_add(&device->devid_kobj, &devid_ktype,
> +                                  devinfo_kobj, "%llu", device->devid);
> +       if (ret) {
> +               kobject_put(&device->devid_kobj);
> +               btrfs_warn(device->fs_info,
> +                          "devinfo init for devid %llu failed: %d",
> +                          device->devid, ret);
> +       }
> +
>          if (device->bdev) {
>                  struct kobject *disk_kobj = bdev_kobj(device->bdev);
>   
>                  ret = sysfs_create_link(devices_kobj, disk_kobj, disk_kobj->name);
> +               if (!ret)
> +                       ret = sysfs_create_link(device->devid_kobj, disk_kobj,
> +                                               disk_kobj->name);
>                  if (ret) {
>                          btrfs_warn(device->fs_info,
>                                  "creating sysfs device link for devid %llu failed: %d",
> @@ -1640,16 +1655,6 @@ int btrfs_sysfs_add_device(struct btrfs_device *device)
>                  }
>          }
>   
> -       init_completion(&device->kobj_unregister);
> -       ret = kobject_init_and_add(&device->devid_kobj, &devid_ktype,
> -                                  devinfo_kobj, "%llu", device->devid);
> -       if (ret) {
> -               kobject_put(&device->devid_kobj);
> -               btrfs_warn(device->fs_info,
> -                          "devinfo init for devid %llu failed: %d",
> -                          device->devid, ret);
> -       }
> -
>   out:
>          memalloc_nofs_restore(nofs_flag);
>          return ret;
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/7][V11] btrfs: allocation_hint
  2022-03-01 18:55       ` Goffredo Baroncelli
@ 2022-03-01 21:43         ` Josef Bacik
  2022-03-02 19:30           ` Goffredo Baroncelli
  0 siblings, 1 reply; 22+ messages in thread
From: Josef Bacik @ 2022-03-01 21:43 UTC (permalink / raw)
  To: Goffredo Baroncelli
  Cc: linux-btrfs, Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov

On Tue, Mar 01, 2022 at 07:55:07PM +0100, Goffredo Baroncelli wrote:
> On 01/03/2022 16.07, Josef Bacik wrote:
> > On Mon, Feb 28, 2022 at 10:01:35PM +0100, Goffredo Baroncelli wrote:
> > > Hi Josef,
> > > 
> > > On 28/02/2022 18.04, Josef Bacik wrote:
> > > > On Wed, Jan 26, 2022 at 09:32:07PM +0100, Goffredo Baroncelli wrote:
> > > > > From: Goffredo Baroncelli <kreijack@inwind.it>
> > > > > 
> > > > > Hi all,
> > > > > 
> > > [...
> > > 
> > > > > In V11 I added a new 'feature' file /sys/fs/btrfs/features/allocation_hint
> > > > > to help the detection of the allocation_hint feature.
> > > > > 
> > > > 
> > > > Sorry Goffredo I dropped the ball on this, lets try and get this shit nailed
> > > > down and done so we can get it merged.  The code overall looks good, I just have
> > > > two things I want changed
> > > > 
> > > > 1. The sysfs file should use a string, not a magic number.  Think how
> > > >      /sys/block/<dev>/queue/scheduler works.  You echo "metadata_only" >
> > > >      allocation_hint, you cat allocation_hint and it says "none" or
> > > >      "metadata_only".  If you want to be fancy you can do exactly like
> > > >      queue/scheduler and show the list of options with [] around the selected
> > > >      option.
> > > 
> > > Ok.
> > > > 
> > > > 2. I hate the major_minor thing, can you do similar to what we do for devices/
> > > >      and symlink the correct device sysfs directory under devid?
> > > > 
> > > Ok, do you have any suggestion for the name ? what about bdev ?
> > > 
> > 
> > You literally just add a link to the device kobj to the devid kobj.  If you look
> > at btrfs_sysfs_add_device(), you would do something like this (completely
> > untested and uncompiled)
> 
> I will give an eye to your code; thanks. However my question was more basic.
> 
> Now we have:
> 
> .../btrfs/<uuid>/devinfo/<dev-nr>/major_minor
> 
> with the link, as you suggested, I think that we will have:
> 
> .../btrfs/<uuid>/devinfo/<dev-nr>/bdev -> ../../../../devices/pci0000:00/0000:00:01.2/0000:01:00.1/ata6/host5/target5:0:0/5:0:0:0/block/sdg/sdg3
>                                   ^^^^
> 
> (notice 'bdev', which is the name that I asked)
> 
> looking at your patch, it seems to me that the link will be named like the device name:
> 
> .../btrfs/<uuid>/devinfo/<dev-nr>/sdg3 -> ../../../../devices/pci0000:00/0000:00:01.2/0000:01:00.1/ata6/host5/target5:0:0/5:0:0:0/block/sdg/sdg3
>                                   ^^^^
> 
> which is quite convoluted as approach, because the code has to find a name that matches a device (sdg3), instead to look for a fixed name (bdev).
> 
> Because I know that every people has a strong, valid (and above all different !) opinion about the name, I want to ask it before issue another patches set.
> For the record, I like 'bdev' only because I saw used (by bcache)
> 
> IMHO, the btrfs world had been simpler if devices/ sysfs directory was populated by the btrfs-dev-nr instead the device name
>

Ahh ok I see, you make a good point.  I agree it would have been better to have
the dev nrs in devices and then links in there, but here we are.

I think for now drop this patch from this series, since it's another bike
shedding opportunity and I'd rather get the core functionality in.  Do what I
asked in #1 and drop this patch from this series, follow up with a different
series if you feel strongly enough about it and that way we can have that
discussion in that thread and not hold up your feature.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/7][V11] btrfs: allocation_hint
  2022-03-01 21:43         ` Josef Bacik
@ 2022-03-02 19:30           ` Goffredo Baroncelli
  2022-03-02 21:23             ` Josef Bacik
  0 siblings, 1 reply; 22+ messages in thread
From: Goffredo Baroncelli @ 2022-03-02 19:30 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov

On 01/03/2022 22.43, Josef Bacik wrote:
> On Tue, Mar 01, 2022 at 07:55:07PM +0100, Goffredo Baroncelli wrote:
>> On 01/03/2022 16.07, Josef Bacik wrote:
>>> On Mon, Feb 28, 2022 at 10:01:35PM +0100, Goffredo Baroncelli wrote:
>>>> Hi Josef,
>>>>
>>>> On 28/02/2022 18.04, Josef Bacik wrote:
>>>>> On Wed, Jan 26, 2022 at 09:32:07PM +0100, Goffredo Baroncelli wrote:
>>>>>> From: Goffredo Baroncelli <kreijack@inwind.it>
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>> [...
>>>>
>>>>>> In V11 I added a new 'feature' file /sys/fs/btrfs/features/allocation_hint
>>>>>> to help the detection of the allocation_hint feature.
>>>>>>
>>>>>
>>>>> Sorry Goffredo I dropped the ball on this, lets try and get this shit nailed
>>>>> down and done so we can get it merged.  The code overall looks good, I just have
>>>>> two things I want changed
>>>>>
>>>>> 1. The sysfs file should use a string, not a magic number.  Think how
>>>>>       /sys/block/<dev>/queue/scheduler works.  You echo "metadata_only" >
>>>>>       allocation_hint, you cat allocation_hint and it says "none" or
>>>>>       "metadata_only".  If you want to be fancy you can do exactly like
>>>>>       queue/scheduler and show the list of options with [] around the selected
>>>>>       option.
>>>>
>>>> Ok.
>>>>>
>>>>> 2. I hate the major_minor thing, can you do similar to what we do for devices/
>>>>>       and symlink the correct device sysfs directory under devid?
>>>>>
>>>> Ok, do you have any suggestion for the name ? what about bdev ?
>>>>
>>>
>>> You literally just add a link to the device kobj to the devid kobj.  If you look
>>> at btrfs_sysfs_add_device(), you would do something like this (completely
>>> untested and uncompiled)
>>
>> I will give an eye to your code; thanks. However my question was more basic.
>>
>> Now we have:
>>
>> .../btrfs/<uuid>/devinfo/<dev-nr>/major_minor
>>
>> with the link, as you suggested, I think that we will have:
>>
>> .../btrfs/<uuid>/devinfo/<dev-nr>/bdev -> ../../../../devices/pci0000:00/0000:00:01.2/0000:01:00.1/ata6/host5/target5:0:0/5:0:0:0/block/sdg/sdg3
>>                                    ^^^^
>>
>> (notice 'bdev', which is the name that I asked)
>>
>> looking at your patch, it seems to me that the link will be named like the device name:
>>
>> .../btrfs/<uuid>/devinfo/<dev-nr>/sdg3 -> ../../../../devices/pci0000:00/0000:00:01.2/0000:01:00.1/ata6/host5/target5:0:0/5:0:0:0/block/sdg/sdg3
>>                                    ^^^^
>>
>> which is quite convoluted as approach, because the code has to find a name that matches a device (sdg3), instead to look for a fixed name (bdev).
>>
>> Because I know that every people has a strong, valid (and above all different !) opinion about the name, I want to ask it before issue another patches set.
>> For the record, I like 'bdev' only because I saw used (by bcache)
>>
>> IMHO, the btrfs world had been simpler if devices/ sysfs directory was populated by the btrfs-dev-nr instead the device name
>>
> 
> Ahh ok I see, you make a good point.  I agree it would have been better to have
> the dev nrs in devices and then links in there, but here we are.
> 
> I think for now drop this patch from this series, since it's another bike
> shedding opportunity and I'd rather get the core functionality in.  Do what I
> asked in #1 and drop this patch from this series, follow up with a different
> series if you feel strongly enough about it and that way we can have that
> discussion in that thread and not hold up your feature.  Thanks,
still be a problem:
- how go from the "dev name" (e.g. /dev/sdg3) to the sysfs field
   (e.g. /sys/btrfs/<uuid>/devinfo/<devid>/allocation_hint) ?

For simple filesystem (e.g. 1 disk), it is trivial (and not useful); for more complex
one (2, 3 disks) it is easy to make mistake.

btrfs-progs relies on major_minor; it is possible to used the BTRFS_IOC_DEV_INFO
but it requires CAP_ADMIN....


> 
> Josef


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/7][V11] btrfs: allocation_hint
  2022-03-02 19:30           ` Goffredo Baroncelli
@ 2022-03-02 21:23             ` Josef Bacik
  2022-03-03 19:01               ` Goffredo Baroncelli
  0 siblings, 1 reply; 22+ messages in thread
From: Josef Bacik @ 2022-03-02 21:23 UTC (permalink / raw)
  To: Goffredo Baroncelli
  Cc: linux-btrfs, Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov

On Wed, Mar 02, 2022 at 08:30:22PM +0100, Goffredo Baroncelli wrote:
> On 01/03/2022 22.43, Josef Bacik wrote:
> > On Tue, Mar 01, 2022 at 07:55:07PM +0100, Goffredo Baroncelli wrote:
> > > On 01/03/2022 16.07, Josef Bacik wrote:
> > > > On Mon, Feb 28, 2022 at 10:01:35PM +0100, Goffredo Baroncelli wrote:
> > > > > Hi Josef,
> > > > > 
> > > > > On 28/02/2022 18.04, Josef Bacik wrote:
> > > > > > On Wed, Jan 26, 2022 at 09:32:07PM +0100, Goffredo Baroncelli wrote:
> > > > > > > From: Goffredo Baroncelli <kreijack@inwind.it>
> > > > > > > 
> > > > > > > Hi all,
> > > > > > > 
> > > > > [...
> > > > > 
> > > > > > > In V11 I added a new 'feature' file /sys/fs/btrfs/features/allocation_hint
> > > > > > > to help the detection of the allocation_hint feature.
> > > > > > > 
> > > > > > 
> > > > > > Sorry Goffredo I dropped the ball on this, lets try and get this shit nailed
> > > > > > down and done so we can get it merged.  The code overall looks good, I just have
> > > > > > two things I want changed
> > > > > > 
> > > > > > 1. The sysfs file should use a string, not a magic number.  Think how
> > > > > >       /sys/block/<dev>/queue/scheduler works.  You echo "metadata_only" >
> > > > > >       allocation_hint, you cat allocation_hint and it says "none" or
> > > > > >       "metadata_only".  If you want to be fancy you can do exactly like
> > > > > >       queue/scheduler and show the list of options with [] around the selected
> > > > > >       option.
> > > > > 
> > > > > Ok.
> > > > > > 
> > > > > > 2. I hate the major_minor thing, can you do similar to what we do for devices/
> > > > > >       and symlink the correct device sysfs directory under devid?
> > > > > > 
> > > > > Ok, do you have any suggestion for the name ? what about bdev ?
> > > > > 
> > > > 
> > > > You literally just add a link to the device kobj to the devid kobj.  If you look
> > > > at btrfs_sysfs_add_device(), you would do something like this (completely
> > > > untested and uncompiled)
> > > 
> > > I will give an eye to your code; thanks. However my question was more basic.
> > > 
> > > Now we have:
> > > 
> > > .../btrfs/<uuid>/devinfo/<dev-nr>/major_minor
> > > 
> > > with the link, as you suggested, I think that we will have:
> > > 
> > > .../btrfs/<uuid>/devinfo/<dev-nr>/bdev -> ../../../../devices/pci0000:00/0000:00:01.2/0000:01:00.1/ata6/host5/target5:0:0/5:0:0:0/block/sdg/sdg3
> > >                                    ^^^^
> > > 
> > > (notice 'bdev', which is the name that I asked)
> > > 
> > > looking at your patch, it seems to me that the link will be named like the device name:
> > > 
> > > .../btrfs/<uuid>/devinfo/<dev-nr>/sdg3 -> ../../../../devices/pci0000:00/0000:00:01.2/0000:01:00.1/ata6/host5/target5:0:0/5:0:0:0/block/sdg/sdg3
> > >                                    ^^^^
> > > 
> > > which is quite convoluted as approach, because the code has to find a name that matches a device (sdg3), instead to look for a fixed name (bdev).
> > > 
> > > Because I know that every people has a strong, valid (and above all different !) opinion about the name, I want to ask it before issue another patches set.
> > > For the record, I like 'bdev' only because I saw used (by bcache)
> > > 
> > > IMHO, the btrfs world had been simpler if devices/ sysfs directory was populated by the btrfs-dev-nr instead the device name
> > > 
> > 
> > Ahh ok I see, you make a good point.  I agree it would have been better to have
> > the dev nrs in devices and then links in there, but here we are.
> > 
> > I think for now drop this patch from this series, since it's another bike
> > shedding opportunity and I'd rather get the core functionality in.  Do what I
> > asked in #1 and drop this patch from this series, follow up with a different
> > series if you feel strongly enough about it and that way we can have that
> > discussion in that thread and not hold up your feature.  Thanks,
> still be a problem:
> - how go from the "dev name" (e.g. /dev/sdg3) to the sysfs field
>   (e.g. /sys/btrfs/<uuid>/devinfo/<devid>/allocation_hint) ?
> 
> For simple filesystem (e.g. 1 disk), it is trivial (and not useful); for more complex
> one (2, 3 disks) it is easy to make mistake.
> 
> btrfs-progs relies on major_minor; it is possible to used the BTRFS_IOC_DEV_INFO
> but it requires CAP_ADMIN....
> 

Well this just made me go look at the code and realize you don't require
CAP_ADMIN for the sysfs knob, which we're going to need.  So using
BTRFS_IOC_DEV_INFO shouldn't be a problem.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/7][V11] btrfs: allocation_hint
  2022-03-02 21:23             ` Josef Bacik
@ 2022-03-03 19:01               ` Goffredo Baroncelli
  2022-03-04 14:56                 ` Josef Bacik
  0 siblings, 1 reply; 22+ messages in thread
From: Goffredo Baroncelli @ 2022-03-03 19:01 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-btrfs, Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov

On 02/03/2022 22.23, Josef Bacik wrote:
> On Wed, Mar 02, 2022 at 08:30:22PM +0100, Goffredo Baroncelli wrote:
[...]
>>
>> For simple filesystem (e.g. 1 disk), it is trivial (and not useful); for more complex
>> one (2, 3 disks) it is easy to make mistake.
>>
>> btrfs-progs relies on major_minor; it is possible to used the BTRFS_IOC_DEV_INFO
>> but it requires CAP_ADMIN....
>>
> 
> Well this just made me go look at the code and realize you don't require
> CAP_ADMIN for the sysfs knob, which we're going to need.  So using
> BTRFS_IOC_DEV_INFO shouldn't be a problem.  Thanks,

I am not sure to be understood completely your answer, so I recap what I am doing:
- replace the "int" interface in favor of a "string" interface (not "echo 123 > allocaton_hint"
   but "echo DATA_ONLY > allocaton_hint") [DONE]
- remove the "kernel" patch related to "major_minor" [DONE]
- update the btrfs-progs patch to use BTRFS_IOC_DEV_INFO instead of <devid>/major_minor [WIP]


This will have the following consequences:
- any user is still capable to read <devid>/allocaton_hint
- only root is able to use "btrfs prop get allocaton_hint <devname>" (before any user)
- only root is able to update <devid>/allocaton_hint
- only root is able to use "btrfs prop set allocation_hint <devname>"

The 2nd point may be relaxed allowing to use BTRFS_IOC_DEV_INFO even to not root user (
I don't think that there is any sensitive data exported by BTRFS_IOC_DEV_INFO); but this will
be done as separate patch.

> 
> Josef
BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/7][V11] btrfs: allocation_hint
  2022-03-03 19:01               ` Goffredo Baroncelli
@ 2022-03-04 14:56                 ` Josef Bacik
  0 siblings, 0 replies; 22+ messages in thread
From: Josef Bacik @ 2022-03-04 14:56 UTC (permalink / raw)
  To: Goffredo Baroncelli
  Cc: linux-btrfs, Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq,
	Paul Jones, Boris Burkov

On Thu, Mar 03, 2022 at 08:01:00PM +0100, Goffredo Baroncelli wrote:
> On 02/03/2022 22.23, Josef Bacik wrote:
> > On Wed, Mar 02, 2022 at 08:30:22PM +0100, Goffredo Baroncelli wrote:
> [...]
> > > 
> > > For simple filesystem (e.g. 1 disk), it is trivial (and not useful); for more complex
> > > one (2, 3 disks) it is easy to make mistake.
> > > 
> > > btrfs-progs relies on major_minor; it is possible to used the BTRFS_IOC_DEV_INFO
> > > but it requires CAP_ADMIN....
> > > 
> > 
> > Well this just made me go look at the code and realize you don't require
> > CAP_ADMIN for the sysfs knob, which we're going to need.  So using
> > BTRFS_IOC_DEV_INFO shouldn't be a problem.  Thanks,
> 
> I am not sure to be understood completely your answer, so I recap what I am doing:
> - replace the "int" interface in favor of a "string" interface (not "echo 123 > allocaton_hint"
>   but "echo DATA_ONLY > allocaton_hint") [DONE]
> - remove the "kernel" patch related to "major_minor" [DONE]
> - update the btrfs-progs patch to use BTRFS_IOC_DEV_INFO instead of <devid>/major_minor [WIP]
> 
> 
> This will have the following consequences:
> - any user is still capable to read <devid>/allocaton_hint
> - only root is able to use "btrfs prop get allocaton_hint <devname>" (before any user)

Ah yeah I see how this is annoying, oh well it can't be helped.

> - only root is able to update <devid>/allocaton_hint
> - only root is able to use "btrfs prop set allocation_hint <devname>"
> 
> The 2nd point may be relaxed allowing to use BTRFS_IOC_DEV_INFO even to not root user (
> I don't think that there is any sensitive data exported by BTRFS_IOC_DEV_INFO); but this will
> be done as separate patch.

Agreed, I think that's a reasonable thing, and yes separately please.

This is all good, thanks Goffredo!

Josef

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2022-03-04 14:57 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-26 20:32 [PATCH 0/7][V11] btrfs: allocation_hint Goffredo Baroncelli
2022-01-26 20:32 ` [PATCH 1/7] btrfs: add flags to give an hint to the chunk allocator Goffredo Baroncelli
2022-01-26 20:32 ` [PATCH 2/7] btrfs: export the device allocation_hint property in sysfs Goffredo Baroncelli
2022-01-26 20:32 ` [PATCH 3/7] btrfs: change the device allocation_hint property via sysfs Goffredo Baroncelli
2022-01-26 20:32 ` [PATCH 4/7] btrfs: add allocation_hint mode Goffredo Baroncelli
2022-01-26 20:32 ` [PATCH 5/7] btrfs: rename dev_item->type to dev_item->flags Goffredo Baroncelli
2022-01-26 20:32 ` [PATCH 6/7] btrfs: add major and minor to sysfs Goffredo Baroncelli
2022-01-26 20:32 ` [PATCH 7/7] Add /sys/fs/btrfs/features/allocation_hint Goffredo Baroncelli
2022-02-15 18:49 ` [PATCH 0/7][V11] btrfs: allocation_hint Goffredo Baroncelli
2022-02-16  0:22   ` Qu Wenruo
2022-02-16  3:28     ` Zygo Blaxell
2022-02-16  4:43       ` Paul Jones
2022-02-25 20:18         ` Boris Burkov
2022-02-28 17:04 ` Josef Bacik
2022-02-28 21:01   ` Goffredo Baroncelli
2022-03-01 15:07     ` Josef Bacik
2022-03-01 18:55       ` Goffredo Baroncelli
2022-03-01 21:43         ` Josef Bacik
2022-03-02 19:30           ` Goffredo Baroncelli
2022-03-02 21:23             ` Josef Bacik
2022-03-03 19:01               ` Goffredo Baroncelli
2022-03-04 14:56                 ` Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).