linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode
@ 2021-10-24 15:31 Goffredo Baroncelli
  2021-10-24 15:31 ` [PATCH 1/4] btrfs: add flags to give an hint to the chunk allocator Goffredo Baroncelli
                   ` (4 more replies)
  0 siblings, 5 replies; 32+ messages in thread
From: Goffredo Baroncelli @ 2021-10-24 15:31 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

Hi all,

This patches set was born after some discussion between me, Zygo and Josef.
Some details can be found in https://github.com/btrfs/btrfs-todo/issues/19.

Some further information about a real use case can be found in
https://lore.kernel.org/linux-btrfs/20210116002533.GE31381@hungrycats.org/

Reently Shafeeq told me that he is interested too, due to the performance gain.

In this revision I switched away from an ioctl API in favor of a sysfs API (
see patch #2 and #3).

The idea behind this patches set, is to dedicate some disks (the fastest one)
to the metadata chunk. My initial idea was a "soft" hint. However Zygo
asked an option for a "strong" hint (== mandatory). The result is that
each disk can be "tagged" by one of the following flags:
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
- BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
- BTRFS_DEV_ALLOCATION_PREFERRED_DATA
- BTRFS_DEV_ALLOCATION_DATA_ONLY

When the chunk allocator search a disks to allocate a chunk, scans the disks
in an order decided by these tags. For metadata, the order is:
*_METADATA_ONLY
*_PREFERRED_METADATA
*_PREFERRED_DATA

The *_DATA_ONLY are not eligible from metadata chunk allocation.

For the data chunk, the order is reversed, and the *_METADATA_ONLY are
excluded.

The exact sort logic is to sort first for the "tag", and then for the space
available. If there is no space available, the next "tag" disks set are
selected.

To set these tags, a new property called "allocation_hint" was created.
There is a dedicated btrfs-prog patches set [[PATCH V5] btrfs-progs:
allocation_hint disk property].

$ sudo mount /dev/loop0 /mnt/test-btrfs/
$ for i in /dev/loop[0-9]; do sudo ./btrfs prop get $i allocation_hint; done
devid=1, path=/dev/loop0: allocation_hint=PREFERRED_METADATA
devid=2, path=/dev/loop1: allocation_hint=PREFERRED_METADATA
devid=3, path=/dev/loop2: allocation_hint=PREFERRED_DATA
devid=4, path=/dev/loop3: allocation_hint=PREFERRED_DATA
devid=5, path=/dev/loop4: allocation_hint=PREFERRED_DATA
devid=6, path=/dev/loop5: allocation_hint=DATA_ONLY
devid=7, path=/dev/loop6: allocation_hint=METADATA_ONLY
devid=8, path=/dev/loop7: allocation_hint=METADATA_ONLY

$ sudo ./btrfs fi us /mnt/test-btrfs/
Overall:
    Device size:           2.75GiB
    Device allocated:           1.34GiB
    Device unallocated:           1.41GiB
    Device missing:             0.00B
    Used:             400.89MiB
    Free (estimated):           1.04GiB    (min: 1.04GiB)
    Data ratio:                  2.00
    Metadata ratio:              1.00
    Global reserve:           3.25MiB    (used: 0.00B)
    Multiple profiles:                no

Data,RAID1: Size:542.00MiB, Used:200.25MiB (36.95%)
   /dev/loop0     288.00MiB
   /dev/loop1     288.00MiB
   /dev/loop2     127.00MiB
   /dev/loop3     127.00MiB
   /dev/loop4     127.00MiB
   /dev/loop5     127.00MiB

Metadata,single: Size:256.00MiB, Used:384.00KiB (0.15%)
   /dev/loop1     256.00MiB

System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
   /dev/loop0      32.00MiB

Unallocated:
   /dev/loop0     704.00MiB
   /dev/loop1     480.00MiB
   /dev/loop2       1.00MiB
   /dev/loop3       1.00MiB
   /dev/loop4       1.00MiB
   /dev/loop5       1.00MiB
   /dev/loop6     128.00MiB
   /dev/loop7     128.00MiB

# change the tag of some disks

$ sudo ./btrfs prop set /dev/loop0 allocation_hint DATA_ONLY
$ sudo ./btrfs prop set /dev/loop1 allocation_hint DATA_ONLY
$ sudo ./btrfs prop set /dev/loop5 allocation_hint METADATA_ONLY

$ for i in /dev/loop[0-9]; do sudo ./btrfs prop get $i allocation_hint; done
devid=1, path=/dev/loop0: allocation_hint=DATA_ONLY
devid=2, path=/dev/loop1: allocation_hint=DATA_ONLY
devid=3, path=/dev/loop2: allocation_hint=PREFERRED_DATA
devid=4, path=/dev/loop3: allocation_hint=PREFERRED_DATA
devid=5, path=/dev/loop4: allocation_hint=PREFERRED_DATA
devid=6, path=/dev/loop5: allocation_hint=METADATA_ONLY
devid=7, path=/dev/loop6: allocation_hint=METADATA_ONLY
devid=8, path=/dev/loop7: allocation_hint=METADATA_ONLY

$ sudo btrfs bal start --full-balance /mnt/test-btrfs/
$ sudo ./btrfs fi us /mnt/test-btrfs/
Overall:
    Device size:           2.75GiB
    Device allocated:         735.00MiB
    Device unallocated:           2.03GiB
    Device missing:             0.00B
    Used:             400.72MiB
    Free (estimated):           1.10GiB    (min: 1.10GiB)
    Data ratio:                  2.00
    Metadata ratio:              1.00
    Global reserve:           3.25MiB    (used: 0.00B)
    Multiple profiles:                no

Data,RAID1: Size:288.00MiB, Used:200.19MiB (69.51%)
   /dev/loop0     288.00MiB
   /dev/loop1     288.00MiB

Metadata,single: Size:127.00MiB, Used:336.00KiB (0.26%)
   /dev/loop5     127.00MiB

System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
   /dev/loop7      32.00MiB

Unallocated:
   /dev/loop0     736.00MiB
   /dev/loop1     736.00MiB
   /dev/loop2     128.00MiB
   /dev/loop3     128.00MiB
   /dev/loop4     128.00MiB
   /dev/loop5       1.00MiB
   /dev/loop6     128.00MiB
   /dev/loop7      96.00MiB


#As you can see all the metadata were placed on the disk loop5/loop7 even if
#the most empty one are loop0 and loop1.



TODO:
- more tests
- the tool which show the space available should consider the tagging (eg
  the disks tagged by _METADATA_ONLY should be excluded from the data
  availability)


Comments are welcome
BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

Revision:
V8:
- drop the ioctl API, instead use a sysfs one

V7:
- make more room in the struct btrfs_ioctl_dev_properties up to 1K
- leave in btrfs_tree.h only the costants
- removed the mount option (sic)
- correct an 'use before check' in the while loop (signaled
  by Zygo)
- add a 2nd sort to be sure that the device_info array is in the
  expected order

V6:
- add further values to the hints: add the possibility to
  exclude a disk for a chunk type 

Goffredo Baroncelli (4):
  btrfs: add flags to give an hint to the chunk allocator
  btrfs: export dev_item.type in
    /sys/fs/btrfs/<uuid>/devinfo/<devid>/type
  btrfs: change the DEV_ITEM 'type' field via sysfs
  btrfs: add allocator_hint mode

 fs/btrfs/sysfs.c                |  65 +++++++++++++++++++++
 fs/btrfs/volumes.c              | 100 +++++++++++++++++++++++++++++++-
 fs/btrfs/volumes.h              |   4 +-
 include/uapi/linux/btrfs_tree.h |  14 +++++
 4 files changed, 180 insertions(+), 3 deletions(-)

-- 
2.33.0


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/4] btrfs: add flags to give an hint to the chunk allocator
  2021-10-24 15:31 [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode Goffredo Baroncelli
@ 2021-10-24 15:31 ` Goffredo Baroncelli
  2021-10-24 15:31 ` [PATCH 2/4] btrfs: export dev_item.type in /sys/fs/btrfs/<uuid>/devinfo/<devid>/type Goffredo Baroncelli
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 32+ messages in thread
From: Goffredo Baroncelli @ 2021-10-24 15:31 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

Add the following flags to give an hint about which chunk should be
allocated in which a disk.
The following flags are created:

- BTRFS_DEV_ALLOCATION_PREFERRED_DATA
  preferred data chunk, but metadata chunk allowed
- BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
  preferred metadata chunk, but data chunk allowed
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
  only metadata chunk allowed
- BTRFS_DEV_ALLOCATION_DATA_ONLY
  only data chunk allowed

Signed-off-by: Goffredo Baroncelli <kreijack@inwid.it>
---
 include/uapi/linux/btrfs_tree.h | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index ccdb40fe40dc..b45322b347c2 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -361,6 +361,20 @@ struct btrfs_key {
 	__u64 offset;
 } __attribute__ ((__packed__));
 
+/* dev_item.type */
+
+/* btrfs chunk allocation hints */
+#define BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT	3
+/* preferred data chunk, but metadata chunk allowed */
+#define BTRFS_DEV_ALLOCATION_PREFERRED_DATA	(0ULL)
+/* preferred metadata chunk, but data chunk allowed */
+#define BTRFS_DEV_ALLOCATION_PREFERRED_METADATA	(1ULL)
+/* only metadata chunk are allowed */
+#define BTRFS_DEV_ALLOCATION_METADATA_ONLY	(2ULL)
+/* only data chunk allowed */
+#define BTRFS_DEV_ALLOCATION_DATA_ONLY		(3ULL)
+/* 5..7 are unused values */
+
 struct btrfs_dev_item {
 	/* the internal btrfs device id */
 	__le64 devid;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/4] btrfs: export dev_item.type in /sys/fs/btrfs/<uuid>/devinfo/<devid>/type
  2021-10-24 15:31 [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode Goffredo Baroncelli
  2021-10-24 15:31 ` [PATCH 1/4] btrfs: add flags to give an hint to the chunk allocator Goffredo Baroncelli
@ 2021-10-24 15:31 ` Goffredo Baroncelli
  2021-10-24 15:31 ` [PATCH 3/4] btrfs: change the DEV_ITEM 'type' field via sysfs Goffredo Baroncelli
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 32+ messages in thread
From: Goffredo Baroncelli @ 2021-10-24 15:31 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 fs/btrfs/sysfs.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 9d1d140118ff..402b98acf2aa 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1510,6 +1510,16 @@ static ssize_t btrfs_devinfo_error_stats_show(struct kobject *kobj,
 }
 BTRFS_ATTR(devid, error_stats, btrfs_devinfo_error_stats_show);
 
+static ssize_t btrfs_devinfo_type_show(struct kobject *kobj,
+					    struct kobj_attribute *a, char *buf)
+{
+	struct btrfs_device *device = container_of(kobj, struct btrfs_device,
+						   devid_kobj);
+
+	return scnprintf(buf, PAGE_SIZE, "0x%08llx\n", device->type);
+}
+BTRFS_ATTR(devid, type, btrfs_devinfo_type_show);
+
 static struct attribute *devid_attrs[] = {
 	BTRFS_ATTR_PTR(devid, error_stats),
 	BTRFS_ATTR_PTR(devid, in_fs_metadata),
@@ -1517,6 +1527,7 @@ static struct attribute *devid_attrs[] = {
 	BTRFS_ATTR_PTR(devid, replace_target),
 	BTRFS_ATTR_PTR(devid, scrub_speed_max),
 	BTRFS_ATTR_PTR(devid, writeable),
+	BTRFS_ATTR_PTR(devid, type),
 	NULL
 };
 ATTRIBUTE_GROUPS(devid);
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/4] btrfs: change the DEV_ITEM 'type' field via sysfs
  2021-10-24 15:31 [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode Goffredo Baroncelli
  2021-10-24 15:31 ` [PATCH 1/4] btrfs: add flags to give an hint to the chunk allocator Goffredo Baroncelli
  2021-10-24 15:31 ` [PATCH 2/4] btrfs: export dev_item.type in /sys/fs/btrfs/<uuid>/devinfo/<devid>/type Goffredo Baroncelli
@ 2021-10-24 15:31 ` Goffredo Baroncelli
  2021-10-24 15:31 ` [PATCH 4/4] btrfs: add allocator_hint mode Goffredo Baroncelli
  2021-12-13  9:39 ` [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode Paul Jones
  4 siblings, 0 replies; 32+ messages in thread
From: Goffredo Baroncelli @ 2021-10-24 15:31 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

---
 fs/btrfs/sysfs.c   | 56 +++++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/volumes.c |  2 +-
 fs/btrfs/volumes.h |  3 ++-
 3 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 402b98acf2aa..2eb74656f61f 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1518,7 +1518,61 @@ static ssize_t btrfs_devinfo_type_show(struct kobject *kobj,
 
 	return scnprintf(buf, PAGE_SIZE, "0x%08llx\n", device->type);
 }
-BTRFS_ATTR(devid, type, btrfs_devinfo_type_show);
+
+static ssize_t btrfs_devinfo_type_store(struct kobject *kobj,
+				 struct kobj_attribute *a,
+				 const char *buf, size_t len)
+{
+	struct btrfs_fs_info *fs_info;
+	struct btrfs_root *root;
+	struct btrfs_device *device;
+	int ret;
+	struct btrfs_trans_handle *trans;
+
+	u64 type, prev_type;
+
+	device = container_of(kobj, struct btrfs_device, devid_kobj);
+	fs_info = device->fs_info;
+	if (!fs_info)
+		return -EPERM;
+
+	root = fs_info->chunk_root;
+	if (sb_rdonly(fs_info->sb))
+		return -EROFS;
+
+	ret = kstrtou64(buf, 0, &type);
+	if (ret < 0)
+		return -EINVAL;
+
+	/* for now, allow to touch only the 'allocation hint' bits */
+	if (type & ~((1 << BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT) - 1))
+		return -EINVAL;
+
+	trans = btrfs_start_transaction(root, 1);
+	if (IS_ERR(trans))
+		return PTR_ERR(trans);
+
+	prev_type = device->type;
+	device->type = type;
+
+	ret = btrfs_update_device(trans, device);
+
+	if (ret < 0) {
+		btrfs_abort_transaction(trans, ret);
+		btrfs_end_transaction(trans);
+		goto abort;
+	}
+
+	ret = btrfs_commit_transaction(trans);
+	if (ret < 0)
+		goto abort;
+
+	return len;
+abort:
+	device->type = prev_type;
+	return  ret;
+}
+BTRFS_ATTR_RW(devid, type, btrfs_devinfo_type_show, btrfs_devinfo_type_store);
 
 static struct attribute *devid_attrs[] = {
 	BTRFS_ATTR_PTR(devid, error_stats),
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 19c780242e12..8ac99771f43c 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2836,7 +2836,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	return ret;
 }
 
-static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
+noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
 					struct btrfs_device *device)
 {
 	int ret;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index f77f869dfd2c..b8250f29df6e 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -608,5 +608,6 @@ int btrfs_bg_type_to_factor(u64 flags);
 const char *btrfs_bg_type_to_raid_name(u64 flags);
 int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info);
 int btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical);
-
+int btrfs_update_device(struct btrfs_trans_handle *trans,
+                                       struct btrfs_device *device);
 #endif
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/4] btrfs: add allocator_hint mode
  2021-10-24 15:31 [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode Goffredo Baroncelli
                   ` (2 preceding siblings ...)
  2021-10-24 15:31 ` [PATCH 3/4] btrfs: change the DEV_ITEM 'type' field via sysfs Goffredo Baroncelli
@ 2021-10-24 15:31 ` Goffredo Baroncelli
  2021-12-17 15:58   ` Hans van Kranenburg
  2021-12-13  9:39 ` [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode Paul Jones
  4 siblings, 1 reply; 32+ messages in thread
From: Goffredo Baroncelli @ 2021-10-24 15:31 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

When this mode is enabled, the chunk allocation policy is modified as
follow.

Each disk may have a different tag:
- BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
- BTRFS_DEV_ALLOCATION_DATA_ONLY
- BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)

Where:
- ALLOCATION_PREFERRED_X means that it is preferred to use this disk for
the X chunk type (the other type may be allowed when the space is low)
- ALLOCATION_X_ONLY means that it is used *only* for the X chunk type.
This means also that it is a preferred choice.

Each time the allocator allocates a chunk of type X , first it takes the
disks tagged as ALLOCATION_X_ONLY or ALLOCATION_PREFERRED_X; if the space
is not enough, it uses also the disks tagged as ALLOCATION_METADATA_ONLY;
if the space is not enough, it uses also the other disks, with the
exception of the one marked as ALLOCATION_PREFERRED_Y, where Y the other
type of chunk (i.e. not X).

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 fs/btrfs/volumes.c | 98 +++++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/volumes.h |  1 +
 2 files changed, 98 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8ac99771f43c..7ee9c6e7bd44 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -153,6 +153,20 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
 	},
 };
 
+#define BTRFS_DEV_ALLOCATION_MASK ((1ULL << \
+		BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT) - 1)
+#define BTRFS_DEV_ALLOCATION_MASK_COUNT (1ULL << \
+		BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT)
+
+static const char alloc_hint_map[BTRFS_DEV_ALLOCATION_MASK_COUNT] = {
+	[BTRFS_DEV_ALLOCATION_DATA_ONLY] = -1,
+	[BTRFS_DEV_ALLOCATION_PREFERRED_DATA] = 0,
+	[BTRFS_DEV_ALLOCATION_PREFERRED_METADATA] = 1,
+	[BTRFS_DEV_ALLOCATION_METADATA_ONLY] = 2,
+	/* the other values are set to 0 */
+};
+
+
 const char *btrfs_bg_type_to_raid_name(u64 flags)
 {
 	const int index = btrfs_bg_flags_to_raid_index(flags);
@@ -4997,13 +5011,18 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
 }
 
 /*
- * sort the devices in descending order by max_avail, total_avail
+ * sort the devices in descending order by alloc_hint,
+ * max_avail, total_avail
  */
 static int btrfs_cmp_device_info(const void *a, const void *b)
 {
 	const struct btrfs_device_info *di_a = a;
 	const struct btrfs_device_info *di_b = b;
 
+	if (di_a->alloc_hint > di_b->alloc_hint)
+		return -1;
+	if (di_a->alloc_hint < di_b->alloc_hint)
+		return 1;
 	if (di_a->max_avail > di_b->max_avail)
 		return -1;
 	if (di_a->max_avail < di_b->max_avail)
@@ -5166,6 +5185,8 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 	int ndevs = 0;
 	u64 max_avail;
 	u64 dev_offset;
+	int hint;
+	int i;
 
 	/*
 	 * in the first pass through the devices list, we gather information
@@ -5218,16 +5239,91 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 		devices_info[ndevs].max_avail = max_avail;
 		devices_info[ndevs].total_avail = total_avail;
 		devices_info[ndevs].dev = device;
+
+		if ((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
+		     (ctl->type & BTRFS_BLOCK_GROUP_METADATA)) {
+			/*
+			 * if mixed bg set all the alloc_hint
+			 * fields to the same value, so the sorting
+			 * is not affected
+			 */
+			devices_info[ndevs].alloc_hint = 0;
+		} else if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
+			hint = device->type & BTRFS_DEV_ALLOCATION_MASK;
+
+			/*
+			 * skip BTRFS_DEV_METADATA_ONLY disks
+			 */
+			if (hint == BTRFS_DEV_ALLOCATION_METADATA_ONLY)
+				continue;
+			/*
+			 * if a data chunk must be allocated,
+			 * sort also by hint (data disk
+			 * higher priority)
+			 */
+			devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
+		} else { /* BTRFS_BLOCK_GROUP_METADATA */
+			hint = device->type & BTRFS_DEV_ALLOCATION_MASK;
+
+			/*
+			 * skip BTRFS_DEV_DATA_ONLY disks
+			 */
+			if (hint == BTRFS_DEV_ALLOCATION_DATA_ONLY)
+				continue;
+			/*
+			 * if a data chunk must be allocated,
+			 * sort also by hint (metadata hint
+			 * higher priority)
+			 */
+			devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
+		}
+
 		++ndevs;
 	}
 	ctl->ndevs = ndevs;
 
+	/*
+	 * no devices available
+	 */
+	if (!ndevs)
+		return 0;
+
 	/*
 	 * now sort the devices by hole size / available space
 	 */
 	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
 	     btrfs_cmp_device_info, NULL);
 
+	/*
+	 * select the minimum set of disks grouped by hint that
+	 * can host the chunk
+	 */
+	ndevs = 0;
+	while (ndevs < ctl->ndevs) {
+		hint = devices_info[ndevs++].alloc_hint;
+		while (ndevs < ctl->ndevs &&
+		       devices_info[ndevs].alloc_hint == hint)
+				ndevs++;
+		if (ndevs >= ctl->devs_min)
+			break;
+	}
+
+	BUG_ON(ndevs > ctl->ndevs);
+	ctl->ndevs = ndevs;
+
+	/*
+	 * the next layers require the devices_info ordered by
+	 * max_avail. If we are returing two (or more) different
+	 * group of alloc_hint, this is not always true. So sort
+	 * these gain.
+	 */
+
+	for (i = 0 ; i < ndevs ; i++)
+		devices_info[i].alloc_hint = 0;
+
+	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+	     btrfs_cmp_device_info, NULL);
+
 	return 0;
 }
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index b8250f29df6e..37eb37b533c5 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -369,6 +369,7 @@ struct btrfs_device_info {
 	u64 dev_offset;
 	u64 max_avail;
 	u64 total_avail;
+	int alloc_hint;
 };
 
 struct btrfs_raid_attr {
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* RE: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode
  2021-10-24 15:31 [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode Goffredo Baroncelli
                   ` (3 preceding siblings ...)
  2021-10-24 15:31 ` [PATCH 4/4] btrfs: add allocator_hint mode Goffredo Baroncelli
@ 2021-12-13  9:39 ` Paul Jones
  2021-12-13 19:54   ` Goffredo Baroncelli
  4 siblings, 1 reply; 32+ messages in thread
From: Paul Jones @ 2021-12-13  9:39 UTC (permalink / raw)
  To: Goffredo Baroncelli, linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Goffredo Baroncelli

> -----Original Message-----
> From: Goffredo Baroncelli <kreijack@tiscali.it>
> Sent: Monday, 25 October 2021 2:31 AM
> To: linux-btrfs@vger.kernel.org
> Cc: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>; Josef Bacik
> <josef@toxicpanda.com>; David Sterba <dsterba@suse.cz>; Sinnamohideen
> Shafeeq <shafeeqs@panasas.com>; Goffredo Baroncelli
> <kreijack@inwind.it>
> Subject: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode
> 
> From: Goffredo Baroncelli <kreijack@inwind.it>
> 
> Hi all,
> 
> This patches set was born after some discussion between me, Zygo and
> Josef.
> Some details can be found in https://github.com/btrfs/btrfs-todo/issues/19.
> 
> Some further information about a real use case can be found in
> https://lore.kernel.org/linux-
> btrfs/20210116002533.GE31381@hungrycats.org/
> 
> Reently Shafeeq told me that he is interested too, due to the performance
> gain.
> 
> In this revision I switched away from an ioctl API in favor of a sysfs API ( see
> patch #2 and #3).
> 
> The idea behind this patches set, is to dedicate some disks (the fastest one)
> to the metadata chunk. My initial idea was a "soft" hint. However Zygo asked
> an option for a "strong" hint (== mandatory). The result is that each disk can
> be "tagged" by one of the following flags:
> - BTRFS_DEV_ALLOCATION_METADATA_ONLY
> - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
> - BTRFS_DEV_ALLOCATION_PREFERRED_DATA
> - BTRFS_DEV_ALLOCATION_DATA_ONLY
> 
> When the chunk allocator search a disks to allocate a chunk, scans the disks in
> an order decided by these tags. For metadata, the order is:
> *_METADATA_ONLY
> *_PREFERRED_METADATA
> *_PREFERRED_DATA
> 
> The *_DATA_ONLY are not eligible from metadata chunk allocation.
> 
> For the data chunk, the order is reversed, and the *_METADATA_ONLY are
> excluded.
> 
> The exact sort logic is to sort first for the "tag", and then for the space
> available. If there is no space available, the next "tag" disks set are selected.
> 
> To set these tags, a new property called "allocation_hint" was created.
> There is a dedicated btrfs-prog patches set [[PATCH V5] btrfs-progs:
> allocation_hint disk property].
> 
> $ sudo mount /dev/loop0 /mnt/test-btrfs/ $ for i in /dev/loop[0-9]; do sudo
> ./btrfs prop get $i allocation_hint; done devid=1, path=/dev/loop0:
> allocation_hint=PREFERRED_METADATA
> devid=2, path=/dev/loop1: allocation_hint=PREFERRED_METADATA
> devid=3, path=/dev/loop2: allocation_hint=PREFERRED_DATA devid=4,
> path=/dev/loop3: allocation_hint=PREFERRED_DATA devid=5,
> path=/dev/loop4: allocation_hint=PREFERRED_DATA devid=6,
> path=/dev/loop5: allocation_hint=DATA_ONLY devid=7, path=/dev/loop6:
> allocation_hint=METADATA_ONLY devid=8, path=/dev/loop7:
> allocation_hint=METADATA_ONLY
> 
> $ sudo ./btrfs fi us /mnt/test-btrfs/
> Overall:
>     Device size:           2.75GiB
>     Device allocated:           1.34GiB
>     Device unallocated:           1.41GiB
>     Device missing:             0.00B
>     Used:             400.89MiB
>     Free (estimated):           1.04GiB    (min: 1.04GiB)
>     Data ratio:                  2.00
>     Metadata ratio:              1.00
>     Global reserve:           3.25MiB    (used: 0.00B)
>     Multiple profiles:                no
> 
> Data,RAID1: Size:542.00MiB, Used:200.25MiB (36.95%)
>    /dev/loop0     288.00MiB
>    /dev/loop1     288.00MiB
>    /dev/loop2     127.00MiB
>    /dev/loop3     127.00MiB
>    /dev/loop4     127.00MiB
>    /dev/loop5     127.00MiB
> 
> Metadata,single: Size:256.00MiB, Used:384.00KiB (0.15%)
>    /dev/loop1     256.00MiB
> 
> System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
>    /dev/loop0      32.00MiB
> 
> Unallocated:
>    /dev/loop0     704.00MiB
>    /dev/loop1     480.00MiB
>    /dev/loop2       1.00MiB
>    /dev/loop3       1.00MiB
>    /dev/loop4       1.00MiB
>    /dev/loop5       1.00MiB
>    /dev/loop6     128.00MiB
>    /dev/loop7     128.00MiB
> 
> # change the tag of some disks
> 
> $ sudo ./btrfs prop set /dev/loop0 allocation_hint DATA_ONLY $ sudo ./btrfs
> prop set /dev/loop1 allocation_hint DATA_ONLY $ sudo ./btrfs prop set
> /dev/loop5 allocation_hint METADATA_ONLY
> 
> $ for i in /dev/loop[0-9]; do sudo ./btrfs prop get $i allocation_hint; done
> devid=1, path=/dev/loop0: allocation_hint=DATA_ONLY devid=2,
> path=/dev/loop1: allocation_hint=DATA_ONLY devid=3, path=/dev/loop2:
> allocation_hint=PREFERRED_DATA devid=4, path=/dev/loop3:
> allocation_hint=PREFERRED_DATA devid=5, path=/dev/loop4:
> allocation_hint=PREFERRED_DATA devid=6, path=/dev/loop5:
> allocation_hint=METADATA_ONLY devid=7, path=/dev/loop6:
> allocation_hint=METADATA_ONLY devid=8, path=/dev/loop7:
> allocation_hint=METADATA_ONLY
> 
> $ sudo btrfs bal start --full-balance /mnt/test-btrfs/ $ sudo ./btrfs fi us
> /mnt/test-btrfs/
> Overall:
>     Device size:           2.75GiB
>     Device allocated:         735.00MiB
>     Device unallocated:           2.03GiB
>     Device missing:             0.00B
>     Used:             400.72MiB
>     Free (estimated):           1.10GiB    (min: 1.10GiB)
>     Data ratio:                  2.00
>     Metadata ratio:              1.00
>     Global reserve:           3.25MiB    (used: 0.00B)
>     Multiple profiles:                no
> 
> Data,RAID1: Size:288.00MiB, Used:200.19MiB (69.51%)
>    /dev/loop0     288.00MiB
>    /dev/loop1     288.00MiB
> 
> Metadata,single: Size:127.00MiB, Used:336.00KiB (0.26%)
>    /dev/loop5     127.00MiB
> 
> System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
>    /dev/loop7      32.00MiB
> 
> Unallocated:
>    /dev/loop0     736.00MiB
>    /dev/loop1     736.00MiB
>    /dev/loop2     128.00MiB
>    /dev/loop3     128.00MiB
>    /dev/loop4     128.00MiB
>    /dev/loop5       1.00MiB
>    /dev/loop6     128.00MiB
>    /dev/loop7      96.00MiB
> 
> 
> #As you can see all the metadata were placed on the disk loop5/loop7 even if
> #the most empty one are loop0 and loop1.
> 
> 
> 
> TODO:
> - more tests
> - the tool which show the space available should consider the tagging (eg
>   the disks tagged by _METADATA_ONLY should be excluded from the data
>   availability)
> 
> 
> Comments are welcome
> BR
> G.Baroncelli


I've been running this patch series since about V4, works really well. Would be nice to have it merged eventually.

Tested By: Paul Jones <paul@pauljones.id.au>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode
  2021-12-13  9:39 ` [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode Paul Jones
@ 2021-12-13 19:54   ` Goffredo Baroncelli
  2021-12-13 21:15     ` Josef Bacik
  0 siblings, 1 reply; 32+ messages in thread
From: Goffredo Baroncelli @ 2021-12-13 19:54 UTC (permalink / raw)
  To: Josef Bacik, David Sterba
  Cc: Sinnamohideen Shafeeq, Goffredo Baroncelli, Paul Jones,
	linux-btrfs, Zygo Blaxell

Gentle ping :-)

Are there anyone of the mains developer interested in supporting this patch ?

I am open to improve it if required.

BR
G.Baroncelli

On 12/13/21 10:39, Paul Jones wrote:
>> -----Original Message-----
>> From: Goffredo Baroncelli <kreijack@tiscali.it>
>> Sent: Monday, 25 October 2021 2:31 AM
>> To: linux-btrfs@vger.kernel.org
>> Cc: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>; Josef Bacik
>> <josef@toxicpanda.com>; David Sterba <dsterba@suse.cz>; Sinnamohideen
>> Shafeeq <shafeeqs@panasas.com>; Goffredo Baroncelli
>> <kreijack@inwind.it>
>> Subject: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode
>>
>> From: Goffredo Baroncelli <kreijack@inwind.it>
>>
>> Hi all,
>>
>> This patches set was born after some discussion between me, Zygo and
>> Josef.
>> Some details can be found in https://github.com/btrfs/btrfs-todo/issues/19.
>>
>> Some further information about a real use case can be found in
>> https://lore.kernel.org/linux-
>> btrfs/20210116002533.GE31381@hungrycats.org/
>>
>> Reently Shafeeq told me that he is interested too, due to the performance
>> gain.
>>
>> In this revision I switched away from an ioctl API in favor of a sysfs API ( see
>> patch #2 and #3).
>>
>> The idea behind this patches set, is to dedicate some disks (the fastest one)
>> to the metadata chunk. My initial idea was a "soft" hint. However Zygo asked
>> an option for a "strong" hint (== mandatory). The result is that each disk can
>> be "tagged" by one of the following flags:
>> - BTRFS_DEV_ALLOCATION_METADATA_ONLY
>> - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
>> - BTRFS_DEV_ALLOCATION_PREFERRED_DATA
>> - BTRFS_DEV_ALLOCATION_DATA_ONLY
>>
>> When the chunk allocator search a disks to allocate a chunk, scans the disks in
>> an order decided by these tags. For metadata, the order is:
>> *_METADATA_ONLY
>> *_PREFERRED_METADATA
>> *_PREFERRED_DATA
>>
>> The *_DATA_ONLY are not eligible from metadata chunk allocation.
>>
>> For the data chunk, the order is reversed, and the *_METADATA_ONLY are
>> excluded.
>>
>> The exact sort logic is to sort first for the "tag", and then for the space
>> available. If there is no space available, the next "tag" disks set are selected.
>>
>> To set these tags, a new property called "allocation_hint" was created.
>> There is a dedicated btrfs-prog patches set [[PATCH V5] btrfs-progs:
>> allocation_hint disk property].
>>
>> $ sudo mount /dev/loop0 /mnt/test-btrfs/ $ for i in /dev/loop[0-9]; do sudo
>> ./btrfs prop get $i allocation_hint; done devid=1, path=/dev/loop0:
>> allocation_hint=PREFERRED_METADATA
>> devid=2, path=/dev/loop1: allocation_hint=PREFERRED_METADATA
>> devid=3, path=/dev/loop2: allocation_hint=PREFERRED_DATA devid=4,
>> path=/dev/loop3: allocation_hint=PREFERRED_DATA devid=5,
>> path=/dev/loop4: allocation_hint=PREFERRED_DATA devid=6,
>> path=/dev/loop5: allocation_hint=DATA_ONLY devid=7, path=/dev/loop6:
>> allocation_hint=METADATA_ONLY devid=8, path=/dev/loop7:
>> allocation_hint=METADATA_ONLY
>>
>> $ sudo ./btrfs fi us /mnt/test-btrfs/
>> Overall:
>>      Device size:           2.75GiB
>>      Device allocated:           1.34GiB
>>      Device unallocated:           1.41GiB
>>      Device missing:             0.00B
>>      Used:             400.89MiB
>>      Free (estimated):           1.04GiB    (min: 1.04GiB)
>>      Data ratio:                  2.00
>>      Metadata ratio:              1.00
>>      Global reserve:           3.25MiB    (used: 0.00B)
>>      Multiple profiles:                no
>>
>> Data,RAID1: Size:542.00MiB, Used:200.25MiB (36.95%)
>>     /dev/loop0     288.00MiB
>>     /dev/loop1     288.00MiB
>>     /dev/loop2     127.00MiB
>>     /dev/loop3     127.00MiB
>>     /dev/loop4     127.00MiB
>>     /dev/loop5     127.00MiB
>>
>> Metadata,single: Size:256.00MiB, Used:384.00KiB (0.15%)
>>     /dev/loop1     256.00MiB
>>
>> System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
>>     /dev/loop0      32.00MiB
>>
>> Unallocated:
>>     /dev/loop0     704.00MiB
>>     /dev/loop1     480.00MiB
>>     /dev/loop2       1.00MiB
>>     /dev/loop3       1.00MiB
>>     /dev/loop4       1.00MiB
>>     /dev/loop5       1.00MiB
>>     /dev/loop6     128.00MiB
>>     /dev/loop7     128.00MiB
>>
>> # change the tag of some disks
>>
>> $ sudo ./btrfs prop set /dev/loop0 allocation_hint DATA_ONLY $ sudo ./btrfs
>> prop set /dev/loop1 allocation_hint DATA_ONLY $ sudo ./btrfs prop set
>> /dev/loop5 allocation_hint METADATA_ONLY
>>
>> $ for i in /dev/loop[0-9]; do sudo ./btrfs prop get $i allocation_hint; done
>> devid=1, path=/dev/loop0: allocation_hint=DATA_ONLY devid=2,
>> path=/dev/loop1: allocation_hint=DATA_ONLY devid=3, path=/dev/loop2:
>> allocation_hint=PREFERRED_DATA devid=4, path=/dev/loop3:
>> allocation_hint=PREFERRED_DATA devid=5, path=/dev/loop4:
>> allocation_hint=PREFERRED_DATA devid=6, path=/dev/loop5:
>> allocation_hint=METADATA_ONLY devid=7, path=/dev/loop6:
>> allocation_hint=METADATA_ONLY devid=8, path=/dev/loop7:
>> allocation_hint=METADATA_ONLY
>>
>> $ sudo btrfs bal start --full-balance /mnt/test-btrfs/ $ sudo ./btrfs fi us
>> /mnt/test-btrfs/
>> Overall:
>>      Device size:           2.75GiB
>>      Device allocated:         735.00MiB
>>      Device unallocated:           2.03GiB
>>      Device missing:             0.00B
>>      Used:             400.72MiB
>>      Free (estimated):           1.10GiB    (min: 1.10GiB)
>>      Data ratio:                  2.00
>>      Metadata ratio:              1.00
>>      Global reserve:           3.25MiB    (used: 0.00B)
>>      Multiple profiles:                no
>>
>> Data,RAID1: Size:288.00MiB, Used:200.19MiB (69.51%)
>>     /dev/loop0     288.00MiB
>>     /dev/loop1     288.00MiB
>>
>> Metadata,single: Size:127.00MiB, Used:336.00KiB (0.26%)
>>     /dev/loop5     127.00MiB
>>
>> System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
>>     /dev/loop7      32.00MiB
>>
>> Unallocated:
>>     /dev/loop0     736.00MiB
>>     /dev/loop1     736.00MiB
>>     /dev/loop2     128.00MiB
>>     /dev/loop3     128.00MiB
>>     /dev/loop4     128.00MiB
>>     /dev/loop5       1.00MiB
>>     /dev/loop6     128.00MiB
>>     /dev/loop7      96.00MiB
>>
>>
>> #As you can see all the metadata were placed on the disk loop5/loop7 even if
>> #the most empty one are loop0 and loop1.
>>
>>
>>
>> TODO:
>> - more tests
>> - the tool which show the space available should consider the tagging (eg
>>    the disks tagged by _METADATA_ONLY should be excluded from the data
>>    availability)
>>
>>
>> Comments are welcome
>> BR
>> G.Baroncelli
> 
> 
> I've been running this patch series since about V4, works really well. Would be nice to have it merged eventually.
> 
> Tested By: Paul Jones <paul@pauljones.id.au>
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode
  2021-12-13 19:54   ` Goffredo Baroncelli
@ 2021-12-13 21:15     ` Josef Bacik
  2021-12-13 22:49       ` Zygo Blaxell
                         ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Josef Bacik @ 2021-12-13 21:15 UTC (permalink / raw)
  To: kreijack
  Cc: David Sterba, Sinnamohideen Shafeeq, Paul Jones, linux-btrfs,
	Zygo Blaxell

On Mon, Dec 13, 2021 at 08:54:24PM +0100, Goffredo Baroncelli wrote:
> Gentle ping :-)
> 
> Are there anyone of the mains developer interested in supporting this patch ?
> 
> I am open to improve it if required.
> 

Sorry I missed this go by.  I like the interface, we don't have a use for
device->type yet, so this fits nicely.

I don't see the btrfs-progs patches in my inbox, and these don't apply, so
you'll definitely need to refresh for a proper review, but looking at these
patches they seem sane enough, and I like the interface.  I'd like to hear
Zygo's opinion as well.

If we're going to use device->type for this, and since we don't have a user of
device->type, I'd also like you to go ahead and re-name ->type to
->allocation_policy, that way it's clear what we're using it for now.

I'd also like some xfstests to validate the behavior so we're sure we're testing
this.  I'd want 1 test to just test the mechanics, like mkfs with different
policies and validate they're set right, change policies, add/remove disks with
different policies.

Then a second test to do something like fsstress with each set of allocation
policies to validate that we did actually allocate from the correct disks.  For
this please also test with compression on to make sure the test validation works
for both normal allocation and compression (ie it doesn't assume writing 5gib of
data == 5 gib of data usage, as compression chould give you a different value).

With that in place I think this is the correct way to implement this feature.
Thanks,

Josef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode
  2021-12-13 21:15     ` Josef Bacik
@ 2021-12-13 22:49       ` Zygo Blaxell
  2021-12-14 14:31         ` Josef Bacik
  2021-12-14 19:03         ` Goffredo Baroncelli
  2021-12-14  1:03       ` Sinnamohideen, Shafeeq
  2021-12-14 18:53       ` Goffredo Baroncelli
  2 siblings, 2 replies; 32+ messages in thread
From: Zygo Blaxell @ 2021-12-13 22:49 UTC (permalink / raw)
  To: Josef Bacik
  Cc: kreijack, David Sterba, Sinnamohideen Shafeeq, Paul Jones, linux-btrfs

On Mon, Dec 13, 2021 at 04:15:14PM -0500, Josef Bacik wrote:
> On Mon, Dec 13, 2021 at 08:54:24PM +0100, Goffredo Baroncelli wrote:
> > Gentle ping :-)
> > 
> > Are there anyone of the mains developer interested in supporting this patch ?
> > 
> > I am open to improve it if required.
> > 
> 
> Sorry I missed this go by.  I like the interface, we don't have a use for
> device->type yet, so this fits nicely.
> 
> I don't see the btrfs-progs patches in my inbox, and these don't apply, so
> you'll definitely need to refresh for a proper review, but looking at these
> patches they seem sane enough, and I like the interface.  I'd like to hear
> Zygo's opinion as well.

I've been running earlier versions with modifications since summer 2020,
and this version mostly unmodified (rebase changes only) since it was
posted.  It seems to work, even in corner cases like converting balances,
replacing drives, and running out of space.  The "running out of space"
experience is on btrfs is weird at the best of times, and these patches
add some more new special cases, but it doesn't behave in ways that
would surprise a sysadmin familiar with how btrfs chunk allocation works.

One major piece that's missing is adjusting the statvfs (aka df)
available blocks field so that it doesn't include unallocated space on
any metadata-only devices.  Right now all the unallocated space on
metadata-only devices is counted as free even though it's impossible to
put a data block there, so anything that is triggered automatically
on "f_bavail < some_threshold" will be confused.

I don't think that piece has to block the rest of the patch series--if
you're not using the feature, df gives the right number (or at least the
same number it gave before), and if you are using the feature, you can
subtract the unavailable data space until a later patch comes along to
fix it.

I like

	echo data_only > /sys/fs/btrfs/$uuid/devinfo/3/type

more than patching btrfs-progs so I can use

	btrfs prop set /dev/... allocation_hint data_only

but I admit that might be because I'm weird.

> If we're going to use device->type for this, and since we don't have a user of
> device->type, I'd also like you to go ahead and re-name ->type to
> ->allocation_policy, that way it's clear what we're using it for now.
> 
> I'd also like some xfstests to validate the behavior so we're sure we're testing
> this.  I'd want 1 test to just test the mechanics, like mkfs with different
> policies and validate they're set right, change policies, add/remove disks with
> different policies.
> 
> Then a second test to do something like fsstress with each set of allocation
> policies to validate that we did actually allocate from the correct disks.  For
> this please also test with compression on to make sure the test validation works
> for both normal allocation and compression (ie it doesn't assume writing 5gib of
> data == 5 gib of data usage, as compression chould give you a different value).
> 
> With that in place I think this is the correct way to implement this feature.
> Thanks,
> 
> Josef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode
  2021-12-13 21:15     ` Josef Bacik
  2021-12-13 22:49       ` Zygo Blaxell
@ 2021-12-14  1:03       ` Sinnamohideen, Shafeeq
  2021-12-14 18:53       ` Goffredo Baroncelli
  2 siblings, 0 replies; 32+ messages in thread
From: Sinnamohideen, Shafeeq @ 2021-12-14  1:03 UTC (permalink / raw)
  To: Josef Bacik, kreijack
  Cc: David Sterba, Paul Jones, linux-btrfs, Zygo Blaxell, Dmitriy Gorokh

Panasas would be very happy with this patch - it solves a problem we have when delete-only workloads thrash the disks. Because it is equivalent to "rm -rf" of an old directory, caching does't help since the metadata of the files being deleted will have fallen out of bcache long ago. So giving up a small amount of SSD space to always hold the metadata yields a big performance improvement.

We got as far as verifying that it doesn't break existing xfstests, we are open to writing the test cases you listed and also to check that the device out-of-space cases behave as expected.

Thanks,
Shafeeq

-----Original Message-----
From: Josef Bacik <josef@toxicpanda.com> 
Sent: Monday, December 13, 2021 4:15 PM
To: kreijack@inwind.it
Cc: David Sterba <dsterba@suse.cz>; Sinnamohideen, Shafeeq <shafeeqs@panasas.com>; Paul Jones <paul@pauljones.id.au>; linux-btrfs@vger.kernel.org; Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Subject: Re: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode

On Mon, Dec 13, 2021 at 08:54:24PM +0100, Goffredo Baroncelli wrote:
> Gentle ping :-)
> 
> Are there anyone of the mains developer interested in supporting this patch ?
> 
> I am open to improve it if required.
> 

Sorry I missed this go by.  I like the interface, we don't have a use for
device->type yet, so this fits nicely.

I don't see the btrfs-progs patches in my inbox, and these don't apply, so you'll definitely need to refresh for a proper review, but looking at these patches they seem sane enough, and I like the interface.  I'd like to hear Zygo's opinion as well.

If we're going to use device->type for this, and since we don't have a user of
device->type, I'd also like you to go ahead and re-name ->type to
->allocation_policy, that way it's clear what we're using it for now.

I'd also like some xfstests to validate the behavior so we're sure we're testing this.  I'd want 1 test to just test the mechanics, like mkfs with different policies and validate they're set right, change policies, add/remove disks with different policies.

Then a second test to do something like fsstress with each set of allocation policies to validate that we did actually allocate from the correct disks.  For this please also test with compression on to make sure the test validation works for both normal allocation and compression (ie it doesn't assume writing 5gib of data == 5 gib of data usage, as compression chould give you a different value).

With that in place I think this is the correct way to implement this feature.
Thanks,

Josef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode
  2021-12-13 22:49       ` Zygo Blaxell
@ 2021-12-14 14:31         ` Josef Bacik
  2021-12-14 19:03         ` Goffredo Baroncelli
  1 sibling, 0 replies; 32+ messages in thread
From: Josef Bacik @ 2021-12-14 14:31 UTC (permalink / raw)
  To: Zygo Blaxell
  Cc: kreijack, David Sterba, Sinnamohideen Shafeeq, Paul Jones, linux-btrfs

On Mon, Dec 13, 2021 at 05:49:22PM -0500, Zygo Blaxell wrote:
> On Mon, Dec 13, 2021 at 04:15:14PM -0500, Josef Bacik wrote:
> > On Mon, Dec 13, 2021 at 08:54:24PM +0100, Goffredo Baroncelli wrote:
> > > Gentle ping :-)
> > > 
> > > Are there anyone of the mains developer interested in supporting this patch ?
> > > 
> > > I am open to improve it if required.
> > > 
> > 
> > Sorry I missed this go by.  I like the interface, we don't have a use for
> > device->type yet, so this fits nicely.
> > 
> > I don't see the btrfs-progs patches in my inbox, and these don't apply, so
> > you'll definitely need to refresh for a proper review, but looking at these
> > patches they seem sane enough, and I like the interface.  I'd like to hear
> > Zygo's opinion as well.
> 
> I've been running earlier versions with modifications since summer 2020,
> and this version mostly unmodified (rebase changes only) since it was
> posted.  It seems to work, even in corner cases like converting balances,
> replacing drives, and running out of space.  The "running out of space"
> experience is on btrfs is weird at the best of times, and these patches
> add some more new special cases, but it doesn't behave in ways that
> would surprise a sysadmin familiar with how btrfs chunk allocation works.
> 
> One major piece that's missing is adjusting the statvfs (aka df)
> available blocks field so that it doesn't include unallocated space on
> any metadata-only devices.  Right now all the unallocated space on
> metadata-only devices is counted as free even though it's impossible to
> put a data block there, so anything that is triggered automatically
> on "f_bavail < some_threshold" will be confused.
> 
> I don't think that piece has to block the rest of the patch series--if
> you're not using the feature, df gives the right number (or at least the
> same number it gave before), and if you are using the feature, you can
> subtract the unavailable data space until a later patch comes along to
> fix it.
> 
> I like
> 
> 	echo data_only > /sys/fs/btrfs/$uuid/devinfo/3/type
> 
> more than patching btrfs-progs so I can use
> 
> 	btrfs prop set /dev/... allocation_hint data_only
> 
> but I admit that might be because I'm weird.
> 

Ok this is good enough for me.  Refresh the series Goffredo and I'll review it,
get some xfstests as I described, and fix up btrfs-progs to show the preferences
via either btrfs filesystem show or some other mechanism since the df thing will
be confusing.  We'll assume if you're using this feature you are aware of the
space gotchas and then fix up bugs as they show up.  With the testing and user
feedback tho this looks good enough for me, I just want to make sure we have
tests and such in place before merging.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode
  2021-12-13 21:15     ` Josef Bacik
  2021-12-13 22:49       ` Zygo Blaxell
  2021-12-14  1:03       ` Sinnamohideen, Shafeeq
@ 2021-12-14 18:53       ` Goffredo Baroncelli
  2021-12-14 20:35         ` Josef Bacik
  2 siblings, 1 reply; 32+ messages in thread
From: Goffredo Baroncelli @ 2021-12-14 18:53 UTC (permalink / raw)
  To: Josef Bacik
  Cc: David Sterba, Sinnamohideen Shafeeq, Paul Jones, linux-btrfs,
	Zygo Blaxell

On 12/13/21 22:15, Josef Bacik wrote:
> On Mon, Dec 13, 2021 at 08:54:24PM +0100, Goffredo Baroncelli wrote:
>> Gentle ping :-)
>>
>> Are there anyone of the mains developer interested in supporting this patch ?
>>
>> I am open to improve it if required.
>>
> 
> Sorry I missed this go by.  I like the interface, we don't have a use for
> device->type yet, so this fits nicely.
> 
> I don't see the btrfs-progs patches in my inbox, and these don't apply, so
> you'll definitely need to refresh for a proper review, but looking at these
> patches they seem sane enough, and I like the interface.  I'd like to hear
> Zygo's opinion as well.
> 
> If we're going to use device->type for this, and since we don't have a user of
> device->type, I'd also like you to go ahead and re-name ->type to
> ->allocation_policy, that way it's clear what we're using it for now.

The allocation policy requires only few bits (3 or 4); instead "type" is 64 bit wide.
We can allocate more bits for future extension, but I don't think that it is necessary
to allocate all the bits to the "allocation_policy".

This to say that renaming the field as allocation_policy is too restrictive if in future
we will use the other bits for another purposes.

I don't like the terms "type". May be flags, is it better ?

> 
> I'd also like some xfstests to validate the behavior so we're sure we're testing
> this.  I'd want 1 test to just test the mechanics, like mkfs with different
> policies and validate they're set right, change policies, add/remove disks with
> different policies.
> 
> Then a second test to do something like fsstress with each set of allocation
> policies to validate that we did actually allocate from the correct disks.  For
> this please also test with compression on to make sure the test validation works
> for both normal allocation and compression (ie it doesn't assume writing 5gib of
> data == 5 gib of data usage, as compression chould give you a different value).
> 
> With that in place I think this is the correct way to implement this feature.
> Thanks,

Ok

> 
> Josef


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode
  2021-12-13 22:49       ` Zygo Blaxell
  2021-12-14 14:31         ` Josef Bacik
@ 2021-12-14 19:03         ` Goffredo Baroncelli
  2021-12-14 20:04           ` Zygo Blaxell
  1 sibling, 1 reply; 32+ messages in thread
From: Goffredo Baroncelli @ 2021-12-14 19:03 UTC (permalink / raw)
  To: Zygo Blaxell, Josef Bacik
  Cc: David Sterba, Sinnamohideen Shafeeq, Paul Jones, linux-btrfs

On 12/13/21 23:49, Zygo Blaxell wrote:
> On Mon, Dec 13, 2021 at 04:15:14PM -0500, Josef Bacik wrote:
>> On Mon, Dec 13, 2021 at 08:54:24PM +0100, Goffredo Baroncelli wrote:
>>> Gentle ping :-)
>>>
>>> Are there anyone of the mains developer interested in supporting this patch ?
>>>
>>> I am open to improve it if required.
>>>
>>
>> Sorry I missed this go by.  I like the interface, we don't have a use for
>> device->type yet, so this fits nicely.
>>
>> I don't see the btrfs-progs patches in my inbox, and these don't apply, so
>> you'll definitely need to refresh for a proper review, but looking at these
>> patches they seem sane enough, and I like the interface.  I'd like to hear
>> Zygo's opinion as well.
> 
> I've been running earlier versions with modifications since summer 2020,
> and this version mostly unmodified (rebase changes only) since it was
> posted.  It seems to work, even in corner cases like converting balances,
> replacing drives, and running out of space.  The "running out of space"
> experience is on btrfs is weird at the best of times, and these patches
> add some more new special cases, but it doesn't behave in ways that
> would surprise a sysadmin familiar with how btrfs chunk allocation works.
> 
> One major piece that's missing is adjusting the statvfs (aka df)
> available blocks field so that it doesn't include unallocated space on
> any metadata-only devices.  Right now all the unallocated space on
> metadata-only devices is counted as free even though it's impossible to
> put a data block there, so anything that is triggered automatically
> on "f_bavail < some_threshold" will be confused.
> 
> I don't think that piece has to block the rest of the patch series--if
> you're not using the feature, df gives the right number (or at least the
> same number it gave before), and if you are using the feature, you can
> subtract the unavailable data space until a later patch comes along to
> fix it.
> 
> I like
> 
> 	echo data_only > /sys/fs/btrfs/$uuid/devinfo/3/type

Only to be clear, for now you can pass a numeric value to "type". Not a text
like your example.

However I want to put on the table another option: to not expose all the
"type" field, but only the "allocation policy"; we can add a new sysfs field
called "allocation policy" that internally change the dev_item->type field.

It is not only a "cosmetic" change. If we want to change the allocation
policy, now the correct way is:
- read the type field
- change the "allocation policy" bits
- write the type field

Which is race 'prone'

For now it is not a problem, because type contains only the allocation bits.
But in future when the type field will contains further properties this could
be a problem.

> 
> more than patching btrfs-progs so I can use
> 
> 	btrfs prop set /dev/... allocation_hint data_only
> 
> but I admit that might be because I'm weird.

I prefer the echo approach too; however it is not very ergonomics in conjunction
to sudo....

> 
>> If we're going to use device->type for this, and since we don't have a user of
>> device->type, I'd also like you to go ahead and re-name ->type to
>> ->allocation_policy, that way it's clear what we're using it for now.
>>
>> I'd also like some xfstests to validate the behavior so we're sure we're testing
>> this.  I'd want 1 test to just test the mechanics, like mkfs with different
>> policies and validate they're set right, change policies, add/remove disks with
>> different policies.
>>
>> Then a second test to do something like fsstress with each set of allocation
>> policies to validate that we did actually allocate from the correct disks.  For
>> this please also test with compression on to make sure the test validation works
>> for both normal allocation and compression (ie it doesn't assume writing 5gib of
>> data == 5 gib of data usage, as compression chould give you a different value).
>>
>> With that in place I think this is the correct way to implement this feature.
>> Thanks,
>>
>> Josef


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode
  2021-12-14 19:03         ` Goffredo Baroncelli
@ 2021-12-14 20:04           ` Zygo Blaxell
  2021-12-14 20:34             ` Josef Bacik
  0 siblings, 1 reply; 32+ messages in thread
From: Zygo Blaxell @ 2021-12-14 20:04 UTC (permalink / raw)
  To: kreijack
  Cc: Josef Bacik, David Sterba, Sinnamohideen Shafeeq, Paul Jones,
	linux-btrfs

On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote:
> On 12/13/21 23:49, Zygo Blaxell wrote:
> > On Mon, Dec 13, 2021 at 04:15:14PM -0500, Josef Bacik wrote:
> > > On Mon, Dec 13, 2021 at 08:54:24PM +0100, Goffredo Baroncelli wrote:
> > > > Gentle ping :-)
> > > > 
> > > > Are there anyone of the mains developer interested in supporting this patch ?
> > > > 
> > > > I am open to improve it if required.
> > > > 
> > > 
> > > Sorry I missed this go by.  I like the interface, we don't have a use for
> > > device->type yet, so this fits nicely.
> > > 
> > > I don't see the btrfs-progs patches in my inbox, and these don't apply, so
> > > you'll definitely need to refresh for a proper review, but looking at these
> > > patches they seem sane enough, and I like the interface.  I'd like to hear
> > > Zygo's opinion as well.
> > 
> > I've been running earlier versions with modifications since summer 2020,
> > and this version mostly unmodified (rebase changes only) since it was
> > posted.  It seems to work, even in corner cases like converting balances,
> > replacing drives, and running out of space.  The "running out of space"
> > experience is on btrfs is weird at the best of times, and these patches
> > add some more new special cases, but it doesn't behave in ways that
> > would surprise a sysadmin familiar with how btrfs chunk allocation works.
> > 
> > One major piece that's missing is adjusting the statvfs (aka df)
> > available blocks field so that it doesn't include unallocated space on
> > any metadata-only devices.  Right now all the unallocated space on
> > metadata-only devices is counted as free even though it's impossible to
> > put a data block there, so anything that is triggered automatically
> > on "f_bavail < some_threshold" will be confused.
> > 
> > I don't think that piece has to block the rest of the patch series--if
> > you're not using the feature, df gives the right number (or at least the
> > same number it gave before), and if you are using the feature, you can
> > subtract the unavailable data space until a later patch comes along to
> > fix it.
> > 
> > I like
> > 
> > 	echo data_only > /sys/fs/btrfs/$uuid/devinfo/3/type
> 
> Only to be clear, for now you can pass a numeric value to "type". Not a text
> like your example.
> 
> However I want to put on the table another option: to not expose all the
> "type" field, but only the "allocation policy"; we can add a new sysfs field
> called "allocation policy" that internally change the dev_item->type field.
> 
> It is not only a "cosmetic" change. If we want to change the allocation
> policy, now the correct way is:
> - read the type field
> - change the "allocation policy" bits
> - write the type field
> 
> Which is race 'prone'

> For now it is not a problem, because type contains only the allocation bits.
> But in future when the type field will contains further properties this could
> be a problem.

Yeah, keep the interface very narrow, don't hand out access to random bits.

If the kernel supports additional bits, it should support additional
sysfs filenames to go with them.  Or it could put all the supported
options in the sysfs field, like block IO schedulers do, so you could
find this in the file by reading it:

	[prefer_data] prefer_metadata metadata_only data_only

> > more than patching btrfs-progs so I can use
> > 
> > 	btrfs prop set /dev/... allocation_hint data_only
> > 
> > but I admit that might be because I'm weird.
> 
> I prefer the echo approach too; however it is not very ergonomics in conjunction
> to sudo....

For /proc/sys/* we have the 'sysctl' tool, so you can write 'sysctl
vm.drop_caches=1' or 'sudo sysctl vm.drop_caches=1'.  For some reason
we don't have this for sysfs (or maybe it's just Debian...?) so we have
to write things like 'echo foo | sudo tee /sys/fs/...'.

Of course btrfs-progs could always open the
/sys/fs/btrfs/.../allocation_policy file and write to it.  But if we're
modifying btrfs-progs then we could use the ioctl interface anyway.

I don't have a strong preference for either sysfs or ioctl, nor am I
opposed to simply implementing both.  I'll let someone who does have
such a preference make their case.

> > > If we're going to use device->type for this, and since we don't have a user of
> > > device->type, I'd also like you to go ahead and re-name ->type to
> > > ->allocation_policy, that way it's clear what we're using it for now.
> > > 
> > > I'd also like some xfstests to validate the behavior so we're sure we're testing
> > > this.  I'd want 1 test to just test the mechanics, like mkfs with different
> > > policies and validate they're set right, change policies, add/remove disks with
> > > different policies.
> > > 
> > > Then a second test to do something like fsstress with each set of allocation
> > > policies to validate that we did actually allocate from the correct disks.  For
> > > this please also test with compression on to make sure the test validation works
> > > for both normal allocation and compression (ie it doesn't assume writing 5gib of
> > > data == 5 gib of data usage, as compression chould give you a different value).
> > > 
> > > With that in place I think this is the correct way to implement this feature.
> > > Thanks,
> > > 
> > > Josef
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode
  2021-12-14 20:04           ` Zygo Blaxell
@ 2021-12-14 20:34             ` Josef Bacik
  2021-12-14 20:41               ` Goffredo Baroncelli
  0 siblings, 1 reply; 32+ messages in thread
From: Josef Bacik @ 2021-12-14 20:34 UTC (permalink / raw)
  To: Zygo Blaxell
  Cc: kreijack, David Sterba, Sinnamohideen Shafeeq, Paul Jones, linux-btrfs

On Tue, Dec 14, 2021 at 03:04:32PM -0500, Zygo Blaxell wrote:
> On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote:
> > On 12/13/21 23:49, Zygo Blaxell wrote:
> > > On Mon, Dec 13, 2021 at 04:15:14PM -0500, Josef Bacik wrote:
> > > > On Mon, Dec 13, 2021 at 08:54:24PM +0100, Goffredo Baroncelli wrote:
> > > > > Gentle ping :-)
> > > > > 
> > > > > Are there anyone of the mains developer interested in supporting this patch ?
> > > > > 
> > > > > I am open to improve it if required.
> > > > > 
> > > > 
> > > > Sorry I missed this go by.  I like the interface, we don't have a use for
> > > > device->type yet, so this fits nicely.
> > > > 
> > > > I don't see the btrfs-progs patches in my inbox, and these don't apply, so
> > > > you'll definitely need to refresh for a proper review, but looking at these
> > > > patches they seem sane enough, and I like the interface.  I'd like to hear
> > > > Zygo's opinion as well.
> > > 
> > > I've been running earlier versions with modifications since summer 2020,
> > > and this version mostly unmodified (rebase changes only) since it was
> > > posted.  It seems to work, even in corner cases like converting balances,
> > > replacing drives, and running out of space.  The "running out of space"
> > > experience is on btrfs is weird at the best of times, and these patches
> > > add some more new special cases, but it doesn't behave in ways that
> > > would surprise a sysadmin familiar with how btrfs chunk allocation works.
> > > 
> > > One major piece that's missing is adjusting the statvfs (aka df)
> > > available blocks field so that it doesn't include unallocated space on
> > > any metadata-only devices.  Right now all the unallocated space on
> > > metadata-only devices is counted as free even though it's impossible to
> > > put a data block there, so anything that is triggered automatically
> > > on "f_bavail < some_threshold" will be confused.
> > > 
> > > I don't think that piece has to block the rest of the patch series--if
> > > you're not using the feature, df gives the right number (or at least the
> > > same number it gave before), and if you are using the feature, you can
> > > subtract the unavailable data space until a later patch comes along to
> > > fix it.
> > > 
> > > I like
> > > 
> > > 	echo data_only > /sys/fs/btrfs/$uuid/devinfo/3/type
> > 
> > Only to be clear, for now you can pass a numeric value to "type". Not a text
> > like your example.
> > 
> > However I want to put on the table another option: to not expose all the
> > "type" field, but only the "allocation policy"; we can add a new sysfs field
> > called "allocation policy" that internally change the dev_item->type field.
> > 
> > It is not only a "cosmetic" change. If we want to change the allocation
> > policy, now the correct way is:
> > - read the type field
> > - change the "allocation policy" bits
> > - write the type field
> > 
> > Which is race 'prone'
> 
> > For now it is not a problem, because type contains only the allocation bits.
> > But in future when the type field will contains further properties this could
> > be a problem.
> 
> Yeah, keep the interface very narrow, don't hand out access to random bits.
> 
> If the kernel supports additional bits, it should support additional
> sysfs filenames to go with them.  Or it could put all the supported
> options in the sysfs field, like block IO schedulers do, so you could
> find this in the file by reading it:
> 
> 	[prefer_data] prefer_metadata metadata_only data_only
> 
> > > more than patching btrfs-progs so I can use
> > > 
> > > 	btrfs prop set /dev/... allocation_hint data_only
> > > 
> > > but I admit that might be because I'm weird.
> > 
> > I prefer the echo approach too; however it is not very ergonomics in conjunction
> > to sudo....
> 
> For /proc/sys/* we have the 'sysctl' tool, so you can write 'sysctl
> vm.drop_caches=1' or 'sudo sysctl vm.drop_caches=1'.  For some reason
> we don't have this for sysfs (or maybe it's just Debian...?) so we have
> to write things like 'echo foo | sudo tee /sys/fs/...'.
> 
> Of course btrfs-progs could always open the
> /sys/fs/btrfs/.../allocation_policy file and write to it.  But if we're
> modifying btrfs-progs then we could use the ioctl interface anyway.
> 
> I don't have a strong preference for either sysfs or ioctl, nor am I
> opposed to simply implementing both.  I'll let someone who does have
> such a preference make their case.

I think echo'ing a name into sysfs is better than bits for sure.  However I want
the ability to set the device properties via a btrfs-progs command offline so I
can setup the storage and then mount the file system.  I want

1) The sysfs interface so you can change things on the fly.  This stays
   persistent of course, so the way it works is perfect.

2) The btrfs-progs command sets it on offline devices.  If you point it at a
   live mounted fs it can simply use the sysfs thing to do it live.

Does this seem reasonable?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode
  2021-12-14 18:53       ` Goffredo Baroncelli
@ 2021-12-14 20:35         ` Josef Bacik
  0 siblings, 0 replies; 32+ messages in thread
From: Josef Bacik @ 2021-12-14 20:35 UTC (permalink / raw)
  To: kreijack
  Cc: David Sterba, Sinnamohideen Shafeeq, Paul Jones, linux-btrfs,
	Zygo Blaxell

On Tue, Dec 14, 2021 at 07:53:34PM +0100, Goffredo Baroncelli wrote:
> On 12/13/21 22:15, Josef Bacik wrote:
> > On Mon, Dec 13, 2021 at 08:54:24PM +0100, Goffredo Baroncelli wrote:
> > > Gentle ping :-)
> > > 
> > > Are there anyone of the mains developer interested in supporting this patch ?
> > > 
> > > I am open to improve it if required.
> > > 
> > 
> > Sorry I missed this go by.  I like the interface, we don't have a use for
> > device->type yet, so this fits nicely.
> > 
> > I don't see the btrfs-progs patches in my inbox, and these don't apply, so
> > you'll definitely need to refresh for a proper review, but looking at these
> > patches they seem sane enough, and I like the interface.  I'd like to hear
> > Zygo's opinion as well.
> > 
> > If we're going to use device->type for this, and since we don't have a user of
> > device->type, I'd also like you to go ahead and re-name ->type to
> > ->allocation_policy, that way it's clear what we're using it for now.
> 
> The allocation policy requires only few bits (3 or 4); instead "type" is 64 bit wide.
> We can allocate more bits for future extension, but I don't think that it is necessary
> to allocate all the bits to the "allocation_policy".
> 
> This to say that renaming the field as allocation_policy is too restrictive if in future
> we will use the other bits for another purposes.
> 
> I don't like the terms "type". May be flags, is it better ?

Sure, rename it to flags and make the allocation policies flags.  The important
part is that "type" isn't accurate, and since it's not used by anything
currently we're free to rename it so it better represents the information it
holds.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode
  2021-12-14 20:34             ` Josef Bacik
@ 2021-12-14 20:41               ` Goffredo Baroncelli
  2021-12-15 13:58                 ` Josef Bacik
  0 siblings, 1 reply; 32+ messages in thread
From: Goffredo Baroncelli @ 2021-12-14 20:41 UTC (permalink / raw)
  To: Josef Bacik, Zygo Blaxell
  Cc: David Sterba, Sinnamohideen Shafeeq, Paul Jones, linux-btrfs

On 12/14/21 21:34, Josef Bacik wrote:
> On Tue, Dec 14, 2021 at 03:04:32PM -0500, Zygo Blaxell wrote:
>> On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote:

>>
>> I don't have a strong preference for either sysfs or ioctl, nor am I
>> opposed to simply implementing both.  I'll let someone who does have
>> such a preference make their case.
> 
> I think echo'ing a name into sysfs is better than bits for sure.  However I want
> the ability to set the device properties via a btrfs-progs command offline so I
> can setup the storage and then mount the file system.  I want
> 
> 1) The sysfs interface so you can change things on the fly.  This stays
>     persistent of course, so the way it works is perfect.
> 
> 2) The btrfs-progs command sets it on offline devices.  If you point it at a
>     live mounted fs it can simply use the sysfs thing to do it live.

#2 is currently not implemented. However I think that we should do.

The problem is that we need to update both:

- the superblock		(simple)
- the dev_item item		(not so simple...)

What about using only bits from the superblock to store this property ?

> 
> Does this seem reasonable?  Thanks,
> 
> Josef


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode
  2021-12-14 20:41               ` Goffredo Baroncelli
@ 2021-12-15 13:58                 ` Josef Bacik
  2021-12-15 18:53                   ` Goffredo Baroncelli
  2021-12-16  2:30                   ` Paul Jones
  0 siblings, 2 replies; 32+ messages in thread
From: Josef Bacik @ 2021-12-15 13:58 UTC (permalink / raw)
  To: kreijack
  Cc: Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq, Paul Jones,
	linux-btrfs

On Tue, Dec 14, 2021 at 09:41:21PM +0100, Goffredo Baroncelli wrote:
> On 12/14/21 21:34, Josef Bacik wrote:
> > On Tue, Dec 14, 2021 at 03:04:32PM -0500, Zygo Blaxell wrote:
> > > On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote:
> 
> > > 
> > > I don't have a strong preference for either sysfs or ioctl, nor am I
> > > opposed to simply implementing both.  I'll let someone who does have
> > > such a preference make their case.
> > 
> > I think echo'ing a name into sysfs is better than bits for sure.  However I want
> > the ability to set the device properties via a btrfs-progs command offline so I
> > can setup the storage and then mount the file system.  I want
> > 
> > 1) The sysfs interface so you can change things on the fly.  This stays
> >     persistent of course, so the way it works is perfect.
> > 
> > 2) The btrfs-progs command sets it on offline devices.  If you point it at a
> >     live mounted fs it can simply use the sysfs thing to do it live.
> 
> #2 is currently not implemented. However I think that we should do.
> 
> The problem is that we need to update both:
> 
> - the superblock		(simple)
> - the dev_item item		(not so simple...)
> 
> What about using only bits from the superblock to store this property ?

I'm looking at the patches and you only are updating the dev_item, am I missing
something for the super block?

For offline all you would need to do is do the normal open_ctree,
btrfs_search_slot to the item and update the device item type, that's
straightforward.

For online if you use btrfs prop you can see if the fs is mounted and just find
the sysfs file to modify and do it that way.

But this also brings up another point, we're going to want a compat bit for
this.  It doesn't make the fs unusable for old kernels, so just a normal
BTRFS_FS_COMPAT_<whatever> flag is fine.  If the setting gets set you set the
compat flag.

You'll also need to modify the tree checker to check the dev item allocation
hints to make sure they're valid, via check_dev_item.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode
  2021-12-15 13:58                 ` Josef Bacik
@ 2021-12-15 18:53                   ` Goffredo Baroncelli
  2021-12-16  0:56                     ` Josef Bacik
  2021-12-16  2:30                   ` Paul Jones
  1 sibling, 1 reply; 32+ messages in thread
From: Goffredo Baroncelli @ 2021-12-15 18:53 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq, Paul Jones,
	linux-btrfs

On 12/15/21 14:58, Josef Bacik wrote:
> On Tue, Dec 14, 2021 at 09:41:21PM +0100, Goffredo Baroncelli wrote:
>> On 12/14/21 21:34, Josef Bacik wrote:
>>> On Tue, Dec 14, 2021 at 03:04:32PM -0500, Zygo Blaxell wrote:
>>>> On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote:
>>
>>>>
>>>> I don't have a strong preference for either sysfs or ioctl, nor am I
>>>> opposed to simply implementing both.  I'll let someone who does have
>>>> such a preference make their case.
>>>
>>> I think echo'ing a name into sysfs is better than bits for sure.  However I want
>>> the ability to set the device properties via a btrfs-progs command offline so I
>>> can setup the storage and then mount the file system.  I want
>>>
>>> 1) The sysfs interface so you can change things on the fly.  This stays
>>>      persistent of course, so the way it works is perfect.
>>>
>>> 2) The btrfs-progs command sets it on offline devices.  If you point it at a
>>>      live mounted fs it can simply use the sysfs thing to do it live.
>>
>> #2 is currently not implemented. However I think that we should do.
>>
>> The problem is that we need to update both:
>>
>> - the superblock		(simple)
>> - the dev_item item		(not so simple...)
>>
>> What about using only bits from the superblock to store this property ?
> 
> I'm looking at the patches and you only are updating the dev_item, am I missing
> something for the super block?

When btrfs write the superblocks (see write_all_supers() in disk-io.c), it copies
the dev_item fields (contained in fs_info->fs_devices->devices lists) in each
superblock before updating it.

> 
> For offline all you would need to do is do the normal open_ctree,
> btrfs_search_slot to the item and update the device item type, that's
> straightforward.
> 
> For online if you use btrfs prop you can see if the fs is mounted and just find
> the sysfs file to modify and do it that way.
> 
> But this also brings up another point, we're going to want a compat bit for
> this.  It doesn't make the fs unusable for old kernels, so just a normal
> BTRFS_FS_COMPAT_<whatever> flag is fine.  If the setting gets set you set the
> compat flag.

Why we need a "compact" bit ? The new kernels know how treat the dev_item_type field.
The old kernels ignore it. The worst thing is that a filesystem may require a balance
before reaching a good shape (i.e. the metadata on ssd and the data on a spinning disk)
  



> You'll also need to modify the tree checker to check the dev item allocation
> hints to make sure they're valid, via check_dev_item.  Thanks,

Ok

> 
> Josef


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode
  2021-12-15 18:53                   ` Goffredo Baroncelli
@ 2021-12-16  0:56                     ` Josef Bacik
  2021-12-17  5:40                       ` Zygo Blaxell
  0 siblings, 1 reply; 32+ messages in thread
From: Josef Bacik @ 2021-12-16  0:56 UTC (permalink / raw)
  To: kreijack
  Cc: Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq, Paul Jones,
	linux-btrfs

On Wed, Dec 15, 2021 at 07:53:40PM +0100, Goffredo Baroncelli wrote:
> On 12/15/21 14:58, Josef Bacik wrote:
> > On Tue, Dec 14, 2021 at 09:41:21PM +0100, Goffredo Baroncelli wrote:
> > > On 12/14/21 21:34, Josef Bacik wrote:
> > > > On Tue, Dec 14, 2021 at 03:04:32PM -0500, Zygo Blaxell wrote:
> > > > > On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote:
> > > 
> > > > > 
> > > > > I don't have a strong preference for either sysfs or ioctl, nor am I
> > > > > opposed to simply implementing both.  I'll let someone who does have
> > > > > such a preference make their case.
> > > > 
> > > > I think echo'ing a name into sysfs is better than bits for sure.  However I want
> > > > the ability to set the device properties via a btrfs-progs command offline so I
> > > > can setup the storage and then mount the file system.  I want
> > > > 
> > > > 1) The sysfs interface so you can change things on the fly.  This stays
> > > >      persistent of course, so the way it works is perfect.
> > > > 
> > > > 2) The btrfs-progs command sets it on offline devices.  If you point it at a
> > > >      live mounted fs it can simply use the sysfs thing to do it live.
> > > 
> > > #2 is currently not implemented. However I think that we should do.
> > > 
> > > The problem is that we need to update both:
> > > 
> > > - the superblock		(simple)
> > > - the dev_item item		(not so simple...)
> > > 
> > > What about using only bits from the superblock to store this property ?
> > 
> > I'm looking at the patches and you only are updating the dev_item, am I missing
> > something for the super block?
> 
> When btrfs write the superblocks (see write_all_supers() in disk-io.c), it copies
> the dev_item fields (contained in fs_info->fs_devices->devices lists) in each
> superblock before updating it.
> 

Oh right.  Still, I hope we're doing this correctly in btrfs-progs, if not
that's a problem.

> > 
> > For offline all you would need to do is do the normal open_ctree,
> > btrfs_search_slot to the item and update the device item type, that's
> > straightforward.
> > 
> > For online if you use btrfs prop you can see if the fs is mounted and just find
> > the sysfs file to modify and do it that way.
> > 
> > But this also brings up another point, we're going to want a compat bit for
> > this.  It doesn't make the fs unusable for old kernels, so just a normal
> > BTRFS_FS_COMPAT_<whatever> flag is fine.  If the setting gets set you set the
> > compat flag.
> 
> Why we need a "compact" bit ? The new kernels know how treat the dev_item_type field.
> The old kernels ignore it. The worst thing is that a filesystem may require a balance
> before reaching a good shape (i.e. the metadata on ssd and the data on a spinning disk)
> 

So you can do the validation below, tho I'm thinking I care about it less, if we
just make sure that type is correct regardless of the compat bit then that's
fine.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode
  2021-12-15 13:58                 ` Josef Bacik
  2021-12-15 18:53                   ` Goffredo Baroncelli
@ 2021-12-16  2:30                   ` Paul Jones
  1 sibling, 0 replies; 32+ messages in thread
From: Paul Jones @ 2021-12-16  2:30 UTC (permalink / raw)
  To: Josef Bacik, kreijack
  Cc: Zygo Blaxell, David Sterba, Sinnamohideen Shafeeq, linux-btrfs

> -----Original Message-----
> From: Josef Bacik <josef@toxicpanda.com>
> Sent: Thursday, 16 December 2021 12:58 AM
> To: kreijack@inwind.it
> Cc: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>; David Sterba
> <dsterba@suse.cz>; Sinnamohideen Shafeeq <shafeeqs@panasas.com>;
> Paul Jones <paul@pauljones.id.au>; linux-btrfs@vger.kernel.org
> Subject: Re: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode
> 
> On Tue, Dec 14, 2021 at 09:41:21PM +0100, Goffredo Baroncelli wrote:
> > On 12/14/21 21:34, Josef Bacik wrote:
> > > On Tue, Dec 14, 2021 at 03:04:32PM -0500, Zygo Blaxell wrote:
> > > > On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote:
> >
> > > >
> > > > I don't have a strong preference for either sysfs or ioctl, nor am
> > > > I opposed to simply implementing both.  I'll let someone who does
> > > > have such a preference make their case.
> > >
> > > I think echo'ing a name into sysfs is better than bits for sure.
> > > However I want the ability to set the device properties via a
> > > btrfs-progs command offline so I can setup the storage and then
> > > mount the file system.  I want
> > >
> > > 1) The sysfs interface so you can change things on the fly.  This stays
> > >     persistent of course, so the way it works is perfect.
> > >
> > > 2) The btrfs-progs command sets it on offline devices.  If you point it at a
> > >     live mounted fs it can simply use the sysfs thing to do it live.
> >
> > #2 is currently not implemented. However I think that we should do.
> >
> > The problem is that we need to update both:
> >
> > - the superblock		(simple)
> > - the dev_item item		(not so simple...)
> >
> > What about using only bits from the superblock to store this property ?
> 
> I'm looking at the patches and you only are updating the dev_item, am I
> missing something for the super block?
> 
> For offline all you would need to do is do the normal open_ctree,
> btrfs_search_slot to the item and update the device item type, that's
> straightforward.
> 
> For online if you use btrfs prop you can see if the fs is mounted and just find
> the sysfs file to modify and do it that way.
> 
> But this also brings up another point, we're going to want a compat bit for
> this.  It doesn't make the fs unusable for old kernels, so just a normal
> BTRFS_FS_COMPAT_<whatever> flag is fine.  If the setting gets set you set
> the compat flag.

I don't think that is necessary - I've remounted my filesystems a few times with an unpatched kernel, and all that happens is new blocks get allocated without any hints - I like that simplicity and compatibility. As long as it's mentioned in the docs I think it's fine.

Having a balance filter that only rebalances wrongly allocated blocks would be a nice addition though.


Paul.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode
  2021-12-16  0:56                     ` Josef Bacik
@ 2021-12-17  5:40                       ` Zygo Blaxell
  2021-12-17 14:48                         ` Josef Bacik
  2021-12-17 18:08                         ` Goffredo Baroncelli
  0 siblings, 2 replies; 32+ messages in thread
From: Zygo Blaxell @ 2021-12-17  5:40 UTC (permalink / raw)
  To: Josef Bacik
  Cc: kreijack, David Sterba, Sinnamohideen Shafeeq, Paul Jones, linux-btrfs

On Wed, Dec 15, 2021 at 07:56:32PM -0500, Josef Bacik wrote:
> On Wed, Dec 15, 2021 at 07:53:40PM +0100, Goffredo Baroncelli wrote:
> > On 12/15/21 14:58, Josef Bacik wrote:
> > > On Tue, Dec 14, 2021 at 09:41:21PM +0100, Goffredo Baroncelli wrote:
> > > > On 12/14/21 21:34, Josef Bacik wrote:
> > > > > On Tue, Dec 14, 2021 at 03:04:32PM -0500, Zygo Blaxell wrote:
> > > > > > On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote:
> > > > 
> > > > > > 
> > > > > > I don't have a strong preference for either sysfs or ioctl, nor am I
> > > > > > opposed to simply implementing both.  I'll let someone who does have
> > > > > > such a preference make their case.
> > > > > 
> > > > > I think echo'ing a name into sysfs is better than bits for sure.  However I want
> > > > > the ability to set the device properties via a btrfs-progs command offline so I
> > > > > can setup the storage and then mount the file system.  I want
> > > > > 
> > > > > 1) The sysfs interface so you can change things on the fly.  This stays
> > > > >      persistent of course, so the way it works is perfect.
> > > > > 
> > > > > 2) The btrfs-progs command sets it on offline devices.  If you point it at a
> > > > >      live mounted fs it can simply use the sysfs thing to do it live.
> > > > 
> > > > #2 is currently not implemented. However I think that we should do.
> > > > 
> > > > The problem is that we need to update both:
> > > > 
> > > > - the superblock		(simple)
> > > > - the dev_item item		(not so simple...)
> > > > 
> > > > What about using only bits from the superblock to store this property ?
> > > 
> > > I'm looking at the patches and you only are updating the dev_item, am I missing
> > > something for the super block?
> > 
> > When btrfs write the superblocks (see write_all_supers() in disk-io.c), it copies
> > the dev_item fields (contained in fs_info->fs_devices->devices lists) in each
> > superblock before updating it.
> > 
> 
> Oh right.  Still, I hope we're doing this correctly in btrfs-progs, if not
> that's a problem.
> 
> > > 
> > > For offline all you would need to do is do the normal open_ctree,
> > > btrfs_search_slot to the item and update the device item type, that's
> > > straightforward.
> > > 
> > > For online if you use btrfs prop you can see if the fs is mounted and just find
> > > the sysfs file to modify and do it that way.
> > > 
> > > But this also brings up another point, we're going to want a compat bit for
> > > this.  It doesn't make the fs unusable for old kernels, so just a normal
> > > BTRFS_FS_COMPAT_<whatever> flag is fine.  If the setting gets set you set the
> > > compat flag.
> > 
> > Why we need a "compact" bit ? The new kernels know how treat the dev_item_type field.
> > The old kernels ignore it. The worst thing is that a filesystem may require a balance
> > before reaching a good shape (i.e. the metadata on ssd and the data on a spinning disk)
> 
> So you can do the validation below, tho I'm thinking I care about it less, if we
> just make sure that type is correct regardless of the compat bit then that's
> fine.  Thanks,

In theory if you get stuck in an impossible allocation situation (like all
your disks are data-only and you run out of metadata space) then one way
to recover from it is to mount with an old kernel which doesn't respect
the type bits.  Another way to recover would be to flip the type bits
while the filesystem is offline with btrfs-progs.  A third way would be to
have a mount option for newer kernels to ignore the allocation bits like
old kernels do (yes I know I already said I didn't like that idea).

If we have a bit that says "old kernels don't mount this filesystem any
more" then we lose one of those recovery options, and the other options
aren't implemented yet.

While I think of it, the metadata reservation system eventually needs
to know that it can't use data-only devices for metadata, the same way
that df eventually needs to know about metadata-only devices.


> Josef
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode
  2021-12-17  5:40                       ` Zygo Blaxell
@ 2021-12-17 14:48                         ` Josef Bacik
  2021-12-17 16:31                           ` Zygo Blaxell
  2021-12-17 18:08                         ` Goffredo Baroncelli
  1 sibling, 1 reply; 32+ messages in thread
From: Josef Bacik @ 2021-12-17 14:48 UTC (permalink / raw)
  To: Zygo Blaxell
  Cc: kreijack, David Sterba, Sinnamohideen Shafeeq, Paul Jones, linux-btrfs

On Fri, Dec 17, 2021 at 12:40:55AM -0500, Zygo Blaxell wrote:
> On Wed, Dec 15, 2021 at 07:56:32PM -0500, Josef Bacik wrote:
> > On Wed, Dec 15, 2021 at 07:53:40PM +0100, Goffredo Baroncelli wrote:
> > > On 12/15/21 14:58, Josef Bacik wrote:
> > > > On Tue, Dec 14, 2021 at 09:41:21PM +0100, Goffredo Baroncelli wrote:
> > > > > On 12/14/21 21:34, Josef Bacik wrote:
> > > > > > On Tue, Dec 14, 2021 at 03:04:32PM -0500, Zygo Blaxell wrote:
> > > > > > > On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote:
> > > > > 
> > > > > > > 
> > > > > > > I don't have a strong preference for either sysfs or ioctl, nor am I
> > > > > > > opposed to simply implementing both.  I'll let someone who does have
> > > > > > > such a preference make their case.
> > > > > > 
> > > > > > I think echo'ing a name into sysfs is better than bits for sure.  However I want
> > > > > > the ability to set the device properties via a btrfs-progs command offline so I
> > > > > > can setup the storage and then mount the file system.  I want
> > > > > > 
> > > > > > 1) The sysfs interface so you can change things on the fly.  This stays
> > > > > >      persistent of course, so the way it works is perfect.
> > > > > > 
> > > > > > 2) The btrfs-progs command sets it on offline devices.  If you point it at a
> > > > > >      live mounted fs it can simply use the sysfs thing to do it live.
> > > > > 
> > > > > #2 is currently not implemented. However I think that we should do.
> > > > > 
> > > > > The problem is that we need to update both:
> > > > > 
> > > > > - the superblock		(simple)
> > > > > - the dev_item item		(not so simple...)
> > > > > 
> > > > > What about using only bits from the superblock to store this property ?
> > > > 
> > > > I'm looking at the patches and you only are updating the dev_item, am I missing
> > > > something for the super block?
> > > 
> > > When btrfs write the superblocks (see write_all_supers() in disk-io.c), it copies
> > > the dev_item fields (contained in fs_info->fs_devices->devices lists) in each
> > > superblock before updating it.
> > > 
> > 
> > Oh right.  Still, I hope we're doing this correctly in btrfs-progs, if not
> > that's a problem.
> > 
> > > > 
> > > > For offline all you would need to do is do the normal open_ctree,
> > > > btrfs_search_slot to the item and update the device item type, that's
> > > > straightforward.
> > > > 
> > > > For online if you use btrfs prop you can see if the fs is mounted and just find
> > > > the sysfs file to modify and do it that way.
> > > > 
> > > > But this also brings up another point, we're going to want a compat bit for
> > > > this.  It doesn't make the fs unusable for old kernels, so just a normal
> > > > BTRFS_FS_COMPAT_<whatever> flag is fine.  If the setting gets set you set the
> > > > compat flag.
> > > 
> > > Why we need a "compact" bit ? The new kernels know how treat the dev_item_type field.
> > > The old kernels ignore it. The worst thing is that a filesystem may require a balance
> > > before reaching a good shape (i.e. the metadata on ssd and the data on a spinning disk)
> > 
> > So you can do the validation below, tho I'm thinking I care about it less, if we
> > just make sure that type is correct regardless of the compat bit then that's
> > fine.  Thanks,
> 
> In theory if you get stuck in an impossible allocation situation (like all
> your disks are data-only and you run out of metadata space) then one way
> to recover from it is to mount with an old kernel which doesn't respect
> the type bits.  Another way to recover would be to flip the type bits
> while the filesystem is offline with btrfs-progs.  A third way would be to
> have a mount option for newer kernels to ignore the allocation bits like
> old kernels do (yes I know I already said I didn't like that idea).
> 

This is the "preferred" vs "only" category.  You set "preferred" if you want to
be able to handle failures cleanly, you set "only" if you'd rather reprovision a
box than lose performance.

> If we have a bit that says "old kernels don't mount this filesystem any
> more" then we lose one of those recovery options, and the other options
> aren't implemented yet.
> 
> While I think of it, the metadata reservation system eventually needs
> to know that it can't use data-only devices for metadata, the same way
> that df eventually needs to know about metadata-only devices.
> 

Yeah, I'm willing to throw that into the power user bucket.  It really only
affects overcommit, because once we've allocated all our chunks we'll stop
overcommitting.  There may be some corners that need to be addressed, but we can
burn those bridges when we get to them.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/4] btrfs: add allocator_hint mode
  2021-10-24 15:31 ` [PATCH 4/4] btrfs: add allocator_hint mode Goffredo Baroncelli
@ 2021-12-17 15:58   ` Hans van Kranenburg
  2021-12-17 18:28     ` Goffredo Baroncelli
  0 siblings, 1 reply; 32+ messages in thread
From: Hans van Kranenburg @ 2021-12-17 15:58 UTC (permalink / raw)
  To: Goffredo Baroncelli, linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Goffredo Baroncelli

Hi,

On 10/24/21 5:31 PM, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli <kreijack@inwind.it>
> 
> When this mode is enabled, the chunk allocation policy is modified as
> follow.
> 
> Each disk may have a different tag:
> - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
> - BTRFS_DEV_ALLOCATION_METADATA_ONLY
> - BTRFS_DEV_ALLOCATION_DATA_ONLY
> - BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)
> 
> Where:
> - ALLOCATION_PREFERRED_X means that it is preferred to use this disk for
> the X chunk type (the other type may be allowed when the space is low)
> - ALLOCATION_X_ONLY means that it is used *only* for the X chunk type.
> This means also that it is a preferred choice.
> 
> Each time the allocator allocates a chunk of type X , first it takes the
> disks tagged as ALLOCATION_X_ONLY or ALLOCATION_PREFERRED_X; if the space
> is not enough, it uses also the disks tagged as ALLOCATION_METADATA_ONLY;
> if the space is not enough, it uses also the other disks, with the
> exception of the one marked as ALLOCATION_PREFERRED_Y, where Y the other
> type of chunk (i.e. not X).

I read this last paragraph for 5 times now, I think, but I still have no
idea what's going on here. Can you rephrase it? It's more complex than
the actual code below.

What the above text tells me is that if I start filling up my disks, it
will happily write DATA to METADATA_ONLY disks when the DATA ones are
full? And if the METADATA_ONLY disks are full with DATA also, it will
not write DATA into PREFERRED_METADATA disks.

I don't think that's what the code actually does. At least, I hope not.

Hans

> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
> ---
>  fs/btrfs/volumes.c | 98 +++++++++++++++++++++++++++++++++++++++++++++-
>  fs/btrfs/volumes.h |  1 +
>  2 files changed, 98 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 8ac99771f43c..7ee9c6e7bd44 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -153,6 +153,20 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
>  	},
>  };
>  
> +#define BTRFS_DEV_ALLOCATION_MASK ((1ULL << \
> +		BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT) - 1)
> +#define BTRFS_DEV_ALLOCATION_MASK_COUNT (1ULL << \
> +		BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT)
> +
> +static const char alloc_hint_map[BTRFS_DEV_ALLOCATION_MASK_COUNT] = {
> +	[BTRFS_DEV_ALLOCATION_DATA_ONLY] = -1,
> +	[BTRFS_DEV_ALLOCATION_PREFERRED_DATA] = 0,
> +	[BTRFS_DEV_ALLOCATION_PREFERRED_METADATA] = 1,
> +	[BTRFS_DEV_ALLOCATION_METADATA_ONLY] = 2,
> +	/* the other values are set to 0 */
> +};
> +
> +
>  const char *btrfs_bg_type_to_raid_name(u64 flags)
>  {
>  	const int index = btrfs_bg_flags_to_raid_index(flags);
> @@ -4997,13 +5011,18 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
>  }
>  
>  /*
> - * sort the devices in descending order by max_avail, total_avail
> + * sort the devices in descending order by alloc_hint,
> + * max_avail, total_avail
>   */
>  static int btrfs_cmp_device_info(const void *a, const void *b)
>  {
>  	const struct btrfs_device_info *di_a = a;
>  	const struct btrfs_device_info *di_b = b;
>  
> +	if (di_a->alloc_hint > di_b->alloc_hint)
> +		return -1;
> +	if (di_a->alloc_hint < di_b->alloc_hint)
> +		return 1;
>  	if (di_a->max_avail > di_b->max_avail)
>  		return -1;
>  	if (di_a->max_avail < di_b->max_avail)
> @@ -5166,6 +5185,8 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>  	int ndevs = 0;
>  	u64 max_avail;
>  	u64 dev_offset;
> +	int hint;
> +	int i;
>  
>  	/*
>  	 * in the first pass through the devices list, we gather information
> @@ -5218,16 +5239,91 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>  		devices_info[ndevs].max_avail = max_avail;
>  		devices_info[ndevs].total_avail = total_avail;
>  		devices_info[ndevs].dev = device;
> +
> +		if ((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
> +		     (ctl->type & BTRFS_BLOCK_GROUP_METADATA)) {
> +			/*
> +			 * if mixed bg set all the alloc_hint
> +			 * fields to the same value, so the sorting
> +			 * is not affected
> +			 */
> +			devices_info[ndevs].alloc_hint = 0;
> +		} else if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
> +			hint = device->type & BTRFS_DEV_ALLOCATION_MASK;
> +
> +			/*
> +			 * skip BTRFS_DEV_METADATA_ONLY disks
> +			 */
> +			if (hint == BTRFS_DEV_ALLOCATION_METADATA_ONLY)
> +				continue;
> +			/*
> +			 * if a data chunk must be allocated,
> +			 * sort also by hint (data disk
> +			 * higher priority)
> +			 */
> +			devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
> +		} else { /* BTRFS_BLOCK_GROUP_METADATA */
> +			hint = device->type & BTRFS_DEV_ALLOCATION_MASK;
> +
> +			/*
> +			 * skip BTRFS_DEV_DATA_ONLY disks
> +			 */
> +			if (hint == BTRFS_DEV_ALLOCATION_DATA_ONLY)
> +				continue;
> +			/*
> +			 * if a data chunk must be allocated,
> +			 * sort also by hint (metadata hint
> +			 * higher priority)
> +			 */
> +			devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
> +		}
> +
>  		++ndevs;
>  	}
>  	ctl->ndevs = ndevs;
>  
> +	/*
> +	 * no devices available
> +	 */
> +	if (!ndevs)
> +		return 0;
> +
>  	/*
>  	 * now sort the devices by hole size / available space
>  	 */
>  	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>  	     btrfs_cmp_device_info, NULL);
>  
> +	/*
> +	 * select the minimum set of disks grouped by hint that
> +	 * can host the chunk
> +	 */
> +	ndevs = 0;
> +	while (ndevs < ctl->ndevs) {
> +		hint = devices_info[ndevs++].alloc_hint;
> +		while (ndevs < ctl->ndevs &&
> +		       devices_info[ndevs].alloc_hint == hint)
> +				ndevs++;
> +		if (ndevs >= ctl->devs_min)
> +			break;
> +	}
> +
> +	BUG_ON(ndevs > ctl->ndevs);
> +	ctl->ndevs = ndevs;
> +
> +	/*
> +	 * the next layers require the devices_info ordered by
> +	 * max_avail. If we are returing two (or more) different
> +	 * group of alloc_hint, this is not always true. So sort
> +	 * these gain.
> +	 */
> +
> +	for (i = 0 ; i < ndevs ; i++)
> +		devices_info[i].alloc_hint = 0;
> +
> +	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +	     btrfs_cmp_device_info, NULL);
> +
>  	return 0;
>  }
>  
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index b8250f29df6e..37eb37b533c5 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -369,6 +369,7 @@ struct btrfs_device_info {
>  	u64 dev_offset;
>  	u64 max_avail;
>  	u64 total_avail;
> +	int alloc_hint;
>  };
>  
>  struct btrfs_raid_attr {
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode
  2021-12-17 14:48                         ` Josef Bacik
@ 2021-12-17 16:31                           ` Zygo Blaxell
  0 siblings, 0 replies; 32+ messages in thread
From: Zygo Blaxell @ 2021-12-17 16:31 UTC (permalink / raw)
  To: Josef Bacik
  Cc: kreijack, David Sterba, Sinnamohideen Shafeeq, Paul Jones, linux-btrfs

On Fri, Dec 17, 2021 at 09:48:40AM -0500, Josef Bacik wrote:
> On Fri, Dec 17, 2021 at 12:40:55AM -0500, Zygo Blaxell wrote:
> > On Wed, Dec 15, 2021 at 07:56:32PM -0500, Josef Bacik wrote:
> > > On Wed, Dec 15, 2021 at 07:53:40PM +0100, Goffredo Baroncelli wrote:
> > > > On 12/15/21 14:58, Josef Bacik wrote:
> > > > > On Tue, Dec 14, 2021 at 09:41:21PM +0100, Goffredo Baroncelli wrote:
> > > > > > On 12/14/21 21:34, Josef Bacik wrote:
> > > > > > > On Tue, Dec 14, 2021 at 03:04:32PM -0500, Zygo Blaxell wrote:
> > > > > > > > On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote:
> > > > > > 
> > > > > > > > 
> > > > > > > > I don't have a strong preference for either sysfs or ioctl, nor am I
> > > > > > > > opposed to simply implementing both.  I'll let someone who does have
> > > > > > > > such a preference make their case.
> > > > > > > 
> > > > > > > I think echo'ing a name into sysfs is better than bits for sure.  However I want
> > > > > > > the ability to set the device properties via a btrfs-progs command offline so I
> > > > > > > can setup the storage and then mount the file system.  I want
> > > > > > > 
> > > > > > > 1) The sysfs interface so you can change things on the fly.  This stays
> > > > > > >      persistent of course, so the way it works is perfect.
> > > > > > > 
> > > > > > > 2) The btrfs-progs command sets it on offline devices.  If you point it at a
> > > > > > >      live mounted fs it can simply use the sysfs thing to do it live.
> > > > > > 
> > > > > > #2 is currently not implemented. However I think that we should do.
> > > > > > 
> > > > > > The problem is that we need to update both:
> > > > > > 
> > > > > > - the superblock		(simple)
> > > > > > - the dev_item item		(not so simple...)
> > > > > > 
> > > > > > What about using only bits from the superblock to store this property ?
> > > > > 
> > > > > I'm looking at the patches and you only are updating the dev_item, am I missing
> > > > > something for the super block?
> > > > 
> > > > When btrfs write the superblocks (see write_all_supers() in disk-io.c), it copies
> > > > the dev_item fields (contained in fs_info->fs_devices->devices lists) in each
> > > > superblock before updating it.
> > > > 
> > > 
> > > Oh right.  Still, I hope we're doing this correctly in btrfs-progs, if not
> > > that's a problem.
> > > 
> > > > > 
> > > > > For offline all you would need to do is do the normal open_ctree,
> > > > > btrfs_search_slot to the item and update the device item type, that's
> > > > > straightforward.
> > > > > 
> > > > > For online if you use btrfs prop you can see if the fs is mounted and just find
> > > > > the sysfs file to modify and do it that way.
> > > > > 
> > > > > But this also brings up another point, we're going to want a compat bit for
> > > > > this.  It doesn't make the fs unusable for old kernels, so just a normal
> > > > > BTRFS_FS_COMPAT_<whatever> flag is fine.  If the setting gets set you set the
> > > > > compat flag.
> > > > 
> > > > Why we need a "compact" bit ? The new kernels know how treat the dev_item_type field.
> > > > The old kernels ignore it. The worst thing is that a filesystem may require a balance
> > > > before reaching a good shape (i.e. the metadata on ssd and the data on a spinning disk)
> > > 
> > > So you can do the validation below, tho I'm thinking I care about it less, if we
> > > just make sure that type is correct regardless of the compat bit then that's
> > > fine.  Thanks,
> > 
> > In theory if you get stuck in an impossible allocation situation (like all
> > your disks are data-only and you run out of metadata space) then one way
> > to recover from it is to mount with an old kernel which doesn't respect
> > the type bits.  Another way to recover would be to flip the type bits
> > while the filesystem is offline with btrfs-progs.  A third way would be to
> > have a mount option for newer kernels to ignore the allocation bits like
> > old kernels do (yes I know I already said I didn't like that idea).
> > 
> 
> This is the "preferred" vs "only" category.  You set "preferred" if you want to
> be able to handle failures cleanly, you set "only" if you'd rather reprovision a
> box than lose performance.

Yeah, you're right here...and I was right the first time.  ;)
In case I forget again, I'll write down why...

It's a corner of a corner of a corner case.  To get there, we have to 1)
set up the unsatisfiable allocation preferences, 2) run out of existing
metadata allocations so we're one big commit away from metadata ENOSPC,
and 3) in a small commit, write something to the filesystem that forces
big metadata allocations during the next mount (like a deleted but
uncleaned snapshot, or an orphan inode).  It's not a new case: steps 2
and 3 are sufficient to get into this corner, and allocation preferences
in step 1 only get us to step 2 faster.  At any time before step 3,
we can change the allocation preferences and avoid the whole thing.

mkfs changes this slightly, since it's possible in mkfs to set up
condition 1 without having any metadata on the filesystem; however,
the filesystem would be unmountable, and it would require a separate bug
in mkfs to create the filesystem at all.  So the filesystem blows up,
but it can only happen on an empty filesystem.  I can live with that.

It would be nice to have a way to avoid step 3 in general (like disable
orphan inode cleanup and snapshot delete on mount, so we can arrange
for some more metadata storage before those run) but that's completely
orthogonal to the allocation preferences patch, solving a problem that
existed before and after.

> > If we have a bit that says "old kernels don't mount this filesystem any
> > more" then we lose one of those recovery options, and the other options
> > aren't implemented yet.
> > 
> > While I think of it, the metadata reservation system eventually needs
> > to know that it can't use data-only devices for metadata, the same way
> > that df eventually needs to know about metadata-only devices.
> 
> Yeah, I'm willing to throw that into the power user bucket.  It really only
> affects overcommit, because once we've allocated all our chunks we'll stop
> overcommitting.  There may be some corners that need to be addressed, but we can
> burn those bridges when we get to them.  Thanks,

> Josef
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode
  2021-12-17  5:40                       ` Zygo Blaxell
  2021-12-17 14:48                         ` Josef Bacik
@ 2021-12-17 18:08                         ` Goffredo Baroncelli
  1 sibling, 0 replies; 32+ messages in thread
From: Goffredo Baroncelli @ 2021-12-17 18:08 UTC (permalink / raw)
  To: Zygo Blaxell, Josef Bacik
  Cc: David Sterba, Sinnamohideen Shafeeq, Paul Jones, linux-btrfs

On 12/17/21 06:40, Zygo Blaxell wrote:

> 
> In theory if you get stuck in an impossible allocation situation (like all
> your disks are data-only and you run out of metadata space) then one way
> to recover from it is to mount with an old kernel which doesn't respect
> the type bits.  


> Another way to recover would be to flip the type bits
> while the filesystem is offline with btrfs-progs.  

Unfortunately this would not doable; type is backup-ped in the dev_item
which are stored in the metadata.

So despite the fact that the filesystem is mounted or not, changing
"type" would require updating metadata which in turn may require to
allocate a new metadata BG.


> A third way would be to
> have a mount option for newer kernels to ignore the allocation bits like
> old kernels do (yes I know I already said I didn't like that idea).

The patch is still available. However I have to point out that this is not
a new problem. Will be always some cases where the metadata space is
exhausted and is not possible to make the needing changes to create space.

If the user uses the _ONLY flags, it is shooting in its feet. We should warn
in the manual to avoid that. But even if the user do that we still have the
problem. Frankly speaking I fatigue to see an use case where _ONLY flags are
needed. IMHO _PREFERRED are enough; what it is needed is to improve the
reporting to the user about these cases: something like that for each
btrfs-prog invocation a warning is raised when the metadata space is near to
exhaustion...

> 
> If we have a bit that says "old kernels don't mount this filesystem any
> more" then we lose one of those recovery options, and the other options
> aren't implemented yet.

Using an old kernel is not a solution. Sooner another new non-backward
compatible change will prevent that.

> 
> While I think of it, the metadata reservation system eventually needs
> to know that it can't use data-only devices for metadata, the same way
> that df eventually needs to know about metadata-only devices.
> 
> 
>> Josef
>>


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/4] btrfs: add allocator_hint mode
  2021-12-17 15:58   ` Hans van Kranenburg
@ 2021-12-17 18:28     ` Goffredo Baroncelli
  2021-12-17 19:41       ` Zygo Blaxell
  0 siblings, 1 reply; 32+ messages in thread
From: Goffredo Baroncelli @ 2021-12-17 18:28 UTC (permalink / raw)
  To: Hans van Kranenburg, linux-btrfs
  Cc: Zygo Blaxell, Josef Bacik, David Sterba, Sinnamohideen Shafeeq,
	Goffredo Baroncelli

On 12/17/21 16:58, Hans van Kranenburg wrote:
> Hi,
> 
> On 10/24/21 5:31 PM, Goffredo Baroncelli wrote:
>> From: Goffredo Baroncelli <kreijack@inwind.it>
>>
>> When this mode is enabled, the chunk allocation policy is modified as
>> follow.
>>
>> Each disk may have a different tag:
>> - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
>> - BTRFS_DEV_ALLOCATION_METADATA_ONLY
>> - BTRFS_DEV_ALLOCATION_DATA_ONLY
>> - BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)
>>
>> Where:
>> - ALLOCATION_PREFERRED_X means that it is preferred to use this disk for
>> the X chunk type (the other type may be allowed when the space is low)
>> - ALLOCATION_X_ONLY means that it is used *only* for the X chunk type.
>> This means also that it is a preferred choice.
>>
>> Each time the allocator allocates a chunk of type X , first it takes the
>> disks tagged as ALLOCATION_X_ONLY or ALLOCATION_PREFERRED_X; if the space
>> is not enough, it uses also the disks tagged as ALLOCATION_METADATA_ONLY;
>> if the space is not enough, it uses also the other disks, with the
>> exception of the one marked as ALLOCATION_PREFERRED_Y, where Y the other
>> type of chunk (i.e. not X).
> 
> I read this last paragraph for 5 times now, I think, but I still have no
> idea what's going on here. Can you rephrase it? It's more complex than
> the actual code below.
> 
> What the above text tells me is that if I start filling up my disks, it
> will happily write DATA to METADATA_ONLY disks when the DATA ones are
> full? And if the METADATA_ONLY disks are full with DATA also, it will
> not write DATA into PREFERRED_METADATA disks.
> 
> I don't think that's what the code actually does. At least, I hope not.
> 
Thanks for reviewing it: my English is very bad :-(

Yes, I wrote ALLOCATION_METADATA_ONLY when I would write ALLOCATION_PREFERRED_Y.

Let to me to rewrite it as following:

-----------------------------
The chunk allocation policy is modified as follow.

Each disk may have one of the following tags:
- BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
- BTRFS_DEV_ALLOCATION_DATA_ONLY
- BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)

During a *mixed data/metadata* chunk allocation, BTRFS works as
usual.

During a *data* chunk allocation, the space are searched first in
BTRFS_DEV_ALLOCATION_DATA_ONLY and BTRFS_DEV_ALLOCATION_PREFERRED_DATA
tagged disks. If no space is found or the space found is not enough (eg.
in raid5, only two disks are available), then also the disks tagged
BTRFS_DEV_ALLOCATION_PREFERRED_METADATA are evaluated. If even in this
case this the space is not sufficient, -ENOSPC is raised.
A disk tagged with BTRFS_DEV_ALLOCATION_METADATA_ONLY is never considered
for a data BG allocation.

During a *metadata* chunk allocation, the space are searched first in
BTRFS_DEV_ALLOCATION_METADATA_ONLY and BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
tagged disks. If no space is found or the space found is not enough (eg.
in raid5, only two disks are available), then also the disks tagged
BTRFS_DEV_ALLOCATION_PREFERRED_DATA are considered. If even in this
case this the space is not sufficient, -ENOSPC is raised.
A disk tagged with BTRFS_DEV_ALLOCATION_DATA_ONLY is never considered
for a metadata BG allocation.

By default the disks are tagged as BTRFS_DEV_ALLOCATION_PREFERRED_DATA,
so the default behavior happens. If the user prefer to store the
metadata in the faster disks (e.g. the SSD), he can tag these with
BTRFS_DEV_ALLOCATION_PREFERRED_DATA: in this case the data BG go in the
BTRFS_DEV_ALLOCATION_PREFERRED_DATA disks and the metadata BG in the
others, until there is enough space. Only if one disks set is filled,
the other is occupied.

WARNING: if the user tags a disk with BTRFS_DEV_ALLOCATION_DATA_ONLY,
this means that this disk will never be used for allocating metadata
increasing the likelihood of exhausting the metadata space.

---------------------------------



>> Where:
>> - ALLOCATION_PREFERRED_X means that it is preferred to use this disk for
>> the X chunk type (the other type may be allowed when the space is low)
>> - ALLOCATION_X_ONLY means that it is used *only* for the X chunk type.
>> This means also that it is a preferred choice.
>>
>> Each time the allocator allocates a chunk of type X , first it takes the
>> disks tagged as ALLOCATION_X_ONLY or ALLOCATION_PREFERRED_X; if the space
>> is not enough, it uses also the disks tagged as ALLOCATION_METADATA_ONLY;
>> if the space is not enough, it uses also the other disks, with the
>> exception of the one marked as ALLOCATION_PREFERRED_Y, where Y the other
>> type of chunk (i.e. not X).

> Hans
> 
>> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
>> ---
>>   fs/btrfs/volumes.c | 98 +++++++++++++++++++++++++++++++++++++++++++++-
>>   fs/btrfs/volumes.h |  1 +
>>   2 files changed, 98 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 8ac99771f43c..7ee9c6e7bd44 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -153,6 +153,20 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
>>   	},
>>   };
>>   
>> +#define BTRFS_DEV_ALLOCATION_MASK ((1ULL << \
>> +		BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT) - 1)
>> +#define BTRFS_DEV_ALLOCATION_MASK_COUNT (1ULL << \
>> +		BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT)
>> +
>> +static const char alloc_hint_map[BTRFS_DEV_ALLOCATION_MASK_COUNT] = {
>> +	[BTRFS_DEV_ALLOCATION_DATA_ONLY] = -1,
>> +	[BTRFS_DEV_ALLOCATION_PREFERRED_DATA] = 0,
>> +	[BTRFS_DEV_ALLOCATION_PREFERRED_METADATA] = 1,
>> +	[BTRFS_DEV_ALLOCATION_METADATA_ONLY] = 2,
>> +	/* the other values are set to 0 */
>> +};
>> +
>> +
>>   const char *btrfs_bg_type_to_raid_name(u64 flags)
>>   {
>>   	const int index = btrfs_bg_flags_to_raid_index(flags);
>> @@ -4997,13 +5011,18 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
>>   }
>>   
>>   /*
>> - * sort the devices in descending order by max_avail, total_avail
>> + * sort the devices in descending order by alloc_hint,
>> + * max_avail, total_avail
>>    */
>>   static int btrfs_cmp_device_info(const void *a, const void *b)
>>   {
>>   	const struct btrfs_device_info *di_a = a;
>>   	const struct btrfs_device_info *di_b = b;
>>   
>> +	if (di_a->alloc_hint > di_b->alloc_hint)
>> +		return -1;
>> +	if (di_a->alloc_hint < di_b->alloc_hint)
>> +		return 1;
>>   	if (di_a->max_avail > di_b->max_avail)
>>   		return -1;
>>   	if (di_a->max_avail < di_b->max_avail)
>> @@ -5166,6 +5185,8 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>>   	int ndevs = 0;
>>   	u64 max_avail;
>>   	u64 dev_offset;
>> +	int hint;
>> +	int i;
>>   
>>   	/*
>>   	 * in the first pass through the devices list, we gather information
>> @@ -5218,16 +5239,91 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>>   		devices_info[ndevs].max_avail = max_avail;
>>   		devices_info[ndevs].total_avail = total_avail;
>>   		devices_info[ndevs].dev = device;
>> +
>> +		if ((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
>> +		     (ctl->type & BTRFS_BLOCK_GROUP_METADATA)) {
>> +			/*
>> +			 * if mixed bg set all the alloc_hint
>> +			 * fields to the same value, so the sorting
>> +			 * is not affected
>> +			 */
>> +			devices_info[ndevs].alloc_hint = 0;
>> +		} else if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
>> +			hint = device->type & BTRFS_DEV_ALLOCATION_MASK;
>> +
>> +			/*
>> +			 * skip BTRFS_DEV_METADATA_ONLY disks
>> +			 */
>> +			if (hint == BTRFS_DEV_ALLOCATION_METADATA_ONLY)
>> +				continue;
>> +			/*
>> +			 * if a data chunk must be allocated,
>> +			 * sort also by hint (data disk
>> +			 * higher priority)
>> +			 */
>> +			devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
>> +		} else { /* BTRFS_BLOCK_GROUP_METADATA */
>> +			hint = device->type & BTRFS_DEV_ALLOCATION_MASK;
>> +
>> +			/*
>> +			 * skip BTRFS_DEV_DATA_ONLY disks
>> +			 */
>> +			if (hint == BTRFS_DEV_ALLOCATION_DATA_ONLY)
>> +				continue;
>> +			/*
>> +			 * if a data chunk must be allocated,
>> +			 * sort also by hint (metadata hint
>> +			 * higher priority)
>> +			 */
>> +			devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
>> +		}
>> +
>>   		++ndevs;
>>   	}
>>   	ctl->ndevs = ndevs;
>>   
>> +	/*
>> +	 * no devices available
>> +	 */
>> +	if (!ndevs)
>> +		return 0;
>> +
>>   	/*
>>   	 * now sort the devices by hole size / available space
>>   	 */
>>   	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>>   	     btrfs_cmp_device_info, NULL);
>>   
>> +	/*
>> +	 * select the minimum set of disks grouped by hint that
>> +	 * can host the chunk
>> +	 */
>> +	ndevs = 0;
>> +	while (ndevs < ctl->ndevs) {
>> +		hint = devices_info[ndevs++].alloc_hint;
>> +		while (ndevs < ctl->ndevs &&
>> +		       devices_info[ndevs].alloc_hint == hint)
>> +				ndevs++;
>> +		if (ndevs >= ctl->devs_min)
>> +			break;
>> +	}
>> +
>> +	BUG_ON(ndevs > ctl->ndevs);
>> +	ctl->ndevs = ndevs;
>> +
>> +	/*
>> +	 * the next layers require the devices_info ordered by
>> +	 * max_avail. If we are returing two (or more) different
>> +	 * group of alloc_hint, this is not always true. So sort
>> +	 * these gain.
>> +	 */
>> +
>> +	for (i = 0 ; i < ndevs ; i++)
>> +		devices_info[i].alloc_hint = 0;
>> +
>> +	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>> +	     btrfs_cmp_device_info, NULL);
>> +
>>   	return 0;
>>   }
>>   
>> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
>> index b8250f29df6e..37eb37b533c5 100644
>> --- a/fs/btrfs/volumes.h
>> +++ b/fs/btrfs/volumes.h
>> @@ -369,6 +369,7 @@ struct btrfs_device_info {
>>   	u64 dev_offset;
>>   	u64 max_avail;
>>   	u64 total_avail;
>> +	int alloc_hint;
>>   };
>>   
>>   struct btrfs_raid_attr {
>>
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/4] btrfs: add allocator_hint mode
  2021-12-17 18:28     ` Goffredo Baroncelli
@ 2021-12-17 19:41       ` Zygo Blaxell
  2021-12-18  9:07         ` Goffredo Baroncelli
  0 siblings, 1 reply; 32+ messages in thread
From: Zygo Blaxell @ 2021-12-17 19:41 UTC (permalink / raw)
  To: kreijack
  Cc: Hans van Kranenburg, linux-btrfs, Josef Bacik, David Sterba,
	Sinnamohideen Shafeeq

On Fri, Dec 17, 2021 at 07:28:28PM +0100, Goffredo Baroncelli wrote:
> On 12/17/21 16:58, Hans van Kranenburg wrote:
> > Hi,
> > 
> > On 10/24/21 5:31 PM, Goffredo Baroncelli wrote:
> > > From: Goffredo Baroncelli <kreijack@inwind.it>
> > > 
> > > When this mode is enabled, the chunk allocation policy is modified as
> > > follow.
> > > 
> > > Each disk may have a different tag:
> > > - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
> > > - BTRFS_DEV_ALLOCATION_METADATA_ONLY
> > > - BTRFS_DEV_ALLOCATION_DATA_ONLY
> > > - BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)
> > > 
> > > Where:
> > > - ALLOCATION_PREFERRED_X means that it is preferred to use this disk for
> > > the X chunk type (the other type may be allowed when the space is low)
> > > - ALLOCATION_X_ONLY means that it is used *only* for the X chunk type.
> > > This means also that it is a preferred choice.
> > > 
> > > Each time the allocator allocates a chunk of type X , first it takes the
> > > disks tagged as ALLOCATION_X_ONLY or ALLOCATION_PREFERRED_X; if the space
> > > is not enough, it uses also the disks tagged as ALLOCATION_METADATA_ONLY;
> > > if the space is not enough, it uses also the other disks, with the
> > > exception of the one marked as ALLOCATION_PREFERRED_Y, where Y the other
> > > type of chunk (i.e. not X).
> > 
> > I read this last paragraph for 5 times now, I think, but I still have no
> > idea what's going on here. Can you rephrase it? It's more complex than
> > the actual code below.
> > 
> > What the above text tells me is that if I start filling up my disks, it
> > will happily write DATA to METADATA_ONLY disks when the DATA ones are
> > full? And if the METADATA_ONLY disks are full with DATA also, it will
> > not write DATA into PREFERRED_METADATA disks.
> > 
> > I don't think that's what the code actually does. At least, I hope not.
> > 
> Thanks for reviewing it: my English is very bad :-(
> 
> Yes, I wrote ALLOCATION_METADATA_ONLY when I would write ALLOCATION_PREFERRED_Y.
> 
> Let to me to rewrite it as following:
> 
> -----------------------------
> The chunk allocation policy is modified as follow.
> 
> Each disk may have one of the following tags:
> - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
> - BTRFS_DEV_ALLOCATION_METADATA_ONLY
> - BTRFS_DEV_ALLOCATION_DATA_ONLY
> - BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)

Is it too late to rename these?  The order of the words is inconsistent
and the English usage is a bit odd.

I'd much rather have:

> - BTRFS_DEV_ALLOCATION_PREFER_METADATA
> - BTRFS_DEV_ALLOCATION_ONLY_METADATA
> - BTRFS_DEV_ALLOCATION_ONLY_DATA
> - BTRFS_DEV_ALLOCATION_PREFER_DATA (default)

English speakers would say "[I/we/you] prefer X" or "X [is] preferred".

or

> - BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
> - BTRFS_DEV_ALLOCATION_METADATA_ONLY
> - BTRFS_DEV_ALLOCATION_DATA_ONLY
> - BTRFS_DEV_ALLOCATION_DATA_PREFERRED (default)

I keep typing "data_preferred" and "only_data" when it's really
"preferred_data" and "data_only" because they're not consistent.

> During a *mixed data/metadata* chunk allocation, BTRFS works as
> usual.
> 
> During a *data* chunk allocation, the space are searched first in
> BTRFS_DEV_ALLOCATION_DATA_ONLY and BTRFS_DEV_ALLOCATION_PREFERRED_DATA
> tagged disks. If no space is found or the space found is not enough (eg.
> in raid5, only two disks are available), then also the disks tagged
> BTRFS_DEV_ALLOCATION_PREFERRED_METADATA are evaluated. If even in this
> case this the space is not sufficient, -ENOSPC is raised.
> A disk tagged with BTRFS_DEV_ALLOCATION_METADATA_ONLY is never considered
> for a data BG allocation.
> 
> During a *metadata* chunk allocation, the space are searched first in
> BTRFS_DEV_ALLOCATION_METADATA_ONLY and BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
> tagged disks. If no space is found or the space found is not enough (eg.
> in raid5, only two disks are available), then also the disks tagged
> BTRFS_DEV_ALLOCATION_PREFERRED_DATA are considered. If even in this
> case this the space is not sufficient, -ENOSPC is raised.
> A disk tagged with BTRFS_DEV_ALLOCATION_DATA_ONLY is never considered
> for a metadata BG allocation.
> 
> By default the disks are tagged as BTRFS_DEV_ALLOCATION_PREFERRED_DATA,
> so the default behavior happens. If the user prefer to store the
> metadata in the faster disks (e.g. the SSD), he can tag these with
> BTRFS_DEV_ALLOCATION_PREFERRED_DATA: in this case the data BG go in the
> BTRFS_DEV_ALLOCATION_PREFERRED_DATA disks and the metadata BG in the
> others, until there is enough space. Only if one disks set is filled,
> the other is occupied.
> 
> WARNING: if the user tags a disk with BTRFS_DEV_ALLOCATION_DATA_ONLY,
> this means that this disk will never be used for allocating metadata
> increasing the likelihood of exhausting the metadata space.

This WARNING is not correct.  We use a combination of METADATA_ONLY and
DATA_ONLY preferences to exclude data allocations from metadata devices,
reducing the likelihood of exhausting the metadata space all the way
to zero.  We do have to provide correctly-sized metadata devices, but
SSDs come in powers-of-2 sizes, so we just bump up to the next power of
two or add another SSD to the filesystem every time a metadata device
goes over 50%.

Metadata-only devices completely eliminate our need to do other
workarounds like data balances to reclaim unallocated space for metadata.

_PREFERRED devices are the problematic case.  Since no space is
exclusively reserved for metadata, it means you have to do maintenance
data balances as the filesystem fills up because you will be constantly
getting data clogging up your metadata devices.

There is a use case for a mix of _PREFERRED and _ONLY devices:  a system
with NVMe, SSD, and HDD might want to have the SSD use DATA_PREFERRED or
METADATA_PREFERRED while the NVMe and HDD use METADATA_ONLY and DATA_ONLY
respectively.  But this use case is not a very good match for what the
implementation does--we'd want to separate device selection ("can I use
this device for metadata, ever?") from ordering ("which devices should
I use for metadata first?").

To keep things simple I'd say that use case is out of scope, and recommend
not mixing _PREFERRED and _ONLY in the same filesystem.  Either explicitly
allocate everything with _ONLY, or mark every device _PREFERRED one way
or the other, but don't use both _ONLY and _PREFERRED at the same time
unless you really know what you're doing.

> ---------------------------------
> 
> 
> 
> > > Where:
> > > - ALLOCATION_PREFERRED_X means that it is preferred to use this disk for
> > > the X chunk type (the other type may be allowed when the space is low)
> > > - ALLOCATION_X_ONLY means that it is used *only* for the X chunk type.
> > > This means also that it is a preferred choice.
> > > 
> > > Each time the allocator allocates a chunk of type X , first it takes the
> > > disks tagged as ALLOCATION_X_ONLY or ALLOCATION_PREFERRED_X; if the space
> > > is not enough, it uses also the disks tagged as ALLOCATION_METADATA_ONLY;
> > > if the space is not enough, it uses also the other disks, with the
> > > exception of the one marked as ALLOCATION_PREFERRED_Y, where Y the other
> > > type of chunk (i.e. not X).
> 
> > Hans
> > 
> > > Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
> > > ---
> > >   fs/btrfs/volumes.c | 98 +++++++++++++++++++++++++++++++++++++++++++++-
> > >   fs/btrfs/volumes.h |  1 +
> > >   2 files changed, 98 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > > index 8ac99771f43c..7ee9c6e7bd44 100644
> > > --- a/fs/btrfs/volumes.c
> > > +++ b/fs/btrfs/volumes.c
> > > @@ -153,6 +153,20 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
> > >   	},
> > >   };
> > > +#define BTRFS_DEV_ALLOCATION_MASK ((1ULL << \
> > > +		BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT) - 1)
> > > +#define BTRFS_DEV_ALLOCATION_MASK_COUNT (1ULL << \
> > > +		BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT)
> > > +
> > > +static const char alloc_hint_map[BTRFS_DEV_ALLOCATION_MASK_COUNT] = {
> > > +	[BTRFS_DEV_ALLOCATION_DATA_ONLY] = -1,
> > > +	[BTRFS_DEV_ALLOCATION_PREFERRED_DATA] = 0,
> > > +	[BTRFS_DEV_ALLOCATION_PREFERRED_METADATA] = 1,
> > > +	[BTRFS_DEV_ALLOCATION_METADATA_ONLY] = 2,
> > > +	/* the other values are set to 0 */
> > > +};
> > > +
> > > +
> > >   const char *btrfs_bg_type_to_raid_name(u64 flags)
> > >   {
> > >   	const int index = btrfs_bg_flags_to_raid_index(flags);
> > > @@ -4997,13 +5011,18 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
> > >   }
> > >   /*
> > > - * sort the devices in descending order by max_avail, total_avail
> > > + * sort the devices in descending order by alloc_hint,
> > > + * max_avail, total_avail
> > >    */
> > >   static int btrfs_cmp_device_info(const void *a, const void *b)
> > >   {
> > >   	const struct btrfs_device_info *di_a = a;
> > >   	const struct btrfs_device_info *di_b = b;
> > > +	if (di_a->alloc_hint > di_b->alloc_hint)
> > > +		return -1;
> > > +	if (di_a->alloc_hint < di_b->alloc_hint)
> > > +		return 1;
> > >   	if (di_a->max_avail > di_b->max_avail)
> > >   		return -1;
> > >   	if (di_a->max_avail < di_b->max_avail)
> > > @@ -5166,6 +5185,8 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
> > >   	int ndevs = 0;
> > >   	u64 max_avail;
> > >   	u64 dev_offset;
> > > +	int hint;
> > > +	int i;
> > >   	/*
> > >   	 * in the first pass through the devices list, we gather information
> > > @@ -5218,16 +5239,91 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
> > >   		devices_info[ndevs].max_avail = max_avail;
> > >   		devices_info[ndevs].total_avail = total_avail;
> > >   		devices_info[ndevs].dev = device;
> > > +
> > > +		if ((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
> > > +		     (ctl->type & BTRFS_BLOCK_GROUP_METADATA)) {
> > > +			/*
> > > +			 * if mixed bg set all the alloc_hint
> > > +			 * fields to the same value, so the sorting
> > > +			 * is not affected
> > > +			 */
> > > +			devices_info[ndevs].alloc_hint = 0;
> > > +		} else if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
> > > +			hint = device->type & BTRFS_DEV_ALLOCATION_MASK;
> > > +
> > > +			/*
> > > +			 * skip BTRFS_DEV_METADATA_ONLY disks
> > > +			 */
> > > +			if (hint == BTRFS_DEV_ALLOCATION_METADATA_ONLY)
> > > +				continue;
> > > +			/*
> > > +			 * if a data chunk must be allocated,
> > > +			 * sort also by hint (data disk
> > > +			 * higher priority)
> > > +			 */
> > > +			devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
> > > +		} else { /* BTRFS_BLOCK_GROUP_METADATA */
> > > +			hint = device->type & BTRFS_DEV_ALLOCATION_MASK;
> > > +
> > > +			/*
> > > +			 * skip BTRFS_DEV_DATA_ONLY disks
> > > +			 */
> > > +			if (hint == BTRFS_DEV_ALLOCATION_DATA_ONLY)
> > > +				continue;
> > > +			/*
> > > +			 * if a data chunk must be allocated,
> > > +			 * sort also by hint (metadata hint
> > > +			 * higher priority)
> > > +			 */
> > > +			devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
> > > +		}
> > > +
> > >   		++ndevs;
> > >   	}
> > >   	ctl->ndevs = ndevs;
> > > +	/*
> > > +	 * no devices available
> > > +	 */
> > > +	if (!ndevs)
> > > +		return 0;
> > > +
> > >   	/*
> > >   	 * now sort the devices by hole size / available space
> > >   	 */
> > >   	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> > >   	     btrfs_cmp_device_info, NULL);
> > > +	/*
> > > +	 * select the minimum set of disks grouped by hint that
> > > +	 * can host the chunk
> > > +	 */
> > > +	ndevs = 0;
> > > +	while (ndevs < ctl->ndevs) {
> > > +		hint = devices_info[ndevs++].alloc_hint;
> > > +		while (ndevs < ctl->ndevs &&
> > > +		       devices_info[ndevs].alloc_hint == hint)
> > > +				ndevs++;
> > > +		if (ndevs >= ctl->devs_min)
> > > +			break;
> > > +	}
> > > +
> > > +	BUG_ON(ndevs > ctl->ndevs);
> > > +	ctl->ndevs = ndevs;
> > > +
> > > +	/*
> > > +	 * the next layers require the devices_info ordered by
> > > +	 * max_avail. If we are returing two (or more) different
> > > +	 * group of alloc_hint, this is not always true. So sort
> > > +	 * these gain.
> > > +	 */
> > > +
> > > +	for (i = 0 ; i < ndevs ; i++)
> > > +		devices_info[i].alloc_hint = 0;
> > > +
> > > +	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> > > +	     btrfs_cmp_device_info, NULL);
> > > +
> > >   	return 0;
> > >   }
> > > diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> > > index b8250f29df6e..37eb37b533c5 100644
> > > --- a/fs/btrfs/volumes.h
> > > +++ b/fs/btrfs/volumes.h
> > > @@ -369,6 +369,7 @@ struct btrfs_device_info {
> > >   	u64 dev_offset;
> > >   	u64 max_avail;
> > >   	u64 total_avail;
> > > +	int alloc_hint;
> > >   };
> > >   struct btrfs_raid_attr {
> > > 
> > 
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/4] btrfs: add allocator_hint mode
  2021-12-17 19:41       ` Zygo Blaxell
@ 2021-12-18  9:07         ` Goffredo Baroncelli
  2021-12-18 22:48           ` Zygo Blaxell
  0 siblings, 1 reply; 32+ messages in thread
From: Goffredo Baroncelli @ 2021-12-18  9:07 UTC (permalink / raw)
  To: Zygo Blaxell
  Cc: Hans van Kranenburg, linux-btrfs, Josef Bacik, David Sterba,
	Sinnamohideen Shafeeq

On 12/17/21 20:41, Zygo Blaxell wrote:
> On Fri, Dec 17, 2021 at 07:28:28PM +0100, Goffredo Baroncelli wrote:
>> On 12/17/21 16:58, Hans van Kranenburg wrote:
[...]
>> -----------------------------
>> The chunk allocation policy is modified as follow.
>>
>> Each disk may have one of the following tags:
>> - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
>> - BTRFS_DEV_ALLOCATION_METADATA_ONLY
>> - BTRFS_DEV_ALLOCATION_DATA_ONLY
>> - BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)
> 
> Is it too late to rename these?  The order of the words is inconsistent
> and the English usage is a bit odd.
> 
> I'd much rather have:
> 
>> - BTRFS_DEV_ALLOCATION_PREFER_METADATA
>> - BTRFS_DEV_ALLOCATION_ONLY_METADATA
>> - BTRFS_DEV_ALLOCATION_ONLY_DATA
>> - BTRFS_DEV_ALLOCATION_PREFER_DATA (default)
> 
> English speakers would say "[I/we/you] prefer X" or "X [is] preferred".
> 
> or
> 
>> - BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
>> - BTRFS_DEV_ALLOCATION_METADATA_ONLY
>> - BTRFS_DEV_ALLOCATION_DATA_ONLY
>> - BTRFS_DEV_ALLOCATION_DATA_PREFERRED (default)
> 
> I keep typing "data_preferred" and "only_data" when it's really
> "preferred_data" and "data_only" because they're not consistent.
> 

Sorry but it is unclear to me the last sentence :-)

Anyway I prefer
BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
BTRFS_DEV_ALLOCATION_METADATA_ONLY
[...]

Because it seems to me more consistent



>> During a *mixed data/metadata* chunk allocation, BTRFS works as
>> usual.
>>
>> During a *data* chunk allocation, the space are searched first in
>> BTRFS_DEV_ALLOCATION_DATA_ONLY and BTRFS_DEV_ALLOCATION_PREFERRED_DATA
>> tagged disks. If no space is found or the space found is not enough (eg.
>> in raid5, only two disks are available), then also the disks tagged
>> BTRFS_DEV_ALLOCATION_PREFERRED_METADATA are evaluated. If even in this
>> case this the space is not sufficient, -ENOSPC is raised.
>> A disk tagged with BTRFS_DEV_ALLOCATION_METADATA_ONLY is never considered
>> for a data BG allocation.
>>
>> During a *metadata* chunk allocation, the space are searched first in
>> BTRFS_DEV_ALLOCATION_METADATA_ONLY and BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
>> tagged disks. If no space is found or the space found is not enough (eg.
>> in raid5, only two disks are available), then also the disks tagged
>> BTRFS_DEV_ALLOCATION_PREFERRED_DATA are considered. If even in this
>> case this the space is not sufficient, -ENOSPC is raised.
>> A disk tagged with BTRFS_DEV_ALLOCATION_DATA_ONLY is never considered
>> for a metadata BG allocation.
>>
>> By default the disks are tagged as BTRFS_DEV_ALLOCATION_PREFERRED_DATA,
>> so the default behavior happens. If the user prefer to store the
>> metadata in the faster disks (e.g. the SSD), he can tag these with
>> BTRFS_DEV_ALLOCATION_PREFERRED_DATA: in this case the data BG go in the
>> BTRFS_DEV_ALLOCATION_PREFERRED_DATA disks and the metadata BG in the
>> others, until there is enough space. Only if one disks set is filled,
>> the other is occupied.
>>
>> WARNING: if the user tags a disk with BTRFS_DEV_ALLOCATION_DATA_ONLY,
>> this means that this disk will never be used for allocating metadata
>> increasing the likelihood of exhausting the metadata space.
> 
> This WARNING is not correct.  We use a combination of METADATA_ONLY and
> DATA_ONLY preferences to exclude data allocations from metadata devices,
> reducing the likelihood of exhausting the metadata space all the way
> to zero.  We do have to provide correctly-sized metadata devices, but
> SSDs come in powers-of-2 sizes, so we just bump up to the next power of
> two or add another SSD to the filesystem every time a metadata device
> goes over 50%.
> 
> Metadata-only devices completely eliminate our need to do other
> workarounds like data balances to reclaim unallocated space for metadata.
> 
> _PREFERRED devices are the problematic case.  Since no space is
> exclusively reserved for metadata, it means you have to do maintenance
> data balances as the filesystem fills up because you will be constantly
> getting data clogging up your metadata devices.


> 
> There is a use case for a mix of _PREFERRED and _ONLY devices:  a system
> with NVMe, SSD, and HDD might want to have the SSD use DATA_PREFERRED or
> METADATA_PREFERRED while the NVMe and HDD use METADATA_ONLY and DATA_ONLY
> respectively.  But this use case is not a very good match for what the
> implementation does--we'd want to separate device selection ("can I use
> this device for metadata, ever?") from ordering ("which devices should
> I use for metadata first?").
> 
> To keep things simple I'd say that use case is out of scope, and recommend
> not mixing _PREFERRED and _ONLY in the same filesystem.  Either explicitly
> allocate everything with _ONLY, or mark every device _PREFERRED one way
> or the other, but don't use both _ONLY and _PREFERRED at the same time
> unless you really know what you're doing.

In what METADATA_ONLY + DATA_PREFERRED would be more dangerous than
METADATA_ONLY + DATA_ONLY ?

If fact there I see two mains differents use cases:
- I want to put my metadata on a SSD for performance reasoning:
	METADATA_PREFERRED + DATA_PREFERRED
    as the most conservative approach
- I want to protect the metadata BG space from exhaustion (assuming that
   a "today standard" disk is far larger than the total BG metadata)
	METADATA_ONLY + X
   is a valid approach




[...]

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/4] btrfs: add allocator_hint mode
  2021-12-18  9:07         ` Goffredo Baroncelli
@ 2021-12-18 22:48           ` Zygo Blaxell
  2021-12-19  0:03             ` Graham Cobb
  0 siblings, 1 reply; 32+ messages in thread
From: Zygo Blaxell @ 2021-12-18 22:48 UTC (permalink / raw)
  To: kreijack
  Cc: Hans van Kranenburg, linux-btrfs, Josef Bacik, David Sterba,
	Sinnamohideen Shafeeq

On Sat, Dec 18, 2021 at 10:07:18AM +0100, Goffredo Baroncelli wrote:
> On 12/17/21 20:41, Zygo Blaxell wrote:
> > On Fri, Dec 17, 2021 at 07:28:28PM +0100, Goffredo Baroncelli wrote:
> > > On 12/17/21 16:58, Hans van Kranenburg wrote:
> [...]
> > > -----------------------------
> > > The chunk allocation policy is modified as follow.
> > > 
> > > Each disk may have one of the following tags:
> > > - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
> > > - BTRFS_DEV_ALLOCATION_METADATA_ONLY
> > > - BTRFS_DEV_ALLOCATION_DATA_ONLY
> > > - BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)
> > 
> > Is it too late to rename these?  The order of the words is inconsistent
> > and the English usage is a bit odd.
> > 
> > I'd much rather have:
> > 
> > > - BTRFS_DEV_ALLOCATION_PREFER_METADATA
> > > - BTRFS_DEV_ALLOCATION_ONLY_METADATA
> > > - BTRFS_DEV_ALLOCATION_ONLY_DATA
> > > - BTRFS_DEV_ALLOCATION_PREFER_DATA (default)
> > 
> > English speakers would say "[I/we/you] prefer X" or "X [is] preferred".
> > 
> > or
> > 
> > > - BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
> > > - BTRFS_DEV_ALLOCATION_METADATA_ONLY
> > > - BTRFS_DEV_ALLOCATION_DATA_ONLY
> > > - BTRFS_DEV_ALLOCATION_DATA_PREFERRED (default)
> > 
> > I keep typing "data_preferred" and "only_data" when it's really
> > "preferred_data" and "data_only" because they're not consistent.
> > 
> 
> Sorry but it is unclear to me the last sentence :-)
> 
> Anyway I prefer
> BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
> BTRFS_DEV_ALLOCATION_METADATA_ONLY
> [...]
> 
> Because it seems to me more consistent

Sounds good.

> > There is a use case for a mix of _PREFERRED and _ONLY devices:  a system
> > with NVMe, SSD, and HDD might want to have the SSD use DATA_PREFERRED or
> > METADATA_PREFERRED while the NVMe and HDD use METADATA_ONLY and DATA_ONLY
> > respectively.  But this use case is not a very good match for what the
> > implementation does--we'd want to separate device selection ("can I use
> > this device for metadata, ever?") from ordering ("which devices should
> > I use for metadata first?").
> > 
> > To keep things simple I'd say that use case is out of scope, and recommend
> > not mixing _PREFERRED and _ONLY in the same filesystem.  Either explicitly
> > allocate everything with _ONLY, or mark every device _PREFERRED one way
> > or the other, but don't use both _ONLY and _PREFERRED at the same time
> > unless you really know what you're doing.
> 
> In what METADATA_ONLY + DATA_PREFERRED would be more dangerous than
> METADATA_ONLY + DATA_ONLY ?

If capacity is our first priority, we use METADATA_PREFERRED
and DATA_PREFERRED (everything can be allocated everywhere, we try
the highest performance but fall back).

If performance is our first priority, we use METADATA_ONLY and DATA_ONLY
(so we never have to balance which would reduce performance) or
METADATA_PREFERRED and DATA_ONLY (so we have more capacity, but get
lower performance because we must balance data in some cases, but not
as low as any combination of options with DATA_PREFERRED).

If we have a complicated setup with 3 or more drives we might use 3 or
4 options at once to create multiple performance and capacity tiers.
But never exactly those two options.

What is METADATA_ONLY + DATA_PREFERRED for?

METADATA_ONLY + DATA_PREFERRED allows metadata to be allocated on data
drives, causing a performance crash.  It doesn't remove the need for
data balances since metadata and data can compete for space on the
DATA_PREFERRED devices.  It reduces data capacity (no data on the
METADATA_ONLY device) but doesn't guarantee a performance benefit
(metadata is allowed on the DATA_PREFERRED device).  Of all the
combinations of options, why would a user choose this one?

> If fact there I see two mains differents use cases:
> - I want to put my metadata on a SSD for performance reasoning:
> 	METADATA_PREFERRED + DATA_PREFERRED
>    as the most conservative approach

If you're using METADATA_PREFERRED, your first priority can only be
capacity, as performance will fail when some of the disks fill up.

We can't prioritize performance and capacity at the same time (at least
not with this code).  The user must choose which is most important
in cases when both are not available.

I see now why you keep missing this use case--it is because you are
thinking that PREFERRED is a valid option for performance use cases,
and therefore serves a superset of ONLY use cases.  ONLY and PREFERRED
serve use cases with opposite requirements, so one cannot be used to
serve the needs of the other.

PREFERRED is for capacity use cases, not for performance use cases.
PREFERRED only improves performance when over 96%(*) of the metadata
accesses hit the SSD; otherwise, the 4% of metadata on spinning devices
will be so slow that it will dominate the metadata access time.  When it
is forced to make a decision, PREFERRED will choose capacity, and drop
performance to pre-allocation-preference levels very quickly.  If the
user can tolerate the worst-case performance, then PREFERRED can provide
average performance above the worst case, but below the best case.
Only worst-case performance is guaranteed (within the limits of this
patch and the current btrfs allocators).

ONLY is for performance, not capacity (though it can also be used for one
of its side-effects on metadata allocation).  ONLY will not sacrifice
performance because one of the disks filled up.  ONLY will give us
ENOSPC immediately to tell us that we have no more space available with
acceptable performance.  ONLY guarantees best-case performance all the
time, since it doesn't allow any other case to arise.

(*) 24:1 is the raw ratio of access time between HDD data and NVMe
drives, but btrfs typically does 2-4x as many metadata iops as data iops,
which makes the effect on performance even worse when metadata leaks
onto HDD.  Also the most recently allocated block groups are most active,
which weights the HDD overflow even more in the average access time.
We typically get "node timeout" alarms immediately after the first
metadata block group on a big filesystem is misallocated on a spinning
disk, which is an effective ratio of 250:1 or 99.6%.  As a result,
we have zero tolerance for metadata on HDD.

> - I want to protect the metadata BG space from exhaustion (assuming that
>   a "today standard" disk is far larger than the total BG metadata)
> 	METADATA_ONLY + X
>   is a valid approach

Even if today's disks are too small, you can always add more of them.
It might even improve performance to make a raid10 metadata array.

> [...]
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/4] btrfs: add allocator_hint mode
  2021-12-18 22:48           ` Zygo Blaxell
@ 2021-12-19  0:03             ` Graham Cobb
  2021-12-19  2:30               ` Zygo Blaxell
  0 siblings, 1 reply; 32+ messages in thread
From: Graham Cobb @ 2021-12-19  0:03 UTC (permalink / raw)
  To: Zygo Blaxell, kreijack
  Cc: Hans van Kranenburg, linux-btrfs, Josef Bacik, David Sterba,
	Sinnamohideen Shafeeq

On 18/12/2021 22:48, Zygo Blaxell wrote:
> On Sat, Dec 18, 2021 at 10:07:18AM +0100, Goffredo Baroncelli wrote:
>> On 12/17/21 20:41, Zygo Blaxell wrote:
>>> On Fri, Dec 17, 2021 at 07:28:28PM +0100, Goffredo Baroncelli wrote:
>>>> On 12/17/21 16:58, Hans van Kranenburg wrote:
>> [...]
>>>> -----------------------------
>>>> The chunk allocation policy is modified as follow.
>>>>
>>>> Each disk may have one of the following tags:
>>>> - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
>>>> - BTRFS_DEV_ALLOCATION_METADATA_ONLY
>>>> - BTRFS_DEV_ALLOCATION_DATA_ONLY
>>>> - BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)
>>>
>>> Is it too late to rename these?  The order of the words is inconsistent
>>> and the English usage is a bit odd.
>>>
>>> I'd much rather have:
>>>
>>>> - BTRFS_DEV_ALLOCATION_PREFER_METADATA
>>>> - BTRFS_DEV_ALLOCATION_ONLY_METADATA
>>>> - BTRFS_DEV_ALLOCATION_ONLY_DATA
>>>> - BTRFS_DEV_ALLOCATION_PREFER_DATA (default)
>>>
>>> English speakers would say "[I/we/you] prefer X" or "X [is] preferred".
>>>
>>> or
>>>
>>>> - BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
>>>> - BTRFS_DEV_ALLOCATION_METADATA_ONLY
>>>> - BTRFS_DEV_ALLOCATION_DATA_ONLY
>>>> - BTRFS_DEV_ALLOCATION_DATA_PREFERRED (default)
>>>
>>> I keep typing "data_preferred" and "only_data" when it's really
>>> "preferred_data" and "data_only" because they're not consistent.
>>>
>>
>> Sorry but it is unclear to me the last sentence :-)
>>
>> Anyway I prefer
>> BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
>> BTRFS_DEV_ALLOCATION_METADATA_ONLY
>> [...]
>>
>> Because it seems to me more consistent
> 
> Sounds good.
> 
>>> There is a use case for a mix of _PREFERRED and _ONLY devices:  a system
>>> with NVMe, SSD, and HDD might want to have the SSD use DATA_PREFERRED or
>>> METADATA_PREFERRED while the NVMe and HDD use METADATA_ONLY and DATA_ONLY
>>> respectively.  But this use case is not a very good match for what the
>>> implementation does--we'd want to separate device selection ("can I use
>>> this device for metadata, ever?") from ordering ("which devices should
>>> I use for metadata first?").
>>>
>>> To keep things simple I'd say that use case is out of scope, and recommend
>>> not mixing _PREFERRED and _ONLY in the same filesystem.  Either explicitly
>>> allocate everything with _ONLY, or mark every device _PREFERRED one way
>>> or the other, but don't use both _ONLY and _PREFERRED at the same time
>>> unless you really know what you're doing.
>>
>> In what METADATA_ONLY + DATA_PREFERRED would be more dangerous than
>> METADATA_ONLY + DATA_ONLY ?
> 
> If capacity is our first priority, we use METADATA_PREFERRED
> and DATA_PREFERRED (everything can be allocated everywhere, we try
> the highest performance but fall back).
> 
> If performance is our first priority, we use METADATA_ONLY and DATA_ONLY
> (so we never have to balance which would reduce performance) or
> METADATA_PREFERRED and DATA_ONLY (so we have more capacity, but get
> lower performance because we must balance data in some cases, but not
> as low as any combination of options with DATA_PREFERRED).

I think it would be a mistake to think that your performance and
capacity use cases are the only ones others will care about.

Your analysis misses a third option for priority: resilience. I have a
nearline backup server. It stores a lot of data but it is almost
entirely write-only. My priority is to be able to get at most of the
data quickly if I need it sometime - it isn't critical for any specific
piece of data as I have additional, slower backups, but I want to be
able to restore as much as possible from this server for speed and
convenience. To keep as much nearline backup as possible, I keep data in
SINGLE and metadata in RAID1. Fine - I can do that today.

However, in normal use the main activity is btrfs receive of many mostly
unchanged subvolumes every day. So, what I do today is have a large data
disk and a second small disk for the RAID1 copy of the metadata. I want
to keep data off that second disk. With this patch, I expect to set the
metadata disk as METADATA_ONLY and the data disk as DATA_PREFERRED.

Of course I would *also* like to be able to get btrfs to mostly read the
RAID1 copy from the fast metadata disk for reading metadata. This patch
does not address that, but I hope one day there will be a separate
option for that.

I think the proposed settings are a useful step and will allow some
experimentation and learning with different scenarios. They certainly
aren't the answer to all allocation problems but I would like to see
them available as soon as possible,

Graham

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/4] btrfs: add allocator_hint mode
  2021-12-19  0:03             ` Graham Cobb
@ 2021-12-19  2:30               ` Zygo Blaxell
  0 siblings, 0 replies; 32+ messages in thread
From: Zygo Blaxell @ 2021-12-19  2:30 UTC (permalink / raw)
  To: Graham Cobb
  Cc: kreijack, Hans van Kranenburg, linux-btrfs, Josef Bacik,
	David Sterba, Sinnamohideen Shafeeq

On Sun, Dec 19, 2021 at 12:03:32AM +0000, Graham Cobb wrote:
> On 18/12/2021 22:48, Zygo Blaxell wrote:
> > On Sat, Dec 18, 2021 at 10:07:18AM +0100, Goffredo Baroncelli wrote:
> >> On 12/17/21 20:41, Zygo Blaxell wrote:
> >>> On Fri, Dec 17, 2021 at 07:28:28PM +0100, Goffredo Baroncelli wrote:
> >>>> On 12/17/21 16:58, Hans van Kranenburg wrote:
> >> [...]
> >>>> -----------------------------
> >>>> The chunk allocation policy is modified as follow.
> >>>>
> >>>> Each disk may have one of the following tags:
> >>>> - BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
> >>>> - BTRFS_DEV_ALLOCATION_METADATA_ONLY
> >>>> - BTRFS_DEV_ALLOCATION_DATA_ONLY
> >>>> - BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)
> >>>
> >>> Is it too late to rename these?  The order of the words is inconsistent
> >>> and the English usage is a bit odd.
> >>>
> >>> I'd much rather have:
> >>>
> >>>> - BTRFS_DEV_ALLOCATION_PREFER_METADATA
> >>>> - BTRFS_DEV_ALLOCATION_ONLY_METADATA
> >>>> - BTRFS_DEV_ALLOCATION_ONLY_DATA
> >>>> - BTRFS_DEV_ALLOCATION_PREFER_DATA (default)
> >>>
> >>> English speakers would say "[I/we/you] prefer X" or "X [is] preferred".
> >>>
> >>> or
> >>>
> >>>> - BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
> >>>> - BTRFS_DEV_ALLOCATION_METADATA_ONLY
> >>>> - BTRFS_DEV_ALLOCATION_DATA_ONLY
> >>>> - BTRFS_DEV_ALLOCATION_DATA_PREFERRED (default)
> >>>
> >>> I keep typing "data_preferred" and "only_data" when it's really
> >>> "preferred_data" and "data_only" because they're not consistent.
> >>>
> >>
> >> Sorry but it is unclear to me the last sentence :-)
> >>
> >> Anyway I prefer
> >> BTRFS_DEV_ALLOCATION_METADATA_PREFERRED
> >> BTRFS_DEV_ALLOCATION_METADATA_ONLY
> >> [...]
> >>
> >> Because it seems to me more consistent
> > 
> > Sounds good.
> > 
> >>> There is a use case for a mix of _PREFERRED and _ONLY devices:  a system
> >>> with NVMe, SSD, and HDD might want to have the SSD use DATA_PREFERRED or
> >>> METADATA_PREFERRED while the NVMe and HDD use METADATA_ONLY and DATA_ONLY
> >>> respectively.  But this use case is not a very good match for what the
> >>> implementation does--we'd want to separate device selection ("can I use
> >>> this device for metadata, ever?") from ordering ("which devices should
> >>> I use for metadata first?").
> >>>
> >>> To keep things simple I'd say that use case is out of scope, and recommend
> >>> not mixing _PREFERRED and _ONLY in the same filesystem.  Either explicitly
> >>> allocate everything with _ONLY, or mark every device _PREFERRED one way
> >>> or the other, but don't use both _ONLY and _PREFERRED at the same time
> >>> unless you really know what you're doing.
> >>
> >> In what METADATA_ONLY + DATA_PREFERRED would be more dangerous than
> >> METADATA_ONLY + DATA_ONLY ?
> > 
> > If capacity is our first priority, we use METADATA_PREFERRED
> > and DATA_PREFERRED (everything can be allocated everywhere, we try
> > the highest performance but fall back).
> > 
> > If performance is our first priority, we use METADATA_ONLY and DATA_ONLY
> > (so we never have to balance which would reduce performance) or
> > METADATA_PREFERRED and DATA_ONLY (so we have more capacity, but get
> > lower performance because we must balance data in some cases, but not
> > as low as any combination of options with DATA_PREFERRED).
> 
> I think it would be a mistake to think that your performance and
> capacity use cases are the only ones others will care about.
> 
> Your analysis misses a third option for priority: resilience. I have a
> nearline backup server. It stores a lot of data but it is almost
> entirely write-only. My priority is to be able to get at most of the
> data quickly if I need it sometime - it isn't critical for any specific
> piece of data as I have additional, slower backups, but I want to be
> able to restore as much as possible from this server for speed and
> convenience. To keep as much nearline backup as possible, I keep data in
> SINGLE and metadata in RAID1. Fine - I can do that today.

We have a lot of servers built this way.  None with just two disks though.
Maybe that gives me a blind spot for the 2-disk corner cases...

> However, in normal use the main activity is btrfs receive of many mostly
> unchanged subvolumes every day. So, what I do today is have a large data
> disk and a second small disk for the RAID1 copy of the metadata. I want
> to keep data off that second disk. With this patch, I expect to set the
> metadata disk as METADATA_ONLY and the data disk as DATA_PREFERRED.

Thanks, that's exactly the missing use case I was looking for.

In this case, we are starting off by prioritizing performance over
capacity (otherwise we'd use METADATA_PREFERRED and DATA_PREFERRED),
so we would normally use METADATA_ONLY and DATA_ONLY.  If we had enough
disks for metadata, we'd be done and we'd stop there.

We can't use DATA_ONLY here, because we would then have only one disk
for raid1 metadata, and that doesn't work.  We could have DATA_ONLY
with dup metadata for best performance, but then we lose resilience
because SSD failure would wipe out the whole filesystem.  We can
add more data disks and then make some of them DATA_ONLY, but we'd
always need at least two devices to have one of the three allocation
preferences that allow metadata, or raid1 metadata won't work at all.
We could use the SSD for a block-layer writethrough cache instead,
but while the resiliency properties are similar (SSD is expendible, HDD
takes all the data with it when it fails), it's not completely identical:
it's prone to data evicting the metadata which makes it less useful,
and it's a separate chunk of code with exposure to more kernel bugs.
So every other option is worse than METADATA_ONLY and DATA_PREFERRED.

With the read_policy patches also applied, and allocation preferences
set to METADATA_ONLY and DATA_PREFERRED, we'd have maximum performance
on reads (they would exclusively use SSD) and minimum performance on
writes (they would all necessarily use HDD) and get the best possible
result given the combination of performance, resilience, and available
device constraints.

> Of course I would *also* like to be able to get btrfs to mostly read the
> RAID1 copy from the fast metadata disk for reading metadata. This patch
> does not address that, but I hope one day there will be a separate
> option for that.

That's the read_policy patch set.  I've been running that for a while too,
but I dropped it when I started having problems with kernels after 5.10.
I'm still having problems without the read_policy patch set, so it was
probably fine?

> I think the proposed settings are a useful step and will allow some
> experimentation and learning with different scenarios. They certainly
> aren't the answer to all allocation problems but I would like to see
> them available as soon as possible,
> 
> Graham
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2021-12-19  2:31 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-24 15:31 [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode Goffredo Baroncelli
2021-10-24 15:31 ` [PATCH 1/4] btrfs: add flags to give an hint to the chunk allocator Goffredo Baroncelli
2021-10-24 15:31 ` [PATCH 2/4] btrfs: export dev_item.type in /sys/fs/btrfs/<uuid>/devinfo/<devid>/type Goffredo Baroncelli
2021-10-24 15:31 ` [PATCH 3/4] btrfs: change the DEV_ITEM 'type' field via sysfs Goffredo Baroncelli
2021-10-24 15:31 ` [PATCH 4/4] btrfs: add allocator_hint mode Goffredo Baroncelli
2021-12-17 15:58   ` Hans van Kranenburg
2021-12-17 18:28     ` Goffredo Baroncelli
2021-12-17 19:41       ` Zygo Blaxell
2021-12-18  9:07         ` Goffredo Baroncelli
2021-12-18 22:48           ` Zygo Blaxell
2021-12-19  0:03             ` Graham Cobb
2021-12-19  2:30               ` Zygo Blaxell
2021-12-13  9:39 ` [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode Paul Jones
2021-12-13 19:54   ` Goffredo Baroncelli
2021-12-13 21:15     ` Josef Bacik
2021-12-13 22:49       ` Zygo Blaxell
2021-12-14 14:31         ` Josef Bacik
2021-12-14 19:03         ` Goffredo Baroncelli
2021-12-14 20:04           ` Zygo Blaxell
2021-12-14 20:34             ` Josef Bacik
2021-12-14 20:41               ` Goffredo Baroncelli
2021-12-15 13:58                 ` Josef Bacik
2021-12-15 18:53                   ` Goffredo Baroncelli
2021-12-16  0:56                     ` Josef Bacik
2021-12-17  5:40                       ` Zygo Blaxell
2021-12-17 14:48                         ` Josef Bacik
2021-12-17 16:31                           ` Zygo Blaxell
2021-12-17 18:08                         ` Goffredo Baroncelli
2021-12-16  2:30                   ` Paul Jones
2021-12-14  1:03       ` Sinnamohideen, Shafeeq
2021-12-14 18:53       ` Goffredo Baroncelli
2021-12-14 20:35         ` Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).