linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata
@ 2020-05-28 18:34 Goffredo Baroncelli
  2020-05-28 18:34 ` [PATCH 1/4] Add an ioctl to set/retrive the device properties Goffredo Baroncelli
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Goffredo Baroncelli @ 2020-05-28 18:34 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Michael, Hugo Mills, Martin Svec, Wang Yugui, Paul Jones,
	Adam Borowski, Zygo Blaxell


[the previous patches sets called this mode ssd_metadata]

Hi all,

This is an RFC; I wrote this patch because I find the idea interesting
even though it adds more complication to the chunk allocator.

The initial idea was to store the metadata on the ssd and to leave the data
on the rotational disks. The kind of disk was determined from the rotational
flag. However looking only at the rotational flags is not flexible enough. So
I added a device property called "preferred_metadata" to mark a device
as preferred for metadata.

A separate patches set is sent to extend the "btrfs property" command
for supporting the preferred_metadata device flag. The basic usage is:

    $ # set a new value
    $ sudo btrfs property set /dev/vde preferred_metadata 1
    
    $ # get the current value
    $ sudo btrfs property get /dev/vde preferred_metadata
    devid=4, path=/dev/vde: dedicated_metadata=1

This new mode is enabled passing the option "preferred_metadata" at mount time.
This policy of allocation is the default one. However if this doesn't permit
a chunk allocation, the "classic" one is used.

Some examples: (/dev/sd[abc] are marked as preferred_metadata,
and /dev/sd[ef] are not)

Non striped profile: metadata->raid1, data->raid1
The data is stored on /dev/sd[ef], metadata is stored on /dev/sd[abc].
When /dev/sd[ef] are full, then the data chunk is allocated also on
/dev/sd[abc].

Striped profile: metadata->raid6, data->raid6
raid6 requires 3 disks at minimum, so /dev/sd[ef] are not enough for a
data profile raid6. To allow a data chunk allocation, the data profile raid6
will be stored on all the disks /dev/sd[abcdef].
Instead the metadata profile raid6 will be allocated on /dev/sd[abc],
because these are enough to host this chunk.

The patches set is composed by four patches:

- The first patch adds the ioctl to update the btrfs_dev_item.type field.
The ioctl is generic to handle more fields, however now only the "type"
field is supported.

- The second patch adds the flag BTRFS_DEV_PREFERRED_METADATA which is
used to mark a device as "preferred_metadata"

- The third patch exports the btrfs_dev_item.type field via sysfs files
/sys/fs/btrfs/<UUID>/devinfo/<devid>/type

It is possible only to read the value. It is not implemented the updated
of the value because in btrfs/stsfs.c there is a comment that states:
"We don't want to do full transaction commit from inside sysfs".

- The fourth patch implements this new mode

Changelog:
v4: - renamed ssd_metadata to preferred_metadata
    - add the device property "preferred_metadata"
    - add the ioctl BTRFS_IOC_DEV_PROPERTIES
    - export the btrfs_dev_item.type values via sysfs
v3: - correct the collision between BTRFS_MOUNT_DISCARD_ASYNC and
      BTRFS_MOUNT_SSD_METADATA.
v2: - rebased to v5.6.2
    - correct the comparison about the rotational disks (>= instead of >)
    - add the flag rotational to the struct btrfs_device_info to
      simplify the comparison function (btrfs_cmp_device_info*() )
v1: - first issue

Below I collected some data to highlight the performance increment.

Test setup:
I performed as test a "dist-upgrade" of a Debian from stretch to buster.
The test consisted in an image of a Debian stretch[1]  with the packages
needed under /var/cache/apt/archives/ (so no networking was involved).
For each test I formatted the filesystem from scratch, un-tar-red the
image and the ran "apt-get dist-upgrade" [2]. For each disk(s)/filesystem
combination I measured the time of apt dist-upgrade with and
without the flag "force-unsafe-io" which reduce the using of sync(2) and
flush(2). The ssd was 20GB big, the hdd was 230GB big,

I considered the following scenarios:
- btrfs over ssd
- btrfs over ssd + hdd with my patch enabled
- btrfs over bcache over hdd+ssd
- btrfs over hdd (very, very slow....)
- ext4 over ssd
- ext4 over hdd

The test machine was an "AMD A6-6400K" with 4GB of ram, where 3GB was used
as cache/buff.

Data analysis:

Of course btrfs is slower than ext4 when a lot of sync/flush are involved. Using
apt on a rotational was a dramatic experience. And IMHO  this should be replaced
by using the btrfs snapshot capabilities. But this is another (not easy) story.

Unsurprising bcache performs better than my patch. But this is an expected
result because it can cache also the data chunk (the read can goes directly to
the ssd). bcache perform about +60% slower when there are a lot of sync/flush
and only +20% in the other case.

Regarding the test with force-unsafe-io (fewer sync/flush), my patch reduce the
time from +256% to +113%  than the hdd-only . Which I consider a good
results considering how small is the patch.


Raw data:
The data below is the "real" time (as return by the time command) consumed by
apt


Test description         real (mmm:ss)	Delta %
--------------------     -------------  -------
btrfs hdd w/sync	   142:38	+533%
btrfs ssd+hdd w/sync        81:04	+260%
ext4 hdd w/sync	            52:39	+134%
btrfs bcache w/sync	    35:59	 +60%
btrfs ssd w/sync	    22:31	reference
ext4 ssd w/sync	            12:19	 -45%



Test description         real (mmm:ss)	Delta %
--------------------     -------------  -------
btrfs hdd	             56:2	+256%
ext4 hdd	            51:32	+228%
btrfs ssd+hdd	            33:30	+113%
btrfs bcache	            18:57	 +20%
btrfs ssd	            15:44	reference
ext4 ssd	            11:49	 -25%


[1] I created the image, using "debootrap stretch", then I installed a set
of packages using the commands:

  # debootstrap stretch test/
  # chroot test/
  # mount -t proc proc proc
  # mount -t sysfs sys sys
  # apt --option=Dpkg::Options::=--force-confold \
        --option=Dpkg::options::=--force-unsafe-io \
	install mate-desktop-environment* xserver-xorg vim \
        task-kde-desktop task-gnome-desktop

Then updated the release from stretch to buster changing the file /etc/apt/source.list
Then I download the packages for the dist upgrade:

  # apt-get update
  # apt-get --download-only dist-upgrade

Then I create a tar of this image.
Before the dist upgrading the space used was about 7GB of space with 2281
packages. After the dist-upgrade, the space used was 9GB with 2870 packages.
The upgrade installed/updated about 2251 packages.


[2] The command was a bit more complex, to avoid an interactive session

  # mkfs.btrfs -m single -d single /dev/sdX
  # mount /dev/sdX test/
  # cd test
  # time tar xzf ../image.tgz
  # chroot .
  # mount -t proc proc proc
  # mount -t sysfs sys sys
  # export DEBIAN_FRONTEND=noninteractive
  # time apt-get -y --option=Dpkg::Options::=--force-confold \
	--option=Dpkg::options::=--force-unsafe-io dist-upgrade


BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5






^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/4] Add an ioctl to set/retrive the device properties
  2020-05-28 18:34 [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata Goffredo Baroncelli
@ 2020-05-28 18:34 ` Goffredo Baroncelli
  2020-05-28 22:03   ` Hans van Kranenburg
  2020-05-28 18:34 ` [PATCH 2/4] Add flags for dedicated metadata disks Goffredo Baroncelli
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Goffredo Baroncelli @ 2020-05-28 18:34 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Michael, Hugo Mills, Martin Svec, Wang Yugui, Paul Jones,
	Adam Borowski, Zygo Blaxell, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>

---
 fs/btrfs/ioctl.c           | 67 ++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.c         |  2 +-
 fs/btrfs/volumes.h         |  2 ++
 include/uapi/linux/btrfs.h | 40 +++++++++++++++++++++++
 4 files changed, 110 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 40b729dce91c..cba3fa942e2f 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4724,6 +4724,71 @@ static int btrfs_ioctl_set_features(struct file *file, void __user *arg)
 	return ret;
 }
 
+static long btrfs_ioctl_dev_properties(struct file *file,
+						void __user *argp)
+{
+	struct inode *inode = file_inode(file);
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct btrfs_ioctl_dev_properties dev_props;
+	struct btrfs_device	*device;
+        struct btrfs_root *root = fs_info->chunk_root;
+        struct btrfs_trans_handle *trans;
+	int ret;
+	u64 prev_type;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (copy_from_user(&dev_props, argp, sizeof(dev_props)))
+		return -EFAULT;
+
+	device = btrfs_find_device(fs_info->fs_devices, dev_props.devid,
+				NULL, NULL, false);
+	if (!device) {
+		btrfs_info(fs_info, "change_dev_properties: unable to find device %llu",
+			   dev_props.devid);
+		return -ENODEV;
+	}
+
+	if (dev_props.properties & BTRFS_DEV_PROPERTY_READ) {
+		u64 props = dev_props.properties;
+		memset(&dev_props, 0, sizeof(dev_props));
+		if (props & BTRFS_DEV_PROPERTY_TYPE) {
+			dev_props.properties = BTRFS_DEV_PROPERTY_TYPE;
+			dev_props.type = device->type;
+		}
+		if(copy_to_user(argp, &dev_props, sizeof(dev_props)))
+			return -EFAULT;
+		return 0;
+	}
+
+	/* it is possible to set only BTRFS_DEV_PROPERTY_TYPE for now */
+	if (dev_props.properties & ~(BTRFS_DEV_PROPERTY_TYPE))
+		return -EPERM;
+
+	trans = btrfs_start_transaction(root, 0);
+        if (IS_ERR(trans))
+                return PTR_ERR(trans);
+
+	prev_type = device->type;
+	device->type = dev_props.type;
+	ret = btrfs_update_device(trans, device);
+
+        if (ret < 0) {
+                btrfs_abort_transaction(trans, ret);
+                btrfs_end_transaction(trans);
+		device->type = prev_type;
+		return  ret;
+        }
+
+        ret = btrfs_commit_transaction(trans);
+	if (ret < 0)
+		device->type = prev_type;
+
+	return ret;
+
+}
+
 static int _btrfs_ioctl_send(struct file *file, void __user *argp, bool compat)
 {
 	struct btrfs_ioctl_send_args *arg;
@@ -4907,6 +4972,8 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_get_subvol_rootref(file, argp);
 	case BTRFS_IOC_INO_LOOKUP_USER:
 		return btrfs_ioctl_ino_lookup_user(file, argp);
+	case BTRFS_IOC_DEV_PROPERTIES:
+		return btrfs_ioctl_dev_properties(file, argp);
 	}
 
 	return -ENOTTY;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index be1e047a489e..5265f54c2931 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2710,7 +2710,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	return ret;
 }
 
-static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
+int btrfs_update_device(struct btrfs_trans_handle *trans,
 					struct btrfs_device *device)
 {
 	int ret;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index f067b5934c46..0ac5bf2b95e6 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -577,5 +577,7 @@ bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
 int btrfs_bg_type_to_factor(u64 flags);
 const char *btrfs_bg_type_to_raid_name(u64 flags);
 int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info);
+int btrfs_update_device(struct btrfs_trans_handle *trans,
+                                        struct btrfs_device *device);
 
 #endif
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index e6b6cb0f8bc6..bb096075677d 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -842,6 +842,44 @@ struct btrfs_ioctl_get_subvol_rootref_args {
 		__u8 align[7];
 };
 
+#define BTRFS_DEV_PROPERTY_TYPE		(1ULL << 0)
+#define BTRFS_DEV_PROPERTY_DEV_GROUP	(1ULL << 1)
+#define BTRFS_DEV_PROPERTY_SEEK_SPEED	(1ULL << 2)
+#define BTRFS_DEV_PROPERTY_BANDWIDTH	(1ULL << 3)
+#define BTRFS_DEV_PROPERTY_READ		(1ULL << 60)
+
+/*
+ * The ioctl BTRFS_IOC_DEV_PROPERTIES can read and write the device properties.
+ *
+ * The properties that the user want to write have to be set
+ * in the 'properties' field using the BTRFS_DEV_PROPERTY_xxxx constants.
+ *
+ * If the ioctl is used to read the device properties, the bit
+ * BTRFS_DEV_PROPERTY_READ has to be set in the 'properties' field.
+ * In this case the properties that the user want have to be set in the
+ * 'properties' field. The kernel doesn't return a property that was not
+ * required, however it may return a subset of the requested properties.
+ * The returned properties have the corrispondent BTRFS_DEV_PROPERTY_xxxx
+ * flag set in the 'properties' field.
+ *
+ * Up to 2020/05/11 the only properties that can be read/write is the 'type'
+ * one.
+ */
+struct btrfs_ioctl_dev_properties {
+	__u64	devid;
+	__u64	properties;
+	__u64	type;
+	__u32	dev_group;
+	__u8	seek_speed;
+	__u8	bandwidth;
+
+	/*
+	 * for future expansion
+	 */
+	__u8	unused1[2];
+	__u64	unused2[4];
+};
+
 /* Error codes as returned by the kernel */
 enum btrfs_err_code {
 	BTRFS_ERROR_DEV_RAID1_MIN_NOT_MET = 1,
@@ -970,5 +1008,7 @@ enum btrfs_err_code {
 				struct btrfs_ioctl_ino_lookup_user_args)
 #define BTRFS_IOC_SNAP_DESTROY_V2 _IOW(BTRFS_IOCTL_MAGIC, 63, \
 				struct btrfs_ioctl_vol_args_v2)
+#define BTRFS_IOC_DEV_PROPERTIES _IOW(BTRFS_IOCTL_MAGIC, 64, \
+				struct btrfs_ioctl_dev_properties)
 
 #endif /* _UAPI_LINUX_BTRFS_H */
-- 
2.27.0.rc2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/4] Add flags for dedicated metadata disks
  2020-05-28 18:34 [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata Goffredo Baroncelli
  2020-05-28 18:34 ` [PATCH 1/4] Add an ioctl to set/retrive the device properties Goffredo Baroncelli
@ 2020-05-28 18:34 ` Goffredo Baroncelli
  2020-05-28 18:34 ` [PATCH 3/4] Export dev_item.type in sysfs /sys/fs/btrfs/<uuid>/devinfo/<devid>/type Goffredo Baroncelli
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Goffredo Baroncelli @ 2020-05-28 18:34 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Michael, Hugo Mills, Martin Svec, Wang Yugui, Paul Jones,
	Adam Borowski, Zygo Blaxell, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>

---
 include/uapi/linux/btrfs_tree.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 8e322e2c7e78..a45d09591db8 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -355,6 +355,9 @@ struct btrfs_key {
 	__u64 offset;
 } __attribute__ ((__packed__));
 
+/* dev_item.type */
+#define BTRFS_DEV_PREFERRED_METADATA	(1ULL << 0)
+
 struct btrfs_dev_item {
 	/* the internal btrfs device id */
 	__le64 devid;
-- 
2.27.0.rc2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/4] Export dev_item.type in sysfs /sys/fs/btrfs/<uuid>/devinfo/<devid>/type
  2020-05-28 18:34 [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata Goffredo Baroncelli
  2020-05-28 18:34 ` [PATCH 1/4] Add an ioctl to set/retrive the device properties Goffredo Baroncelli
  2020-05-28 18:34 ` [PATCH 2/4] Add flags for dedicated metadata disks Goffredo Baroncelli
@ 2020-05-28 18:34 ` Goffredo Baroncelli
  2020-05-28 18:34 ` [PATCH 4/4] btrfs: add preferred_metadata mode Goffredo Baroncelli
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Goffredo Baroncelli @ 2020-05-28 18:34 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Michael, Hugo Mills, Martin Svec, Wang Yugui, Paul Jones,
	Adam Borowski, Zygo Blaxell, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>

---
 fs/btrfs/sysfs.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index a39bff64ff24..c189fd7f9afd 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1244,11 +1244,22 @@ static ssize_t btrfs_devinfo_writeable_show(struct kobject *kobj,
 }
 BTRFS_ATTR(devid, writeable, btrfs_devinfo_writeable_show);
 
+static ssize_t btrfs_devinfo_type_show(struct kobject *kobj,
+					    struct kobj_attribute *a, char *buf)
+{
+	struct btrfs_device *device = container_of(kobj, struct btrfs_device,
+						   devid_kobj);
+
+	return scnprintf(buf, PAGE_SIZE, "0x%08llx\n",device->type);
+}
+BTRFS_ATTR(devid, type, btrfs_devinfo_type_show);
+
 static struct attribute *devid_attrs[] = {
 	BTRFS_ATTR_PTR(devid, in_fs_metadata),
 	BTRFS_ATTR_PTR(devid, missing),
 	BTRFS_ATTR_PTR(devid, replace_target),
 	BTRFS_ATTR_PTR(devid, writeable),
+	BTRFS_ATTR_PTR(devid, type),
 	NULL
 };
 ATTRIBUTE_GROUPS(devid);
-- 
2.27.0.rc2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/4] btrfs: add preferred_metadata mode
  2020-05-28 18:34 [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata Goffredo Baroncelli
                   ` (2 preceding siblings ...)
  2020-05-28 18:34 ` [PATCH 3/4] Export dev_item.type in sysfs /sys/fs/btrfs/<uuid>/devinfo/<devid>/type Goffredo Baroncelli
@ 2020-05-28 18:34 ` Goffredo Baroncelli
  2020-05-28 22:02   ` Hans van Kranenburg
  2020-05-28 21:59 ` [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata Hans van Kranenburg
  2021-01-08  1:05 ` Zygo Blaxell
  5 siblings, 1 reply; 18+ messages in thread
From: Goffredo Baroncelli @ 2020-05-28 18:34 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Michael, Hugo Mills, Martin Svec, Wang Yugui, Paul Jones,
	Adam Borowski, Zygo Blaxell, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

When this mode is enabled, the allocation policy of the chunk
is so modified:
- allocation of metadata chunk: priority is given to preferred_metadata
  disks.
- allocation of data chunk: priority is given to a non preferred_metadata
  disk.

When a striped profile is involved (like RAID0,5,6), the logic
is a bit more complex. If there are enough disks, the data profiles
are stored on the non preferred_metadata disks; instead the metadata
profiles are stored on the preferred_metadata disk.
If the disks are not enough, then the profile is allocated on all
the disks.

Example: assuming that sda, sdb, sdc are ssd disks, and sde, sdf are
non preferred_metadata ones.
A data profile raid6, will be stored on sda, sdb, sdc, sde, sdf (sde
and sdf are not enough to host a raid5 profile).
A metadata profile raid6, will be stored on sda, sdb, sdc (these
are enough to host a raid6 profile).

To enable this mode pass -o dedicated_metadata at mount time.

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/super.c   |  8 +++++
 fs/btrfs/volumes.c | 89 ++++++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/volumes.h |  1 +
 4 files changed, 97 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 03ea7370aea7..779760fd27b1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1239,6 +1239,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
 #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
 #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
 #define BTRFS_MOUNT_DISCARD_ASYNC	(1 << 29)
+#define BTRFS_MOUNT_PREFERRED_METADATA	(1 << 30)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
 #define BTRFS_DEFAULT_MAX_INLINE	(2048)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 438ecba26557..80700dc9dcf8 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -359,6 +359,7 @@ enum {
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	Opt_ref_verify,
 #endif
+	Opt_preferred_metadata,
 	Opt_err,
 };
 
@@ -430,6 +431,7 @@ static const match_table_t tokens = {
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	{Opt_ref_verify, "ref_verify"},
 #endif
+	{Opt_preferred_metadata, "preferred_metadata"},
 	{Opt_err, NULL},
 };
 
@@ -881,6 +883,10 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 			btrfs_set_opt(info->mount_opt, REF_VERIFY);
 			break;
 #endif
+		case Opt_preferred_metadata:
+			btrfs_set_and_info(info, PREFERRED_METADATA,
+					"enabling preferred_metadata");
+			break;
 		case Opt_err:
 			btrfs_err(info, "unrecognized mount option '%s'", p);
 			ret = -EINVAL;
@@ -1403,6 +1409,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 #endif
 	if (btrfs_test_opt(info, REF_VERIFY))
 		seq_puts(seq, ",ref_verify");
+	if (btrfs_test_opt(info, PREFERRED_METADATA))
+		seq_puts(seq, ",preferred_metadata");
 	seq_printf(seq, ",subvolid=%llu",
 		  BTRFS_I(d_inode(dentry))->root->root_key.objectid);
 	seq_puts(seq, ",subvol=");
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 5265f54c2931..c68efb15e473 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4770,6 +4770,56 @@ static int btrfs_cmp_device_info(const void *a, const void *b)
 	return 0;
 }
 
+/*
+ * sort the devices in descending order by preferred_metadata,
+ * max_avail, total_avail
+ */
+static int btrfs_cmp_device_info_metadata(const void *a, const void *b)
+{
+	const struct btrfs_device_info *di_a = a;
+	const struct btrfs_device_info *di_b = b;
+
+	/* metadata -> preferred_metadata first */
+	if (di_a->preferred_metadata && !di_b->preferred_metadata)
+		return -1;
+	if (!di_a->preferred_metadata && di_b->preferred_metadata)
+		return 1;
+	if (di_a->max_avail > di_b->max_avail)
+		return -1;
+	if (di_a->max_avail < di_b->max_avail)
+		return 1;
+	if (di_a->total_avail > di_b->total_avail)
+		return -1;
+	if (di_a->total_avail < di_b->total_avail)
+		return 1;
+	return 0;
+}
+
+/*
+ * sort the devices in descending order by !preferred_metadata,
+ * max_avail, total_avail
+ */
+static int btrfs_cmp_device_info_data(const void *a, const void *b)
+{
+	const struct btrfs_device_info *di_a = a;
+	const struct btrfs_device_info *di_b = b;
+
+	/* data -> preferred_metadata last */
+	if (di_a->preferred_metadata && !di_b->preferred_metadata)
+		return 1;
+	if (!di_a->preferred_metadata && di_b->preferred_metadata)
+		return -1;
+	if (di_a->max_avail > di_b->max_avail)
+		return -1;
+	if (di_a->max_avail < di_b->max_avail)
+		return 1;
+	if (di_a->total_avail > di_b->total_avail)
+		return -1;
+	if (di_a->total_avail < di_b->total_avail)
+		return 1;
+	return 0;
+}
+
 static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
 {
 	if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK))
@@ -4885,6 +4935,7 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 	int ndevs = 0;
 	u64 max_avail;
 	u64 dev_offset;
+	int nr_preferred_metadata = 0;
 
 	/*
 	 * in the first pass through the devices list, we gather information
@@ -4937,15 +4988,49 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 		devices_info[ndevs].max_avail = max_avail;
 		devices_info[ndevs].total_avail = total_avail;
 		devices_info[ndevs].dev = device;
+		devices_info[ndevs].preferred_metadata = !!(device->type &
+			BTRFS_DEV_PREFERRED_METADATA);
+		if (devices_info[ndevs].preferred_metadata)
+			nr_preferred_metadata++;
 		++ndevs;
 	}
 	ctl->ndevs = ndevs;
 
+	BUG_ON(nr_preferred_metadata > ndevs);
 	/*
 	 * now sort the devices by hole size / available space
 	 */
-	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
-	     btrfs_cmp_device_info, NULL);
+	if (((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
+	     (ctl->type & BTRFS_BLOCK_GROUP_METADATA)) ||
+	    !btrfs_test_opt(info, PREFERRED_METADATA)) {
+		/* mixed bg or PREFERRED_METADATA not set */
+		sort(devices_info, ctl->ndevs, sizeof(struct btrfs_device_info),
+			     btrfs_cmp_device_info, NULL);
+	} else {
+		/*
+		 * if PREFERRED_METADATA is set, sort the device considering
+		 * also the kind (preferred_metadata or not). Limit the
+		 * availables devices to the ones of the same kind, to avoid
+		 * that a striped profile, like raid5, spreads to all kind of
+		 * devices.
+		 * It is allowed to use different kinds of devices if the ones
+		 * of the same kind are not enough alone.
+		 */
+		if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
+			int nr_data = ctl->ndevs - nr_preferred_metadata;
+			sort(devices_info, ctl->ndevs,
+				     sizeof(struct btrfs_device_info),
+				     btrfs_cmp_device_info_data, NULL);
+			if (nr_data >= ctl->devs_min)
+				ctl->ndevs = nr_data;
+		} else { /* non data -> metadata and system */
+			sort(devices_info, ctl->ndevs,
+				     sizeof(struct btrfs_device_info),
+				     btrfs_cmp_device_info_metadata, NULL);
+			if (nr_preferred_metadata >= ctl->devs_min)
+				ctl->ndevs = nr_preferred_metadata;
+		}
+	}
 
 	return 0;
 }
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 0ac5bf2b95e6..d39c3b0e7569 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -347,6 +347,7 @@ struct btrfs_device_info {
 	u64 dev_offset;
 	u64 max_avail;
 	u64 total_avail;
+	int preferred_metadata:1;
 };
 
 struct btrfs_raid_attr {
-- 
2.27.0.rc2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata
  2020-05-28 18:34 [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata Goffredo Baroncelli
                   ` (3 preceding siblings ...)
  2020-05-28 18:34 ` [PATCH 4/4] btrfs: add preferred_metadata mode Goffredo Baroncelli
@ 2020-05-28 21:59 ` Hans van Kranenburg
  2020-05-29 16:37   ` Goffredo Baroncelli
  2021-01-08  1:05 ` Zygo Blaxell
  5 siblings, 1 reply; 18+ messages in thread
From: Hans van Kranenburg @ 2020-05-28 21:59 UTC (permalink / raw)
  To: Goffredo Baroncelli, linux-btrfs
  Cc: Michael, Hugo Mills, Martin Svec, Wang Yugui, Paul Jones,
	Adam Borowski, Zygo Blaxell

Hi!

On 5/28/20 8:34 PM, Goffredo Baroncelli wrote:
> 
> [the previous patches sets called this mode ssd_metadata]
> 
> Hi all,
> 
> This is an RFC; I wrote this patch because I find the idea interesting
> even though it adds more complication to the chunk allocator.

Thanks for working on this. This is an often discussed feature request.
So, taking it to the next level by actually writing PoC code helps a lot
I guess.

> The initial idea was to store the metadata on the ssd and to leave the data
> on the rotational disks. The kind of disk was determined from the rotational
> flag. However looking only at the rotational flags is not flexible enough. So
> I added a device property called "preferred_metadata" to mark a device
> as preferred for metadata.
> 
> A separate patches set is sent to extend the "btrfs property" command
> for supporting the preferred_metadata device flag. The basic usage is:
> 
>     $ # set a new value
>     $ sudo btrfs property set /dev/vde preferred_metadata 1
>     
>     $ # get the current value
>     $ sudo btrfs property get /dev/vde preferred_metadata
>     devid=4, path=/dev/vde: dedicated_metadata=1
> 
> This new mode is enabled passing the option "preferred_metadata" at mount time.
> This policy of allocation is the default one. However if this doesn't permit
> a chunk allocation, the "classic" one is used.
> 
> Some examples: (/dev/sd[abc] are marked as preferred_metadata,
> and /dev/sd[ef] are not)
> 
> Non striped profile: metadata->raid1, data->raid1
> The data is stored on /dev/sd[ef], metadata is stored on /dev/sd[abc].
> When /dev/sd[ef] are full, then the data chunk is allocated also on
> /dev/sd[abc].
> 
> Striped profile: metadata->raid6, data->raid6
> raid6 requires 3 disks at minimum, so /dev/sd[ef] are not enough for a
> data profile raid6. To allow a data chunk allocation, the data profile raid6
> will be stored on all the disks /dev/sd[abcdef].
> Instead the metadata profile raid6 will be allocated on /dev/sd[abc],
> because these are enough to host this chunk.
> 
> The patches set is composed by four patches:
> 
> - The first patch adds the ioctl to update the btrfs_dev_item.type field.
> The ioctl is generic to handle more fields, however now only the "type"
> field is supported.

What are your thoughts about the chicken/egg situation of changing these
properties only when the filesystem is mounted?

E.g. mkfs puts metadata on the wrong disk, and then only after actually
mounting, I have to find out how to find out where metadata is actually
placed, and then play around with btrfs balance options until I get
everything moved to my preferred disks. Do you have any ideas about
improving the out of the box usability of this?

> - The second patch adds the flag BTRFS_DEV_PREFERRED_METADATA which is
> used to mark a device as "preferred_metadata"
> 
> - The third patch exports the btrfs_dev_item.type field via sysfs files
> /sys/fs/btrfs/<UUID>/devinfo/<devid>/type
> 
> It is possible only to read the value. It is not implemented the updated
> of the value because in btrfs/stsfs.c there is a comment that states:
> "We don't want to do full transaction commit from inside sysfs".
> 
> - The fourth patch implements this new mode
> 
> Changelog:
> v4: - renamed ssd_metadata to preferred_metadata
>     - add the device property "preferred_metadata"
>     - add the ioctl BTRFS_IOC_DEV_PROPERTIES
>     - export the btrfs_dev_item.type values via sysfs
> v3: - correct the collision between BTRFS_MOUNT_DISCARD_ASYNC and
>       BTRFS_MOUNT_SSD_METADATA.
> v2: - rebased to v5.6.2
>     - correct the comparison about the rotational disks (>= instead of >)
>     - add the flag rotational to the struct btrfs_device_info to
>       simplify the comparison function (btrfs_cmp_device_info*() )
> v1: - first issue
> 
> [...]
Another question: what is your opinion about device replace? Should it
leave properties of the destination device alone, or should it copy the
bit over?

If I'm replacing my ssd with metadata with a larger one, then what
should I expect to happen by default as user (already having forgotten
about that property command that I had to use to actually make it work
months ago)?

Thanks,
Hans

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] btrfs: add preferred_metadata mode
  2020-05-28 18:34 ` [PATCH 4/4] btrfs: add preferred_metadata mode Goffredo Baroncelli
@ 2020-05-28 22:02   ` Hans van Kranenburg
  2020-05-29 16:26     ` Goffredo Baroncelli
  0 siblings, 1 reply; 18+ messages in thread
From: Hans van Kranenburg @ 2020-05-28 22:02 UTC (permalink / raw)
  To: Goffredo Baroncelli, linux-btrfs
  Cc: Michael, Hugo Mills, Martin Svec, Wang Yugui, Paul Jones,
	Adam Borowski, Zygo Blaxell, Goffredo Baroncelli

Hi,

On 5/28/20 8:34 PM, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli <kreijack@inwind.it>
> 
> When this mode is enabled,

The commit message does not mention if this is either only a convenience
during development and testing of the feature to be able to quickly turn
it on/off, or if you intend to have this into the final change set.

> the allocation policy of the chunk
> is so modified:
> - allocation of metadata chunk: priority is given to preferred_metadata
>   disks.
> - allocation of data chunk: priority is given to a non preferred_metadata
>   disk.
> 
> When a striped profile is involved (like RAID0,5,6), the logic
> is a bit more complex. If there are enough disks, the data profiles
> are stored on the non preferred_metadata disks; instead the metadata
> profiles are stored on the preferred_metadata disk.
> If the disks are not enough, then the profile is allocated on all
> the disks.
> 
> Example: assuming that sda, sdb, sdc are ssd disks, and sde, sdf are
> non preferred_metadata ones.
> A data profile raid6, will be stored on sda, sdb, sdc, sde, sdf (sde
> and sdf are not enough to host a raid5 profile).
> A metadata profile raid6, will be stored on sda, sdb, sdc (these
> are enough to host a raid6 profile).
> 
> To enable this mode pass -o dedicated_metadata at mount time.

Is it dedicated_metadata or preferred_metadata?

> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
> ---
>  fs/btrfs/ctree.h   |  1 +
>  fs/btrfs/super.c   |  8 +++++
>  fs/btrfs/volumes.c | 89 ++++++++++++++++++++++++++++++++++++++++++++--
>  fs/btrfs/volumes.h |  1 +
>  4 files changed, 97 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 03ea7370aea7..779760fd27b1 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1239,6 +1239,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
>  #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
>  #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
>  #define BTRFS_MOUNT_DISCARD_ASYNC	(1 << 29)
> +#define BTRFS_MOUNT_PREFERRED_METADATA	(1 << 30)
>  
>  #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
>  #define BTRFS_DEFAULT_MAX_INLINE	(2048)
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 438ecba26557..80700dc9dcf8 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -359,6 +359,7 @@ enum {
>  #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>  	Opt_ref_verify,
>  #endif
> +	Opt_preferred_metadata,
>  	Opt_err,
>  };
>  
> @@ -430,6 +431,7 @@ static const match_table_t tokens = {
>  #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>  	{Opt_ref_verify, "ref_verify"},
>  #endif
> +	{Opt_preferred_metadata, "preferred_metadata"},
>  	{Opt_err, NULL},
>  };
>  
> @@ -881,6 +883,10 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
>  			btrfs_set_opt(info->mount_opt, REF_VERIFY);
>  			break;
>  #endif
> +		case Opt_preferred_metadata:
> +			btrfs_set_and_info(info, PREFERRED_METADATA,
> +					"enabling preferred_metadata");
> +			break;
>  		case Opt_err:
>  			btrfs_err(info, "unrecognized mount option '%s'", p);
>  			ret = -EINVAL;
> @@ -1403,6 +1409,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
>  #endif
>  	if (btrfs_test_opt(info, REF_VERIFY))
>  		seq_puts(seq, ",ref_verify");
> +	if (btrfs_test_opt(info, PREFERRED_METADATA))
> +		seq_puts(seq, ",preferred_metadata");
>  	seq_printf(seq, ",subvolid=%llu",
>  		  BTRFS_I(d_inode(dentry))->root->root_key.objectid);
>  	seq_puts(seq, ",subvol=");
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 5265f54c2931..c68efb15e473 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -4770,6 +4770,56 @@ static int btrfs_cmp_device_info(const void *a, const void *b)
>  	return 0;
>  }
>  
> +/*
> + * sort the devices in descending order by preferred_metadata,
> + * max_avail, total_avail
> + */
> +static int btrfs_cmp_device_info_metadata(const void *a, const void *b)
> +{
> +	const struct btrfs_device_info *di_a = a;
> +	const struct btrfs_device_info *di_b = b;
> +
> +	/* metadata -> preferred_metadata first */
> +	if (di_a->preferred_metadata && !di_b->preferred_metadata)
> +		return -1;
> +	if (!di_a->preferred_metadata && di_b->preferred_metadata)
> +		return 1;
> +	if (di_a->max_avail > di_b->max_avail)
> +		return -1;
> +	if (di_a->max_avail < di_b->max_avail)
> +		return 1;
> +	if (di_a->total_avail > di_b->total_avail)
> +		return -1;
> +	if (di_a->total_avail < di_b->total_avail)
> +		return 1;
> +	return 0;
> +}
> +
> +/*
> + * sort the devices in descending order by !preferred_metadata,
> + * max_avail, total_avail
> + */
> +static int btrfs_cmp_device_info_data(const void *a, const void *b)
> +{
> +	const struct btrfs_device_info *di_a = a;
> +	const struct btrfs_device_info *di_b = b;
> +
> +	/* data -> preferred_metadata last */
> +	if (di_a->preferred_metadata && !di_b->preferred_metadata)
> +		return 1;
> +	if (!di_a->preferred_metadata && di_b->preferred_metadata)
> +		return -1;
> +	if (di_a->max_avail > di_b->max_avail)
> +		return -1;
> +	if (di_a->max_avail < di_b->max_avail)
> +		return 1;
> +	if (di_a->total_avail > di_b->total_avail)
> +		return -1;
> +	if (di_a->total_avail < di_b->total_avail)
> +		return 1;
> +	return 0;
> +}
> +
>  static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
>  {
>  	if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK))
> @@ -4885,6 +4935,7 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>  	int ndevs = 0;
>  	u64 max_avail;
>  	u64 dev_offset;
> +	int nr_preferred_metadata = 0;
>  
>  	/*
>  	 * in the first pass through the devices list, we gather information
> @@ -4937,15 +4988,49 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>  		devices_info[ndevs].max_avail = max_avail;
>  		devices_info[ndevs].total_avail = total_avail;
>  		devices_info[ndevs].dev = device;
> +		devices_info[ndevs].preferred_metadata = !!(device->type &
> +			BTRFS_DEV_PREFERRED_METADATA);
> +		if (devices_info[ndevs].preferred_metadata)
> +			nr_preferred_metadata++;
>  		++ndevs;
>  	}
>  	ctl->ndevs = ndevs;
>  
> +	BUG_ON(nr_preferred_metadata > ndevs);
>  	/*
>  	 * now sort the devices by hole size / available space
>  	 */
> -	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> -	     btrfs_cmp_device_info, NULL);
> +	if (((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
> +	     (ctl->type & BTRFS_BLOCK_GROUP_METADATA)) ||
> +	    !btrfs_test_opt(info, PREFERRED_METADATA)) {
> +		/* mixed bg or PREFERRED_METADATA not set */
> +		sort(devices_info, ctl->ndevs, sizeof(struct btrfs_device_info),
> +			     btrfs_cmp_device_info, NULL);
> +	} else {
> +		/*
> +		 * if PREFERRED_METADATA is set, sort the device considering
> +		 * also the kind (preferred_metadata or not). Limit the
> +		 * availables devices to the ones of the same kind, to avoid
> +		 * that a striped profile, like raid5, spreads to all kind of
> +		 * devices.
> +		 * It is allowed to use different kinds of devices if the ones
> +		 * of the same kind are not enough alone.
> +		 */
> +		if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
> +			int nr_data = ctl->ndevs - nr_preferred_metadata;
> +			sort(devices_info, ctl->ndevs,
> +				     sizeof(struct btrfs_device_info),
> +				     btrfs_cmp_device_info_data, NULL);
> +			if (nr_data >= ctl->devs_min)
> +				ctl->ndevs = nr_data;
> +		} else { /* non data -> metadata and system */
> +			sort(devices_info, ctl->ndevs,
> +				     sizeof(struct btrfs_device_info),
> +				     btrfs_cmp_device_info_metadata, NULL);
> +			if (nr_preferred_metadata >= ctl->devs_min)
> +				ctl->ndevs = nr_preferred_metadata;
> +		}
> +	}
>  
>  	return 0;
>  }
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 0ac5bf2b95e6..d39c3b0e7569 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -347,6 +347,7 @@ struct btrfs_device_info {
>  	u64 dev_offset;
>  	u64 max_avail;
>  	u64 total_avail;
> +	int preferred_metadata:1;
>  };
>  
>  struct btrfs_raid_attr {
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/4] Add an ioctl to set/retrive the device properties
  2020-05-28 18:34 ` [PATCH 1/4] Add an ioctl to set/retrive the device properties Goffredo Baroncelli
@ 2020-05-28 22:03   ` Hans van Kranenburg
  0 siblings, 0 replies; 18+ messages in thread
From: Hans van Kranenburg @ 2020-05-28 22:03 UTC (permalink / raw)
  To: Goffredo Baroncelli, linux-btrfs
  Cc: Michael, Hugo Mills, Martin Svec, Wang Yugui, Paul Jones,
	Adam Borowski, Zygo Blaxell, Goffredo Baroncelli

Hi,

On 5/28/20 8:34 PM, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli <kreijack@inwind.it>
> 
> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
> 
> ---
>  fs/btrfs/ioctl.c           | 67 ++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/volumes.c         |  2 +-
>  fs/btrfs/volumes.h         |  2 ++
>  include/uapi/linux/btrfs.h | 40 +++++++++++++++++++++++
>  4 files changed, 110 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 40b729dce91c..cba3fa942e2f 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -4724,6 +4724,71 @@ static int btrfs_ioctl_set_features(struct file *file, void __user *arg)
>  	return ret;
>  }
>  
> +static long btrfs_ioctl_dev_properties(struct file *file,
> +						void __user *argp)
> +{
> +	struct inode *inode = file_inode(file);
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	struct btrfs_ioctl_dev_properties dev_props;
> +	struct btrfs_device	*device;
> +        struct btrfs_root *root = fs_info->chunk_root;
> +        struct btrfs_trans_handle *trans;

FYI, the code contains a big mix of spaces and tabs to indent lines.

> +	int ret;
> +	u64 prev_type;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EPERM;
> +
> +	if (copy_from_user(&dev_props, argp, sizeof(dev_props)))
> +		return -EFAULT;
> +
> +	device = btrfs_find_device(fs_info->fs_devices, dev_props.devid,
> +				NULL, NULL, false);
> +	if (!device) {
> +		btrfs_info(fs_info, "change_dev_properties: unable to find device %llu",
> +			   dev_props.devid);
> +		return -ENODEV;
> +	}
> +
> +	if (dev_props.properties & BTRFS_DEV_PROPERTY_READ) {
> +		u64 props = dev_props.properties;
> +		memset(&dev_props, 0, sizeof(dev_props));
> +		if (props & BTRFS_DEV_PROPERTY_TYPE) {
> +			dev_props.properties = BTRFS_DEV_PROPERTY_TYPE;
> +			dev_props.type = device->type;
> +		}
> +		if(copy_to_user(argp, &dev_props, sizeof(dev_props)))
> +			return -EFAULT;
> +		return 0;
> +	}
> +
> +	/* it is possible to set only BTRFS_DEV_PROPERTY_TYPE for now */
> +	if (dev_props.properties & ~(BTRFS_DEV_PROPERTY_TYPE))
> +		return -EPERM;
> +
> +	trans = btrfs_start_transaction(root, 0);
> +        if (IS_ERR(trans))
> +                return PTR_ERR(trans);
> +
> +	prev_type = device->type;
> +	device->type = dev_props.type;
> +	ret = btrfs_update_device(trans, device);
> +
> +        if (ret < 0) {
> +                btrfs_abort_transaction(trans, ret);
> +                btrfs_end_transaction(trans);
> +		device->type = prev_type;
> +		return  ret;
> +        }
> +
> +        ret = btrfs_commit_transaction(trans);
> +	if (ret < 0)
> +		device->type = prev_type;
> +
> +	return ret;
> +
> +}
> +
>  static int _btrfs_ioctl_send(struct file *file, void __user *argp, bool compat)
>  {
>  	struct btrfs_ioctl_send_args *arg;
> @@ -4907,6 +4972,8 @@ long btrfs_ioctl(struct file *file, unsigned int
>  		return btrfs_ioctl_get_subvol_rootref(file, argp);
>  	case BTRFS_IOC_INO_LOOKUP_USER:
>  		return btrfs_ioctl_ino_lookup_user(file, argp);
> +	case BTRFS_IOC_DEV_PROPERTIES:
> +		return btrfs_ioctl_dev_properties(file, argp);
>  	}
>  
>  	return -ENOTTY;
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index be1e047a489e..5265f54c2931 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2710,7 +2710,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
>  	return ret;
>  }
>  
> -static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
> +int btrfs_update_device(struct btrfs_trans_handle *trans,
>  					struct btrfs_device *device)
>  {
>  	int ret;
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index f067b5934c46..0ac5bf2b95e6 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -577,5 +577,7 @@ bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
>  int btrfs_bg_type_to_factor(u64 flags);
>  const char *btrfs_bg_type_to_raid_name(u64 flags);
>  int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info);
> +int btrfs_update_device(struct btrfs_trans_handle *trans,
> +                                        struct btrfs_device *device);
>  
>  #endif
> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> index e6b6cb0f8bc6..bb096075677d 100644
> --- a/include/uapi/linux/btrfs.h
> +++ b/include/uapi/linux/btrfs.h
> @@ -842,6 +842,44 @@ struct btrfs_ioctl_get_subvol_rootref_args {
>  		__u8 align[7];
>  };
>  
> +#define BTRFS_DEV_PROPERTY_TYPE		(1ULL << 0)
> +#define BTRFS_DEV_PROPERTY_DEV_GROUP	(1ULL << 1)
> +#define BTRFS_DEV_PROPERTY_SEEK_SPEED	(1ULL << 2)
> +#define BTRFS_DEV_PROPERTY_BANDWIDTH	(1ULL << 3)
> +#define BTRFS_DEV_PROPERTY_READ		(1ULL << 60)
> +
> +/*
> + * The ioctl BTRFS_IOC_DEV_PROPERTIES can read and write the device properties.
> + *
> + * The properties that the user want to write have to be set
> + * in the 'properties' field using the BTRFS_DEV_PROPERTY_xxxx constants.
> + *
> + * If the ioctl is used to read the device properties, the bit
> + * BTRFS_DEV_PROPERTY_READ has to be set in the 'properties' field.
> + * In this case the properties that the user want have to be set in the
> + * 'properties' field. The kernel doesn't return a property that was not
> + * required, however it may return a subset of the requested properties.
> + * The returned properties have the corrispondent BTRFS_DEV_PROPERTY_xxxx
> + * flag set in the 'properties' field.
> + *
> + * Up to 2020/05/11 the only properties that can be read/write is the 'type'
> + * one.
> + */
> +struct btrfs_ioctl_dev_properties {
> +	__u64	devid;
> +	__u64	properties;
> +	__u64	type;
> +	__u32	dev_group;
> +	__u8	seek_speed;
> +	__u8	bandwidth;
> +
> +	/*
> +	 * for future expansion
> +	 */
> +	__u8	unused1[2];
> +	__u64	unused2[4];
> +};
> +
>  /* Error codes as returned by the kernel */
>  enum btrfs_err_code {
>  	BTRFS_ERROR_DEV_RAID1_MIN_NOT_MET = 1,
> @@ -970,5 +1008,7 @@ enum btrfs_err_code {
>  				struct btrfs_ioctl_ino_lookup_user_args)
>  #define BTRFS_IOC_SNAP_DESTROY_V2 _IOW(BTRFS_IOCTL_MAGIC, 63, \
>  				struct btrfs_ioctl_vol_args_v2)
> +#define BTRFS_IOC_DEV_PROPERTIES _IOW(BTRFS_IOCTL_MAGIC, 64, \
> +				struct btrfs_ioctl_dev_properties)
>  
>  #endif /* _UAPI_LINUX_BTRFS_H */
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] btrfs: add preferred_metadata mode
  2020-05-28 22:02   ` Hans van Kranenburg
@ 2020-05-29 16:26     ` Goffredo Baroncelli
  0 siblings, 0 replies; 18+ messages in thread
From: Goffredo Baroncelli @ 2020-05-29 16:26 UTC (permalink / raw)
  To: Hans van Kranenburg, linux-btrfs
  Cc: Michael, Hugo Mills, Martin Svec, Wang Yugui, Paul Jones,
	Adam Borowski, Zygo Blaxell, Goffredo Baroncelli

On 5/29/20 12:02 AM, Hans van Kranenburg wrote:
> Hi,
> 
> On 5/28/20 8:34 PM, Goffredo Baroncelli wrote:
>> From: Goffredo Baroncelli <kreijack@inwind.it>
>>
>> When this mode is enabled,
> 
> The commit message does not mention if this is either only a convenience
> during development and testing of the feature to be able to quickly turn
> it on/off, or if you intend to have this into the final change set.

Good question. IMHO for the initial devel phase I think that it is useful to have
a preferred_metadata disk (opt-in). Then we could reverse the logic and
default to preferred_metadata. Of course then we will have a
no-preferred_metadata flag (opt-out)
> 
>> the allocation policy of the chunk
>> is so modified:
>> - allocation of metadata chunk: priority is given to preferred_metadata
>>    disks.
>> - allocation of data chunk: priority is given to a non preferred_metadata
>>    disk.
>>
>> When a striped profile is involved (like RAID0,5,6), the logic
>> is a bit more complex. If there are enough disks, the data profiles
>> are stored on the non preferred_metadata disks; instead the metadata
>> profiles are stored on the preferred_metadata disk.
>> If the disks are not enough, then the profile is allocated on all
>> the disks.
>>
>> Example: assuming that sda, sdb, sdc are ssd disks, and sde, sdf are
>> non preferred_metadata ones.
>> A data profile raid6, will be stored on sda, sdb, sdc, sde, sdf (sde
>> and sdf are not enough to host a raid5 profile).
>> A metadata profile raid6, will be stored on sda, sdb, sdc (these
>> are enough to host a raid6 profile).
>>
>> To enable this mode pass -o dedicated_metadata at mount time.
> 
> Is it dedicated_metadata or preferred_metadata?

It was an copy&paste error. It should be preferred_metadata
> 
>> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
>> ---
>>   fs/btrfs/ctree.h   |  1 +
>>   fs/btrfs/super.c   |  8 +++++
>>   fs/btrfs/volumes.c | 89 ++++++++++++++++++++++++++++++++++++++++++++--
>>   fs/btrfs/volumes.h |  1 +
>>   4 files changed, 97 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 03ea7370aea7..779760fd27b1 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -1239,6 +1239,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
>>   #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
>>   #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
>>   #define BTRFS_MOUNT_DISCARD_ASYNC	(1 << 29)
>> +#define BTRFS_MOUNT_PREFERRED_METADATA	(1 << 30)
>>   
>>   #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
>>   #define BTRFS_DEFAULT_MAX_INLINE	(2048)
>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>> index 438ecba26557..80700dc9dcf8 100644
>> --- a/fs/btrfs/super.c
>> +++ b/fs/btrfs/super.c
>> @@ -359,6 +359,7 @@ enum {
>>   #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>>   	Opt_ref_verify,
>>   #endif
>> +	Opt_preferred_metadata,
>>   	Opt_err,
>>   };
>>   
>> @@ -430,6 +431,7 @@ static const match_table_t tokens = {
>>   #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>>   	{Opt_ref_verify, "ref_verify"},
>>   #endif
>> +	{Opt_preferred_metadata, "preferred_metadata"},
>>   	{Opt_err, NULL},
>>   };
>>   
>> @@ -881,6 +883,10 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
>>   			btrfs_set_opt(info->mount_opt, REF_VERIFY);
>>   			break;
>>   #endif
>> +		case Opt_preferred_metadata:
>> +			btrfs_set_and_info(info, PREFERRED_METADATA,
>> +					"enabling preferred_metadata");
>> +			break;
>>   		case Opt_err:
>>   			btrfs_err(info, "unrecognized mount option '%s'", p);
>>   			ret = -EINVAL;
>> @@ -1403,6 +1409,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
>>   #endif
>>   	if (btrfs_test_opt(info, REF_VERIFY))
>>   		seq_puts(seq, ",ref_verify");
>> +	if (btrfs_test_opt(info, PREFERRED_METADATA))
>> +		seq_puts(seq, ",preferred_metadata");
>>   	seq_printf(seq, ",subvolid=%llu",
>>   		  BTRFS_I(d_inode(dentry))->root->root_key.objectid);
>>   	seq_puts(seq, ",subvol=");
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 5265f54c2931..c68efb15e473 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -4770,6 +4770,56 @@ static int btrfs_cmp_device_info(const void *a, const void *b)
>>   	return 0;
>>   }
>>   
>> +/*
>> + * sort the devices in descending order by preferred_metadata,
>> + * max_avail, total_avail
>> + */
>> +static int btrfs_cmp_device_info_metadata(const void *a, const void *b)
>> +{
>> +	const struct btrfs_device_info *di_a = a;
>> +	const struct btrfs_device_info *di_b = b;
>> +
>> +	/* metadata -> preferred_metadata first */
>> +	if (di_a->preferred_metadata && !di_b->preferred_metadata)
>> +		return -1;
>> +	if (!di_a->preferred_metadata && di_b->preferred_metadata)
>> +		return 1;
>> +	if (di_a->max_avail > di_b->max_avail)
>> +		return -1;
>> +	if (di_a->max_avail < di_b->max_avail)
>> +		return 1;
>> +	if (di_a->total_avail > di_b->total_avail)
>> +		return -1;
>> +	if (di_a->total_avail < di_b->total_avail)
>> +		return 1;
>> +	return 0;
>> +}
>> +
>> +/*
>> + * sort the devices in descending order by !preferred_metadata,
>> + * max_avail, total_avail
>> + */
>> +static int btrfs_cmp_device_info_data(const void *a, const void *b)
>> +{
>> +	const struct btrfs_device_info *di_a = a;
>> +	const struct btrfs_device_info *di_b = b;
>> +
>> +	/* data -> preferred_metadata last */
>> +	if (di_a->preferred_metadata && !di_b->preferred_metadata)
>> +		return 1;
>> +	if (!di_a->preferred_metadata && di_b->preferred_metadata)
>> +		return -1;
>> +	if (di_a->max_avail > di_b->max_avail)
>> +		return -1;
>> +	if (di_a->max_avail < di_b->max_avail)
>> +		return 1;
>> +	if (di_a->total_avail > di_b->total_avail)
>> +		return -1;
>> +	if (di_a->total_avail < di_b->total_avail)
>> +		return 1;
>> +	return 0;
>> +}
>> +
>>   static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
>>   {
>>   	if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK))
>> @@ -4885,6 +4935,7 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>>   	int ndevs = 0;
>>   	u64 max_avail;
>>   	u64 dev_offset;
>> +	int nr_preferred_metadata = 0;
>>   
>>   	/*
>>   	 * in the first pass through the devices list, we gather information
>> @@ -4937,15 +4988,49 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
>>   		devices_info[ndevs].max_avail = max_avail;
>>   		devices_info[ndevs].total_avail = total_avail;
>>   		devices_info[ndevs].dev = device;
>> +		devices_info[ndevs].preferred_metadata = !!(device->type &
>> +			BTRFS_DEV_PREFERRED_METADATA);
>> +		if (devices_info[ndevs].preferred_metadata)
>> +			nr_preferred_metadata++;
>>   		++ndevs;
>>   	}
>>   	ctl->ndevs = ndevs;
>>   
>> +	BUG_ON(nr_preferred_metadata > ndevs);
>>   	/*
>>   	 * now sort the devices by hole size / available space
>>   	 */
>> -	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>> -	     btrfs_cmp_device_info, NULL);
>> +	if (((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
>> +	     (ctl->type & BTRFS_BLOCK_GROUP_METADATA)) ||
>> +	    !btrfs_test_opt(info, PREFERRED_METADATA)) {
>> +		/* mixed bg or PREFERRED_METADATA not set */
>> +		sort(devices_info, ctl->ndevs, sizeof(struct btrfs_device_info),
>> +			     btrfs_cmp_device_info, NULL);
>> +	} else {
>> +		/*
>> +		 * if PREFERRED_METADATA is set, sort the device considering
>> +		 * also the kind (preferred_metadata or not). Limit the
>> +		 * availables devices to the ones of the same kind, to avoid
>> +		 * that a striped profile, like raid5, spreads to all kind of
>> +		 * devices.
>> +		 * It is allowed to use different kinds of devices if the ones
>> +		 * of the same kind are not enough alone.
>> +		 */
>> +		if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
>> +			int nr_data = ctl->ndevs - nr_preferred_metadata;
>> +			sort(devices_info, ctl->ndevs,
>> +				     sizeof(struct btrfs_device_info),
>> +				     btrfs_cmp_device_info_data, NULL);
>> +			if (nr_data >= ctl->devs_min)
>> +				ctl->ndevs = nr_data;
>> +		} else { /* non data -> metadata and system */
>> +			sort(devices_info, ctl->ndevs,
>> +				     sizeof(struct btrfs_device_info),
>> +				     btrfs_cmp_device_info_metadata, NULL);
>> +			if (nr_preferred_metadata >= ctl->devs_min)
>> +				ctl->ndevs = nr_preferred_metadata;
>> +		}
>> +	}
>>   
>>   	return 0;
>>   }
>> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
>> index 0ac5bf2b95e6..d39c3b0e7569 100644
>> --- a/fs/btrfs/volumes.h
>> +++ b/fs/btrfs/volumes.h
>> @@ -347,6 +347,7 @@ struct btrfs_device_info {
>>   	u64 dev_offset;
>>   	u64 max_avail;
>>   	u64 total_avail;
>> +	int preferred_metadata:1;
>>   };
>>   
>>   struct btrfs_raid_attr {
>>
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata
  2020-05-28 21:59 ` [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata Hans van Kranenburg
@ 2020-05-29 16:37   ` Goffredo Baroncelli
  2020-05-30 11:44     ` Zygo Blaxell
  0 siblings, 1 reply; 18+ messages in thread
From: Goffredo Baroncelli @ 2020-05-29 16:37 UTC (permalink / raw)
  To: Hans van Kranenburg, linux-btrfs
  Cc: Michael, Hugo Mills, Martin Svec, Wang Yugui, Paul Jones,
	Adam Borowski, Zygo Blaxell

On 5/28/20 11:59 PM, Hans van Kranenburg wrote:
> Hi!
> 
> On 5/28/20 8:34 PM, Goffredo Baroncelli wrote:
>>
>> [the previous patches sets called this mode ssd_metadata]
>>
>> Hi all,
>>
>> This is an RFC; I wrote this patch because I find the idea interesting
>> even though it adds more complication to the chunk allocator.
> 
> Thanks for working on this. This is an often discussed feature request.
> So, taking it to the next level by actually writing PoC code helps a lot
> I guess.
> 
>> The initial idea was to store the metadata on the ssd and to leave the data
>> on the rotational disks. The kind of disk was determined from the rotational
>> flag. However looking only at the rotational flags is not flexible enough. So
>> I added a device property called "preferred_metadata" to mark a device
>> as preferred for metadata.
>>
>> A separate patches set is sent to extend the "btrfs property" command
>> for supporting the preferred_metadata device flag. The basic usage is:
>>
>>      $ # set a new value
>>      $ sudo btrfs property set /dev/vde preferred_metadata 1
>>      
>>      $ # get the current value
>>      $ sudo btrfs property get /dev/vde preferred_metadata
>>      devid=4, path=/dev/vde: dedicated_metadata=1
>>
>> This new mode is enabled passing the option "preferred_metadata" at mount time.
>> This policy of allocation is the default one. However if this doesn't permit
>> a chunk allocation, the "classic" one is used.
>>
>> Some examples: (/dev/sd[abc] are marked as preferred_metadata,
>> and /dev/sd[ef] are not)
>>
>> Non striped profile: metadata->raid1, data->raid1
>> The data is stored on /dev/sd[ef], metadata is stored on /dev/sd[abc].
>> When /dev/sd[ef] are full, then the data chunk is allocated also on
>> /dev/sd[abc].
>>
>> Striped profile: metadata->raid6, data->raid6
>> raid6 requires 3 disks at minimum, so /dev/sd[ef] are not enough for a
>> data profile raid6. To allow a data chunk allocation, the data profile raid6
>> will be stored on all the disks /dev/sd[abcdef].
>> Instead the metadata profile raid6 will be allocated on /dev/sd[abc],
>> because these are enough to host this chunk.
>>
>> The patches set is composed by four patches:
>>
>> - The first patch adds the ioctl to update the btrfs_dev_item.type field.
>> The ioctl is generic to handle more fields, however now only the "type"
>> field is supported.
> 
> What are your thoughts about the chicken/egg situation of changing these
> properties only when the filesystem is mounted?

The logic is related only to a chunk allocation. I.e. if you have a not
empty filesystem, after enabling the preferred_metadata "mode", in order
to get the benefit a full balance is required.

> 
> E.g. mkfs puts metadata on the wrong disk, and then only after actually
> mounting, I have to find out how to find out where metadata is actually
> placed, and then play around with btrfs balance options until I get
> everything moved to my preferred disks. Do you have any ideas about
> improving the out of the box usability of this?

In order to figure out where the (meta)data are placed, "btrfs fi us"
is your friend.
Of course setting this at mkfs.btrfs time is a good suggestion.

> 
>> - The second patch adds the flag BTRFS_DEV_PREFERRED_METADATA which is
>> used to mark a device as "preferred_metadata"
>>
>> - The third patch exports the btrfs_dev_item.type field via sysfs files
>> /sys/fs/btrfs/<UUID>/devinfo/<devid>/type
>>
>> It is possible only to read the value. It is not implemented the updated
>> of the value because in btrfs/stsfs.c there is a comment that states:
>> "We don't want to do full transaction commit from inside sysfs".
>>
>> - The fourth patch implements this new mode
>>
>> Changelog:
>> v4: - renamed ssd_metadata to preferred_metadata
>>      - add the device property "preferred_metadata"
>>      - add the ioctl BTRFS_IOC_DEV_PROPERTIES
>>      - export the btrfs_dev_item.type values via sysfs
>> v3: - correct the collision between BTRFS_MOUNT_DISCARD_ASYNC and
>>        BTRFS_MOUNT_SSD_METADATA.
>> v2: - rebased to v5.6.2
>>      - correct the comparison about the rotational disks (>= instead of >)
>>      - add the flag rotational to the struct btrfs_device_info to
>>        simplify the comparison function (btrfs_cmp_device_info*() )
>> v1: - first issue
>>
>> [...]
> Another question: what is your opinion about device replace? Should it
> leave properties of the destination device alone, or should it copy the
> bit over?
>
> If I'm replacing my ssd with metadata with a larger one, then what
> should I expect to happen by default as user (already having forgotten
> about that property command that I had to use to actually make it work
> months ago)?

In the previous attempt I rtried  to detect automatically which disk is faster
looking at the rotation flag. However someone pointed me that even from
a sata ssd and a pci nvme there is an huge speed differences (even tough
the latency is more important).
This to say that an automatic logic is not the best possible choice for all
the cases .
Then the next step was to add a flag to mark explicitly the devices for
metadata.

I think that "replacing" and "adding" doesn't have a "sane" default. There will
be always a case where an user replace an ssd with an mechanical hdd or
a case where an ssd is added where there is already an pci nvme.

What would make sense is an additional option to btrfs add/replace
that allows to specify if the disk should be preferred for metadata or not.
> 
> Thanks,
> Hans
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata
  2020-05-29 16:37   ` Goffredo Baroncelli
@ 2020-05-30 11:44     ` Zygo Blaxell
  2020-05-30 11:51       ` Goffredo Baroncelli
  0 siblings, 1 reply; 18+ messages in thread
From: Zygo Blaxell @ 2020-05-30 11:44 UTC (permalink / raw)
  To: kreijack
  Cc: Hans van Kranenburg, linux-btrfs, Michael, Hugo Mills,
	Martin Svec, Wang Yugui, Paul Jones, Adam Borowski

On Fri, May 29, 2020 at 06:37:27PM +0200, Goffredo Baroncelli wrote:
> On 5/28/20 11:59 PM, Hans van Kranenburg wrote:
> > Hi!
> > 
> > On 5/28/20 8:34 PM, Goffredo Baroncelli wrote:
[...]
> > > The patches set is composed by four patches:
> > > 
> > > - The first patch adds the ioctl to update the btrfs_dev_item.type field.
> > > The ioctl is generic to handle more fields, however now only the "type"
> > > field is supported.
> > 
> > What are your thoughts about the chicken/egg situation of changing these
> > properties only when the filesystem is mounted?
> 
> The logic is related only to a chunk allocation. I.e. if you have a not
> empty filesystem, after enabling the preferred_metadata "mode", in order
> to get the benefit a full balance is required.

Ideally this feature comes with a balance filter that selects block
groups that don't match the current preferred allocation policy, so
you can do a balance on those block groups and leave the rest alone.
There are existing balance filters by devid and stripe count to work from,
so it shouldn't be a lot of new kernel code.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata
  2020-05-30 11:44     ` Zygo Blaxell
@ 2020-05-30 11:51       ` Goffredo Baroncelli
  0 siblings, 0 replies; 18+ messages in thread
From: Goffredo Baroncelli @ 2020-05-30 11:51 UTC (permalink / raw)
  To: Zygo Blaxell
  Cc: Hans van Kranenburg, linux-btrfs, Michael, Hugo Mills,
	Martin Svec, Wang Yugui, Paul Jones, Adam Borowski

On 5/30/20 1:44 PM, Zygo Blaxell wrote:
> On Fri, May 29, 2020 at 06:37:27PM +0200, Goffredo Baroncelli wrote:
>> On 5/28/20 11:59 PM, Hans van Kranenbu

[...]
>>> What are your thoughts about the chicken/egg situation of changing these
>>> properties only when the filesystem is mounted?
>>
>> The logic is related only to a chunk allocation. I.e. if you have a not
>> empty filesystem, after enabling the preferred_metadata "mode", in order
>> to get the benefit a full balance is required.
> 
> Ideally this feature comes with a balance filter that selects block
> groups that don't match the current preferred allocation policy, so
> you can do a balance on those block groups and leave the rest alone.
> There are existing balance filters by devid and stripe count to work from,
> so it shouldn't be a lot of new kernel code.
> 
Yes, this could be a further enhancement.

GB

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata
  2020-05-28 18:34 [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata Goffredo Baroncelli
                   ` (4 preceding siblings ...)
  2020-05-28 21:59 ` [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata Hans van Kranenburg
@ 2021-01-08  1:05 ` Zygo Blaxell
  2021-01-08 17:30   ` Goffredo Baroncelli
  5 siblings, 1 reply; 18+ messages in thread
From: Zygo Blaxell @ 2021-01-08  1:05 UTC (permalink / raw)
  To: Goffredo Baroncelli
  Cc: linux-btrfs, Michael, Hugo Mills, Martin Svec, Wang Yugui,
	Paul Jones, Adam Borowski

On Thu, May 28, 2020 at 08:34:47PM +0200, Goffredo Baroncelli wrote:
> 
> [the previous patches sets called this mode ssd_metadata]
> 
> Hi all,
> 
> This is an RFC; I wrote this patch because I find the idea interesting
> even though it adds more complication to the chunk allocator.
> 
> The initial idea was to store the metadata on the ssd and to leave the data
> on the rotational disks. The kind of disk was determined from the rotational
> flag. However looking only at the rotational flags is not flexible enough. So
> I added a device property called "preferred_metadata" to mark a device
> as preferred for metadata.
> 
> A separate patches set is sent to extend the "btrfs property" command
> for supporting the preferred_metadata device flag. The basic usage is:
> 
>     $ # set a new value
>     $ sudo btrfs property set /dev/vde preferred_metadata 1
>     
>     $ # get the current value
>     $ sudo btrfs property get /dev/vde preferred_metadata
>     devid=4, path=/dev/vde: dedicated_metadata=1
> 
> This new mode is enabled passing the option "preferred_metadata" at mount time.
> This policy of allocation is the default one. However if this doesn't permit
> a chunk allocation, the "classic" one is used.
> 
> Some examples: (/dev/sd[abc] are marked as preferred_metadata,
> and /dev/sd[ef] are not)
> 
> Non striped profile: metadata->raid1, data->raid1
> The data is stored on /dev/sd[ef], metadata is stored on /dev/sd[abc].
> When /dev/sd[ef] are full, then the data chunk is allocated also on
> /dev/sd[abc].
> 
> Striped profile: metadata->raid6, data->raid6
> raid6 requires 3 disks at minimum, so /dev/sd[ef] are not enough for a
> data profile raid6. To allow a data chunk allocation, the data profile raid6
> will be stored on all the disks /dev/sd[abcdef].
> Instead the metadata profile raid6 will be allocated on /dev/sd[abc],
> because these are enough to host this chunk.
> 
> The patches set is composed by four patches:
> 
> - The first patch adds the ioctl to update the btrfs_dev_item.type field.
> The ioctl is generic to handle more fields, however now only the "type"
> field is supported.
> 
> - The second patch adds the flag BTRFS_DEV_PREFERRED_METADATA which is
> used to mark a device as "preferred_metadata"
> 
> - The third patch exports the btrfs_dev_item.type field via sysfs files
> /sys/fs/btrfs/<UUID>/devinfo/<devid>/type
> 
> It is possible only to read the value. It is not implemented the updated
> of the value because in btrfs/stsfs.c there is a comment that states:
> "We don't want to do full transaction commit from inside sysfs".
> 
> - The fourth patch implements this new mode
> 
> Changelog:
> v4: - renamed ssd_metadata to preferred_metadata
>     - add the device property "preferred_metadata"
>     - add the ioctl BTRFS_IOC_DEV_PROPERTIES
>     - export the btrfs_dev_item.type values via sysfs
> v3: - correct the collision between BTRFS_MOUNT_DISCARD_ASYNC and
>       BTRFS_MOUNT_SSD_METADATA.
> v2: - rebased to v5.6.2
>     - correct the comparison about the rotational disks (>= instead of >)
>     - add the flag rotational to the struct btrfs_device_info to
>       simplify the comparison function (btrfs_cmp_device_info*() )
> v1: - first issue

I've been testing these patches for a while now.  They enable an
interesting use case that can't otherwise be done safely, sanely or
cheaply with btrfs.

Normally if we have an array of, say, 10 spinning disks, and we want to
implement a writeback cache layer with SSD, we would need 10 distinct SSD
devices to avoid reducing btrfs's ability to recover from drive failures.
The writeback cache will be modified on both reads and writes, data and
metadata, so we need high endurance SSDs if we want them to make it to
the end of their warranty.  The SSD firmware has to not have crippling
performance bugs while under heavy write load, which means we are now
restricted to an expensive subset of high endurance SSDs targeted at
the enterprise/NAS/video production markets...and we need 10 of them!

NVME has fairly draconian restrictions on drive count, and getting
anything close to 10 of them into a btrfs filesystem can be an expensive
challenge.  (I'm not counting solutions that use USB-to-NVME bridges
because those don't count as "sane" or "safe").

We can share the cache between disks, but not safely in writeback mode,
because a failure in one SSD could affect multiple logical btrfs disks.
Strictly speaking we can't do it safely in any cache mode, but at least
with a writethrough cache we can recover the btrfs by throwing the SSDs
away.

In the current btrfs raid5 and dm-cache implementations, a scrub through a
SSD/HDD cache pair triggers an _astronomical_ number of SSD cache writes
(enough to burn through a low-end SSD's TBW in a weekend).  That can
presumably be fixed one day, but it's definitely unusable today.

If we have only 2 NVME drives, with this patch, we can put them to work
storing metadata, leave the data on the spinning rust, and keep all
the btrfs self-repair features working, and get most of the performance
gain of SSD caching at significantly lower cost.  The write loads are
reasonable (metadata only, no data, no reads) so we don't need a high
endurance SSD.  With raid1/10/c3/c4 redundancy, btrfs can fix silent
metadata corruption, so we can even use cheap SSDs as long as they
don't hang when they fail.  We can use btrfs raid5/6 for data since
preferred_metadata avoids the bugs that kill the SSD when we run a scrub.

Now all we have to do is keep porting these patches to new kernels until
some equivalent feature lands in mainline.  :-P

I do have some comments about these particular patches:

I dropped the "preferred_metadata" mount option very early in testing.
Apart from the merge conflicts with later kernels, the option is
redundant, since we could just as easily change all the drive type
properties back to 0 to get the same effect.  Adding an incompatible
mount option (and potentially removing it again after a failed test)
was a comparatively onerous requirement.

The fallback to the other allocation mode when all disks of one type
are full is not always a feature.  When an array gets full, sometimes
data gets on SSDs, or metadata gets on spinners, and we quickly lose
the benefit of separating them.  Balances in either direction to fix
this after it happens are time-consuming and waste SSD lifetime TBW.

I'd like an easy way to make the preference strict, i.e. if there are
two types of disks in the filesystem, and one type fills up, and we're
not in a weird state like degraded mode or deleting the last drive of
one type, then go directly to ENOSPC.  That requires a more complex
patch since we'd have to change the way free space is calculated for
'df' to exclude the metadata devices, and track whether there are two
types of device present or just one.

> Below I collected some data to highlight the performance increment.
> 
> Test setup:
> I performed as test a "dist-upgrade" of a Debian from stretch to buster.
> The test consisted in an image of a Debian stretch[1]  with the packages
> needed under /var/cache/apt/archives/ (so no networking was involved).
> For each test I formatted the filesystem from scratch, un-tar-red the
> image and the ran "apt-get dist-upgrade" [2]. For each disk(s)/filesystem
> combination I measured the time of apt dist-upgrade with and
> without the flag "force-unsafe-io" which reduce the using of sync(2) and
> flush(2). The ssd was 20GB big, the hdd was 230GB big,
> 
> I considered the following scenarios:
> - btrfs over ssd
> - btrfs over ssd + hdd with my patch enabled
> - btrfs over bcache over hdd+ssd
> - btrfs over hdd (very, very slow....)
> - ext4 over ssd
> - ext4 over hdd
> 
> The test machine was an "AMD A6-6400K" with 4GB of ram, where 3GB was used
> as cache/buff.
> 
> Data analysis:
> 
> Of course btrfs is slower than ext4 when a lot of sync/flush are involved. Using
> apt on a rotational was a dramatic experience. And IMHO  this should be replaced
> by using the btrfs snapshot capabilities. But this is another (not easy) story.
> 
> Unsurprising bcache performs better than my patch. But this is an expected
> result because it can cache also the data chunk (the read can goes directly to
> the ssd). bcache perform about +60% slower when there are a lot of sync/flush
> and only +20% in the other case.
> 
> Regarding the test with force-unsafe-io (fewer sync/flush), my patch reduce the
> time from +256% to +113%  than the hdd-only . Which I consider a good
> results considering how small is the patch.
> 
> 
> Raw data:
> The data below is the "real" time (as return by the time command) consumed by
> apt
> 
> 
> Test description         real (mmm:ss)	Delta %
> --------------------     -------------  -------
> btrfs hdd w/sync	   142:38	+533%
> btrfs ssd+hdd w/sync        81:04	+260%
> ext4 hdd w/sync	            52:39	+134%
> btrfs bcache w/sync	    35:59	 +60%
> btrfs ssd w/sync	    22:31	reference
> ext4 ssd w/sync	            12:19	 -45%
> 
> 
> 
> Test description         real (mmm:ss)	Delta %
> --------------------     -------------  -------
> btrfs hdd	             56:2	+256%
> ext4 hdd	            51:32	+228%
> btrfs ssd+hdd	            33:30	+113%
> btrfs bcache	            18:57	 +20%
> btrfs ssd	            15:44	reference
> ext4 ssd	            11:49	 -25%
> 
> 
> [1] I created the image, using "debootrap stretch", then I installed a set
> of packages using the commands:
> 
>   # debootstrap stretch test/
>   # chroot test/
>   # mount -t proc proc proc
>   # mount -t sysfs sys sys
>   # apt --option=Dpkg::Options::=--force-confold \
>         --option=Dpkg::options::=--force-unsafe-io \
> 	install mate-desktop-environment* xserver-xorg vim \
>         task-kde-desktop task-gnome-desktop
> 
> Then updated the release from stretch to buster changing the file /etc/apt/source.list
> Then I download the packages for the dist upgrade:
> 
>   # apt-get update
>   # apt-get --download-only dist-upgrade
> 
> Then I create a tar of this image.
> Before the dist upgrading the space used was about 7GB of space with 2281
> packages. After the dist-upgrade, the space used was 9GB with 2870 packages.
> The upgrade installed/updated about 2251 packages.
> 
> 
> [2] The command was a bit more complex, to avoid an interactive session
> 
>   # mkfs.btrfs -m single -d single /dev/sdX
>   # mount /dev/sdX test/
>   # cd test
>   # time tar xzf ../image.tgz
>   # chroot .
>   # mount -t proc proc proc
>   # mount -t sysfs sys sys
>   # export DEBIAN_FRONTEND=noninteractive
>   # time apt-get -y --option=Dpkg::Options::=--force-confold \
> 	--option=Dpkg::options::=--force-unsafe-io dist-upgrade
> 
> 
> BR
> G.Baroncelli
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata
  2021-01-08  1:05 ` Zygo Blaxell
@ 2021-01-08 17:30   ` Goffredo Baroncelli
  2021-01-08 17:43     ` BTRFS and *CACHE setup [was Re: [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata] Goffredo Baroncelli
  2021-01-09 21:23     ` [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata Zygo Blaxell
  0 siblings, 2 replies; 18+ messages in thread
From: Goffredo Baroncelli @ 2021-01-08 17:30 UTC (permalink / raw)
  To: Zygo Blaxell
  Cc: linux-btrfs, Michael, Hugo Mills, Martin Svec, Wang Yugui,
	Paul Jones, Adam Borowski

On 1/8/21 2:05 AM, Zygo Blaxell wrote:
> On Thu, May 28, 2020 at 08:34:47PM +0200, Goffredo Baroncelli wrote:
>>
[...]
> 
> I've been testing these patches for a while now.  They enable an
> interesting use case that can't otherwise be done safely, sanely or
> cheaply with btrfs.

Thanks Zygo for this feedback. As usual you are source of very interesting considerations.
> 
> Normally if we have an array of, say, 10 spinning disks, and we want to
> implement a writeback cache layer with SSD, we would need 10 distinct SSD
> devices to avoid reducing btrfs's ability to recover from drive failures.
> The writeback cache will be modified on both reads and writes, data and
> metadata, so we need high endurance SSDs if we want them to make it to
> the end of their warranty.  The SSD firmware has to not have crippling
> performance bugs while under heavy write load, which means we are now
> restricted to an expensive subset of high endurance SSDs targeted at
> the enterprise/NAS/video production markets...and we need 10 of them!
> 
> NVME has fairly draconian restrictions on drive count, and getting
> anything close to 10 of them into a btrfs filesystem can be an expensive
> challenge.  (I'm not counting solutions that use USB-to-NVME bridges
> because those don't count as "sane" or "safe").
> 
> We can share the cache between disks, but not safely in writeback mode,
> because a failure in one SSD could affect multiple logical btrfs disks.
> Strictly speaking we can't do it safely in any cache mode, but at least
> with a writethrough cache we can recover the btrfs by throwing the SSDs
> away.

I will replay in a separate thread, because I found your consideration very
interesting but OT.

> 
> In the current btrfs raid5 and dm-cache implementations, a scrub through a
> SSD/HDD cache pair triggers an _astronomical_ number of SSD cache writes
> (enough to burn through a low-end SSD's TBW in a weekend).  That can
> presumably be fixed one day, but it's definitely unusable today.

Again, this is another point to discuss...

> 
> If we have only 2 NVME drives, with this patch, we can put them to work
> storing metadata, leave the data on the spinning rust, and keep all
> the btrfs self-repair features working, and get most of the performance
> gain of SSD caching at significantly lower cost.  The write loads are
> reasonable (metadata only, no data, no reads) so we don't need a high
> endurance SSD.  With raid1/10/c3/c4 redundancy, btrfs can fix silent
> metadata corruption, so we can even use cheap SSDs as long as they
> don't hang when they fail.  We can use btrfs raid5/6 for data since
> preferred_metadata avoids the bugs that kill the SSD when we run a scrub.
> 
> Now all we have to do is keep porting these patches to new kernels until
> some equivalent feature lands in mainline.  :-P

Ok, this would be doable.
> 
> I do have some comments about these particular patches:
> 
> I dropped the "preferred_metadata" mount option very early in testing.
> Apart from the merge conflicts with later kernels, the option is
> redundant, since we could just as easily change all the drive type
> properties back to 0 to get the same effect.  Adding an incompatible
> mount option (and potentially removing it again after a failed test)
> was a comparatively onerous requirement.

Could you clarify this point ? I understood that you removed some pieces
of the patch #4, but the other parts are still needed.

Anyway I have some doubt about removing entirely this options.
The options allows the new policy. The disk property instead marks the
disk as eligible as preferred storage for metadata. Are two concept
very different and I prefer to take them separates.

The behavior that you suggested is something like:
- if a "preferred metadata" properties is set on any disks, the
"preferred metadata" behavior is enabled.

My main concern is that setting this disk flag is not atomic, when
the mount option is. Would help the maintenance of the filesystem
when the user are replacing (dropping) the disks ?

Anyway I am open to ear other opinions.

> 
> The fallback to the other allocation mode when all disks of one type
> are full is not always a feature.  When an array gets full, sometimes
> data gets on SSDs, or metadata gets on spinners, and we quickly lose
> the benefit of separating them.  Balances in either direction to fix
> this after it happens are time-consuming and waste SSD lifetime TBW.
> 
> I'd like an easy way to make the preference strict, i.e. if there are
> two types of disks in the filesystem, and one type fills up, and we're
> not in a weird state like degraded mode or deleting the last drive of
> one type, then go directly to ENOSPC.  That requires a more complex
> patch since we'd have to change the way free space is calculated for
> 'df' to exclude the metadata devices, and track whether there are two
> types of device present or just one.
> 

I fear another issue: what happens when you filled the metadata disks ?
Does BTRFS allow to update the "preferred metadata" properties on a
filled disks ? I would say yes, but I am not sure what happens when
BTRFS force the filesystem RO.

Frankly speaking, I think that a slower filesystem is far better than a
locked one. A better way would be add a warning to btrfs-progs which
say: "WARNING: your preferred metadata disks are filled !!!"


>> Below I collected some data to highlight the performance increment.
>>
>> Test setup:
>> I performed as test a "dist-upgrade" of a Debian from stretch to buster.
>> The test consisted in an image of a Debian stretch[1]  with the packages
>> needed under /var/cache/apt/archives/ (so no networking was involved).
>> For each test I formatted the filesystem from scratch, un-tar-red the
>> image and the ran "apt-get dist-upgrade" [2]. For each disk(s)/filesystem
>> combination I measured the time of apt dist-upgrade with and
>> without the flag "force-unsafe-io" which reduce the using of sync(2) and
>> flush(2). The ssd was 20GB big, the hdd was 230GB big,
>>
>> I considered the following scenarios:
>> - btrfs over ssd
>> - btrfs over ssd + hdd with my patch enabled
>> - btrfs over bcache over hdd+ssd
>> - btrfs over hdd (very, very slow....)
>> - ext4 over ssd
>> - ext4 over hdd
>>
>> The test machine was an "AMD A6-6400K" with 4GB of ram, where 3GB was used
>> as cache/buff.
>>
>> Data analysis:
>>
>> Of course btrfs is slower than ext4 when a lot of sync/flush are involved. Using
>> apt on a rotational was a dramatic experience. And IMHO  this should be replaced
>> by using the btrfs snapshot capabilities. But this is another (not easy) story.
>>
>> Unsurprising bcache performs better than my patch. But this is an expected
>> result because it can cache also the data chunk (the read can goes directly to
>> the ssd). bcache perform about +60% slower when there are a lot of sync/flush
>> and only +20% in the other case.
>>
>> Regarding the test with force-unsafe-io (fewer sync/flush), my patch reduce the
>> time from +256% to +113%  than the hdd-only . Which I consider a good
>> results considering how small is the patch.
>>
>>
>> Raw data:
>> The data below is the "real" time (as return by the time command) consumed by
>> apt
>>
>>
>> Test description         real (mmm:ss)	Delta %
>> --------------------     -------------  -------
>> btrfs hdd w/sync	   142:38	+533%
>> btrfs ssd+hdd w/sync        81:04	+260%
>> ext4 hdd w/sync	            52:39	+134%
>> btrfs bcache w/sync	    35:59	 +60%
>> btrfs ssd w/sync	    22:31	reference
>> ext4 ssd w/sync	            12:19	 -45%
>>
>>
>>
>> Test description         real (mmm:ss)	Delta %
>> --------------------     -------------  -------
>> btrfs hdd	             56:2	+256%
>> ext4 hdd	            51:32	+228%
>> btrfs ssd+hdd	            33:30	+113%
>> btrfs bcache	            18:57	 +20%
>> btrfs ssd	            15:44	reference
>> ext4 ssd	            11:49	 -25%
>>
>>
>> [1] I created the image, using "debootrap stretch", then I installed a set
>> of packages using the commands:
>>
>>    # debootstrap stretch test/
>>    # chroot test/
>>    # mount -t proc proc proc
>>    # mount -t sysfs sys sys
>>    # apt --option=Dpkg::Options::=--force-confold \
>>          --option=Dpkg::options::=--force-unsafe-io \
>> 	install mate-desktop-environment* xserver-xorg vim \
>>          task-kde-desktop task-gnome-desktop
>>
>> Then updated the release from stretch to buster changing the file /etc/apt/source.list
>> Then I download the packages for the dist upgrade:
>>
>>    # apt-get update
>>    # apt-get --download-only dist-upgrade
>>
>> Then I create a tar of this image.
>> Before the dist upgrading the space used was about 7GB of space with 2281
>> packages. After the dist-upgrade, the space used was 9GB with 2870 packages.
>> The upgrade installed/updated about 2251 packages.
>>
>>
>> [2] The command was a bit more complex, to avoid an interactive session
>>
>>    # mkfs.btrfs -m single -d single /dev/sdX
>>    # mount /dev/sdX test/
>>    # cd test
>>    # time tar xzf ../image.tgz
>>    # chroot .
>>    # mount -t proc proc proc
>>    # mount -t sysfs sys sys
>>    # export DEBIAN_FRONTEND=noninteractive
>>    # time apt-get -y --option=Dpkg::Options::=--force-confold \
>> 	--option=Dpkg::options::=--force-unsafe-io dist-upgrade
>>
>>
>> BR
>> G.Baroncelli
>>
>> -- 
>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>>
>>
>>
>>
>>


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 18+ messages in thread

* BTRFS and *CACHE setup [was Re: [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata]
  2021-01-08 17:30   ` Goffredo Baroncelli
@ 2021-01-08 17:43     ` Goffredo Baroncelli
  2021-01-09 21:23     ` [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata Zygo Blaxell
  1 sibling, 0 replies; 18+ messages in thread
From: Goffredo Baroncelli @ 2021-01-08 17:43 UTC (permalink / raw)
  To: Zygo Blaxell
  Cc: linux-btrfs, Michael, Hugo Mills, Martin Svec, Wang Yugui,
	Paul Jones, Adam Borowski

On 1/8/21 6:30 PM, Goffredo Baroncelli wrote:
> On 1/8/21 2:05 AM, Zygo Blaxell wrote:
>> On Thu, May 28, 2020 at 08:34:47PM +0200, Goffredo Baroncelli wrote:
>>>
> [...]
>>
>> I've been testing these patches for a while now.  They enable an
>> interesting use case that can't otherwise be done safely, sanely or
>> cheaply with btrfs.
> 
> Thanks Zygo for this feedback. As usual you are source of very interesting considerations.
>>
>> Normally if we have an array of, say, 10 spinning disks, and we want to
>> implement a writeback cache layer with SSD, we would need 10 distinct SSD
>> devices to avoid reducing btrfs's ability to recover from drive failures.
>> The writeback cache will be modified on both reads and writes, data and
>> metadata, so we need high endurance SSDs if we want them to make it to
>> the end of their warranty.  The SSD firmware has to not have crippling
>> performance bugs while under heavy write load, which means we are now
>> restricted to an expensive subset of high endurance SSDs targeted at
>> the enterprise/NAS/video production markets...and we need 10 of them!
>>
>> NVME has fairly draconian restrictions on drive count, and getting
>> anything close to 10 of them into a btrfs filesystem can be an expensive
>> challenge.  (I'm not counting solutions that use USB-to-NVME bridges
>> because those don't count as "sane" or "safe").
>>
>> We can share the cache between disks, but not safely in writeback mode,
>> because a failure in one SSD could affect multiple logical btrfs disks.
>> Strictly speaking we can't do it safely in any cache mode, but at least
>> with a writethrough cache we can recover the btrfs by throwing the SSDs
>> away.
[...]

Hi Zygo,

could you elaborate the last sentence. What I understood is that in
write-through mode, the ordering (and the barrier) are preserved.
So this mode should be safe (bug a part).

If this is true, it would be possible to have a btrfs multi (spinning)
disks setup with only one SSD acting as cache. Of course, it will
works only in write-through mode, and the main beneficial are related
to caching the data for further next read.

Does anyone have further experiences ? Does anyone tried to
recover a BTRFS filesystem when the cache disks died ?

Oh.. wait... Now I understood: if the caching disk read badly (but
without returning an error), the bad data may be wrote on the other
disks: in this case a single failure (the cache disk) may affect
all the other disks and the redundancy is lost ...

BR
G.Baroncelli

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata
  2021-01-08 17:30   ` Goffredo Baroncelli
  2021-01-08 17:43     ` BTRFS and *CACHE setup [was Re: [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata] Goffredo Baroncelli
@ 2021-01-09 21:23     ` Zygo Blaxell
  2021-01-10 19:55       ` Goffredo Baroncelli
  1 sibling, 1 reply; 18+ messages in thread
From: Zygo Blaxell @ 2021-01-09 21:23 UTC (permalink / raw)
  To: kreijack
  Cc: linux-btrfs, Michael, Hugo Mills, Martin Svec, Wang Yugui,
	Paul Jones, Adam Borowski

On Fri, Jan 08, 2021 at 06:30:49PM +0100, Goffredo Baroncelli wrote:
> On 1/8/21 2:05 AM, Zygo Blaxell wrote:
> > On Thu, May 28, 2020 at 08:34:47PM +0200, Goffredo Baroncelli wrote:
> > > 
> [...]
> > 
> > I've been testing these patches for a while now.  They enable an
> > interesting use case that can't otherwise be done safely, sanely or
> > cheaply with btrfs.
> 
> Thanks Zygo for this feedback. As usual you are source of very interesting considerations.
> > 
> > Normally if we have an array of, say, 10 spinning disks, and we want to
> > implement a writeback cache layer with SSD, we would need 10 distinct SSD
> > devices to avoid reducing btrfs's ability to recover from drive failures.
> > The writeback cache will be modified on both reads and writes, data and
> > metadata, so we need high endurance SSDs if we want them to make it to
> > the end of their warranty.  The SSD firmware has to not have crippling
> > performance bugs while under heavy write load, which means we are now
> > restricted to an expensive subset of high endurance SSDs targeted at
> > the enterprise/NAS/video production markets...and we need 10 of them!
> > 
> > NVME has fairly draconian restrictions on drive count, and getting
> > anything close to 10 of them into a btrfs filesystem can be an expensive
> > challenge.  (I'm not counting solutions that use USB-to-NVME bridges
> > because those don't count as "sane" or "safe").
> > 
> > We can share the cache between disks, but not safely in writeback mode,
> > because a failure in one SSD could affect multiple logical btrfs disks.
> > Strictly speaking we can't do it safely in any cache mode, but at least
> > with a writethrough cache we can recover the btrfs by throwing the SSDs
> > away.
> 
> I will replay in a separate thread, because I found your consideration very
> interesting but OT.
> 
> > 
> > In the current btrfs raid5 and dm-cache implementations, a scrub through a
> > SSD/HDD cache pair triggers an _astronomical_ number of SSD cache writes
> > (enough to burn through a low-end SSD's TBW in a weekend).  That can
> > presumably be fixed one day, but it's definitely unusable today.
> 
> Again, this is another point to discuss...
> 
> > 
> > If we have only 2 NVME drives, with this patch, we can put them to work
> > storing metadata, leave the data on the spinning rust, and keep all
> > the btrfs self-repair features working, and get most of the performance
> > gain of SSD caching at significantly lower cost.  The write loads are
> > reasonable (metadata only, no data, no reads) so we don't need a high
> > endurance SSD.  With raid1/10/c3/c4 redundancy, btrfs can fix silent
> > metadata corruption, so we can even use cheap SSDs as long as they
> > don't hang when they fail.  We can use btrfs raid5/6 for data since
> > preferred_metadata avoids the bugs that kill the SSD when we run a scrub.
> > 
> > Now all we have to do is keep porting these patches to new kernels until
> > some equivalent feature lands in mainline.  :-P
> 
> Ok, this would be doable.
> > 
> > I do have some comments about these particular patches:
> > 
> > I dropped the "preferred_metadata" mount option very early in testing.
> > Apart from the merge conflicts with later kernels, the option is
> > redundant, since we could just as easily change all the drive type
> > properties back to 0 to get the same effect.  Adding an incompatible
> > mount option (and potentially removing it again after a failed test)
> > was a comparatively onerous requirement.
> 
> Could you clarify this point ? I understood that you removed some pieces
> of the patch #4, but the other parts are still needed.

I just removed the preferred_metadata mount option, and removed the check
for preferred_metadata in the if statements (i.e. the behavior is as if
preferred_metadata was always set in the mount options).

> Anyway I have some doubt about removing entirely this options.
> The options allows the new policy. The disk property instead marks the
> disk as eligible as preferred storage for metadata. Are two concept
> very different and I prefer to take them separates.
> 
> The behavior that you suggested is something like:
> - if a "preferred metadata" properties is set on any disks, the
> "preferred metadata" behavior is enabled.

Yes, that way we can turn it on once with btrfs prop set just after mkfs,
and not have to make changes to userspace on top to change the mount
options (which would require spinning a whole different image just for
this experiment).  In this way it behaves more like btrfstune or raid
profile settings.

I could have used mount -oremount, but there was a merge conflict in
that part of the code already, so I already had to rewrite those parts
of the patch.  I saw no need for the mount option to be there, so
I just resolved all the conflicts by keeping the code without it.

> My main concern is that setting this disk flag is not atomic, when
> the mount option is. Would help the maintenance of the filesystem
> when the user are replacing (dropping) the disks ?

We would have to adjust the check for minimum number of disks in the
RAID profile to take disks that refuse to accept data/metadata into
account (unless that check is already calling into the patched code?).

Also we should disallow setting all of the devices data-only or
metadata-only at once--though this isn't strictly necessary.  It would
work for a time, as long as no new chunk allocations occur, and then
get a normal ENOSPC when the next chunk allocation is attempted.
The user can then say "oh right I need to enable some data drives" and
fix it, and the filesystem would be effectively full in the interim.

Failing disks behave the same way as before--an offline raid1 mirror is
offline whether it prefers metadata or not.

> Anyway I am open to ear other opinions.
> 
> > 
> > The fallback to the other allocation mode when all disks of one type
> > are full is not always a feature.  When an array gets full, sometimes
> > data gets on SSDs, or metadata gets on spinners, and we quickly lose
> > the benefit of separating them.  Balances in either direction to fix
> > this after it happens are time-consuming and waste SSD lifetime TBW.
> > 
> > I'd like an easy way to make the preference strict, i.e. if there are
> > two types of disks in the filesystem, and one type fills up, and we're
> > not in a weird state like degraded mode or deleting the last drive of
> > one type, then go directly to ENOSPC.  That requires a more complex
> > patch since we'd have to change the way free space is calculated for
> > 'df' to exclude the metadata devices, and track whether there are two
> > types of device present or just one.
> 
> I fear another issue: what happens when you filled the metadata disks ?

Exactly the same thing that happens now when you run out of unallocated
space for the metadata raid profile, or configure an unsatisfiable set of
disk sizes and raid profiles.  This situation can already occur on every
btrfs filesystem under some sequence of allocations and deallocations.

You can replicate the behavior of strict preferred_metadata by doing
btrfs fi resize operations at exactly the right times.  Nothing new here.

> Does BTRFS allow to update the "preferred metadata" properties on a
> filled disks ? I would say yes, but I am not sure what happens when
> BTRFS force the filesystem RO.

No, on btrfs you must always avoid running out of metadata space.
It is too late to do anything after you have run out, other than to roll
back to the end of the previous transaction (i.e. umount and mount the
filesystem again).  That is why there is a global reserve to force at
least half a GB free metadata space, and code sprinkled all over the
kernel that tries, mostly successfully, to avoid filling it.  On our
filesystems we try to ensure that metadata has room to expand by 50%
at all times (rounded up to the next GB on small filesystems).

There are current bugs where you can get into a "groundhog day" situation:
btrfs's mount kernel thread starts a transaction that fails with ENOSPC
in metadata.  The transaction isn't completed, so as far as the filesystem
on disk is concerned, it never happened.  The next mount does exactly
the same thing in every detail.  Repeat until you upgrade the kernel
with a bug fix (or downgrade to avoid a regression).  _Maybe_ you can
get a lvextend/btrfs fi resize or dev add into the same transaction
and get out of the loop that way, but I had a case not too long ago
where I was forced to roll back to 4.19 to complete a snapshot delete
(keep those old LTS kernels handy, you might need one!).

I hadn't noticed this before, but strict preferred metadata provides us
a novel way to avoid the risk of running out of metadata space on btrfs.
We effectively reserve all of the space on the SSDs for metadata, because
strict preferred metadata won't allow any data to be allocated on the
same disks.  Non-strict preferred metadata is _more_ at risk of running
out of metadata space, as data block groups are not prevented from using
the reserved space on SSDs.

> Frankly speaking, I think that a slower filesystem is far better than a
> locked one. 

No, slower is a _failure mode_ in our use case.  We are competing
with slower, and fail if we do not win.

We already have all-spinning arrays that are cheap, all-SSD arrays that
are fast, and arrays of SSD/spinner pairs that are expensive.  We are
considering spending additional money on SSDs and host interfaces to
improve the performance of the all-spinners, but want to spend less money
than the all-SSD or SSD/spinner pair configurations.  We expect strictly
improved performance for any money we spend.  We can haggle over whether
the improvement was worth the cost, but clearly that's not the case if
the improvement is zero.  Equal performance is not acceptable: if there's
no gain, then we keep our money and go back to the cheaper option.

On a loaded test server, I observed 90th percentile fsync times
drop from 7 seconds without preferred_metadata to 0.7 seconds with
preferred_metadata when all the metadata is on the SSDs.  If some metadata
ever lands on a spinner, we go back to almost 7 seconds latency again
(it sometimes only gets up to 5 or 6 seconds, but it's still very bad).
We lost our performance gain, so our test resulted in failure.

If we try to fix this, by running a metadata balance to get the metadata
off of the spinner, fsync's 90th percentile latency goes up past 600
seconds (i.e.  10 minutes) during the balance, and the balance runs for
6 to 10 hours.  The fileserver is effectively unusuable for at least
one working day.  There's no way we can spin that event to our users
as anything less than a service outage.

If data lands on a SSD, we are at greater risk of running out of
metadata space.  We normally (without preferred_metadata) do a lot of
work to force btrfs to maintain spare metadata space--in extreme cases,
with the metadata_ratio mount option--and run daily data balances
to maintain reserves of unallocated space for metadata to grow into.
If we can completely disallow data allocations on SSD, we get an easily
manageable, fast, huge chunk of unallocated space that only metadata can
use, without all the balancing iops and continuous metadata free space
monitoring.  We can control the size of this reserved space by choosing
the SSD size.  This is a valuable capability all by itself.  If we have
non-strict preferred_metadata then we get none of these benefits.

The extra write load for data changes the pricing model for the SSDs.
Failure rates go up (to 100% 5-year failure rate on some SSD models).
Performance goes down (to significantly slower than spinners on some
SSD models).  If we push the price up by over 40%, it's cheaper to buy
high-endurance SSDs instead.  This isn't a failure--we still need fewer
SSDs for preferred_metadata than we would need for dm-cache or bcache,
so we're still winning--but we could easily do better, and that's not
a good story to take into is-it-worth-the-cost discussions.

In the cases I tested, we have about 100:1 SSD to spinner capacity ratio
(e.g. 6x8TB in spinners with 2x512GB in SSD, or 3x12TB in spinners
with 2x128GB in SSD).  Only a trivial amount of data space is lost if
we do not allow data allocation on SSD, and we were doing a lot of
work to avoid ever using that data space before preferred_metadata.
As the array gets bigger, we swap out the spinners for bigger drives,
and we can swap in bigger SSDs at the same time to maintain the ratio.

Here are two other scenarios:

1.  The smallest available SSDs are more than 2% of the total spinner
size.  In this case, we would never fill the SSD with metadata, so we
could allow some data to be placed on the SSD in order to make use of the
space that would otherwise be unused.  We must forbid metadata allocation
on the spinner to keep the performance gain.

We have to monitor unallocated space on the SSD and watch out for data
chunks stealing all the space we need for metadata like we do on any
other btrfs filesystem; however, with data preferentially allocated
on the spinner, in most cases all the necessary metadata chunk
allocations will be completed before the last unallocated space on
the SSD becomes a data chunk.  We don't need to monitor the spinner
because its metadata usage will always be zero.

2.  We have one small fast SSD and one large slow spinner, and we already
paid for them, so we cannot reduce cost by omitting one of the devices
(this is a very popular use case on #btrfs, it's common to fall into
this situation with laptops).  This is a worst-case setup for the stock
btrfs allocator.  Even random allocation would be about 450% better.
In this situation, there is no hardware cost penalty for having
performance equal to the stock btrfs allocator, so we no longer have
to have a performance win--tying is good enough.  With this competitor,
we could only lose the performance race intentionally.

We still have to monitor unallocated space and watch out for data chunks
stealing all the space we need for metadata, but it's no different than
stock btrfs (the order in which devices are filled is different, but
all the devices are still filled).

We could handle all these use cases with two bits:

	bit 0:  0 = prefer data, 1 = prefer metadata
	bit 1:  0 = allow other types, 1 = exclude other types

which gives 4 encoded values:

	0 = prefer data, allow metadata (default)
	1 = prefer metadata, allow data (same as v4 patch)
	2 = prefer data, disallow metadata
	3 = prefer metadata, disallow data

In the allocator, first try allocation using all devices where
data/metadata matches in bit 0, and bit 2 is set.  If that fails,
try again with all those devices, plus devices where bit 2 is clear.
If that fails too, go to ENOSPC.

'df' reports bavail counting unallocated space only when bit 0 is 0
(data preferred) or bit 1 is 0 (all types allowed) in the device type.

> A better way would be add a warning to btrfs-progs which
> say: "WARNING: your preferred metadata disks are filled !!!"

This is something that already happens on every btrfs, so it is not
in any way specific to preferred_metadata.  I'd rather minimize new
warning messages, especially messages where it's so difficult to
determine the condition accurately.

The best way I have found to predict metadata ENOSPC is to monitor
filesystem usage stats over time and watch for indications the trend
line will cross zero free space soon.  Without the trend data (or
very deep metadata analysis to reconstruct the trend data), we can't
know how fast metadata space is growing (or steady, or even shrinking).
Every workload consumes and releases metadata space at a different rate,
so one server's 1% free space is a boring non-event while another server's
1% is an imminent critical emergency.

The calculations for monitoring tools will be a little different with
strict preferred metadata.

> > > Below I collected some data to highlight the performance increment.
> > > 
> > > Test setup:
> > > I performed as test a "dist-upgrade" of a Debian from stretch to buster.
> > > The test consisted in an image of a Debian stretch[1]  with the packages
> > > needed under /var/cache/apt/archives/ (so no networking was involved).
> > > For each test I formatted the filesystem from scratch, un-tar-red the
> > > image and the ran "apt-get dist-upgrade" [2]. For each disk(s)/filesystem
> > > combination I measured the time of apt dist-upgrade with and
> > > without the flag "force-unsafe-io" which reduce the using of sync(2) and
> > > flush(2). The ssd was 20GB big, the hdd was 230GB big,
> > > 
> > > I considered the following scenarios:
> > > - btrfs over ssd
> > > - btrfs over ssd + hdd with my patch enabled
> > > - btrfs over bcache over hdd+ssd
> > > - btrfs over hdd (very, very slow....)
> > > - ext4 over ssd
> > > - ext4 over hdd
> > > 
> > > The test machine was an "AMD A6-6400K" with 4GB of ram, where 3GB was used
> > > as cache/buff.
> > > 
> > > Data analysis:
> > > 
> > > Of course btrfs is slower than ext4 when a lot of sync/flush are involved. Using
> > > apt on a rotational was a dramatic experience. And IMHO  this should be replaced
> > > by using the btrfs snapshot capabilities. But this is another (not easy) story.
> > > 
> > > Unsurprising bcache performs better than my patch. But this is an expected
> > > result because it can cache also the data chunk (the read can goes directly to
> > > the ssd). bcache perform about +60% slower when there are a lot of sync/flush
> > > and only +20% in the other case.
> > > 
> > > Regarding the test with force-unsafe-io (fewer sync/flush), my patch reduce the
> > > time from +256% to +113%  than the hdd-only . Which I consider a good
> > > results considering how small is the patch.
> > > 
> > > 
> > > Raw data:
> > > The data below is the "real" time (as return by the time command) consumed by
> > > apt
> > > 
> > > 
> > > Test description         real (mmm:ss)	Delta %
> > > --------------------     -------------  -------
> > > btrfs hdd w/sync	   142:38	+533%
> > > btrfs ssd+hdd w/sync        81:04	+260%
> > > ext4 hdd w/sync	            52:39	+134%
> > > btrfs bcache w/sync	    35:59	 +60%
> > > btrfs ssd w/sync	    22:31	reference
> > > ext4 ssd w/sync	            12:19	 -45%
> > > 
> > > 
> > > 
> > > Test description         real (mmm:ss)	Delta %
> > > --------------------     -------------  -------
> > > btrfs hdd	             56:2	+256%
> > > ext4 hdd	            51:32	+228%
> > > btrfs ssd+hdd	            33:30	+113%
> > > btrfs bcache	            18:57	 +20%
> > > btrfs ssd	            15:44	reference
> > > ext4 ssd	            11:49	 -25%
> > > 
> > > 
> > > [1] I created the image, using "debootrap stretch", then I installed a set
> > > of packages using the commands:
> > > 
> > >    # debootstrap stretch test/
> > >    # chroot test/
> > >    # mount -t proc proc proc
> > >    # mount -t sysfs sys sys
> > >    # apt --option=Dpkg::Options::=--force-confold \
> > >          --option=Dpkg::options::=--force-unsafe-io \
> > > 	install mate-desktop-environment* xserver-xorg vim \
> > >          task-kde-desktop task-gnome-desktop
> > > 
> > > Then updated the release from stretch to buster changing the file /etc/apt/source.list
> > > Then I download the packages for the dist upgrade:
> > > 
> > >    # apt-get update
> > >    # apt-get --download-only dist-upgrade
> > > 
> > > Then I create a tar of this image.
> > > Before the dist upgrading the space used was about 7GB of space with 2281
> > > packages. After the dist-upgrade, the space used was 9GB with 2870 packages.
> > > The upgrade installed/updated about 2251 packages.
> > > 
> > > 
> > > [2] The command was a bit more complex, to avoid an interactive session
> > > 
> > >    # mkfs.btrfs -m single -d single /dev/sdX
> > >    # mount /dev/sdX test/
> > >    # cd test
> > >    # time tar xzf ../image.tgz
> > >    # chroot .
> > >    # mount -t proc proc proc
> > >    # mount -t sysfs sys sys
> > >    # export DEBIAN_FRONTEND=noninteractive
> > >    # time apt-get -y --option=Dpkg::Options::=--force-confold \
> > > 	--option=Dpkg::options::=--force-unsafe-io dist-upgrade
> > > 
> > > 
> > > BR
> > > G.Baroncelli
> > > 
> > > -- 
> > > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> > > Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> > > 
> > > 
> > > 
> > > 
> > > 
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata
  2021-01-09 21:23     ` [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata Zygo Blaxell
@ 2021-01-10 19:55       ` Goffredo Baroncelli
  2021-01-16  0:25         ` Zygo Blaxell
  0 siblings, 1 reply; 18+ messages in thread
From: Goffredo Baroncelli @ 2021-01-10 19:55 UTC (permalink / raw)
  To: Zygo Blaxell
  Cc: linux-btrfs, Michael, Hugo Mills, Martin Svec, Wang Yugui,
	Paul Jones, Adam Borowski

On 1/9/21 10:23 PM, Zygo Blaxell wrote:

> On a loaded test server, I observed 90th percentile fsync times
> drop from 7 seconds without preferred_metadata to 0.7 seconds with
> preferred_metadata when all the metadata is on the SSDs.  If some metadata
> ever lands on a spinner, we go back to almost 7 seconds latency again
> (it sometimes only gets up to 5 or 6 seconds, but it's still very bad).
> We lost our performance gain, so our test resulted in failure.

Wow, this is a very interesting information: an use case where there is a
10x increase of speed !

Could you share more detail about this server. With more data that supporting
this patch, we can convince David to include it.

[...]
> 
> We could handle all these use cases with two bits:
> 
> 	bit 0:  0 = prefer data, 1 = prefer metadata
> 	bit 1:  0 = allow other types, 1 = exclude other types
> 
> which gives 4 encoded values:
> 
> 	0 = prefer data, allow metadata (default)
> 	1 = prefer metadata, allow data (same as v4 patch)
> 	2 = prefer data, disallow metadata
> 	3 = prefer metadata, disallow data

What you are suggesting allows the maximum flexibility. However I still
fear that we are mixing two discussions that are unrelated except that
the solution *may* be the same:

1) the first discussion is related to the increasing of performance
because we put the metadata in the faster disks and the data in
the slower one.

2) the second discussion is how avoid that the chunk data consumes space of
the metadata.

Regarding 2), I think that a more generic approach is something like:
- don't allocate *data* chunk if the chunk free space is less than <X>
Where <X> is the maximum size of a metadata chunk (IIRC 1GB ?), eventually
multiplied by 2x or 3x.
Instead the metadata allocation policy is still constrained only to have
enough space. As further step (to allow a metadata balance command to success), we
could constraint the metadata allocation policy to allocate up to the half of the
available space ( or 1 GB, whichever is smaller)

Regarding 1) I prefer to leave the patch as simple as possible to increase
the likelihood of an inclusion. Eventually we can put further constraint after.

Anyway I am rebasing the patch to the latest kernel. Let me to check how complex
could be implement you algorithm (the two bits one).

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata
  2021-01-10 19:55       ` Goffredo Baroncelli
@ 2021-01-16  0:25         ` Zygo Blaxell
  0 siblings, 0 replies; 18+ messages in thread
From: Zygo Blaxell @ 2021-01-16  0:25 UTC (permalink / raw)
  To: Goffredo Baroncelli
  Cc: linux-btrfs, Michael, Hugo Mills, Martin Svec, Wang Yugui,
	Paul Jones, Adam Borowski

On Sun, Jan 10, 2021 at 08:55:36PM +0100, Goffredo Baroncelli wrote:
> On 1/9/21 10:23 PM, Zygo Blaxell wrote:
> 
> > On a loaded test server, I observed 90th percentile fsync times
> > drop from 7 seconds without preferred_metadata to 0.7 seconds with
> > preferred_metadata when all the metadata is on the SSDs.  If some metadata
> > ever lands on a spinner, we go back to almost 7 seconds latency again
> > (it sometimes only gets up to 5 or 6 seconds, but it's still very bad).
> > We lost our performance gain, so our test resulted in failure.
> 
> Wow, this is a very interesting information: an use case where there is a
> 10x increase of speed !

It's a decrease of latency for the highest-latency fsyncs.  Not quite the
same thing as a 10% speed increase, or even an average latency decrease.
The gain comes from two places:

1.  There's much less variation in latency.  Seek times on spinners
are proportional to address distance, while seek times on SSD are not.
This is a 90th percentile observation, and reducing the variance a little
will reduce the 90th percentile a lot.

2.  Most of the latency on these servers comes from btrfs commit's giant
delayed_ref queue purges--hundreds of thousands of seek-heavy metadata
updates.  Those go _much_ faster on SSD than on HDD.  Previously, the
disks would fall behind write requests, so commits keep getting new data
to write all the time, making IO queues longer and longer, absorbing
all the free RAM that could smooth out spikes, and the transactions
would never finish until the writers stop providing new data.  With SSD
metadata it's harder to hit this case--the SSDs can keep up with the
metadata, and the HDDs can flush data faster with less seeking, so we
keep IO queues short and RAM available.  The system-wide gains are very
nonlinear.

> Could you share more detail about this server. With more data that supporting
> this patch, we can convince David to include it.

The one I pulled the numbers from has 3 spinning disks, 7200 RPM NAS or
enterprise drives by WD, HGST, and Seagate.  SSDs are 1TB WD Reds--we
initially tried them for caching, but they were burning out too quickly,
so we found a new home for them in preferred_metadata test boxes.

The workload is backups:  rsyncs, btrfs receives, snapshot creates and
deletes, and dedupe.  Occasionally some data is retrieved.  I excluded
data collected while scrubs are running.

> [...]
> > 
> > We could handle all these use cases with two bits:
> > 
> > 	bit 0:  0 = prefer data, 1 = prefer metadata
> > 	bit 1:  0 = allow other types, 1 = exclude other types
> > 
> > which gives 4 encoded values:
> > 
> > 	0 = prefer data, allow metadata (default)
> > 	1 = prefer metadata, allow data (same as v4 patch)
> > 	2 = prefer data, disallow metadata
> > 	3 = prefer metadata, disallow data
> 
> What you are suggesting allows the maximum flexibility. However I still
> fear that we are mixing two discussions that are unrelated except that
> the solution *may* be the same:

I don't fear that.  I state it explicitly.

> 1) the first discussion is related to the increasing of performance
> because we put the metadata in the faster disks and the data in
> the slower one.

The current btrfs allocator gives close to the worst possible performance
with a mix of drive speeds inversely proportional to size.  This is an
important problem with an easily implemented solution that has been
reinvented several times by different individuals over a period of
some years.  People ask about this on IRC every other week, and in the
mailing list at least once this month.  I've stopped waiting for it to
happen and started testing patches of my own.

If we are going to implement a more sophisticated solution in the future,
we can presume it will be able to emulate preferred_metadata and import
the legacy configuration data, so we can go ahead and implement these
capabilities now.

These future solutions are probably all going to try to create
chunks on useful subsets of drives, and allocate useful subsets of
IO to the available chunks, because any other structure will require
pretty deep changes to the way btrfs works that will not plausibly
be accepted (e.g. not using chunks as allocation units any more).
The likely differences between one proposal and another are the number
of chunk types supported and the number of subsets of IO supported.
"Put metadata on device X, put data on device Y" should be a trivial
capability of any future implementation worth considering.

One source of pushback against the current patch set I would expect is
the use of btrfs_dev_item::type to store the type field.  That field was
probably intended for a different purpose, and space in btrfs_dev_item
is a scarce resource.  We don't intend to come up with a general device
type list here, or sort out when to use a NVME device and when to use
a SSD when both are present--currently we are interested in manually
directing the allocator, not enumerating all possible device information
for the allocator to make good decisions itself (and even if we were, we'd
want to be able to override those anyway).  Maybe the preferred_metadata
"type" can be moved to a separate PERSISTENT_ITEM key (see DEV_STATS_KEY)
with a preferred_metadata objectid.  This will make extensions easier
later on (there are still 2^64-ish objectid values available after this,
so we can have multiple versions of the configuration).

> 2) the second discussion is how avoid that the chunk data consumes space of
> the metadata.

This is an important problem too, but a much larger and more general
one.  It's also one that is relatively well understood and managed now,
so there is less urgent need to fix it.  Also, unlike the performance
issue, there seems to be a large number of ways to change this for the
worse, or make changes which will make more sophisticated solutions
unnecessarily complex.

My strong opinion about this is that I don't have a strong opinion.
There are other problems to solve first, let's work on those.

> Regarding 2), I think that a more generic approach is something like:
> - don't allocate *data* chunk if the chunk free space is less than <X>
> Where <X> is the maximum size of a metadata chunk (IIRC 1GB ?), eventually
> multiplied by 2x or 3x.

That's a constant value, orders of magnitude too big for small
filesystems and orders of magnitude too small for big filesystems.
e.g. on a filesystem with 150GB of metadata, we try to keep at least
75GB unused in allocated metadata chunks or in unallocated space.  On a
filesystem with 50GB of total space, we'd probably never need more than
4GB of metadata--or less.

We need metadata space for at least the following:

	- operations that don't throttle properly (e.g. snapshot delete,
	balance) - worst cases in the field are about 10% of the metadata
	size.  These are bugs and will presumably be fixed one day.

	- scrub (locks up to one metadata chunk per device, removing that
	chunk temporarily from usable space).  Seems to work as designed,
	but maybe someone someday will decide it's a bug and fix it?

	- global reserve (capped at 0.5GB) - by design.  In theory we
	don't need anything else--we should be able to gracefully throw
	ENOSPC any time this runs out, without being forced read-only.

	- snapshot subvol metadata page unsharing and big reflink copies.
	We can plan for worst case subvol unsharing easily enough, but
	big reflink copy is a user action we cannot predict.  The value is
	determined by application workload, so it should be configurable.

There are some constant terms (1GB per disk, 0.5GB reserve) and some
proportional terms (multiple of existing metadata size) and some
configurable terms in the equation (user says she needs 40GB on top of
what there is now, we don't know why but assume she knows what she's
doing) for determining how much metadata we need.  And this maybe isn't
even an exhaustive list.  But it shows the size of the problem.

> Instead the metadata allocation policy is still constrained only to have
> enough space. As further step (to allow a metadata balance command to success), we
> could constraint the metadata allocation policy to allocate up to the half of the
> available space ( or 1 GB, whichever is smaller)

That's way more complicated than what we do now.  Now, we never balance
metadata, occasionally balance a few chunks of data, and btrfs usually
ends up with the right mix of block groups by itself over time.

During RAID reshapes one must plan for where metadata is going to
live after all the devices are added and removed.  Sometimes we use
metadata_ratio after these operations to bring the metadata chunks back
up again.

I do like the "never allocate the last N GB except during balance" idea.
Currently we do something similar with fi resize--after mkfs, we resize
the btrfs devices about 1% smaller than the physical device, to provide
emergency free space.  We've only ever needed it once, to recover a
filesystem where someone accidentally ran a metadata balance a few months
earlier, and the ratios got all, for want of a better word, unbalanced.

Arguably this 2) stuff is all fine and we don't need to change anything.
Users have all the tools they need, so we might only need need better
manuals and maybe more automation for common cases.

> Regarding 1) I prefer to leave the patch as simple as possible to increase
> the likelihood of an inclusion. Eventually we can put further constraint after.

I'd prefer...otherwise?  I've already had to implement the strict mode to
make simple test cases work.  The requirement seems uncontroversial to me.

I set up one test server last week and tried to do a metadata balance
with non-strict preferred metadata.  Before metadata balance could
relocate any metadata to the SSDs, the SSDs were all filled with data.
I had to implement strict preference before preferred_metadata would
work at all on that server.

At the moment I'm running this (written before my earlier email, it doesn't
consider the possibility of non-strict preferred metadata):

@@ -4976,15 +5030,47 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
                devices_info[ndevs].max_avail = max_avail;
                devices_info[ndevs].total_avail = total_avail;
                devices_info[ndevs].dev = device;
+               devices_info[ndevs].preferred_metadata = !!(device->type &
+                       BTRFS_DEV_PREFERRED_METADATA);
+               if (devices_info[ndevs].preferred_metadata)
+                       nr_preferred_metadata++;
                ++ndevs;
        }
        ctl->ndevs = ndevs;

+       BUG_ON(nr_preferred_metadata > ndevs);
        /*
         * now sort the devices by hole size / available space
         */
-       sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
-            btrfs_cmp_device_info, NULL);
+       if ((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
+           (ctl->type & BTRFS_BLOCK_GROUP_METADATA)
+           ) {
+               /* mixed bg or PREFERRED_METADATA not set */
+               sort(devices_info, ctl->ndevs, sizeof(struct btrfs_device_info),
+                            btrfs_cmp_device_info, NULL);
+       } else {
+               /*
+                * If PREFERRED_METADATA is set, sort the device
+                * considering also the kind (preferred_metadata or
+                * not). Limit the availables devices to the ones of the
+                * requested kind, to prevent metadata appearing on a
+                * non-preferred device, or data appearing on a preferred
+                * device.
+                */
+               if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
+                       int nr_data = ctl->ndevs - nr_preferred_metadata;
+                       sort(devices_info, ctl->ndevs,
+                                    sizeof(struct btrfs_device_info),
+                                    btrfs_cmp_device_info_data, NULL);
+                       ctl->ndevs = nr_data;
+               } else { /* non data -> metadata and system */
+                       sort(devices_info, ctl->ndevs,
+                                    sizeof(struct btrfs_device_info),
+                                    btrfs_cmp_device_info_metadata, NULL);
+                       if (nr_preferred_metadata)
+                               ctl->ndevs = nr_preferred_metadata;
+               }
+       }

        return 0;
 }

and it works OK, except df misreports free space because it's counting
unallocated space on metadata drives as space available for data.
I'll need to fix that eventually, though the error is less than 0.5% on
the filesystems where preferred_metadata is most important--and far less
than the errors that already happen during btrfs raid conversions anyway.

> Anyway I am rebasing the patch to the latest kernel. Let me to check how complex
> could be implement you algorithm (the two bits one).

> BR
> G.Baroncelli
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-01-16  0:26 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-28 18:34 [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata Goffredo Baroncelli
2020-05-28 18:34 ` [PATCH 1/4] Add an ioctl to set/retrive the device properties Goffredo Baroncelli
2020-05-28 22:03   ` Hans van Kranenburg
2020-05-28 18:34 ` [PATCH 2/4] Add flags for dedicated metadata disks Goffredo Baroncelli
2020-05-28 18:34 ` [PATCH 3/4] Export dev_item.type in sysfs /sys/fs/btrfs/<uuid>/devinfo/<devid>/type Goffredo Baroncelli
2020-05-28 18:34 ` [PATCH 4/4] btrfs: add preferred_metadata mode Goffredo Baroncelli
2020-05-28 22:02   ` Hans van Kranenburg
2020-05-29 16:26     ` Goffredo Baroncelli
2020-05-28 21:59 ` [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata Hans van Kranenburg
2020-05-29 16:37   ` Goffredo Baroncelli
2020-05-30 11:44     ` Zygo Blaxell
2020-05-30 11:51       ` Goffredo Baroncelli
2021-01-08  1:05 ` Zygo Blaxell
2021-01-08 17:30   ` Goffredo Baroncelli
2021-01-08 17:43     ` BTRFS and *CACHE setup [was Re: [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata] Goffredo Baroncelli
2021-01-09 21:23     ` [RFC][PATCH V4] btrfs: preferred_metadata: preferred device for metadata Zygo Blaxell
2021-01-10 19:55       ` Goffredo Baroncelli
2021-01-16  0:25         ` Zygo Blaxell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).