linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata
@ 2021-01-17 18:54 Goffredo Baroncelli
  2021-01-17 18:54 ` [PATCH 1/5] Add an ioctl to set the device properties Goffredo Baroncelli
                   ` (7 more replies)
  0 siblings, 8 replies; 26+ messages in thread
From: Goffredo Baroncelli @ 2021-01-17 18:54 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Zygo Blaxell


Hi all,

This is an RFC; I wrote this patch because I find the idea interesting
even though it adds more complication to the chunk allocator.

The basic idea is to store the metadata chunk in the fasters disks.
The fasters disk are marked by the "preferred_metadata" flag.

BTRFS when allocate a new metadata/system chunk, selects the
"preferred_metadata" disks, otherwise it selectes the non
"preferred_metadata" disks. The intial patch allowed to use the other
kind of disk in case a set is full.

This patches set is based on v5.11-rc2.

For now, the only user of this patch that I am aware is Zygo. 
However he asked to further constraint the allocation: i.e. avoid to
allocated metadata on a not "preferred_metadata"
disk. So I extended the patch adding 4 modes to operate.

This is enabled passing the option "preferred_metadata=<mode>" at
mount time. 

There are 4 modes:
- preferred_metadata=disabled
  The allocator is the standard one.

- preferred_metadata=soft
  The metadata chunk are allocated on the disks marked with the
  "preferred_metadata" flag.
  The data chunk are allocated on the disks not marked with the
  "preferred_metadata" flag.
  If the space isn't enough, then it is possible to use the other kind
  of disks.

- preferred_metadata=hard
  The metadata chunk are allocated on the disks marked with the
  "preferred_metadata" flag.
  The data chunk are allocated on the disks not marked with the
  "preferred_metadata" flag.
  If the space isn't enough, then "no space left" error is raised. It
  is not possible to use the other kind of disks.
        
- preferred_metadata=metadata
  The metadata chunk are allocated on the disks marked with the
  "preferred_metadata" flag.
  For metadata, if the space isn't enough, then it is possible to use the
  other kind of disks.
  The data chunk are allocated on the disks not marked with the
  "preferred_metadata" flag.
  For data, if the space isn't enough, then "no space left" error is raised.
  It is not possible to use the other kind of disks.


A separate patches set is sent to extend the "btrfs property" command
for supporting the preferred_metadata device flag. The basic usage is:

    $ # set a new value
    $ sudo btrfs property set /dev/vde preferred_metadata 1

    $ # get the current value
    $ sudo btrfs property get /dev/vde preferred_metadata
    devid=4, path=/dev/vde: dedicated_metadata=1

Some examples: (/dev/sd[abc] are marked as preferred_metadata,
and /dev/sd[ef] are not)

Non striped profile: metadata->raid1, data->raid1
The data is stored on /dev/sd[ef], metadata is stored on /dev/sd[abc].

If mode is one of "soft" or "disabled", when /dev/sd[ef] are full
the data chunk is allocated also on /dev/sd[abc]. Otherwise -ENOSPACE
is raised.

Striped profile: metadata->raid6, data->raid6
raid6 requires 3 disks at minimum, so /dev/sd[ef] are not enough for a
data profile raid6. If mode is one of "hard" or "metadata", an error 
is returned. Otherwise to allow a data chunk allocation, the data profile
raid6 will be stored on all the disks /dev/sd[abcdef].

Instead the metadata profile raid6 will be allocated on /dev/sd[abc],
because these are enough to host this chunk.

When the disks /dev/sd[abc] are full, if the mode is "hard" an
error is raised, otherwise the other disks may be used.

The patches set is composed by four patches:

- The first patch adds the ioctl to update the btrfs_dev_item.type field.
The ioctl is generic to handle more fields, however now only the "type"
field is supported.

- The second patch adds the flag BTRFS_DEV_PREFERRED_METADATA which is
used to mark a device as "preferred_metadata"

- The third patch exports the btrfs_dev_item.type field via sysfs files
/sys/fs/btrfs/<UUID>/devinfo/<devid>/type

It is possible only to read the value. It is not implemented the updated
of the value because in btrfs/stsfs.c there is a comment that states:
"We don't want to do full transaction commit from inside sysfs".

- The fourth patch implements the mount options handling

- The 5th implements the different allocation strategies

Changelog:
v5: - add several modes to preferred_metadata mount option
v4: - renamed ssd_metadata to preferred_metadata
    - add the device property "preferred_metadata"
    - add the ioctl BTRFS_IOC_DEV_PROPERTIES
    - export the btrfs_dev_item.type values via sysfs
v3: - correct the collision between BTRFS_MOUNT_DISCARD_ASYNC and
      BTRFS_MOUNT_SSD_METADATA.
v2: - rebased to v5.6.2
    - correct the comparison about the rotational disks (>= instead of >)
    - add the flag rotational to the struct btrfs_device_info to
      simplify the comparison function (btrfs_cmp_device_info*() )
v1: - first issue

Below I collected some data to highlight the performance increment.

Test setup:
I performed as test a "dist-upgrade" of a Debian from stretch to buster.
The test consisted in an image of a Debian stretch[1]  with the packages
needed under /var/cache/apt/archives/ (so no networking was involved).
For each test I formatted the filesystem from scratch, un-tar-red the
image and the ran "apt-get dist-upgrade" [2]. For each disk(s)/filesystem
combination I measured the time of apt dist-upgrade with and
without the flag "force-unsafe-io" which reduce the using of sync(2) and
flush(2). The ssd was 20GB big, the hdd was 230GB big,

I considered the following scenarios:
- btrfs over ssd
- btrfs over ssd + hdd with my patch enabled
- btrfs over bcache over hdd+ssd
- btrfs over hdd (very, very slow....)
- ext4 over ssd
- ext4 over hdd

The test machine was an "AMD A6-6400K" with 4GB of ram, where 3GB was used
as cache/buff.

Data analysis:

Of course btrfs is slower than ext4 when a lot of sync/flush are involved. Using
apt on a rotational was a dramatic experience. And IMHO  this should be replaced
by using the btrfs snapshot capabilities. But this is another (not easy) story.

Unsurprising bcache performs better than my patch. But this is an expected
result because it can cache also the data chunk (the read can goes directly to
the ssd). bcache perform about +60% slower when there are a lot of sync/flush
and only +20% in the other case.

Regarding the test with force-unsafe-io (fewer sync/flush), my patch reduce the
time from +256% to +113%  than the hdd-only . Which I consider a good
results considering how small is the patch.


Raw data:
The data below is the "real" time (as return by the time command) consumed by
apt


Test description         real (mmm:ss)	Delta %
--------------------     -------------  -------
btrfs hdd w/sync	   142:38	+533%
btrfs ssd+hdd w/sync        81:04	+260%
ext4 hdd w/sync	            52:39	+134%
btrfs bcache w/sync	    35:59	 +60%
btrfs ssd w/sync	    22:31	reference
ext4 ssd w/sync	            12:19	 -45%



Test description         real (mmm:ss)	Delta %
--------------------     -------------  -------
btrfs hdd	             56:2	+256%
ext4 hdd	            51:32	+228%
btrfs ssd+hdd	            33:30	+113%
btrfs bcache	            18:57	 +20%
btrfs ssd	            15:44	reference
ext4 ssd	            11:49	 -25%


[1] I created the image, using "debootrap stretch", then I installed a set
of packages using the commands:

  # debootstrap stretch test/
  # chroot test/
  # mount -t proc proc proc
  # mount -t sysfs sys sys
  # apt --option=Dpkg::Options::=--force-confold \
        --option=Dpkg::options::=--force-unsafe-io \
	install mate-desktop-environment* xserver-xorg vim \
        task-kde-desktop task-gnome-desktop

Then updated the release from stretch to buster changing the file /etc/apt/source.list
Then I download the packages for the dist upgrade:

  # apt-get update
  # apt-get --download-only dist-upgrade

Then I create a tar of this image.
Before the dist upgrading the space used was about 7GB of space with 2281
packages. After the dist-upgrade, the space used was 9GB with 2870 packages.
The upgrade installed/updated about 2251 packages.


[2] The command was a bit more complex, to avoid an interactive session

  # mkfs.btrfs -m single -d single /dev/sdX
  # mount /dev/sdX test/
  # cd test
  # time tar xzf ../image.tgz
  # chroot .
  # mount -t proc proc proc
  # mount -t sysfs sys sys
  # export DEBIAN_FRONTEND=noninteractive
  # time apt-get -y --option=Dpkg::Options::=--force-confold \
	--option=Dpkg::options::=--force-unsafe-io dist-upgrade


BR
G.Baroncelli

--
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/5] Add an ioctl to set the device properties
  2021-01-17 18:54 [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata Goffredo Baroncelli
@ 2021-01-17 18:54 ` Goffredo Baroncelli
  2021-01-17 18:54 ` [PATCH 2/5] Add flags for dedicated metadata disks Goffredo Baroncelli
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Goffredo Baroncelli @ 2021-01-17 18:54 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Zygo Blaxell, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

---
 fs/btrfs/ioctl.c           | 67 ++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.c         |  2 +-
 fs/btrfs/volumes.h         |  2 ++
 include/uapi/linux/btrfs.h | 40 +++++++++++++++++++++++
 4 files changed, 110 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 703212ff50a5..9e67741fa966 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4842,6 +4842,71 @@ static int btrfs_ioctl_set_features(struct file *file, void __user *arg)
 	return ret;
 }
 
+static long btrfs_ioctl_dev_properties(struct file *file,
+						void __user *argp)
+{
+	struct inode *inode = file_inode(file);
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct btrfs_ioctl_dev_properties dev_props;
+	struct btrfs_device	*device;
+        struct btrfs_root *root = fs_info->chunk_root;
+        struct btrfs_trans_handle *trans;
+	int ret;
+	u64 prev_type;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (copy_from_user(&dev_props, argp, sizeof(dev_props)))
+		return -EFAULT;
+
+	device = btrfs_find_device(fs_info->fs_devices, dev_props.devid,
+				NULL, NULL);
+	if (!device) {
+		btrfs_info(fs_info, "change_dev_properties: unable to find device %llu",
+			   dev_props.devid);
+		return -ENODEV;
+	}
+
+	if (dev_props.properties & BTRFS_DEV_PROPERTY_READ) {
+		u64 props = dev_props.properties;
+		memset(&dev_props, 0, sizeof(dev_props));
+		if (props & BTRFS_DEV_PROPERTY_TYPE) {
+			dev_props.properties = BTRFS_DEV_PROPERTY_TYPE;
+			dev_props.type = device->type;
+		}
+		if(copy_to_user(argp, &dev_props, sizeof(dev_props)))
+			return -EFAULT;
+		return 0;
+	}
+
+	/* it is possible to set only BTRFS_DEV_PROPERTY_TYPE for now */
+	if (dev_props.properties & ~(BTRFS_DEV_PROPERTY_TYPE))
+		return -EPERM;
+
+	trans = btrfs_start_transaction(root, 0);
+        if (IS_ERR(trans))
+                return PTR_ERR(trans);
+
+	prev_type = device->type;
+	device->type = dev_props.type;
+	ret = btrfs_update_device(trans, device);
+
+        if (ret < 0) {
+                btrfs_abort_transaction(trans, ret);
+                btrfs_end_transaction(trans);
+		device->type = prev_type;
+		return  ret;
+        }
+
+        ret = btrfs_commit_transaction(trans);
+	if (ret < 0)
+		device->type = prev_type;
+
+	return ret;
+
+}
+
 static int _btrfs_ioctl_send(struct file *file, void __user *argp, bool compat)
 {
 	struct btrfs_ioctl_send_args *arg;
@@ -5025,6 +5090,8 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_get_subvol_rootref(file, argp);
 	case BTRFS_IOC_INO_LOOKUP_USER:
 		return btrfs_ioctl_ino_lookup_user(file, argp);
+	case BTRFS_IOC_DEV_PROPERTIES:
+		return btrfs_ioctl_dev_properties(file, argp);
 	}
 
 	return -ENOTTY;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index ee086fc56c30..68b346c5465d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2744,7 +2744,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	return ret;
 }
 
-static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
+int btrfs_update_device(struct btrfs_trans_handle *trans,
 					struct btrfs_device *device)
 {
 	int ret;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 1997a4649a66..d776b7f55d56 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -595,5 +595,7 @@ void btrfs_scratch_superblocks(struct btrfs_fs_info *fs_info,
 int btrfs_bg_type_to_factor(u64 flags);
 const char *btrfs_bg_type_to_raid_name(u64 flags);
 int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info);
+int btrfs_update_device(struct btrfs_trans_handle *trans,
+                                        struct btrfs_device *device);
 
 #endif
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 5df73001aad4..e6caef42837a 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -860,6 +860,44 @@ struct btrfs_ioctl_get_subvol_rootref_args {
 		__u8 align[7];
 };
 
+#define BTRFS_DEV_PROPERTY_TYPE		(1ULL << 0)
+#define BTRFS_DEV_PROPERTY_DEV_GROUP	(1ULL << 1)
+#define BTRFS_DEV_PROPERTY_SEEK_SPEED	(1ULL << 2)
+#define BTRFS_DEV_PROPERTY_BANDWIDTH	(1ULL << 3)
+#define BTRFS_DEV_PROPERTY_READ		(1ULL << 60)
+
+/*
+ * The ioctl BTRFS_IOC_DEV_PROPERTIES can read and write the device properties.
+ *
+ * The properties that the user want to write have to be set
+ * in the 'properties' field using the BTRFS_DEV_PROPERTY_xxxx constants.
+ *
+ * If the ioctl is used to read the device properties, the bit
+ * BTRFS_DEV_PROPERTY_READ has to be set in the 'properties' field.
+ * In this case the properties that the user want have to be set in the
+ * 'properties' field. The kernel doesn't return a property that was not
+ * required, however it may return a subset of the requested properties.
+ * The returned properties have the corrispondent BTRFS_DEV_PROPERTY_xxxx
+ * flag set in the 'properties' field.
+ *
+ * Up to 2020/05/11 the only properties that can be read/write is the 'type'
+ * one.
+ */
+struct btrfs_ioctl_dev_properties {
+	__u64	devid;
+	__u64	properties;
+	__u64	type;
+	__u32	dev_group;
+	__u8	seek_speed;
+	__u8	bandwidth;
+
+	/*
+	 * for future expansion
+	 */
+	__u8	unused1[2];
+	__u64	unused2[4];
+};
+
 /* Error codes as returned by the kernel */
 enum btrfs_err_code {
 	BTRFS_ERROR_DEV_RAID1_MIN_NOT_MET = 1,
@@ -988,5 +1026,7 @@ enum btrfs_err_code {
 				struct btrfs_ioctl_ino_lookup_user_args)
 #define BTRFS_IOC_SNAP_DESTROY_V2 _IOW(BTRFS_IOCTL_MAGIC, 63, \
 				struct btrfs_ioctl_vol_args_v2)
+#define BTRFS_IOC_DEV_PROPERTIES _IOW(BTRFS_IOCTL_MAGIC, 64, \
+				struct btrfs_ioctl_dev_properties)
 
 #endif /* _UAPI_LINUX_BTRFS_H */
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/5] Add flags for dedicated metadata disks
  2021-01-17 18:54 [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata Goffredo Baroncelli
  2021-01-17 18:54 ` [PATCH 1/5] Add an ioctl to set the device properties Goffredo Baroncelli
@ 2021-01-17 18:54 ` Goffredo Baroncelli
  2021-01-17 18:54 ` [PATCH 3/5] Export dev_item.type in sysfs /sys/fs/btrfs/<uuid>/devinfo/<devid>/type Goffredo Baroncelli
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Goffredo Baroncelli @ 2021-01-17 18:54 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Zygo Blaxell, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

---
 include/uapi/linux/btrfs_tree.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 58d7cff9afb1..701ad550d596 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -361,6 +361,9 @@ struct btrfs_key {
 	__u64 offset;
 } __attribute__ ((__packed__));
 
+/* dev_item.type */
+#define BTRFS_DEV_PREFERRED_METADATA	(1ULL << 0)
+
 struct btrfs_dev_item {
 	/* the internal btrfs device id */
 	__le64 devid;
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/5] Export dev_item.type in sysfs /sys/fs/btrfs/<uuid>/devinfo/<devid>/type
  2021-01-17 18:54 [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata Goffredo Baroncelli
  2021-01-17 18:54 ` [PATCH 1/5] Add an ioctl to set the device properties Goffredo Baroncelli
  2021-01-17 18:54 ` [PATCH 2/5] Add flags for dedicated metadata disks Goffredo Baroncelli
@ 2021-01-17 18:54 ` Goffredo Baroncelli
  2021-01-17 18:54 ` [PATCH 4/5] btrfs: add preferred_metadata option Goffredo Baroncelli
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Goffredo Baroncelli @ 2021-01-17 18:54 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Zygo Blaxell, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

---
 fs/btrfs/sysfs.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 19b9fffa2c9c..594e8445fe21 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1416,11 +1416,22 @@ static ssize_t btrfs_devinfo_writeable_show(struct kobject *kobj,
 }
 BTRFS_ATTR(devid, writeable, btrfs_devinfo_writeable_show);
 
+static ssize_t btrfs_devinfo_type_show(struct kobject *kobj,
+					    struct kobj_attribute *a, char *buf)
+{
+	struct btrfs_device *device = container_of(kobj, struct btrfs_device,
+						   devid_kobj);
+
+	return scnprintf(buf, PAGE_SIZE, "0x%08llx\n",device->type);
+}
+BTRFS_ATTR(devid, type, btrfs_devinfo_type_show);
+
 static struct attribute *devid_attrs[] = {
 	BTRFS_ATTR_PTR(devid, in_fs_metadata),
 	BTRFS_ATTR_PTR(devid, missing),
 	BTRFS_ATTR_PTR(devid, replace_target),
 	BTRFS_ATTR_PTR(devid, writeable),
+	BTRFS_ATTR_PTR(devid, type),
 	NULL
 };
 ATTRIBUTE_GROUPS(devid);
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/5] btrfs: add preferred_metadata option.
  2021-01-17 18:54 [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata Goffredo Baroncelli
                   ` (2 preceding siblings ...)
  2021-01-17 18:54 ` [PATCH 3/5] Export dev_item.type in sysfs /sys/fs/btrfs/<uuid>/devinfo/<devid>/type Goffredo Baroncelli
@ 2021-01-17 18:54 ` Goffredo Baroncelli
  2021-01-17 18:54 ` [PATCH 5/5] btrfs: add preferred_metadata mode mount option Goffredo Baroncelli
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Goffredo Baroncelli @ 2021-01-17 18:54 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Zygo Blaxell, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

Add preferred_metadata mount option. This option accept the following values:

- disabled (default):  the disk preferred metadata flag is ignored
- soft:		       the disk preferred metadata flag is a suggestion
                       on which disk the system may use
- metadata:	       the disk preferred metadata flag is a suggestion
                       on which disk the system use for metadata. It is a
		       mandatory requirement for data
- hard:                the disk preferred metadata flag is a mandatory
                       requirement for metadata and


Signed-off-by: Goffredo Baroncelli <kreijack@winwind.it>
---
 fs/btrfs/ctree.h   | 14 ++++++++++++++
 fs/btrfs/disk-io.c |  2 ++
 fs/btrfs/super.c   | 27 +++++++++++++++++++++++++++
 3 files changed, 43 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 1d3c1e479f3d..af2e549ef1ab 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -570,6 +570,17 @@ enum btrfs_exclusive_operation {
 	BTRFS_EXCLOP_SWAP_ACTIVATE,
 };
 
+/*
+ * 'preferred_metadata' mode
+ */
+
+enum btrfs_preferred_metadata_mode {
+	BTRFS_PM_DISABLED,
+	BTRFS_PM_HARD,
+	BTRFS_PM_METADATA,
+	BTRFS_PM_SOFT
+};
+
 struct btrfs_fs_info {
 	u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
 	unsigned long flags;
@@ -961,6 +972,9 @@ struct btrfs_fs_info {
 		u64 zoned;
 	};
 
+	/* preferred_metadata mode */
+	int preferred_metadata_mode;
+
 	/* Max size to emit ZONE_APPEND write command */
 	u64 max_zone_append_size;
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 765deefda92b..c0d9e6572e63 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2794,6 +2794,8 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	fs_info->swapfile_pins = RB_ROOT;
 
 	fs_info->send_in_progress = 0;
+
+	fs_info->preferred_metadata_mode = BTRFS_PM_DISABLED;
 }
 
 static int init_mount_fs_info(struct btrfs_fs_info *fs_info, struct super_block *sb)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 022f20810089..b7abc25c9637 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -359,6 +359,7 @@ enum {
 	Opt_thread_pool,
 	Opt_treelog, Opt_notreelog,
 	Opt_user_subvol_rm_allowed,
+	Opt_preferred_metadata,
 
 	/* Rescue options */
 	Opt_rescue,
@@ -432,6 +433,7 @@ static const match_table_t tokens = {
 	{Opt_treelog, "treelog"},
 	{Opt_notreelog, "notreelog"},
 	{Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"},
+	{Opt_preferred_metadata, "preferred_metadata=%s"},
 
 	/* Rescue options */
 	{Opt_rescue, "rescue=%s"},
@@ -889,6 +891,23 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 		case Opt_user_subvol_rm_allowed:
 			btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED);
 			break;
+		case Opt_preferred_metadata:
+			if (!strcmp(args[0].from, "hard")) {
+				info->preferred_metadata_mode = BTRFS_PM_HARD;
+			} else if (!strcmp(args[0].from, "metadata")) {
+				info->preferred_metadata_mode = BTRFS_PM_METADATA;
+			} else if (!strcmp(args[0].from, "soft")) {
+				info->preferred_metadata_mode = BTRFS_PM_SOFT;
+			} else if (!strcmp(args[0].from, "disabled")) {
+				info->preferred_metadata_mode = BTRFS_PM_DISABLED;
+			} else  {
+				btrfs_err(info,
+					"unknown option '%s' in preferred_metadata",
+					args[0].from);
+				ret = -EINVAL;
+				goto out;
+			}
+			break;
 		case Opt_enospc_debug:
 			btrfs_set_opt(info->mount_opt, ENOSPC_DEBUG);
 			break;
@@ -1495,6 +1514,14 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 		seq_puts(seq, ",clear_cache");
 	if (btrfs_test_opt(info, USER_SUBVOL_RM_ALLOWED))
 		seq_puts(seq, ",user_subvol_rm_allowed");
+	if (info->preferred_metadata_mode == BTRFS_PM_HARD)
+		seq_puts(seq, ",preferred_metadata=hard");
+	else if (info->preferred_metadata_mode == BTRFS_PM_METADATA)
+		seq_puts(seq, ",preferred_metadata=metadata");
+	else if (info->preferred_metadata_mode == BTRFS_PM_SOFT)
+		seq_puts(seq, ",preferred_metadata=soft");
+	else if (info->preferred_metadata_mode == BTRFS_PM_DISABLED)
+                seq_puts(seq, ",preferred_metadata=disabled");
 	if (btrfs_test_opt(info, ENOSPC_DEBUG))
 		seq_puts(seq, ",enospc_debug");
 	if (btrfs_test_opt(info, AUTO_DEFRAG))
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 5/5] btrfs: add preferred_metadata mode mount option
  2021-01-17 18:54 [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata Goffredo Baroncelli
                   ` (3 preceding siblings ...)
  2021-01-17 18:54 ` [PATCH 4/5] btrfs: add preferred_metadata option Goffredo Baroncelli
@ 2021-01-17 18:54 ` Goffredo Baroncelli
  2021-01-18  3:05   ` kernel test robot
  2021-01-19 23:12 ` [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata Zygo Blaxell
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Goffredo Baroncelli @ 2021-01-17 18:54 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Zygo Blaxell, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

When this mode is enabled, the chunk allocation policy is modified
giving a different precedence between the disks depending by the chunk type.
A disk may be marked with the preferred_metadata flag to have higher chance
to host metadata.

There are 4 modes:
- preferred_metadata=disabled
  The allocator is the standard one.

- preferred_metadata=soft
  The metadata chunk are allocated on the disks marked with the
  "preferred_metadata" flag.
  The data chunk are allocated on the disks not marked with the
  "preferred_metadata" flag.
  If the space isn't enough, then it is possible to use the other kind
  of disks.

- preferred_metadata=hard
  The metadata chunk are allocated on the disks marked with the
  "preferred_metadata" flag.
  The data chunk are allocated on the disks not marked with the
  "preferred_metadata" flag.
  If the space isn't enough, then "no space left" error is raised. It
  is not possible to use the other kind of disks.

- preferred_metadata=metadata
  The metadata chunk are allocated on the disks marked with the
  "preferred_metadata" flag.
  For metadata, if the space isn't enough, then it is possible to use the
  other kind of disks.
  The data chunk are allocated on the disks not marked with the
  "preferred_metadata" flag.
  For data, if the space isn't enough, then "no space left" error is raised.
  It is not possible to use the other kind of disks.

To mark a disk as "preferred_metadata", use the command
# btrfs properties set <disk> preferred_metadata 1

To remove the flag "preferred_metadata" from a disk, use the command
# btrfs properties set <disk> preferred_metadata 0

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 fs/btrfs/volumes.c | 105 ++++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/volumes.h |   1 +
 2 files changed, 104 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 68b346c5465d..9dc3f2473805 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4824,6 +4824,56 @@ static int btrfs_cmp_device_info(const void *a, const void *b)
 	return 0;
 }
 
+/*
+ * sort the devices in descending order by preferred_metadata,
+ * max_avail, total_avail
+ */
+static int btrfs_cmp_device_info_metadata(const void *a, const void *b)
+{
+	const struct btrfs_device_info *di_a = a;
+	const struct btrfs_device_info *di_b = b;
+
+	/* metadata -> preferred_metadata first */
+	if (di_a->preferred_metadata && !di_b->preferred_metadata)
+		return -1;
+	if (!di_a->preferred_metadata && di_b->preferred_metadata)
+		return 1;
+	if (di_a->max_avail > di_b->max_avail)
+		return -1;
+	if (di_a->max_avail < di_b->max_avail)
+		return 1;
+	if (di_a->total_avail > di_b->total_avail)
+		return -1;
+	if (di_a->total_avail < di_b->total_avail)
+		return 1;
+	return 0;
+}
+
+/*
+ * sort the devices in descending order by !preferred_metadata,
+ * max_avail, total_avail
+ */
+static int btrfs_cmp_device_info_data(const void *a, const void *b)
+{
+	const struct btrfs_device_info *di_a = a;
+	const struct btrfs_device_info *di_b = b;
+
+	/* data -> preferred_metadata last */
+	if (di_a->preferred_metadata && !di_b->preferred_metadata)
+		return 1;
+	if (!di_a->preferred_metadata && di_b->preferred_metadata)
+		return -1;
+	if (di_a->max_avail > di_b->max_avail)
+		return -1;
+	if (di_a->max_avail < di_b->max_avail)
+		return 1;
+	if (di_a->total_avail > di_b->total_avail)
+		return -1;
+	if (di_a->total_avail < di_b->total_avail)
+		return 1;
+	return 0;
+}
+
 static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
 {
 	if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK))
@@ -4939,6 +4989,7 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 	int ndevs = 0;
 	u64 max_avail;
 	u64 dev_offset;
+	int nr_preferred_metadata = 0;
 
 	/*
 	 * in the first pass through the devices list, we gather information
@@ -4991,15 +5042,65 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
 		devices_info[ndevs].max_avail = max_avail;
 		devices_info[ndevs].total_avail = total_avail;
 		devices_info[ndevs].dev = device;
+		devices_info[ndevs].preferred_metadata = !!(device->type &
+			BTRFS_DEV_PREFERRED_METADATA);
+		if (devices_info[ndevs].preferred_metadata)
+			nr_preferred_metadata++;
 		++ndevs;
 	}
 	ctl->ndevs = ndevs;
 
+	BUG_ON(nr_preferred_metadata > ndevs);
 	/*
 	 * now sort the devices by hole size / available space
 	 */
-	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
-	     btrfs_cmp_device_info, NULL);
+	if (((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
+	     (ctl->type & BTRFS_BLOCK_GROUP_METADATA)) ||
+	    info->preferred_metadata_mode == BTRFS_PM_DISABLED) {
+		/* mixed bg or PREFERRED_METADATA not set */
+		sort(devices_info, ctl->ndevs, sizeof(struct btrfs_device_info),
+			     btrfs_cmp_device_info, NULL);
+	} else {
+		/*
+		 * if PREFERRED_METADATA is set, sort the device considering
+		 * also the kind (preferred_metadata or not). Limit the
+		 * availables devices to the ones of the same kind, to avoid
+		 * that a striped profile, like raid5, spreads to all kind of
+		 * devices.
+		 * It is allowed to use different kinds of devices (if the ones
+		 * of the same kind are not enough alone) in the following
+		 * case:
+		 * - preferred_metadata_mode == BTRFS_PM_SOFT:
+		 *               use the device of the same kind until these
+		 *               are enough. Otherwise it is allowed to
+		 *               use all the devices
+		 * - preferred_metadata_mode == BTRFS_PM_HARD
+		 *               use the device of the same kind; if these are
+		 *		 not enough, then an error will be raised raised
+		 * - preferred_metadata_mode == BTRFS_PM_METADATA
+		 *               metadata/system -> as BTRFS_PM_SOFT
+		 *               data -> as BTRFS_PM_HARD
+		 */
+		if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
+			int nr_data = ctl->ndevs - nr_preferred_metadata;
+			sort(devices_info, ctl->ndevs,
+				     sizeof(struct btrfs_device_info),
+				     btrfs_cmp_device_info_data, NULL);
+			if (info->preferred_metadata_mode == BTRFS_PM_HARD ||
+			    info->preferred_metadata_mode == BTRFS_PM_METADATA)
+				ctl->ndevs = nr_data;
+			else if (nr_data >= ctl->devs_min)
+				ctl->ndevs = nr_data;
+		} else { /* non data -> metadata and system */
+			sort(devices_info, ctl->ndevs,
+				     sizeof(struct btrfs_device_info),
+				     btrfs_cmp_device_info_metadata, NULL);
+			if (info->preferred_metadata_mode == BTRFS_PM_HARD)
+				ctl->ndevs = nr_preferred_metadata;
+			else if (nr_preferred_metadata >= ctl->devs_min)
+				ctl->ndevs = nr_preferred_metadata;
+		}
+	}
 
 	return 0;
 }
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index d776b7f55d56..fc8da51e739b 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -364,6 +364,7 @@ struct btrfs_device_info {
 	u64 dev_offset;
 	u64 max_avail;
 	u64 total_avail;
+	int preferred_metadata:1;
 };
 
 struct btrfs_raid_attr {
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] btrfs: add preferred_metadata mode mount option
  2021-01-17 18:54 ` [PATCH 5/5] btrfs: add preferred_metadata mode mount option Goffredo Baroncelli
@ 2021-01-18  3:05   ` kernel test robot
  0 siblings, 0 replies; 26+ messages in thread
From: kernel test robot @ 2021-01-18  3:05 UTC (permalink / raw)
  To: Goffredo Baroncelli, linux-btrfs
  Cc: kbuild-all, Zygo Blaxell, Goffredo Baroncelli

[-- Attachment #1: Type: text/plain, Size: 2812 bytes --]

Hi Goffredo,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on kdave/for-next]
[also build test WARNING on v5.11-rc3 next-20210115]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Goffredo-Baroncelli/Add-an-ioctl-to-set-the-device-properties/20210118-035442
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: arm64-randconfig-s032-20210118 (attached as .config)
compiler: aarch64-linux-gcc (GCC) 9.3.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # apt-get install sparse
        # sparse version: v0.6.3-208-g46a52ca4-dirty
        # https://github.com/0day-ci/linux/commit/6e3781fa16beb22201a3dac33326357fc0a0b7b9
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Goffredo-Baroncelli/Add-an-ioctl-to-set-the-device-properties/20210118-035442
        git checkout 6e3781fa16beb22201a3dac33326357fc0a0b7b9
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=arm64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <rong.a.chen@intel.com>


"sparse warnings: (new ones prefixed by >>)"
   fs/btrfs/tests/btrfs-tests.c: note: in included file:
>> fs/btrfs/tests/../volumes.h:367:33: sparse: sparse: dubious one-bit signed bitfield
--
   fs/btrfs/tests/inode-tests.c: note: in included file:
>> fs/btrfs/tests/../volumes.h:367:33: sparse: sparse: dubious one-bit signed bitfield
--
   fs/btrfs/tests/extent-map-tests.c: note: in included file:
>> fs/btrfs/tests/../volumes.h:367:33: sparse: sparse: dubious one-bit signed bitfield

vim +367 fs/btrfs/tests/../volumes.h

cea9e4452ebaf18d Chris Mason         2008-04-09  361  
b2117a39fa96cf48 Miao Xie            2011-01-05  362  struct btrfs_device_info {
b2117a39fa96cf48 Miao Xie            2011-01-05  363  	struct btrfs_device *dev;
b2117a39fa96cf48 Miao Xie            2011-01-05  364  	u64 dev_offset;
b2117a39fa96cf48 Miao Xie            2011-01-05  365  	u64 max_avail;
73c5de0051533cbd Arne Jansen         2011-04-12  366  	u64 total_avail;
6e3781fa16beb222 Goffredo Baroncelli 2021-01-17 @367  	int preferred_metadata:1;
b2117a39fa96cf48 Miao Xie            2011-01-05  368  };
b2117a39fa96cf48 Miao Xie            2011-01-05  369  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 41661 bytes --]

[-- Attachment #3: Type: text/plain, Size: 149 bytes --]

_______________________________________________
kbuild mailing list -- kbuild@lists.01.org
To unsubscribe send an email to kbuild-leave@lists.01.org

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata
  2021-01-17 18:54 [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata Goffredo Baroncelli
                   ` (4 preceding siblings ...)
  2021-01-17 18:54 ` [PATCH 5/5] btrfs: add preferred_metadata mode mount option Goffredo Baroncelli
@ 2021-01-19 23:12 ` Zygo Blaxell
  2021-01-21  8:31   ` Martin Svec
  2021-01-20 16:02 ` Josef Bacik
  2021-01-25 15:21 ` Josef Bacik
  7 siblings, 1 reply; 26+ messages in thread
From: Zygo Blaxell @ 2021-01-19 23:12 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: linux-btrfs

On Sun, Jan 17, 2021 at 07:54:30PM +0100, Goffredo Baroncelli wrote:
> 
> Hi all,
> 
> This is an RFC; I wrote this patch because I find the idea interesting
> even though it adds more complication to the chunk allocator.
> 
> The basic idea is to store the metadata chunk in the fasters disks.
> The fasters disk are marked by the "preferred_metadata" flag.
> 
> BTRFS when allocate a new metadata/system chunk, selects the
> "preferred_metadata" disks, otherwise it selectes the non
> "preferred_metadata" disks. The intial patch allowed to use the other
> kind of disk in case a set is full.
> 
> This patches set is based on v5.11-rc2.
> 
> For now, the only user of this patch that I am aware is Zygo. 
> However he asked to further constraint the allocation: i.e. avoid to
> allocated metadata on a not "preferred_metadata"
> disk. So I extended the patch adding 4 modes to operate.
> 
> This is enabled passing the option "preferred_metadata=<mode>" at
> mount time. 
> 
> There are 4 modes:
> - preferred_metadata=disabled
>   The allocator is the standard one.
> 
> - preferred_metadata=soft
>   The metadata chunk are allocated on the disks marked with the
>   "preferred_metadata" flag.
>   The data chunk are allocated on the disks not marked with the
>   "preferred_metadata" flag.
>   If the space isn't enough, then it is possible to use the other kind
>   of disks.
> 
> - preferred_metadata=hard
>   The metadata chunk are allocated on the disks marked with the
>   "preferred_metadata" flag.
>   The data chunk are allocated on the disks not marked with the
>   "preferred_metadata" flag.
>   If the space isn't enough, then "no space left" error is raised. It
>   is not possible to use the other kind of disks.
>         
> - preferred_metadata=metadata
>   The metadata chunk are allocated on the disks marked with the
>   "preferred_metadata" flag.
>   For metadata, if the space isn't enough, then it is possible to use the
>   other kind of disks.
>   The data chunk are allocated on the disks not marked with the
>   "preferred_metadata" flag.
>   For data, if the space isn't enough, then "no space left" error is raised.
>   It is not possible to use the other kind of disks.

I like this form of the preferred_metadata mount option even less than
the last one.  Now we have 4 different mount option cases, and we still
can't specify some things that are possible with 4 device properties.
Ideally there's no mount option at all, and we handle everything with
properties.

e.g. if I had one fast but small NVME, one slow but large SATA SSD, and
one big spinning HDD, I can't set up the following with the mount option:

	- metadata only on NVME (type 3)
	- metadata preferred, data allowed on SSD (type 1)
	- data only on HDD (type 2)

The relationship between NVME and SSD is "metadata" but between NVME and
HDD it is "hard", which are conflicting mount options.  I can't specify
the relationship between SSD and HDD (metadata must not overflow, but
data is allowed to overflow) with any of the mount options, because you
have not provided a mount option where data may overflow onto preferred
(fast) disks while metadata must not use slow disks.

With the 4 device types we can trivially specify this arrangement.

The sorted device lists would be:

        Metadata sort order             Data sort order
        metadata only (3)               data only (2)
        metadata preferred (1)          data preferred (0)
	data preferred (0)		metadata preferred (1)
        other devices (2 or other)      other devices (3 or other)

We keep 3 device counts for the first 3 sort orders.  If the number of all
preferred devices (type != 0) is zero, we just return ndevs; otherwise,
we pick the first device count that is >= mindevs.  If all the device
counts are < mindevs then we return the 3rd count (metadata only +
metadata preferred + data preferred) and the caller will find ENOSPC.

More sophisticated future implementations can alter the sort order, or
operate in entirely separate parts of btrfs, without conflicting with
this scheme.  If there is no mount option, then future implementations
can't conflict with it.

> A separate patches set is sent to extend the "btrfs property" command
> for supporting the preferred_metadata device flag. The basic usage is:
> 
>     $ # set a new value
>     $ sudo btrfs property set /dev/vde preferred_metadata 1
> 
>     $ # get the current value
>     $ sudo btrfs property get /dev/vde preferred_metadata
>     devid=4, path=/dev/vde: dedicated_metadata=1
> 
> Some examples: (/dev/sd[abc] are marked as preferred_metadata,
> and /dev/sd[ef] are not)
> 
> Non striped profile: metadata->raid1, data->raid1
> The data is stored on /dev/sd[ef], metadata is stored on /dev/sd[abc].
> 
> If mode is one of "soft" or "disabled", when /dev/sd[ef] are full
> the data chunk is allocated also on /dev/sd[abc]. Otherwise -ENOSPACE
> is raised.
> 
> Striped profile: metadata->raid6, data->raid6
> raid6 requires 3 disks at minimum, so /dev/sd[ef] are not enough for a
> data profile raid6. If mode is one of "hard" or "metadata", an error 
> is returned. Otherwise to allow a data chunk allocation, the data profile
> raid6 will be stored on all the disks /dev/sd[abcdef].
> 
> Instead the metadata profile raid6 will be allocated on /dev/sd[abc],
> because these are enough to host this chunk.
> 
> When the disks /dev/sd[abc] are full, if the mode is "hard" an
> error is raised, otherwise the other disks may be used.
> 
> The patches set is composed by four patches:
> 
> - The first patch adds the ioctl to update the btrfs_dev_item.type field.
> The ioctl is generic to handle more fields, however now only the "type"
> field is supported.
> 
> - The second patch adds the flag BTRFS_DEV_PREFERRED_METADATA which is
> used to mark a device as "preferred_metadata"
> 
> - The third patch exports the btrfs_dev_item.type field via sysfs files
> /sys/fs/btrfs/<UUID>/devinfo/<devid>/type
> 
> It is possible only to read the value. It is not implemented the updated
> of the value because in btrfs/stsfs.c there is a comment that states:
> "We don't want to do full transaction commit from inside sysfs".
> 
> - The fourth patch implements the mount options handling
> 
> - The 5th implements the different allocation strategies
> 
> Changelog:
> v5: - add several modes to preferred_metadata mount option
> v4: - renamed ssd_metadata to preferred_metadata
>     - add the device property "preferred_metadata"
>     - add the ioctl BTRFS_IOC_DEV_PROPERTIES
>     - export the btrfs_dev_item.type values via sysfs
> v3: - correct the collision between BTRFS_MOUNT_DISCARD_ASYNC and
>       BTRFS_MOUNT_SSD_METADATA.
> v2: - rebased to v5.6.2
>     - correct the comparison about the rotational disks (>= instead of >)
>     - add the flag rotational to the struct btrfs_device_info to
>       simplify the comparison function (btrfs_cmp_device_info*() )
> v1: - first issue
> 
> Below I collected some data to highlight the performance increment.
> 
> Test setup:
> I performed as test a "dist-upgrade" of a Debian from stretch to buster.
> The test consisted in an image of a Debian stretch[1]  with the packages
> needed under /var/cache/apt/archives/ (so no networking was involved).
> For each test I formatted the filesystem from scratch, un-tar-red the
> image and the ran "apt-get dist-upgrade" [2]. For each disk(s)/filesystem
> combination I measured the time of apt dist-upgrade with and
> without the flag "force-unsafe-io" which reduce the using of sync(2) and
> flush(2). The ssd was 20GB big, the hdd was 230GB big,
> 
> I considered the following scenarios:
> - btrfs over ssd
> - btrfs over ssd + hdd with my patch enabled
> - btrfs over bcache over hdd+ssd
> - btrfs over hdd (very, very slow....)
> - ext4 over ssd
> - ext4 over hdd
> 
> The test machine was an "AMD A6-6400K" with 4GB of ram, where 3GB was used
> as cache/buff.
> 
> Data analysis:
> 
> Of course btrfs is slower than ext4 when a lot of sync/flush are involved. Using
> apt on a rotational was a dramatic experience. And IMHO  this should be replaced
> by using the btrfs snapshot capabilities. But this is another (not easy) story.
> 
> Unsurprising bcache performs better than my patch. But this is an expected
> result because it can cache also the data chunk (the read can goes directly to
> the ssd). bcache perform about +60% slower when there are a lot of sync/flush
> and only +20% in the other case.
> 
> Regarding the test with force-unsafe-io (fewer sync/flush), my patch reduce the
> time from +256% to +113%  than the hdd-only . Which I consider a good
> results considering how small is the patch.
> 
> 
> Raw data:
> The data below is the "real" time (as return by the time command) consumed by
> apt
> 
> 
> Test description         real (mmm:ss)	Delta %
> --------------------     -------------  -------
> btrfs hdd w/sync	   142:38	+533%
> btrfs ssd+hdd w/sync        81:04	+260%
> ext4 hdd w/sync	            52:39	+134%
> btrfs bcache w/sync	    35:59	 +60%
> btrfs ssd w/sync	    22:31	reference
> ext4 ssd w/sync	            12:19	 -45%
> 
> 
> 
> Test description         real (mmm:ss)	Delta %
> --------------------     -------------  -------
> btrfs hdd	             56:2	+256%
> ext4 hdd	            51:32	+228%
> btrfs ssd+hdd	            33:30	+113%
> btrfs bcache	            18:57	 +20%
> btrfs ssd	            15:44	reference
> ext4 ssd	            11:49	 -25%
> 
> 
> [1] I created the image, using "debootrap stretch", then I installed a set
> of packages using the commands:
> 
>   # debootstrap stretch test/
>   # chroot test/
>   # mount -t proc proc proc
>   # mount -t sysfs sys sys
>   # apt --option=Dpkg::Options::=--force-confold \
>         --option=Dpkg::options::=--force-unsafe-io \
> 	install mate-desktop-environment* xserver-xorg vim \
>         task-kde-desktop task-gnome-desktop
> 
> Then updated the release from stretch to buster changing the file /etc/apt/source.list
> Then I download the packages for the dist upgrade:
> 
>   # apt-get update
>   # apt-get --download-only dist-upgrade
> 
> Then I create a tar of this image.
> Before the dist upgrading the space used was about 7GB of space with 2281
> packages. After the dist-upgrade, the space used was 9GB with 2870 packages.
> The upgrade installed/updated about 2251 packages.
> 
> 
> [2] The command was a bit more complex, to avoid an interactive session
> 
>   # mkfs.btrfs -m single -d single /dev/sdX
>   # mount /dev/sdX test/
>   # cd test
>   # time tar xzf ../image.tgz
>   # chroot .
>   # mount -t proc proc proc
>   # mount -t sysfs sys sys
>   # export DEBIAN_FRONTEND=noninteractive
>   # time apt-get -y --option=Dpkg::Options::=--force-confold \
> 	--option=Dpkg::options::=--force-unsafe-io dist-upgrade
> 
> 
> BR
> G.Baroncelli
> 
> --
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata
  2021-01-17 18:54 [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata Goffredo Baroncelli
                   ` (5 preceding siblings ...)
  2021-01-19 23:12 ` [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata Zygo Blaxell
@ 2021-01-20 16:02 ` Josef Bacik
  2021-01-20 16:15   ` Johannes Thumshirn
                     ` (2 more replies)
  2021-01-25 15:21 ` Josef Bacik
  7 siblings, 3 replies; 26+ messages in thread
From: Josef Bacik @ 2021-01-20 16:02 UTC (permalink / raw)
  To: Goffredo Baroncelli, linux-btrfs; +Cc: Zygo Blaxell

On 1/17/21 1:54 PM, Goffredo Baroncelli wrote:
> 
> Hi all,
> 
> This is an RFC; I wrote this patch because I find the idea interesting
> even though it adds more complication to the chunk allocator.
> 
> The basic idea is to store the metadata chunk in the fasters disks.
> The fasters disk are marked by the "preferred_metadata" flag.
> 
> BTRFS when allocate a new metadata/system chunk, selects the
> "preferred_metadata" disks, otherwise it selectes the non
> "preferred_metadata" disks. The intial patch allowed to use the other
> kind of disk in case a set is full.
> 
> This patches set is based on v5.11-rc2.
> 
> For now, the only user of this patch that I am aware is Zygo.
> However he asked to further constraint the allocation: i.e. avoid to
> allocated metadata on a not "preferred_metadata"
> disk. So I extended the patch adding 4 modes to operate.
> 
> This is enabled passing the option "preferred_metadata=<mode>" at
> mount time.
> 

I'll echo Zygo's hatred for mount options.  The more complicated policy 
decisions belong in properties and sysfs knobs, not mount options.

And then for the properties themselves, presumably we'll want to add other FS 
wide properties in the future.  I'm not against adding new actual keys and items 
to the tree itself, but is there a way we could use our existing property 
infrastructure that we use for compression, and simply store the xattrs in the 
tree root?  It looks like we're just toggling a policy decision, and we don't 
actually need the other properties in the item you've created, so why not just a 
btrfs.preferred_metadata property with the value stored in it, dropped into the 
tree_root so it can be read on mount?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata
  2021-01-20 16:02 ` Josef Bacik
@ 2021-01-20 16:15   ` Johannes Thumshirn
  2021-01-20 16:17     ` Josef Bacik
  2021-01-20 16:20   ` Zygo Blaxell
  2021-01-21 18:16   ` Goffredo Baroncelli
  2 siblings, 1 reply; 26+ messages in thread
From: Johannes Thumshirn @ 2021-01-20 16:15 UTC (permalink / raw)
  To: Josef Bacik, Goffredo Baroncelli, linux-btrfs; +Cc: Zygo Blaxell

On 20/01/2021 17:09, Josef Bacik wrote:
> I'm not against adding new actual keys and items 
> to the tree itself, but is there a way we could use our existing property 
> infrastructure that we use for compression, and simply store the xattrs in the 
> tree root?  It looks like we're just toggling a policy decision, and we don't 
> actually need the other properties in the item you've created, so why not just a 
> btrfs.preferred_metadata property with the value stored in it, dropped into the 
> tree_root so it can be read on mount?

+1 from my side as well.

I have the need for this in a planned/future series and I'd be more than happy if
someone else would've already implemented it so I can jump on the train there.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata
  2021-01-20 16:15   ` Johannes Thumshirn
@ 2021-01-20 16:17     ` Josef Bacik
  0 siblings, 0 replies; 26+ messages in thread
From: Josef Bacik @ 2021-01-20 16:17 UTC (permalink / raw)
  To: Johannes Thumshirn, Goffredo Baroncelli, linux-btrfs; +Cc: Zygo Blaxell

On 1/20/21 11:15 AM, Johannes Thumshirn wrote:
> On 20/01/2021 17:09, Josef Bacik wrote:
>> I'm not against adding new actual keys and items
>> to the tree itself, but is there a way we could use our existing property
>> infrastructure that we use for compression, and simply store the xattrs in the
>> tree root?  It looks like we're just toggling a policy decision, and we don't
>> actually need the other properties in the item you've created, so why not just a
>> btrfs.preferred_metadata property with the value stored in it, dropped into the
>> tree_root so it can be read on mount?
> 
> +1 from my side as well.
> 
> I have the need for this in a planned/future series and I'd be more than happy if
> someone else would've already implemented it so I can jump on the train there.
> 

Same, hoping to encourage a grenade jumper,

Josef

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata
  2021-01-20 16:02 ` Josef Bacik
  2021-01-20 16:15   ` Johannes Thumshirn
@ 2021-01-20 16:20   ` Zygo Blaxell
  2021-01-21 18:16   ` Goffredo Baroncelli
  2 siblings, 0 replies; 26+ messages in thread
From: Zygo Blaxell @ 2021-01-20 16:20 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Goffredo Baroncelli, linux-btrfs

On Wed, Jan 20, 2021 at 11:02:41AM -0500, Josef Bacik wrote:
> On 1/17/21 1:54 PM, Goffredo Baroncelli wrote:
> > 
> > Hi all,
> > 
> > This is an RFC; I wrote this patch because I find the idea interesting
> > even though it adds more complication to the chunk allocator.
> > 
> > The basic idea is to store the metadata chunk in the fasters disks.
> > The fasters disk are marked by the "preferred_metadata" flag.
> > 
> > BTRFS when allocate a new metadata/system chunk, selects the
> > "preferred_metadata" disks, otherwise it selectes the non
> > "preferred_metadata" disks. The intial patch allowed to use the other
> > kind of disk in case a set is full.
> > 
> > This patches set is based on v5.11-rc2.
> > 
> > For now, the only user of this patch that I am aware is Zygo.
> > However he asked to further constraint the allocation: i.e. avoid to
> > allocated metadata on a not "preferred_metadata"
> > disk. So I extended the patch adding 4 modes to operate.
> > 
> > This is enabled passing the option "preferred_metadata=<mode>" at
> > mount time.
> > 
> 
> I'll echo Zygo's hatred for mount options.  The more complicated policy
> decisions belong in properties and sysfs knobs, not mount options.

Well, I hated it for two distinct reasons.  Only one of those was the
mount option.

The feature is really a per-_device_ policy:  whether we put data or
metadata or both (*) on the device, and in what order we fill devices
with each.

There's nothing filesystem-wide about the feature, other than we might
want to ensure there are enough devices to do allocations with our
raid profiles (e.g. if there are 2 data-only disks and raid1 data,
don't allow one of those disks to become metadata-only).

I had considered a couple of other ideas but dropped them:

	- keep a filesystem-wide "off" feature.  If that makes sense at
	all, it would be as a mount option, e.g. if you had painted your
	filesystem into a corner with data-only devices and didn't have
	enough metadata space to do a device property set to fix it.
	I dropped the idea because we don't have an existing feature
	for any other scenario where that happens (e.g. adding one disk
	to a full raid1 array) and it should be possible to recover
	immediately after the next mount using reserved metadata space.

	- for each device, specify a priority for metadata and data
	separately, with the lowest priority meaning "never use this
	device for that kind of chunk".  That is a more general form
	of the 4-device-type form (metadata only, metadata preferred,
	data preferred, data only), but all of the additional device
	orders that arbitrary sorting order permits will tend to make
	data and metadata fight with each other, to the point that the
	ordering isn't useful.  On the other hand, maybe this idea does
	have a future as a kind of advanced "move the data where I tell
	you to" balancing tool for complicated RAID array reshapes.

> And then for the properties themselves, presumably we'll want to add other
> FS wide properties in the future.  I'm not against adding new actual keys
> and items to the tree itself, but is there a way we could use our existing
> property infrastructure that we use for compression, and simply store the
> xattrs in the tree root?  It looks like we're just toggling a policy
> decision, and we don't actually need the other properties in the item you've
> created, so why not just a btrfs.preferred_metadata property with the value
> stored in it, dropped into the tree_root so it can be read on mount?
> Thanks,
> 
> Josef

(*) I suppose logically there could be a "neither" option, but I'm not
sure what it would be for.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata
  2021-01-19 23:12 ` [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata Zygo Blaxell
@ 2021-01-21  8:31   ` Martin Svec
  0 siblings, 0 replies; 26+ messages in thread
From: Martin Svec @ 2021-01-21  8:31 UTC (permalink / raw)
  To: Zygo Blaxell, Goffredo Baroncelli, Josef Bacik; +Cc: linux-btrfs

Hi all,

Dne 20.01.2021 v 0:12 Zygo Blaxell napsal(a):
> With the 4 device types we can trivially specify this arrangement.
>
> The sorted device lists would be:
>
>         Metadata sort order             Data sort order
>         metadata only (3)               data only (2)
>         metadata preferred (1)          data preferred (0)
> 	data preferred (0)		metadata preferred (1)
>         other devices (2 or other)      other devices (3 or other)
>
> We keep 3 device counts for the first 3 sort orders.  If the number of all
> preferred devices (type != 0) is zero, we just return ndevs; otherwise,
> we pick the first device count that is >= mindevs.  If all the device
> counts are < mindevs then we return the 3rd count (metadata only +
> metadata preferred + data preferred) and the caller will find ENOSPC.
>
> More sophisticated future implementations can alter the sort order, or
> operate in entirely separate parts of btrfs, without conflicting with
> this scheme.  If there is no mount option, then future implementations
> can't conflict with it.

I agree with Zygo and Josef that the mount option is ugly and needless. This should be a
_per-device_ setting as suggested by Zygo (metadata only, metadata preferred, data preferred, data
only, unspecified). Maybe in the future it might be useful to generalize this setting to something
like a 0..255 priority but the 4 device types look like a sufficient solution for now. I would
personally prefer a read-write sysfs option to change the device preference but btrfs-progs approach
is fine for me too.

Anyway, I'm REALLY happy that there's finally a patchset being actively discussed. I maintain a
naive patch implementing "preferred_metadata=metadata" option for years and the impact e.g. for
rsync backups is huge.

Martin



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata
  2021-01-20 16:02 ` Josef Bacik
  2021-01-20 16:15   ` Johannes Thumshirn
  2021-01-20 16:20   ` Zygo Blaxell
@ 2021-01-21 18:16   ` Goffredo Baroncelli
  2021-01-21 18:54     ` Zygo Blaxell
  2 siblings, 1 reply; 26+ messages in thread
From: Goffredo Baroncelli @ 2021-01-21 18:16 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs; +Cc: Zygo Blaxell

On 1/20/21 5:02 PM, Josef Bacik wrote:
> On 1/17/21 1:54 PM, Goffredo Baroncelli wrote:
>>
>> Hi all,
>>
>> This is an RFC; I wrote this patch because I find the idea interesting
>> even though it adds more complication to the chunk allocator.
>>
>> The basic idea is to store the metadata chunk in the fasters disks.
>> The fasters disk are marked by the "preferred_metadata" flag.
>>
>> BTRFS when allocate a new metadata/system chunk, selects the
>> "preferred_metadata" disks, otherwise it selectes the non
>> "preferred_metadata" disks. The intial patch allowed to use the other
>> kind of disk in case a set is full.
>>
>> This patches set is based on v5.11-rc2.
>>
>> For now, the only user of this patch that I am aware is Zygo.
>> However he asked to further constraint the allocation: i.e. avoid to
>> allocated metadata on a not "preferred_metadata"
>> disk. So I extended the patch adding 4 modes to operate.
>>
>> This is enabled passing the option "preferred_metadata=<mode>" at
>> mount time.
>>
> 
> I'll echo Zygo's hatred for mount options.  The more complicated policy decisions belong in properties and sysfs knobs, not mount options.
> 
I tend to agree. However adding a filesystem property can be done in a second time. I don't think that this a problem. However I prefer to make the patch smaller.

Anyway I have to point out that we need a way to change the allocation policy without changing the metadata otherwise we risk to be in the loop of exhausting metadata space:
- how we can increase the space for metadata if we don't have space for metadata but I need to allocate few block of metadata....

What I mean is that even if we store the setting as filesystem properties (and definitely we have to do), we need a way to override in an emergency scenario.

> And then for the properties themselves, presumably we'll want to add other FS wide properties in the future.  I'm not against adding new actual keys and items to the tree itself, but is there a way we could use our existing property infrastructure that we use for compression, and simply store the xattrs in the tree root?  It looks like we're just toggling a policy decision, and we don't actually need the other properties in the item you've created, so why not just a btrfs.preferred_metadata property with the value stored in it, dropped into the tree_root so it can be read on mount?  Thanks,

What if the root subvolume is not mounted ? Yes we can create a further api to store/retrive this kind of metadata without mounting the root subvolume, but doing so in what it would be different than adding a key to the root fs like the default subvolume ioctl does ?

> 
> Josef


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata
  2021-01-21 18:16   ` Goffredo Baroncelli
@ 2021-01-21 18:54     ` Zygo Blaxell
  2021-01-22 18:31       ` Goffredo Baroncelli
  0 siblings, 1 reply; 26+ messages in thread
From: Zygo Blaxell @ 2021-01-21 18:54 UTC (permalink / raw)
  To: kreijack; +Cc: Josef Bacik, linux-btrfs

On Thu, Jan 21, 2021 at 07:16:05PM +0100, Goffredo Baroncelli wrote:
> On 1/20/21 5:02 PM, Josef Bacik wrote:
> > On 1/17/21 1:54 PM, Goffredo Baroncelli wrote:
> > > 
> > > Hi all,
> > > 
> > > This is an RFC; I wrote this patch because I find the idea interesting
> > > even though it adds more complication to the chunk allocator.
> > > 
> > > The basic idea is to store the metadata chunk in the fasters disks.
> > > The fasters disk are marked by the "preferred_metadata" flag.
> > > 
> > > BTRFS when allocate a new metadata/system chunk, selects the
> > > "preferred_metadata" disks, otherwise it selectes the non
> > > "preferred_metadata" disks. The intial patch allowed to use the other
> > > kind of disk in case a set is full.
> > > 
> > > This patches set is based on v5.11-rc2.
> > > 
> > > For now, the only user of this patch that I am aware is Zygo.
> > > However he asked to further constraint the allocation: i.e. avoid to
> > > allocated metadata on a not "preferred_metadata"
> > > disk. So I extended the patch adding 4 modes to operate.
> > > 
> > > This is enabled passing the option "preferred_metadata=<mode>" at
> > > mount time.
> > > 
> > 
> > I'll echo Zygo's hatred for mount options.  The more complicated policy decisions belong in properties and sysfs knobs, not mount options.
> > 
> I tend to agree. However adding a filesystem property can be done in a second time. I don't think that this a problem. However I prefer to make the patch smaller.
> 
> Anyway I have to point out that we need a way to change the allocation
> policy without changing the metadata otherwise we risk to be in the
> loop of exhausting metadata space: - how we can increase the space for
> metadata if we don't have space for metadata but I need to allocate
> few block of metadata....
> 
> What I mean is that even if we store the setting as filesystem
> properties (and definitely we have to do), we need a way to override
> in an emergency scenario.

There are no new issues introduced by this change, thus no requirement
for a mount option to deal with new issues.

The same issue comes up when changing RAID profile, or removing devices,
or when existing devices simply fill up.  Part of the solution is the
global reserve, which ensures we can always create a transaction to modify
a few metadata pages.

Part of the solution is a run-time check to ensure we have min_devs for
active RAID profiles whenever we change a device policy to reject data
or metadata (see btrfs_check_raid_min_devices).  This is currently
implemented for the device remove ioctl, and a similar check will
be needed for the device property set ioctl for preferred_metadata.
That part is missing in v5 of this patch and will have to be added,
though even now it works most of the time without.

v5 is also missing changes to the df statvfs code to deal with metadata
only devices.  At this stage it's an RFC patch, so that's OK, but it
will also need to be fixed.  We presume these will be addressed in future
versions.  Again, it works now, but 'df' will give the wrong number.

None of the above requirements is addressed by a mount option, and
the mount option adds new requirements that we don't want.

> > And then for the properties themselves, presumably we'll want to
> add other FS wide properties in the future.  I'm not against adding
> new actual keys and items to the tree itself, but is there a way
> we could use our existing property infrastructure that we use for
> compression, and simply store the xattrs in the tree root?  It looks
> like we're just toggling a policy decision, and we don't actually
> need the other properties in the item you've created, so why not
> just a btrfs.preferred_metadata property with the value stored in it,
> dropped into the tree_root so it can be read on mount?  Thanks,
> 
> What if the root subvolume is not mounted ? 

Same as device add or remove--if the filesystem isn't mounted, you can't
make any changes.

Note that all the required properties are per-device, so really you just
need any open FD on the filesystem.  (I think Josef didn't read that far
down).

The per-device policy storage can go in dev_root (tree 4) along with the
device stats data, if we don't want to use btrfs_device::type.  You'd still
need an ioctl to get to it.

Or maybe I'm misreading Josef here, and his idea is to make the per-device
configuration a string blob that can be set by putting an xattr on the
root subvol?  I'm not sure that's better, but it'll work.

> Yes we can create a further
> api to store/retrive this kind of metadata without mounting the root
> subvolume, but doing so in what it would be different than adding a
> key to the root fs like the default subvolume ioctl does ?

> > 
> > Josef
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata
  2021-01-21 18:54     ` Zygo Blaxell
@ 2021-01-22 18:31       ` Goffredo Baroncelli
  2021-01-22 22:42         ` Zygo Blaxell
  0 siblings, 1 reply; 26+ messages in thread
From: Goffredo Baroncelli @ 2021-01-22 18:31 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Josef Bacik, linux-btrfs

On 1/21/21 7:54 PM, Zygo Blaxell wrote:
> On Thu, Jan 21, 2021 at 07:16:05PM +0100, Goffredo Baroncelli wrote:
>> On 1/20/21 5:02 PM, Josef Bacik wrote:
>>> On 1/17/21 1:54 PM, Goffredo Baroncelli wrote:
>>>>
>>>> Hi all,
>>>>
>>>> This is an RFC; I wrote this patch because I find the idea interesting
>>>> even though it adds more complication to the chunk allocator.
>>>>
>>>> The basic idea is to store the metadata chunk in the fasters disks.
>>>> The fasters disk are marked by the "preferred_metadata" flag.
>>>>
>>>> BTRFS when allocate a new metadata/system chunk, selects the
>>>> "preferred_metadata" disks, otherwise it selectes the non
>>>> "preferred_metadata" disks. The intial patch allowed to use the other
>>>> kind of disk in case a set is full.
>>>>
>>>> This patches set is based on v5.11-rc2.
>>>>
>>>> For now, the only user of this patch that I am aware is Zygo.
>>>> However he asked to further constraint the allocation: i.e. avoid to
>>>> allocated metadata on a not "preferred_metadata"
>>>> disk. So I extended the patch adding 4 modes to operate.
>>>>
>>>> This is enabled passing the option "preferred_metadata=<mode>" at
>>>> mount time.
>>>>
>>>
>>> I'll echo Zygo's hatred for mount options.  The more complicated policy decisions belong in properties and sysfs knobs, not mount options.
>>>
>> I tend to agree. However adding a filesystem property can be done in a second time. I don't think that this a problem. However I prefer to make the patch smaller.
>>
>> Anyway I have to point out that we need a way to change the allocation
>> policy without changing the metadata otherwise we risk to be in the
>> loop of exhausting metadata space: - how we can increase the space for
>> metadata if we don't have space for metadata but I need to allocate
>> few block of metadata....
>>
>> What I mean is that even if we store the setting as filesystem
>> properties (and definitely we have to do), we need a way to override
>> in an emergency scenario.
> 
> There are no new issues introduced by this change, thus no requirement
> for a mount option to deal with new issues.
> 
> The same issue comes up when changing RAID profile, or removing devices,
> or when existing devices simply fill up.  Part of the solution is the
> global reserve, which ensures we can always create a transaction to modify
> a few metadata pages.
> 
> Part of the solution is a run-time check to ensure we have min_devs for
> active RAID profiles whenever we change a device policy to reject data
> or metadata (see btrfs_check_raid_min_devices).  This is currently
> implemented for the device remove ioctl, and a similar check will
> be needed for the device property set ioctl for preferred_metadata.
> That part is missing in v5 of this patch and will have to be added,
> though even now it works most of the time without.
> 
> v5 is also missing changes to the df statvfs code to deal with metadata
> only devices.  At this stage it's an RFC patch, so that's OK, but it
> will also need to be fixed.  We presume these will be addressed in future
> versions.  Again, it works now, but 'df' will give the wrong number.
> 
> None of the above requirements is addressed by a mount option, and
> the mount option adds new requirements that we don't want.
> 
>>> And then for the properties themselves, presumably we'll want to
>> add other FS wide properties in the future.  I'm not against adding
>> new actual keys and items to the tree itself, but is there a way
>> we could use our existing property infrastructure that we use for
>> compression, and simply store the xattrs in the tree root?  It looks
>> like we're just toggling a policy decision, and we don't actually
>> need the other properties in the item you've created, so why not
>> just a btrfs.preferred_metadata property with the value stored in it,
>> dropped into the tree_root so it can be read on mount?  Thanks,
>>
>> What if the root subvolume is not mounted ?
> 
> Same as device add or remove--if the filesystem isn't mounted, you can't
> make any changes.

I am referring to a case where a subvolume (id != 5) is mounted but not the root one (id=5).
The point is that (e.g.) in the current implementation you can use

$ sudo btrfs property set /dev/vde preferred_metadata 1

in any case where the filesystem is mounted (doesn't matter which
subvvolume).

I like the Josef idea: instead to develop a new api to retrive/change/list/store/delete/create
some setting, we could use the xattr api which already exists.

# adding properties
$ sudo setfattr -n "btrfs.metadata_preferred_disk=aabbcc" -v "mode=1"  /
$ sudo setfattr -n "btrfs.metadata_preferred_disk=aabbee" -v "mode=1"  /
$ sudo setfattr -n "btrfs.metadata_preferred" -v "1"  /

# listing properties and their values
$ sudo getfattr -d /
getfattr: Removing leading '/' from absolute path names
# file: .
btrfs.metadata_preferred_disk\075aabbcc="mode=1"
btrfs.metadata_preferred_disk\075aabbee="mode=1"
btrfs.metadata_preferred="1"

# removing a properties
$ sudo setfattr -x "btrfs.metadata_preferred_disk=aabbcc" /

However the xattr requires an inode to attach the .. xattrs. But when we are talking about
filesystem properties, which is the right inode ? The only stable inode is the '.' of
the root subvolume. The other inodes may be deleted so are not suitable to store per
filesystem properties.

So the point is: what happens if the root subvolume is not mounted ?

Of course we can tweak the xattr api to deal this case, but doing so is not
so different than developing a new api, which (I think) is not what Josef suggested.

> 
> Note that all the required properties are per-device, so really you just
> need any open FD on the filesystem.  (I think Josef didn't read that far
> down).
> 
> The per-device policy storage can go in dev_root (tree 4) along with the
> device stats data, if we don't want to use btrfs_device::type.  You'd still
> need an ioctl to get to it.
> 
> Or maybe I'm misreading Josef here, and his idea is to make the per-device
> configuration a string blob that can be set by putting an xattr on the
> root subvol?  I'm not sure that's better, but it'll work.
> 
>> Yes we can create a further
>> api to store/retrive this kind of metadata without mounting the root
>> subvolume, but doing so in what it would be different than adding a
>> key to the root fs like the default subvolume ioctl does ?
> 
>>>
>>> Josef
>>
>>
>> -- 
>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>>


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata
  2021-01-22 18:31       ` Goffredo Baroncelli
@ 2021-01-22 22:42         ` Zygo Blaxell
  2021-01-23 14:55           ` Graham Cobb
  0 siblings, 1 reply; 26+ messages in thread
From: Zygo Blaxell @ 2021-01-22 22:42 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: Josef Bacik, linux-btrfs

On Fri, Jan 22, 2021 at 07:31:18PM +0100, Goffredo Baroncelli wrote:
> On 1/21/21 7:54 PM, Zygo Blaxell wrote:
> > On Thu, Jan 21, 2021 at 07:16:05PM +0100, Goffredo Baroncelli wrote:
> > > On 1/20/21 5:02 PM, Josef Bacik wrote:
> > > > On 1/17/21 1:54 PM, Goffredo Baroncelli wrote:
> > > > > 
> > > > > Hi all,
> > > > > 
> > > > > This is an RFC; I wrote this patch because I find the idea interesting
> > > > > even though it adds more complication to the chunk allocator.
> > > > > 
> > > > > The basic idea is to store the metadata chunk in the fasters disks.
> > > > > The fasters disk are marked by the "preferred_metadata" flag.
> > > > > 
> > > > > BTRFS when allocate a new metadata/system chunk, selects the
> > > > > "preferred_metadata" disks, otherwise it selectes the non
> > > > > "preferred_metadata" disks. The intial patch allowed to use the other
> > > > > kind of disk in case a set is full.
> > > > > 
> > > > > This patches set is based on v5.11-rc2.
> > > > > 
> > > > > For now, the only user of this patch that I am aware is Zygo.
> > > > > However he asked to further constraint the allocation: i.e. avoid to
> > > > > allocated metadata on a not "preferred_metadata"
> > > > > disk. So I extended the patch adding 4 modes to operate.
> > > > > 
> > > > > This is enabled passing the option "preferred_metadata=<mode>" at
> > > > > mount time.
> > > > > 
> > > > 
> > > > I'll echo Zygo's hatred for mount options.  The more complicated policy decisions belong in properties and sysfs knobs, not mount options.
> > > > 
> > > I tend to agree. However adding a filesystem property can be done in a second time. I don't think that this a problem. However I prefer to make the patch smaller.
> > > 
> > > Anyway I have to point out that we need a way to change the allocation
> > > policy without changing the metadata otherwise we risk to be in the
> > > loop of exhausting metadata space: - how we can increase the space for
> > > metadata if we don't have space for metadata but I need to allocate
> > > few block of metadata....
> > > 
> > > What I mean is that even if we store the setting as filesystem
> > > properties (and definitely we have to do), we need a way to override
> > > in an emergency scenario.
> > 
> > There are no new issues introduced by this change, thus no requirement
> > for a mount option to deal with new issues.
> > 
> > The same issue comes up when changing RAID profile, or removing devices,
> > or when existing devices simply fill up.  Part of the solution is the
> > global reserve, which ensures we can always create a transaction to modify
> > a few metadata pages.
> > 
> > Part of the solution is a run-time check to ensure we have min_devs for
> > active RAID profiles whenever we change a device policy to reject data
> > or metadata (see btrfs_check_raid_min_devices).  This is currently
> > implemented for the device remove ioctl, and a similar check will
> > be needed for the device property set ioctl for preferred_metadata.
> > That part is missing in v5 of this patch and will have to be added,
> > though even now it works most of the time without.
> > 
> > v5 is also missing changes to the df statvfs code to deal with metadata
> > only devices.  At this stage it's an RFC patch, so that's OK, but it
> > will also need to be fixed.  We presume these will be addressed in future
> > versions.  Again, it works now, but 'df' will give the wrong number.
> > 
> > None of the above requirements is addressed by a mount option, and
> > the mount option adds new requirements that we don't want.
> > 
> > > > And then for the properties themselves, presumably we'll want to
> > > add other FS wide properties in the future.  I'm not against adding
> > > new actual keys and items to the tree itself, but is there a way
> > > we could use our existing property infrastructure that we use for
> > > compression, and simply store the xattrs in the tree root?  It looks
> > > like we're just toggling a policy decision, and we don't actually
> > > need the other properties in the item you've created, so why not
> > > just a btrfs.preferred_metadata property with the value stored in it,
> > > dropped into the tree_root so it can be read on mount?  Thanks,
> > > 
> > > What if the root subvolume is not mounted ?
> > 
> > Same as device add or remove--if the filesystem isn't mounted, you can't
> > make any changes.
> 
> I am referring to a case where a subvolume (id != 5) is mounted but not the root one (id=5).
> The point is that (e.g.) in the current implementation you can use
> 
> $ sudo btrfs property set /dev/vde preferred_metadata 1
> 
> in any case where the filesystem is mounted (doesn't matter which
> subvvolume).
> 
> I like the Josef idea: instead to develop a new api to retrive/change/list/store/delete/create
> some setting, we could use the xattr api which already exists.

Yeah, I've been thinking about that since yesterday and I am now thinking
it's a good idea if we can wave away some problems.  We don't need special
tools to set an xattr, and we can define new schema versions by changing
the attribute name.

My one lingering concern is what happens when the xattr is propagated
to other filesystems, e.g. by backups.  You might have very different
devices on the backup server than on the origin server, and you wouldn't
want this attribute to suddenly make the backup server unusably slow,
nor would you want the entire backup to fail because it can't set this
attribute to the new value on a backup filesystem because some of the
devids don't exist there.

I'm tempted to say "well you shouldn't back up the root subvol of a
btrfs anyway, and if you do, you should explictly exclude this attribute"
and leave it at that, but there's probably someone out there with
working backup scripts that will be broken by this change.

Maybe a compromise solution is to keep the kernel-parsed string encoding,
but store that string in a persistent dev tree item so it doesn't show
up in any subvol's metadata.

> # adding properties
> $ sudo setfattr -n "btrfs.metadata_preferred_disk=aabbcc" -v "mode=1"  /
> $ sudo setfattr -n "btrfs.metadata_preferred_disk=aabbee" -v "mode=1"  /
> $ sudo setfattr -n "btrfs.metadata_preferred" -v "1"  /
> 
> # listing properties and their values
> $ sudo getfattr -d /
> getfattr: Removing leading '/' from absolute path names
> # file: .
> btrfs.metadata_preferred_disk\075aabbcc="mode=1"
> btrfs.metadata_preferred_disk\075aabbee="mode=1"
> btrfs.metadata_preferred="1"
> 
> # removing a properties
> $ sudo setfattr -x "btrfs.metadata_preferred_disk=aabbcc" /

If the specification is a single string blob, then we get atomic updates
for free--no need to figure out the right order of operations if you
suddenly want to flip metadata from devices 1,2,3 to 4,5,6 with raid1c3
metadata.  The kernel would just ACK or NAK the entire new string, and
tools could manipulate the whole string without trying to guess what
might make one order of changes work while another order fails.

Of course we can haggle for a few more weeks about what that string
should look like, but the mechanism of "configure this thing through a
kernel-parsed single string" seems OK.

I was thinking something like:

	-n btrfs.chunk_policy -v only_metadata=1:only_data=3:prefer_metadata=2

for device 1 = NVME, 2 = SSD, 3 = HDD, where the SSD is shared between
data and metadata, or

	-n btrfs.chunk_policy -v only_metadata=7,8:only_data=1,2,3,4,5,6

for a big NVME (7, 8) and HDD (1-6) array with no device sharing.

Any devid not listed in the string gets "prefer_data".

Metadata is allocated on devices listed with "only_metadata" first,
followed by "prefer_metadata" devices, followed by "prefer_data" devices.
No metadata is allocated on "only_data" devices.

Data is allocated on devices listed with "only_data" first, followed by
"prefer_data" devices, followed by "prefer_metadata" devices.  No data
is allocated on "only_metadata" devices.

Devids may be mentioned in the string that do not exist in the filesystem.
This allows us to do 'dev remove' without changing the xattr string at
the same time (though we still would have to prevent 'dev remove' if this
makes it impossible to allocate new chunks of any existing raid profile).

A special devid "default" specifies the policy for devices that are not
otherwise listed.  This can be used to define what happens to new devices
before they are added, e.g.

	-n btrfs.chunk_policy -v only_metadata=1,2:only_data=default

Or, we move away from the "preferred" concept, and specify the device
order explicitly.

In the following syntax, "," separates device IDs, and ":" defines a
boundary between ordering tiers, where we fill every device before ":"
before any device after ":", but we let btrfs do its normal equalizing
fill algorithm with any device between ":".

The special device id "default" can be used to specify where unlisted
devices appear in the order, which might not be at the end.  If a
device is listed by ID for either data or metadata, it is excluded from
"default" in either string.  We might also define other shorthands like
"rotational" later, if some day the btrfs_device::type field ever does
mean anything useful.

The NVME (1), SSD (2), HDD (3) example is:

	-n btrfs.device_fill_order.data     -v 3:2
	-n btrfs.device_fill_order.metadata -v 1:2

i.e. we put data on devid 3 first, then if that is full put data on
devid 2, while we put metadata on devid 1 first, then if that is full
put metadata on devid 2.  We never put metadata on devid 3 or data on
devid 1.  Any devices added to the filesystem don't get chunks at all
until someone changes the device_fill_order policy to say what to do
with the new devices.

The 2xNVME (7, 8) and 6xHDD (1-6) example could be:

	-n btrfs.device_fill_order.data     -v 1,2,3,4,5,6:default
	-n btrfs.device_fill_order.metadata -v 7,8

Note that since devids 7 and 8 are explicitly listed in metadata, they
are not covered by 'default' in data, and no data will be allocated
on devices 7 and 8.  The "," operator does not specify an order, so
btrfs will pick devices from 1-6 according to its equal-fill algorithm.
Any new devices added to the array will be filled with data only after
the existing devices are all full.

The following order might be useful if we add disks 2 at a time, and want
to rotate out the old ones.  We want the data to be preferentially
located on the devices that are added last, starting with devices that
have not been added yet:

	-n btrfs.device_fill_order.data     -v default:5,6:3,4:1,2
	-n btrfs.device_fill_order.metadata -v 7,8

This could also be (ab)used to get narrower stripes than the entire
array in striping profiles.  I'm not sure if that's good or bad.

This is equivalent to the absence of each xattr:

	-n btrfs.device_fill_order.data     -v default
	-n btrfs.device_fill_order.metadata -v default

This is preferred_metadata=soft with metadata on devid 1:

	-n btrfs.device_fill_order.data     -v default:1
	-n btrfs.device_fill_order.metadata -v 1:default

This is preferred_metadata=hard with metadata on devid 1:

	-n btrfs.device_fill_order.data     -v default
	-n btrfs.device_fill_order.metadata -v 1

This is preferred_metadata=metadata with metadata on devid 1, allowing
metadata to overflow onto data devices but not the opposite:

	-n btrfs.device_fill_order.data     -v default
	-n btrfs.device_fill_order.metadata -v 1:default

This is the "laptop with one SSD and one HDD" example, where we want only
metadata on the SSD but we will allow data to overflow from HDD to SSD:

	-n btrfs.device_fill_order.data     -v default:1
	-n btrfs.device_fill_order.metadata -v 1

Obviously we'd want to compile the ordering preferences into an attribute
attached to each device the in-kernel device list, so sort can just order
devices by the value of the attribute.  The attribute would be recomputed
on mount, device add, and changes to the xattr string (dev remove and
dev replace doesn't change the order, though they might require special
cases in the code for other reasons).

> However the xattr requires an inode to attach the .. xattrs. But when we are talking about
> filesystem properties, which is the right inode ? The only stable inode is the '.' of
> the root subvolume. The other inodes may be deleted so are not suitable to store per
> filesystem properties.
> 
> So the point is: what happens if the root subvolume is not mounted ?

It's not an onerous requirement to mount the root subvol.  You can do (*)

	tmp="$(mktemp -d)"
	mount -osubvolid=5 /dev/btrfs "$tmp"
	setfattr -n 'btrfs...' -v... "$tmp"
	umount "$tmp"
	rmdir "$tmp"

any time you need to change the policy, regardless of whether the root
subvol is mounted anywhere else at the time.

> Of course we can tweak the xattr api to deal this case, but doing so is not
> so different than developing a new api, which (I think) is not what Josef suggested.
> 
> > 
> > Note that all the required properties are per-device, so really you just
> > need any open FD on the filesystem.  (I think Josef didn't read that far
> > down).
> > 
> > The per-device policy storage can go in dev_root (tree 4) along with the
> > device stats data, if we don't want to use btrfs_device::type.  You'd still
> > need an ioctl to get to it.
> > 
> > Or maybe I'm misreading Josef here, and his idea is to make the per-device
> > configuration a string blob that can be set by putting an xattr on the
> > root subvol?  I'm not sure that's better, but it'll work.
> > 
> > > Yes we can create a further
> > > api to store/retrive this kind of metadata without mounting the root
> > > subvolume, but doing so in what it would be different than adding a
> > > key to the root fs like the default subvolume ioctl does ?
> > 
> > > > 
> > > > Josef
> > > 
> > > 
> > > -- 
> > > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> > > Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> > > 
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

(*) but don't do exactly that, it will seriously and dangerously confuse
anything that thinks it owns /tmp.  Use some private directory under
/var or /mnt where there is no possibility of some cron job waking up
and deciding it needs to delete everything.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata
  2021-01-22 22:42         ` Zygo Blaxell
@ 2021-01-23 14:55           ` Graham Cobb
  2021-01-23 17:21             ` Zygo Blaxell
  0 siblings, 1 reply; 26+ messages in thread
From: Graham Cobb @ 2021-01-23 14:55 UTC (permalink / raw)
  To: Zygo Blaxell, Goffredo Baroncelli; +Cc: Josef Bacik, linux-btrfs

On 22/01/2021 22:42, Zygo Blaxell wrote:
...
>> So the point is: what happens if the root subvolume is not mounted ?
> 
> It's not an onerous requirement to mount the root subvol.  You can do (*)
> 
> 	tmp="$(mktemp -d)"
> 	mount -osubvolid=5 /dev/btrfs "$tmp"
> 	setfattr -n 'btrfs...' -v... "$tmp"
> 	umount "$tmp"
> 	rmdir "$tmp"

No! I may have other data on that disk which I do NOT want to become
accessible to users on this system (even for a short time). As a simple
example, imagine, a disk I carry around to take emergency backups of
other systems, but I need to change this attribute to make the emergency
backup of this system run as quickly as possible before the system dies.
Or a disk used for audit trails, where users should not be able to
modify their earlier data. Or where I suspect a security breach has
occurred. I need to be able to be confident that the only data
accessible on this system is data in the specific subvolume I have mounted.

Also, the backup problem is a very real problem - abusing xattrs for
filesystem controls really screws up writing backup procedures to
correctly backup xattrs used to describe or manage data (or for any
other purpose).

I suppose btrfs can internally store it in an xattr if it wants, as long
as any values set are just ignored and changes happen through some other
operation (e.g. sysfs). It still might confuse programs like rsync which
would try to reset the values each time a sync is done.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata
  2021-01-23 14:55           ` Graham Cobb
@ 2021-01-23 17:21             ` Zygo Blaxell
  2021-01-23 17:44               ` Graham Cobb
  0 siblings, 1 reply; 26+ messages in thread
From: Zygo Blaxell @ 2021-01-23 17:21 UTC (permalink / raw)
  To: Graham Cobb; +Cc: Goffredo Baroncelli, Josef Bacik, linux-btrfs

On Sat, Jan 23, 2021 at 02:55:52PM +0000, Graham Cobb wrote:
> On 22/01/2021 22:42, Zygo Blaxell wrote:
> ...
> >> So the point is: what happens if the root subvolume is not mounted ?
> > 
> > It's not an onerous requirement to mount the root subvol.  You can do (*)
> > 
> > 	tmp="$(mktemp -d)"
> > 	mount -osubvolid=5 /dev/btrfs "$tmp"
> > 	setfattr -n 'btrfs...' -v... "$tmp"
> > 	umount "$tmp"
> > 	rmdir "$tmp"
> 
> No! I may have other data on that disk which I do NOT want to become
> accessible to users on this system (even for a short time). As a simple
> example, imagine, a disk I carry around to take emergency backups of
> other systems, but I need to change this attribute to make the emergency
> backup of this system run as quickly as possible before the system dies.
> Or a disk used for audit trails, where users should not be able to
> modify their earlier data. Or where I suspect a security breach has
> occurred. I need to be able to be confident that the only data
> accessible on this system is data in the specific subvolume I have mounted.

Those are worthy goals, but to enforce them you'll have to block or filter
the mount syscall with one of the usual sandboxing/container methods.

If you have that already set up, you can change root subvol xattrs from
the supervisor side.  No users will have access if you don't give them
access to the root subvol or the mount syscall on the restricted side
(they might also need a block device FD belonging to the filesystem).

If you don't have the sandbox already set up, then root subvol access
is a thing your users can already do, and it may be time to revisit the
assumptions behind your security architecture.

> Also, the backup problem is a very real problem - abusing xattrs for
> filesystem controls really screws up writing backup procedures to
> correctly backup xattrs used to describe or manage data (or for any
> other purpose).
> 
> I suppose btrfs can internally store it in an xattr if it wants, as long
> as any values set are just ignored and changes happen through some other
> operation (e.g. sysfs). It still might confuse programs like rsync which
> would try to reset the values each time a sync is done.

I want to upgrade my "one lingering concern" about xattr to a "we really
shouldn't use an inode's xattr to describe device layout."

The layout configuration can still be a string (I kind of like the string,
it's more extensible than a binary blob, though we don't do balance that
way so I'm not married to the string idea) but that string should live
elsewhere, like in a dev tree persistent item where we currently keep
replace progress and dev stats.

It shouldn't be attached to an inode, even if it's easier to manage there.
Unlike all of btrfs's previous per-inode flags, per-device storage
preferences are not the kind of metadata you could move to another btrfs
filesystem and expect to work.

Even some of the previous btrfs inode attributes cause issues and need
to be filtered or translated when they are copied, e.g. nodatacow needs
to be _stored_ in backups so it can be restored to the original system,
but must not _affect_ backups so the backups have csums, dedupe, and/or
compression.  We don't want to add a new xattr that ends up having to
be filtered out of everything.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata
  2021-01-23 17:21             ` Zygo Blaxell
@ 2021-01-23 17:44               ` Graham Cobb
  2021-01-24  4:00                 ` Zygo Blaxell
  2021-01-24 20:05                 ` Goffredo Baroncelli
  0 siblings, 2 replies; 26+ messages in thread
From: Graham Cobb @ 2021-01-23 17:44 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Goffredo Baroncelli, Josef Bacik, linux-btrfs

On 23/01/2021 17:21, Zygo Blaxell wrote:
> On Sat, Jan 23, 2021 at 02:55:52PM +0000, Graham Cobb wrote:
>> On 22/01/2021 22:42, Zygo Blaxell wrote:
>> ...
>>>> So the point is: what happens if the root subvolume is not mounted ?
>>>
>>> It's not an onerous requirement to mount the root subvol.  You can do (*)
>>>
>>> 	tmp="$(mktemp -d)"
>>> 	mount -osubvolid=5 /dev/btrfs "$tmp"
>>> 	setfattr -n 'btrfs...' -v... "$tmp"
>>> 	umount "$tmp"
>>> 	rmdir "$tmp"
>>
>> No! I may have other data on that disk which I do NOT want to become
>> accessible to users on this system (even for a short time). As a simple
>> example, imagine, a disk I carry around to take emergency backups of
>> other systems, but I need to change this attribute to make the emergency
>> backup of this system run as quickly as possible before the system dies.
>> Or a disk used for audit trails, where users should not be able to
>> modify their earlier data. Or where I suspect a security breach has
>> occurred. I need to be able to be confident that the only data
>> accessible on this system is data in the specific subvolume I have mounted.
> 
> Those are worthy goals, but to enforce them you'll have to block or filter
> the mount syscall with one of the usual sandboxing/container methods.
> 
> If you have that already set up, you can change root subvol xattrs from
> the supervisor side.  No users will have access if you don't give them
> access to the root subvol or the mount syscall on the restricted side
> (they might also need a block device FD belonging to the filesystem).
> 
> If you don't have the sandbox already set up, then root subvol access
> is a thing your users can already do, and it may be time to revisit the
> assumptions behind your security architecture.

I'm not talking about root. I mean unpriv'd users (who can't use mount)!
If I (as root) mount the whole disk, those users may be able to read or
modify files from parts of that disk to which they should not have
access. That may be  why I am not mounting the whole disk in the first
place.

I gave a few very simple examples, but I can think of many more cases
where a disk may contain files which users might be able to access if
the disk was mounted (maybe the disk has subvols used by many different
systems but UIDs are not coordinated, or ...).  And, of course, if they
can open a FD during the brief time it is mounted, they can stop it
being unmounted again.

No. If I have chosen to mount just a subvol, it is because I don't want
to mount the whole disk.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata
  2021-01-23 17:44               ` Graham Cobb
@ 2021-01-24  4:00                 ` Zygo Blaxell
  2021-01-24 20:05                 ` Goffredo Baroncelli
  1 sibling, 0 replies; 26+ messages in thread
From: Zygo Blaxell @ 2021-01-24  4:00 UTC (permalink / raw)
  To: Graham Cobb; +Cc: Goffredo Baroncelli, Josef Bacik, linux-btrfs

On Sat, Jan 23, 2021 at 05:44:48PM +0000, Graham Cobb wrote:
> On 23/01/2021 17:21, Zygo Blaxell wrote:
> > On Sat, Jan 23, 2021 at 02:55:52PM +0000, Graham Cobb wrote:
> >> On 22/01/2021 22:42, Zygo Blaxell wrote:
> >> ...
> >>>> So the point is: what happens if the root subvolume is not mounted ?
> >>>
> >>> It's not an onerous requirement to mount the root subvol.  You can do (*)
> >>>
> >>> 	tmp="$(mktemp -d)"
> >>> 	mount -osubvolid=5 /dev/btrfs "$tmp"
> >>> 	setfattr -n 'btrfs...' -v... "$tmp"
> >>> 	umount "$tmp"
> >>> 	rmdir "$tmp"
> >>
> >> No! I may have other data on that disk which I do NOT want to become
> >> accessible to users on this system (even for a short time). As a simple
> >> example, imagine, a disk I carry around to take emergency backups of
> >> other systems, but I need to change this attribute to make the emergency
> >> backup of this system run as quickly as possible before the system dies.
> >> Or a disk used for audit trails, where users should not be able to
> >> modify their earlier data. Or where I suspect a security breach has
> >> occurred. I need to be able to be confident that the only data
> >> accessible on this system is data in the specific subvolume I have mounted.
> > 
> > Those are worthy goals, but to enforce them you'll have to block or filter
> > the mount syscall with one of the usual sandboxing/container methods.
> > 
> > If you have that already set up, you can change root subvol xattrs from
> > the supervisor side.  No users will have access if you don't give them
> > access to the root subvol or the mount syscall on the restricted side
> > (they might also need a block device FD belonging to the filesystem).
> > 
> > If you don't have the sandbox already set up, then root subvol access
> > is a thing your users can already do, and it may be time to revisit the
> > assumptions behind your security architecture.
> 
> I'm not talking about root. I mean unpriv'd users (who can't use mount)!
> If I (as root) mount the whole disk, those users may be able to read or
> modify files from parts of that disk to which they should not have
> access. That may be  why I am not mounting the whole disk in the first
> place.

So put the temporary mount point behind a mode 700 directory owned
by root, e.g.

	umask 077
	tmp="$(mktemp -d)"
	mkdir "$tmp/mp"
	mount -osubvolid=5 /dev/btrfs "$tmp/mp"
	setfattr -n 'btrfs...' -v... "$tmp/mp"
	umount "$tmp/mp"
	rmdir "$tmp/mp" "$tmp"

This doesn't seem like a new problem, or a problem that doesn't already
have lots of good solutions.

Anyway I'm no longer favor of using an xattr this way for a number of
better reasons, so the point is moot.

> I gave a few very simple examples, but I can think of many more cases
> where a disk may contain files which users might be able to access if
> the disk was mounted (maybe the disk has subvols used by many different
> systems but UIDs are not coordinated, or ...).  And, of course, if they
> can open a FD during the brief time it is mounted, they can stop it
> being unmounted again.
> 
> No. If I have chosen to mount just a subvol, it is because I don't want
> to mount the whole disk.
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata
  2021-01-23 17:44               ` Graham Cobb
  2021-01-24  4:00                 ` Zygo Blaxell
@ 2021-01-24 20:05                 ` Goffredo Baroncelli
  1 sibling, 0 replies; 26+ messages in thread
From: Goffredo Baroncelli @ 2021-01-24 20:05 UTC (permalink / raw)
  To: Graham Cobb, Zygo Blaxell; +Cc: Josef Bacik, linux-btrfs

On 1/23/21 6:44 PM, Graham Cobb wrote:
[...]
> I gave a few very simple examples, but I can think of many more cases
> where a disk may contain files which users might be able to access if
> the disk was mounted (maybe the disk has subvols used by many different
> systems but UIDs are not coordinated, or ...).  And, of course, if they
> can open a FD during the brief time it is mounted, they can stop it
> being unmounted again.
> 
> No. If I have chosen to mount just a subvol, it is because I don't want
> to mount the whole disk.

I agree with Graham, if we have to mount the root subvolume, it means
that the api is not so good.
Moreover, as explained also by you, the xattr are "exposed" to the risk
of be copied by a simple rsync -X (or cp --preserve=all ...)

BR
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata
  2021-01-17 18:54 [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata Goffredo Baroncelli
                   ` (6 preceding siblings ...)
  2021-01-20 16:02 ` Josef Bacik
@ 2021-01-25 15:21 ` Josef Bacik
  2023-01-15 17:00   ` Goffredo Baroncelli
  7 siblings, 1 reply; 26+ messages in thread
From: Josef Bacik @ 2021-01-25 15:21 UTC (permalink / raw)
  To: Goffredo Baroncelli, linux-btrfs; +Cc: Zygo Blaxell

On 1/17/21 1:54 PM, Goffredo Baroncelli wrote:
> 
> Hi all,
> 
> This is an RFC; I wrote this patch because I find the idea interesting
> even though it adds more complication to the chunk allocator.
> 
> The basic idea is to store the metadata chunk in the fasters disks.
> The fasters disk are marked by the "preferred_metadata" flag.
> 
> BTRFS when allocate a new metadata/system chunk, selects the
> "preferred_metadata" disks, otherwise it selectes the non
> "preferred_metadata" disks. The intial patch allowed to use the other
> kind of disk in case a set is full.
> 
> This patches set is based on v5.11-rc2.
> 
> For now, the only user of this patch that I am aware is Zygo.
> However he asked to further constraint the allocation: i.e. avoid to
> allocated metadata on a not "preferred_metadata"
> disk. So I extended the patch adding 4 modes to operate.
> 

Ok this discussion is going in a few different directions, and there's a lot of 
moving parts here.  I don't want Goffredo to wander off and do V6 only to have 
us go off into the weeds on random particulars of how we think this thing is 
supposed to work.  To that end, I've opened a design issue in github

https://github.com/btrfs/btrfs-todo/issues/19

and filled out what I think are all the questions and points we've all brought 
up throughout these discussions.  Everybody please weigh in on the task, laying 
out what they think is the best way forward for some/all of these questions. 
Once we have an agreed upon design then Goffredo can go and do V6, and then we 
only have to argue about the code and not the design.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata
  2021-01-25 15:21 ` Josef Bacik
@ 2023-01-15 17:00   ` Goffredo Baroncelli
  2023-01-15 17:05     ` Goffredo Baroncelli
  0 siblings, 1 reply; 26+ messages in thread
From: Goffredo Baroncelli @ 2023-01-15 17:00 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Zygo Blaxell, linux-btrfs

On 25/01/2021 16.21, Josef Bacik wrote:
[...]
> Ok this discussion is going in a few different directions, and there's a lot of moving parts here.  I don't want Goffredo to wander off and do V6 only to have us go off into the weeds on random particulars of how we think this thing is supposed to work.  To that end, I've opened a design issue in github
> 
> https://github.com/btrfs/btrfs-todo/issues/19
> 
> and filled out what I think are all the questions and points we've all brought up throughout these discussions.  Everybody please weigh in on the task, laying out what they think is the best way forward for some/all of these questions. Once we have an agreed upon design then Goffredo can go and do V6, and then we only have to argue about the code and not the design.  Thanks,
> 

Hi Josef,
do you think that there will be interest in resurrecting this patches ?

I saw some form of interesting by MarkRose (see github https://github.com/btrfs/btrfs-todo/issues/19#issuecomment-1368377759)

> Josef

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata
  2023-01-15 17:00   ` Goffredo Baroncelli
@ 2023-01-15 17:05     ` Goffredo Baroncelli
  2023-01-16  8:20       ` Paul Jones
  0 siblings, 1 reply; 26+ messages in thread
From: Goffredo Baroncelli @ 2023-01-15 17:05 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Zygo Blaxell, linux-btrfs

On 15/01/2023 18.00, Goffredo Baroncelli wrote:
> On 25/01/2021 16.21, Josef Bacik wrote:
> [...]
>> Ok this discussion is going in a few different directions, and there's a lot of moving parts here.  I don't want Goffredo to wander off and do V6 only to have us go off into the weeds on random particulars of how we think this thing is supposed to work.  To that end, I've opened a design issue in github
>>
>> https://github.com/btrfs/btrfs-todo/issues/19
>>
>> and filled out what I think are all the questions and points we've all brought up throughout these discussions.  Everybody please weigh in on the task, laying out what they think is the best way forward for some/all of these questions. Once we have an agreed upon design then Goffredo can go and do V6, and then we only have to argue about the code and not the design.  Thanks,
>>
> 
> Hi Josef,
> do you think that there will be interest in resurrecting this patches ?
> 
> I saw some form of interesting by MarkRose (see github https://github.com/btrfs/btrfs-todo/issues/19#issuecomment-1368377759)
> 
>> Josef

Ok, the latest updates was a V12 version:
[PATCH 0/5][V12] btrfs: allocation_hint

https://lore.kernel.org/linux-btrfs/cover.1646589622.git.kreijack@inwind.it/

> 

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata
  2023-01-15 17:05     ` Goffredo Baroncelli
@ 2023-01-16  8:20       ` Paul Jones
  0 siblings, 0 replies; 26+ messages in thread
From: Paul Jones @ 2023-01-16  8:20 UTC (permalink / raw)
  To: kreijack, Josef Bacik; +Cc: Zygo Blaxell, linux-btrfs

> -----Original Message-----
> From: Goffredo Baroncelli <kreijack@libero.it>
> Sent: Monday, 16 January 2023 4:05 AM
> To: Josef Bacik <josef@toxicpanda.com>
> Cc: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>; linux-
> btrfs@vger.kernel.org
> Subject: Re: [RFC][PATCH V5] btrfs: preferred_metadata: preferred device
> for metadata
> 
> On 15/01/2023 18.00, Goffredo Baroncelli wrote:
> > On 25/01/2021 16.21, Josef Bacik wrote:
> > [...]
> >> Ok this discussion is going in a few different directions, and
> >> there's a lot of moving parts here.  I don't want Goffredo to wander
> >> off and do V6 only to have us go off into the weeds on random
> >> particulars of how we think this thing is supposed to work.  To that
> >> end, I've opened a design issue in github
> >>
> >> https://github.com/btrfs/btrfs-todo/issues/19
> >>
> >> and filled out what I think are all the questions and points we've
> >> all brought up throughout these discussions.  Everybody please weigh
> >> in on the task, laying out what they think is the best way forward
> >> for some/all of these questions. Once we have an agreed upon design
> >> then Goffredo can go and do V6, and then we only have to argue about
> >> the code and not the design.  Thanks,
> >>
> >
> > Hi Josef,
> > do you think that there will be interest in resurrecting this patches ?
> >
> > I saw some form of interesting by MarkRose (see github
> > https://github.com/btrfs/btrfs-todo/issues/19#issuecomment-
> 1368377759)
> >
> >> Josef
> 
> Ok, the latest updates was a V12 version:
> [PATCH 0/5][V12] btrfs: allocation_hint
> 
> https://lore.kernel.org/linux-btrfs/cover.1646589622.git.kreijack@inwind.it/


Just adding that I have been using this patchset since about V4 and it works beautifully to keep metadata on ssd and data on hdd (on raid1). Would be great to see it finally merged!


Paul.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2023-01-16  8:20 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-17 18:54 [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata Goffredo Baroncelli
2021-01-17 18:54 ` [PATCH 1/5] Add an ioctl to set the device properties Goffredo Baroncelli
2021-01-17 18:54 ` [PATCH 2/5] Add flags for dedicated metadata disks Goffredo Baroncelli
2021-01-17 18:54 ` [PATCH 3/5] Export dev_item.type in sysfs /sys/fs/btrfs/<uuid>/devinfo/<devid>/type Goffredo Baroncelli
2021-01-17 18:54 ` [PATCH 4/5] btrfs: add preferred_metadata option Goffredo Baroncelli
2021-01-17 18:54 ` [PATCH 5/5] btrfs: add preferred_metadata mode mount option Goffredo Baroncelli
2021-01-18  3:05   ` kernel test robot
2021-01-19 23:12 ` [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata Zygo Blaxell
2021-01-21  8:31   ` Martin Svec
2021-01-20 16:02 ` Josef Bacik
2021-01-20 16:15   ` Johannes Thumshirn
2021-01-20 16:17     ` Josef Bacik
2021-01-20 16:20   ` Zygo Blaxell
2021-01-21 18:16   ` Goffredo Baroncelli
2021-01-21 18:54     ` Zygo Blaxell
2021-01-22 18:31       ` Goffredo Baroncelli
2021-01-22 22:42         ` Zygo Blaxell
2021-01-23 14:55           ` Graham Cobb
2021-01-23 17:21             ` Zygo Blaxell
2021-01-23 17:44               ` Graham Cobb
2021-01-24  4:00                 ` Zygo Blaxell
2021-01-24 20:05                 ` Goffredo Baroncelli
2021-01-25 15:21 ` Josef Bacik
2023-01-15 17:00   ` Goffredo Baroncelli
2023-01-15 17:05     ` Goffredo Baroncelli
2023-01-16  8:20       ` Paul Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).