linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
@ 2020-04-05  8:26 Goffredo Baroncelli
  2020-04-05  8:26 ` [PATCH] btrfs: add ssd_metadata mode Goffredo Baroncelli
                   ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: Goffredo Baroncelli @ 2020-04-05  8:26 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Michael, Hugo Mills, Martin Svec, Wang Yugui


Hi all,

This is an RFC; I wrote this patch because I find the idea interesting
even though it adds more complication to the chunk allocator.

The core idea is to store the metadata on the ssd and to leave the data
on the rotational disks. BTRFS looks at the rotational flags to
understand the kind of disks.

This new mode is enabled passing the option ssd_metadata at mount time.
This policy of allocation is the "preferred" one. If this doesn't permit
a chunk allocation, the "classic" one is used.

Some examples: (/dev/sd[abc] are ssd, and /dev/sd[ef] are rotational)

Non striped profile: metadata->raid1, data->raid1
The data is stored on /dev/sd[ef], metadata is stored on /dev/sd[abc].
When /dev/sd[ef] are full, then the data chunk is allocated also on
/dev/sd[abc].

Striped profile: metadata->raid6, data->raid6
raid6 requires 3 disks at minimum, so /dev/sd[ef] are not enough for a
data profile raid6. To allow a data chunk allocation, the data profile raid6
will be stored on all the disks /dev/sd[abcdef].
Instead the metadata profile raid6 will be allocated on /dev/sd[abc],
because these are enough to host this chunk.

Changelog:
v1: - first issue
v2: - rebased to v5.6.2
    - correct the comparison about the rotational disks (>= instead of >)
    - add the flag rotational to the struct btrfs_device_info to
      simplify the comparison function (btrfs_cmp_device_info*() )
v3: - correct the collision between BTRFS_MOUNT_DISCARD_ASYNC and
      BTRFS_MOUNT_SSD_METADATA.

Below I collected some data to highlight the performance increment.

Test setup:
I performed as test a "dist-upgrade" of a Debian from stretch to buster.
The test consisted in an image of a Debian stretch[1]  with the packages
needed under /var/cache/apt/archives/ (so no networking was involved).
For each test I formatted the filesystem from scratch, un-tar-red the
image and the ran "apt-get dist-upgrade" [2]. For each disk(s)/filesystem
combination I measured the time of apt dist-upgrade with and
without the flag "force-unsafe-io" which reduce the using of sync(2) and
flush(2). The ssd was 20GB big, the hdd was 230GB big,

I considered the following scenarios:
- btrfs over ssd
- btrfs over ssd + hdd with my patch enabled
- btrfs over bcache over hdd+ssd
- btrfs over hdd (very, very slow....)
- ext4 over ssd
- ext4 over hdd

The test machine was an "AMD A6-6400K" with 4GB of ram, where 3GB was used
as cache/buff.

Data analysis:

Of course btrfs is slower than ext4 when a lot of sync/flush are involved. Using
apt on a rotational was a dramatic experience. And IMHO  this should be replaced
by using the btrfs snapshot capabilities. But this is another (not easy) story.

Unsurprising bcache performs better than my patch. But this is an expected
result because it can cache also the data chunk (the read can goes directly to
the ssd). bcache perform about +60% slower when there are a lot of sync/flush
and only +20% in the other case.

Regarding the test with force-unsafe-io (fewer sync/flush), my patch reduce the
time from +256% to +113%  than the hdd-only . Which I consider a good
results considering how small is the patch.


Raw data:
The data below is the "real" time (as return by the time command) consumed by
apt


Test description         real (mmm:ss)	Delta %
--------------------     -------------  -------
btrfs hdd w/sync	   142:38	+533%
btrfs ssd+hdd w/sync        81:04	+260%
ext4 hdd w/sync	            52:39	+134%
btrfs bcache w/sync	    35:59	 +60%
btrfs ssd w/sync	    22:31	reference
ext4 ssd w/sync	            12:19	 -45%



Test description         real (mmm:ss)	Delta %
--------------------     -------------  -------
btrfs hdd	             56:2	+256%
ext4 hdd	            51:32	+228%
btrfs ssd+hdd	            33:30	+113%
btrfs bcache	            18:57	 +20%
btrfs ssd	            15:44	reference
ext4 ssd	            11:49	 -25%


[1] I created the image, using "debootrap stretch", then I installed a set
of packages using the commands:

  # debootstrap stretch test/
  # chroot test/
  # mount -t proc proc proc
  # mount -t sysfs sys sys
  # apt --option=Dpkg::Options::=--force-confold \
        --option=Dpkg::options::=--force-unsafe-io \
	install mate-desktop-environment* xserver-xorg vim \
        task-kde-desktop task-gnome-desktop

Then updated the release from stretch to buster changing the file /etc/apt/source.list
Then I download the packages for the dist upgrade:

  # apt-get update
  # apt-get --download-only dist-upgrade

Then I create a tar of this image.
Before the dist upgrading the space used was about 7GB of space with 2281
packages. After the dist-upgrade, the space used was 9GB with 2870 packages.
The upgrade installed/updated about 2251 packages.


[2] The command was a bit more complex, to avoid an interactive session

  # mkfs.btrfs -m single -d single /dev/sdX
  # mount /dev/sdX test/
  # cd test
  # time tar xzf ../image.tgz
  # chroot .
  # mount -t proc proc proc
  # mount -t sysfs sys sys
  # export DEBIAN_FRONTEND=noninteractive
  # time apt-get -y --option=Dpkg::Options::=--force-confold \
	--option=Dpkg::options::=--force-unsafe-io dist-upgrade


BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5




^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH] btrfs: add ssd_metadata mode
  2020-04-05  8:26 [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD Goffredo Baroncelli
@ 2020-04-05  8:26 ` Goffredo Baroncelli
  2020-04-14  5:24   ` Paul Jones
  2020-10-23  7:23   ` Wang Yugui
  2020-04-05 10:57 ` [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD Graham Cobb
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 28+ messages in thread
From: Goffredo Baroncelli @ 2020-04-05  8:26 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Michael, Hugo Mills, Martin Svec, Wang Yugui, Goffredo Baroncelli

From: Goffredo Baroncelli <kreijack@inwind.it>

When this mode is enabled, the allocation policy of the chunk
is so modified:
- allocation of metadata chunk: priority is given to ssd disk.
- allocation of data chunk: priority is given to a rotational disk.

When a striped profile is involved (like RAID0,5,6), the logic
is a bit more complex. If there are enough disks, the data profiles
are stored on the rotational disks only; instead the metadata profiles
are stored on the non rotational disk only.
If the disks are not enough, then the profiles is stored on all
the disks.

Example: assuming that sda, sdb, sdc are ssd disks, and sde, sdf are
rotational ones.
A data profile raid6, will be stored on sda, sdb, sdc, sde, sdf (sde
and sdf are not enough to host a raid5 profile).
A metadata profile raid6, will be stored on sda, sdb, sdc (these
are enough to host a raid6 profile).

To enable this mode pass -o ssd_metadata at mount time.

Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/super.c   |  8 +++++
 fs/btrfs/volumes.c | 89 ++++++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/volumes.h |  1 +
 4 files changed, 97 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 36df977b64d9..773c7f8b0b0d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1236,6 +1236,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
 #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
 #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
 #define BTRFS_MOUNT_DISCARD_ASYNC	(1 << 29)
+#define BTRFS_MOUNT_SSD_METADATA	(1 << 30)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
 #define BTRFS_DEFAULT_MAX_INLINE	(2048)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 67c63858812a..4ad14b0a57b3 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -350,6 +350,7 @@ enum {
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	Opt_ref_verify,
 #endif
+	Opt_ssd_metadata,
 	Opt_err,
 };
 
@@ -421,6 +422,7 @@ static const match_table_t tokens = {
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	{Opt_ref_verify, "ref_verify"},
 #endif
+	{Opt_ssd_metadata, "ssd_metadata"},
 	{Opt_err, NULL},
 };
 
@@ -872,6 +874,10 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 			btrfs_set_opt(info->mount_opt, REF_VERIFY);
 			break;
 #endif
+		case Opt_ssd_metadata:
+			btrfs_set_and_info(info, SSD_METADATA,
+					"enabling ssd_metadata");
+			break;
 		case Opt_err:
 			btrfs_info(info, "unrecognized mount option '%s'", p);
 			ret = -EINVAL;
@@ -1390,6 +1396,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 #endif
 	if (btrfs_test_opt(info, REF_VERIFY))
 		seq_puts(seq, ",ref_verify");
+	if (btrfs_test_opt(info, SSD_METADATA))
+		seq_puts(seq, ",ssd_metadata");
 	seq_printf(seq, ",subvolid=%llu",
 		  BTRFS_I(d_inode(dentry))->root->root_key.objectid);
 	seq_puts(seq, ",subvol=");
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9cfc668f91f4..ffb2bc912c43 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4761,6 +4761,58 @@ static int btrfs_cmp_device_info(const void *a, const void *b)
 	return 0;
 }
 
+/*
+ * sort the devices in descending order by rotational,
+ * max_avail, total_avail
+ */
+static int btrfs_cmp_device_info_metadata(const void *a, const void *b)
+{
+	const struct btrfs_device_info *di_a = a;
+	const struct btrfs_device_info *di_b = b;
+
+	/* metadata -> non rotational first */
+	if (!di_a->rotational && di_b->rotational)
+		return -1;
+	if (di_a->rotational && !di_b->rotational)
+		return 1;
+	if (di_a->max_avail > di_b->max_avail)
+		return -1;
+	if (di_a->max_avail < di_b->max_avail)
+		return 1;
+	if (di_a->total_avail > di_b->total_avail)
+		return -1;
+	if (di_a->total_avail < di_b->total_avail)
+		return 1;
+	return 0;
+}
+
+/*
+ * sort the devices in descending order by !rotational,
+ * max_avail, total_avail
+ */
+static int btrfs_cmp_device_info_data(const void *a, const void *b)
+{
+	const struct btrfs_device_info *di_a = a;
+	const struct btrfs_device_info *di_b = b;
+
+	/* data -> non rotational last */
+	if (!di_a->rotational && di_b->rotational)
+		return 1;
+	if (di_a->rotational && !di_b->rotational)
+		return -1;
+	if (di_a->max_avail > di_b->max_avail)
+		return -1;
+	if (di_a->max_avail < di_b->max_avail)
+		return 1;
+	if (di_a->total_avail > di_b->total_avail)
+		return -1;
+	if (di_a->total_avail < di_b->total_avail)
+		return 1;
+	return 0;
+}
+
+
+
 static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
 {
 	if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK))
@@ -4808,6 +4860,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	int i;
 	int j;
 	int index;
+	int nr_rotational;
 
 	BUG_ON(!alloc_profile_is_valid(type, 0));
 
@@ -4863,6 +4916,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	 * about the available holes on each device.
 	 */
 	ndevs = 0;
+	nr_rotational = 0;
 	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
 		u64 max_avail;
 		u64 dev_offset;
@@ -4914,14 +4968,45 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 		devices_info[ndevs].max_avail = max_avail;
 		devices_info[ndevs].total_avail = total_avail;
 		devices_info[ndevs].dev = device;
+		devices_info[ndevs].rotational = !test_bit(QUEUE_FLAG_NONROT,
+				&(bdev_get_queue(device->bdev)->queue_flags));
+		if (devices_info[ndevs].rotational)
+			nr_rotational++;
 		++ndevs;
 	}
 
+	BUG_ON(nr_rotational > ndevs);
 	/*
 	 * now sort the devices by hole size / available space
 	 */
-	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
-	     btrfs_cmp_device_info, NULL);
+	if (((type & BTRFS_BLOCK_GROUP_DATA) &&
+	     (type & BTRFS_BLOCK_GROUP_METADATA)) ||
+	    !btrfs_test_opt(info, SSD_METADATA)) {
+		/* mixed bg or SSD_METADATA not set */
+		sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+			     btrfs_cmp_device_info, NULL);
+	} else {
+		/*
+		 * if SSD_METADATA is set, sort the device considering also the
+		 * kind (ssd or not). Limit the availables devices to the ones
+		 * of the same kind, to avoid that a striped profile like raid5
+		 * spread to all kind of devices (ssd and rotational).
+		 * It is allowed to use different kinds of devices if the ones
+		 * of the same kind are not enough alone.
+		 */
+		if (type & BTRFS_BLOCK_GROUP_DATA) {
+			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+				     btrfs_cmp_device_info_data, NULL);
+			if (nr_rotational >= devs_min)
+				ndevs = nr_rotational;
+		} else {
+			int nr_norot = ndevs - nr_rotational;
+			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+				     btrfs_cmp_device_info_metadata, NULL);
+			if (nr_norot >= devs_min)
+				ndevs = nr_norot;
+		}
+	}
 
 	/*
 	 * Round down to number of usable stripes, devs_increment can be any
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index f01552a0785e..285d71d54a03 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -343,6 +343,7 @@ struct btrfs_device_info {
 	u64 dev_offset;
 	u64 max_avail;
 	u64 total_avail;
+	int rotational:1;
 };
 
 struct btrfs_raid_attr {
-- 
2.26.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-04-05  8:26 [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD Goffredo Baroncelli
  2020-04-05  8:26 ` [PATCH] btrfs: add ssd_metadata mode Goffredo Baroncelli
@ 2020-04-05 10:57 ` Graham Cobb
  2020-04-05 18:47   ` Goffredo Baroncelli
  2020-04-06  2:24   ` Zygo Blaxell
  2020-05-29 16:06 ` Hans van Kranenburg
  2020-05-30  4:59 ` Qu Wenruo
  3 siblings, 2 replies; 28+ messages in thread
From: Graham Cobb @ 2020-04-05 10:57 UTC (permalink / raw)
  To: Goffredo Baroncelli, linux-btrfs

On 05/04/2020 09:26, Goffredo Baroncelli wrote:
...

> I considered the following scenarios:
> - btrfs over ssd
> - btrfs over ssd + hdd with my patch enabled
> - btrfs over bcache over hdd+ssd
> - btrfs over hdd (very, very slow....)
> - ext4 over ssd
> - ext4 over hdd
> 
> The test machine was an "AMD A6-6400K" with 4GB of ram, where 3GB was used
> as cache/buff.
> 
> Data analysis:
> 
> Of course btrfs is slower than ext4 when a lot of sync/flush are involved. Using
> apt on a rotational was a dramatic experience. And IMHO  this should be replaced
> by using the btrfs snapshot capabilities. But this is another (not easy) story.
> 
> Unsurprising bcache performs better than my patch. But this is an expected
> result because it can cache also the data chunk (the read can goes directly to
> the ssd). bcache perform about +60% slower when there are a lot of sync/flush
> and only +20% in the other case.
> 
> Regarding the test with force-unsafe-io (fewer sync/flush), my patch reduce the
> time from +256% to +113%  than the hdd-only . Which I consider a good
> results considering how small is the patch.
> 
> 
> Raw data:
> The data below is the "real" time (as return by the time command) consumed by
> apt
> 
> 
> Test description         real (mmm:ss)	Delta %
> --------------------     -------------  -------
> btrfs hdd w/sync	   142:38	+533%
> btrfs ssd+hdd w/sync        81:04	+260%
> ext4 hdd w/sync	            52:39	+134%
> btrfs bcache w/sync	    35:59	 +60%
> btrfs ssd w/sync	    22:31	reference
> ext4 ssd w/sync	            12:19	 -45%

Interesting data but it seems to be missing the case of btrfs ssd+hdd
w/sync without your patch in order to tell what difference your patch
made. Or am I confused?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-04-05 10:57 ` [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD Graham Cobb
@ 2020-04-05 18:47   ` Goffredo Baroncelli
  2020-04-05 21:58     ` Adam Borowski
  2020-04-06  2:24   ` Zygo Blaxell
  1 sibling, 1 reply; 28+ messages in thread
From: Goffredo Baroncelli @ 2020-04-05 18:47 UTC (permalink / raw)
  To: Graham Cobb, linux-btrfs

On 4/5/20 12:57 PM, Graham Cobb wrote:
> On 05/04/2020 09:26, Goffredo Baroncelli wrote:
[...]
>>
>>
>> Test description      real (mmm:ss)	Delta %
>> --------------------  -------------  -------
>> btrfs hdd w/sync	   142:38	+533%
>> btrfs ssd+hdd w/sync     81:04	+260%
>> ext4 hdd w/sync          52:39	+134%
>> btrfs bcache w/sync      35:59	 +60%
>> btrfs ssd w/sync         22:31	reference
>> ext4 ssd w/syn           12:19	 -45%
> 
> Interesting data but it seems to be missing the case of btrfs ssd+hdd
> w/sync without your patch in order to tell what difference your patch
> made. Or am I confused?
> 
Currently BTRFS allocates the chunk on the basis of the free space.

For my tests I have a smaller ssd (20GB) and a bigger hdd (230GB).
This means that the latter has higher priority for the allocation,
until the free space became equal.

The rationale behind my patch is the following:
- is quite simple (even tough in 3 iteration I put two errors :-) )
- BTRFS has already two kind of information to store: data and metadata.
   The former is (a lot ) bigger, than the latter. Having two kind of storage,
   one faster (and expensive) than the other, it is natural to put the metadata
   in the faster one, and the data in the slower one.

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-04-05 18:47   ` Goffredo Baroncelli
@ 2020-04-05 21:58     ` Adam Borowski
  0 siblings, 0 replies; 28+ messages in thread
From: Adam Borowski @ 2020-04-05 21:58 UTC (permalink / raw)
  To: kreijack; +Cc: Graham Cobb, linux-btrfs

On Sun, Apr 05, 2020 at 08:47:15PM +0200, Goffredo Baroncelli wrote:
> Currently BTRFS allocates the chunk on the basis of the free space.
> 
> For my tests I have a smaller ssd (20GB) and a bigger hdd (230GB).
> This means that the latter has higher priority for the allocation,
> until the free space became equal.
> 
> The rationale behind my patch is the following:
> - is quite simple (even tough in 3 iteration I put two errors :-) )
> - BTRFS has already two kind of information to store: data and metadata.
>   The former is (a lot ) bigger, than the latter. Having two kind of storage,
>   one faster (and expensive) than the other, it is natural to put the metadata
>   in the faster one, and the data in the slower one.

But why do you assume that SSD means fast?  Even with traditional disks
only, you can have a SATA-connected array for data and NVMe for metadata,
legacy NVMe for data and NVMe Optane for metadata -- but the real fun starts
if you put metadata on Optane pmem.

There are many storage tiers, and your patch hard-codes the lowest one as
the only determinator.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ in the beginning was the boot and root floppies and they were good.
⢿⡄⠘⠷⠚⠋⠀                                       -- <willmore> on #linux-sunxi
⠈⠳⣄⠀⠀⠀⠀

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-04-05 10:57 ` [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD Graham Cobb
  2020-04-05 18:47   ` Goffredo Baroncelli
@ 2020-04-06  2:24   ` Zygo Blaxell
  2020-04-06 16:43     ` Goffredo Baroncelli
  1 sibling, 1 reply; 28+ messages in thread
From: Zygo Blaxell @ 2020-04-06  2:24 UTC (permalink / raw)
  To: Graham Cobb; +Cc: Goffredo Baroncelli, linux-btrfs

On Sun, Apr 05, 2020 at 11:57:49AM +0100, Graham Cobb wrote:
> On 05/04/2020 09:26, Goffredo Baroncelli wrote:
> ...
> 
> > I considered the following scenarios:
> > - btrfs over ssd
> > - btrfs over ssd + hdd with my patch enabled
> > - btrfs over bcache over hdd+ssd
> > - btrfs over hdd (very, very slow....)
> > - ext4 over ssd
> > - ext4 over hdd
> > 
> > The test machine was an "AMD A6-6400K" with 4GB of ram, where 3GB was used
> > as cache/buff.
> > 
> > Data analysis:
> > 
> > Of course btrfs is slower than ext4 when a lot of sync/flush are involved. Using
> > apt on a rotational was a dramatic experience. And IMHO  this should be replaced
> > by using the btrfs snapshot capabilities. But this is another (not easy) story.

flushoncommit and eatmydata work reasonably well...once you patch out the
noise warnings from fs-writeback.

> > Unsurprising bcache performs better than my patch. But this is an expected
> > result because it can cache also the data chunk (the read can goes directly to
> > the ssd). bcache perform about +60% slower when there are a lot of sync/flush
> > and only +20% in the other case.
> > 
> > Regarding the test with force-unsafe-io (fewer sync/flush), my patch reduce the
> > time from +256% to +113%  than the hdd-only . Which I consider a good
> > results considering how small is the patch.
> > 
> > 
> > Raw data:
> > The data below is the "real" time (as return by the time command) consumed by
> > apt
> > 
> > 
> > Test description         real (mmm:ss)	Delta %
> > --------------------     -------------  -------
> > btrfs hdd w/sync	   142:38	+533%
> > btrfs ssd+hdd w/sync        81:04	+260%
> > ext4 hdd w/sync	            52:39	+134%
> > btrfs bcache w/sync	    35:59	 +60%
> > btrfs ssd w/sync	    22:31	reference
> > ext4 ssd w/sync	            12:19	 -45%
> 
> Interesting data but it seems to be missing the case of btrfs ssd+hdd
> w/sync without your patch in order to tell what difference your patch
> made. Or am I confused?

Goffredo's test was using profile 'single' for both data and metadata,
so the unpatched allocator would use the biggest device (hdd) for all
block groups and ignore the smaller one (ssd).  The result should be
the same as plain btrfs hdd, give or take a few superblock updates.

Of course, no one should ever use 'single' profile for metadata, except
on disposable filesystems like the ones people use for benchmarks.  ;)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-04-06  2:24   ` Zygo Blaxell
@ 2020-04-06 16:43     ` Goffredo Baroncelli
  2020-04-06 17:21       ` Zygo Blaxell
  0 siblings, 1 reply; 28+ messages in thread
From: Goffredo Baroncelli @ 2020-04-06 16:43 UTC (permalink / raw)
  To: Zygo Blaxell, Graham Cobb; +Cc: linux-btrfs

On 4/6/20 4:24 AM, Zygo Blaxell wrote:
>>> Of course btrfs is slower than ext4 when a lot of sync/flush are involved. Using
>>> apt on a rotational was a dramatic experience. And IMHO  this should be replaced
>>> by using the btrfs snapshot capabilities. But this is another (not easy) story.
> flushoncommit and eatmydata work reasonably well...once you patch out the
> noise warnings from fs-writeback.
> 

You wrote flushoncommit, but did you mean "noflushoncommit" ?

Regarding eatmydata, I used it too. However I was never happy. Below my script:
----------------------------------
ghigo@venice:/etc/apt/apt.conf.d$ cat 10btrfs.conf

DPkg::Pre-Invoke {"bash /var/btrfs/btrfs-apt.sh snapshot";};
DPkg::Post-Invoke {"bash /var/btrfs/btrfs-apt.sh clean";};
Dpkg::options {"--force-unsafe-io";};
---------------------------------
ghigo@venice:/etc/apt/apt.conf.d$ cat /var/btrfs/btrfs-apt.sh

btrfsroot=/var/btrfs/debian
btrfsrollback=/var/btrfs/debian-rollback


do_snapshot() {
	if [ -d "$btrfsrollback" ]; then
		btrfs subvolume delete "$btrfsrollback"
	fi

	i=20
	while [ $i -gt 0 -a -d "$btrfsrollback" ]; do
		i=$(( $i + 1 ))
		sleep 0.1
	done
	if [ $i -eq 0 ]; then
		exit 100
	fi

	btrfs subvolume snapshot "$btrfsroot" "$btrfsrollback"
	
}

do_removerollback() {
	if [ -d "$btrfsrollback" ]; then
		btrfs subvolume delete "$btrfsrollback"
	fi
}

if [ "$1" = "snapshot" ]; then
	do_snapshot
elif [ "$1" = "clean" ]; then
	do_removerollback
else
	echo "usage: $0  snapshot|clean"
fi
--------------------------------------------------------------

Suggestion are welcome how detect automatically where is mount the btrfs root (subvolume=/) and  my root subvolume name (debian in my case). So I will avoid to wrote directly in my script.

BR
G.Baroncelli
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-04-06 16:43     ` Goffredo Baroncelli
@ 2020-04-06 17:21       ` Zygo Blaxell
  2020-04-06 17:33         ` Goffredo Baroncelli
  0 siblings, 1 reply; 28+ messages in thread
From: Zygo Blaxell @ 2020-04-06 17:21 UTC (permalink / raw)
  To: kreijack; +Cc: Graham Cobb, linux-btrfs

On Mon, Apr 06, 2020 at 06:43:04PM +0200, Goffredo Baroncelli wrote:
> On 4/6/20 4:24 AM, Zygo Blaxell wrote:
> > > > Of course btrfs is slower than ext4 when a lot of sync/flush are involved. Using
> > > > apt on a rotational was a dramatic experience. And IMHO  this should be replaced
> > > > by using the btrfs snapshot capabilities. But this is another (not easy) story.
> > flushoncommit and eatmydata work reasonably well...once you patch out the
> > noise warnings from fs-writeback.
> > 
> 
> You wrote flushoncommit, but did you mean "noflushoncommit" ?

No.  "noflushoncommit" means applications have to call fsync() all the
time, or their files get trashed on a crash.  I meant flushoncommit
and eatmydata.

While dpkg runs, it must never call fsync, or it breaks the write
ordering provided by flushoncommit (or you have to zero-log on boot).
btrfs effectively does a point-in-time snapshot at every commit interval.
dpkg's ordering of write operations and renames does the rest.

dpkg runs much faster, so the window for interruption is smaller, and
if it is interrupted, then the result is more or less the same as if
you had run with fsync() on noflushoncommit.  The difference is that
the filesystem might roll back to an earlier state after a crash, which
could be a problem e.g. if your maintainer scripts are manipulating data
on multiple filesystems.


> Regarding eatmydata, I used it too. However I was never happy. Below my script:
> ----------------------------------
> ghigo@venice:/etc/apt/apt.conf.d$ cat 10btrfs.conf
> 
> DPkg::Pre-Invoke {"bash /var/btrfs/btrfs-apt.sh snapshot";};
> DPkg::Post-Invoke {"bash /var/btrfs/btrfs-apt.sh clean";};
> Dpkg::options {"--force-unsafe-io";};
> ---------------------------------
> ghigo@venice:/etc/apt/apt.conf.d$ cat /var/btrfs/btrfs-apt.sh
> 
> btrfsroot=/var/btrfs/debian
> btrfsrollback=/var/btrfs/debian-rollback
> 
> 
> do_snapshot() {
> 	if [ -d "$btrfsrollback" ]; then
> 		btrfs subvolume delete "$btrfsrollback"
> 	fi
> 
> 	i=20
> 	while [ $i -gt 0 -a -d "$btrfsrollback" ]; do
> 		i=$(( $i + 1 ))
> 		sleep 0.1
> 	done
> 	if [ $i -eq 0 ]; then
> 		exit 100
> 	fi
> 
> 	btrfs subvolume snapshot "$btrfsroot" "$btrfsrollback"
> 	
> }
> 
> do_removerollback() {
> 	if [ -d "$btrfsrollback" ]; then
> 		btrfs subvolume delete "$btrfsrollback"
> 	fi
> }
> 
> if [ "$1" = "snapshot" ]; then
> 	do_snapshot
> elif [ "$1" = "clean" ]; then
> 	do_removerollback
> else
> 	echo "usage: $0  snapshot|clean"
> fi
> --------------------------------------------------------------
> 
> Suggestion are welcome how detect automatically where is mount the
> btrfs root (subvolume=/) and  my root subvolume name (debian in my
> case). So I will avoid to wrote directly in my script.

You can figure out where "/" is within a btrfs filesystem by recusively
looking up parent subvol IDs with TREE_SEARCH_V2 until you get to 5
FS_ROOT (sort of like the way pwd works on traditional Unix); however,
root can be a bind mount, so "path from fs_root to /" is not guaranteed
to end at a subvol root.

Also, sometimes people put /var on its own subvol, so you'd need to
find "the set of all subvols relevant to dpkg" and that's definitely
not trivial in the general case.

It's not as easy to figure out if there's an existing fs_root mount
point (partly because namespacing mangles every path in /proc/mounts
and mountinfo), but if you know the btrfs device (and can access it
from your namespace) you can just mount it somewhere and then you do
know where it is.

> BR
> G.Baroncelli
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-04-06 17:21       ` Zygo Blaxell
@ 2020-04-06 17:33         ` Goffredo Baroncelli
  2020-04-06 17:40           ` Zygo Blaxell
  0 siblings, 1 reply; 28+ messages in thread
From: Goffredo Baroncelli @ 2020-04-06 17:33 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Graham Cobb, linux-btrfs

On 4/6/20 7:21 PM, Zygo Blaxell wrote:
> On Mon, Apr 06, 2020 at 06:43:04PM +0200, Goffredo Baroncelli wrote:
>> On 4/6/20 4:24 AM, Zygo Blaxell wrote:
>>>>> Of course btrfs is slower than ext4 when a lot of sync/flush are involved. Using
>>>>> apt on a rotational was a dramatic experience. And IMHO  this should be replaced
>>>>> by using the btrfs snapshot capabilities. But this is another (not easy) story.
>>> flushoncommit and eatmydata work reasonably well...once you patch out the
>>> noise warnings from fs-writeback.
>>>
>>
>> You wrote flushoncommit, but did you mean "noflushoncommit" ?
> 
> No.  "noflushoncommit" means applications have to call fsync() all the
> time, or their files get trashed on a crash.  I meant flushoncommit
> and eatmydata.

It is a tristate value (default, flushoncommit, noflushoncommit), or
flushoncommit IS the default ?
> 
> While dpkg runs, it must never call fsync, or it breaks the write
> ordering provided by flushoncommit (or you have to zero-log on boot).
> btrfs effectively does a point-in-time snapshot at every commit interval.
> dpkg's ordering of write operations and renames does the rest.
> 
> dpkg runs much faster, so the window for interruption is smaller, and
> if it is interrupted, then the result is more or less the same as if
> you had run with fsync() on noflushoncommit.  The difference is that
> the filesystem might roll back to an earlier state after a crash, which
> could be a problem e.g. if your maintainer scripts are manipulating data
> on multiple filesystems.
> 
> 
>> Regarding eatmydata, I used it too. However I was never happy. Below my script:
>> ----------------------------------
>> ghigo@venice:/etc/apt/apt.conf.d$ cat 10btrfs.conf
>>
>> DPkg::Pre-Invoke {"bash /var/btrfs/btrfs-apt.sh snapshot";};
>> DPkg::Post-Invoke {"bash /var/btrfs/btrfs-apt.sh clean";};
>> Dpkg::options {"--force-unsafe-io";};
>> ---------------------------------
>> ghigo@venice:/etc/apt/apt.conf.d$ cat /var/btrfs/btrfs-apt.sh
>>
>> btrfsroot=/var/btrfs/debian
>> btrfsrollback=/var/btrfs/debian-rollback
>>
>>
>> do_snapshot() {
>> 	if [ -d "$btrfsrollback" ]; then
>> 		btrfs subvolume delete "$btrfsrollback"
>> 	fi
>>
>> 	i=20
>> 	while [ $i -gt 0 -a -d "$btrfsrollback" ]; do
>> 		i=$(( $i + 1 ))
>> 		sleep 0.1
>> 	done
>> 	if [ $i -eq 0 ]; then
>> 		exit 100
>> 	fi
>>
>> 	btrfs subvolume snapshot "$btrfsroot" "$btrfsrollback"
>> 	
>> }
>>
>> do_removerollback() {
>> 	if [ -d "$btrfsrollback" ]; then
>> 		btrfs subvolume delete "$btrfsrollback"
>> 	fi
>> }
>>
>> if [ "$1" = "snapshot" ]; then
>> 	do_snapshot
>> elif [ "$1" = "clean" ]; then
>> 	do_removerollback
>> else
>> 	echo "usage: $0  snapshot|clean"
>> fi
>> --------------------------------------------------------------
>>
>> Suggestion are welcome how detect automatically where is mount the
>> btrfs root (subvolume=/) and  my root subvolume name (debian in my
>> case). So I will avoid to wrote directly in my script.
> 
> You can figure out where "/" is within a btrfs filesystem by recusively
> looking up parent subvol IDs with TREE_SEARCH_V2 until you get to 5
> FS_ROOT (sort of like the way pwd works on traditional Unix); however,
> root can be a bind mount, so "path from fs_root to /" is not guaranteed
> to end at a subvol root.

May be an use case for a new ioctl :-) ? Snapshot a subvolume without
mounting the root subvolume....

> 
> Also, sometimes people put /var on its own subvol, so you'd need to
> find "the set of all subvols relevant to dpkg" and that's definitely
> not trivial in the general case.

I know that a general rule it is not easy. Anyway I also would put /boot
and /home in a dedicated subvolume.
If the "roolback" is done at boot, /boot should be an invariant...
However I think that there are a lot of corner case even here (what happens
if the boot kernel doesn't have modules in the root subvolume ?)

It is not an easy job. It must be performed at distribution level...

> 
> It's not as easy to figure out if there's an existing fs_root mount
> point (partly because namespacing mangles every path in /proc/mounts
> and mountinfo), but if you know the btrfs device (and can access it
> from your namespace) you can just mount it somewhere and then you do
> know where it is.

I agree, looking from root to the "root device", then mount the
root subvolume in a know place, where it is possible to snapshot
the root subvolume.

> 
>> BR
>> G.Baroncelli
>> -- 
>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>>


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-04-06 17:33         ` Goffredo Baroncelli
@ 2020-04-06 17:40           ` Zygo Blaxell
  0 siblings, 0 replies; 28+ messages in thread
From: Zygo Blaxell @ 2020-04-06 17:40 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: Graham Cobb, linux-btrfs

On Mon, Apr 06, 2020 at 07:33:16PM +0200, Goffredo Baroncelli wrote:
> On 4/6/20 7:21 PM, Zygo Blaxell wrote:
> > On Mon, Apr 06, 2020 at 06:43:04PM +0200, Goffredo Baroncelli wrote:
> > > On 4/6/20 4:24 AM, Zygo Blaxell wrote:
> > > > > > Of course btrfs is slower than ext4 when a lot of sync/flush are involved. Using
> > > > > > apt on a rotational was a dramatic experience. And IMHO  this should be replaced
> > > > > > by using the btrfs snapshot capabilities. But this is another (not easy) story.
> > > > flushoncommit and eatmydata work reasonably well...once you patch out the
> > > > noise warnings from fs-writeback.
> > > > 
> > > 
> > > You wrote flushoncommit, but did you mean "noflushoncommit" ?
> > 
> > No.  "noflushoncommit" means applications have to call fsync() all the
> > time, or their files get trashed on a crash.  I meant flushoncommit
> > and eatmydata.
> 
> It is a tristate value (default, flushoncommit, noflushoncommit), or
> flushoncommit IS the default ?

noflushoncommit is the default.  flushoncommit is sort of terrible--it
used to have deadlock bugs up to 4.15, and spams the kernel log with
warnings since 4.15.

> > While dpkg runs, it must never call fsync, or it breaks the write
> > ordering provided by flushoncommit (or you have to zero-log on boot).
> > btrfs effectively does a point-in-time snapshot at every commit interval.
> > dpkg's ordering of write operations and renames does the rest.
> > 
> > dpkg runs much faster, so the window for interruption is smaller, and
> > if it is interrupted, then the result is more or less the same as if
> > you had run with fsync() on noflushoncommit.  The difference is that
> > the filesystem might roll back to an earlier state after a crash, which
> > could be a problem e.g. if your maintainer scripts are manipulating data
> > on multiple filesystems.
> > 
> > 
> > > Regarding eatmydata, I used it too. However I was never happy. Below my script:
> > > ----------------------------------
> > > ghigo@venice:/etc/apt/apt.conf.d$ cat 10btrfs.conf
> > > 
> > > DPkg::Pre-Invoke {"bash /var/btrfs/btrfs-apt.sh snapshot";};
> > > DPkg::Post-Invoke {"bash /var/btrfs/btrfs-apt.sh clean";};
> > > Dpkg::options {"--force-unsafe-io";};
> > > ---------------------------------
> > > ghigo@venice:/etc/apt/apt.conf.d$ cat /var/btrfs/btrfs-apt.sh
> > > 
> > > btrfsroot=/var/btrfs/debian
> > > btrfsrollback=/var/btrfs/debian-rollback
> > > 
> > > 
> > > do_snapshot() {
> > > 	if [ -d "$btrfsrollback" ]; then
> > > 		btrfs subvolume delete "$btrfsrollback"
> > > 	fi
> > > 
> > > 	i=20
> > > 	while [ $i -gt 0 -a -d "$btrfsrollback" ]; do
> > > 		i=$(( $i + 1 ))
> > > 		sleep 0.1
> > > 	done
> > > 	if [ $i -eq 0 ]; then
> > > 		exit 100
> > > 	fi
> > > 
> > > 	btrfs subvolume snapshot "$btrfsroot" "$btrfsrollback"
> > > 	
> > > }
> > > 
> > > do_removerollback() {
> > > 	if [ -d "$btrfsrollback" ]; then
> > > 		btrfs subvolume delete "$btrfsrollback"
> > > 	fi
> > > }
> > > 
> > > if [ "$1" = "snapshot" ]; then
> > > 	do_snapshot
> > > elif [ "$1" = "clean" ]; then
> > > 	do_removerollback
> > > else
> > > 	echo "usage: $0  snapshot|clean"
> > > fi
> > > --------------------------------------------------------------
> > > 
> > > Suggestion are welcome how detect automatically where is mount the
> > > btrfs root (subvolume=/) and  my root subvolume name (debian in my
> > > case). So I will avoid to wrote directly in my script.
> > 
> > You can figure out where "/" is within a btrfs filesystem by recusively
> > looking up parent subvol IDs with TREE_SEARCH_V2 until you get to 5
> > FS_ROOT (sort of like the way pwd works on traditional Unix); however,
> > root can be a bind mount, so "path from fs_root to /" is not guaranteed
> > to end at a subvol root.
> 
> May be an use case for a new ioctl :-) ? Snapshot a subvolume without
> mounting the root subvolume....

That would make access control mechanisms like chroot...challenging.
;)  But I hear we have a delete-by-id ioctl now, so might as well have
snap-by-id too.

> > Also, sometimes people put /var on its own subvol, so you'd need to
> > find "the set of all subvols relevant to dpkg" and that's definitely
> > not trivial in the general case.
> 
> I know that a general rule it is not easy. Anyway I also would put /boot
> and /home in a dedicated subvolume.
> If the "roolback" is done at boot, /boot should be an invariant...
> However I think that there are a lot of corner case even here (what happens
> if the boot kernel doesn't have modules in the root subvolume ?)
> 
> It is not an easy job. It must be performed at distribution level...
> 
> > 
> > It's not as easy to figure out if there's an existing fs_root mount
> > point (partly because namespacing mangles every path in /proc/mounts
> > and mountinfo), but if you know the btrfs device (and can access it
> > from your namespace) you can just mount it somewhere and then you do
> > know where it is.
> 
> I agree, looking from root to the "root device", then mount the
> root subvolume in a know place, where it is possible to snapshot
> the root subvolume.
> 
> > 
> > > BR
> > > G.Baroncelli
> > > -- 
> > > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> > > Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> > > 
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH] btrfs: add ssd_metadata mode
  2020-04-05  8:26 ` [PATCH] btrfs: add ssd_metadata mode Goffredo Baroncelli
@ 2020-04-14  5:24   ` Paul Jones
  2020-10-23  7:23   ` Wang Yugui
  1 sibling, 0 replies; 28+ messages in thread
From: Paul Jones @ 2020-04-14  5:24 UTC (permalink / raw)
  To: Goffredo Baroncelli, linux-btrfs

> -----Original Message-----
> From: linux-btrfs-owner@vger.kernel.org <linux-btrfs-
> owner@vger.kernel.org> On Behalf Of Goffredo Baroncelli
> Sent: Sunday, 5 April 2020 6:27 PM
> To: linux-btrfs@vger.kernel.org
> Cc: Michael <mclaud@roznica.com.ua>; Hugo Mills <hugo@carfax.org.uk>;
> Martin Svec <martin.svec@zoner.cz>; Wang Yugui <wangyugui@e16-
> tech.com>; Goffredo Baroncelli <kreijack@inwind.it>
> Subject: [PATCH] btrfs: add ssd_metadata mode
> 
> From: Goffredo Baroncelli <kreijack@inwind.it>
> 
> When this mode is enabled, the allocation policy of the chunk is so modified:
> - allocation of metadata chunk: priority is given to ssd disk.
> - allocation of data chunk: priority is given to a rotational disk.
> 
> When a striped profile is involved (like RAID0,5,6), the logic is a bit more
> complex. If there are enough disks, the data profiles are stored on the
> rotational disks only; instead the metadata profiles are stored on the non
> rotational disk only.
> If the disks are not enough, then the profiles is stored on all the disks.
> 
> Example: assuming that sda, sdb, sdc are ssd disks, and sde, sdf are rotational
> ones.
> A data profile raid6, will be stored on sda, sdb, sdc, sde, sdf (sde and sdf are
> not enough to host a raid5 profile).
> A metadata profile raid6, will be stored on sda, sdb, sdc (these are enough to
> host a raid6 profile).
> 
> To enable this mode pass -o ssd_metadata at mount time.
> 
> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>

Tested-By: Paul Jones <paul@pauljones.id.au>

Using raid 1. Makes a surprising difference in speed, nearly as fast as when I was using dm-cache


Paul.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-04-05  8:26 [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD Goffredo Baroncelli
  2020-04-05  8:26 ` [PATCH] btrfs: add ssd_metadata mode Goffredo Baroncelli
  2020-04-05 10:57 ` [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD Graham Cobb
@ 2020-05-29 16:06 ` Hans van Kranenburg
  2020-05-29 16:40   ` Goffredo Baroncelli
  2020-05-30  4:59 ` Qu Wenruo
  3 siblings, 1 reply; 28+ messages in thread
From: Hans van Kranenburg @ 2020-05-29 16:06 UTC (permalink / raw)
  To: Goffredo Baroncelli, linux-btrfs
  Cc: Michael, Hugo Mills, Martin Svec, Wang Yugui

Hi Goffredo,

On 4/5/20 10:26 AM, Goffredo Baroncelli wrote:
> 
> This is an RFC; I wrote this patch because I find the idea interesting
> even though it adds more complication to the chunk allocator.
> 
> The core idea is to store the metadata on the ssd and to leave the data
> on the rotational disks. BTRFS looks at the rotational flags to
> understand the kind of disks.

Like I said yesterday, thanks for working on these kind of proof of
concepts. :)

Even while this can't be a final solution, it's still very useful in the
meantime for users for which this is sufficient right now.

I simply did not realize before that it was possible to just set that
rotational flag myself using an udev rule... How convenient.

-# cat /etc/udev/rules.d/99-yolo.rules
ACTION=="add|change",
ENV{ID_FS_UUID_SUB_ENC}=="4139fb4c-e7c4-49c7-a4ce-5c86f683ffdc",
ATTR{queue/rotational}="1"
ACTION=="add|change",
ENV{ID_FS_UUID_SUB_ENC}=="192139f4-1618-4089-95fd-4a863db9416b",
ATTR{queue/rotational}="0"

> This new mode is enabled passing the option ssd_metadata at mount time.
> This policy of allocation is the "preferred" one. If this doesn't permit
> a chunk allocation, the "classic" one is used.
> 
> Some examples: (/dev/sd[abc] are ssd, and /dev/sd[ef] are rotational)
> 
> Non striped profile: metadata->raid1, data->raid1
> The data is stored on /dev/sd[ef], metadata is stored on /dev/sd[abc].
> When /dev/sd[ef] are full, then the data chunk is allocated also on
> /dev/sd[abc].
> 
> Striped profile: metadata->raid6, data->raid6
> raid6 requires 3 disks at minimum, so /dev/sd[ef] are not enough for a
> data profile raid6. To allow a data chunk allocation, the data profile raid6
> will be stored on all the disks /dev/sd[abcdef].
> Instead the metadata profile raid6 will be allocated on /dev/sd[abc],
> because these are enough to host this chunk.

Yes, and while the explanation above focuses on multi-disk profiles, it
might be useful (for the similar section in later versions) to
explicitly mention that for single profile, the same algorithm will just
cause it to overflow to a less preferred disk if the preferred one is
completely full. Neat!

I've been testing this change on top of my 4.19 kernel, and also tried
to come up with some edge cases, doing ridiculous things to generate
metadata usage en do stuff like btrfs fi resize to push metadata away
from the prefered device etc... No weird things happened.

I guess there will be no further work on this V3, the only comment I
would have now is that an Opt_no_ssd_metadata would be nice for testing,
but I can hack that in myself.

Thanks,
Hans

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-05-29 16:06 ` Hans van Kranenburg
@ 2020-05-29 16:40   ` Goffredo Baroncelli
  2020-05-29 18:37     ` Hans van Kranenburg
  0 siblings, 1 reply; 28+ messages in thread
From: Goffredo Baroncelli @ 2020-05-29 16:40 UTC (permalink / raw)
  To: Hans van Kranenburg, linux-btrfs
  Cc: Michael, Hugo Mills, Martin Svec, Wang Yugui

On 5/29/20 6:06 PM, Hans van Kranenburg wrote:
> Hi Goffredo,
> 
> On 4/5/20 10:26 AM, Goffredo Baroncelli wrote:
>>
>> This is an RFC; I wrote this patch because I find the idea interesting
>> even though it adds more complication to the chunk allocator.
>>
>> The core idea is to store the metadata on the ssd and to leave the data
>> on the rotational disks. BTRFS looks at the rotational flags to
>> understand the kind of disks.
> 
> Like I said yesterday, thanks for working on these kind of proof of
> concepts. :)
> 
> Even while this can't be a final solution, it's still very useful in the
> meantime for users for which this is sufficient right now.
> 
> I simply did not realize before that it was possible to just set that
> rotational flag myself using an udev rule... How convenient.
> 
> -# cat /etc/udev/rules.d/99-yolo.rules
> ACTION=="add|change",
> ENV{ID_FS_UUID_SUB_ENC}=="4139fb4c-e7c4-49c7-a4ce-5c86f683ffdc",
> ATTR{queue/rotational}="1"
> ACTION=="add|change",
> ENV{ID_FS_UUID_SUB_ENC}=="192139f4-1618-4089-95fd-4a863db9416b",
> ATTR{queue/rotational}="0"

Yes but of course this should be an exception than the default

> 
>> This new mode is enabled passing the option ssd_metadata at mount time.
>> This policy of allocation is the "preferred" one. If this doesn't permit
>> a chunk allocation, the "classic" one is used.
>>
>> Some examples: (/dev/sd[abc] are ssd, and /dev/sd[ef] are rotational)
>>
>> Non striped profile: metadata->raid1, data->raid1
>> The data is stored on /dev/sd[ef], metadata is stored on /dev/sd[abc].
>> When /dev/sd[ef] are full, then the data chunk is allocated also on
>> /dev/sd[abc].
>>
>> Striped profile: metadata->raid6, data->raid6
>> raid6 requires 3 disks at minimum, so /dev/sd[ef] are not enough for a
>> data profile raid6. To allow a data chunk allocation, the data profile raid6
>> will be stored on all the disks /dev/sd[abcdef].
>> Instead the metadata profile raid6 will be allocated on /dev/sd[abc],
>> because these are enough to host this chunk.
> 
> Yes, and while the explanation above focuses on multi-disk profiles, it
> might be useful (for the similar section in later versions) to
> explicitly mention that for single profile, the same algorithm will just
> cause it to overflow to a less preferred disk if the preferred one is
> completely full. Neat!
> 
> I've been testing this change on top of my 4.19 kernel, and also tried
> to come up with some edge cases, doing ridiculous things to generate
> metadata usage en do stuff like btrfs fi resize to push metadata away
> from the prefered device etc... No weird things happened.
> 
> I guess there will be no further work on this V3, the only comment I
> would have now is that an Opt_no_ssd_metadata would be nice for testing,
> but I can hack that in myself.

Because ssd_metadata is not a default, what would be the purpouse of
Opt_no_ssd_metadata ?

> 
> Thanks,
> Hans
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-05-29 16:40   ` Goffredo Baroncelli
@ 2020-05-29 18:37     ` Hans van Kranenburg
  0 siblings, 0 replies; 28+ messages in thread
From: Hans van Kranenburg @ 2020-05-29 18:37 UTC (permalink / raw)
  To: kreijack, linux-btrfs; +Cc: Michael, Hugo Mills, Martin Svec, Wang Yugui

On 5/29/20 6:40 PM, Goffredo Baroncelli wrote:
> On 5/29/20 6:06 PM, Hans van Kranenburg wrote:
>> Hi Goffredo,
>>
>> On 4/5/20 10:26 AM, Goffredo Baroncelli wrote:
>>>
>>> This is an RFC; I wrote this patch because I find the idea interesting
>>> even though it adds more complication to the chunk allocator.
>>>
>>> The core idea is to store the metadata on the ssd and to leave the data
>>> on the rotational disks. BTRFS looks at the rotational flags to
>>> understand the kind of disks.
>>
>> Like I said yesterday, thanks for working on these kind of proof of
>> concepts. :)
>>
>> Even while this can't be a final solution, it's still very useful in the
>> meantime for users for which this is sufficient right now.
>>
>> I simply did not realize before that it was possible to just set that
>> rotational flag myself using an udev rule... How convenient.
>>
>> -# cat /etc/udev/rules.d/99-yolo.rules
>> ACTION=="add|change",
>> ENV{ID_FS_UUID_SUB_ENC}=="4139fb4c-e7c4-49c7-a4ce-5c86f683ffdc",
>> ATTR{queue/rotational}="1"
>> ACTION=="add|change",
>> ENV{ID_FS_UUID_SUB_ENC}=="192139f4-1618-4089-95fd-4a863db9416b",
>> ATTR{queue/rotational}="0"
> 
> Yes but of course this should be an exception than the default

For non-local storage it's the default that this rotational value is
completely bogus.

What I mean is that I like that this PoC patch (ab)uses existing stuff,
and does not rely on changing the filesystem (yet) in any way, so it can
be thrown out at any time later without consequences.

>>> This new mode is enabled passing the option ssd_metadata at mount time.
>>> This policy of allocation is the "preferred" one. If this doesn't permit
>>> a chunk allocation, the "classic" one is used.
>>>
>>> Some examples: (/dev/sd[abc] are ssd, and /dev/sd[ef] are rotational)
>>>
>>> Non striped profile: metadata->raid1, data->raid1
>>> The data is stored on /dev/sd[ef], metadata is stored on /dev/sd[abc].
>>> When /dev/sd[ef] are full, then the data chunk is allocated also on
>>> /dev/sd[abc].
>>>
>>> Striped profile: metadata->raid6, data->raid6
>>> raid6 requires 3 disks at minimum, so /dev/sd[ef] are not enough for a
>>> data profile raid6. To allow a data chunk allocation, the data profile raid6
>>> will be stored on all the disks /dev/sd[abcdef].
>>> Instead the metadata profile raid6 will be allocated on /dev/sd[abc],
>>> because these are enough to host this chunk.
>>
>> Yes, and while the explanation above focuses on multi-disk profiles, it
>> might be useful (for the similar section in later versions) to
>> explicitly mention that for single profile, the same algorithm will just
>> cause it to overflow to a less preferred disk if the preferred one is
>> completely full. Neat!
>>
>> I've been testing this change on top of my 4.19 kernel, and also tried
>> to come up with some edge cases, doing ridiculous things to generate
>> metadata usage en do stuff like btrfs fi resize to push metadata away
>> from the prefered device etc... No weird things happened.
>>
>> I guess there will be no further work on this V3, the only comment I
>> would have now is that an Opt_no_ssd_metadata would be nice for testing,
>> but I can hack that in myself.
> 
> Because ssd_metadata is not a default, what would be the purpouse of
> Opt_no_ssd_metadata ?

While testing, mount -o remount,no_ssd_metadata without having to umount
/ mount and stop data generating/removing test processes, so that data
gets written to the "wrong" disks again.

Hans


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-04-05  8:26 [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD Goffredo Baroncelli
                   ` (2 preceding siblings ...)
  2020-05-29 16:06 ` Hans van Kranenburg
@ 2020-05-30  4:59 ` Qu Wenruo
  2020-05-30  6:48   ` Goffredo Baroncelli
  3 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2020-05-30  4:59 UTC (permalink / raw)
  To: Goffredo Baroncelli, linux-btrfs
  Cc: Michael, Hugo Mills, Martin Svec, Wang Yugui



On 2020/4/5 下午4:26, Goffredo Baroncelli wrote:
>
> Hi all,
>
> This is an RFC; I wrote this patch because I find the idea interesting
> even though it adds more complication to the chunk allocator.
>
> The core idea is to store the metadata on the ssd and to leave the data
> on the rotational disks. BTRFS looks at the rotational flags to
> understand the kind of disks.
>
> This new mode is enabled passing the option ssd_metadata at mount time.
> This policy of allocation is the "preferred" one. If this doesn't permit
> a chunk allocation, the "classic" one is used.

One thing to improve here, in fact we can use existing members to
restore the device related info:
- btrfs_dev_item::seek_speed
- btrfs_dev_item::bandwidth (I tend to rename it to IOPS)

In fact, what you're trying to do is to provide a policy to allocate
chunks based on each device performance characteristics.

I believe it would be super awesome, but to get it upstream, I guess we
would prefer a more flex framework, thus it would be pretty slow to merge.

But still, thanks for your awesome idea.

Thanks,
Qu


>
> Some examples: (/dev/sd[abc] are ssd, and /dev/sd[ef] are rotational)
>
> Non striped profile: metadata->raid1, data->raid1
> The data is stored on /dev/sd[ef], metadata is stored on /dev/sd[abc].
> When /dev/sd[ef] are full, then the data chunk is allocated also on
> /dev/sd[abc].
>
> Striped profile: metadata->raid6, data->raid6
> raid6 requires 3 disks at minimum, so /dev/sd[ef] are not enough for a
> data profile raid6. To allow a data chunk allocation, the data profile raid6
> will be stored on all the disks /dev/sd[abcdef].
> Instead the metadata profile raid6 will be allocated on /dev/sd[abc],
> because these are enough to host this chunk.
>
> Changelog:
> v1: - first issue
> v2: - rebased to v5.6.2
>     - correct the comparison about the rotational disks (>= instead of >)
>     - add the flag rotational to the struct btrfs_device_info to
>       simplify the comparison function (btrfs_cmp_device_info*() )
> v3: - correct the collision between BTRFS_MOUNT_DISCARD_ASYNC and
>       BTRFS_MOUNT_SSD_METADATA.
>
> Below I collected some data to highlight the performance increment.
>
> Test setup:
> I performed as test a "dist-upgrade" of a Debian from stretch to buster.
> The test consisted in an image of a Debian stretch[1]  with the packages
> needed under /var/cache/apt/archives/ (so no networking was involved).
> For each test I formatted the filesystem from scratch, un-tar-red the
> image and the ran "apt-get dist-upgrade" [2]. For each disk(s)/filesystem
> combination I measured the time of apt dist-upgrade with and
> without the flag "force-unsafe-io" which reduce the using of sync(2) and
> flush(2). The ssd was 20GB big, the hdd was 230GB big,
>
> I considered the following scenarios:
> - btrfs over ssd
> - btrfs over ssd + hdd with my patch enabled
> - btrfs over bcache over hdd+ssd
> - btrfs over hdd (very, very slow....)
> - ext4 over ssd
> - ext4 over hdd
>
> The test machine was an "AMD A6-6400K" with 4GB of ram, where 3GB was used
> as cache/buff.
>
> Data analysis:
>
> Of course btrfs is slower than ext4 when a lot of sync/flush are involved. Using
> apt on a rotational was a dramatic experience. And IMHO  this should be replaced
> by using the btrfs snapshot capabilities. But this is another (not easy) story.
>
> Unsurprising bcache performs better than my patch. But this is an expected
> result because it can cache also the data chunk (the read can goes directly to
> the ssd). bcache perform about +60% slower when there are a lot of sync/flush
> and only +20% in the other case.
>
> Regarding the test with force-unsafe-io (fewer sync/flush), my patch reduce the
> time from +256% to +113%  than the hdd-only . Which I consider a good
> results considering how small is the patch.
>
>
> Raw data:
> The data below is the "real" time (as return by the time command) consumed by
> apt
>
>
> Test description         real (mmm:ss)	Delta %
> --------------------     -------------  -------
> btrfs hdd w/sync	   142:38	+533%
> btrfs ssd+hdd w/sync        81:04	+260%
> ext4 hdd w/sync	            52:39	+134%
> btrfs bcache w/sync	    35:59	 +60%
> btrfs ssd w/sync	    22:31	reference
> ext4 ssd w/sync	            12:19	 -45%
>
>
>
> Test description         real (mmm:ss)	Delta %
> --------------------     -------------  -------
> btrfs hdd	             56:2	+256%
> ext4 hdd	            51:32	+228%
> btrfs ssd+hdd	            33:30	+113%
> btrfs bcache	            18:57	 +20%
> btrfs ssd	            15:44	reference
> ext4 ssd	            11:49	 -25%
>
>
> [1] I created the image, using "debootrap stretch", then I installed a set
> of packages using the commands:
>
>   # debootstrap stretch test/
>   # chroot test/
>   # mount -t proc proc proc
>   # mount -t sysfs sys sys
>   # apt --option=Dpkg::Options::=--force-confold \
>         --option=Dpkg::options::=--force-unsafe-io \
> 	install mate-desktop-environment* xserver-xorg vim \
>         task-kde-desktop task-gnome-desktop
>
> Then updated the release from stretch to buster changing the file /etc/apt/source.list
> Then I download the packages for the dist upgrade:
>
>   # apt-get update
>   # apt-get --download-only dist-upgrade
>
> Then I create a tar of this image.
> Before the dist upgrading the space used was about 7GB of space with 2281
> packages. After the dist-upgrade, the space used was 9GB with 2870 packages.
> The upgrade installed/updated about 2251 packages.
>
>
> [2] The command was a bit more complex, to avoid an interactive session
>
>   # mkfs.btrfs -m single -d single /dev/sdX
>   # mount /dev/sdX test/
>   # cd test
>   # time tar xzf ../image.tgz
>   # chroot .
>   # mount -t proc proc proc
>   # mount -t sysfs sys sys
>   # export DEBIAN_FRONTEND=noninteractive
>   # time apt-get -y --option=Dpkg::Options::=--force-confold \
> 	--option=Dpkg::options::=--force-unsafe-io dist-upgrade
>
>
> BR
> G.Baroncelli
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-05-30  4:59 ` Qu Wenruo
@ 2020-05-30  6:48   ` Goffredo Baroncelli
  2020-05-30  8:57     ` Paul Jones
  0 siblings, 1 reply; 28+ messages in thread
From: Goffredo Baroncelli @ 2020-05-30  6:48 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: Michael, Hugo Mills, Martin Svec, Wang Yugui

On 5/30/20 6:59 AM, Qu Wenruo wrote:
[...]
>> This new mode is enabled passing the option ssd_metadata at mount time.
>> This policy of allocation is the "preferred" one. If this doesn't permit
>> a chunk allocation, the "classic" one is used.
> 
> One thing to improve here, in fact we can use existing members to
> restore the device related info:
> - btrfs_dev_item::seek_speed
> - btrfs_dev_item::bandwidth (I tend to rename it to IOPS)

Hi Qu,

this path was an older version,the current one (sent 2 days ago) store the setting
of which disks has to be considered as "preferred_metadata".
> 
> In fact, what you're trying to do is to provide a policy to allocate
> chunks based on each device performance characteristics.
> 
> I believe it would be super awesome, but to get it upstream, I guess we
> would prefer a more flex framework, thus it would be pretty slow to merge.

I agree. And considering that in the near future the SSD will become more
widespread, I don't know if the effort (and the time required) are worth.

> 
> But still, thanks for your awesome idea.
> 
> Thanks,
> Qu
> 
> 
>>
>> Some examples: (/dev/sd[abc] are ssd, and /dev/sd[ef] are rotational)
>>
>> Non striped profile: metadata->raid1, data->raid1
>> The data is stored on /dev/sd[ef], metadata is stored on /dev/sd[abc].
>> When /dev/sd[ef] are full, then the data chunk is allocated also on
>> /dev/sd[abc].
>>
>> Striped profile: metadata->raid6, data->raid6
>> raid6 requires 3 disks at minimum, so /dev/sd[ef] are not enough for a
>> data profile raid6. To allow a data chunk allocation, the data profile raid6
>> will be stored on all the disks /dev/sd[abcdef].
>> Instead the metadata profile raid6 will be allocated on /dev/sd[abc],
>> because these are enough to host this chunk.
>>
>> Changelog:
>> v1: - first issue
>> v2: - rebased to v5.6.2
>>      - correct the comparison about the rotational disks (>= instead of >)
>>      - add the flag rotational to the struct btrfs_device_info to
>>        simplify the comparison function (btrfs_cmp_device_info*() )
>> v3: - correct the collision between BTRFS_MOUNT_DISCARD_ASYNC and
>>        BTRFS_MOUNT_SSD_METADATA.
>>
>> Below I collected some data to highlight the performance increment.
>>
>> Test setup:
>> I performed as test a "dist-upgrade" of a Debian from stretch to buster.
>> The test consisted in an image of a Debian stretch[1]  with the packages
>> needed under /var/cache/apt/archives/ (so no networking was involved).
>> For each test I formatted the filesystem from scratch, un-tar-red the
>> image and the ran "apt-get dist-upgrade" [2]. For each disk(s)/filesystem
>> combination I measured the time of apt dist-upgrade with and
>> without the flag "force-unsafe-io" which reduce the using of sync(2) and
>> flush(2). The ssd was 20GB big, the hdd was 230GB big,
>>
>> I considered the following scenarios:
>> - btrfs over ssd
>> - btrfs over ssd + hdd with my patch enabled
>> - btrfs over bcache over hdd+ssd
>> - btrfs over hdd (very, very slow....)
>> - ext4 over ssd
>> - ext4 over hdd
>>
>> The test machine was an "AMD A6-6400K" with 4GB of ram, where 3GB was used
>> as cache/buff.
>>
>> Data analysis:
>>
>> Of course btrfs is slower than ext4 when a lot of sync/flush are involved. Using
>> apt on a rotational was a dramatic experience. And IMHO  this should be replaced
>> by using the btrfs snapshot capabilities. But this is another (not easy) story.
>>
>> Unsurprising bcache performs better than my patch. But this is an expected
>> result because it can cache also the data chunk (the read can goes directly to
>> the ssd). bcache perform about +60% slower when there are a lot of sync/flush
>> and only +20% in the other case.
>>
>> Regarding the test with force-unsafe-io (fewer sync/flush), my patch reduce the
>> time from +256% to +113%  than the hdd-only . Which I consider a good
>> results considering how small is the patch.
>>
>>
>> Raw data:
>> The data below is the "real" time (as return by the time command) consumed by
>> apt
>>
>>
>> Test description         real (mmm:ss)	Delta %
>> --------------------     -------------  -------
>> btrfs hdd w/sync	   142:38	+533%
>> btrfs ssd+hdd w/sync        81:04	+260%
>> ext4 hdd w/sync	            52:39	+134%
>> btrfs bcache w/sync	    35:59	 +60%
>> btrfs ssd w/sync	    22:31	reference
>> ext4 ssd w/sync	            12:19	 -45%
>>
>>
>>
>> Test description         real (mmm:ss)	Delta %
>> --------------------     -------------  -------
>> btrfs hdd	             56:2	+256%
>> ext4 hdd	            51:32	+228%
>> btrfs ssd+hdd	            33:30	+113%
>> btrfs bcache	            18:57	 +20%
>> btrfs ssd	            15:44	reference
>> ext4 ssd	            11:49	 -25%
>>
>>
>> [1] I created the image, using "debootrap stretch", then I installed a set
>> of packages using the commands:
>>
>>    # debootstrap stretch test/
>>    # chroot test/
>>    # mount -t proc proc proc
>>    # mount -t sysfs sys sys
>>    # apt --option=Dpkg::Options::=--force-confold \
>>          --option=Dpkg::options::=--force-unsafe-io \
>> 	install mate-desktop-environment* xserver-xorg vim \
>>          task-kde-desktop task-gnome-desktop
>>
>> Then updated the release from stretch to buster changing the file /etc/apt/source.list
>> Then I download the packages for the dist upgrade:
>>
>>    # apt-get update
>>    # apt-get --download-only dist-upgrade
>>
>> Then I create a tar of this image.
>> Before the dist upgrading the space used was about 7GB of space with 2281
>> packages. After the dist-upgrade, the space used was 9GB with 2870 packages.
>> The upgrade installed/updated about 2251 packages.
>>
>>
>> [2] The command was a bit more complex, to avoid an interactive session
>>
>>    # mkfs.btrfs -m single -d single /dev/sdX
>>    # mount /dev/sdX test/
>>    # cd test
>>    # time tar xzf ../image.tgz
>>    # chroot .
>>    # mount -t proc proc proc
>>    # mount -t sysfs sys sys
>>    # export DEBIAN_FRONTEND=noninteractive
>>    # time apt-get -y --option=Dpkg::Options::=--force-confold \
>> 	--option=Dpkg::options::=--force-unsafe-io dist-upgrade
>>
>>
>> BR
>> G.Baroncelli
>>


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-05-30  6:48   ` Goffredo Baroncelli
@ 2020-05-30  8:57     ` Paul Jones
  0 siblings, 0 replies; 28+ messages in thread
From: Paul Jones @ 2020-05-30  8:57 UTC (permalink / raw)
  To: kreijack, Qu Wenruo, linux-btrfs
  Cc: Michael, Hugo Mills, Martin Svec, Wang Yugui

> -----Original Message-----
> From: linux-btrfs-owner@vger.kernel.org <linux-btrfs-
> owner@vger.kernel.org> On Behalf Of Goffredo Baroncelli
> Sent: Saturday, 30 May 2020 4:48 PM
> To: Qu Wenruo <quwenruo.btrfs@gmx.com>; linux-btrfs@vger.kernel.org
> Cc: Michael <mclaud@roznica.com.ua>; Hugo Mills <hugo@carfax.org.uk>;
> Martin Svec <martin.svec@zoner.cz>; Wang Yugui <wangyugui@e16-
> tech.com>
> Subject: Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
> 
> On 5/30/20 6:59 AM, Qu Wenruo wrote:
> [...]
> >> This new mode is enabled passing the option ssd_metadata at mount
> time.
> >> This policy of allocation is the "preferred" one. If this doesn't
> >> permit a chunk allocation, the "classic" one is used.
> >
> > One thing to improve here, in fact we can use existing members to
> > restore the device related info:
> > - btrfs_dev_item::seek_speed
> > - btrfs_dev_item::bandwidth (I tend to rename it to IOPS)
> 
> Hi Qu,
> 
> this path was an older version,the current one (sent 2 days ago) store the
> setting of which disks has to be considered as "preferred_metadata".
> >
> > In fact, what you're trying to do is to provide a policy to allocate
> > chunks based on each device performance characteristics.
> >
> > I believe it would be super awesome, but to get it upstream, I guess
> > we would prefer a more flex framework, thus it would be pretty slow to
> merge.
> 
> I agree. And considering that in the near future the SSD will become more
> widespread, I don't know if the effort (and the time required) are worth.

I think it will be. Consider a large 10TB+ filesystem that runs on cheap unbuffered SSDs - Metadata will still be a bottleneck like it is now, just everything happens much faster. Archival storage will likely be rotational based for a long time yet for cost reasons, and this is where ssd metadata shines. I've been running your ssd_metadata patch for over a month now and it's flipping fantastic! The responsiveness it brings to networked archival storage is amazing.

Paul.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] btrfs: add ssd_metadata mode
  2020-04-05  8:26 ` [PATCH] btrfs: add ssd_metadata mode Goffredo Baroncelli
  2020-04-14  5:24   ` Paul Jones
@ 2020-10-23  7:23   ` Wang Yugui
  2020-10-23 10:11     ` Adam Borowski
  1 sibling, 1 reply; 28+ messages in thread
From: Wang Yugui @ 2020-10-23  7:23 UTC (permalink / raw)
  To: Goffredo Baroncelli
  Cc: linux-btrfs, Michael, Hugo Mills, Martin Svec, Goffredo Baroncelli

Hi, Goffredo Baroncelli 

We can move 'rotational of struct btrfs_device_info' to  'bool rotating
of struct btrfs_device'.

1, it will be more close to 'bool rotating of struct btrfs_fs_devices'.

2, it maybe used to enhance the path of '[PATCH] btrfs: balance RAID1/RAID10 mirror selection'.
https://lore.kernel.org/linux-btrfs/3bddd73e-cb60-b716-4e98-61ff24beb570@oracle.com/T/#t

Best Regards
王玉贵
2020/10/23

> From: Goffredo Baroncelli <kreijack@inwind.it>
> 
> When this mode is enabled, the allocation policy of the chunk
> is so modified:
> - allocation of metadata chunk: priority is given to ssd disk.
> - allocation of data chunk: priority is given to a rotational disk.
> 
> When a striped profile is involved (like RAID0,5,6), the logic
> is a bit more complex. If there are enough disks, the data profiles
> are stored on the rotational disks only; instead the metadata profiles
> are stored on the non rotational disk only.
> If the disks are not enough, then the profiles is stored on all
> the disks.
> 
> Example: assuming that sda, sdb, sdc are ssd disks, and sde, sdf are
> rotational ones.
> A data profile raid6, will be stored on sda, sdb, sdc, sde, sdf (sde
> and sdf are not enough to host a raid5 profile).
> A metadata profile raid6, will be stored on sda, sdb, sdc (these
> are enough to host a raid6 profile).
> 
> To enable this mode pass -o ssd_metadata at mount time.
> 
> Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
> ---
>  fs/btrfs/ctree.h   |  1 +
>  fs/btrfs/super.c   |  8 +++++
>  fs/btrfs/volumes.c | 89 ++++++++++++++++++++++++++++++++++++++++++++--
>  fs/btrfs/volumes.h |  1 +
>  4 files changed, 97 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 36df977b64d9..773c7f8b0b0d 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1236,6 +1236,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
>  #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
>  #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
>  #define BTRFS_MOUNT_DISCARD_ASYNC	(1 << 29)
> +#define BTRFS_MOUNT_SSD_METADATA	(1 << 30)
>  
>  #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
>  #define BTRFS_DEFAULT_MAX_INLINE	(2048)
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 67c63858812a..4ad14b0a57b3 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -350,6 +350,7 @@ enum {
>  #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>  	Opt_ref_verify,
>  #endif
> +	Opt_ssd_metadata,
>  	Opt_err,
>  };
>  
> @@ -421,6 +422,7 @@ static const match_table_t tokens = {
>  #ifdef CONFIG_BTRFS_FS_REF_VERIFY
>  	{Opt_ref_verify, "ref_verify"},
>  #endif
> +	{Opt_ssd_metadata, "ssd_metadata"},
>  	{Opt_err, NULL},
>  };
>  
> @@ -872,6 +874,10 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
>  			btrfs_set_opt(info->mount_opt, REF_VERIFY);
>  			break;
>  #endif
> +		case Opt_ssd_metadata:
> +			btrfs_set_and_info(info, SSD_METADATA,
> +					"enabling ssd_metadata");
> +			break;
>  		case Opt_err:
>  			btrfs_info(info, "unrecognized mount option '%s'", p);
>  			ret = -EINVAL;
> @@ -1390,6 +1396,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
>  #endif
>  	if (btrfs_test_opt(info, REF_VERIFY))
>  		seq_puts(seq, ",ref_verify");
> +	if (btrfs_test_opt(info, SSD_METADATA))
> +		seq_puts(seq, ",ssd_metadata");
>  	seq_printf(seq, ",subvolid=%llu",
>  		  BTRFS_I(d_inode(dentry))->root->root_key.objectid);
>  	seq_puts(seq, ",subvol=");
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 9cfc668f91f4..ffb2bc912c43 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -4761,6 +4761,58 @@ static int btrfs_cmp_device_info(const void *a, const void *b)
>  	return 0;
>  }
>  
> +/*
> + * sort the devices in descending order by rotational,
> + * max_avail, total_avail
> + */
> +static int btrfs_cmp_device_info_metadata(const void *a, const void *b)
> +{
> +	const struct btrfs_device_info *di_a = a;
> +	const struct btrfs_device_info *di_b = b;
> +
> +	/* metadata -> non rotational first */
> +	if (!di_a->rotational && di_b->rotational)
> +		return -1;
> +	if (di_a->rotational && !di_b->rotational)
> +		return 1;
> +	if (di_a->max_avail > di_b->max_avail)
> +		return -1;
> +	if (di_a->max_avail < di_b->max_avail)
> +		return 1;
> +	if (di_a->total_avail > di_b->total_avail)
> +		return -1;
> +	if (di_a->total_avail < di_b->total_avail)
> +		return 1;
> +	return 0;
> +}
> +
> +/*
> + * sort the devices in descending order by !rotational,
> + * max_avail, total_avail
> + */
> +static int btrfs_cmp_device_info_data(const void *a, const void *b)
> +{
> +	const struct btrfs_device_info *di_a = a;
> +	const struct btrfs_device_info *di_b = b;
> +
> +	/* data -> non rotational last */
> +	if (!di_a->rotational && di_b->rotational)
> +		return 1;
> +	if (di_a->rotational && !di_b->rotational)
> +		return -1;
> +	if (di_a->max_avail > di_b->max_avail)
> +		return -1;
> +	if (di_a->max_avail < di_b->max_avail)
> +		return 1;
> +	if (di_a->total_avail > di_b->total_avail)
> +		return -1;
> +	if (di_a->total_avail < di_b->total_avail)
> +		return 1;
> +	return 0;
> +}
> +
> +
> +
>  static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
>  {
>  	if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK))
> @@ -4808,6 +4860,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>  	int i;
>  	int j;
>  	int index;
> +	int nr_rotational;
>  
>  	BUG_ON(!alloc_profile_is_valid(type, 0));
>  
> @@ -4863,6 +4916,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>  	 * about the available holes on each device.
>  	 */
>  	ndevs = 0;
> +	nr_rotational = 0;
>  	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
>  		u64 max_avail;
>  		u64 dev_offset;
> @@ -4914,14 +4968,45 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>  		devices_info[ndevs].max_avail = max_avail;
>  		devices_info[ndevs].total_avail = total_avail;
>  		devices_info[ndevs].dev = device;
> +		devices_info[ndevs].rotational = !test_bit(QUEUE_FLAG_NONROT,
> +				&(bdev_get_queue(device->bdev)->queue_flags));
> +		if (devices_info[ndevs].rotational)
> +			nr_rotational++;
>  		++ndevs;
>  	}
>  
> +	BUG_ON(nr_rotational > ndevs);
>  	/*
>  	 * now sort the devices by hole size / available space
>  	 */
> -	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> -	     btrfs_cmp_device_info, NULL);
> +	if (((type & BTRFS_BLOCK_GROUP_DATA) &&
> +	     (type & BTRFS_BLOCK_GROUP_METADATA)) ||
> +	    !btrfs_test_opt(info, SSD_METADATA)) {
> +		/* mixed bg or SSD_METADATA not set */
> +		sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +			     btrfs_cmp_device_info, NULL);
> +	} else {
> +		/*
> +		 * if SSD_METADATA is set, sort the device considering also the
> +		 * kind (ssd or not). Limit the availables devices to the ones
> +		 * of the same kind, to avoid that a striped profile like raid5
> +		 * spread to all kind of devices (ssd and rotational).
> +		 * It is allowed to use different kinds of devices if the ones
> +		 * of the same kind are not enough alone.
> +		 */
> +		if (type & BTRFS_BLOCK_GROUP_DATA) {
> +			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +				     btrfs_cmp_device_info_data, NULL);
> +			if (nr_rotational >= devs_min)
> +				ndevs = nr_rotational;
> +		} else {
> +			int nr_norot = ndevs - nr_rotational;
> +			sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +				     btrfs_cmp_device_info_metadata, NULL);
> +			if (nr_norot >= devs_min)
> +				ndevs = nr_norot;
> +		}
> +	}
>  
>  	/*
>  	 * Round down to number of usable stripes, devs_increment can be any
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index f01552a0785e..285d71d54a03 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -343,6 +343,7 @@ struct btrfs_device_info {
>  	u64 dev_offset;
>  	u64 max_avail;
>  	u64 total_avail;
> +	int rotational:1;
>  };
>  
>  struct btrfs_raid_attr {
> -- 
> 2.26.0

--------------------------------------
北京京垓科技有限公司
王玉贵	wangyugui@e16-tech.com
电话:+86-136-71123776


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] btrfs: add ssd_metadata mode
  2020-10-23  7:23   ` Wang Yugui
@ 2020-10-23 10:11     ` Adam Borowski
  2020-10-23 11:25       ` Qu Wenruo
  0 siblings, 1 reply; 28+ messages in thread
From: Adam Borowski @ 2020-10-23 10:11 UTC (permalink / raw)
  To: Wang Yugui
  Cc: Goffredo Baroncelli, linux-btrfs, Michael, Hugo Mills,
	Martin Svec, Goffredo Baroncelli

On Fri, Oct 23, 2020 at 03:23:30PM +0800, Wang Yugui wrote:
> Hi, Goffredo Baroncelli 
> 
> We can move 'rotational of struct btrfs_device_info' to  'bool rotating
> of struct btrfs_device'.
> 
> 1, it will be more close to 'bool rotating of struct btrfs_fs_devices'.
> 
> 2, it maybe used to enhance the path of '[PATCH] btrfs: balance RAID1/RAID10 mirror selection'.
> https://lore.kernel.org/linux-btrfs/3bddd73e-cb60-b716-4e98-61ff24beb570@oracle.com/T/#t

I don't think it should be a bool -- or at least, turned into a bool
late in the processing.

There are many storage tiers; rotational applies only to one of the
coldest.  In my use case, at least, I've added the following patchlet:

-               devices_info[ndevs].rotational = !test_bit(QUEUE_FLAG_NONROT,
+               devices_info[ndevs].rotational = !test_bit(QUEUE_FLAG_DAX,

Or, you may want Optane NVMe vs legacy (ie, NAND) NVMe.

The tiers look like:
* DIMM-connected Optane (dax=1)
* NVMe-connected Optane
* NVMe-connected flash
* SATA-connected flash
* SATA-connected spinning rust (rotational=1)
* IDE-connected spinning rust (rotational=1)
* SD cards
* floppies?

And even that is just for local storage only.

Thus, please don't hardcode the notion of "rotational", what we want is
"faster but smaller" vs "slower but bigger".

> > From: Goffredo Baroncelli <kreijack@inwind.it>
> > 
> > When this mode is enabled, the allocation policy of the chunk
> > is so modified:
> > - allocation of metadata chunk: priority is given to ssd disk.
> > - allocation of data chunk: priority is given to a rotational disk.
> > 
> > When a striped profile is involved (like RAID0,5,6), the logic
> > is a bit more complex. If there are enough disks, the data profiles
> > are stored on the rotational disks only; instead the metadata profiles
> > are stored on the non rotational disk only.
> > If the disks are not enough, then the profiles is stored on all
> > the disks.

And, a newer version of Goffredo's patchset already had
"preferred_metadata".  It did not assign the preference automatically,
but if we want god defaults, they should be smarter than just rotationality.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Imagine there are bandits in your house, your kid is bleeding out,
⢿⡄⠘⠷⠚⠋⠀ the house is on fire, and seven giant trumpets are playing in the
⠈⠳⣄⠀⠀⠀⠀ sky.  Your cat demands food.  The priority should be obvious...

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] btrfs: add ssd_metadata mode
  2020-10-23 10:11     ` Adam Borowski
@ 2020-10-23 11:25       ` Qu Wenruo
  2020-10-23 12:37         ` Wang Yugui
  0 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2020-10-23 11:25 UTC (permalink / raw)
  To: Adam Borowski, Wang Yugui
  Cc: Goffredo Baroncelli, linux-btrfs, Michael, Hugo Mills,
	Martin Svec, Goffredo Baroncelli



On 2020/10/23 下午6:11, Adam Borowski wrote:
> On Fri, Oct 23, 2020 at 03:23:30PM +0800, Wang Yugui wrote:
>> Hi, Goffredo Baroncelli
>>
>> We can move 'rotational of struct btrfs_device_info' to  'bool rotating
>> of struct btrfs_device'.
>>
>> 1, it will be more close to 'bool rotating of struct btrfs_fs_devices'.
>>
>> 2, it maybe used to enhance the path of '[PATCH] btrfs: balance RAID1/RAID10 mirror selection'.
>> https://lore.kernel.org/linux-btrfs/3bddd73e-cb60-b716-4e98-61ff24beb570@oracle.com/T/#t
>
> I don't think it should be a bool -- or at least, turned into a bool
> late in the processing.
>
> There are many storage tiers; rotational applies only to one of the
> coldest.  In my use case, at least, I've added the following patchlet:
>
> -               devices_info[ndevs].rotational = !test_bit(QUEUE_FLAG_NONROT,
> +               devices_info[ndevs].rotational = !test_bit(QUEUE_FLAG_DAX,
>
> Or, you may want Optane NVMe vs legacy (ie, NAND) NVMe.

A little off topic here, btrfs in fact has a better ways to model a
storage, and definitely not simply rotational or not.

In btrfs_dev_item, we have bandwith and seek_speed to determine the
characteristic, although they're never really utilized.

So if we're really going to dig deeper into the rabbit hole, we need
more characteristic to describe a device.
From basic bandwidth for large block size IO, to things like IOPS for
small random block size, and even possible multi-level performance
characteristic for cases like multi-level cache used in current NVME
ssds, and future SMR + CMR mixed devices.

Although computer is binary, the performance characteristic is never
binary. :)

Thanks,
Qu
>
> The tiers look like:
> * DIMM-connected Optane (dax=1)
> * NVMe-connected Optane
> * NVMe-connected flash
> * SATA-connected flash
> * SATA-connected spinning rust (rotational=1)
> * IDE-connected spinning rust (rotational=1)
> * SD cards
> * floppies?
>
> And even that is just for local storage only.
>
> Thus, please don't hardcode the notion of "rotational", what we want is
> "faster but smaller" vs "slower but bigger".
>
>>> From: Goffredo Baroncelli <kreijack@inwind.it>
>>>
>>> When this mode is enabled, the allocation policy of the chunk
>>> is so modified:
>>> - allocation of metadata chunk: priority is given to ssd disk.
>>> - allocation of data chunk: priority is given to a rotational disk.
>>>
>>> When a striped profile is involved (like RAID0,5,6), the logic
>>> is a bit more complex. If there are enough disks, the data profiles
>>> are stored on the rotational disks only; instead the metadata profiles
>>> are stored on the non rotational disk only.
>>> If the disks are not enough, then the profiles is stored on all
>>> the disks.
>
> And, a newer version of Goffredo's patchset already had
> "preferred_metadata".  It did not assign the preference automatically,
> but if we want god defaults, they should be smarter than just rotationality.
>
>
> Meow!
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] btrfs: add ssd_metadata mode
  2020-10-23 11:25       ` Qu Wenruo
@ 2020-10-23 12:37         ` Wang Yugui
  2020-10-23 12:45           ` Qu Wenruo
                             ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Wang Yugui @ 2020-10-23 12:37 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Adam Borowski, Goffredo Baroncelli, linux-btrfs, Michael,
	Hugo Mills, Martin Svec, Goffredo Baroncelli

Hi,

Can we add the feature of 'Storage Tiering' to btrfs for these use cases?

1) use faster tier firstly for metadata

2) only the subvol with higher tier can save data to 
    the higher tier disk?

3) use faster tier firstly for mirror selection of RAID1/RAID10

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2020/10/23



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] btrfs: add ssd_metadata mode
  2020-10-23 12:37         ` Wang Yugui
@ 2020-10-23 12:45           ` Qu Wenruo
  2020-10-23 13:10           ` Steven Davies
  2020-10-23 18:03           ` Goffredo Baroncelli
  2 siblings, 0 replies; 28+ messages in thread
From: Qu Wenruo @ 2020-10-23 12:45 UTC (permalink / raw)
  To: Wang Yugui
  Cc: Adam Borowski, Goffredo Baroncelli, linux-btrfs, Michael,
	Hugo Mills, Martin Svec, Goffredo Baroncelli



On 2020/10/23 下午8:37, Wang Yugui wrote:
> Hi,
>
> Can we add the feature of 'Storage Tiering' to btrfs for these use cases?

Feel free to contribute.

Although it seems pretty hard already, too many factors are involved,
from extent allocator to chunk allocator.

Just consider how complex things are for bcache, it won't be a simple
feature at all.

Thanks,
Qu
>
> 1) use faster tier firstly for metadata
>
> 2) only the subvol with higher tier can save data to
>     the higher tier disk?
>
> 3) use faster tier firstly for mirror selection of RAID1/RAID10
>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2020/10/23
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] btrfs: add ssd_metadata mode
  2020-10-23 12:37         ` Wang Yugui
  2020-10-23 12:45           ` Qu Wenruo
@ 2020-10-23 13:10           ` Steven Davies
  2020-10-23 13:49             ` Wang Yugui
  2020-10-23 18:03           ` Goffredo Baroncelli
  2 siblings, 1 reply; 28+ messages in thread
From: Steven Davies @ 2020-10-23 13:10 UTC (permalink / raw)
  To: Wang Yugui
  Cc: Qu Wenruo, Adam Borowski, Goffredo Baroncelli, linux-btrfs,
	Michael, Hugo Mills, Martin Svec, Goffredo Baroncelli

On 2020-10-23 13:37, Wang Yugui wrote:
> Hi,
> 
> Can we add the feature of 'Storage Tiering' to btrfs for these use 
> cases?
> 
> 1) use faster tier firstly for metadata
> 
> 2) only the subvol with higher tier can save data to
>     the higher tier disk?
> 
> 3) use faster tier firstly for mirror selection of RAID1/RAID10

I'd support user-configurable tiering by specifying which device IDs are 
allowed to be used for

a) storing metadata
b) reading data from RAID1/RAID10

that would fit into both this patch and Anand's read policy patchset. It 
could be a mount option, sysfs tunable and/or btrfs-device command.

e.g. for sysfs
/sys/fs/btrfs/6e2797f3-d0ab-4aa1-b262-c2395fd0626e/devices/sdb2/prio_metadata_store 
[0..15]
/sys/fs/btrfs/6e2797f3-d0ab-4aa1-b262-c2395fd0626e/devices/sdb2/prio_metadata_read 
[0..15]
/sys/fs/btrfs/6e2797f3-d0ab-4aa1-b262-c2395fd0626e/devices/sdb2/prio_data_read 
[0..15]

Getting the user to specify the devices' priorities would be more 
reliable than looking at the rotational attribute.

-- 
Steven Davies

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] btrfs: add ssd_metadata mode
  2020-10-23 13:10           ` Steven Davies
@ 2020-10-23 13:49             ` Wang Yugui
  0 siblings, 0 replies; 28+ messages in thread
From: Wang Yugui @ 2020-10-23 13:49 UTC (permalink / raw)
  To: Steven Davies
  Cc: Qu Wenruo, Adam Borowski, Goffredo Baroncelli, linux-btrfs,
	Michael, Hugo Mills, Martin Svec, Goffredo Baroncelli

Hi,

We define 'int tier' for different device, such as
* dax=1
* NVMe
* SSD
* rotational=1

but in most case, just two tier of them are used at the same time,
so we use them just as two tier(faster tier, slower tier).

for phase1, we support
1) use faster tier firstly for metadata
3) use faster tier firstly for mirror selection of RAID1/RAID10

and in phase2(TODO), we support
2) let some subvol save data to the faster tier disk

for metadata, the policy is
1) faster-tier-firstly (default)
2) faster-tier-only

for data, the policy is
4) slower-tier-firstly(default)
3) slower-tier-only
1) faster-tier-firstly(phase2 TODO)
2) faster-tier-only(phase2 TODO)

Now we are using metadata profile(RAID type) and data profile(RAID type),
in phase2, we change the name of them  to 'faster tier profile' and 
'lower tier profile'.

The key is 'in most case, just two tier are used at the same time'.
so we can support the flowing case in phase1 without user config.
1) use faster tier firstly for metadata
3) use faster tier firstly for mirror selection of RAID1/RAID10

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2020/10/23


> On 2020-10-23 13:37, Wang Yugui wrote:
> > Hi,
> >
> > Can we add the feature of 'Storage Tiering' to btrfs for these use
> > cases?
> >
> > 1) use faster tier firstly for metadata
> >
> > 2) only the subvol with higher tier can save data to
> >     the higher tier disk?
> >
> > 3) use faster tier firstly for mirror selection of RAID1/RAID10
> 
> I'd support user-configurable tiering by specifying which device IDs are allowed to be used for
> 
> a) storing metadata
> b) reading data from RAID1/RAID10
> 
> that would fit into both this patch and Anand's read policy patchset. It could be a mount option, sysfs tunable and/or btrfs-device command.
> 
> e.g. for sysfs
> /sys/fs/btrfs/6e2797f3-d0ab-4aa1-b262-c2395fd0626e/devices/sdb2/prio_metadata_store [0..15]
> /sys/fs/btrfs/6e2797f3-d0ab-4aa1-b262-c2395fd0626e/devices/sdb2/prio_metadata_read [0..15]
> /sys/fs/btrfs/6e2797f3-d0ab-4aa1-b262-c2395fd0626e/devices/sdb2/prio_data_read [0..15]
> 
> Getting the user to specify the devices' priorities would be more reliable than looking at the rotational attribute.




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] btrfs: add ssd_metadata mode
  2020-10-23 12:37         ` Wang Yugui
  2020-10-23 12:45           ` Qu Wenruo
  2020-10-23 13:10           ` Steven Davies
@ 2020-10-23 18:03           ` Goffredo Baroncelli
  2020-10-24  3:26             ` Paul Jones
  2 siblings, 1 reply; 28+ messages in thread
From: Goffredo Baroncelli @ 2020-10-23 18:03 UTC (permalink / raw)
  To: Wang Yugui, Qu Wenruo
  Cc: Adam Borowski, linux-btrfs, Michael, Hugo Mills, Martin Svec,
	Goffredo Baroncelli

On 10/23/20 2:37 PM, Wang Yugui wrote:
> Hi,
> 
> Can we add the feature of 'Storage Tiering' to btrfs for these use cases?
> 
> 1) use faster tier firstly for metadata

My tests revealed that a BTRFS filesystem stacked over bcache has better performance. So I am not sure that putting the metadata in a dedicated storage is a good thing.

> 
> 2) only the subvol with higher tier can save data to
>      the higher tier disk?
> 
> 3) use faster tier firstly for mirror selection of RAID1/RAID10

If you want to put a subvolume in a "higher tier", it is more simple to use two filesystems: one in the "higher tier" and one in the "slower one".

Also what should be the semantic of cp --reflink between different subvolumes of different tiers ? From a technical point of view, it is defined, but the expectation of the users can be vary...

The same is true about assign different raid profile to differents subvolumes....

Let the things simple.
> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2020/10/23
> 

BR
GB
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH] btrfs: add ssd_metadata mode
  2020-10-23 18:03           ` Goffredo Baroncelli
@ 2020-10-24  3:26             ` Paul Jones
  0 siblings, 0 replies; 28+ messages in thread
From: Paul Jones @ 2020-10-24  3:26 UTC (permalink / raw)
  To: kreijack, Wang Yugui, Qu Wenruo
  Cc: Adam Borowski, linux-btrfs, Michael, Hugo Mills, Martin Svec

> -----Original Message-----
> From: Goffredo Baroncelli <kreijack@libero.it>
> Sent: Saturday, 24 October 2020 5:04 AM
> To: Wang Yugui <wangyugui@e16-tech.com>; Qu Wenruo
> <quwenruo.btrfs@gmx.com>
> Cc: Adam Borowski <kilobyte@angband.pl>; linux-btrfs@vger.kernel.org;
> Michael <mclaud@roznica.com.ua>; Hugo Mills <hugo@carfax.org.uk>;
> Martin Svec <martin.svec@zoner.cz>; Goffredo Baroncelli
> <kreijack@inwind.it>
> Subject: Re: [PATCH] btrfs: add ssd_metadata mode
> 
> On 10/23/20 2:37 PM, Wang Yugui wrote:
> > Hi,
> >
> > Can we add the feature of 'Storage Tiering' to btrfs for these use cases?
> >
> > 1) use faster tier firstly for metadata
> 
> My tests revealed that a BTRFS filesystem stacked over bcache has better
> performance. So I am not sure that putting the metadata in a dedicated
> storage is a good thing.

There is a balance between ultimate speed and simplicity. I used to use dm-cache under btrfs which worked very well but is complicated and error prone to setup, fragile, and slow to restore after an error. Now I'm using your ssd_metadata patch which is almost as fast but far more robust and quick/easy to restore after errors. It's night and day better imho, especially for a production system.

Paul.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
  2020-04-27 15:06 Torstein Eide
@ 2020-04-28 19:31 ` Goffredo Baroncelli
  0 siblings, 0 replies; 28+ messages in thread
From: Goffredo Baroncelli @ 2020-04-28 19:31 UTC (permalink / raw)
  To: Torstein Eide; +Cc: hugo, linux-btrfs, martin.svec, mclaud, wangyugui

On 4/27/20 5:06 PM, Torstein Eide wrote:
> How will affect sleep of disk? will it reduce the number of wake up
> call, to the HDD?
No; this patch put the metadata on the SSD, leaving data in the HDD;
this means that if you add data to a file both the HDD and the SSD will be used.
  
BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD
@ 2020-04-27 15:06 Torstein Eide
  2020-04-28 19:31 ` Goffredo Baroncelli
  0 siblings, 1 reply; 28+ messages in thread
From: Torstein Eide @ 2020-04-27 15:06 UTC (permalink / raw)
  To: kreijack; +Cc: hugo, linux-btrfs, martin.svec, mclaud, wangyugui

How will affect sleep of disk? will it reduce the number of wake up
call, to the HDD?

-- 
Torstein Eide
Torsteine@gmail.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2020-10-24  3:38 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-05  8:26 [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD Goffredo Baroncelli
2020-04-05  8:26 ` [PATCH] btrfs: add ssd_metadata mode Goffredo Baroncelli
2020-04-14  5:24   ` Paul Jones
2020-10-23  7:23   ` Wang Yugui
2020-10-23 10:11     ` Adam Borowski
2020-10-23 11:25       ` Qu Wenruo
2020-10-23 12:37         ` Wang Yugui
2020-10-23 12:45           ` Qu Wenruo
2020-10-23 13:10           ` Steven Davies
2020-10-23 13:49             ` Wang Yugui
2020-10-23 18:03           ` Goffredo Baroncelli
2020-10-24  3:26             ` Paul Jones
2020-04-05 10:57 ` [RFC][PATCH V3] btrfs: ssd_metadata: storing metadata on SSD Graham Cobb
2020-04-05 18:47   ` Goffredo Baroncelli
2020-04-05 21:58     ` Adam Borowski
2020-04-06  2:24   ` Zygo Blaxell
2020-04-06 16:43     ` Goffredo Baroncelli
2020-04-06 17:21       ` Zygo Blaxell
2020-04-06 17:33         ` Goffredo Baroncelli
2020-04-06 17:40           ` Zygo Blaxell
2020-05-29 16:06 ` Hans van Kranenburg
2020-05-29 16:40   ` Goffredo Baroncelli
2020-05-29 18:37     ` Hans van Kranenburg
2020-05-30  4:59 ` Qu Wenruo
2020-05-30  6:48   ` Goffredo Baroncelli
2020-05-30  8:57     ` Paul Jones
2020-04-27 15:06 Torstein Eide
2020-04-28 19:31 ` Goffredo Baroncelli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).