linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/3] Introduce per-profile available space array to avoid over-confident can_overcommit()
@ 2019-12-25 13:39 Qu Wenruo
  2019-12-25 13:39 ` [PATCH RFC 1/3] btrfs: Introduce per-profile available space facility Qu Wenruo
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Qu Wenruo @ 2019-12-25 13:39 UTC (permalink / raw)
  To: linux-btrfs

There are several bug reports of ENOSPC error in
btrfs_run_delalloc_range().

With some extra info from one reporter, it turns out that
can_overcommit() is using a wrong way to calculate allocatable metadata
space.

The most typical case would look like:
  devid 1 unallocated:	1G
  devid 2 unallocated:  10G
  metadata profile:	RAID1

In above case, we can at most allocate 1G chunk for metadata, due to
unbalanced disk free space.
But current can_overcommit() uses factor based calculation, which never
consider the disk free space balance.


To address this problem, here comes the per-profile available space
array, which gets updated every time a chunk get allocated/removed or a
device get grown or shrunk.

This provides a quick way for hotter place like can_overcommit() to grab
an estimation on how many bytes it can over-commit.

The per-profile available space calculation tries to keep the behavior
of chunk allocator, thus it can handle uneven disks pretty well.

The RFC tag is here because I'm not yet confident enough about the
implementation.
I'm not sure this is the proper to go, or just a over-engineered mess.

Qu Wenruo (3):
  btrfs: Introduce per-profile available space facility
  btrfs: Update per-profile available space when device size/used space
    get updated
  btrfs: space-info: Use per-profile available space in can_overcommit()

 fs/btrfs/space-info.c |  15 ++--
 fs/btrfs/volumes.c    | 205 ++++++++++++++++++++++++++++++++++++++----
 fs/btrfs/volumes.h    |  10 +++
 3 files changed, 203 insertions(+), 27 deletions(-)

-- 
2.24.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH RFC 1/3] btrfs: Introduce per-profile available space facility
  2019-12-25 13:39 [PATCH RFC 0/3] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
@ 2019-12-25 13:39 ` Qu Wenruo
  2019-12-30 16:14   ` Josef Bacik
  2019-12-25 13:39 ` [PATCH RFC 2/3] btrfs: Update per-profile available space when device size/used space get updated Qu Wenruo
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2019-12-25 13:39 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

[PROBLEM]
There are some locations in btrfs requiring accurate estimation on how
many new bytes can be allocated on unallocated space.

We have two types of estimation:
- Factor based calculation
  Just use all unallocated space, divide by the profile factor
  One obvious user is can_overcommit().

- Chunk allocator like calculation
  This will emulate the chunk allocator behavior, to get a proper
  estimation.
  The only user is btrfs_calc_avail_data_space(), utilized by
  btrfs_statfs().
  The problem is, that function is not generic purposed enough, can't
  handle things like RAID5/6.

Current factor based calculation can't handle the following case:
  devid 1 unallocated:	1T
  devid 2 unallocated:	10T
  metadata type:	RAID1

If using factor, we can use (1T + 10T) / 2 = 5.5T free space for
metadata.
But in fact we can only get 1T free space, as we're limited by the
smallest device for RAID1.

[SOLUTION]
This patch will introduce the skeleton of per-profile available space
calculation, which can more-or-less get to the point of chunk allocator.

The difference between it and chunk allocator is mostly on rounding and
[0, 1M) reserved space handling, which shouldn't cause practical impact.

The newly introduced per-profile available space calculation will
calculate available space for each type, using chunk-allocator like
calculation.

With that facility, for above device layout we get the full available
space array:
  RAID10:	0  (not enough devices)
  RAID1:	1T
  RAID1C3:	0  (not enough devices)
  RAID1C4:	0  (not enough devices)
  DUP:		5.5T
  RAID0:	2T
  SINGLE:	11T
  RAID5:	1T
  RAID6:	0  (not enough devices)

Or for a more complex example:
  devid 1 unallocated:	1T
  devid 2 unallocated:  1T
  devid 3 unallocated:	10T

We will get an array of:
  RAID10:	0  (not enough devices)
  RAID1:	2T
  RAID1C3:	1T
  RAID1C4:	0  (not enough devices)
  DUP:		6T
  RAID0:	3T
  SINGLE:	12T
  RAID5:	2T
  RAID6:	0  (not enough devices)

And for the each profile , we go chunk allocator level calculation:
The code code looks like:

  clear_virtual_used_space_of_all_rw_devices();
  do {
  	/*
  	 * The same as chunk allocator, despite used space,
  	 * we also take virtual used space into consideration.
  	 */
  	sort_device_with_virtual_free_space();

  	/*
  	 * Unlike chunk allocator, we don't need to bother hole/stripe
  	 * size, so we use the smallest device to make sure we can
  	 * allocated as many stripes as regular chunk allocator
  	 */
  	stripe_size = device_with_smallest_free->avail_space;

  	/*
  	 * Allocate a virtual chunk, allocated virtual chunk will
  	 * increase virtual used space, allow next iteration to
  	 * properly emulate chunk allocator behavior.
  	 */
  	ret = alloc_virtual_chunk(stripe_size, &allocated_size);
  	if (ret == 0)
  		avail += allocated_size;
  } while (ret == 0)

As we always select the device with least free space (just like chunk
allocator), for above 1T + 10T device, we will allocate a 1T virtual chunk
in the first iteration, then run out of device in next iteration.

Thus only get 1T free space for RAID1 type, just like what chunk
allocator would do.

This patch is just the skeleton, we only do the per-profile chunk
calculation at mount time.

Later commits will update per-profile available space at other proper
timings.

Suggested-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 190 ++++++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/volumes.h |  10 +++
 2 files changed, 181 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d8e5560db285..8d4be1d48f2f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -349,6 +349,7 @@ static struct btrfs_fs_devices *alloc_fs_devices(const u8 *fsid,
 	INIT_LIST_HEAD(&fs_devs->devices);
 	INIT_LIST_HEAD(&fs_devs->alloc_list);
 	INIT_LIST_HEAD(&fs_devs->fs_list);
+	spin_lock_init(&fs_devs->per_profile_lock);
 	if (fsid)
 		memcpy(fs_devs->fsid, fsid, BTRFS_FSID_SIZE);
 
@@ -2628,6 +2629,169 @@ static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+/*
+ * sort the devices in descending order by max_avail, total_avail
+ */
+static int btrfs_cmp_device_info(const void *a, const void *b)
+{
+	const struct btrfs_device_info *di_a = a;
+	const struct btrfs_device_info *di_b = b;
+
+	if (di_a->max_avail > di_b->max_avail)
+		return -1;
+	if (di_a->max_avail < di_b->max_avail)
+		return 1;
+	if (di_a->total_avail > di_b->total_avail)
+		return -1;
+	if (di_a->total_avail < di_b->total_avail)
+		return 1;
+	return 0;
+}
+
+/*
+ * Return 0 if we allocated a virtual(*) chunk, and restore the size to
+ * @allocated_size
+ * Return -ENOSPC if we have no more space to allocate virtual chunk
+ *
+ * *: virtual chunk is a space holder for per-profile available space
+ *    calculator.
+ *    Such virtual chunks won't take on-disk space, thus called virtual, and
+ *    only affects per-profile available space calulation.
+ */
+static int allocate_virtual_chunk(struct btrfs_fs_info *fs_info,
+				  const struct btrfs_raid_attr *raid_attr,
+				  struct btrfs_device_info *devices_info,
+				  u64 *allocated)
+{
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+	struct btrfs_device *device;
+	u64 stripe_size;
+	int i;
+	int ndevs = 0;
+
+	/* Go through devices to collect their unallocated space */
+	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
+		u64 avail;
+		if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
+					&device->dev_state) ||
+		    test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
+			continue;
+
+		if (device->total_bytes > device->bytes_used +
+				device->virtual_allocated)
+			avail = device->total_bytes - device->bytes_used -
+				device->virtual_allocated;
+		else
+			avail = 0;
+
+		/* And exclude the [0, 1M) reserved space */
+		if (avail > SZ_1M)
+			avail -= SZ_1M;
+		else
+			avail = 0;
+
+		if (avail == 0)
+			continue;
+		/*
+		 * Unlike chunk allocator, we don't care about stripe or hole
+		 * size, so here we use @avail directly
+		 */
+		devices_info[ndevs].dev_offset = 0;
+		devices_info[ndevs].total_avail = avail;
+		devices_info[ndevs].max_avail = avail;
+		devices_info[ndevs].dev = device;
+		++ndevs;
+	}
+	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+	     btrfs_cmp_device_info, NULL);
+	ndevs -= ndevs % raid_attr->devs_increment;
+	if (ndevs < raid_attr->devs_min)
+		return -ENOSPC;
+	if (raid_attr->devs_max)
+		ndevs = min(ndevs, (int)raid_attr->devs_max);
+	else
+		ndevs = min(ndevs, (int)BTRFS_MAX_DEVS(fs_info));
+
+	/*
+	 * Now allocate a virtual chunk using the unallocate space of the
+	 * device with the least unallocated space.
+	 */
+	stripe_size = devices_info[ndevs - 1].total_avail;
+	for (i = 0; i < ndevs; i++)
+		devices_info[i].dev->virtual_allocated += stripe_size;
+	*allocated = stripe_size * (ndevs - raid_attr->nparity) /
+		     raid_attr->ncopies;
+	return 0;
+}
+
+static int calc_one_profile_avail(struct btrfs_fs_info *fs_info,
+				  enum btrfs_raid_types type)
+{
+	const struct btrfs_raid_attr *raid_attr;
+	struct btrfs_device_info *devices_info = NULL;
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+	struct btrfs_device *device;
+	u64 allocated;
+	u64 result = 0;
+	int ret = 0;
+
+	ASSERT(type >= 0 && type < BTRFS_NR_RAID_TYPES);
+	lockdep_assert_held(&fs_devices->device_list_mutex);
+
+	raid_attr = &btrfs_raid_array[type];
+
+	/* Not enough devices, quick exit, just update the result */
+	if (fs_devices->rw_devices < raid_attr->devs_min)
+		goto out;
+
+	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
+			       GFP_NOFS);
+	if (!devices_info) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	/* Clear virtual chunk used space for each device */
+	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
+		device->virtual_allocated = 0;
+	while (ret == 0) {
+		ret = allocate_virtual_chunk(fs_info, raid_attr, devices_info,
+					     &allocated);
+		if (ret == 0)
+			result += allocated;
+	}
+	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
+		device->virtual_allocated = 0;
+out:
+	kfree(devices_info);
+	if (ret < 0 && ret != -ENOSPC)
+		return ret;
+	spin_lock(&fs_devices->per_profile_lock);
+	fs_devices->per_profile_avail[type] = result;
+	spin_unlock(&fs_devices->per_profile_lock);
+	return 0;
+}
+
+/*
+ * Calculate the per-profile available space array.
+ *
+ * Return 0 if we succeeded updating the array.
+ * Return <0 if something went wrong. (ENOMEM)
+ */
+static int calc_per_profile_avail(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+	int i;
+	int ret;
+
+	lockdep_assert_held(&fs_devices->device_list_mutex);
+	for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
+		ret = calc_one_profile_avail(fs_info, i);
+		if (ret < 0)
+			break;
+	}
+	return ret;
+}
+
 int btrfs_grow_device(struct btrfs_trans_handle *trans,
 		      struct btrfs_device *device, u64 new_size)
 {
@@ -4690,25 +4854,6 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
 	return 0;
 }
 
-/*
- * sort the devices in descending order by max_avail, total_avail
- */
-static int btrfs_cmp_device_info(const void *a, const void *b)
-{
-	const struct btrfs_device_info *di_a = a;
-	const struct btrfs_device_info *di_b = b;
-
-	if (di_a->max_avail > di_b->max_avail)
-		return -1;
-	if (di_a->max_avail < di_b->max_avail)
-		return 1;
-	if (di_a->total_avail > di_b->total_avail)
-		return -1;
-	if (di_a->total_avail < di_b->total_avail)
-		return 1;
-	return 0;
-}
-
 static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
 {
 	if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK))
@@ -7629,6 +7774,13 @@ int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info)
 
 	/* Ensure all chunks have corresponding dev extents */
 	ret = verify_chunk_dev_extent_mapping(fs_info);
+	if (ret < 0)
+		goto out;
+
+	/* All dev extents are verified, update per-profile available space */
+	mutex_lock(&fs_info->fs_devices->device_list_mutex);
+	ret = calc_per_profile_avail(fs_info);
+	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
 out:
 	btrfs_free_path(path);
 	return ret;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index fc1b564b9cfe..81cdab0d864a 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -138,6 +138,12 @@ struct btrfs_device {
 	atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX];
 
 	struct extent_io_tree alloc_state;
+
+	/*
+	 * the "virtual" allocated space by per-profile available space
+	 * calculator. Doesn't affect chunk allocator at all.
+	 */
+	u64 virtual_allocated;
 };
 
 /*
@@ -257,6 +263,10 @@ struct btrfs_fs_devices {
 	struct kobject fsid_kobj;
 	struct kobject *device_dir_kobj;
 	struct completion kobj_unregister;
+
+	/* Records per-type available space */
+	spinlock_t per_profile_lock;
+	u64 per_profile_avail[BTRFS_NR_RAID_TYPES];
 };
 
 #define BTRFS_BIO_INLINE_CSUM_SIZE	64
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH RFC 2/3] btrfs: Update per-profile available space when device size/used space get updated
  2019-12-25 13:39 [PATCH RFC 0/3] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
  2019-12-25 13:39 ` [PATCH RFC 1/3] btrfs: Introduce per-profile available space facility Qu Wenruo
@ 2019-12-25 13:39 ` Qu Wenruo
  2019-12-30 16:17   ` Josef Bacik
  2019-12-25 13:39 ` [PATCH RFC 3/3] btrfs: space-info: Use per-profile available space in can_overcommit() Qu Wenruo
  2019-12-27 18:32 ` [PATCH RFC 0/3] Introduce per-profile available space array to avoid over-confident can_overcommit() Josef Bacik
  3 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2019-12-25 13:39 UTC (permalink / raw)
  To: linux-btrfs

There are 4 locations where device size or used space get updated:
- Chunk allocation
- Chunk removal
- Device grow
- Device shrink

Now also update per-profile available space at those timings.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8d4be1d48f2f..92d45a406d98 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2799,6 +2799,7 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
 	struct btrfs_super_block *super_copy = fs_info->super_copy;
 	u64 old_total;
 	u64 diff;
+	int ret;
 
 	if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state))
 		return -EACCES;
@@ -2827,6 +2828,11 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
 			      &trans->transaction->dev_update_list);
 	mutex_unlock(&fs_info->chunk_mutex);
 
+	mutex_lock(&fs_info->fs_devices->device_list_mutex);
+	ret = calc_per_profile_avail(fs_info);
+	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+	if (ret < 0)
+		return ret;
 	return btrfs_update_device(trans, device);
 }
 
@@ -3005,7 +3011,12 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
 			goto out;
 		}
 	}
+	ret = calc_per_profile_avail(fs_info);
 	mutex_unlock(&fs_devices->device_list_mutex);
+	if (ret < 0) {
+		btrfs_abort_transaction(trans, ret);
+		goto out;
+	}
 
 	ret = btrfs_free_chunk(trans, chunk_offset);
 	if (ret) {
@@ -4821,6 +4832,10 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
 			device->fs_devices->total_rw_bytes += diff;
 		atomic64_add(diff, &fs_info->free_chunk_space);
 		mutex_unlock(&fs_info->chunk_mutex);
+	} else {
+		mutex_lock(&fs_info->fs_devices->device_list_mutex);
+		ret = calc_per_profile_avail(fs_info);
+		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
 	}
 	return ret;
 }
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH RFC 3/3] btrfs: space-info: Use per-profile available space in can_overcommit()
  2019-12-25 13:39 [PATCH RFC 0/3] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
  2019-12-25 13:39 ` [PATCH RFC 1/3] btrfs: Introduce per-profile available space facility Qu Wenruo
  2019-12-25 13:39 ` [PATCH RFC 2/3] btrfs: Update per-profile available space when device size/used space get updated Qu Wenruo
@ 2019-12-25 13:39 ` Qu Wenruo
  2019-12-30 16:17   ` Josef Bacik
  2019-12-27 18:32 ` [PATCH RFC 0/3] Introduce per-profile available space array to avoid over-confident can_overcommit() Josef Bacik
  3 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2019-12-25 13:39 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Marc Lehmann

For the following disk layout, can_overcommit() can cause false
confidence in available space:

  devid 1 unallocated:	1T
  devid 2 unallocated:	10T
  metadata type:	RAID1

As can_overcommit() simply uses unallocated space with factor to
calculate the allocatable metadata chunk size.

can_overcommit() believes we still have 5.5T for metadata chunks, while
the truth is, we only have 1T available for metadata chunks.
This can lead to ENOSPC at run_delalloc_range() and cause transaction
abort.

Since factor based calculation can't distinguish RAID1/RAID10 and DUP at
all, we need proper chunk-allocator level awareness to do such estimation.

Thankfully, we have per-profile available space already calculated, just
use that facility to avoid such false confidence.

Reported-by: Marc Lehmann <schmorp@schmorp.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/space-info.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index f09aa6ee9113..c26aba9e7124 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -164,10 +164,10 @@ static int can_overcommit(struct btrfs_fs_info *fs_info,
 			  enum btrfs_reserve_flush_enum flush,
 			  bool system_chunk)
 {
+	enum btrfs_raid_types index;
 	u64 profile;
 	u64 avail;
 	u64 used;
-	int factor;
 
 	/* Don't overcommit when in mixed mode. */
 	if (space_info->flags & BTRFS_BLOCK_GROUP_DATA)
@@ -179,16 +179,15 @@ static int can_overcommit(struct btrfs_fs_info *fs_info,
 		profile = btrfs_metadata_alloc_profile(fs_info);
 
 	used = btrfs_space_info_used(space_info, true);
-	avail = atomic64_read(&fs_info->free_chunk_space);
 
 	/*
-	 * If we have dup, raid1 or raid10 then only half of the free
-	 * space is actually usable.  For raid56, the space info used
-	 * doesn't include the parity drive, so we don't have to
-	 * change the math
+	 * Grab avail space from per-profile array which should be as accurate
+	 * as chunk allocator.
 	 */
-	factor = btrfs_bg_type_to_factor(profile);
-	avail = div_u64(avail, factor);
+	index = btrfs_bg_flags_to_raid_index(profile);
+	spin_lock(&fs_info->fs_devices->per_profile_lock);
+	avail = fs_info->fs_devices->per_profile_avail[index];
+	spin_unlock(&fs_info->fs_devices->per_profile_lock);
 
 	/*
 	 * If we aren't flushing all things, let us overcommit up to
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 0/3] Introduce per-profile available space array to avoid over-confident can_overcommit()
  2019-12-25 13:39 [PATCH RFC 0/3] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
                   ` (2 preceding siblings ...)
  2019-12-25 13:39 ` [PATCH RFC 3/3] btrfs: space-info: Use per-profile available space in can_overcommit() Qu Wenruo
@ 2019-12-27 18:32 ` Josef Bacik
  2019-12-28  1:09   ` Qu Wenruo
  3 siblings, 1 reply; 11+ messages in thread
From: Josef Bacik @ 2019-12-27 18:32 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 12/25/19 8:39 AM, Qu Wenruo wrote:
> There are several bug reports of ENOSPC error in
> btrfs_run_delalloc_range().
> 
> With some extra info from one reporter, it turns out that
> can_overcommit() is using a wrong way to calculate allocatable metadata
> space.
> 
> The most typical case would look like:
>    devid 1 unallocated:	1G
>    devid 2 unallocated:  10G
>    metadata profile:	RAID1
> 
> In above case, we can at most allocate 1G chunk for metadata, due to
> unbalanced disk free space.
> But current can_overcommit() uses factor based calculation, which never
> consider the disk free space balance.
> 
> 
> To address this problem, here comes the per-profile available space
> array, which gets updated every time a chunk get allocated/removed or a
> device get grown or shrunk.
> 
> This provides a quick way for hotter place like can_overcommit() to grab
> an estimation on how many bytes it can over-commit.
> 
> The per-profile available space calculation tries to keep the behavior
> of chunk allocator, thus it can handle uneven disks pretty well.
> 
> The RFC tag is here because I'm not yet confident enough about the
> implementation.
> I'm not sure this is the proper to go, or just a over-engineered mess.
> 

In general I like the approach, however re-doing the whole calculation once we 
add or remove a chunk seems like overkill.  Can we get away with just doing the 
big expensive calculation on mount, and then adjust available up and down as we 
add and remove chunks?  We know the chunk allocator is not going to allow us to 
allocate more chunks than our smallest device, so we can be sure we're never 
going to go negative, and removing chunks is easy again because of the same reason.

Other than that the approach seems sound to me.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 0/3] Introduce per-profile available space array to avoid over-confident can_overcommit()
  2019-12-27 18:32 ` [PATCH RFC 0/3] Introduce per-profile available space array to avoid over-confident can_overcommit() Josef Bacik
@ 2019-12-28  1:09   ` Qu Wenruo
  2019-12-30 14:29     ` Josef Bacik
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2019-12-28  1:09 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2573 bytes --]



On 2019/12/28 上午2:32, Josef Bacik wrote:
> On 12/25/19 8:39 AM, Qu Wenruo wrote:
>> There are several bug reports of ENOSPC error in
>> btrfs_run_delalloc_range().
>>
>> With some extra info from one reporter, it turns out that
>> can_overcommit() is using a wrong way to calculate allocatable metadata
>> space.
>>
>> The most typical case would look like:
>>    devid 1 unallocated:    1G
>>    devid 2 unallocated:  10G
>>    metadata profile:    RAID1
>>
>> In above case, we can at most allocate 1G chunk for metadata, due to
>> unbalanced disk free space.
>> But current can_overcommit() uses factor based calculation, which never
>> consider the disk free space balance.
>>
>>
>> To address this problem, here comes the per-profile available space
>> array, which gets updated every time a chunk get allocated/removed or a
>> device get grown or shrunk.
>>
>> This provides a quick way for hotter place like can_overcommit() to grab
>> an estimation on how many bytes it can over-commit.
>>
>> The per-profile available space calculation tries to keep the behavior
>> of chunk allocator, thus it can handle uneven disks pretty well.
>>
>> The RFC tag is here because I'm not yet confident enough about the
>> implementation.
>> I'm not sure this is the proper to go, or just a over-engineered mess.
>>
> 
> In general I like the approach, however re-doing the whole calculation
> once we add or remove a chunk seems like overkill.  Can we get away with
> just doing the big expensive calculation on mount, and then adjust
> available up and down as we add and remove chunks?

That looks good on a quick glance, but in practice it may not work as
expected, mostly due to the small difference in sort.

Current chunk allocator works by sorting the max hole size as primary
sort index, thus it may cause difference on some corner case.
Without proper re-calculation, the difference may drift larger and larger.

Thus I prefer to be a little safer to do extra calculation each time
chunk get allocated/remove.
And that calculation is not that heavy, it just iterate the device lists
several times, and all access are in-memory without sleep, it should be
pretty fast.

Thanks,
Qu

>  We know the chunk
> allocator is not going to allow us to allocate more chunks than our
> smallest device, so we can be sure we're never going to go negative, and
> removing chunks is easy again because of the same reason.
> 
> Other than that the approach seems sound to me.  Thanks,
> 
> Josef


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 0/3] Introduce per-profile available space array to avoid over-confident can_overcommit()
  2019-12-28  1:09   ` Qu Wenruo
@ 2019-12-30 14:29     ` Josef Bacik
  0 siblings, 0 replies; 11+ messages in thread
From: Josef Bacik @ 2019-12-30 14:29 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs

On 12/27/19 8:09 PM, Qu Wenruo wrote:
> 
> 
> On 2019/12/28 上午2:32, Josef Bacik wrote:
>> On 12/25/19 8:39 AM, Qu Wenruo wrote:
>>> There are several bug reports of ENOSPC error in
>>> btrfs_run_delalloc_range().
>>>
>>> With some extra info from one reporter, it turns out that
>>> can_overcommit() is using a wrong way to calculate allocatable metadata
>>> space.
>>>
>>> The most typical case would look like:
>>>     devid 1 unallocated:    1G
>>>     devid 2 unallocated:  10G
>>>     metadata profile:    RAID1
>>>
>>> In above case, we can at most allocate 1G chunk for metadata, due to
>>> unbalanced disk free space.
>>> But current can_overcommit() uses factor based calculation, which never
>>> consider the disk free space balance.
>>>
>>>
>>> To address this problem, here comes the per-profile available space
>>> array, which gets updated every time a chunk get allocated/removed or a
>>> device get grown or shrunk.
>>>
>>> This provides a quick way for hotter place like can_overcommit() to grab
>>> an estimation on how many bytes it can over-commit.
>>>
>>> The per-profile available space calculation tries to keep the behavior
>>> of chunk allocator, thus it can handle uneven disks pretty well.
>>>
>>> The RFC tag is here because I'm not yet confident enough about the
>>> implementation.
>>> I'm not sure this is the proper to go, or just a over-engineered mess.
>>>
>>
>> In general I like the approach, however re-doing the whole calculation
>> once we add or remove a chunk seems like overkill.  Can we get away with
>> just doing the big expensive calculation on mount, and then adjust
>> available up and down as we add and remove chunks?
> 
> That looks good on a quick glance, but in practice it may not work as
> expected, mostly due to the small difference in sort.
> 
> Current chunk allocator works by sorting the max hole size as primary
> sort index, thus it may cause difference on some corner case.
> Without proper re-calculation, the difference may drift larger and larger.
> 
> Thus I prefer to be a little safer to do extra calculation each time
> chunk get allocated/remove.
> And that calculation is not that heavy, it just iterate the device lists
> several times, and all access are in-memory without sleep, it should be
> pretty fast.
> 

Ahh I hadn't thought of different hole sizes.  You're right that it shouldn't 
matter in practice, it's not like chunk allocation is a fast path.  This seems 
reasonable to me then, I'll go through the patches properly.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 1/3] btrfs: Introduce per-profile available space facility
  2019-12-25 13:39 ` [PATCH RFC 1/3] btrfs: Introduce per-profile available space facility Qu Wenruo
@ 2019-12-30 16:14   ` Josef Bacik
  0 siblings, 0 replies; 11+ messages in thread
From: Josef Bacik @ 2019-12-30 16:14 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 12/25/19 8:39 AM, Qu Wenruo wrote:
> [PROBLEM]
> There are some locations in btrfs requiring accurate estimation on how
> many new bytes can be allocated on unallocated space.
> 
> We have two types of estimation:
> - Factor based calculation
>    Just use all unallocated space, divide by the profile factor
>    One obvious user is can_overcommit().
> 
> - Chunk allocator like calculation
>    This will emulate the chunk allocator behavior, to get a proper
>    estimation.
>    The only user is btrfs_calc_avail_data_space(), utilized by
>    btrfs_statfs().
>    The problem is, that function is not generic purposed enough, can't
>    handle things like RAID5/6.
> 
> Current factor based calculation can't handle the following case:
>    devid 1 unallocated:	1T
>    devid 2 unallocated:	10T
>    metadata type:	RAID1
> 
> If using factor, we can use (1T + 10T) / 2 = 5.5T free space for
> metadata.
> But in fact we can only get 1T free space, as we're limited by the
> smallest device for RAID1.
> 
> [SOLUTION]
> This patch will introduce the skeleton of per-profile available space
> calculation, which can more-or-less get to the point of chunk allocator.
> 
> The difference between it and chunk allocator is mostly on rounding and
> [0, 1M) reserved space handling, which shouldn't cause practical impact.
> 
> The newly introduced per-profile available space calculation will
> calculate available space for each type, using chunk-allocator like
> calculation.
> 
> With that facility, for above device layout we get the full available
> space array:
>    RAID10:	0  (not enough devices)
>    RAID1:	1T
>    RAID1C3:	0  (not enough devices)
>    RAID1C4:	0  (not enough devices)
>    DUP:		5.5T
>    RAID0:	2T
>    SINGLE:	11T
>    RAID5:	1T
>    RAID6:	0  (not enough devices)
> 
> Or for a more complex example:
>    devid 1 unallocated:	1T
>    devid 2 unallocated:  1T
>    devid 3 unallocated:	10T
> 
> We will get an array of:
>    RAID10:	0  (not enough devices)
>    RAID1:	2T
>    RAID1C3:	1T
>    RAID1C4:	0  (not enough devices)
>    DUP:		6T
>    RAID0:	3T
>    SINGLE:	12T
>    RAID5:	2T
>    RAID6:	0  (not enough devices)
> 
> And for the each profile , we go chunk allocator level calculation:
> The code code looks like:
> 
>    clear_virtual_used_space_of_all_rw_devices();
>    do {
>    	/*
>    	 * The same as chunk allocator, despite used space,
>    	 * we also take virtual used space into consideration.
>    	 */
>    	sort_device_with_virtual_free_space();
> 
>    	/*
>    	 * Unlike chunk allocator, we don't need to bother hole/stripe
>    	 * size, so we use the smallest device to make sure we can
>    	 * allocated as many stripes as regular chunk allocator
>    	 */
>    	stripe_size = device_with_smallest_free->avail_space;
> 
>    	/*
>    	 * Allocate a virtual chunk, allocated virtual chunk will
>    	 * increase virtual used space, allow next iteration to
>    	 * properly emulate chunk allocator behavior.
>    	 */
>    	ret = alloc_virtual_chunk(stripe_size, &allocated_size);
>    	if (ret == 0)
>    		avail += allocated_size;
>    } while (ret == 0)
> 
> As we always select the device with least free space (just like chunk
> allocator), for above 1T + 10T device, we will allocate a 1T virtual chunk
> in the first iteration, then run out of device in next iteration.
> 
> Thus only get 1T free space for RAID1 type, just like what chunk
> allocator would do.
> 
> This patch is just the skeleton, we only do the per-profile chunk
> calculation at mount time.
> 
> Later commits will update per-profile available space at other proper
> timings.
> 
> Suggested-by: Josef Bacik <josef@toxicpanda.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 2/3] btrfs: Update per-profile available space when device size/used space get updated
  2019-12-25 13:39 ` [PATCH RFC 2/3] btrfs: Update per-profile available space when device size/used space get updated Qu Wenruo
@ 2019-12-30 16:17   ` Josef Bacik
  2019-12-31  0:25     ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Josef Bacik @ 2019-12-30 16:17 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 12/25/19 8:39 AM, Qu Wenruo wrote:
> There are 4 locations where device size or used space get updated:
> - Chunk allocation
> - Chunk removal
> - Device grow
> - Device shrink
> 
> Now also update per-profile available space at those timings.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Looks like you are missing chunk allocation?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 3/3] btrfs: space-info: Use per-profile available space in can_overcommit()
  2019-12-25 13:39 ` [PATCH RFC 3/3] btrfs: space-info: Use per-profile available space in can_overcommit() Qu Wenruo
@ 2019-12-30 16:17   ` Josef Bacik
  0 siblings, 0 replies; 11+ messages in thread
From: Josef Bacik @ 2019-12-30 16:17 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: Marc Lehmann

On 12/25/19 8:39 AM, Qu Wenruo wrote:
> For the following disk layout, can_overcommit() can cause false
> confidence in available space:
> 
>    devid 1 unallocated:	1T
>    devid 2 unallocated:	10T
>    metadata type:	RAID1
> 
> As can_overcommit() simply uses unallocated space with factor to
> calculate the allocatable metadata chunk size.
> 
> can_overcommit() believes we still have 5.5T for metadata chunks, while
> the truth is, we only have 1T available for metadata chunks.
> This can lead to ENOSPC at run_delalloc_range() and cause transaction
> abort.
> 
> Since factor based calculation can't distinguish RAID1/RAID10 and DUP at
> all, we need proper chunk-allocator level awareness to do such estimation.
> 
> Thankfully, we have per-profile available space already calculated, just
> use that facility to avoid such false confidence.
> 
> Reported-by: Marc Lehmann <schmorp@schmorp.de>
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC 2/3] btrfs: Update per-profile available space when device size/used space get updated
  2019-12-30 16:17   ` Josef Bacik
@ 2019-12-31  0:25     ` Qu Wenruo
  0 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2019-12-31  0:25 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 567 bytes --]



On 2019/12/31 上午12:17, Josef Bacik wrote:
> On 12/25/19 8:39 AM, Qu Wenruo wrote:
>> There are 4 locations where device size or used space get updated:
>> - Chunk allocation
>> - Chunk removal
>> - Device grow
>> - Device shrink
>>
>> Now also update per-profile available space at those timings.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
> 
> Looks like you are missing chunk allocation?  Thanks,

Oh, you're completely right!

I'll fix it and also enhance statfs() to use this facility in next version.

Thanks,
Qu

> 
> Josef


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-12-31  0:25 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-25 13:39 [PATCH RFC 0/3] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
2019-12-25 13:39 ` [PATCH RFC 1/3] btrfs: Introduce per-profile available space facility Qu Wenruo
2019-12-30 16:14   ` Josef Bacik
2019-12-25 13:39 ` [PATCH RFC 2/3] btrfs: Update per-profile available space when device size/used space get updated Qu Wenruo
2019-12-30 16:17   ` Josef Bacik
2019-12-31  0:25     ` Qu Wenruo
2019-12-25 13:39 ` [PATCH RFC 3/3] btrfs: space-info: Use per-profile available space in can_overcommit() Qu Wenruo
2019-12-30 16:17   ` Josef Bacik
2019-12-27 18:32 ` [PATCH RFC 0/3] Introduce per-profile available space array to avoid over-confident can_overcommit() Josef Bacik
2019-12-28  1:09   ` Qu Wenruo
2019-12-30 14:29     ` Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).