[PATCH v3 0/3] Introduce per-profile available space array to avoid over-confident can

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/3] Introduce per-profile available space array to avoid over-confident can_overcommit()
@ 2020-01-06  6:13 Qu Wenruo
  2020-01-06  6:13 ` [PATCH v3 1/3] btrfs: Introduce per-profile available space facility Qu Wenruo
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Qu Wenruo @ 2020-01-06  6:13 UTC (permalink / raw)
  To: linux-btrfs

There are several bug reports of ENOSPC error in
btrfs_run_delalloc_range().

With some extra info from one reporter, it turns out that
can_overcommit() is using a wrong way to calculate allocatable metadata
space.

The most typical case would look like:
  devid 1 unallocated:	1G
  devid 2 unallocated:  10G
  metadata profile:	RAID1

In above case, we can at most allocate 1G chunk for metadata, due to
unbalanced disk free space.
But current can_overcommit() uses factor based calculation, which never
consider the disk free space balance.

To address this problem, here comes the per-profile available space
array, which gets updated every time a chunk get allocated/removed or a
device get grown or shrunk.

This provides a quick way for hotter place like can_overcommit() to grab
an estimation on how many bytes it can over-commit.

The per-profile available space calculation tries to keep the behavior
of chunk allocator, thus it can handle uneven disks pretty well.

Although per-profile is not clever enough to handle estimation when both
data and metadata chunks need to be considered, its virtual chunk
infrastructure is flex enough to handle such case.

So for statfs(), we also re-use virtual chunk allocator to handle
available data space, with metadata over-commit space considered.
This brings an unexpected advantage, now we can handle RAID5/6 pretty OK
in statfs().

The execution time of this per-profile calculation is a little below
20 us per 5 iterations in my test VM.
Although all such calculation will need to acquire chunk mutex, the
impact should be small enough.
For the full statfs execution time anaylse, please see the commit
message of the last patch.

Changelog:
v1:
- Fix a bug where we forgot to update per-profile array after allocating
  a chunk.
  To avoid ABBA deadlock, this introduce a small windows at the end
  __btrfs_alloc_chunk(), it's not elegant but should be good enough
  before we rework chunk and device list mutex.

- Make statfs() to use virtual chunk allocator to do better estimation
  Now statfs() can report not only more accurate result, but can also
  handle RAID5/6 better.

v2:
- Fix a deadlock caused by acquiring device_list_mutex under
  __btrfs_alloc_chunk()
  There is no need to acquire device_list_mutex when holding
  chunk_mutex.
  Fix it and remove the lockdep assert.

v3:
- Use proper chunk_mutex instead of device_list_mutex
  Since they are protecting two different things, and we only care about
  alloc_list, we should only use chunk_mutex.
  With improved lock situation, it's easier to fold
  calc_per_profile_available() calls into the first patch.

- Add performance benchmark for statfs() modification
  As Facebook seems to run into some problems with statfs() calls, add
  some basic ftrace results.

Qu Wenruo (3):
  btrfs: Introduce per-profile available space facility
  btrfs: space-info: Use per-profile available space in can_overcommit()
  btrfs: statfs: Use virtual chunk allocation to calculation available
    data space

 fs/btrfs/space-info.c |  15 ++-
 fs/btrfs/super.c      | 190 +++++++++++++-----------------------
 fs/btrfs/volumes.c    | 218 ++++++++++++++++++++++++++++++++++++++----
 fs/btrfs/volumes.h    |  15 +++
 4 files changed, 289 insertions(+), 149 deletions(-)

-- 
2.24.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v3 1/3] btrfs: Introduce per-profile available space facility
  2020-01-06  6:13 [PATCH v3 0/3] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
@ 2020-01-06  6:13 ` Qu Wenruo
  2020-01-06 14:32   ` David Sterba
  2020-01-06 23:45     ` kbuild test robot
  2020-01-06  6:13 ` [PATCH v3 2/3] btrfs: space-info: Use per-profile available space in can_overcommit() Qu Wenruo
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 15+ messages in thread
From: Qu Wenruo @ 2020-01-06  6:13 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

[PROBLEM]
There are some locations in btrfs requiring accurate estimation on how
many new bytes can be allocated on unallocated space.

We have two types of estimation:
- Factor based calculation
  Just use all unallocated space, divide by the profile factor
  One obvious user is can_overcommit().

- Chunk allocator like calculation
  This will emulate the chunk allocator behavior, to get a proper
  estimation.
  The only user is btrfs_calc_avail_data_space(), utilized by
  btrfs_statfs().
  The problem is, that function is not generic purposed enough, can't
  handle things like RAID5/6.

Current factor based calculation can't handle the following case:
  devid 1 unallocated:	1T
  devid 2 unallocated:	10T
  metadata type:	RAID1

If using factor, we can use (1T + 10T) / 2 = 5.5T free space for
metadata.
But in fact we can only get 1T free space, as we're limited by the
smallest device for RAID1.

[SOLUTION]
This patch will introduce per-profile available space calculation,
which can give an estimation based on chunk-allocator-like behavior.

The difference between it and chunk allocator is mostly on rounding and
[0, 1M) reserved space handling, which shouldn't cause practical impact.

The newly introduced per-profile available space calculation will
calculate available space for each type, using chunk-allocator like
calculation.

With that facility, for above device layout we get the full available
space array:
  RAID10:	0  (not enough devices)
  RAID1:	1T
  RAID1C3:	0  (not enough devices)
  RAID1C4:	0  (not enough devices)
  DUP:		5.5T
  RAID0:	2T
  SINGLE:	11T
  RAID5:	1T
  RAID6:	0  (not enough devices)

Or for a more complex example:
  devid 1 unallocated:	1T
  devid 2 unallocated:  1T
  devid 3 unallocated:	10T

We will get an array of:
  RAID10:	0  (not enough devices)
  RAID1:	2T
  RAID1C3:	1T
  RAID1C4:	0  (not enough devices)
  DUP:		6T
  RAID0:	3T
  SINGLE:	12T
  RAID5:	2T
  RAID6:	0  (not enough devices)

And for the each profile , we go chunk allocator level calculation:
The pseudo code looks like:

  clear_virtual_used_space_of_all_rw_devices();
  do {
  	/*
  	 * The same as chunk allocator, despite used space,
  	 * we also take virtual used space into consideration.
  	 */
  	sort_device_with_virtual_free_space();

  	/*
  	 * Unlike chunk allocator, we don't need to bother hole/stripe
  	 * size, so we use the smallest device to make sure we can
  	 * allocated as many stripes as regular chunk allocator
  	 */
  	stripe_size = device_with_smallest_free->avail_space;
	stripe_size = min(stripe_size, to_alloc / ndevs);

  	/*
  	 * Allocate a virtual chunk, allocated virtual chunk will
  	 * increase virtual used space, allow next iteration to
  	 * properly emulate chunk allocator behavior.
  	 */
  	ret = alloc_virtual_chunk(stripe_size, &allocated_size);
  	if (ret == 0)
  		avail += allocated_size;
  } while (ret == 0)

As we always select the device with least free space, the device with
the most space will be the first to be utilized, just like chunk
allocator.
For above 1T + 10T device, we will allocate a 1T virtual chunk
in the first iteration, then run out of device in next iteration.

Thus only get 1T free space for RAID1 type, just like what chunk
allocator would do.

The patch will update such per-profile available space at the following
timing:
- Mount time
- Chunk allocation
- Chunk removal
- Device grow
- Device shrink

Suggested-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 218 ++++++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/volumes.h |  11 +++
 2 files changed, 209 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d8e5560db285..e38930390e89 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -349,6 +349,7 @@ static struct btrfs_fs_devices *alloc_fs_devices(const u8 *fsid,
 	INIT_LIST_HEAD(&fs_devs->devices);
 	INIT_LIST_HEAD(&fs_devs->alloc_list);
 	INIT_LIST_HEAD(&fs_devs->fs_list);
+	spin_lock_init(&fs_devs->per_profile_lock);
 	if (fsid)
 		memcpy(fs_devs->fsid, fsid, BTRFS_FSID_SIZE);
 
@@ -2628,6 +2629,176 @@ static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+/*
+ * sort the devices in descending order by max_avail, total_avail
+ */
+static int btrfs_cmp_device_info(const void *a, const void *b)
+{
+	const struct btrfs_device_info *di_a = a;
+	const struct btrfs_device_info *di_b = b;
+
+	if (di_a->max_avail > di_b->max_avail)
+		return -1;
+	if (di_a->max_avail < di_b->max_avail)
+		return 1;
+	if (di_a->total_avail > di_b->total_avail)
+		return -1;
+	if (di_a->total_avail < di_b->total_avail)
+		return 1;
+	return 0;
+}
+
+/*
+ * Return 0 if we allocated any virtual(*) chunk, and restore the size to
+ * @allocated_size
+ * Return -ENOSPC if we have no more space to allocate virtual chunk
+ *
+ * *: virtual chunk is a space holder for per-profile available space
+ *    calculator.
+ *    Such virtual chunks won't take on-disk space, thus called virtual, and
+ *    only affects per-profile available space calulation.
+ */
+static int alloc_virtual_chunk(struct btrfs_fs_info *fs_info,
+			       struct btrfs_device_info *devices_info,
+			       enum btrfs_raid_types type,
+			       u64 to_alloc, u64 *allocated)
+{
+	const struct btrfs_raid_attr *raid_attr = &btrfs_raid_array[type];
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+	struct btrfs_device *device;
+	u64 stripe_size;
+	int i;
+	int ndevs = 0;
+
+	lockdep_assert_held(&fs_info->chunk_mutex);
+
+	/* Go through devices to collect their unallocated space */
+	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
+		u64 avail;
+		if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
+					&device->dev_state) ||
+		    test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
+			continue;
+
+		if (device->total_bytes > device->bytes_used +
+				device->virtual_allocated)
+			avail = device->total_bytes - device->bytes_used -
+				device->virtual_allocated;
+		else
+			avail = 0;
+
+		/* And exclude the [0, 1M) reserved space */
+		if (avail > SZ_1M)
+			avail -= SZ_1M;
+		else
+			avail = 0;
+
+		if (avail == 0)
+			continue;
+		/*
+		 * Unlike chunk allocator, we don't care about stripe or hole
+		 * size, so here we use @avail directly
+		 */
+		devices_info[ndevs].dev_offset = 0;
+		devices_info[ndevs].total_avail = avail;
+		devices_info[ndevs].max_avail = avail;
+		devices_info[ndevs].dev = device;
+		++ndevs;
+	}
+	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+	     btrfs_cmp_device_info, NULL);
+	ndevs -= ndevs % raid_attr->devs_increment;
+	if (ndevs < raid_attr->devs_min)
+		return -ENOSPC;
+	if (raid_attr->devs_max)
+		ndevs = min(ndevs, (int)raid_attr->devs_max);
+	else
+		ndevs = min(ndevs, (int)BTRFS_MAX_DEVS(fs_info));
+
+	/*
+	 * Now allocate a virtual chunk using the unallocate space of the
+	 * device with the least unallocated space.
+	 */
+	stripe_size = round_down(devices_info[ndevs - 1].total_avail,
+				 fs_info->sectorsize);
+
+	/* We can't directly do round_up for (u64)-1 as that would result 0 */
+	if (to_alloc != (u64)-1)
+		stripe_size = min_t(u64, stripe_size,
+				    round_up(div_u64(to_alloc, ndevs),
+					     fs_info->sectorsize));
+	if (stripe_size == 0)
+		return -ENOSPC;
+
+	for (i = 0; i < ndevs; i++)
+		devices_info[i].dev->virtual_allocated += stripe_size;
+	*allocated = stripe_size * (ndevs - raid_attr->nparity) /
+		     raid_attr->ncopies;
+	return 0;
+}
+
+static int calc_one_profile_avail(struct btrfs_fs_info *fs_info,
+				  enum btrfs_raid_types type)
+{
+	struct btrfs_device_info *devices_info = NULL;
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+	struct btrfs_device *device;
+	u64 allocated;
+	u64 result = 0;
+	int ret = 0;
+
+	ASSERT(type >= 0 && type < BTRFS_NR_RAID_TYPES);
+
+	/* Not enough devices, quick exit, just update the result */
+	if (fs_devices->rw_devices < btrfs_raid_array[type].devs_min)
+		goto out;
+
+	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
+			       GFP_NOFS);
+	if (!devices_info) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	/* Clear virtual chunk used space for each device */
+	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
+		device->virtual_allocated = 0;
+	while (ret == 0) {
+		ret = alloc_virtual_chunk(fs_info, devices_info, type,
+					  (u64)-1, &allocated);
+		if (ret == 0)
+			result += allocated;
+	}
+	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
+		device->virtual_allocated = 0;
+out:
+	kfree(devices_info);
+	if (ret < 0 && ret != -ENOSPC)
+		return ret;
+	spin_lock(&fs_devices->per_profile_lock);
+	fs_devices->per_profile_avail[type] = result;
+	spin_unlock(&fs_devices->per_profile_lock);
+	return 0;
+}
+
+/*
+ * Calculate the per-profile available space array.
+ *
+ * Return 0 if we succeeded updating the array.
+ * Return <0 if something went wrong. (ENOMEM)
+ */
+static int calc_per_profile_avail(struct btrfs_fs_info *fs_info)
+{
+	int i;
+	int ret;
+
+	for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
+		ret = calc_one_profile_avail(fs_info, i);
+		if (ret < 0)
+			break;
+	}
+	return ret;
+}
+
 int btrfs_grow_device(struct btrfs_trans_handle *trans,
 		      struct btrfs_device *device, u64 new_size)
 {
@@ -2635,6 +2806,7 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
 	struct btrfs_super_block *super_copy = fs_info->super_copy;
 	u64 old_total;
 	u64 diff;
+	int ret;
 
 	if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state))
 		return -EACCES;
@@ -2661,7 +2833,12 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
 	if (list_empty(&device->post_commit_list))
 		list_add_tail(&device->post_commit_list,
 			      &trans->transaction->dev_update_list);
+	ret = calc_per_profile_avail(fs_info);
 	mutex_unlock(&fs_info->chunk_mutex);
+	if (ret < 0) {
+		btrfs_abort_transaction(trans, ret);
+		return ret;
+	}
 
 	return btrfs_update_device(trans, device);
 }
@@ -2831,7 +3008,13 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
 					device->bytes_used - dev_extent_len);
 			atomic64_add(dev_extent_len, &fs_info->free_chunk_space);
 			btrfs_clear_space_info_full(fs_info);
+			ret = calc_per_profile_avail(fs_info);
 			mutex_unlock(&fs_info->chunk_mutex);
+			if (ret < 0) {
+				mutex_unlock(&fs_devices->device_list_mutex);
+				btrfs_abort_transaction(trans, ret);
+				goto out;
+			}
 		}
 
 		ret = btrfs_update_device(trans, device);
@@ -4526,6 +4709,12 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
 		atomic64_sub(diff, &fs_info->free_chunk_space);
 	}
 
+	ret = calc_per_profile_avail(fs_info);
+	if (ret < 0) {
+		btrfs_abort_transaction(trans, ret);
+		btrfs_end_transaction(trans);
+		goto done;
+	}
 	/*
 	 * Once the device's size has been set to the new size, ensure all
 	 * in-memory chunks are synced to disk so that the loop below sees them
@@ -4690,25 +4879,6 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
 	return 0;
 }
 
-/*
- * sort the devices in descending order by max_avail, total_avail
- */
-static int btrfs_cmp_device_info(const void *a, const void *b)
-{
-	const struct btrfs_device_info *di_a = a;
-	const struct btrfs_device_info *di_b = b;
-
-	if (di_a->max_avail > di_b->max_avail)
-		return -1;
-	if (di_a->max_avail < di_b->max_avail)
-		return 1;
-	if (di_a->total_avail > di_b->total_avail)
-		return -1;
-	if (di_a->total_avail < di_b->total_avail)
-		return 1;
-	return 0;
-}
-
 static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
 {
 	if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK))
@@ -4992,9 +5162,10 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	free_extent_map(em);
 	check_raid56_incompat_flag(info, type);
 	check_raid1c34_incompat_flag(info, type);
+	ret = calc_per_profile_avail(info);
 
 	kfree(devices_info);
-	return 0;
+	return ret;
 
 error_del_extent:
 	write_lock(&em_tree->lock);
@@ -7629,6 +7800,13 @@ int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info)
 
 	/* Ensure all chunks have corresponding dev extents */
 	ret = verify_chunk_dev_extent_mapping(fs_info);
+	if (ret < 0)
+		goto out;
+
+	/* All dev extents are verified, update per-profile available space */
+	mutex_lock(&fs_info->chunk_mutex);
+	ret = calc_per_profile_avail(fs_info);
+	mutex_unlock(&fs_info->chunk_mutex);
 out:
 	btrfs_free_path(path);
 	return ret;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index fc1b564b9cfe..5cddfe7cfee8 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -138,6 +138,13 @@ struct btrfs_device {
 	atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX];
 
 	struct extent_io_tree alloc_state;
+
+	/*
+	 * the "virtual" allocated space by virtual chunk allocator, which
+	 * is used to do accurate estimation on available space.
+	 * Doesn't affect real chunk allocator.
+	 */
+	u64 virtual_allocated;
 };
 
 /*
@@ -257,6 +264,10 @@ struct btrfs_fs_devices {
 	struct kobject fsid_kobj;
 	struct kobject *device_dir_kobj;
 	struct completion kobj_unregister;
+
+	/* Records per-type available space */
+	spinlock_t per_profile_lock;
+	u64 per_profile_avail[BTRFS_NR_RAID_TYPES];
 };
 
 #define BTRFS_BIO_INLINE_CSUM_SIZE	64
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 2/3] btrfs: space-info: Use per-profile available space in can_overcommit()
  2020-01-06  6:13 [PATCH v3 0/3] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
  2020-01-06  6:13 ` [PATCH v3 1/3] btrfs: Introduce per-profile available space facility Qu Wenruo
@ 2020-01-06  6:13 ` Qu Wenruo
  2020-01-06  6:13 ` [PATCH v3 3/3] btrfs: statfs: Use virtual chunk allocation to calculation available data space Qu Wenruo
  2020-01-06 14:06 ` [PATCH v3 0/3] Introduce per-profile available space array to avoid over-confident can_overcommit() David Sterba
  3 siblings, 0 replies; 15+ messages in thread
From: Qu Wenruo @ 2020-01-06  6:13 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Marc Lehmann, Josef Bacik

For the following disk layout, can_overcommit() can cause false
confidence in available space:

  devid 1 unallocated:	1T
  devid 2 unallocated:	10T
  metadata type:	RAID1

As can_overcommit() simply uses unallocated space with factor to
calculate the allocatable metadata chunk size.

can_overcommit() believes we still have 5.5T for metadata chunks, while
the truth is, we only have 1T available for metadata chunks.
This can lead to ENOSPC at run_delalloc_range() and cause transaction
abort.

Since factor based calculation can't distinguish RAID1/RAID10 and DUP at
all, we need proper chunk-allocator level awareness to do such estimation.

Thankfully, we have per-profile available space already calculated, just
use that facility to avoid such false confidence.

Reported-by: Marc Lehmann <schmorp@schmorp.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/space-info.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index f09aa6ee9113..c26aba9e7124 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -164,10 +164,10 @@ static int can_overcommit(struct btrfs_fs_info *fs_info,
 			  enum btrfs_reserve_flush_enum flush,
 			  bool system_chunk)
 {
+	enum btrfs_raid_types index;
 	u64 profile;
 	u64 avail;
 	u64 used;
-	int factor;
 
 	/* Don't overcommit when in mixed mode. */
 	if (space_info->flags & BTRFS_BLOCK_GROUP_DATA)
@@ -179,16 +179,15 @@ static int can_overcommit(struct btrfs_fs_info *fs_info,
 		profile = btrfs_metadata_alloc_profile(fs_info);
 
 	used = btrfs_space_info_used(space_info, true);
-	avail = atomic64_read(&fs_info->free_chunk_space);
 
 	/*
-	 * If we have dup, raid1 or raid10 then only half of the free
-	 * space is actually usable.  For raid56, the space info used
-	 * doesn't include the parity drive, so we don't have to
-	 * change the math
+	 * Grab avail space from per-profile array which should be as accurate
+	 * as chunk allocator.
 	 */
-	factor = btrfs_bg_type_to_factor(profile);
-	avail = div_u64(avail, factor);
+	index = btrfs_bg_flags_to_raid_index(profile);
+	spin_lock(&fs_info->fs_devices->per_profile_lock);
+	avail = fs_info->fs_devices->per_profile_avail[index];
+	spin_unlock(&fs_info->fs_devices->per_profile_lock);
 
 	/*
 	 * If we aren't flushing all things, let us overcommit up to
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 3/3] btrfs: statfs: Use virtual chunk allocation to calculation available data space
  2020-01-06  6:13 [PATCH v3 0/3] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
  2020-01-06  6:13 ` [PATCH v3 1/3] btrfs: Introduce per-profile available space facility Qu Wenruo
  2020-01-06  6:13 ` [PATCH v3 2/3] btrfs: space-info: Use per-profile available space in can_overcommit() Qu Wenruo
@ 2020-01-06  6:13 ` Qu Wenruo
  2020-01-06 14:44   ` Josef Bacik
  2020-01-06 14:06 ` [PATCH v3 0/3] Introduce per-profile available space array to avoid over-confident can_overcommit() David Sterba
  3 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2020-01-06  6:13 UTC (permalink / raw)
  To: linux-btrfs

Although btrfs_calc_avail_data_space() is trying to do an estimation
on how many data chunks it can allocate, the estimation is far from
perfect:

- Metadata over-commit is not considered at all
- Chunk allocation doesn't take RAID5/6 into consideration

Although current per-profile available space itself is not able to
handle metadata over-commit itself, the virtual chunk infrastructure can
be re-used to address above problems.

This patch will change btrfs_calc_avail_data_space() to do the following
things:
- Do metadata virtual chunk allocation first
  This is to address the over-commit behavior.
  If current metadata chunks have enough free space, we can completely
  skip this step.

- Allocate data virtual chunks as many as possible
  Just like what we did in per-profile available space estimation.
  Here we only need to calculate one profile, since statfs() call is
  a relative cold path.

Now statfs() should be able to report near perfect estimation on
available data space, and can handle RAID5/6 better.

[BENCHMARK]
For the performance difference, here is the benchmark:
 Disk layout:
 - devid 1:	1G
 - devid 2:	2G
 - devid 3:	3G
 - devid 4:	4G
 - devid 5:	5G
 metadata:	RAID1
 data:		RAID5

 This layout should be the worst case for RAID5, as it can
 from 5 disks raid5 to 2 disks raid 5 with unusable space.

 Then use ftrace to trace the execution time of btrfs_statfs() after
 allocating 1G data chunk. Both have 12 samples.
				avg		std
 Patched:			17.59 us	7.04
 WIthout patch (v5.5-rc2):	14.98 us	6.16

When the fs is cold, there is a small performance for this particular
case, as we need to do several more iterations to calculate correct
RAID5 data space.
But it's still pretty good, and won't block chunk allocator for any
observable time.

When the fs is hot, the performance bottleneck is the chunk_mutex, where
the most common and longest holder would be __btrfs_chunk_alloc().
In that case, we may sleep much longer as __btrfs_chunk_alloc() can
trigger IO.

Since the new implementation is not observable slower than the old one,
and won't cause meaningful delay for chunk allocator, it should be more
or less OK.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/super.c   | 190 ++++++++++++++++-----------------------------
 fs/btrfs/volumes.c |  12 +--
 fs/btrfs/volumes.h |   4 +
 3 files changed, 79 insertions(+), 127 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f452a94abdc3..4c0ba0f633ef 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1893,119 +1893,88 @@ static inline void btrfs_descending_sort_devices(
 /*
  * The helper to calc the free space on the devices that can be used to store
  * file data.
+ *
+ * The calculation will:
+ * - Allocate enough metadata virtual chunks to fulfill over-commit
+ *   To ensure we still have enough space to contain metadata chunks
+ * - Allocate any many data virtual chunks as possible
+ *   To get a true estimation on available data free space.
+ *
+ * Only with such comprehensive check, we can get a good result considering
+ * all the uneven disk layouts.
  */
 static inline int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info,
-					      u64 *free_bytes)
+					      u64 free_meta, u64 *result)
 {
 	struct btrfs_device_info *devices_info;
 	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
 	struct btrfs_device *device;
-	u64 type;
-	u64 avail_space;
-	u64 min_stripe_size;
-	int num_stripes = 1;
-	int i = 0, nr_devices;
-	const struct btrfs_raid_attr *rattr;
+	u64 meta_index;
+	u64 data_index;
+	u64 meta_rsv;
+	u64 meta_allocated = 0;
+	u64 data_allocated = 0;
+	u64 allocated;
+	int nr_devices;
+	int ret = 0;
 
-	/*
-	 * We aren't under the device list lock, so this is racy-ish, but good
-	 * enough for our purposes.
-	 */
-	nr_devices = fs_info->fs_devices->open_devices;
-	if (!nr_devices) {
-		smp_mb();
-		nr_devices = fs_info->fs_devices->open_devices;
-		ASSERT(nr_devices);
-		if (!nr_devices) {
-			*free_bytes = 0;
-			return 0;
-		}
-	}
+	spin_lock(&fs_info->global_block_rsv.lock);
+	meta_rsv = fs_info->global_block_rsv.size;
+	spin_unlock(&fs_info->global_block_rsv.lock);
 
+	mutex_lock(&fs_info->chunk_mutex);
+	nr_devices = fs_devices->rw_devices;
 	devices_info = kmalloc_array(nr_devices, sizeof(*devices_info),
 			       GFP_KERNEL);
-	if (!devices_info)
-		return -ENOMEM;
-
-	/* calc min stripe number for data space allocation */
-	type = btrfs_data_alloc_profile(fs_info);
-	rattr = &btrfs_raid_array[btrfs_bg_flags_to_raid_index(type)];
-
-	if (type & BTRFS_BLOCK_GROUP_RAID0)
-		num_stripes = nr_devices;
-	else if (type & BTRFS_BLOCK_GROUP_RAID1)
-		num_stripes = 2;
-	else if (type & BTRFS_BLOCK_GROUP_RAID1C3)
-		num_stripes = 3;
-	else if (type & BTRFS_BLOCK_GROUP_RAID1C4)
-		num_stripes = 4;
-	else if (type & BTRFS_BLOCK_GROUP_RAID10)
-		num_stripes = 4;
-
-	/* Adjust for more than 1 stripe per device */
-	min_stripe_size = rattr->dev_stripes * BTRFS_STRIPE_LEN;
-
-	rcu_read_lock();
-	list_for_each_entry_rcu(device, &fs_devices->devices, dev_list) {
-		if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
-						&device->dev_state) ||
-		    !device->bdev ||
-		    test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
-			continue;
-
-		if (i >= nr_devices)
-			break;
-
-		avail_space = device->total_bytes - device->bytes_used;
-
-		/* align with stripe_len */
-		avail_space = rounddown(avail_space, BTRFS_STRIPE_LEN);
-
-		/*
-		 * In order to avoid overwriting the superblock on the drive,
-		 * btrfs starts at an offset of at least 1MB when doing chunk
-		 * allocation.
-		 *
-		 * This ensures we have at least min_stripe_size free space
-		 * after excluding 1MB.
-		 */
-		if (avail_space <= SZ_1M + min_stripe_size)
-			continue;
-
-		avail_space -= SZ_1M;
-
-		devices_info[i].dev = device;
-		devices_info[i].max_avail = avail_space;
-
-		i++;
+	if (!devices_info) {
+		ret = -ENOMEM;
+		goto out;
 	}
-	rcu_read_unlock();
-
-	nr_devices = i;
 
-	btrfs_descending_sort_devices(devices_info, nr_devices);
-
-	i = nr_devices - 1;
-	avail_space = 0;
-	while (nr_devices >= rattr->devs_min) {
-		num_stripes = min(num_stripes, nr_devices);
-
-		if (devices_info[i].max_avail >= min_stripe_size) {
-			int j;
-			u64 alloc_size;
-
-			avail_space += devices_info[i].max_avail * num_stripes;
-			alloc_size = devices_info[i].max_avail;
-			for (j = i + 1 - num_stripes; j <= i; j++)
-				devices_info[j].max_avail -= alloc_size;
-		}
-		i--;
-		nr_devices--;
+	data_index = btrfs_bg_flags_to_raid_index(
+			btrfs_data_alloc_profile(fs_info));
+	meta_index = btrfs_bg_flags_to_raid_index(
+			btrfs_metadata_alloc_profile(fs_info));
+
+	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
+		device->virtual_allocated = 0;
+
+	/* Current metadata space is enough, no need to bother meta space */
+	if (meta_rsv <= free_meta)
+		goto data_only;
+
+	/* Allocate space for exceeding meta space */
+	while (meta_allocated < meta_rsv - free_meta) {
+		ret = btrfs_alloc_virtual_chunk(fs_info, devices_info,
+				meta_index,
+				meta_rsv - free_meta - meta_allocated,
+				&allocated);
+		if (ret < 0)
+			goto out;
+		meta_allocated += allocated;
+	}
+data_only:
+	/*
+	 * meta virtual chunks have been allocated, now allocate data virtual
+	 * chunks
+	 */
+	while (ret == 0) {
+		ret = btrfs_alloc_virtual_chunk(fs_info, devices_info,
+				data_index, -1, &allocated);
+		if (ret < 0)
+			goto out;
+		data_allocated += allocated;
 	}
 
+out:
+	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
+		device->virtual_allocated = 0;
+	mutex_unlock(&fs_info->chunk_mutex);
 	kfree(devices_info);
-	*free_bytes = avail_space;
-	return 0;
+	*result = data_allocated;
+	if (ret == -ENOSPC)
+		ret = 0;
+	return ret;
 }
 
 /*
@@ -2034,7 +2003,6 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 	unsigned factor = 1;
 	struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv;
 	int ret;
-	u64 thresh = 0;
 	int mixed = 0;
 
 	rcu_read_lock();
@@ -2082,31 +2050,11 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 		buf->f_bfree = 0;
 	spin_unlock(&block_rsv->lock);
 
-	buf->f_bavail = div_u64(total_free_data, factor);
-	ret = btrfs_calc_avail_data_space(fs_info, &total_free_data);
+	ret = btrfs_calc_avail_data_space(fs_info, total_free_meta,
+					  &buf->f_bavail);
 	if (ret)
 		return ret;
-	buf->f_bavail += div_u64(total_free_data, factor);
 	buf->f_bavail = buf->f_bavail >> bits;
-
-	/*
-	 * We calculate the remaining metadata space minus global reserve. If
-	 * this is (supposedly) smaller than zero, there's no space. But this
-	 * does not hold in practice, the exhausted state happens where's still
-	 * some positive delta. So we apply some guesswork and compare the
-	 * delta to a 4M threshold.  (Practically observed delta was ~2M.)
-	 *
-	 * We probably cannot calculate the exact threshold value because this
-	 * depends on the internal reservations requested by various
-	 * operations, so some operations that consume a few metadata will
-	 * succeed even if the Avail is zero. But this is better than the other
-	 * way around.
-	 */
-	thresh = SZ_4M;
-
-	if (!mixed && total_free_meta - thresh < block_rsv->size)
-		buf->f_bavail = 0;
-
 	buf->f_type = BTRFS_SUPER_MAGIC;
 	buf->f_bsize = dentry->d_sb->s_blocksize;
 	buf->f_namelen = BTRFS_NAME_LEN;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e38930390e89..135948343932 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2658,10 +2658,10 @@ static int btrfs_cmp_device_info(const void *a, const void *b)
  *    Such virtual chunks won't take on-disk space, thus called virtual, and
  *    only affects per-profile available space calulation.
  */
-static int alloc_virtual_chunk(struct btrfs_fs_info *fs_info,
-			       struct btrfs_device_info *devices_info,
-			       enum btrfs_raid_types type,
-			       u64 to_alloc, u64 *allocated)
+int btrfs_alloc_virtual_chunk(struct btrfs_fs_info *fs_info,
+			      struct btrfs_device_info *devices_info,
+			      enum btrfs_raid_types type,
+			      u64 to_alloc, u64 *allocated)
 {
 	const struct btrfs_raid_attr *raid_attr = &btrfs_raid_array[type];
 	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
@@ -2763,8 +2763,8 @@ static int calc_one_profile_avail(struct btrfs_fs_info *fs_info,
 	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
 		device->virtual_allocated = 0;
 	while (ret == 0) {
-		ret = alloc_virtual_chunk(fs_info, devices_info, type,
-					  (u64)-1, &allocated);
+		ret = btrfs_alloc_virtual_chunk(fs_info, devices_info, type,
+						(u64)-1, &allocated);
 		if (ret == 0)
 			result += allocated;
 	}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 5cddfe7cfee8..0b4fe2603b0e 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -460,6 +460,10 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
 		      struct btrfs_device *device, u64 new_size);
 struct btrfs_device *btrfs_find_device(struct btrfs_fs_devices *fs_devices,
 				       u64 devid, u8 *uuid, u8 *fsid, bool seed);
+int btrfs_alloc_virtual_chunk(struct btrfs_fs_info *fs_info,
+			      struct btrfs_device_info *devices_info,
+			      enum btrfs_raid_types type,
+			      u64 to_alloc, u64 *allocated);
 int btrfs_shrink_device(struct btrfs_device *device, u64 new_size);
 int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *path);
 int btrfs_balance(struct btrfs_fs_info *fs_info,
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/3] Introduce per-profile available space array to avoid over-confident can_overcommit()
  2020-01-06  6:13 [PATCH v3 0/3] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
                   ` (2 preceding siblings ...)
  2020-01-06  6:13 ` [PATCH v3 3/3] btrfs: statfs: Use virtual chunk allocation to calculation available data space Qu Wenruo
@ 2020-01-06 14:06 ` David Sterba
  2020-01-07  2:04   ` Qu Wenruo
  3 siblings, 1 reply; 15+ messages in thread
From: David Sterba @ 2020-01-06 14:06 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Jan 06, 2020 at 02:13:40PM +0800, Qu Wenruo wrote:
> The execution time of this per-profile calculation is a little below
> 20 us per 5 iterations in my test VM.
> Although all such calculation will need to acquire chunk mutex, the
> impact should be small enough.

The problem is not only the execution time of statfs, but what happens
when them mutex is contended. This was the problem with the block group
mutex in the past that had to be converted to RCU.

If the chunk mutex gets locked because a new chunk is allocated, until
it finishes then statfs will block. The time can vary a lot depending on
the workload and delay in seconds can trigger system monitors alert.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/3] btrfs: Introduce per-profile available space facility
  2020-01-06  6:13 ` [PATCH v3 1/3] btrfs: Introduce per-profile available space facility Qu Wenruo
@ 2020-01-06 14:32   ` David Sterba
  2020-01-07  2:13     ` Qu Wenruo
  2020-01-06 23:45     ` kbuild test robot
  1 sibling, 1 reply; 15+ messages in thread
From: David Sterba @ 2020-01-06 14:32 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Josef Bacik

On Mon, Jan 06, 2020 at 02:13:41PM +0800, Qu Wenruo wrote:
> +/*
> + * Return 0 if we allocated any virtual(*) chunk, and restore the size to
> + * @allocated_size
> + * Return -ENOSPC if we have no more space to allocate virtual chunk
> + *
> + * *: virtual chunk is a space holder for per-profile available space
> + *    calculator.
> + *    Such virtual chunks won't take on-disk space, thus called virtual, and
> + *    only affects per-profile available space calulation.
> + */
> +static int alloc_virtual_chunk(struct btrfs_fs_info *fs_info,
> +			       struct btrfs_device_info *devices_info,
> +			       enum btrfs_raid_types type,
> +			       u64 to_alloc, u64 *allocated)
> +{
> +	const struct btrfs_raid_attr *raid_attr = &btrfs_raid_array[type];
> +	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> +	struct btrfs_device *device;
> +	u64 stripe_size;
> +	int i;
> +	int ndevs = 0;
> +
> +	lockdep_assert_held(&fs_info->chunk_mutex);
> +
> +	/* Go through devices to collect their unallocated space */
> +	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
> +		u64 avail;
> +		if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
> +					&device->dev_state) ||
> +		    test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
> +			continue;
> +
> +		if (device->total_bytes > device->bytes_used +
> +				device->virtual_allocated)
> +			avail = device->total_bytes - device->bytes_used -
> +				device->virtual_allocated;
> +		else
> +			avail = 0;
> +
> +		/* And exclude the [0, 1M) reserved space */
> +		if (avail > SZ_1M)
> +			avail -= SZ_1M;
> +		else
> +			avail = 0;
> +
> +		if (avail == 0)
> +			continue;
> +		/*
> +		 * Unlike chunk allocator, we don't care about stripe or hole
> +		 * size, so here we use @avail directly
> +		 */
> +		devices_info[ndevs].dev_offset = 0;
> +		devices_info[ndevs].total_avail = avail;
> +		devices_info[ndevs].max_avail = avail;
> +		devices_info[ndevs].dev = device;
> +		++ndevs;
> +	}
> +	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +	     btrfs_cmp_device_info, NULL);
> +	ndevs -= ndevs % raid_attr->devs_increment;
> +	if (ndevs < raid_attr->devs_min)
> +		return -ENOSPC;
> +	if (raid_attr->devs_max)
> +		ndevs = min(ndevs, (int)raid_attr->devs_max);
> +	else
> +		ndevs = min(ndevs, (int)BTRFS_MAX_DEVS(fs_info));
> +
> +	/*
> +	 * Now allocate a virtual chunk using the unallocate space of the
> +	 * device with the least unallocated space.
> +	 */
> +	stripe_size = round_down(devices_info[ndevs - 1].total_avail,
> +				 fs_info->sectorsize);
> +
> +	/* We can't directly do round_up for (u64)-1 as that would result 0 */
> +	if (to_alloc != (u64)-1)
> +		stripe_size = min_t(u64, stripe_size,
> +				    round_up(div_u64(to_alloc, ndevs),
> +					     fs_info->sectorsize));
> +	if (stripe_size == 0)
> +		return -ENOSPC;
> +
> +	for (i = 0; i < ndevs; i++)
> +		devices_info[i].dev->virtual_allocated += stripe_size;
> +	*allocated = stripe_size * (ndevs - raid_attr->nparity) /
> +		     raid_attr->ncopies;
> +	return 0;
> +}
> +
> +static int calc_one_profile_avail(struct btrfs_fs_info *fs_info,
> +				  enum btrfs_raid_types type)
> +{
> +	struct btrfs_device_info *devices_info = NULL;
> +	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> +	struct btrfs_device *device;
> +	u64 allocated;
> +	u64 result = 0;
> +	int ret = 0;
> +
> +	ASSERT(type >= 0 && type < BTRFS_NR_RAID_TYPES);
> +
> +	/* Not enough devices, quick exit, just update the result */
> +	if (fs_devices->rw_devices < btrfs_raid_array[type].devs_min)
> +		goto out;
> +
> +	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
> +			       GFP_NOFS);

Calling kcalloc is another potential slowdown, for the statfs
considerations.

> +	if (!devices_info) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	/* Clear virtual chunk used space for each device */
> +	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
> +		device->virtual_allocated = 0;
> +	while (ret == 0) {
> +		ret = alloc_virtual_chunk(fs_info, devices_info, type,
> +					  (u64)-1, &allocated);
> +		if (ret == 0)
> +			result += allocated;
> +	}
> +	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
> +		device->virtual_allocated = 0;
> +out:
> +	kfree(devices_info);
> +	if (ret < 0 && ret != -ENOSPC)
> +		return ret;
> +	spin_lock(&fs_devices->per_profile_lock);
> +	fs_devices->per_profile_avail[type] = result;
> +	spin_unlock(&fs_devices->per_profile_lock);
> +	return 0;
> +}
> +
> +/*
> + * Calculate the per-profile available space array.
> + *
> + * Return 0 if we succeeded updating the array.
> + * Return <0 if something went wrong. (ENOMEM)
> + */
> +static int calc_per_profile_avail(struct btrfs_fs_info *fs_info)
> +{
> +	int i;
> +	int ret;
> +
> +	for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
> +		ret = calc_one_profile_avail(fs_info, i);
> +		if (ret < 0)
> +			break;
> +	}
> +	return ret;
> +}
> +
>  int btrfs_grow_device(struct btrfs_trans_handle *trans,
>  		      struct btrfs_device *device, u64 new_size)
>  {
> @@ -2635,6 +2806,7 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
>  	struct btrfs_super_block *super_copy = fs_info->super_copy;
>  	u64 old_total;
>  	u64 diff;
> +	int ret;
>  
>  	if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state))
>  		return -EACCES;
> @@ -2661,7 +2833,12 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
>  	if (list_empty(&device->post_commit_list))
>  		list_add_tail(&device->post_commit_list,
>  			      &trans->transaction->dev_update_list);
> +	ret = calc_per_profile_avail(fs_info);
>  	mutex_unlock(&fs_info->chunk_mutex);
> +	if (ret < 0) {
> +		btrfs_abort_transaction(trans, ret);
> +		return ret;
> +	}
>  
>  	return btrfs_update_device(trans, device);
>  }
> @@ -2831,7 +3008,13 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
>  					device->bytes_used - dev_extent_len);
>  			atomic64_add(dev_extent_len, &fs_info->free_chunk_space);
>  			btrfs_clear_space_info_full(fs_info);
> +			ret = calc_per_profile_avail(fs_info);

Adding new failure modes

>  			mutex_unlock(&fs_info->chunk_mutex);
> +			if (ret < 0) {
> +				mutex_unlock(&fs_devices->device_list_mutex);
> +				btrfs_abort_transaction(trans, ret);
> +				goto out;
> +			}
>  		}
>  
>  		ret = btrfs_update_device(trans, device);
> @@ -4526,6 +4709,12 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
>  		atomic64_sub(diff, &fs_info->free_chunk_space);
>  	}
>  
> +	ret = calc_per_profile_avail(fs_info);
> +	if (ret < 0) {
> +		btrfs_abort_transaction(trans, ret);
> +		btrfs_end_transaction(trans);
> +		goto done;
> +	}
>  	/*
>  	 * Once the device's size has been set to the new size, ensure all
>  	 * in-memory chunks are synced to disk so that the loop below sees them
> @@ -4690,25 +4879,6 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
>  	return 0;
>  }
>  
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -138,6 +138,13 @@ struct btrfs_device {
>  	atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX];
>  
>  	struct extent_io_tree alloc_state;
> +
> +	/*
> +	 * the "virtual" allocated space by virtual chunk allocator, which
> +	 * is used to do accurate estimation on available space.
> +	 * Doesn't affect real chunk allocator.
> +	 */
> +	u64 virtual_allocated;

I find it a bit confusing to use 'virtual', though I get what you mean.
As this is per-device it accounts overall space, so the name should
reflect somthing like that. I maybe have a more concrete suggestion once
I read through the whole series.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 3/3] btrfs: statfs: Use virtual chunk allocation to calculation available data space
  2020-01-06  6:13 ` [PATCH v3 3/3] btrfs: statfs: Use virtual chunk allocation to calculation available data space Qu Wenruo
@ 2020-01-06 14:44   ` Josef Bacik
  2020-01-07  1:03     ` Qu Wenruo
  0 siblings, 1 reply; 15+ messages in thread
From: Josef Bacik @ 2020-01-06 14:44 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 1/6/20 1:13 AM, Qu Wenruo wrote:
> Although btrfs_calc_avail_data_space() is trying to do an estimation
> on how many data chunks it can allocate, the estimation is far from
> perfect:
> 
> - Metadata over-commit is not considered at all
> - Chunk allocation doesn't take RAID5/6 into consideration
> 
> Although current per-profile available space itself is not able to
> handle metadata over-commit itself, the virtual chunk infrastructure can
> be re-used to address above problems.
> 
> This patch will change btrfs_calc_avail_data_space() to do the following
> things:
> - Do metadata virtual chunk allocation first
>    This is to address the over-commit behavior.
>    If current metadata chunks have enough free space, we can completely
>    skip this step.
> 
> - Allocate data virtual chunks as many as possible
>    Just like what we did in per-profile available space estimation.
>    Here we only need to calculate one profile, since statfs() call is
>    a relative cold path.
> 
> Now statfs() should be able to report near perfect estimation on
> available data space, and can handle RAID5/6 better.
> 
> [BENCHMARK]
> For the performance difference, here is the benchmark:
>   Disk layout:
>   - devid 1:	1G
>   - devid 2:	2G
>   - devid 3:	3G
>   - devid 4:	4G
>   - devid 5:	5G
>   metadata:	RAID1
>   data:		RAID5
> 
>   This layout should be the worst case for RAID5, as it can
>   from 5 disks raid5 to 2 disks raid 5 with unusable space.
> 
>   Then use ftrace to trace the execution time of btrfs_statfs() after
>   allocating 1G data chunk. Both have 12 samples.
> 				avg		std
>   Patched:			17.59 us	7.04
>   WIthout patch (v5.5-rc2):	14.98 us	6.16
> 
> When the fs is cold, there is a small performance for this particular
> case, as we need to do several more iterations to calculate correct
> RAID5 data space.
> But it's still pretty good, and won't block chunk allocator for any
> observable time.
> 
> When the fs is hot, the performance bottleneck is the chunk_mutex, where
> the most common and longest holder would be __btrfs_chunk_alloc().
> In that case, we may sleep much longer as __btrfs_chunk_alloc() can
> trigger IO.
> 
> Since the new implementation is not observable slower than the old one,
> and won't cause meaningful delay for chunk allocator, it should be more
> or less OK.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Sorry I should have been more specific.  Taking the chunk_mutex here means all 
the people running statfs at the same time are going to be serialized.  Can we 
just do what overcommit does and take the already calculated free space instead 
of doing the calculation again?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/3] btrfs: Introduce per-profile available space facility
  2020-01-06  6:13 ` [PATCH v3 1/3] btrfs: Introduce per-profile available space facility Qu Wenruo
@ 2020-01-06 23:45     ` kbuild test robot
  2020-01-06 23:45     ` kbuild test robot
  1 sibling, 0 replies; 15+ messages in thread
From: kbuild test robot @ 2020-01-06 23:45 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: kbuild-all, linux-btrfs, Josef Bacik

[-- Attachment #1: Type: text/plain, Size: 1307 bytes --]

Hi Qu,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on v5.5-rc5]
[also build test ERROR on next-20200106]
[cannot apply to btrfs/next]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Qu-Wenruo/Introduce-per-profile-available-space-array-to-avoid-over-confident-can_overcommit/20200107-025134
base:    c79f46a282390e0f5b306007bf7b11a46d529538
config: m68k-multi_defconfig (attached as .config)
compiler: m68k-linux-gcc (GCC) 7.5.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.5.0 make.cross ARCH=m68k 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> ERROR: "__udivdi3" [fs/btrfs/btrfs.ko] undefined!

---
0-DAY kernel test infrastructure                 Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 16753 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/3] btrfs: Introduce per-profile available space facility
@ 2020-01-06 23:45     ` kbuild test robot
  0 siblings, 0 replies; 15+ messages in thread
From: kbuild test robot @ 2020-01-06 23:45 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 1340 bytes --]

Hi Qu,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on v5.5-rc5]
[also build test ERROR on next-20200106]
[cannot apply to btrfs/next]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Qu-Wenruo/Introduce-per-profile-available-space-array-to-avoid-over-confident-can_overcommit/20200107-025134
base:    c79f46a282390e0f5b306007bf7b11a46d529538
config: m68k-multi_defconfig (attached as .config)
compiler: m68k-linux-gcc (GCC) 7.5.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.5.0 make.cross ARCH=m68k 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> ERROR: "__udivdi3" [fs/btrfs/btrfs.ko] undefined!

---
0-DAY kernel test infrastructure                 Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org Intel Corporation

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 16753 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 3/3] btrfs: statfs: Use virtual chunk allocation to calculation available data space
  2020-01-06 14:44   ` Josef Bacik
@ 2020-01-07  1:03     ` Qu Wenruo
  0 siblings, 0 replies; 15+ messages in thread
From: Qu Wenruo @ 2020-01-07  1:03 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 3389 bytes --]



On 2020/1/6 下午10:44, Josef Bacik wrote:
> On 1/6/20 1:13 AM, Qu Wenruo wrote:
>> Although btrfs_calc_avail_data_space() is trying to do an estimation
>> on how many data chunks it can allocate, the estimation is far from
>> perfect:
>>
>> - Metadata over-commit is not considered at all
>> - Chunk allocation doesn't take RAID5/6 into consideration
>>
>> Although current per-profile available space itself is not able to
>> handle metadata over-commit itself, the virtual chunk infrastructure can
>> be re-used to address above problems.
>>
>> This patch will change btrfs_calc_avail_data_space() to do the following
>> things:
>> - Do metadata virtual chunk allocation first
>>    This is to address the over-commit behavior.
>>    If current metadata chunks have enough free space, we can completely
>>    skip this step.
>>
>> - Allocate data virtual chunks as many as possible
>>    Just like what we did in per-profile available space estimation.
>>    Here we only need to calculate one profile, since statfs() call is
>>    a relative cold path.
>>
>> Now statfs() should be able to report near perfect estimation on
>> available data space, and can handle RAID5/6 better.
>>
>> [BENCHMARK]
>> For the performance difference, here is the benchmark:
>>   Disk layout:
>>   - devid 1:    1G
>>   - devid 2:    2G
>>   - devid 3:    3G
>>   - devid 4:    4G
>>   - devid 5:    5G
>>   metadata:    RAID1
>>   data:        RAID5
>>
>>   This layout should be the worst case for RAID5, as it can
>>   from 5 disks raid5 to 2 disks raid 5 with unusable space.
>>
>>   Then use ftrace to trace the execution time of btrfs_statfs() after
>>   allocating 1G data chunk. Both have 12 samples.
>>                 avg        std
>>   Patched:            17.59 us    7.04
>>   WIthout patch (v5.5-rc2):    14.98 us    6.16
>>
>> When the fs is cold, there is a small performance for this particular
>> case, as we need to do several more iterations to calculate correct
>> RAID5 data space.
>> But it's still pretty good, and won't block chunk allocator for any
>> observable time.
>>
>> When the fs is hot, the performance bottleneck is the chunk_mutex, where
>> the most common and longest holder would be __btrfs_chunk_alloc().
>> In that case, we may sleep much longer as __btrfs_chunk_alloc() can
>> trigger IO.
>>
>> Since the new implementation is not observable slower than the old one,
>> and won't cause meaningful delay for chunk allocator, it should be more
>> or less OK.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
> 
> Sorry I should have been more specific.  Taking the chunk_mutex here
> means all the people running statfs at the same time are going to be
> serialized.  Can we just do what overcommit does and take the already
> calculated free space instead of doing the calculation again?  Thanks,

If we want to take metadata over-commit into consideration, then we can't.

E.g. if we have only 256M metadata chunk, but over-commit to reserve 1G
metadata space.
We should take that 768M into metadata profile allocation before we do
data allocation.

And since metadata can have completely different profile, we have to go
this complex calculation again.

Thanks,
Qu

> 
> Josef


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/3] Introduce per-profile available space array to avoid over-confident can_overcommit()
  2020-01-06 14:06 ` [PATCH v3 0/3] Introduce per-profile available space array to avoid over-confident can_overcommit() David Sterba
@ 2020-01-07  2:04   ` Qu Wenruo
  0 siblings, 0 replies; 15+ messages in thread
From: Qu Wenruo @ 2020-01-07  2:04 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 1492 bytes --]

On 2020/1/6 下午10:06, David Sterba wrote:
> On Mon, Jan 06, 2020 at 02:13:40PM +0800, Qu Wenruo wrote:
>> The execution time of this per-profile calculation is a little below
>> 20 us per 5 iterations in my test VM.
>> Although all such calculation will need to acquire chunk mutex, the
>> impact should be small enough.
> 
> The problem is not only the execution time of statfs, but what happens
> when them mutex is contended. This was the problem with the block group
> mutex in the past that had to be converted to RCU.
> 
> If the chunk mutex gets locked because a new chunk is allocated, until
> it finishes then statfs will block. The time can vary a lot depending on
> the workload and delay in seconds can trigger system monitors alert.
> 
Yes, that's exactly the same concern I have.

But I'm not sure how safe the old RCU implementation is when
device->virtual_allocated is modified during the RCU critical section.

That's to say, if a virtual chunk is being allocated during the
statfs(), then we got incorrect result.
So I tend to keep it safe by protecting it using chunk_mutex even it
means chunk_mutex can block statfs().

Another solution is to completely forget the whole metadata part, just
grab the spinlock and the pre-calculated result, but that may result
more available space than what we really have.

If the delay is really a blockage, i can go the pre-allocated way,
making the result a little less accurate.

Thanks,
Qu

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/3] btrfs: Introduce per-profile available space facility
  2020-01-06 14:32   ` David Sterba
@ 2020-01-07  2:13     ` Qu Wenruo
  2020-01-08 15:04       ` David Sterba
  0 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2020-01-07  2:13 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs, Josef Bacik


[-- Attachment #1.1: Type: text/plain, Size: 2526 bytes --]



On 2020/1/6 下午10:32, David Sterba wrote:
> On Mon, Jan 06, 2020 at 02:13:41PM +0800, Qu Wenruo wrote:
[...]
>> +static int calc_one_profile_avail(struct btrfs_fs_info *fs_info,
>> +				  enum btrfs_raid_types type)
>> +{
>> +	struct btrfs_device_info *devices_info = NULL;
>> +	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>> +	struct btrfs_device *device;
>> +	u64 allocated;
>> +	u64 result = 0;
>> +	int ret = 0;
>> +
>> +	ASSERT(type >= 0 && type < BTRFS_NR_RAID_TYPES);
>> +
>> +	/* Not enough devices, quick exit, just update the result */
>> +	if (fs_devices->rw_devices < btrfs_raid_array[type].devs_min)
>> +		goto out;
>> +
>> +	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
>> +			       GFP_NOFS);
> 
> Calling kcalloc is another potential slowdown, for the statfs
> considerations.

That's also what we did in statfs() before, so it shouldn't cause extra
problem.
Furthermore, we didn't use calc_one_profile_avail() directly in statfs()
directly.

Although I get your point, and personally speaking the memory allocation
and extra in-memory device iteration should be pretty fast compared to
__btrfs_alloc_chunk().

Thus I don't think this memory allocation would cause extra trouble,
except the error handling mentioned below.

[...]
>> +			ret = calc_per_profile_avail(fs_info);
> 
> Adding new failure modes

Another solution I have tried is make calc_per_profile_avail() return
void, ignoring the ENOMEM error, and just set the affected profile to 0
available space.

But that method is just delaying ENOMEM, and would cause strange
pre-profile values until next successful update or mount cycle.

Any idea on which method is less worse?

[...]
>>  
>> --- a/fs/btrfs/volumes.h
>> +++ b/fs/btrfs/volumes.h
>> @@ -138,6 +138,13 @@ struct btrfs_device {
>>  	atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX];
>>  
>>  	struct extent_io_tree alloc_state;
>> +
>> +	/*
>> +	 * the "virtual" allocated space by virtual chunk allocator, which
>> +	 * is used to do accurate estimation on available space.
>> +	 * Doesn't affect real chunk allocator.
>> +	 */
>> +	u64 virtual_allocated;
> 
> I find it a bit confusing to use 'virtual', though I get what you mean.
> As this is per-device it accounts overall space, so the name should
> reflect somthing like that. I maybe have a more concrete suggestion once
> I read through the whole series.
> 
Looking forward that naming.

Thanks,
Qu


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/3] btrfs: Introduce per-profile available space facility
  2020-01-07  2:13     ` Qu Wenruo
@ 2020-01-08 15:04       ` David Sterba
  2020-01-08 23:53         ` Qu WenRuo
  0 siblings, 1 reply; 15+ messages in thread
From: David Sterba @ 2020-01-08 15:04 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Qu Wenruo, linux-btrfs, Josef Bacik

On Tue, Jan 07, 2020 at 10:13:43AM +0800, Qu Wenruo wrote:
> >> +	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
> >> +			       GFP_NOFS);
> > 
> > Calling kcalloc is another potential slowdown, for the statfs
> > considerations.
> 
> That's also what we did in statfs() before, so it shouldn't cause extra
> problem.
> Furthermore, we didn't use calc_one_profile_avail() directly in statfs()
> directly.
> 
> Although I get your point, and personally speaking the memory allocation
> and extra in-memory device iteration should be pretty fast compared to
> __btrfs_alloc_chunk().
> 
> Thus I don't think this memory allocation would cause extra trouble,
> except the error handling mentioned below.

Right, current statfs also does allocation via
btrfs_calc_avail_data_space, so it's the same as before.

> [...]
> >> +			ret = calc_per_profile_avail(fs_info);
> > 
> > Adding new failure modes
> 
> Another solution I have tried is make calc_per_profile_avail() return
> void, ignoring the ENOMEM error, and just set the affected profile to 0
> available space.
> 
> But that method is just delaying ENOMEM, and would cause strange
> pre-profile values until next successful update or mount cycle.
> 
> Any idea on which method is less worse?

Better to return the error than wrong values in this case. As the
numbers are sort of a cache and the mount cycle to get them fixed is not
very user friendly, we need some other way. As this is a global state, a
bit in fs_info::flags can be set and full recalculation attempted at
some point until it succeeds. This would leave the counters stale for
some time but I think still better than if they're suddenly 0.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/3] btrfs: Introduce per-profile available space facility
  2020-01-08 15:04       ` David Sterba
@ 2020-01-08 23:53         ` Qu WenRuo
  2020-01-09  6:26           ` Qu Wenruo
  0 siblings, 1 reply; 15+ messages in thread
From: Qu WenRuo @ 2020-01-08 23:53 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs, Josef Bacik



On 2020/1/8 下午11:04, David Sterba wrote:
> On Tue, Jan 07, 2020 at 10:13:43AM +0800, Qu Wenruo wrote:
>>>> +	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
>>>> +			       GFP_NOFS);
>>>
>>> Calling kcalloc is another potential slowdown, for the statfs
>>> considerations.
>>
>> That's also what we did in statfs() before, so it shouldn't cause extra
>> problem.
>> Furthermore, we didn't use calc_one_profile_avail() directly in statfs()
>> directly.
>>
>> Although I get your point, and personally speaking the memory allocation
>> and extra in-memory device iteration should be pretty fast compared to
>> __btrfs_alloc_chunk().
>>
>> Thus I don't think this memory allocation would cause extra trouble,
>> except the error handling mentioned below.
> 
> Right, current statfs also does allocation via
> btrfs_calc_avail_data_space, so it's the same as before.
> 
>> [...]
>>>> +			ret = calc_per_profile_avail(fs_info);
>>>
>>> Adding new failure modes
>>
>> Another solution I have tried is make calc_per_profile_avail() return
>> void, ignoring the ENOMEM error, and just set the affected profile to 0
>> available space.
>>
>> But that method is just delaying ENOMEM, and would cause strange
>> pre-profile values until next successful update or mount cycle.
>>
>> Any idea on which method is less worse?
> 
> Better to return the error than wrong values in this case. As the
> numbers are sort of a cache and the mount cycle to get them fixed is not
> very user friendly, we need some other way. As this is a global state, a
> bit in fs_info::flags can be set and full recalculation attempted at
> some point until it succeeds. This would leave the counters stale for
> some time but I think still better than if they're suddenly 0.
> 
If can_over_commit() is the only user of this facility, then either an
extra indicator or sudden 0 is no problem.
As in that case, we just don't over-commit and do extra flush to free
meta space.

But when statfs() is going to use this facility, either sudden 0 nor
indicator is good.
Just image seconds before, we still have several TiB free space, and all
of a sudden, just several GiB free (from allocated data chunks).

User will definitely complain.

Thus I still prefer proper error handling, as when we're low on memory,
a lot of things can go wrong anyway.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/3] btrfs: Introduce per-profile available space facility
  2020-01-08 23:53         ` Qu WenRuo
@ 2020-01-09  6:26           ` Qu Wenruo
  0 siblings, 0 replies; 15+ messages in thread
From: Qu Wenruo @ 2020-01-09  6:26 UTC (permalink / raw)
  To: Qu WenRuo, dsterba, linux-btrfs, Josef Bacik


[-- Attachment #1.1: Type: text/plain, Size: 3654 bytes --]



On 2020/1/9 上午7:53, Qu WenRuo wrote:
> 
> 
> On 2020/1/8 下午11:04, David Sterba wrote:
>> On Tue, Jan 07, 2020 at 10:13:43AM +0800, Qu Wenruo wrote:
>>>>> +	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
>>>>> +			       GFP_NOFS);
>>>>
>>>> Calling kcalloc is another potential slowdown, for the statfs
>>>> considerations.
>>>
>>> That's also what we did in statfs() before, so it shouldn't cause extra
>>> problem.
>>> Furthermore, we didn't use calc_one_profile_avail() directly in statfs()
>>> directly.
>>>
>>> Although I get your point, and personally speaking the memory allocation
>>> and extra in-memory device iteration should be pretty fast compared to
>>> __btrfs_alloc_chunk().
>>>
>>> Thus I don't think this memory allocation would cause extra trouble,
>>> except the error handling mentioned below.
>>
>> Right, current statfs also does allocation via
>> btrfs_calc_avail_data_space, so it's the same as before.
>>
>>> [...]
>>>>> +			ret = calc_per_profile_avail(fs_info);
>>>>
>>>> Adding new failure modes
>>>
>>> Another solution I have tried is make calc_per_profile_avail() return
>>> void, ignoring the ENOMEM error, and just set the affected profile to 0
>>> available space.
>>>
>>> But that method is just delaying ENOMEM, and would cause strange
>>> pre-profile values until next successful update or mount cycle.
>>>
>>> Any idea on which method is less worse?
>>
>> Better to return the error than wrong values in this case. As the
>> numbers are sort of a cache and the mount cycle to get them fixed is not
>> very user friendly, we need some other way. As this is a global state, a
>> bit in fs_info::flags can be set and full recalculation attempted at
>> some point until it succeeds. This would leave the counters stale for
>> some time but I think still better than if they're suddenly 0.

BTW, not sure if this would make you feel less nervous.

When we return ENOMEM from this facility, the timings are:
- Mount
  So really not something would happen, and no problem would be caused
  at all.

- Chunk allocation
  It's from __btrfs_alloc_chunk() which also do memory allocation by
  itself and could return ENOMEM. So no different at error handling.

- Grow device
  This is a little complex.
  My new error handling is aborting transaction as we didn't reset the
  device size to its original size.
  But the existing btrfs_update_devcice() can return -ENOMEM, even
  without resetting device size.
  From this point of view, my new error handling is at least better
  to avoid device size mismatch.

- Shrink device
  This new error handling is overkilling.
  At done tag, we have method to revert to old device size, and we
  haven't done anything yet, so we should be able to recover from that
  situation.

Anyway, I will enhance the error handling part, to make then recover
without aborting transaction for shrinking device and growing device.

Thanks,
Qu

>>
> If can_over_commit() is the only user of this facility, then either an
> extra indicator or sudden 0 is no problem.
> As in that case, we just don't over-commit and do extra flush to free
> meta space.
> 
> But when statfs() is going to use this facility, either sudden 0 nor
> indicator is good.
> Just image seconds before, we still have several TiB free space, and all
> of a sudden, just several GiB free (from allocated data chunks).
> 
> User will definitely complain.
> 
> Thus I still prefer proper error handling, as when we're low on memory,
> a lot of things can go wrong anyway.
> 
> Thanks,
> Qu
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-01-09  6:27 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-06  6:13 [PATCH v3 0/3] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
2020-01-06  6:13 ` [PATCH v3 1/3] btrfs: Introduce per-profile available space facility Qu Wenruo
2020-01-06 14:32   ` David Sterba
2020-01-07  2:13     ` Qu Wenruo
2020-01-08 15:04       ` David Sterba
2020-01-08 23:53         ` Qu WenRuo
2020-01-09  6:26           ` Qu Wenruo
2020-01-06 23:45   ` kbuild test robot
2020-01-06 23:45     ` kbuild test robot
2020-01-06  6:13 ` [PATCH v3 2/3] btrfs: space-info: Use per-profile available space in can_overcommit() Qu Wenruo
2020-01-06  6:13 ` [PATCH v3 3/3] btrfs: statfs: Use virtual chunk allocation to calculation available data space Qu Wenruo
2020-01-06 14:44   ` Josef Bacik
2020-01-07  1:03     ` Qu Wenruo
2020-01-06 14:06 ` [PATCH v3 0/3] Introduce per-profile available space array to avoid over-confident can_overcommit() David Sterba
2020-01-07  2:04   ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.