Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v7 0/5] Introduce per-profile available space array to avoid over-confident can_overcommit()
@ 2020-02-11  5:11 Qu Wenruo
  2020-02-11  5:11 ` [PATCH v7 1/5] btrfs: Introduce per-profile available space facility Qu Wenruo
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: Qu Wenruo @ 2020-02-11  5:11 UTC (permalink / raw)
  To: linux-btrfs

There are several bug reports of ENOSPC error in
btrfs_run_delalloc_range().

With some extra info from one reporter, it turns out that
can_overcommit() is using a wrong way to calculate allocatable metadata
space.

The most typical case would look like:
  devid 1 unallocated:	1G
  devid 2 unallocated:  10G
  metadata profile:	RAID1

In above case, we can at most allocate 1G chunk for metadata, due to
unbalanced disk free space.
But current can_overcommit() uses factor based calculation, which never
consider the disk free space balance.


To address this problem, here comes the per-profile available space
array, which gets updated every time a chunk get allocated/removed or a
device get grown or shrunk.

This provides a quick way for hotter place like can_overcommit() to grab
an estimation on how many bytes it can over-commit.

The per-profile available space calculation tries to keep the behavior
of chunk allocator, thus it can handle uneven disks pretty well.

And statfs() can also grab that pre-calculated value for instance usage.

Since this patch introduced a new failure pattern, some new error
handling are introduced:
- __btrfs_alloc_chunk()
  At the end of that function where calc_per_profile_avail() get called,
  if it failed due to -ENOMEM, we will revert device used space, and
  remove the allocated chunk.
  This is the only new error handling added by patch 5.

- btrfs_remove_chunk()
  There is no good way to revert the change. So here we abort
  transaction, just like what the old error handling does.

- btrfs_grow_device()
  This function has its problem by not reverting device used space from
  the very beginning.
  This patchset will enhance it in patch 4.

- btrfs_shrink_device()
  This function already has good error handling, reuse it.

- btrfs_verify_dev_extents()
  Mount time error will lead to mount failure, nothing to worry about.

Contents of the patchset:
Patch 1:	Core facility, with basic (not perfect) error handling
Patch 2:	Fix for over-confident can_overcommit()
Patch 3:	Make statfs() more accurate
Patch 4:	Better error handling for btrfs_grow_device()
Patch 5:	Better error handling for __btrfs_alloc_chunk()

If needed, patch 4 and patch 5 can be merged into patch 1.

Changelog:
v1:
- Fix a bug where we forgot to update per-profile array after allocating
  a chunk.
  To avoid ABBA deadlock, this introduce a small windows at the end
  __btrfs_alloc_chunk(), it's not elegant but should be good enough
  before we rework chunk and device list mutex.
  
- Make statfs() to use virtual chunk allocator to do better estimation
  Now statfs() can report not only more accurate result, but can also
  handle RAID5/6 better.

v2:
- Fix a deadlock caused by acquiring device_list_mutex under
  __btrfs_alloc_chunk()
  There is no need to acquire device_list_mutex when holding
  chunk_mutex.
  Fix it and remove the lockdep assert.

v3:
- Use proper chunk_mutex instead of device_list_mutex
  Since they are protecting two different things, and we only care about
  alloc_list, we should only use chunk_mutex.
  With improved lock situation, it's easier to fold
  calc_per_profile_available() calls into the first patch.

- Add performance benchmark for statfs() modification
  As Facebook seems to run into some problems with statfs() calls, add
  some basic ftrace results.

v4:
- Keep the lock-free design for statfs()
  As extra sleeping in statfs() may not be a good idea, keep the old
  lock-free design, and use factor based calculation as fall back.

v5:
- Enhance btrfs_update_device() error handling in btrfs_grow_device()
- Ensure all failure caused by calc_per_profile_available() is the same
  with existing error handling
- Fix a bug where chunk_mutex is not released in btrfs_shrink_device()

v6:
- Don't update the array if we hit any error.
  To avoid calling calc_per_profile_avail() in error handling path.

- Re-order the patchset
  Make the core facility the first patch.
  Error handling improvement in later patches.

- Add better error handling
  Improve one existing bad error handling, and provide a better solution
  for __btrfs_alloc_chunk()

v7:
- Remove btrfs_calc_avail_data_space() completely
  Now we only need to grab the pre-calculated number, no need for a
  function over 100 lines.

- Keep the 0-avail-if-metadata-exhausted behavior
  Now it's handled by space_info->full, which indicates if we can
  allocate new chunks in metadata space info.
  We have no need to bother that now.

Qu Wenruo (5):
  btrfs: Introduce per-profile available space facility
  btrfs: space-info: Use per-profile available space in can_overcommit()
  btrfs: statfs: Use pre-calculated per-profile available space
  btrfs: Reset device size when btrfs_update_device() failed in
    btrfs_grow_device()
  btrfs: volumes: Revert device used bytes when calc_per_profile_avail()
    failed

 fs/btrfs/space-info.c |  15 ++-
 fs/btrfs/super.c      | 133 ++---------------------
 fs/btrfs/volumes.c    | 245 ++++++++++++++++++++++++++++++++++++++----
 fs/btrfs/volumes.h    |  11 ++
 4 files changed, 251 insertions(+), 153 deletions(-)

-- 
2.25.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v7 1/5] btrfs: Introduce per-profile available space facility
  2020-02-11  5:11 [PATCH v7 0/5] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
@ 2020-02-11  5:11 ` Qu Wenruo
  2020-02-13 17:12   ` kbuild test robot
  2020-02-20 12:49   ` Nikolay Borisov
  2020-02-11  5:11 ` [PATCH v7 2/5] btrfs: space-info: Use per-profile available space in can_overcommit() Qu Wenruo
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 11+ messages in thread
From: Qu Wenruo @ 2020-02-11  5:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

[PROBLEM]
There are some locations in btrfs requiring accurate estimation on how
many new bytes can be allocated on unallocated space.

We have two types of estimation:
- Factor based calculation
  Just use all unallocated space, divide by the profile factor
  One obvious user is can_overcommit().

- Chunk allocator like calculation
  This will emulate the chunk allocator behavior, to get a proper
  estimation.
  The only user is btrfs_calc_avail_data_space(), utilized by
  btrfs_statfs().
  The problem is, that function is not generic purposed enough, can't
  handle things like RAID5/6.

Current factor based calculation can't handle the following case:
  devid 1 unallocated:	1T
  devid 2 unallocated:	10T
  metadata type:	RAID1

If using factor, we can use (1T + 10T) / 2 = 5.5T free space for
metadata.
But in fact we can only get 1T free space, as we're limited by the
smallest device for RAID1.

[SOLUTION]
This patch will introduce per-profile available space calculation,
which can give an estimation based on chunk-allocator-like behavior.

The difference between it and chunk allocator is mostly on rounding and
[0, 1M) reserved space handling, which shouldn't cause practical impact.

The newly introduced per-profile available space calculation will
calculate available space for each type, using chunk-allocator like
calculation.

With that facility, for above device layout we get the full available
space array:
  RAID10:	0  (not enough devices)
  RAID1:	1T
  RAID1C3:	0  (not enough devices)
  RAID1C4:	0  (not enough devices)
  DUP:		5.5T
  RAID0:	2T
  SINGLE:	11T
  RAID5:	1T
  RAID6:	0  (not enough devices)

Or for a more complex example:
  devid 1 unallocated:	1T
  devid 2 unallocated:  1T
  devid 3 unallocated:	10T

We will get an array of:
  RAID10:	0  (not enough devices)
  RAID1:	2T
  RAID1C3:	1T
  RAID1C4:	0  (not enough devices)
  DUP:		6T
  RAID0:	3T
  SINGLE:	12T
  RAID5:	2T
  RAID6:	0  (not enough devices)

And for the each profile , we go chunk allocator level calculation:
The pseudo code looks like:

  clear_virtual_used_space_of_all_rw_devices();
  do {
  	/*
  	 * The same as chunk allocator, despite used space,
  	 * we also take virtual used space into consideration.
  	 */
  	sort_device_with_virtual_free_space();

  	/*
  	 * Unlike chunk allocator, we don't need to bother hole/stripe
  	 * size, so we use the smallest device to make sure we can
  	 * allocated as many stripes as regular chunk allocator
  	 */
  	stripe_size = device_with_smallest_free->avail_space;
	stripe_size = min(stripe_size, to_alloc / ndevs);

  	/*
  	 * Allocate a virtual chunk, allocated virtual chunk will
  	 * increase virtual used space, allow next iteration to
  	 * properly emulate chunk allocator behavior.
  	 */
  	ret = alloc_virtual_chunk(stripe_size, &allocated_size);
  	if (ret == 0)
  		avail += allocated_size;
  } while (ret == 0)

As we always select the device with least free space, the device with
the most space will be the first to be utilized, just like chunk
allocator.
For above 1T + 10T device, we will allocate a 1T virtual chunk
in the first iteration, then run out of device in next iteration.

Thus only get 1T free space for RAID1 type, just like what chunk
allocator would do.

The patch will update such per-profile available space at the following
timing:
- Mount time
- Chunk allocation
- Chunk removal
- Device grow
- Device shrink

Those timing are all protected by chunk_mutex, and what we do are only
iterating in-memory only structures, no extra IO triggered, so the
performance impact should be pretty small.

For the extra error handling, the principle is to keep the old behavior.
That's to say, if old error handler would just return an error, then we
follow it, no matter if the caller reverts the device size.

For the proper error handling, they will be added in later patches.
As I don't want to make the core facility bloated by the error handling
code, especially some situation needs quite some new code to handle
errors.

Suggested-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 216 ++++++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/volumes.h |  11 +++
 2 files changed, 207 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9cfc668f91f4..1886fa20530d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -352,6 +352,7 @@ static struct btrfs_fs_devices *alloc_fs_devices(const u8 *fsid,
 	INIT_LIST_HEAD(&fs_devs->devices);
 	INIT_LIST_HEAD(&fs_devs->alloc_list);
 	INIT_LIST_HEAD(&fs_devs->fs_list);
+	spin_lock_init(&fs_devs->per_profile_lock);
 	if (fsid)
 		memcpy(fs_devs->fsid, fsid, BTRFS_FSID_SIZE);
 
@@ -2669,6 +2670,177 @@ static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+/*
+ * sort the devices in descending order by max_avail, total_avail
+ */
+static int btrfs_cmp_device_info(const void *a, const void *b)
+{
+	const struct btrfs_device_info *di_a = a;
+	const struct btrfs_device_info *di_b = b;
+
+	if (di_a->max_avail > di_b->max_avail)
+		return -1;
+	if (di_a->max_avail < di_b->max_avail)
+		return 1;
+	if (di_a->total_avail > di_b->total_avail)
+		return -1;
+	if (di_a->total_avail < di_b->total_avail)
+		return 1;
+	return 0;
+}
+
+/*
+ * Return 0 if we allocated any virtual(*) chunk, and restore the size to
+ * @allocated_size
+ * Return -ENOSPC if we have no more space to allocate virtual chunk
+ *
+ * *: virtual chunk is a space holder for per-profile available space
+ *    calculator.
+ *    Such virtual chunks won't take on-disk space, thus called virtual, and
+ *    only affects per-profile available space calulation.
+ */
+static int alloc_virtual_chunk(struct btrfs_fs_info *fs_info,
+			       struct btrfs_device_info *devices_info,
+			       enum btrfs_raid_types type,
+			       u64 *allocated)
+{
+	const struct btrfs_raid_attr *raid_attr = &btrfs_raid_array[type];
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+	struct btrfs_device *device;
+	u64 stripe_size;
+	int i;
+	int ndevs = 0;
+
+	lockdep_assert_held(&fs_info->chunk_mutex);
+
+	/* Go through devices to collect their unallocated space */
+	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
+		u64 avail;
+		if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
+					&device->dev_state) ||
+		    test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
+			continue;
+
+		if (device->total_bytes > device->bytes_used +
+				device->virtual_allocated)
+			avail = device->total_bytes - device->bytes_used -
+				device->virtual_allocated;
+		else
+			avail = 0;
+
+		/* And exclude the [0, 1M) reserved space */
+		if (avail > SZ_1M)
+			avail -= SZ_1M;
+		else
+			avail = 0;
+
+		if (avail < fs_info->sectorsize)
+			continue;
+		/*
+		 * Unlike chunk allocator, we don't care about stripe or hole
+		 * size, so here we use @avail directly
+		 */
+		devices_info[ndevs].dev_offset = 0;
+		devices_info[ndevs].total_avail = avail;
+		devices_info[ndevs].max_avail = avail;
+		devices_info[ndevs].dev = device;
+		++ndevs;
+	}
+	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+	     btrfs_cmp_device_info, NULL);
+	ndevs -= ndevs % raid_attr->devs_increment;
+	if (ndevs < raid_attr->devs_min)
+		return -ENOSPC;
+	if (raid_attr->devs_max)
+		ndevs = min(ndevs, (int)raid_attr->devs_max);
+	else
+		ndevs = min(ndevs, (int)BTRFS_MAX_DEVS(fs_info));
+
+	/*
+	 * Now allocate a virtual chunk using the unallocate space of the
+	 * device with the least unallocated space.
+	 */
+	stripe_size = round_down(devices_info[ndevs - 1].total_avail,
+				 fs_info->sectorsize);
+	if (stripe_size == 0)
+		return -ENOSPC;
+
+	for (i = 0; i < ndevs; i++)
+		devices_info[i].dev->virtual_allocated += stripe_size;
+	*allocated = stripe_size * (ndevs - raid_attr->nparity) /
+		     raid_attr->ncopies;
+	return 0;
+}
+
+static int calc_one_profile_avail(struct btrfs_fs_info *fs_info,
+				  enum btrfs_raid_types type,
+				  u64 *result_ret)
+{
+	struct btrfs_device_info *devices_info = NULL;
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+	struct btrfs_device *device;
+	u64 allocated;
+	u64 result = 0;
+	int ret = 0;
+
+	ASSERT(type >= 0 && type < BTRFS_NR_RAID_TYPES);
+
+	/* Not enough devices, quick exit, just update the result */
+	if (fs_devices->rw_devices < btrfs_raid_array[type].devs_min)
+		goto out;
+
+	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
+			       GFP_NOFS);
+	if (!devices_info) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	/* Clear virtual chunk used space for each device */
+	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
+		device->virtual_allocated = 0;
+	while (ret == 0) {
+		ret = alloc_virtual_chunk(fs_info, devices_info, type,
+					  &allocated);
+		if (ret == 0)
+			result += allocated;
+	}
+	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
+		device->virtual_allocated = 0;
+out:
+	kfree(devices_info);
+	if (ret < 0 && ret != -ENOSPC)
+		return ret;
+	*result_ret = result;
+	return 0;
+}
+
+/*
+ * Calculate the per-profile available space array.
+ *
+ * Return 0 if we succeeded updating the array.
+ * Return <0 if something went wrong (ENOMEM), and the array is not
+ * updated.
+ */
+static int calc_per_profile_avail(struct btrfs_fs_info *fs_info)
+{
+	u64 results[BTRFS_NR_RAID_TYPES];
+	int i;
+	int ret;
+
+	for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
+		ret = calc_one_profile_avail(fs_info, i, &results[i]);
+		if (ret < 0)
+			return ret;
+	}
+
+	spin_lock(&fs_info->fs_devices->per_profile_lock);
+	for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
+		fs_info->fs_devices->per_profile_avail[i] =
+			results[i];
+	spin_unlock(&fs_info->fs_devices->per_profile_lock);
+	return ret;
+}
+
 int btrfs_grow_device(struct btrfs_trans_handle *trans,
 		      struct btrfs_device *device, u64 new_size)
 {
@@ -2676,6 +2848,7 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
 	struct btrfs_super_block *super_copy = fs_info->super_copy;
 	u64 old_total;
 	u64 diff;
+	int ret;
 
 	if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state))
 		return -EACCES;
@@ -2702,7 +2875,10 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
 	if (list_empty(&device->post_commit_list))
 		list_add_tail(&device->post_commit_list,
 			      &trans->transaction->dev_update_list);
+	ret = calc_per_profile_avail(fs_info);
 	mutex_unlock(&fs_info->chunk_mutex);
+	if (ret < 0)
+		return ret;
 
 	return btrfs_update_device(trans, device);
 }
@@ -2872,7 +3048,13 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
 					device->bytes_used - dev_extent_len);
 			atomic64_add(dev_extent_len, &fs_info->free_chunk_space);
 			btrfs_clear_space_info_full(fs_info);
+			ret = calc_per_profile_avail(fs_info);
 			mutex_unlock(&fs_info->chunk_mutex);
+			if (ret < 0) {
+				mutex_unlock(&fs_devices->device_list_mutex);
+				btrfs_abort_transaction(trans, ret);
+				goto out;
+			}
 		}
 
 		ret = btrfs_update_device(trans, device);
@@ -4578,6 +4760,11 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
 		atomic64_sub(diff, &fs_info->free_chunk_space);
 	}
 
+	ret = calc_per_profile_avail(fs_info);
+	if (ret < 0) {
+		mutex_unlock(&fs_info->chunk_mutex);
+		goto done;
+	}
 	/*
 	 * Once the device's size has been set to the new size, ensure all
 	 * in-memory chunks are synced to disk so that the loop below sees them
@@ -4742,25 +4929,6 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
 	return 0;
 }
 
-/*
- * sort the devices in descending order by max_avail, total_avail
- */
-static int btrfs_cmp_device_info(const void *a, const void *b)
-{
-	const struct btrfs_device_info *di_a = a;
-	const struct btrfs_device_info *di_b = b;
-
-	if (di_a->max_avail > di_b->max_avail)
-		return -1;
-	if (di_a->max_avail < di_b->max_avail)
-		return 1;
-	if (di_a->total_avail > di_b->total_avail)
-		return -1;
-	if (di_a->total_avail < di_b->total_avail)
-		return 1;
-	return 0;
-}
-
 static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
 {
 	if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK))
@@ -5038,6 +5206,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 			list_add_tail(&dev->post_commit_list,
 				      &trans->transaction->dev_update_list);
 	}
+	ret = calc_per_profile_avail(info);
 
 	atomic64_sub(stripe_size * map->num_stripes, &info->free_chunk_space);
 
@@ -5046,7 +5215,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	check_raid1c34_incompat_flag(info, type);
 
 	kfree(devices_info);
-	return 0;
+	return ret;
 
 error_del_extent:
 	write_lock(&em_tree->lock);
@@ -7609,6 +7778,13 @@ int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info)
 
 	/* Ensure all chunks have corresponding dev extents */
 	ret = verify_chunk_dev_extent_mapping(fs_info);
+	if (ret < 0)
+		goto out;
+
+	/* All dev extents are verified, update per-profile available space */
+	mutex_lock(&fs_info->chunk_mutex);
+	ret = calc_per_profile_avail(fs_info);
+	mutex_unlock(&fs_info->chunk_mutex);
 out:
 	btrfs_free_path(path);
 	return ret;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 409f4816fb89..c1e4a32393b0 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -140,6 +140,13 @@ struct btrfs_device {
 	struct completion kobj_unregister;
 	/* For sysfs/FSID/devinfo/devid/ */
 	struct kobject devid_kobj;
+
+	/*
+	 * the "virtual" allocated space by virtual chunk allocator, which
+	 * is used to do accurate estimation on available space.
+	 * Doesn't affect real chunk allocator.
+	 */
+	u64 virtual_allocated;
 };
 
 /*
@@ -259,6 +266,10 @@ struct btrfs_fs_devices {
 	struct kobject fsid_kobj;
 	struct kobject *devices_kobj;
 	struct completion kobj_unregister;
+
+	/* Records per-type available space */
+	spinlock_t per_profile_lock;
+	u64 per_profile_avail[BTRFS_NR_RAID_TYPES];
 };
 
 #define BTRFS_BIO_INLINE_CSUM_SIZE	64
-- 
2.25.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v7 2/5] btrfs: space-info: Use per-profile available space in can_overcommit()
  2020-02-11  5:11 [PATCH v7 0/5] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
  2020-02-11  5:11 ` [PATCH v7 1/5] btrfs: Introduce per-profile available space facility Qu Wenruo
@ 2020-02-11  5:11 ` Qu Wenruo
  2020-02-11  5:11 ` [PATCH v7 3/5] btrfs: statfs: Use pre-calculated per-profile available space Qu Wenruo
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2020-02-11  5:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Marc Lehmann, Josef Bacik

For the following disk layout, can_overcommit() can cause false
confidence in available space:

  devid 1 unallocated:	1T
  devid 2 unallocated:	10T
  metadata type:	RAID1

As can_overcommit() simply uses unallocated space with factor to
calculate the allocatable metadata chunk size.

can_overcommit() believes we still have 5.5T for metadata chunks, while
the truth is, we only have 1T available for metadata chunks.
This can lead to ENOSPC at run_delalloc_range() and cause transaction
abort.

Since factor based calculation can't distinguish RAID1/RAID10 and DUP at
all, we need proper chunk-allocator level awareness to do such estimation.

Thankfully, we have per-profile available space already calculated, just
use that facility to avoid such false confidence.

Reported-by: Marc Lehmann <schmorp@schmorp.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/space-info.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 01297c5b2666..2fdc19a39e07 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -163,10 +163,10 @@ int btrfs_can_overcommit(struct btrfs_fs_info *fs_info,
 			 struct btrfs_space_info *space_info, u64 bytes,
 			 enum btrfs_reserve_flush_enum flush)
 {
+	enum btrfs_raid_types index;
 	u64 profile;
 	u64 avail;
 	u64 used;
-	int factor;
 
 	/* Don't overcommit when in mixed mode. */
 	if (space_info->flags & BTRFS_BLOCK_GROUP_DATA)
@@ -178,16 +178,15 @@ int btrfs_can_overcommit(struct btrfs_fs_info *fs_info,
 		profile = btrfs_metadata_alloc_profile(fs_info);
 
 	used = btrfs_space_info_used(space_info, true);
-	avail = atomic64_read(&fs_info->free_chunk_space);
 
 	/*
-	 * If we have dup, raid1 or raid10 then only half of the free
-	 * space is actually usable.  For raid56, the space info used
-	 * doesn't include the parity drive, so we don't have to
-	 * change the math
+	 * Grab avail space from per-profile array which should be as accurate
+	 * as chunk allocator.
 	 */
-	factor = btrfs_bg_type_to_factor(profile);
-	avail = div_u64(avail, factor);
+	index = btrfs_bg_flags_to_raid_index(profile);
+	spin_lock(&fs_info->fs_devices->per_profile_lock);
+	avail = fs_info->fs_devices->per_profile_avail[index];
+	spin_unlock(&fs_info->fs_devices->per_profile_lock);
 
 	/*
 	 * If we aren't flushing all things, let us overcommit up to
-- 
2.25.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v7 3/5] btrfs: statfs: Use pre-calculated per-profile available space
  2020-02-11  5:11 [PATCH v7 0/5] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
  2020-02-11  5:11 ` [PATCH v7 1/5] btrfs: Introduce per-profile available space facility Qu Wenruo
  2020-02-11  5:11 ` [PATCH v7 2/5] btrfs: space-info: Use per-profile available space in can_overcommit() Qu Wenruo
@ 2020-02-11  5:11 ` Qu Wenruo
  2020-02-11  5:11 ` [PATCH v7 4/5] btrfs: Reset device size when btrfs_update_device() failed in btrfs_grow_device() Qu Wenruo
  2020-02-11  5:11 ` [PATCH v7 5/5] btrfs: volumes: Revert device used bytes when calc_per_profile_avail() failed Qu Wenruo
  4 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2020-02-11  5:11 UTC (permalink / raw)
  To: linux-btrfs

Although btrfs_calc_avail_data_space() is trying to do an estimation
on how many data chunks it can allocate, the estimation is far from
perfect:

- Metadata over-commit is not considered at all
- Chunk allocation doesn't take RAID5/6 into consideration

This patch will change btrfs_calc_avail_data_space() to use
pre-calculated per-profile available space.

This provides the following benefits:
- Accurate unallocated data space estimation
  It's as accurate as chunk allocator, and can handle RAID5/6 and newly
  introduced RAID1C3/C4.

For the metadata over-commit part, we don't take that into consideration
yet. As metadata over-commit only happens when we have enough
unallocated space, and under most case we won't use that much metadata
space at all.

And we still have the existing 0-available space check, to prevent us
from reporting too optimistic f_bavail result.

Since we're keeping the old lock-free design, statfs should not experience
any extra delay.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/super.c | 133 ++++-------------------------------------------
 1 file changed, 9 insertions(+), 124 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 0616a5434793..0e53f39e2d29 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1921,124 +1921,6 @@ static inline void btrfs_descending_sort_devices(
 	     btrfs_cmp_device_free_bytes, NULL);
 }
 
-/*
- * The helper to calc the free space on the devices that can be used to store
- * file data.
- */
-static inline int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info,
-					      u64 *free_bytes)
-{
-	struct btrfs_device_info *devices_info;
-	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
-	struct btrfs_device *device;
-	u64 type;
-	u64 avail_space;
-	u64 min_stripe_size;
-	int num_stripes = 1;
-	int i = 0, nr_devices;
-	const struct btrfs_raid_attr *rattr;
-
-	/*
-	 * We aren't under the device list lock, so this is racy-ish, but good
-	 * enough for our purposes.
-	 */
-	nr_devices = fs_info->fs_devices->open_devices;
-	if (!nr_devices) {
-		smp_mb();
-		nr_devices = fs_info->fs_devices->open_devices;
-		ASSERT(nr_devices);
-		if (!nr_devices) {
-			*free_bytes = 0;
-			return 0;
-		}
-	}
-
-	devices_info = kmalloc_array(nr_devices, sizeof(*devices_info),
-			       GFP_KERNEL);
-	if (!devices_info)
-		return -ENOMEM;
-
-	/* calc min stripe number for data space allocation */
-	type = btrfs_data_alloc_profile(fs_info);
-	rattr = &btrfs_raid_array[btrfs_bg_flags_to_raid_index(type)];
-
-	if (type & BTRFS_BLOCK_GROUP_RAID0)
-		num_stripes = nr_devices;
-	else if (type & BTRFS_BLOCK_GROUP_RAID1)
-		num_stripes = 2;
-	else if (type & BTRFS_BLOCK_GROUP_RAID1C3)
-		num_stripes = 3;
-	else if (type & BTRFS_BLOCK_GROUP_RAID1C4)
-		num_stripes = 4;
-	else if (type & BTRFS_BLOCK_GROUP_RAID10)
-		num_stripes = 4;
-
-	/* Adjust for more than 1 stripe per device */
-	min_stripe_size = rattr->dev_stripes * BTRFS_STRIPE_LEN;
-
-	rcu_read_lock();
-	list_for_each_entry_rcu(device, &fs_devices->devices, dev_list) {
-		if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
-						&device->dev_state) ||
-		    !device->bdev ||
-		    test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
-			continue;
-
-		if (i >= nr_devices)
-			break;
-
-		avail_space = device->total_bytes - device->bytes_used;
-
-		/* align with stripe_len */
-		avail_space = rounddown(avail_space, BTRFS_STRIPE_LEN);
-
-		/*
-		 * In order to avoid overwriting the superblock on the drive,
-		 * btrfs starts at an offset of at least 1MB when doing chunk
-		 * allocation.
-		 *
-		 * This ensures we have at least min_stripe_size free space
-		 * after excluding 1MB.
-		 */
-		if (avail_space <= SZ_1M + min_stripe_size)
-			continue;
-
-		avail_space -= SZ_1M;
-
-		devices_info[i].dev = device;
-		devices_info[i].max_avail = avail_space;
-
-		i++;
-	}
-	rcu_read_unlock();
-
-	nr_devices = i;
-
-	btrfs_descending_sort_devices(devices_info, nr_devices);
-
-	i = nr_devices - 1;
-	avail_space = 0;
-	while (nr_devices >= rattr->devs_min) {
-		num_stripes = min(num_stripes, nr_devices);
-
-		if (devices_info[i].max_avail >= min_stripe_size) {
-			int j;
-			u64 alloc_size;
-
-			avail_space += devices_info[i].max_avail * num_stripes;
-			alloc_size = devices_info[i].max_avail;
-			for (j = i + 1 - num_stripes; j <= i; j++)
-				devices_info[j].max_avail -= alloc_size;
-		}
-		i--;
-		nr_devices--;
-	}
-
-	kfree(devices_info);
-	*free_bytes = avail_space;
-	return 0;
-}
-
 /*
  * Calculate numbers for 'df', pessimistic in case of mixed raid profiles.
  *
@@ -2055,6 +1937,7 @@ static inline int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info,
 static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(dentry->d_sb);
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
 	struct btrfs_super_block *disk_super = fs_info->super_copy;
 	struct btrfs_space_info *found;
 	u64 total_used = 0;
@@ -2064,7 +1947,7 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 	__be32 *fsid = (__be32 *)fs_info->fs_devices->fsid;
 	unsigned factor = 1;
 	struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv;
-	int ret;
+	enum btrfs_raid_types data_type;
 	u64 thresh = 0;
 	int mixed = 0;
 
@@ -2113,11 +1996,13 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 		buf->f_bfree = 0;
 	spin_unlock(&block_rsv->lock);
 
-	buf->f_bavail = div_u64(total_free_data, factor);
-	ret = btrfs_calc_avail_data_space(fs_info, &total_free_data);
-	if (ret)
-		return ret;
-	buf->f_bavail += div_u64(total_free_data, factor);
+	data_type = btrfs_bg_flags_to_raid_index(
+			btrfs_data_alloc_profile(fs_info));
+
+	spin_lock(&fs_devices->per_profile_lock);
+	buf->f_bavail = fs_devices->per_profile_avail[data_type] +
+			total_free_data;
+	spin_unlock(&fs_devices->per_profile_lock);
 	buf->f_bavail = buf->f_bavail >> bits;
 
 	/*
-- 
2.25.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v7 4/5] btrfs: Reset device size when btrfs_update_device() failed in btrfs_grow_device()
  2020-02-11  5:11 [PATCH v7 0/5] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
                   ` (2 preceding siblings ...)
  2020-02-11  5:11 ` [PATCH v7 3/5] btrfs: statfs: Use pre-calculated per-profile available space Qu Wenruo
@ 2020-02-11  5:11 ` Qu Wenruo
  2020-02-11  5:11 ` [PATCH v7 5/5] btrfs: volumes: Revert device used bytes when calc_per_profile_avail() failed Qu Wenruo
  4 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2020-02-11  5:11 UTC (permalink / raw)
  To: linux-btrfs

When btrfs_update_device() failed due to ENOMEM, we didn't reset device
size back to its original size, causing the in-memory device size larger
than original.

If somehow the memory pressure get solved, and the fs committed, since
the device item is not updated, but super block total size get updated,
it would cause mount failure due to size mismatch.

So here revert device size and super size to its original size when
btrfs_update_device() failed, just like what we did in shrink_device().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1886fa20530d..4750255d02a8 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2846,6 +2846,7 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
 {
 	struct btrfs_fs_info *fs_info = device->fs_info;
 	struct btrfs_super_block *super_copy = fs_info->super_copy;
+	u64 old_device_size;
 	u64 old_total;
 	u64 diff;
 	int ret;
@@ -2856,6 +2857,7 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
 	new_size = round_down(new_size, fs_info->sectorsize);
 
 	mutex_lock(&fs_info->chunk_mutex);
+	old_device_size = device->total_bytes;
 	old_total = btrfs_super_total_bytes(super_copy);
 	diff = round_down(new_size - device->total_bytes, fs_info->sectorsize);
 
@@ -2878,9 +2880,25 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
 	ret = calc_per_profile_avail(fs_info);
 	mutex_unlock(&fs_info->chunk_mutex);
 	if (ret < 0)
-		return ret;
+		goto out;
 
-	return btrfs_update_device(trans, device);
+	ret = btrfs_update_device(trans, device);
+out:
+	if (ret < 0) {
+		/*
+		 * Although we dropped chunk_mutex halfway for
+		 * btrfs_update_device(), we have FS_EXCL_OP bit to prevent
+		 * shrinking/growing race.
+		 * So we're safe to use the old size directly.
+		 */
+		mutex_lock(&fs_info->chunk_mutex);
+		btrfs_set_super_total_bytes(super_copy, old_total);
+		device->fs_devices->total_rw_bytes -= diff;
+		btrfs_device_set_total_bytes(device, old_device_size);
+		btrfs_device_set_disk_total_bytes(device, old_device_size);
+		mutex_unlock(&fs_info->chunk_mutex);
+	}
+	return ret;
 }
 
 static int btrfs_free_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
-- 
2.25.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v7 5/5] btrfs: volumes: Revert device used bytes when calc_per_profile_avail() failed
  2020-02-11  5:11 [PATCH v7 0/5] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
                   ` (3 preceding siblings ...)
  2020-02-11  5:11 ` [PATCH v7 4/5] btrfs: Reset device size when btrfs_update_device() failed in btrfs_grow_device() Qu Wenruo
@ 2020-02-11  5:11 ` Qu Wenruo
  4 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2020-02-11  5:11 UTC (permalink / raw)
  To: linux-btrfs

In __btrfs_alloc_chunk(), if calc_per_profile_avail() failed, it will
not update the per-profile available space array.
However since device used space is already updated, this would cause a
mismatch between them.

To fix this problem, this patch will revert device used bytes when
calc_per_profile_avail() failed, and remove the newly allocated chunk.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 4750255d02a8..27bd71d30e6f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5225,6 +5225,15 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 				      &trans->transaction->dev_update_list);
 	}
 	ret = calc_per_profile_avail(info);
+	if (ret < 0) {
+		for (i = 0; i < map->num_stripes; i++) {
+			struct btrfs_device *dev = map->stripes[i].dev;
+
+			btrfs_device_set_bytes_used(dev,
+					dev->bytes_used - stripe_size);
+		}
+		goto error_del_extent;
+	}
 
 	atomic64_sub(stripe_size * map->num_stripes, &info->free_chunk_space);
 
-- 
2.25.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v7 1/5] btrfs: Introduce per-profile available space facility
  2020-02-11  5:11 ` [PATCH v7 1/5] btrfs: Introduce per-profile available space facility Qu Wenruo
@ 2020-02-13 17:12   ` kbuild test robot
  2020-02-20 12:49   ` Nikolay Borisov
  1 sibling, 0 replies; 11+ messages in thread
From: kbuild test robot @ 2020-02-13 17:12 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: kbuild-all, linux-btrfs, Josef Bacik

[-- Attachment #1: Type: text/plain, Size: 1169 bytes --]

Hi Qu,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on v5.6-rc1]
[also build test ERROR on next-20200213]
[cannot apply to btrfs/next]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Qu-Wenruo/Introduce-per-profile-available-space-array-to-avoid-over-confident-can_overcommit/20200213-093844
base:    bb6d3fb354c5ee8d6bde2d576eb7220ea09862b9
config: i386-allyesconfig (attached as .config)
compiler: gcc-7 (Debian 7.5.0-4) 7.5.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   ld: fs/btrfs/volumes.o: in function `calc_per_profile_avail':
>> volumes.c:(.text+0x2603): undefined reference to `__udivdi3'

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 71477 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v7 1/5] btrfs: Introduce per-profile available space facility
  2020-02-11  5:11 ` [PATCH v7 1/5] btrfs: Introduce per-profile available space facility Qu Wenruo
  2020-02-13 17:12   ` kbuild test robot
@ 2020-02-20 12:49   ` Nikolay Borisov
  2020-02-20 12:59     ` Qu Wenruo
  2020-02-21  5:16     ` Qu Wenruo
  1 sibling, 2 replies; 11+ messages in thread
From: Nikolay Borisov @ 2020-02-20 12:49 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: Josef Bacik

<snip>

> 
> Suggested-by: Josef Bacik <josef@toxicpanda.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/volumes.c | 216 ++++++++++++++++++++++++++++++++++++++++-----
>  fs/btrfs/volumes.h |  11 +++
>  2 files changed, 207 insertions(+), 20 deletions(-)
> 

<snip>

> +
> +/*
> + * Return 0 if we allocated any virtual(*) chunk, and restore the size to
> + * @allocated_size
> + * Return -ENOSPC if we have no more space to allocate virtual chunk
> + *
> + * *: virtual chunk is a space holder for per-profile available space
> + *    calculator.
> + *    Such virtual chunks won't take on-disk space, thus called virtual, and
> + *    only affects per-profile available space calulation.
> + */

Document that the last parameter is an output value which contains the
size of the allocated virtual chunk.

> +static int alloc_virtual_chunk(struct btrfs_fs_info *fs_info,
> +			       struct btrfs_device_info *devices_info,
> +			       enum btrfs_raid_types type,
> +			       u64 *allocated)
> +{
> +	const struct btrfs_raid_attr *raid_attr = &btrfs_raid_array[type];
> +	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> +	struct btrfs_device *device;
> +	u64 stripe_size;
> +	int i;
> +	int ndevs = 0;
> +
> +	lockdep_assert_held(&fs_info->chunk_mutex);
> +
> +	/* Go through devices to collect their unallocated space */
> +	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
> +		u64 avail;
> +		if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
> +					&device->dev_state) ||
> +		    test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
> +			continue;
> +
> +		if (device->total_bytes > device->bytes_used +
> +				device->virtual_allocated)
> +			avail = device->total_bytes - device->bytes_used -
> +				device->virtual_allocated;
> +		else
> +			avail = 0;
> +
> +		/* And exclude the [0, 1M) reserved space */
> +		if (avail > SZ_1M)
> +			avail -= SZ_1M;
> +		else
> +			avail = 0;
> +
> +		if (avail < fs_info->sectorsize)
> +			continue;
> +		/*
> +		 * Unlike chunk allocator, we don't care about stripe or hole
> +		 * size, so here we use @avail directly
> +		 */
> +		devices_info[ndevs].dev_offset = 0;
> +		devices_info[ndevs].total_avail = avail;
> +		devices_info[ndevs].max_avail = avail;
> +		devices_info[ndevs].dev = device;
> +		++ndevs;
> +	}
> +	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
> +	     btrfs_cmp_device_info, NULL);
> +	ndevs -= ndevs % raid_attr->devs_increment;

nit: ndevs = rounddown(ndevs, raid_attr->devs_increment);
makes it more clear what's going on. Since you are working with at most
int it's not a problem for 32bits.


> +	if (ndevs < raid_attr->devs_min)
> +		return -ENOSPC;
> +	if (raid_attr->devs_max)
> +		ndevs = min(ndevs, (int)raid_attr->devs_max);
> +	else
> +		ndevs = min(ndevs, (int)BTRFS_MAX_DEVS(fs_info));

Instead of casting simply use min_t(int, ndevs, BTRFS_MAX_DEVS...)

> +
> +	/*
> +	 * Now allocate a virtual chunk using the unallocate space of the

nit: missing d after 'unallocate'

> +	 * device with the least unallocated space.
> +	 */
> +	stripe_size = round_down(devices_info[ndevs - 1].total_avail,
> +				 fs_info->sectorsize);
> +	if (stripe_size == 0)
> +		return -ENOSPC;

Isn't this check redundant - in the loop where you iterate the devices
you always ensure total_avail is at least a sector size, this guarantees
that stripe_size cannot be 0 at this point.

> +
> +	for (i = 0; i < ndevs; i++)
> +		devices_info[i].dev->virtual_allocated += stripe_size;
> +	*allocated = stripe_size * (ndevs - raid_attr->nparity) /
> +		     raid_attr->ncopies;
> +	return 0;
> +}
> +
> +static int calc_one_profile_avail(struct btrfs_fs_info *fs_info,
> +				  enum btrfs_raid_types type,
> +				  u64 *result_ret)
> +{
> +	struct btrfs_device_info *devices_info = NULL;
> +	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> +	struct btrfs_device *device;
> +	u64 allocated;
> +	u64 result = 0;
> +	int ret = 0;
> +

lockdep assert that chunk mutex is held since you access alloc_list.

> +	ASSERT(type >= 0 && type < BTRFS_NR_RAID_TYPES);
> +
> +	/* Not enough devices, quick exit, just update the result */
> +	if (fs_devices->rw_devices < btrfs_raid_array[type].devs_min)
> +		goto out;

You can directly return.

> +
> +	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
> +			       GFP_NOFS);
> +	if (!devices_info) {
> +		ret = -ENOMEM;
> +		goto out;

Same here.

> +	}
> +	/* Clear virtual chunk used space for each device */
> +	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
> +		device->virtual_allocated = 0;
> +	while (ret == 0) {
> +		ret = alloc_virtual_chunk(fs_info, devices_info, type,
> +					  &allocated);
The 'allocated' variable is used only in this loop so declare it in the
loop. Ideally we want variables to have the shortest possible lifecycle.

> +		if (ret == 0)
> +			result += allocated;
> +	}

Why do you need to call this in a loop calling alloc_virtual_chunk once
simply calculate the available space for the given raid type.

> +	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
> +		device->virtual_allocated = 0;
> +out:
> +	kfree(devices_info);
> +	if (ret < 0 && ret != -ENOSPC)
> +		return ret;
> +	*result_ret = result;
> +	return 0;
> +}

<snip>

> @@ -259,6 +266,10 @@ struct btrfs_fs_devices {
>  	struct kobject fsid_kobj;
>  	struct kobject *devices_kobj;
>  	struct completion kobj_unregister;
> +
> +	/* Records per-type available space */
> +	spinlock_t per_profile_lock;
> +	u64 per_profile_avail[BTRFS_NR_RAID_TYPES];

Since every avail quantity is independent of any other, can you turn
this into an array of atomic64 values and get rid of the spinlock? My
head spins how many locks we have in btrfs.

>  };
>  
>  #define BTRFS_BIO_INLINE_CSUM_SIZE	64
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v7 1/5] btrfs: Introduce per-profile available space facility
  2020-02-20 12:49   ` Nikolay Borisov
@ 2020-02-20 12:59     ` Qu Wenruo
  2020-02-20 13:19       ` Nikolay Borisov
  2020-02-21  5:16     ` Qu Wenruo
  1 sibling, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2020-02-20 12:59 UTC (permalink / raw)
  To: Nikolay Borisov, Qu Wenruo, linux-btrfs; +Cc: Josef Bacik



On 2020/2/20 下午8:49, Nikolay Borisov wrote:
> <snip>
>
>>
>> Suggested-by: Josef Bacik <josef@toxicpanda.com>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>  fs/btrfs/volumes.c | 216 ++++++++++++++++++++++++++++++++++++++++-----
>>  fs/btrfs/volumes.h |  11 +++
>>  2 files changed, 207 insertions(+), 20 deletions(-)
>>
>
> <snip>
>
>> +
>> +/*
>> + * Return 0 if we allocated any virtual(*) chunk, and restore the size to
>> + * @allocated_size
>> + * Return -ENOSPC if we have no more space to allocate virtual chunk
>> + *
>> + * *: virtual chunk is a space holder for per-profile available space
>> + *    calculator.
>> + *    Such virtual chunks won't take on-disk space, thus called virtual, and
>> + *    only affects per-profile available space calulation.
>> + */
>
> Document that the last parameter is an output value which contains the
> size of the allocated virtual chunk.
>
>> +static int alloc_virtual_chunk(struct btrfs_fs_info *fs_info,
>> +			       struct btrfs_device_info *devices_info,
>> +			       enum btrfs_raid_types type,
>> +			       u64 *allocated)
>> +{
>> +	const struct btrfs_raid_attr *raid_attr = &btrfs_raid_array[type];
>> +	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>> +	struct btrfs_device *device;
>> +	u64 stripe_size;
>> +	int i;
>> +	int ndevs = 0;
>> +
>> +	lockdep_assert_held(&fs_info->chunk_mutex);
>> +
>> +	/* Go through devices to collect their unallocated space */
>> +	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
>> +		u64 avail;
>> +		if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
>> +					&device->dev_state) ||
>> +		    test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
>> +			continue;
>> +
>> +		if (device->total_bytes > device->bytes_used +
>> +				device->virtual_allocated)
>> +			avail = device->total_bytes - device->bytes_used -
>> +				device->virtual_allocated;
>> +		else
>> +			avail = 0;
>> +
>> +		/* And exclude the [0, 1M) reserved space */
>> +		if (avail > SZ_1M)
>> +			avail -= SZ_1M;
>> +		else
>> +			avail = 0;
>> +
>> +		if (avail < fs_info->sectorsize)
>> +			continue;
>> +		/*
>> +		 * Unlike chunk allocator, we don't care about stripe or hole
>> +		 * size, so here we use @avail directly
>> +		 */
>> +		devices_info[ndevs].dev_offset = 0;
>> +		devices_info[ndevs].total_avail = avail;
>> +		devices_info[ndevs].max_avail = avail;
>> +		devices_info[ndevs].dev = device;
>> +		++ndevs;
>> +	}
>> +	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>> +	     btrfs_cmp_device_info, NULL);
>> +	ndevs -= ndevs % raid_attr->devs_increment;
>
> nit: ndevs = rounddown(ndevs, raid_attr->devs_increment);

IIRC round_down() can only be used when the alignment is power of 2.

Don't forget we have RAID1C3 now.

> makes it more clear what's going on. Since you are working with at most
> int it's not a problem for 32bits.
>
>
>> +	if (ndevs < raid_attr->devs_min)
>> +		return -ENOSPC;
>> +	if (raid_attr->devs_max)
>> +		ndevs = min(ndevs, (int)raid_attr->devs_max);
>> +	else
>> +		ndevs = min(ndevs, (int)BTRFS_MAX_DEVS(fs_info));
>
> Instead of casting simply use min_t(int, ndevs, BTRFS_MAX_DEVS...)

That looks the same to me...

>
>> +
>> +	/*
>> +	 * Now allocate a virtual chunk using the unallocate space of the
>
> nit: missing d after 'unallocate'
>
>> +	 * device with the least unallocated space.
>> +	 */
>> +	stripe_size = round_down(devices_info[ndevs - 1].total_avail,
>> +				 fs_info->sectorsize);
>> +	if (stripe_size == 0)
>> +		return -ENOSPC;
>
> Isn't this check redundant - in the loop where you iterate the devices
> you always ensure total_avail is at least a sector size, this guarantees
> that stripe_size cannot be 0 at this point.
>
>> +
>> +	for (i = 0; i < ndevs; i++)
>> +		devices_info[i].dev->virtual_allocated += stripe_size;
>> +	*allocated = stripe_size * (ndevs - raid_attr->nparity) /
>> +		     raid_attr->ncopies;
>> +	return 0;
>> +}
>> +
>> +static int calc_one_profile_avail(struct btrfs_fs_info *fs_info,
>> +				  enum btrfs_raid_types type,
>> +				  u64 *result_ret)
>> +{
>> +	struct btrfs_device_info *devices_info = NULL;
>> +	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>> +	struct btrfs_device *device;
>> +	u64 allocated;
>> +	u64 result = 0;
>> +	int ret = 0;
>> +
>
> lockdep assert that chunk mutex is held since you access alloc_list.
>
>> +	ASSERT(type >= 0 && type < BTRFS_NR_RAID_TYPES);
>> +
>> +	/* Not enough devices, quick exit, just update the result */
>> +	if (fs_devices->rw_devices < btrfs_raid_array[type].devs_min)
>> +		goto out;
>
> You can directly return.
>
>> +
>> +	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
>> +			       GFP_NOFS);
>> +	if (!devices_info) {
>> +		ret = -ENOMEM;
>> +		goto out;
>
> Same here.
>
>> +	}
>> +	/* Clear virtual chunk used space for each device */
>> +	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
>> +		device->virtual_allocated = 0;
>> +	while (ret == 0) {
>> +		ret = alloc_virtual_chunk(fs_info, devices_info, type,
>> +					  &allocated);
> The 'allocated' variable is used only in this loop so declare it in the
> loop. Ideally we want variables to have the shortest possible lifecycle.
>
>> +		if (ret == 0)
>> +			result += allocated;
>> +	}
>
> Why do you need to call this in a loop calling alloc_virtual_chunk once
> simply calculate the available space for the given raid type.

For this case, we must go several loops:
Dev1: 10G
Dev2: 5G
Dev3: 5G
Type: RAID1.

The first loop will use 5G from dev1, 5G from dev2.
Then the 2nd loop will use the remaining 5G from dev1, 5G from dev3.

And that's the core problem per-profile available space system want to
address.

>
>> +	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
>> +		device->virtual_allocated = 0;
>> +out:
>> +	kfree(devices_info);
>> +	if (ret < 0 && ret != -ENOSPC)
>> +		return ret;
>> +	*result_ret = result;
>> +	return 0;
>> +}
>
> <snip>
>
>> @@ -259,6 +266,10 @@ struct btrfs_fs_devices {
>>  	struct kobject fsid_kobj;
>>  	struct kobject *devices_kobj;
>>  	struct completion kobj_unregister;
>> +
>> +	/* Records per-type available space */
>> +	spinlock_t per_profile_lock;
>> +	u64 per_profile_avail[BTRFS_NR_RAID_TYPES];
>
> Since every avail quantity is independent of any other, can you turn
> this into an array of atomic64 values and get rid of the spinlock? My
> head spins how many locks we have in btrfs.

OK, that won't hurt, since they are updated separately, there is indeed
no need for a spinlock.

Thanks,
Qu
>
>>  };
>>
>>  #define BTRFS_BIO_INLINE_CSUM_SIZE	64
>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v7 1/5] btrfs: Introduce per-profile available space facility
  2020-02-20 12:59     ` Qu Wenruo
@ 2020-02-20 13:19       ` Nikolay Borisov
  0 siblings, 0 replies; 11+ messages in thread
From: Nikolay Borisov @ 2020-02-20 13:19 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs; +Cc: Josef Bacik



On 20.02.20 г. 14:59 ч., Qu Wenruo wrote:
> 
> 
> On 2020/2/20 下午8:49, Nikolay Borisov wrote:
>> <snip>
>>
>>>
>>> Suggested-by: Josef Bacik <josef@toxicpanda.com>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> ---
>>>  fs/btrfs/volumes.c | 216 ++++++++++++++++++++++++++++++++++++++++-----
>>>  fs/btrfs/volumes.h |  11 +++
>>>  2 files changed, 207 insertions(+), 20 deletions(-)
>>>
>>

<snip>

>>> +	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>>> +	     btrfs_cmp_device_info, NULL);
>>> +	ndevs -= ndevs % raid_attr->devs_increment;
>>
>> nit: ndevs = rounddown(ndevs, raid_attr->devs_increment);
> 
> IIRC round_down() can only be used when the alignment is power of 2.
> 
> Don't forget we have RAID1C3 now.

Sure, but I'm referring to rounddown and not round_down :)

> 
>> makes it more clear what's going on. Since you are working with at most
>> int it's not a problem for 32bits.
>>
>>
>>> +	if (ndevs < raid_attr->devs_min)
>>> +		return -ENOSPC;
>>> +	if (raid_attr->devs_max)
>>> +		ndevs = min(ndevs, (int)raid_attr->devs_max);
>>> +	else
>>> +		ndevs = min(ndevs, (int)BTRFS_MAX_DEVS(fs_info));
>>
>> Instead of casting simply use min_t(int, ndevs, BTRFS_MAX_DEVS...)
> 
> That looks the same to me...

I guess it's a matter of preference so I will defer to David to decide.

> 
>>
>>> +
>>> +	/*
>>> +	 * Now allocate a virtual chunk using the unallocate space of the
>>
>> nit: missing d after 'unallocate'
>>
>>> +	 * device with the least unallocated space.
>>> +	 */
>>> +	stripe_size = round_down(devices_info[ndevs - 1].total_avail,
>>> +				 fs_info->sectorsize);
>>> +	if (stripe_size == 0)
>>> +		return -ENOSPC;
>>
>> Isn't this check redundant - in the loop where you iterate the devices
>> you always ensure total_avail is at least a sector size, this guarantees
>> that stripe_size cannot be 0 at this point.
>>
>>> +
>>> +	for (i = 0; i < ndevs; i++)
>>> +		devices_info[i].dev->virtual_allocated += stripe_size;
>>> +	*allocated = stripe_size * (ndevs - raid_attr->nparity) /
>>> +		     raid_attr->ncopies;
>>> +	return 0;
>>> +}
>>> +
>>> +static int calc_one_profile_avail(struct btrfs_fs_info *fs_info,
>>> +				  enum btrfs_raid_types type,
>>> +				  u64 *result_ret)
>>> +{
>>> +	struct btrfs_device_info *devices_info = NULL;
>>> +	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>>> +	struct btrfs_device *device;
>>> +	u64 allocated;
>>> +	u64 result = 0;
>>> +	int ret = 0;
>>> +
>>
>> lockdep assert that chunk mutex is held since you access alloc_list.
>>
>>> +	ASSERT(type >= 0 && type < BTRFS_NR_RAID_TYPES);
>>> +
>>> +	/* Not enough devices, quick exit, just update the result */
>>> +	if (fs_devices->rw_devices < btrfs_raid_array[type].devs_min)
>>> +		goto out;
>>
>> You can directly return.
>>
>>> +
>>> +	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
>>> +			       GFP_NOFS);
>>> +	if (!devices_info) {
>>> +		ret = -ENOMEM;
>>> +		goto out;
>>
>> Same here.
>>
>>> +	}
>>> +	/* Clear virtual chunk used space for each device */
>>> +	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
>>> +		device->virtual_allocated = 0;
>>> +	while (ret == 0) {
>>> +		ret = alloc_virtual_chunk(fs_info, devices_info, type,
>>> +					  &allocated);
>> The 'allocated' variable is used only in this loop so declare it in the
>> loop. Ideally we want variables to have the shortest possible lifecycle.
>>
>>> +		if (ret == 0)
>>> +			result += allocated;
>>> +	}
>>

After the explanation on IRC I think it's better if this is written as:

while(!alloc_virtual_chunk(fs_info, devices_info, type, &allocated))
  result += allocated

That way it's easier (at least for me) to spot you are "draining"
something. IN this case you try to allocate/simulate multiple
allocations to calculate the real available space.

> 
> For this case, we must go several loops:
> Dev1: 10G
> Dev2: 5G
> Dev3: 5G
> Type: RAID1.
> 
> The first loop will use 5G from dev1, 5G from dev2.
> Then the 2nd loop will use the remaining 5G from dev1, 5G from dev3.
> 
> And that's the core problem per-profile available space system want to
> address.
> 

<snip>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v7 1/5] btrfs: Introduce per-profile available space facility
  2020-02-20 12:49   ` Nikolay Borisov
  2020-02-20 12:59     ` Qu Wenruo
@ 2020-02-21  5:16     ` Qu Wenruo
  1 sibling, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2020-02-21  5:16 UTC (permalink / raw)
  To: Nikolay Borisov, Qu Wenruo, linux-btrfs; +Cc: Josef Bacik



On 2020/2/20 下午8:49, Nikolay Borisov wrote:
> <snip>
>
>>
>> Suggested-by: Josef Bacik <josef@toxicpanda.com>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>  fs/btrfs/volumes.c | 216 ++++++++++++++++++++++++++++++++++++++++-----
>>  fs/btrfs/volumes.h |  11 +++
>>  2 files changed, 207 insertions(+), 20 deletions(-)
>>
>
> <snip>
>
>> +
>> +/*
>> + * Return 0 if we allocated any virtual(*) chunk, and restore the size to
>> + * @allocated_size
>> + * Return -ENOSPC if we have no more space to allocate virtual chunk
>> + *
>> + * *: virtual chunk is a space holder for per-profile available space
>> + *    calculator.
>> + *    Such virtual chunks won't take on-disk space, thus called virtual, and
>> + *    only affects per-profile available space calulation.
>> + */
>
> Document that the last parameter is an output value which contains the
> size of the allocated virtual chunk.

Isn't that mentioned in the 3rd line of the comment? (Although the
parameter name doesn't match after some updates).

...
> lockdep assert that chunk mutex is held since you access alloc_list.
>
>> +	ASSERT(type >= 0 && type < BTRFS_NR_RAID_TYPES);
>> +
>> +	/* Not enough devices, quick exit, just update the result */
>> +	if (fs_devices->rw_devices < btrfs_raid_array[type].devs_min)
>> +		goto out;
>
> You can directly return.

Then you won't update the @result_ret.
And don't forget that, @result_ret is from an uninitialized local array.
Thus we must go out tag.

>
>> +
>> +	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
>> +			       GFP_NOFS);
>> +	if (!devices_info) {
>> +		ret = -ENOMEM;
>> +		goto out;
>
> Same here.

Same as above.

>
>> +	}
>> +	/* Clear virtual chunk used space for each device */
>> +	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
>> +		device->virtual_allocated = 0;
>> +	while (ret == 0) {
>> +		ret = alloc_virtual_chunk(fs_info, devices_info, type,
>> +					  &allocated);
> The 'allocated' variable is used only in this loop so declare it in the
> loop. Ideally we want variables to have the shortest possible lifecycle.

Then your new while() loop idea will still need @allocated declared at
at the beginning of the function.

Thanks,
Qu

>
>> +		if (ret == 0)
>> +			result += allocated;
>> +	}
>
> Why do you need to call this in a loop calling alloc_virtual_chunk once
> simply calculate the available space for the given raid type.
>
>> +	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
>> +		device->virtual_allocated = 0;
>> +out:
>> +	kfree(devices_info);
>> +	if (ret < 0 && ret != -ENOSPC)
>> +		return ret;
>> +	*result_ret = result;
>> +	return 0;
>> +}
>
> <snip>
>
>> @@ -259,6 +266,10 @@ struct btrfs_fs_devices {
>>  	struct kobject fsid_kobj;
>>  	struct kobject *devices_kobj;
>>  	struct completion kobj_unregister;
>> +
>> +	/* Records per-type available space */
>> +	spinlock_t per_profile_lock;
>> +	u64 per_profile_avail[BTRFS_NR_RAID_TYPES];
>
> Since every avail quantity is independent of any other, can you turn
> this into an array of atomic64 values and get rid of the spinlock? My
> head spins how many locks we have in btrfs.
>
>>  };
>>
>>  #define BTRFS_BIO_INLINE_CSUM_SIZE	64
>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, back to index

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-11  5:11 [PATCH v7 0/5] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
2020-02-11  5:11 ` [PATCH v7 1/5] btrfs: Introduce per-profile available space facility Qu Wenruo
2020-02-13 17:12   ` kbuild test robot
2020-02-20 12:49   ` Nikolay Borisov
2020-02-20 12:59     ` Qu Wenruo
2020-02-20 13:19       ` Nikolay Borisov
2020-02-21  5:16     ` Qu Wenruo
2020-02-11  5:11 ` [PATCH v7 2/5] btrfs: space-info: Use per-profile available space in can_overcommit() Qu Wenruo
2020-02-11  5:11 ` [PATCH v7 3/5] btrfs: statfs: Use pre-calculated per-profile available space Qu Wenruo
2020-02-11  5:11 ` [PATCH v7 4/5] btrfs: Reset device size when btrfs_update_device() failed in btrfs_grow_device() Qu Wenruo
2020-02-11  5:11 ` [PATCH v7 5/5] btrfs: volumes: Revert device used bytes when calc_per_profile_avail() failed Qu Wenruo

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org
	public-inbox-index linux-btrfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git