[PATCH v2 0/4] Introduce per-profile available space array to avoid over-confident can

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/4] Introduce per-profile available space array to avoid over-confident can_overcommit()
@ 2020-01-02 11:27 Qu Wenruo
  2020-01-02 11:27 ` [PATCH v2 1/4] btrfs: Introduce per-profile available space facility Qu Wenruo
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Qu Wenruo @ 2020-01-02 11:27 UTC (permalink / raw)
  To: linux-btrfs

There are several bug reports of ENOSPC error in
btrfs_run_delalloc_range().

With some extra info from one reporter, it turns out that
can_overcommit() is using a wrong way to calculate allocatable metadata
space.

The most typical case would look like:
  devid 1 unallocated:	1G
  devid 2 unallocated:  10G
  metadata profile:	RAID1

In above case, we can at most allocate 1G chunk for metadata, due to
unbalanced disk free space.
But current can_overcommit() uses factor based calculation, which never
consider the disk free space balance.

To address this problem, here comes the per-profile available space
array, which gets updated every time a chunk get allocated/removed or a
device get grown or shrunk.

This provides a quick way for hotter place like can_overcommit() to grab
an estimation on how many bytes it can over-commit.

The per-profile available space calculation tries to keep the behavior
of chunk allocator, thus it can handle uneven disks pretty well.

Although per-profile is not clever enough to handle estimation when both
data and metadata chunks need to be considered, its virtual chunk
infrastructure is flex enough to handle such case.

So for statfs(), we also re-use virtual chunk allocator to handle
available data space, with metadata over-commit space considered.
This brings an unexpected advantage, now we can handle RAID5/6 pretty OK
in statfs().

Changelog:
v1:
- Fix a bug where we forgot to update per-profile array after allocating
  a chunk.
  To avoid ABBA deadlock, this introduce a small windows at the end
  __btrfs_alloc_chunk(), it's not elegant but should be good enough
  before we rework chunk and device list mutex.

- Make statfs() to use virtual chunk allocator to do better estimation
  Now statfs() can report not only more accurate result, but can also
  handle RAID5/6 better.

v2:
- Fix a deadlock caused by acquiring device_list_mutex under
  __btrfs_alloc_chunk()
  There is no need to acquire device_list_mutex when holding
  chunk_mutex.
  Fix it and remove the lockdep assert.

Qu Wenruo (4):
  btrfs: Introduce per-profile available space facility
  btrfs: Update per-profile available space when device size/used space
    get updated
  btrfs: space-info: Use per-profile available space in can_overcommit()
  btrfs: statfs: Use virtual chunk allocation to calculation available
    data space

 fs/btrfs/space-info.c |  15 ++-
 fs/btrfs/super.c      | 190 +++++++++++++-----------------------
 fs/btrfs/volumes.c    | 218 ++++++++++++++++++++++++++++++++++++++----
 fs/btrfs/volumes.h    |  14 +++
 4 files changed, 288 insertions(+), 149 deletions(-)

-- 
2.24.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/4] btrfs: Introduce per-profile available space facility
  2020-01-02 11:27 [PATCH v2 0/4] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
@ 2020-01-02 11:27 ` Qu Wenruo
  2020-01-02 16:13   ` Josef Bacik
  2020-01-04  6:40   ` kbuild test robot
  2020-01-02 11:27 ` [PATCH v2 2/4] btrfs: Update per-profile available space when device size/used space get updated Qu Wenruo
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 14+ messages in thread
From: Qu Wenruo @ 2020-01-02 11:27 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

[PROBLEM]
There are some locations in btrfs requiring accurate estimation on how
many new bytes can be allocated on unallocated space.

We have two types of estimation:
- Factor based calculation
  Just use all unallocated space, divide by the profile factor
  One obvious user is can_overcommit().

- Chunk allocator like calculation
  This will emulate the chunk allocator behavior, to get a proper
  estimation.
  The only user is btrfs_calc_avail_data_space(), utilized by
  btrfs_statfs().
  The problem is, that function is not generic purposed enough, can't
  handle things like RAID5/6.

Current factor based calculation can't handle the following case:
  devid 1 unallocated:	1T
  devid 2 unallocated:	10T
  metadata type:	RAID1

If using factor, we can use (1T + 10T) / 2 = 5.5T free space for
metadata.
But in fact we can only get 1T free space, as we're limited by the
smallest device for RAID1.

[SOLUTION]
This patch will introduce the skeleton of per-profile available space
calculation, which can more-or-less get to the point of chunk allocator.

The difference between it and chunk allocator is mostly on rounding and
[0, 1M) reserved space handling, which shouldn't cause practical impact.

The newly introduced per-profile available space calculation will
calculate available space for each type, using chunk-allocator like
calculation.

With that facility, for above device layout we get the full available
space array:
  RAID10:	0  (not enough devices)
  RAID1:	1T
  RAID1C3:	0  (not enough devices)
  RAID1C4:	0  (not enough devices)
  DUP:		5.5T
  RAID0:	2T
  SINGLE:	11T
  RAID5:	1T
  RAID6:	0  (not enough devices)

Or for a more complex example:
  devid 1 unallocated:	1T
  devid 2 unallocated:  1T
  devid 3 unallocated:	10T

We will get an array of:
  RAID10:	0  (not enough devices)
  RAID1:	2T
  RAID1C3:	1T
  RAID1C4:	0  (not enough devices)
  DUP:		6T
  RAID0:	3T
  SINGLE:	12T
  RAID5:	2T
  RAID6:	0  (not enough devices)

And for the each profile , we go chunk allocator level calculation:
The code code looks like:

  clear_virtual_used_space_of_all_rw_devices();
  do {
  	/*
  	 * The same as chunk allocator, despite used space,
  	 * we also take virtual used space into consideration.
  	 */
  	sort_device_with_virtual_free_space();

  	/*
  	 * Unlike chunk allocator, we don't need to bother hole/stripe
  	 * size, so we use the smallest device to make sure we can
  	 * allocated as many stripes as regular chunk allocator
  	 */
  	stripe_size = device_with_smallest_free->avail_space;
	stripe_size = min(stripe_size, to_alloc / ndevs);

  	/*
  	 * Allocate a virtual chunk, allocated virtual chunk will
  	 * increase virtual used space, allow next iteration to
  	 * properly emulate chunk allocator behavior.
  	 */
  	ret = alloc_virtual_chunk(stripe_size, &allocated_size);
  	if (ret == 0)
  		avail += allocated_size;
  } while (ret == 0)

As we always select the device with least free space (just like chunk
allocator), for above 1T + 10T device, we will allocate a 1T virtual chunk
in the first iteration, then run out of device in next iteration.

Thus only get 1T free space for RAID1 type, just like what chunk
allocator would do.

This patch is just the skeleton, we only do the per-profile chunk
calculation at mount time.

Later commits will update per-profile available space at other proper
timings.

Suggested-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 195 ++++++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/volumes.h |  10 +++
 2 files changed, 186 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d8e5560db285..fa948f7eebc2 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -349,6 +349,7 @@ static struct btrfs_fs_devices *alloc_fs_devices(const u8 *fsid,
 	INIT_LIST_HEAD(&fs_devs->devices);
 	INIT_LIST_HEAD(&fs_devs->alloc_list);
 	INIT_LIST_HEAD(&fs_devs->fs_list);
+	spin_lock_init(&fs_devs->per_profile_lock);
 	if (fsid)
 		memcpy(fs_devs->fsid, fsid, BTRFS_FSID_SIZE);
 
@@ -2628,6 +2629,174 @@ static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+/*
+ * sort the devices in descending order by max_avail, total_avail
+ */
+static int btrfs_cmp_device_info(const void *a, const void *b)
+{
+	const struct btrfs_device_info *di_a = a;
+	const struct btrfs_device_info *di_b = b;
+
+	if (di_a->max_avail > di_b->max_avail)
+		return -1;
+	if (di_a->max_avail < di_b->max_avail)
+		return 1;
+	if (di_a->total_avail > di_b->total_avail)
+		return -1;
+	if (di_a->total_avail < di_b->total_avail)
+		return 1;
+	return 0;
+}
+
+/*
+ * Return 0 if we allocated any virtual(*) chunk, and restore the size to
+ * @allocated_size
+ * Return -ENOSPC if we have no more space to allocate virtual chunk
+ *
+ * *: virtual chunk is a space holder for per-profile available space
+ *    calculator.
+ *    Such virtual chunks won't take on-disk space, thus called virtual, and
+ *    only affects per-profile available space calulation.
+ */
+static int alloc_virtual_chunk(struct btrfs_fs_info *fs_info,
+			       struct btrfs_device_info *devices_info,
+			       enum btrfs_raid_types type,
+			       u64 to_alloc, u64 *allocated)
+{
+	const struct btrfs_raid_attr *raid_attr = &btrfs_raid_array[type];
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+	struct btrfs_device *device;
+	u64 stripe_size;
+	int i;
+	int ndevs = 0;
+
+	/* Go through devices to collect their unallocated space */
+	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
+		u64 avail;
+		if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
+					&device->dev_state) ||
+		    test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
+			continue;
+
+		if (device->total_bytes > device->bytes_used +
+				device->virtual_allocated)
+			avail = device->total_bytes - device->bytes_used -
+				device->virtual_allocated;
+		else
+			avail = 0;
+
+		/* And exclude the [0, 1M) reserved space */
+		if (avail > SZ_1M)
+			avail -= SZ_1M;
+		else
+			avail = 0;
+
+		if (avail == 0)
+			continue;
+		/*
+		 * Unlike chunk allocator, we don't care about stripe or hole
+		 * size, so here we use @avail directly
+		 */
+		devices_info[ndevs].dev_offset = 0;
+		devices_info[ndevs].total_avail = avail;
+		devices_info[ndevs].max_avail = avail;
+		devices_info[ndevs].dev = device;
+		++ndevs;
+	}
+	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+	     btrfs_cmp_device_info, NULL);
+	ndevs -= ndevs % raid_attr->devs_increment;
+	if (ndevs < raid_attr->devs_min)
+		return -ENOSPC;
+	if (raid_attr->devs_max)
+		ndevs = min(ndevs, (int)raid_attr->devs_max);
+	else
+		ndevs = min(ndevs, (int)BTRFS_MAX_DEVS(fs_info));
+
+	/*
+	 * Now allocate a virtual chunk using the unallocate space of the
+	 * device with the least unallocated space.
+	 */
+	stripe_size = round_down(devices_info[ndevs - 1].total_avail,
+				 fs_info->sectorsize);
+
+	/* We can't directly do round_up for (u64)-1 as that would result 0 */
+	if (to_alloc != (u64)-1)
+		stripe_size = min_t(u64, stripe_size,
+				    round_up(div_u64(to_alloc, ndevs),
+					     fs_info->sectorsize));
+	if (stripe_size == 0)
+		return -ENOSPC;
+
+	for (i = 0; i < ndevs; i++)
+		devices_info[i].dev->virtual_allocated += stripe_size;
+	*allocated = stripe_size * (ndevs - raid_attr->nparity) /
+		     raid_attr->ncopies;
+	return 0;
+}
+
+static int calc_one_profile_avail(struct btrfs_fs_info *fs_info,
+				  enum btrfs_raid_types type)
+{
+	struct btrfs_device_info *devices_info = NULL;
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+	struct btrfs_device *device;
+	u64 allocated;
+	u64 result = 0;
+	int ret = 0;
+
+	ASSERT(type >= 0 && type < BTRFS_NR_RAID_TYPES);
+
+	/* Not enough devices, quick exit, just update the result */
+	if (fs_devices->rw_devices < btrfs_raid_array[type].devs_min)
+		goto out;
+
+	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
+			       GFP_NOFS);
+	if (!devices_info) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	/* Clear virtual chunk used space for each device */
+	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
+		device->virtual_allocated = 0;
+	while (ret == 0) {
+		ret = alloc_virtual_chunk(fs_info, devices_info, type,
+					  (u64)-1, &allocated);
+		if (ret == 0)
+			result += allocated;
+	}
+	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
+		device->virtual_allocated = 0;
+out:
+	kfree(devices_info);
+	if (ret < 0 && ret != -ENOSPC)
+		return ret;
+	spin_lock(&fs_devices->per_profile_lock);
+	fs_devices->per_profile_avail[type] = result;
+	spin_unlock(&fs_devices->per_profile_lock);
+	return 0;
+}
+
+/*
+ * Calculate the per-profile available space array.
+ *
+ * Return 0 if we succeeded updating the array.
+ * Return <0 if something went wrong. (ENOMEM)
+ */
+static int calc_per_profile_avail(struct btrfs_fs_info *fs_info)
+{
+	int i;
+	int ret;
+
+	for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
+		ret = calc_one_profile_avail(fs_info, i);
+		if (ret < 0)
+			break;
+	}
+	return ret;
+}
+
 int btrfs_grow_device(struct btrfs_trans_handle *trans,
 		      struct btrfs_device *device, u64 new_size)
 {
@@ -4690,25 +4859,6 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
 	return 0;
 }
 
-/*
- * sort the devices in descending order by max_avail, total_avail
- */
-static int btrfs_cmp_device_info(const void *a, const void *b)
-{
-	const struct btrfs_device_info *di_a = a;
-	const struct btrfs_device_info *di_b = b;
-
-	if (di_a->max_avail > di_b->max_avail)
-		return -1;
-	if (di_a->max_avail < di_b->max_avail)
-		return 1;
-	if (di_a->total_avail > di_b->total_avail)
-		return -1;
-	if (di_a->total_avail < di_b->total_avail)
-		return 1;
-	return 0;
-}
-
 static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
 {
 	if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK))
@@ -7629,6 +7779,13 @@ int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info)
 
 	/* Ensure all chunks have corresponding dev extents */
 	ret = verify_chunk_dev_extent_mapping(fs_info);
+	if (ret < 0)
+		goto out;
+
+	/* All dev extents are verified, update per-profile available space */
+	mutex_lock(&fs_info->fs_devices->device_list_mutex);
+	ret = calc_per_profile_avail(fs_info);
+	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
 out:
 	btrfs_free_path(path);
 	return ret;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index fc1b564b9cfe..81cdab0d864a 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -138,6 +138,12 @@ struct btrfs_device {
 	atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX];
 
 	struct extent_io_tree alloc_state;
+
+	/*
+	 * the "virtual" allocated space by per-profile available space
+	 * calculator. Doesn't affect chunk allocator at all.
+	 */
+	u64 virtual_allocated;
 };
 
 /*
@@ -257,6 +263,10 @@ struct btrfs_fs_devices {
 	struct kobject fsid_kobj;
 	struct kobject *device_dir_kobj;
 	struct completion kobj_unregister;
+
+	/* Records per-type available space */
+	spinlock_t per_profile_lock;
+	u64 per_profile_avail[BTRFS_NR_RAID_TYPES];
 };
 
 #define BTRFS_BIO_INLINE_CSUM_SIZE	64
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 2/4] btrfs: Update per-profile available space when device size/used space get updated
  2020-01-02 11:27 [PATCH v2 0/4] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
  2020-01-02 11:27 ` [PATCH v2 1/4] btrfs: Introduce per-profile available space facility Qu Wenruo
@ 2020-01-02 11:27 ` Qu Wenruo
  2020-01-02 16:17   ` Josef Bacik
  2020-01-02 11:27 ` [PATCH v2 3/4] btrfs: space-info: Use per-profile available space in can_overcommit() Qu Wenruo
  2020-01-02 11:27 ` [PATCH v2 4/4] btrfs: statfs: Use virtual chunk allocation to calculation available data space Qu Wenruo
  3 siblings, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2020-01-02 11:27 UTC (permalink / raw)
  To: linux-btrfs

There are 4 locations where device size or used space get updated:
- Chunk allocation
- Chunk removal
- Device grow
- Device shrink

Now also update per-profile available space at those timings.

For __btrfs_alloc_chunk() we can't acquire device_list_mutex as in
btrfs_finish_chunk_alloc() we could hold device_list_mutex and cause
dead lock.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index fa948f7eebc2..be8250332c60 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2804,6 +2804,7 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
 	struct btrfs_super_block *super_copy = fs_info->super_copy;
 	u64 old_total;
 	u64 diff;
+	int ret;
 
 	if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state))
 		return -EACCES;
@@ -2832,6 +2833,11 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
 			      &trans->transaction->dev_update_list);
 	mutex_unlock(&fs_info->chunk_mutex);
 
+	mutex_lock(&fs_info->fs_devices->device_list_mutex);
+	ret = calc_per_profile_avail(fs_info);
+	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+	if (ret < 0)
+		return ret;
 	return btrfs_update_device(trans, device);
 }
 
@@ -3010,7 +3016,12 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset)
 			goto out;
 		}
 	}
+	ret = calc_per_profile_avail(fs_info);
 	mutex_unlock(&fs_devices->device_list_mutex);
+	if (ret < 0) {
+		btrfs_abort_transaction(trans, ret);
+		goto out;
+	}
 
 	ret = btrfs_free_chunk(trans, chunk_offset);
 	if (ret) {
@@ -4826,6 +4837,10 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
 			device->fs_devices->total_rw_bytes += diff;
 		atomic64_add(diff, &fs_info->free_chunk_space);
 		mutex_unlock(&fs_info->chunk_mutex);
+	} else {
+		mutex_lock(&fs_info->fs_devices->device_list_mutex);
+		ret = calc_per_profile_avail(fs_info);
+		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
 	}
 	return ret;
 }
@@ -5143,8 +5158,14 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	check_raid56_incompat_flag(info, type);
 	check_raid1c34_incompat_flag(info, type);
 
+	/*
+	 * In this context we don't need device_list_mutex as chunk_mutex are
+	 * protecting us.
+	 */
+	ret = calc_per_profile_avail(info);
+
 	kfree(devices_info);
-	return 0;
+	return ret;
 
 error_del_extent:
 	write_lock(&em_tree->lock);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 3/4] btrfs: space-info: Use per-profile available space in can_overcommit()
  2020-01-02 11:27 [PATCH v2 0/4] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
  2020-01-02 11:27 ` [PATCH v2 1/4] btrfs: Introduce per-profile available space facility Qu Wenruo
  2020-01-02 11:27 ` [PATCH v2 2/4] btrfs: Update per-profile available space when device size/used space get updated Qu Wenruo
@ 2020-01-02 11:27 ` Qu Wenruo
  2020-01-02 16:18   ` Josef Bacik
  2020-01-02 11:27 ` [PATCH v2 4/4] btrfs: statfs: Use virtual chunk allocation to calculation available data space Qu Wenruo
  3 siblings, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2020-01-02 11:27 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Marc Lehmann

For the following disk layout, can_overcommit() can cause false
confidence in available space:

  devid 1 unallocated:	1T
  devid 2 unallocated:	10T
  metadata type:	RAID1

As can_overcommit() simply uses unallocated space with factor to
calculate the allocatable metadata chunk size.

can_overcommit() believes we still have 5.5T for metadata chunks, while
the truth is, we only have 1T available for metadata chunks.
This can lead to ENOSPC at run_delalloc_range() and cause transaction
abort.

Since factor based calculation can't distinguish RAID1/RAID10 and DUP at
all, we need proper chunk-allocator level awareness to do such estimation.

Thankfully, we have per-profile available space already calculated, just
use that facility to avoid such false confidence.

Reported-by: Marc Lehmann <schmorp@schmorp.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/space-info.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index f09aa6ee9113..c26aba9e7124 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -164,10 +164,10 @@ static int can_overcommit(struct btrfs_fs_info *fs_info,
 			  enum btrfs_reserve_flush_enum flush,
 			  bool system_chunk)
 {
+	enum btrfs_raid_types index;
 	u64 profile;
 	u64 avail;
 	u64 used;
-	int factor;
 
 	/* Don't overcommit when in mixed mode. */
 	if (space_info->flags & BTRFS_BLOCK_GROUP_DATA)
@@ -179,16 +179,15 @@ static int can_overcommit(struct btrfs_fs_info *fs_info,
 		profile = btrfs_metadata_alloc_profile(fs_info);
 
 	used = btrfs_space_info_used(space_info, true);
-	avail = atomic64_read(&fs_info->free_chunk_space);
 
 	/*
-	 * If we have dup, raid1 or raid10 then only half of the free
-	 * space is actually usable.  For raid56, the space info used
-	 * doesn't include the parity drive, so we don't have to
-	 * change the math
+	 * Grab avail space from per-profile array which should be as accurate
+	 * as chunk allocator.
 	 */
-	factor = btrfs_bg_type_to_factor(profile);
-	avail = div_u64(avail, factor);
+	index = btrfs_bg_flags_to_raid_index(profile);
+	spin_lock(&fs_info->fs_devices->per_profile_lock);
+	avail = fs_info->fs_devices->per_profile_avail[index];
+	spin_unlock(&fs_info->fs_devices->per_profile_lock);
 
 	/*
 	 * If we aren't flushing all things, let us overcommit up to
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 4/4] btrfs: statfs: Use virtual chunk allocation to calculation available data space
  2020-01-02 11:27 [PATCH v2 0/4] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
                   ` (2 preceding siblings ...)
  2020-01-02 11:27 ` [PATCH v2 3/4] btrfs: space-info: Use per-profile available space in can_overcommit() Qu Wenruo
@ 2020-01-02 11:27 ` Qu Wenruo
  2020-01-02 16:20   ` Josef Bacik
  3 siblings, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2020-01-02 11:27 UTC (permalink / raw)
  To: linux-btrfs

Although btrfs_calc_avail_data_space() is trying to do an estimation
on how many data chunks it can allocate, the estimation is far from
perfect:

- Metadata over-commit is not considered at all
- Chunk allocation doesn't take RAID5/6 into consideration

Although current per-profile available space itself is not able to
handle metadata over-commit itself, the virtual chunk infrastructure can
be re-used to address above problems.

This patch will change btrfs_calc_avail_data_space() to do the following
things:
- Do metadata virtual chunk allocation first
  This is to address the over-commit behavior.
  If current metadata chunks have enough free space, we can completely
  skip this step.

- Allocate data virtual chunks as many as possible
  Just like what we did in per-profile available space estimation.
  Here we only need to calculate one profile, since statfs() call is
  a relative cold path.

Now statfs() should be able to report near perfect estimation on
available data space, and can handle RAID5/6 better.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/super.c   | 190 ++++++++++++++++-----------------------------
 fs/btrfs/volumes.c |  12 +--
 fs/btrfs/volumes.h |   4 +
 3 files changed, 79 insertions(+), 127 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f452a94abdc3..718ff54b62f0 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1893,119 +1893,88 @@ static inline void btrfs_descending_sort_devices(
 /*
  * The helper to calc the free space on the devices that can be used to store
  * file data.
+ *
+ * The calculation will:
+ * - Allocate enough metadata virtual chunks to fulfill over-commit
+ *   To ensure we still have enough space to contain metadata chunks
+ * - Allocate any many data virtual chunks as possible
+ *   To get a true estimation on available data free space.
+ *
+ * Only with such comprehensive check, we can get a good result considering
+ * all the uneven disk layouts.
  */
 static inline int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info,
-					      u64 *free_bytes)
+					      u64 free_meta, u64 *result)
 {
 	struct btrfs_device_info *devices_info;
 	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
 	struct btrfs_device *device;
-	u64 type;
-	u64 avail_space;
-	u64 min_stripe_size;
-	int num_stripes = 1;
-	int i = 0, nr_devices;
-	const struct btrfs_raid_attr *rattr;
+	u64 meta_index;
+	u64 data_index;
+	u64 meta_rsv;
+	u64 meta_allocated = 0;
+	u64 data_allocated = 0;
+	u64 allocated;
+	int nr_devices;
+	int ret = 0;
 
-	/*
-	 * We aren't under the device list lock, so this is racy-ish, but good
-	 * enough for our purposes.
-	 */
-	nr_devices = fs_info->fs_devices->open_devices;
-	if (!nr_devices) {
-		smp_mb();
-		nr_devices = fs_info->fs_devices->open_devices;
-		ASSERT(nr_devices);
-		if (!nr_devices) {
-			*free_bytes = 0;
-			return 0;
-		}
-	}
+	spin_lock(&fs_info->global_block_rsv.lock);
+	meta_rsv = fs_info->global_block_rsv.size;
+	spin_unlock(&fs_info->global_block_rsv.lock);
 
+	mutex_lock(&fs_devices->device_list_mutex);
+	nr_devices = fs_devices->rw_devices;
 	devices_info = kmalloc_array(nr_devices, sizeof(*devices_info),
 			       GFP_KERNEL);
-	if (!devices_info)
-		return -ENOMEM;
-
-	/* calc min stripe number for data space allocation */
-	type = btrfs_data_alloc_profile(fs_info);
-	rattr = &btrfs_raid_array[btrfs_bg_flags_to_raid_index(type)];
-
-	if (type & BTRFS_BLOCK_GROUP_RAID0)
-		num_stripes = nr_devices;
-	else if (type & BTRFS_BLOCK_GROUP_RAID1)
-		num_stripes = 2;
-	else if (type & BTRFS_BLOCK_GROUP_RAID1C3)
-		num_stripes = 3;
-	else if (type & BTRFS_BLOCK_GROUP_RAID1C4)
-		num_stripes = 4;
-	else if (type & BTRFS_BLOCK_GROUP_RAID10)
-		num_stripes = 4;
-
-	/* Adjust for more than 1 stripe per device */
-	min_stripe_size = rattr->dev_stripes * BTRFS_STRIPE_LEN;
-
-	rcu_read_lock();
-	list_for_each_entry_rcu(device, &fs_devices->devices, dev_list) {
-		if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
-						&device->dev_state) ||
-		    !device->bdev ||
-		    test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
-			continue;
-
-		if (i >= nr_devices)
-			break;
-
-		avail_space = device->total_bytes - device->bytes_used;
-
-		/* align with stripe_len */
-		avail_space = rounddown(avail_space, BTRFS_STRIPE_LEN);
-
-		/*
-		 * In order to avoid overwriting the superblock on the drive,
-		 * btrfs starts at an offset of at least 1MB when doing chunk
-		 * allocation.
-		 *
-		 * This ensures we have at least min_stripe_size free space
-		 * after excluding 1MB.
-		 */
-		if (avail_space <= SZ_1M + min_stripe_size)
-			continue;
-
-		avail_space -= SZ_1M;
-
-		devices_info[i].dev = device;
-		devices_info[i].max_avail = avail_space;
-
-		i++;
+	if (!devices_info) {
+		ret = -ENOMEM;
+		goto out;
 	}
-	rcu_read_unlock();
-
-	nr_devices = i;
 
-	btrfs_descending_sort_devices(devices_info, nr_devices);
-
-	i = nr_devices - 1;
-	avail_space = 0;
-	while (nr_devices >= rattr->devs_min) {
-		num_stripes = min(num_stripes, nr_devices);
-
-		if (devices_info[i].max_avail >= min_stripe_size) {
-			int j;
-			u64 alloc_size;
-
-			avail_space += devices_info[i].max_avail * num_stripes;
-			alloc_size = devices_info[i].max_avail;
-			for (j = i + 1 - num_stripes; j <= i; j++)
-				devices_info[j].max_avail -= alloc_size;
-		}
-		i--;
-		nr_devices--;
+	data_index = btrfs_bg_flags_to_raid_index(
+			btrfs_data_alloc_profile(fs_info));
+	meta_index = btrfs_bg_flags_to_raid_index(
+			btrfs_metadata_alloc_profile(fs_info));
+
+	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
+		device->virtual_allocated = 0;
+
+	/* Current metadata space is enough, no need to bother meta space */
+	if (meta_rsv <= free_meta)
+		goto data_only;
+
+	/* Allocate space for exceeding meta space */
+	while (meta_allocated < meta_rsv - free_meta) {
+		ret = btrfs_alloc_virtual_chunk(fs_info, devices_info,
+				meta_index,
+				meta_rsv - free_meta - meta_allocated,
+				&allocated);
+		if (ret < 0)
+			goto out;
+		meta_allocated += allocated;
+	}
+data_only:
+	/*
+	 * meta virtual chunks have been allocated, now allocate data virtual
+	 * chunks
+	 */
+	while (ret == 0) {
+		ret = btrfs_alloc_virtual_chunk(fs_info, devices_info,
+				data_index, -1, &allocated);
+		if (ret < 0)
+			goto out;
+		data_allocated += allocated;
 	}
 
+out:
+	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
+		device->virtual_allocated = 0;
+	mutex_unlock(&fs_devices->device_list_mutex);
 	kfree(devices_info);
-	*free_bytes = avail_space;
-	return 0;
+	*result = data_allocated;
+	if (ret == -ENOSPC)
+		ret = 0;
+	return ret;
 }
 
 /*
@@ -2034,7 +2003,6 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 	unsigned factor = 1;
 	struct btrfs_block_rsv *block_rsv = &fs_info->global_block_rsv;
 	int ret;
-	u64 thresh = 0;
 	int mixed = 0;
 
 	rcu_read_lock();
@@ -2082,31 +2050,11 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 		buf->f_bfree = 0;
 	spin_unlock(&block_rsv->lock);
 
-	buf->f_bavail = div_u64(total_free_data, factor);
-	ret = btrfs_calc_avail_data_space(fs_info, &total_free_data);
+	ret = btrfs_calc_avail_data_space(fs_info, total_free_meta,
+					  &buf->f_bavail);
 	if (ret)
 		return ret;
-	buf->f_bavail += div_u64(total_free_data, factor);
 	buf->f_bavail = buf->f_bavail >> bits;
-
-	/*
-	 * We calculate the remaining metadata space minus global reserve. If
-	 * this is (supposedly) smaller than zero, there's no space. But this
-	 * does not hold in practice, the exhausted state happens where's still
-	 * some positive delta. So we apply some guesswork and compare the
-	 * delta to a 4M threshold.  (Practically observed delta was ~2M.)
-	 *
-	 * We probably cannot calculate the exact threshold value because this
-	 * depends on the internal reservations requested by various
-	 * operations, so some operations that consume a few metadata will
-	 * succeed even if the Avail is zero. But this is better than the other
-	 * way around.
-	 */
-	thresh = SZ_4M;
-
-	if (!mixed && total_free_meta - thresh < block_rsv->size)
-		buf->f_bavail = 0;
-
 	buf->f_type = BTRFS_SUPER_MAGIC;
 	buf->f_bsize = dentry->d_sb->s_blocksize;
 	buf->f_namelen = BTRFS_NAME_LEN;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index be8250332c60..b2155a143740 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2658,10 +2658,10 @@ static int btrfs_cmp_device_info(const void *a, const void *b)
  *    Such virtual chunks won't take on-disk space, thus called virtual, and
  *    only affects per-profile available space calulation.
  */
-static int alloc_virtual_chunk(struct btrfs_fs_info *fs_info,
-			       struct btrfs_device_info *devices_info,
-			       enum btrfs_raid_types type,
-			       u64 to_alloc, u64 *allocated)
+int btrfs_alloc_virtual_chunk(struct btrfs_fs_info *fs_info,
+			      struct btrfs_device_info *devices_info,
+			      enum btrfs_raid_types type,
+			      u64 to_alloc, u64 *allocated)
 {
 	const struct btrfs_raid_attr *raid_attr = &btrfs_raid_array[type];
 	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
@@ -2761,8 +2761,8 @@ static int calc_one_profile_avail(struct btrfs_fs_info *fs_info,
 	list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
 		device->virtual_allocated = 0;
 	while (ret == 0) {
-		ret = alloc_virtual_chunk(fs_info, devices_info, type,
-					  (u64)-1, &allocated);
+		ret = btrfs_alloc_virtual_chunk(fs_info, devices_info, type,
+						(u64)-1, &allocated);
 		if (ret == 0)
 			result += allocated;
 	}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 81cdab0d864a..3c327ad6924e 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -459,6 +459,10 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
 		      struct btrfs_device *device, u64 new_size);
 struct btrfs_device *btrfs_find_device(struct btrfs_fs_devices *fs_devices,
 				       u64 devid, u8 *uuid, u8 *fsid, bool seed);
+int btrfs_alloc_virtual_chunk(struct btrfs_fs_info *fs_info,
+			      struct btrfs_device_info *devices_info,
+			      enum btrfs_raid_types type,
+			      u64 to_alloc, u64 *allocated);
 int btrfs_shrink_device(struct btrfs_device *device, u64 new_size);
 int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *path);
 int btrfs_balance(struct btrfs_fs_info *fs_info,
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/4] btrfs: Introduce per-profile available space facility
  2020-01-02 11:27 ` [PATCH v2 1/4] btrfs: Introduce per-profile available space facility Qu Wenruo
@ 2020-01-02 16:13   ` Josef Bacik
  2020-01-04  6:40   ` kbuild test robot
  1 sibling, 0 replies; 14+ messages in thread
From: Josef Bacik @ 2020-01-02 16:13 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 1/2/20 6:27 AM, Qu Wenruo wrote:
> [PROBLEM]
> There are some locations in btrfs requiring accurate estimation on how
> many new bytes can be allocated on unallocated space.
> 
> We have two types of estimation:
> - Factor based calculation
>    Just use all unallocated space, divide by the profile factor
>    One obvious user is can_overcommit().
> 
> - Chunk allocator like calculation
>    This will emulate the chunk allocator behavior, to get a proper
>    estimation.
>    The only user is btrfs_calc_avail_data_space(), utilized by
>    btrfs_statfs().
>    The problem is, that function is not generic purposed enough, can't
>    handle things like RAID5/6.
> 
> Current factor based calculation can't handle the following case:
>    devid 1 unallocated:	1T
>    devid 2 unallocated:	10T
>    metadata type:	RAID1
> 
> If using factor, we can use (1T + 10T) / 2 = 5.5T free space for
> metadata.
> But in fact we can only get 1T free space, as we're limited by the
> smallest device for RAID1.
> 
> [SOLUTION]
> This patch will introduce the skeleton of per-profile available space
> calculation, which can more-or-less get to the point of chunk allocator.
> 
> The difference between it and chunk allocator is mostly on rounding and
> [0, 1M) reserved space handling, which shouldn't cause practical impact.
> 
> The newly introduced per-profile available space calculation will
> calculate available space for each type, using chunk-allocator like
> calculation.
> 
> With that facility, for above device layout we get the full available
> space array:
>    RAID10:	0  (not enough devices)
>    RAID1:	1T
>    RAID1C3:	0  (not enough devices)
>    RAID1C4:	0  (not enough devices)
>    DUP:		5.5T
>    RAID0:	2T
>    SINGLE:	11T
>    RAID5:	1T
>    RAID6:	0  (not enough devices)
> 
> Or for a more complex example:
>    devid 1 unallocated:	1T
>    devid 2 unallocated:  1T
>    devid 3 unallocated:	10T
> 
> We will get an array of:
>    RAID10:	0  (not enough devices)
>    RAID1:	2T
>    RAID1C3:	1T
>    RAID1C4:	0  (not enough devices)
>    DUP:		6T
>    RAID0:	3T
>    SINGLE:	12T
>    RAID5:	2T
>    RAID6:	0  (not enough devices)
> 
> And for the each profile , we go chunk allocator level calculation:
> The code code looks like:
> 
>    clear_virtual_used_space_of_all_rw_devices();
>    do {
>    	/*
>    	 * The same as chunk allocator, despite used space,
>    	 * we also take virtual used space into consideration.
>    	 */
>    	sort_device_with_virtual_free_space();
> 
>    	/*
>    	 * Unlike chunk allocator, we don't need to bother hole/stripe
>    	 * size, so we use the smallest device to make sure we can
>    	 * allocated as many stripes as regular chunk allocator
>    	 */
>    	stripe_size = device_with_smallest_free->avail_space;
> 	stripe_size = min(stripe_size, to_alloc / ndevs);
> 
>    	/*
>    	 * Allocate a virtual chunk, allocated virtual chunk will
>    	 * increase virtual used space, allow next iteration to
>    	 * properly emulate chunk allocator behavior.
>    	 */
>    	ret = alloc_virtual_chunk(stripe_size, &allocated_size);
>    	if (ret == 0)
>    		avail += allocated_size;
>    } while (ret == 0)
> 
> As we always select the device with least free space (just like chunk
> allocator), for above 1T + 10T device, we will allocate a 1T virtual chunk
> in the first iteration, then run out of device in next iteration.
> 
> Thus only get 1T free space for RAID1 type, just like what chunk
> allocator would do.
> 
> This patch is just the skeleton, we only do the per-profile chunk
> calculation at mount time.
> 
> Later commits will update per-profile available space at other proper
> timings.
> 
> Suggested-by: Josef Bacik <josef@toxicpanda.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/4] btrfs: Update per-profile available space when device size/used space get updated
  2020-01-02 11:27 ` [PATCH v2 2/4] btrfs: Update per-profile available space when device size/used space get updated Qu Wenruo
@ 2020-01-02 16:17   ` Josef Bacik
  2020-01-03  0:51     ` Qu Wenruo
  2020-01-03  9:52     ` Qu Wenruo
  0 siblings, 2 replies; 14+ messages in thread
From: Josef Bacik @ 2020-01-02 16:17 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 1/2/20 6:27 AM, Qu Wenruo wrote:
> There are 4 locations where device size or used space get updated:
> - Chunk allocation
> - Chunk removal
> - Device grow
> - Device shrink
> 
> Now also update per-profile available space at those timings.
> 
> For __btrfs_alloc_chunk() we can't acquire device_list_mutex as in
> btrfs_finish_chunk_alloc() we could hold device_list_mutex and cause
> dead lock.
> 

These are protecting two different things though, holding the chunk_mutex 
doesn't keep things from being removed from the device list.

Looking at patch 1 can't we just do the device list traversal under RCU and then 
not have to worry about the locking at all?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/4] btrfs: space-info: Use per-profile available space in can_overcommit()
  2020-01-02 11:27 ` [PATCH v2 3/4] btrfs: space-info: Use per-profile available space in can_overcommit() Qu Wenruo
@ 2020-01-02 16:18   ` Josef Bacik
  0 siblings, 0 replies; 14+ messages in thread
From: Josef Bacik @ 2020-01-02 16:18 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: Marc Lehmann

On 1/2/20 6:27 AM, Qu Wenruo wrote:
> For the following disk layout, can_overcommit() can cause false
> confidence in available space:
> 
>    devid 1 unallocated:	1T
>    devid 2 unallocated:	10T
>    metadata type:	RAID1
> 
> As can_overcommit() simply uses unallocated space with factor to
> calculate the allocatable metadata chunk size.
> 
> can_overcommit() believes we still have 5.5T for metadata chunks, while
> the truth is, we only have 1T available for metadata chunks.
> This can lead to ENOSPC at run_delalloc_range() and cause transaction
> abort.
> 
> Since factor based calculation can't distinguish RAID1/RAID10 and DUP at
> all, we need proper chunk-allocator level awareness to do such estimation.
> 
> Thankfully, we have per-profile available space already calculated, just
> use that facility to avoid such false confidence.
> 
> Reported-by: Marc Lehmann <schmorp@schmorp.de>
> Signed-off-by: Qu Wenruo <wqu@suse.com>

I don't expect this will change much as you mess with the other code, so you can 
go ahead and add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

To this one, thanks,

Josef

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 4/4] btrfs: statfs: Use virtual chunk allocation to calculation available data space
  2020-01-02 11:27 ` [PATCH v2 4/4] btrfs: statfs: Use virtual chunk allocation to calculation available data space Qu Wenruo
@ 2020-01-02 16:20   ` Josef Bacik
  2020-01-03  0:38     ` Qu Wenruo
  0 siblings, 1 reply; 14+ messages in thread
From: Josef Bacik @ 2020-01-02 16:20 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 1/2/20 6:27 AM, Qu Wenruo wrote:
> Although btrfs_calc_avail_data_space() is trying to do an estimation
> on how many data chunks it can allocate, the estimation is far from
> perfect:
> 
> - Metadata over-commit is not considered at all
> - Chunk allocation doesn't take RAID5/6 into consideration
> 
> Although current per-profile available space itself is not able to
> handle metadata over-commit itself, the virtual chunk infrastructure can
> be re-used to address above problems.
> 
> This patch will change btrfs_calc_avail_data_space() to do the following
> things:
> - Do metadata virtual chunk allocation first
>    This is to address the over-commit behavior.
>    If current metadata chunks have enough free space, we can completely
>    skip this step.
> 
> - Allocate data virtual chunks as many as possible
>    Just like what we did in per-profile available space estimation.
>    Here we only need to calculate one profile, since statfs() call is
>    a relative cold path.
> 
> Now statfs() should be able to report near perfect estimation on
> available data space, and can handle RAID5/6 better.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Can you put a comparison of the statfs call time for the old way vs the new way 
in your changelog?  Say make a raid5 fs for example, populate it a little bit, 
and then run statfs 10 times and take the average with and without your patch so 
we can make sure there's no performance penalty.  You'd be surprised how many 
times statfs() things have caused problems for us in production.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 4/4] btrfs: statfs: Use virtual chunk allocation to calculation available data space
  2020-01-02 16:20   ` Josef Bacik
@ 2020-01-03  0:38     ` Qu Wenruo
  0 siblings, 0 replies; 14+ messages in thread
From: Qu Wenruo @ 2020-01-03  0:38 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1819 bytes --]



On 2020/1/3 上午12:20, Josef Bacik wrote:
> On 1/2/20 6:27 AM, Qu Wenruo wrote:
>> Although btrfs_calc_avail_data_space() is trying to do an estimation
>> on how many data chunks it can allocate, the estimation is far from
>> perfect:
>>
>> - Metadata over-commit is not considered at all
>> - Chunk allocation doesn't take RAID5/6 into consideration
>>
>> Although current per-profile available space itself is not able to
>> handle metadata over-commit itself, the virtual chunk infrastructure can
>> be re-used to address above problems.
>>
>> This patch will change btrfs_calc_avail_data_space() to do the following
>> things:
>> - Do metadata virtual chunk allocation first
>>    This is to address the over-commit behavior.
>>    If current metadata chunks have enough free space, we can completely
>>    skip this step.
>>
>> - Allocate data virtual chunks as many as possible
>>    Just like what we did in per-profile available space estimation.
>>    Here we only need to calculate one profile, since statfs() call is
>>    a relative cold path.
>>
>> Now statfs() should be able to report near perfect estimation on
>> available data space, and can handle RAID5/6 better.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
> 
> Can you put a comparison of the statfs call time for the old way vs the
> new way in your changelog?  Say make a raid5 fs for example, populate it
> a little bit, and then run statfs 10 times and take the average with and
> without your patch so we can make sure there's no performance penalty. 
> You'd be surprised how many times statfs() things have caused problems
> for us in production.  Thanks,

Sure no problem.

Never considered statfs() can be a problem.
Will keep an eye on that.

Thanks,
Qu

> 
> Josef


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/4] btrfs: Update per-profile available space when device size/used space get updated
  2020-01-02 16:17   ` Josef Bacik
@ 2020-01-03  0:51     ` Qu Wenruo
  2020-01-03  9:52     ` Qu Wenruo
  1 sibling, 0 replies; 14+ messages in thread
From: Qu Wenruo @ 2020-01-03  0:51 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1174 bytes --]



On 2020/1/3 上午12:17, Josef Bacik wrote:
> On 1/2/20 6:27 AM, Qu Wenruo wrote:
>> There are 4 locations where device size or used space get updated:
>> - Chunk allocation
>> - Chunk removal
>> - Device grow
>> - Device shrink
>>
>> Now also update per-profile available space at those timings.
>>
>> For __btrfs_alloc_chunk() we can't acquire device_list_mutex as in
>> btrfs_finish_chunk_alloc() we could hold device_list_mutex and cause
>> dead lock.
>>
> 
> These are protecting two different things though, holding the
> chunk_mutex doesn't keep things from being removed from the device list.
> 
> Looking at patch 1 can't we just do the device list traversal under RCU
> and then not have to worry about the locking at all?  Thanks,

That's very interesting solution.

But from the comment of btrfs_fs_devices::alloc_list, it says:
```
        /*
         * Devices which can satisfy space allocation. Protected by
         * chunk_mutex
         */
        struct list_head alloc_list;
```

And __btrfs_chunk_alloc() is iterating alloc_list without extra
protection, so it should be OK I guess.

Thanks,
Qu
> 
> Josef


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/4] btrfs: Update per-profile available space when device size/used space get updated
  2020-01-02 16:17   ` Josef Bacik
  2020-01-03  0:51     ` Qu Wenruo
@ 2020-01-03  9:52     ` Qu Wenruo
  2020-01-03 16:35       ` Josef Bacik
  1 sibling, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2020-01-03  9:52 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 1550 bytes --]

On 2020/1/3 上午12:17, Josef Bacik wrote:
> On 1/2/20 6:27 AM, Qu Wenruo wrote:
>> There are 4 locations where device size or used space get updated:
>> - Chunk allocation
>> - Chunk removal
>> - Device grow
>> - Device shrink
>>
>> Now also update per-profile available space at those timings.
>>
>> For __btrfs_alloc_chunk() we can't acquire device_list_mutex as in
>> btrfs_finish_chunk_alloc() we could hold device_list_mutex and cause
>> dead lock.
>>
> 
> These are protecting two different things though, holding the
> chunk_mutex doesn't keep things from being removed from the device list.

Further looking into the lock schema, Nikolay's comment is right,
chunk_mutex protects alloc_list (rw devices).

Device won't disappear from alloc_list, as in btrfs_rm_device() we took
chunk_mutex before removing writeable device.

In fact, device_list_mutex is unrelated to alloc_list.

So other calc_per_profile_avaiable() call sites are in fact not safe, I
shouldn't take device_list_mutex, but chunk_mutex which are much easier
to get.

> 
> Looking at patch 1 can't we just do the device list traversal under RCU
> and then not have to worry about the locking at all?  Thanks,

I guess we need SRCU, since in __btrfs_alloc_chunk() context we need to
sleep for search dev extent tree.

And I'm not confident enough for some corner case, maybe some RCU sync
case would cause unexpected performance degrade as the RCU sync can be
more expensive than simple mutex lock.

Thanks,
Qu

> 
> Josef

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/4] btrfs: Update per-profile available space when device size/used space get updated
  2020-01-03  9:52     ` Qu Wenruo
@ 2020-01-03 16:35       ` Josef Bacik
  0 siblings, 0 replies; 14+ messages in thread
From: Josef Bacik @ 2020-01-03 16:35 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs

On 1/3/20 4:52 AM, Qu Wenruo wrote:
> 
> 
> On 2020/1/3 上午12:17, Josef Bacik wrote:
>> On 1/2/20 6:27 AM, Qu Wenruo wrote:
>>> There are 4 locations where device size or used space get updated:
>>> - Chunk allocation
>>> - Chunk removal
>>> - Device grow
>>> - Device shrink
>>>
>>> Now also update per-profile available space at those timings.
>>>
>>> For __btrfs_alloc_chunk() we can't acquire device_list_mutex as in
>>> btrfs_finish_chunk_alloc() we could hold device_list_mutex and cause
>>> dead lock.
>>>
>>
>> These are protecting two different things though, holding the
>> chunk_mutex doesn't keep things from being removed from the device list.
> 
> Further looking into the lock schema, Nikolay's comment is right,
> chunk_mutex protects alloc_list (rw devices).
> 
> Device won't disappear from alloc_list, as in btrfs_rm_device() we took
> chunk_mutex before removing writeable device.
> 
> In fact, device_list_mutex is unrelated to alloc_list.
> 
> So other calc_per_profile_avaiable() call sites are in fact not safe, I
> shouldn't take device_list_mutex, but chunk_mutex which are much easier
> to get.
> 
>>
>> Looking at patch 1 can't we just do the device list traversal under RCU
>> and then not have to worry about the locking at all?  Thanks,
> 
> I guess we need SRCU, since in __btrfs_alloc_chunk() context we need to
> sleep for search dev extent tree.
> 
> And I'm not confident enough for some corner case, maybe some RCU sync
> case would cause unexpected performance degrade as the RCU sync can be
> more expensive than simple mutex lock.
> 

Sigh sorry I was reading it like we were doing fs_devices->devices, which is all 
handled via RCU.  But you're right, alloc_list is covered by the chunk mutex, so 
just make sure the chunk mutex is taken every where and you should be good to 
go.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/4] btrfs: Introduce per-profile available space facility
  2020-01-02 11:27 ` [PATCH v2 1/4] btrfs: Introduce per-profile available space facility Qu Wenruo
  2020-01-02 16:13   ` Josef Bacik
@ 2020-01-04  6:40   ` kbuild test robot
  1 sibling, 0 replies; 14+ messages in thread
From: kbuild test robot @ 2020-01-04  6:40 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: kbuild-all, linux-btrfs, Josef Bacik

[-- Attachment #1: Type: text/plain, Size: 1324 bytes --]

Hi Qu,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on v5.5-rc4]
[also build test ERROR on next-20191219]
[cannot apply to btrfs/next]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Qu-Wenruo/Introduce-per-profile-available-space-array-to-avoid-over-confident-can_overcommit/20200104-042806
base:    fd6988496e79a6a4bdb514a4655d2920209eb85d
config: openrisc-randconfig-a001-20200103 (attached as .config)
compiler: or1k-linux-gcc (GCC) 9.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=9.2.0 make.cross ARCH=openrisc 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> ERROR: "__udivdi3" [fs/btrfs/btrfs.ko] undefined!

---
0-DAY kernel test infrastructure                 Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 24191 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-01-04  6:40 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-02 11:27 [PATCH v2 0/4] Introduce per-profile available space array to avoid over-confident can_overcommit() Qu Wenruo
2020-01-02 11:27 ` [PATCH v2 1/4] btrfs: Introduce per-profile available space facility Qu Wenruo
2020-01-02 16:13   ` Josef Bacik
2020-01-04  6:40   ` kbuild test robot
2020-01-02 11:27 ` [PATCH v2 2/4] btrfs: Update per-profile available space when device size/used space get updated Qu Wenruo
2020-01-02 16:17   ` Josef Bacik
2020-01-03  0:51     ` Qu Wenruo
2020-01-03  9:52     ` Qu Wenruo
2020-01-03 16:35       ` Josef Bacik
2020-01-02 11:27 ` [PATCH v2 3/4] btrfs: space-info: Use per-profile available space in can_overcommit() Qu Wenruo
2020-01-02 16:18   ` Josef Bacik
2020-01-02 11:27 ` [PATCH v2 4/4] btrfs: statfs: Use virtual chunk allocation to calculation available data space Qu Wenruo
2020-01-02 16:20   ` Josef Bacik
2020-01-03  0:38     ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).