All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] ENOSPC refinements
@ 2019-04-10 19:56 Josef Bacik
  2019-04-10 19:56 ` [PATCH 1/2] btrfs: track odirect bytes in flight Josef Bacik
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Josef Bacik @ 2019-04-10 19:56 UTC (permalink / raw)
  To: linux-btrfs

I noticed our continuous testing for btrfs started timing out recently, and dug
down to discover that generic/224 was taking upwards of an hour to run.  This
test should take seconds to run, so this is a problem.

Fix this by reworking how we do delalloc metadata reservations, as described in
patch 2/2.

I also noticed we don't quite do the right thing when we are mostly O_DIRECT,
and addressed this in patch 1/2.  I've run these through xfstests and everything
is back to normal.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/2] btrfs: track odirect bytes in flight
  2019-04-10 19:56 [PATCH 0/2] ENOSPC refinements Josef Bacik
@ 2019-04-10 19:56 ` Josef Bacik
  2019-04-12 10:17   ` Nikolay Borisov
  2019-04-10 19:56 ` [PATCH 2/2] btrfs: reserve delalloc metadata differently Josef Bacik
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Josef Bacik @ 2019-04-10 19:56 UTC (permalink / raw)
  To: linux-btrfs

When diagnosing a slowdown of generic/224 I noticed we were wasting a
lot of time in shrink_delalloc() despite all writes being O_DIRECT
writes.  O_DIRECT writes still have outstanding extents, but obviously
cannot be directly flushed, instead we need to wait on their
corresponding ordered extent.  Track the outstanding odirect write bytes
and if this amount is higher than the delalloc bytes in the system go
ahead and force us to wait on the ordered extents.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.h        |  1 +
 fs/btrfs/disk-io.c      | 15 ++++++++++++++-
 fs/btrfs/extent-tree.c  | 17 +++++++++++++++--
 fs/btrfs/ordered-data.c |  9 ++++++++-
 4 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7e774d48c48c..e293d74b2ead 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1016,6 +1016,7 @@ struct btrfs_fs_info {
 	/* used to keep from writing metadata until there is a nice batch */
 	struct percpu_counter dirty_metadata_bytes;
 	struct percpu_counter delalloc_bytes;
+	struct percpu_counter odirect_bytes;
 	s32 dirty_metadata_batch;
 	s32 delalloc_batch;
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7a88de4be8d7..3f0b1854cedc 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2641,11 +2641,17 @@ int open_ctree(struct super_block *sb,
 		goto fail;
 	}
 
-	ret = percpu_counter_init(&fs_info->dirty_metadata_bytes, 0, GFP_KERNEL);
+	ret = percpu_counter_init(&fs_info->odirect_bytes, 0, GFP_KERNEL);
 	if (ret) {
 		err = ret;
 		goto fail_srcu;
 	}
+
+	ret = percpu_counter_init(&fs_info->dirty_metadata_bytes, 0, GFP_KERNEL);
+	if (ret) {
+		err = ret;
+		goto fail_odirect_bytes;
+	}
 	fs_info->dirty_metadata_batch = PAGE_SIZE *
 					(1 + ilog2(nr_cpu_ids));
 
@@ -3344,6 +3350,8 @@ int open_ctree(struct super_block *sb,
 	percpu_counter_destroy(&fs_info->delalloc_bytes);
 fail_dirty_metadata_bytes:
 	percpu_counter_destroy(&fs_info->dirty_metadata_bytes);
+fail_odirect_bytes:
+	percpu_counter_destroy(&fs_info->odirect_bytes);
 fail_srcu:
 	cleanup_srcu_struct(&fs_info->subvol_srcu);
 fail:
@@ -4025,6 +4033,10 @@ void close_ctree(struct btrfs_fs_info *fs_info)
 		       percpu_counter_sum(&fs_info->delalloc_bytes));
 	}
 
+	if (percpu_counter_sum(&fs_info->odirect_bytes))
+		btrfs_info(fs_info, "at unmount odirect count %lld",
+			   percpu_counter_sum(&fs_info->odirect_bytes));
+
 	btrfs_sysfs_remove_mounted(fs_info);
 	btrfs_sysfs_remove_fsid(fs_info->fs_devices);
 
@@ -4056,6 +4068,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
 
 	percpu_counter_destroy(&fs_info->dirty_metadata_bytes);
 	percpu_counter_destroy(&fs_info->delalloc_bytes);
+	percpu_counter_destroy(&fs_info->odirect_bytes);
 	percpu_counter_destroy(&fs_info->dev_replace.bio_counter);
 	cleanup_srcu_struct(&fs_info->subvol_srcu);
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d0626f945de2..0982456ebabb 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4727,6 +4727,7 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info, u64 to_reclaim,
 	struct btrfs_space_info *space_info;
 	struct btrfs_trans_handle *trans;
 	u64 delalloc_bytes;
+	u64 odirect_bytes;
 	u64 async_pages;
 	u64 items;
 	long time_left;
@@ -4742,7 +4743,9 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info, u64 to_reclaim,
 
 	delalloc_bytes = percpu_counter_sum_positive(
 						&fs_info->delalloc_bytes);
-	if (delalloc_bytes == 0) {
+	odirect_bytes = percpu_counter_sum_positive(
+						&fs_info->odirect_bytes);
+	if (delalloc_bytes == 0 && odirect_bytes == 0) {
 		if (trans)
 			return;
 		if (wait_ordered)
@@ -4750,8 +4753,16 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info, u64 to_reclaim,
 		return;
 	}
 
+	/*
+	 * If we are doing more ordered than delalloc we need to just wait on
+	 * ordered extents, otherwise we'll waste time trying to flush delalloc
+	 * that likely won't give us the space back we need.
+	 */
+	if (odirect_bytes > delalloc_bytes)
+		wait_ordered = true;
+
 	loops = 0;
-	while (delalloc_bytes && loops < 3) {
+	while ((delalloc_bytes || odirect_bytes)  && loops < 3) {
 		nr_pages = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT;
 
 		/*
@@ -4801,6 +4812,8 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info, u64 to_reclaim,
 		}
 		delalloc_bytes = percpu_counter_sum_positive(
 						&fs_info->delalloc_bytes);
+		odirect_bytes = percpu_counter_sum_positive(
+						&fs_info->odirect_bytes);
 	}
 }
 
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 6fde2b2741ef..967c62b85d77 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -194,8 +194,11 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
 	if (type != BTRFS_ORDERED_IO_DONE && type != BTRFS_ORDERED_COMPLETE)
 		set_bit(type, &entry->flags);
 
-	if (dio)
+	if (dio) {
+		percpu_counter_add_batch(&fs_info->odirect_bytes, len,
+					 fs_info->delalloc_batch);
 		set_bit(BTRFS_ORDERED_DIRECT, &entry->flags);
+	}
 
 	/* one ref for the tree */
 	refcount_set(&entry->refs, 1);
@@ -468,6 +471,10 @@ void btrfs_remove_ordered_extent(struct inode *inode,
 	if (root != fs_info->tree_root)
 		btrfs_delalloc_release_metadata(btrfs_inode, entry->len, false);
 
+	if (test_bit(BTRFS_ORDERED_DIRECT, &entry->flags))
+		percpu_counter_add_batch(&fs_info->odirect_bytes, -entry->len,
+					 fs_info->delalloc_batch);
+
 	tree = &btrfs_inode->ordered_tree;
 	spin_lock_irq(&tree->lock);
 	node = &entry->rb_node;
-- 
2.13.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/2] btrfs: reserve delalloc metadata differently
  2019-04-10 19:56 [PATCH 0/2] ENOSPC refinements Josef Bacik
  2019-04-10 19:56 ` [PATCH 1/2] btrfs: track odirect bytes in flight Josef Bacik
@ 2019-04-10 19:56 ` Josef Bacik
  2019-04-12 13:06   ` Nikolay Borisov
  2019-04-29 18:33   ` David Sterba
  2019-04-25 15:22 ` [PATCH 0/2] ENOSPC refinements David Sterba
  2019-04-25 20:52 ` [PATCH][v2] btrfs: track odirect bytes in flight Josef Bacik
  3 siblings, 2 replies; 14+ messages in thread
From: Josef Bacik @ 2019-04-10 19:56 UTC (permalink / raw)
  To: linux-btrfs

With the per-inode block rsvs we started refilling the reserve based on
the calculated size of the outstanding csum bytes and extents for the
inode, including the amount we were adding with the new operation.

However generic/224 exposed a problem with this approach.  With 1000
files all writing at the same time we ended up with a bunch of bytes
being reserved but unusable.

When you write to a file we reserve space for the csum leaves for those
bytes, the number of extent items required to cover those bytes, and a
single credit for updating the inode at ordered extent finish for that
range of bytes.  This is held until the ordered extent finishes and we
release all of the reserved space.

If a second write comes in at this point we would add a single
reservation for the new outstanding extent and however many reservations
for the csum leaves.  At this point we find the delta of how much we
have reserved and how much outstanding size this is and attempt to
reserve this delta.  If the first write finishes it will not release any
space, because the space it had reserved for the initial write is still
needed for the second write.  However some space would have been used,
as we have added csums, extent items, and dirtied the inode.  Our
reserved space would be > 0 but < the total needed reserved space.

This is just for a single inode, now consider generic/224.  This has
1000 inodes writing in parallel to a very small file system, 1gib.  In
my testing this usually means we get about a 120mib metadata area to
work with, more than enough to allow the writes to continue, but not
enough if all of the inodes are stuck trying to reserve the slack space
while continuing to hold their leftovers from their initial writes.

Fix this by pre-reserved _only_ for the space we are currently trying to
add.  Then once that is successful modify our inodes csum count and
outstanding extents, and then add the newly reserved space to the inodes
block_rsv.  This allows us to actually pass generic/224 without running
out of metadata space.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-tree.c | 145 ++++++++++++++++++-------------------------------
 1 file changed, 53 insertions(+), 92 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0982456ebabb..9aff7a8817d9 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5811,85 +5811,6 @@ int btrfs_block_rsv_refill(struct btrfs_root *root,
 	return ret;
 }
 
-static void calc_refill_bytes(struct btrfs_block_rsv *block_rsv,
-				u64 *metadata_bytes, u64 *qgroup_bytes)
-{
-	*metadata_bytes = 0;
-	*qgroup_bytes = 0;
-
-	spin_lock(&block_rsv->lock);
-	if (block_rsv->reserved < block_rsv->size)
-		*metadata_bytes = block_rsv->size - block_rsv->reserved;
-	if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size)
-		*qgroup_bytes = block_rsv->qgroup_rsv_size -
-			block_rsv->qgroup_rsv_reserved;
-	spin_unlock(&block_rsv->lock);
-}
-
-/**
- * btrfs_inode_rsv_refill - refill the inode block rsv.
- * @inode - the inode we are refilling.
- * @flush - the flushing restriction.
- *
- * Essentially the same as btrfs_block_rsv_refill, except it uses the
- * block_rsv->size as the minimum size.  We'll either refill the missing amount
- * or return if we already have enough space.  This will also handle the reserve
- * tracepoint for the reserved amount.
- */
-static int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
-				  enum btrfs_reserve_flush_enum flush)
-{
-	struct btrfs_root *root = inode->root;
-	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
-	u64 num_bytes, last = 0;
-	u64 qgroup_num_bytes;
-	int ret = -ENOSPC;
-
-	calc_refill_bytes(block_rsv, &num_bytes, &qgroup_num_bytes);
-	if (num_bytes == 0)
-		return 0;
-
-	do {
-		ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes,
-							 true);
-		if (ret)
-			return ret;
-		ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
-		if (ret) {
-			btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes);
-			last = num_bytes;
-			/*
-			 * If we are fragmented we can end up with a lot of
-			 * outstanding extents which will make our size be much
-			 * larger than our reserved amount.
-			 *
-			 * If the reservation happens here, it might be very
-			 * big though not needed in the end, if the delalloc
-			 * flushing happens.
-			 *
-			 * If this is the case try and do the reserve again.
-			 */
-			if (flush == BTRFS_RESERVE_FLUSH_ALL)
-				calc_refill_bytes(block_rsv, &num_bytes,
-						   &qgroup_num_bytes);
-			if (num_bytes == 0)
-				return 0;
-		}
-	} while (ret && last != num_bytes);
-
-	if (!ret) {
-		block_rsv_add_bytes(block_rsv, num_bytes, false);
-		trace_btrfs_space_reservation(root->fs_info, "delalloc",
-					      btrfs_ino(inode), num_bytes, 1);
-
-		/* Don't forget to increase qgroup_rsv_reserved */
-		spin_lock(&block_rsv->lock);
-		block_rsv->qgroup_rsv_reserved += qgroup_num_bytes;
-		spin_unlock(&block_rsv->lock);
-	}
-	return ret;
-}
-
 static u64 __btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
 				     struct btrfs_block_rsv *block_rsv,
 				     u64 num_bytes, u64 *qgroup_to_release)
@@ -6190,9 +6111,26 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
 	spin_unlock(&block_rsv->lock);
 }
 
+static inline void calc_inode_reservations(struct btrfs_fs_info *fs_info,
+					   struct btrfs_inode *inode,
+					   u64 num_bytes, u64 *meta_reserve,
+					   u64 *qgroup_reserve)
+{
+	u64 nr_extents = count_max_extents(num_bytes);
+	u64 csum_leaves = btrfs_csum_bytes_to_leaves(fs_info, num_bytes);
+
+	/* We add one for the inode update at finish ordered time. */
+	*meta_reserve = btrfs_calc_trans_metadata_size(fs_info,
+						nr_extents + csum_leaves + 1);
+	*qgroup_reserve = nr_extents * fs_info->nodesize;
+}
+
 int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
 {
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	struct btrfs_root *root = inode->root;
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
+	u64 meta_reserve, qgroup_reserve;
 	unsigned nr_extents;
 	enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
 	int ret = 0;
@@ -6222,7 +6160,31 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
 
 	num_bytes = ALIGN(num_bytes, fs_info->sectorsize);
 
-	/* Add our new extents and calculate the new rsv size. */
+	/*
+	 * Josef, we always want to do it this way, every other way is wrong and
+	 * ends in tears.  Pre-reserving the amount we are going to add will
+	 * always be the right way, because otherwise if we have enough
+	 * parallelism we could end up with thousands of inodes all holding
+	 * little bits of reservations they were able to make previously and the
+	 * only way to reclaim that space is to ENOSPC out the operations and
+	 * clear everything out and try again, which is shitty.  This way we
+	 * just over-reserve slightly, and clean up the mess when we are done.
+	 */
+	calc_inode_reservations(fs_info, inode, num_bytes, &meta_reserve,
+				&qgroup_reserve);
+	ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_reserve, true);
+	if (ret)
+		goto out_fail;
+	ret = reserve_metadata_bytes(root, block_rsv, meta_reserve, flush);
+	if (ret)
+		goto out_qgroup;
+
+	/*
+	 * Now we need to update our outstanding extents and csum bytes _first_
+	 * and then add the reservation to the block_rsv.  This keeps us from
+	 * racing with an ordered completion or some such that would think it
+	 * needs to free the reservation we just made.
+	 */
 	spin_lock(&inode->lock);
 	nr_extents = count_max_extents(num_bytes);
 	btrfs_mod_outstanding_extents(inode, nr_extents);
@@ -6230,22 +6192,21 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
 	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
 	spin_unlock(&inode->lock);
 
-	ret = btrfs_inode_rsv_refill(inode, flush);
-	if (unlikely(ret))
-		goto out_fail;
+	/* Now we can safely add our space to our block rsv. */
+	block_rsv_add_bytes(block_rsv, meta_reserve, false);
+	trace_btrfs_space_reservation(root->fs_info, "delalloc",
+				      btrfs_ino(inode), meta_reserve, 1);
+
+	spin_lock(&block_rsv->lock);
+	block_rsv->qgroup_rsv_reserved += qgroup_reserve;
+	spin_unlock(&block_rsv->lock);
 
 	if (delalloc_lock)
 		mutex_unlock(&inode->delalloc_mutex);
 	return 0;
-
+out_qgroup:
+	btrfs_qgroup_free_meta_prealloc(root, qgroup_reserve);
 out_fail:
-	spin_lock(&inode->lock);
-	nr_extents = count_max_extents(num_bytes);
-	btrfs_mod_outstanding_extents(inode, -nr_extents);
-	inode->csum_bytes -= num_bytes;
-	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
-	spin_unlock(&inode->lock);
-
 	btrfs_inode_rsv_release(inode, true);
 	if (delalloc_lock)
 		mutex_unlock(&inode->delalloc_mutex);
-- 
2.13.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] btrfs: track odirect bytes in flight
  2019-04-10 19:56 ` [PATCH 1/2] btrfs: track odirect bytes in flight Josef Bacik
@ 2019-04-12 10:17   ` Nikolay Borisov
  2019-04-12 13:30     ` Josef Bacik
  2019-04-24 17:26     ` David Sterba
  0 siblings, 2 replies; 14+ messages in thread
From: Nikolay Borisov @ 2019-04-12 10:17 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs



On 10.04.19 г. 22:56 ч., Josef Bacik wrote:
> When diagnosing a slowdown of generic/224 I noticed we were wasting a
> lot of time in shrink_delalloc() despite all writes being O_DIRECT
> writes.  O_DIRECT writes still have outstanding extents, but obviously
> cannot be directly flushed, instead we need to wait on their
> corresponding ordered extent.  Track the outstanding odirect write bytes
> and if this amount is higher than the delalloc bytes in the system go
> ahead and force us to wait on the ordered extents.

This is way too sparse. I've been running generic/224 to try and
reproduce your slowdown. So far I can confirm that this test exhibits
drastic swings in performance - I've seen it complete from 30s up to
300s. I've also been taking an offcputime[0] measurements in the case
where high completion times were observed but so far I haven't really
seen shrink_delalloc standing out.

Provide more information how you measured the said slowdown as well as
more information in the changelog about why it's happening. At the very
least this could be split into 2 patches:

1. Could add the percpu counter init + modification in ordered extent
routines

2. Should add the logic in shrink_delalloc. Ideally that patch will
include detailed explanation of how the problem manifests.


Slight off topic:

What purpose do the checks of trans in shrink_delalloc serve? Does it
mean "if there is currently an open transaction don't do any ordered
wait because that's expensive" ?


[0] https://drive.google.com/open?id=1rEtMchqll6LZ0hq7uAzYkC4vY975Mw4i

> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/ctree.h        |  1 +
>  fs/btrfs/disk-io.c      | 15 ++++++++++++++-
>  fs/btrfs/extent-tree.c  | 17 +++++++++++++++--
>  fs/btrfs/ordered-data.c |  9 ++++++++-
>  4 files changed, 38 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 7e774d48c48c..e293d74b2ead 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1016,6 +1016,7 @@ struct btrfs_fs_info {
>  	/* used to keep from writing metadata until there is a nice batch */
>  	struct percpu_counter dirty_metadata_bytes;
>  	struct percpu_counter delalloc_bytes;
> +	struct percpu_counter odirect_bytes;
>  	s32 dirty_metadata_batch;
>  	s32 delalloc_batch;
>  
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 7a88de4be8d7..3f0b1854cedc 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2641,11 +2641,17 @@ int open_ctree(struct super_block *sb,
>  		goto fail;
>  	}
>  
> -	ret = percpu_counter_init(&fs_info->dirty_metadata_bytes, 0, GFP_KERNEL);
> +	ret = percpu_counter_init(&fs_info->odirect_bytes, 0, GFP_KERNEL);
>  	if (ret) {
>  		err = ret;
>  		goto fail_srcu;
>  	}
> +
> +	ret = percpu_counter_init(&fs_info->dirty_metadata_bytes, 0, GFP_KERNEL);
> +	if (ret) {
> +		err = ret;
> +		goto fail_odirect_bytes;
> +	}
>  	fs_info->dirty_metadata_batch = PAGE_SIZE *
>  					(1 + ilog2(nr_cpu_ids));
>  
> @@ -3344,6 +3350,8 @@ int open_ctree(struct super_block *sb,
>  	percpu_counter_destroy(&fs_info->delalloc_bytes);
>  fail_dirty_metadata_bytes:
>  	percpu_counter_destroy(&fs_info->dirty_metadata_bytes);
> +fail_odirect_bytes:
> +	percpu_counter_destroy(&fs_info->odirect_bytes);
>  fail_srcu:
>  	cleanup_srcu_struct(&fs_info->subvol_srcu);
>  fail:
> @@ -4025,6 +4033,10 @@ void close_ctree(struct btrfs_fs_info *fs_info)
>  		       percpu_counter_sum(&fs_info->delalloc_bytes));
>  	}
>  
> +	if (percpu_counter_sum(&fs_info->odirect_bytes))
> +		btrfs_info(fs_info, "at unmount odirect count %lld",
> +			   percpu_counter_sum(&fs_info->odirect_bytes));
> +
>  	btrfs_sysfs_remove_mounted(fs_info);
>  	btrfs_sysfs_remove_fsid(fs_info->fs_devices);
>  
> @@ -4056,6 +4068,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
>  
>  	percpu_counter_destroy(&fs_info->dirty_metadata_bytes);
>  	percpu_counter_destroy(&fs_info->delalloc_bytes);
> +	percpu_counter_destroy(&fs_info->odirect_bytes);
>  	percpu_counter_destroy(&fs_info->dev_replace.bio_counter);
>  	cleanup_srcu_struct(&fs_info->subvol_srcu);
>  
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index d0626f945de2..0982456ebabb 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4727,6 +4727,7 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info, u64 to_reclaim,
>  	struct btrfs_space_info *space_info;
>  	struct btrfs_trans_handle *trans;
>  	u64 delalloc_bytes;
> +	u64 odirect_bytes;
>  	u64 async_pages;
>  	u64 items;
>  	long time_left;
> @@ -4742,7 +4743,9 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info, u64 to_reclaim,
>  
>  	delalloc_bytes = percpu_counter_sum_positive(
>  						&fs_info->delalloc_bytes);
> -	if (delalloc_bytes == 0) {
> +	odirect_bytes = percpu_counter_sum_positive(
> +						&fs_info->odirect_bytes);
> +	if (delalloc_bytes == 0 && odirect_bytes == 0) {
>  		if (trans)
>  			return;
>  		if (wait_ordered)
> @@ -4750,8 +4753,16 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info, u64 to_reclaim,
>  		return;
>  	}
>  
> +	/*
> +	 * If we are doing more ordered than delalloc we need to just wait on
> +	 * ordered extents, otherwise we'll waste time trying to flush delalloc
> +	 * that likely won't give us the space back we need.
> +	 */
> +	if (odirect_bytes > delalloc_bytes)
> +		wait_ordered = true;
> +
>  	loops = 0;
> -	while (delalloc_bytes && loops < 3) {
> +	while ((delalloc_bytes || odirect_bytes)  && loops < 3) {
>  		nr_pages = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT;
>  
>  		/*
> @@ -4801,6 +4812,8 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info, u64 to_reclaim,
>  		}
>  		delalloc_bytes = percpu_counter_sum_positive(
>  						&fs_info->delalloc_bytes);
> +		odirect_bytes = percpu_counter_sum_positive(
> +						&fs_info->odirect_bytes);
>  	}
>  }
>  
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 6fde2b2741ef..967c62b85d77 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -194,8 +194,11 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
>  	if (type != BTRFS_ORDERED_IO_DONE && type != BTRFS_ORDERED_COMPLETE)
>  		set_bit(type, &entry->flags);
>  
> -	if (dio)
> +	if (dio) {
> +		percpu_counter_add_batch(&fs_info->odirect_bytes, len,
> +					 fs_info->delalloc_batch);
>  		set_bit(BTRFS_ORDERED_DIRECT, &entry->flags);
> +	}
>  
>  	/* one ref for the tree */
>  	refcount_set(&entry->refs, 1);
> @@ -468,6 +471,10 @@ void btrfs_remove_ordered_extent(struct inode *inode,
>  	if (root != fs_info->tree_root)
>  		btrfs_delalloc_release_metadata(btrfs_inode, entry->len, false);
>  
> +	if (test_bit(BTRFS_ORDERED_DIRECT, &entry->flags))
> +		percpu_counter_add_batch(&fs_info->odirect_bytes, -entry->len,
> +					 fs_info->delalloc_batch);
> +
>  	tree = &btrfs_inode->ordered_tree;
>  	spin_lock_irq(&tree->lock);
>  	node = &entry->rb_node;
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] btrfs: reserve delalloc metadata differently
  2019-04-10 19:56 ` [PATCH 2/2] btrfs: reserve delalloc metadata differently Josef Bacik
@ 2019-04-12 13:06   ` Nikolay Borisov
  2019-04-12 13:26     ` Josef Bacik
  2019-04-29 18:33   ` David Sterba
  1 sibling, 1 reply; 14+ messages in thread
From: Nikolay Borisov @ 2019-04-12 13:06 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs



On 10.04.19 г. 22:56 ч., Josef Bacik wrote:
> With the per-inode block rsvs we started refilling the reserve based on
> the calculated size of the outstanding csum bytes and extents for the
> inode, including the amount we were adding with the new operation.
> 
> However generic/224 exposed a problem with this approach.  With 1000
> files all writing at the same time we ended up with a bunch of bytes
> being reserved but unusable.
> 
> When you write to a file we reserve space for the csum leaves for those
> bytes, the number of extent items required to cover those bytes, and a
> single credit for updating the inode at ordered extent finish for that
> range of bytes.  This is held until the ordered extent finishes and we
> release all of the reserved space.
> 
> If a second write comes in at this point we would add a single
> reservation for the new outstanding extent and however many reservations
> for the csum leaves.  

If a second write comes we won't do anything different than the first
i.e calculate the number of extent items + csums bytes required, add
them to the block reservation and call btrfs_inode_rsv_refill which
should refill the delta necessary for the 2nd write.


At this point we find the delta of how much we
> have reserved and how much outstanding size this is and attempt to
> reserve this delta.  If the first write finishes it will not release any
> space, because the space it had reserved for the initial write is still
> needed for the second write.  However some space would have been used,

Each and every reservation is responsible for itself how come the first
one will know some of its space is required for the second, hence it
won't be released?


> as we have added csums, extent items, and dirtied the inode.  Our
> reserved space would be > 0 but < the total needed reserved space.
> 
> This is just for a single inode, now consider generic/224.  This has
> 1000 inodes writing in parallel to a very small file system, 1gib.  In
> my testing this usually means we get about a 120mib metadata area to
> work with, more than enough to allow the writes to continue, but not
> enough if all of the inodes are stuck trying to reserve the slack space
> while continuing to hold their leftovers from their initial writes.
> 
> Fix this by pre-reserved _only_ for the space we are currently trying to
> add.  Then once that is successful modify our inodes csum count and
> outstanding extents, and then add the newly reserved space to the inodes
> block_rsv.  This allows us to actually pass generic/224 without running
> out of metadata space.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/extent-tree.c | 145 ++++++++++++++++++-------------------------------
>  1 file changed, 53 insertions(+), 92 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 0982456ebabb..9aff7a8817d9 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -5811,85 +5811,6 @@ int btrfs_block_rsv_refill(struct btrfs_root *root,
>  	return ret;
>  }
>  
> -static void calc_refill_bytes(struct btrfs_block_rsv *block_rsv,
> -				u64 *metadata_bytes, u64 *qgroup_bytes)
> -{
> -	*metadata_bytes = 0;
> -	*qgroup_bytes = 0;
> -
> -	spin_lock(&block_rsv->lock);
> -	if (block_rsv->reserved < block_rsv->size)
> -		*metadata_bytes = block_rsv->size - block_rsv->reserved;
> -	if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size)
> -		*qgroup_bytes = block_rsv->qgroup_rsv_size -
> -			block_rsv->qgroup_rsv_reserved;
> -	spin_unlock(&block_rsv->lock);
> -}
> -
> -/**
> - * btrfs_inode_rsv_refill - refill the inode block rsv.
> - * @inode - the inode we are refilling.
> - * @flush - the flushing restriction.
> - *
> - * Essentially the same as btrfs_block_rsv_refill, except it uses the
> - * block_rsv->size as the minimum size.  We'll either refill the missing amount
> - * or return if we already have enough space.  This will also handle the reserve
> - * tracepoint for the reserved amount.
> - */
> -static int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
> -				  enum btrfs_reserve_flush_enum flush)
> -{
> -	struct btrfs_root *root = inode->root;
> -	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
> -	u64 num_bytes, last = 0;
> -	u64 qgroup_num_bytes;
> -	int ret = -ENOSPC;
> -
> -	calc_refill_bytes(block_rsv, &num_bytes, &qgroup_num_bytes);
> -	if (num_bytes == 0)
> -		return 0;
> -
> -	do {
> -		ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes,
> -							 true);
> -		if (ret)
> -			return ret;
> -		ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
> -		if (ret) {
> -			btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes);
> -			last = num_bytes;
> -			/*
> -			 * If we are fragmented we can end up with a lot of
> -			 * outstanding extents which will make our size be much
> -			 * larger than our reserved amount.
> -			 *
> -			 * If the reservation happens here, it might be very
> -			 * big though not needed in the end, if the delalloc
> -			 * flushing happens.
> -			 *
> -			 * If this is the case try and do the reserve again.
> -			 */
> -			if (flush == BTRFS_RESERVE_FLUSH_ALL)
> -				calc_refill_bytes(block_rsv, &num_bytes,
> -						   &qgroup_num_bytes);
> -			if (num_bytes == 0)
> -				return 0;
> -		}
> -	} while (ret && last != num_bytes);
> -
> -	if (!ret) {
> -		block_rsv_add_bytes(block_rsv, num_bytes, false);
> -		trace_btrfs_space_reservation(root->fs_info, "delalloc",
> -					      btrfs_ino(inode), num_bytes, 1);
> -
> -		/* Don't forget to increase qgroup_rsv_reserved */
> -		spin_lock(&block_rsv->lock);
> -		block_rsv->qgroup_rsv_reserved += qgroup_num_bytes;
> -		spin_unlock(&block_rsv->lock);
> -	}
> -	return ret;
> -}
> -
>  static u64 __btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
>  				     struct btrfs_block_rsv *block_rsv,
>  				     u64 num_bytes, u64 *qgroup_to_release)
> @@ -6190,9 +6111,26 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
>  	spin_unlock(&block_rsv->lock);
>  }
>  
> +static inline void calc_inode_reservations(struct btrfs_fs_info *fs_info,
> +					   struct btrfs_inode *inode,
> +					   u64 num_bytes, u64 *meta_reserve,
> +					   u64 *qgroup_reserve)
> +{
> +	u64 nr_extents = count_max_extents(num_bytes);
> +	u64 csum_leaves = btrfs_csum_bytes_to_leaves(fs_info, num_bytes);
> +
> +	/* We add one for the inode update at finish ordered time. */
> +	*meta_reserve = btrfs_calc_trans_metadata_size(fs_info,
> +						nr_extents + csum_leaves + 1);
> +	*qgroup_reserve = nr_extents * fs_info->nodesize;
> +}
> +
>  int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
>  {
> -	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> +	struct btrfs_root *root = inode->root;
> +	struct btrfs_fs_info *fs_info = root->fs_info;
> +	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
> +	u64 meta_reserve, qgroup_reserve;
>  	unsigned nr_extents;
>  	enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
>  	int ret = 0;
> @@ -6222,7 +6160,31 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
>  
>  	num_bytes = ALIGN(num_bytes, fs_info->sectorsize);
>  
> -	/* Add our new extents and calculate the new rsv size. */
> +	/*
> +	 * Josef, we always want to do it this way, every other way is wrong and
> +	 * ends in tears.  Pre-reserving the amount we are going to add will
> +	 * always be the right way, because otherwise if we have enough
> +	 * parallelism we could end up with thousands of inodes all holding
> +	 * little bits of reservations they were able to make previously and the
> +	 * only way to reclaim that space is to ENOSPC out the operations and
> +	 * clear everything out and try again, which is shitty.  This way we
> +	 * just over-reserve slightly, and clean up the mess when we are done.
> +	 */
> +	calc_inode_reservations(fs_info, inode, num_bytes, &meta_reserve,
> +				&qgroup_reserve);
> +	ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_reserve, true);
> +	if (ret)
> +		goto out_fail;
> +	ret = reserve_metadata_bytes(root, block_rsv, meta_reserve, flush);
> +	if (ret)
> +		goto out_qgroup;
> +
> +	/*
> +	 * Now we need to update our outstanding extents and csum bytes _first_
> +	 * and then add the reservation to the block_rsv.  This keeps us from
> +	 * racing with an ordered completion or some such that would think it
> +	 * needs to free the reservation we just made.
> +	 */
>  	spin_lock(&inode->lock);
>  	nr_extents = count_max_extents(num_bytes);
>  	btrfs_mod_outstanding_extents(inode, nr_extents);
> @@ -6230,22 +6192,21 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
>  	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
>  	spin_unlock(&inode->lock);
>  
> -	ret = btrfs_inode_rsv_refill(inode, flush);
> -	if (unlikely(ret))
> -		goto out_fail;
> +	/* Now we can safely add our space to our block rsv. */
> +	block_rsv_add_bytes(block_rsv, meta_reserve, false);
> +	trace_btrfs_space_reservation(root->fs_info, "delalloc",
> +				      btrfs_ino(inode), meta_reserve, 1);
> +
> +	spin_lock(&block_rsv->lock);
> +	block_rsv->qgroup_rsv_reserved += qgroup_reserve;
> +	spin_unlock(&block_rsv->lock);
>  
>  	if (delalloc_lock)
>  		mutex_unlock(&inode->delalloc_mutex);
>  	return 0;
> -
> +out_qgroup:
> +	btrfs_qgroup_free_meta_prealloc(root, qgroup_reserve);
>  out_fail:
> -	spin_lock(&inode->lock);
> -	nr_extents = count_max_extents(num_bytes);
> -	btrfs_mod_outstanding_extents(inode, -nr_extents);
> -	inode->csum_bytes -= num_bytes;
> -	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
> -	spin_unlock(&inode->lock);
> -
>  	btrfs_inode_rsv_release(inode, true);
>  	if (delalloc_lock)
>  		mutex_unlock(&inode->delalloc_mutex);
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] btrfs: reserve delalloc metadata differently
  2019-04-12 13:06   ` Nikolay Borisov
@ 2019-04-12 13:26     ` Josef Bacik
  2019-04-12 13:35       ` Nikolay Borisov
  0 siblings, 1 reply; 14+ messages in thread
From: Josef Bacik @ 2019-04-12 13:26 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, linux-btrfs

On Fri, Apr 12, 2019 at 04:06:25PM +0300, Nikolay Borisov wrote:
> 
> 
> On 10.04.19 г. 22:56 ч., Josef Bacik wrote:
> > With the per-inode block rsvs we started refilling the reserve based on
> > the calculated size of the outstanding csum bytes and extents for the
> > inode, including the amount we were adding with the new operation.
> > 
> > However generic/224 exposed a problem with this approach.  With 1000
> > files all writing at the same time we ended up with a bunch of bytes
> > being reserved but unusable.
> > 
> > When you write to a file we reserve space for the csum leaves for those
> > bytes, the number of extent items required to cover those bytes, and a
> > single credit for updating the inode at ordered extent finish for that
> > range of bytes.  This is held until the ordered extent finishes and we
> > release all of the reserved space.
> > 
> > If a second write comes in at this point we would add a single
> > reservation for the new outstanding extent and however many reservations
> > for the csum leaves.  
> 
> If a second write comes we won't do anything different than the first
> i.e calculate the number of extent items + csums bytes required, add
> them to the block reservation and call btrfs_inode_rsv_refill which
> should refill the delta necessary for the 2nd write.
> 
> 
> At this point we find the delta of how much we
> > have reserved and how much outstanding size this is and attempt to
> > reserve this delta.  If the first write finishes it will not release any
> > space, because the space it had reserved for the initial write is still
> > needed for the second write.  However some space would have been used,
> 
> Each and every reservation is responsible for itself how come the first
> one will know some of its space is required for the second, hence it
> won't be released?
> 

Write 1 comes in, sets the size to 3mib, reserves 3mib.
Write 2 comes in, sets the size to 5 mib, attempts to reserve 2mib.
  - can't reserve because there's not enough space, starts flushing.
Write 1 finishes, used 1mib of it's 3mib reservation
Write 1 sets the size to 3mib
We still have 2mib in reserves, which is less than 3mib, so no bytes are
  released to the space info.

Now multiply this by 1000, you have 1000 files with 2mib sitting in their
reservations, but they need 2mib, and there's no space to be squeezed from the
rest of the fs, so they start to ENOSPC out one by one.

With the new thing we get this

Write 1 comes in, reserves 3mib, sets the size to 3mib.
Write 2 comes in, attempts to reserve 3mib.
  - can't reserve because there's not enough space, starts flushing.
Write 1 finishes, used 1mib of it's 3mib reservation
Write 1 sets the size to 0mib
Write 1 releases 2mib to the space_info, allowing the next waiter to claim 2mib
  of whatever it's reservation was.

Multiply this by 1000.  As we get smaller and smaller amounts of metadata space
to work with, we get less and less writes happening in parallel, because we only
have X inodes worth of reservations to be in flight at any given time.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] btrfs: track odirect bytes in flight
  2019-04-12 10:17   ` Nikolay Borisov
@ 2019-04-12 13:30     ` Josef Bacik
  2019-04-24 17:26     ` David Sterba
  1 sibling, 0 replies; 14+ messages in thread
From: Josef Bacik @ 2019-04-12 13:30 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, linux-btrfs

On Fri, Apr 12, 2019 at 01:17:40PM +0300, Nikolay Borisov wrote:
> 
> 
> On 10.04.19 г. 22:56 ч., Josef Bacik wrote:
> > When diagnosing a slowdown of generic/224 I noticed we were wasting a
> > lot of time in shrink_delalloc() despite all writes being O_DIRECT
> > writes.  O_DIRECT writes still have outstanding extents, but obviously
> > cannot be directly flushed, instead we need to wait on their
> > corresponding ordered extent.  Track the outstanding odirect write bytes
> > and if this amount is higher than the delalloc bytes in the system go
> > ahead and force us to wait on the ordered extents.
> 
> This is way too sparse. I've been running generic/224 to try and
> reproduce your slowdown. So far I can confirm that this test exhibits
> drastic swings in performance - I've seen it complete from 30s up to
> 300s. I've also been taking an offcputime[0] measurements in the case
> where high completion times were observed but so far I haven't really
> seen shrink_delalloc standing out.
> 

It's not, I shouldn't have said "wasting time" I should have said most calls to
shrink_delalloc did nothing.  The rest is self explanatory.  shrink_delalloc()
keys off of fs_info->delalloc_bytes, which is 0 if you have nothing but
O_DIRECT.  Thus we need a mechanism for seeing that there's O_DIRECT in flight
and we would benefit from waiting on ordered extents.  The slowdown is addressed
in patch 2, this patch just makes it so shrink_delalloc() actually does
something if we are O_DIRECT.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] btrfs: reserve delalloc metadata differently
  2019-04-12 13:26     ` Josef Bacik
@ 2019-04-12 13:35       ` Nikolay Borisov
  2019-04-12 13:37         ` Josef Bacik
  0 siblings, 1 reply; 14+ messages in thread
From: Nikolay Borisov @ 2019-04-12 13:35 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs



On 12.04.19 г. 16:26 ч., Josef Bacik wrote:
> On Fri, Apr 12, 2019 at 04:06:25PM +0300, Nikolay Borisov wrote:
>>
>>
>> On 10.04.19 г. 22:56 ч., Josef Bacik wrote:
>>> With the per-inode block rsvs we started refilling the reserve based on
>>> the calculated size of the outstanding csum bytes and extents for the
>>> inode, including the amount we were adding with the new operation.
>>>
>>> However generic/224 exposed a problem with this approach.  With 1000
>>> files all writing at the same time we ended up with a bunch of bytes
>>> being reserved but unusable.
>>>
>>> When you write to a file we reserve space for the csum leaves for those
>>> bytes, the number of extent items required to cover those bytes, and a
>>> single credit for updating the inode at ordered extent finish for that
>>> range of bytes.  This is held until the ordered extent finishes and we
>>> release all of the reserved space.
>>>
>>> If a second write comes in at this point we would add a single
>>> reservation for the new outstanding extent and however many reservations
>>> for the csum leaves.  
>>
>> If a second write comes we won't do anything different than the first
>> i.e calculate the number of extent items + csums bytes required, add
>> them to the block reservation and call btrfs_inode_rsv_refill which
>> should refill the delta necessary for the 2nd write.
>>
>>
>> At this point we find the delta of how much we
>>> have reserved and how much outstanding size this is and attempt to
>>> reserve this delta.  If the first write finishes it will not release any
>>> space, because the space it had reserved for the initial write is still
>>> needed for the second write.  However some space would have been used,
>>
>> Each and every reservation is responsible for itself how come the first
>> one will know some of its space is required for the second, hence it
>> won't be released?
>>
> 
> Write 1 comes in, sets the size to 3mib, reserves 3mib.
> Write 2 comes in, sets the size to 5 mib, attempts to reserve 2mib.
>   - can't reserve because there's not enough space, starts flushing.
> Write 1 finishes, used 1mib of it's 3mib reservation
> Write 1 sets the size to 3mib
> We still have 2mib in reserves, which is less than 3mib, so no bytes are
>   released to the space info.
> 
> Now multiply this by 1000, you have 1000 files with 2mib sitting in their
> reservations, but they need 2mib, and there's no space to be squeezed from the
> rest of the fs, so they start to ENOSPC out one by one.
> 
> With the new thing we get this
> 
> Write 1 comes in, reserves 3mib, sets the size to 3mib.
> Write 2 comes in, attempts to reserve 3mib.
>   - can't reserve because there's not enough space, starts flushing.

Um, no, you've removed btrfs_inode_rsv_refill so there is no flushing
happening in btrfs_delalloc_reserve_metadata whatsoever. None of the 2
remaining callers of btrfs_delalloc_reserve_metadata does any flushing
based on the retval of that function.

This actually means that you should also remove the 'flush' variable in
the same function otherwise you are leaving unused variable behind,
which is not nice.

> Write 1 finishes, used 1mib of it's 3mib reservation
> Write 1 sets the size to 0mib

Reveweing the way oustanding_extents are used I'm getting a little bit
confused - on the one hand we are modifying this value when we reserve
metadata on the other hand we are also modifying the count when adding
ordered extents. Shouldn't that count have been already accounted when
doing the initial metadata reservation?

> Write 1 releases 2mib to the space_info, allowing the next waiter to claim 2mib
>   of whatever it's reservation was.
> 
> Multiply this by 1000.  As we get smaller and smaller amounts of metadata space
> to work with, we get less and less writes happening in parallel, because we only
> have X inodes worth of reservations to be in flight at any given time.  Thanks,
> 
> Josef
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] btrfs: reserve delalloc metadata differently
  2019-04-12 13:35       ` Nikolay Borisov
@ 2019-04-12 13:37         ` Josef Bacik
  0 siblings, 0 replies; 14+ messages in thread
From: Josef Bacik @ 2019-04-12 13:37 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, linux-btrfs

On Fri, Apr 12, 2019 at 04:35:20PM +0300, Nikolay Borisov wrote:
> 
> 
> On 12.04.19 г. 16:26 ч., Josef Bacik wrote:
> > On Fri, Apr 12, 2019 at 04:06:25PM +0300, Nikolay Borisov wrote:
> >>
> >>
> >> On 10.04.19 г. 22:56 ч., Josef Bacik wrote:
> >>> With the per-inode block rsvs we started refilling the reserve based on
> >>> the calculated size of the outstanding csum bytes and extents for the
> >>> inode, including the amount we were adding with the new operation.
> >>>
> >>> However generic/224 exposed a problem with this approach.  With 1000
> >>> files all writing at the same time we ended up with a bunch of bytes
> >>> being reserved but unusable.
> >>>
> >>> When you write to a file we reserve space for the csum leaves for those
> >>> bytes, the number of extent items required to cover those bytes, and a
> >>> single credit for updating the inode at ordered extent finish for that
> >>> range of bytes.  This is held until the ordered extent finishes and we
> >>> release all of the reserved space.
> >>>
> >>> If a second write comes in at this point we would add a single
> >>> reservation for the new outstanding extent and however many reservations
> >>> for the csum leaves.  
> >>
> >> If a second write comes we won't do anything different than the first
> >> i.e calculate the number of extent items + csums bytes required, add
> >> them to the block reservation and call btrfs_inode_rsv_refill which
> >> should refill the delta necessary for the 2nd write.
> >>
> >>
> >> At this point we find the delta of how much we
> >>> have reserved and how much outstanding size this is and attempt to
> >>> reserve this delta.  If the first write finishes it will not release any
> >>> space, because the space it had reserved for the initial write is still
> >>> needed for the second write.  However some space would have been used,
> >>
> >> Each and every reservation is responsible for itself how come the first
> >> one will know some of its space is required for the second, hence it
> >> won't be released?
> >>
> > 
> > Write 1 comes in, sets the size to 3mib, reserves 3mib.
> > Write 2 comes in, sets the size to 5 mib, attempts to reserve 2mib.
> >   - can't reserve because there's not enough space, starts flushing.
> > Write 1 finishes, used 1mib of it's 3mib reservation
> > Write 1 sets the size to 3mib
> > We still have 2mib in reserves, which is less than 3mib, so no bytes are
> >   released to the space info.
> > 
> > Now multiply this by 1000, you have 1000 files with 2mib sitting in their
> > reservations, but they need 2mib, and there's no space to be squeezed from the
> > rest of the fs, so they start to ENOSPC out one by one.
> > 
> > With the new thing we get this
> > 
> > Write 1 comes in, reserves 3mib, sets the size to 3mib.
> > Write 2 comes in, attempts to reserve 3mib.
> >   - can't reserve because there's not enough space, starts flushing.
> 
> Um, no, you've removed btrfs_inode_rsv_refill so there is no flushing
> happening in btrfs_delalloc_reserve_metadata whatsoever. None of the 2
> remaining callers of btrfs_delalloc_reserve_metadata does any flushing
> based on the retval of that function.
> 

Please go read the code again.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] btrfs: track odirect bytes in flight
  2019-04-12 10:17   ` Nikolay Borisov
  2019-04-12 13:30     ` Josef Bacik
@ 2019-04-24 17:26     ` David Sterba
  1 sibling, 0 replies; 14+ messages in thread
From: David Sterba @ 2019-04-24 17:26 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, linux-btrfs

On Fri, Apr 12, 2019 at 01:17:40PM +0300, Nikolay Borisov wrote:
> 
> 
> On 10.04.19 г. 22:56 ч., Josef Bacik wrote:
> > When diagnosing a slowdown of generic/224 I noticed we were wasting a
> > lot of time in shrink_delalloc() despite all writes being O_DIRECT
> > writes.  O_DIRECT writes still have outstanding extents, but obviously
> > cannot be directly flushed, instead we need to wait on their
> > corresponding ordered extent.  Track the outstanding odirect write bytes
> > and if this amount is higher than the delalloc bytes in the system go
> > ahead and force us to wait on the ordered extents.
> 
> This is way too sparse. I've been running generic/224 to try and
> reproduce your slowdown. So far I can confirm that this test exhibits
> drastic swings in performance - I've seen it complete from 30s up to
> 300s. I've also been taking an offcputime[0] measurements in the case
> where high completion times were observed but so far I haven't really
> seen shrink_delalloc standing out.
> 
> Provide more information how you measured the said slowdown as well as
> more information in the changelog about why it's happening. At the very
> least this could be split into 2 patches:
> 
> 1. Could add the percpu counter init + modification in ordered extent
> routines
> 
> 2. Should add the logic in shrink_delalloc. Ideally that patch will
> include detailed explanation of how the problem manifests.

I don't think splitting the init code is required here, I did not find
it distracting while reading the patch.

The 'ideally' part of your comment is be something that we should not
ask for. Missing or vague explanation for change like that is a bad
practice. Josef, please update the changelog and resend, thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] ENOSPC refinements
  2019-04-10 19:56 [PATCH 0/2] ENOSPC refinements Josef Bacik
  2019-04-10 19:56 ` [PATCH 1/2] btrfs: track odirect bytes in flight Josef Bacik
  2019-04-10 19:56 ` [PATCH 2/2] btrfs: reserve delalloc metadata differently Josef Bacik
@ 2019-04-25 15:22 ` David Sterba
  2019-04-25 20:52 ` [PATCH][v2] btrfs: track odirect bytes in flight Josef Bacik
  3 siblings, 0 replies; 14+ messages in thread
From: David Sterba @ 2019-04-25 15:22 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Wed, Apr 10, 2019 at 03:56:08PM -0400, Josef Bacik wrote:
> I noticed our continuous testing for btrfs started timing out recently, and dug
> down to discover that generic/224 was taking upwards of an hour to run.  This
> test should take seconds to run, so this is a problem.
> 
> Fix this by reworking how we do delalloc metadata reservations, as described in
> patch 2/2.
> 
> I also noticed we don't quite do the right thing when we are mostly O_DIRECT,
> and addressed this in patch 1/2.  I've run these through xfstests and everything
> is back to normal.  Thanks,

The patches have been in for-next for some time, I'd like to merge them
to 5.2 queue. Please update the changelogs, either post just the text or
whole patches. Thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH][v2] btrfs: track odirect bytes in flight
  2019-04-10 19:56 [PATCH 0/2] ENOSPC refinements Josef Bacik
                   ` (2 preceding siblings ...)
  2019-04-25 15:22 ` [PATCH 0/2] ENOSPC refinements David Sterba
@ 2019-04-25 20:52 ` Josef Bacik
  2019-04-29 18:17   ` David Sterba
  3 siblings, 1 reply; 14+ messages in thread
From: Josef Bacik @ 2019-04-25 20:52 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

When diagnosing a slowdown of generic/224 I noticed we were not doing
anything when calling into shrink_delalloc().  This is because all
writes in 224 are O_DIRECT, not delalloc, and thus our delalloc_bytes
counter is 0, which short circuits most of the work inside of
shrink_delalloc().  However O_DIRECT writes still consume metadata
resources and generate ordered extents, which we can still wait on.

Fix this by tracking outstandingn odirect write bytes, and use this as
well as the delalloc byts counter to decide if we need to lookup and
wait on any ordered extents.  If we have more odirect writes than
delalloc bytes we'll go ahead and wait on any ordered extents
irregardless of our flush state as flushing delalloc is likely to not
gain us anything.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
v1->v2:
- updated the changelog

 fs/btrfs/ctree.h        |  1 +
 fs/btrfs/disk-io.c      | 15 ++++++++++++++-
 fs/btrfs/extent-tree.c  | 17 +++++++++++++++--
 fs/btrfs/ordered-data.c |  9 ++++++++-
 4 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7e774d48c48c..e293d74b2ead 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1016,6 +1016,7 @@ struct btrfs_fs_info {
 	/* used to keep from writing metadata until there is a nice batch */
 	struct percpu_counter dirty_metadata_bytes;
 	struct percpu_counter delalloc_bytes;
+	struct percpu_counter odirect_bytes;
 	s32 dirty_metadata_batch;
 	s32 delalloc_batch;
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7a88de4be8d7..3f0b1854cedc 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2641,11 +2641,17 @@ int open_ctree(struct super_block *sb,
 		goto fail;
 	}
 
-	ret = percpu_counter_init(&fs_info->dirty_metadata_bytes, 0, GFP_KERNEL);
+	ret = percpu_counter_init(&fs_info->odirect_bytes, 0, GFP_KERNEL);
 	if (ret) {
 		err = ret;
 		goto fail_srcu;
 	}
+
+	ret = percpu_counter_init(&fs_info->dirty_metadata_bytes, 0, GFP_KERNEL);
+	if (ret) {
+		err = ret;
+		goto fail_odirect_bytes;
+	}
 	fs_info->dirty_metadata_batch = PAGE_SIZE *
 					(1 + ilog2(nr_cpu_ids));
 
@@ -3344,6 +3350,8 @@ int open_ctree(struct super_block *sb,
 	percpu_counter_destroy(&fs_info->delalloc_bytes);
 fail_dirty_metadata_bytes:
 	percpu_counter_destroy(&fs_info->dirty_metadata_bytes);
+fail_odirect_bytes:
+	percpu_counter_destroy(&fs_info->odirect_bytes);
 fail_srcu:
 	cleanup_srcu_struct(&fs_info->subvol_srcu);
 fail:
@@ -4025,6 +4033,10 @@ void close_ctree(struct btrfs_fs_info *fs_info)
 		       percpu_counter_sum(&fs_info->delalloc_bytes));
 	}
 
+	if (percpu_counter_sum(&fs_info->odirect_bytes))
+		btrfs_info(fs_info, "at unmount odirect count %lld",
+			   percpu_counter_sum(&fs_info->odirect_bytes));
+
 	btrfs_sysfs_remove_mounted(fs_info);
 	btrfs_sysfs_remove_fsid(fs_info->fs_devices);
 
@@ -4056,6 +4068,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
 
 	percpu_counter_destroy(&fs_info->dirty_metadata_bytes);
 	percpu_counter_destroy(&fs_info->delalloc_bytes);
+	percpu_counter_destroy(&fs_info->odirect_bytes);
 	percpu_counter_destroy(&fs_info->dev_replace.bio_counter);
 	cleanup_srcu_struct(&fs_info->subvol_srcu);
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d0626f945de2..0982456ebabb 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4727,6 +4727,7 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info, u64 to_reclaim,
 	struct btrfs_space_info *space_info;
 	struct btrfs_trans_handle *trans;
 	u64 delalloc_bytes;
+	u64 odirect_bytes;
 	u64 async_pages;
 	u64 items;
 	long time_left;
@@ -4742,7 +4743,9 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info, u64 to_reclaim,
 
 	delalloc_bytes = percpu_counter_sum_positive(
 						&fs_info->delalloc_bytes);
-	if (delalloc_bytes == 0) {
+	odirect_bytes = percpu_counter_sum_positive(
+						&fs_info->odirect_bytes);
+	if (delalloc_bytes == 0 && odirect_bytes == 0) {
 		if (trans)
 			return;
 		if (wait_ordered)
@@ -4750,8 +4753,16 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info, u64 to_reclaim,
 		return;
 	}
 
+	/*
+	 * If we are doing more ordered than delalloc we need to just wait on
+	 * ordered extents, otherwise we'll waste time trying to flush delalloc
+	 * that likely won't give us the space back we need.
+	 */
+	if (odirect_bytes > delalloc_bytes)
+		wait_ordered = true;
+
 	loops = 0;
-	while (delalloc_bytes && loops < 3) {
+	while ((delalloc_bytes || odirect_bytes)  && loops < 3) {
 		nr_pages = min(delalloc_bytes, to_reclaim) >> PAGE_SHIFT;
 
 		/*
@@ -4801,6 +4812,8 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info, u64 to_reclaim,
 		}
 		delalloc_bytes = percpu_counter_sum_positive(
 						&fs_info->delalloc_bytes);
+		odirect_bytes = percpu_counter_sum_positive(
+						&fs_info->odirect_bytes);
 	}
 }
 
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 6fde2b2741ef..967c62b85d77 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -194,8 +194,11 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
 	if (type != BTRFS_ORDERED_IO_DONE && type != BTRFS_ORDERED_COMPLETE)
 		set_bit(type, &entry->flags);
 
-	if (dio)
+	if (dio) {
+		percpu_counter_add_batch(&fs_info->odirect_bytes, len,
+					 fs_info->delalloc_batch);
 		set_bit(BTRFS_ORDERED_DIRECT, &entry->flags);
+	}
 
 	/* one ref for the tree */
 	refcount_set(&entry->refs, 1);
@@ -468,6 +471,10 @@ void btrfs_remove_ordered_extent(struct inode *inode,
 	if (root != fs_info->tree_root)
 		btrfs_delalloc_release_metadata(btrfs_inode, entry->len, false);
 
+	if (test_bit(BTRFS_ORDERED_DIRECT, &entry->flags))
+		percpu_counter_add_batch(&fs_info->odirect_bytes, -entry->len,
+					 fs_info->delalloc_batch);
+
 	tree = &btrfs_inode->ordered_tree;
 	spin_lock_irq(&tree->lock);
 	node = &entry->rb_node;
-- 
2.13.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH][v2] btrfs: track odirect bytes in flight
  2019-04-25 20:52 ` [PATCH][v2] btrfs: track odirect bytes in flight Josef Bacik
@ 2019-04-29 18:17   ` David Sterba
  0 siblings, 0 replies; 14+ messages in thread
From: David Sterba @ 2019-04-29 18:17 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Thu, Apr 25, 2019 at 04:52:47PM -0400, Josef Bacik wrote:
> When diagnosing a slowdown of generic/224 I noticed we were not doing
> anything when calling into shrink_delalloc().  This is because all
> writes in 224 are O_DIRECT, not delalloc, and thus our delalloc_bytes
> counter is 0, which short circuits most of the work inside of
> shrink_delalloc().  However O_DIRECT writes still consume metadata
> resources and generate ordered extents, which we can still wait on.
> 
> Fix this by tracking outstandingn odirect write bytes, and use this as
> well as the delalloc byts counter to decide if we need to lookup and
> wait on any ordered extents.  If we have more odirect writes than
> delalloc bytes we'll go ahead and wait on any ordered extents
> irregardless of our flush state as flushing delalloc is likely to not
> gain us anything.

Ok, that helps.

> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> v1->v2:
> - updated the changelog
> 
>  fs/btrfs/ctree.h        |  1 +
>  fs/btrfs/disk-io.c      | 15 ++++++++++++++-
>  fs/btrfs/extent-tree.c  | 17 +++++++++++++++--
>  fs/btrfs/ordered-data.c |  9 ++++++++-
>  4 files changed, 38 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 7e774d48c48c..e293d74b2ead 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1016,6 +1016,7 @@ struct btrfs_fs_info {
>  	/* used to keep from writing metadata until there is a nice batch */
>  	struct percpu_counter dirty_metadata_bytes;
>  	struct percpu_counter delalloc_bytes;
> +	struct percpu_counter odirect_bytes;

I've changed odirect to dio everywhere, as this is the common reference
to direct IO kernel code, O_DIRECT is for userspace.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] btrfs: reserve delalloc metadata differently
  2019-04-10 19:56 ` [PATCH 2/2] btrfs: reserve delalloc metadata differently Josef Bacik
  2019-04-12 13:06   ` Nikolay Borisov
@ 2019-04-29 18:33   ` David Sterba
  1 sibling, 0 replies; 14+ messages in thread
From: David Sterba @ 2019-04-29 18:33 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Wed, Apr 10, 2019 at 03:56:10PM -0400, Josef Bacik wrote:
> With the per-inode block rsvs we started refilling the reserve based on
> the calculated size of the outstanding csum bytes and extents for the
> inode, including the amount we were adding with the new operation.
> 
> However generic/224 exposed a problem with this approach.  With 1000
> files all writing at the same time we ended up with a bunch of bytes
> being reserved but unusable.
> 
> When you write to a file we reserve space for the csum leaves for those
> bytes, the number of extent items required to cover those bytes, and a
> single credit for updating the inode at ordered extent finish for that
> range of bytes.  This is held until the ordered extent finishes and we
> release all of the reserved space.
> 
> If a second write comes in at this point we would add a single
> reservation for the new outstanding extent and however many reservations
> for the csum leaves.  At this point we find the delta of how much we
> have reserved and how much outstanding size this is and attempt to
> reserve this delta.  If the first write finishes it will not release any
> space, because the space it had reserved for the initial write is still
> needed for the second write.  However some space would have been used,
> as we have added csums, extent items, and dirtied the inode.  Our
> reserved space would be > 0 but < the total needed reserved space.
> 
> This is just for a single inode, now consider generic/224.  This has
> 1000 inodes writing in parallel to a very small file system, 1gib.  In
> my testing this usually means we get about a 120mib metadata area to
> work with, more than enough to allow the writes to continue, but not
> enough if all of the inodes are stuck trying to reserve the slack space
> while continuing to hold their leftovers from their initial writes.
> 
> Fix this by pre-reserved _only_ for the space we are currently trying to
> add.  Then once that is successful modify our inodes csum count and
> outstanding extents, and then add the newly reserved space to the inodes
> block_rsv.  This allows us to actually pass generic/224 without running
> out of metadata space.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/extent-tree.c | 145 ++++++++++++++++++-------------------------------
>  1 file changed, 53 insertions(+), 92 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 0982456ebabb..9aff7a8817d9 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -5811,85 +5811,6 @@ int btrfs_block_rsv_refill(struct btrfs_root *root,
>  	return ret;
>  }
>  
> -static void calc_refill_bytes(struct btrfs_block_rsv *block_rsv,
> -				u64 *metadata_bytes, u64 *qgroup_bytes)
> -{
> -	*metadata_bytes = 0;
> -	*qgroup_bytes = 0;
> -
> -	spin_lock(&block_rsv->lock);
> -	if (block_rsv->reserved < block_rsv->size)
> -		*metadata_bytes = block_rsv->size - block_rsv->reserved;
> -	if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size)
> -		*qgroup_bytes = block_rsv->qgroup_rsv_size -
> -			block_rsv->qgroup_rsv_reserved;
> -	spin_unlock(&block_rsv->lock);
> -}
> -
> -/**
> - * btrfs_inode_rsv_refill - refill the inode block rsv.
> - * @inode - the inode we are refilling.
> - * @flush - the flushing restriction.
> - *
> - * Essentially the same as btrfs_block_rsv_refill, except it uses the
> - * block_rsv->size as the minimum size.  We'll either refill the missing amount
> - * or return if we already have enough space.  This will also handle the reserve
> - * tracepoint for the reserved amount.
> - */
> -static int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
> -				  enum btrfs_reserve_flush_enum flush)
> -{
> -	struct btrfs_root *root = inode->root;
> -	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
> -	u64 num_bytes, last = 0;
> -	u64 qgroup_num_bytes;
> -	int ret = -ENOSPC;
> -
> -	calc_refill_bytes(block_rsv, &num_bytes, &qgroup_num_bytes);
> -	if (num_bytes == 0)
> -		return 0;
> -
> -	do {
> -		ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes,
> -							 true);
> -		if (ret)
> -			return ret;
> -		ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
> -		if (ret) {
> -			btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes);
> -			last = num_bytes;
> -			/*
> -			 * If we are fragmented we can end up with a lot of
> -			 * outstanding extents which will make our size be much
> -			 * larger than our reserved amount.
> -			 *
> -			 * If the reservation happens here, it might be very
> -			 * big though not needed in the end, if the delalloc
> -			 * flushing happens.
> -			 *
> -			 * If this is the case try and do the reserve again.
> -			 */
> -			if (flush == BTRFS_RESERVE_FLUSH_ALL)
> -				calc_refill_bytes(block_rsv, &num_bytes,
> -						   &qgroup_num_bytes);
> -			if (num_bytes == 0)
> -				return 0;
> -		}
> -	} while (ret && last != num_bytes);
> -
> -	if (!ret) {
> -		block_rsv_add_bytes(block_rsv, num_bytes, false);
> -		trace_btrfs_space_reservation(root->fs_info, "delalloc",
> -					      btrfs_ino(inode), num_bytes, 1);
> -
> -		/* Don't forget to increase qgroup_rsv_reserved */
> -		spin_lock(&block_rsv->lock);
> -		block_rsv->qgroup_rsv_reserved += qgroup_num_bytes;
> -		spin_unlock(&block_rsv->lock);
> -	}
> -	return ret;
> -}
> -
>  static u64 __btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
>  				     struct btrfs_block_rsv *block_rsv,
>  				     u64 num_bytes, u64 *qgroup_to_release)
> @@ -6190,9 +6111,26 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
>  	spin_unlock(&block_rsv->lock);
>  }
>  
> +static inline void calc_inode_reservations(struct btrfs_fs_info *fs_info,

I don't think this needs to be a static inline, just static.

> +					   struct btrfs_inode *inode,

Unused patameter.

> +					   u64 num_bytes, u64 *meta_reserve,
> +					   u64 *qgroup_reserve)
> +{
> +	u64 nr_extents = count_max_extents(num_bytes);
> +	u64 csum_leaves = btrfs_csum_bytes_to_leaves(fs_info, num_bytes);
> +
> +	/* We add one for the inode update at finish ordered time. */
> +	*meta_reserve = btrfs_calc_trans_metadata_size(fs_info,
> +						nr_extents + csum_leaves + 1);
> +	*qgroup_reserve = nr_extents * fs_info->nodesize;
> +}
> +
>  int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
>  {
> -	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> +	struct btrfs_root *root = inode->root;
> +	struct btrfs_fs_info *fs_info = root->fs_info;
> +	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
> +	u64 meta_reserve, qgroup_reserve;
>  	unsigned nr_extents;
>  	enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
>  	int ret = 0;
> @@ -6222,7 +6160,31 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
>  
>  	num_bytes = ALIGN(num_bytes, fs_info->sectorsize);
>  
> -	/* Add our new extents and calculate the new rsv size. */
> +	/*
> +	 * Josef, we always want to do it this way, every other way is wrong and

Come on, talking to yourself in comments again?

> +	 * ends in tears.  Pre-reserving the amount we are going to add will
> +	 * always be the right way, because otherwise if we have enough
> +	 * parallelism we could end up with thousands of inodes all holding
> +	 * little bits of reservations they were able to make previously and the
> +	 * only way to reclaim that space is to ENOSPC out the operations and
> +	 * clear everything out and try again, which is shitty.  This way we

'shitty' replaced by 'bad'

Otherwise, patches updated and added to misc-next, thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-04-29 18:32 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-10 19:56 [PATCH 0/2] ENOSPC refinements Josef Bacik
2019-04-10 19:56 ` [PATCH 1/2] btrfs: track odirect bytes in flight Josef Bacik
2019-04-12 10:17   ` Nikolay Borisov
2019-04-12 13:30     ` Josef Bacik
2019-04-24 17:26     ` David Sterba
2019-04-10 19:56 ` [PATCH 2/2] btrfs: reserve delalloc metadata differently Josef Bacik
2019-04-12 13:06   ` Nikolay Borisov
2019-04-12 13:26     ` Josef Bacik
2019-04-12 13:35       ` Nikolay Borisov
2019-04-12 13:37         ` Josef Bacik
2019-04-29 18:33   ` David Sterba
2019-04-25 15:22 ` [PATCH 0/2] ENOSPC refinements David Sterba
2019-04-25 20:52 ` [PATCH][v2] btrfs: track odirect bytes in flight Josef Bacik
2019-04-29 18:17   ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.