linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/14] Enospc rework
@ 2016-03-25 17:25 Josef Bacik
  2016-03-25 17:25 ` [PATCH 01/14] Btrfs: add bytes_readonly to the spaceinfo at once Josef Bacik
                   ` (13 more replies)
  0 siblings, 14 replies; 26+ messages in thread
From: Josef Bacik @ 2016-03-25 17:25 UTC (permalink / raw)
  To: linux-btrfs

1) Huge latency spikes.  One guy starts flushing, he doesn't wake up until the
flushers are finished doing work and then checks to see if he can continue.
Meanwhile everybody is backed up waiting for that guy to finish getting his
reservation.

2) The flushers flush everything.  They have no idea when to stop, so it just
flushes all of delalloc or all of the delayed inodes.  At first they try to
flush a little bit and hope they can get away with it, but the tighter you get
on space the more it becomes flush the world and hope for the best.

3) Some of the flushing isn't async, yay more latency.

The new approach introduces the idea of tickets for reservations.  If you cannot
make your reservation immediately you initialize a ticket with how much space
you need and you put yourselve on a list.  If you cannot flush anything (things
like dirty'ing an inode) then you add yourself to the priority queue and wait
for a little bit.  If you can flush then you add yourself to the normal queue
and wait for flushing to happen.  Each ticket has it's own waitqueue so as we
add space back into the system we can satisfy reservations immediately and
immediately wake the waiters back up, which greatly reduces latencies.

I've been testing these patches for a while and will be building on them from
here, but the results are pretty excellent so far.  In the fs_mark test with all
metadata here are the results (on an empty file system)

Without Patch
Average Files/sec:     212897.2
p50 Files/sec: 207495
p90 Files/sec: 196709
p99 Files/sec: 189682

Creat Max Latency in usec
p50: 264665
p90: 456347.2
p99: 659489.32
max: 1001413

With Patch
Average Files/sec:     238613.4  
p50 Files/sec: 235764  
p90 Files/sec: 223308  
p99 Files/sec: 216291 

Creat Max Latency in usec
p50: 206771.5
p90: 355430.6
p99: 469634.98
max: 512389

So as you can see there is quite a bit better latency and better throughput
overall.  There will be more work as I test the worst case scenarios and get
the worst latencies down further, but this is the initial work.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 01/14] Btrfs: add bytes_readonly to the spaceinfo at once
  2016-03-25 17:25 [PATCH 00/14] Enospc rework Josef Bacik
@ 2016-03-25 17:25 ` Josef Bacik
  2016-03-25 17:25 ` [PATCH 02/14] Btrfs: fix callers of btrfs_block_rsv_migrate Josef Bacik
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Josef Bacik @ 2016-03-25 17:25 UTC (permalink / raw)
  To: linux-btrfs

For some reason we're adding bytes_readonly to the space info after we update
the space info with the block group info.  This creates a tiny race where we
could over-reserve space because we haven't yet taken out the bytes_readonly
bit.  Since we already know this information at the time we call
update_space_info, just pass it along so it can be updated all at once.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/extent-tree.c | 29 +++++++++++------------------
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 53e1297..c357c96 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3843,6 +3843,7 @@ static const char *alloc_name(u64 flags)
 
 static int update_space_info(struct btrfs_fs_info *info, u64 flags,
 			     u64 total_bytes, u64 bytes_used,
+			     u64 bytes_readonly,
 			     struct btrfs_space_info **space_info)
 {
 	struct btrfs_space_info *found;
@@ -3863,6 +3864,7 @@ static int update_space_info(struct btrfs_fs_info *info, u64 flags,
 		found->disk_total += total_bytes * factor;
 		found->bytes_used += bytes_used;
 		found->disk_used += bytes_used * factor;
+		found->bytes_readonly += bytes_readonly;
 		if (total_bytes > 0)
 			found->full = 0;
 		spin_unlock(&found->lock);
@@ -3890,7 +3892,7 @@ static int update_space_info(struct btrfs_fs_info *info, u64 flags,
 	found->disk_used = bytes_used * factor;
 	found->bytes_pinned = 0;
 	found->bytes_reserved = 0;
-	found->bytes_readonly = 0;
+	found->bytes_readonly = bytes_readonly;
 	found->bytes_may_use = 0;
 	found->full = 0;
 	found->max_extent_size = 0;
@@ -4400,7 +4402,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans,
 	space_info = __find_space_info(extent_root->fs_info, flags);
 	if (!space_info) {
 		ret = update_space_info(extent_root->fs_info, flags,
-					0, 0, &space_info);
+					0, 0, 0, &space_info);
 		BUG_ON(ret); /* -ENOMEM */
 	}
 	BUG_ON(!space_info); /* Logic error */
@@ -9862,7 +9864,7 @@ int btrfs_read_block_groups(struct btrfs_root *root)
 
 		ret = update_space_info(info, cache->flags, found_key.offset,
 					btrfs_block_group_used(&cache->item),
-					&space_info);
+					cache->bytes_super, &space_info);
 		if (ret) {
 			btrfs_remove_free_space_cache(cache);
 			spin_lock(&info->block_group_cache_lock);
@@ -9875,9 +9877,6 @@ int btrfs_read_block_groups(struct btrfs_root *root)
 		}
 
 		cache->space_info = space_info;
-		spin_lock(&cache->space_info->lock);
-		cache->space_info->bytes_readonly += cache->bytes_super;
-		spin_unlock(&cache->space_info->lock);
 
 		__link_block_group(space_info, cache);
 
@@ -9969,7 +9968,6 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans,
 	int ret;
 	struct btrfs_root *extent_root;
 	struct btrfs_block_group_cache *cache;
-
 	extent_root = root->fs_info->extent_root;
 
 	btrfs_set_log_full_commit(root->fs_info, trans);
@@ -10015,7 +10013,7 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans,
 	 * assigned to our block group, but don't update its counters just yet.
 	 * We want our bg to be added to the rbtree with its ->space_info set.
 	 */
-	ret = update_space_info(root->fs_info, cache->flags, 0, 0,
+	ret = update_space_info(root->fs_info, cache->flags, 0, 0, 0,
 				&cache->space_info);
 	if (ret) {
 		btrfs_remove_free_space_cache(cache);
@@ -10035,7 +10033,7 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans,
 	 * the rbtree, update the space info's counters.
 	 */
 	ret = update_space_info(root->fs_info, cache->flags, size, bytes_used,
-				&cache->space_info);
+				cache->bytes_super, &cache->space_info);
 	if (ret) {
 		btrfs_remove_free_space_cache(cache);
 		spin_lock(&root->fs_info->block_group_cache_lock);
@@ -10048,16 +10046,11 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans,
 	}
 	update_global_block_rsv(root->fs_info);
 
-	spin_lock(&cache->space_info->lock);
-	cache->space_info->bytes_readonly += cache->bytes_super;
-	spin_unlock(&cache->space_info->lock);
-
 	__link_block_group(cache->space_info, cache);
 
 	list_add_tail(&cache->bg_list, &trans->new_bgs);
 
 	set_avail_alloc_bits(extent_root->fs_info, type);
-
 	return 0;
 }
 
@@ -10602,21 +10595,21 @@ int btrfs_init_space_info(struct btrfs_fs_info *fs_info)
 		mixed = 1;
 
 	flags = BTRFS_BLOCK_GROUP_SYSTEM;
-	ret = update_space_info(fs_info, flags, 0, 0, &space_info);
+	ret = update_space_info(fs_info, flags, 0, 0, 0, &space_info);
 	if (ret)
 		goto out;
 
 	if (mixed) {
 		flags = BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA;
-		ret = update_space_info(fs_info, flags, 0, 0, &space_info);
+		ret = update_space_info(fs_info, flags, 0, 0, 0, &space_info);
 	} else {
 		flags = BTRFS_BLOCK_GROUP_METADATA;
-		ret = update_space_info(fs_info, flags, 0, 0, &space_info);
+		ret = update_space_info(fs_info, flags, 0, 0, 0, &space_info);
 		if (ret)
 			goto out;
 
 		flags = BTRFS_BLOCK_GROUP_DATA;
-		ret = update_space_info(fs_info, flags, 0, 0, &space_info);
+		ret = update_space_info(fs_info, flags, 0, 0, 0, &space_info);
 	}
 out:
 	return ret;
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 02/14] Btrfs: fix callers of btrfs_block_rsv_migrate
  2016-03-25 17:25 [PATCH 00/14] Enospc rework Josef Bacik
  2016-03-25 17:25 ` [PATCH 01/14] Btrfs: add bytes_readonly to the spaceinfo at once Josef Bacik
@ 2016-03-25 17:25 ` Josef Bacik
  2016-03-25 17:25 ` [PATCH 03/14] Btrfs: always reserve metadata for delalloc extents Josef Bacik
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Josef Bacik @ 2016-03-25 17:25 UTC (permalink / raw)
  To: linux-btrfs

So btrfs_block_rsv_migrate just unconditionally calls block_rsv_migrate_bytes.
Not only this but it unconditionally changes the size of the block_rsv.  This
isn't a bug strictly speaking, but it makes truncate block rsv's look funny
because every time we migrate bytes over its size grows, even though we only
want it to be a specific size.  So collapse this into one function that takes an
update_size argument and make truncate and evict not update the size for
consistency sake.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/ctree.h         |  4 ++--
 fs/btrfs/delayed-inode.c |  8 ++++----
 fs/btrfs/extent-tree.c   | 18 ++++++------------
 fs/btrfs/file.c          |  4 ++--
 fs/btrfs/inode.c         |  7 +++----
 fs/btrfs/relocation.c    |  2 +-
 6 files changed, 18 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 84a6a5b..b675066 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3646,8 +3646,8 @@ int btrfs_block_rsv_refill(struct btrfs_root *root,
 			   struct btrfs_block_rsv *block_rsv, u64 min_reserved,
 			   enum btrfs_reserve_flush_enum flush);
 int btrfs_block_rsv_migrate(struct btrfs_block_rsv *src_rsv,
-			    struct btrfs_block_rsv *dst_rsv,
-			    u64 num_bytes);
+			    struct btrfs_block_rsv *dst_rsv, u64 num_bytes,
+			    int update_size);
 int btrfs_cond_migrate_bytes(struct btrfs_fs_info *fs_info,
 			     struct btrfs_block_rsv *dest, u64 num_bytes,
 			     int min_factor);
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 6cef006..d3cda0f 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -553,7 +553,7 @@ static int btrfs_delayed_item_reserve_metadata(struct btrfs_trans_handle *trans,
 	dst_rsv = &root->fs_info->delayed_block_rsv;
 
 	num_bytes = btrfs_calc_trans_metadata_size(root, 1);
-	ret = btrfs_block_rsv_migrate(src_rsv, dst_rsv, num_bytes);
+	ret = btrfs_block_rsv_migrate(src_rsv, dst_rsv, num_bytes, 1);
 	if (!ret) {
 		trace_btrfs_space_reservation(root->fs_info, "delayed_item",
 					      item->key.objectid,
@@ -649,7 +649,7 @@ static int btrfs_delayed_inode_reserve_metadata(
 		if (!ret)
 			goto out;
 
-		ret = btrfs_block_rsv_migrate(src_rsv, dst_rsv, num_bytes);
+		ret = btrfs_block_rsv_migrate(src_rsv, dst_rsv, num_bytes, 1);
 		if (!ret)
 			goto out;
 
@@ -663,12 +663,12 @@ static int btrfs_delayed_inode_reserve_metadata(
 		 * since this really shouldn't happen that often.
 		 */
 		ret = btrfs_block_rsv_migrate(&root->fs_info->global_block_rsv,
-					      dst_rsv, num_bytes);
+					      dst_rsv, num_bytes, 1);
 		goto out;
 	}
 
 migrate:
-	ret = btrfs_block_rsv_migrate(src_rsv, dst_rsv, num_bytes);
+	ret = btrfs_block_rsv_migrate(src_rsv, dst_rsv, num_bytes, 1);
 
 out:
 	/*
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index c357c96..06f4e7b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5192,8 +5192,9 @@ static void block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
 	}
 }
 
-static int block_rsv_migrate_bytes(struct btrfs_block_rsv *src,
-				   struct btrfs_block_rsv *dst, u64 num_bytes)
+int btrfs_block_rsv_migrate(struct btrfs_block_rsv *src,
+			    struct btrfs_block_rsv *dst, u64 num_bytes,
+			    int update_size)
 {
 	int ret;
 
@@ -5201,7 +5202,7 @@ static int block_rsv_migrate_bytes(struct btrfs_block_rsv *src,
 	if (ret)
 		return ret;
 
-	block_rsv_add_bytes(dst, num_bytes, 1);
+	block_rsv_add_bytes(dst, num_bytes, update_size);
 	return 0;
 }
 
@@ -5308,13 +5309,6 @@ int btrfs_block_rsv_refill(struct btrfs_root *root,
 	return ret;
 }
 
-int btrfs_block_rsv_migrate(struct btrfs_block_rsv *src_rsv,
-			    struct btrfs_block_rsv *dst_rsv,
-			    u64 num_bytes)
-{
-	return block_rsv_migrate_bytes(src_rsv, dst_rsv, num_bytes);
-}
-
 void btrfs_block_rsv_release(struct btrfs_root *root,
 			     struct btrfs_block_rsv *block_rsv,
 			     u64 num_bytes)
@@ -5494,7 +5488,7 @@ int btrfs_orphan_reserve_metadata(struct btrfs_trans_handle *trans,
 	u64 num_bytes = btrfs_calc_trans_metadata_size(root, 1);
 	trace_btrfs_space_reservation(root->fs_info, "orphan",
 				      btrfs_ino(inode), num_bytes, 1);
-	return block_rsv_migrate_bytes(src_rsv, dst_rsv, num_bytes);
+	return btrfs_block_rsv_migrate(src_rsv, dst_rsv, num_bytes, 1);
 }
 
 void btrfs_orphan_release_metadata(struct inode *inode)
@@ -5549,7 +5543,7 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root *root,
 				  BTRFS_RESERVE_FLUSH_ALL);
 
 	if (ret == -ENOSPC && use_global_rsv)
-		ret = btrfs_block_rsv_migrate(global_rsv, rsv, num_bytes);
+		ret = btrfs_block_rsv_migrate(global_rsv, rsv, num_bytes, 1);
 
 	if (ret && *qgroup_reserved)
 		btrfs_qgroup_free_meta(root, *qgroup_reserved);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 15a09cb..0ce4bb3 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2474,7 +2474,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	}
 
 	ret = btrfs_block_rsv_migrate(&root->fs_info->trans_block_rsv, rsv,
-				      min_size);
+				      min_size, 0);
 	BUG_ON(ret);
 	trans->block_rsv = rsv;
 
@@ -2517,7 +2517,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 		}
 
 		ret = btrfs_block_rsv_migrate(&root->fs_info->trans_block_rsv,
-					      rsv, min_size);
+					      rsv, min_size, 0);
 		BUG_ON(ret);	/* shouldn't happen */
 		trans->block_rsv = rsv;
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 41a5688..723e4bb 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5239,7 +5239,7 @@ void btrfs_evict_inode(struct inode *inode)
 		if (steal_from_global) {
 			if (!btrfs_check_space_for_delayed_refs(trans, root))
 				ret = btrfs_block_rsv_migrate(global_rsv, rsv,
-							      min_size);
+							      min_size, 0);
 			else
 				ret = -ENOSPC;
 		}
@@ -9072,7 +9072,7 @@ static int btrfs_truncate(struct inode *inode)
 
 	/* Migrate the slack space for the truncate to our reserve */
 	ret = btrfs_block_rsv_migrate(&root->fs_info->trans_block_rsv, rsv,
-				      min_size);
+				      min_size, 0);
 	BUG_ON(ret);
 
 	/*
@@ -9112,7 +9112,7 @@ static int btrfs_truncate(struct inode *inode)
 		}
 
 		ret = btrfs_block_rsv_migrate(&root->fs_info->trans_block_rsv,
-					      rsv, min_size);
+					      rsv, min_size, 0);
 		BUG_ON(ret);	/* shouldn't happen */
 		trans->block_rsv = rsv;
 	}
@@ -9133,7 +9133,6 @@ static int btrfs_truncate(struct inode *inode)
 		ret = btrfs_end_transaction(trans, root);
 		btrfs_btree_balance_dirty(root);
 	}
-
 out:
 	btrfs_free_block_rsv(root, rsv);
 
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 2bd0011..61efa5f 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4644,7 +4644,7 @@ int btrfs_reloc_post_snapshot(struct btrfs_trans_handle *trans,
 	if (rc->merge_reloc_tree) {
 		ret = btrfs_block_rsv_migrate(&pending->block_rsv,
 					      rc->block_rsv,
-					      rc->nodes_relocated);
+					      rc->nodes_relocated, 1);
 		if (ret)
 			return ret;
 	}
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 03/14] Btrfs: always reserve metadata for delalloc extents
  2016-03-25 17:25 [PATCH 00/14] Enospc rework Josef Bacik
  2016-03-25 17:25 ` [PATCH 01/14] Btrfs: add bytes_readonly to the spaceinfo at once Josef Bacik
  2016-03-25 17:25 ` [PATCH 02/14] Btrfs: fix callers of btrfs_block_rsv_migrate Josef Bacik
@ 2016-03-25 17:25 ` Josef Bacik
  2016-03-25 18:04   ` Liu Bo
  2016-03-25 17:25 ` [PATCH 04/14] Btrfs: change delayed reservation fallback behavior Josef Bacik
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 26+ messages in thread
From: Josef Bacik @ 2016-03-25 17:25 UTC (permalink / raw)
  To: linux-btrfs

There are a few races in the metadata reservation stuff.  First we add the bytes
to the block_rsv well after we've set the bit on the inode saying that we have
space for it and after we've reserved the bytes.  So use the normal
btrfs_block_rsv_add helper for this case.  Secondly we can flush delalloc
extents when we try to reserve space for our write, which means that we could
have used up the space for the inode and we wouldn't know because we only check
before the reservation.  So instead make sure we are always reserving space for
the inode update, and then if we don't need it release those bytes afterward.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/extent-tree.c | 35 +++++++++++++----------------------
 1 file changed, 13 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 06f4e7b..157a0b6 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5653,12 +5653,12 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
 	u64 to_reserve = 0;
 	u64 csum_bytes;
 	unsigned nr_extents = 0;
-	int extra_reserve = 0;
 	enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
 	int ret = 0;
 	bool delalloc_lock = true;
 	u64 to_free = 0;
 	unsigned dropped;
+	bool release_extra = false;
 
 	/* If we are a free space inode we need to not flush since we will be in
 	 * the middle of a transaction commit.  We also don't need the delalloc
@@ -5684,24 +5684,15 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
 					 BTRFS_MAX_EXTENT_SIZE - 1,
 					 BTRFS_MAX_EXTENT_SIZE);
 	BTRFS_I(inode)->outstanding_extents += nr_extents;
-	nr_extents = 0;
 
+	nr_extents = 0;
 	if (BTRFS_I(inode)->outstanding_extents >
 	    BTRFS_I(inode)->reserved_extents)
-		nr_extents = BTRFS_I(inode)->outstanding_extents -
+		nr_extents += BTRFS_I(inode)->outstanding_extents -
 			BTRFS_I(inode)->reserved_extents;
 
-	/*
-	 * Add an item to reserve for updating the inode when we complete the
-	 * delalloc io.
-	 */
-	if (!test_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
-		      &BTRFS_I(inode)->runtime_flags)) {
-		nr_extents++;
-		extra_reserve = 1;
-	}
-
-	to_reserve = btrfs_calc_trans_metadata_size(root, nr_extents);
+	/* We always want to reserve a slot for updating the inode. */
+	to_reserve = btrfs_calc_trans_metadata_size(root, nr_extents + 1);
 	to_reserve += calc_csum_metadata_size(inode, num_bytes, 1);
 	csum_bytes = BTRFS_I(inode)->csum_bytes;
 	spin_unlock(&BTRFS_I(inode)->lock);
@@ -5713,18 +5704,16 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
 			goto out_fail;
 	}
 
-	ret = reserve_metadata_bytes(root, block_rsv, to_reserve, flush);
+	ret = btrfs_block_rsv_add(root, block_rsv, to_reserve, flush);
 	if (unlikely(ret)) {
 		btrfs_qgroup_free_meta(root, nr_extents * root->nodesize);
 		goto out_fail;
 	}
 
 	spin_lock(&BTRFS_I(inode)->lock);
-	if (extra_reserve) {
-		set_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
-			&BTRFS_I(inode)->runtime_flags);
-		nr_extents--;
-	}
+	if (test_and_set_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
+			     &BTRFS_I(inode)->runtime_flags))
+		release_extra = true;
 	BTRFS_I(inode)->reserved_extents += nr_extents;
 	spin_unlock(&BTRFS_I(inode)->lock);
 
@@ -5734,8 +5723,10 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
 	if (to_reserve)
 		trace_btrfs_space_reservation(root->fs_info, "delalloc",
 					      btrfs_ino(inode), to_reserve, 1);
-	block_rsv_add_bytes(block_rsv, to_reserve, 1);
-
+	if (release_extra)
+		btrfs_block_rsv_release(root, block_rsv,
+					btrfs_calc_trans_metadata_size(root,
+								       1));
 	return 0;
 
 out_fail:
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 04/14] Btrfs: change delayed reservation fallback behavior
  2016-03-25 17:25 [PATCH 00/14] Enospc rework Josef Bacik
                   ` (2 preceding siblings ...)
  2016-03-25 17:25 ` [PATCH 03/14] Btrfs: always reserve metadata for delalloc extents Josef Bacik
@ 2016-03-25 17:25 ` Josef Bacik
  2016-03-25 17:25 ` [PATCH 05/14] Btrfs: warn_on for unaccounted spaces Josef Bacik
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Josef Bacik @ 2016-03-25 17:25 UTC (permalink / raw)
  To: linux-btrfs

We reserve space for the inode update when we first reserve space for writing to
a file.  However there are lots of ways that we can use this reservation and not
have it for subsequent ordered extents.  Previously we'd fall through and try to
reserve metadata bytes for this, then we'd just steal the full reservation from
the delalloc_block_rsv, and if that didn't have enough space we'd steal the full
reservation from the global reserve.  The problem with this is we can easily
just return ENOSPC and fallback to updating the inode item directly.  In the
worst case (assuming 4k nodesize) we'd steal 64kib from the global reserve if we
fall all the way through, however if we just fallback and update the inode
directly we'd only steal 4k * BTRFS_PATH_MAX in the worst case which is 32kib.

We would have also just added the extent item for the inode so we likely will
have already cow'ed down most of the way to the leaf containing the inode item,
so we are more often than not only need one or two nodesize's worth of
reservations.  Given the reservation for the extent itself is also a worst case
we will likely already have space to cover the inode update.

This change will make us behave better in the theoretical worst case, and much
better in the case that we don't have our reservation and cannot reserve more
metadata.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/delayed-inode.c | 64 +++++++++++++++++-------------------------------
 1 file changed, 23 insertions(+), 41 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index d3cda0f..1c77103 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -598,6 +598,29 @@ static int btrfs_delayed_inode_reserve_metadata(
 	num_bytes = btrfs_calc_trans_metadata_size(root, 1);
 
 	/*
+	 * If our block_rsv is the delalloc block reserve then check and see if
+	 * we have our extra reservation for updating the inode.  If not fall
+	 * through and try to reserve space quickly.
+	 *
+	 * We used to try and steal from the delalloc block rsv or the global
+	 * reserve, but we'd steal a full reservation, which isn't kind.  We are
+	 * here through delalloc which means we've likely just cowed down close
+	 * to the leaf that contains the inode, so we would steal less just
+	 * doing the fallback inode update, so if we do end up having to steal
+	 * from the global block rsv we hopefully only steal one or two blocks
+	 * worth which is less likely to hurt us.
+	 */
+	if (src_rsv && src_rsv->type == BTRFS_BLOCK_RSV_DELALLOC) {
+		spin_lock(&BTRFS_I(inode)->lock);
+		if (test_and_clear_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
+				       &BTRFS_I(inode)->runtime_flags))
+			release = true;
+		else
+			src_rsv = NULL;
+		spin_unlock(&BTRFS_I(inode)->lock);
+	}
+
+	/*
 	 * btrfs_dirty_inode will update the inode under btrfs_join_transaction
 	 * which doesn't reserve space for speed.  This is a problem since we
 	 * still need to reserve space for this update, so try to reserve the
@@ -626,51 +649,10 @@ static int btrfs_delayed_inode_reserve_metadata(
 						      num_bytes, 1);
 		}
 		return ret;
-	} else if (src_rsv->type == BTRFS_BLOCK_RSV_DELALLOC) {
-		spin_lock(&BTRFS_I(inode)->lock);
-		if (test_and_clear_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
-				       &BTRFS_I(inode)->runtime_flags)) {
-			spin_unlock(&BTRFS_I(inode)->lock);
-			release = true;
-			goto migrate;
-		}
-		spin_unlock(&BTRFS_I(inode)->lock);
-
-		/* Ok we didn't have space pre-reserved.  This shouldn't happen
-		 * too often but it can happen if we do delalloc to an existing
-		 * inode which gets dirtied because of the time update, and then
-		 * isn't touched again until after the transaction commits and
-		 * then we try to write out the data.  First try to be nice and
-		 * reserve something strictly for us.  If not be a pain and try
-		 * to steal from the delalloc block rsv.
-		 */
-		ret = btrfs_block_rsv_add(root, dst_rsv, num_bytes,
-					  BTRFS_RESERVE_NO_FLUSH);
-		if (!ret)
-			goto out;
-
-		ret = btrfs_block_rsv_migrate(src_rsv, dst_rsv, num_bytes, 1);
-		if (!ret)
-			goto out;
-
-		if (btrfs_test_opt(root, ENOSPC_DEBUG)) {
-			btrfs_debug(root->fs_info,
-				    "block rsv migrate returned %d", ret);
-			WARN_ON(1);
-		}
-		/*
-		 * Ok this is a problem, let's just steal from the global rsv
-		 * since this really shouldn't happen that often.
-		 */
-		ret = btrfs_block_rsv_migrate(&root->fs_info->global_block_rsv,
-					      dst_rsv, num_bytes, 1);
-		goto out;
 	}
 
-migrate:
 	ret = btrfs_block_rsv_migrate(src_rsv, dst_rsv, num_bytes, 1);
 
-out:
 	/*
 	 * Migrate only takes a reservation, it doesn't touch the size of the
 	 * block_rsv.  This is to simplify people who don't normally have things
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 05/14] Btrfs: warn_on for unaccounted spaces
  2016-03-25 17:25 [PATCH 00/14] Enospc rework Josef Bacik
                   ` (3 preceding siblings ...)
  2016-03-25 17:25 ` [PATCH 04/14] Btrfs: change delayed reservation fallback behavior Josef Bacik
@ 2016-03-25 17:25 ` Josef Bacik
  2016-06-27  4:47   ` Qu Wenruo
  2016-03-25 17:25 ` [PATCH 06/14] Btrfs: add tracepoint for adding block groups Josef Bacik
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 26+ messages in thread
From: Josef Bacik @ 2016-03-25 17:25 UTC (permalink / raw)
  To: linux-btrfs

These were hidden behind enospc_debug, which isn't helpful as they indicate
actual bugs, unlike the rest of the enospc_debug stuff which is really debug
information.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/extent-tree.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 157a0b6..90ac821 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9633,13 +9633,15 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
 		space_info = list_entry(info->space_info.next,
 					struct btrfs_space_info,
 					list);
-		if (btrfs_test_opt(info->tree_root, ENOSPC_DEBUG)) {
-			if (WARN_ON(space_info->bytes_pinned > 0 ||
+
+		/*
+		 * Do not hide this behind enospc_debug, this is actually
+		 * important and indicates a real bug if this happens.
+		 */
+		if (WARN_ON(space_info->bytes_pinned > 0 ||
 			    space_info->bytes_reserved > 0 ||
-			    space_info->bytes_may_use > 0)) {
-				dump_space_info(space_info, 0, 0);
-			}
-		}
+			    space_info->bytes_may_use > 0))
+			dump_space_info(space_info, 0, 0);
 		list_del(&space_info->list);
 		for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
 			struct kobject *kobj;
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 06/14] Btrfs: add tracepoint for adding block groups
  2016-03-25 17:25 [PATCH 00/14] Enospc rework Josef Bacik
                   ` (4 preceding siblings ...)
  2016-03-25 17:25 ` [PATCH 05/14] Btrfs: warn_on for unaccounted spaces Josef Bacik
@ 2016-03-25 17:25 ` Josef Bacik
  2016-03-25 17:25 ` [PATCH 07/14] Btrfs: introduce ticketed enospc infrastructure Josef Bacik
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Josef Bacik @ 2016-03-25 17:25 UTC (permalink / raw)
  To: linux-btrfs

I'm writing a tool to visualize the enospc system inside btrfs, I need this
tracepoint in order to keep track of the block groups in the system.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/extent-tree.c       |  2 ++
 include/trace/events/btrfs.h | 40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 90ac821..0db4319 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9849,6 +9849,7 @@ int btrfs_read_block_groups(struct btrfs_root *root)
 			goto error;
 		}
 
+		trace_btrfs_add_block_group(root->fs_info, cache, 0);
 		ret = update_space_info(info, cache->flags, found_key.offset,
 					btrfs_block_group_used(&cache->item),
 					cache->bytes_super, &space_info);
@@ -10019,6 +10020,7 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans,
 	 * Now that our block group has its ->space_info set and is inserted in
 	 * the rbtree, update the space info's counters.
 	 */
+	trace_btrfs_add_block_group(root->fs_info, cache, 1);
 	ret = update_space_info(root->fs_info, cache->flags, size, bytes_used,
 				cache->bytes_super, &cache->space_info);
 	if (ret) {
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index d866f21..3e61deb8 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -440,6 +440,46 @@ TRACE_EVENT(btrfs_sync_fs,
 	TP_printk("wait = %d", __entry->wait)
 );
 
+TRACE_EVENT(btrfs_add_block_group,
+
+	TP_PROTO(struct btrfs_fs_info *fs_info,
+		 struct btrfs_block_group_cache *block_group, int create),
+
+	TP_ARGS(fs_info, block_group, create),
+
+	TP_STRUCT__entry(
+		__array(	u8,	fsid,	BTRFS_UUID_SIZE	)
+		__field(	u64,	offset			)
+		__field(	u64,	size			)
+		__field(	u64,	flags			)
+		__field(	u64,	bytes_used		)
+		__field(	u64,	bytes_super		)
+		__field(	int,	create			)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->fsid, fs_info->fsid, BTRFS_UUID_SIZE);
+		__entry->offset		= block_group->key.objectid;
+		__entry->size		= block_group->key.offset;
+		__entry->flags		= block_group->flags;
+		__entry->bytes_used	=
+			btrfs_block_group_used(&block_group->item);
+		__entry->bytes_super	= block_group->bytes_super;
+		__entry->create		= create;
+	),
+
+	TP_printk("%pU: block_group offset = %llu, size = %llu, "
+		  "flags = %llu(%s), bytes_used = %llu, bytes_super = %llu, "
+		  "create = %d", __entry->fsid,
+		  (unsigned long long)__entry->offset,
+		  (unsigned long long)__entry->size,
+		  (unsigned long long)__entry->flags,
+		  __print_flags((unsigned long)__entry->flags, "|",
+				BTRFS_GROUP_FLAGS),
+		  (unsigned long long)__entry->bytes_used,
+		  (unsigned long long)__entry->bytes_super, __entry->create)
+);
+
 #define show_ref_action(action)						\
 	__print_symbolic(action,					\
 		{ BTRFS_ADD_DELAYED_REF,    "ADD_DELAYED_REF" },	\
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 07/14] Btrfs: introduce ticketed enospc infrastructure
  2016-03-25 17:25 [PATCH 00/14] Enospc rework Josef Bacik
                   ` (5 preceding siblings ...)
  2016-03-25 17:25 ` [PATCH 06/14] Btrfs: add tracepoint for adding block groups Josef Bacik
@ 2016-03-25 17:25 ` Josef Bacik
  2016-05-09 21:29   ` Liu Bo
  2016-05-17 17:30   ` [PATCH V2] " Josef Bacik
  2016-03-25 17:25 ` [PATCH 08/14] Btrfs: trace pinned extents Josef Bacik
                   ` (6 subsequent siblings)
  13 siblings, 2 replies; 26+ messages in thread
From: Josef Bacik @ 2016-03-25 17:25 UTC (permalink / raw)
  To: linux-btrfs

Our enospc flushing sucks.  It is born from a time where we were early
enospc'ing constantly because multiple threads would race in for the same
reservation and randomly starve other ones out.  So I came up with this solution
to block any other reservations from happening while one guy tried to flush
stuff to satisfy his reservation.  This gives us pretty good correctness, but
completely crap latency.

The solution I've come up with is ticketed reservations.  Basically we try to
make our reservation, and if we can't we put a ticket on a list in order and
kick off an async flusher thread.  This async flusher thread does the same old
flushing we always did, just asynchronously.  As space is freed and added back
to the space_info it checks and sees if we have any tickets that need
satisfying, and adds space to the tickets and wakes up anything we've satisfied.

Once the flusher thread stops making progress it wakes up all the current
tickets and tells them to take a hike.

There is a priority list for things that can't flush, since the async flusher
could do anything we need to avoid deadlocks.  These guys get priority for
having their reservation made, and will still do manual flushing themselves in
case the async flusher isn't running.

This patch gives us significantly better latencies.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/ctree.h       |   2 +
 fs/btrfs/extent-tree.c | 524 +++++++++++++++++++++++++++++++++++--------------
 2 files changed, 375 insertions(+), 151 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b675066..7437c8a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1229,6 +1229,8 @@ struct btrfs_space_info {
 	struct list_head list;
 	/* Protected by the spinlock 'lock'. */
 	struct list_head ro_bgs;
+	struct list_head priority_tickets;
+	struct list_head tickets;
 
 	struct rw_semaphore groups_sem;
 	/* for block groups in our same type */
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0db4319..1673365 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -111,6 +111,16 @@ static int block_rsv_use_bytes(struct btrfs_block_rsv *block_rsv,
 			       u64 num_bytes);
 int btrfs_pin_extent(struct btrfs_root *root,
 		     u64 bytenr, u64 num_bytes, int reserved);
+static int __reserve_metadata_bytes(struct btrfs_root *root,
+				    struct btrfs_space_info *space_info,
+				    u64 orig_bytes,
+				    enum btrfs_reserve_flush_enum flush);
+static void space_info_add_new_bytes(struct btrfs_fs_info *fs_info,
+				     struct btrfs_space_info *space_info,
+				     u64 num_bytes);
+static void space_info_add_old_bytes(struct btrfs_fs_info *fs_info,
+				     struct btrfs_space_info *space_info,
+				     u64 num_bytes);
 
 static noinline int
 block_group_cache_done(struct btrfs_block_group_cache *cache)
@@ -3867,6 +3877,8 @@ static int update_space_info(struct btrfs_fs_info *info, u64 flags,
 		found->bytes_readonly += bytes_readonly;
 		if (total_bytes > 0)
 			found->full = 0;
+		space_info_add_new_bytes(info, found, total_bytes -
+					 bytes_used - bytes_readonly);
 		spin_unlock(&found->lock);
 		*space_info = found;
 		return 0;
@@ -3901,6 +3913,8 @@ static int update_space_info(struct btrfs_fs_info *info, u64 flags,
 	found->flush = 0;
 	init_waitqueue_head(&found->wait);
 	INIT_LIST_HEAD(&found->ro_bgs);
+	INIT_LIST_HEAD(&found->tickets);
+	INIT_LIST_HEAD(&found->priority_tickets);
 
 	ret = kobject_init_and_add(&found->kobj, &space_info_ktype,
 				    info->space_info_kobj, "%s",
@@ -4514,12 +4528,19 @@ static int can_overcommit(struct btrfs_root *root,
 			  struct btrfs_space_info *space_info, u64 bytes,
 			  enum btrfs_reserve_flush_enum flush)
 {
-	struct btrfs_block_rsv *global_rsv = &root->fs_info->global_block_rsv;
-	u64 profile = btrfs_get_alloc_profile(root, 0);
+	struct btrfs_block_rsv *global_rsv;
+	u64 profile;
 	u64 space_size;
 	u64 avail;
 	u64 used;
 
+	/* Don't overcommit when in mixed mode. */
+	if (space_info->flags & BTRFS_BLOCK_GROUP_DATA)
+		return 0;
+
+	BUG_ON(root->fs_info == NULL);
+	global_rsv = &root->fs_info->global_block_rsv;
+	profile = btrfs_get_alloc_profile(root, 0);
 	used = space_info->bytes_used + space_info->bytes_reserved +
 		space_info->bytes_pinned + space_info->bytes_readonly;
 
@@ -4669,6 +4690,11 @@ skip_async:
 			spin_unlock(&space_info->lock);
 			break;
 		}
+		if (list_empty(&space_info->tickets) &&
+		    list_empty(&space_info->priority_tickets)) {
+			spin_unlock(&space_info->lock);
+			break;
+		}
 		spin_unlock(&space_info->lock);
 
 		loops++;
@@ -4745,6 +4771,13 @@ enum flush_state {
 	COMMIT_TRANS		=	6,
 };
 
+struct reserve_ticket {
+	u64 bytes;
+	int error;
+	struct list_head list;
+	wait_queue_head_t wait;
+};
+
 static int flush_space(struct btrfs_root *root,
 		       struct btrfs_space_info *space_info, u64 num_bytes,
 		       u64 orig_bytes, int state)
@@ -4802,17 +4835,22 @@ static inline u64
 btrfs_calc_reclaim_metadata_size(struct btrfs_root *root,
 				 struct btrfs_space_info *space_info)
 {
+	struct reserve_ticket *ticket;
 	u64 used;
 	u64 expected;
-	u64 to_reclaim;
+	u64 to_reclaim = 0;
 
 	to_reclaim = min_t(u64, num_online_cpus() * SZ_1M, SZ_16M);
-	spin_lock(&space_info->lock);
 	if (can_overcommit(root, space_info, to_reclaim,
-			   BTRFS_RESERVE_FLUSH_ALL)) {
-		to_reclaim = 0;
-		goto out;
-	}
+			   BTRFS_RESERVE_FLUSH_ALL))
+		return 0;
+
+	list_for_each_entry(ticket, &space_info->tickets, list)
+		to_reclaim += ticket->bytes;
+	list_for_each_entry(ticket, &space_info->priority_tickets, list)
+		to_reclaim += ticket->bytes;
+	if (to_reclaim)
+		return to_reclaim;
 
 	used = space_info->bytes_used + space_info->bytes_reserved +
 	       space_info->bytes_pinned + space_info->bytes_readonly +
@@ -4828,9 +4866,6 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_root *root,
 		to_reclaim = 0;
 	to_reclaim = min(to_reclaim, space_info->bytes_may_use +
 				     space_info->bytes_reserved);
-out:
-	spin_unlock(&space_info->lock);
-
 	return to_reclaim;
 }
 
@@ -4847,69 +4882,169 @@ static inline int need_do_async_reclaim(struct btrfs_space_info *space_info,
 		!test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state));
 }
 
-static int btrfs_need_do_async_reclaim(struct btrfs_space_info *space_info,
-				       struct btrfs_fs_info *fs_info,
-				       int flush_state)
+static void wake_all_tickets(struct list_head *head)
 {
-	u64 used;
+	struct reserve_ticket *ticket;
 
-	spin_lock(&space_info->lock);
-	/*
-	 * We run out of space and have not got any free space via flush_space,
-	 * so don't bother doing async reclaim.
-	 */
-	if (flush_state > COMMIT_TRANS && space_info->full) {
-		spin_unlock(&space_info->lock);
-		return 0;
+	while (!list_empty(head)) {
+		ticket = list_first_entry(head, struct reserve_ticket, list);
+		list_del_init(&ticket->list);
+		ticket->error = -ENOSPC;
+		wake_up(&ticket->wait);
 	}
-
-	used = space_info->bytes_used + space_info->bytes_reserved +
-	       space_info->bytes_pinned + space_info->bytes_readonly +
-	       space_info->bytes_may_use;
-	if (need_do_async_reclaim(space_info, fs_info, used)) {
-		spin_unlock(&space_info->lock);
-		return 1;
-	}
-	spin_unlock(&space_info->lock);
-
-	return 0;
 }
 
+/*
+ * This is for normal flushers, we can wait all goddamned day if we want to.  We
+ * will loop and continuously try to flush as long as we are making progress.
+ * We count progress as clearing off tickets each time we have to loop.
+ */
 static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
 {
+	struct reserve_ticket *last_ticket = NULL;
 	struct btrfs_fs_info *fs_info;
 	struct btrfs_space_info *space_info;
 	u64 to_reclaim;
 	int flush_state;
+	int commit_cycles = 0;
 
 	fs_info = container_of(work, struct btrfs_fs_info, async_reclaim_work);
 	space_info = __find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
 
+	spin_lock(&space_info->lock);
 	to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info->fs_root,
 						      space_info);
-	if (!to_reclaim)
+	if (!to_reclaim) {
+		space_info->flush = 0;
+		spin_unlock(&space_info->lock);
 		return;
+	}
+	last_ticket = list_first_entry(&space_info->tickets,
+				       struct reserve_ticket, list);
+	spin_unlock(&space_info->lock);
 
 	flush_state = FLUSH_DELAYED_ITEMS_NR;
 	do {
+		struct reserve_ticket *ticket;
+		int ret;
+
+		ret = flush_space(fs_info->fs_root, space_info, to_reclaim,
+			    to_reclaim, flush_state);
+		spin_lock(&space_info->lock);
+		if (list_empty(&space_info->tickets)) {
+			space_info->flush = 0;
+			spin_unlock(&space_info->lock);
+			return;
+		}
+		to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info->fs_root,
+							      space_info);
+		ticket = list_first_entry(&space_info->tickets,
+					  struct reserve_ticket, list);
+		if (last_ticket == ticket) {
+			flush_state++;
+		} else {
+			last_ticket = ticket;
+			flush_state = FLUSH_DELAYED_ITEMS_NR;
+			if (commit_cycles)
+				commit_cycles--;
+		}
+
+		if (flush_state > COMMIT_TRANS) {
+			commit_cycles++;
+			if (commit_cycles > 2) {
+				wake_all_tickets(&space_info->tickets);
+				space_info->flush = 0;
+			} else {
+				flush_state = FLUSH_DELAYED_ITEMS_NR;
+			}
+		}
+		spin_unlock(&space_info->lock);
+	} while (flush_state <= COMMIT_TRANS);
+}
+
+void btrfs_init_async_reclaim_work(struct work_struct *work)
+{
+	INIT_WORK(work, btrfs_async_reclaim_metadata_space);
+}
+
+static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info,
+					    struct btrfs_space_info *space_info,
+					    struct reserve_ticket *ticket)
+{
+	u64 to_reclaim;
+	int flush_state = FLUSH_DELAYED_ITEMS_NR;
+
+	spin_lock(&space_info->lock);
+	to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info->fs_root,
+						      space_info);
+	if (!to_reclaim) {
+		spin_unlock(&space_info->lock);
+		return;
+	}
+	spin_unlock(&space_info->lock);
+
+	do {
 		flush_space(fs_info->fs_root, space_info, to_reclaim,
 			    to_reclaim, flush_state);
 		flush_state++;
-		if (!btrfs_need_do_async_reclaim(space_info, fs_info,
-						 flush_state))
+		spin_lock(&space_info->lock);
+		if (ticket->bytes == 0) {
+			spin_unlock(&space_info->lock);
 			return;
+		}
+		spin_unlock(&space_info->lock);
+
+		/*
+		 * Priority flushers can't wait on delalloc without
+		 * deadlocking.
+		 */
+		if (flush_state == FLUSH_DELALLOC ||
+		    flush_state == FLUSH_DELALLOC_WAIT)
+			flush_state = ALLOC_CHUNK;
 	} while (flush_state < COMMIT_TRANS);
 }
 
-void btrfs_init_async_reclaim_work(struct work_struct *work)
+static int wait_reserve_ticket(struct btrfs_fs_info *fs_info,
+			       struct btrfs_space_info *space_info,
+			       struct reserve_ticket *ticket, u64 orig_bytes)
+
 {
-	INIT_WORK(work, btrfs_async_reclaim_metadata_space);
+	DEFINE_WAIT(wait);
+	int ret = 0;
+
+	spin_lock(&space_info->lock);
+	while (ticket->bytes > 0 && ticket->error == 0) {
+		ret = prepare_to_wait_event(&ticket->wait, &wait, TASK_KILLABLE);
+		if (ret) {
+			ret = -EINTR;
+			break;
+		}
+		spin_unlock(&space_info->lock);
+
+		schedule();
+
+		finish_wait(&ticket->wait, &wait);
+		spin_lock(&space_info->lock);
+	}
+	if (!ret)
+		ret = ticket->error;
+	if (!list_empty(&ticket->list))
+		list_del_init(&ticket->list);
+	if (ticket->bytes && ticket->bytes < orig_bytes) {
+		u64 num_bytes = orig_bytes - ticket->bytes;
+		space_info->bytes_may_use -= num_bytes;
+		trace_btrfs_space_reservation(fs_info, "space_info",
+					      space_info->flags, num_bytes, 0);
+	}
+	spin_unlock(&space_info->lock);
+
+	return ret;
 }
 
 /**
  * reserve_metadata_bytes - try to reserve bytes from the block_rsv's space
  * @root - the root we're allocating for
- * @block_rsv - the block_rsv we're allocating for
+ * @space_info - the space info we want to allocate from
  * @orig_bytes - the number of bytes we want
  * @flush - whether or not we can flush to make our reservation
  *
@@ -4920,81 +5055,34 @@ void btrfs_init_async_reclaim_work(struct work_struct *work)
  * regain reservations will be made and this will fail if there is not enough
  * space already.
  */
-static int reserve_metadata_bytes(struct btrfs_root *root,
-				  struct btrfs_block_rsv *block_rsv,
-				  u64 orig_bytes,
-				  enum btrfs_reserve_flush_enum flush)
+static int __reserve_metadata_bytes(struct btrfs_root *root,
+				    struct btrfs_space_info *space_info,
+				    u64 orig_bytes,
+				    enum btrfs_reserve_flush_enum flush)
 {
-	struct btrfs_space_info *space_info = block_rsv->space_info;
+	struct reserve_ticket ticket;
 	u64 used;
-	u64 num_bytes = orig_bytes;
-	int flush_state = FLUSH_DELAYED_ITEMS_NR;
 	int ret = 0;
-	bool flushing = false;
 
-again:
-	ret = 0;
+	ASSERT(orig_bytes);
 	spin_lock(&space_info->lock);
-	/*
-	 * We only want to wait if somebody other than us is flushing and we
-	 * are actually allowed to flush all things.
-	 */
-	while (flush == BTRFS_RESERVE_FLUSH_ALL && !flushing &&
-	       space_info->flush) {
-		spin_unlock(&space_info->lock);
-		/*
-		 * If we have a trans handle we can't wait because the flusher
-		 * may have to commit the transaction, which would mean we would
-		 * deadlock since we are waiting for the flusher to finish, but
-		 * hold the current transaction open.
-		 */
-		if (current->journal_info)
-			return -EAGAIN;
-		ret = wait_event_killable(space_info->wait, !space_info->flush);
-		/* Must have been killed, return */
-		if (ret)
-			return -EINTR;
-
-		spin_lock(&space_info->lock);
-	}
-
 	ret = -ENOSPC;
 	used = space_info->bytes_used + space_info->bytes_reserved +
 		space_info->bytes_pinned + space_info->bytes_readonly +
 		space_info->bytes_may_use;
 
 	/*
-	 * The idea here is that we've not already over-reserved the block group
-	 * then we can go ahead and save our reservation first and then start
-	 * flushing if we need to.  Otherwise if we've already overcommitted
-	 * lets start flushing stuff first and then come back and try to make
-	 * our reservation.
+	 * If we have enough space then hooray, make our reservation and carry
+	 * on.  If not see if we can overcommit, and if we can, hooray carry on.
+	 * If not things get more complicated.
 	 */
-	if (used <= space_info->total_bytes) {
-		if (used + orig_bytes <= space_info->total_bytes) {
-			space_info->bytes_may_use += orig_bytes;
-			trace_btrfs_space_reservation(root->fs_info,
-				"space_info", space_info->flags, orig_bytes, 1);
-			ret = 0;
-		} else {
-			/*
-			 * Ok set num_bytes to orig_bytes since we aren't
-			 * overocmmitted, this way we only try and reclaim what
-			 * we need.
-			 */
-			num_bytes = orig_bytes;
-		}
-	} else {
-		/*
-		 * Ok we're over committed, set num_bytes to the overcommitted
-		 * amount plus the amount of bytes that we need for this
-		 * reservation.
-		 */
-		num_bytes = used - space_info->total_bytes +
-			(orig_bytes * 2);
-	}
-
-	if (ret && can_overcommit(root, space_info, orig_bytes, flush)) {
+	if (used + orig_bytes <= space_info->total_bytes) {
+		space_info->bytes_may_use += orig_bytes;
+		trace_btrfs_space_reservation(root->fs_info, "space_info",
+					      space_info->flags, orig_bytes,
+					      1);
+		ret = 0;
+	} else if (can_overcommit(root, space_info, orig_bytes, flush)) {
 		space_info->bytes_may_use += orig_bytes;
 		trace_btrfs_space_reservation(root->fs_info, "space_info",
 					      space_info->flags, orig_bytes,
@@ -5003,16 +5091,27 @@ again:
 	}
 
 	/*
-	 * Couldn't make our reservation, save our place so while we're trying
-	 * to reclaim space we can actually use it instead of somebody else
-	 * stealing it from us.
+	 * If we couldn't make a reservation then setup our reservation ticket
+	 * and kick the async worker if it's not already running.
 	 *
-	 * We make the other tasks wait for the flush only when we can flush
-	 * all things.
+	 * If we are a priority flusher then we just need to add our ticket to
+	 * the list and we will do our own flushing further down.
 	 */
 	if (ret && flush != BTRFS_RESERVE_NO_FLUSH) {
-		flushing = true;
-		space_info->flush = 1;
+		ticket.bytes = orig_bytes;
+		ticket.error = 0;
+		init_waitqueue_head(&ticket.wait);
+		if (flush == BTRFS_RESERVE_FLUSH_ALL) {
+			list_add_tail(&ticket.list, &space_info->tickets);
+			if (!space_info->flush) {
+				space_info->flush = 1;
+				queue_work(system_unbound_wq,
+					   &root->fs_info->async_reclaim_work);
+			}
+		} else {
+			list_add_tail(&ticket.list,
+				      &space_info->priority_tickets);
+		}
 	} else if (!ret && space_info->flags & BTRFS_BLOCK_GROUP_METADATA) {
 		used += orig_bytes;
 		/*
@@ -5027,33 +5126,56 @@ again:
 				   &root->fs_info->async_reclaim_work);
 	}
 	spin_unlock(&space_info->lock);
-
 	if (!ret || flush == BTRFS_RESERVE_NO_FLUSH)
-		goto out;
+		return ret;
 
-	ret = flush_space(root, space_info, num_bytes, orig_bytes,
-			  flush_state);
-	flush_state++;
+	if (flush == BTRFS_RESERVE_FLUSH_ALL)
+		return wait_reserve_ticket(root->fs_info, space_info, &ticket,
+					   orig_bytes);
 
-	/*
-	 * If we are FLUSH_LIMIT, we can not flush delalloc, or the deadlock
-	 * would happen. So skip delalloc flush.
-	 */
-	if (flush == BTRFS_RESERVE_FLUSH_LIMIT &&
-	    (flush_state == FLUSH_DELALLOC ||
-	     flush_state == FLUSH_DELALLOC_WAIT))
-		flush_state = ALLOC_CHUNK;
+	ret = 0;
+	priority_reclaim_metadata_space(root->fs_info, space_info, &ticket);
+	spin_lock(&space_info->lock);
+	if (ticket.bytes) {
+		if (ticket.bytes < orig_bytes) {
+			u64 num_bytes = orig_bytes - ticket.bytes;
+			space_info->bytes_may_use -= num_bytes;
+			trace_btrfs_space_reservation(root->fs_info,
+					"space_info", space_info->flags,
+					num_bytes, 0);
 
-	if (!ret)
-		goto again;
-	else if (flush == BTRFS_RESERVE_FLUSH_LIMIT &&
-		 flush_state < COMMIT_TRANS)
-		goto again;
-	else if (flush == BTRFS_RESERVE_FLUSH_ALL &&
-		 flush_state <= COMMIT_TRANS)
-		goto again;
+		}
+		list_del_init(&ticket.list);
+		ret = -ENOSPC;
+	}
+	spin_unlock(&space_info->lock);
+	ASSERT(list_empty(&ticket.list));
+	return ret;
+}
 
-out:
+/**
+ * reserve_metadata_bytes - try to reserve bytes from the block_rsv's space
+ * @root - the root we're allocating for
+ * @block_rsv - the block_rsv we're allocating for
+ * @orig_bytes - the number of bytes we want
+ * @flush - whether or not we can flush to make our reservation
+ *
+ * This will reserve orgi_bytes number of bytes from the space info associated
+ * with the block_rsv.  If there is not enough space it will make an attempt to
+ * flush out space to make room.  It will do this by flushing delalloc if
+ * possible or committing the transaction.  If flush is 0 then no attempts to
+ * regain reservations will be made and this will fail if there is not enough
+ * space already.
+ */
+static int reserve_metadata_bytes(struct btrfs_root *root,
+				  struct btrfs_block_rsv *block_rsv,
+				  u64 orig_bytes,
+				  enum btrfs_reserve_flush_enum flush)
+{
+	int ret;
+
+	ret = __reserve_metadata_bytes(root, block_rsv->space_info, orig_bytes,
+				       flush);
 	if (ret == -ENOSPC &&
 	    unlikely(root->orphan_cleanup_state == ORPHAN_CLEANUP_STARTED)) {
 		struct btrfs_block_rsv *global_rsv =
@@ -5066,13 +5188,8 @@ out:
 	if (ret == -ENOSPC)
 		trace_btrfs_space_reservation(root->fs_info,
 					      "space_info:enospc",
-					      space_info->flags, orig_bytes, 1);
-	if (flushing) {
-		spin_lock(&space_info->lock);
-		space_info->flush = 0;
-		wake_up_all(&space_info->wait);
-		spin_unlock(&space_info->lock);
-	}
+					      block_rsv->space_info->flags,
+					      orig_bytes, 1);
 	return ret;
 }
 
@@ -5148,6 +5265,103 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info *fs_info,
 	return 0;
 }
 
+/*
+ * This is for space we already have accounted in space_info->bytes_may_use, so
+ * basically when we're returning space from block_rsv's.
+ */
+static void space_info_add_old_bytes(struct btrfs_fs_info *fs_info,
+				     struct btrfs_space_info *space_info,
+				     u64 num_bytes)
+{
+	struct reserve_ticket *ticket;
+	struct list_head *head;
+	u64 used;
+	enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_NO_FLUSH;
+	bool check_overcommit = false;
+
+	spin_lock(&space_info->lock);
+	head = &space_info->priority_tickets;
+
+	/*
+	 * First we want to see if we're over our limit, because if we are then
+	 * we need to make sure we are still ok overcommitting before we satisfy
+	 * another reservation.
+	 */
+	used = space_info->bytes_used + space_info->bytes_reserved +
+		space_info->bytes_pinned + space_info->bytes_readonly;
+	if (used - num_bytes >= space_info->total_bytes)
+		check_overcommit = true;
+again:
+	while (!list_empty(head) && num_bytes) {
+		ticket = list_first_entry(head, struct reserve_ticket,
+					  list);
+		if (check_overcommit &&
+		    !can_overcommit(fs_info->extent_root, space_info,
+				    ticket->bytes, flush))
+			break;
+		if (num_bytes >= ticket->bytes) {
+			list_del_init(&ticket->list);
+			num_bytes -= ticket->bytes;
+			ticket->bytes = 0;
+			wake_up(&ticket->wait);
+		} else {
+			ticket->bytes -= num_bytes;
+			num_bytes = 0;
+		}
+	}
+
+	if (num_bytes && head == &space_info->priority_tickets) {
+		head = &space_info->tickets;
+		flush = BTRFS_RESERVE_FLUSH_ALL;
+		goto again;
+	}
+	space_info->bytes_may_use -= num_bytes;
+	trace_btrfs_space_reservation(fs_info, "space_info",
+				      space_info->flags, num_bytes, 0);
+	spin_unlock(&space_info->lock);
+}
+
+/*
+ * This is for newly allocated space that isn't accounted in
+ * space_info->bytes_may_use yet.  So if we allocate a chunk or unpin an extent
+ * we use this helper.
+ */
+static void space_info_add_new_bytes(struct btrfs_fs_info *fs_info,
+				     struct btrfs_space_info *space_info,
+				     u64 num_bytes)
+{
+	struct reserve_ticket *ticket;
+	struct list_head *head = &space_info->priority_tickets;
+
+again:
+	while (!list_empty(head) && num_bytes) {
+		ticket = list_first_entry(head, struct reserve_ticket,
+					  list);
+		if (num_bytes >= ticket->bytes) {
+			trace_btrfs_space_reservation(fs_info, "space_info",
+						      space_info->flags,
+						      ticket->bytes, 1);
+			list_del_init(&ticket->list);
+			num_bytes -= ticket->bytes;
+			space_info->bytes_may_use += ticket->bytes;
+			ticket->bytes = 0;
+			wake_up(&ticket->wait);
+		} else {
+			trace_btrfs_space_reservation(fs_info, "space_info",
+						      space_info->flags,
+						      num_bytes, 1);
+			space_info->bytes_may_use += num_bytes;
+			ticket->bytes -= num_bytes;
+			num_bytes = 0;
+		}
+	}
+
+	if (num_bytes && head == &space_info->priority_tickets) {
+		head = &space_info->tickets;
+		goto again;
+	}
+}
+
 static void block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
 				    struct btrfs_block_rsv *block_rsv,
 				    struct btrfs_block_rsv *dest, u64 num_bytes)
@@ -5182,13 +5396,9 @@ static void block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
 			}
 			spin_unlock(&dest->lock);
 		}
-		if (num_bytes) {
-			spin_lock(&space_info->lock);
-			space_info->bytes_may_use -= num_bytes;
-			trace_btrfs_space_reservation(fs_info, "space_info",
-					space_info->flags, num_bytes, 0);
-			spin_unlock(&space_info->lock);
-		}
+		if (num_bytes)
+			space_info_add_old_bytes(fs_info, space_info,
+						 num_bytes);
 	}
 }
 
@@ -6346,17 +6556,29 @@ static int unpin_extent_range(struct btrfs_root *root, u64 start, u64 end,
 			readonly = true;
 		}
 		spin_unlock(&cache->lock);
-		if (!readonly && global_rsv->space_info == space_info) {
+		if (!readonly && return_free_space &&
+		    global_rsv->space_info == space_info) {
+			u64 to_add = len;
+			WARN_ON(!return_free_space);
 			spin_lock(&global_rsv->lock);
 			if (!global_rsv->full) {
-				len = min(len, global_rsv->size -
-					  global_rsv->reserved);
-				global_rsv->reserved += len;
-				space_info->bytes_may_use += len;
+				to_add = min(len, global_rsv->size -
+					     global_rsv->reserved);
+				global_rsv->reserved += to_add;
+				space_info->bytes_may_use += to_add;
 				if (global_rsv->reserved >= global_rsv->size)
 					global_rsv->full = 1;
+				trace_btrfs_space_reservation(fs_info,
+							      "space_info",
+							      space_info->flags,
+							      to_add, 1);
+				len -= to_add;
 			}
 			spin_unlock(&global_rsv->lock);
+			/* Add to any tickets we may have */
+			if (len)
+				space_info_add_new_bytes(fs_info, space_info,
+							 len);
 		}
 		spin_unlock(&space_info->lock);
 	}
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 08/14] Btrfs: trace pinned extents
  2016-03-25 17:25 [PATCH 00/14] Enospc rework Josef Bacik
                   ` (6 preceding siblings ...)
  2016-03-25 17:25 ` [PATCH 07/14] Btrfs: introduce ticketed enospc infrastructure Josef Bacik
@ 2016-03-25 17:25 ` Josef Bacik
  2016-03-25 17:25 ` [PATCH 09/14] Btrfs: fix delalloc reservation amount tracepoint Josef Bacik
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Josef Bacik @ 2016-03-25 17:25 UTC (permalink / raw)
  To: linux-btrfs

Pinned extents are an important metric to keep track of for enospc.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/extent-tree.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 1673365..26f7a9d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6168,6 +6168,9 @@ static int update_block_group(struct btrfs_trans_handle *trans,
 			spin_unlock(&cache->lock);
 			spin_unlock(&cache->space_info->lock);
 
+			trace_btrfs_space_reservation(root->fs_info, "pinned",
+						      cache->space_info->flags,
+						      num_bytes, 1);
 			set_extent_dirty(info->pinned_extents,
 					 bytenr, bytenr + num_bytes - 1,
 					 GFP_NOFS | __GFP_NOFAIL);
@@ -6242,6 +6245,8 @@ static int pin_down_extent(struct btrfs_root *root,
 	spin_unlock(&cache->lock);
 	spin_unlock(&cache->space_info->lock);
 
+	trace_btrfs_space_reservation(root->fs_info, "pinned",
+				      cache->space_info->flags, num_bytes, 1);
 	set_extent_dirty(root->fs_info->pinned_extents, bytenr,
 			 bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL);
 	if (reserved)
@@ -6549,6 +6554,9 @@ static int unpin_extent_range(struct btrfs_root *root, u64 start, u64 end,
 		spin_lock(&cache->lock);
 		cache->pinned -= len;
 		space_info->bytes_pinned -= len;
+
+		trace_btrfs_space_reservation(fs_info, "pinned",
+					      space_info->flags, len, 0);
 		space_info->max_extent_size = 0;
 		percpu_counter_add(&space_info->total_bytes_pinned, -len);
 		if (cache->ro) {
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 09/14] Btrfs: fix delalloc reservation amount tracepoint
  2016-03-25 17:25 [PATCH 00/14] Enospc rework Josef Bacik
                   ` (7 preceding siblings ...)
  2016-03-25 17:25 ` [PATCH 08/14] Btrfs: trace pinned extents Josef Bacik
@ 2016-03-25 17:25 ` Josef Bacik
  2016-03-25 17:25 ` [PATCH 10/14] Btrfs: add tracepoints for flush events Josef Bacik
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Josef Bacik @ 2016-03-25 17:25 UTC (permalink / raw)
  To: linux-btrfs

We can sometimes drop the reservation we had for our inode, so we need to remove
that amount from to_reserve so that our tracepoint reports a valid amount of
space.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/extent-tree.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 26f7a9d..1221c07 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5922,8 +5922,10 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
 
 	spin_lock(&BTRFS_I(inode)->lock);
 	if (test_and_set_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
-			     &BTRFS_I(inode)->runtime_flags))
+			     &BTRFS_I(inode)->runtime_flags)) {
+		to_reserve -= btrfs_calc_trans_metadata_size(root, 1);
 		release_extra = true;
+	}
 	BTRFS_I(inode)->reserved_extents += nr_extents;
 	spin_unlock(&BTRFS_I(inode)->lock);
 
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 10/14] Btrfs: add tracepoints for flush events
  2016-03-25 17:25 [PATCH 00/14] Enospc rework Josef Bacik
                   ` (8 preceding siblings ...)
  2016-03-25 17:25 ` [PATCH 09/14] Btrfs: fix delalloc reservation amount tracepoint Josef Bacik
@ 2016-03-25 17:25 ` Josef Bacik
  2016-03-25 17:25 ` [PATCH 11/14] Btrfs: add fsid to some tracepoints Josef Bacik
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Josef Bacik @ 2016-03-25 17:25 UTC (permalink / raw)
  To: linux-btrfs

We want to track when we're triggering flushing from our reservation code and
what flushing is being done when we start flushing.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/ctree.h             |  9 +++++
 fs/btrfs/extent-tree.c       | 22 ++++++------
 include/trace/events/btrfs.h | 82 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 103 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7437c8a..55a24c5 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3611,6 +3611,15 @@ enum btrfs_reserve_flush_enum {
 	BTRFS_RESERVE_FLUSH_ALL,
 };
 
+enum btrfs_flush_state {
+	FLUSH_DELAYED_ITEMS_NR	=	1,
+	FLUSH_DELAYED_ITEMS	=	2,
+	FLUSH_DELALLOC		=	3,
+	FLUSH_DELALLOC_WAIT	=	4,
+	ALLOC_CHUNK		=	5,
+	COMMIT_TRANS		=	6,
+};
+
 int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len);
 int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes);
 void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 1221c07..0ecceea 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4762,15 +4762,6 @@ commit:
 	return btrfs_commit_transaction(trans, root);
 }
 
-enum flush_state {
-	FLUSH_DELAYED_ITEMS_NR	=	1,
-	FLUSH_DELAYED_ITEMS	=	2,
-	FLUSH_DELALLOC		=	3,
-	FLUSH_DELALLOC_WAIT	=	4,
-	ALLOC_CHUNK		=	5,
-	COMMIT_TRANS		=	6,
-};
-
 struct reserve_ticket {
 	u64 bytes;
 	int error;
@@ -4828,6 +4819,8 @@ static int flush_space(struct btrfs_root *root,
 		break;
 	}
 
+	trace_btrfs_flush_space(root->fs_info, space_info->flags, num_bytes,
+				orig_bytes, state, ret);
 	return ret;
 }
 
@@ -5105,6 +5098,10 @@ static int __reserve_metadata_bytes(struct btrfs_root *root,
 			list_add_tail(&ticket.list, &space_info->tickets);
 			if (!space_info->flush) {
 				space_info->flush = 1;
+				trace_btrfs_trigger_flush(root->fs_info,
+							  space_info->flags,
+							  orig_bytes, flush,
+							  "enospc");
 				queue_work(system_unbound_wq,
 					   &root->fs_info->async_reclaim_work);
 			}
@@ -5121,9 +5118,14 @@ static int __reserve_metadata_bytes(struct btrfs_root *root,
 		 */
 		if (!root->fs_info->log_root_recovering &&
 		    need_do_async_reclaim(space_info, root->fs_info, used) &&
-		    !work_busy(&root->fs_info->async_reclaim_work))
+		    !work_busy(&root->fs_info->async_reclaim_work)) {
+			trace_btrfs_trigger_flush(root->fs_info,
+						  space_info->flags,
+						  orig_bytes, flush,
+						  "preempt");
 			queue_work(system_unbound_wq,
 				   &root->fs_info->async_reclaim_work);
+		}
 	}
 	spin_unlock(&space_info->lock);
 	if (!ret || flush == BTRFS_RESERVE_NO_FLUSH)
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 3e61deb8..6c192dc 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -784,6 +784,88 @@ TRACE_EVENT(btrfs_space_reservation,
 		  __entry->bytes)
 );
 
+#define show_flush_action(action)						\
+	__print_symbolic(action,						\
+		{ BTRFS_RESERVE_NO_FLUSH,	"BTRFS_RESERVE_NO_FLUSH"},	\
+		{ BTRFS_RESERVE_FLUSH_LIMIT,	"BTRFS_RESERVE_FLUSH_LIMIT"},	\
+		{ BTRFS_RESERVE_FLUSH_ALL,	"BTRFS_RESERVE_FLUSH_ALL"})
+
+TRACE_EVENT(btrfs_trigger_flush,
+
+	TP_PROTO(struct btrfs_fs_info *fs_info, u64 flags, u64 bytes,
+		 int flush, char *reason),
+
+	TP_ARGS(fs_info, flags, bytes, flush, reason),
+
+	TP_STRUCT__entry(
+		__array(	u8,	fsid,	BTRFS_UUID_SIZE	)
+		__field(	u64,	flags			)
+		__field(	u64,	bytes			)
+		__field(	int,	flush			)
+		__string(	reason,	reason			)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->fsid, fs_info->fsid, BTRFS_UUID_SIZE);
+		__entry->flags	= flags;
+		__entry->bytes	= bytes;
+		__entry->flush	= flush;
+		__assign_str(reason, reason)
+	),
+
+	TP_printk("%pU: %s: flush = %d(%s), flags = %llu(%s), bytes = %llu",
+		  __entry->fsid, __get_str(reason), __entry->flush,
+		  show_flush_action(__entry->flush),
+		  (unsigned long long)__entry->flags,
+		  __print_flags((unsigned long)__entry->flags, "|",
+				BTRFS_GROUP_FLAGS),
+		  (unsigned long long)__entry->bytes)
+);
+
+#define show_flush_state(state)							\
+	__print_symbolic(state,							\
+		{ FLUSH_DELAYED_ITEMS_NR,	"FLUSH_DELAYED_ITEMS_NR"},	\
+		{ FLUSH_DELAYED_ITEMS,		"FLUSH_DELAYED_ITEMS"},		\
+		{ FLUSH_DELALLOC,		"FLUSH_DELALLOC"},		\
+		{ FLUSH_DELALLOC_WAIT,		"FLUSH_DELALLOC_WAIT"},		\
+		{ ALLOC_CHUNK,			"ALLOC_CHUNK"},			\
+		{ COMMIT_TRANS,			"COMMIT_TRANS"})
+
+TRACE_EVENT(btrfs_flush_space,
+
+	TP_PROTO(struct btrfs_fs_info *fs_info, u64 flags, u64 num_bytes,
+		 u64 orig_bytes, int state, int ret),
+
+	TP_ARGS(fs_info, flags, num_bytes, orig_bytes, state, ret),
+
+	TP_STRUCT__entry(
+		__array(	u8,	fsid,	BTRFS_UUID_SIZE	)
+		__field(	u64,	flags			)
+		__field(	u64,	num_bytes		)
+		__field(	u64,	orig_bytes		)
+		__field(	int,	state			)
+		__field(	int,	ret			)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->fsid, fs_info->fsid, BTRFS_UUID_SIZE);
+		__entry->flags		=	flags;
+		__entry->num_bytes	=	num_bytes;
+		__entry->orig_bytes	=	orig_bytes;
+		__entry->state		=	state;
+		__entry->ret		=	ret;
+	),
+
+	TP_printk("%pU: state = %d(%s), flags = %llu(%s), num_bytes = %llu, "
+		  "orig_bytes = %llu, ret = %d", __entry->fsid, __entry->state,
+		  show_flush_state(__entry->state),
+		  (unsigned long long)__entry->flags,
+		  __print_flags((unsigned long)__entry->flags, "|",
+				BTRFS_GROUP_FLAGS),
+		  (unsigned long long)__entry->num_bytes,
+		  (unsigned long long)__entry->orig_bytes, __entry->ret)
+);
+
 DECLARE_EVENT_CLASS(btrfs__reserved_extent,
 
 	TP_PROTO(struct btrfs_root *root, u64 start, u64 len),
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 11/14] Btrfs: add fsid to some tracepoints
  2016-03-25 17:25 [PATCH 00/14] Enospc rework Josef Bacik
                   ` (9 preceding siblings ...)
  2016-03-25 17:25 ` [PATCH 10/14] Btrfs: add tracepoints for flush events Josef Bacik
@ 2016-03-25 17:25 ` Josef Bacik
  2016-03-25 17:25 ` [PATCH 12/14] Btrfs: fix release reserved extents trace points Josef Bacik
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Josef Bacik @ 2016-03-25 17:25 UTC (permalink / raw)
  To: linux-btrfs

When tracing enospc problems on a box with multiple file systems mounted I need
to be able to differentiate between the two file systems.  Most of the important
trace points I'm looking at already have an fsid, but the reserved extent trace
points do not, so add that to make it possible to figure out which trace point
belongs to which file system.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 include/trace/events/btrfs.h | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 6c192dc..b0f555e 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -873,18 +873,21 @@ DECLARE_EVENT_CLASS(btrfs__reserved_extent,
 	TP_ARGS(root, start, len),
 
 	TP_STRUCT__entry(
-		__field(	u64,  root_objectid		)
-		__field(	u64,  start			)
-		__field(	u64,  len			)
+		__array(	u8,	fsid,	BTRFS_UUID_SIZE	)
+		__field(	u64,	root_objectid		)
+		__field(	u64,	start			)
+		__field(	u64,	len			)
 	),
 
 	TP_fast_assign(
+		memcpy(__entry->fsid, root->fs_info->fsid, BTRFS_UUID_SIZE);
 		__entry->root_objectid	= root->root_key.objectid;
 		__entry->start		= start;
 		__entry->len		= len;
 	),
 
-	TP_printk("root = %llu(%s), start = %llu, len = %llu",
+	TP_printk("%pU: root = %llu(%s), start = %llu, len = %llu",
+		  __entry->fsid,
 		  show_root_type(__entry->root_objectid),
 		  (unsigned long long)__entry->start,
 		  (unsigned long long)__entry->len)
@@ -941,6 +944,7 @@ DECLARE_EVENT_CLASS(btrfs__reserve_extent,
 	TP_ARGS(root, block_group, start, len),
 
 	TP_STRUCT__entry(
+		__array(	u8,	fsid,	BTRFS_UUID_SIZE	)
 		__field(	u64,	root_objectid		)
 		__field(	u64,	bg_objectid		)
 		__field(	u64,	flags			)
@@ -949,6 +953,7 @@ DECLARE_EVENT_CLASS(btrfs__reserve_extent,
 	),
 
 	TP_fast_assign(
+		memcpy(__entry->fsid, root->fs_info->fsid, BTRFS_UUID_SIZE);
 		__entry->root_objectid	= root->root_key.objectid;
 		__entry->bg_objectid	= block_group->key.objectid;
 		__entry->flags		= block_group->flags;
@@ -956,8 +961,8 @@ DECLARE_EVENT_CLASS(btrfs__reserve_extent,
 		__entry->len		= len;
 	),
 
-	TP_printk("root = %Lu(%s), block_group = %Lu, flags = %Lu(%s), "
-		  "start = %Lu, len = %Lu",
+	TP_printk("%pU: root = %Lu(%s), block_group = %Lu, flags = %Lu(%s), "
+		  "start = %Lu, len = %Lu", __entry->fsid,
 		  show_root_type(__entry->root_objectid), __entry->bg_objectid,
 		  __entry->flags, __print_flags((unsigned long)__entry->flags,
 						"|", BTRFS_GROUP_FLAGS),
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 12/14] Btrfs: fix release reserved extents trace points
  2016-03-25 17:25 [PATCH 00/14] Enospc rework Josef Bacik
                   ` (10 preceding siblings ...)
  2016-03-25 17:25 ` [PATCH 11/14] Btrfs: add fsid to some tracepoints Josef Bacik
@ 2016-03-25 17:25 ` Josef Bacik
  2016-05-09 21:33   ` Liu Bo
  2016-03-25 17:25 ` [PATCH 13/14] Btrfs: don't bother kicking async if there's nothing to reclaim Josef Bacik
  2016-03-25 17:26 ` [PATCH 14/14] Btrfs: don't do nocow check unless we have to Josef Bacik
  13 siblings, 1 reply; 26+ messages in thread
From: Josef Bacik @ 2016-03-25 17:25 UTC (permalink / raw)
  To: linux-btrfs

We were doing trace_btrfs_release_reserved_extent() in pin_down_extent which
isn't quite right because we will go through and free that extent later when we
unpin, so it messes up apps that are accounting for the reservation space.  We
were also unconditionally doing it in __btrfs_free_reserved_extent(), when we
only actually free the reservation instead of pinning the extent.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/extent-tree.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0ecceea..273e18d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6253,8 +6253,6 @@ static int pin_down_extent(struct btrfs_root *root,
 				      cache->space_info->flags, num_bytes, 1);
 	set_extent_dirty(root->fs_info->pinned_extents, bytenr,
 			 bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL);
-	if (reserved)
-		trace_btrfs_reserved_extent_free(root, bytenr, num_bytes);
 	return 0;
 }
 
@@ -7877,12 +7875,10 @@ static int __btrfs_free_reserved_extent(struct btrfs_root *root,
 			ret = btrfs_discard_extent(root, start, len, NULL);
 		btrfs_add_free_space(cache, start, len);
 		btrfs_update_reserved_bytes(cache, len, RESERVE_FREE, delalloc);
+		trace_btrfs_reserved_extent_free(root, start, len);
 	}
 
 	btrfs_put_block_group(cache);
-
-	trace_btrfs_reserved_extent_free(root, start, len);
-
 	return ret;
 }
 
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 13/14] Btrfs: don't bother kicking async if there's nothing to reclaim
  2016-03-25 17:25 [PATCH 00/14] Enospc rework Josef Bacik
                   ` (11 preceding siblings ...)
  2016-03-25 17:25 ` [PATCH 12/14] Btrfs: fix release reserved extents trace points Josef Bacik
@ 2016-03-25 17:25 ` Josef Bacik
  2016-03-25 17:26 ` [PATCH 14/14] Btrfs: don't do nocow check unless we have to Josef Bacik
  13 siblings, 0 replies; 26+ messages in thread
From: Josef Bacik @ 2016-03-25 17:25 UTC (permalink / raw)
  To: linux-btrfs

We do this check when we start the async reclaimer thread, might as well check
before we kick it off to save us some cycles.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/extent-tree.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 273e18d..4b5a517 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4871,6 +4871,9 @@ static inline int need_do_async_reclaim(struct btrfs_space_info *space_info,
 	if ((space_info->bytes_used + space_info->bytes_reserved) >= thresh)
 		return 0;
 
+	if (!btrfs_calc_reclaim_metadata_size(fs_info->fs_root, space_info))
+		return 0;
+
 	return (used >= thresh && !btrfs_fs_closing(fs_info) &&
 		!test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state));
 }
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 14/14] Btrfs: don't do nocow check unless we have to
  2016-03-25 17:25 [PATCH 00/14] Enospc rework Josef Bacik
                   ` (12 preceding siblings ...)
  2016-03-25 17:25 ` [PATCH 13/14] Btrfs: don't bother kicking async if there's nothing to reclaim Josef Bacik
@ 2016-03-25 17:26 ` Josef Bacik
  2016-03-25 17:50   ` Liu Bo
  13 siblings, 1 reply; 26+ messages in thread
From: Josef Bacik @ 2016-03-25 17:26 UTC (permalink / raw)
  To: linux-btrfs

Before we write into prealloc/nocow space we have to make sure that there are no
references to the extents we are writing into, which means checking the extent
tree and csum tree in the case of nocow.  So we don't want to do the nocow dance
unless we can't reserve data space, since it's a serious drag on performance.
With the following sequence

fallocate -l10737418240 /mnt/btrfs-test/file
cp --reflink /mnt/btrfs-test/file /mnt/btrfs-test/link
fio --name=randwrite --rw=randwrite --bs=4k --filename=/mnt/btrfs-test/file \
	--end_fsync=1

we get the worst case scenario where we have to fall back on to doing the check
anyway.

Without this patch
lat (usec): min=5, max=111598, avg=27.65, stdev=124.51
write: io=10240MB, bw=126876KB/s, iops=31718, runt= 82646msec

With this patch
lat (usec): min=3, max=91210, avg=14.09, stdev=110.62
write: io=10240MB, bw=212753KB/s, iops=53188, runt= 49286msec

We get twice the throughput, half of the runtime, and half of the average
latency.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/file.c | 44 ++++++++++++++++++++++----------------------
 1 file changed, 22 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0ce4bb3..7c80208 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1534,30 +1534,30 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 		reserve_bytes = round_up(write_bytes + sector_offset,
 				root->sectorsize);
 
-		if ((BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
-					      BTRFS_INODE_PREALLOC)) &&
-		    check_can_nocow(inode, pos, &write_bytes) > 0) {
-			/*
-			 * For nodata cow case, no need to reserve
-			 * data space.
-			 */
-			only_release_metadata = true;
-			/*
-			 * our prealloc extent may be smaller than
-			 * write_bytes, so scale down.
-			 */
-			num_pages = DIV_ROUND_UP(write_bytes + offset,
-						 PAGE_CACHE_SIZE);
-			reserve_bytes = round_up(write_bytes + sector_offset,
-					root->sectorsize);
-			goto reserve_metadata;
-		}
-
 		ret = btrfs_check_data_free_space(inode, pos, write_bytes);
-		if (ret < 0)
-			break;
+		if (ret < 0) {
+			if ((BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
+						      BTRFS_INODE_PREALLOC)) &&
+			    check_can_nocow(inode, pos, &write_bytes) > 0) {
+				/*
+				 * For nodata cow case, no need to reserve
+				 * data space.
+				 */
+				only_release_metadata = true;
+				/*
+				 * our prealloc extent may be smaller than
+				 * write_bytes, so scale down.
+				 */
+				num_pages = DIV_ROUND_UP(write_bytes + offset,
+							 PAGE_CACHE_SIZE);
+				reserve_bytes = round_up(write_bytes +
+							 sector_offset,
+							 root->sectorsize);
+			} else {
+				break;
+			}
+		}
 
-reserve_metadata:
 		ret = btrfs_delalloc_reserve_metadata(inode, reserve_bytes);
 		if (ret) {
 			if (!only_release_metadata)
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 14/14] Btrfs: don't do nocow check unless we have to
  2016-03-25 17:26 ` [PATCH 14/14] Btrfs: don't do nocow check unless we have to Josef Bacik
@ 2016-03-25 17:50   ` Liu Bo
  0 siblings, 0 replies; 26+ messages in thread
From: Liu Bo @ 2016-03-25 17:50 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Fri, Mar 25, 2016 at 01:26:00PM -0400, Josef Bacik wrote:
> Before we write into prealloc/nocow space we have to make sure that there are no
> references to the extents we are writing into, which means checking the extent
> tree and csum tree in the case of nocow.  So we don't want to do the nocow dance
> unless we can't reserve data space, since it's a serious drag on performance.
> With the following sequence
> 
> fallocate -l10737418240 /mnt/btrfs-test/file
> cp --reflink /mnt/btrfs-test/file /mnt/btrfs-test/link
> fio --name=randwrite --rw=randwrite --bs=4k --filename=/mnt/btrfs-test/file \
> 	--end_fsync=1
> 
> we get the worst case scenario where we have to fall back on to doing the check
> anyway.
> 
> Without this patch
> lat (usec): min=5, max=111598, avg=27.65, stdev=124.51
> write: io=10240MB, bw=126876KB/s, iops=31718, runt= 82646msec
> 
> With this patch
> lat (usec): min=3, max=91210, avg=14.09, stdev=110.62
> write: io=10240MB, bw=212753KB/s, iops=53188, runt= 49286msec
> 
> We get twice the throughput, half of the runtime, and half of the average
> latency.  Thanks,

I've submitted a similar one, but looks like this one is cleaner, I
forgot to remove the goto reserve_metadata.

Thanks,

-liubo

> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  fs/btrfs/file.c | 44 ++++++++++++++++++++++----------------------
>  1 file changed, 22 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 0ce4bb3..7c80208 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1534,30 +1534,30 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
>  		reserve_bytes = round_up(write_bytes + sector_offset,
>  				root->sectorsize);
>  
> -		if ((BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
> -					      BTRFS_INODE_PREALLOC)) &&
> -		    check_can_nocow(inode, pos, &write_bytes) > 0) {
> -			/*
> -			 * For nodata cow case, no need to reserve
> -			 * data space.
> -			 */
> -			only_release_metadata = true;
> -			/*
> -			 * our prealloc extent may be smaller than
> -			 * write_bytes, so scale down.
> -			 */
> -			num_pages = DIV_ROUND_UP(write_bytes + offset,
> -						 PAGE_CACHE_SIZE);
> -			reserve_bytes = round_up(write_bytes + sector_offset,
> -					root->sectorsize);
> -			goto reserve_metadata;
> -		}
> -
>  		ret = btrfs_check_data_free_space(inode, pos, write_bytes);
> -		if (ret < 0)
> -			break;
> +		if (ret < 0) {
> +			if ((BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
> +						      BTRFS_INODE_PREALLOC)) &&
> +			    check_can_nocow(inode, pos, &write_bytes) > 0) {
> +				/*
> +				 * For nodata cow case, no need to reserve
> +				 * data space.
> +				 */
> +				only_release_metadata = true;
> +				/*
> +				 * our prealloc extent may be smaller than
> +				 * write_bytes, so scale down.
> +				 */
> +				num_pages = DIV_ROUND_UP(write_bytes + offset,
> +							 PAGE_CACHE_SIZE);
> +				reserve_bytes = round_up(write_bytes +
> +							 sector_offset,
> +							 root->sectorsize);
> +			} else {
> +				break;
> +			}
> +		}
>  
> -reserve_metadata:
>  		ret = btrfs_delalloc_reserve_metadata(inode, reserve_bytes);
>  		if (ret) {
>  			if (!only_release_metadata)
> -- 
> 2.5.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 03/14] Btrfs: always reserve metadata for delalloc extents
  2016-03-25 17:25 ` [PATCH 03/14] Btrfs: always reserve metadata for delalloc extents Josef Bacik
@ 2016-03-25 18:04   ` Liu Bo
  0 siblings, 0 replies; 26+ messages in thread
From: Liu Bo @ 2016-03-25 18:04 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Fri, Mar 25, 2016 at 01:25:49PM -0400, Josef Bacik wrote:
> There are a few races in the metadata reservation stuff.  First we add the bytes
> to the block_rsv well after we've set the bit on the inode saying that we have
> space for it and after we've reserved the bytes.  So use the normal
> btrfs_block_rsv_add helper for this case.  Secondly we can flush delalloc
> extents when we try to reserve space for our write, which means that we could
> have used up the space for the inode and we wouldn't know because we only check
> before the reservation.  So instead make sure we are always reserving space for
> the inode update, and then if we don't need it release those bytes afterward.
> Thanks,

Looks fine.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>

> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  fs/btrfs/extent-tree.c | 35 +++++++++++++----------------------
>  1 file changed, 13 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 06f4e7b..157a0b6 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -5653,12 +5653,12 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
>  	u64 to_reserve = 0;
>  	u64 csum_bytes;
>  	unsigned nr_extents = 0;
> -	int extra_reserve = 0;
>  	enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
>  	int ret = 0;
>  	bool delalloc_lock = true;
>  	u64 to_free = 0;
>  	unsigned dropped;
> +	bool release_extra = false;
>  
>  	/* If we are a free space inode we need to not flush since we will be in
>  	 * the middle of a transaction commit.  We also don't need the delalloc
> @@ -5684,24 +5684,15 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
>  					 BTRFS_MAX_EXTENT_SIZE - 1,
>  					 BTRFS_MAX_EXTENT_SIZE);
>  	BTRFS_I(inode)->outstanding_extents += nr_extents;
> -	nr_extents = 0;
>  
> +	nr_extents = 0;
>  	if (BTRFS_I(inode)->outstanding_extents >
>  	    BTRFS_I(inode)->reserved_extents)
> -		nr_extents = BTRFS_I(inode)->outstanding_extents -
> +		nr_extents += BTRFS_I(inode)->outstanding_extents -
>  			BTRFS_I(inode)->reserved_extents;
>  
> -	/*
> -	 * Add an item to reserve for updating the inode when we complete the
> -	 * delalloc io.
> -	 */
> -	if (!test_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
> -		      &BTRFS_I(inode)->runtime_flags)) {
> -		nr_extents++;
> -		extra_reserve = 1;
> -	}
> -
> -	to_reserve = btrfs_calc_trans_metadata_size(root, nr_extents);
> +	/* We always want to reserve a slot for updating the inode. */
> +	to_reserve = btrfs_calc_trans_metadata_size(root, nr_extents + 1);
>  	to_reserve += calc_csum_metadata_size(inode, num_bytes, 1);
>  	csum_bytes = BTRFS_I(inode)->csum_bytes;
>  	spin_unlock(&BTRFS_I(inode)->lock);
> @@ -5713,18 +5704,16 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
>  			goto out_fail;
>  	}
>  
> -	ret = reserve_metadata_bytes(root, block_rsv, to_reserve, flush);
> +	ret = btrfs_block_rsv_add(root, block_rsv, to_reserve, flush);
>  	if (unlikely(ret)) {
>  		btrfs_qgroup_free_meta(root, nr_extents * root->nodesize);
>  		goto out_fail;
>  	}
>  
>  	spin_lock(&BTRFS_I(inode)->lock);
> -	if (extra_reserve) {
> -		set_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
> -			&BTRFS_I(inode)->runtime_flags);
> -		nr_extents--;
> -	}
> +	if (test_and_set_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
> +			     &BTRFS_I(inode)->runtime_flags))
> +		release_extra = true;
>  	BTRFS_I(inode)->reserved_extents += nr_extents;
>  	spin_unlock(&BTRFS_I(inode)->lock);
>  
> @@ -5734,8 +5723,10 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
>  	if (to_reserve)
>  		trace_btrfs_space_reservation(root->fs_info, "delalloc",
>  					      btrfs_ino(inode), to_reserve, 1);
> -	block_rsv_add_bytes(block_rsv, to_reserve, 1);
> -
> +	if (release_extra)
> +		btrfs_block_rsv_release(root, block_rsv,
> +					btrfs_calc_trans_metadata_size(root,
> +								       1));
>  	return 0;
>  
>  out_fail:
> -- 
> 2.5.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 07/14] Btrfs: introduce ticketed enospc infrastructure
  2016-03-25 17:25 ` [PATCH 07/14] Btrfs: introduce ticketed enospc infrastructure Josef Bacik
@ 2016-05-09 21:29   ` Liu Bo
  2016-05-17 17:30   ` [PATCH V2] " Josef Bacik
  1 sibling, 0 replies; 26+ messages in thread
From: Liu Bo @ 2016-05-09 21:29 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Fri, Mar 25, 2016 at 01:25:53PM -0400, Josef Bacik wrote:
> Our enospc flushing sucks.  It is born from a time where we were early
> enospc'ing constantly because multiple threads would race in for the same
> reservation and randomly starve other ones out.  So I came up with this solution
> to block any other reservations from happening while one guy tried to flush
> stuff to satisfy his reservation.  This gives us pretty good correctness, but
> completely crap latency.
> 
> The solution I've come up with is ticketed reservations.  Basically we try to
> make our reservation, and if we can't we put a ticket on a list in order and
> kick off an async flusher thread.  This async flusher thread does the same old
> flushing we always did, just asynchronously.  As space is freed and added back
> to the space_info it checks and sees if we have any tickets that need
> satisfying, and adds space to the tickets and wakes up anything we've satisfied.
> 
> Once the flusher thread stops making progress it wakes up all the current
> tickets and tells them to take a hike.
> 
> There is a priority list for things that can't flush, since the async flusher
> could do anything we need to avoid deadlocks.  These guys get priority for
> having their reservation made, and will still do manual flushing themselves in
> case the async flusher isn't running.
> 
> This patch gives us significantly better latencies.  Thanks,
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  fs/btrfs/ctree.h       |   2 +
>  fs/btrfs/extent-tree.c | 524 +++++++++++++++++++++++++++++++++++--------------
>  2 files changed, 375 insertions(+), 151 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index b675066..7437c8a 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1229,6 +1229,8 @@ struct btrfs_space_info {
>  	struct list_head list;
>  	/* Protected by the spinlock 'lock'. */
>  	struct list_head ro_bgs;
> +	struct list_head priority_tickets;
> +	struct list_head tickets;
>  
>  	struct rw_semaphore groups_sem;
>  	/* for block groups in our same type */
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 0db4319..1673365 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -111,6 +111,16 @@ static int block_rsv_use_bytes(struct btrfs_block_rsv *block_rsv,
>  			       u64 num_bytes);
>  int btrfs_pin_extent(struct btrfs_root *root,
>  		     u64 bytenr, u64 num_bytes, int reserved);
> +static int __reserve_metadata_bytes(struct btrfs_root *root,
> +				    struct btrfs_space_info *space_info,
> +				    u64 orig_bytes,
> +				    enum btrfs_reserve_flush_enum flush);
> +static void space_info_add_new_bytes(struct btrfs_fs_info *fs_info,
> +				     struct btrfs_space_info *space_info,
> +				     u64 num_bytes);
> +static void space_info_add_old_bytes(struct btrfs_fs_info *fs_info,
> +				     struct btrfs_space_info *space_info,
> +				     u64 num_bytes);
>  
>  static noinline int
>  block_group_cache_done(struct btrfs_block_group_cache *cache)
> @@ -3867,6 +3877,8 @@ static int update_space_info(struct btrfs_fs_info *info, u64 flags,
>  		found->bytes_readonly += bytes_readonly;
>  		if (total_bytes > 0)
>  			found->full = 0;
> +		space_info_add_new_bytes(info, found, total_bytes -
> +					 bytes_used - bytes_readonly);
>  		spin_unlock(&found->lock);
>  		*space_info = found;
>  		return 0;
> @@ -3901,6 +3913,8 @@ static int update_space_info(struct btrfs_fs_info *info, u64 flags,
>  	found->flush = 0;
>  	init_waitqueue_head(&found->wait);
>  	INIT_LIST_HEAD(&found->ro_bgs);
> +	INIT_LIST_HEAD(&found->tickets);
> +	INIT_LIST_HEAD(&found->priority_tickets);
>  
>  	ret = kobject_init_and_add(&found->kobj, &space_info_ktype,
>  				    info->space_info_kobj, "%s",
> @@ -4514,12 +4528,19 @@ static int can_overcommit(struct btrfs_root *root,
>  			  struct btrfs_space_info *space_info, u64 bytes,
>  			  enum btrfs_reserve_flush_enum flush)
>  {
> -	struct btrfs_block_rsv *global_rsv = &root->fs_info->global_block_rsv;
> -	u64 profile = btrfs_get_alloc_profile(root, 0);
> +	struct btrfs_block_rsv *global_rsv;
> +	u64 profile;
>  	u64 space_size;
>  	u64 avail;
>  	u64 used;
>  
> +	/* Don't overcommit when in mixed mode. */
> +	if (space_info->flags & BTRFS_BLOCK_GROUP_DATA)
> +		return 0;
> +
> +	BUG_ON(root->fs_info == NULL);
> +	global_rsv = &root->fs_info->global_block_rsv;
> +	profile = btrfs_get_alloc_profile(root, 0);
>  	used = space_info->bytes_used + space_info->bytes_reserved +
>  		space_info->bytes_pinned + space_info->bytes_readonly;
>  
> @@ -4669,6 +4690,11 @@ skip_async:
>  			spin_unlock(&space_info->lock);
>  			break;
>  		}
> +		if (list_empty(&space_info->tickets) &&
> +		    list_empty(&space_info->priority_tickets)) {
> +			spin_unlock(&space_info->lock);
> +			break;
> +		}
>  		spin_unlock(&space_info->lock);
>  
>  		loops++;
> @@ -4745,6 +4771,13 @@ enum flush_state {
>  	COMMIT_TRANS		=	6,
>  };
>  
> +struct reserve_ticket {
> +	u64 bytes;
> +	int error;
> +	struct list_head list;
> +	wait_queue_head_t wait;
> +};
> +
>  static int flush_space(struct btrfs_root *root,
>  		       struct btrfs_space_info *space_info, u64 num_bytes,
>  		       u64 orig_bytes, int state)
> @@ -4802,17 +4835,22 @@ static inline u64
>  btrfs_calc_reclaim_metadata_size(struct btrfs_root *root,
>  				 struct btrfs_space_info *space_info)
>  {
> +	struct reserve_ticket *ticket;
>  	u64 used;
>  	u64 expected;
> -	u64 to_reclaim;
> +	u64 to_reclaim = 0;
>  
>  	to_reclaim = min_t(u64, num_online_cpus() * SZ_1M, SZ_16M);
> -	spin_lock(&space_info->lock);
>  	if (can_overcommit(root, space_info, to_reclaim,
> -			   BTRFS_RESERVE_FLUSH_ALL)) {
> -		to_reclaim = 0;
> -		goto out;
> -	}
> +			   BTRFS_RESERVE_FLUSH_ALL))
> +		return 0;
> +
> +	list_for_each_entry(ticket, &space_info->tickets, list)
> +		to_reclaim += ticket->bytes;
> +	list_for_each_entry(ticket, &space_info->priority_tickets, list)
> +		to_reclaim += ticket->bytes;
> +	if (to_reclaim)
> +		return to_reclaim;
>  
>  	used = space_info->bytes_used + space_info->bytes_reserved +
>  	       space_info->bytes_pinned + space_info->bytes_readonly +
> @@ -4828,9 +4866,6 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_root *root,
>  		to_reclaim = 0;
>  	to_reclaim = min(to_reclaim, space_info->bytes_may_use +
>  				     space_info->bytes_reserved);
> -out:
> -	spin_unlock(&space_info->lock);
> -
>  	return to_reclaim;
>  }
>  
> @@ -4847,69 +4882,169 @@ static inline int need_do_async_reclaim(struct btrfs_space_info *space_info,
>  		!test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state));
>  }
>  
> -static int btrfs_need_do_async_reclaim(struct btrfs_space_info *space_info,
> -				       struct btrfs_fs_info *fs_info,
> -				       int flush_state)
> +static void wake_all_tickets(struct list_head *head)
>  {
> -	u64 used;
> +	struct reserve_ticket *ticket;
>  
> -	spin_lock(&space_info->lock);
> -	/*
> -	 * We run out of space and have not got any free space via flush_space,
> -	 * so don't bother doing async reclaim.
> -	 */
> -	if (flush_state > COMMIT_TRANS && space_info->full) {
> -		spin_unlock(&space_info->lock);
> -		return 0;
> +	while (!list_empty(head)) {
> +		ticket = list_first_entry(head, struct reserve_ticket, list);
> +		list_del_init(&ticket->list);
> +		ticket->error = -ENOSPC;
> +		wake_up(&ticket->wait);
>  	}
> -
> -	used = space_info->bytes_used + space_info->bytes_reserved +
> -	       space_info->bytes_pinned + space_info->bytes_readonly +
> -	       space_info->bytes_may_use;
> -	if (need_do_async_reclaim(space_info, fs_info, used)) {
> -		spin_unlock(&space_info->lock);
> -		return 1;
> -	}
> -	spin_unlock(&space_info->lock);
> -
> -	return 0;
>  }
>  
> +/*
> + * This is for normal flushers, we can wait all goddamned day if we want to.  We
> + * will loop and continuously try to flush as long as we are making progress.
> + * We count progress as clearing off tickets each time we have to loop.
> + */
>  static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
>  {
> +	struct reserve_ticket *last_ticket = NULL;
>  	struct btrfs_fs_info *fs_info;
>  	struct btrfs_space_info *space_info;
>  	u64 to_reclaim;
>  	int flush_state;
> +	int commit_cycles = 0;
>  
>  	fs_info = container_of(work, struct btrfs_fs_info, async_reclaim_work);
>  	space_info = __find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
>  
> +	spin_lock(&space_info->lock);
>  	to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info->fs_root,
>  						      space_info);
> -	if (!to_reclaim)
> +	if (!to_reclaim) {
> +		space_info->flush = 0;
> +		spin_unlock(&space_info->lock);
>  		return;
> +	}
> +	last_ticket = list_first_entry(&space_info->tickets,
> +				       struct reserve_ticket, list);
> +	spin_unlock(&space_info->lock);
>  
>  	flush_state = FLUSH_DELAYED_ITEMS_NR;
>  	do {
> +		struct reserve_ticket *ticket;
> +		int ret;
> +
> +		ret = flush_space(fs_info->fs_root, space_info, to_reclaim,
> +			    to_reclaim, flush_state);
> +		spin_lock(&space_info->lock);
> +		if (list_empty(&space_info->tickets)) {
> +			space_info->flush = 0;
> +			spin_unlock(&space_info->lock);
> +			return;
> +		}
> +		to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info->fs_root,
> +							      space_info);
> +		ticket = list_first_entry(&space_info->tickets,
> +					  struct reserve_ticket, list);
> +		if (last_ticket == ticket) {
> +			flush_state++;
> +		} else {
> +			last_ticket = ticket;
> +			flush_state = FLUSH_DELAYED_ITEMS_NR;
> +			if (commit_cycles)
> +				commit_cycles--;
> +		}
> +
> +		if (flush_state > COMMIT_TRANS) {
> +			commit_cycles++;
> +			if (commit_cycles > 2) {
> +				wake_all_tickets(&space_info->tickets);
> +				space_info->flush = 0;
> +			} else {
> +				flush_state = FLUSH_DELAYED_ITEMS_NR;
> +			}
> +		}
> +		spin_unlock(&space_info->lock);
> +	} while (flush_state <= COMMIT_TRANS);
> +}
> +
> +void btrfs_init_async_reclaim_work(struct work_struct *work)
> +{
> +	INIT_WORK(work, btrfs_async_reclaim_metadata_space);
> +}
> +
> +static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info,
> +					    struct btrfs_space_info *space_info,
> +					    struct reserve_ticket *ticket)
> +{
> +	u64 to_reclaim;
> +	int flush_state = FLUSH_DELAYED_ITEMS_NR;
> +
> +	spin_lock(&space_info->lock);
> +	to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info->fs_root,
> +						      space_info);
> +	if (!to_reclaim) {
> +		spin_unlock(&space_info->lock);
> +		return;
> +	}
> +	spin_unlock(&space_info->lock);
> +
> +	do {
>  		flush_space(fs_info->fs_root, space_info, to_reclaim,
>  			    to_reclaim, flush_state);
>  		flush_state++;
> -		if (!btrfs_need_do_async_reclaim(space_info, fs_info,
> -						 flush_state))
> +		spin_lock(&space_info->lock);
> +		if (ticket->bytes == 0) {
> +			spin_unlock(&space_info->lock);
>  			return;
> +		}
> +		spin_unlock(&space_info->lock);
> +
> +		/*
> +		 * Priority flushers can't wait on delalloc without
> +		 * deadlocking.
> +		 */
> +		if (flush_state == FLUSH_DELALLOC ||
> +		    flush_state == FLUSH_DELALLOC_WAIT)
> +			flush_state = ALLOC_CHUNK;
>  	} while (flush_state < COMMIT_TRANS);
>  }
>  
> -void btrfs_init_async_reclaim_work(struct work_struct *work)
> +static int wait_reserve_ticket(struct btrfs_fs_info *fs_info,
> +			       struct btrfs_space_info *space_info,
> +			       struct reserve_ticket *ticket, u64 orig_bytes)
> +
>  {
> -	INIT_WORK(work, btrfs_async_reclaim_metadata_space);
> +	DEFINE_WAIT(wait);
> +	int ret = 0;
> +
> +	spin_lock(&space_info->lock);
> +	while (ticket->bytes > 0 && ticket->error == 0) {
> +		ret = prepare_to_wait_event(&ticket->wait, &wait, TASK_KILLABLE);
> +		if (ret) {
> +			ret = -EINTR;
> +			break;
> +		}
> +		spin_unlock(&space_info->lock);
> +
> +		schedule();
> +
> +		finish_wait(&ticket->wait, &wait);
> +		spin_lock(&space_info->lock);
> +	}
> +	if (!ret)
> +		ret = ticket->error;
> +	if (!list_empty(&ticket->list))
> +		list_del_init(&ticket->list);
> +	if (ticket->bytes && ticket->bytes < orig_bytes) {
> +		u64 num_bytes = orig_bytes - ticket->bytes;
> +		space_info->bytes_may_use -= num_bytes;
> +		trace_btrfs_space_reservation(fs_info, "space_info",
> +					      space_info->flags, num_bytes, 0);
> +	}
> +	spin_unlock(&space_info->lock);
> +
> +	return ret;
>  }
>  
>  /**
>   * reserve_metadata_bytes - try to reserve bytes from the block_rsv's space
>   * @root - the root we're allocating for
> - * @block_rsv - the block_rsv we're allocating for
> + * @space_info - the space info we want to allocate from
>   * @orig_bytes - the number of bytes we want
>   * @flush - whether or not we can flush to make our reservation
>   *
> @@ -4920,81 +5055,34 @@ void btrfs_init_async_reclaim_work(struct work_struct *work)
>   * regain reservations will be made and this will fail if there is not enough
>   * space already.
>   */
> -static int reserve_metadata_bytes(struct btrfs_root *root,
> -				  struct btrfs_block_rsv *block_rsv,
> -				  u64 orig_bytes,
> -				  enum btrfs_reserve_flush_enum flush)
> +static int __reserve_metadata_bytes(struct btrfs_root *root,
> +				    struct btrfs_space_info *space_info,
> +				    u64 orig_bytes,
> +				    enum btrfs_reserve_flush_enum flush)
>  {
> -	struct btrfs_space_info *space_info = block_rsv->space_info;
> +	struct reserve_ticket ticket;
>  	u64 used;
> -	u64 num_bytes = orig_bytes;
> -	int flush_state = FLUSH_DELAYED_ITEMS_NR;
>  	int ret = 0;
> -	bool flushing = false;
>  
> -again:
> -	ret = 0;
> +	ASSERT(orig_bytes);
>  	spin_lock(&space_info->lock);
> -	/*
> -	 * We only want to wait if somebody other than us is flushing and we
> -	 * are actually allowed to flush all things.
> -	 */
> -	while (flush == BTRFS_RESERVE_FLUSH_ALL && !flushing &&
> -	       space_info->flush) {
> -		spin_unlock(&space_info->lock);
> -		/*
> -		 * If we have a trans handle we can't wait because the flusher
> -		 * may have to commit the transaction, which would mean we would
> -		 * deadlock since we are waiting for the flusher to finish, but
> -		 * hold the current transaction open.
> -		 */
> -		if (current->journal_info)
> -			return -EAGAIN;
> -		ret = wait_event_killable(space_info->wait, !space_info->flush);
> -		/* Must have been killed, return */
> -		if (ret)
> -			return -EINTR;
> -
> -		spin_lock(&space_info->lock);
> -	}
> -
>  	ret = -ENOSPC;
>  	used = space_info->bytes_used + space_info->bytes_reserved +
>  		space_info->bytes_pinned + space_info->bytes_readonly +
>  		space_info->bytes_may_use;
>  
>  	/*
> -	 * The idea here is that we've not already over-reserved the block group
> -	 * then we can go ahead and save our reservation first and then start
> -	 * flushing if we need to.  Otherwise if we've already overcommitted
> -	 * lets start flushing stuff first and then come back and try to make
> -	 * our reservation.
> +	 * If we have enough space then hooray, make our reservation and carry
> +	 * on.  If not see if we can overcommit, and if we can, hooray carry on.
> +	 * If not things get more complicated.
>  	 */
> -	if (used <= space_info->total_bytes) {
> -		if (used + orig_bytes <= space_info->total_bytes) {
> -			space_info->bytes_may_use += orig_bytes;
> -			trace_btrfs_space_reservation(root->fs_info,
> -				"space_info", space_info->flags, orig_bytes, 1);
> -			ret = 0;
> -		} else {
> -			/*
> -			 * Ok set num_bytes to orig_bytes since we aren't
> -			 * overocmmitted, this way we only try and reclaim what
> -			 * we need.
> -			 */
> -			num_bytes = orig_bytes;
> -		}
> -	} else {
> -		/*
> -		 * Ok we're over committed, set num_bytes to the overcommitted
> -		 * amount plus the amount of bytes that we need for this
> -		 * reservation.
> -		 */
> -		num_bytes = used - space_info->total_bytes +
> -			(orig_bytes * 2);
> -	}
> -
> -	if (ret && can_overcommit(root, space_info, orig_bytes, flush)) {
> +	if (used + orig_bytes <= space_info->total_bytes) {
> +		space_info->bytes_may_use += orig_bytes;
> +		trace_btrfs_space_reservation(root->fs_info, "space_info",
> +					      space_info->flags, orig_bytes,
> +					      1);
> +		ret = 0;
> +	} else if (can_overcommit(root, space_info, orig_bytes, flush)) {
>  		space_info->bytes_may_use += orig_bytes;
>  		trace_btrfs_space_reservation(root->fs_info, "space_info",
>  					      space_info->flags, orig_bytes,
> @@ -5003,16 +5091,27 @@ again:
>  	}
>  
>  	/*
> -	 * Couldn't make our reservation, save our place so while we're trying
> -	 * to reclaim space we can actually use it instead of somebody else
> -	 * stealing it from us.
> +	 * If we couldn't make a reservation then setup our reservation ticket
> +	 * and kick the async worker if it's not already running.
>  	 *
> -	 * We make the other tasks wait for the flush only when we can flush
> -	 * all things.
> +	 * If we are a priority flusher then we just need to add our ticket to
> +	 * the list and we will do our own flushing further down.
>  	 */
>  	if (ret && flush != BTRFS_RESERVE_NO_FLUSH) {
> -		flushing = true;
> -		space_info->flush = 1;
> +		ticket.bytes = orig_bytes;
> +		ticket.error = 0;
> +		init_waitqueue_head(&ticket.wait);
> +		if (flush == BTRFS_RESERVE_FLUSH_ALL) {
> +			list_add_tail(&ticket.list, &space_info->tickets);
> +			if (!space_info->flush) {
> +				space_info->flush = 1;
> +				queue_work(system_unbound_wq,
> +					   &root->fs_info->async_reclaim_work);
> +			}
> +		} else {
> +			list_add_tail(&ticket.list,
> +				      &space_info->priority_tickets);
> +		}
>  	} else if (!ret && space_info->flags & BTRFS_BLOCK_GROUP_METADATA) {
>  		used += orig_bytes;
>  		/*
> @@ -5027,33 +5126,56 @@ again:
>  				   &root->fs_info->async_reclaim_work);
>  	}
>  	spin_unlock(&space_info->lock);
> -
>  	if (!ret || flush == BTRFS_RESERVE_NO_FLUSH)
> -		goto out;
> +		return ret;
>  
> -	ret = flush_space(root, space_info, num_bytes, orig_bytes,
> -			  flush_state);
> -	flush_state++;
> +	if (flush == BTRFS_RESERVE_FLUSH_ALL)
> +		return wait_reserve_ticket(root->fs_info, space_info, &ticket,
> +					   orig_bytes);
>  
> -	/*
> -	 * If we are FLUSH_LIMIT, we can not flush delalloc, or the deadlock
> -	 * would happen. So skip delalloc flush.
> -	 */
> -	if (flush == BTRFS_RESERVE_FLUSH_LIMIT &&
> -	    (flush_state == FLUSH_DELALLOC ||
> -	     flush_state == FLUSH_DELALLOC_WAIT))
> -		flush_state = ALLOC_CHUNK;
> +	ret = 0;
> +	priority_reclaim_metadata_space(root->fs_info, space_info, &ticket);
> +	spin_lock(&space_info->lock);
> +	if (ticket.bytes) {
> +		if (ticket.bytes < orig_bytes) {
> +			u64 num_bytes = orig_bytes - ticket.bytes;
> +			space_info->bytes_may_use -= num_bytes;
> +			trace_btrfs_space_reservation(root->fs_info,
> +					"space_info", space_info->flags,
> +					num_bytes, 0);
>  
> -	if (!ret)
> -		goto again;
> -	else if (flush == BTRFS_RESERVE_FLUSH_LIMIT &&
> -		 flush_state < COMMIT_TRANS)
> -		goto again;
> -	else if (flush == BTRFS_RESERVE_FLUSH_ALL &&
> -		 flush_state <= COMMIT_TRANS)
> -		goto again;
> +		}
> +		list_del_init(&ticket.list);
> +		ret = -ENOSPC;
> +	}
> +	spin_unlock(&space_info->lock);
> +	ASSERT(list_empty(&ticket.list));
> +	return ret;
> +}
>  
> -out:
> +/**
> + * reserve_metadata_bytes - try to reserve bytes from the block_rsv's space
> + * @root - the root we're allocating for
> + * @block_rsv - the block_rsv we're allocating for
> + * @orig_bytes - the number of bytes we want
> + * @flush - whether or not we can flush to make our reservation
> + *
> + * This will reserve orgi_bytes number of bytes from the space info associated
> + * with the block_rsv.  If there is not enough space it will make an attempt to
> + * flush out space to make room.  It will do this by flushing delalloc if
> + * possible or committing the transaction.  If flush is 0 then no attempts to
> + * regain reservations will be made and this will fail if there is not enough
> + * space already.
> + */
> +static int reserve_metadata_bytes(struct btrfs_root *root,
> +				  struct btrfs_block_rsv *block_rsv,
> +				  u64 orig_bytes,
> +				  enum btrfs_reserve_flush_enum flush)
> +{
> +	int ret;
> +
> +	ret = __reserve_metadata_bytes(root, block_rsv->space_info, orig_bytes,
> +				       flush);
>  	if (ret == -ENOSPC &&
>  	    unlikely(root->orphan_cleanup_state == ORPHAN_CLEANUP_STARTED)) {
>  		struct btrfs_block_rsv *global_rsv =
> @@ -5066,13 +5188,8 @@ out:
>  	if (ret == -ENOSPC)
>  		trace_btrfs_space_reservation(root->fs_info,
>  					      "space_info:enospc",
> -					      space_info->flags, orig_bytes, 1);
> -	if (flushing) {
> -		spin_lock(&space_info->lock);
> -		space_info->flush = 0;
> -		wake_up_all(&space_info->wait);
> -		spin_unlock(&space_info->lock);
> -	}
> +					      block_rsv->space_info->flags,
> +					      orig_bytes, 1);
>  	return ret;
>  }
>  
> @@ -5148,6 +5265,103 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info *fs_info,
>  	return 0;
>  }
>  
> +/*
> + * This is for space we already have accounted in space_info->bytes_may_use, so
> + * basically when we're returning space from block_rsv's.
> + */
> +static void space_info_add_old_bytes(struct btrfs_fs_info *fs_info,
> +				     struct btrfs_space_info *space_info,
> +				     u64 num_bytes)
> +{
> +	struct reserve_ticket *ticket;
> +	struct list_head *head;
> +	u64 used;
> +	enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_NO_FLUSH;
> +	bool check_overcommit = false;
> +
> +	spin_lock(&space_info->lock);
> +	head = &space_info->priority_tickets;
> +
> +	/*
> +	 * First we want to see if we're over our limit, because if we are then
> +	 * we need to make sure we are still ok overcommitting before we satisfy
> +	 * another reservation.
> +	 */
> +	used = space_info->bytes_used + space_info->bytes_reserved +
> +		space_info->bytes_pinned + space_info->bytes_readonly;
> +	if (used - num_bytes >= space_info->total_bytes)
> +		check_overcommit = true;

'used' without bytes_may_use should be less than ->total_bytes,
you wanna check if (used + num_bytes >= space_info->total_bytes) ?

Others are sane to me.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>

Thanks,

-liubo

> +again:
> +	while (!list_empty(head) && num_bytes) {
> +		ticket = list_first_entry(head, struct reserve_ticket,
> +					  list);
> +		if (check_overcommit &&
> +		    !can_overcommit(fs_info->extent_root, space_info,
> +				    ticket->bytes, flush))
> +			break;
> +		if (num_bytes >= ticket->bytes) {
> +			list_del_init(&ticket->list);
> +			num_bytes -= ticket->bytes;
> +			ticket->bytes = 0;
> +			wake_up(&ticket->wait);
> +		} else {
> +			ticket->bytes -= num_bytes;
> +			num_bytes = 0;
> +		}
> +	}
> +
> +	if (num_bytes && head == &space_info->priority_tickets) {
> +		head = &space_info->tickets;
> +		flush = BTRFS_RESERVE_FLUSH_ALL;
> +		goto again;
> +	}
> +	space_info->bytes_may_use -= num_bytes;
> +	trace_btrfs_space_reservation(fs_info, "space_info",
> +				      space_info->flags, num_bytes, 0);
> +	spin_unlock(&space_info->lock);
> +}
> +
> +/*
> + * This is for newly allocated space that isn't accounted in
> + * space_info->bytes_may_use yet.  So if we allocate a chunk or unpin an extent
> + * we use this helper.
> + */
> +static void space_info_add_new_bytes(struct btrfs_fs_info *fs_info,
> +				     struct btrfs_space_info *space_info,
> +				     u64 num_bytes)
> +{
> +	struct reserve_ticket *ticket;
> +	struct list_head *head = &space_info->priority_tickets;
> +
> +again:
> +	while (!list_empty(head) && num_bytes) {
> +		ticket = list_first_entry(head, struct reserve_ticket,
> +					  list);
> +		if (num_bytes >= ticket->bytes) {
> +			trace_btrfs_space_reservation(fs_info, "space_info",
> +						      space_info->flags,
> +						      ticket->bytes, 1);
> +			list_del_init(&ticket->list);
> +			num_bytes -= ticket->bytes;
> +			space_info->bytes_may_use += ticket->bytes;
> +			ticket->bytes = 0;
> +			wake_up(&ticket->wait);
> +		} else {
> +			trace_btrfs_space_reservation(fs_info, "space_info",
> +						      space_info->flags,
> +						      num_bytes, 1);
> +			space_info->bytes_may_use += num_bytes;
> +			ticket->bytes -= num_bytes;
> +			num_bytes = 0;
> +		}
> +	}
> +
> +	if (num_bytes && head == &space_info->priority_tickets) {
> +		head = &space_info->tickets;
> +		goto again;
> +	}
> +}
> +
>  static void block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
>  				    struct btrfs_block_rsv *block_rsv,
>  				    struct btrfs_block_rsv *dest, u64 num_bytes)
> @@ -5182,13 +5396,9 @@ static void block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
>  			}
>  			spin_unlock(&dest->lock);
>  		}
> -		if (num_bytes) {
> -			spin_lock(&space_info->lock);
> -			space_info->bytes_may_use -= num_bytes;
> -			trace_btrfs_space_reservation(fs_info, "space_info",
> -					space_info->flags, num_bytes, 0);
> -			spin_unlock(&space_info->lock);
> -		}
> +		if (num_bytes)
> +			space_info_add_old_bytes(fs_info, space_info,
> +						 num_bytes);
>  	}
>  }
>  
> @@ -6346,17 +6556,29 @@ static int unpin_extent_range(struct btrfs_root *root, u64 start, u64 end,
>  			readonly = true;
>  		}
>  		spin_unlock(&cache->lock);
> -		if (!readonly && global_rsv->space_info == space_info) {
> +		if (!readonly && return_free_space &&
> +		    global_rsv->space_info == space_info) {
> +			u64 to_add = len;
> +			WARN_ON(!return_free_space);
>  			spin_lock(&global_rsv->lock);
>  			if (!global_rsv->full) {
> -				len = min(len, global_rsv->size -
> -					  global_rsv->reserved);
> -				global_rsv->reserved += len;
> -				space_info->bytes_may_use += len;
> +				to_add = min(len, global_rsv->size -
> +					     global_rsv->reserved);
> +				global_rsv->reserved += to_add;
> +				space_info->bytes_may_use += to_add;
>  				if (global_rsv->reserved >= global_rsv->size)
>  					global_rsv->full = 1;
> +				trace_btrfs_space_reservation(fs_info,
> +							      "space_info",
> +							      space_info->flags,
> +							      to_add, 1);
> +				len -= to_add;
>  			}
>  			spin_unlock(&global_rsv->lock);
> +			/* Add to any tickets we may have */
> +			if (len)
> +				space_info_add_new_bytes(fs_info, space_info,
> +							 len);
>  		}
>  		spin_unlock(&space_info->lock);
>  	}
> -- 
> 2.5.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 12/14] Btrfs: fix release reserved extents trace points
  2016-03-25 17:25 ` [PATCH 12/14] Btrfs: fix release reserved extents trace points Josef Bacik
@ 2016-05-09 21:33   ` Liu Bo
  0 siblings, 0 replies; 26+ messages in thread
From: Liu Bo @ 2016-05-09 21:33 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Fri, Mar 25, 2016 at 01:25:58PM -0400, Josef Bacik wrote:
> We were doing trace_btrfs_release_reserved_extent() in pin_down_extent which
> isn't quite right because we will go through and free that extent later when we
> unpin, so it messes up apps that are accounting for the reservation space.  We
> were also unconditionally doing it in __btrfs_free_reserved_extent(), when we
> only actually free the reservation instead of pinning the extent.  Thanks,

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>

Thanks,

-liubo
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  fs/btrfs/extent-tree.c | 6 +-----
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 0ecceea..273e18d 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -6253,8 +6253,6 @@ static int pin_down_extent(struct btrfs_root *root,
>  				      cache->space_info->flags, num_bytes, 1);
>  	set_extent_dirty(root->fs_info->pinned_extents, bytenr,
>  			 bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL);
> -	if (reserved)
> -		trace_btrfs_reserved_extent_free(root, bytenr, num_bytes);
>  	return 0;
>  }
>  
> @@ -7877,12 +7875,10 @@ static int __btrfs_free_reserved_extent(struct btrfs_root *root,
>  			ret = btrfs_discard_extent(root, start, len, NULL);
>  		btrfs_add_free_space(cache, start, len);
>  		btrfs_update_reserved_bytes(cache, len, RESERVE_FREE, delalloc);
> +		trace_btrfs_reserved_extent_free(root, start, len);
>  	}
>  
>  	btrfs_put_block_group(cache);
> -
> -	trace_btrfs_reserved_extent_free(root, start, len);
> -
>  	return ret;
>  }
>  
> -- 
> 2.5.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH V2] Btrfs: introduce ticketed enospc infrastructure
  2016-03-25 17:25 ` [PATCH 07/14] Btrfs: introduce ticketed enospc infrastructure Josef Bacik
  2016-05-09 21:29   ` Liu Bo
@ 2016-05-17 17:30   ` Josef Bacik
  2016-05-18 11:24     ` Austin S. Hemmelgarn
  2016-05-18 22:46     ` David Sterba
  1 sibling, 2 replies; 26+ messages in thread
From: Josef Bacik @ 2016-05-17 17:30 UTC (permalink / raw)
  To: linux-btrfs

Our enospc flushing sucks.  It is born from a time where we were early
enospc'ing constantly because multiple threads would race in for the same
reservation and randomly starve other ones out.  So I came up with this solution
to block any other reservations from happening while one guy tried to flush
stuff to satisfy his reservation.  This gives us pretty good correctness, but
completely crap latency.

The solution I've come up with is ticketed reservations.  Basically we try to
make our reservation, and if we can't we put a ticket on a list in order and
kick off an async flusher thread.  This async flusher thread does the same old
flushing we always did, just asynchronously.  As space is freed and added back
to the space_info it checks and sees if we have any tickets that need
satisfying, and adds space to the tickets and wakes up anything we've satisfied.

Once the flusher thread stops making progress it wakes up all the current
tickets and tells them to take a hike.

There is a priority list for things that can't flush, since the async flusher
could do anything we need to avoid deadlocks.  These guys get priority for
having their reservation made, and will still do manual flushing themselves in
case the async flusher isn't running.

This patch gives us significantly better latencies.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
V1->V2:
-fixed a check in space_info_add_old_bytes where we didn't take into account
 bytes_may_used for the space used.
-don't count ticket->bytes when checking overcommit.

 fs/btrfs/ctree.h       |   2 +
 fs/btrfs/extent-tree.c | 529 +++++++++++++++++++++++++++++++++++--------------
 2 files changed, 380 insertions(+), 151 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b675066..7437c8a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1229,6 +1229,8 @@ struct btrfs_space_info {
 	struct list_head list;
 	/* Protected by the spinlock 'lock'. */
 	struct list_head ro_bgs;
+	struct list_head priority_tickets;
+	struct list_head tickets;
 
 	struct rw_semaphore groups_sem;
 	/* for block groups in our same type */
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0db4319..00e3344 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -111,6 +111,16 @@ static int block_rsv_use_bytes(struct btrfs_block_rsv *block_rsv,
 			       u64 num_bytes);
 int btrfs_pin_extent(struct btrfs_root *root,
 		     u64 bytenr, u64 num_bytes, int reserved);
+static int __reserve_metadata_bytes(struct btrfs_root *root,
+				    struct btrfs_space_info *space_info,
+				    u64 orig_bytes,
+				    enum btrfs_reserve_flush_enum flush);
+static void space_info_add_new_bytes(struct btrfs_fs_info *fs_info,
+				     struct btrfs_space_info *space_info,
+				     u64 num_bytes);
+static void space_info_add_old_bytes(struct btrfs_fs_info *fs_info,
+				     struct btrfs_space_info *space_info,
+				     u64 num_bytes);
 
 static noinline int
 block_group_cache_done(struct btrfs_block_group_cache *cache)
@@ -3867,6 +3877,8 @@ static int update_space_info(struct btrfs_fs_info *info, u64 flags,
 		found->bytes_readonly += bytes_readonly;
 		if (total_bytes > 0)
 			found->full = 0;
+		space_info_add_new_bytes(info, found, total_bytes -
+					 bytes_used - bytes_readonly);
 		spin_unlock(&found->lock);
 		*space_info = found;
 		return 0;
@@ -3901,6 +3913,8 @@ static int update_space_info(struct btrfs_fs_info *info, u64 flags,
 	found->flush = 0;
 	init_waitqueue_head(&found->wait);
 	INIT_LIST_HEAD(&found->ro_bgs);
+	INIT_LIST_HEAD(&found->tickets);
+	INIT_LIST_HEAD(&found->priority_tickets);
 
 	ret = kobject_init_and_add(&found->kobj, &space_info_ktype,
 				    info->space_info_kobj, "%s",
@@ -4514,12 +4528,19 @@ static int can_overcommit(struct btrfs_root *root,
 			  struct btrfs_space_info *space_info, u64 bytes,
 			  enum btrfs_reserve_flush_enum flush)
 {
-	struct btrfs_block_rsv *global_rsv = &root->fs_info->global_block_rsv;
-	u64 profile = btrfs_get_alloc_profile(root, 0);
+	struct btrfs_block_rsv *global_rsv;
+	u64 profile;
 	u64 space_size;
 	u64 avail;
 	u64 used;
 
+	/* Don't overcommit when in mixed mode. */
+	if (space_info->flags & BTRFS_BLOCK_GROUP_DATA)
+		return 0;
+
+	BUG_ON(root->fs_info == NULL);
+	global_rsv = &root->fs_info->global_block_rsv;
+	profile = btrfs_get_alloc_profile(root, 0);
 	used = space_info->bytes_used + space_info->bytes_reserved +
 		space_info->bytes_pinned + space_info->bytes_readonly;
 
@@ -4669,6 +4690,11 @@ skip_async:
 			spin_unlock(&space_info->lock);
 			break;
 		}
+		if (list_empty(&space_info->tickets) &&
+		    list_empty(&space_info->priority_tickets)) {
+			spin_unlock(&space_info->lock);
+			break;
+		}
 		spin_unlock(&space_info->lock);
 
 		loops++;
@@ -4745,6 +4771,13 @@ enum flush_state {
 	COMMIT_TRANS		=	6,
 };
 
+struct reserve_ticket {
+	u64 bytes;
+	int error;
+	struct list_head list;
+	wait_queue_head_t wait;
+};
+
 static int flush_space(struct btrfs_root *root,
 		       struct btrfs_space_info *space_info, u64 num_bytes,
 		       u64 orig_bytes, int state)
@@ -4802,17 +4835,22 @@ static inline u64
 btrfs_calc_reclaim_metadata_size(struct btrfs_root *root,
 				 struct btrfs_space_info *space_info)
 {
+	struct reserve_ticket *ticket;
 	u64 used;
 	u64 expected;
-	u64 to_reclaim;
+	u64 to_reclaim = 0;
 
 	to_reclaim = min_t(u64, num_online_cpus() * SZ_1M, SZ_16M);
-	spin_lock(&space_info->lock);
 	if (can_overcommit(root, space_info, to_reclaim,
-			   BTRFS_RESERVE_FLUSH_ALL)) {
-		to_reclaim = 0;
-		goto out;
-	}
+			   BTRFS_RESERVE_FLUSH_ALL))
+		return 0;
+
+	list_for_each_entry(ticket, &space_info->tickets, list)
+		to_reclaim += ticket->bytes;
+	list_for_each_entry(ticket, &space_info->priority_tickets, list)
+		to_reclaim += ticket->bytes;
+	if (to_reclaim)
+		return to_reclaim;
 
 	used = space_info->bytes_used + space_info->bytes_reserved +
 	       space_info->bytes_pinned + space_info->bytes_readonly +
@@ -4828,9 +4866,6 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_root *root,
 		to_reclaim = 0;
 	to_reclaim = min(to_reclaim, space_info->bytes_may_use +
 				     space_info->bytes_reserved);
-out:
-	spin_unlock(&space_info->lock);
-
 	return to_reclaim;
 }
 
@@ -4847,69 +4882,169 @@ static inline int need_do_async_reclaim(struct btrfs_space_info *space_info,
 		!test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state));
 }
 
-static int btrfs_need_do_async_reclaim(struct btrfs_space_info *space_info,
-				       struct btrfs_fs_info *fs_info,
-				       int flush_state)
+static void wake_all_tickets(struct list_head *head)
 {
-	u64 used;
-
-	spin_lock(&space_info->lock);
-	/*
-	 * We run out of space and have not got any free space via flush_space,
-	 * so don't bother doing async reclaim.
-	 */
-	if (flush_state > COMMIT_TRANS && space_info->full) {
-		spin_unlock(&space_info->lock);
-		return 0;
-	}
+	struct reserve_ticket *ticket;
 
-	used = space_info->bytes_used + space_info->bytes_reserved +
-	       space_info->bytes_pinned + space_info->bytes_readonly +
-	       space_info->bytes_may_use;
-	if (need_do_async_reclaim(space_info, fs_info, used)) {
-		spin_unlock(&space_info->lock);
-		return 1;
+	while (!list_empty(head)) {
+		ticket = list_first_entry(head, struct reserve_ticket, list);
+		list_del_init(&ticket->list);
+		ticket->error = -ENOSPC;
+		wake_up(&ticket->wait);
 	}
-	spin_unlock(&space_info->lock);
-
-	return 0;
 }
 
+/*
+ * This is for normal flushers, we can wait all goddamned day if we want to.  We
+ * will loop and continuously try to flush as long as we are making progress.
+ * We count progress as clearing off tickets each time we have to loop.
+ */
 static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
 {
+	struct reserve_ticket *last_ticket = NULL;
 	struct btrfs_fs_info *fs_info;
 	struct btrfs_space_info *space_info;
 	u64 to_reclaim;
 	int flush_state;
+	int commit_cycles = 0;
 
 	fs_info = container_of(work, struct btrfs_fs_info, async_reclaim_work);
 	space_info = __find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
 
+	spin_lock(&space_info->lock);
 	to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info->fs_root,
 						      space_info);
-	if (!to_reclaim)
+	if (!to_reclaim) {
+		space_info->flush = 0;
+		spin_unlock(&space_info->lock);
 		return;
+	}
+	last_ticket = list_first_entry(&space_info->tickets,
+				       struct reserve_ticket, list);
+	spin_unlock(&space_info->lock);
 
 	flush_state = FLUSH_DELAYED_ITEMS_NR;
 	do {
+		struct reserve_ticket *ticket;
+		int ret;
+
+		ret = flush_space(fs_info->fs_root, space_info, to_reclaim,
+			    to_reclaim, flush_state);
+		spin_lock(&space_info->lock);
+		if (list_empty(&space_info->tickets)) {
+			space_info->flush = 0;
+			spin_unlock(&space_info->lock);
+			return;
+		}
+		to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info->fs_root,
+							      space_info);
+		ticket = list_first_entry(&space_info->tickets,
+					  struct reserve_ticket, list);
+		if (last_ticket == ticket) {
+			flush_state++;
+		} else {
+			last_ticket = ticket;
+			flush_state = FLUSH_DELAYED_ITEMS_NR;
+			if (commit_cycles)
+				commit_cycles--;
+		}
+
+		if (flush_state > COMMIT_TRANS) {
+			commit_cycles++;
+			if (commit_cycles > 2) {
+				wake_all_tickets(&space_info->tickets);
+				space_info->flush = 0;
+			} else {
+				flush_state = FLUSH_DELAYED_ITEMS_NR;
+			}
+		}
+		spin_unlock(&space_info->lock);
+	} while (flush_state <= COMMIT_TRANS);
+}
+
+void btrfs_init_async_reclaim_work(struct work_struct *work)
+{
+	INIT_WORK(work, btrfs_async_reclaim_metadata_space);
+}
+
+static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info,
+					    struct btrfs_space_info *space_info,
+					    struct reserve_ticket *ticket)
+{
+	u64 to_reclaim;
+	int flush_state = FLUSH_DELAYED_ITEMS_NR;
+
+	spin_lock(&space_info->lock);
+	to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info->fs_root,
+						      space_info);
+	if (!to_reclaim) {
+		spin_unlock(&space_info->lock);
+		return;
+	}
+	spin_unlock(&space_info->lock);
+
+	do {
 		flush_space(fs_info->fs_root, space_info, to_reclaim,
 			    to_reclaim, flush_state);
 		flush_state++;
-		if (!btrfs_need_do_async_reclaim(space_info, fs_info,
-						 flush_state))
+		spin_lock(&space_info->lock);
+		if (ticket->bytes == 0) {
+			spin_unlock(&space_info->lock);
 			return;
+		}
+		spin_unlock(&space_info->lock);
+
+		/*
+		 * Priority flushers can't wait on delalloc without
+		 * deadlocking.
+		 */
+		if (flush_state == FLUSH_DELALLOC ||
+		    flush_state == FLUSH_DELALLOC_WAIT)
+			flush_state = ALLOC_CHUNK;
 	} while (flush_state < COMMIT_TRANS);
 }
 
-void btrfs_init_async_reclaim_work(struct work_struct *work)
+static int wait_reserve_ticket(struct btrfs_fs_info *fs_info,
+			       struct btrfs_space_info *space_info,
+			       struct reserve_ticket *ticket, u64 orig_bytes)
+
 {
-	INIT_WORK(work, btrfs_async_reclaim_metadata_space);
+	DEFINE_WAIT(wait);
+	int ret = 0;
+
+	spin_lock(&space_info->lock);
+	while (ticket->bytes > 0 && ticket->error == 0) {
+		ret = prepare_to_wait_event(&ticket->wait, &wait, TASK_KILLABLE);
+		if (ret) {
+			ret = -EINTR;
+			break;
+		}
+		spin_unlock(&space_info->lock);
+
+		schedule();
+
+		finish_wait(&ticket->wait, &wait);
+		spin_lock(&space_info->lock);
+	}
+	if (!ret)
+		ret = ticket->error;
+	if (!list_empty(&ticket->list))
+		list_del_init(&ticket->list);
+	if (ticket->bytes && ticket->bytes < orig_bytes) {
+		u64 num_bytes = orig_bytes - ticket->bytes;
+		space_info->bytes_may_use -= num_bytes;
+		trace_btrfs_space_reservation(fs_info, "space_info",
+					      space_info->flags, num_bytes, 0);
+	}
+	spin_unlock(&space_info->lock);
+
+	return ret;
 }
 
 /**
  * reserve_metadata_bytes - try to reserve bytes from the block_rsv's space
  * @root - the root we're allocating for
- * @block_rsv - the block_rsv we're allocating for
+ * @space_info - the space info we want to allocate from
  * @orig_bytes - the number of bytes we want
  * @flush - whether or not we can flush to make our reservation
  *
@@ -4920,81 +5055,34 @@ void btrfs_init_async_reclaim_work(struct work_struct *work)
  * regain reservations will be made and this will fail if there is not enough
  * space already.
  */
-static int reserve_metadata_bytes(struct btrfs_root *root,
-				  struct btrfs_block_rsv *block_rsv,
-				  u64 orig_bytes,
-				  enum btrfs_reserve_flush_enum flush)
+static int __reserve_metadata_bytes(struct btrfs_root *root,
+				    struct btrfs_space_info *space_info,
+				    u64 orig_bytes,
+				    enum btrfs_reserve_flush_enum flush)
 {
-	struct btrfs_space_info *space_info = block_rsv->space_info;
+	struct reserve_ticket ticket;
 	u64 used;
-	u64 num_bytes = orig_bytes;
-	int flush_state = FLUSH_DELAYED_ITEMS_NR;
 	int ret = 0;
-	bool flushing = false;
 
-again:
-	ret = 0;
+	ASSERT(orig_bytes);
 	spin_lock(&space_info->lock);
-	/*
-	 * We only want to wait if somebody other than us is flushing and we
-	 * are actually allowed to flush all things.
-	 */
-	while (flush == BTRFS_RESERVE_FLUSH_ALL && !flushing &&
-	       space_info->flush) {
-		spin_unlock(&space_info->lock);
-		/*
-		 * If we have a trans handle we can't wait because the flusher
-		 * may have to commit the transaction, which would mean we would
-		 * deadlock since we are waiting for the flusher to finish, but
-		 * hold the current transaction open.
-		 */
-		if (current->journal_info)
-			return -EAGAIN;
-		ret = wait_event_killable(space_info->wait, !space_info->flush);
-		/* Must have been killed, return */
-		if (ret)
-			return -EINTR;
-
-		spin_lock(&space_info->lock);
-	}
-
 	ret = -ENOSPC;
 	used = space_info->bytes_used + space_info->bytes_reserved +
 		space_info->bytes_pinned + space_info->bytes_readonly +
 		space_info->bytes_may_use;
 
 	/*
-	 * The idea here is that we've not already over-reserved the block group
-	 * then we can go ahead and save our reservation first and then start
-	 * flushing if we need to.  Otherwise if we've already overcommitted
-	 * lets start flushing stuff first and then come back and try to make
-	 * our reservation.
+	 * If we have enough space then hooray, make our reservation and carry
+	 * on.  If not see if we can overcommit, and if we can, hooray carry on.
+	 * If not things get more complicated.
 	 */
-	if (used <= space_info->total_bytes) {
-		if (used + orig_bytes <= space_info->total_bytes) {
-			space_info->bytes_may_use += orig_bytes;
-			trace_btrfs_space_reservation(root->fs_info,
-				"space_info", space_info->flags, orig_bytes, 1);
-			ret = 0;
-		} else {
-			/*
-			 * Ok set num_bytes to orig_bytes since we aren't
-			 * overocmmitted, this way we only try and reclaim what
-			 * we need.
-			 */
-			num_bytes = orig_bytes;
-		}
-	} else {
-		/*
-		 * Ok we're over committed, set num_bytes to the overcommitted
-		 * amount plus the amount of bytes that we need for this
-		 * reservation.
-		 */
-		num_bytes = used - space_info->total_bytes +
-			(orig_bytes * 2);
-	}
-
-	if (ret && can_overcommit(root, space_info, orig_bytes, flush)) {
+	if (used + orig_bytes <= space_info->total_bytes) {
+		space_info->bytes_may_use += orig_bytes;
+		trace_btrfs_space_reservation(root->fs_info, "space_info",
+					      space_info->flags, orig_bytes,
+					      1);
+		ret = 0;
+	} else if (can_overcommit(root, space_info, orig_bytes, flush)) {
 		space_info->bytes_may_use += orig_bytes;
 		trace_btrfs_space_reservation(root->fs_info, "space_info",
 					      space_info->flags, orig_bytes,
@@ -5003,16 +5091,27 @@ again:
 	}
 
 	/*
-	 * Couldn't make our reservation, save our place so while we're trying
-	 * to reclaim space we can actually use it instead of somebody else
-	 * stealing it from us.
+	 * If we couldn't make a reservation then setup our reservation ticket
+	 * and kick the async worker if it's not already running.
 	 *
-	 * We make the other tasks wait for the flush only when we can flush
-	 * all things.
+	 * If we are a priority flusher then we just need to add our ticket to
+	 * the list and we will do our own flushing further down.
 	 */
 	if (ret && flush != BTRFS_RESERVE_NO_FLUSH) {
-		flushing = true;
-		space_info->flush = 1;
+		ticket.bytes = orig_bytes;
+		ticket.error = 0;
+		init_waitqueue_head(&ticket.wait);
+		if (flush == BTRFS_RESERVE_FLUSH_ALL) {
+			list_add_tail(&ticket.list, &space_info->tickets);
+			if (!space_info->flush) {
+				space_info->flush = 1;
+				queue_work(system_unbound_wq,
+					   &root->fs_info->async_reclaim_work);
+			}
+		} else {
+			list_add_tail(&ticket.list,
+				      &space_info->priority_tickets);
+		}
 	} else if (!ret && space_info->flags & BTRFS_BLOCK_GROUP_METADATA) {
 		used += orig_bytes;
 		/*
@@ -5027,33 +5126,56 @@ again:
 				   &root->fs_info->async_reclaim_work);
 	}
 	spin_unlock(&space_info->lock);
-
 	if (!ret || flush == BTRFS_RESERVE_NO_FLUSH)
-		goto out;
+		return ret;
 
-	ret = flush_space(root, space_info, num_bytes, orig_bytes,
-			  flush_state);
-	flush_state++;
+	if (flush == BTRFS_RESERVE_FLUSH_ALL)
+		return wait_reserve_ticket(root->fs_info, space_info, &ticket,
+					   orig_bytes);
 
-	/*
-	 * If we are FLUSH_LIMIT, we can not flush delalloc, or the deadlock
-	 * would happen. So skip delalloc flush.
-	 */
-	if (flush == BTRFS_RESERVE_FLUSH_LIMIT &&
-	    (flush_state == FLUSH_DELALLOC ||
-	     flush_state == FLUSH_DELALLOC_WAIT))
-		flush_state = ALLOC_CHUNK;
+	ret = 0;
+	priority_reclaim_metadata_space(root->fs_info, space_info, &ticket);
+	spin_lock(&space_info->lock);
+	if (ticket.bytes) {
+		if (ticket.bytes < orig_bytes) {
+			u64 num_bytes = orig_bytes - ticket.bytes;
+			space_info->bytes_may_use -= num_bytes;
+			trace_btrfs_space_reservation(root->fs_info,
+					"space_info", space_info->flags,
+					num_bytes, 0);
 
-	if (!ret)
-		goto again;
-	else if (flush == BTRFS_RESERVE_FLUSH_LIMIT &&
-		 flush_state < COMMIT_TRANS)
-		goto again;
-	else if (flush == BTRFS_RESERVE_FLUSH_ALL &&
-		 flush_state <= COMMIT_TRANS)
-		goto again;
+		}
+		list_del_init(&ticket.list);
+		ret = -ENOSPC;
+	}
+	spin_unlock(&space_info->lock);
+	ASSERT(list_empty(&ticket.list));
+	return ret;
+}
 
-out:
+/**
+ * reserve_metadata_bytes - try to reserve bytes from the block_rsv's space
+ * @root - the root we're allocating for
+ * @block_rsv - the block_rsv we're allocating for
+ * @orig_bytes - the number of bytes we want
+ * @flush - whether or not we can flush to make our reservation
+ *
+ * This will reserve orgi_bytes number of bytes from the space info associated
+ * with the block_rsv.  If there is not enough space it will make an attempt to
+ * flush out space to make room.  It will do this by flushing delalloc if
+ * possible or committing the transaction.  If flush is 0 then no attempts to
+ * regain reservations will be made and this will fail if there is not enough
+ * space already.
+ */
+static int reserve_metadata_bytes(struct btrfs_root *root,
+				  struct btrfs_block_rsv *block_rsv,
+				  u64 orig_bytes,
+				  enum btrfs_reserve_flush_enum flush)
+{
+	int ret;
+
+	ret = __reserve_metadata_bytes(root, block_rsv->space_info, orig_bytes,
+				       flush);
 	if (ret == -ENOSPC &&
 	    unlikely(root->orphan_cleanup_state == ORPHAN_CLEANUP_STARTED)) {
 		struct btrfs_block_rsv *global_rsv =
@@ -5066,13 +5188,8 @@ out:
 	if (ret == -ENOSPC)
 		trace_btrfs_space_reservation(root->fs_info,
 					      "space_info:enospc",
-					      space_info->flags, orig_bytes, 1);
-	if (flushing) {
-		spin_lock(&space_info->lock);
-		space_info->flush = 0;
-		wake_up_all(&space_info->wait);
-		spin_unlock(&space_info->lock);
-	}
+					      block_rsv->space_info->flags,
+					      orig_bytes, 1);
 	return ret;
 }
 
@@ -5148,6 +5265,108 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info *fs_info,
 	return 0;
 }
 
+/*
+ * This is for space we already have accounted in space_info->bytes_may_use, so
+ * basically when we're returning space from block_rsv's.
+ */
+static void space_info_add_old_bytes(struct btrfs_fs_info *fs_info,
+				     struct btrfs_space_info *space_info,
+				     u64 num_bytes)
+{
+	struct reserve_ticket *ticket;
+	struct list_head *head;
+	u64 used;
+	enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_NO_FLUSH;
+	bool check_overcommit = false;
+
+	spin_lock(&space_info->lock);
+	head = &space_info->priority_tickets;
+
+	/*
+	 * If we are over our limit then we need to check and see if we can
+	 * overcommit, and if we can't then we just need to free up our space
+	 * and not satisfy any requests.
+	 */
+	used = space_info->bytes_used + space_info->bytes_reserved +
+		space_info->bytes_pinned + space_info->bytes_readonly +
+		space_info->bytes_may_use;
+	if (used - num_bytes >= space_info->total_bytes)
+		check_overcommit = true;
+again:
+	while (!list_empty(head) && num_bytes) {
+		ticket = list_first_entry(head, struct reserve_ticket,
+					  list);
+		/*
+		 * We use 0 bytes because this space is already reserved, so
+		 * adding the ticket space would be a double count.
+		 */
+		if (check_overcommit &&
+		    !can_overcommit(fs_info->extent_root, space_info, 0,
+				    flush))
+			break;
+		if (num_bytes >= ticket->bytes) {
+			list_del_init(&ticket->list);
+			num_bytes -= ticket->bytes;
+			ticket->bytes = 0;
+			wake_up(&ticket->wait);
+		} else {
+			ticket->bytes -= num_bytes;
+			num_bytes = 0;
+		}
+	}
+
+	if (num_bytes && head == &space_info->priority_tickets) {
+		head = &space_info->tickets;
+		flush = BTRFS_RESERVE_FLUSH_ALL;
+		goto again;
+	}
+	space_info->bytes_may_use -= num_bytes;
+	trace_btrfs_space_reservation(fs_info, "space_info",
+				      space_info->flags, num_bytes, 0);
+	spin_unlock(&space_info->lock);
+}
+
+/*
+ * This is for newly allocated space that isn't accounted in
+ * space_info->bytes_may_use yet.  So if we allocate a chunk or unpin an extent
+ * we use this helper.
+ */
+static void space_info_add_new_bytes(struct btrfs_fs_info *fs_info,
+				     struct btrfs_space_info *space_info,
+				     u64 num_bytes)
+{
+	struct reserve_ticket *ticket;
+	struct list_head *head = &space_info->priority_tickets;
+
+again:
+	while (!list_empty(head) && num_bytes) {
+		ticket = list_first_entry(head, struct reserve_ticket,
+					  list);
+		if (num_bytes >= ticket->bytes) {
+			trace_btrfs_space_reservation(fs_info, "space_info",
+						      space_info->flags,
+						      ticket->bytes, 1);
+			list_del_init(&ticket->list);
+			num_bytes -= ticket->bytes;
+			space_info->bytes_may_use += ticket->bytes;
+			ticket->bytes = 0;
+			wake_up(&ticket->wait);
+		} else {
+			trace_btrfs_space_reservation(fs_info, "space_info",
+						      space_info->flags,
+						      num_bytes, 1);
+			space_info->bytes_may_use += num_bytes;
+			ticket->bytes -= num_bytes;
+			num_bytes = 0;
+		}
+	}
+
+	if (num_bytes && head == &space_info->priority_tickets) {
+		head = &space_info->tickets;
+		goto again;
+	}
+}
+
 static void block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
 				    struct btrfs_block_rsv *block_rsv,
 				    struct btrfs_block_rsv *dest, u64 num_bytes)
@@ -5182,13 +5401,9 @@ static void block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
 			}
 			spin_unlock(&dest->lock);
 		}
-		if (num_bytes) {
-			spin_lock(&space_info->lock);
-			space_info->bytes_may_use -= num_bytes;
-			trace_btrfs_space_reservation(fs_info, "space_info",
-					space_info->flags, num_bytes, 0);
-			spin_unlock(&space_info->lock);
-		}
+		if (num_bytes)
+			space_info_add_old_bytes(fs_info, space_info,
+						 num_bytes);
 	}
 }
 
@@ -6346,17 +6561,29 @@ static int unpin_extent_range(struct btrfs_root *root, u64 start, u64 end,
 			readonly = true;
 		}
 		spin_unlock(&cache->lock);
-		if (!readonly && global_rsv->space_info == space_info) {
+		if (!readonly && return_free_space &&
+		    global_rsv->space_info == space_info) {
+			u64 to_add = len;
+			WARN_ON(!return_free_space);
 			spin_lock(&global_rsv->lock);
 			if (!global_rsv->full) {
-				len = min(len, global_rsv->size -
-					  global_rsv->reserved);
-				global_rsv->reserved += len;
-				space_info->bytes_may_use += len;
+				to_add = min(len, global_rsv->size -
+					     global_rsv->reserved);
+				global_rsv->reserved += to_add;
+				space_info->bytes_may_use += to_add;
 				if (global_rsv->reserved >= global_rsv->size)
 					global_rsv->full = 1;
+				trace_btrfs_space_reservation(fs_info,
+							      "space_info",
+							      space_info->flags,
+							      to_add, 1);
+				len -= to_add;
 			}
 			spin_unlock(&global_rsv->lock);
+			/* Add to any tickets we may have */
+			if (len)
+				space_info_add_new_bytes(fs_info, space_info,
+							 len);
 		}
 		spin_unlock(&space_info->lock);
 	}
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH V2] Btrfs: introduce ticketed enospc infrastructure
  2016-05-17 17:30   ` [PATCH V2] " Josef Bacik
@ 2016-05-18 11:24     ` Austin S. Hemmelgarn
  2016-05-19 12:47       ` Austin S. Hemmelgarn
  2016-05-18 22:46     ` David Sterba
  1 sibling, 1 reply; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2016-05-18 11:24 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs

On 2016-05-17 13:30, Josef Bacik wrote:
> Our enospc flushing sucks.  It is born from a time where we were early
> enospc'ing constantly because multiple threads would race in for the same
> reservation and randomly starve other ones out.  So I came up with this solution
> to block any other reservations from happening while one guy tried to flush
> stuff to satisfy his reservation.  This gives us pretty good correctness, but
> completely crap latency.
>
> The solution I've come up with is ticketed reservations.  Basically we try to
> make our reservation, and if we can't we put a ticket on a list in order and
> kick off an async flusher thread.  This async flusher thread does the same old
> flushing we always did, just asynchronously.  As space is freed and added back
> to the space_info it checks and sees if we have any tickets that need
> satisfying, and adds space to the tickets and wakes up anything we've satisfied.
>
> Once the flusher thread stops making progress it wakes up all the current
> tickets and tells them to take a hike.
>
> There is a priority list for things that can't flush, since the async flusher
> could do anything we need to avoid deadlocks.  These guys get priority for
> having their reservation made, and will still do manual flushing themselves in
> case the async flusher isn't running.
>
> This patch gives us significantly better latencies.  Thanks,
>
> Signed-off-by: Josef Bacik <jbacik@fb.com>
I've had this running on my test system (which is _finally_ working 
again) for about 16 hours now, nothing is breaking, and a number of the 
tests are actually completing marginally faster, so you can add:

Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V2] Btrfs: introduce ticketed enospc infrastructure
  2016-05-17 17:30   ` [PATCH V2] " Josef Bacik
  2016-05-18 11:24     ` Austin S. Hemmelgarn
@ 2016-05-18 22:46     ` David Sterba
  1 sibling, 0 replies; 26+ messages in thread
From: David Sterba @ 2016-05-18 22:46 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Tue, May 17, 2016 at 01:30:55PM -0400, Josef Bacik wrote:
> V1->V2:
> -fixed a check in space_info_add_old_bytes where we didn't take into account
>  bytes_may_used for the space used.
> -don't count ticket->bytes when checking overcommit.

I still see the warning in generic/333, same as with v1.

2016-05-19T00:44:33.232414+02:00 ben kernel: ------------[ cut here ]------------
2016-05-19T00:44:33.232446+02:00 ben kernel: WARNING: CPU: 2 PID: 3559 at fs/btrfs/extent-tree.c:2956 btrfs_run_delayed_refs+0x27d/0x2a0 [btrfs]
2016-05-19T00:44:33.232448+02:00 ben kernel: BTRFS: Transaction aborted (error -28)
2016-05-19T00:44:33.242936+02:00 ben kernel: Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache af_packet bridge stp llc iscsi_ibft iscsi_boot_sysfs msr btrfs radeon i2c_algo_bit drm_kms_helper coretemp kvm_intel syscopyarea sysfillrect sysimgblt fb_sys_fops ttm xor kvm raid6_pq drm e1000e dm_mod pcspkr ptp iTCO_wdt gpio_ich i5000_edac pps_core edac_core i2c_i801 iTCO_vendor_support lpc_ich mfd_core ppdev parport_pc intel_rng i5k_amb parport irqbypass serio_raw button ata_generic uhci_hcd ehci_pci ehci_hcd ata_piix usbcore aic79xx usb_common scsi_transport_spi sg [last unloaded: floppy]
2016-05-19T00:44:33.242960+02:00 ben kernel: CPU: 2 PID: 3559 Comm: btrfs-transacti Not tainted 4.6.0-rc5-vanilla+ #14
2016-05-19T00:44:33.242963+02:00 ben kernel: Hardware name: Supermicro X7DB8/X7DB8, BIOS 6.00 07/26/2006
2016-05-19T00:44:33.242964+02:00 ben kernel: 0000000000000000 ffff88017de1bd20 ffffffff8139d86a ffff88017de1bd70
2016-05-19T00:44:33.242966+02:00 ben kernel: 0000000000000000 ffff88017de1bd60 ffffffff81080941 00000b8c3279c6c0
2016-05-19T00:44:33.242967+02:00 ben kernel: ffff8802274fb0a0 ffff8800b5f8d000 ffff88023279c6c0 ffff8800b5f8d000
2016-05-19T00:44:33.242969+02:00 ben kernel: Call Trace:
2016-05-19T00:44:33.242970+02:00 ben kernel: [<ffffffff8139d86a>] dump_stack+0x63/0x89
2016-05-19T00:44:33.242972+02:00 ben kernel: [<ffffffff81080941>] __warn+0xd1/0xf0
2016-05-19T00:44:33.242973+02:00 ben kernel: [<ffffffff810809af>] warn_slowpath_fmt+0x4f/0x60
2016-05-19T00:44:33.242975+02:00 ben kernel: [<ffffffffa0b1a09d>] btrfs_run_delayed_refs+0x27d/0x2a0 [btrfs]
2016-05-19T00:44:33.242977+02:00 ben kernel: [<ffffffff810ecef8>] ? del_timer_sync+0x48/0x50
2016-05-19T00:44:33.242978+02:00 ben kernel: [<ffffffffa0b300d3>] btrfs_commit_transaction+0x43/0xae0 [btrfs]
2016-05-19T00:44:33.242979+02:00 ben kernel: [<ffffffffa0b2b1cc>] transaction_kthread+0x1cc/0x1f0 [btrfs]
2016-05-19T00:44:33.242981+02:00 ben kernel: [<ffffffffa0b2b000>] ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
2016-05-19T00:44:33.242983+02:00 ben kernel: [<ffffffff8109fab9>] kthread+0xc9/0xe0
2016-05-19T00:44:33.242984+02:00 ben kernel: [<ffffffff816b2462>] ret_from_fork+0x22/0x40
2016-05-19T00:44:33.242986+02:00 ben kernel: [<ffffffff8109f9f0>] ? kthread_create_on_node+0x180/0x180
2016-05-19T00:44:33.242988+02:00 ben kernel: ---[ end trace a9fa5269514f9444 ]---
2016-05-19T00:44:33.242990+02:00 ben kernel: BTRFS: error (device sdc) in btrfs_run_delayed_refs:2956: errno=-28 No space left
2016-05-19T00:44:33.242992+02:00 ben kernel: BTRFS info (device sdc): forced readonly

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V2] Btrfs: introduce ticketed enospc infrastructure
  2016-05-18 11:24     ` Austin S. Hemmelgarn
@ 2016-05-19 12:47       ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2016-05-19 12:47 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs

On 2016-05-18 07:24, Austin S. Hemmelgarn wrote:
> On 2016-05-17 13:30, Josef Bacik wrote:
>> Our enospc flushing sucks.  It is born from a time where we were early
>> enospc'ing constantly because multiple threads would race in for the same
>> reservation and randomly starve other ones out.  So I came up with
>> this solution
>> to block any other reservations from happening while one guy tried to
>> flush
>> stuff to satisfy his reservation.  This gives us pretty good
>> correctness, but
>> completely crap latency.
>>
>> The solution I've come up with is ticketed reservations.  Basically we
>> try to
>> make our reservation, and if we can't we put a ticket on a list in
>> order and
>> kick off an async flusher thread.  This async flusher thread does the
>> same old
>> flushing we always did, just asynchronously.  As space is freed and
>> added back
>> to the space_info it checks and sees if we have any tickets that need
>> satisfying, and adds space to the tickets and wakes up anything we've
>> satisfied.
>>
>> Once the flusher thread stops making progress it wakes up all the current
>> tickets and tells them to take a hike.
>>
>> There is a priority list for things that can't flush, since the async
>> flusher
>> could do anything we need to avoid deadlocks.  These guys get priority
>> for
>> having their reservation made, and will still do manual flushing
>> themselves in
>> case the async flusher isn't running.
>>
>> This patch gives us significantly better latencies.  Thanks,
>>
>> Signed-off-by: Josef Bacik <jbacik@fb.com>
> I've had this running on my test system (which is _finally_ working
> again) for about 16 hours now, nothing is breaking, and a number of the
> tests are actually completing marginally faster, so you can add:
>
> Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
Hmm, this is troubling, I just did some manual testing, and hit the 
exact same warning that David did, so it looks like I need to go do some 
more thorough testing of my test system...

FWIW though, everything else does appear to be working correctly.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 05/14] Btrfs: warn_on for unaccounted spaces
  2016-03-25 17:25 ` [PATCH 05/14] Btrfs: warn_on for unaccounted spaces Josef Bacik
@ 2016-06-27  4:47   ` Qu Wenruo
  2016-06-27 13:03     ` Chris Mason
  0 siblings, 1 reply; 26+ messages in thread
From: Qu Wenruo @ 2016-06-27  4:47 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs

Hi Josef,

Would you please move this patch to the first of the patchset?

It's making bisect quite hard, as it will always stop at this patch, 
hard to check if it's a regression or existing bug.

Thanks,
Qu

At 03/26/2016 01:25 AM, Josef Bacik wrote:
> These were hidden behind enospc_debug, which isn't helpful as they indicate
> actual bugs, unlike the rest of the enospc_debug stuff which is really debug
> information.  Thanks,
>
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  fs/btrfs/extent-tree.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 157a0b6..90ac821 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -9633,13 +9633,15 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
>  		space_info = list_entry(info->space_info.next,
>  					struct btrfs_space_info,
>  					list);
> -		if (btrfs_test_opt(info->tree_root, ENOSPC_DEBUG)) {
> -			if (WARN_ON(space_info->bytes_pinned > 0 ||
> +
> +		/*
> +		 * Do not hide this behind enospc_debug, this is actually
> +		 * important and indicates a real bug if this happens.
> +		 */
> +		if (WARN_ON(space_info->bytes_pinned > 0 ||
>  			    space_info->bytes_reserved > 0 ||
> -			    space_info->bytes_may_use > 0)) {
> -				dump_space_info(space_info, 0, 0);
> -			}
> -		}
> +			    space_info->bytes_may_use > 0))
> +			dump_space_info(space_info, 0, 0);
>  		list_del(&space_info->list);
>  		for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
>  			struct kobject *kobj;
>



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 05/14] Btrfs: warn_on for unaccounted spaces
  2016-06-27  4:47   ` Qu Wenruo
@ 2016-06-27 13:03     ` Chris Mason
  2016-06-28  0:16       ` Qu Wenruo
  0 siblings, 1 reply; 26+ messages in thread
From: Chris Mason @ 2016-06-27 13:03 UTC (permalink / raw)
  To: Qu Wenruo, Josef Bacik, linux-btrfs



On 06/27/2016 12:47 AM, Qu Wenruo wrote:
> Hi Josef,
>
> Would you please move this patch to the first of the patchset?
>
> It's making bisect quite hard, as it will always stop at this patch,
> hard to check if it's a regression or existing bug.

That's a good idea.  Which workload are you having trouble with?

-chris

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 05/14] Btrfs: warn_on for unaccounted spaces
  2016-06-27 13:03     ` Chris Mason
@ 2016-06-28  0:16       ` Qu Wenruo
  0 siblings, 0 replies; 26+ messages in thread
From: Qu Wenruo @ 2016-06-28  0:16 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, linux-btrfs



At 06/27/2016 09:03 PM, Chris Mason wrote:
>
>
> On 06/27/2016 12:47 AM, Qu Wenruo wrote:
>> Hi Josef,
>>
>> Would you please move this patch to the first of the patchset?
>>
>> It's making bisect quite hard, as it will always stop at this patch,
>> hard to check if it's a regression or existing bug.
>
> That's a good idea.  Which workload are you having trouble with?
>
> -chris
>
>
Qgroup test which hits EDQUOTA.

After hitting EDQUOTA, unmount will always trigger a kernel warning, for 
DATA space whose byte_may_use is not zero.

The problem is long existing, seems to be buffered write doesn't clean 
up its delalloc reserved DATA space.

Thanks,
Qu



^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2016-06-28  0:16 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-25 17:25 [PATCH 00/14] Enospc rework Josef Bacik
2016-03-25 17:25 ` [PATCH 01/14] Btrfs: add bytes_readonly to the spaceinfo at once Josef Bacik
2016-03-25 17:25 ` [PATCH 02/14] Btrfs: fix callers of btrfs_block_rsv_migrate Josef Bacik
2016-03-25 17:25 ` [PATCH 03/14] Btrfs: always reserve metadata for delalloc extents Josef Bacik
2016-03-25 18:04   ` Liu Bo
2016-03-25 17:25 ` [PATCH 04/14] Btrfs: change delayed reservation fallback behavior Josef Bacik
2016-03-25 17:25 ` [PATCH 05/14] Btrfs: warn_on for unaccounted spaces Josef Bacik
2016-06-27  4:47   ` Qu Wenruo
2016-06-27 13:03     ` Chris Mason
2016-06-28  0:16       ` Qu Wenruo
2016-03-25 17:25 ` [PATCH 06/14] Btrfs: add tracepoint for adding block groups Josef Bacik
2016-03-25 17:25 ` [PATCH 07/14] Btrfs: introduce ticketed enospc infrastructure Josef Bacik
2016-05-09 21:29   ` Liu Bo
2016-05-17 17:30   ` [PATCH V2] " Josef Bacik
2016-05-18 11:24     ` Austin S. Hemmelgarn
2016-05-19 12:47       ` Austin S. Hemmelgarn
2016-05-18 22:46     ` David Sterba
2016-03-25 17:25 ` [PATCH 08/14] Btrfs: trace pinned extents Josef Bacik
2016-03-25 17:25 ` [PATCH 09/14] Btrfs: fix delalloc reservation amount tracepoint Josef Bacik
2016-03-25 17:25 ` [PATCH 10/14] Btrfs: add tracepoints for flush events Josef Bacik
2016-03-25 17:25 ` [PATCH 11/14] Btrfs: add fsid to some tracepoints Josef Bacik
2016-03-25 17:25 ` [PATCH 12/14] Btrfs: fix release reserved extents trace points Josef Bacik
2016-05-09 21:33   ` Liu Bo
2016-03-25 17:25 ` [PATCH 13/14] Btrfs: don't bother kicking async if there's nothing to reclaim Josef Bacik
2016-03-25 17:26 ` [PATCH 14/14] Btrfs: don't do nocow check unless we have to Josef Bacik
2016-03-25 17:50   ` Liu Bo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).