All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/21] My current btrfs patch queue
@ 2017-09-29 19:43 Josef Bacik
  2017-09-29 19:43 ` [PATCH 01/21] Btrfs: rework outstanding_extents Josef Bacik
                   ` (21 more replies)
  0 siblings, 22 replies; 51+ messages in thread
From: Josef Bacik @ 2017-09-29 19:43 UTC (permalink / raw)
  To: kernel-team, linux-btrfs

This is my current set of outstanding patches.  A lot of these had reviews and
I've incorporated the feedback.  They have been pretty thorougly tested and are
pretty solid.

[PATCH 01/21] Btrfs: rework outstanding_extents
[PATCH 02/21] btrfs: add tracepoints for outstanding extents mods
[PATCH 03/21] btrfs: make the delalloc block rsv per inode

These are simplification patches.  The way we handled metadata reservation for
multiple extents for an inode was really complicated and fragile, these patches
go a long way towards simplifying that code.  We only have one
outstanding_extents counter, and the rules for changing that counter are more
straightforward and logical.  There are some uglier cases still, but this is as
clean as I could think to do it at the moment.

[PATCH 04/21] btrfs: add ref-verify mount option
[PATCH 05/21] btrfs: pass root to various extent ref mod functions
[PATCH 06/21] Btrfs: add a extent ref verify tool

This is the ref verification set.  I've broken the patches up a little more
since I originally posted them, but it's a big feature so it's just going to be
a bit large unfortunately.  I also fixed all of the printk's so they are
btrfs_err() now.  I folded in all of the various fixes that came from Marc
Merlin's usage of the patches.

[PATCH 07/21] Btrfs: only check delayed ref usage in
[PATCH 08/21] btrfs: add a helper to return a head ref
[PATCH 09/21] btrfs: move extent_op cleanup to a helper
[PATCH 10/21] btrfs: breakout empty head cleanup to a helper
[PATCH 11/21] btrfs: move ref_mod modification into the if (ref)
[PATCH 12/21] btrfs: move all ref head cleanup to the helper function
[PATCH 13/21] btrfs: remove delayed_ref_node from ref_head

Delayed ref's cleanup.  run_delayed_refs was kind of complicated and we did two
things in two different places.  Either we ran the actual delayed ref, or we
cleaned up the delayed_ref_head, and we did this in different places depending
on how things went.  These are mostly moving code around, adding helpers to make
it clear what's happening, and to make __btrfs_run_delayed_refs smaller and
clearer.  I also made btrfs_delayed_ref_head it's own thing and took out the old
delayed_ref_node that was embedded in it.  That existed because heads and refs
used to be on the same list, but since that isn't the case anymore we don't need
it to be that way.

[PATCH 14/21] btrfs: remove type argument from comp_tree_refs
[PATCH 15/21] btrfs: switch args for comp_*_refs
[PATCH 16/21] btrfs: add a comp_refs() helper
[PATCH 17/21] btrfs: track refs in a rb_tree instead of a list

With lots of snapshots we can end up in merge_delayed_refs for _ever_ because
we're constantly going over the list to find matches.  I converted the ref list
in the ref_head to a rb_tree so we can go straight to our refs, which makes
searching and on-the-fly merging happen immediately always, and makes
post-processing of the tree significantly faster.  Now I'm no longer seeing
multi-minute commit times with lots of snapshots.

[PATCH 18/21] btrfs: fix send ioctl on 32bit with 64bit kernel
[PATCH 19/21] btrfs: don't call btrfs_start_delalloc_roots in
[PATCH 20/21] btrfs: move btrfs_truncate_block out of trans handle
[PATCH 21/21] btrfs: add assertions for releasing trans handle

Fixes for various problems I found while running xfstests.  One is for 32bit
ioctls on a 64bit kernel, two are lockdep fixes, and the last one is just so I
can catch myself doing stupid stuff in the future.

Dave this is based on for-next-20170929, and is in my btrfs-next tree in the
master branch if you want to just pull it.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 01/21] Btrfs: rework outstanding_extents
  2017-09-29 19:43 [PATCH 00/21] My current btrfs patch queue Josef Bacik
@ 2017-09-29 19:43 ` Josef Bacik
  2017-10-13  8:39   ` Nikolay Borisov
  2017-10-19  3:14   ` Edmund Nadolski
  2017-09-29 19:43 ` [PATCH 02/21] btrfs: add tracepoints for outstanding extents mods Josef Bacik
                   ` (20 subsequent siblings)
  21 siblings, 2 replies; 51+ messages in thread
From: Josef Bacik @ 2017-09-29 19:43 UTC (permalink / raw)
  To: kernel-team, linux-btrfs

Right now we do a lot of weird hoops around outstanding_extents in order
to keep the extent count consistent.  This is because we logically
transfer the outstanding_extent count from the initial reservation
through the set_delalloc_bits.  This makes it pretty difficult to get a
handle on how and when we need to mess with outstanding_extents.

Fix this by revamping the rules of how we deal with outstanding_extents.
Now instead everybody that is holding on to a delalloc extent is
required to increase the outstanding extents count for itself.  This
means we'll have something like this

btrfs_dealloc_reserve_metadata	- outstanding_extents = 1
 btrfs_set_delalloc		- outstanding_extents = 2
btrfs_release_delalloc_extents	- outstanding_extents = 1

for an initial file write.  Now take the append write where we extend an
existing delalloc range but still under the maximum extent size

btrfs_delalloc_reserve_metadata - outstanding_extents = 2
  btrfs_set_delalloc
    btrfs_set_bit_hook		- outstanding_extents = 3
    btrfs_merge_bit_hook	- outstanding_extents = 2
btrfs_release_delalloc_extents	- outstanding_extnets = 1

In order to make the ordered extent transition we of course must now
make ordered extents carry their own outstanding_extent reservation, so
for cow_file_range we end up with

btrfs_add_ordered_extent	- outstanding_extents = 2
clear_extent_bit		- outstanding_extents = 1
btrfs_remove_ordered_extent	- outstanding_extents = 0

This makes all manipulations of outstanding_extents much more explicit.
Every successful call to btrfs_reserve_delalloc_metadata _must_ now be
combined with btrfs_release_delalloc_extents, even in the error case, as
that is the only function that actually modifies the
outstanding_extents counter.

The drawback to this is now we are much more likely to have transient
cases where outstanding_extents is much larger than it actually should
be.  This could happen before as we manipulated the delalloc bits, but
now it happens basically at every write.  This may put more pressure on
the ENOSPC flushing code, but I think making this code simpler is worth
the cost.  I have another change coming to mitigate this side-effect
somewhat.

I also added trace points for the counter manipulation.  These were used
by a bpf script I wrote to help track down leak issues.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/btrfs_inode.h       |  18 ++++++
 fs/btrfs/ctree.h             |   2 +
 fs/btrfs/extent-tree.c       | 139 ++++++++++++++++++++++++++++---------------
 fs/btrfs/file.c              |  22 +++----
 fs/btrfs/inode-map.c         |   3 +-
 fs/btrfs/inode.c             | 114 +++++++++++------------------------
 fs/btrfs/ioctl.c             |   2 +
 fs/btrfs/ordered-data.c      |  21 ++++++-
 fs/btrfs/relocation.c        |   3 +
 fs/btrfs/tests/inode-tests.c |  18 ++----
 10 files changed, 186 insertions(+), 156 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index eccadb5f62a5..a6a22ef41f91 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -267,6 +267,24 @@ static inline bool btrfs_is_free_space_inode(struct btrfs_inode *inode)
 	return false;
 }
 
+static inline void btrfs_mod_outstanding_extents(struct btrfs_inode *inode,
+						 int mod)
+{
+	ASSERT(spin_is_locked(&inode->lock));
+	inode->outstanding_extents += mod;
+	if (btrfs_is_free_space_inode(inode))
+		return;
+}
+
+static inline void btrfs_mod_reserved_extents(struct btrfs_inode *inode,
+					      int mod)
+{
+	ASSERT(spin_is_locked(&inode->lock));
+	inode->reserved_extents += mod;
+	if (btrfs_is_free_space_inode(inode))
+		return;
+}
+
 static inline int btrfs_inode_in_log(struct btrfs_inode *inode, u64 generation)
 {
 	int ret = 0;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a7f68c304b4c..1262612fbf78 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2742,6 +2742,8 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root *root,
 				     u64 *qgroup_reserved, bool use_global_rsv);
 void btrfs_subvolume_release_metadata(struct btrfs_fs_info *fs_info,
 				      struct btrfs_block_rsv *rsv);
+void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes);
+
 int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes);
 void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes);
 int btrfs_delalloc_reserve_space(struct inode *inode,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 1a6aced00a19..aa0f5c8953b0 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5971,42 +5971,31 @@ void btrfs_subvolume_release_metadata(struct btrfs_fs_info *fs_info,
 }
 
 /**
- * drop_outstanding_extent - drop an outstanding extent
+ * drop_over_reserved_extents - drop our extra extent reservations
  * @inode: the inode we're dropping the extent for
- * @num_bytes: the number of bytes we're releasing.
  *
- * This is called when we are freeing up an outstanding extent, either called
- * after an error or after an extent is written.  This will return the number of
- * reserved extents that need to be freed.  This must be called with
- * BTRFS_I(inode)->lock held.
+ * We reserve extents we may use, but they may have been merged with other
+ * extents and we may not need the extra reservation.
+ *
+ * We also call this when we've completed io to an extent or had an error and
+ * cleared the outstanding extent, in either case we no longer need our
+ * reservation and can drop the excess.
  */
-static unsigned drop_outstanding_extent(struct btrfs_inode *inode,
-		u64 num_bytes)
+static unsigned drop_over_reserved_extents(struct btrfs_inode *inode)
 {
-	unsigned drop_inode_space = 0;
-	unsigned dropped_extents = 0;
-	unsigned num_extents;
+	unsigned num_extents = 0;
 
-	num_extents = count_max_extents(num_bytes);
-	ASSERT(num_extents);
-	ASSERT(inode->outstanding_extents >= num_extents);
-	inode->outstanding_extents -= num_extents;
+	if (inode->reserved_extents > inode->outstanding_extents) {
+		num_extents = inode->reserved_extents -
+			inode->outstanding_extents;
+		btrfs_mod_reserved_extents(inode, -num_extents);
+	}
 
 	if (inode->outstanding_extents == 0 &&
 	    test_and_clear_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
 			       &inode->runtime_flags))
-		drop_inode_space = 1;
-
-	/*
-	 * If we have more or the same amount of outstanding extents than we have
-	 * reserved then we need to leave the reserved extents count alone.
-	 */
-	if (inode->outstanding_extents >= inode->reserved_extents)
-		return drop_inode_space;
-
-	dropped_extents = inode->reserved_extents - inode->outstanding_extents;
-	inode->reserved_extents -= dropped_extents;
-	return dropped_extents + drop_inode_space;
+		num_extents++;
+	return num_extents;
 }
 
 /**
@@ -6061,13 +6050,15 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
 	struct btrfs_block_rsv *block_rsv = &fs_info->delalloc_block_rsv;
 	u64 to_reserve = 0;
 	u64 csum_bytes;
-	unsigned nr_extents;
+	unsigned nr_extents, reserve_extents;
 	enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
 	int ret = 0;
 	bool delalloc_lock = true;
 	u64 to_free = 0;
 	unsigned dropped;
 	bool release_extra = false;
+	bool underflow = false;
+	bool did_retry = false;
 
 	/* If we are a free space inode we need to not flush since we will be in
 	 * the middle of a transaction commit.  We also don't need the delalloc
@@ -6092,18 +6083,31 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
 		mutex_lock(&inode->delalloc_mutex);
 
 	num_bytes = ALIGN(num_bytes, fs_info->sectorsize);
-
+retry:
 	spin_lock(&inode->lock);
-	nr_extents = count_max_extents(num_bytes);
-	inode->outstanding_extents += nr_extents;
+	reserve_extents = nr_extents = count_max_extents(num_bytes);
+	btrfs_mod_outstanding_extents(inode, nr_extents);
 
-	nr_extents = 0;
-	if (inode->outstanding_extents > inode->reserved_extents)
-		nr_extents += inode->outstanding_extents -
+	/*
+	 * Because we add an outstanding extent for ordered before we clear
+	 * delalloc we will double count our outstanding extents slightly.  This
+	 * could mean that we transiently over-reserve, which could result in an
+	 * early ENOSPC if our timing is unlucky.  Keep track of the case that
+	 * we had a reservation underflow so we can retry if we fail.
+	 *
+	 * Keep in mind we can legitimately have more outstanding extents than
+	 * reserved because of fragmentation, so only allow a retry once.
+	 */
+	if (inode->outstanding_extents >
+	    inode->reserved_extents + nr_extents) {
+		reserve_extents = inode->outstanding_extents -
 			inode->reserved_extents;
+		underflow = true;
+	}
 
 	/* We always want to reserve a slot for updating the inode. */
-	to_reserve = btrfs_calc_trans_metadata_size(fs_info, nr_extents + 1);
+	to_reserve = btrfs_calc_trans_metadata_size(fs_info,
+						    reserve_extents + 1);
 	to_reserve += calc_csum_metadata_size(inode, num_bytes, 1);
 	csum_bytes = inode->csum_bytes;
 	spin_unlock(&inode->lock);
@@ -6128,7 +6132,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
 		to_reserve -= btrfs_calc_trans_metadata_size(fs_info, 1);
 		release_extra = true;
 	}
-	inode->reserved_extents += nr_extents;
+	btrfs_mod_reserved_extents(inode, reserve_extents);
 	spin_unlock(&inode->lock);
 
 	if (delalloc_lock)
@@ -6144,7 +6148,10 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
 
 out_fail:
 	spin_lock(&inode->lock);
-	dropped = drop_outstanding_extent(inode, num_bytes);
+	nr_extents = count_max_extents(num_bytes);
+	btrfs_mod_outstanding_extents(inode, -nr_extents);
+
+	dropped = drop_over_reserved_extents(inode);
 	/*
 	 * If the inodes csum_bytes is the same as the original
 	 * csum_bytes then we know we haven't raced with any free()ers
@@ -6201,6 +6208,11 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
 		trace_btrfs_space_reservation(fs_info, "delalloc",
 					      btrfs_ino(inode), to_free, 0);
 	}
+	if (underflow && !did_retry) {
+		did_retry = true;
+		underflow = false;
+		goto retry;
+	}
 	if (delalloc_lock)
 		mutex_unlock(&inode->delalloc_mutex);
 	return ret;
@@ -6208,12 +6220,12 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
 
 /**
  * btrfs_delalloc_release_metadata - release a metadata reservation for an inode
- * @inode: the inode to release the reservation for
- * @num_bytes: the number of bytes we're releasing
+ * @inode: the inode to release the reservation for.
+ * @num_bytes: the number of bytes we are releasing.
  *
  * This will release the metadata reservation for an inode.  This can be called
  * once we complete IO for a given set of bytes to release their metadata
- * reservations.
+ * reservations, or on error for the same reason.
  */
 void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes)
 {
@@ -6223,8 +6235,7 @@ void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes)
 
 	num_bytes = ALIGN(num_bytes, fs_info->sectorsize);
 	spin_lock(&inode->lock);
-	dropped = drop_outstanding_extent(inode, num_bytes);
-
+	dropped = drop_over_reserved_extents(inode);
 	if (num_bytes)
 		to_free = calc_csum_metadata_size(inode, num_bytes, 0);
 	spin_unlock(&inode->lock);
@@ -6241,6 +6252,42 @@ void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes)
 }
 
 /**
+ * btrfs_delalloc_release_extents - release our outstanding_extents
+ * @inode: the inode to balance the reservation for.
+ * @num_bytes: the number of bytes we originally reserved with
+ *
+ * When we reserve space we increase outstanding_extents for the extents we may
+ * add.  Once we've set the range as delalloc or created our ordered extents we
+ * have outstanding_extents to track the real usage, so we use this to free our
+ * temporarily tracked outstanding_extents.  This _must_ be used in conjunction
+ * with btrfs_delalloc_reserve_metadata.
+ */
+void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
+{
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb);
+	unsigned num_extents;
+	u64 to_free;
+	unsigned dropped;
+
+	spin_lock(&inode->lock);
+	num_extents = count_max_extents(num_bytes);
+	btrfs_mod_outstanding_extents(inode, -num_extents);
+	dropped = drop_over_reserved_extents(inode);
+	spin_unlock(&inode->lock);
+
+	if (!dropped)
+		return;
+
+	if (btrfs_is_testing(fs_info))
+		return;
+
+	to_free = btrfs_calc_trans_metadata_size(fs_info, dropped);
+	trace_btrfs_space_reservation(fs_info, "delalloc", btrfs_ino(inode),
+				      to_free, 0);
+	btrfs_block_rsv_release(fs_info, &fs_info->delalloc_block_rsv, to_free);
+}
+
+/**
  * btrfs_delalloc_reserve_space - reserve data and metadata space for
  * delalloc
  * @inode: inode we're writing to
@@ -6284,10 +6331,7 @@ int btrfs_delalloc_reserve_space(struct inode *inode,
  * @inode: inode we're releasing space for
  * @start: start position of the space already reserved
  * @len: the len of the space already reserved
- *
- * This must be matched with a call to btrfs_delalloc_reserve_space.  This is
- * called in the case that we don't need the metadata AND data reservations
- * anymore.  So if there is an error or we insert an inline extent.
+ * @release_bytes: the len of the space we consumed or didn't use
  *
  * This function will release the metadata space that was not used and will
  * decrement ->delalloc_bytes and remove it from the fs_info delalloc_inodes
@@ -6295,7 +6339,8 @@ int btrfs_delalloc_reserve_space(struct inode *inode,
  * Also it will handle the qgroup reserved space.
  */
 void btrfs_delalloc_release_space(struct inode *inode,
-			struct extent_changeset *reserved, u64 start, u64 len)
+				  struct extent_changeset *reserved,
+				  u64 start, u64 len)
 {
 	btrfs_delalloc_release_metadata(BTRFS_I(inode), len);
 	btrfs_free_reserved_data_space(inode, reserved, start, len);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index aafcc785f840..2a7f1b27149b 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1656,6 +1656,7 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 			}
 		}
 
+		WARN_ON(reserve_bytes == 0);
 		ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
 				reserve_bytes);
 		if (ret) {
@@ -1679,8 +1680,11 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 		ret = prepare_pages(inode, pages, num_pages,
 				    pos, write_bytes,
 				    force_page_uptodate);
-		if (ret)
+		if (ret) {
+			btrfs_delalloc_release_extents(BTRFS_I(inode),
+						       reserve_bytes);
 			break;
+		}
 
 		ret = lock_and_cleanup_extent_if_need(BTRFS_I(inode), pages,
 				num_pages, pos, write_bytes, &lockstart,
@@ -1688,6 +1692,8 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 		if (ret < 0) {
 			if (ret == -EAGAIN)
 				goto again;
+			btrfs_delalloc_release_extents(BTRFS_I(inode),
+						       reserve_bytes);
 			break;
 		} else if (ret > 0) {
 			need_unlock = true;
@@ -1718,23 +1724,10 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 						   PAGE_SIZE);
 		}
 
-		/*
-		 * If we had a short copy we need to release the excess delaloc
-		 * bytes we reserved.  We need to increment outstanding_extents
-		 * because btrfs_delalloc_release_space and
-		 * btrfs_delalloc_release_metadata will decrement it, but
-		 * we still have an outstanding extent for the chunk we actually
-		 * managed to copy.
-		 */
 		if (num_sectors > dirty_sectors) {
 			/* release everything except the sectors we dirtied */
 			release_bytes -= dirty_sectors <<
 						fs_info->sb->s_blocksize_bits;
-			if (copied > 0) {
-				spin_lock(&BTRFS_I(inode)->lock);
-				BTRFS_I(inode)->outstanding_extents++;
-				spin_unlock(&BTRFS_I(inode)->lock);
-			}
 			if (only_release_metadata) {
 				btrfs_delalloc_release_metadata(BTRFS_I(inode),
 								release_bytes);
@@ -1760,6 +1753,7 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 			unlock_extent_cached(&BTRFS_I(inode)->io_tree,
 					     lockstart, lockend, &cached_state,
 					     GFP_NOFS);
+		btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes);
 		if (ret) {
 			btrfs_drop_pages(pages, num_pages);
 			break;
diff --git a/fs/btrfs/inode-map.c b/fs/btrfs/inode-map.c
index d02019747d00..022b19336fee 100644
--- a/fs/btrfs/inode-map.c
+++ b/fs/btrfs/inode-map.c
@@ -500,11 +500,12 @@ int btrfs_save_ino_cache(struct btrfs_root *root,
 	ret = btrfs_prealloc_file_range_trans(inode, trans, 0, 0, prealloc,
 					      prealloc, prealloc, &alloc_hint);
 	if (ret) {
-		btrfs_delalloc_release_metadata(BTRFS_I(inode), prealloc);
+		btrfs_delalloc_release_extents(BTRFS_I(inode), prealloc);
 		goto out_put;
 	}
 
 	ret = btrfs_write_out_ino_cache(root, trans, path, inode);
+	btrfs_delalloc_release_extents(BTRFS_I(inode), prealloc);
 out_put:
 	iput(inode);
 out_release:
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b728397ba6e1..33ba258815b2 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -67,7 +67,6 @@ struct btrfs_iget_args {
 };
 
 struct btrfs_dio_data {
-	u64 outstanding_extents;
 	u64 reserve;
 	u64 unsubmitted_oe_range_start;
 	u64 unsubmitted_oe_range_end;
@@ -348,7 +347,6 @@ static noinline int cow_file_range_inline(struct btrfs_root *root,
 	}
 
 	set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &BTRFS_I(inode)->runtime_flags);
-	btrfs_delalloc_release_metadata(BTRFS_I(inode), end + 1 - start);
 	btrfs_drop_extent_cache(BTRFS_I(inode), start, aligned_end - 1, 0);
 out:
 	/*
@@ -587,16 +585,21 @@ static noinline void compress_file_range(struct inode *inode,
 		}
 		if (ret <= 0) {
 			unsigned long clear_flags = EXTENT_DELALLOC |
-				EXTENT_DELALLOC_NEW | EXTENT_DEFRAG;
+				EXTENT_DELALLOC_NEW | EXTENT_DEFRAG |
+				EXTENT_DO_ACCOUNTING;
 			unsigned long page_error_op;
 
-			clear_flags |= (ret < 0) ? EXTENT_DO_ACCOUNTING : 0;
 			page_error_op = ret < 0 ? PAGE_SET_ERROR : 0;
 
 			/*
 			 * inline extent creation worked or returned error,
 			 * we don't need to create any more async work items.
 			 * Unlock and free up our temp pages.
+			 *
+			 * We use DO_ACCOUNTING here because we need the
+			 * delalloc_release_metadata to be done _after_ we drop
+			 * our outstanding extent for clearing delalloc for this
+			 * range.
 			 */
 			extent_clear_unlock_delalloc(inode, start, end, end,
 						     NULL, clear_flags,
@@ -605,10 +608,6 @@ static noinline void compress_file_range(struct inode *inode,
 						     PAGE_SET_WRITEBACK |
 						     page_error_op |
 						     PAGE_END_WRITEBACK);
-			if (ret == 0)
-				btrfs_free_reserved_data_space_noquota(inode,
-							       start,
-							       end - start + 1);
 			goto free_pages_out;
 		}
 	}
@@ -985,15 +984,19 @@ static noinline int cow_file_range(struct inode *inode,
 		ret = cow_file_range_inline(root, inode, start, end, 0,
 					BTRFS_COMPRESS_NONE, NULL);
 		if (ret == 0) {
+			/*
+			 * We use DO_ACCOUNTING here because we need the
+			 * delalloc_release_metadata to be run _after_ we drop
+			 * our outstanding extent for clearing delalloc for this
+			 * range.
+			 */
 			extent_clear_unlock_delalloc(inode, start, end,
 				     delalloc_end, NULL,
 				     EXTENT_LOCKED | EXTENT_DELALLOC |
-				     EXTENT_DELALLOC_NEW |
-				     EXTENT_DEFRAG, PAGE_UNLOCK |
+				     EXTENT_DELALLOC_NEW | EXTENT_DEFRAG |
+				     EXTENT_DO_ACCOUNTING, PAGE_UNLOCK |
 				     PAGE_CLEAR_DIRTY | PAGE_SET_WRITEBACK |
 				     PAGE_END_WRITEBACK);
-			btrfs_free_reserved_data_space_noquota(inode, start,
-						end - start + 1);
 			*nr_written = *nr_written +
 			     (end - start + PAGE_SIZE) / PAGE_SIZE;
 			*page_started = 1;
@@ -1631,7 +1634,7 @@ static void btrfs_split_extent_hook(void *private_data,
 	}
 
 	spin_lock(&BTRFS_I(inode)->lock);
-	BTRFS_I(inode)->outstanding_extents++;
+	btrfs_mod_outstanding_extents(BTRFS_I(inode), 1);
 	spin_unlock(&BTRFS_I(inode)->lock);
 }
 
@@ -1661,7 +1664,7 @@ static void btrfs_merge_extent_hook(void *private_data,
 	/* we're not bigger than the max, unreserve the space and go */
 	if (new_size <= BTRFS_MAX_EXTENT_SIZE) {
 		spin_lock(&BTRFS_I(inode)->lock);
-		BTRFS_I(inode)->outstanding_extents--;
+		btrfs_mod_outstanding_extents(BTRFS_I(inode), -1);
 		spin_unlock(&BTRFS_I(inode)->lock);
 		return;
 	}
@@ -1692,7 +1695,7 @@ static void btrfs_merge_extent_hook(void *private_data,
 		return;
 
 	spin_lock(&BTRFS_I(inode)->lock);
-	BTRFS_I(inode)->outstanding_extents--;
+	btrfs_mod_outstanding_extents(BTRFS_I(inode), -1);
 	spin_unlock(&BTRFS_I(inode)->lock);
 }
 
@@ -1762,15 +1765,12 @@ static void btrfs_set_bit_hook(void *private_data,
 	if (!(state->state & EXTENT_DELALLOC) && (*bits & EXTENT_DELALLOC)) {
 		struct btrfs_root *root = BTRFS_I(inode)->root;
 		u64 len = state->end + 1 - state->start;
+		u32 num_extents = count_max_extents(len);
 		bool do_list = !btrfs_is_free_space_inode(BTRFS_I(inode));
 
-		if (*bits & EXTENT_FIRST_DELALLOC) {
-			*bits &= ~EXTENT_FIRST_DELALLOC;
-		} else {
-			spin_lock(&BTRFS_I(inode)->lock);
-			BTRFS_I(inode)->outstanding_extents++;
-			spin_unlock(&BTRFS_I(inode)->lock);
-		}
+		spin_lock(&BTRFS_I(inode)->lock);
+		btrfs_mod_outstanding_extents(BTRFS_I(inode), num_extents);
+		spin_unlock(&BTRFS_I(inode)->lock);
 
 		/* For sanity tests */
 		if (btrfs_is_testing(fs_info))
@@ -1824,13 +1824,9 @@ static void btrfs_clear_bit_hook(void *private_data,
 		struct btrfs_root *root = inode->root;
 		bool do_list = !btrfs_is_free_space_inode(inode);
 
-		if (*bits & EXTENT_FIRST_DELALLOC) {
-			*bits &= ~EXTENT_FIRST_DELALLOC;
-		} else if (!(*bits & EXTENT_CLEAR_META_RESV)) {
-			spin_lock(&inode->lock);
-			inode->outstanding_extents -= num_extents;
-			spin_unlock(&inode->lock);
-		}
+		spin_lock(&inode->lock);
+		btrfs_mod_outstanding_extents(inode, -num_extents);
+		spin_unlock(&inode->lock);
 
 		/*
 		 * We don't reserve metadata space for space cache inodes so we
@@ -2101,6 +2097,7 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
 				  0);
 	ClearPageChecked(page);
 	set_page_dirty(page);
+	btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
 out:
 	unlock_extent_cached(&BTRFS_I(inode)->io_tree, page_start, page_end,
 			     &cached_state, GFP_NOFS);
@@ -3054,9 +3051,6 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 				 0, &cached_state, GFP_NOFS);
 	}
 
-	if (root != fs_info->tree_root)
-		btrfs_delalloc_release_metadata(BTRFS_I(inode),
-				ordered_extent->len);
 	if (trans)
 		btrfs_end_transaction(trans);
 
@@ -4797,8 +4791,11 @@ int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len,
 	    (!len || ((len & (blocksize - 1)) == 0)))
 		goto out;
 
+	block_start = round_down(from, blocksize);
+	block_end = block_start + blocksize - 1;
+
 	ret = btrfs_delalloc_reserve_space(inode, &data_reserved,
-			round_down(from, blocksize), blocksize);
+					   block_start, blocksize);
 	if (ret)
 		goto out;
 
@@ -4806,15 +4803,12 @@ int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len,
 	page = find_or_create_page(mapping, index, mask);
 	if (!page) {
 		btrfs_delalloc_release_space(inode, data_reserved,
-				round_down(from, blocksize),
-				blocksize);
+					     block_start, blocksize);
+		btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize);
 		ret = -ENOMEM;
 		goto out;
 	}
 
-	block_start = round_down(from, blocksize);
-	block_end = block_start + blocksize - 1;
-
 	if (!PageUptodate(page)) {
 		ret = btrfs_readpage(NULL, page);
 		lock_page(page);
@@ -4879,6 +4873,7 @@ int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len,
 	if (ret)
 		btrfs_delalloc_release_space(inode, data_reserved, block_start,
 					     blocksize);
+	btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize);
 	unlock_page(page);
 	put_page(page);
 out:
@@ -7793,33 +7788,6 @@ static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len,
 	return em;
 }
 
-static void adjust_dio_outstanding_extents(struct inode *inode,
-					   struct btrfs_dio_data *dio_data,
-					   const u64 len)
-{
-	unsigned num_extents = count_max_extents(len);
-
-	/*
-	 * If we have an outstanding_extents count still set then we're
-	 * within our reservation, otherwise we need to adjust our inode
-	 * counter appropriately.
-	 */
-	if (dio_data->outstanding_extents >= num_extents) {
-		dio_data->outstanding_extents -= num_extents;
-	} else {
-		/*
-		 * If dio write length has been split due to no large enough
-		 * contiguous space, we need to compensate our inode counter
-		 * appropriately.
-		 */
-		u64 num_needed = num_extents - dio_data->outstanding_extents;
-
-		spin_lock(&BTRFS_I(inode)->lock);
-		BTRFS_I(inode)->outstanding_extents += num_needed;
-		spin_unlock(&BTRFS_I(inode)->lock);
-	}
-}
-
 static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
 				   struct buffer_head *bh_result, int create)
 {
@@ -7981,7 +7949,6 @@ static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
 		if (!dio_data->overwrite && start + len > i_size_read(inode))
 			i_size_write(inode, start + len);
 
-		adjust_dio_outstanding_extents(inode, dio_data, len);
 		WARN_ON(dio_data->reserve < len);
 		dio_data->reserve -= len;
 		dio_data->unsubmitted_oe_range_end = start + len;
@@ -8011,14 +7978,6 @@ static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
 err:
 	if (dio_data)
 		current->journal_info = dio_data;
-	/*
-	 * Compensate the delalloc release we do in btrfs_direct_IO() when we
-	 * write less data then expected, so that we don't underflow our inode's
-	 * outstanding extents counter.
-	 */
-	if (create && dio_data)
-		adjust_dio_outstanding_extents(inode, dio_data, len);
-
 	return ret;
 }
 
@@ -8863,7 +8822,6 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 						   offset, count);
 		if (ret)
 			goto out;
-		dio_data.outstanding_extents = count_max_extents(count);
 
 		/*
 		 * We need to know how many extents we reserved so that we can
@@ -8890,6 +8848,7 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	if (iov_iter_rw(iter) == WRITE) {
 		up_read(&BTRFS_I(inode)->dio_sem);
 		current->journal_info = NULL;
+		btrfs_delalloc_release_extents(BTRFS_I(inode), count);
 		if (ret < 0 && ret != -EIOCBQUEUED) {
 			if (dio_data.reserve)
 				btrfs_delalloc_release_space(inode, data_reserved,
@@ -9227,9 +9186,6 @@ int btrfs_page_mkwrite(struct vm_fault *vmf)
 					  fs_info->sectorsize);
 		if (reserved_space < PAGE_SIZE) {
 			end = page_start + reserved_space - 1;
-			spin_lock(&BTRFS_I(inode)->lock);
-			BTRFS_I(inode)->outstanding_extents++;
-			spin_unlock(&BTRFS_I(inode)->lock);
 			btrfs_delalloc_release_space(inode, data_reserved,
 					page_start, PAGE_SIZE - reserved_space);
 		}
@@ -9281,12 +9237,14 @@ int btrfs_page_mkwrite(struct vm_fault *vmf)
 
 out_unlock:
 	if (!ret) {
+		btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
 		sb_end_pagefault(inode->i_sb);
 		extent_changeset_free(data_reserved);
 		return VM_FAULT_LOCKED;
 	}
 	unlock_page(page);
 out:
+	btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
 	btrfs_delalloc_release_space(inode, data_reserved, page_start,
 				     reserved_space);
 out_noreserve:
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 31407c62da63..17059aa5564f 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1204,6 +1204,7 @@ static int cluster_pages_for_defrag(struct inode *inode,
 		unlock_page(pages[i]);
 		put_page(pages[i]);
 	}
+	btrfs_delalloc_release_extents(BTRFS_I(inode), page_cnt << PAGE_SHIFT);
 	extent_changeset_free(data_reserved);
 	return i_done;
 out:
@@ -1214,6 +1215,7 @@ static int cluster_pages_for_defrag(struct inode *inode,
 	btrfs_delalloc_release_space(inode, data_reserved,
 			start_index << PAGE_SHIFT,
 			page_cnt << PAGE_SHIFT);
+	btrfs_delalloc_release_extents(BTRFS_I(inode), page_cnt << PAGE_SHIFT);
 	extent_changeset_free(data_reserved);
 	return ret;
 
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index a3aca495e33e..5b311aeddcc8 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -242,6 +242,15 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
 	}
 	spin_unlock(&root->ordered_extent_lock);
 
+	/*
+	 * We don't need the count_max_extents here, we can assume that all of
+	 * that work has been done at higher layers, so this is truly the
+	 * smallest the extent is going to get.
+	 */
+	spin_lock(&BTRFS_I(inode)->lock);
+	btrfs_mod_outstanding_extents(BTRFS_I(inode), 1);
+	spin_unlock(&BTRFS_I(inode)->lock);
+
 	return 0;
 }
 
@@ -591,11 +600,19 @@ void btrfs_remove_ordered_extent(struct inode *inode,
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct btrfs_ordered_inode_tree *tree;
-	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
+	struct btrfs_root *root = btrfs_inode->root;
 	struct rb_node *node;
 	bool dec_pending_ordered = false;
 
-	tree = &BTRFS_I(inode)->ordered_tree;
+	/* This is paired with btrfs_add_ordered_extent. */
+	spin_lock(&btrfs_inode->lock);
+	btrfs_mod_outstanding_extents(btrfs_inode, -1);
+	spin_unlock(&btrfs_inode->lock);
+	if (root != fs_info->tree_root)
+		btrfs_delalloc_release_metadata(btrfs_inode, entry->len);
+
+	tree = &btrfs_inode->ordered_tree;
 	spin_lock_irq(&tree->lock);
 	node = &entry->rb_node;
 	rb_erase(node, &tree->tree);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 9841faef08ea..53e192647339 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3246,6 +3246,8 @@ static int relocate_file_extent_cluster(struct inode *inode,
 				put_page(page);
 				btrfs_delalloc_release_metadata(BTRFS_I(inode),
 							PAGE_SIZE);
+				btrfs_delalloc_release_extents(BTRFS_I(inode),
+							       PAGE_SIZE);
 				ret = -EIO;
 				goto out;
 			}
@@ -3275,6 +3277,7 @@ static int relocate_file_extent_cluster(struct inode *inode,
 		put_page(page);
 
 		index++;
+		btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
 		balance_dirty_pages_ratelimited(inode->i_mapping);
 		btrfs_throttle(fs_info);
 	}
diff --git a/fs/btrfs/tests/inode-tests.c b/fs/btrfs/tests/inode-tests.c
index 8c91d03cc82d..11c77eafde00 100644
--- a/fs/btrfs/tests/inode-tests.c
+++ b/fs/btrfs/tests/inode-tests.c
@@ -968,7 +968,6 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	btrfs_test_inode_set_ops(inode);
 
 	/* [BTRFS_MAX_EXTENT_SIZE] */
-	BTRFS_I(inode)->outstanding_extents++;
 	ret = btrfs_set_extent_delalloc(inode, 0, BTRFS_MAX_EXTENT_SIZE - 1,
 					NULL, 0);
 	if (ret) {
@@ -983,7 +982,6 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 
 	/* [BTRFS_MAX_EXTENT_SIZE][sectorsize] */
-	BTRFS_I(inode)->outstanding_extents++;
 	ret = btrfs_set_extent_delalloc(inode, BTRFS_MAX_EXTENT_SIZE,
 					BTRFS_MAX_EXTENT_SIZE + sectorsize - 1,
 					NULL, 0);
@@ -1003,7 +1001,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 			       BTRFS_MAX_EXTENT_SIZE >> 1,
 			       (BTRFS_MAX_EXTENT_SIZE >> 1) + sectorsize - 1,
 			       EXTENT_DELALLOC | EXTENT_DIRTY |
-			       EXTENT_UPTODATE | EXTENT_DO_ACCOUNTING, 0, 0,
+			       EXTENT_UPTODATE, 0, 0,
 			       NULL, GFP_KERNEL);
 	if (ret) {
 		test_msg("clear_extent_bit returned %d\n", ret);
@@ -1017,7 +1015,6 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	}
 
 	/* [BTRFS_MAX_EXTENT_SIZE][sectorsize] */
-	BTRFS_I(inode)->outstanding_extents++;
 	ret = btrfs_set_extent_delalloc(inode, BTRFS_MAX_EXTENT_SIZE >> 1,
 					(BTRFS_MAX_EXTENT_SIZE >> 1)
 					+ sectorsize - 1,
@@ -1035,12 +1032,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 
 	/*
 	 * [BTRFS_MAX_EXTENT_SIZE+sectorsize][sectorsize HOLE][BTRFS_MAX_EXTENT_SIZE+sectorsize]
-	 *
-	 * I'm artificially adding 2 to outstanding_extents because in the
-	 * buffered IO case we'd add things up as we go, but I don't feel like
-	 * doing that here, this isn't the interesting case we want to test.
 	 */
-	BTRFS_I(inode)->outstanding_extents += 2;
 	ret = btrfs_set_extent_delalloc(inode,
 			BTRFS_MAX_EXTENT_SIZE + 2 * sectorsize,
 			(BTRFS_MAX_EXTENT_SIZE << 1) + 3 * sectorsize - 1,
@@ -1059,7 +1051,6 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	/*
 	* [BTRFS_MAX_EXTENT_SIZE+sectorsize][sectorsize][BTRFS_MAX_EXTENT_SIZE+sectorsize]
 	*/
-	BTRFS_I(inode)->outstanding_extents++;
 	ret = btrfs_set_extent_delalloc(inode,
 			BTRFS_MAX_EXTENT_SIZE + sectorsize,
 			BTRFS_MAX_EXTENT_SIZE + 2 * sectorsize - 1, NULL, 0);
@@ -1079,7 +1070,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 			       BTRFS_MAX_EXTENT_SIZE + sectorsize,
 			       BTRFS_MAX_EXTENT_SIZE + 2 * sectorsize - 1,
 			       EXTENT_DIRTY | EXTENT_DELALLOC |
-			       EXTENT_DO_ACCOUNTING | EXTENT_UPTODATE, 0, 0,
+			       EXTENT_UPTODATE, 0, 0,
 			       NULL, GFP_KERNEL);
 	if (ret) {
 		test_msg("clear_extent_bit returned %d\n", ret);
@@ -1096,7 +1087,6 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	 * Refill the hole again just for good measure, because I thought it
 	 * might fail and I'd rather satisfy my paranoia at this point.
 	 */
-	BTRFS_I(inode)->outstanding_extents++;
 	ret = btrfs_set_extent_delalloc(inode,
 			BTRFS_MAX_EXTENT_SIZE + sectorsize,
 			BTRFS_MAX_EXTENT_SIZE + 2 * sectorsize - 1, NULL, 0);
@@ -1114,7 +1104,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	/* Empty */
 	ret = clear_extent_bit(&BTRFS_I(inode)->io_tree, 0, (u64)-1,
 			       EXTENT_DIRTY | EXTENT_DELALLOC |
-			       EXTENT_DO_ACCOUNTING | EXTENT_UPTODATE, 0, 0,
+			       EXTENT_UPTODATE, 0, 0,
 			       NULL, GFP_KERNEL);
 	if (ret) {
 		test_msg("clear_extent_bit returned %d\n", ret);
@@ -1131,7 +1121,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	if (ret)
 		clear_extent_bit(&BTRFS_I(inode)->io_tree, 0, (u64)-1,
 				 EXTENT_DIRTY | EXTENT_DELALLOC |
-				 EXTENT_DO_ACCOUNTING | EXTENT_UPTODATE, 0, 0,
+				 EXTENT_UPTODATE, 0, 0,
 				 NULL, GFP_KERNEL);
 	iput(inode);
 	btrfs_free_dummy_root(root);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 02/21] btrfs: add tracepoints for outstanding extents mods
  2017-09-29 19:43 [PATCH 00/21] My current btrfs patch queue Josef Bacik
  2017-09-29 19:43 ` [PATCH 01/21] Btrfs: rework outstanding_extents Josef Bacik
@ 2017-09-29 19:43 ` Josef Bacik
  2017-09-29 19:43 ` [PATCH 03/21] btrfs: make the delalloc block rsv per inode Josef Bacik
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 51+ messages in thread
From: Josef Bacik @ 2017-09-29 19:43 UTC (permalink / raw)
  To: kernel-team, linux-btrfs

This is handy for tracing problems with modifying the outstanding
extents counters.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/btrfs_inode.h       |  2 ++
 include/trace/events/btrfs.h | 21 +++++++++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index a6a22ef41f91..22daa79e77b8 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -274,6 +274,8 @@ static inline void btrfs_mod_outstanding_extents(struct btrfs_inode *inode,
 	inode->outstanding_extents += mod;
 	if (btrfs_is_free_space_inode(inode))
 		return;
+	trace_btrfs_inode_mod_outstanding_extents(inode->root, btrfs_ino(inode),
+						  mod);
 }
 
 static inline void btrfs_mod_reserved_extents(struct btrfs_inode *inode,
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 77437f545c63..f0a374a720b0 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1698,6 +1698,27 @@ DEFINE_EVENT(btrfs__prelim_ref, btrfs_prelim_ref_insert,
 	TP_ARGS(fs_info, oldref, newref, tree_size)
 );
 
+TRACE_EVENT(btrfs_inode_mod_outstanding_extents,
+	TP_PROTO(struct btrfs_root *root, u64 ino, int mod),
+
+	TP_ARGS(root, ino, mod),
+
+	TP_STRUCT__entry_btrfs(
+		__field(	u64, root_objectid	)
+		__field(	u64, ino		)
+		__field(	int, mod		)
+	),
+
+	TP_fast_assign_btrfs(root->fs_info,
+		__entry->root_objectid	= root->objectid;
+		__entry->ino		= ino;
+		__entry->mod		= mod;
+	),
+
+	TP_printk_btrfs("root = %llu(%s) ino = %llu mod = %d",
+			show_root_type(__entry->root_objectid),
+			(unsigned long long)__entry->ino, __entry->mod)
+);
 #endif /* _TRACE_BTRFS_H */
 
 /* This part must be outside protection */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 03/21] btrfs: make the delalloc block rsv per inode
  2017-09-29 19:43 [PATCH 00/21] My current btrfs patch queue Josef Bacik
  2017-09-29 19:43 ` [PATCH 01/21] Btrfs: rework outstanding_extents Josef Bacik
  2017-09-29 19:43 ` [PATCH 02/21] btrfs: add tracepoints for outstanding extents mods Josef Bacik
@ 2017-09-29 19:43 ` Josef Bacik
  2017-10-13 11:47   ` Nikolay Borisov
  2017-09-29 19:43 ` [PATCH 04/21] btrfs: add ref-verify mount option Josef Bacik
                   ` (18 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: Josef Bacik @ 2017-09-29 19:43 UTC (permalink / raw)
  To: kernel-team, linux-btrfs

The way we handle delalloc metadata reservations has gotten
progressively more complicated over the years.  There is so much cruft
and weirdness around keeping the reserved count and outstanding counters
consistent and handling the error cases that it's impossible to
understand.

Fix this by making the delalloc block rsv per-inode.  This way we can
calculate the actual size of the outstanding metadata reservations every
time we make a change, and then reserve the delta based on that amount.
This greatly simplifies the code everywhere, and makes the error
handling in btrfs_delalloc_reserve_metadata far less terrifying.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/btrfs_inode.h   |  27 ++--
 fs/btrfs/ctree.h         |   5 +-
 fs/btrfs/delayed-inode.c |  46 +------
 fs/btrfs/disk-io.c       |  18 ++-
 fs/btrfs/extent-tree.c   | 320 ++++++++++++++++-------------------------------
 fs/btrfs/inode.c         |  18 +--
 6 files changed, 141 insertions(+), 293 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 22daa79e77b8..f9c6887a8b6c 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -36,14 +36,13 @@
 #define BTRFS_INODE_ORPHAN_META_RESERVED	1
 #define BTRFS_INODE_DUMMY			2
 #define BTRFS_INODE_IN_DEFRAG			3
-#define BTRFS_INODE_DELALLOC_META_RESERVED	4
-#define BTRFS_INODE_HAS_ORPHAN_ITEM		5
-#define BTRFS_INODE_HAS_ASYNC_EXTENT		6
-#define BTRFS_INODE_NEEDS_FULL_SYNC		7
-#define BTRFS_INODE_COPY_EVERYTHING		8
-#define BTRFS_INODE_IN_DELALLOC_LIST		9
-#define BTRFS_INODE_READDIO_NEED_LOCK		10
-#define BTRFS_INODE_HAS_PROPS		        11
+#define BTRFS_INODE_HAS_ORPHAN_ITEM		4
+#define BTRFS_INODE_HAS_ASYNC_EXTENT		5
+#define BTRFS_INODE_NEEDS_FULL_SYNC		6
+#define BTRFS_INODE_COPY_EVERYTHING		7
+#define BTRFS_INODE_IN_DELALLOC_LIST		8
+#define BTRFS_INODE_READDIO_NEED_LOCK		9
+#define BTRFS_INODE_HAS_PROPS		        10
 
 /* in memory btrfs inode */
 struct btrfs_inode {
@@ -176,7 +175,8 @@ struct btrfs_inode {
 	 * of extent items we've reserved metadata for.
 	 */
 	unsigned outstanding_extents;
-	unsigned reserved_extents;
+
+	struct btrfs_block_rsv block_rsv;
 
 	/*
 	 * Cached values of inode properties
@@ -278,15 +278,6 @@ static inline void btrfs_mod_outstanding_extents(struct btrfs_inode *inode,
 						  mod);
 }
 
-static inline void btrfs_mod_reserved_extents(struct btrfs_inode *inode,
-					      int mod)
-{
-	ASSERT(spin_is_locked(&inode->lock));
-	inode->reserved_extents += mod;
-	if (btrfs_is_free_space_inode(inode))
-		return;
-}
-
 static inline int btrfs_inode_in_log(struct btrfs_inode *inode, u64 generation)
 {
 	int ret = 0;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 1262612fbf78..93e767e16a43 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -763,8 +763,6 @@ struct btrfs_fs_info {
 	 * delayed dir index item
 	 */
 	struct btrfs_block_rsv global_block_rsv;
-	/* block reservation for delay allocation */
-	struct btrfs_block_rsv delalloc_block_rsv;
 	/* block reservation for metadata operations */
 	struct btrfs_block_rsv trans_block_rsv;
 	/* block reservation for chunk tree */
@@ -2751,6 +2749,9 @@ int btrfs_delalloc_reserve_space(struct inode *inode,
 void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type);
 struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_fs_info *fs_info,
 					      unsigned short type);
+void btrfs_init_metadata_block_rsv(struct btrfs_fs_info *fs_info,
+				   struct btrfs_block_rsv *rsv,
+				   unsigned short type);
 void btrfs_free_block_rsv(struct btrfs_fs_info *fs_info,
 			  struct btrfs_block_rsv *rsv);
 void __btrfs_free_block_rsv(struct btrfs_block_rsv *rsv);
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 19e4ad2f3f2e..5d73f79ded8b 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -581,7 +581,6 @@ static int btrfs_delayed_inode_reserve_metadata(
 	struct btrfs_block_rsv *dst_rsv;
 	u64 num_bytes;
 	int ret;
-	bool release = false;
 
 	src_rsv = trans->block_rsv;
 	dst_rsv = &fs_info->delayed_block_rsv;
@@ -589,36 +588,13 @@ static int btrfs_delayed_inode_reserve_metadata(
 	num_bytes = btrfs_calc_trans_metadata_size(fs_info, 1);
 
 	/*
-	 * If our block_rsv is the delalloc block reserve then check and see if
-	 * we have our extra reservation for updating the inode.  If not fall
-	 * through and try to reserve space quickly.
-	 *
-	 * We used to try and steal from the delalloc block rsv or the global
-	 * reserve, but we'd steal a full reservation, which isn't kind.  We are
-	 * here through delalloc which means we've likely just cowed down close
-	 * to the leaf that contains the inode, so we would steal less just
-	 * doing the fallback inode update, so if we do end up having to steal
-	 * from the global block rsv we hopefully only steal one or two blocks
-	 * worth which is less likely to hurt us.
-	 */
-	if (src_rsv && src_rsv->type == BTRFS_BLOCK_RSV_DELALLOC) {
-		spin_lock(&inode->lock);
-		if (test_and_clear_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
-				       &inode->runtime_flags))
-			release = true;
-		else
-			src_rsv = NULL;
-		spin_unlock(&inode->lock);
-	}
-
-	/*
 	 * btrfs_dirty_inode will update the inode under btrfs_join_transaction
 	 * which doesn't reserve space for speed.  This is a problem since we
 	 * still need to reserve space for this update, so try to reserve the
 	 * space.
 	 *
 	 * Now if src_rsv == delalloc_block_rsv we'll let it just steal since
-	 * we're accounted for.
+	 * we always reserve enough to update the inode item.
 	 */
 	if (!src_rsv || (!trans->bytes_reserved &&
 			 src_rsv->type != BTRFS_BLOCK_RSV_DELALLOC)) {
@@ -643,32 +619,12 @@ static int btrfs_delayed_inode_reserve_metadata(
 	}
 
 	ret = btrfs_block_rsv_migrate(src_rsv, dst_rsv, num_bytes, 1);
-
-	/*
-	 * Migrate only takes a reservation, it doesn't touch the size of the
-	 * block_rsv.  This is to simplify people who don't normally have things
-	 * migrated from their block rsv.  If they go to release their
-	 * reservation, that will decrease the size as well, so if migrate
-	 * reduced size we'd end up with a negative size.  But for the
-	 * delalloc_meta_reserved stuff we will only know to drop 1 reservation,
-	 * but we could in fact do this reserve/migrate dance several times
-	 * between the time we did the original reservation and we'd clean it
-	 * up.  So to take care of this, release the space for the meta
-	 * reservation here.  I think it may be time for a documentation page on
-	 * how block rsvs. work.
-	 */
 	if (!ret) {
 		trace_btrfs_space_reservation(fs_info, "delayed_inode",
 					      btrfs_ino(inode), num_bytes, 1);
 		node->bytes_reserved = num_bytes;
 	}
 
-	if (release) {
-		trace_btrfs_space_reservation(fs_info, "delalloc",
-					      btrfs_ino(inode), num_bytes, 0);
-		btrfs_block_rsv_release(fs_info, src_rsv, num_bytes);
-	}
-
 	return ret;
 }
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f3cc4aa24e8a..1307907e19d8 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2723,14 +2723,6 @@ int open_ctree(struct super_block *sb,
 		goto fail_delalloc_bytes;
 	}
 
-	fs_info->btree_inode = new_inode(sb);
-	if (!fs_info->btree_inode) {
-		err = -ENOMEM;
-		goto fail_bio_counter;
-	}
-
-	mapping_set_gfp_mask(fs_info->btree_inode->i_mapping, GFP_NOFS);
-
 	INIT_RADIX_TREE(&fs_info->fs_roots_radix, GFP_ATOMIC);
 	INIT_RADIX_TREE(&fs_info->buffer_radix, GFP_ATOMIC);
 	INIT_LIST_HEAD(&fs_info->trans_list);
@@ -2763,8 +2755,6 @@ int open_ctree(struct super_block *sb,
 	btrfs_mapping_init(&fs_info->mapping_tree);
 	btrfs_init_block_rsv(&fs_info->global_block_rsv,
 			     BTRFS_BLOCK_RSV_GLOBAL);
-	btrfs_init_block_rsv(&fs_info->delalloc_block_rsv,
-			     BTRFS_BLOCK_RSV_DELALLOC);
 	btrfs_init_block_rsv(&fs_info->trans_block_rsv, BTRFS_BLOCK_RSV_TRANS);
 	btrfs_init_block_rsv(&fs_info->chunk_block_rsv, BTRFS_BLOCK_RSV_CHUNK);
 	btrfs_init_block_rsv(&fs_info->empty_block_rsv, BTRFS_BLOCK_RSV_EMPTY);
@@ -2792,6 +2782,14 @@ int open_ctree(struct super_block *sb,
 
 	INIT_LIST_HEAD(&fs_info->ordered_roots);
 	spin_lock_init(&fs_info->ordered_root_lock);
+
+	fs_info->btree_inode = new_inode(sb);
+	if (!fs_info->btree_inode) {
+		err = -ENOMEM;
+		goto fail_bio_counter;
+	}
+	mapping_set_gfp_mask(fs_info->btree_inode->i_mapping, GFP_NOFS);
+
 	fs_info->delayed_root = kmalloc(sizeof(struct btrfs_delayed_root),
 					GFP_KERNEL);
 	if (!fs_info->delayed_root) {
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index aa0f5c8953b0..e32ad9fc93a8 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -26,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/ratelimit.h>
 #include <linux/percpu_counter.h>
+#include <linux/lockdep.h>
 #include "hash.h"
 #include "tree-log.h"
 #include "disk-io.h"
@@ -4831,7 +4832,6 @@ static inline u64 calc_reclaim_items_nr(struct btrfs_fs_info *fs_info,
 static void shrink_delalloc(struct btrfs_fs_info *fs_info, u64 to_reclaim,
 			    u64 orig, bool wait_ordered)
 {
-	struct btrfs_block_rsv *block_rsv;
 	struct btrfs_space_info *space_info;
 	struct btrfs_trans_handle *trans;
 	u64 delalloc_bytes;
@@ -4847,8 +4847,7 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info, u64 to_reclaim,
 	to_reclaim = items * EXTENT_SIZE_PER_ITEM;
 
 	trans = (struct btrfs_trans_handle *)current->journal_info;
-	block_rsv = &fs_info->delalloc_block_rsv;
-	space_info = block_rsv->space_info;
+	space_info = __find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
 
 	delalloc_bytes = percpu_counter_sum_positive(
 						&fs_info->delalloc_bytes);
@@ -5584,11 +5583,12 @@ static void space_info_add_new_bytes(struct btrfs_fs_info *fs_info,
 	}
 }
 
-static void block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
+static u64 block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
 				    struct btrfs_block_rsv *block_rsv,
 				    struct btrfs_block_rsv *dest, u64 num_bytes)
 {
 	struct btrfs_space_info *space_info = block_rsv->space_info;
+	u64 ret;
 
 	spin_lock(&block_rsv->lock);
 	if (num_bytes == (u64)-1)
@@ -5603,6 +5603,7 @@ static void block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
 	}
 	spin_unlock(&block_rsv->lock);
 
+	ret = num_bytes;
 	if (num_bytes > 0) {
 		if (dest) {
 			spin_lock(&dest->lock);
@@ -5622,6 +5623,7 @@ static void block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
 			space_info_add_old_bytes(fs_info, space_info,
 						 num_bytes);
 	}
+	return ret;
 }
 
 int btrfs_block_rsv_migrate(struct btrfs_block_rsv *src,
@@ -5645,6 +5647,15 @@ void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type)
 	rsv->type = type;
 }
 
+void btrfs_init_metadata_block_rsv(struct btrfs_fs_info *fs_info,
+				   struct btrfs_block_rsv *rsv,
+				   unsigned short type)
+{
+	btrfs_init_block_rsv(rsv, type);
+	rsv->space_info = __find_space_info(fs_info,
+					    BTRFS_BLOCK_GROUP_METADATA);
+}
+
 struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_fs_info *fs_info,
 					      unsigned short type)
 {
@@ -5654,9 +5665,7 @@ struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_fs_info *fs_info,
 	if (!block_rsv)
 		return NULL;
 
-	btrfs_init_block_rsv(block_rsv, type);
-	block_rsv->space_info = __find_space_info(fs_info,
-						  BTRFS_BLOCK_GROUP_METADATA);
+	btrfs_init_metadata_block_rsv(fs_info, block_rsv, type);
 	return block_rsv;
 }
 
@@ -5739,6 +5748,66 @@ int btrfs_block_rsv_refill(struct btrfs_root *root,
 	return ret;
 }
 
+/**
+ * btrfs_inode_rsv_refill - refill the inode block rsv.
+ * @inode - the inode we are refilling.
+ * @flush - the flusing restriction.
+ *
+ * Essentially the same as btrfs_block_rsv_refill, except it uses the
+ * block_rsv->size as the minimum size.  We'll either refill the missing amount
+ * or return if we already have enough space.  This will also handle the resreve
+ * tracepoint for the reserved amount.
+ */
+int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
+			   enum btrfs_reserve_flush_enum flush)
+{
+	struct btrfs_root *root = inode->root;
+	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
+	u64 num_bytes = 0;
+	int ret = -ENOSPC;
+
+	spin_lock(&block_rsv->lock);
+	if (block_rsv->reserved < block_rsv->size)
+		num_bytes = block_rsv->size - block_rsv->reserved;
+	spin_unlock(&block_rsv->lock);
+
+	if (num_bytes == 0)
+		return 0;
+
+	ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
+	if (!ret) {
+		block_rsv_add_bytes(block_rsv, num_bytes, 0);
+		trace_btrfs_space_reservation(root->fs_info, "delalloc",
+					      btrfs_ino(inode), num_bytes, 1);
+	}
+	return ret;
+}
+
+/**
+ * btrfs_inode_rsv_release - release any excessive reservation.
+ * @inode - the inode we need to release from.
+ *
+ * This is the same as btrfs_block_rsv_release, except that it handles the
+ * tracepoint for the reservation.
+ */
+void btrfs_inode_rsv_release(struct btrfs_inode *inode)
+{
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
+	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
+	u64 released = 0;
+
+	/*
+	 * Since we statically set the block_rsv->size we just want to say we
+	 * are releasing 0 bytes, and then we'll just get the reservation over
+	 * the size free'd.
+	 */
+	released = block_rsv_release_bytes(fs_info, block_rsv, global_rsv, 0);
+	if (released > 0)
+		trace_btrfs_space_reservation(fs_info, "delalloc",
+					      btrfs_ino(inode), released, 0);
+}
+
 void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
 			     struct btrfs_block_rsv *block_rsv,
 			     u64 num_bytes)
@@ -5810,7 +5879,6 @@ static void init_global_block_rsv(struct btrfs_fs_info *fs_info)
 
 	space_info = __find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
 	fs_info->global_block_rsv.space_info = space_info;
-	fs_info->delalloc_block_rsv.space_info = space_info;
 	fs_info->trans_block_rsv.space_info = space_info;
 	fs_info->empty_block_rsv.space_info = space_info;
 	fs_info->delayed_block_rsv.space_info = space_info;
@@ -5830,8 +5898,6 @@ static void release_global_block_rsv(struct btrfs_fs_info *fs_info)
 {
 	block_rsv_release_bytes(fs_info, &fs_info->global_block_rsv, NULL,
 				(u64)-1);
-	WARN_ON(fs_info->delalloc_block_rsv.size > 0);
-	WARN_ON(fs_info->delalloc_block_rsv.reserved > 0);
 	WARN_ON(fs_info->trans_block_rsv.size > 0);
 	WARN_ON(fs_info->trans_block_rsv.reserved > 0);
 	WARN_ON(fs_info->chunk_block_rsv.size > 0);
@@ -5970,95 +6036,37 @@ void btrfs_subvolume_release_metadata(struct btrfs_fs_info *fs_info,
 	btrfs_block_rsv_release(fs_info, rsv, (u64)-1);
 }
 
-/**
- * drop_over_reserved_extents - drop our extra extent reservations
- * @inode: the inode we're dropping the extent for
- *
- * We reserve extents we may use, but they may have been merged with other
- * extents and we may not need the extra reservation.
- *
- * We also call this when we've completed io to an extent or had an error and
- * cleared the outstanding extent, in either case we no longer need our
- * reservation and can drop the excess.
- */
-static unsigned drop_over_reserved_extents(struct btrfs_inode *inode)
-{
-	unsigned num_extents = 0;
-
-	if (inode->reserved_extents > inode->outstanding_extents) {
-		num_extents = inode->reserved_extents -
-			inode->outstanding_extents;
-		btrfs_mod_reserved_extents(inode, -num_extents);
-	}
-
-	if (inode->outstanding_extents == 0 &&
-	    test_and_clear_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
-			       &inode->runtime_flags))
-		num_extents++;
-	return num_extents;
-}
-
-/**
- * calc_csum_metadata_size - return the amount of metadata space that must be
- *	reserved/freed for the given bytes.
- * @inode: the inode we're manipulating
- * @num_bytes: the number of bytes in question
- * @reserve: 1 if we are reserving space, 0 if we are freeing space
- *
- * This adjusts the number of csum_bytes in the inode and then returns the
- * correct amount of metadata that must either be reserved or freed.  We
- * calculate how many checksums we can fit into one leaf and then divide the
- * number of bytes that will need to be checksumed by this value to figure out
- * how many checksums will be required.  If we are adding bytes then the number
- * may go up and we will return the number of additional bytes that must be
- * reserved.  If it is going down we will return the number of bytes that must
- * be freed.
- *
- * This must be called with BTRFS_I(inode)->lock held.
- */
-static u64 calc_csum_metadata_size(struct btrfs_inode *inode, u64 num_bytes,
-				   int reserve)
+static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
+						 struct btrfs_inode *inode)
 {
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb);
-	u64 old_csums, num_csums;
-
-	if (inode->flags & BTRFS_INODE_NODATASUM && inode->csum_bytes == 0)
-		return 0;
-
-	old_csums = btrfs_csum_bytes_to_leaves(fs_info, inode->csum_bytes);
-	if (reserve)
-		inode->csum_bytes += num_bytes;
-	else
-		inode->csum_bytes -= num_bytes;
-	num_csums = btrfs_csum_bytes_to_leaves(fs_info, inode->csum_bytes);
-
-	/* No change, no need to reserve more */
-	if (old_csums == num_csums)
-		return 0;
+	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
+	u64 reserve_size = 0;
+	u64 csum_leaves;
+	unsigned outstanding_extents;
 
-	if (reserve)
-		return btrfs_calc_trans_metadata_size(fs_info,
-						      num_csums - old_csums);
+	lockdep_assert_held(&inode->lock);
+	outstanding_extents = inode->outstanding_extents;
+	if (outstanding_extents)
+		reserve_size = btrfs_calc_trans_metadata_size(fs_info,
+						outstanding_extents + 1);
+	csum_leaves = btrfs_csum_bytes_to_leaves(fs_info,
+						 inode->csum_bytes);
+	reserve_size += btrfs_calc_trans_metadata_size(fs_info,
+						       csum_leaves);
 
-	return btrfs_calc_trans_metadata_size(fs_info, old_csums - num_csums);
+	spin_lock(&block_rsv->lock);
+	block_rsv->size = reserve_size;
+	spin_unlock(&block_rsv->lock);
 }
 
 int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb);
 	struct btrfs_root *root = inode->root;
-	struct btrfs_block_rsv *block_rsv = &fs_info->delalloc_block_rsv;
-	u64 to_reserve = 0;
-	u64 csum_bytes;
-	unsigned nr_extents, reserve_extents;
+	unsigned nr_extents;
 	enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
 	int ret = 0;
 	bool delalloc_lock = true;
-	u64 to_free = 0;
-	unsigned dropped;
-	bool release_extra = false;
-	bool underflow = false;
-	bool did_retry = false;
 
 	/* If we are a free space inode we need to not flush since we will be in
 	 * the middle of a transaction commit.  We also don't need the delalloc
@@ -6083,33 +6091,13 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
 		mutex_lock(&inode->delalloc_mutex);
 
 	num_bytes = ALIGN(num_bytes, fs_info->sectorsize);
-retry:
+
+	/* Add our new extents and calculate the new rsv size. */
 	spin_lock(&inode->lock);
-	reserve_extents = nr_extents = count_max_extents(num_bytes);
+	nr_extents = count_max_extents(num_bytes);
 	btrfs_mod_outstanding_extents(inode, nr_extents);
-
-	/*
-	 * Because we add an outstanding extent for ordered before we clear
-	 * delalloc we will double count our outstanding extents slightly.  This
-	 * could mean that we transiently over-reserve, which could result in an
-	 * early ENOSPC if our timing is unlucky.  Keep track of the case that
-	 * we had a reservation underflow so we can retry if we fail.
-	 *
-	 * Keep in mind we can legitimately have more outstanding extents than
-	 * reserved because of fragmentation, so only allow a retry once.
-	 */
-	if (inode->outstanding_extents >
-	    inode->reserved_extents + nr_extents) {
-		reserve_extents = inode->outstanding_extents -
-			inode->reserved_extents;
-		underflow = true;
-	}
-
-	/* We always want to reserve a slot for updating the inode. */
-	to_reserve = btrfs_calc_trans_metadata_size(fs_info,
-						    reserve_extents + 1);
-	to_reserve += calc_csum_metadata_size(inode, num_bytes, 1);
-	csum_bytes = inode->csum_bytes;
+	inode->csum_bytes += num_bytes;
+	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
 	spin_unlock(&inode->lock);
 
 	if (test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags)) {
@@ -6119,100 +6107,26 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
 			goto out_fail;
 	}
 
-	ret = btrfs_block_rsv_add(root, block_rsv, to_reserve, flush);
+	ret = btrfs_inode_rsv_refill(inode, flush);
 	if (unlikely(ret)) {
 		btrfs_qgroup_free_meta(root,
 				       nr_extents * fs_info->nodesize);
 		goto out_fail;
 	}
 
-	spin_lock(&inode->lock);
-	if (test_and_set_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
-			     &inode->runtime_flags)) {
-		to_reserve -= btrfs_calc_trans_metadata_size(fs_info, 1);
-		release_extra = true;
-	}
-	btrfs_mod_reserved_extents(inode, reserve_extents);
-	spin_unlock(&inode->lock);
-
 	if (delalloc_lock)
 		mutex_unlock(&inode->delalloc_mutex);
-
-	if (to_reserve)
-		trace_btrfs_space_reservation(fs_info, "delalloc",
-					      btrfs_ino(inode), to_reserve, 1);
-	if (release_extra)
-		btrfs_block_rsv_release(fs_info, block_rsv,
-				btrfs_calc_trans_metadata_size(fs_info, 1));
 	return 0;
 
 out_fail:
 	spin_lock(&inode->lock);
 	nr_extents = count_max_extents(num_bytes);
 	btrfs_mod_outstanding_extents(inode, -nr_extents);
-
-	dropped = drop_over_reserved_extents(inode);
-	/*
-	 * If the inodes csum_bytes is the same as the original
-	 * csum_bytes then we know we haven't raced with any free()ers
-	 * so we can just reduce our inodes csum bytes and carry on.
-	 */
-	if (inode->csum_bytes == csum_bytes) {
-		calc_csum_metadata_size(inode, num_bytes, 0);
-	} else {
-		u64 orig_csum_bytes = inode->csum_bytes;
-		u64 bytes;
-
-		/*
-		 * This is tricky, but first we need to figure out how much we
-		 * freed from any free-ers that occurred during this
-		 * reservation, so we reset ->csum_bytes to the csum_bytes
-		 * before we dropped our lock, and then call the free for the
-		 * number of bytes that were freed while we were trying our
-		 * reservation.
-		 */
-		bytes = csum_bytes - inode->csum_bytes;
-		inode->csum_bytes = csum_bytes;
-		to_free = calc_csum_metadata_size(inode, bytes, 0);
-
-
-		/*
-		 * Now we need to see how much we would have freed had we not
-		 * been making this reservation and our ->csum_bytes were not
-		 * artificially inflated.
-		 */
-		inode->csum_bytes = csum_bytes - num_bytes;
-		bytes = csum_bytes - orig_csum_bytes;
-		bytes = calc_csum_metadata_size(inode, bytes, 0);
-
-		/*
-		 * Now reset ->csum_bytes to what it should be.  If bytes is
-		 * more than to_free then we would have freed more space had we
-		 * not had an artificially high ->csum_bytes, so we need to free
-		 * the remainder.  If bytes is the same or less then we don't
-		 * need to do anything, the other free-ers did the correct
-		 * thing.
-		 */
-		inode->csum_bytes = orig_csum_bytes - num_bytes;
-		if (bytes > to_free)
-			to_free = bytes - to_free;
-		else
-			to_free = 0;
-	}
+	inode->csum_bytes -= num_bytes;
+	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
 	spin_unlock(&inode->lock);
-	if (dropped)
-		to_free += btrfs_calc_trans_metadata_size(fs_info, dropped);
 
-	if (to_free) {
-		btrfs_block_rsv_release(fs_info, block_rsv, to_free);
-		trace_btrfs_space_reservation(fs_info, "delalloc",
-					      btrfs_ino(inode), to_free, 0);
-	}
-	if (underflow && !did_retry) {
-		did_retry = true;
-		underflow = false;
-		goto retry;
-	}
+	btrfs_inode_rsv_release(inode);
 	if (delalloc_lock)
 		mutex_unlock(&inode->delalloc_mutex);
 	return ret;
@@ -6230,25 +6144,17 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
 void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb);
-	u64 to_free = 0;
-	unsigned dropped;
 
 	num_bytes = ALIGN(num_bytes, fs_info->sectorsize);
 	spin_lock(&inode->lock);
-	dropped = drop_over_reserved_extents(inode);
-	if (num_bytes)
-		to_free = calc_csum_metadata_size(inode, num_bytes, 0);
+	inode->csum_bytes -= num_bytes;
+	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
 	spin_unlock(&inode->lock);
-	if (dropped > 0)
-		to_free += btrfs_calc_trans_metadata_size(fs_info, dropped);
 
 	if (btrfs_is_testing(fs_info))
 		return;
 
-	trace_btrfs_space_reservation(fs_info, "delalloc", btrfs_ino(inode),
-				      to_free, 0);
-
-	btrfs_block_rsv_release(fs_info, &fs_info->delalloc_block_rsv, to_free);
+	btrfs_inode_rsv_release(inode);
 }
 
 /**
@@ -6266,25 +6172,17 @@ void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb);
 	unsigned num_extents;
-	u64 to_free;
-	unsigned dropped;
 
 	spin_lock(&inode->lock);
 	num_extents = count_max_extents(num_bytes);
 	btrfs_mod_outstanding_extents(inode, -num_extents);
-	dropped = drop_over_reserved_extents(inode);
+	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
 	spin_unlock(&inode->lock);
 
-	if (!dropped)
-		return;
-
 	if (btrfs_is_testing(fs_info))
 		return;
 
-	to_free = btrfs_calc_trans_metadata_size(fs_info, dropped);
-	trace_btrfs_space_reservation(fs_info, "delalloc", btrfs_ino(inode),
-				      to_free, 0);
-	btrfs_block_rsv_release(fs_info, &fs_info->delalloc_block_rsv, to_free);
+	btrfs_inode_rsv_release(inode);
 }
 
 /**
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 33ba258815b2..4e092e799f0a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -42,6 +42,7 @@
 #include <linux/blkdev.h>
 #include <linux/posix_acl_xattr.h>
 #include <linux/uio.h>
+#include <linux/magic.h>
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -315,7 +316,7 @@ static noinline int cow_file_range_inline(struct btrfs_root *root,
 		btrfs_free_path(path);
 		return PTR_ERR(trans);
 	}
-	trans->block_rsv = &fs_info->delalloc_block_rsv;
+	trans->block_rsv = &BTRFS_I(inode)->block_rsv;
 
 	if (compressed_size && compressed_pages)
 		extent_item_size = btrfs_file_extent_calc_inline_size(
@@ -2957,7 +2958,7 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 			trans = NULL;
 			goto out;
 		}
-		trans->block_rsv = &fs_info->delalloc_block_rsv;
+		trans->block_rsv = &BTRFS_I(inode)->block_rsv;
 		ret = btrfs_update_inode_fallback(trans, root, inode);
 		if (ret) /* -ENOMEM or corruption */
 			btrfs_abort_transaction(trans, ret);
@@ -2993,7 +2994,7 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 		goto out;
 	}
 
-	trans->block_rsv = &fs_info->delalloc_block_rsv;
+	trans->block_rsv = &BTRFS_I(inode)->block_rsv;
 
 	if (test_bit(BTRFS_ORDERED_COMPRESSED, &ordered_extent->flags))
 		compress_type = ordered_extent->compress_type;
@@ -8848,7 +8849,6 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	if (iov_iter_rw(iter) == WRITE) {
 		up_read(&BTRFS_I(inode)->dio_sem);
 		current->journal_info = NULL;
-		btrfs_delalloc_release_extents(BTRFS_I(inode), count);
 		if (ret < 0 && ret != -EIOCBQUEUED) {
 			if (dio_data.reserve)
 				btrfs_delalloc_release_space(inode, data_reserved,
@@ -8869,6 +8869,7 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 		} else if (ret >= 0 && (size_t)ret < count)
 			btrfs_delalloc_release_space(inode, data_reserved,
 					offset, count - (size_t)ret);
+		btrfs_delalloc_release_extents(BTRFS_I(inode), count);
 	}
 out:
 	if (wakeup)
@@ -9433,6 +9434,7 @@ int btrfs_create_subvol_root(struct btrfs_trans_handle *trans,
 
 struct inode *btrfs_alloc_inode(struct super_block *sb)
 {
+	struct btrfs_fs_info *fs_info = btrfs_sb(sb);
 	struct btrfs_inode *ei;
 	struct inode *inode;
 
@@ -9459,8 +9461,9 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 
 	spin_lock_init(&ei->lock);
 	ei->outstanding_extents = 0;
-	ei->reserved_extents = 0;
-
+	if (sb->s_magic != BTRFS_TEST_MAGIC)
+		btrfs_init_metadata_block_rsv(fs_info, &ei->block_rsv,
+					      BTRFS_BLOCK_RSV_DELALLOC);
 	ei->runtime_flags = 0;
 	ei->prop_compress = BTRFS_COMPRESS_NONE;
 	ei->defrag_compress = BTRFS_COMPRESS_NONE;
@@ -9510,8 +9513,9 @@ void btrfs_destroy_inode(struct inode *inode)
 
 	WARN_ON(!hlist_empty(&inode->i_dentry));
 	WARN_ON(inode->i_data.nrpages);
+	WARN_ON(BTRFS_I(inode)->block_rsv.reserved);
+	WARN_ON(BTRFS_I(inode)->block_rsv.size);
 	WARN_ON(BTRFS_I(inode)->outstanding_extents);
-	WARN_ON(BTRFS_I(inode)->reserved_extents);
 	WARN_ON(BTRFS_I(inode)->delalloc_bytes);
 	WARN_ON(BTRFS_I(inode)->new_delalloc_bytes);
 	WARN_ON(BTRFS_I(inode)->csum_bytes);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 04/21] btrfs: add ref-verify mount option
  2017-09-29 19:43 [PATCH 00/21] My current btrfs patch queue Josef Bacik
                   ` (2 preceding siblings ...)
  2017-09-29 19:43 ` [PATCH 03/21] btrfs: make the delalloc block rsv per inode Josef Bacik
@ 2017-09-29 19:43 ` Josef Bacik
  2017-10-13 13:53   ` David Sterba
  2017-09-29 19:43 ` [PATCH 05/21] btrfs: pass root to various extent ref mod functions Josef Bacik
                   ` (17 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: Josef Bacik @ 2017-09-29 19:43 UTC (permalink / raw)
  To: kernel-team, linux-btrfs

This adds the infrastructure for turning ref verify on and off for a
mount, to be used by a later patch.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/ctree.h |  1 +
 fs/btrfs/super.c | 15 +++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 93e767e16a43..f1bd12f5f2d5 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1334,6 +1334,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
 #define BTRFS_MOUNT_FRAGMENT_METADATA	(1 << 25)
 #define BTRFS_MOUNT_FREE_SPACE_TREE	(1 << 26)
 #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
+#define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
 #define BTRFS_DEFAULT_MAX_INLINE	(2048)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f88ac4ebe01a..8e74f7029e12 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -326,6 +326,9 @@ enum {
 #ifdef CONFIG_BTRFS_DEBUG
 	Opt_fragment_data, Opt_fragment_metadata, Opt_fragment_all,
 #endif
+#ifdef CONFIG_BTRFS_FS_REF_VERIFY
+	Opt_ref_verify,
+#endif
 	Opt_err,
 };
 
@@ -387,6 +390,9 @@ static const match_table_t tokens = {
 	{Opt_fragment_metadata, "fragment=metadata"},
 	{Opt_fragment_all, "fragment=all"},
 #endif
+#ifdef CONFIG_BTRFS_FS_REF_VERIFY
+	{Opt_ref_verify, "ref_verify"},
+#endif
 	{Opt_err, NULL},
 };
 
@@ -827,6 +833,13 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 			btrfs_set_opt(info->mount_opt, FRAGMENT_DATA);
 			break;
 #endif
+#ifdef CONFIG_BTRFS_FS_REF_VERIFY
+		case Opt_ref_verify:
+			btrfs_info(info, "doing ref verification");
+			btrfs_set_opt(info->mount_opt,
+				      REF_VERIFY);
+			break;
+#endif
 		case Opt_err:
 			btrfs_info(info, "unrecognized mount option '%s'", p);
 			ret = -EINVAL;
@@ -1309,6 +1322,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 	if (btrfs_test_opt(info, FRAGMENT_METADATA))
 		seq_puts(seq, ",fragment=metadata");
 #endif
+	if (btrfs_test_opt(info, REF_VERIFY))
+		seq_puts(seq, ",ref_verify");
 	seq_printf(seq, ",subvolid=%llu",
 		  BTRFS_I(d_inode(dentry))->root->root_key.objectid);
 	seq_puts(seq, ",subvol=");
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 05/21] btrfs: pass root to various extent ref mod functions
  2017-09-29 19:43 [PATCH 00/21] My current btrfs patch queue Josef Bacik
                   ` (3 preceding siblings ...)
  2017-09-29 19:43 ` [PATCH 04/21] btrfs: add ref-verify mount option Josef Bacik
@ 2017-09-29 19:43 ` Josef Bacik
  2017-10-13 14:01   ` David Sterba
  2017-09-29 19:43 ` [PATCH 06/21] Btrfs: add a extent ref verify tool Josef Bacik
                   ` (16 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: Josef Bacik @ 2017-09-29 19:43 UTC (permalink / raw)
  To: kernel-team, linux-btrfs

We need the actual root for the ref verifier tool to work, so change
these functions to pass the root around instead.  This will be used in
a subsequent patch.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/ctree.c       |  2 +-
 fs/btrfs/ctree.h       |  7 ++++---
 fs/btrfs/extent-tree.c | 24 +++++++++++++-----------
 fs/btrfs/file.c        | 10 +++++-----
 fs/btrfs/inode.c       |  9 +++++----
 fs/btrfs/ioctl.c       |  2 +-
 fs/btrfs/relocation.c  | 14 +++++++-------
 fs/btrfs/tree-log.c    |  2 +-
 8 files changed, 37 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 19b9c5131745..531e0a8645b0 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -192,7 +192,7 @@ struct extent_buffer *btrfs_lock_root_node(struct btrfs_root *root)
  * tree until you end up with a lock on the root.  A locked buffer
  * is returned, with a reference held.
  */
-static struct extent_buffer *btrfs_read_lock_root_node(struct btrfs_root *root)
+struct extent_buffer *btrfs_read_lock_root_node(struct btrfs_root *root)
 {
 	struct extent_buffer *eb;
 
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f1bd12f5f2d5..a7484a744ef0 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2636,7 +2636,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 			   struct extent_buffer *buf,
 			   u64 parent, int last_ref);
 int btrfs_alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
-				     u64 root_objectid, u64 owner,
+				     struct btrfs_root *root, u64 owner,
 				     u64 offset, u64 ram_bytes,
 				     struct btrfs_key *ins);
 int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
@@ -2655,7 +2655,7 @@ int btrfs_set_disk_extent_flags(struct btrfs_trans_handle *trans,
 				u64 bytenr, u64 num_bytes, u64 flags,
 				int level, int is_data);
 int btrfs_free_extent(struct btrfs_trans_handle *trans,
-		      struct btrfs_fs_info *fs_info,
+		      struct btrfs_root *root,
 		      u64 bytenr, u64 num_bytes, u64 parent, u64 root_objectid,
 		      u64 owner, u64 offset);
 
@@ -2667,7 +2667,7 @@ void btrfs_prepare_extent_commit(struct btrfs_fs_info *fs_info);
 int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans,
 			       struct btrfs_fs_info *fs_info);
 int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
-			 struct btrfs_fs_info *fs_info,
+			 struct btrfs_root *root,
 			 u64 bytenr, u64 num_bytes, u64 parent,
 			 u64 root_objectid, u64 owner, u64 offset);
 
@@ -2811,6 +2811,7 @@ void btrfs_set_item_key_safe(struct btrfs_fs_info *fs_info,
 			     const struct btrfs_key *new_key);
 struct extent_buffer *btrfs_root_node(struct btrfs_root *root);
 struct extent_buffer *btrfs_lock_root_node(struct btrfs_root *root);
+struct extent_buffer *btrfs_read_lock_root_node(struct btrfs_root *root);
 int btrfs_find_next_key(struct btrfs_root *root, struct btrfs_path *path,
 			struct btrfs_key *key, int lowest_level,
 			u64 min_trans);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e32ad9fc93a8..2df22cae45b1 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2178,10 +2178,11 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
 
 /* Can return -ENOMEM */
 int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
-			 struct btrfs_fs_info *fs_info,
+			 struct btrfs_root *root,
 			 u64 bytenr, u64 num_bytes, u64 parent,
 			 u64 root_objectid, u64 owner, u64 offset)
 {
+	struct btrfs_fs_info *fs_info = root->fs_info;
 	int old_ref_mod, new_ref_mod;
 	int ret;
 
@@ -3340,7 +3341,7 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle *trans,
 	int level;
 	int ret = 0;
 	int (*process_func)(struct btrfs_trans_handle *,
-			    struct btrfs_fs_info *,
+			    struct btrfs_root *,
 			    u64, u64, u64, u64, u64, u64);
 
 
@@ -3380,7 +3381,7 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle *trans,
 
 			num_bytes = btrfs_file_extent_disk_num_bytes(buf, fi);
 			key.offset -= btrfs_file_extent_offset(buf, fi);
-			ret = process_func(trans, fs_info, bytenr, num_bytes,
+			ret = process_func(trans, root, bytenr, num_bytes,
 					   parent, ref_root, key.objectid,
 					   key.offset);
 			if (ret)
@@ -3388,7 +3389,7 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle *trans,
 		} else {
 			bytenr = btrfs_node_blockptr(buf, i);
 			num_bytes = fs_info->nodesize;
-			ret = process_func(trans, fs_info, bytenr, num_bytes,
+			ret = process_func(trans, root, bytenr, num_bytes,
 					   parent, ref_root, level - 1, 0);
 			if (ret)
 				goto fail;
@@ -7274,17 +7275,17 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 
 /* Can return -ENOMEM */
 int btrfs_free_extent(struct btrfs_trans_handle *trans,
-		      struct btrfs_fs_info *fs_info,
+		      struct btrfs_root *root,
 		      u64 bytenr, u64 num_bytes, u64 parent, u64 root_objectid,
 		      u64 owner, u64 offset)
 {
+	struct btrfs_fs_info *fs_info = root->fs_info;
 	int old_ref_mod, new_ref_mod;
 	int ret;
 
 	if (btrfs_is_testing(fs_info))
 		return 0;
 
-
 	/*
 	 * tree log blocks never actually go into the extent allocation
 	 * tree, just update pinning info and exit early.
@@ -8251,17 +8252,18 @@ static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
 }
 
 int btrfs_alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
-				     u64 root_objectid, u64 owner,
+				     struct btrfs_root *root, u64 owner,
 				     u64 offset, u64 ram_bytes,
 				     struct btrfs_key *ins)
 {
-	struct btrfs_fs_info *fs_info = trans->fs_info;
+	struct btrfs_fs_info *fs_info = root->fs_info;
 	int ret;
 
-	BUG_ON(root_objectid == BTRFS_TREE_LOG_OBJECTID);
+	BUG_ON(root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID);
 
 	ret = btrfs_add_delayed_data_ref(fs_info, trans, ins->objectid,
-					 ins->offset, 0, root_objectid, owner,
+					 ins->offset, 0,
+					 root->root_key.objectid, owner,
 					 offset, ram_bytes,
 					 BTRFS_ADD_DELAYED_EXTENT, NULL, NULL);
 	return ret;
@@ -8839,7 +8841,7 @@ static noinline int do_walk_down(struct btrfs_trans_handle *trans,
 					     ret);
 			}
 		}
-		ret = btrfs_free_extent(trans, fs_info, bytenr, blocksize,
+		ret = btrfs_free_extent(trans, root, bytenr, blocksize,
 					parent, root->root_key.objectid,
 					level - 1, 0);
 		if (ret)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 2a7f1b27149b..ab1c38f2dd8c 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -856,7 +856,7 @@ int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
 			btrfs_mark_buffer_dirty(leaf);
 
 			if (update_refs && disk_bytenr > 0) {
-				ret = btrfs_inc_extent_ref(trans, fs_info,
+				ret = btrfs_inc_extent_ref(trans, root,
 						disk_bytenr, num_bytes, 0,
 						root->root_key.objectid,
 						new_key.objectid,
@@ -940,7 +940,7 @@ int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
 				extent_end = ALIGN(extent_end,
 						   fs_info->sectorsize);
 			} else if (update_refs && disk_bytenr > 0) {
-				ret = btrfs_free_extent(trans, fs_info,
+				ret = btrfs_free_extent(trans, root,
 						disk_bytenr, num_bytes, 0,
 						root->root_key.objectid,
 						key.objectid, key.offset -
@@ -1234,7 +1234,7 @@ int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
 						extent_end - split);
 		btrfs_mark_buffer_dirty(leaf);
 
-		ret = btrfs_inc_extent_ref(trans, fs_info, bytenr, num_bytes,
+		ret = btrfs_inc_extent_ref(trans, root, bytenr, num_bytes,
 					   0, root->root_key.objectid,
 					   ino, orig_offset);
 		if (ret) {
@@ -1268,7 +1268,7 @@ int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
 		extent_end = other_end;
 		del_slot = path->slots[0] + 1;
 		del_nr++;
-		ret = btrfs_free_extent(trans, fs_info, bytenr, num_bytes,
+		ret = btrfs_free_extent(trans, root, bytenr, num_bytes,
 					0, root->root_key.objectid,
 					ino, orig_offset);
 		if (ret) {
@@ -1288,7 +1288,7 @@ int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
 		key.offset = other_start;
 		del_slot = path->slots[0];
 		del_nr++;
-		ret = btrfs_free_extent(trans, fs_info, bytenr, num_bytes,
+		ret = btrfs_free_extent(trans, root, bytenr, num_bytes,
 					0, root->root_key.objectid,
 					ino, orig_offset);
 		if (ret) {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4e092e799f0a..3cbddfc181dc 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2223,8 +2223,9 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 	if (ret < 0)
 		goto out;
 	qg_released = ret;
-	ret = btrfs_alloc_reserved_file_extent(trans, root->root_key.objectid,
-			btrfs_ino(BTRFS_I(inode)), file_pos, qg_released, &ins);
+	ret = btrfs_alloc_reserved_file_extent(trans, root,
+					       btrfs_ino(BTRFS_I(inode)),
+					       file_pos, qg_released, &ins);
 out:
 	btrfs_free_path(path);
 
@@ -2676,7 +2677,7 @@ static noinline int relink_extent_backref(struct btrfs_path *path,
 	inode_add_bytes(inode, len);
 	btrfs_release_path(path);
 
-	ret = btrfs_inc_extent_ref(trans, fs_info, new->bytenr,
+	ret = btrfs_inc_extent_ref(trans, root, new->bytenr,
 			new->disk_len, 0,
 			backref->root_id, backref->inum,
 			new->file_pos);	/* start - extent_offset */
@@ -4667,7 +4668,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
 		     root == fs_info->tree_root)) {
 			btrfs_set_path_blocking(path);
 			bytes_deleted += extent_num_bytes;
-			ret = btrfs_free_extent(trans, fs_info, extent_start,
+			ret = btrfs_free_extent(trans, root, extent_start,
 						extent_num_bytes, 0,
 						btrfs_header_owner(leaf),
 						ino, extent_offset);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 17059aa5564f..398495f79c83 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3678,7 +3678,7 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
 				if (disko) {
 					inode_add_bytes(inode, datal);
 					ret = btrfs_inc_extent_ref(trans,
-							fs_info,
+							root,
 							disko, diskl, 0,
 							root->root_key.objectid,
 							btrfs_ino(BTRFS_I(inode)),
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 53e192647339..4cf2eb67eba6 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1742,7 +1742,7 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
 		dirty = 1;
 
 		key.offset -= btrfs_file_extent_offset(leaf, fi);
-		ret = btrfs_inc_extent_ref(trans, fs_info, new_bytenr,
+		ret = btrfs_inc_extent_ref(trans, root, new_bytenr,
 					   num_bytes, parent,
 					   btrfs_header_owner(leaf),
 					   key.objectid, key.offset);
@@ -1751,7 +1751,7 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
 			break;
 		}
 
-		ret = btrfs_free_extent(trans, fs_info, bytenr, num_bytes,
+		ret = btrfs_free_extent(trans, root, bytenr, num_bytes,
 					parent, btrfs_header_owner(leaf),
 					key.objectid, key.offset);
 		if (ret) {
@@ -1952,21 +1952,21 @@ int replace_path(struct btrfs_trans_handle *trans,
 					      path->slots[level], old_ptr_gen);
 		btrfs_mark_buffer_dirty(path->nodes[level]);
 
-		ret = btrfs_inc_extent_ref(trans, fs_info, old_bytenr,
+		ret = btrfs_inc_extent_ref(trans, src, old_bytenr,
 					blocksize, path->nodes[level]->start,
 					src->root_key.objectid, level - 1, 0);
 		BUG_ON(ret);
-		ret = btrfs_inc_extent_ref(trans, fs_info, new_bytenr,
+		ret = btrfs_inc_extent_ref(trans, dest, new_bytenr,
 					blocksize, 0, dest->root_key.objectid,
 					level - 1, 0);
 		BUG_ON(ret);
 
-		ret = btrfs_free_extent(trans, fs_info, new_bytenr, blocksize,
+		ret = btrfs_free_extent(trans, src, new_bytenr, blocksize,
 					path->nodes[level]->start,
 					src->root_key.objectid, level - 1, 0);
 		BUG_ON(ret);
 
-		ret = btrfs_free_extent(trans, fs_info, old_bytenr, blocksize,
+		ret = btrfs_free_extent(trans, dest, old_bytenr, blocksize,
 					0, dest->root_key.objectid, level - 1,
 					0);
 		BUG_ON(ret);
@@ -2808,7 +2808,7 @@ static int do_relocation(struct btrfs_trans_handle *trans,
 						      trans->transid);
 			btrfs_mark_buffer_dirty(upper->eb);
 
-			ret = btrfs_inc_extent_ref(trans, root->fs_info,
+			ret = btrfs_inc_extent_ref(trans, root,
 						node->eb->start, blocksize,
 						upper->eb->start,
 						btrfs_header_owner(upper->eb),
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 1036ac7313a7..aa7c71cff575 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -717,7 +717,7 @@ static noinline int replay_one_extent(struct btrfs_trans_handle *trans,
 			ret = btrfs_lookup_data_extent(fs_info, ins.objectid,
 						ins.offset);
 			if (ret == 0) {
-				ret = btrfs_inc_extent_ref(trans, fs_info,
+				ret = btrfs_inc_extent_ref(trans, root,
 						ins.objectid, ins.offset,
 						0, root->root_key.objectid,
 						key->objectid, offset);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 06/21] Btrfs: add a extent ref verify tool
  2017-09-29 19:43 [PATCH 00/21] My current btrfs patch queue Josef Bacik
                   ` (4 preceding siblings ...)
  2017-09-29 19:43 ` [PATCH 05/21] btrfs: pass root to various extent ref mod functions Josef Bacik
@ 2017-09-29 19:43 ` Josef Bacik
  2017-10-13 14:23   ` David Sterba
  2017-09-29 19:43 ` [PATCH 07/21] Btrfs: only check delayed ref usage in should_end_transaction Josef Bacik
                   ` (15 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: Josef Bacik @ 2017-09-29 19:43 UTC (permalink / raw)
  To: kernel-team, linux-btrfs

We were having corruption issues that were tied back to problems with the extent
tree.  In order to track them down I built this tool to try and find the
culprit, which was pretty successful.  If you compile with this tool on it will
live verify every ref update that the fs makes and make sure it is consistent
and valid.  I've run this through with xfstests and haven't gotten any false
positives.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/Kconfig       |   11 +
 fs/btrfs/Makefile      |    1 +
 fs/btrfs/ctree.h       |    5 +
 fs/btrfs/disk-io.c     |    6 +
 fs/btrfs/extent-tree.c |   22 ++
 fs/btrfs/ref-verify.c  | 1012 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/ref-verify.h  |   59 +++
 7 files changed, 1116 insertions(+)
 create mode 100644 fs/btrfs/ref-verify.c
 create mode 100644 fs/btrfs/ref-verify.h

diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig
index a26c63b4ad68..2e558227931a 100644
--- a/fs/btrfs/Kconfig
+++ b/fs/btrfs/Kconfig
@@ -91,3 +91,14 @@ config BTRFS_ASSERT
 	  any of the assertions trip.  This is meant for btrfs developers only.
 
 	  If unsure, say N.
+
+config BTRFS_FS_REF_VERIFY
+	bool "Btrfs with the ref verify tool compiled in"
+	depends on BTRFS_FS
+	default n
+	help
+	  Enable run-time extent reference verification instrumentation.  This
+	  is meant to be used by btrfs developers for tracking down extent
+	  reference problems or verifying they didn't break something.
+
+	  If unsure, say N.
diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 962a95aefb81..72c60f54f962 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -13,6 +13,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
+btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
 
 btrfs-$(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) += tests/free-space-tests.o \
 	tests/extent-buffer-tests.o tests/btrfs-tests.o \
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a7484a744ef0..4ffbe9f07cf7 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1096,6 +1096,11 @@ struct btrfs_fs_info {
 	u32 nodesize;
 	u32 sectorsize;
 	u32 stripesize;
+
+#ifdef CONFIG_BTRFS_FS_REF_VERIFY
+	spinlock_t ref_verify_lock;
+	struct rb_root block_tree;
+#endif
 };
 
 static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 1307907e19d8..778dc7682966 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -50,6 +50,7 @@
 #include "sysfs.h"
 #include "qgroup.h"
 #include "compression.h"
+#include "ref-verify.h"
 
 #ifdef CONFIG_X86
 #include <asm/cpufeature.h>
@@ -2776,6 +2777,7 @@ int open_ctree(struct super_block *sb,
 	/* readahead state */
 	INIT_RADIX_TREE(&fs_info->reada_tree, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
 	spin_lock_init(&fs_info->reada_lock);
+	btrfs_init_ref_verify(fs_info);
 
 	fs_info->thread_pool_size = min_t(unsigned long,
 					  num_online_cpus() + 2, 8);
@@ -3195,6 +3197,9 @@ int open_ctree(struct super_block *sb,
 	if (ret)
 		goto fail_trans_kthread;
 
+	if (btrfs_build_ref_tree(fs_info))
+		btrfs_err(fs_info, "BTRFS: couldn't build ref tree\n");
+
 	/* do not make disk changes in broken FS or nologreplay is given */
 	if (btrfs_super_log_root(disk_super) != 0 &&
 	    !btrfs_test_opt(fs_info, NOLOGREPLAY)) {
@@ -4060,6 +4065,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
 	cleanup_srcu_struct(&fs_info->subvol_srcu);
 
 	btrfs_free_stripe_hash_table(fs_info);
+	btrfs_free_ref_cache(fs_info);
 
 	__btrfs_free_block_rsv(root->orphan_block_rsv);
 	root->orphan_block_rsv = NULL;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 2df22cae45b1..00d86c8afaef 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -39,6 +39,7 @@
 #include "math.h"
 #include "sysfs.h"
 #include "qgroup.h"
+#include "ref-verify.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -2189,6 +2190,9 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 	BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID &&
 	       root_objectid == BTRFS_TREE_LOG_OBJECTID);
 
+	btrfs_ref_tree_mod(root, bytenr, num_bytes, parent, root_objectid,
+			   owner, offset, BTRFS_ADD_DELAYED_REF);
+
 	if (owner < BTRFS_FIRST_FREE_OBJECTID) {
 		ret = btrfs_add_delayed_tree_ref(fs_info, trans, bytenr,
 						 num_bytes, parent,
@@ -7223,6 +7227,10 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 	if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
 		int old_ref_mod, new_ref_mod;
 
+		btrfs_ref_tree_mod(root, buf->start, buf->len, parent,
+				   root->root_key.objectid,
+				   btrfs_header_level(buf), 0,
+				   BTRFS_DROP_DELAYED_REF);
 		ret = btrfs_add_delayed_tree_ref(fs_info, trans, buf->start,
 						 buf->len, parent,
 						 root->root_key.objectid,
@@ -7286,6 +7294,11 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans,
 	if (btrfs_is_testing(fs_info))
 		return 0;
 
+	if (root_objectid != BTRFS_TREE_LOG_OBJECTID)
+		btrfs_ref_tree_mod(root, bytenr, num_bytes, parent,
+				   root_objectid, owner, offset,
+				   BTRFS_DROP_DELAYED_REF);
+
 	/*
 	 * tree log blocks never actually go into the extent allocation
 	 * tree, just update pinning info and exit early.
@@ -8261,6 +8274,10 @@ int btrfs_alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
 
 	BUG_ON(root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID);
 
+	btrfs_ref_tree_mod(root, ins->objectid, ins->offset, 0,
+			   root->root_key.objectid, owner, offset,
+			   BTRFS_ADD_DELAYED_EXTENT);
+
 	ret = btrfs_add_delayed_data_ref(fs_info, trans, ins->objectid,
 					 ins->offset, 0,
 					 root->root_key.objectid, owner,
@@ -8485,6 +8502,9 @@ struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
 		extent_op->is_data = false;
 		extent_op->level = level;
 
+		btrfs_ref_tree_mod(root, ins.objectid, ins.offset, parent,
+				   root_objectid, level, 0,
+				   BTRFS_ADD_DELAYED_EXTENT);
 		ret = btrfs_add_delayed_tree_ref(fs_info, trans, ins.objectid,
 						 ins.offset, parent,
 						 root_objectid, level,
@@ -10334,6 +10354,8 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	 * remove it.
 	 */
 	free_excluded_extents(fs_info, block_group);
+	btrfs_free_ref_tree_range(fs_info, block_group->key.objectid,
+				  block_group->key.offset);
 
 	memcpy(&key, &block_group->key, sizeof(key));
 	index = get_block_group_index(block_group);
diff --git a/fs/btrfs/ref-verify.c b/fs/btrfs/ref-verify.c
new file mode 100644
index 000000000000..16640398e2ef
--- /dev/null
+++ b/fs/btrfs/ref-verify.c
@@ -0,0 +1,1012 @@
+/*
+ * Copyright (C) 2014 Facebook.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+#include <linux/sched.h>
+#include <linux/stacktrace.h>
+#include "ctree.h"
+#include "disk-io.h"
+#include "locking.h"
+#include "delayed-ref.h"
+#include "ref-verify.h"
+
+/*
+ * Used to keep track the roots and number of refs each root has for a given
+ * bytenr.  This just tracks the number of direct references, no shared
+ * references.
+ */
+struct root_entry {
+	u64 root_objectid;
+	u64 num_refs;
+	struct rb_node node;
+};
+
+/*
+ * These are meant to represent what should exist in the extent tree, these can
+ * be used to verify the extent tree is consistent as these should all match
+ * what the extent tree says.
+ */
+struct ref_entry {
+	u64 root_objectid;
+	u64 parent;
+	u64 owner;
+	u64 offset;
+	u64 num_refs;
+	struct rb_node node;
+};
+
+#define MAX_TRACE	16
+
+/*
+ * Whenever we add/remove a reference we record the action.  The action maps
+ * back to the delayed ref action.  We hold the ref we are changing in the
+ * action so we can account for the history properly, and we record the root we
+ * were called with since it could be different from ref_root.  We also store
+ * stack traces because thats how I roll.
+ */
+struct ref_action {
+	int action;
+	u64 root;
+	struct ref_entry ref;
+	struct list_head list;
+	unsigned long trace[MAX_TRACE];
+	unsigned int trace_len;
+};
+
+/*
+ * One of these for every block we reference, it holds the roots and references
+ * to it as well as all of the ref actions that have occured to it.  We never
+ * free it until we unmount the file system in order to make sure re-allocations
+ * are happening properly.
+ */
+struct block_entry {
+	u64 bytenr;
+	u64 len;
+	u64 num_refs;
+	int metadata;
+	int from_disk;
+	struct rb_root roots;
+	struct rb_root refs;
+	struct rb_node node;
+	struct list_head actions;
+};
+
+static struct block_entry *insert_block_entry(struct rb_root *root,
+					      struct block_entry *be)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent_node = NULL;
+	struct block_entry *entry;
+
+	while (*p) {
+		parent_node = *p;
+		entry = rb_entry(parent_node, struct block_entry, node);
+		if (entry->bytenr > be->bytenr)
+			p = &(*p)->rb_left;
+		else if (entry->bytenr < be->bytenr)
+			p = &(*p)->rb_right;
+		else
+			return entry;
+	}
+
+	rb_link_node(&be->node, parent_node, p);
+	rb_insert_color(&be->node, root);
+	return NULL;
+}
+
+static struct block_entry *lookup_block_entry(struct rb_root *root, u64 bytenr)
+{
+	struct rb_node *n;
+	struct block_entry *entry = NULL;
+
+	n = root->rb_node;
+	while (n) {
+		entry = rb_entry(n, struct block_entry, node);
+		if (entry->bytenr < bytenr)
+			n = n->rb_right;
+		else if (entry->bytenr > bytenr)
+			n = n->rb_left;
+		else
+			return entry;
+	}
+	return NULL;
+}
+
+static struct root_entry *insert_root_entry(struct rb_root *root,
+					    struct root_entry *re)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent_node = NULL;
+	struct root_entry *entry;
+
+	while (*p) {
+		parent_node = *p;
+		entry = rb_entry(parent_node, struct root_entry, node);
+		if (entry->root_objectid > re->root_objectid)
+			p = &(*p)->rb_left;
+		else if (entry->root_objectid < re->root_objectid)
+			p = &(*p)->rb_right;
+		else
+			return entry;
+	}
+
+	rb_link_node(&re->node, parent_node, p);
+	rb_insert_color(&re->node, root);
+	return NULL;
+
+}
+
+static int comp_refs(struct ref_entry *ref1, struct ref_entry *ref2)
+{
+	if (ref1->root_objectid < ref2->root_objectid)
+		return -1;
+	if (ref1->root_objectid > ref2->root_objectid)
+		return 1;
+	if (ref1->parent < ref2->parent)
+		return -1;
+	if (ref1->parent > ref2->parent)
+		return 1;
+	if (ref1->owner < ref2->owner)
+		return -1;
+	if (ref1->owner > ref2->owner)
+		return 1;
+	if (ref1->offset < ref2->offset)
+		return -1;
+	if (ref1->offset > ref2->offset)
+		return 1;
+	return 0;
+}
+
+static struct ref_entry *insert_ref_entry(struct rb_root *root,
+					  struct ref_entry *ref)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent_node = NULL;
+	struct ref_entry *entry;
+	int cmp;
+
+	while (*p) {
+		parent_node = *p;
+		entry = rb_entry(parent_node, struct ref_entry, node);
+		cmp = comp_refs(entry, ref);
+		if (cmp > 0)
+			p = &(*p)->rb_left;
+		else if (cmp < 0)
+			p = &(*p)->rb_right;
+		else
+			return entry;
+	}
+
+	rb_link_node(&ref->node, parent_node, p);
+	rb_insert_color(&ref->node, root);
+	return NULL;
+
+}
+
+static struct root_entry *lookup_root_entry(struct rb_root *root, u64 objectid)
+{
+	struct rb_node *n;
+	struct root_entry *entry = NULL;
+
+	n = root->rb_node;
+	while (n) {
+		entry = rb_entry(n, struct root_entry, node);
+		if (entry->root_objectid < objectid)
+			n = n->rb_right;
+		else if (entry->root_objectid > objectid)
+			n = n->rb_left;
+		else
+			return entry;
+	}
+	return NULL;
+}
+
+#ifdef CONFIG_STACKTRACE
+static void __save_stack_trace(struct ref_action *ra)
+{
+	struct stack_trace stack_trace;
+
+	stack_trace.max_entries = MAX_TRACE;
+	stack_trace.nr_entries = 0;
+	stack_trace.entries = ra->trace;
+	stack_trace.skip = 2;
+	save_stack_trace(&stack_trace);
+	ra->trace_len = stack_trace.nr_entries;
+}
+
+static void __print_stack_trace(struct btrfs_fs_info *fs_info,
+				struct ref_action *ra)
+{
+	struct stack_trace trace;
+	if (ra->trace_len == 0) {
+		btrfs_err(fs_info, "  No Stacktrace\n");
+		return;
+	}
+	trace.nr_entries = ra->trace_len;
+	trace.entries = ra->trace;
+	print_stack_trace(&trace, 2);
+}
+#else
+static void inline __save_stack_trace(struct ref_action *ra) {}
+static void inline __print_stack_trace(struct btrfs_fs_info *fs_info,
+				       struct ref_action *ra)
+{
+	btrfs_err(fs_info, "  No stacktrace support\n");
+}
+#endif
+
+static void free_block_entry(struct block_entry *be)
+{
+	struct root_entry *re;
+	struct ref_entry *ref;
+	struct ref_action *ra;
+	struct rb_node *n;
+
+	while ((n = rb_first(&be->roots))) {
+		re = rb_entry(n, struct root_entry, node);
+		rb_erase(&re->node, &be->roots);
+		kfree(re);
+	}
+
+	while((n = rb_first(&be->refs))) {
+		ref = rb_entry(n, struct ref_entry, node);
+		rb_erase(&ref->node, &be->refs);
+		kfree(ref);
+	}
+
+	while (!list_empty(&be->actions)) {
+		ra = list_first_entry(&be->actions, struct ref_action,
+				      list);
+		list_del(&ra->list);
+		kfree(ra);
+	}
+	kfree(be);
+}
+
+static struct block_entry *add_block_entry(struct btrfs_fs_info *fs_info,
+					   u64 bytenr, u64 len,
+					   u64 root_objectid)
+{
+	struct block_entry *be = NULL, *exist;
+	struct root_entry *re = NULL;
+
+	re = kzalloc(sizeof(struct root_entry), GFP_NOFS);
+	be = kzalloc(sizeof(struct block_entry), GFP_NOFS);
+	if (!be || !re) {
+		kfree(re);
+		kfree(be);
+		return ERR_PTR(-ENOMEM);
+	}
+	be->bytenr = bytenr;
+	be->len = len;
+
+	re->root_objectid = root_objectid;
+	re->num_refs = 0;
+
+	spin_lock(&fs_info->ref_verify_lock);
+	exist = insert_block_entry(&fs_info->block_tree, be);
+	if (exist) {
+		if (root_objectid) {
+			struct root_entry *exist_re;
+			exist_re = insert_root_entry(&exist->roots, re);
+			if (exist_re)
+				kfree(re);
+		}
+		kfree(be);
+		return exist;
+	}
+
+	be->num_refs = 0;
+	be->metadata = 0;
+	be->from_disk = 0;
+	be->roots = RB_ROOT;
+	be->refs = RB_ROOT;
+	INIT_LIST_HEAD(&be->actions);
+	if (root_objectid)
+		insert_root_entry(&be->roots, re);
+	else
+		kfree(re);
+	return be;
+}
+
+static int add_tree_block(struct btrfs_fs_info *fs_info, u64 ref_root,
+			  u64 parent, u64 bytenr, int level)
+{
+	struct block_entry *be;
+	struct root_entry *re;
+	struct ref_entry *ref = NULL, *exist;
+
+	ref = kmalloc(sizeof(struct ref_entry), GFP_KERNEL);
+	if (!ref)
+		return -ENOMEM;
+
+	if (parent)
+		ref->root_objectid = 0;
+	else
+		ref->root_objectid = ref_root;
+	ref->parent = parent;
+	ref->owner = level;
+	ref->offset = 0;
+	ref->num_refs = 1;
+
+	be = add_block_entry(fs_info, bytenr, fs_info->nodesize, ref_root);
+	if (IS_ERR(be)) {
+		kfree(ref);
+		return PTR_ERR(be);
+	}
+	be->num_refs++;
+	be->from_disk = 1;
+	be->metadata = 1;
+
+	if (!parent) {
+		ASSERT(ref_root);
+		re = lookup_root_entry(&be->roots, ref_root);
+		ASSERT(re);
+		re->num_refs++;
+	}
+	exist = insert_ref_entry(&be->refs, ref);
+	if (exist) {
+		exist->num_refs++;
+		kfree(ref);
+	}
+	spin_unlock(&fs_info->ref_verify_lock);
+
+	return 0;
+}
+
+static int add_shared_data_ref(struct btrfs_fs_info *fs_info,
+			       u64 parent, u32 num_refs, u64 bytenr,
+			       u64 num_bytes)
+{
+	struct block_entry *be;
+	struct ref_entry *ref;
+
+	ref = kzalloc(sizeof(struct ref_entry), GFP_KERNEL);
+	if (!ref)
+		return -ENOMEM;
+	be = add_block_entry(fs_info, bytenr, num_bytes, 0);
+	if (IS_ERR(be)) {
+		kfree(ref);
+		return PTR_ERR(be);
+	}
+	be->num_refs += num_refs;
+
+	ref->parent = parent;
+	ref->num_refs = num_refs;
+	if (insert_ref_entry(&be->refs, ref)) {
+		spin_unlock(&fs_info->ref_verify_lock);
+		btrfs_err(fs_info, "Existing shared ref when reading from disk?\n");
+		kfree(ref);
+		return -EINVAL;
+	}
+	spin_unlock(&fs_info->ref_verify_lock);
+	return 0;
+}
+
+static int add_extent_data_ref(struct btrfs_fs_info *fs_info,
+			       struct extent_buffer *leaf,
+			       struct btrfs_extent_data_ref *dref,
+			       u64 bytenr, u64 num_bytes)
+{
+	struct block_entry *be;
+	struct ref_entry *ref;
+	struct root_entry *re;
+	u64 ref_root = btrfs_extent_data_ref_root(leaf, dref);
+	u64 owner = btrfs_extent_data_ref_objectid(leaf, dref);
+	u64 offset = btrfs_extent_data_ref_offset(leaf, dref);
+	u32 num_refs = btrfs_extent_data_ref_count(leaf, dref);
+
+	ref = kzalloc(sizeof(struct ref_entry), GFP_KERNEL);
+	if (!ref)
+		return -ENOMEM;
+	be = add_block_entry(fs_info, bytenr, num_bytes, ref_root);
+	if (IS_ERR(be)) {
+		kfree(ref);
+		return PTR_ERR(be);
+	}
+	be->num_refs += num_refs;
+
+	ref->parent = 0;
+	ref->owner = owner;
+	ref->root_objectid = ref_root;
+	ref->offset = offset;
+	ref->num_refs = num_refs;
+	if (insert_ref_entry(&be->refs, ref)) {
+		spin_unlock(&fs_info->ref_verify_lock);
+		btrfs_err(fs_info, "Existing ref when reading from disk?\n");
+		kfree(ref);
+		return -EINVAL;
+	}
+
+	re = lookup_root_entry(&be->roots, ref_root);
+	if (!re) {
+		spin_unlock(&fs_info->ref_verify_lock);
+		btrfs_err(fs_info, "Missing root in new block entry?\n");
+		return -EINVAL;
+	}
+	re->num_refs += num_refs;
+	spin_unlock(&fs_info->ref_verify_lock);
+	return 0;
+}
+
+static int process_extent_item(struct btrfs_fs_info *fs_info,
+			       struct btrfs_path *path, struct btrfs_key *key,
+			       int slot, int *tree_block_level)
+{
+	struct btrfs_extent_item *ei;
+	struct btrfs_extent_inline_ref *iref;
+	struct btrfs_extent_data_ref *dref;
+	struct btrfs_shared_data_ref *sref;
+	struct extent_buffer *leaf = path->nodes[0];
+	u32 item_size = btrfs_item_size_nr(leaf, slot);
+	unsigned long end, ptr;
+	u64 offset, flags, count;
+	int type, ret;
+
+	ei = btrfs_item_ptr(leaf, slot, struct btrfs_extent_item);
+	flags = btrfs_extent_flags(leaf, ei);
+
+	if ((key->type == BTRFS_EXTENT_ITEM_KEY) &&
+	    flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
+		struct btrfs_tree_block_info *info;
+		info = (struct btrfs_tree_block_info *)(ei + 1);
+		*tree_block_level = btrfs_tree_block_level(leaf, info);
+		iref = (struct btrfs_extent_inline_ref *)(info + 1);
+	} else {
+		if (key->type == BTRFS_METADATA_ITEM_KEY)
+			*tree_block_level = key->offset;
+		iref = (struct btrfs_extent_inline_ref *)(ei + 1);
+	}
+
+	ptr = (unsigned long)iref;
+	end = (unsigned long)ei + item_size;
+	while (ptr < end) {
+		iref = (struct btrfs_extent_inline_ref *)ptr;
+		type = btrfs_extent_inline_ref_type(leaf, iref);
+		offset = btrfs_extent_inline_ref_offset(leaf, iref);
+		switch (type) {
+		case BTRFS_TREE_BLOCK_REF_KEY:
+			ret = add_tree_block(fs_info, offset, 0, key->objectid,
+					     *tree_block_level);
+			break;
+		case BTRFS_SHARED_BLOCK_REF_KEY:
+			ret = add_tree_block(fs_info, 0, offset, key->objectid,
+					     *tree_block_level);
+			break;
+		case BTRFS_EXTENT_DATA_REF_KEY:
+			dref = (struct btrfs_extent_data_ref *)(&iref->offset);
+			ret = add_extent_data_ref(fs_info, leaf, dref,
+						  key->objectid, key->offset);
+			break;
+		case BTRFS_SHARED_DATA_REF_KEY:
+			sref = (struct btrfs_shared_data_ref *)(iref + 1);
+			count = btrfs_shared_data_ref_count(leaf, sref);
+			ret = add_shared_data_ref(fs_info, offset, count,
+						  key->objectid, key->offset);
+			break;
+		default:
+			btrfs_err(fs_info, "Invalid key type in iref\n");
+			ret = -EINVAL;
+			break;
+		}
+		if (ret)
+			break;
+		ptr += btrfs_extent_inline_ref_size(type);
+	}
+	return ret;
+}
+
+static int process_leaf(struct btrfs_root *root,
+			struct btrfs_path *path, u64 *bytenr, u64 *num_bytes)
+{
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct extent_buffer *leaf = path->nodes[0];
+	struct btrfs_extent_data_ref *dref;
+	struct btrfs_shared_data_ref *sref;
+	u32 count;
+	int i = 0, tree_block_level = 0, ret;
+	struct btrfs_key key;
+	int nritems = btrfs_header_nritems(leaf);
+
+	for (i = 0; i < nritems; i++) {
+		btrfs_item_key_to_cpu(leaf, &key, i);
+		switch (key.type) {
+		case BTRFS_EXTENT_ITEM_KEY:
+			*num_bytes = key.offset;
+		case BTRFS_METADATA_ITEM_KEY:
+			*bytenr = key.objectid;
+			ret = process_extent_item(fs_info, path, &key, i,
+						  &tree_block_level);
+			break;
+		case BTRFS_TREE_BLOCK_REF_KEY:
+			ret = add_tree_block(fs_info, key.offset, 0,
+					     key.objectid, tree_block_level);
+			break;
+		case BTRFS_SHARED_BLOCK_REF_KEY:
+			ret = add_tree_block(fs_info, 0, key.offset,
+					     key.objectid, tree_block_level);
+			break;
+		case BTRFS_EXTENT_DATA_REF_KEY:
+			dref = btrfs_item_ptr(leaf, i,
+					      struct btrfs_extent_data_ref);
+			ret = add_extent_data_ref(fs_info, leaf, dref, *bytenr,
+						  *num_bytes);
+			break;
+		case BTRFS_SHARED_DATA_REF_KEY:
+			sref = btrfs_item_ptr(leaf, i,
+					      struct btrfs_shared_data_ref);
+			count = btrfs_shared_data_ref_count(leaf, sref);
+			ret = add_shared_data_ref(fs_info, key.offset, count,
+						  *bytenr, *num_bytes);
+			break;
+		default:
+			break;
+		}
+		if (ret)
+			break;
+	}
+	return ret;
+}
+
+/* Walk down to the leaf from the given level */
+static int walk_down_tree(struct btrfs_root *root, struct btrfs_path *path,
+			  int level, u64 *bytenr, u64 *num_bytes)
+{
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct extent_buffer *eb;
+	u64 block_bytenr, gen;
+	int ret = 0;
+
+	while (level >= 0) {
+		if (level) {
+			block_bytenr = btrfs_node_blockptr(path->nodes[level],
+							   path->slots[level]);
+			gen = btrfs_node_ptr_generation(path->nodes[level],
+							path->slots[level]);
+			eb = read_tree_block(fs_info, block_bytenr, gen);
+			if (!eb || !extent_buffer_uptodate(eb)) {
+				free_extent_buffer(eb);
+				return -EIO;
+			}
+			btrfs_tree_read_lock(eb);
+			btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
+			path->nodes[level-1] = eb;
+			path->slots[level-1] = 0;
+			path->locks[level-1] = BTRFS_READ_LOCK_BLOCKING;
+		} else {
+			ret = process_leaf(root, path, bytenr, num_bytes);
+			if (ret)
+				break;
+		}
+		level--;
+	}
+	return ret;
+}
+
+/* Walk up to the next node that needs to be processed */
+static int walk_up_tree(struct btrfs_root *root, struct btrfs_path *path,
+			int *level)
+{
+	int l;
+
+	for (l = 0; l < BTRFS_MAX_LEVEL; l++) {
+		if (!path->nodes[l])
+			continue;
+		if (l) {
+			path->slots[l]++;
+			if (path->slots[l] <
+			    btrfs_header_nritems(path->nodes[l])) {
+				*level = l;
+				return 0;
+			}
+		}
+		btrfs_tree_unlock_rw(path->nodes[l], path->locks[l]);
+		free_extent_buffer(path->nodes[l]);
+		path->nodes[l] = NULL;
+		path->slots[l] = 0;
+		path->locks[l] = 0;
+	}
+
+	return 1;
+}
+
+static void dump_ref_action(struct btrfs_fs_info *fs_info,
+			    struct ref_action *ra)
+{
+	btrfs_err(fs_info, "  Ref action %d, root %llu, ref_root %llu, parent %llu, owner %llu, offset %llu, num_refs %llu\n",
+		  ra->action, ra->root, ra->ref.root_objectid, ra->ref.parent,
+		  ra->ref.owner, ra->ref.offset, ra->ref.num_refs);
+	__print_stack_trace(fs_info, ra);
+}
+
+/*
+ * Dumps all the information from the block entry to printk, it's going to be
+ * awesome.
+ */
+static void dump_block_entry(struct btrfs_fs_info *fs_info,
+			     struct block_entry *be)
+{
+	struct ref_entry *ref;
+	struct root_entry *re;
+	struct ref_action *ra;
+	struct rb_node *n;
+
+	btrfs_err(fs_info, "Dumping block entry [%llu %llu], num_refs %llu, metadata %d, from disk %d\n",
+		  be->bytenr, be->len, be->num_refs, be->metadata,
+		  be->from_disk);
+
+	for (n = rb_first(&be->refs); n; n = rb_next(n)) {
+		ref = rb_entry(n, struct ref_entry, node);
+		btrfs_err(fs_info, "  Ref root %llu, parent %llu, owner %llu, offset %llu, num_refs %llu\n",
+			  ref->root_objectid, ref->parent, ref->owner,
+			  ref->offset, ref->num_refs);
+	}
+
+	for (n = rb_first(&be->roots); n; n = rb_next(n)) {
+		re = rb_entry(n, struct root_entry, node);
+		btrfs_err(fs_info, "  Root entry %llu, num_refs %llu\n",
+			  re->root_objectid, re->num_refs);
+	}
+
+	list_for_each_entry(ra, &be->actions, list)
+		dump_ref_action(fs_info, ra);
+}
+
+/*
+ * btrfs_ref_tree_mod: called when we modify a ref for a bytenr
+ * @root: the root we are making this modification from.
+ * @bytenr: the bytenr we are modifying.
+ * @num_bytes: number of bytes.
+ * @parent: the parent bytenr.
+ * @ref_root: the original root owner of the bytenr.
+ * @owner: level in the case of metadata, inode in the case of data.
+ * @offset: 0 for metadata, file offset for data.
+ * @action: the action that we are doing, this is the same as the delayed ref
+ *	action.
+ *
+ * This will add an action item to the given bytenr and do sanity checks to make
+ * sure we haven't messed something up.  If we are making a new allocation and
+ * this block entry has history we will delete all previous actions as long as
+ * our sanity checks pass as they are no longer needed.
+ */
+int btrfs_ref_tree_mod(struct btrfs_root *root, u64 bytenr, u64 num_bytes,
+		       u64 parent, u64 ref_root, u64 owner, u64 offset,
+		       int action)
+{
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct ref_entry *ref = NULL, *exist;
+	struct ref_action *ra = NULL;
+	struct block_entry *be = NULL;
+	struct root_entry *re = NULL;
+	int ret = 0;
+	bool metadata = owner < BTRFS_FIRST_FREE_OBJECTID;
+
+	if (!btrfs_test_opt(root->fs_info, REF_VERIFY))
+		return 0;
+
+	ref = kzalloc(sizeof(struct ref_entry), GFP_NOFS);
+	ra = kmalloc(sizeof(struct ref_action), GFP_NOFS);
+	if (!ra || !ref) {
+		kfree(ref);
+		kfree(ra);
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	if (parent) {
+		ref->parent = parent;
+	} else {
+		ref->root_objectid = ref_root;
+		ref->owner = owner;
+		ref->offset = offset;
+	}
+	ref->num_refs = (action == BTRFS_DROP_DELAYED_REF) ? -1 : 1;
+
+	memcpy(&ra->ref, ref, sizeof(struct ref_entry));
+	/*
+	 * Save the extra info from the delayed ref in the ref action to make it
+	 * easier to figure out what is happening.  The real ref's we add to the
+	 * ref tree need to reflect what we save on disk so it matches any
+	 * on-disk refs we pre-loaded.
+	 */
+	ra->ref.owner = owner;
+	ra->ref.offset = offset;
+	ra->ref.root_objectid = ref_root;
+	__save_stack_trace(ra);
+
+	INIT_LIST_HEAD(&ra->list);
+	ra->action = action;
+	ra->root = root->objectid;
+
+	/*
+	 * This is an allocation, preallocate the block_entry in case we haven't
+	 * used it before.
+	 */
+	ret = -EINVAL;
+	if (action == BTRFS_ADD_DELAYED_EXTENT) {
+		/*
+		 * For subvol_create we'll just pass in whatever the parent root
+		 * is and the new root objectid, so let's not treat the passed
+		 * in root as if it really has a ref for this bytenr.
+		 */
+		be = add_block_entry(root->fs_info, bytenr, num_bytes,
+				     ref_root);
+		if (IS_ERR(be)) {
+			kfree(ra);
+			ret = PTR_ERR(be);
+			goto out;
+		}
+		be->num_refs++;
+		if (metadata)
+			be->metadata = 1;
+
+		if (be->num_refs != 1) {
+			btrfs_err(fs_info, "re-allocated a block that still has references to it!\n");
+			dump_block_entry(fs_info, be);
+			dump_ref_action(fs_info, ra);
+			goto out_unlock;
+		}
+
+		while (!list_empty(&be->actions)) {
+			struct ref_action *tmp;
+			tmp = list_first_entry(&be->actions, struct ref_action,
+					       list);
+			list_del(&tmp->list);
+			kfree(tmp);
+		}
+	} else {
+		struct root_entry *tmp;
+
+		if (!parent) {
+			re = kmalloc(sizeof(struct root_entry), GFP_NOFS);
+			if (!re) {
+				kfree(ref);
+				kfree(ra);
+				ret = -ENOMEM;
+				goto out;
+			}
+			/*
+			 * This is the root that is modifying us, so it's the
+			 * one we want to lookup below when we modify the
+			 * re->num_refs.
+			 */
+			ref_root = root->objectid;
+			re->root_objectid = root->objectid;
+			re->num_refs = 0;
+		}
+
+		spin_lock(&root->fs_info->ref_verify_lock);
+		be = lookup_block_entry(&root->fs_info->block_tree, bytenr);
+		if (!be) {
+			btrfs_err(fs_info, "Trying to do action %d to bytenr %llu num_bytes %llu but there is no existing entry!\n",
+				  action, (unsigned long long)bytenr,
+				  (unsigned long long)num_bytes);
+			dump_ref_action(fs_info, ra);
+			kfree(ref);
+			kfree(ra);
+			goto out_unlock;
+		}
+
+		if (!parent) {
+			tmp = insert_root_entry(&be->roots, re);
+			if (tmp) {
+				kfree(re);
+				re = tmp;
+			}
+		}
+	}
+
+	exist = insert_ref_entry(&be->refs, ref);
+	if (exist) {
+		if (action == BTRFS_DROP_DELAYED_REF) {
+			if (exist->num_refs == 0) {
+				btrfs_err(fs_info, "Dropping a ref for a existing root that doesn't have a ref on the block\n");
+				dump_block_entry(fs_info, be);
+				dump_ref_action(fs_info, ra);
+				kfree(ra);
+				goto out_unlock;
+			}
+			exist->num_refs--;
+			if (exist->num_refs == 0) {
+				rb_erase(&exist->node, &be->refs);
+				kfree(exist);
+			}
+		} else if (!be->metadata) {
+			exist->num_refs++;
+		} else {
+			btrfs_err(fs_info, "Attempting to add another ref for an existing ref on a tree block\n");
+			dump_block_entry(fs_info, be);
+			dump_ref_action(fs_info, ra);
+			kfree(ra);
+			goto out_unlock;
+		}
+		kfree(ref);
+	} else {
+		if (action == BTRFS_DROP_DELAYED_REF) {
+			btrfs_err(fs_info, "Dropping a ref for a root that doesn't have a ref on the block\n");
+			dump_block_entry(fs_info, be);
+			dump_ref_action(fs_info, ra);
+			kfree(ra);
+			goto out_unlock;
+		}
+	}
+
+	if (!parent && !re) {
+		re = lookup_root_entry(&be->roots, ref_root);
+		if (!re) {
+			/*
+			 * This shouldn't happen because we will add our re
+			 * above when we lookup the be with !parent, but just in
+			 * case catch this case so we don't panic because I
+			 * didn't thik of some other corner case.
+			 */
+			btrfs_err(fs_info, "Failed to find root %llu for %llu",
+				  root->objectid, be->bytenr);
+			dump_block_entry(fs_info, be);
+			dump_ref_action(fs_info, ra);
+			kfree(ra);
+			goto out_unlock;
+		}
+	}
+	if (action == BTRFS_DROP_DELAYED_REF) {
+		if (re)
+			re->num_refs--;
+		be->num_refs--;
+	} else if (action == BTRFS_ADD_DELAYED_REF) {
+		be->num_refs++;
+		if (re)
+			re->num_refs++;
+	}
+	list_add_tail(&ra->list, &be->actions);
+	ret = 0;
+out_unlock:
+	spin_unlock(&root->fs_info->ref_verify_lock);
+out:
+	if (ret)
+		btrfs_clear_opt(fs_info->mount_opt, REF_VERIFY);
+	return ret;
+}
+
+/* Free up the ref cache */
+void btrfs_free_ref_cache(struct btrfs_fs_info *fs_info)
+{
+	struct block_entry *be;
+	struct rb_node *n;
+
+	if (!btrfs_test_opt(fs_info, REF_VERIFY))
+		return;
+
+	spin_lock(&fs_info->ref_verify_lock);
+	while ((n = rb_first(&fs_info->block_tree))) {
+		be = rb_entry(n, struct block_entry, node);
+		rb_erase(&be->node, &fs_info->block_tree);
+		free_block_entry(be);
+		cond_resched_lock(&fs_info->ref_verify_lock);
+	}
+	spin_unlock(&fs_info->ref_verify_lock);
+}
+
+void btrfs_free_ref_tree_range(struct btrfs_fs_info *fs_info, u64 start,
+			       u64 len)
+{
+	struct block_entry *be = NULL, *entry;
+	struct rb_node *n;
+
+	if (!btrfs_test_opt(fs_info, REF_VERIFY))
+		return;
+
+	spin_lock(&fs_info->ref_verify_lock);
+	n = fs_info->block_tree.rb_node;
+	while (n) {
+		entry = rb_entry(n, struct block_entry, node);
+		if (entry->bytenr < start) {
+			n = n->rb_right;
+		} else if (entry->bytenr > start) {
+			n = n->rb_left;
+		} else {
+			be = entry;
+			break;
+		}
+		/* We want to get as close to start as possible */
+		if (be == NULL ||
+		    (entry->bytenr < start && be->bytenr > start) ||
+		    (entry->bytenr < start && entry->bytenr > be->bytenr))
+			be = entry;
+	}
+
+	/*
+	 * Could have an empty block group, maybe have something to check for
+	 * this case to verify we were actually empty?
+	 */
+	if (!be) {
+		spin_unlock(&fs_info->ref_verify_lock);
+		return;
+	}
+
+	n = &be->node;
+	while (n) {
+		be = rb_entry(n, struct block_entry, node);
+		n = rb_next(n);
+		if (be->bytenr < start && be->bytenr + be->len > start) {
+			btrfs_err(fs_info, "Block entry overlaps a block group [%llu,%llu]!\n",
+				  start, len);
+			dump_block_entry(fs_info, be);
+			continue;
+		}
+		if (be->bytenr < start)
+			continue;
+		if (be->bytenr >= start + len)
+			break;
+		if (be->bytenr + be->len > start + len) {
+			btrfs_err(fs_info, "Block entry overlaps a block group [%llu,%llu]!\n",
+				  start, len);
+			dump_block_entry(fs_info, be);
+		}
+		rb_erase(&be->node, &fs_info->block_tree);
+		free_block_entry(be);
+	}
+	spin_unlock(&fs_info->ref_verify_lock);
+}
+
+/* Walk down all roots and build the ref tree, meant to be called at mount */
+int btrfs_build_ref_tree(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_path *path;
+	struct btrfs_root *root;
+	struct extent_buffer *eb;
+	u64 bytenr = 0, num_bytes = 0;
+	int ret, level;
+
+	if (!btrfs_test_opt(fs_info, REF_VERIFY))
+		return 0;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	eb = btrfs_read_lock_root_node(fs_info->extent_root);
+	btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
+	level = btrfs_header_level(eb);
+	path->nodes[level] = eb;
+	path->slots[level] = 0;
+	path->locks[level] = BTRFS_READ_LOCK_BLOCKING;
+
+	while (1) {
+		/*
+		 * We have to keep track of the bytenr/num_bytes we last hit
+		 * because we could have run out of space for an inline ref, and
+		 * would have had to added a ref key item which may appear on a
+		 * different leaf from the original extent item.
+		 */
+		ret = walk_down_tree(fs_info->extent_root, path, level,
+				     &bytenr, &num_bytes);
+		if (ret)
+			break;
+		ret = walk_up_tree(root, path, &level);
+		if (ret < 0)
+			break;
+		if (ret > 0) {
+			ret = 0;
+			break;
+		}
+	}
+	if (ret) {
+		btrfs_clear_opt(fs_info->mount_opt, REF_VERIFY);
+		btrfs_free_ref_cache(fs_info);
+	}
+	btrfs_free_path(path);
+	return ret;
+}
diff --git a/fs/btrfs/ref-verify.h b/fs/btrfs/ref-verify.h
new file mode 100644
index 000000000000..9026e2ea2919
--- /dev/null
+++ b/fs/btrfs/ref-verify.h
@@ -0,0 +1,59 @@
+/*
+ * Copyright (C) 2014 Facebook.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+#ifndef __REF_VERIFY__
+#define __REF_VERIFY__
+
+#ifdef CONFIG_BTRFS_FS_REF_VERIFY
+int btrfs_build_ref_tree(struct btrfs_fs_info *fs_info);
+void btrfs_free_ref_cache(struct btrfs_fs_info *fs_info);
+int btrfs_ref_tree_mod(struct btrfs_root *root, u64 bytenr, u64 num_bytes,
+		       u64 parent, u64 ref_root, u64 owner, u64 offset,
+		       int action);
+void btrfs_free_ref_tree_range(struct btrfs_fs_info *fs_info, u64 start,
+			       u64 len);
+static inline void btrfs_init_ref_verify(struct btrfs_fs_info *fs_info)
+{
+	spin_lock_init(&fs_info->ref_verify_lock);
+	fs_info->block_tree = RB_ROOT;
+}
+#else
+static inline int btrfs_build_ref_tree(struct btrfs_fs_info *fs_info)
+{
+	return 0;
+}
+
+static inline void btrfs_free_ref_cache(struct btrfs_fs_info *fs_info)
+{
+}
+
+static inline int btrfs_ref_tree_mod(struct btrfs_root *root, u64 bytenr,
+				     u64 num_bytes, u64 parent, u64 ref_root,
+				     u64 owner, u64 offset, int action)
+{
+	return 0;
+}
+
+static inline void btrfs_free_ref_tree_range(struct btrfs_fs_info *fs_info,
+					     u64 start, u64 len)
+{
+}
+static inline void btrfs_init_ref_verify(struct btrfs_fs_info *fs_info)
+{
+}
+#endif /* CONFIG_BTRFS_FS_REF_VERIFY */
+#endif /* _REF_VERIFY__ */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 07/21] Btrfs: only check delayed ref usage in should_end_transaction
  2017-09-29 19:43 [PATCH 00/21] My current btrfs patch queue Josef Bacik
                   ` (5 preceding siblings ...)
  2017-09-29 19:43 ` [PATCH 06/21] Btrfs: add a extent ref verify tool Josef Bacik
@ 2017-09-29 19:43 ` Josef Bacik
  2017-10-13 17:20   ` David Sterba
  2017-09-29 19:43 ` [PATCH 08/21] btrfs: add a helper to return a head ref Josef Bacik
                   ` (14 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: Josef Bacik @ 2017-09-29 19:43 UTC (permalink / raw)
  To: kernel-team, linux-btrfs

We were only doing btrfs_check_space_for_delayed_refs() if the metadata
space was full, ie we couldn't allocate chunks.  This assumes we'll be
able to allocate chunks during transaction commit, but since nothing
does a LIMIT flush during the transaction commit this won't actually
happen unless we happen to run shy of actual space.  We already take
into account a full fs in btrfs_check_space_for_delayed_refs() so just
kill this extra check to make sure we're ending the transaction when we
need to.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/transaction.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 9c5f126064bd..68c3e1c04bca 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -797,8 +797,7 @@ static int should_end_transaction(struct btrfs_trans_handle *trans)
 {
 	struct btrfs_fs_info *fs_info = trans->fs_info;
 
-	if (fs_info->global_block_rsv.space_info->full &&
-	    btrfs_check_space_for_delayed_refs(trans, fs_info))
+	if (btrfs_check_space_for_delayed_refs(trans, fs_info))
 		return 1;
 
 	return !!btrfs_block_rsv_check(&fs_info->global_block_rsv, 5);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 08/21] btrfs: add a helper to return a head ref
  2017-09-29 19:43 [PATCH 00/21] My current btrfs patch queue Josef Bacik
                   ` (6 preceding siblings ...)
  2017-09-29 19:43 ` [PATCH 07/21] Btrfs: only check delayed ref usage in should_end_transaction Josef Bacik
@ 2017-09-29 19:43 ` Josef Bacik
  2017-10-13 14:39   ` David Sterba
  2017-09-29 19:43 ` [PATCH 09/21] btrfs: move extent_op cleanup to a helper Josef Bacik
                   ` (13 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: Josef Bacik @ 2017-09-29 19:43 UTC (permalink / raw)
  To: kernel-team, linux-btrfs

Simplify the error handling in __btrfs_run_delayed_refs by breaking out
the code used to return a head back to the delayed_refs tree for
processing into a helper function.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/extent-tree.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 00d86c8afaef..f356b4a66186 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2576,6 +2576,17 @@ select_delayed_ref(struct btrfs_delayed_ref_head *head)
 	return ref;
 }
 
+static void
+unselect_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
+			  struct btrfs_delayed_ref_head *head)
+{
+	spin_lock(&delayed_refs->lock);
+	head->processing = 0;
+	delayed_refs->num_heads_ready++;
+	spin_unlock(&delayed_refs->lock);
+	btrfs_delayed_ref_unlock(head);
+}
+
 /*
  * Returns 0 on success or if called with an already aborted transaction.
  * Returns -ENOMEM or -EIO on failure and will abort the transaction.
@@ -2649,11 +2660,7 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 		if (ref && ref->seq &&
 		    btrfs_check_delayed_seq(fs_info, delayed_refs, ref->seq)) {
 			spin_unlock(&locked_ref->lock);
-			spin_lock(&delayed_refs->lock);
-			locked_ref->processing = 0;
-			delayed_refs->num_heads_ready++;
-			spin_unlock(&delayed_refs->lock);
-			btrfs_delayed_ref_unlock(locked_ref);
+			unselect_delayed_ref_head(delayed_refs, locked_ref);
 			locked_ref = NULL;
 			cond_resched();
 			count++;
@@ -2699,14 +2706,11 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 					 */
 					if (must_insert_reserved)
 						locked_ref->must_insert_reserved = 1;
-					spin_lock(&delayed_refs->lock);
-					locked_ref->processing = 0;
-					delayed_refs->num_heads_ready++;
-					spin_unlock(&delayed_refs->lock);
+					unselect_delayed_ref_head(delayed_refs,
+								  locked_ref);
 					btrfs_debug(fs_info,
 						    "run_delayed_extent_op returned %d",
 						    ret);
-					btrfs_delayed_ref_unlock(locked_ref);
 					return ret;
 				}
 				continue;
@@ -2764,11 +2768,7 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 
 		btrfs_free_delayed_extent_op(extent_op);
 		if (ret) {
-			spin_lock(&delayed_refs->lock);
-			locked_ref->processing = 0;
-			delayed_refs->num_heads_ready++;
-			spin_unlock(&delayed_refs->lock);
-			btrfs_delayed_ref_unlock(locked_ref);
+			unselect_delayed_ref_head(delayed_refs, locked_ref);
 			btrfs_put_delayed_ref(ref);
 			btrfs_debug(fs_info, "run_one_delayed_ref returned %d",
 				    ret);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 09/21] btrfs: move extent_op cleanup to a helper
  2017-09-29 19:43 [PATCH 00/21] My current btrfs patch queue Josef Bacik
                   ` (7 preceding siblings ...)
  2017-09-29 19:43 ` [PATCH 08/21] btrfs: add a helper to return a head ref Josef Bacik
@ 2017-09-29 19:43 ` Josef Bacik
  2017-10-13 14:50   ` David Sterba
  2017-10-16 14:05   ` Nikolay Borisov
  2017-09-29 19:43 ` [PATCH 10/21] btrfs: breakout empty head " Josef Bacik
                   ` (12 subsequent siblings)
  21 siblings, 2 replies; 51+ messages in thread
From: Josef Bacik @ 2017-09-29 19:43 UTC (permalink / raw)
  To: kernel-team, linux-btrfs

Move the extent_op cleanup for an empty head ref to a helper function to
help simplify __btrfs_run_delayed_refs.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/extent-tree.c | 77 ++++++++++++++++++++++++++------------------------
 1 file changed, 40 insertions(+), 37 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f356b4a66186..f4048b23c7be 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2587,6 +2587,26 @@ unselect_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
 	btrfs_delayed_ref_unlock(head);
 }
 
+static int cleanup_extent_op(struct btrfs_trans_handle *trans,
+			     struct btrfs_fs_info *fs_info,
+			     struct btrfs_delayed_ref_head *head)
+{
+	struct btrfs_delayed_extent_op *extent_op = head->extent_op;
+	int ret;
+
+	if (!extent_op)
+		return 0;
+	head->extent_op = NULL;
+	if (head->must_insert_reserved) {
+		btrfs_free_delayed_extent_op(extent_op);
+		return 0;
+	}
+	spin_unlock(&head->lock);
+	ret = run_delayed_extent_op(trans, fs_info, &head->node, extent_op);
+	btrfs_free_delayed_extent_op(extent_op);
+	return ret ? ret : 1;
+}
+
 /*
  * Returns 0 on success or if called with an already aborted transaction.
  * Returns -ENOMEM or -EIO on failure and will abort the transaction.
@@ -2667,16 +2687,6 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 			continue;
 		}
 
-		/*
-		 * record the must insert reserved flag before we
-		 * drop the spin lock.
-		 */
-		must_insert_reserved = locked_ref->must_insert_reserved;
-		locked_ref->must_insert_reserved = 0;
-
-		extent_op = locked_ref->extent_op;
-		locked_ref->extent_op = NULL;
-
 		if (!ref) {
 
 
@@ -2686,33 +2696,17 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 			 */
 			ref = &locked_ref->node;
 
-			if (extent_op && must_insert_reserved) {
-				btrfs_free_delayed_extent_op(extent_op);
-				extent_op = NULL;
-			}
-
-			if (extent_op) {
-				spin_unlock(&locked_ref->lock);
-				ret = run_delayed_extent_op(trans, fs_info,
-							    ref, extent_op);
-				btrfs_free_delayed_extent_op(extent_op);
-
-				if (ret) {
-					/*
-					 * Need to reset must_insert_reserved if
-					 * there was an error so the abort stuff
-					 * can cleanup the reserved space
-					 * properly.
-					 */
-					if (must_insert_reserved)
-						locked_ref->must_insert_reserved = 1;
-					unselect_delayed_ref_head(delayed_refs,
-								  locked_ref);
-					btrfs_debug(fs_info,
-						    "run_delayed_extent_op returned %d",
-						    ret);
-					return ret;
-				}
+			ret = cleanup_extent_op(trans, fs_info, locked_ref);
+			if (ret < 0) {
+				unselect_delayed_ref_head(delayed_refs,
+							  locked_ref);
+				btrfs_debug(fs_info,
+					    "run_delayed_extent_op returned %d",
+					    ret);
+				return ret;
+			} else if (ret > 0) {
+				/* We dropped our lock, we need to loop. */
+				ret = 0;
 				continue;
 			}
 
@@ -2761,6 +2755,15 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 				WARN_ON(1);
 			}
 		}
+		/*
+		 * record the must insert reserved flag before we
+		 * drop the spin lock.
+		 */
+		must_insert_reserved = locked_ref->must_insert_reserved;
+		locked_ref->must_insert_reserved = 0;
+
+		extent_op = locked_ref->extent_op;
+		locked_ref->extent_op = NULL;
 		spin_unlock(&locked_ref->lock);
 
 		ret = run_one_delayed_ref(trans, fs_info, ref, extent_op,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 10/21] btrfs: breakout empty head cleanup to a helper
  2017-09-29 19:43 [PATCH 00/21] My current btrfs patch queue Josef Bacik
                   ` (8 preceding siblings ...)
  2017-09-29 19:43 ` [PATCH 09/21] btrfs: move extent_op cleanup to a helper Josef Bacik
@ 2017-09-29 19:43 ` Josef Bacik
  2017-10-13 14:57   ` David Sterba
  2017-10-16 14:07   ` Nikolay Borisov
  2017-09-29 19:43 ` [PATCH 11/21] btrfs: move ref_mod modification into the if (ref) logic Josef Bacik
                   ` (11 subsequent siblings)
  21 siblings, 2 replies; 51+ messages in thread
From: Josef Bacik @ 2017-09-29 19:43 UTC (permalink / raw)
  To: kernel-team, linux-btrfs

Move this code out to a helper function to further simplivy
__btrfs_run_delayed_refs.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/extent-tree.c | 80 ++++++++++++++++++++++++++++----------------------
 1 file changed, 45 insertions(+), 35 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f4048b23c7be..bae2eac11db7 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2607,6 +2607,43 @@ static int cleanup_extent_op(struct btrfs_trans_handle *trans,
 	return ret ? ret : 1;
 }
 
+static int cleanup_ref_head(struct btrfs_trans_handle *trans,
+			    struct btrfs_fs_info *fs_info,
+			    struct btrfs_delayed_ref_head *head)
+{
+	struct btrfs_delayed_ref_root *delayed_refs;
+	int ret;
+
+	delayed_refs = &trans->transaction->delayed_refs;
+
+	ret = cleanup_extent_op(trans, fs_info, head);
+	if (ret < 0) {
+		unselect_delayed_ref_head(delayed_refs, head);
+		btrfs_debug(fs_info, "run_delayed_extent_op returned %d", ret);
+		return ret;
+	} else if (ret) {
+		return ret;
+	}
+
+	/*
+	 * Need to drop our head ref lock and re-acquire the delayed ref lock
+	 * and then re-check to make sure nobody got added.
+	 */
+	spin_unlock(&head->lock);
+	spin_lock(&delayed_refs->lock);
+	spin_lock(&head->lock);
+	if (!list_empty(&head->ref_list) || head->extent_op) {
+		spin_unlock(&head->lock);
+		spin_unlock(&delayed_refs->lock);
+		return 1;
+	}
+	head->node.in_tree = 0;
+	delayed_refs->num_heads--;
+	rb_erase(&head->href_node, &delayed_refs->href_root);
+	spin_unlock(&delayed_refs->lock);
+	return 0;
+}
+
 /*
  * Returns 0 on success or if called with an already aborted transaction.
  * Returns -ENOMEM or -EIO on failure and will abort the transaction.
@@ -2688,47 +2725,20 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 		}
 
 		if (!ref) {
-
-
-			/* All delayed refs have been processed, Go ahead
-			 * and send the head node to run_one_delayed_ref,
-			 * so that any accounting fixes can happen
-			 */
-			ref = &locked_ref->node;
-
-			ret = cleanup_extent_op(trans, fs_info, locked_ref);
-			if (ret < 0) {
-				unselect_delayed_ref_head(delayed_refs,
-							  locked_ref);
-				btrfs_debug(fs_info,
-					    "run_delayed_extent_op returned %d",
-					    ret);
-				return ret;
-			} else if (ret > 0) {
+			ret = cleanup_ref_head(trans, fs_info, locked_ref);
+			if (ret > 0 ) {
 				/* We dropped our lock, we need to loop. */
 				ret = 0;
 				continue;
+			} else if (ret) {
+				return ret;
 			}
 
-			/*
-			 * Need to drop our head ref lock and re-acquire the
-			 * delayed ref lock and then re-check to make sure
-			 * nobody got added.
+			/* All delayed refs have been processed, Go ahead
+			 * and send the head node to run_one_delayed_ref,
+			 * so that any accounting fixes can happen
 			 */
-			spin_unlock(&locked_ref->lock);
-			spin_lock(&delayed_refs->lock);
-			spin_lock(&locked_ref->lock);
-			if (!list_empty(&locked_ref->ref_list) ||
-			    locked_ref->extent_op) {
-				spin_unlock(&locked_ref->lock);
-				spin_unlock(&delayed_refs->lock);
-				continue;
-			}
-			ref->in_tree = 0;
-			delayed_refs->num_heads--;
-			rb_erase(&locked_ref->href_node,
-				 &delayed_refs->href_root);
-			spin_unlock(&delayed_refs->lock);
+			ref = &locked_ref->node;
 		} else {
 			actual_count++;
 			ref->in_tree = 0;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 11/21] btrfs: move ref_mod modification into the if (ref) logic
  2017-09-29 19:43 [PATCH 00/21] My current btrfs patch queue Josef Bacik
                   ` (9 preceding siblings ...)
  2017-09-29 19:43 ` [PATCH 10/21] btrfs: breakout empty head " Josef Bacik
@ 2017-09-29 19:43 ` Josef Bacik
  2017-10-13 15:05   ` David Sterba
  2017-09-29 19:43 ` [PATCH 12/21] btrfs: move all ref head cleanup to the helper function Josef Bacik
                   ` (10 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: Josef Bacik @ 2017-09-29 19:43 UTC (permalink / raw)
  To: kernel-team, linux-btrfs

We only use this logic if our ref isn't a ref_head, so move it up into
the if (ref) case since we know that this is a normal ref and not a
delayed ref head.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/extent-tree.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index bae2eac11db7..ac1196af6b7e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2745,10 +2745,6 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 			list_del(&ref->list);
 			if (!list_empty(&ref->add_list))
 				list_del(&ref->add_list);
-		}
-		atomic_dec(&delayed_refs->num_entries);
-
-		if (!btrfs_delayed_ref_is_head(ref)) {
 			/*
 			 * when we play the delayed ref, also correct the
 			 * ref_mod on head
@@ -2765,6 +2761,8 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 				WARN_ON(1);
 			}
 		}
+		atomic_dec(&delayed_refs->num_entries);
+
 		/*
 		 * record the must insert reserved flag before we
 		 * drop the spin lock.
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 12/21] btrfs: move all ref head cleanup to the helper function
  2017-09-29 19:43 [PATCH 00/21] My current btrfs patch queue Josef Bacik
                   ` (10 preceding siblings ...)
  2017-09-29 19:43 ` [PATCH 11/21] btrfs: move ref_mod modification into the if (ref) logic Josef Bacik
@ 2017-09-29 19:43 ` Josef Bacik
  2017-10-13 15:39   ` David Sterba
  2017-09-29 19:43 ` [PATCH 13/21] btrfs: remove delayed_ref_node from ref_head Josef Bacik
                   ` (9 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: Josef Bacik @ 2017-09-29 19:43 UTC (permalink / raw)
  To: kernel-team, linux-btrfs

We do a couple different cleanup operations on the ref head.  We adjust
counters, we'll free any reserved space if we didn't end up using the
ref, and we clear the pending csum bytes.  Move all these disparate
things into cleanup_ref_head and clean up the logic in
__btrfs_run_delayed_refs so that it handles the !ref case a lot cleaner,
as well as making run_one_delayed_ref() only deal with real refs and not
the ref head.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/extent-tree.c | 144 ++++++++++++++++++++++---------------------------
 1 file changed, 64 insertions(+), 80 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ac1196af6b7e..260414bd86e7 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2501,44 +2501,6 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
 		return 0;
 	}
 
-	if (btrfs_delayed_ref_is_head(node)) {
-		struct btrfs_delayed_ref_head *head;
-		/*
-		 * we've hit the end of the chain and we were supposed
-		 * to insert this extent into the tree.  But, it got
-		 * deleted before we ever needed to insert it, so all
-		 * we have to do is clean up the accounting
-		 */
-		BUG_ON(extent_op);
-		head = btrfs_delayed_node_to_head(node);
-		trace_run_delayed_ref_head(fs_info, node, head, node->action);
-
-		if (head->total_ref_mod < 0) {
-			struct btrfs_block_group_cache *cache;
-
-			cache = btrfs_lookup_block_group(fs_info, node->bytenr);
-			ASSERT(cache);
-			percpu_counter_add(&cache->space_info->total_bytes_pinned,
-					   -node->num_bytes);
-			btrfs_put_block_group(cache);
-		}
-
-		if (insert_reserved) {
-			btrfs_pin_extent(fs_info, node->bytenr,
-					 node->num_bytes, 1);
-			if (head->is_data) {
-				ret = btrfs_del_csums(trans, fs_info,
-						      node->bytenr,
-						      node->num_bytes);
-			}
-		}
-
-		/* Also free its reserved qgroup space */
-		btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
-					      head->qgroup_reserved);
-		return ret;
-	}
-
 	if (node->type == BTRFS_TREE_BLOCK_REF_KEY ||
 	    node->type == BTRFS_SHARED_BLOCK_REF_KEY)
 		ret = run_delayed_tree_ref(trans, fs_info, node, extent_op,
@@ -2641,6 +2603,43 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans,
 	delayed_refs->num_heads--;
 	rb_erase(&head->href_node, &delayed_refs->href_root);
 	spin_unlock(&delayed_refs->lock);
+	spin_unlock(&head->lock);
+	atomic_dec(&delayed_refs->num_entries);
+
+	trace_run_delayed_ref_head(fs_info, &head->node, head,
+				   head->node.action);
+
+	if (head->total_ref_mod < 0) {
+		struct btrfs_block_group_cache *cache;
+
+		cache = btrfs_lookup_block_group(fs_info, head->node.bytenr);
+		ASSERT(cache);
+		percpu_counter_add(&cache->space_info->total_bytes_pinned,
+				   -head->node.num_bytes);
+		btrfs_put_block_group(cache);
+
+		if (head->is_data) {
+			spin_lock(&delayed_refs->lock);
+			delayed_refs->pending_csums -= head->node.num_bytes;
+			spin_unlock(&delayed_refs->lock);
+		}
+	}
+
+	if (head->must_insert_reserved) {
+		btrfs_pin_extent(fs_info, head->node.bytenr,
+				 head->node.num_bytes, 1);
+		if (head->is_data) {
+			ret = btrfs_del_csums(trans, fs_info,
+					      head->node.bytenr,
+					      head->node.num_bytes);
+		}
+	}
+
+	/* Also free its reserved qgroup space */
+	btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
+				      head->qgroup_reserved);
+	btrfs_delayed_ref_unlock(head);
+	btrfs_put_delayed_ref(&head->node);
 	return 0;
 }
 
@@ -2724,6 +2723,10 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 			continue;
 		}
 
+		/*
+		 * We're done processing refs in this ref_head, clean everything
+		 * up and move on to the next ref_head.
+		 */
 		if (!ref) {
 			ret = cleanup_ref_head(trans, fs_info, locked_ref);
 			if (ret > 0 ) {
@@ -2733,33 +2736,30 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 			} else if (ret) {
 				return ret;
 			}
+			locked_ref = NULL;
+			count++;
+			continue;
+		}
 
-			/* All delayed refs have been processed, Go ahead
-			 * and send the head node to run_one_delayed_ref,
-			 * so that any accounting fixes can happen
-			 */
-			ref = &locked_ref->node;
-		} else {
-			actual_count++;
-			ref->in_tree = 0;
-			list_del(&ref->list);
-			if (!list_empty(&ref->add_list))
-				list_del(&ref->add_list);
-			/*
-			 * when we play the delayed ref, also correct the
-			 * ref_mod on head
-			 */
-			switch (ref->action) {
-			case BTRFS_ADD_DELAYED_REF:
-			case BTRFS_ADD_DELAYED_EXTENT:
-				locked_ref->node.ref_mod -= ref->ref_mod;
-				break;
-			case BTRFS_DROP_DELAYED_REF:
-				locked_ref->node.ref_mod += ref->ref_mod;
-				break;
-			default:
-				WARN_ON(1);
-			}
+		actual_count++;
+		ref->in_tree = 0;
+		list_del(&ref->list);
+		if (!list_empty(&ref->add_list))
+			list_del(&ref->add_list);
+		/*
+		 * when we play the delayed ref, also correct the
+		 * ref_mod on head
+		 */
+		switch (ref->action) {
+		case BTRFS_ADD_DELAYED_REF:
+		case BTRFS_ADD_DELAYED_EXTENT:
+			locked_ref->node.ref_mod -= ref->ref_mod;
+			break;
+		case BTRFS_DROP_DELAYED_REF:
+			locked_ref->node.ref_mod += ref->ref_mod;
+			break;
+		default:
+			WARN_ON(1);
 		}
 		atomic_dec(&delayed_refs->num_entries);
 
@@ -2786,22 +2786,6 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 			return ret;
 		}
 
-		/*
-		 * If this node is a head, that means all the refs in this head
-		 * have been dealt with, and we will pick the next head to deal
-		 * with, so we must unlock the head and drop it from the cluster
-		 * list before we release it.
-		 */
-		if (btrfs_delayed_ref_is_head(ref)) {
-			if (locked_ref->is_data &&
-			    locked_ref->total_ref_mod < 0) {
-				spin_lock(&delayed_refs->lock);
-				delayed_refs->pending_csums -= ref->num_bytes;
-				spin_unlock(&delayed_refs->lock);
-			}
-			btrfs_delayed_ref_unlock(locked_ref);
-			locked_ref = NULL;
-		}
 		btrfs_put_delayed_ref(ref);
 		count++;
 		cond_resched();
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 13/21] btrfs: remove delayed_ref_node from ref_head
  2017-09-29 19:43 [PATCH 00/21] My current btrfs patch queue Josef Bacik
                   ` (11 preceding siblings ...)
  2017-09-29 19:43 ` [PATCH 12/21] btrfs: move all ref head cleanup to the helper function Josef Bacik
@ 2017-09-29 19:43 ` Josef Bacik
  2017-10-13 16:05   ` David Sterba
  2017-10-16 14:41   ` Nikolay Borisov
  2017-09-29 19:43 ` [PATCH 14/21] btrfs: remove type argument from comp_tree_refs Josef Bacik
                   ` (8 subsequent siblings)
  21 siblings, 2 replies; 51+ messages in thread
From: Josef Bacik @ 2017-09-29 19:43 UTC (permalink / raw)
  To: kernel-team, linux-btrfs

This is just excessive information in the ref_head, and makes the code
complicated.  It is a relic from when we had the heads and the refs in
the same tree, which is no longer the case.  With this removal I've
cleaned up a bunch of the cruft around this old assumption as well.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/backref.c           |   4 +-
 fs/btrfs/delayed-ref.c       | 126 +++++++++++++++++++------------------------
 fs/btrfs/delayed-ref.h       |  49 ++++++-----------
 fs/btrfs/disk-io.c           |  12 ++---
 fs/btrfs/extent-tree.c       |  90 ++++++++++++-------------------
 include/trace/events/btrfs.h |  15 +++---
 6 files changed, 120 insertions(+), 176 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index b517ef1477ea..33cba1abf8b6 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1178,7 +1178,7 @@ static int find_parent_nodes(struct btrfs_trans_handle *trans,
 		head = btrfs_find_delayed_ref_head(delayed_refs, bytenr);
 		if (head) {
 			if (!mutex_trylock(&head->mutex)) {
-				refcount_inc(&head->node.refs);
+				refcount_inc(&head->refs);
 				spin_unlock(&delayed_refs->lock);
 
 				btrfs_release_path(path);
@@ -1189,7 +1189,7 @@ static int find_parent_nodes(struct btrfs_trans_handle *trans,
 				 */
 				mutex_lock(&head->mutex);
 				mutex_unlock(&head->mutex);
-				btrfs_put_delayed_ref(&head->node);
+				btrfs_put_delayed_ref_head(head);
 				goto again;
 			}
 			spin_unlock(&delayed_refs->lock);
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 93ffa898df6d..b9b41c838da4 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -96,15 +96,15 @@ static struct btrfs_delayed_ref_head *htree_insert(struct rb_root *root,
 	u64 bytenr;
 
 	ins = rb_entry(node, struct btrfs_delayed_ref_head, href_node);
-	bytenr = ins->node.bytenr;
+	bytenr = ins->bytenr;
 	while (*p) {
 		parent_node = *p;
 		entry = rb_entry(parent_node, struct btrfs_delayed_ref_head,
 				 href_node);
 
-		if (bytenr < entry->node.bytenr)
+		if (bytenr < entry->bytenr)
 			p = &(*p)->rb_left;
-		else if (bytenr > entry->node.bytenr)
+		else if (bytenr > entry->bytenr)
 			p = &(*p)->rb_right;
 		else
 			return entry;
@@ -133,15 +133,15 @@ find_ref_head(struct rb_root *root, u64 bytenr,
 	while (n) {
 		entry = rb_entry(n, struct btrfs_delayed_ref_head, href_node);
 
-		if (bytenr < entry->node.bytenr)
+		if (bytenr < entry->bytenr)
 			n = n->rb_left;
-		else if (bytenr > entry->node.bytenr)
+		else if (bytenr > entry->bytenr)
 			n = n->rb_right;
 		else
 			return entry;
 	}
 	if (entry && return_bigger) {
-		if (bytenr > entry->node.bytenr) {
+		if (bytenr > entry->bytenr) {
 			n = rb_next(&entry->href_node);
 			if (!n)
 				n = rb_first(root);
@@ -164,17 +164,17 @@ int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans,
 	if (mutex_trylock(&head->mutex))
 		return 0;
 
-	refcount_inc(&head->node.refs);
+	refcount_inc(&head->refs);
 	spin_unlock(&delayed_refs->lock);
 
 	mutex_lock(&head->mutex);
 	spin_lock(&delayed_refs->lock);
-	if (!head->node.in_tree) {
+	if (RB_EMPTY_NODE(&head->href_node)) {
 		mutex_unlock(&head->mutex);
-		btrfs_put_delayed_ref(&head->node);
+		btrfs_put_delayed_ref_head(head);
 		return -EAGAIN;
 	}
-	btrfs_put_delayed_ref(&head->node);
+	btrfs_put_delayed_ref_head(head);
 	return 0;
 }
 
@@ -183,15 +183,10 @@ static inline void drop_delayed_ref(struct btrfs_trans_handle *trans,
 				    struct btrfs_delayed_ref_head *head,
 				    struct btrfs_delayed_ref_node *ref)
 {
-	if (btrfs_delayed_ref_is_head(ref)) {
-		head = btrfs_delayed_node_to_head(ref);
-		rb_erase(&head->href_node, &delayed_refs->href_root);
-	} else {
-		assert_spin_locked(&head->lock);
-		list_del(&ref->list);
-		if (!list_empty(&ref->add_list))
-			list_del(&ref->add_list);
-	}
+	assert_spin_locked(&head->lock);
+	list_del(&ref->list);
+	if (!list_empty(&ref->add_list))
+		list_del(&ref->add_list);
 	ref->in_tree = 0;
 	btrfs_put_delayed_ref(ref);
 	atomic_dec(&delayed_refs->num_entries);
@@ -380,8 +375,8 @@ btrfs_select_ref_head(struct btrfs_trans_handle *trans)
 	head->processing = 1;
 	WARN_ON(delayed_refs->num_heads_ready == 0);
 	delayed_refs->num_heads_ready--;
-	delayed_refs->run_delayed_start = head->node.bytenr +
-		head->node.num_bytes;
+	delayed_refs->run_delayed_start = head->bytenr +
+		head->num_bytes;
 	return head;
 }
 
@@ -469,20 +464,16 @@ add_delayed_ref_tail_merge(struct btrfs_trans_handle *trans,
  */
 static noinline void
 update_existing_head_ref(struct btrfs_delayed_ref_root *delayed_refs,
-			 struct btrfs_delayed_ref_node *existing,
-			 struct btrfs_delayed_ref_node *update,
+			 struct btrfs_delayed_ref_head *existing,
+			 struct btrfs_delayed_ref_head *update,
 			 int *old_ref_mod_ret)
 {
-	struct btrfs_delayed_ref_head *existing_ref;
-	struct btrfs_delayed_ref_head *ref;
 	int old_ref_mod;
 
-	existing_ref = btrfs_delayed_node_to_head(existing);
-	ref = btrfs_delayed_node_to_head(update);
-	BUG_ON(existing_ref->is_data != ref->is_data);
+	BUG_ON(existing->is_data != update->is_data);
 
-	spin_lock(&existing_ref->lock);
-	if (ref->must_insert_reserved) {
+	spin_lock(&existing->lock);
+	if (update->must_insert_reserved) {
 		/* if the extent was freed and then
 		 * reallocated before the delayed ref
 		 * entries were processed, we can end up
@@ -490,7 +481,7 @@ update_existing_head_ref(struct btrfs_delayed_ref_root *delayed_refs,
 		 * the must_insert_reserved flag set.
 		 * Set it again here
 		 */
-		existing_ref->must_insert_reserved = ref->must_insert_reserved;
+		existing->must_insert_reserved = update->must_insert_reserved;
 
 		/*
 		 * update the num_bytes so we make sure the accounting
@@ -500,22 +491,22 @@ update_existing_head_ref(struct btrfs_delayed_ref_root *delayed_refs,
 
 	}
 
-	if (ref->extent_op) {
-		if (!existing_ref->extent_op) {
-			existing_ref->extent_op = ref->extent_op;
+	if (update->extent_op) {
+		if (!existing->extent_op) {
+			existing->extent_op = update->extent_op;
 		} else {
-			if (ref->extent_op->update_key) {
-				memcpy(&existing_ref->extent_op->key,
-				       &ref->extent_op->key,
-				       sizeof(ref->extent_op->key));
-				existing_ref->extent_op->update_key = true;
+			if (update->extent_op->update_key) {
+				memcpy(&existing->extent_op->key,
+				       &update->extent_op->key,
+				       sizeof(update->extent_op->key));
+				existing->extent_op->update_key = true;
 			}
-			if (ref->extent_op->update_flags) {
-				existing_ref->extent_op->flags_to_set |=
-					ref->extent_op->flags_to_set;
-				existing_ref->extent_op->update_flags = true;
+			if (update->extent_op->update_flags) {
+				existing->extent_op->flags_to_set |=
+					update->extent_op->flags_to_set;
+				existing->extent_op->update_flags = true;
 			}
-			btrfs_free_delayed_extent_op(ref->extent_op);
+			btrfs_free_delayed_extent_op(update->extent_op);
 		}
 	}
 	/*
@@ -523,23 +514,23 @@ update_existing_head_ref(struct btrfs_delayed_ref_root *delayed_refs,
 	 * only need the lock for this case cause we could be processing it
 	 * currently, for refs we just added we know we're a-ok.
 	 */
-	old_ref_mod = existing_ref->total_ref_mod;
+	old_ref_mod = existing->total_ref_mod;
 	if (old_ref_mod_ret)
 		*old_ref_mod_ret = old_ref_mod;
 	existing->ref_mod += update->ref_mod;
-	existing_ref->total_ref_mod += update->ref_mod;
+	existing->total_ref_mod += update->ref_mod;
 
 	/*
 	 * If we are going to from a positive ref mod to a negative or vice
 	 * versa we need to make sure to adjust pending_csums accordingly.
 	 */
-	if (existing_ref->is_data) {
-		if (existing_ref->total_ref_mod >= 0 && old_ref_mod < 0)
+	if (existing->is_data) {
+		if (existing->total_ref_mod >= 0 && old_ref_mod < 0)
 			delayed_refs->pending_csums -= existing->num_bytes;
-		if (existing_ref->total_ref_mod < 0 && old_ref_mod >= 0)
+		if (existing->total_ref_mod < 0 && old_ref_mod >= 0)
 			delayed_refs->pending_csums += existing->num_bytes;
 	}
-	spin_unlock(&existing_ref->lock);
+	spin_unlock(&existing->lock);
 }
 
 /*
@@ -550,14 +541,13 @@ update_existing_head_ref(struct btrfs_delayed_ref_root *delayed_refs,
 static noinline struct btrfs_delayed_ref_head *
 add_delayed_ref_head(struct btrfs_fs_info *fs_info,
 		     struct btrfs_trans_handle *trans,
-		     struct btrfs_delayed_ref_node *ref,
+		     struct btrfs_delayed_ref_head *head_ref,
 		     struct btrfs_qgroup_extent_record *qrecord,
 		     u64 bytenr, u64 num_bytes, u64 ref_root, u64 reserved,
 		     int action, int is_data, int *qrecord_inserted_ret,
 		     int *old_ref_mod, int *new_ref_mod)
 {
 	struct btrfs_delayed_ref_head *existing;
-	struct btrfs_delayed_ref_head *head_ref = NULL;
 	struct btrfs_delayed_ref_root *delayed_refs;
 	int count_mod = 1;
 	int must_insert_reserved = 0;
@@ -593,26 +583,21 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
 
 	delayed_refs = &trans->transaction->delayed_refs;
 
-	/* first set the basic ref node struct up */
-	refcount_set(&ref->refs, 1);
-	ref->bytenr = bytenr;
-	ref->num_bytes = num_bytes;
-	ref->ref_mod = count_mod;
-	ref->type  = 0;
-	ref->action  = 0;
-	ref->is_head = 1;
-	ref->in_tree = 1;
-	ref->seq = 0;
-
-	head_ref = btrfs_delayed_node_to_head(ref);
+	refcount_set(&head_ref->refs, 1);
+	head_ref->bytenr = bytenr;
+	head_ref->num_bytes = num_bytes;
+	head_ref->ref_mod = count_mod;
 	head_ref->must_insert_reserved = must_insert_reserved;
 	head_ref->is_data = is_data;
 	INIT_LIST_HEAD(&head_ref->ref_list);
 	INIT_LIST_HEAD(&head_ref->ref_add_list);
+	RB_CLEAR_NODE(&head_ref->href_node);
 	head_ref->processing = 0;
 	head_ref->total_ref_mod = count_mod;
 	head_ref->qgroup_reserved = 0;
 	head_ref->qgroup_ref_root = 0;
+	spin_lock_init(&head_ref->lock);
+	mutex_init(&head_ref->mutex);
 
 	/* Record qgroup extent info if provided */
 	if (qrecord) {
@@ -632,17 +617,14 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
 			qrecord_inserted = 1;
 	}
 
-	spin_lock_init(&head_ref->lock);
-	mutex_init(&head_ref->mutex);
-
-	trace_add_delayed_ref_head(fs_info, ref, head_ref, action);
+	trace_add_delayed_ref_head(fs_info, head_ref, action);
 
 	existing = htree_insert(&delayed_refs->href_root,
 				&head_ref->href_node);
 	if (existing) {
 		WARN_ON(ref_root && reserved && existing->qgroup_ref_root
 			&& existing->qgroup_reserved);
-		update_existing_head_ref(delayed_refs, &existing->node, ref,
+		update_existing_head_ref(delayed_refs, existing, head_ref,
 					 old_ref_mod);
 		/*
 		 * we've updated the existing ref, free the newly
@@ -821,7 +803,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
 	 * insert both the head node and the new ref without dropping
 	 * the spin lock
 	 */
-	head_ref = add_delayed_ref_head(fs_info, trans, &head_ref->node, record,
+	head_ref = add_delayed_ref_head(fs_info, trans, head_ref, record,
 					bytenr, num_bytes, 0, 0, action, 0,
 					&qrecord_inserted, old_ref_mod,
 					new_ref_mod);
@@ -888,7 +870,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
 	 * insert both the head node and the new ref without dropping
 	 * the spin lock
 	 */
-	head_ref = add_delayed_ref_head(fs_info, trans, &head_ref->node, record,
+	head_ref = add_delayed_ref_head(fs_info, trans, head_ref, record,
 					bytenr, num_bytes, ref_root, reserved,
 					action, 1, &qrecord_inserted,
 					old_ref_mod, new_ref_mod);
@@ -920,7 +902,7 @@ int btrfs_add_delayed_extent_op(struct btrfs_fs_info *fs_info,
 	delayed_refs = &trans->transaction->delayed_refs;
 	spin_lock(&delayed_refs->lock);
 
-	add_delayed_ref_head(fs_info, trans, &head_ref->node, NULL, bytenr,
+	add_delayed_ref_head(fs_info, trans, head_ref, NULL, bytenr,
 			     num_bytes, 0, 0, BTRFS_UPDATE_DELAYED_HEAD,
 			     extent_op->is_data, NULL, NULL, NULL);
 
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index ce88e4ac5276..5d75f8cd08a9 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -26,15 +26,6 @@
 #define BTRFS_ADD_DELAYED_EXTENT 3 /* record a full extent allocation */
 #define BTRFS_UPDATE_DELAYED_HEAD 4 /* not changing ref count on head ref */
 
-/*
- * XXX: Qu: I really hate the design that ref_head and tree/data ref shares the
- * same ref_node structure.
- * Ref_head is in a higher logic level than tree/data ref, and duplicated
- * bytenr/num_bytes in ref_node is really a waste or memory, they should be
- * referred from ref_head.
- * This gets more disgusting after we use list to store tree/data ref in
- * ref_head. Must clean this mess up later.
- */
 struct btrfs_delayed_ref_node {
 	/*data/tree ref use list, stored in ref_head->ref_list. */
 	struct list_head list;
@@ -91,8 +82,8 @@ struct btrfs_delayed_extent_op {
  * reference count modifications we've queued up.
  */
 struct btrfs_delayed_ref_head {
-	struct btrfs_delayed_ref_node node;
-
+	u64 bytenr, num_bytes;
+	refcount_t refs;
 	/*
 	 * the mutex is held while running the refs, and it is also
 	 * held when checking the sum of reference modifications.
@@ -116,6 +107,14 @@ struct btrfs_delayed_ref_head {
 	int total_ref_mod;
 
 	/*
+	 * This is the current outstanding mod references for this bytenr.  This
+	 * is used with lookup_extent_info to get an accurate reference count
+	 * for a bytenr, so it is adjusted as delayed refs are run so that any
+	 * on disk reference count + ref_mod is accurate.
+	 */
+	int ref_mod;
+
+	/*
 	 * For qgroup reserved space freeing.
 	 *
 	 * ref_root and reserved will be recorded after
@@ -234,15 +233,19 @@ static inline void btrfs_put_delayed_ref(struct btrfs_delayed_ref_node *ref)
 		case BTRFS_SHARED_DATA_REF_KEY:
 			kmem_cache_free(btrfs_delayed_data_ref_cachep, ref);
 			break;
-		case 0:
-			kmem_cache_free(btrfs_delayed_ref_head_cachep, ref);
-			break;
 		default:
 			BUG();
 		}
 	}
 }
 
+static inline void
+btrfs_put_delayed_ref_head(struct btrfs_delayed_ref_head *head)
+{
+	if (refcount_dec_and_test(&head->refs))
+		kmem_cache_free(btrfs_delayed_ref_head_cachep, head);
+}
+
 int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
 			       struct btrfs_trans_handle *trans,
 			       u64 bytenr, u64 num_bytes, u64 parent,
@@ -283,35 +286,17 @@ int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info,
 			    u64 seq);
 
 /*
- * a node might live in a head or a regular ref, this lets you
- * test for the proper type to use.
- */
-static int btrfs_delayed_ref_is_head(struct btrfs_delayed_ref_node *node)
-{
-	return node->is_head;
-}
-
-/*
  * helper functions to cast a node into its container
  */
 static inline struct btrfs_delayed_tree_ref *
 btrfs_delayed_node_to_tree_ref(struct btrfs_delayed_ref_node *node)
 {
-	WARN_ON(btrfs_delayed_ref_is_head(node));
 	return container_of(node, struct btrfs_delayed_tree_ref, node);
 }
 
 static inline struct btrfs_delayed_data_ref *
 btrfs_delayed_node_to_data_ref(struct btrfs_delayed_ref_node *node)
 {
-	WARN_ON(btrfs_delayed_ref_is_head(node));
 	return container_of(node, struct btrfs_delayed_data_ref, node);
 }
-
-static inline struct btrfs_delayed_ref_head *
-btrfs_delayed_node_to_head(struct btrfs_delayed_ref_node *node)
-{
-	WARN_ON(!btrfs_delayed_ref_is_head(node));
-	return container_of(node, struct btrfs_delayed_ref_head, node);
-}
 #endif
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 778dc7682966..14759e6a8f3c 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4396,12 +4396,12 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
 		head = rb_entry(node, struct btrfs_delayed_ref_head,
 				href_node);
 		if (!mutex_trylock(&head->mutex)) {
-			refcount_inc(&head->node.refs);
+			refcount_inc(&head->refs);
 			spin_unlock(&delayed_refs->lock);
 
 			mutex_lock(&head->mutex);
 			mutex_unlock(&head->mutex);
-			btrfs_put_delayed_ref(&head->node);
+			btrfs_put_delayed_ref_head(head);
 			spin_lock(&delayed_refs->lock);
 			continue;
 		}
@@ -4422,16 +4422,16 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
 		if (head->processing == 0)
 			delayed_refs->num_heads_ready--;
 		atomic_dec(&delayed_refs->num_entries);
-		head->node.in_tree = 0;
 		rb_erase(&head->href_node, &delayed_refs->href_root);
+		RB_CLEAR_NODE(&head->href_node);
 		spin_unlock(&head->lock);
 		spin_unlock(&delayed_refs->lock);
 		mutex_unlock(&head->mutex);
 
 		if (pin_bytes)
-			btrfs_pin_extent(fs_info, head->node.bytenr,
-					 head->node.num_bytes, 1);
-		btrfs_put_delayed_ref(&head->node);
+			btrfs_pin_extent(fs_info, head->bytenr,
+					 head->num_bytes, 1);
+		btrfs_put_delayed_ref_head(head);
 		cond_resched();
 		spin_lock(&delayed_refs->lock);
 	}
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 260414bd86e7..6492a5e1f2b9 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -913,7 +913,7 @@ int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans,
 	head = btrfs_find_delayed_ref_head(delayed_refs, bytenr);
 	if (head) {
 		if (!mutex_trylock(&head->mutex)) {
-			refcount_inc(&head->node.refs);
+			refcount_inc(&head->refs);
 			spin_unlock(&delayed_refs->lock);
 
 			btrfs_release_path(path);
@@ -924,7 +924,7 @@ int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans,
 			 */
 			mutex_lock(&head->mutex);
 			mutex_unlock(&head->mutex);
-			btrfs_put_delayed_ref(&head->node);
+			btrfs_put_delayed_ref_head(head);
 			goto search_again;
 		}
 		spin_lock(&head->lock);
@@ -933,7 +933,7 @@ int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans,
 		else
 			BUG_ON(num_refs == 0);
 
-		num_refs += head->node.ref_mod;
+		num_refs += head->ref_mod;
 		spin_unlock(&head->lock);
 		mutex_unlock(&head->mutex);
 	}
@@ -2338,7 +2338,7 @@ static void __run_delayed_extent_op(struct btrfs_delayed_extent_op *extent_op,
 
 static int run_delayed_extent_op(struct btrfs_trans_handle *trans,
 				 struct btrfs_fs_info *fs_info,
-				 struct btrfs_delayed_ref_node *node,
+				 struct btrfs_delayed_ref_head *head,
 				 struct btrfs_delayed_extent_op *extent_op)
 {
 	struct btrfs_key key;
@@ -2360,14 +2360,14 @@ static int run_delayed_extent_op(struct btrfs_trans_handle *trans,
 	if (!path)
 		return -ENOMEM;
 
-	key.objectid = node->bytenr;
+	key.objectid = head->bytenr;
 
 	if (metadata) {
 		key.type = BTRFS_METADATA_ITEM_KEY;
 		key.offset = extent_op->level;
 	} else {
 		key.type = BTRFS_EXTENT_ITEM_KEY;
-		key.offset = node->num_bytes;
+		key.offset = head->num_bytes;
 	}
 
 again:
@@ -2384,17 +2384,17 @@ static int run_delayed_extent_op(struct btrfs_trans_handle *trans,
 				path->slots[0]--;
 				btrfs_item_key_to_cpu(path->nodes[0], &key,
 						      path->slots[0]);
-				if (key.objectid == node->bytenr &&
+				if (key.objectid == head->bytenr &&
 				    key.type == BTRFS_EXTENT_ITEM_KEY &&
-				    key.offset == node->num_bytes)
+				    key.offset == head->num_bytes)
 					ret = 0;
 			}
 			if (ret > 0) {
 				btrfs_release_path(path);
 				metadata = 0;
 
-				key.objectid = node->bytenr;
-				key.offset = node->num_bytes;
+				key.objectid = head->bytenr;
+				key.offset = head->num_bytes;
 				key.type = BTRFS_EXTENT_ITEM_KEY;
 				goto again;
 			}
@@ -2564,7 +2564,7 @@ static int cleanup_extent_op(struct btrfs_trans_handle *trans,
 		return 0;
 	}
 	spin_unlock(&head->lock);
-	ret = run_delayed_extent_op(trans, fs_info, &head->node, extent_op);
+	ret = run_delayed_extent_op(trans, fs_info, head, extent_op);
 	btrfs_free_delayed_extent_op(extent_op);
 	return ret ? ret : 1;
 }
@@ -2599,39 +2599,37 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans,
 		spin_unlock(&delayed_refs->lock);
 		return 1;
 	}
-	head->node.in_tree = 0;
 	delayed_refs->num_heads--;
 	rb_erase(&head->href_node, &delayed_refs->href_root);
+	RB_CLEAR_NODE(&head->href_node);
 	spin_unlock(&delayed_refs->lock);
 	spin_unlock(&head->lock);
 	atomic_dec(&delayed_refs->num_entries);
 
-	trace_run_delayed_ref_head(fs_info, &head->node, head,
-				   head->node.action);
+	trace_run_delayed_ref_head(fs_info, head, 0);
 
 	if (head->total_ref_mod < 0) {
 		struct btrfs_block_group_cache *cache;
 
-		cache = btrfs_lookup_block_group(fs_info, head->node.bytenr);
+		cache = btrfs_lookup_block_group(fs_info, head->bytenr);
 		ASSERT(cache);
 		percpu_counter_add(&cache->space_info->total_bytes_pinned,
-				   -head->node.num_bytes);
+				   -head->num_bytes);
 		btrfs_put_block_group(cache);
 
 		if (head->is_data) {
 			spin_lock(&delayed_refs->lock);
-			delayed_refs->pending_csums -= head->node.num_bytes;
+			delayed_refs->pending_csums -= head->num_bytes;
 			spin_unlock(&delayed_refs->lock);
 		}
 	}
 
 	if (head->must_insert_reserved) {
-		btrfs_pin_extent(fs_info, head->node.bytenr,
-				 head->node.num_bytes, 1);
+		btrfs_pin_extent(fs_info, head->bytenr,
+				 head->num_bytes, 1);
 		if (head->is_data) {
-			ret = btrfs_del_csums(trans, fs_info,
-					      head->node.bytenr,
-					      head->node.num_bytes);
+			ret = btrfs_del_csums(trans, fs_info, head->bytenr,
+					      head->num_bytes);
 		}
 	}
 
@@ -2639,7 +2637,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans,
 	btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
 				      head->qgroup_reserved);
 	btrfs_delayed_ref_unlock(head);
-	btrfs_put_delayed_ref(&head->node);
+	btrfs_put_delayed_ref_head(head);
 	return 0;
 }
 
@@ -2753,10 +2751,10 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 		switch (ref->action) {
 		case BTRFS_ADD_DELAYED_REF:
 		case BTRFS_ADD_DELAYED_EXTENT:
-			locked_ref->node.ref_mod -= ref->ref_mod;
+			locked_ref->ref_mod -= ref->ref_mod;
 			break;
 		case BTRFS_DROP_DELAYED_REF:
-			locked_ref->node.ref_mod += ref->ref_mod;
+			locked_ref->ref_mod += ref->ref_mod;
 			break;
 		default:
 			WARN_ON(1);
@@ -3089,33 +3087,16 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 			spin_unlock(&delayed_refs->lock);
 			goto out;
 		}
+		head = rb_entry(node, struct btrfs_delayed_ref_head,
+				href_node);
+		refcount_inc(&head->refs);
+		spin_unlock(&delayed_refs->lock);
 
-		while (node) {
-			head = rb_entry(node, struct btrfs_delayed_ref_head,
-					href_node);
-			if (btrfs_delayed_ref_is_head(&head->node)) {
-				struct btrfs_delayed_ref_node *ref;
-
-				ref = &head->node;
-				refcount_inc(&ref->refs);
-
-				spin_unlock(&delayed_refs->lock);
-				/*
-				 * Mutex was contended, block until it's
-				 * released and try again
-				 */
-				mutex_lock(&head->mutex);
-				mutex_unlock(&head->mutex);
+		/* Mutex was contended, block until it's released and retry. */
+		mutex_lock(&head->mutex);
+		mutex_unlock(&head->mutex);
 
-				btrfs_put_delayed_ref(ref);
-				cond_resched();
-				goto again;
-			} else {
-				WARN_ON(1);
-			}
-			node = rb_next(node);
-		}
-		spin_unlock(&delayed_refs->lock);
+		btrfs_put_delayed_ref_head(head);
 		cond_resched();
 		goto again;
 	}
@@ -3173,7 +3154,7 @@ static noinline int check_delayed_ref(struct btrfs_root *root,
 	}
 
 	if (!mutex_trylock(&head->mutex)) {
-		refcount_inc(&head->node.refs);
+		refcount_inc(&head->refs);
 		spin_unlock(&delayed_refs->lock);
 
 		btrfs_release_path(path);
@@ -3184,7 +3165,7 @@ static noinline int check_delayed_ref(struct btrfs_root *root,
 		 */
 		mutex_lock(&head->mutex);
 		mutex_unlock(&head->mutex);
-		btrfs_put_delayed_ref(&head->node);
+		btrfs_put_delayed_ref_head(head);
 		return -EAGAIN;
 	}
 	spin_unlock(&delayed_refs->lock);
@@ -7179,9 +7160,8 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans,
 	 * at this point we have a head with no other entries.  Go
 	 * ahead and process it.
 	 */
-	head->node.in_tree = 0;
 	rb_erase(&head->href_node, &delayed_refs->href_root);
-
+	RB_CLEAR_NODE(&head->href_node);
 	atomic_dec(&delayed_refs->num_entries);
 
 	/*
@@ -7200,7 +7180,7 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans,
 		ret = 1;
 
 	mutex_unlock(&head->mutex);
-	btrfs_put_delayed_ref(&head->node);
+	btrfs_put_delayed_ref_head(head);
 	return ret;
 out:
 	spin_unlock(&head->lock);
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index f0a374a720b0..b75436c832d8 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -798,22 +798,21 @@ DEFINE_EVENT(btrfs_delayed_data_ref,  run_delayed_data_ref,
 DECLARE_EVENT_CLASS(btrfs_delayed_ref_head,
 
 	TP_PROTO(const struct btrfs_fs_info *fs_info,
-		 const struct btrfs_delayed_ref_node *ref,
 		 const struct btrfs_delayed_ref_head *head_ref,
 		 int action),
 
-	TP_ARGS(fs_info, ref, head_ref, action),
+	TP_ARGS(fs_info, head_ref, action),
 
 	TP_STRUCT__entry_btrfs(
 		__field(	u64,  bytenr		)
 		__field(	u64,  num_bytes		)
-		__field(	int,  action		) 
+		__field(	int,  action		)
 		__field(	int,  is_data		)
 	),
 
 	TP_fast_assign_btrfs(fs_info,
-		__entry->bytenr		= ref->bytenr;
-		__entry->num_bytes	= ref->num_bytes;
+		__entry->bytenr		= head_ref->bytenr;
+		__entry->num_bytes	= head_ref->num_bytes;
 		__entry->action		= action;
 		__entry->is_data	= head_ref->is_data;
 	),
@@ -828,21 +827,19 @@ DECLARE_EVENT_CLASS(btrfs_delayed_ref_head,
 DEFINE_EVENT(btrfs_delayed_ref_head,  add_delayed_ref_head,
 
 	TP_PROTO(const struct btrfs_fs_info *fs_info,
-		 const struct btrfs_delayed_ref_node *ref,
 		 const struct btrfs_delayed_ref_head *head_ref,
 		 int action),
 
-	TP_ARGS(fs_info, ref, head_ref, action)
+	TP_ARGS(fs_info, head_ref, action)
 );
 
 DEFINE_EVENT(btrfs_delayed_ref_head,  run_delayed_ref_head,
 
 	TP_PROTO(const struct btrfs_fs_info *fs_info,
-		 const struct btrfs_delayed_ref_node *ref,
 		 const struct btrfs_delayed_ref_head *head_ref,
 		 int action),
 
-	TP_ARGS(fs_info, ref, head_ref, action)
+	TP_ARGS(fs_info, head_ref, action)
 );
 
 #define show_chunk_type(type)					\
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 14/21] btrfs: remove type argument from comp_tree_refs
  2017-09-29 19:43 [PATCH 00/21] My current btrfs patch queue Josef Bacik
                   ` (12 preceding siblings ...)
  2017-09-29 19:43 ` [PATCH 13/21] btrfs: remove delayed_ref_node from ref_head Josef Bacik
@ 2017-09-29 19:43 ` Josef Bacik
  2017-10-13 16:06   ` David Sterba
  2017-09-29 19:43 ` [PATCH 15/21] btrfs: switch args for comp_*_refs Josef Bacik
                   ` (7 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: Josef Bacik @ 2017-09-29 19:43 UTC (permalink / raw)
  To: kernel-team, linux-btrfs

We can get this from the ref we've passed in.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/delayed-ref.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index b9b41c838da4..a2973340a94f 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -41,9 +41,9 @@ struct kmem_cache *btrfs_delayed_extent_op_cachep;
  * compare two delayed tree backrefs with same bytenr and type
  */
 static int comp_tree_refs(struct btrfs_delayed_tree_ref *ref2,
-			  struct btrfs_delayed_tree_ref *ref1, int type)
+			  struct btrfs_delayed_tree_ref *ref1)
 {
-	if (type == BTRFS_TREE_BLOCK_REF_KEY) {
+	if (ref1->node.type == BTRFS_TREE_BLOCK_REF_KEY) {
 		if (ref1->root < ref2->root)
 			return -1;
 		if (ref1->root > ref2->root)
@@ -223,8 +223,7 @@ static bool merge_ref(struct btrfs_trans_handle *trans,
 		if ((ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
 		     ref->type == BTRFS_SHARED_BLOCK_REF_KEY) &&
 		    comp_tree_refs(btrfs_delayed_node_to_tree_ref(ref),
-				   btrfs_delayed_node_to_tree_ref(next),
-				   ref->type))
+				   btrfs_delayed_node_to_tree_ref(next)))
 			goto next;
 		if ((ref->type == BTRFS_EXTENT_DATA_REF_KEY ||
 		     ref->type == BTRFS_SHARED_DATA_REF_KEY) &&
@@ -409,8 +408,7 @@ add_delayed_ref_tail_merge(struct btrfs_trans_handle *trans,
 	if ((exist->type == BTRFS_TREE_BLOCK_REF_KEY ||
 	     exist->type == BTRFS_SHARED_BLOCK_REF_KEY) &&
 	    comp_tree_refs(btrfs_delayed_node_to_tree_ref(exist),
-			   btrfs_delayed_node_to_tree_ref(ref),
-			   ref->type))
+			   btrfs_delayed_node_to_tree_ref(ref)))
 		goto add_tail;
 	if ((exist->type == BTRFS_EXTENT_DATA_REF_KEY ||
 	     exist->type == BTRFS_SHARED_DATA_REF_KEY) &&
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 15/21] btrfs: switch args for comp_*_refs
  2017-09-29 19:43 [PATCH 00/21] My current btrfs patch queue Josef Bacik
                   ` (13 preceding siblings ...)
  2017-09-29 19:43 ` [PATCH 14/21] btrfs: remove type argument from comp_tree_refs Josef Bacik
@ 2017-09-29 19:43 ` Josef Bacik
  2017-10-13 16:24   ` David Sterba
  2017-09-29 19:44 ` [PATCH 16/21] btrfs: add a comp_refs() helper Josef Bacik
                   ` (6 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: Josef Bacik @ 2017-09-29 19:43 UTC (permalink / raw)
  To: kernel-team, linux-btrfs

Because seriously?  ref2 and then ref1?

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/delayed-ref.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index a2973340a94f..bc940bb374cf 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -40,8 +40,8 @@ struct kmem_cache *btrfs_delayed_extent_op_cachep;
 /*
  * compare two delayed tree backrefs with same bytenr and type
  */
-static int comp_tree_refs(struct btrfs_delayed_tree_ref *ref2,
-			  struct btrfs_delayed_tree_ref *ref1)
+static int comp_tree_refs(struct btrfs_delayed_tree_ref *ref1,
+			  struct btrfs_delayed_tree_ref *ref2)
 {
 	if (ref1->node.type == BTRFS_TREE_BLOCK_REF_KEY) {
 		if (ref1->root < ref2->root)
@@ -60,8 +60,8 @@ static int comp_tree_refs(struct btrfs_delayed_tree_ref *ref2,
 /*
  * compare two delayed data backrefs with same bytenr and type
  */
-static int comp_data_refs(struct btrfs_delayed_data_ref *ref2,
-			  struct btrfs_delayed_data_ref *ref1)
+static int comp_data_refs(struct btrfs_delayed_data_ref *ref1,
+			  struct btrfs_delayed_data_ref *ref2)
 {
 	if (ref1->node.type == BTRFS_EXTENT_DATA_REF_KEY) {
 		if (ref1->root < ref2->root)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 16/21] btrfs: add a comp_refs() helper
  2017-09-29 19:43 [PATCH 00/21] My current btrfs patch queue Josef Bacik
                   ` (14 preceding siblings ...)
  2017-09-29 19:43 ` [PATCH 15/21] btrfs: switch args for comp_*_refs Josef Bacik
@ 2017-09-29 19:44 ` Josef Bacik
  2017-09-29 19:44 ` [PATCH 17/21] btrfs: track refs in a rb_tree instead of a list Josef Bacik
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 51+ messages in thread
From: Josef Bacik @ 2017-09-29 19:44 UTC (permalink / raw)
  To: kernel-team, linux-btrfs

Instead of open-coding the delayed ref comparisons, add a helper to do
the comparisons generically and use that everywhere.  We compare
sequence numbers last for following patches.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/delayed-ref.c | 54 ++++++++++++++++++++++++++++----------------------
 1 file changed, 30 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index bc940bb374cf..c4cfadb9768c 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -85,6 +85,34 @@ static int comp_data_refs(struct btrfs_delayed_data_ref *ref1,
 	return 0;
 }
 
+static int comp_refs(struct btrfs_delayed_ref_node *ref1,
+		     struct btrfs_delayed_ref_node *ref2,
+		     bool check_seq)
+{
+	int ret = 0;
+	if (ref1->type < ref2->type)
+		return -1;
+	if (ref1->type > ref2->type)
+		return 1;
+	if (ref1->type == BTRFS_TREE_BLOCK_REF_KEY ||
+	    ref1->type == BTRFS_SHARED_BLOCK_REF_KEY)
+		ret = comp_tree_refs(btrfs_delayed_node_to_tree_ref(ref1),
+				     btrfs_delayed_node_to_tree_ref(ref2));
+	else
+		ret = comp_data_refs(btrfs_delayed_node_to_data_ref(ref1),
+				     btrfs_delayed_node_to_data_ref(ref2));
+	if (ret)
+		return ret;
+	if (check_seq) {
+		if (ref1->seq < ref2->seq)
+			return -1;
+		if (ref1->seq > ref2->seq)
+			return 1;
+	}
+	return 0;
+}
+
+
 /* insert a new ref to head ref rbtree */
 static struct btrfs_delayed_ref_head *htree_insert(struct rb_root *root,
 						   struct rb_node *node)
@@ -217,18 +245,7 @@ static bool merge_ref(struct btrfs_trans_handle *trans,
 		if (seq && next->seq >= seq)
 			goto next;
 
-		if (next->type != ref->type)
-			goto next;
-
-		if ((ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
-		     ref->type == BTRFS_SHARED_BLOCK_REF_KEY) &&
-		    comp_tree_refs(btrfs_delayed_node_to_tree_ref(ref),
-				   btrfs_delayed_node_to_tree_ref(next)))
-			goto next;
-		if ((ref->type == BTRFS_EXTENT_DATA_REF_KEY ||
-		     ref->type == BTRFS_SHARED_DATA_REF_KEY) &&
-		    comp_data_refs(btrfs_delayed_node_to_data_ref(ref),
-				   btrfs_delayed_node_to_data_ref(next)))
+		if (comp_refs(ref, next, false))
 			goto next;
 
 		if (ref->action == next->action) {
@@ -402,18 +419,7 @@ add_delayed_ref_tail_merge(struct btrfs_trans_handle *trans,
 	exist = list_entry(href->ref_list.prev, struct btrfs_delayed_ref_node,
 			   list);
 	/* No need to compare bytenr nor is_head */
-	if (exist->type != ref->type || exist->seq != ref->seq)
-		goto add_tail;
-
-	if ((exist->type == BTRFS_TREE_BLOCK_REF_KEY ||
-	     exist->type == BTRFS_SHARED_BLOCK_REF_KEY) &&
-	    comp_tree_refs(btrfs_delayed_node_to_tree_ref(exist),
-			   btrfs_delayed_node_to_tree_ref(ref)))
-		goto add_tail;
-	if ((exist->type == BTRFS_EXTENT_DATA_REF_KEY ||
-	     exist->type == BTRFS_SHARED_DATA_REF_KEY) &&
-	    comp_data_refs(btrfs_delayed_node_to_data_ref(exist),
-			   btrfs_delayed_node_to_data_ref(ref)))
+	if (comp_refs(exist, ref, true))
 		goto add_tail;
 
 	/* Now we are sure we can merge */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 17/21] btrfs: track refs in a rb_tree instead of a list
  2017-09-29 19:43 [PATCH 00/21] My current btrfs patch queue Josef Bacik
                   ` (15 preceding siblings ...)
  2017-09-29 19:44 ` [PATCH 16/21] btrfs: add a comp_refs() helper Josef Bacik
@ 2017-09-29 19:44 ` Josef Bacik
  2017-09-29 19:44 ` [PATCH 18/21] btrfs: fix send ioctl on 32bit with 64bit kernel Josef Bacik
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 51+ messages in thread
From: Josef Bacik @ 2017-09-29 19:44 UTC (permalink / raw)
  To: kernel-team, linux-btrfs

If we get a significant amount of delayed refs for a single block (think
modifying multiple snapshots) we can end up spending an ungodly amount
of time looping through all of the entries trying to see if they can be
merged.  This is because we only add them to a list, so we have O(2n)
for every ref head.  This doesn't make any sense as we likely have refs
for different roots, and so they cannot be merged.  Tracking in a tree
will allow us to break as soon as we hit an entry that doesn't match,
making our worst case O(n).

With this we can also merge entries more easily.  Before we had to hope
that matching refs were on the ends of our list, but with the tree we
can search down to exact matches and merge them at insert time.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/backref.c     |   5 ++-
 fs/btrfs/delayed-ref.c | 107 +++++++++++++++++++++++++------------------------
 fs/btrfs/delayed-ref.h |   5 +--
 fs/btrfs/disk-io.c     |  10 +++--
 fs/btrfs/extent-tree.c |  21 ++++++----
 5 files changed, 81 insertions(+), 67 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 33cba1abf8b6..9b627b895806 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -769,6 +769,7 @@ static int add_delayed_refs(const struct btrfs_fs_info *fs_info,
 	struct btrfs_key key;
 	struct btrfs_key tmp_op_key;
 	struct btrfs_key *op_key = NULL;
+	struct rb_node *n;
 	int count;
 	int ret = 0;
 
@@ -778,7 +779,9 @@ static int add_delayed_refs(const struct btrfs_fs_info *fs_info,
 	}
 
 	spin_lock(&head->lock);
-	list_for_each_entry(node, &head->ref_list, list) {
+	for (n = rb_first(&head->ref_tree); n; n = rb_next(n)) {
+		node = rb_entry(n, struct btrfs_delayed_ref_node,
+				ref_node);
 		if (node->seq > seq)
 			continue;
 
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index c4cfadb9768c..48a9b23774e6 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -143,6 +143,33 @@ static struct btrfs_delayed_ref_head *htree_insert(struct rb_root *root,
 	return NULL;
 }
 
+static struct btrfs_delayed_ref_node *
+tree_insert(struct rb_root *root, struct btrfs_delayed_ref_node *ins)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *node = &ins->ref_node;
+	struct rb_node *parent_node = NULL;
+	struct btrfs_delayed_ref_node *entry;
+
+	while (*p) {
+		int comp;
+		parent_node = *p;
+		entry = rb_entry(parent_node, struct btrfs_delayed_ref_node,
+				 ref_node);
+		comp = comp_refs(ins, entry, true);
+		if (comp < 0)
+			p = &(*p)->rb_left;
+		else if (comp > 0)
+			p = &(*p)->rb_right;
+		else
+			return entry;
+	}
+
+	rb_link_node(node, parent_node, p);
+	rb_insert_color(node, root);
+	return NULL;
+}
+
 /*
  * find an head entry based on bytenr. This returns the delayed ref
  * head if it was able to find one, or NULL if nothing was in that spot.
@@ -212,7 +239,8 @@ static inline void drop_delayed_ref(struct btrfs_trans_handle *trans,
 				    struct btrfs_delayed_ref_node *ref)
 {
 	assert_spin_locked(&head->lock);
-	list_del(&ref->list);
+	rb_erase(&ref->ref_node, &head->ref_tree);
+	RB_CLEAR_NODE(&ref->ref_node);
 	if (!list_empty(&ref->add_list))
 		list_del(&ref->add_list);
 	ref->in_tree = 0;
@@ -229,24 +257,18 @@ static bool merge_ref(struct btrfs_trans_handle *trans,
 		      u64 seq)
 {
 	struct btrfs_delayed_ref_node *next;
+	struct rb_node *node = rb_next(&ref->ref_node);
 	bool done = false;
 
-	next = list_first_entry(&head->ref_list, struct btrfs_delayed_ref_node,
-				list);
-	while (!done && &next->list != &head->ref_list) {
+	while (!done && node) {
 		int mod;
-		struct btrfs_delayed_ref_node *next2;
-
-		next2 = list_next_entry(next, list);
-
-		if (next == ref)
-			goto next;
 
+		next = rb_entry(node, struct btrfs_delayed_ref_node, ref_node);
+		node = rb_next(node);
 		if (seq && next->seq >= seq)
-			goto next;
-
+			break;
 		if (comp_refs(ref, next, false))
-			goto next;
+			break;
 
 		if (ref->action == next->action) {
 			mod = next->ref_mod;
@@ -270,8 +292,6 @@ static bool merge_ref(struct btrfs_trans_handle *trans,
 			WARN_ON(ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
 				ref->type == BTRFS_SHARED_BLOCK_REF_KEY);
 		}
-next:
-		next = next2;
 	}
 
 	return done;
@@ -283,11 +303,12 @@ void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans,
 			      struct btrfs_delayed_ref_head *head)
 {
 	struct btrfs_delayed_ref_node *ref;
+	struct rb_node *node;
 	u64 seq = 0;
 
 	assert_spin_locked(&head->lock);
 
-	if (list_empty(&head->ref_list))
+	if (RB_EMPTY_ROOT(&head->ref_tree))
 		return;
 
 	/* We don't have too many refs to merge for data. */
@@ -304,22 +325,13 @@ void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans,
 	}
 	spin_unlock(&fs_info->tree_mod_seq_lock);
 
-	ref = list_first_entry(&head->ref_list, struct btrfs_delayed_ref_node,
-			       list);
-	while (&ref->list != &head->ref_list) {
+again:
+	for (node = rb_first(&head->ref_tree); node; node = rb_next(node)) {
+		ref = rb_entry(node, struct btrfs_delayed_ref_node, ref_node);
 		if (seq && ref->seq >= seq)
-			goto next;
-
-		if (merge_ref(trans, delayed_refs, head, ref, seq)) {
-			if (list_empty(&head->ref_list))
-				break;
-			ref = list_first_entry(&head->ref_list,
-					       struct btrfs_delayed_ref_node,
-					       list);
 			continue;
-		}
-next:
-		ref = list_next_entry(ref, list);
+		if (merge_ref(trans, delayed_refs, head, ref, seq))
+			goto again;
 	}
 }
 
@@ -402,25 +414,19 @@ btrfs_select_ref_head(struct btrfs_trans_handle *trans)
  * Return 0 for insert.
  * Return >0 for merge.
  */
-static int
-add_delayed_ref_tail_merge(struct btrfs_trans_handle *trans,
-			   struct btrfs_delayed_ref_root *root,
-			   struct btrfs_delayed_ref_head *href,
-			   struct btrfs_delayed_ref_node *ref)
+static int insert_delayed_ref(struct btrfs_trans_handle *trans,
+			      struct btrfs_delayed_ref_root *root,
+			      struct btrfs_delayed_ref_head *href,
+			      struct btrfs_delayed_ref_node *ref)
 {
 	struct btrfs_delayed_ref_node *exist;
 	int mod;
 	int ret = 0;
 
 	spin_lock(&href->lock);
-	/* Check whether we can merge the tail node with ref */
-	if (list_empty(&href->ref_list))
-		goto add_tail;
-	exist = list_entry(href->ref_list.prev, struct btrfs_delayed_ref_node,
-			   list);
-	/* No need to compare bytenr nor is_head */
-	if (comp_refs(exist, ref, true))
-		goto add_tail;
+	exist = tree_insert(&href->ref_tree, ref);
+	if (!exist)
+		goto inserted;
 
 	/* Now we are sure we can merge */
 	ret = 1;
@@ -451,9 +457,7 @@ add_delayed_ref_tail_merge(struct btrfs_trans_handle *trans,
 		drop_delayed_ref(trans, root, href, exist);
 	spin_unlock(&href->lock);
 	return ret;
-
-add_tail:
-	list_add_tail(&ref->list, &href->ref_list);
+inserted:
 	if (ref->action == BTRFS_ADD_DELAYED_REF)
 		list_add_tail(&ref->add_list, &href->ref_add_list);
 	atomic_inc(&root->num_entries);
@@ -593,7 +597,7 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
 	head_ref->ref_mod = count_mod;
 	head_ref->must_insert_reserved = must_insert_reserved;
 	head_ref->is_data = is_data;
-	INIT_LIST_HEAD(&head_ref->ref_list);
+	head_ref->ref_tree = RB_ROOT;
 	INIT_LIST_HEAD(&head_ref->ref_add_list);
 	RB_CLEAR_NODE(&head_ref->href_node);
 	head_ref->processing = 0;
@@ -685,7 +689,7 @@ add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
 	ref->is_head = 0;
 	ref->in_tree = 1;
 	ref->seq = seq;
-	INIT_LIST_HEAD(&ref->list);
+	RB_CLEAR_NODE(&ref->ref_node);
 	INIT_LIST_HEAD(&ref->add_list);
 
 	full_ref = btrfs_delayed_node_to_tree_ref(ref);
@@ -699,7 +703,7 @@ add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
 
 	trace_add_delayed_tree_ref(fs_info, ref, full_ref, action);
 
-	ret = add_delayed_ref_tail_merge(trans, delayed_refs, head_ref, ref);
+	ret = insert_delayed_ref(trans, delayed_refs, head_ref, ref);
 
 	/*
 	 * XXX: memory should be freed at the same level allocated.
@@ -742,7 +746,7 @@ add_delayed_data_ref(struct btrfs_fs_info *fs_info,
 	ref->is_head = 0;
 	ref->in_tree = 1;
 	ref->seq = seq;
-	INIT_LIST_HEAD(&ref->list);
+	RB_CLEAR_NODE(&ref->ref_node);
 	INIT_LIST_HEAD(&ref->add_list);
 
 	full_ref = btrfs_delayed_node_to_data_ref(ref);
@@ -758,8 +762,7 @@ add_delayed_data_ref(struct btrfs_fs_info *fs_info,
 
 	trace_add_delayed_data_ref(fs_info, ref, full_ref, action);
 
-	ret = add_delayed_ref_tail_merge(trans, delayed_refs, head_ref, ref);
-
+	ret = insert_delayed_ref(trans, delayed_refs, head_ref, ref);
 	if (ret > 0)
 		kmem_cache_free(btrfs_delayed_data_ref_cachep, full_ref);
 }
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 5d75f8cd08a9..918a5b1d67d8 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -27,8 +27,7 @@
 #define BTRFS_UPDATE_DELAYED_HEAD 4 /* not changing ref count on head ref */
 
 struct btrfs_delayed_ref_node {
-	/*data/tree ref use list, stored in ref_head->ref_list. */
-	struct list_head list;
+	struct rb_node ref_node;
 	/*
 	 * If action is BTRFS_ADD_DELAYED_REF, also link this node to
 	 * ref_head->ref_add_list, then we do not need to iterate the
@@ -91,7 +90,7 @@ struct btrfs_delayed_ref_head {
 	struct mutex mutex;
 
 	spinlock_t lock;
-	struct list_head ref_list;
+	struct rb_root ref_tree;
 	/* accumulate add BTRFS_ADD_DELAYED_REF nodes to this ref_add_list. */
 	struct list_head ref_add_list;
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 14759e6a8f3c..689b9913ccb5 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4390,7 +4390,7 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
 
 	while ((node = rb_first(&delayed_refs->href_root)) != NULL) {
 		struct btrfs_delayed_ref_head *head;
-		struct btrfs_delayed_ref_node *tmp;
+		struct rb_node *n;
 		bool pin_bytes = false;
 
 		head = rb_entry(node, struct btrfs_delayed_ref_head,
@@ -4406,10 +4406,12 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
 			continue;
 		}
 		spin_lock(&head->lock);
-		list_for_each_entry_safe_reverse(ref, tmp, &head->ref_list,
-						 list) {
+		while ((n = rb_first(&head->ref_tree)) != NULL) {
+			ref = rb_entry(n, struct btrfs_delayed_ref_node,
+				       ref_node);
 			ref->in_tree = 0;
-			list_del(&ref->list);
+			rb_erase(&ref->ref_node, &head->ref_tree);
+			RB_CLEAR_NODE(&ref->ref_node);
 			if (!list_empty(&ref->add_list))
 				list_del(&ref->add_list);
 			atomic_dec(&delayed_refs->num_entries);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 6492a5e1f2b9..dc966978ca7b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2519,7 +2519,7 @@ select_delayed_ref(struct btrfs_delayed_ref_head *head)
 {
 	struct btrfs_delayed_ref_node *ref;
 
-	if (list_empty(&head->ref_list))
+	if (RB_EMPTY_ROOT(&head->ref_tree))
 		return NULL;
 
 	/*
@@ -2532,8 +2532,8 @@ select_delayed_ref(struct btrfs_delayed_ref_head *head)
 		return list_first_entry(&head->ref_add_list,
 				struct btrfs_delayed_ref_node, add_list);
 
-	ref = list_first_entry(&head->ref_list, struct btrfs_delayed_ref_node,
-			       list);
+	ref = rb_entry(rb_first(&head->ref_tree),
+		       struct btrfs_delayed_ref_node, ref_node);
 	ASSERT(list_empty(&ref->add_list));
 	return ref;
 }
@@ -2594,7 +2594,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans,
 	spin_unlock(&head->lock);
 	spin_lock(&delayed_refs->lock);
 	spin_lock(&head->lock);
-	if (!list_empty(&head->ref_list) || head->extent_op) {
+	if (!RB_EMPTY_ROOT(&head->ref_tree) || head->extent_op) {
 		spin_unlock(&head->lock);
 		spin_unlock(&delayed_refs->lock);
 		return 1;
@@ -2741,7 +2741,8 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 
 		actual_count++;
 		ref->in_tree = 0;
-		list_del(&ref->list);
+		rb_erase(&ref->ref_node, &locked_ref->ref_tree);
+		RB_CLEAR_NODE(&ref->ref_node);
 		if (!list_empty(&ref->add_list))
 			list_del(&ref->add_list);
 		/*
@@ -3139,6 +3140,7 @@ static noinline int check_delayed_ref(struct btrfs_root *root,
 	struct btrfs_delayed_data_ref *data_ref;
 	struct btrfs_delayed_ref_root *delayed_refs;
 	struct btrfs_transaction *cur_trans;
+	struct rb_node *node;
 	int ret = 0;
 
 	cur_trans = root->fs_info->running_transaction;
@@ -3171,7 +3173,12 @@ static noinline int check_delayed_ref(struct btrfs_root *root,
 	spin_unlock(&delayed_refs->lock);
 
 	spin_lock(&head->lock);
-	list_for_each_entry(ref, &head->ref_list, list) {
+	/*
+	 * XXX: We should replace this with a proper search function in the
+	 * future.
+	 */
+	for (node = rb_first(&head->ref_tree); node; node = rb_next(node)) {
+		ref = rb_entry(node, struct btrfs_delayed_ref_node, ref_node);
 		/* If it's a shared ref we know a cross reference exists */
 		if (ref->type != BTRFS_EXTENT_DATA_REF_KEY) {
 			ret = 1;
@@ -7139,7 +7146,7 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans,
 		goto out_delayed_unlock;
 
 	spin_lock(&head->lock);
-	if (!list_empty(&head->ref_list))
+	if (!RB_EMPTY_ROOT(&head->ref_tree))
 		goto out;
 
 	if (head->extent_op) {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 18/21] btrfs: fix send ioctl on 32bit with 64bit kernel
  2017-09-29 19:43 [PATCH 00/21] My current btrfs patch queue Josef Bacik
                   ` (16 preceding siblings ...)
  2017-09-29 19:44 ` [PATCH 17/21] btrfs: track refs in a rb_tree instead of a list Josef Bacik
@ 2017-09-29 19:44 ` Josef Bacik
  2017-09-29 19:44 ` [PATCH 19/21] btrfs: don't call btrfs_start_delalloc_roots in flushoncommit Josef Bacik
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 51+ messages in thread
From: Josef Bacik @ 2017-09-29 19:44 UTC (permalink / raw)
  To: kernel-team, linux-btrfs

We pass in a pointer in our send arg struct, this means the struct size doesn't
match with 32bit user space and 64bit kernel space.  Fix this by adding a compat
mode and doing the appropriate conversion.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/ioctl.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/send.c  | 12 ++----------
 fs/btrfs/send.h  |  2 +-
 3 files changed, 55 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 398495f79c83..6a07d4e12fd2 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -5464,6 +5464,53 @@ static int btrfs_ioctl_set_features(struct file *file, void __user *arg)
 	return ret;
 }
 
+#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
+struct btrfs_ioctl_send_args_32 {
+	__s64 send_fd;			/* in */
+	__u64 clone_sources_count;	/* in */
+	compat_uptr_t clone_sources;	/* in */
+	__u64 parent_root;		/* in */
+	__u64 flags;			/* in */
+	__u64 reserved[4];		/* in */
+} __attribute__ ((__packed__));
+#define BTRFS_IOC_SEND_32 _IOW(BTRFS_IOCTL_MAGIC, 38, \
+			       struct btrfs_ioctl_send_args_32)
+#endif
+
+static int _btrfs_ioctl_send(struct file *file, void __user *argp, bool compat)
+{
+	struct btrfs_ioctl_send_args *arg;
+	int ret;
+
+	if (compat) {
+#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
+		struct btrfs_ioctl_send_args_32 args32;
+		ret = copy_from_user(&args32, argp, sizeof(args32));
+		if (ret)
+			return -EFAULT;
+		arg = kzalloc(sizeof(*arg), GFP_KERNEL);
+		if (!arg)
+			return -ENOMEM;
+		arg->send_fd = args32.send_fd;
+		arg->clone_sources_count = args32.clone_sources_count;
+		arg->clone_sources = compat_ptr(args32.clone_sources);
+		arg->parent_root = args32.parent_root;
+		arg->flags = args32.flags;
+		memcpy(arg->reserved, args32.reserved,
+		       sizeof(args32.reserved));
+#else
+		return -ENOTTY;
+#endif
+	} else {
+		arg = memdup_user(argp, sizeof(*arg));
+		if (IS_ERR(arg))
+			return PTR_ERR(arg);
+	}
+	ret = btrfs_ioctl_send(file, arg);
+	kfree(arg);
+	return ret;
+}
+
 long btrfs_ioctl(struct file *file, unsigned int
 		cmd, unsigned long arg)
 {
@@ -5569,7 +5616,11 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_set_received_subvol_32(file, argp);
 #endif
 	case BTRFS_IOC_SEND:
-		return btrfs_ioctl_send(file, argp);
+		return _btrfs_ioctl_send(file, argp, false);
+#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
+	case BTRFS_IOC_SEND_32:
+		return _btrfs_ioctl_send(file, argp, true);
+#endif
 	case BTRFS_IOC_GET_DEV_STATS:
 		return btrfs_ioctl_get_dev_stats(fs_info, argp);
 	case BTRFS_IOC_QUOTA_CTL:
diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 0746eda7231d..d9ddcdbdd2e7 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -26,6 +26,7 @@
 #include <linux/radix-tree.h>
 #include <linux/vmalloc.h>
 #include <linux/string.h>
+#include <linux/compat.h>
 
 #include "send.h"
 #include "backref.h"
@@ -6371,13 +6372,12 @@ static void btrfs_root_dec_send_in_progress(struct btrfs_root* root)
 	spin_unlock(&root->root_item_lock);
 }
 
-long btrfs_ioctl_send(struct file *mnt_file, void __user *arg_)
+long btrfs_ioctl_send(struct file *mnt_file, struct btrfs_ioctl_send_args *arg)
 {
 	int ret = 0;
 	struct btrfs_root *send_root = BTRFS_I(file_inode(mnt_file))->root;
 	struct btrfs_fs_info *fs_info = send_root->fs_info;
 	struct btrfs_root *clone_root;
-	struct btrfs_ioctl_send_args *arg = NULL;
 	struct btrfs_key key;
 	struct send_ctx *sctx = NULL;
 	u32 i;
@@ -6413,13 +6413,6 @@ long btrfs_ioctl_send(struct file *mnt_file, void __user *arg_)
 		goto out;
 	}
 
-	arg = memdup_user(arg_, sizeof(*arg));
-	if (IS_ERR(arg)) {
-		ret = PTR_ERR(arg);
-		arg = NULL;
-		goto out;
-	}
-
 	/*
 	 * Check that we don't overflow at later allocations, we request
 	 * clone_sources_count + 1 items, and compare to unsigned long inside
@@ -6660,7 +6653,6 @@ long btrfs_ioctl_send(struct file *mnt_file, void __user *arg_)
 	if (sctx && !IS_ERR_OR_NULL(sctx->parent_root))
 		btrfs_root_dec_send_in_progress(sctx->parent_root);
 
-	kfree(arg);
 	kvfree(clone_sources_tmp);
 
 	if (sctx) {
diff --git a/fs/btrfs/send.h b/fs/btrfs/send.h
index 02e00166c4da..3aa4bc55754f 100644
--- a/fs/btrfs/send.h
+++ b/fs/btrfs/send.h
@@ -130,5 +130,5 @@ enum {
 #define BTRFS_SEND_A_MAX (__BTRFS_SEND_A_MAX - 1)
 
 #ifdef __KERNEL__
-long btrfs_ioctl_send(struct file *mnt_file, void __user *arg);
+long btrfs_ioctl_send(struct file *mnt_file, struct btrfs_ioctl_send_args *arg);
 #endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 19/21] btrfs: don't call btrfs_start_delalloc_roots in flushoncommit
  2017-09-29 19:43 [PATCH 00/21] My current btrfs patch queue Josef Bacik
                   ` (17 preceding siblings ...)
  2017-09-29 19:44 ` [PATCH 18/21] btrfs: fix send ioctl on 32bit with 64bit kernel Josef Bacik
@ 2017-09-29 19:44 ` Josef Bacik
  2017-10-13 17:10   ` David Sterba
  2017-09-29 19:44 ` [PATCH 20/21] btrfs: move btrfs_truncate_block out of trans handle Josef Bacik
                   ` (2 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: Josef Bacik @ 2017-09-29 19:44 UTC (permalink / raw)
  To: kernel-team, linux-btrfs

We're holding the sb_start_intwrite lock at this point, and doing async
filemap_flush of the inodes will result in a deadlock if we freeze the
fs during this operation.  This is because we could do a
btrfs_join_transaction() in the thread we are waiting on which would
block at sb_start_intwrite, and thus deadlock.  Using
writeback_inodes_sb() side steps the problem by not introducing all of
these extra locking dependencies.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/transaction.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 68c3e1c04bca..9fed8c67b6e8 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1917,7 +1917,7 @@ static void cleanup_transaction(struct btrfs_trans_handle *trans,
 static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info)
 {
 	if (btrfs_test_opt(fs_info, FLUSHONCOMMIT))
-		return btrfs_start_delalloc_roots(fs_info, 1, -1);
+		writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC);
 	return 0;
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 20/21] btrfs: move btrfs_truncate_block out of trans handle
  2017-09-29 19:43 [PATCH 00/21] My current btrfs patch queue Josef Bacik
                   ` (18 preceding siblings ...)
  2017-09-29 19:44 ` [PATCH 19/21] btrfs: don't call btrfs_start_delalloc_roots in flushoncommit Josef Bacik
@ 2017-09-29 19:44 ` Josef Bacik
  2017-09-29 19:44 ` [PATCH 21/21] btrfs: add assertions for releasing trans handle reservations Josef Bacik
  2017-10-13 17:28 ` [PATCH 00/21] My current btrfs patch queue David Sterba
  21 siblings, 0 replies; 51+ messages in thread
From: Josef Bacik @ 2017-09-29 19:44 UTC (permalink / raw)
  To: kernel-team, linux-btrfs

Since we do a delalloc reserve in btrfs_truncate_block we can deadlock
with freeze.  If somebody else is trying to allocate metadata for this
inode and it gets stuck in start_delalloc_inodes because of freeze we
will deadlock.  Be safe and move this outside of a trans handle.  This
also has a side-effect of making sure that we're not leaving stale data
behind in the other_encoding or encryption case.  Not an issue now since
nobody uses it, but it would be a problem in the future.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/inode.c | 119 ++++++++++++++++++++-----------------------------------
 1 file changed, 44 insertions(+), 75 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3cbddfc181dc..46b5632a7c6d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4364,47 +4364,11 @@ static int truncate_space_check(struct btrfs_trans_handle *trans,
 
 }
 
-static int truncate_inline_extent(struct inode *inode,
-				  struct btrfs_path *path,
-				  struct btrfs_key *found_key,
-				  const u64 item_end,
-				  const u64 new_size)
-{
-	struct extent_buffer *leaf = path->nodes[0];
-	int slot = path->slots[0];
-	struct btrfs_file_extent_item *fi;
-	u32 size = (u32)(new_size - found_key->offset);
-	struct btrfs_root *root = BTRFS_I(inode)->root;
-
-	fi = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item);
-
-	if (btrfs_file_extent_compression(leaf, fi) != BTRFS_COMPRESS_NONE) {
-		loff_t offset = new_size;
-		loff_t page_end = ALIGN(offset, PAGE_SIZE);
-
-		/*
-		 * Zero out the remaining of the last page of our inline extent,
-		 * instead of directly truncating our inline extent here - that
-		 * would be much more complex (decompressing all the data, then
-		 * compressing the truncated data, which might be bigger than
-		 * the size of the inline extent, resize the extent, etc).
-		 * We release the path because to get the page we might need to
-		 * read the extent item from disk (data not in the page cache).
-		 */
-		btrfs_release_path(path);
-		return btrfs_truncate_block(inode, offset, page_end - offset,
-					0);
-	}
-
-	btrfs_set_file_extent_ram_bytes(leaf, fi, size);
-	size = btrfs_file_extent_calc_inline_size(size);
-	btrfs_truncate_item(root->fs_info, path, size, 1);
-
-	if (test_bit(BTRFS_ROOT_REF_COWS, &root->state))
-		inode_sub_bytes(inode, item_end + 1 - new_size);
-
-	return 0;
-}
+/*
+ * Return this if we need to call truncate_block for the last bit of the
+ * truncate.
+ */
+#define NEED_TRUNCATE_BLOCK 1
 
 /*
  * this can truncate away extent items, csum items and directory items.
@@ -4565,11 +4529,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
 		if (found_type != BTRFS_EXTENT_DATA_KEY)
 			goto delete;
 
-		if (del_item)
-			last_size = found_key.offset;
-		else
-			last_size = new_size;
-
 		if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
 			u64 num_dec;
 			extent_start = btrfs_file_extent_disk_bytenr(leaf, fi);
@@ -4611,40 +4570,29 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
 			 */
 			if (!del_item &&
 			    btrfs_file_extent_encryption(leaf, fi) == 0 &&
-			    btrfs_file_extent_other_encoding(leaf, fi) == 0) {
-
+			    btrfs_file_extent_other_encoding(leaf, fi) == 0 &&
+			    btrfs_file_extent_compression(leaf, fi) == 0) {
+				u32 size = (u32)(new_size - found_key.offset);
+				btrfs_set_file_extent_ram_bytes(leaf, fi, size);
+				size = btrfs_file_extent_calc_inline_size(size);
+				btrfs_truncate_item(root->fs_info, path, size, 1);
+			} else if (!del_item) {
 				/*
-				 * Need to release path in order to truncate a
-				 * compressed extent. So delete any accumulated
-				 * extent items so far.
+				 * We have to bail so the last_size is set to
+				 * just before this extent.
 				 */
-				if (btrfs_file_extent_compression(leaf, fi) !=
-				    BTRFS_COMPRESS_NONE && pending_del_nr) {
-					err = btrfs_del_items(trans, root, path,
-							      pending_del_slot,
-							      pending_del_nr);
-					if (err) {
-						btrfs_abort_transaction(trans,
-									err);
-						goto error;
-					}
-					pending_del_nr = 0;
-				}
+				err = NEED_TRUNCATE_BLOCK;
+				break;
+			}
 
-				err = truncate_inline_extent(inode, path,
-							     &found_key,
-							     item_end,
-							     new_size);
-				if (err) {
-					btrfs_abort_transaction(trans, err);
-					goto error;
-				}
-			} else if (test_bit(BTRFS_ROOT_REF_COWS,
-					    &root->state)) {
+			if (test_bit(BTRFS_ROOT_REF_COWS, &root->state))
 				inode_sub_bytes(inode, item_end + 1 - new_size);
-			}
 		}
 delete:
+		if (del_item)
+			last_size = found_key.offset;
+		else
+			last_size = new_size;
 		if (del_item) {
 			if (!pending_del_nr) {
 				/* no pending yet, add ourselves */
@@ -9342,12 +9290,12 @@ static int btrfs_truncate(struct inode *inode)
 		ret = btrfs_truncate_inode_items(trans, root, inode,
 						 inode->i_size,
 						 BTRFS_EXTENT_DATA_KEY);
+		trans->block_rsv = &fs_info->trans_block_rsv;
 		if (ret != -ENOSPC && ret != -EAGAIN) {
 			err = ret;
 			break;
 		}
 
-		trans->block_rsv = &fs_info->trans_block_rsv;
 		ret = btrfs_update_inode(trans, root, inode);
 		if (ret) {
 			err = ret;
@@ -9371,6 +9319,27 @@ static int btrfs_truncate(struct inode *inode)
 		trans->block_rsv = rsv;
 	}
 
+	/*
+	 * We can't call btrfs_truncate_block inside a trans handle as we could
+	 * deadlock with freeze, if we got NEED_TRUNCATE_BLOCK then we know
+	 * we've truncated everything except the last little bit, and can do
+	 * btrfs_truncate_block and then update the disk_i_size.
+	 */
+	if (ret == NEED_TRUNCATE_BLOCK) {
+		btrfs_end_transaction(trans);
+		btrfs_btree_balance_dirty(fs_info);
+
+		ret = btrfs_truncate_block(inode, inode->i_size, 0, 0);
+		if (ret)
+			goto out;
+		trans = btrfs_start_transaction(root, 1);
+		if (IS_ERR(trans)) {
+			ret = PTR_ERR(trans);
+			goto out;
+		}
+		btrfs_ordered_update_i_size(inode, inode->i_size, NULL);
+	}
+
 	if (ret == 0 && inode->i_nlink > 0) {
 		trans->block_rsv = root->orphan_block_rsv;
 		ret = btrfs_orphan_del(trans, BTRFS_I(inode));
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 21/21] btrfs: add assertions for releasing trans handle reservations
  2017-09-29 19:43 [PATCH 00/21] My current btrfs patch queue Josef Bacik
                   ` (19 preceding siblings ...)
  2017-09-29 19:44 ` [PATCH 20/21] btrfs: move btrfs_truncate_block out of trans handle Josef Bacik
@ 2017-09-29 19:44 ` Josef Bacik
  2017-10-13 17:17   ` David Sterba
  2017-10-13 17:28 ` [PATCH 00/21] My current btrfs patch queue David Sterba
  21 siblings, 1 reply; 51+ messages in thread
From: Josef Bacik @ 2017-09-29 19:44 UTC (permalink / raw)
  To: kernel-team, linux-btrfs

These are useful for debugging problems where we mess with
trans->block_rsv to make sure we're not screwing something up.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/extent-tree.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index dc966978ca7b..0bdc10b453b9 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5897,12 +5897,15 @@ static void release_global_block_rsv(struct btrfs_fs_info *fs_info)
 void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans,
 				  struct btrfs_fs_info *fs_info)
 {
-	if (!trans->block_rsv)
+	if (!trans->block_rsv) {
+		ASSERT(!trans->bytes_reserved);
 		return;
+	}
 
 	if (!trans->bytes_reserved)
 		return;
 
+	ASSERT(trans->block_rsv == &fs_info->trans_block_rsv);
 	trace_btrfs_space_reservation(fs_info, "transaction",
 				      trans->transid, trans->bytes_reserved, 0);
 	btrfs_block_rsv_release(fs_info, trans->block_rsv,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 01/21] Btrfs: rework outstanding_extents
  2017-09-29 19:43 ` [PATCH 01/21] Btrfs: rework outstanding_extents Josef Bacik
@ 2017-10-13  8:39   ` Nikolay Borisov
  2017-10-13 13:10     ` Josef Bacik
  2017-10-19  3:14   ` Edmund Nadolski
  1 sibling, 1 reply; 51+ messages in thread
From: Nikolay Borisov @ 2017-10-13  8:39 UTC (permalink / raw)
  To: Josef Bacik, kernel-team, linux-btrfs



On 29.09.2017 22:43, Josef Bacik wrote:
>  
> +static inline void btrfs_mod_outstanding_extents(struct btrfs_inode *inode,
> +						 int mod)
> +{
> +	ASSERT(spin_is_locked(&inode->lock));
> +	inode->outstanding_extents += mod;
> +	if (btrfs_is_free_space_inode(inode))
> +		return;
> +}
> +
> +static inline void btrfs_mod_reserved_extents(struct btrfs_inode *inode,
> +					      int mod)
> +{
> +	ASSERT(spin_is_locked(&inode->lock));

lockdep_assert_held(&inode->lock); for both functions. I've spoken with
Peterz and he said any other way of checking whether a lock is held, I
quote, "must die"

> +	inode->reserved_extents += mod;
> +	if (btrfs_is_free_space_inode(inode))
> +		return;
> +}
> +
>  static inline int btrfs_inode_in_log(struct btrfs_inode *inode, u64 generation)
>  {
>  	int ret = 0;
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index a7f68c304b4c..1262612fbf78 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2742,6 +2742,8 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root *root,
>  				     u64 *qgroup_reserved, bool use_global_rsv);
>  void btrfs_subvolume_release_metadata(struct btrfs_fs_info *fs_info,
>  				      struct btrfs_block_rsv *rsv);
> +void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes);
> +
>  int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes);
>  void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes);
>  int btrfs_delalloc_reserve_space(struct inode *inode,
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 1a6aced00a19..aa0f5c8953b0 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -5971,42 +5971,31 @@ void btrfs_subvolume_release_metadata(struct btrfs_fs_info *fs_info,
>  }
>  
>  /**
> - * drop_outstanding_extent - drop an outstanding extent
> + * drop_over_reserved_extents - drop our extra extent reservations
>   * @inode: the inode we're dropping the extent for
> - * @num_bytes: the number of bytes we're releasing.
>   *
> - * This is called when we are freeing up an outstanding extent, either called
> - * after an error or after an extent is written.  This will return the number of
> - * reserved extents that need to be freed.  This must be called with
> - * BTRFS_I(inode)->lock held.
> + * We reserve extents we may use, but they may have been merged with other
> + * extents and we may not need the extra reservation.
> + *
> + * We also call this when we've completed io to an extent or had an error and
> + * cleared the outstanding extent, in either case we no longer need our
> + * reservation and can drop the excess.
>   */
> -static unsigned drop_outstanding_extent(struct btrfs_inode *inode,
> -		u64 num_bytes)
> +static unsigned drop_over_reserved_extents(struct btrfs_inode *inode)
>  {
> -	unsigned drop_inode_space = 0;
> -	unsigned dropped_extents = 0;
> -	unsigned num_extents;
> +	unsigned num_extents = 0;
>  
> -	num_extents = count_max_extents(num_bytes);
> -	ASSERT(num_extents);
> -	ASSERT(inode->outstanding_extents >= num_extents);
> -	inode->outstanding_extents -= num_extents;
> +	if (inode->reserved_extents > inode->outstanding_extents) {
> +		num_extents = inode->reserved_extents -
> +			inode->outstanding_extents;
> +		btrfs_mod_reserved_extents(inode, -num_extents);
> +	}
>  
>  	if (inode->outstanding_extents == 0 &&
>  	    test_and_clear_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
>  			       &inode->runtime_flags))
> -		drop_inode_space = 1;
> -
> -	/*
> -	 * If we have more or the same amount of outstanding extents than we have
> -	 * reserved then we need to leave the reserved extents count alone.
> -	 */
> -	if (inode->outstanding_extents >= inode->reserved_extents)
> -		return drop_inode_space;
> -
> -	dropped_extents = inode->reserved_extents - inode->outstanding_extents;
> -	inode->reserved_extents -= dropped_extents;
> -	return dropped_extents + drop_inode_space;
> +		num_extents++;
> +	return num_extents;

Something bugs me around the handling of this. In
btrfs_delalloc_reserve_metadata we do add the additional bytes necessary
for updating the inode. However outstanding_extents is modified with the
number of extent items necessary to cover the requires byte range but we
don't account the extra inode item as being an extent. Then why do we
have to actually increment the num_extents here? Doesn't this lead to
underflow?

To illustrate: A write comes for 1 mb, we count this as 1 outstanding
extent in delalloc_reserve_metadata:

nr_extents = count_max_extents(num_bytes);
inode->outstanding_extents += nr_extents;

We do account the extra inode item but only when calculating the bytes
to reserve:

to_reserve = btrfs_calc_trans_metadata_size(fs_info, nr_extents + 1);
And we of course set : test_and_set_bit(BTRFS_INODE_DELALLOC_META_RESERVED,

Now, when the time comes to free the outstanding extent and call :
drop_over_reserved_extents we do account the extra inode item as an
extent being freed. Is this correct? In any case it's not something
which is introduced by your patch but something which has been there
since time immemorial, just wondering?

>  }
>  
>  /**
> @@ -6061,13 +6050,15 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
>  	struct btrfs_block_rsv *block_rsv = &fs_info->delalloc_block_rsv;
>  	u64 to_reserve = 0;
>  	u64 csum_bytes;
> -	unsigned nr_extents;
> +	unsigned nr_extents, reserve_extents;
>  	enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
>  	int ret = 0;
>  	bool delalloc_lock = true;
>  	u64 to_free = 0;
>  	unsigned dropped;
>  	bool release_extra = false;
> +	bool underflow = false;
> +	bool did_retry = false;
>  
>  	/* If we are a free space inode we need to not flush since we will be in
>  	 * the middle of a transaction commit.  We also don't need the delalloc
> @@ -6092,18 +6083,31 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
>  		mutex_lock(&inode->delalloc_mutex);
>  
>  	num_bytes = ALIGN(num_bytes, fs_info->sectorsize);
> -
> +retry:
>  	spin_lock(&inode->lock);
> -	nr_extents = count_max_extents(num_bytes);
> -	inode->outstanding_extents += nr_extents;
> +	reserve_extents = nr_extents = count_max_extents(num_bytes);
> +	btrfs_mod_outstanding_extents(inode, nr_extents);
>  
> -	nr_extents = 0;
> -	if (inode->outstanding_extents > inode->reserved_extents)
> -		nr_extents += inode->outstanding_extents -
> +	/*
> +	 * Because we add an outstanding extent for ordered before we clear
> +	 * delalloc we will double count our outstanding extents slightly.  This
> +	 * could mean that we transiently over-reserve, which could result in an
> +	 * early ENOSPC if our timing is unlucky.  Keep track of the case that
> +	 * we had a reservation underflow so we can retry if we fail.
> +	 *
> +	 * Keep in mind we can legitimately have more outstanding extents than
> +	 * reserved because of fragmentation, so only allow a retry once.
> +	 */
> +	if (inode->outstanding_extents >
> +	    inode->reserved_extents + nr_extents) {
> +		reserve_extents = inode->outstanding_extents -
>  			inode->reserved_extents;
> +		underflow = true;
> +	}
>  
>  	/* We always want to reserve a slot for updating the inode. */
> -	to_reserve = btrfs_calc_trans_metadata_size(fs_info, nr_extents + 1);
> +	to_reserve = btrfs_calc_trans_metadata_size(fs_info,
> +						    reserve_extents + 1);
>  	to_reserve += calc_csum_metadata_size(inode, num_bytes, 1);
>  	csum_bytes = inode->csum_bytes;
>  	spin_unlock(&inode->lock);
> @@ -6128,7 +6132,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
>  		to_reserve -= btrfs_calc_trans_metadata_size(fs_info, 1);
>  		release_extra = true;
>  	}
> -	inode->reserved_extents += nr_extents;
> +	btrfs_mod_reserved_extents(inode, reserve_extents);
>  	spin_unlock(&inode->lock);
>  
>  	if (delalloc_lock)
> @@ -6144,7 +6148,10 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
>  
>  out_fail:
>  	spin_lock(&inode->lock);
> -	dropped = drop_outstanding_extent(inode, num_bytes);
> +	nr_extents = count_max_extents(num_bytes);

nr_extent isn't changed, so re-calculating it is redundant.

> +	btrfs_mod_outstanding_extents(inode, -nr_extents);
> +
> +	dropped = drop_over_reserved_extents(inode);
>  	/*
>  	 * If the inodes csum_bytes is the same as the original
>  	 * csum_bytes then we know we haven't raced with any free()ers
> @@ -6201,6 +6208,11 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
>  		trace_btrfs_space_reservation(fs_info, "delalloc",
>  					      btrfs_ino(inode), to_free, 0);
>  	}
> +	if (underflow && !did_retry) {
> +		did_retry = true;
> +		underflow = false;
> +		goto retry;
> +	}
>  	if (delalloc_lock)
>  		mutex_unlock(&inode->delalloc_mutex);
>  	return ret;
> @@ -6208,12 +6220,12 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
>  
>  /**
>   * btrfs_delalloc_release_metadata - release a metadata reservation for an inode
> - * @inode: the inode to release the reservation for
> - * @num_bytes: the number of bytes we're releasing
> + * @inode: the inode to release the reservation for.
> + * @num_bytes: the number of bytes we are releasing.
>   *
>   * This will release the metadata reservation for an inode.  This can be called
>   * once we complete IO for a given set of bytes to release their metadata
> - * reservations.
> + * reservations, or on error for the same reason.
>   */
>  void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes)
>  {
> @@ -6223,8 +6235,7 @@ void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes)
>  
>  	num_bytes = ALIGN(num_bytes, fs_info->sectorsize);
>  	spin_lock(&inode->lock);
> -	dropped = drop_outstanding_extent(inode, num_bytes);
> -
> +	dropped = drop_over_reserved_extents(inode);
>  	if (num_bytes)
>  		to_free = calc_csum_metadata_size(inode, num_bytes, 0);
>  	spin_unlock(&inode->lock);
> @@ -6241,6 +6252,42 @@ void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes)
>  }
>  
>  /**
> + * btrfs_delalloc_release_extents - release our outstanding_extents
> + * @inode: the inode to balance the reservation for.
> + * @num_bytes: the number of bytes we originally reserved with
> + *
> + * When we reserve space we increase outstanding_extents for the extents we may
> + * add.  Once we've set the range as delalloc or created our ordered extents we
> + * have outstanding_extents to track the real usage, so we use this to free our
> + * temporarily tracked outstanding_extents.  This _must_ be used in conjunction
> + * with btrfs_delalloc_reserve_metadata.
> + */
> +void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
> +{
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb);
> +	unsigned num_extents;
> +	u64 to_free;
> +	unsigned dropped;
> +
> +	spin_lock(&inode->lock);
> +	num_extents = count_max_extents(num_bytes);
> +	btrfs_mod_outstanding_extents(inode, -num_extents);
> +	dropped = drop_over_reserved_extents(inode);
> +	spin_unlock(&inode->lock);
> +
> +	if (!dropped)
> +		return;
> +
> +	if (btrfs_is_testing(fs_info))
> +		return;
> +
> +	to_free = btrfs_calc_trans_metadata_size(fs_info, dropped);

So what's really happening here is that drop_over_reserved_extents
reallu returns the number of items (which can consists of extent items +
1 inode item). So perhaps the function should be renamed to
drop_over_reserved_items?

> +	trace_btrfs_space_reservation(fs_info, "delalloc", btrfs_ino(inode),
> +				      to_free, 0);
> +	btrfs_block_rsv_release(fs_info, &fs_info->delalloc_block_rsv, to_free);
> +}
> +
> +/**
>   * btrfs_delalloc_reserve_space - reserve data and metadata space for
>   * delalloc
>   * @inode: inode we're writing to
> @@ -6284,10 +6331,7 @@ int btrfs_delalloc_reserve_space(struct inode *inode,
>   * @inode: inode we're releasing space for
>   * @start: start position of the space already reserved
>   * @len: the len of the space already reserved
> - *
> - * This must be matched with a call to btrfs_delalloc_reserve_space.  This is
> - * called in the case that we don't need the metadata AND data reservations
> - * anymore.  So if there is an error or we insert an inline extent.
> + * @release_bytes: the len of the space we consumed or didn't use
>   *
>   * This function will release the metadata space that was not used and will
>   * decrement ->delalloc_bytes and remove it from the fs_info delalloc_inodes
> @@ -6295,7 +6339,8 @@ int btrfs_delalloc_reserve_space(struct inode *inode,
>   * Also it will handle the qgroup reserved space.
>   */
>  void btrfs_delalloc_release_space(struct inode *inode,
> -			struct extent_changeset *reserved, u64 start, u64 len)
> +				  struct extent_changeset *reserved,
> +				  u64 start, u64 len)
>  {
>  	btrfs_delalloc_release_metadata(BTRFS_I(inode), len);
>  	btrfs_free_reserved_data_space(inode, reserved, start, len);


<omitted for brevity>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 03/21] btrfs: make the delalloc block rsv per inode
  2017-09-29 19:43 ` [PATCH 03/21] btrfs: make the delalloc block rsv per inode Josef Bacik
@ 2017-10-13 11:47   ` Nikolay Borisov
  2017-10-13 13:18     ` Josef Bacik
  0 siblings, 1 reply; 51+ messages in thread
From: Nikolay Borisov @ 2017-10-13 11:47 UTC (permalink / raw)
  To: Josef Bacik, kernel-team, linux-btrfs



On 29.09.2017 22:43, Josef Bacik wrote:
> The way we handle delalloc metadata reservations has gotten
> progressively more complicated over the years.  There is so much cruft
> and weirdness around keeping the reserved count and outstanding counters
> consistent and handling the error cases that it's impossible to
> understand.
> 
> Fix this by making the delalloc block rsv per-inode.  This way we can
> calculate the actual size of the outstanding metadata reservations every
> time we make a change, and then reserve the delta based on that amount.
> This greatly simplifies the code everywhere, and makes the error
> handling in btrfs_delalloc_reserve_metadata far less terrifying.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  fs/btrfs/btrfs_inode.h   |  27 ++--
>  fs/btrfs/ctree.h         |   5 +-
>  fs/btrfs/delayed-inode.c |  46 +------
>  fs/btrfs/disk-io.c       |  18 ++-
>  fs/btrfs/extent-tree.c   | 320 ++++++++++++++++-------------------------------
>  fs/btrfs/inode.c         |  18 +--
>  6 files changed, 141 insertions(+), 293 deletions(-)
> 
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index 22daa79e77b8..f9c6887a8b6c 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -36,14 +36,13 @@
>  #define BTRFS_INODE_ORPHAN_META_RESERVED	1
>  #define BTRFS_INODE_DUMMY			2
>  #define BTRFS_INODE_IN_DEFRAG			3
> -#define BTRFS_INODE_DELALLOC_META_RESERVED	4
> -#define BTRFS_INODE_HAS_ORPHAN_ITEM		5
> -#define BTRFS_INODE_HAS_ASYNC_EXTENT		6
> -#define BTRFS_INODE_NEEDS_FULL_SYNC		7
> -#define BTRFS_INODE_COPY_EVERYTHING		8
> -#define BTRFS_INODE_IN_DELALLOC_LIST		9
> -#define BTRFS_INODE_READDIO_NEED_LOCK		10
> -#define BTRFS_INODE_HAS_PROPS		        11
> +#define BTRFS_INODE_HAS_ORPHAN_ITEM		4
> +#define BTRFS_INODE_HAS_ASYNC_EXTENT		5
> +#define BTRFS_INODE_NEEDS_FULL_SYNC		6
> +#define BTRFS_INODE_COPY_EVERYTHING		7
> +#define BTRFS_INODE_IN_DELALLOC_LIST		8
> +#define BTRFS_INODE_READDIO_NEED_LOCK		9
> +#define BTRFS_INODE_HAS_PROPS		        10
>  
>  /* in memory btrfs inode */
>  struct btrfs_inode {
> @@ -176,7 +175,8 @@ struct btrfs_inode {
>  	 * of extent items we've reserved metadata for.
>  	 */
>  	unsigned outstanding_extents;
> -	unsigned reserved_extents;
> +
> +	struct btrfs_block_rsv block_rsv;
>  
>  	/*
>  	 * Cached values of inode properties
> @@ -278,15 +278,6 @@ static inline void btrfs_mod_outstanding_extents(struct btrfs_inode *inode,
>  						  mod);
>  }
>  
> -static inline void btrfs_mod_reserved_extents(struct btrfs_inode *inode,
> -					      int mod)
> -{
> -	ASSERT(spin_is_locked(&inode->lock));
> -	inode->reserved_extents += mod;
> -	if (btrfs_is_free_space_inode(inode))
> -		return;
> -}
> -
>  static inline int btrfs_inode_in_log(struct btrfs_inode *inode, u64 generation)
>  {
>  	int ret = 0;
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 1262612fbf78..93e767e16a43 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -763,8 +763,6 @@ struct btrfs_fs_info {
>  	 * delayed dir index item
>  	 */
>  	struct btrfs_block_rsv global_block_rsv;
> -	/* block reservation for delay allocation */
> -	struct btrfs_block_rsv delalloc_block_rsv;
>  	/* block reservation for metadata operations */
>  	struct btrfs_block_rsv trans_block_rsv;
>  	/* block reservation for chunk tree */
> @@ -2751,6 +2749,9 @@ int btrfs_delalloc_reserve_space(struct inode *inode,
>  void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type);
>  struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_fs_info *fs_info,
>  					      unsigned short type);
> +void btrfs_init_metadata_block_rsv(struct btrfs_fs_info *fs_info,
> +				   struct btrfs_block_rsv *rsv,
> +				   unsigned short type);
>  void btrfs_free_block_rsv(struct btrfs_fs_info *fs_info,
>  			  struct btrfs_block_rsv *rsv);
>  void __btrfs_free_block_rsv(struct btrfs_block_rsv *rsv);
> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> index 19e4ad2f3f2e..5d73f79ded8b 100644
> --- a/fs/btrfs/delayed-inode.c
> +++ b/fs/btrfs/delayed-inode.c
> @@ -581,7 +581,6 @@ static int btrfs_delayed_inode_reserve_metadata(
>  	struct btrfs_block_rsv *dst_rsv;
>  	u64 num_bytes;
>  	int ret;
> -	bool release = false;
>  
>  	src_rsv = trans->block_rsv;
>  	dst_rsv = &fs_info->delayed_block_rsv;
> @@ -589,36 +588,13 @@ static int btrfs_delayed_inode_reserve_metadata(
>  	num_bytes = btrfs_calc_trans_metadata_size(fs_info, 1);
>  
>  	/*
> -	 * If our block_rsv is the delalloc block reserve then check and see if
> -	 * we have our extra reservation for updating the inode.  If not fall
> -	 * through and try to reserve space quickly.
> -	 *
> -	 * We used to try and steal from the delalloc block rsv or the global
> -	 * reserve, but we'd steal a full reservation, which isn't kind.  We are
> -	 * here through delalloc which means we've likely just cowed down close
> -	 * to the leaf that contains the inode, so we would steal less just
> -	 * doing the fallback inode update, so if we do end up having to steal
> -	 * from the global block rsv we hopefully only steal one or two blocks
> -	 * worth which is less likely to hurt us.
> -	 */
> -	if (src_rsv && src_rsv->type == BTRFS_BLOCK_RSV_DELALLOC) {
> -		spin_lock(&inode->lock);
> -		if (test_and_clear_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
> -				       &inode->runtime_flags))
> -			release = true;
> -		else
> -			src_rsv = NULL;
> -		spin_unlock(&inode->lock);
> -	}
> -
> -	/*
>  	 * btrfs_dirty_inode will update the inode under btrfs_join_transaction
>  	 * which doesn't reserve space for speed.  This is a problem since we
>  	 * still need to reserve space for this update, so try to reserve the
>  	 * space.
>  	 *
>  	 * Now if src_rsv == delalloc_block_rsv we'll let it just steal since
> -	 * we're accounted for.
> +	 * we always reserve enough to update the inode item.
>  	 */
>  	if (!src_rsv || (!trans->bytes_reserved &&
>  			 src_rsv->type != BTRFS_BLOCK_RSV_DELALLOC)) {
> @@ -643,32 +619,12 @@ static int btrfs_delayed_inode_reserve_metadata(
>  	}
>  
>  	ret = btrfs_block_rsv_migrate(src_rsv, dst_rsv, num_bytes, 1);
> -
> -	/*
> -	 * Migrate only takes a reservation, it doesn't touch the size of the
> -	 * block_rsv.  This is to simplify people who don't normally have things
> -	 * migrated from their block rsv.  If they go to release their
> -	 * reservation, that will decrease the size as well, so if migrate
> -	 * reduced size we'd end up with a negative size.  But for the
> -	 * delalloc_meta_reserved stuff we will only know to drop 1 reservation,
> -	 * but we could in fact do this reserve/migrate dance several times
> -	 * between the time we did the original reservation and we'd clean it
> -	 * up.  So to take care of this, release the space for the meta
> -	 * reservation here.  I think it may be time for a documentation page on
> -	 * how block rsvs. work.
> -	 */
>  	if (!ret) {
>  		trace_btrfs_space_reservation(fs_info, "delayed_inode",
>  					      btrfs_ino(inode), num_bytes, 1);
>  		node->bytes_reserved = num_bytes;
>  	}
>  
> -	if (release) {
> -		trace_btrfs_space_reservation(fs_info, "delalloc",
> -					      btrfs_ino(inode), num_bytes, 0);
> -		btrfs_block_rsv_release(fs_info, src_rsv, num_bytes);
> -	}
> -
>  	return ret;
>  }
>  
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index f3cc4aa24e8a..1307907e19d8 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2723,14 +2723,6 @@ int open_ctree(struct super_block *sb,
>  		goto fail_delalloc_bytes;
>  	}
>  
> -	fs_info->btree_inode = new_inode(sb);
> -	if (!fs_info->btree_inode) {
> -		err = -ENOMEM;
> -		goto fail_bio_counter;
> -	}
> -
> -	mapping_set_gfp_mask(fs_info->btree_inode->i_mapping, GFP_NOFS);
> -
>  	INIT_RADIX_TREE(&fs_info->fs_roots_radix, GFP_ATOMIC);
>  	INIT_RADIX_TREE(&fs_info->buffer_radix, GFP_ATOMIC);
>  	INIT_LIST_HEAD(&fs_info->trans_list);
> @@ -2763,8 +2755,6 @@ int open_ctree(struct super_block *sb,
>  	btrfs_mapping_init(&fs_info->mapping_tree);
>  	btrfs_init_block_rsv(&fs_info->global_block_rsv,
>  			     BTRFS_BLOCK_RSV_GLOBAL);
> -	btrfs_init_block_rsv(&fs_info->delalloc_block_rsv,
> -			     BTRFS_BLOCK_RSV_DELALLOC);
>  	btrfs_init_block_rsv(&fs_info->trans_block_rsv, BTRFS_BLOCK_RSV_TRANS);
>  	btrfs_init_block_rsv(&fs_info->chunk_block_rsv, BTRFS_BLOCK_RSV_CHUNK);
>  	btrfs_init_block_rsv(&fs_info->empty_block_rsv, BTRFS_BLOCK_RSV_EMPTY);
> @@ -2792,6 +2782,14 @@ int open_ctree(struct super_block *sb,
>  
>  	INIT_LIST_HEAD(&fs_info->ordered_roots);
>  	spin_lock_init(&fs_info->ordered_root_lock);
> +
> +	fs_info->btree_inode = new_inode(sb);
> +	if (!fs_info->btree_inode) {
> +		err = -ENOMEM;
> +		goto fail_bio_counter;
> +	}
> +	mapping_set_gfp_mask(fs_info->btree_inode->i_mapping, GFP_NOFS);
> +
>  	fs_info->delayed_root = kmalloc(sizeof(struct btrfs_delayed_root),
>  					GFP_KERNEL);
>  	if (!fs_info->delayed_root) {
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index aa0f5c8953b0..e32ad9fc93a8 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -26,6 +26,7 @@
>  #include <linux/slab.h>
>  #include <linux/ratelimit.h>
>  #include <linux/percpu_counter.h>
> +#include <linux/lockdep.h>
>  #include "hash.h"
>  #include "tree-log.h"
>  #include "disk-io.h"
> @@ -4831,7 +4832,6 @@ static inline u64 calc_reclaim_items_nr(struct btrfs_fs_info *fs_info,
>  static void shrink_delalloc(struct btrfs_fs_info *fs_info, u64 to_reclaim,
>  			    u64 orig, bool wait_ordered)
>  {
> -	struct btrfs_block_rsv *block_rsv;
>  	struct btrfs_space_info *space_info;
>  	struct btrfs_trans_handle *trans;
>  	u64 delalloc_bytes;
> @@ -4847,8 +4847,7 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info, u64 to_reclaim,
>  	to_reclaim = items * EXTENT_SIZE_PER_ITEM;
>  
>  	trans = (struct btrfs_trans_handle *)current->journal_info;
> -	block_rsv = &fs_info->delalloc_block_rsv;
> -	space_info = block_rsv->space_info;
> +	space_info = __find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
>  
>  	delalloc_bytes = percpu_counter_sum_positive(
>  						&fs_info->delalloc_bytes);
> @@ -5584,11 +5583,12 @@ static void space_info_add_new_bytes(struct btrfs_fs_info *fs_info,
>  	}
>  }
>  
> -static void block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
> +static u64 block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
>  				    struct btrfs_block_rsv *block_rsv,
>  				    struct btrfs_block_rsv *dest, u64 num_bytes)
>  {
>  	struct btrfs_space_info *space_info = block_rsv->space_info;
> +	u64 ret;
>  
>  	spin_lock(&block_rsv->lock);
>  	if (num_bytes == (u64)-1)
> @@ -5603,6 +5603,7 @@ static void block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
>  	}
>  	spin_unlock(&block_rsv->lock);
>  
> +	ret = num_bytes;
>  	if (num_bytes > 0) {
>  		if (dest) {
>  			spin_lock(&dest->lock);
> @@ -5622,6 +5623,7 @@ static void block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
>  			space_info_add_old_bytes(fs_info, space_info,
>  						 num_bytes);
>  	}
> +	return ret;
>  }
>  
>  int btrfs_block_rsv_migrate(struct btrfs_block_rsv *src,
> @@ -5645,6 +5647,15 @@ void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type)
>  	rsv->type = type;
>  }
>  
> +void btrfs_init_metadata_block_rsv(struct btrfs_fs_info *fs_info,
> +				   struct btrfs_block_rsv *rsv,
> +				   unsigned short type)
> +{
> +	btrfs_init_block_rsv(rsv, type);
> +	rsv->space_info = __find_space_info(fs_info,
> +					    BTRFS_BLOCK_GROUP_METADATA);
> +}
> +
>  struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_fs_info *fs_info,
>  					      unsigned short type)
>  {
> @@ -5654,9 +5665,7 @@ struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_fs_info *fs_info,
>  	if (!block_rsv)
>  		return NULL;
>  
> -	btrfs_init_block_rsv(block_rsv, type);
> -	block_rsv->space_info = __find_space_info(fs_info,
> -						  BTRFS_BLOCK_GROUP_METADATA);
> +	btrfs_init_metadata_block_rsv(fs_info, block_rsv, type);
>  	return block_rsv;
>  }
>  
> @@ -5739,6 +5748,66 @@ int btrfs_block_rsv_refill(struct btrfs_root *root,
>  	return ret;
>  }
>  
> +/**
> + * btrfs_inode_rsv_refill - refill the inode block rsv.
> + * @inode - the inode we are refilling.
> + * @flush - the flusing restriction.
> + *
> + * Essentially the same as btrfs_block_rsv_refill, except it uses the
> + * block_rsv->size as the minimum size.  We'll either refill the missing amount
> + * or return if we already have enough space.  This will also handle the resreve
> + * tracepoint for the reserved amount.
> + */
> +int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
> +			   enum btrfs_reserve_flush_enum flush)
> +{
> +	struct btrfs_root *root = inode->root;
> +	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
> +	u64 num_bytes = 0;
> +	int ret = -ENOSPC;
> +
> +	spin_lock(&block_rsv->lock);
> +	if (block_rsv->reserved < block_rsv->size)
> +		num_bytes = block_rsv->size - block_rsv->reserved;
> +	spin_unlock(&block_rsv->lock);
> +
> +	if (num_bytes == 0)
> +		return 0;
> +
> +	ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
> +	if (!ret) {
> +		block_rsv_add_bytes(block_rsv, num_bytes, 0);
> +		trace_btrfs_space_reservation(root->fs_info, "delalloc",
> +					      btrfs_ino(inode), num_bytes, 1);
> +	}
> +	return ret;
> +}
> +
> +/**
> + * btrfs_inode_rsv_release - release any excessive reservation.
> + * @inode - the inode we need to release from.
> + *
> + * This is the same as btrfs_block_rsv_release, except that it handles the
> + * tracepoint for the reservation.
> + */
> +void btrfs_inode_rsv_release(struct btrfs_inode *inode)
> +{
> +	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> +	struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
> +	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
> +	u64 released = 0;
> +
> +	/*
> +	 * Since we statically set the block_rsv->size we just want to say we
> +	 * are releasing 0 bytes, and then we'll just get the reservation over
> +	 * the size free'd.
> +	 */
> +	released = block_rsv_release_bytes(fs_info, block_rsv, global_rsv, 0);
> +	if (released > 0)
> +		trace_btrfs_space_reservation(fs_info, "delalloc",
> +					      btrfs_ino(inode), released, 0);
> +}
> +
>  void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
>  			     struct btrfs_block_rsv *block_rsv,
>  			     u64 num_bytes)
> @@ -5810,7 +5879,6 @@ static void init_global_block_rsv(struct btrfs_fs_info *fs_info)
>  
>  	space_info = __find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
>  	fs_info->global_block_rsv.space_info = space_info;
> -	fs_info->delalloc_block_rsv.space_info = space_info;
>  	fs_info->trans_block_rsv.space_info = space_info;
>  	fs_info->empty_block_rsv.space_info = space_info;
>  	fs_info->delayed_block_rsv.space_info = space_info;
> @@ -5830,8 +5898,6 @@ static void release_global_block_rsv(struct btrfs_fs_info *fs_info)
>  {
>  	block_rsv_release_bytes(fs_info, &fs_info->global_block_rsv, NULL,
>  				(u64)-1);
> -	WARN_ON(fs_info->delalloc_block_rsv.size > 0);
> -	WARN_ON(fs_info->delalloc_block_rsv.reserved > 0);
>  	WARN_ON(fs_info->trans_block_rsv.size > 0);
>  	WARN_ON(fs_info->trans_block_rsv.reserved > 0);
>  	WARN_ON(fs_info->chunk_block_rsv.size > 0);
> @@ -5970,95 +6036,37 @@ void btrfs_subvolume_release_metadata(struct btrfs_fs_info *fs_info,
>  	btrfs_block_rsv_release(fs_info, rsv, (u64)-1);
>  }
>  
> -/**
> - * drop_over_reserved_extents - drop our extra extent reservations
> - * @inode: the inode we're dropping the extent for
> - *
> - * We reserve extents we may use, but they may have been merged with other
> - * extents and we may not need the extra reservation.
> - *
> - * We also call this when we've completed io to an extent or had an error and
> - * cleared the outstanding extent, in either case we no longer need our
> - * reservation and can drop the excess.
> - */
> -static unsigned drop_over_reserved_extents(struct btrfs_inode *inode)
> -{
> -	unsigned num_extents = 0;
> -
> -	if (inode->reserved_extents > inode->outstanding_extents) {
> -		num_extents = inode->reserved_extents -
> -			inode->outstanding_extents;
> -		btrfs_mod_reserved_extents(inode, -num_extents);
> -	}
> -
> -	if (inode->outstanding_extents == 0 &&
> -	    test_and_clear_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
> -			       &inode->runtime_flags))
> -		num_extents++;
> -	return num_extents;
> -}
> -
> -/**
> - * calc_csum_metadata_size - return the amount of metadata space that must be
> - *	reserved/freed for the given bytes.
> - * @inode: the inode we're manipulating
> - * @num_bytes: the number of bytes in question
> - * @reserve: 1 if we are reserving space, 0 if we are freeing space
> - *
> - * This adjusts the number of csum_bytes in the inode and then returns the
> - * correct amount of metadata that must either be reserved or freed.  We
> - * calculate how many checksums we can fit into one leaf and then divide the
> - * number of bytes that will need to be checksumed by this value to figure out
> - * how many checksums will be required.  If we are adding bytes then the number
> - * may go up and we will return the number of additional bytes that must be
> - * reserved.  If it is going down we will return the number of bytes that must
> - * be freed.
> - *
> - * This must be called with BTRFS_I(inode)->lock held.
> - */
> -static u64 calc_csum_metadata_size(struct btrfs_inode *inode, u64 num_bytes,
> -				   int reserve)
> +static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
> +						 struct btrfs_inode *inode)
>  {
> -	struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb);
> -	u64 old_csums, num_csums;
> -
> -	if (inode->flags & BTRFS_INODE_NODATASUM && inode->csum_bytes == 0)
> -		return 0;
> -
> -	old_csums = btrfs_csum_bytes_to_leaves(fs_info, inode->csum_bytes);
> -	if (reserve)
> -		inode->csum_bytes += num_bytes;
> -	else
> -		inode->csum_bytes -= num_bytes;
> -	num_csums = btrfs_csum_bytes_to_leaves(fs_info, inode->csum_bytes);
> -
> -	/* No change, no need to reserve more */
> -	if (old_csums == num_csums)
> -		return 0;
> +	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
> +	u64 reserve_size = 0;
> +	u64 csum_leaves;
> +	unsigned outstanding_extents;
>  
> -	if (reserve)
> -		return btrfs_calc_trans_metadata_size(fs_info,
> -						      num_csums - old_csums);
> +	lockdep_assert_held(&inode->lock);
> +	outstanding_extents = inode->outstanding_extents;
> +	if (outstanding_extents)
> +		reserve_size = btrfs_calc_trans_metadata_size(fs_info,
> +						outstanding_extents + 1);
> +	csum_leaves = btrfs_csum_bytes_to_leaves(fs_info,
> +						 inode->csum_bytes);
> +	reserve_size += btrfs_calc_trans_metadata_size(fs_info,
> +						       csum_leaves);
>  
> -	return btrfs_calc_trans_metadata_size(fs_info, old_csums - num_csums);
> +	spin_lock(&block_rsv->lock);
> +	block_rsv->size = reserve_size;
> +	spin_unlock(&block_rsv->lock);
>  }
>  
>  int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
>  {
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb);
>  	struct btrfs_root *root = inode->root;
> -	struct btrfs_block_rsv *block_rsv = &fs_info->delalloc_block_rsv;
> -	u64 to_reserve = 0;
> -	u64 csum_bytes;
> -	unsigned nr_extents, reserve_extents;
> +	unsigned nr_extents;
>  	enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
>  	int ret = 0;
>  	bool delalloc_lock = true;
> -	u64 to_free = 0;
> -	unsigned dropped;
> -	bool release_extra = false;
> -	bool underflow = false;
> -	bool did_retry = false;
>  
>  	/* If we are a free space inode we need to not flush since we will be in
>  	 * the middle of a transaction commit.  We also don't need the delalloc
> @@ -6083,33 +6091,13 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
>  		mutex_lock(&inode->delalloc_mutex);
>  
>  	num_bytes = ALIGN(num_bytes, fs_info->sectorsize);
> -retry:
> +
> +	/* Add our new extents and calculate the new rsv size. */
>  	spin_lock(&inode->lock);
> -	reserve_extents = nr_extents = count_max_extents(num_bytes);
> +	nr_extents = count_max_extents(num_bytes);
>  	btrfs_mod_outstanding_extents(inode, nr_extents);
> -
> -	/*
> -	 * Because we add an outstanding extent for ordered before we clear
> -	 * delalloc we will double count our outstanding extents slightly.  This
> -	 * could mean that we transiently over-reserve, which could result in an
> -	 * early ENOSPC if our timing is unlucky.  Keep track of the case that
> -	 * we had a reservation underflow so we can retry if we fail.
> -	 *
> -	 * Keep in mind we can legitimately have more outstanding extents than
> -	 * reserved because of fragmentation, so only allow a retry once.
> -	 */
> -	if (inode->outstanding_extents >
> -	    inode->reserved_extents + nr_extents) {
> -		reserve_extents = inode->outstanding_extents -
> -			inode->reserved_extents;
> -		underflow = true;
> -	}
> -
> -	/* We always want to reserve a slot for updating the inode. */
> -	to_reserve = btrfs_calc_trans_metadata_size(fs_info,
> -						    reserve_extents + 1);
> -	to_reserve += calc_csum_metadata_size(inode, num_bytes, 1);
> -	csum_bytes = inode->csum_bytes;
> +	inode->csum_bytes += num_bytes;
> +	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
>  	spin_unlock(&inode->lock);
>  
>  	if (test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags)) {
> @@ -6119,100 +6107,26 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
>  			goto out_fail;
>  	}
>  
> -	ret = btrfs_block_rsv_add(root, block_rsv, to_reserve, flush);
> +	ret = btrfs_inode_rsv_refill(inode, flush);
>  	if (unlikely(ret)) {
>  		btrfs_qgroup_free_meta(root,
>  				       nr_extents * fs_info->nodesize);
>  		goto out_fail;
>  	}
>  
> -	spin_lock(&inode->lock);
> -	if (test_and_set_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
> -			     &inode->runtime_flags)) {
> -		to_reserve -= btrfs_calc_trans_metadata_size(fs_info, 1);
> -		release_extra = true;
> -	}
> -	btrfs_mod_reserved_extents(inode, reserve_extents);
> -	spin_unlock(&inode->lock);
> -
>  	if (delalloc_lock)
>  		mutex_unlock(&inode->delalloc_mutex);
> -
> -	if (to_reserve)
> -		trace_btrfs_space_reservation(fs_info, "delalloc",
> -					      btrfs_ino(inode), to_reserve, 1);
> -	if (release_extra)
> -		btrfs_block_rsv_release(fs_info, block_rsv,
> -				btrfs_calc_trans_metadata_size(fs_info, 1));
>  	return 0;
>  
>  out_fail:
>  	spin_lock(&inode->lock);
>  	nr_extents = count_max_extents(num_bytes);
>  	btrfs_mod_outstanding_extents(inode, -nr_extents);
> -
> -	dropped = drop_over_reserved_extents(inode);
> -	/*
> -	 * If the inodes csum_bytes is the same as the original
> -	 * csum_bytes then we know we haven't raced with any free()ers
> -	 * so we can just reduce our inodes csum bytes and carry on.
> -	 */
> -	if (inode->csum_bytes == csum_bytes) {
> -		calc_csum_metadata_size(inode, num_bytes, 0);
> -	} else {
> -		u64 orig_csum_bytes = inode->csum_bytes;
> -		u64 bytes;
> -
> -		/*
> -		 * This is tricky, but first we need to figure out how much we
> -		 * freed from any free-ers that occurred during this
> -		 * reservation, so we reset ->csum_bytes to the csum_bytes
> -		 * before we dropped our lock, and then call the free for the
> -		 * number of bytes that were freed while we were trying our
> -		 * reservation.
> -		 */
> -		bytes = csum_bytes - inode->csum_bytes;
> -		inode->csum_bytes = csum_bytes;
> -		to_free = calc_csum_metadata_size(inode, bytes, 0);
> -
> -
> -		/*
> -		 * Now we need to see how much we would have freed had we not
> -		 * been making this reservation and our ->csum_bytes were not
> -		 * artificially inflated.
> -		 */
> -		inode->csum_bytes = csum_bytes - num_bytes;
> -		bytes = csum_bytes - orig_csum_bytes;
> -		bytes = calc_csum_metadata_size(inode, bytes, 0);
> -
> -		/*
> -		 * Now reset ->csum_bytes to what it should be.  If bytes is
> -		 * more than to_free then we would have freed more space had we
> -		 * not had an artificially high ->csum_bytes, so we need to free
> -		 * the remainder.  If bytes is the same or less then we don't
> -		 * need to do anything, the other free-ers did the correct
> -		 * thing.
> -		 */
> -		inode->csum_bytes = orig_csum_bytes - num_bytes;
> -		if (bytes > to_free)
> -			to_free = bytes - to_free;
> -		else
> -			to_free = 0;
> -	}
> +	inode->csum_bytes -= num_bytes;
> +	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
>  	spin_unlock(&inode->lock);
> -	if (dropped)
> -		to_free += btrfs_calc_trans_metadata_size(fs_info, dropped);
>  
> -	if (to_free) {
> -		btrfs_block_rsv_release(fs_info, block_rsv, to_free);
> -		trace_btrfs_space_reservation(fs_info, "delalloc",
> -					      btrfs_ino(inode), to_free, 0);
> -	}
> -	if (underflow && !did_retry) {
> -		did_retry = true;
> -		underflow = false;
> -		goto retry;
> -	}
> +	btrfs_inode_rsv_release(inode);
>  	if (delalloc_lock)
>  		mutex_unlock(&inode->delalloc_mutex);
>  	return ret;
> @@ -6230,25 +6144,17 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
>  void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes)
>  {
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb);
> -	u64 to_free = 0;
> -	unsigned dropped;
>  
>  	num_bytes = ALIGN(num_bytes, fs_info->sectorsize);
>  	spin_lock(&inode->lock);
> -	dropped = drop_over_reserved_extents(inode);
> -	if (num_bytes)
> -		to_free = calc_csum_metadata_size(inode, num_bytes, 0);
> +	inode->csum_bytes -= num_bytes;
> +	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
>  	spin_unlock(&inode->lock);
> -	if (dropped > 0)
> -		to_free += btrfs_calc_trans_metadata_size(fs_info, dropped);
>  
>  	if (btrfs_is_testing(fs_info))
>  		return;
>  
> -	trace_btrfs_space_reservation(fs_info, "delalloc", btrfs_ino(inode),
> -				      to_free, 0);
> -
> -	btrfs_block_rsv_release(fs_info, &fs_info->delalloc_block_rsv, to_free);
> +	btrfs_inode_rsv_release(inode);
>  }
>  
>  /**
> @@ -6266,25 +6172,17 @@ void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
>  {
>  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb);
>  	unsigned num_extents;
> -	u64 to_free;
> -	unsigned dropped;
>  
>  	spin_lock(&inode->lock);
>  	num_extents = count_max_extents(num_bytes);
>  	btrfs_mod_outstanding_extents(inode, -num_extents);
> -	dropped = drop_over_reserved_extents(inode);
> +	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
>  	spin_unlock(&inode->lock);
>  
> -	if (!dropped)
> -		return;
> -
>  	if (btrfs_is_testing(fs_info))
>  		return;
>  
> -	to_free = btrfs_calc_trans_metadata_size(fs_info, dropped);
> -	trace_btrfs_space_reservation(fs_info, "delalloc", btrfs_ino(inode),
> -				      to_free, 0);
> -	btrfs_block_rsv_release(fs_info, &fs_info->delalloc_block_rsv, to_free);
> +	btrfs_inode_rsv_release(inode);
>  }
>  
>  /**
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 33ba258815b2..4e092e799f0a 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -42,6 +42,7 @@
>  #include <linux/blkdev.h>
>  #include <linux/posix_acl_xattr.h>
>  #include <linux/uio.h>
> +#include <linux/magic.h>
>  #include "ctree.h"
>  #include "disk-io.h"
>  #include "transaction.h"
> @@ -315,7 +316,7 @@ static noinline int cow_file_range_inline(struct btrfs_root *root,
>  		btrfs_free_path(path);
>  		return PTR_ERR(trans);
>  	}
> -	trans->block_rsv = &fs_info->delalloc_block_rsv;
> +	trans->block_rsv = &BTRFS_I(inode)->block_rsv;
>  
>  	if (compressed_size && compressed_pages)
>  		extent_item_size = btrfs_file_extent_calc_inline_size(
> @@ -2957,7 +2958,7 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
>  			trans = NULL;
>  			goto out;
>  		}
> -		trans->block_rsv = &fs_info->delalloc_block_rsv;
> +		trans->block_rsv = &BTRFS_I(inode)->block_rsv;
>  		ret = btrfs_update_inode_fallback(trans, root, inode);
>  		if (ret) /* -ENOMEM or corruption */
>  			btrfs_abort_transaction(trans, ret);
> @@ -2993,7 +2994,7 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
>  		goto out;
>  	}
>  
> -	trans->block_rsv = &fs_info->delalloc_block_rsv;
> +	trans->block_rsv = &BTRFS_I(inode)->block_rsv;
>  
>  	if (test_bit(BTRFS_ORDERED_COMPRESSED, &ordered_extent->flags))
>  		compress_type = ordered_extent->compress_type;
> @@ -8848,7 +8849,6 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>  	if (iov_iter_rw(iter) == WRITE) {
>  		up_read(&BTRFS_I(inode)->dio_sem);
>  		current->journal_info = NULL;
> -		btrfs_delalloc_release_extents(BTRFS_I(inode), count);
>  		if (ret < 0 && ret != -EIOCBQUEUED) {
>  			if (dio_data.reserve)
>  				btrfs_delalloc_release_space(inode, data_reserved,
> @@ -8869,6 +8869,7 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>  		} else if (ret >= 0 && (size_t)ret < count)
>  			btrfs_delalloc_release_space(inode, data_reserved,
>  					offset, count - (size_t)ret);

In case we didn't manage to write everything we are releasing the extra
stuff that wasn't written.

> +		btrfs_delalloc_release_extents(BTRFS_I(inode), count);
In case btrfs_delalloc_release_space triggered wouldn't freeing metadata
here cause some sort of an underflow? Shouldn't we adjust count in case
we have already freed anything beforehand?

>  	}
>  out:
>  	if (wakeup)
> @@ -9433,6 +9434,7 @@ int btrfs_create_subvol_root(struct btrfs_trans_handle *trans,
>  
>  struct inode *btrfs_alloc_inode(struct super_block *sb)
>  {
> +	struct btrfs_fs_info *fs_info = btrfs_sb(sb);
>  	struct btrfs_inode *ei;
>  	struct inode *inode;
>  
> @@ -9459,8 +9461,9 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
>  
>  	spin_lock_init(&ei->lock);
>  	ei->outstanding_extents = 0;
> -	ei->reserved_extents = 0;
> -
> +	if (sb->s_magic != BTRFS_TEST_MAGIC)
> +		btrfs_init_metadata_block_rsv(fs_info, &ei->block_rsv,
> +					      BTRFS_BLOCK_RSV_DELALLOC);
>  	ei->runtime_flags = 0;
>  	ei->prop_compress = BTRFS_COMPRESS_NONE;
>  	ei->defrag_compress = BTRFS_COMPRESS_NONE;
> @@ -9510,8 +9513,9 @@ void btrfs_destroy_inode(struct inode *inode)
>  
>  	WARN_ON(!hlist_empty(&inode->i_dentry));
>  	WARN_ON(inode->i_data.nrpages);
> +	WARN_ON(BTRFS_I(inode)->block_rsv.reserved);
> +	WARN_ON(BTRFS_I(inode)->block_rsv.size);
>  	WARN_ON(BTRFS_I(inode)->outstanding_extents);
> -	WARN_ON(BTRFS_I(inode)->reserved_extents);
>  	WARN_ON(BTRFS_I(inode)->delalloc_bytes);
>  	WARN_ON(BTRFS_I(inode)->new_delalloc_bytes);
>  	WARN_ON(BTRFS_I(inode)->csum_bytes);
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 01/21] Btrfs: rework outstanding_extents
  2017-10-13  8:39   ` Nikolay Borisov
@ 2017-10-13 13:10     ` Josef Bacik
  2017-10-13 13:33       ` David Sterba
  2017-10-13 13:55       ` Nikolay Borisov
  0 siblings, 2 replies; 51+ messages in thread
From: Josef Bacik @ 2017-10-13 13:10 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, kernel-team, linux-btrfs

On Fri, Oct 13, 2017 at 11:39:15AM +0300, Nikolay Borisov wrote:
> 
> 
> On 29.09.2017 22:43, Josef Bacik wrote:
> >  
> > +static inline void btrfs_mod_outstanding_extents(struct btrfs_inode *inode,
> > +						 int mod)
> > +{
> > +	ASSERT(spin_is_locked(&inode->lock));
> > +	inode->outstanding_extents += mod;
> > +	if (btrfs_is_free_space_inode(inode))
> > +		return;
> > +}
> > +
> > +static inline void btrfs_mod_reserved_extents(struct btrfs_inode *inode,
> > +					      int mod)
> > +{
> > +	ASSERT(spin_is_locked(&inode->lock));
> 
> lockdep_assert_held(&inode->lock); for both functions. I've spoken with
> Peterz and he said any other way of checking whether a lock is held, I
> quote, "must die"
> 

Blah I continually forget that's what we're supposed to use now.

> > +	inode->reserved_extents += mod;
> > +	if (btrfs_is_free_space_inode(inode))
> > +		return;
> > +}
> > +
> >  static inline int btrfs_inode_in_log(struct btrfs_inode *inode, u64 generation)
> >  {
> >  	int ret = 0;
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index a7f68c304b4c..1262612fbf78 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -2742,6 +2742,8 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root *root,
> >  				     u64 *qgroup_reserved, bool use_global_rsv);
> >  void btrfs_subvolume_release_metadata(struct btrfs_fs_info *fs_info,
> >  				      struct btrfs_block_rsv *rsv);
> > +void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes);
> > +
> >  int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes);
> >  void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes);
> >  int btrfs_delalloc_reserve_space(struct inode *inode,
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 1a6aced00a19..aa0f5c8953b0 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -5971,42 +5971,31 @@ void btrfs_subvolume_release_metadata(struct btrfs_fs_info *fs_info,
> >  }
> >  
> >  /**
> > - * drop_outstanding_extent - drop an outstanding extent
> > + * drop_over_reserved_extents - drop our extra extent reservations
> >   * @inode: the inode we're dropping the extent for
> > - * @num_bytes: the number of bytes we're releasing.
> >   *
> > - * This is called when we are freeing up an outstanding extent, either called
> > - * after an error or after an extent is written.  This will return the number of
> > - * reserved extents that need to be freed.  This must be called with
> > - * BTRFS_I(inode)->lock held.
> > + * We reserve extents we may use, but they may have been merged with other
> > + * extents and we may not need the extra reservation.
> > + *
> > + * We also call this when we've completed io to an extent or had an error and
> > + * cleared the outstanding extent, in either case we no longer need our
> > + * reservation and can drop the excess.
> >   */
> > -static unsigned drop_outstanding_extent(struct btrfs_inode *inode,
> > -		u64 num_bytes)
> > +static unsigned drop_over_reserved_extents(struct btrfs_inode *inode)
> >  {
> > -	unsigned drop_inode_space = 0;
> > -	unsigned dropped_extents = 0;
> > -	unsigned num_extents;
> > +	unsigned num_extents = 0;
> >  
> > -	num_extents = count_max_extents(num_bytes);
> > -	ASSERT(num_extents);
> > -	ASSERT(inode->outstanding_extents >= num_extents);
> > -	inode->outstanding_extents -= num_extents;
> > +	if (inode->reserved_extents > inode->outstanding_extents) {
> > +		num_extents = inode->reserved_extents -
> > +			inode->outstanding_extents;
> > +		btrfs_mod_reserved_extents(inode, -num_extents);
> > +	}
> >  
> >  	if (inode->outstanding_extents == 0 &&
> >  	    test_and_clear_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
> >  			       &inode->runtime_flags))
> > -		drop_inode_space = 1;
> > -
> > -	/*
> > -	 * If we have more or the same amount of outstanding extents than we have
> > -	 * reserved then we need to leave the reserved extents count alone.
> > -	 */
> > -	if (inode->outstanding_extents >= inode->reserved_extents)
> > -		return drop_inode_space;
> > -
> > -	dropped_extents = inode->reserved_extents - inode->outstanding_extents;
> > -	inode->reserved_extents -= dropped_extents;
> > -	return dropped_extents + drop_inode_space;
> > +		num_extents++;
> > +	return num_extents;
> 
> Something bugs me around the handling of this. In
> btrfs_delalloc_reserve_metadata we do add the additional bytes necessary
> for updating the inode. However outstanding_extents is modified with the
> number of extent items necessary to cover the requires byte range but we
> don't account the extra inode item as being an extent. Then why do we
> have to actually increment the num_extents here? Doesn't this lead to
> underflow?
> 
> To illustrate: A write comes for 1 mb, we count this as 1 outstanding
> extent in delalloc_reserve_metadata:
> 
> nr_extents = count_max_extents(num_bytes);
> inode->outstanding_extents += nr_extents;
> 
> We do account the extra inode item but only when calculating the bytes
> to reserve:
> 
> to_reserve = btrfs_calc_trans_metadata_size(fs_info, nr_extents + 1);
> And we of course set : test_and_set_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
> 
> Now, when the time comes to free the outstanding extent and call :
> drop_over_reserved_extents we do account the extra inode item as an
> extent being freed. Is this correct? In any case it's not something
> which is introduced by your patch but something which has been there
> since time immemorial, just wondering?
> 

The outstanding_extents accounting is consistent with only the items needed to
handle the outstanding extent items.  However since changing the inode requires
updating the inode item as well we have to keep this floating reservation for
the inode item until we have 0 outstanding extents.  The way we do this is with
the BTRFS_INODE_DELALLOC_META_RESERVED flag.  So if it isn't set we will
allocate nr_exntents + 1 in btrfs_delalloc_reserve_metadata() and then set our
bit.  If we ever steal this reservation we make sure to clear the flag so we
know we don't have to clean it up when outstanding_extents goes to 0.  It's not
super intuitive but needs to be done under the BTRFS_I(inode)->lock so this was
the best place to put it.  I suppose we could move the logic out of here and put
it somewhere else to make it more clear.

> >  }
> >  
> >  /**
> > @@ -6061,13 +6050,15 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
> >  	struct btrfs_block_rsv *block_rsv = &fs_info->delalloc_block_rsv;
> >  	u64 to_reserve = 0;
> >  	u64 csum_bytes;
> > -	unsigned nr_extents;
> > +	unsigned nr_extents, reserve_extents;
> >  	enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
> >  	int ret = 0;
> >  	bool delalloc_lock = true;
> >  	u64 to_free = 0;
> >  	unsigned dropped;
> >  	bool release_extra = false;
> > +	bool underflow = false;
> > +	bool did_retry = false;
> >  
> >  	/* If we are a free space inode we need to not flush since we will be in
> >  	 * the middle of a transaction commit.  We also don't need the delalloc
> > @@ -6092,18 +6083,31 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
> >  		mutex_lock(&inode->delalloc_mutex);
> >  
> >  	num_bytes = ALIGN(num_bytes, fs_info->sectorsize);
> > -
> > +retry:
> >  	spin_lock(&inode->lock);
> > -	nr_extents = count_max_extents(num_bytes);
> > -	inode->outstanding_extents += nr_extents;
> > +	reserve_extents = nr_extents = count_max_extents(num_bytes);
> > +	btrfs_mod_outstanding_extents(inode, nr_extents);
> >  
> > -	nr_extents = 0;
> > -	if (inode->outstanding_extents > inode->reserved_extents)
> > -		nr_extents += inode->outstanding_extents -
> > +	/*
> > +	 * Because we add an outstanding extent for ordered before we clear
> > +	 * delalloc we will double count our outstanding extents slightly.  This
> > +	 * could mean that we transiently over-reserve, which could result in an
> > +	 * early ENOSPC if our timing is unlucky.  Keep track of the case that
> > +	 * we had a reservation underflow so we can retry if we fail.
> > +	 *
> > +	 * Keep in mind we can legitimately have more outstanding extents than
> > +	 * reserved because of fragmentation, so only allow a retry once.
> > +	 */
> > +	if (inode->outstanding_extents >
> > +	    inode->reserved_extents + nr_extents) {
> > +		reserve_extents = inode->outstanding_extents -
> >  			inode->reserved_extents;
> > +		underflow = true;
> > +	}
> >  
> >  	/* We always want to reserve a slot for updating the inode. */
> > -	to_reserve = btrfs_calc_trans_metadata_size(fs_info, nr_extents + 1);
> > +	to_reserve = btrfs_calc_trans_metadata_size(fs_info,
> > +						    reserve_extents + 1);
> >  	to_reserve += calc_csum_metadata_size(inode, num_bytes, 1);
> >  	csum_bytes = inode->csum_bytes;
> >  	spin_unlock(&inode->lock);
> > @@ -6128,7 +6132,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
> >  		to_reserve -= btrfs_calc_trans_metadata_size(fs_info, 1);
> >  		release_extra = true;
> >  	}
> > -	inode->reserved_extents += nr_extents;
> > +	btrfs_mod_reserved_extents(inode, reserve_extents);
> >  	spin_unlock(&inode->lock);
> >  
> >  	if (delalloc_lock)
> > @@ -6144,7 +6148,10 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
> >  
> >  out_fail:
> >  	spin_lock(&inode->lock);
> > -	dropped = drop_outstanding_extent(inode, num_bytes);
> > +	nr_extents = count_max_extents(num_bytes);
> 
> nr_extent isn't changed, so re-calculating it is redundant.
> 
> > +	btrfs_mod_outstanding_extents(inode, -nr_extents);
> > +
> > +	dropped = drop_over_reserved_extents(inode);
> >  	/*
> >  	 * If the inodes csum_bytes is the same as the original
> >  	 * csum_bytes then we know we haven't raced with any free()ers
> > @@ -6201,6 +6208,11 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
> >  		trace_btrfs_space_reservation(fs_info, "delalloc",
> >  					      btrfs_ino(inode), to_free, 0);
> >  	}
> > +	if (underflow && !did_retry) {
> > +		did_retry = true;
> > +		underflow = false;
> > +		goto retry;
> > +	}
> >  	if (delalloc_lock)
> >  		mutex_unlock(&inode->delalloc_mutex);
> >  	return ret;
> > @@ -6208,12 +6220,12 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
> >  
> >  /**
> >   * btrfs_delalloc_release_metadata - release a metadata reservation for an inode
> > - * @inode: the inode to release the reservation for
> > - * @num_bytes: the number of bytes we're releasing
> > + * @inode: the inode to release the reservation for.
> > + * @num_bytes: the number of bytes we are releasing.
> >   *
> >   * This will release the metadata reservation for an inode.  This can be called
> >   * once we complete IO for a given set of bytes to release their metadata
> > - * reservations.
> > + * reservations, or on error for the same reason.
> >   */
> >  void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes)
> >  {
> > @@ -6223,8 +6235,7 @@ void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes)
> >  
> >  	num_bytes = ALIGN(num_bytes, fs_info->sectorsize);
> >  	spin_lock(&inode->lock);
> > -	dropped = drop_outstanding_extent(inode, num_bytes);
> > -
> > +	dropped = drop_over_reserved_extents(inode);
> >  	if (num_bytes)
> >  		to_free = calc_csum_metadata_size(inode, num_bytes, 0);
> >  	spin_unlock(&inode->lock);
> > @@ -6241,6 +6252,42 @@ void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes)
> >  }
> >  
> >  /**
> > + * btrfs_delalloc_release_extents - release our outstanding_extents
> > + * @inode: the inode to balance the reservation for.
> > + * @num_bytes: the number of bytes we originally reserved with
> > + *
> > + * When we reserve space we increase outstanding_extents for the extents we may
> > + * add.  Once we've set the range as delalloc or created our ordered extents we
> > + * have outstanding_extents to track the real usage, so we use this to free our
> > + * temporarily tracked outstanding_extents.  This _must_ be used in conjunction
> > + * with btrfs_delalloc_reserve_metadata.
> > + */
> > +void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
> > +{
> > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb);
> > +	unsigned num_extents;
> > +	u64 to_free;
> > +	unsigned dropped;
> > +
> > +	spin_lock(&inode->lock);
> > +	num_extents = count_max_extents(num_bytes);
> > +	btrfs_mod_outstanding_extents(inode, -num_extents);
> > +	dropped = drop_over_reserved_extents(inode);
> > +	spin_unlock(&inode->lock);
> > +
> > +	if (!dropped)
> > +		return;
> > +
> > +	if (btrfs_is_testing(fs_info))
> > +		return;
> > +
> > +	to_free = btrfs_calc_trans_metadata_size(fs_info, dropped);
> 
> So what's really happening here is that drop_over_reserved_extents
> reallu returns the number of items (which can consists of extent items +
> 1 inode item). So perhaps the function should be renamed to
> drop_over_reserved_items?
> 

I'll just move the cleaning up of the inode item reservation out of the helper.
Thanks,

Josef

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 03/21] btrfs: make the delalloc block rsv per inode
  2017-10-13 11:47   ` Nikolay Borisov
@ 2017-10-13 13:18     ` Josef Bacik
  0 siblings, 0 replies; 51+ messages in thread
From: Josef Bacik @ 2017-10-13 13:18 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, kernel-team, linux-btrfs

On Fri, Oct 13, 2017 at 02:47:32PM +0300, Nikolay Borisov wrote:
> 
> 
> > @@ -8848,7 +8849,6 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> >  	if (iov_iter_rw(iter) == WRITE) {
> >  		up_read(&BTRFS_I(inode)->dio_sem);
> >  		current->journal_info = NULL;
> > -		btrfs_delalloc_release_extents(BTRFS_I(inode), count);
> >  		if (ret < 0 && ret != -EIOCBQUEUED) {
> >  			if (dio_data.reserve)
> >  				btrfs_delalloc_release_space(inode, data_reserved,
> > @@ -8869,6 +8869,7 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> >  		} else if (ret >= 0 && (size_t)ret < count)
> >  			btrfs_delalloc_release_space(inode, data_reserved,
> >  					offset, count - (size_t)ret);
> 
> In case we didn't manage to write everything we are releasing the extra
> stuff that wasn't written.
> 
> > +		btrfs_delalloc_release_extents(BTRFS_I(inode), count);
> In case btrfs_delalloc_release_space triggered wouldn't freeing metadata
> here cause some sort of an underflow? Shouldn't we adjust count in case
> we have already freed anything beforehand?
> 

No, btrfs_delalloc_release_extents() is only for modifying the
outstanding_extents and adjusting the reservation,
btrfs_delalloc_release_space() takes care of the outstanding csum_bytes and
adjusts the reservation.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 01/21] Btrfs: rework outstanding_extents
  2017-10-13 13:10     ` Josef Bacik
@ 2017-10-13 13:33       ` David Sterba
  2017-10-13 13:55       ` Nikolay Borisov
  1 sibling, 0 replies; 51+ messages in thread
From: David Sterba @ 2017-10-13 13:33 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Nikolay Borisov, kernel-team, linux-btrfs

On Fri, Oct 13, 2017 at 09:10:52AM -0400, Josef Bacik wrote:
> On Fri, Oct 13, 2017 at 11:39:15AM +0300, Nikolay Borisov wrote:
> > 
> > 
> > On 29.09.2017 22:43, Josef Bacik wrote:
> > >  
> > > +static inline void btrfs_mod_outstanding_extents(struct btrfs_inode *inode,
> > > +						 int mod)
> > > +{
> > > +	ASSERT(spin_is_locked(&inode->lock));
> > > +	inode->outstanding_extents += mod;
> > > +	if (btrfs_is_free_space_inode(inode))
> > > +		return;
> > > +}
> > > +
> > > +static inline void btrfs_mod_reserved_extents(struct btrfs_inode *inode,
> > > +					      int mod)
> > > +{
> > > +	ASSERT(spin_is_locked(&inode->lock));
> > 
> > lockdep_assert_held(&inode->lock); for both functions. I've spoken with
> > Peterz and he said any other way of checking whether a lock is held, I
> > quote, "must die"
> > 
> 
> Blah I continually forget that's what we're supposed to use now.

I try to keep such information on
https://btrfs.wiki.kernel.org/index.php/Development_notes#BCP

Just started the section with best practices, feel free to add more.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 04/21] btrfs: add ref-verify mount option
  2017-09-29 19:43 ` [PATCH 04/21] btrfs: add ref-verify mount option Josef Bacik
@ 2017-10-13 13:53   ` David Sterba
  2017-10-13 13:57     ` David Sterba
  0 siblings, 1 reply; 51+ messages in thread
From: David Sterba @ 2017-10-13 13:53 UTC (permalink / raw)
  To: Josef Bacik; +Cc: kernel-team, linux-btrfs

On Fri, Sep 29, 2017 at 03:43:48PM -0400, Josef Bacik wrote:
> This adds the infrastructure for turning ref verify on and off for a
> mount, to be used by a later patch.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>

Reviewed-by: David Sterba <dsterba@suse.com>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 01/21] Btrfs: rework outstanding_extents
  2017-10-13 13:10     ` Josef Bacik
  2017-10-13 13:33       ` David Sterba
@ 2017-10-13 13:55       ` Nikolay Borisov
  2017-10-19 18:10         ` Josef Bacik
  1 sibling, 1 reply; 51+ messages in thread
From: Nikolay Borisov @ 2017-10-13 13:55 UTC (permalink / raw)
  To: Josef Bacik; +Cc: kernel-team, linux-btrfs


> 
> The outstanding_extents accounting is consistent with only the items needed to
> handle the outstanding extent items.  However since changing the inode requires
> updating the inode item as well we have to keep this floating reservation for
> the inode item until we have 0 outstanding extents.  The way we do this is with
> the BTRFS_INODE_DELALLOC_META_RESERVED flag.  So if it isn't set we will
> allocate nr_exntents + 1 in btrfs_delalloc_reserve_metadata() and then set our
> bit.  If we ever steal this reservation we make sure to clear the flag so we
> know we don't have to clean it up when outstanding_extents goes to 0.  It's not
> super intuitive but needs to be done under the BTRFS_I(inode)->lock so this was
> the best place to put it.  I suppose we could move the logic out of here and put
> it somewhere else to make it more clear.

I think defining this logic in its own, discrete block of code would be
best w.r.t readibility. It's not super obvious.


I'm slowly going through your patchkit so expect more question but
otherwise the delalloc stuff after this and patch 03 really start
looking a lot more obvious !



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 04/21] btrfs: add ref-verify mount option
  2017-10-13 13:53   ` David Sterba
@ 2017-10-13 13:57     ` David Sterba
  0 siblings, 0 replies; 51+ messages in thread
From: David Sterba @ 2017-10-13 13:57 UTC (permalink / raw)
  To: David Sterba; +Cc: Josef Bacik, kernel-team, linux-btrfs

On Fri, Oct 13, 2017 at 03:53:36PM +0200, David Sterba wrote:
> On Fri, Sep 29, 2017 at 03:43:48PM -0400, Josef Bacik wrote:
> > This adds the infrastructure for turning ref verify on and off for a
> > mount, to be used by a later patch.
> > 
> > Signed-off-by: Josef Bacik <jbacik@fb.com>
> 
> Reviewed-by: David Sterba <dsterba@suse.com>

I've added this hunk so we get the config options recorded at module load time.

@@ -2332,6 +2332,9 @@ static void btrfs_print_mod_info(void)
 #endif
 #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
                        ", integrity-checker=on"
+#endif
+#ifdef CONFIG_BTRFS_FS_REF_VERIFY
+                       ", ref-verify=on"
 #endif
                        "\n",
                        btrfs_crc32c_impl());

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/21] btrfs: pass root to various extent ref mod functions
  2017-09-29 19:43 ` [PATCH 05/21] btrfs: pass root to various extent ref mod functions Josef Bacik
@ 2017-10-13 14:01   ` David Sterba
  0 siblings, 0 replies; 51+ messages in thread
From: David Sterba @ 2017-10-13 14:01 UTC (permalink / raw)
  To: Josef Bacik; +Cc: kernel-team, linux-btrfs

On Fri, Sep 29, 2017 at 03:43:49PM -0400, Josef Bacik wrote:
> We need the actual root for the ref verifier tool to work, so change
> these functions to pass the root around instead.  This will be used in
> a subsequent patch.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>

Reviewed-by: David Sterba <dsterba@suse.com>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 06/21] Btrfs: add a extent ref verify tool
  2017-09-29 19:43 ` [PATCH 06/21] Btrfs: add a extent ref verify tool Josef Bacik
@ 2017-10-13 14:23   ` David Sterba
  0 siblings, 0 replies; 51+ messages in thread
From: David Sterba @ 2017-10-13 14:23 UTC (permalink / raw)
  To: Josef Bacik; +Cc: kernel-team, linux-btrfs

On Fri, Sep 29, 2017 at 03:43:50PM -0400, Josef Bacik wrote:
> We were having corruption issues that were tied back to problems with the extent
> tree.  In order to track them down I built this tool to try and find the
> culprit, which was pretty successful.  If you compile with this tool on it will
> live verify every ref update that the fs makes and make sure it is consistent
> and valid.  I've run this through with xfstests and haven't gotten any false
> positives.  Thanks,
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>

Reviewed-by: David Sterba <dsterba@suse.com>

I've fixed the error messages, they should not start with an uppercase
letter and must not end with \n, and can be un-indented so they fit as
much as possible under 80.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 08/21] btrfs: add a helper to return a head ref
  2017-09-29 19:43 ` [PATCH 08/21] btrfs: add a helper to return a head ref Josef Bacik
@ 2017-10-13 14:39   ` David Sterba
  0 siblings, 0 replies; 51+ messages in thread
From: David Sterba @ 2017-10-13 14:39 UTC (permalink / raw)
  To: Josef Bacik; +Cc: kernel-team, linux-btrfs

On Fri, Sep 29, 2017 at 03:43:52PM -0400, Josef Bacik wrote:
> Simplify the error handling in __btrfs_run_delayed_refs by breaking out
> the code used to return a head back to the delayed_refs tree for
> processing into a helper function.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>

Reviewed-by: David Sterba <dsterba@suse.com>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 09/21] btrfs: move extent_op cleanup to a helper
  2017-09-29 19:43 ` [PATCH 09/21] btrfs: move extent_op cleanup to a helper Josef Bacik
@ 2017-10-13 14:50   ` David Sterba
  2017-10-16 14:05   ` Nikolay Borisov
  1 sibling, 0 replies; 51+ messages in thread
From: David Sterba @ 2017-10-13 14:50 UTC (permalink / raw)
  To: Josef Bacik; +Cc: kernel-team, linux-btrfs

On Fri, Sep 29, 2017 at 03:43:53PM -0400, Josef Bacik wrote:
> Move the extent_op cleanup for an empty head ref to a helper function to
> help simplify __btrfs_run_delayed_refs.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>

Reviewed-by: David Sterba <dsterba@suse.com>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 10/21] btrfs: breakout empty head cleanup to a helper
  2017-09-29 19:43 ` [PATCH 10/21] btrfs: breakout empty head " Josef Bacik
@ 2017-10-13 14:57   ` David Sterba
  2017-10-16 14:07   ` Nikolay Borisov
  1 sibling, 0 replies; 51+ messages in thread
From: David Sterba @ 2017-10-13 14:57 UTC (permalink / raw)
  To: Josef Bacik; +Cc: kernel-team, linux-btrfs

On Fri, Sep 29, 2017 at 03:43:54PM -0400, Josef Bacik wrote:
> Move this code out to a helper function to further simplivy
> __btrfs_run_delayed_refs.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>

Reviewed-by: David Sterba <dsterba@suse.com>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 11/21] btrfs: move ref_mod modification into the if (ref) logic
  2017-09-29 19:43 ` [PATCH 11/21] btrfs: move ref_mod modification into the if (ref) logic Josef Bacik
@ 2017-10-13 15:05   ` David Sterba
  0 siblings, 0 replies; 51+ messages in thread
From: David Sterba @ 2017-10-13 15:05 UTC (permalink / raw)
  To: Josef Bacik; +Cc: kernel-team, linux-btrfs

On Fri, Sep 29, 2017 at 03:43:55PM -0400, Josef Bacik wrote:
> We only use this logic if our ref isn't a ref_head, so move it up into
> the if (ref) case since we know that this is a normal ref and not a
> delayed ref head.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>

Reviewed-by: David Sterba <dsterba@suse.com>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 12/21] btrfs: move all ref head cleanup to the helper function
  2017-09-29 19:43 ` [PATCH 12/21] btrfs: move all ref head cleanup to the helper function Josef Bacik
@ 2017-10-13 15:39   ` David Sterba
  0 siblings, 0 replies; 51+ messages in thread
From: David Sterba @ 2017-10-13 15:39 UTC (permalink / raw)
  To: Josef Bacik; +Cc: kernel-team, linux-btrfs

On Fri, Sep 29, 2017 at 03:43:56PM -0400, Josef Bacik wrote:
> We do a couple different cleanup operations on the ref head.  We adjust
> counters, we'll free any reserved space if we didn't end up using the
> ref, and we clear the pending csum bytes.  Move all these disparate
> things into cleanup_ref_head and clean up the logic in
> __btrfs_run_delayed_refs so that it handles the !ref case a lot cleaner,
> as well as making run_one_delayed_ref() only deal with real refs and not
> the ref head.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>

Reviewed-by: David Sterba <dsterba@suse.com>

This one was tough.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 13/21] btrfs: remove delayed_ref_node from ref_head
  2017-09-29 19:43 ` [PATCH 13/21] btrfs: remove delayed_ref_node from ref_head Josef Bacik
@ 2017-10-13 16:05   ` David Sterba
  2017-10-16 14:41   ` Nikolay Borisov
  1 sibling, 0 replies; 51+ messages in thread
From: David Sterba @ 2017-10-13 16:05 UTC (permalink / raw)
  To: Josef Bacik; +Cc: kernel-team, linux-btrfs

On Fri, Sep 29, 2017 at 03:43:57PM -0400, Josef Bacik wrote:
> This is just excessive information in the ref_head, and makes the code
> complicated.  It is a relic from when we had the heads and the refs in
> the same tree, which is no longer the case.  With this removal I've
> cleaned up a bunch of the cruft around this old assumption as well.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>

Reviewed-by: David Sterba <dsterba@suse.com>

> @@ -91,8 +82,8 @@ struct btrfs_delayed_extent_op {
>   * reference count modifications we've queued up.
>   */
>  struct btrfs_delayed_ref_head {
> -	struct btrfs_delayed_ref_node node;
> -
> +	u64 bytenr, num_bytes;

In structures, one declaration per line is recommended. Fixed.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 14/21] btrfs: remove type argument from comp_tree_refs
  2017-09-29 19:43 ` [PATCH 14/21] btrfs: remove type argument from comp_tree_refs Josef Bacik
@ 2017-10-13 16:06   ` David Sterba
  0 siblings, 0 replies; 51+ messages in thread
From: David Sterba @ 2017-10-13 16:06 UTC (permalink / raw)
  To: Josef Bacik; +Cc: kernel-team, linux-btrfs

On Fri, Sep 29, 2017 at 03:43:58PM -0400, Josef Bacik wrote:
> We can get this from the ref we've passed in.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>

Reviewed-by: David Sterba <dsterba@suse.com>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 15/21] btrfs: switch args for comp_*_refs
  2017-09-29 19:43 ` [PATCH 15/21] btrfs: switch args for comp_*_refs Josef Bacik
@ 2017-10-13 16:24   ` David Sterba
  0 siblings, 0 replies; 51+ messages in thread
From: David Sterba @ 2017-10-13 16:24 UTC (permalink / raw)
  To: Josef Bacik; +Cc: kernel-team, linux-btrfs

On Fri, Sep 29, 2017 at 03:43:59PM -0400, Josef Bacik wrote:
> Because seriously?  ref2 and then ref1?

Brought to you by The infamous commit 5d4f98a28c7d334091c1b with
diffstat  20 files changed, 6958 insertions(+), 2073 deletions(-) .

But we still need some description of the impact of the change, as this
effectively reverses the sorting order.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 19/21] btrfs: don't call btrfs_start_delalloc_roots in flushoncommit
  2017-09-29 19:44 ` [PATCH 19/21] btrfs: don't call btrfs_start_delalloc_roots in flushoncommit Josef Bacik
@ 2017-10-13 17:10   ` David Sterba
  0 siblings, 0 replies; 51+ messages in thread
From: David Sterba @ 2017-10-13 17:10 UTC (permalink / raw)
  To: Josef Bacik; +Cc: kernel-team, linux-btrfs

On Fri, Sep 29, 2017 at 03:44:03PM -0400, Josef Bacik wrote:
> We're holding the sb_start_intwrite lock at this point, and doing async
> filemap_flush of the inodes will result in a deadlock if we freeze the
> fs during this operation.  This is because we could do a
> btrfs_join_transaction() in the thread we are waiting on which would
> block at sb_start_intwrite, and thus deadlock.  Using
> writeback_inodes_sb() side steps the problem by not introducing all of
> these extra locking dependencies.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  fs/btrfs/transaction.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 68c3e1c04bca..9fed8c67b6e8 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1917,7 +1917,7 @@ static void cleanup_transaction(struct btrfs_trans_handle *trans,
>  static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info)
>  {
>  	if (btrfs_test_opt(fs_info, FLUSHONCOMMIT))
> -		return btrfs_start_delalloc_roots(fs_info, 1, -1);
> +		writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC);

This really needs a comment in the code, it's calling an external
interface to do some internal stuff. I was not able to trace it from
writeback_inodes_sb back to the dealloc in a reasonable time, so some
pointers in the comment are highly appreciated. The changelog sort of
explains what's going on, but it's still hard to find it in the code.

With writeback_inodes_sb, the function does not return any error and can
be moved to the single caller.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 21/21] btrfs: add assertions for releasing trans handle reservations
  2017-09-29 19:44 ` [PATCH 21/21] btrfs: add assertions for releasing trans handle reservations Josef Bacik
@ 2017-10-13 17:17   ` David Sterba
  0 siblings, 0 replies; 51+ messages in thread
From: David Sterba @ 2017-10-13 17:17 UTC (permalink / raw)
  To: Josef Bacik; +Cc: kernel-team, linux-btrfs

On Fri, Sep 29, 2017 at 03:44:05PM -0400, Josef Bacik wrote:
> These are useful for debugging problems where we mess with
> trans->block_rsv to make sure we're not screwing something up.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>

Reviewed-by: David Sterba <dsterba@suse.com>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 07/21] Btrfs: only check delayed ref usage in should_end_transaction
  2017-09-29 19:43 ` [PATCH 07/21] Btrfs: only check delayed ref usage in should_end_transaction Josef Bacik
@ 2017-10-13 17:20   ` David Sterba
  0 siblings, 0 replies; 51+ messages in thread
From: David Sterba @ 2017-10-13 17:20 UTC (permalink / raw)
  To: Josef Bacik; +Cc: kernel-team, linux-btrfs

On Fri, Sep 29, 2017 at 03:43:51PM -0400, Josef Bacik wrote:
> We were only doing btrfs_check_space_for_delayed_refs() if the metadata
> space was full, ie we couldn't allocate chunks.  This assumes we'll be
> able to allocate chunks during transaction commit, but since nothing
> does a LIMIT flush during the transaction commit this won't actually
> happen unless we happen to run shy of actual space.  We already take
> into account a full fs in btrfs_check_space_for_delayed_refs() so just
> kill this extra check to make sure we're ending the transaction when we
> need to.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>

Reviewed-by: David Sterba <dsterba@suse.com>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/21] My current btrfs patch queue
  2017-09-29 19:43 [PATCH 00/21] My current btrfs patch queue Josef Bacik
                   ` (20 preceding siblings ...)
  2017-09-29 19:44 ` [PATCH 21/21] btrfs: add assertions for releasing trans handle reservations Josef Bacik
@ 2017-10-13 17:28 ` David Sterba
  21 siblings, 0 replies; 51+ messages in thread
From: David Sterba @ 2017-10-13 17:28 UTC (permalink / raw)
  To: Josef Bacik; +Cc: kernel-team, linux-btrfs

On Fri, Sep 29, 2017 at 03:43:44PM -0400, Josef Bacik wrote:
> This is my current set of outstanding patches.  A lot of these had reviews and
> I've incorporated the feedback.  They have been pretty thorougly tested and are
> pretty solid.
> 
> [PATCH 01/21] Btrfs: rework outstanding_extents
> [PATCH 02/21] btrfs: add tracepoints for outstanding extents mods
> [PATCH 03/21] btrfs: make the delalloc block rsv per inode

Left out for the main patch pile, as there's still ongoing review.

> [PATCH 04/21] btrfs: add ref-verify mount option
> [PATCH 05/21] btrfs: pass root to various extent ref mod functions
> [PATCH 06/21] Btrfs: add a extent ref verify tool
> [PATCH 07/21] Btrfs: only check delayed ref usage in
> [PATCH 08/21] btrfs: add a helper to return a head ref
> [PATCH 09/21] btrfs: move extent_op cleanup to a helper
> [PATCH 10/21] btrfs: breakout empty head cleanup to a helper
> [PATCH 11/21] btrfs: move ref_mod modification into the if (ref)
> [PATCH 12/21] btrfs: move all ref head cleanup to the helper function
> [PATCH 13/21] btrfs: remove delayed_ref_node from ref_head
> [PATCH 14/21] btrfs: remove type argument from comp_tree_refs

Added to misc-next.

> [PATCH 15/21] btrfs: switch args for comp_*_refs
> [PATCH 16/21] btrfs: add a comp_refs() helper
> [PATCH 17/21] btrfs: track refs in a rb_tree instead of a list

15 has needs some clarification and the other two depend on it, so
they're not merged yet.

> [PATCH 18/21] btrfs: fix send ioctl on 32bit with 64bit kernel

Merged outside of this queue earlier.

> [PATCH 19/21] btrfs: don't call btrfs_start_delalloc_roots in
> [PATCH 20/21] btrfs: move btrfs_truncate_block out of trans handle

Not enough brain power to review them today.

> [PATCH 21/21] btrfs: add assertions for releasing trans handle

Added.

Feel free to send updates, as new patches. I'll reorder the for-next
branch so the unmerged patches are still there in a separate branch and
will update them as needed.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 09/21] btrfs: move extent_op cleanup to a helper
  2017-09-29 19:43 ` [PATCH 09/21] btrfs: move extent_op cleanup to a helper Josef Bacik
  2017-10-13 14:50   ` David Sterba
@ 2017-10-16 14:05   ` Nikolay Borisov
  2017-10-16 15:02     ` David Sterba
  1 sibling, 1 reply; 51+ messages in thread
From: Nikolay Borisov @ 2017-10-16 14:05 UTC (permalink / raw)
  To: Josef Bacik, kernel-team, linux-btrfs



On 29.09.2017 22:43, Josef Bacik wrote:
> Move the extent_op cleanup for an empty head ref to a helper function to
> help simplify __btrfs_run_delayed_refs.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  fs/btrfs/extent-tree.c | 77 ++++++++++++++++++++++++++------------------------
>  1 file changed, 40 insertions(+), 37 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index f356b4a66186..f4048b23c7be 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2587,6 +2587,26 @@ unselect_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
>  	btrfs_delayed_ref_unlock(head);
>  }
>  
> +static int cleanup_extent_op(struct btrfs_trans_handle *trans,
> +			     struct btrfs_fs_info *fs_info,

Do we really need the fs_info as a separate argument, since we can
obtain it from trans->fs_info, admittedly this would require refactoring
run_delayed_extent_op as well? Looking at the other patches which
simplify delayed refs there are numerous function which take both a
transaction and a fs_info. Is there a case where trans' fs_info is not
the same as the fs_info based, I'd say no?

> +			     struct btrfs_delayed_ref_head *head)
> +{
> +	struct btrfs_delayed_extent_op *extent_op = head->extent_op;
> +	int ret;
> +
> +	if (!extent_op)
> +		return 0;
> +	head->extent_op = NULL;
> +	if (head->must_insert_reserved) {
> +		btrfs_free_delayed_extent_op(extent_op);
> +		return 0;
> +	}
> +	spin_unlock(&head->lock);
> +	ret = run_delayed_extent_op(trans, fs_info, &head->node, extent_op);
> +	btrfs_free_delayed_extent_op(extent_op);
> +	return ret ? ret : 1;
> +}
> +
>  /*
>   * Returns 0 on success or if called with an already aborted transaction.
>   * Returns -ENOMEM or -EIO on failure and will abort the transaction.
> @@ -2667,16 +2687,6 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
>  			continue;
>  		}
>  
> -		/*
> -		 * record the must insert reserved flag before we
> -		 * drop the spin lock.
> -		 */
> -		must_insert_reserved = locked_ref->must_insert_reserved;
> -		locked_ref->must_insert_reserved = 0;
> -
> -		extent_op = locked_ref->extent_op;
> -		locked_ref->extent_op = NULL;
> -
>  		if (!ref) {
>  
>  
> @@ -2686,33 +2696,17 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
>  			 */
>  			ref = &locked_ref->node;
>  
> -			if (extent_op && must_insert_reserved) {
> -				btrfs_free_delayed_extent_op(extent_op);
> -				extent_op = NULL;
> -			}
> -
> -			if (extent_op) {
> -				spin_unlock(&locked_ref->lock);
> -				ret = run_delayed_extent_op(trans, fs_info,
> -							    ref, extent_op);
> -				btrfs_free_delayed_extent_op(extent_op);
> -
> -				if (ret) {
> -					/*
> -					 * Need to reset must_insert_reserved if
> -					 * there was an error so the abort stuff
> -					 * can cleanup the reserved space
> -					 * properly.
> -					 */
> -					if (must_insert_reserved)
> -						locked_ref->must_insert_reserved = 1;
> -					unselect_delayed_ref_head(delayed_refs,
> -								  locked_ref);
> -					btrfs_debug(fs_info,
> -						    "run_delayed_extent_op returned %d",
> -						    ret);
> -					return ret;
> -				}
> +			ret = cleanup_extent_op(trans, fs_info, locked_ref);
> +			if (ret < 0) {
> +				unselect_delayed_ref_head(delayed_refs,
> +							  locked_ref);
> +				btrfs_debug(fs_info,
> +					    "run_delayed_extent_op returned %d",
> +					    ret);
> +				return ret;
> +			} else if (ret > 0) {
> +				/* We dropped our lock, we need to loop. */
> +				ret = 0;
>  				continue;
>  			}
>  
> @@ -2761,6 +2755,15 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
>  				WARN_ON(1);
>  			}
>  		}
> +		/*
> +		 * record the must insert reserved flag before we
> +		 * drop the spin lock.
> +		 */
> +		must_insert_reserved = locked_ref->must_insert_reserved;
> +		locked_ref->must_insert_reserved = 0;
> +
> +		extent_op = locked_ref->extent_op;
> +		locked_ref->extent_op = NULL;
>  		spin_unlock(&locked_ref->lock);
>  
>  		ret = run_one_delayed_ref(trans, fs_info, ref, extent_op,
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 10/21] btrfs: breakout empty head cleanup to a helper
  2017-09-29 19:43 ` [PATCH 10/21] btrfs: breakout empty head " Josef Bacik
  2017-10-13 14:57   ` David Sterba
@ 2017-10-16 14:07   ` Nikolay Borisov
  2017-10-16 14:55     ` David Sterba
  1 sibling, 1 reply; 51+ messages in thread
From: Nikolay Borisov @ 2017-10-16 14:07 UTC (permalink / raw)
  To: Josef Bacik, kernel-team, linux-btrfs



On 29.09.2017 22:43, Josef Bacik wrote:
> Move this code out to a helper function to further simplivy
> __btrfs_run_delayed_refs.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  fs/btrfs/extent-tree.c | 80 ++++++++++++++++++++++++++++----------------------
>  1 file changed, 45 insertions(+), 35 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index f4048b23c7be..bae2eac11db7 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2607,6 +2607,43 @@ static int cleanup_extent_op(struct btrfs_trans_handle *trans,
>  	return ret ? ret : 1;
>  }
>  
> +static int cleanup_ref_head(struct btrfs_trans_handle *trans,
> +			    struct btrfs_fs_info *fs_info,
> +			    struct btrfs_delayed_ref_head *head)
> +{
> +	struct btrfs_delayed_ref_root *delayed_refs;
> +	int ret;
> +
> +	delayed_refs = &trans->transaction->delayed_refs;
> +
> +	ret = cleanup_extent_op(trans, fs_info, head);
> +	if (ret < 0) {
> +		unselect_delayed_ref_head(delayed_refs, head);
> +		btrfs_debug(fs_info, "run_delayed_extent_op returned %d", ret);
> +		return ret;
> +	} else if (ret) {
> +		return ret;
> +	}
nit: Omit the { } when there is a single statement after a conditional
clause.

> +
> +	/*
> +	 * Need to drop our head ref lock and re-acquire the delayed ref lock
> +	 * and then re-check to make sure nobody got added.
> +	 */
> +	spin_unlock(&head->lock);
> +	spin_lock(&delayed_refs->lock);
> +	spin_lock(&head->lock);
> +	if (!list_empty(&head->ref_list) || head->extent_op) {
> +		spin_unlock(&head->lock);
> +		spin_unlock(&delayed_refs->lock);
> +		return 1;
> +	}
> +	head->node.in_tree = 0;
> +	delayed_refs->num_heads--;
> +	rb_erase(&head->href_node, &delayed_refs->href_root);
> +	spin_unlock(&delayed_refs->lock);
> +	return 0;
> +}
> +
>  /*
>   * Returns 0 on success or if called with an already aborted transaction.
>   * Returns -ENOMEM or -EIO on failure and will abort the transaction.
> @@ -2688,47 +2725,20 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
>  		}
>  
>  		if (!ref) {
> -
> -
> -			/* All delayed refs have been processed, Go ahead
> -			 * and send the head node to run_one_delayed_ref,
> -			 * so that any accounting fixes can happen
> -			 */
> -			ref = &locked_ref->node;
> -
> -			ret = cleanup_extent_op(trans, fs_info, locked_ref);
> -			if (ret < 0) {
> -				unselect_delayed_ref_head(delayed_refs,
> -							  locked_ref);
> -				btrfs_debug(fs_info,
> -					    "run_delayed_extent_op returned %d",
> -					    ret);
> -				return ret;
> -			} else if (ret > 0) {
> +			ret = cleanup_ref_head(trans, fs_info, locked_ref);
> +			if (ret > 0 ) {
>  				/* We dropped our lock, we need to loop. */
>  				ret = 0;
>  				continue;
> +			} else if (ret) {
> +				return ret;
>  			}
>  
> -			/*
> -			 * Need to drop our head ref lock and re-acquire the
> -			 * delayed ref lock and then re-check to make sure
> -			 * nobody got added.
> +			/* All delayed refs have been processed, Go ahead
> +			 * and send the head node to run_one_delayed_ref,
> +			 * so that any accounting fixes can happen
>  			 */
> -			spin_unlock(&locked_ref->lock);
> -			spin_lock(&delayed_refs->lock);
> -			spin_lock(&locked_ref->lock);
> -			if (!list_empty(&locked_ref->ref_list) ||
> -			    locked_ref->extent_op) {
> -				spin_unlock(&locked_ref->lock);
> -				spin_unlock(&delayed_refs->lock);
> -				continue;
> -			}
> -			ref->in_tree = 0;
> -			delayed_refs->num_heads--;
> -			rb_erase(&locked_ref->href_node,
> -				 &delayed_refs->href_root);
> -			spin_unlock(&delayed_refs->lock);
> +			ref = &locked_ref->node;
>  		} else {
>  			actual_count++;
>  			ref->in_tree = 0;
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 13/21] btrfs: remove delayed_ref_node from ref_head
  2017-09-29 19:43 ` [PATCH 13/21] btrfs: remove delayed_ref_node from ref_head Josef Bacik
  2017-10-13 16:05   ` David Sterba
@ 2017-10-16 14:41   ` Nikolay Borisov
  1 sibling, 0 replies; 51+ messages in thread
From: Nikolay Borisov @ 2017-10-16 14:41 UTC (permalink / raw)
  To: Josef Bacik, kernel-team, linux-btrfs



On 29.09.2017 22:43, Josef Bacik wrote:
> This is just excessive information in the ref_head, and makes the code
> complicated.  It is a relic from when we had the heads and the refs in
> the same tree, which is no longer the case.  With this removal I've
> cleaned up a bunch of the cruft around this old assumption as well.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  fs/btrfs/backref.c           |   4 +-
>  fs/btrfs/delayed-ref.c       | 126 +++++++++++++++++++------------------------
>  fs/btrfs/delayed-ref.h       |  49 ++++++-----------
>  fs/btrfs/disk-io.c           |  12 ++---
>  fs/btrfs/extent-tree.c       |  90 ++++++++++++-------------------
>  include/trace/events/btrfs.h |  15 +++---
>  6 files changed, 120 insertions(+), 176 deletions(-)
> 
> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> index b517ef1477ea..33cba1abf8b6 100644
> --- a/fs/btrfs/backref.c
> +++ b/fs/btrfs/backref.c
> @@ -1178,7 +1178,7 @@ static int find_parent_nodes(struct btrfs_trans_handle *trans,
>  		head = btrfs_find_delayed_ref_head(delayed_refs, bytenr);
>  		if (head) {
>  			if (!mutex_trylock(&head->mutex)) {
> -				refcount_inc(&head->node.refs);
> +				refcount_inc(&head->refs);
>  				spin_unlock(&delayed_refs->lock);
>  
>  				btrfs_release_path(path);
> @@ -1189,7 +1189,7 @@ static int find_parent_nodes(struct btrfs_trans_handle *trans,
>  				 */
>  				mutex_lock(&head->mutex);
>  				mutex_unlock(&head->mutex);
> -				btrfs_put_delayed_ref(&head->node);
> +				btrfs_put_delayed_ref_head(head);
>  				goto again;
>  			}
>  			spin_unlock(&delayed_refs->lock);
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 93ffa898df6d..b9b41c838da4 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -96,15 +96,15 @@ static struct btrfs_delayed_ref_head *htree_insert(struct rb_root *root,
>  	u64 bytenr;
>  
>  	ins = rb_entry(node, struct btrfs_delayed_ref_head, href_node);
> -	bytenr = ins->node.bytenr;
> +	bytenr = ins->bytenr;
>  	while (*p) {
>  		parent_node = *p;
>  		entry = rb_entry(parent_node, struct btrfs_delayed_ref_head,
>  				 href_node);
>  
> -		if (bytenr < entry->node.bytenr)
> +		if (bytenr < entry->bytenr)
>  			p = &(*p)->rb_left;
> -		else if (bytenr > entry->node.bytenr)
> +		else if (bytenr > entry->bytenr)
>  			p = &(*p)->rb_right;
>  		else
>  			return entry;
> @@ -133,15 +133,15 @@ find_ref_head(struct rb_root *root, u64 bytenr,
>  	while (n) {
>  		entry = rb_entry(n, struct btrfs_delayed_ref_head, href_node);
>  
> -		if (bytenr < entry->node.bytenr)
> +		if (bytenr < entry->bytenr)
>  			n = n->rb_left;
> -		else if (bytenr > entry->node.bytenr)
> +		else if (bytenr > entry->bytenr)
>  			n = n->rb_right;
>  		else
>  			return entry;
>  	}
>  	if (entry && return_bigger) {
> -		if (bytenr > entry->node.bytenr) {
> +		if (bytenr > entry->bytenr) {
>  			n = rb_next(&entry->href_node);
>  			if (!n)
>  				n = rb_first(root);
> @@ -164,17 +164,17 @@ int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans,
>  	if (mutex_trylock(&head->mutex))
>  		return 0;
>  
> -	refcount_inc(&head->node.refs);
> +	refcount_inc(&head->refs);
>  	spin_unlock(&delayed_refs->lock);
>  
>  	mutex_lock(&head->mutex);
>  	spin_lock(&delayed_refs->lock);
> -	if (!head->node.in_tree) {
> +	if (RB_EMPTY_NODE(&head->href_node)) {
>  		mutex_unlock(&head->mutex);
> -		btrfs_put_delayed_ref(&head->node);
> +		btrfs_put_delayed_ref_head(head);
>  		return -EAGAIN;
>  	}
> -	btrfs_put_delayed_ref(&head->node);
> +	btrfs_put_delayed_ref_head(head);
>  	return 0;
>  }
>  
> @@ -183,15 +183,10 @@ static inline void drop_delayed_ref(struct btrfs_trans_handle *trans,
>  				    struct btrfs_delayed_ref_head *head,
>  				    struct btrfs_delayed_ref_node *ref)
>  {
> -	if (btrfs_delayed_ref_is_head(ref)) {
> -		head = btrfs_delayed_node_to_head(ref);
> -		rb_erase(&head->href_node, &delayed_refs->href_root);
> -	} else {
> -		assert_spin_locked(&head->lock);
> -		list_del(&ref->list);
> -		if (!list_empty(&ref->add_list))
> -			list_del(&ref->add_list);
> -	}
> +	assert_spin_locked(&head->lock);

lockdep_assert_held


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 10/21] btrfs: breakout empty head cleanup to a helper
  2017-10-16 14:07   ` Nikolay Borisov
@ 2017-10-16 14:55     ` David Sterba
  0 siblings, 0 replies; 51+ messages in thread
From: David Sterba @ 2017-10-16 14:55 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, kernel-team, linux-btrfs

On Mon, Oct 16, 2017 at 05:07:21PM +0300, Nikolay Borisov wrote:
> 
> 
> On 29.09.2017 22:43, Josef Bacik wrote:
> > Move this code out to a helper function to further simplivy
> > __btrfs_run_delayed_refs.
> > 
> > Signed-off-by: Josef Bacik <jbacik@fb.com>
> > ---
> >  fs/btrfs/extent-tree.c | 80 ++++++++++++++++++++++++++++----------------------
> >  1 file changed, 45 insertions(+), 35 deletions(-)
> > 
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index f4048b23c7be..bae2eac11db7 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -2607,6 +2607,43 @@ static int cleanup_extent_op(struct btrfs_trans_handle *trans,
> >  	return ret ? ret : 1;
> >  }
> >  
> > +static int cleanup_ref_head(struct btrfs_trans_handle *trans,
> > +			    struct btrfs_fs_info *fs_info,
> > +			    struct btrfs_delayed_ref_head *head)
> > +{
> > +	struct btrfs_delayed_ref_root *delayed_refs;
> > +	int ret;
> > +
> > +	delayed_refs = &trans->transaction->delayed_refs;
> > +
> > +	ret = cleanup_extent_op(trans, fs_info, head);
> > +	if (ret < 0) {
> > +		unselect_delayed_ref_head(delayed_refs, head);
> > +		btrfs_debug(fs_info, "run_delayed_extent_op returned %d", ret);
> > +		return ret;
> > +	} else if (ret) {
> > +		return ret;
> > +	}
> nit: Omit the { } when there is a single statement after a conditional
> clause.

Unless it's a chain of if / if else / ... / else, then all statements
should use { }

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 09/21] btrfs: move extent_op cleanup to a helper
  2017-10-16 14:05   ` Nikolay Borisov
@ 2017-10-16 15:02     ` David Sterba
  0 siblings, 0 replies; 51+ messages in thread
From: David Sterba @ 2017-10-16 15:02 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, kernel-team, linux-btrfs

On Mon, Oct 16, 2017 at 05:05:19PM +0300, Nikolay Borisov wrote:
> 
> 
> On 29.09.2017 22:43, Josef Bacik wrote:
> > Move the extent_op cleanup for an empty head ref to a helper function to
> > help simplify __btrfs_run_delayed_refs.
> > 
> > Signed-off-by: Josef Bacik <jbacik@fb.com>
> > ---
> >  fs/btrfs/extent-tree.c | 77 ++++++++++++++++++++++++++------------------------
> >  1 file changed, 40 insertions(+), 37 deletions(-)
> > 
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index f356b4a66186..f4048b23c7be 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -2587,6 +2587,26 @@ unselect_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
> >  	btrfs_delayed_ref_unlock(head);
> >  }
> >  
> > +static int cleanup_extent_op(struct btrfs_trans_handle *trans,
> > +			     struct btrfs_fs_info *fs_info,
> 
> Do we really need the fs_info as a separate argument, since we can
> obtain it from trans->fs_info, admittedly this would require refactoring
> run_delayed_extent_op as well? Looking at the other patches which
> simplify delayed refs there are numerous function which take both a
> transaction and a fs_info. Is there a case where trans' fs_info is not
> the same as the fs_info based, I'd say no?

I've noticed the trans/fs_info redundancy, but it does not seem
important enough compared to the non-trivial changes the patch does. The
function group (eg. run_delayed_extent_op) uses both trans/fs_info so it
conforms to the current use. I'd suggest to do the argument cleanup
later, once the delayed ref series is merged.

Historically the fs_info pointer was not in transaction, so it had to be
passed separately and many functions do that. There are some exceptions
when transaction could be NULL.

For all internal helpers it would be good to use just trans and reuse
fs_info from there. Functions that represent an API for other code, like
when transaction commit calls to delayed refs, it may look better when
there are both trans and fs_info. This should be decided case by case.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 01/21] Btrfs: rework outstanding_extents
  2017-09-29 19:43 ` [PATCH 01/21] Btrfs: rework outstanding_extents Josef Bacik
  2017-10-13  8:39   ` Nikolay Borisov
@ 2017-10-19  3:14   ` Edmund Nadolski
  1 sibling, 0 replies; 51+ messages in thread
From: Edmund Nadolski @ 2017-10-19  3:14 UTC (permalink / raw)
  To: Josef Bacik, kernel-team, linux-btrfs

just a few quick things for the changelog:

On 09/29/2017 01:43 PM, Josef Bacik wrote:
> Right now we do a lot of weird hoops around outstanding_extents in order
> to keep the extent count consistent.  This is because we logically
> transfer the outstanding_extent count from the initial reservation
> through the set_delalloc_bits.  This makes it pretty difficult to get a
> handle on how and when we need to mess with outstanding_extents.
> 
> Fix this by revamping the rules of how we deal with outstanding_extents.
> Now instead everybody that is holding on to a delalloc extent is
> required to increase the outstanding extents count for itself.  This
> means we'll have something like this
> 
> btrfs_dealloc_reserve_metadata	- outstanding_extents = 1

s/dealloc/delalloc/


>  btrfs_set_delalloc		- outstanding_extents = 2

should be btrfs_set_extent_delalloc?


> btrfs_release_delalloc_extents	- outstanding_extents = 1
> 
> for an initial file write.  Now take the append write where we extend an
> existing delalloc range but still under the maximum extent size
> 
> btrfs_delalloc_reserve_metadata - outstanding_extents = 2
>   btrfs_set_delalloc

btrfs_set_extent_delalloc?


>     btrfs_set_bit_hook		- outstanding_extents = 3
>     btrfs_merge_bit_hook	- outstanding_extents = 2

should be btrfs_clear_bit_hook? (or btrfs_merge_extent_hook?)


> btrfs_release_delalloc_extents	- outstanding_extnets = 1

btrfs_delalloc_release_metadata?


> 
> In order to make the ordered extent transition we of course must now
> make ordered extents carry their own outstanding_extent reservation, so
> for cow_file_range we end up with
> 
> btrfs_add_ordered_extent	- outstanding_extents = 2
> clear_extent_bit		- outstanding_extents = 1
> btrfs_remove_ordered_extent	- outstanding_extents = 0
> 
> This makes all manipulations of outstanding_extents much more explicit.
> Every successful call to btrfs_reserve_delalloc_metadata _must_ now be
                           ^
btrfs_delalloc_reserve_metadata?


Thanks,
Ed


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 01/21] Btrfs: rework outstanding_extents
  2017-10-13 13:55       ` Nikolay Borisov
@ 2017-10-19 18:10         ` Josef Bacik
  0 siblings, 0 replies; 51+ messages in thread
From: Josef Bacik @ 2017-10-19 18:10 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, kernel-team, linux-btrfs

On Fri, Oct 13, 2017 at 04:55:58PM +0300, Nikolay Borisov wrote:
> 
> > 
> > The outstanding_extents accounting is consistent with only the items needed to
> > handle the outstanding extent items.  However since changing the inode requires
> > updating the inode item as well we have to keep this floating reservation for
> > the inode item until we have 0 outstanding extents.  The way we do this is with
> > the BTRFS_INODE_DELALLOC_META_RESERVED flag.  So if it isn't set we will
> > allocate nr_exntents + 1 in btrfs_delalloc_reserve_metadata() and then set our
> > bit.  If we ever steal this reservation we make sure to clear the flag so we
> > know we don't have to clean it up when outstanding_extents goes to 0.  It's not
> > super intuitive but needs to be done under the BTRFS_I(inode)->lock so this was
> > the best place to put it.  I suppose we could move the logic out of here and put
> > it somewhere else to make it more clear.
> 
> I think defining this logic in its own, discrete block of code would be
> best w.r.t readibility. It's not super obvious.
> 

I went to do this and realized that I rip all of this out when we switch to
per-inode block rsvs, so I'm just going to leave this patch as is.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2017-10-19 18:10 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-29 19:43 [PATCH 00/21] My current btrfs patch queue Josef Bacik
2017-09-29 19:43 ` [PATCH 01/21] Btrfs: rework outstanding_extents Josef Bacik
2017-10-13  8:39   ` Nikolay Borisov
2017-10-13 13:10     ` Josef Bacik
2017-10-13 13:33       ` David Sterba
2017-10-13 13:55       ` Nikolay Borisov
2017-10-19 18:10         ` Josef Bacik
2017-10-19  3:14   ` Edmund Nadolski
2017-09-29 19:43 ` [PATCH 02/21] btrfs: add tracepoints for outstanding extents mods Josef Bacik
2017-09-29 19:43 ` [PATCH 03/21] btrfs: make the delalloc block rsv per inode Josef Bacik
2017-10-13 11:47   ` Nikolay Borisov
2017-10-13 13:18     ` Josef Bacik
2017-09-29 19:43 ` [PATCH 04/21] btrfs: add ref-verify mount option Josef Bacik
2017-10-13 13:53   ` David Sterba
2017-10-13 13:57     ` David Sterba
2017-09-29 19:43 ` [PATCH 05/21] btrfs: pass root to various extent ref mod functions Josef Bacik
2017-10-13 14:01   ` David Sterba
2017-09-29 19:43 ` [PATCH 06/21] Btrfs: add a extent ref verify tool Josef Bacik
2017-10-13 14:23   ` David Sterba
2017-09-29 19:43 ` [PATCH 07/21] Btrfs: only check delayed ref usage in should_end_transaction Josef Bacik
2017-10-13 17:20   ` David Sterba
2017-09-29 19:43 ` [PATCH 08/21] btrfs: add a helper to return a head ref Josef Bacik
2017-10-13 14:39   ` David Sterba
2017-09-29 19:43 ` [PATCH 09/21] btrfs: move extent_op cleanup to a helper Josef Bacik
2017-10-13 14:50   ` David Sterba
2017-10-16 14:05   ` Nikolay Borisov
2017-10-16 15:02     ` David Sterba
2017-09-29 19:43 ` [PATCH 10/21] btrfs: breakout empty head " Josef Bacik
2017-10-13 14:57   ` David Sterba
2017-10-16 14:07   ` Nikolay Borisov
2017-10-16 14:55     ` David Sterba
2017-09-29 19:43 ` [PATCH 11/21] btrfs: move ref_mod modification into the if (ref) logic Josef Bacik
2017-10-13 15:05   ` David Sterba
2017-09-29 19:43 ` [PATCH 12/21] btrfs: move all ref head cleanup to the helper function Josef Bacik
2017-10-13 15:39   ` David Sterba
2017-09-29 19:43 ` [PATCH 13/21] btrfs: remove delayed_ref_node from ref_head Josef Bacik
2017-10-13 16:05   ` David Sterba
2017-10-16 14:41   ` Nikolay Borisov
2017-09-29 19:43 ` [PATCH 14/21] btrfs: remove type argument from comp_tree_refs Josef Bacik
2017-10-13 16:06   ` David Sterba
2017-09-29 19:43 ` [PATCH 15/21] btrfs: switch args for comp_*_refs Josef Bacik
2017-10-13 16:24   ` David Sterba
2017-09-29 19:44 ` [PATCH 16/21] btrfs: add a comp_refs() helper Josef Bacik
2017-09-29 19:44 ` [PATCH 17/21] btrfs: track refs in a rb_tree instead of a list Josef Bacik
2017-09-29 19:44 ` [PATCH 18/21] btrfs: fix send ioctl on 32bit with 64bit kernel Josef Bacik
2017-09-29 19:44 ` [PATCH 19/21] btrfs: don't call btrfs_start_delalloc_roots in flushoncommit Josef Bacik
2017-10-13 17:10   ` David Sterba
2017-09-29 19:44 ` [PATCH 20/21] btrfs: move btrfs_truncate_block out of trans handle Josef Bacik
2017-09-29 19:44 ` [PATCH 21/21] btrfs: add assertions for releasing trans handle reservations Josef Bacik
2017-10-13 17:17   ` David Sterba
2017-10-13 17:28 ` [PATCH 00/21] My current btrfs patch queue David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.