All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] free space B-tree
@ 2015-09-01 19:01 Omar Sandoval
  2015-09-01 19:01 ` [PATCH 1/6] Btrfs: add extent buffer bitmap operations Omar Sandoval
                   ` (8 more replies)
  0 siblings, 9 replies; 43+ messages in thread
From: Omar Sandoval @ 2015-09-01 19:01 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Omar Sandoval

From: Omar Sandoval <osandov@fb.com>

Hi,

At Facebook, we are still running into the issue of long commit stalls
on large filesystems (say, 10s of TBs). 1bbc621ef284 ("Btrfs: allow
block group cache writeout outside critical section in commit") was a
stopgap, but it wasn't enough, as it's still possible for a very busy
filesystem to have a lot of block groups dirtied between the time we do
the initial free space cache writeout and redo it in the critical
section. Like Chris mentioned at LPC, I've been working on another
solution to put these issues behind us.

The solution we came up with is to track free space in a separate B-tree
and update it in tandem with the extent tree. Using a B-tree rather than
another ad-hoc mechanism has the advantage of being well-understood and
giving us the proper metadata allocation profile by default, even if it
could be slightly less efficient than something else purpose-built.

In any case, the scalability win is clear. All of the tests below were
run on a Fusion-io card. My stress testing workload, fallocating 50 TB
worth of space on a 100 TB sparse filesystem image and then freeing it
all, could cause stalls in the critical section for tens of seconds
writing out the free space cache. Using no free space cache or the free
space tree, commits spend only about a tenth of a second in the critical
section.

The time to load the free space tree is still reasonable as well. To
test this, I created extremely fragmented block groups and then ran a
workload that dirtied every inode in the filesystem, measuring how long
we spent loading free space. The free space cache comes out on top,
costing only ~30 ms. Using no cache is much worse, costing about 3-5
seconds. The free space tree is in between, taking 100-500 ms total
(keep in mind that this is for the whole test lasting several minutes,
not just for one block group). A lot of this overhead is actually
manipulating the in-memory free space structures, so there's room for
improvement in the future.

Finally, we keep the disk usage under control by switching to a bitmap
format when it becomes more efficient than using extents. Using 256 byte
bitmaps and 4096 blocks, a 1 GB block groups in the worst case requires
1 GB / (256 * 8 * 4096 B) = 128 bitmaps, at 256 + sizeof(btrfs_item) =
256 + 25 = 281 bytes per bitmap for a grand total of ~35 KB of overhead
per block group, comparable to the free space cache. This incurs the
cost of converting between the two formats while running delayed refs,
but I found that this takes <5 ms on my device.

There are a couple of things that I wouldn't mind some comments on.
Firstly, when doing the conversion between the extent and bitmap
formats, we vmalloc a chunk of memory to buffer the free space in. For a
1 GB block group, this is 32 KB of memory. This happens during a
transaction commit, so I'm worried about whether this will cause
problems in low-memory situations. I chose to do this instead of
handling the extent or bitmap items one by one because this is the
simplest way to guarantee that the free space tree does not become
larger at any point during the conversion (for example, imagine that our
metadata space is almost full and we try to process a bitmap that
becomes several extents). I'm wondering if anyone has any better ideas
about how to handle this. Secondly, I *think* that using the commit root
in load_free_space_tree() like is done in caching_thread() is correct,
but I'm not 100% sure.

This series is on top of v4.2. I've run it through xfstests and some
manual stress tests as well. I'm sending it from my personal email
address because soon I'll be back to finish up school, but I'll still be
looking at any comments that anyone might have.

Thanks!

Omar Sandoval (6):
  Btrfs: add extent buffer bitmap operations
  Btrfs: add helpers for read-only compat bits
  Btrfs: introduce the free space B-tree on-disk format
  Btrfs: implement the free space B-tree
  Btrfs: wire up the free space tree to the extent tree
  Btrfs: add free space tree mount option

 fs/btrfs/Makefile            |    2 +-
 fs/btrfs/ctree.h             |  104 ++-
 fs/btrfs/disk-io.c           |   26 +
 fs/btrfs/extent-tree.c       |   88 ++-
 fs/btrfs/extent_io.c         |  101 +++
 fs/btrfs/extent_io.h         |    6 +
 fs/btrfs/free-space-tree.c   | 1468 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/free-space-tree.h   |   39 ++
 fs/btrfs/super.c             |   21 +-
 include/trace/events/btrfs.h |    3 +-
 10 files changed, 1843 insertions(+), 15 deletions(-)
 create mode 100644 fs/btrfs/free-space-tree.c
 create mode 100644 fs/btrfs/free-space-tree.h

-- 
2.5.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 1/6] Btrfs: add extent buffer bitmap operations
  2015-09-01 19:01 [PATCH 0/6] free space B-tree Omar Sandoval
@ 2015-09-01 19:01 ` Omar Sandoval
  2015-09-01 19:25   ` Josef Bacik
  2015-09-01 19:01 ` [PATCH 2/6] Btrfs: add helpers for read-only compat bits Omar Sandoval
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 43+ messages in thread
From: Omar Sandoval @ 2015-09-01 19:01 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Omar Sandoval

From: Omar Sandoval <osandov@fb.com>

These are going to be used for the free space tree bitmap items.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/extent_io.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/extent_io.h |   6 +++
 2 files changed, 107 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 02d05817cbdf..649e3b4eeb1b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5475,6 +5475,107 @@ void copy_extent_buffer(struct extent_buffer *dst, struct extent_buffer *src,
 	}
 }
 
+/*
+ * The extent buffer bitmap operations are done with byte granularity because
+ * bitmap items are not guaranteed to be aligned to a word and therefore a
+ * single word in a bitmap may straddle two pages in the extent buffer.
+ */
+#define BIT_BYTE(nr) ((nr) / BITS_PER_BYTE)
+#define BYTE_MASK ((1 << BITS_PER_BYTE) - 1)
+#define BITMAP_FIRST_BYTE_MASK(start) \
+	((BYTE_MASK << ((start) & (BITS_PER_BYTE - 1))) & BYTE_MASK)
+#define BITMAP_LAST_BYTE_MASK(nbits) \
+	(BYTE_MASK >> (-(nbits) & (BITS_PER_BYTE - 1)))
+
+int extent_buffer_test_bit(struct extent_buffer *eb, unsigned long start,
+			   unsigned long nr)
+{
+	size_t offset;
+	char *kaddr;
+	struct page *page;
+	size_t byte_offset = BIT_BYTE(nr);
+	size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1);
+	unsigned long i = (start_offset + start + byte_offset) >> PAGE_CACHE_SHIFT;
+
+	offset = (start_offset + start + byte_offset) & (PAGE_CACHE_SIZE - 1);
+	page = eb->pages[i];
+	WARN_ON(!PageUptodate(page));
+	kaddr = page_address(page);
+	return 1U & (kaddr[offset] >> (nr & (BITS_PER_BYTE - 1)));
+}
+
+void extent_buffer_bitmap_set(struct extent_buffer *eb, unsigned long start,
+			      unsigned long pos, unsigned long len)
+{
+	size_t offset;
+	char *kaddr;
+	struct page *page;
+	size_t byte_offset = BIT_BYTE(pos);
+	size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1);
+	unsigned long i = (start_offset + start + byte_offset) >> PAGE_CACHE_SHIFT;
+	const unsigned int size = pos + len;
+	int bits_to_set = BITS_PER_BYTE - (pos % BITS_PER_BYTE);
+	unsigned int mask_to_set = BITMAP_FIRST_BYTE_MASK(pos);
+
+	offset = (start_offset + start + byte_offset) & (PAGE_CACHE_SIZE - 1);
+	page = eb->pages[i];
+	WARN_ON(!PageUptodate(page));
+	kaddr = page_address(page);
+
+	while (len >= bits_to_set) {
+		kaddr[offset] |= mask_to_set;
+		len -= bits_to_set;
+		bits_to_set = BITS_PER_BYTE;
+		mask_to_set = ~0U;
+		if (++offset >= PAGE_CACHE_SIZE && len > 0) {
+			offset = 0;
+			page = eb->pages[++i];
+			WARN_ON(!PageUptodate(page));
+			kaddr = page_address(page);
+		}
+	}
+	if (len) {
+		mask_to_set &= BITMAP_LAST_BYTE_MASK(size);
+		kaddr[offset] |= mask_to_set;
+	}
+}
+
+void extent_buffer_bitmap_clear(struct extent_buffer *eb, unsigned long start,
+				unsigned long pos, unsigned long len)
+{
+	size_t offset;
+	char *kaddr;
+	struct page *page;
+	size_t byte_offset = BIT_BYTE(pos);
+	size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1);
+	unsigned long i = (start_offset + start + byte_offset) >> PAGE_CACHE_SHIFT;
+	const unsigned int size = pos + len;
+	int bits_to_clear = BITS_PER_BYTE - (pos % BITS_PER_BYTE);
+	unsigned int mask_to_clear = BITMAP_FIRST_BYTE_MASK(pos);
+
+	offset = (start_offset + start + byte_offset) & (PAGE_CACHE_SIZE - 1);
+	page = eb->pages[i];
+	WARN_ON(!PageUptodate(page));
+	kaddr = page_address(page);
+
+	while (len >= bits_to_clear) {
+		kaddr[offset] &= ~mask_to_clear;
+		len -= bits_to_clear;
+		bits_to_clear = BITS_PER_BYTE;
+		mask_to_clear = ~0U;
+		if (++offset >= PAGE_CACHE_SIZE && len > 0) {
+			offset = 0;
+			page = eb->pages[++i];
+			WARN_ON(!PageUptodate(page));
+			kaddr = page_address(page);
+		}
+	}
+	if (len) {
+		mask_to_clear &= BITMAP_LAST_BYTE_MASK(size);
+		kaddr[offset] &= ~mask_to_clear;
+	}
+}
+
 static inline bool areas_overlap(unsigned long src, unsigned long dst, unsigned long len)
 {
 	unsigned long distance = (src > dst) ? src - dst : dst - src;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index c668f36898d3..9185a20081d7 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -309,6 +309,12 @@ void memmove_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
 			   unsigned long src_offset, unsigned long len);
 void memset_extent_buffer(struct extent_buffer *eb, char c,
 			  unsigned long start, unsigned long len);
+int extent_buffer_test_bit(struct extent_buffer *eb, unsigned long start,
+			   unsigned long pos);
+void extent_buffer_bitmap_set(struct extent_buffer *eb, unsigned long start,
+			      unsigned long pos, unsigned long len);
+void extent_buffer_bitmap_clear(struct extent_buffer *eb, unsigned long start,
+				unsigned long pos, unsigned long len);
 void clear_extent_buffer_dirty(struct extent_buffer *eb);
 int set_extent_buffer_dirty(struct extent_buffer *eb);
 int set_extent_buffer_uptodate(struct extent_buffer *eb);
-- 
2.5.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 2/6] Btrfs: add helpers for read-only compat bits
  2015-09-01 19:01 [PATCH 0/6] free space B-tree Omar Sandoval
  2015-09-01 19:01 ` [PATCH 1/6] Btrfs: add extent buffer bitmap operations Omar Sandoval
@ 2015-09-01 19:01 ` Omar Sandoval
  2015-09-01 19:26   ` Josef Bacik
  2015-09-01 19:01 ` [PATCH 3/6] Btrfs: introduce the free space B-tree on-disk format Omar Sandoval
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 43+ messages in thread
From: Omar Sandoval @ 2015-09-01 19:01 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Omar Sandoval

From: Omar Sandoval <osandov@fb.com>

We're finally going to add one of these for the free space tree, so
let's add the same nice helpers that we have for the incompat bits.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/ctree.h | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index aac314e14188..10388ac041b6 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -4113,6 +4113,40 @@ static inline int __btrfs_fs_incompat(struct btrfs_fs_info *fs_info, u64 flag)
 	return !!(btrfs_super_incompat_flags(disk_super) & flag);
 }
 
+#define btrfs_set_fs_compat_ro(__fs_info, opt) \
+	__btrfs_set_fs_compat_ro((__fs_info), BTRFS_FEATURE_COMPAT_RO_##opt)
+
+static inline void __btrfs_set_fs_compat_ro(struct btrfs_fs_info *fs_info,
+					    u64 flag)
+{
+	struct btrfs_super_block *disk_super;
+	u64 features;
+
+	disk_super = fs_info->super_copy;
+	features = btrfs_super_compat_ro_flags(disk_super);
+	if (!(features & flag)) {
+		spin_lock(&fs_info->super_lock);
+		features = btrfs_super_compat_ro_flags(disk_super);
+		if (!(features & flag)) {
+			features |= flag;
+			btrfs_set_super_compat_ro_flags(disk_super, features);
+			btrfs_info(fs_info, "setting %llu ro feature flag",
+				   flag);
+		}
+		spin_unlock(&fs_info->super_lock);
+	}
+}
+
+#define btrfs_fs_compat_ro(fs_info, opt) \
+	__btrfs_fs_compat_ro((fs_info), BTRFS_FEATURE_COMPAT_RO_##opt)
+
+static inline int __btrfs_fs_compat_ro(struct btrfs_fs_info *fs_info, u64 flag)
+{
+	struct btrfs_super_block *disk_super;
+	disk_super = fs_info->super_copy;
+	return !!(btrfs_super_compat_ro_flags(disk_super) & flag);
+}
+
 /*
  * Call btrfs_abort_transaction as early as possible when an error condition is
  * detected, that way the exact line number is reported.
-- 
2.5.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 3/6] Btrfs: introduce the free space B-tree on-disk format
  2015-09-01 19:01 [PATCH 0/6] free space B-tree Omar Sandoval
  2015-09-01 19:01 ` [PATCH 1/6] Btrfs: add extent buffer bitmap operations Omar Sandoval
  2015-09-01 19:01 ` [PATCH 2/6] Btrfs: add helpers for read-only compat bits Omar Sandoval
@ 2015-09-01 19:01 ` Omar Sandoval
  2015-09-01 19:28   ` Josef Bacik
  2015-09-01 19:05 ` [PATCH 5/6] Btrfs: wire up the free space tree to the extent tree Omar Sandoval
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 43+ messages in thread
From: Omar Sandoval @ 2015-09-01 19:01 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Omar Sandoval

From: Omar Sandoval <osandov@fb.com>

The on-disk format for the free space tree is straightforward. Each
block group is represented in the free space tree by a free space info
item that stores accounting information: whether the free space for this
block group is stored as bitmaps or extents and how many extents of free
space exist for this block group (regardless of which format is being
used in the tree). Extents are (start, FREE_SPACE_EXTENT, length) keys
with no corresponding item, and bitmaps instead have the
FREE_SPACE_BITMAP type and have a bitmap item attached, which is just an
array of bytes.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/ctree.h             | 38 ++++++++++++++++++++++++++++++++++++++
 include/trace/events/btrfs.h |  3 ++-
 2 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 10388ac041b6..34a81a79f5b6 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -96,6 +96,9 @@ struct btrfs_ordered_sum;
 /* for storing items that use the BTRFS_UUID_KEY* types */
 #define BTRFS_UUID_TREE_OBJECTID 9ULL
 
+/* tracks free space in block groups. */
+#define BTRFS_FREE_SPACE_TREE_OBJECTID 10ULL
+
 /* for storing balance parameters in the root tree */
 #define BTRFS_BALANCE_OBJECTID -4ULL
 
@@ -500,6 +503,8 @@ struct btrfs_super_block {
  * Compat flags that we support.  If any incompat flags are set other than the
  * ones specified below then we will fail to mount
  */
+#define BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE	(1ULL << 0)
+
 #define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF	(1ULL << 0)
 #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL	(1ULL << 1)
 #define BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS	(1ULL << 2)
@@ -1061,6 +1066,13 @@ struct btrfs_block_group_item {
 	__le64 flags;
 } __attribute__ ((__packed__));
 
+struct btrfs_free_space_info {
+	__le32 extent_count;
+	__le32 flags;
+} __attribute__ ((__packed__));
+
+#define BTRFS_FREE_SPACE_USING_BITMAPS (1ULL << 0)
+
 #define BTRFS_QGROUP_LEVEL_SHIFT		48
 static inline u64 btrfs_qgroup_level(u64 qgroupid)
 {
@@ -2061,6 +2073,27 @@ struct btrfs_ioctl_defrag_range_args {
  */
 #define BTRFS_BLOCK_GROUP_ITEM_KEY 192
 
+/*
+ * Every block group is represented in the free space tree by a free space info
+ * item, which stores some accounting information. It is keyed on
+ * (block_group_start, FREE_SPACE_INFO, block_group_length).
+ */
+#define BTRFS_FREE_SPACE_INFO_KEY 198
+
+/*
+ * A free space extent tracks an extent of space that is free in a block group.
+ * It is keyed on (start, FREE_SPACE_EXTENT, length).
+ */
+#define BTRFS_FREE_SPACE_EXTENT_KEY 199
+
+/*
+ * When a block group becomes very fragmented, we convert it to use bitmaps
+ * instead of extents. A free space bitmap is keyed on
+ * (start, FREE_SPACE_BITMAP, length); the corresponding item is a bitmap with
+ * (length / sectorsize) bits.
+ */
+#define BTRFS_FREE_SPACE_BITMAP_KEY 200
+
 #define BTRFS_DEV_EXTENT_KEY	204
 #define BTRFS_DEV_ITEM_KEY	216
 #define BTRFS_CHUNK_ITEM_KEY	228
@@ -2461,6 +2494,11 @@ BTRFS_SETGET_FUNCS(disk_block_group_flags,
 BTRFS_SETGET_STACK_FUNCS(block_group_flags,
 			struct btrfs_block_group_item, flags, 64);
 
+/* struct btrfs_free_space_info */
+BTRFS_SETGET_FUNCS(free_space_extent_count, struct btrfs_free_space_info,
+		   extent_count, 32);
+BTRFS_SETGET_FUNCS(free_space_flags, struct btrfs_free_space_info, flags, 32);
+
 /* struct btrfs_inode_ref */
 BTRFS_SETGET_FUNCS(inode_ref_name_len, struct btrfs_inode_ref, name_len, 16);
 BTRFS_SETGET_FUNCS(inode_ref_index, struct btrfs_inode_ref, index, 64);
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 0b73af9be12f..e6289e62a2a8 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -45,7 +45,8 @@ struct btrfs_qgroup_operation;
 		{ BTRFS_TREE_LOG_OBJECTID,	"TREE_LOG"	},	\
 		{ BTRFS_QUOTA_TREE_OBJECTID,	"QUOTA_TREE"	},	\
 		{ BTRFS_TREE_RELOC_OBJECTID,	"TREE_RELOC"	},	\
-		{ BTRFS_UUID_TREE_OBJECTID,	"UUID_RELOC"	},	\
+		{ BTRFS_UUID_TREE_OBJECTID,	"UUID_TREE"	},	\
+		{ BTRFS_FREE_SPACE_TREE_OBJECTID, "FREE_SPACE_TREE" },	\
 		{ BTRFS_DATA_RELOC_TREE_OBJECTID, "DATA_RELOC_TREE" })
 
 #define show_root_type(obj)						\
-- 
2.5.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 5/6] Btrfs: wire up the free space tree to the extent tree
  2015-09-01 19:01 [PATCH 0/6] free space B-tree Omar Sandoval
                   ` (2 preceding siblings ...)
  2015-09-01 19:01 ` [PATCH 3/6] Btrfs: introduce the free space B-tree on-disk format Omar Sandoval
@ 2015-09-01 19:05 ` Omar Sandoval
  2015-09-01 19:48   ` Josef Bacik
  2015-09-01 19:05 ` [PATCH 6/6] Btrfs: add free space tree mount option Omar Sandoval
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 43+ messages in thread
From: Omar Sandoval @ 2015-09-01 19:05 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Omar Sandoval

From: Omar Sandoval <osandov@fb.com>

The free space tree is updated in tandem with the extent tree. There are
only a handful of places where we need to hook in:

1. Block group creation
2. Block group deletion
3. Delayed refs (extent creation and deletion)
4. Block group caching

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/extent-tree.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 70 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 37179a569f40..3f10df3932f0 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -33,6 +33,7 @@
 #include "raid56.h"
 #include "locking.h"
 #include "free-space-cache.h"
+#include "free-space-tree.h"
 #include "math.h"
 #include "sysfs.h"
 #include "qgroup.h"
@@ -589,7 +590,41 @@ static int cache_block_group(struct btrfs_block_group_cache *cache,
 	cache->cached = BTRFS_CACHE_FAST;
 	spin_unlock(&cache->lock);
 
-	if (fs_info->mount_opt & BTRFS_MOUNT_SPACE_CACHE) {
+	if (btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE)) {
+		if (load_cache_only) {
+			spin_lock(&cache->lock);
+			cache->caching_ctl = NULL;
+			cache->cached = BTRFS_CACHE_NO;
+			spin_unlock(&cache->lock);
+			wake_up(&caching_ctl->wait);
+		} else {
+			mutex_lock(&caching_ctl->mutex);
+			ret = load_free_space_tree(fs_info, cache);
+			if (ret) {
+				btrfs_warn(fs_info, "failed to load free space tree for %llu: %d",
+					   cache->key.objectid, ret);
+				spin_lock(&cache->lock);
+				cache->caching_ctl = NULL;
+				cache->cached = BTRFS_CACHE_ERROR;
+				spin_unlock(&cache->lock);
+				goto tree_out;
+			}
+
+			spin_lock(&cache->lock);
+			cache->caching_ctl = NULL;
+			cache->cached = BTRFS_CACHE_FINISHED;
+			cache->last_byte_to_unpin = (u64)-1;
+			caching_ctl->progress = (u64)-1;
+			spin_unlock(&cache->lock);
+			mutex_unlock(&caching_ctl->mutex);
+
+tree_out:
+			wake_up(&caching_ctl->wait);
+			put_caching_control(caching_ctl);
+			free_excluded_extents(fs_info->extent_root, cache);
+			return 0;
+		}
+	} else if (fs_info->mount_opt & BTRFS_MOUNT_SPACE_CACHE) {
 		mutex_lock(&caching_ctl->mutex);
 		ret = load_free_space_cache(fs_info, cache);
 
@@ -619,8 +654,10 @@ static int cache_block_group(struct btrfs_block_group_cache *cache,
 		}
 	} else {
 		/*
-		 * We are not going to do the fast caching, set cached to the
-		 * appropriate value and wakeup any waiters.
+		 * We're here either because we're not using the space cache or
+		 * free space tree, or because we're currently building the free
+		 * space tree. Set cached to the appropriate value and wakeup
+		 * any waiters.
 		 */
 		spin_lock(&cache->lock);
 		if (load_cache_only) {
@@ -6378,6 +6415,13 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 			}
 		}
 
+		ret = add_to_free_space_tree(trans, root->fs_info, bytenr,
+					     num_bytes);
+		if (ret) {
+			btrfs_abort_transaction(trans, extent_root, ret);
+			goto out;
+		}
+
 		ret = update_block_group(trans, root, bytenr, num_bytes, 0);
 		if (ret) {
 			btrfs_abort_transaction(trans, extent_root, ret);
@@ -7321,6 +7365,11 @@ static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
 	btrfs_mark_buffer_dirty(path->nodes[0]);
 	btrfs_free_path(path);
 
+	ret = remove_from_free_space_tree(trans, fs_info, ins->objectid,
+					  ins->offset);
+	if (ret)
+		return ret;
+
 	ret = update_block_group(trans, root, ins->objectid, ins->offset, 1);
 	if (ret) { /* -ENOENT, logic error */
 		btrfs_err(fs_info, "update block group failed for %llu %llu",
@@ -7402,6 +7451,11 @@ static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
 	btrfs_mark_buffer_dirty(leaf);
 	btrfs_free_path(path);
 
+	ret = remove_from_free_space_tree(trans, fs_info, ins->objectid,
+					  num_bytes);
+	if (ret)
+		return ret;
+
 	ret = update_block_group(trans, root, ins->objectid, root->nodesize,
 				 1);
 	if (ret) { /* -ENOENT, logic error */
@@ -9272,6 +9326,8 @@ btrfs_create_block_group_cache(struct btrfs_root *root, u64 start, u64 size)
 	cache->full_stripe_len = btrfs_full_stripe_len(root,
 					       &root->fs_info->mapping_tree,
 					       start);
+	set_free_space_tree_thresholds(cache);
+
 	atomic_set(&cache->count, 1);
 	spin_lock_init(&cache->lock);
 	init_rwsem(&cache->data_rwsem);
@@ -9535,6 +9591,13 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans,
 	add_new_free_space(cache, root->fs_info, chunk_offset,
 			   chunk_offset + size);
 
+	ret = add_block_group_free_space(trans, root->fs_info, cache);
+	if (ret) {
+		btrfs_remove_free_space_cache(cache);
+		btrfs_put_block_group(cache);
+		return ret;
+	}
+
 	free_excluded_extents(root, cache);
 
 	/*
@@ -9878,6 +9941,10 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 
 	unlock_chunks(root);
 
+	ret = remove_block_group_free_space(trans, root->fs_info, block_group);
+	if (ret)
+		goto out;
+
 	btrfs_put_block_group(block_group);
 	btrfs_put_block_group(block_group);
 
-- 
2.5.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 6/6] Btrfs: add free space tree mount option
  2015-09-01 19:01 [PATCH 0/6] free space B-tree Omar Sandoval
                   ` (3 preceding siblings ...)
  2015-09-01 19:05 ` [PATCH 5/6] Btrfs: wire up the free space tree to the extent tree Omar Sandoval
@ 2015-09-01 19:05 ` Omar Sandoval
  2015-09-01 19:49   ` Josef Bacik
  2015-09-01 19:13 ` [PATCH 4/6] Btrfs: implement the free space B-tree Omar Sandoval
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 43+ messages in thread
From: Omar Sandoval @ 2015-09-01 19:05 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Omar Sandoval

From: Omar Sandoval <osandov@fb.com>

Now we can finally hook up everything so we can actually use free space
tree. On the first mount with the free_space_tree mount option, the free
space tree will be created and the FREE_SPACE_TREE read-only compat bit
will be set. Any time the filesystem is mounted from then on, we will
use the free space tree.

Having both the free space cache and free space trees enabled is
nonsense, so we don't allow that to happen. Since mkfs sets the
superblock cache generation to -1, this means that the filesystem will
have to be mounted with nospace_cache,free_space_tree to create the free
space trees on first mount. Once the FREE_SPACE_TREE bit is set, the
cache generation is ignored when mounting. This is all a little more
complicated than would be ideal, but at some point we can presumably
make the free space tree the default and stop setting the cache
generation in mkfs.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/ctree.h   |  7 ++++++-
 fs/btrfs/disk-io.c | 26 ++++++++++++++++++++++++++
 fs/btrfs/super.c   | 21 +++++++++++++++++++--
 3 files changed, 51 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index d49181d35f08..bf4ca5a5496a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -531,7 +531,10 @@ struct btrfs_super_block {
 #define BTRFS_FEATURE_COMPAT_SUPP		0ULL
 #define BTRFS_FEATURE_COMPAT_SAFE_SET		0ULL
 #define BTRFS_FEATURE_COMPAT_SAFE_CLEAR		0ULL
-#define BTRFS_FEATURE_COMPAT_RO_SUPP		0ULL
+
+#define BTRFS_FEATURE_COMPAT_RO_SUPP			\
+	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE)
+
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_SET	0ULL
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR	0ULL
 
@@ -2200,6 +2203,7 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21)
 #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR	(1 << 22)
 #define BTRFS_MOUNT_RESCAN_UUID_TREE	(1 << 23)
+#define BTRFS_MOUNT_FREE_SPACE_TREE	(1 << 24)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
 #define BTRFS_DEFAULT_MAX_INLINE	(8192)
@@ -3743,6 +3747,7 @@ static inline void free_fs_info(struct btrfs_fs_info *fs_info)
 	kfree(fs_info->csum_root);
 	kfree(fs_info->quota_root);
 	kfree(fs_info->uuid_root);
+	kfree(fs_info->free_space_root);
 	kfree(fs_info->super_copy);
 	kfree(fs_info->super_for_commit);
 	security_free_mnt_opts(&fs_info->security_opts);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f556c3732c2c..e88674c594da 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -42,6 +42,7 @@
 #include "locking.h"
 #include "tree-log.h"
 #include "free-space-cache.h"
+#include "free-space-tree.h"
 #include "inode-map.h"
 #include "check-integrity.h"
 #include "rcu-string.h"
@@ -1641,6 +1642,9 @@ struct btrfs_root *btrfs_get_fs_root(struct btrfs_fs_info *fs_info,
 	if (location->objectid == BTRFS_UUID_TREE_OBJECTID)
 		return fs_info->uuid_root ? fs_info->uuid_root :
 					    ERR_PTR(-ENOENT);
+	if (location->objectid == BTRFS_FREE_SPACE_TREE_OBJECTID)
+		return fs_info->free_space_root ? fs_info->free_space_root :
+						  ERR_PTR(-ENOENT);
 again:
 	root = btrfs_lookup_fs_root(fs_info, location->objectid);
 	if (root) {
@@ -2138,6 +2142,7 @@ static void free_root_pointers(struct btrfs_fs_info *info, int chunk_root)
 	free_root_extent_buffers(info->uuid_root);
 	if (chunk_root)
 		free_root_extent_buffers(info->chunk_root);
+	free_root_extent_buffers(info->free_space_root);
 }
 
 void btrfs_free_fs_roots(struct btrfs_fs_info *fs_info)
@@ -2439,6 +2444,15 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info,
 		fs_info->uuid_root = root;
 	}
 
+	if (btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE)) {
+		location.objectid = BTRFS_FREE_SPACE_TREE_OBJECTID;
+		root = btrfs_read_tree_root(tree_root, &location);
+		if (IS_ERR(root))
+			return PTR_ERR(root);
+		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
+		fs_info->free_space_root = root;
+	}
+
 	return 0;
 }
 
@@ -3063,6 +3077,18 @@ retry_root_backup:
 
 	btrfs_qgroup_rescan_resume(fs_info);
 
+	if (btrfs_test_opt(tree_root, FREE_SPACE_TREE) &&
+	    !btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE)) {
+		pr_info("BTRFS: creating free space tree\n");
+		ret = btrfs_create_free_space_tree(fs_info);
+		if (ret) {
+			pr_warn("BTRFS: failed to create free space tree %d\n",
+				ret);
+			close_ctree(tree_root);
+			return ret;
+		}
+	}
+
 	if (!fs_info->uuid_root) {
 		pr_info("BTRFS: creating UUID tree\n");
 		ret = btrfs_create_uuid_tree(fs_info);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index cd7ef34d2dce..60135e53f4b9 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -319,7 +319,7 @@ enum {
 	Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_rescan_uuid_tree,
 	Opt_commit_interval, Opt_barrier, Opt_nodefrag, Opt_nodiscard,
 	Opt_noenospc_debug, Opt_noflushoncommit, Opt_acl, Opt_datacow,
-	Opt_datasum, Opt_treelog, Opt_noinode_cache,
+	Opt_datasum, Opt_treelog, Opt_noinode_cache, Opt_free_space_tree,
 	Opt_err,
 };
 
@@ -372,6 +372,7 @@ static match_table_t tokens = {
 	{Opt_rescan_uuid_tree, "rescan_uuid_tree"},
 	{Opt_fatal_errors, "fatal_errors=%s"},
 	{Opt_commit_interval, "commit=%d"},
+	{Opt_free_space_tree, "free_space_tree"},
 	{Opt_err, NULL},
 };
 
@@ -392,7 +393,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 	bool compress_force = false;
 
 	cache_gen = btrfs_super_cache_generation(root->fs_info->super_copy);
-	if (cache_gen)
+	if (btrfs_fs_compat_ro(root->fs_info, FREE_SPACE_TREE))
+		btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE);
+	else if (cache_gen)
 		btrfs_set_opt(info->mount_opt, SPACE_CACHE);
 
 	if (!options)
@@ -738,6 +741,10 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 				info->commit_interval = BTRFS_DEFAULT_COMMIT_INTERVAL;
 			}
 			break;
+		case Opt_free_space_tree:
+			btrfs_set_and_info(root, FREE_SPACE_TREE,
+					   "enabling free space tree");
+			break;
 		case Opt_err:
 			btrfs_info(root->fs_info, "unrecognized mount option '%s'", p);
 			ret = -EINVAL;
@@ -747,8 +754,16 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 		}
 	}
 out:
+	if (btrfs_test_opt(root, SPACE_CACHE) &&
+	    btrfs_test_opt(root, FREE_SPACE_TREE)) {
+		btrfs_err(root->fs_info,
+			  "cannot use both free space cache and free space tree");
+		ret = -EINVAL;
+	}
 	if (!ret && btrfs_test_opt(root, SPACE_CACHE))
 		btrfs_info(root->fs_info, "disk space caching is enabled");
+	if (!ret && btrfs_test_opt(root, FREE_SPACE_TREE))
+		btrfs_info(root->fs_info, "using free space tree");
 	kfree(orig);
 	return ret;
 }
@@ -1152,6 +1167,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 		seq_puts(seq, ",discard");
 	if (!(root->fs_info->sb->s_flags & MS_POSIXACL))
 		seq_puts(seq, ",noacl");
+	if (btrfs_test_opt(root, FREE_SPACE_TREE))
+		seq_puts(seq, ",free_space_tree");
 	if (btrfs_test_opt(root, SPACE_CACHE))
 		seq_puts(seq, ",space_cache");
 	else
-- 
2.5.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 4/6] Btrfs: implement the free space B-tree
  2015-09-01 19:01 [PATCH 0/6] free space B-tree Omar Sandoval
                   ` (4 preceding siblings ...)
  2015-09-01 19:05 ` [PATCH 6/6] Btrfs: add free space tree mount option Omar Sandoval
@ 2015-09-01 19:13 ` Omar Sandoval
  2015-09-01 19:44   ` Josef Bacik
  2015-09-01 19:17 ` [PATCH 0/6] " Omar Sandoval
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 43+ messages in thread
From: Omar Sandoval @ 2015-09-01 19:13 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Omar Sandoval

From: Omar Sandoval <osandov@fb.com>

The free space cache has turned out to be a scalability bottleneck on
large, busy filesystems. When the cache for a lot of block groups needs
to be written out, we can get extremely long commit times; if this
happens in the critical section, things are especially bad because we
block new transactions from happening.

The main problem with the free space cache is that it has to be written
out in its entirety and is managed in an ad hoc fashion. Using a B-tree
to store free space fixes this: updates can be done as needed and we get
all of the benefits of using a B-tree: checksumming, RAID handling,
well-understood behavior.

With the free space tree, we get commit times that are about the same as
the no cache case with load times slower than the free space cache case
but still much faster than the no cache case. Free space is represented
with extents until it becomes more space-efficient to use bitmaps,
giving us similar space overhead to the free space cache.

The operations on the free space tree are: adding and removing free
space, handling the creation and deletion of block groups, and loading
the free space for a block group. We can also create the free space tree
by walking the extent tree.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/Makefile          |    2 +-
 fs/btrfs/ctree.h           |   25 +-
 fs/btrfs/extent-tree.c     |   15 +-
 fs/btrfs/free-space-tree.c | 1468 ++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/free-space-tree.h |   39 ++
 5 files changed, 1541 insertions(+), 8 deletions(-)
 create mode 100644 fs/btrfs/free-space-tree.c
 create mode 100644 fs/btrfs/free-space-tree.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 6d1d0b93b1aa..766169709146 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -9,7 +9,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
 	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
 	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
-	   uuid-tree.o props.o hash.o
+	   uuid-tree.o props.o hash.o free-space-tree.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 34a81a79f5b6..d49181d35f08 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1299,8 +1299,20 @@ struct btrfs_block_group_cache {
 	u64 delalloc_bytes;
 	u64 bytes_super;
 	u64 flags;
-	u64 sectorsize;
 	u64 cache_generation;
+	u32 sectorsize;
+
+	/*
+	 * If the free space extent count exceeds this number, convert the block
+	 * group to bitmaps.
+	 */
+	u32 bitmap_high_thresh;
+
+	/*
+	 * If the free space extent count drops below this number, convert the
+	 * block group back to extents.
+	 */
+	u32 bitmap_low_thresh;
 
 	/*
 	 * It is just used for the delayed data space allocation because
@@ -1356,6 +1368,9 @@ struct btrfs_block_group_cache {
 	struct list_head io_list;
 
 	struct btrfs_io_ctl io_ctl;
+
+	/* Lock for free space tree operations. */
+	struct mutex free_space_lock;
 };
 
 /* delayed seq elem */
@@ -1407,6 +1422,7 @@ struct btrfs_fs_info {
 	struct btrfs_root *csum_root;
 	struct btrfs_root *quota_root;
 	struct btrfs_root *uuid_root;
+	struct btrfs_root *free_space_root;
 
 	/* the log root tree is a directory of all the other log roots */
 	struct btrfs_root *log_root_tree;
@@ -3556,6 +3572,13 @@ void btrfs_end_write_no_snapshoting(struct btrfs_root *root);
 void check_system_chunk(struct btrfs_trans_handle *trans,
 			struct btrfs_root *root,
 			const u64 type);
+void free_excluded_extents(struct btrfs_root *root,
+			   struct btrfs_block_group_cache *cache);
+int exclude_super_stripes(struct btrfs_root *root,
+			  struct btrfs_block_group_cache *cache);
+u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
+		       struct btrfs_fs_info *info, u64 start, u64 end);
+
 /* ctree.c */
 int btrfs_bin_search(struct extent_buffer *eb, struct btrfs_key *key,
 		     int level, int *slot);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 07204bf601ed..37179a569f40 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -237,8 +237,8 @@ static int add_excluded_extent(struct btrfs_root *root,
 	return 0;
 }
 
-static void free_excluded_extents(struct btrfs_root *root,
-				  struct btrfs_block_group_cache *cache)
+void free_excluded_extents(struct btrfs_root *root,
+			   struct btrfs_block_group_cache *cache)
 {
 	u64 start, end;
 
@@ -251,14 +251,16 @@ static void free_excluded_extents(struct btrfs_root *root,
 			  start, end, EXTENT_UPTODATE, GFP_NOFS);
 }
 
-static int exclude_super_stripes(struct btrfs_root *root,
-				 struct btrfs_block_group_cache *cache)
+int exclude_super_stripes(struct btrfs_root *root,
+			  struct btrfs_block_group_cache *cache)
 {
 	u64 bytenr;
 	u64 *logical;
 	int stripe_len;
 	int i, nr, ret;
 
+	cache->bytes_super = 0;
+
 	if (cache->key.objectid < BTRFS_SUPER_INFO_OFFSET) {
 		stripe_len = BTRFS_SUPER_INFO_OFFSET - cache->key.objectid;
 		cache->bytes_super += stripe_len;
@@ -337,8 +339,8 @@ static void put_caching_control(struct btrfs_caching_control *ctl)
  * we need to check the pinned_extents for any extents that can't be used yet
  * since their free space will be released as soon as the transaction commits.
  */
-static u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
-			      struct btrfs_fs_info *info, u64 start, u64 end)
+u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
+		       struct btrfs_fs_info *info, u64 start, u64 end)
 {
 	u64 extent_start, extent_end, size, total_added = 0;
 	int ret;
@@ -9281,6 +9283,7 @@ btrfs_create_block_group_cache(struct btrfs_root *root, u64 start, u64 size)
 	INIT_LIST_HEAD(&cache->io_list);
 	btrfs_init_free_space_ctl(cache);
 	atomic_set(&cache->trimming, 0);
+	mutex_init(&cache->free_space_lock);
 
 	return cache;
 }
diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
new file mode 100644
index 000000000000..bbb4f731f948
--- /dev/null
+++ b/fs/btrfs/free-space-tree.c
@@ -0,0 +1,1468 @@
+/*
+ * Copyright (C) 2015 Facebook.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/kernel.h>
+#include <linux/vmalloc.h>
+#include "ctree.h"
+#include "disk-io.h"
+#include "locking.h"
+#include "free-space-tree.h"
+#include "transaction.h"
+
+/*
+ * The default size for new free space bitmap items. The last bitmap in a block
+ * group may be truncated, and none of the free space tree code assumes that
+ * existing bitmaps are this size.
+ */
+#define BTRFS_FREE_SPACE_BITMAP_SIZE 256
+#define BTRFS_FREE_SPACE_BITMAP_BITS (BTRFS_FREE_SPACE_BITMAP_SIZE * BITS_PER_BYTE)
+
+void set_free_space_tree_thresholds(struct btrfs_block_group_cache *cache)
+{
+	u32 bitmap_range;
+	size_t bitmap_size;
+	u64 num_bitmaps, total_bitmap_size;
+
+	/*
+	 * We convert to bitmaps when the disk space required for using extents
+	 * exceeds that required for using bitmaps.
+	 */
+	bitmap_range = cache->sectorsize * BTRFS_FREE_SPACE_BITMAP_BITS;
+	num_bitmaps = div_u64(cache->key.offset + bitmap_range - 1,
+			      bitmap_range);
+	bitmap_size = sizeof(struct btrfs_item) + BTRFS_FREE_SPACE_BITMAP_SIZE;
+	total_bitmap_size = num_bitmaps * bitmap_size;
+	cache->bitmap_high_thresh = div_u64(total_bitmap_size,
+					    sizeof(struct btrfs_item));
+
+	/*
+	 * We allow for a small buffer between the high threshold and low
+	 * threshold to avoid thrashing back and forth between the two formats.
+	 */
+	if (cache->bitmap_high_thresh > 100)
+		cache->bitmap_low_thresh = cache->bitmap_high_thresh - 100;
+	else
+		cache->bitmap_low_thresh = 0;
+}
+
+static int add_new_free_space_info(struct btrfs_trans_handle *trans,
+				   struct btrfs_fs_info *fs_info,
+				   struct btrfs_block_group_cache *block_group,
+				   struct btrfs_path *path)
+{
+	struct btrfs_root *root = fs_info->free_space_root;
+	struct btrfs_free_space_info *info;
+	struct btrfs_key key;
+	struct extent_buffer *leaf;
+	int ret;
+
+	key.objectid = block_group->key.objectid;
+	key.type = BTRFS_FREE_SPACE_INFO_KEY;
+	key.offset = block_group->key.offset;
+
+	ret = btrfs_insert_empty_item(trans, root, path, &key, sizeof(*info));
+	if (ret)
+		goto out;
+
+	leaf = path->nodes[0];
+	info = btrfs_item_ptr(leaf, path->slots[0],
+			      struct btrfs_free_space_info);
+	btrfs_set_free_space_extent_count(leaf, info, 0);
+	btrfs_set_free_space_flags(leaf, info, 0);
+	btrfs_mark_buffer_dirty(leaf);
+
+	ret = 0;
+out:
+	btrfs_release_path(path);
+	return ret;
+}
+
+static struct btrfs_free_space_info *
+search_free_space_info(struct btrfs_trans_handle *trans,
+		       struct btrfs_fs_info *fs_info,
+		       struct btrfs_block_group_cache *block_group,
+		       struct btrfs_path *path, int cow)
+{
+	struct btrfs_root *root = fs_info->free_space_root;
+	struct btrfs_key key;
+	int ret;
+
+	key.objectid = block_group->key.objectid;
+	key.type = BTRFS_FREE_SPACE_INFO_KEY;
+	key.offset = block_group->key.offset;
+
+	ret = btrfs_search_slot(trans, root, &key, path, 0, cow);
+	if (ret < 0)
+		return ERR_PTR(ret);
+	if (ret != 0) {
+		btrfs_warn(fs_info, "missing free space info for %llu\n",
+			   block_group->key.objectid);
+		ASSERT(0);
+		return ERR_PTR(-ENOENT);
+	}
+
+	return btrfs_item_ptr(path->nodes[0], path->slots[0],
+			      struct btrfs_free_space_info);
+}
+
+/*
+ * btrfs_search_slot() but we're looking for the greatest key less than the
+ * passed key.
+ */
+static int btrfs_search_prev_slot(struct btrfs_trans_handle *trans,
+				  struct btrfs_root *root,
+				  struct btrfs_key *key, struct btrfs_path *p,
+				  int ins_len, int cow)
+{
+	int ret;
+
+	ret = btrfs_search_slot(trans, root, key, p, ins_len, cow);
+	if (ret < 0)
+		return ret;
+
+	if (ret == 0) {
+		ASSERT(0);
+		return -EIO;
+	}
+
+	if (p->slots[0] == 0) {
+		ASSERT(0);
+		return -EIO;
+	}
+	p->slots[0]--;
+
+	return 0;
+}
+
+static inline u32 free_space_bitmap_size(u64 size, u32 sectorsize)
+{
+	return DIV_ROUND_UP((u32)div_u64(size, sectorsize), BITS_PER_BYTE);
+}
+
+static unsigned long *alloc_bitmap(u32 bitmap_size)
+{
+	return __vmalloc(bitmap_size, GFP_NOFS | __GFP_HIGHMEM | __GFP_ZERO,
+			 PAGE_KERNEL);
+}
+
+static int convert_free_space_to_bitmaps(struct btrfs_trans_handle *trans,
+					 struct btrfs_fs_info *fs_info,
+					 struct btrfs_block_group_cache *block_group,
+					 struct btrfs_path *path)
+{
+	struct btrfs_root *root = fs_info->free_space_root;
+	struct btrfs_free_space_info *info;
+	struct btrfs_key key, found_key;
+	struct extent_buffer *leaf;
+	unsigned long *bitmap;
+	char *bitmap_cursor;
+	u64 start, end;
+	u64 bitmap_range, i;
+	u32 bitmap_size, flags, expected_extent_count;
+	u32 extent_count = 0;
+	int done = 0, nr;
+	int ret;
+
+	bitmap_size = free_space_bitmap_size(block_group->key.offset,
+					     block_group->sectorsize);
+	bitmap = alloc_bitmap(bitmap_size);
+	if (!bitmap)
+		return -ENOMEM;
+
+	start = block_group->key.objectid;
+	end = block_group->key.objectid + block_group->key.offset;
+
+	key.objectid = end - 1;
+	key.type = (u8)-1;
+	key.offset = (u64)-1;
+
+	while (!done) {
+		ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
+		if (ret)
+			goto out;
+
+		leaf = path->nodes[0];
+		nr = 0;
+		path->slots[0]++;
+		while (path->slots[0] > 0) {
+			btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0] - 1);
+
+			if (found_key.type == BTRFS_FREE_SPACE_INFO_KEY) {
+				ASSERT(found_key.objectid == block_group->key.objectid);
+				ASSERT(found_key.offset == block_group->key.offset);
+				done = 1;
+				break;
+			} else if (found_key.type == BTRFS_FREE_SPACE_EXTENT_KEY) {
+				u64 first, last;
+
+				ASSERT(found_key.objectid >= start);
+				ASSERT(found_key.objectid < end);
+				ASSERT(found_key.objectid + found_key.offset <= end);
+
+				first = div_u64(found_key.objectid - start,
+						block_group->sectorsize);
+				last = div_u64(found_key.objectid + found_key.offset - start,
+					       block_group->sectorsize);
+				bitmap_set(bitmap, first, last - first);
+
+				extent_count++;
+				nr++;
+				path->slots[0]--;
+			} else {
+				ASSERT(0);
+			}
+		}
+
+		ret = btrfs_del_items(trans, root, path, path->slots[0], nr);
+		if (ret)
+			goto out;
+		btrfs_release_path(path);
+	}
+
+	info = search_free_space_info(trans, fs_info, block_group, path, 1);
+	if (IS_ERR(info)) {
+		ret = PTR_ERR(info);
+		goto out;
+	}
+	leaf = path->nodes[0];
+	flags = btrfs_free_space_flags(leaf, info);
+	flags |= BTRFS_FREE_SPACE_USING_BITMAPS;
+	btrfs_set_free_space_flags(leaf, info, flags);
+	expected_extent_count = btrfs_free_space_extent_count(leaf, info);
+	btrfs_mark_buffer_dirty(leaf);
+	btrfs_release_path(path);
+
+	if (extent_count != expected_extent_count) {
+		btrfs_err(fs_info, "incorrect extent count for %llu; counted %u, expected %u",
+			  block_group->key.objectid, extent_count,
+			  expected_extent_count);
+		ASSERT(0);
+		ret = -EIO;
+		goto out;
+	}
+
+	bitmap_cursor = (char *)bitmap;
+	bitmap_range = block_group->sectorsize * BTRFS_FREE_SPACE_BITMAP_BITS;
+	i = start;
+	while (i < end) {
+		unsigned long ptr;
+		u64 extent_size;
+		u32 data_size;
+
+		extent_size = min(end - i, bitmap_range);
+		data_size = free_space_bitmap_size(extent_size,
+						   block_group->sectorsize);
+
+		key.objectid = i;
+		key.type = BTRFS_FREE_SPACE_BITMAP_KEY;
+		key.offset = extent_size;
+
+		ret = btrfs_insert_empty_item(trans, root, path, &key,
+					      data_size);
+		if (ret)
+			goto out;
+
+		leaf = path->nodes[0];
+		ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
+		write_extent_buffer(leaf, bitmap_cursor, ptr,
+				    data_size);
+		btrfs_mark_buffer_dirty(leaf);
+		btrfs_release_path(path);
+
+		i += extent_size;
+		bitmap_cursor += data_size;
+	}
+
+	ret = 0;
+out:
+	vfree(bitmap);
+	return ret;
+}
+
+static int convert_free_space_to_extents(struct btrfs_trans_handle *trans,
+					 struct btrfs_fs_info *fs_info,
+					 struct btrfs_block_group_cache *block_group,
+					 struct btrfs_path *path)
+{
+	struct btrfs_root *root = fs_info->free_space_root;
+	struct btrfs_free_space_info *info;
+	struct btrfs_key key, found_key;
+	struct extent_buffer *leaf;
+	unsigned long *bitmap;
+	u64 start, end;
+	/* Initialize to silence GCC. */
+	u64 extent_start = 0;
+	u64 offset;
+	u32 bitmap_size, flags, expected_extent_count;
+	int prev_bit = 0, bit, bitnr;
+	u32 extent_count = 0;
+	int done = 0, nr;
+	int ret;
+
+	bitmap_size = free_space_bitmap_size(block_group->key.offset,
+					     block_group->sectorsize);
+	bitmap = alloc_bitmap(bitmap_size);
+	if (!bitmap)
+		return -ENOMEM;
+
+	start = block_group->key.objectid;
+	end = block_group->key.objectid + block_group->key.offset;
+
+	key.objectid = end - 1;
+	key.type = (u8)-1;
+	key.offset = (u64)-1;
+
+	while (!done) {
+		ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
+		if (ret)
+			goto out;
+
+		leaf = path->nodes[0];
+		nr = 0;
+		path->slots[0]++;
+		while (path->slots[0] > 0) {
+			btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0] - 1);
+
+			if (found_key.type == BTRFS_FREE_SPACE_INFO_KEY) {
+				ASSERT(found_key.objectid == block_group->key.objectid);
+				ASSERT(found_key.offset == block_group->key.offset);
+				done = 1;
+				break;
+			} else if (found_key.type == BTRFS_FREE_SPACE_BITMAP_KEY) {
+				unsigned long ptr;
+				char *bitmap_cursor;
+				u32 bitmap_pos, data_size;
+
+				ASSERT(found_key.objectid >= start);
+				ASSERT(found_key.objectid < end);
+				ASSERT(found_key.objectid + found_key.offset <= end);
+
+				bitmap_pos = div_u64(found_key.objectid - start,
+						     block_group->sectorsize *
+						     BITS_PER_BYTE);
+				bitmap_cursor = ((char *)bitmap) + bitmap_pos;
+				data_size = free_space_bitmap_size(found_key.offset,
+								   block_group->sectorsize);
+
+				ptr = btrfs_item_ptr_offset(leaf, path->slots[0] - 1);
+				read_extent_buffer(leaf, bitmap_cursor, ptr,
+						   data_size);
+
+				nr++;
+				path->slots[0]--;
+			} else {
+				ASSERT(0);
+			}
+		}
+
+		ret = btrfs_del_items(trans, root, path, path->slots[0], nr);
+		if (ret)
+			goto out;
+		btrfs_release_path(path);
+	}
+
+	info = search_free_space_info(trans, fs_info, block_group, path, 1);
+	if (IS_ERR(info)) {
+		ret = PTR_ERR(info);
+		goto out;
+	}
+	leaf = path->nodes[0];
+	flags = btrfs_free_space_flags(leaf, info);
+	flags &= ~BTRFS_FREE_SPACE_USING_BITMAPS;
+	btrfs_set_free_space_flags(leaf, info, flags);
+	expected_extent_count = btrfs_free_space_extent_count(leaf, info);
+	btrfs_mark_buffer_dirty(leaf);
+	btrfs_release_path(path);
+
+	offset = start;
+	bitnr = 0;
+	while (offset < end) {
+		bit = !!test_bit(bitnr, bitmap);
+		if (prev_bit == 0 && bit == 1) {
+			extent_start = offset;
+		} else if (prev_bit == 1 && bit == 0) {
+			key.objectid = extent_start;
+			key.type = BTRFS_FREE_SPACE_EXTENT_KEY;
+			key.offset = offset - extent_start;
+
+			ret = btrfs_insert_empty_item(trans, root, path, &key, 0);
+			if (ret)
+				goto out;
+			btrfs_release_path(path);
+
+			extent_count++;
+		}
+		prev_bit = bit;
+		offset += block_group->sectorsize;
+		bitnr++;
+	}
+	if (prev_bit == 1) {
+		key.objectid = extent_start;
+		key.type = BTRFS_FREE_SPACE_EXTENT_KEY;
+		key.offset = end - extent_start;
+
+		ret = btrfs_insert_empty_item(trans, root, path, &key, 0);
+		if (ret)
+			goto out;
+		btrfs_release_path(path);
+
+		extent_count++;
+	}
+
+	if (extent_count != expected_extent_count) {
+		btrfs_err(fs_info, "incorrect extent count for %llu; counted %u, expected %u",
+			  block_group->key.objectid, extent_count,
+			  expected_extent_count);
+		ASSERT(0);
+		ret = -EIO;
+		goto out;
+	}
+
+	ret = 0;
+out:
+	vfree(bitmap);
+	return ret;
+}
+
+static int update_free_space_extent_count(struct btrfs_trans_handle *trans,
+					  struct btrfs_fs_info *fs_info,
+					  struct btrfs_block_group_cache *block_group,
+					  struct btrfs_path *path,
+					  int new_extents)
+{
+	struct btrfs_free_space_info *info;
+	u32 flags;
+	u32 extent_count;
+	int ret = 0;
+
+	if (new_extents == 0)
+		return 0;
+
+	info = search_free_space_info(trans, fs_info, block_group, path, 1);
+	if (IS_ERR(info)) {
+		ret = PTR_ERR(info);
+		goto out;
+	}
+	flags = btrfs_free_space_flags(path->nodes[0], info);
+	extent_count = btrfs_free_space_extent_count(path->nodes[0], info);
+
+	extent_count += new_extents;
+	btrfs_set_free_space_extent_count(path->nodes[0], info, extent_count);
+	btrfs_mark_buffer_dirty(path->nodes[0]);
+	btrfs_release_path(path);
+
+	if (!(flags & BTRFS_FREE_SPACE_USING_BITMAPS) &&
+	    extent_count > block_group->bitmap_high_thresh) {
+		ret = convert_free_space_to_bitmaps(trans, fs_info, block_group,
+						    path);
+	} else if ((flags & BTRFS_FREE_SPACE_USING_BITMAPS) &&
+		   extent_count < block_group->bitmap_low_thresh) {
+		ret = convert_free_space_to_extents(trans, fs_info, block_group,
+						    path);
+	}
+	if (ret)
+		goto out;
+
+	ret = 0;
+out:
+	return ret;
+}
+
+static int free_space_test_bit(struct btrfs_block_group_cache *block_group,
+			       struct btrfs_path *path, u64 offset)
+{
+	struct extent_buffer *leaf;
+	struct btrfs_key key;
+	u64 found_start, found_end;
+	unsigned long ptr, i;
+
+	leaf = path->nodes[0];
+	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+	ASSERT(key.type == BTRFS_FREE_SPACE_BITMAP_KEY);
+
+	found_start = key.objectid;
+	found_end = key.objectid + key.offset;
+	ASSERT(offset >= found_start && offset < found_end);
+
+	ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
+	i = div_u64(offset - found_start, block_group->sectorsize);
+	return !!extent_buffer_test_bit(leaf, ptr, i);
+}
+
+static void free_space_set_bits(struct btrfs_block_group_cache *block_group,
+				struct btrfs_path *path, u64 *start, u64 *size,
+				int bit)
+{
+	struct extent_buffer *leaf;
+	struct btrfs_key key;
+	u64 end = *start + *size;
+	u64 found_start, found_end;
+	unsigned long ptr, first, last;
+
+	leaf = path->nodes[0];
+	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+	ASSERT(key.type == BTRFS_FREE_SPACE_BITMAP_KEY);
+
+	found_start = key.objectid;
+	found_end = key.objectid + key.offset;
+	ASSERT(*start >= found_start && *start < found_end);
+	ASSERT(end > found_start);
+
+	if (end > found_end)
+		end = found_end;
+
+	ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
+	first = div_u64(*start - found_start, block_group->sectorsize);
+	last = div_u64(end - found_start, block_group->sectorsize);
+	if (bit)
+		extent_buffer_bitmap_set(leaf, ptr, first, last - first);
+	else
+		extent_buffer_bitmap_clear(leaf, ptr, first, last - first);
+	btrfs_mark_buffer_dirty(leaf);
+
+	*size -= end - *start;
+	*start = end;
+}
+
+/*
+ * We can't use btrfs_next_item() in modify_free_space_bitmap() because
+ * btrfs_next_leaf() doesn't get the path for writing. We can forgo the fancy
+ * tree walking in btrfs_next_leaf() anyways because we know exactly what we're
+ * looking for.
+ */
+static int free_space_next_bitmap(struct btrfs_trans_handle *trans,
+				  struct btrfs_root *root, struct btrfs_path *p)
+{
+	struct btrfs_key key;
+
+	if (p->slots[0] + 1 < btrfs_header_nritems(p->nodes[0])) {
+		p->slots[0]++;
+		return 0;
+	}
+
+	btrfs_item_key_to_cpu(p->nodes[0], &key, p->slots[0]);
+	btrfs_release_path(p);
+
+	key.objectid += key.offset;
+	key.type = (u8)-1;
+	key.offset = (u64)-1;
+
+	return btrfs_search_prev_slot(trans, root, &key, p, 0, 1);
+}
+
+/*
+ * If remove is 1, then we are removing free space, thus clearing bits in the
+ * bitmap. If remove is 0, then we are adding free space, thus setting bits in
+ * the bitmap.
+ */
+static int modify_free_space_bitmap(struct btrfs_trans_handle *trans,
+				    struct btrfs_fs_info *fs_info,
+				    struct btrfs_block_group_cache *block_group,
+				    struct btrfs_path *path,
+				    u64 start, u64 size, int remove)
+{
+	struct btrfs_root *root = fs_info->free_space_root;
+	struct btrfs_key key;
+	u64 end = start + size;
+	u64 cur_start, cur_size;
+	int prev_bit, next_bit;
+	int new_extents;
+	int ret;
+
+	/*
+	 * Read the bit for the block immediately before the extent of space if
+	 * that block is within the block group.
+	 */
+	if (start > block_group->key.objectid) {
+		u64 prev_block = start - block_group->sectorsize;
+
+		key.objectid = prev_block;
+		key.type = (u8)-1;
+		key.offset = (u64)-1;
+
+		ret = btrfs_search_prev_slot(trans, root, &key, path, 0, 1);
+		if (ret)
+			goto out;
+
+		prev_bit = free_space_test_bit(block_group, path, prev_block);
+
+		/* The previous block may have been in the previous bitmap. */
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+		if (start >= key.objectid + key.offset) {
+			ret = free_space_next_bitmap(trans, root, path);
+			if (ret)
+				goto out;
+		}
+	} else {
+		key.objectid = start;
+		key.type = (u8)-1;
+		key.offset = (u64)-1;
+
+		ret = btrfs_search_prev_slot(trans, root, &key, path, 0, 1);
+		if (ret)
+			goto out;
+
+		prev_bit = -1;
+	}
+
+	/*
+	 * Iterate over all of the bitmaps overlapped by the extent of space,
+	 * clearing/setting bits as required.
+	 */
+	cur_start = start;
+	cur_size = size;
+	while (1) {
+		free_space_set_bits(block_group, path, &cur_start, &cur_size,
+				    !remove);
+		if (cur_size == 0)
+			break;
+		ret = free_space_next_bitmap(trans, root, path);
+		if (ret)
+			goto out;
+	}
+
+	/*
+	 * Read the bit for the block immediately after the extent of space if
+	 * that block is within the block group.
+	 */
+	if (end < block_group->key.objectid + block_group->key.offset) {
+		/* The next block may be in the next bitmap. */
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+		if (end >= key.objectid + key.offset) {
+			ret = free_space_next_bitmap(trans, root, path);
+			if (ret)
+				goto out;
+		}
+
+		next_bit = free_space_test_bit(block_group, path, end);
+	} else {
+		next_bit = -1;
+	}
+
+	if (remove) {
+		new_extents = -1;
+		if (prev_bit == 1) {
+			/* Leftover on the left. */
+			new_extents++;
+		}
+		if (next_bit == 1) {
+			/* Leftover on the right. */
+			new_extents++;
+		}
+	} else {
+		new_extents = 1;
+		if (prev_bit == 1) {
+			/* Merging with neighbor on the left. */
+			new_extents--;
+		}
+		if (next_bit == 1) {
+			/* Merging with neighbor on the right. */
+			new_extents--;
+		}
+	}
+
+	btrfs_release_path(path);
+	ret = update_free_space_extent_count(trans, fs_info, block_group, path,
+					     new_extents);
+	if (ret)
+		goto out;
+
+	ret = 0;
+out:
+	return ret;
+}
+
+static int remove_free_space_extent(struct btrfs_trans_handle *trans,
+				    struct btrfs_fs_info *fs_info,
+				    struct btrfs_block_group_cache *block_group,
+				    struct btrfs_path *path,
+				    u64 start, u64 size)
+{
+	struct btrfs_root *root = fs_info->free_space_root;
+	struct btrfs_key key;
+	u64 found_start, found_end;
+	u64 end = start + size;
+	int new_extents = -1;
+	int ret;
+
+	key.objectid = start;
+	key.type = (u8)-1;
+	key.offset = (u64)-1;
+
+	ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
+	if (ret)
+		goto out;
+
+	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+
+	ASSERT(key.type == BTRFS_FREE_SPACE_EXTENT_KEY);
+
+	found_start = key.objectid;
+	found_end = key.objectid + key.offset;
+	ASSERT(start >= found_start && end <= found_end);
+
+	/*
+	 * Okay, now that we've found the free space extent which contains the
+	 * free space that we are removing, there are four cases:
+	 *
+	 * 1. We're using the whole extent: delete the key we found and
+	 * decrement the free space extent count.
+	 * 2. We are using part of the extent starting at the beginning: delete
+	 * the key we found and insert a new key representing the leftover at
+	 * the end. There is no net change in the number of extents.
+	 * 3. We are using part of the extent ending at the end: delete the key
+	 * we found and insert a new key representing the leftover at the
+	 * beginning. There is no net change in the number of extents.
+	 * 4. We are using part of the extent in the middle: delete the key we
+	 * found and insert two new keys representing the leftovers on each
+	 * side. Where we used to have one extent, we now have two, so increment
+	 * the extent count. We may need to convert the block group to bitmaps
+	 * as a result.
+	 */
+
+	/* Delete the existing key (cases 1-4). */
+	ret = btrfs_del_item(trans, root, path);
+	if (ret)
+		goto out;
+
+	/* Add a key for leftovers at the beginning (cases 3 and 4). */
+	if (start > found_start) {
+		key.objectid = found_start;
+		key.type = BTRFS_FREE_SPACE_EXTENT_KEY;
+		key.offset = start - found_start;
+
+		btrfs_release_path(path);
+		ret = btrfs_insert_empty_item(trans, root, path, &key, 0);
+		if (ret)
+			goto out;
+		new_extents++;
+	}
+
+	/* Add a key for leftovers at the end (cases 2 and 4). */
+	if (end < found_end) {
+		key.objectid = end;
+		key.type = BTRFS_FREE_SPACE_EXTENT_KEY;
+		key.offset = found_end - end;
+
+		btrfs_release_path(path);
+		ret = btrfs_insert_empty_item(trans, root, path, &key, 0);
+		if (ret)
+			goto out;
+		new_extents++;
+	}
+
+	btrfs_release_path(path);
+	ret = update_free_space_extent_count(trans, fs_info, block_group, path,
+					     new_extents);
+	if (ret)
+		goto out;
+
+	ret = 0;
+out:
+	return ret;
+}
+
+int remove_from_free_space_tree(struct btrfs_trans_handle *trans,
+				struct btrfs_fs_info *fs_info,
+				u64 start, u64 size)
+{
+	struct btrfs_block_group_cache *block_group;
+	struct btrfs_free_space_info *info;
+	struct btrfs_path *path;
+	u32 flags;
+	int ret;
+
+	if (!btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
+		return 0;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	block_group = btrfs_lookup_block_group(fs_info, start);
+	if (!block_group) {
+		ASSERT(0);
+		ret = -ENOENT;
+		goto out_nobg;
+	}
+
+	mutex_lock(&block_group->free_space_lock);
+
+	info = search_free_space_info(NULL, fs_info, block_group, path, 0);
+	if (IS_ERR(info)) {
+		ret = PTR_ERR(info);
+		goto out;
+	}
+	flags = btrfs_free_space_flags(path->nodes[0], info);
+	btrfs_release_path(path);
+
+	if (flags & BTRFS_FREE_SPACE_USING_BITMAPS) {
+		ret = modify_free_space_bitmap(trans, fs_info, block_group,
+					       path, start, size, 1);
+	} else {
+		ret = remove_free_space_extent(trans, fs_info, block_group,
+					       path, start, size);
+	}
+	if (ret)
+		goto out;
+
+	ret = 0;
+out:
+	mutex_unlock(&block_group->free_space_lock);
+	btrfs_put_block_group(block_group);
+out_nobg:
+	btrfs_free_path(path);
+	return ret;
+}
+
+static int add_free_space_extent(struct btrfs_trans_handle *trans,
+				 struct btrfs_fs_info *fs_info,
+				 struct btrfs_block_group_cache *block_group,
+				 struct btrfs_path *path,
+				 u64 start, u64 size)
+{
+	struct btrfs_root *root = fs_info->free_space_root;
+	struct btrfs_key key, new_key;
+	u64 found_start, found_end;
+	u64 end = start + size;
+	int new_extents = 1;
+	int ret;
+
+	/*
+	 * We are adding a new extent of free space, but we need to merge
+	 * extents. There are four cases here:
+	 *
+	 * 1. The new extent does not have any immediate neighbors to merge
+	 * with: add the new key and increment the free space extent count. We
+	 * may need to convert the block group to bitmaps as a result.
+	 * 2. The new extent has an immediate neighbor before it: remove the
+	 * previous key and insert a new key combining both of them. There is no
+	 * net change in the number of extents.
+	 * 3. The new extent has an immediate neighbor after it: remove the next
+	 * key and insert a new key combining both of them. There is no net
+	 * change in the number of extents.
+	 * 4. The new extent has immediate neighbors on both sides: remove both
+	 * of the keys and insert a new key combining all of them. Where we used
+	 * to have two extents, we now have one, so decrement the extent count.
+	 */
+
+	new_key.objectid = start;
+	new_key.type = BTRFS_FREE_SPACE_EXTENT_KEY;
+	new_key.offset = size;
+
+	/* Search for a neighbor on the left. */
+	if (start == block_group->key.objectid)
+		goto right;
+	key.objectid = start - 1;
+	key.type = (u8)-1;
+	key.offset = (u64)-1;
+
+	ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
+	if (ret)
+		goto out;
+
+	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+
+	if (key.type != BTRFS_FREE_SPACE_EXTENT_KEY) {
+		ASSERT(key.type == BTRFS_FREE_SPACE_INFO_KEY);
+		btrfs_release_path(path);
+		goto right;
+	}
+
+	found_start = key.objectid;
+	found_end = key.objectid + key.offset;
+	ASSERT(found_start >= block_group->key.objectid &&
+	       found_end > block_group->key.objectid);
+	ASSERT(found_start < start && found_end <= start);
+
+	/*
+	 * Delete the neighbor on the left and absorb it into the new key (cases
+	 * 2 and 4).
+	 */
+	if (found_end == start) {
+		ret = btrfs_del_item(trans, root, path);
+		if (ret)
+			goto out;
+		new_key.objectid = found_start;
+		new_key.offset += key.offset;
+		new_extents--;
+	}
+	btrfs_release_path(path);
+
+right:
+	/* Search for a neighbor on the right. */
+	if (end == block_group->key.objectid + block_group->key.offset)
+		goto insert;
+	key.objectid = end;
+	key.type = (u8)-1;
+	key.offset = (u64)-1;
+
+	ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
+	if (ret)
+		goto out;
+
+	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+
+	if (key.type != BTRFS_FREE_SPACE_EXTENT_KEY) {
+		ASSERT(key.type == BTRFS_FREE_SPACE_INFO_KEY);
+		btrfs_release_path(path);
+		goto insert;
+	}
+
+	found_start = key.objectid;
+	found_end = key.objectid + key.offset;
+	ASSERT(found_start >= block_group->key.objectid &&
+	       found_end > block_group->key.objectid);
+	ASSERT((found_start < start && found_end <= start) ||
+	       (found_start >= end && found_end > end));
+
+	/*
+	 * Delete the neighbor on the right and absorb it into the new key
+	 * (cases 3 and 4).
+	 */
+	if (found_start == end) {
+		ret = btrfs_del_item(trans, root, path);
+		if (ret)
+			goto out;
+		new_key.offset += key.offset;
+		new_extents--;
+	}
+	btrfs_release_path(path);
+
+insert:
+	/* Insert the new key (cases 1-4). */
+	ret = btrfs_insert_empty_item(trans, root, path, &new_key, 0);
+	if (ret)
+		goto out;
+
+	btrfs_release_path(path);
+	ret = update_free_space_extent_count(trans, fs_info, block_group, path,
+					     new_extents);
+	if (ret)
+		goto out;
+
+	ret = 0;
+out:
+	return ret;
+}
+
+static int __add_to_free_space_tree(struct btrfs_trans_handle *trans,
+				    struct btrfs_fs_info *fs_info,
+				    struct btrfs_block_group_cache *block_group,
+				    struct btrfs_path *path,
+				    u64 start, u64 size)
+{
+	struct btrfs_free_space_info *info;
+	u32 flags;
+	int ret;
+
+	mutex_lock(&block_group->free_space_lock);
+
+	info = search_free_space_info(NULL, fs_info, block_group, path, 0);
+	if (IS_ERR(info)) {
+		return PTR_ERR(info);
+		goto out;
+	}
+	flags = btrfs_free_space_flags(path->nodes[0], info);
+	btrfs_release_path(path);
+
+	if (flags & BTRFS_FREE_SPACE_USING_BITMAPS) {
+		ret = modify_free_space_bitmap(trans, fs_info, block_group,
+					       path, start, size, 0);
+	} else {
+		ret = add_free_space_extent(trans, fs_info, block_group, path,
+					    start, size);
+	}
+
+out:
+	mutex_unlock(&block_group->free_space_lock);
+	return ret;
+}
+
+int add_to_free_space_tree(struct btrfs_trans_handle *trans,
+			   struct btrfs_fs_info *fs_info,
+			   u64 start, u64 size)
+{
+	struct btrfs_block_group_cache *block_group;
+	struct btrfs_path *path;
+	int ret;
+
+	if (!btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
+		return 0;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	block_group = btrfs_lookup_block_group(fs_info, start);
+	if (!block_group) {
+		ASSERT(0);
+		ret = -ENOENT;
+		goto out_nobg;
+	}
+
+	ret = __add_to_free_space_tree(trans, fs_info, block_group, path, start,
+				       size);
+	if (ret)
+		goto out;
+
+	ret = 0;
+out:
+	btrfs_put_block_group(block_group);
+out_nobg:
+	btrfs_free_path(path);
+	return ret;
+}
+
+static int add_new_free_space_extent(struct btrfs_trans_handle *trans,
+				     struct btrfs_fs_info *fs_info,
+				     struct btrfs_block_group_cache *block_group,
+				     struct btrfs_path *path,
+				     u64 start, u64 end)
+{
+	u64 extent_start, extent_end;
+	int ret;
+
+	while (start < end) {
+		ret = find_first_extent_bit(fs_info->pinned_extents, start,
+					    &extent_start, &extent_end,
+					    EXTENT_DIRTY | EXTENT_UPTODATE,
+					    NULL);
+		if (ret)
+			break;
+
+		if (extent_start <= start) {
+			start = extent_end + 1;
+		} else if (extent_start > start && extent_start < end) {
+			ret = __add_to_free_space_tree(trans, fs_info,
+						       block_group, path, start,
+						       extent_start - start);
+			btrfs_release_path(path);
+			if (ret)
+				return ret;
+			start = extent_end + 1;
+		} else {
+			break;
+		}
+	}
+	if (start < end) {
+		ret = __add_to_free_space_tree(trans, fs_info, block_group,
+					       path, start, end - start);
+		btrfs_release_path(path);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+/*
+ * Populate the free space tree by walking the extent tree, avoiding the super
+ * block mirrors. Operations on the extent tree that happen as a result of
+ * writes to the free space tree will go through the normal add/remove hooks.
+ */
+static int populate_free_space_tree(struct btrfs_trans_handle *trans,
+				    struct btrfs_fs_info *fs_info,
+				    struct btrfs_block_group_cache *block_group)
+{
+	struct btrfs_root *extent_root = fs_info->extent_root;
+	struct btrfs_path *path, *path2;
+	struct btrfs_key key;
+	u64 start, end;
+	int ret;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+	path->reada = 1;
+
+	path2 = btrfs_alloc_path();
+	if (!path2) {
+		btrfs_free_path(path);
+		return -ENOMEM;
+	}
+
+	ret = add_new_free_space_info(trans, fs_info, block_group, path2);
+	if (ret)
+		goto out;
+
+	ret = exclude_super_stripes(extent_root, block_group);
+	if (ret)
+		goto out;
+
+	/*
+	 * Iterate through all of the extent and metadata items in this block
+	 * group, adding the free space between them and the free space at the
+	 * end. Note that EXTENT_ITEM and METADATA_ITEM are less than
+	 * BLOCK_GROUP_ITEM, so an extent may precede the block group that it's
+	 * contained in.
+	 */
+	key.objectid = block_group->key.objectid;
+	key.type = BTRFS_EXTENT_ITEM_KEY;
+	key.offset = 0;
+
+	ret = btrfs_search_slot_for_read(extent_root, &key, path, 1, 0);
+	if (ret < 0)
+		goto out;
+	ASSERT(ret == 0);
+
+	start = block_group->key.objectid;
+	end = block_group->key.objectid + block_group->key.offset;
+	while (1) {
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+
+		if (key.type == BTRFS_EXTENT_ITEM_KEY ||
+		    key.type == BTRFS_METADATA_ITEM_KEY) {
+			if (key.objectid >= end)
+				break;
+
+			ret = add_new_free_space_extent(trans, fs_info,
+							block_group, path2,
+							start, key.objectid);
+			start = key.objectid;
+			if (key.type == BTRFS_METADATA_ITEM_KEY)
+				start += fs_info->tree_root->nodesize;
+			else
+				start += key.offset;
+		} else if (key.type == BTRFS_BLOCK_GROUP_ITEM_KEY) {
+			if (key.objectid != block_group->key.objectid)
+				break;
+		}
+
+		ret = btrfs_next_item(extent_root, path);
+		if (ret < 0)
+			goto out;
+		if (ret)
+			break;
+	}
+	ret = add_new_free_space_extent(trans, fs_info, block_group, path2,
+					start, end);
+	if (ret)
+		goto out;
+
+out:
+	free_excluded_extents(extent_root, block_group);
+	btrfs_free_path(path2);
+	btrfs_free_path(path);
+	return ret;
+}
+
+int btrfs_create_free_space_tree(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_trans_handle *trans;
+	struct btrfs_root *tree_root = fs_info->tree_root;
+	struct btrfs_root *free_space_root;
+	struct btrfs_block_group_cache *block_group;
+	struct rb_node *node;
+	int ret;
+
+	trans = btrfs_start_transaction(tree_root, 0);
+	if (IS_ERR(trans))
+		return PTR_ERR(trans);
+
+	free_space_root = btrfs_create_tree(trans, fs_info,
+					    BTRFS_FREE_SPACE_TREE_OBJECTID);
+	if (IS_ERR(free_space_root)) {
+		ret = PTR_ERR(free_space_root);
+		btrfs_abort_transaction(trans, tree_root, ret);
+		return ret;
+	}
+	fs_info->free_space_root = free_space_root;
+
+	node = rb_first(&fs_info->block_group_cache_tree);
+	while (node) {
+		block_group = rb_entry(node, struct btrfs_block_group_cache,
+				       cache_node);
+		ret = populate_free_space_tree(trans, fs_info, block_group);
+		if (ret) {
+			btrfs_abort_transaction(trans, tree_root, ret);
+			return ret;
+		}
+		node = rb_next(node);
+	}
+
+	btrfs_set_fs_compat_ro(fs_info, FREE_SPACE_TREE);
+
+	ret = btrfs_commit_transaction(trans, tree_root);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+int add_block_group_free_space(struct btrfs_trans_handle *trans,
+			       struct btrfs_fs_info *fs_info,
+			       struct btrfs_block_group_cache *block_group)
+{
+	struct btrfs_path *path;
+	int ret;
+
+	if (!btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
+		return 0;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	ret = add_new_free_space_info(trans, fs_info, block_group, path);
+	if (ret)
+		goto out;
+
+	ret = add_new_free_space_extent(trans, fs_info, block_group, path,
+					block_group->key.objectid,
+					block_group->key.objectid +
+					block_group->key.offset);
+	if (ret)
+		goto out;
+
+	ret = 0;
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
+int remove_block_group_free_space(struct btrfs_trans_handle *trans,
+				  struct btrfs_fs_info *fs_info,
+				  struct btrfs_block_group_cache *block_group)
+{
+	struct btrfs_root *root = fs_info->free_space_root;
+	struct btrfs_path *path;
+	struct btrfs_key key, found_key;
+	struct extent_buffer *leaf;
+	u64 start, end;
+	int done = 0, nr;
+	int ret;
+
+	if (!btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
+		return 0;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	start = block_group->key.objectid;
+	end = block_group->key.objectid + block_group->key.offset;
+
+	key.objectid = end - 1;
+	key.type = (u8)-1;
+	key.offset = (u64)-1;
+
+	while (!done) {
+		ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
+		if (ret)
+			goto out;
+
+		leaf = path->nodes[0];
+		nr = 0;
+		path->slots[0]++;
+		while (path->slots[0] > 0) {
+			btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0] - 1);
+
+			if (found_key.type == BTRFS_FREE_SPACE_INFO_KEY) {
+				ASSERT(found_key.objectid == block_group->key.objectid);
+				ASSERT(found_key.offset == block_group->key.offset);
+				done = 1;
+				nr++;
+				path->slots[0]--;
+				break;
+			} else if (found_key.type == BTRFS_FREE_SPACE_EXTENT_KEY ||
+				   found_key.type == BTRFS_FREE_SPACE_BITMAP_KEY) {
+				ASSERT(found_key.objectid >= start);
+				ASSERT(found_key.objectid < end);
+				ASSERT(found_key.objectid + found_key.offset <= end);
+				nr++;
+				path->slots[0]--;
+			} else {
+				ASSERT(0);
+			}
+		}
+
+		ret = btrfs_del_items(trans, root, path, path->slots[0], nr);
+		if (ret)
+			goto out;
+		btrfs_release_path(path);
+	}
+
+	ret = 0;
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
+static int load_free_space_bitmaps(struct btrfs_fs_info *fs_info,
+				   struct btrfs_block_group_cache *block_group,
+				   struct btrfs_path *path,
+				   u32 expected_extent_count)
+{
+	struct btrfs_root *root = fs_info->free_space_root;
+	struct btrfs_key key;
+	int prev_bit = 0, bit;
+	/* Initialize to silence GCC. */
+	u64 extent_start = 0;
+	u64 end, offset;
+	u32 extent_count = 0;
+	int ret;
+
+	end = block_group->key.objectid + block_group->key.offset;
+
+	while (1) {
+		ret = btrfs_next_item(root, path);
+		if (ret < 0)
+			goto out;
+		if (ret)
+			break;
+
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+
+		if (key.type == BTRFS_FREE_SPACE_INFO_KEY)
+			break;
+
+		ASSERT(key.type == BTRFS_FREE_SPACE_BITMAP_KEY);
+		ASSERT(key.objectid < end && key.objectid + key.offset <= end);
+
+		offset = key.objectid;
+		while (offset < key.objectid + key.offset) {
+			bit = free_space_test_bit(block_group, path, offset);
+			if (prev_bit == 0 && bit == 1) {
+				extent_start = offset;
+			} else if (prev_bit == 1 && bit == 0) {
+				add_new_free_space(block_group, fs_info,
+						   extent_start, offset);
+				extent_count++;
+			}
+			prev_bit = bit;
+			offset += block_group->sectorsize;
+		}
+	}
+	if (prev_bit == 1) {
+		add_new_free_space(block_group, fs_info, extent_start, end);
+		extent_count++;
+	}
+
+	if (extent_count != expected_extent_count) {
+		btrfs_err(fs_info, "incorrect extent count for %llu; counted %u, expected %u",
+			  block_group->key.objectid, extent_count,
+			  expected_extent_count);
+		ASSERT(0);
+		ret = -EIO;
+		goto out;
+	}
+
+	ret = 0;
+out:
+	return ret;
+}
+
+static int load_free_space_extents(struct btrfs_fs_info *fs_info,
+				   struct btrfs_block_group_cache *block_group,
+				   struct btrfs_path *path,
+				   u32 expected_extent_count)
+{
+	struct btrfs_root *root = fs_info->free_space_root;
+	struct btrfs_key key;
+	u64 end;
+	u32 extent_count = 0;
+	int ret;
+
+	end = block_group->key.objectid + block_group->key.offset;
+
+	while (1) {
+		ret = btrfs_next_item(root, path);
+		if (ret < 0)
+			goto out;
+		if (ret)
+			break;
+
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+
+		if (key.type == BTRFS_FREE_SPACE_INFO_KEY)
+			break;
+
+		ASSERT(key.type == BTRFS_FREE_SPACE_EXTENT_KEY);
+		ASSERT(key.objectid < end && key.objectid + key.offset <= end);
+
+		add_new_free_space(block_group, fs_info, key.objectid,
+				   key.objectid + key.offset);
+		extent_count++;
+	}
+
+	if (extent_count != expected_extent_count) {
+		btrfs_err(fs_info, "incorrect extent count for %llu; counted %u, expected %u",
+			  block_group->key.objectid, extent_count,
+			  expected_extent_count);
+		ASSERT(0);
+		ret = -EIO;
+		goto out;
+	}
+
+	ret = 0;
+out:
+	return ret;
+}
+
+int load_free_space_tree(struct btrfs_fs_info *fs_info,
+			 struct btrfs_block_group_cache *block_group)
+{
+	struct btrfs_free_space_info *info;
+	struct btrfs_path *path;
+	u32 extent_count, flags;
+	int ret;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	/*
+	 * Just like caching_thread() doesn't want to deadlock on the extent
+	 * tree, we don't want to deadlock on the free space tree.
+	 */
+	path->skip_locking = 1;
+	path->search_commit_root = 1;
+	path->reada = 1;
+
+	down_read(&fs_info->commit_root_sem);
+
+	info = search_free_space_info(NULL, fs_info, block_group, path, 0);
+	if (IS_ERR(info)) {
+		ret = PTR_ERR(info);
+		goto out;
+	}
+	extent_count = btrfs_free_space_extent_count(path->nodes[0], info);
+	flags = btrfs_free_space_flags(path->nodes[0], info);
+
+	/*
+	 * We left path pointing to the free space info item, so now
+	 * load_free_space_foo can just iterate through the free space tree from
+	 * there.
+	 */
+	if (flags & BTRFS_FREE_SPACE_USING_BITMAPS) {
+		ret = load_free_space_bitmaps(fs_info, block_group, path,
+					      extent_count);
+	} else {
+		ret = load_free_space_extents(fs_info, block_group, path,
+					      extent_count);
+	}
+	if (ret)
+		goto out;
+
+	ret = 0;
+out:
+	up_read(&fs_info->commit_root_sem);
+	btrfs_free_path(path);
+	return ret;
+}
diff --git a/fs/btrfs/free-space-tree.h b/fs/btrfs/free-space-tree.h
new file mode 100644
index 000000000000..a0c2494a054e
--- /dev/null
+++ b/fs/btrfs/free-space-tree.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright (C) 2015 Facebook.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __BTRFS_FREE_SPACE_TREE
+#define __BTRFS_FREE_SPACE_TREE
+
+void set_free_space_tree_thresholds(struct btrfs_block_group_cache *block_group);
+int btrfs_create_free_space_tree(struct btrfs_fs_info *fs_info);
+int load_free_space_tree(struct btrfs_fs_info *fs_info,
+			 struct btrfs_block_group_cache *block_group);
+int add_block_group_free_space(struct btrfs_trans_handle *trans,
+			       struct btrfs_fs_info *fs_info,
+			       struct btrfs_block_group_cache *block_group);
+int remove_block_group_free_space(struct btrfs_trans_handle *trans,
+				  struct btrfs_fs_info *fs_info,
+				  struct btrfs_block_group_cache *block_group);
+int add_to_free_space_tree(struct btrfs_trans_handle *trans,
+			   struct btrfs_fs_info *fs_info,
+			   u64 start, u64 size);
+int remove_from_free_space_tree(struct btrfs_trans_handle *trans,
+				struct btrfs_fs_info *fs_info,
+				u64 start, u64 size);
+
+#endif
-- 
2.5.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] free space B-tree
  2015-09-01 19:01 [PATCH 0/6] free space B-tree Omar Sandoval
                   ` (5 preceding siblings ...)
  2015-09-01 19:13 ` [PATCH 4/6] Btrfs: implement the free space B-tree Omar Sandoval
@ 2015-09-01 19:17 ` Omar Sandoval
  2015-09-01 19:22 ` [PATCH 1/3] btrfs-progs: use calloc instead of malloc+memset for tree roots Omar Sandoval
  2015-09-03 19:44 ` [PATCH v2 0/9] free space B-tree Omar Sandoval
  8 siblings, 0 replies; 43+ messages in thread
From: Omar Sandoval @ 2015-09-01 19:17 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs

Urgh, sorry about the duplicates, either git send-email or Gmail is
being weird...

-- 
Omar

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 1/3] btrfs-progs: use calloc instead of malloc+memset for tree roots
  2015-09-01 19:01 [PATCH 0/6] free space B-tree Omar Sandoval
                   ` (6 preceding siblings ...)
  2015-09-01 19:17 ` [PATCH 0/6] " Omar Sandoval
@ 2015-09-01 19:22 ` Omar Sandoval
  2015-09-01 19:22   ` [PATCH 2/3] btrfs-progs: add basic awareness of the free space tree Omar Sandoval
                     ` (2 more replies)
  2015-09-03 19:44 ` [PATCH v2 0/9] free space B-tree Omar Sandoval
  8 siblings, 3 replies; 43+ messages in thread
From: Omar Sandoval @ 2015-09-01 19:22 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Omar Sandoval

From: Omar Sandoval <osandov@fb.com>

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 disk-io.c | 22 +++++++---------------
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/disk-io.c b/disk-io.c
index 1d4889322411..8496aded31c4 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -833,13 +833,13 @@ struct btrfs_fs_info *btrfs_new_fs_info(int writable, u64 sb_bytenr)
 
 	memset(fs_info, 0, sizeof(struct btrfs_fs_info));
 
-	fs_info->tree_root = malloc(sizeof(struct btrfs_root));
-	fs_info->extent_root = malloc(sizeof(struct btrfs_root));
-	fs_info->chunk_root = malloc(sizeof(struct btrfs_root));
-	fs_info->dev_root = malloc(sizeof(struct btrfs_root));
-	fs_info->csum_root = malloc(sizeof(struct btrfs_root));
-	fs_info->quota_root = malloc(sizeof(struct btrfs_root));
-	fs_info->super_copy = malloc(BTRFS_SUPER_INFO_SIZE);
+	fs_info->tree_root = calloc(1, sizeof(struct btrfs_root));
+	fs_info->extent_root = calloc(1, sizeof(struct btrfs_root));
+	fs_info->chunk_root = calloc(1, sizeof(struct btrfs_root));
+	fs_info->dev_root = calloc(1, sizeof(struct btrfs_root));
+	fs_info->csum_root = calloc(1, sizeof(struct btrfs_root));
+	fs_info->quota_root = calloc(1, sizeof(struct btrfs_root));
+	fs_info->super_copy = calloc(1, BTRFS_SUPER_INFO_SIZE);
 
 	if (!fs_info->tree_root || !fs_info->extent_root ||
 	    !fs_info->chunk_root || !fs_info->dev_root ||
@@ -847,14 +847,6 @@ struct btrfs_fs_info *btrfs_new_fs_info(int writable, u64 sb_bytenr)
 	    !fs_info->super_copy)
 		goto free_all;
 
-	memset(fs_info->super_copy, 0, BTRFS_SUPER_INFO_SIZE);
-	memset(fs_info->tree_root, 0, sizeof(struct btrfs_root));
-	memset(fs_info->extent_root, 0, sizeof(struct btrfs_root));
-	memset(fs_info->chunk_root, 0, sizeof(struct btrfs_root));
-	memset(fs_info->dev_root, 0, sizeof(struct btrfs_root));
-	memset(fs_info->csum_root, 0, sizeof(struct btrfs_root));
-	memset(fs_info->quota_root, 0, sizeof(struct btrfs_root));
-
 	extent_io_tree_init(&fs_info->extent_cache);
 	extent_io_tree_init(&fs_info->free_space_cache);
 	extent_io_tree_init(&fs_info->block_group_cache);
-- 
2.5.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 2/3] btrfs-progs: add basic awareness of the free space tree
  2015-09-01 19:22 ` [PATCH 1/3] btrfs-progs: use calloc instead of malloc+memset for tree roots Omar Sandoval
@ 2015-09-01 19:22   ` Omar Sandoval
  2015-09-01 19:22   ` [PATCH 3/3] btrfs-progs: check the free space tree in btrfsck Omar Sandoval
  2015-09-02 15:02   ` [PATCH 1/3] btrfs-progs: use calloc instead of malloc+memset for tree roots David Sterba
  2 siblings, 0 replies; 43+ messages in thread
From: Omar Sandoval @ 2015-09-01 19:22 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Omar Sandoval

From: Omar Sandoval <osandov@fb.com>

To start, let's tell btrfs-progs to read the free space root and how to
print the on-disk format of the free space tree. However, we're not
adding the FREE_SPACE_TREE read-only compat bit to the set of supported
bits because progs doesn't know how to keep the free space tree
consistent.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 btrfs-debug-tree.c |  4 ++++
 ctree.h            | 49 ++++++++++++++++++++++++++++++++++++++++++++++++-
 disk-io.c          | 16 +++++++++++++++-
 print-tree.c       | 25 +++++++++++++++++++++++++
 4 files changed, 92 insertions(+), 2 deletions(-)

diff --git a/btrfs-debug-tree.c b/btrfs-debug-tree.c
index 7d8e876f1a2d..2aa5bd11d1b6 100644
--- a/btrfs-debug-tree.c
+++ b/btrfs-debug-tree.c
@@ -375,6 +375,10 @@ again:
 				if (!skip)
 					printf("uuid");
 				break;
+			case BTRFS_FREE_SPACE_TREE_OBJECTID:
+				if (!skip)
+					printf("free space");
+				break;
 			case BTRFS_MULTIPLE_OBJECTIDS:
 				if (!skip) {
 					printf("multiple");
diff --git a/ctree.h b/ctree.h
index bcad2b98d5c7..6339ad50a412 100644
--- a/ctree.h
+++ b/ctree.h
@@ -76,6 +76,9 @@ struct btrfs_free_space_ctl;
 /* for storing items that use the BTRFS_UUID_KEY* */
 #define BTRFS_UUID_TREE_OBJECTID 9ULL
 
+/* tracks free space in block groups. */
+#define BTRFS_FREE_SPACE_TREE_OBJECTID 10ULL
+
 /* for storing balance parameters in the root tree */
 #define BTRFS_BALANCE_OBJECTID -4ULL
 
@@ -453,6 +456,8 @@ struct btrfs_super_block {
  * Compat flags that we support.  If any incompat flags are set other than the
  * ones specified below then we will fail to mount
  */
+#define BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE	(1ULL << 0)
+
 #define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF	(1ULL << 0)
 #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL	(1ULL << 1)
 #define BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS	(1ULL << 2)
@@ -476,9 +481,10 @@ struct btrfs_super_block {
 #define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA	(1ULL << 8)
 #define BTRFS_FEATURE_INCOMPAT_NO_HOLES		(1ULL << 9)
 
-
 #define BTRFS_FEATURE_COMPAT_SUPP		0ULL
+
 #define BTRFS_FEATURE_COMPAT_RO_SUPP		0ULL
+
 #define BTRFS_FEATURE_INCOMPAT_SUPP			\
 	(BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF |		\
 	 BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL |	\
@@ -898,6 +904,13 @@ struct btrfs_block_group_item {
 	__le64 flags;
 } __attribute__ ((__packed__));
 
+struct btrfs_free_space_info {
+	__le32 extent_count;
+	__le32 flags;
+} __attribute__ ((__packed__));
+
+#define BTRFS_FREE_SPACE_USING_BITMAPS (1ULL << 0)
+
 struct btrfs_qgroup_info_item {
 	__le64 generation;
 	__le64 referenced;
@@ -965,6 +978,7 @@ struct btrfs_fs_info {
 	struct btrfs_root *dev_root;
 	struct btrfs_root *csum_root;
 	struct btrfs_root *quota_root;
+	struct btrfs_root *free_space_root;
 
 	struct rb_root fs_root_tree;
 
@@ -1157,6 +1171,27 @@ struct btrfs_root {
  */
 #define BTRFS_BLOCK_GROUP_ITEM_KEY 192
 
+/*
+ * Every block group is represented in the free space tree by a free space info
+ * item, which stores some accounting information. It is keyed on
+ * (block_group_start, FREE_SPACE_INFO, block_group_length).
+ */
+#define BTRFS_FREE_SPACE_INFO_KEY 198
+
+/*
+ * A free space extent tracks an extent of space that is free in a block group.
+ * It is keyed on (start, FREE_SPACE_EXTENT, length).
+ */
+#define BTRFS_FREE_SPACE_EXTENT_KEY 199
+
+/*
+ * When a block group becomes very fragmented, we convert it to use bitmaps
+ * instead of extents. A free space bitmap is keyed on
+ * (start, FREE_SPACE_BITMAP, length); the corresponding item is a bitmap with
+ * (length / sectorsize) bits.
+ */
+#define BTRFS_FREE_SPACE_BITMAP_KEY 200
+
 #define BTRFS_DEV_EXTENT_KEY	204
 #define BTRFS_DEV_ITEM_KEY	216
 #define BTRFS_CHUNK_ITEM_KEY	228
@@ -1394,6 +1429,11 @@ BTRFS_SETGET_FUNCS(disk_block_group_flags,
 BTRFS_SETGET_STACK_FUNCS(block_group_flags,
 			struct btrfs_block_group_item, flags, 64);
 
+/* struct btrfs_free_space_info */
+BTRFS_SETGET_FUNCS(free_space_extent_count, struct btrfs_free_space_info,
+		   extent_count, 32);
+BTRFS_SETGET_FUNCS(free_space_flags, struct btrfs_free_space_info, flags, 32);
+
 /* struct btrfs_inode_ref */
 BTRFS_SETGET_FUNCS(inode_ref_name_len, struct btrfs_inode_ref, name_len, 16);
 BTRFS_SETGET_STACK_FUNCS(stack_inode_ref_name_len, struct btrfs_inode_ref, name_len, 16);
@@ -2191,6 +2231,13 @@ static inline int btrfs_fs_incompat(struct btrfs_fs_info *fs_info, u64 flag)
 	return !!(btrfs_super_incompat_flags(disk_super) & flag);
 }
 
+static inline int btrfs_fs_compat_ro(struct btrfs_fs_info *fs_info, u64 flag)
+{
+	struct btrfs_super_block *disk_super;
+	disk_super = fs_info->super_copy;
+	return !!(btrfs_super_compat_ro_flags(disk_super) & flag);
+}
+
 /* helper function to cast into the data area of the leaf. */
 #define btrfs_item_ptr(leaf, slot, type) \
 	((type *)(btrfs_leaf_data(leaf) + \
diff --git a/disk-io.c b/disk-io.c
index 8496aded31c4..ae9d6e1abb23 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -818,6 +818,7 @@ void btrfs_free_fs_info(struct btrfs_fs_info *fs_info)
 	free(fs_info->dev_root);
 	free(fs_info->csum_root);
 	free(fs_info->quota_root);
+	free(fs_info->free_space_root);
 	free(fs_info->super_copy);
 	free(fs_info->log_root_tree);
 	free(fs_info);
@@ -839,12 +840,13 @@ struct btrfs_fs_info *btrfs_new_fs_info(int writable, u64 sb_bytenr)
 	fs_info->dev_root = calloc(1, sizeof(struct btrfs_root));
 	fs_info->csum_root = calloc(1, sizeof(struct btrfs_root));
 	fs_info->quota_root = calloc(1, sizeof(struct btrfs_root));
+	fs_info->free_space_root = calloc(1, sizeof(struct btrfs_root));
 	fs_info->super_copy = calloc(1, BTRFS_SUPER_INFO_SIZE);
 
 	if (!fs_info->tree_root || !fs_info->extent_root ||
 	    !fs_info->chunk_root || !fs_info->dev_root ||
 	    !fs_info->csum_root || !fs_info->quota_root ||
-	    !fs_info->super_copy)
+	    !fs_info->free_space_root || !fs_info->super_copy)
 		goto free_all;
 
 	extent_io_tree_init(&fs_info->extent_cache);
@@ -1025,6 +1027,16 @@ int btrfs_setup_all_roots(struct btrfs_fs_info *fs_info, u64 root_tree_bytenr,
 	if (ret == 0)
 		fs_info->quota_enabled = 1;
 
+	if (btrfs_fs_compat_ro(fs_info, BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE)) {
+		ret = find_and_setup_root(root, fs_info, BTRFS_FREE_SPACE_TREE_OBJECTID,
+					  fs_info->free_space_root);
+		if (ret) {
+			printk("Couldn't read free space tree\n");
+			return -EIO;
+		}
+		fs_info->free_space_root->track_dirty = 1;
+	}
+
 	ret = find_and_setup_log_root(root, fs_info, sb);
 	if (ret) {
 		printk("Couldn't setup log root tree\n");
@@ -1050,6 +1062,8 @@ int btrfs_setup_all_roots(struct btrfs_fs_info *fs_info, u64 root_tree_bytenr,
 
 void btrfs_release_all_roots(struct btrfs_fs_info *fs_info)
 {
+	if (fs_info->free_space_root)
+		free_extent_buffer(fs_info->free_space_root->node);
 	if (fs_info->quota_root)
 		free_extent_buffer(fs_info->quota_root->node);
 	if (fs_info->csum_root)
diff --git a/print-tree.c b/print-tree.c
index dc1d2764ae91..9dc058b58bde 100644
--- a/print-tree.c
+++ b/print-tree.c
@@ -610,6 +610,15 @@ static void print_key_type(u64 objectid, u8 type)
 	case BTRFS_BLOCK_GROUP_ITEM_KEY:
 		printf("BLOCK_GROUP_ITEM");
 		break;
+	case BTRFS_FREE_SPACE_INFO_KEY:
+		printf("FREE_SPACE_INFO");
+		break;
+	case BTRFS_FREE_SPACE_EXTENT_KEY:
+		printf("FREE_SPACE_EXTENT");
+		break;
+	case BTRFS_FREE_SPACE_BITMAP_KEY:
+		printf("FREE_SPACE_BITMAP");
+		break;
 	case BTRFS_CHUNK_ITEM_KEY:
 		printf("CHUNK_ITEM");
 		break;
@@ -728,6 +737,9 @@ static void print_objectid(u64 objectid, u8 type)
 	case BTRFS_UUID_TREE_OBJECTID:
 		printf("UUID_TREE");
 		break;
+	case BTRFS_FREE_SPACE_TREE_OBJECTID:
+		printf("FREE_SPACE_TREE");
+		break;
 	case BTRFS_MULTIPLE_OBJECTIDS:
 		printf("MULTIPLE");
 		break;
@@ -810,6 +822,7 @@ void btrfs_print_leaf(struct btrfs_root *root, struct extent_buffer *l)
 	struct btrfs_dev_extent *dev_extent;
 	struct btrfs_disk_key disk_key;
 	struct btrfs_block_group_item bg_item;
+	struct btrfs_free_space_info *free_info;
 	struct btrfs_dir_log_item *dlog;
 	struct btrfs_qgroup_info_item *qg_info;
 	struct btrfs_qgroup_limit_item *qg_limit;
@@ -947,6 +960,18 @@ void btrfs_print_leaf(struct btrfs_root *root, struct extent_buffer *l)
 			       (unsigned long long)btrfs_block_group_chunk_objectid(&bg_item),
 			       flags_str);
 			break;
+		case BTRFS_FREE_SPACE_INFO_KEY:
+			free_info = btrfs_item_ptr(l, i, struct btrfs_free_space_info);
+			printf("\t\tfree space info extent count %u flags %u\n",
+			       (unsigned)btrfs_free_space_extent_count(l, free_info),
+			       (unsigned)btrfs_free_space_flags(l, free_info));
+			break;
+		case BTRFS_FREE_SPACE_EXTENT_KEY:
+			printf("\t\tfree space extent\n");
+			break;
+		case BTRFS_FREE_SPACE_BITMAP_KEY:
+			printf("\t\tfree space bitmap\n");
+			break;
 		case BTRFS_CHUNK_ITEM_KEY:
 			print_chunk(l, btrfs_item_ptr(l, i, struct btrfs_chunk));
 			break;
-- 
2.5.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 3/3] btrfs-progs: check the free space tree in btrfsck
  2015-09-01 19:22 ` [PATCH 1/3] btrfs-progs: use calloc instead of malloc+memset for tree roots Omar Sandoval
  2015-09-01 19:22   ` [PATCH 2/3] btrfs-progs: add basic awareness of the free space tree Omar Sandoval
@ 2015-09-01 19:22   ` Omar Sandoval
  2015-09-02 15:02   ` [PATCH 1/3] btrfs-progs: use calloc instead of malloc+memset for tree roots David Sterba
  2 siblings, 0 replies; 43+ messages in thread
From: Omar Sandoval @ 2015-09-01 19:22 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Omar Sandoval

From: Omar Sandoval <osandov@fb.com>

This reuses the existing code for checking the free space cache, we just
need to load the free space tree. While we do that, we check a couple of
invariants on the free space tree itself.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 Makefile.in        |   2 +-
 cmds-check.c       |  24 ++++-
 extent_io.c        |   6 ++
 extent_io.h        |   2 +
 free-space-cache.c |   4 +-
 free-space-cache.h |   2 +
 free-space-tree.c  | 277 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 free-space-tree.h  |  25 +++++
 8 files changed, 335 insertions(+), 7 deletions(-)
 create mode 100644 free-space-tree.c
 create mode 100644 free-space-tree.h

diff --git a/Makefile.in b/Makefile.in
index 665f83c04d00..cac9a269ee8f 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -38,7 +38,7 @@ objects = ctree.o disk-io.o radix-tree.o extent-tree.o print-tree.o \
 	  extent-cache.o extent_io.o volumes.o utils.o repair.o \
 	  qgroup.o raid6.o free-space-cache.o list_sort.o props.o \
 	  ulist.o qgroup-verify.o backref.o string-table.o task-utils.o \
-	  inode.o file.o find-root.o
+	  inode.o file.o find-root.o free-space-tree.o
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
 	       cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
 	       cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
diff --git a/cmds-check.c b/cmds-check.c
index 0694a3bc5646..2a3910c254a4 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -34,6 +34,7 @@
 #include "utils.h"
 #include "commands.h"
 #include "free-space-cache.h"
+#include "free-space-tree.h"
 #include "btrfsck.h"
 #include "qgroup-verify.h"
 #include "rbtree-utils.h"
@@ -5285,9 +5286,21 @@ static int check_space_cache(struct btrfs_root *root)
 			btrfs_remove_free_space_cache(cache);
 		}
 
-		ret = load_free_space_cache(root->fs_info, cache);
-		if (!ret)
-			continue;
+		if (btrfs_fs_compat_ro(root->fs_info,
+				       BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE)) {
+			ret = load_free_space_tree(root->fs_info, cache);
+			if (ret < 0) {
+				fprintf(stderr, "could not load free space tree: %s\n",
+					strerror(-ret));
+				error++;
+				continue;
+			}
+			error += ret;
+		} else {
+			ret = load_free_space_cache(root->fs_info, cache);
+			if (!ret)
+				continue;
+		}
 
 		ret = verify_space_cache(root, cache);
 		if (ret) {
@@ -9495,7 +9508,10 @@ int cmd_check(int argc, char **argv)
 		goto close_out;
 	}
 
-	fprintf(stderr, "checking free space cache\n");
+	if (btrfs_fs_compat_ro(info, BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE))
+		fprintf(stderr, "checking free space tree\n");
+	else
+		fprintf(stderr, "checking free space cache\n");
 	ret = check_space_cache(root);
 	if (ret)
 		goto out;
diff --git a/extent_io.c b/extent_io.c
index 07695ef8a097..079ab7f2fb7d 100644
--- a/extent_io.c
+++ b/extent_io.c
@@ -885,3 +885,9 @@ void memset_extent_buffer(struct extent_buffer *eb, char c,
 {
 	memset(eb->data + start, c, len);
 }
+
+int extent_buffer_test_bit(struct extent_buffer *eb, unsigned long start,
+			   unsigned long nr)
+{
+	return test_bit(nr, (unsigned long *)(eb->data + start));
+}
diff --git a/extent_io.h b/extent_io.h
index 27c4b6931c97..a9a7353556a7 100644
--- a/extent_io.h
+++ b/extent_io.h
@@ -148,6 +148,8 @@ void memmove_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
 			   unsigned long src_offset, unsigned long len);
 void memset_extent_buffer(struct extent_buffer *eb, char c,
 			  unsigned long start, unsigned long len);
+int extent_buffer_test_bit(struct extent_buffer *eb, unsigned long start,
+			   unsigned long nr);
 int set_extent_buffer_dirty(struct extent_buffer *eb);
 int clear_extent_buffer_dirty(struct extent_buffer *eb);
 int read_data_from_disk(struct btrfs_fs_info *info, void *buf, u64 offset,
diff --git a/free-space-cache.c b/free-space-cache.c
index 19ab0c904a71..d10a5f517b10 100644
--- a/free-space-cache.c
+++ b/free-space-cache.c
@@ -802,8 +802,8 @@ void btrfs_remove_free_space_cache(struct btrfs_block_group_cache *block_group)
 	__btrfs_remove_free_space_cache(block_group->free_space_ctl);
 }
 
-static int btrfs_add_free_space(struct btrfs_free_space_ctl *ctl, u64 offset,
-				u64 bytes)
+int btrfs_add_free_space(struct btrfs_free_space_ctl *ctl, u64 offset,
+			 u64 bytes)
 {
 	struct btrfs_free_space *info;
 	int ret = 0;
diff --git a/free-space-cache.h b/free-space-cache.h
index 85411a10e64b..9214077a1b27 100644
--- a/free-space-cache.h
+++ b/free-space-cache.h
@@ -57,4 +57,6 @@ int btrfs_init_free_space_ctl(struct btrfs_block_group_cache *block_group,
 			      int sectorsize);
 void unlink_free_space(struct btrfs_free_space_ctl *ctl,
 		       struct btrfs_free_space *info);
+int btrfs_add_free_space(struct btrfs_free_space_ctl *ctl, u64 offset,
+			 u64 bytes);
 #endif
diff --git a/free-space-tree.c b/free-space-tree.c
new file mode 100644
index 000000000000..527af6e5f2a0
--- /dev/null
+++ b/free-space-tree.c
@@ -0,0 +1,277 @@
+/*
+ * Copyright (C) 2015 Facebook.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include "ctree.h"
+#include "disk-io.h"
+#include "free-space-cache.h"
+#include "free-space-tree.h"
+
+static struct btrfs_free_space_info *
+search_free_space_info(struct btrfs_trans_handle *trans,
+		       struct btrfs_fs_info *fs_info,
+		       struct btrfs_block_group_cache *block_group,
+		       struct btrfs_path *path, int cow)
+{
+	struct btrfs_root *root = fs_info->free_space_root;
+	struct btrfs_key key;
+	int ret;
+
+	key.objectid = block_group->key.objectid;
+	key.type = BTRFS_FREE_SPACE_INFO_KEY;
+	key.offset = block_group->key.offset;
+
+	ret = btrfs_search_slot(trans, root, &key, path, 0, cow);
+	if (ret < 0)
+		return ERR_PTR(ret);
+	if (ret != 0)
+		return ERR_PTR(-ENOENT);
+
+	return btrfs_item_ptr(path->nodes[0], path->slots[0],
+			      struct btrfs_free_space_info);
+}
+
+static int free_space_test_bit(struct btrfs_block_group_cache *block_group,
+			       struct btrfs_path *path, u64 offset,
+			       u64 sectorsize)
+{
+	struct extent_buffer *leaf;
+	struct btrfs_key key;
+	u64 found_start, found_end;
+	unsigned long ptr, i;
+
+	leaf = path->nodes[0];
+	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+	ASSERT(key.type == BTRFS_FREE_SPACE_BITMAP_KEY);
+
+	found_start = key.objectid;
+	found_end = key.objectid + key.offset;
+	ASSERT(offset >= found_start && offset < found_end);
+
+	ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
+	i = (offset - found_start) / sectorsize;
+	return !!extent_buffer_test_bit(leaf, ptr, i);
+}
+
+static int load_free_space_bitmaps(struct btrfs_fs_info *fs_info,
+				   struct btrfs_block_group_cache *block_group,
+				   struct btrfs_path *path,
+				   u32 expected_extent_count,
+				   int *errors)
+{
+	struct btrfs_root *root = fs_info->free_space_root;
+	struct btrfs_key key;
+	int prev_bit = 0, bit;
+	u64 extent_start = 0;
+	u64 start, end, offset;
+	u32 extent_count = 0;
+	int ret;
+
+	start = block_group->key.objectid;
+	end = block_group->key.objectid + block_group->key.offset;
+
+	while (1) {
+		ret = btrfs_next_item(root, path);
+		if (ret < 0)
+			goto out;
+		if (ret)
+			break;
+
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+
+		if (key.type == BTRFS_FREE_SPACE_INFO_KEY)
+			break;
+
+		if (key.type != BTRFS_FREE_SPACE_BITMAP_KEY) {
+			fprintf(stderr, "unexpected key of type %u\n", key.type);
+			(*errors)++;
+			break;
+		}
+		if (key.objectid >= end) {
+			fprintf(stderr, "free space bitmap starts at %Lu, beyond end of block group %Lu-%Lu\n",
+				key.objectid, start, end);
+			(*errors)++;
+			break;
+		}
+		if (key.objectid + key.offset > end) {
+			fprintf(stderr, "free space bitmap ends at %Lu, beyond end of block group %Lu-%Lu\n",
+				key.objectid, start, end);
+			(*errors)++;
+			break;
+		}
+
+		offset = key.objectid;
+		while (offset < key.objectid + key.offset) {
+			bit = free_space_test_bit(block_group, path, offset,
+						  root->sectorsize);
+			if (prev_bit == 0 && bit == 1) {
+				extent_start = offset;
+			} else if (prev_bit == 1 && bit == 0) {
+				ret = btrfs_add_free_space(block_group->free_space_ctl,
+							   extent_start,
+							   offset - extent_start);
+				if (ret)
+					goto out;
+				extent_count++;
+			}
+			prev_bit = bit;
+			offset += root->sectorsize;
+		}
+	}
+
+	if (prev_bit == 1) {
+		ret = btrfs_add_free_space(block_group->free_space_ctl,
+					   extent_start, end - extent_start);
+		if (ret)
+			goto out;
+		extent_count++;
+	}
+
+	if (extent_count != expected_extent_count) {
+		fprintf(stderr, "free space info recorded %u extents, counted %u\n",
+			expected_extent_count, extent_count);
+		(*errors)++;
+	}
+
+	ret = 0;
+out:
+	return ret;
+}
+
+static int load_free_space_extents(struct btrfs_fs_info *fs_info,
+				   struct btrfs_block_group_cache *block_group,
+				   struct btrfs_path *path,
+				   u32 expected_extent_count,
+				   int *errors)
+{
+	struct btrfs_root *root = fs_info->free_space_root;
+	struct btrfs_key key, prev_key;
+	int have_prev = 0;
+	u64 start, end;
+	u32 extent_count = 0;
+	int ret;
+
+	start = block_group->key.objectid;
+	end = block_group->key.objectid + block_group->key.offset;
+
+	while (1) {
+		ret = btrfs_next_item(root, path);
+		if (ret < 0)
+			goto out;
+		if (ret)
+			break;
+
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+
+		if (key.type == BTRFS_FREE_SPACE_INFO_KEY)
+			break;
+
+		if (key.type != BTRFS_FREE_SPACE_EXTENT_KEY) {
+			fprintf(stderr, "unexpected key of type %u\n", key.type);
+			(*errors)++;
+			break;
+		}
+		if (key.objectid >= end) {
+			fprintf(stderr, "free space extent starts at %Lu, beyond end of block group %Lu-%Lu\n",
+				key.objectid, start, end);
+			(*errors)++;
+			break;
+		}
+		if (key.objectid + key.offset > end) {
+			fprintf(stderr, "free space extent ends at %Lu, beyond end of block group %Lu-%Lu\n",
+				key.objectid, start, end);
+			(*errors)++;
+			break;
+		}
+
+		if (have_prev) {
+			u64 cur_start = key.objectid;
+			u64 cur_end = cur_start + key.offset;
+			u64 prev_start = prev_key.objectid;
+			u64 prev_end = prev_start + prev_key.offset;
+
+			if (cur_start < prev_end) {
+				fprintf(stderr, "free space extent %Lu-%Lu overlaps with previous %Lu-%Lu\n",
+					cur_start, cur_end,
+					prev_start, prev_end);
+				(*errors)++;
+			} else if (cur_start == prev_end) {
+				fprintf(stderr, "free space extent %Lu-%Lu is unmerged with previous %Lu-%Lu\n",
+					cur_start, cur_end,
+					prev_start, prev_end);
+				(*errors)++;
+			}
+		}
+
+		ret = btrfs_add_free_space(block_group->free_space_ctl,
+					   key.objectid, key.offset);
+		if (ret)
+			goto out;
+		extent_count++;
+
+		prev_key = key;
+		have_prev = 1;
+	}
+
+	if (extent_count != expected_extent_count) {
+		fprintf(stderr, "free space info recorded %u extents, counted %u\n",
+			expected_extent_count, extent_count);
+		(*errors)++;
+	}
+
+	ret = 0;
+out:
+	return ret;
+}
+
+int load_free_space_tree(struct btrfs_fs_info *fs_info,
+			 struct btrfs_block_group_cache *block_group)
+{
+	struct btrfs_free_space_info *info;
+	struct btrfs_path *path;
+	u32 extent_count, flags;
+	int errors = 0;
+	int ret;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+	path->reada = 1;
+
+	info = search_free_space_info(NULL, fs_info, block_group, path, 0);
+	if (IS_ERR(info)) {
+		ret = PTR_ERR(info);
+		goto out;
+	}
+	extent_count = btrfs_free_space_extent_count(path->nodes[0], info);
+	flags = btrfs_free_space_flags(path->nodes[0], info);
+
+	if (flags & BTRFS_FREE_SPACE_USING_BITMAPS) {
+		ret = load_free_space_bitmaps(fs_info, block_group, path,
+					      extent_count, &errors);
+	} else {
+		ret = load_free_space_extents(fs_info, block_group, path,
+					      extent_count, &errors);
+	}
+	if (ret)
+		goto out;
+
+	ret = 0;
+out:
+	btrfs_free_path(path);
+	return ret ? ret : errors;
+}
diff --git a/free-space-tree.h b/free-space-tree.h
new file mode 100644
index 000000000000..7529a46890da
--- /dev/null
+++ b/free-space-tree.h
@@ -0,0 +1,25 @@
+/*
+ * Copyright (C) 2015 Facebook.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __BTRFS_FREE_SPACE_TREE_H__
+#define __BTRFS_FREE_SPACE_TREE_H__
+
+int load_free_space_tree(struct btrfs_fs_info *fs_info,
+			 struct btrfs_block_group_cache *block_group);
+
+#endif
-- 
2.5.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/6] Btrfs: add extent buffer bitmap operations
  2015-09-01 19:01 ` [PATCH 1/6] Btrfs: add extent buffer bitmap operations Omar Sandoval
@ 2015-09-01 19:25   ` Josef Bacik
  2015-09-01 19:37     ` Omar Sandoval
  0 siblings, 1 reply; 43+ messages in thread
From: Josef Bacik @ 2015-09-01 19:25 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: Omar Sandoval

On 09/01/2015 03:01 PM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
>
> These are going to be used for the free space tree bitmap items.
>
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Can we get sanity tests for these operations so we know they are 
properly unit tested?

> ---
>   fs/btrfs/extent_io.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/extent_io.h |   6 +++
>   2 files changed, 107 insertions(+)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 02d05817cbdf..649e3b4eeb1b 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5475,6 +5475,107 @@ void copy_extent_buffer(struct extent_buffer *dst, struct extent_buffer *src,
>   	}
>   }
>
> +/*
> + * The extent buffer bitmap operations are done with byte granularity because
> + * bitmap items are not guaranteed to be aligned to a word and therefore a
> + * single word in a bitmap may straddle two pages in the extent buffer.
> + */
> +#define BIT_BYTE(nr) ((nr) / BITS_PER_BYTE)
> +#define BYTE_MASK ((1 << BITS_PER_BYTE) - 1)
> +#define BITMAP_FIRST_BYTE_MASK(start) \
> +	((BYTE_MASK << ((start) & (BITS_PER_BYTE - 1))) & BYTE_MASK)
> +#define BITMAP_LAST_BYTE_MASK(nbits) \
> +	(BYTE_MASK >> (-(nbits) & (BITS_PER_BYTE - 1)))
> +
> +int extent_buffer_test_bit(struct extent_buffer *eb, unsigned long start,
> +			   unsigned long nr)
> +{
> +	size_t offset;
> +	char *kaddr;
> +	struct page *page;
> +	size_t byte_offset = BIT_BYTE(nr);
> +	size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1);
> +	unsigned long i = (start_offset + start + byte_offset) >> PAGE_CACHE_SHIFT;
> +
> +	offset = (start_offset + start + byte_offset) & (PAGE_CACHE_SIZE - 1);
> +	page = eb->pages[i];
> +	WARN_ON(!PageUptodate(page));
> +	kaddr = page_address(page);
> +	return 1U & (kaddr[offset] >> (nr & (BITS_PER_BYTE - 1)));
> +}
> +
> +void extent_buffer_bitmap_set(struct extent_buffer *eb, unsigned long start,
> +			      unsigned long pos, unsigned long len)
> +{
> +	size_t offset;
> +	char *kaddr;
> +	struct page *page;
> +	size_t byte_offset = BIT_BYTE(pos);
> +	size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1);
> +	unsigned long i = (start_offset + start + byte_offset) >> PAGE_CACHE_SHIFT;
> +	const unsigned int size = pos + len;
> +	int bits_to_set = BITS_PER_BYTE - (pos % BITS_PER_BYTE);
> +	unsigned int mask_to_set = BITMAP_FIRST_BYTE_MASK(pos);
> +
> +	offset = (start_offset + start + byte_offset) & (PAGE_CACHE_SIZE - 1);
> +	page = eb->pages[i];
> +	WARN_ON(!PageUptodate(page));
> +	kaddr = page_address(page);
> +
> +	while (len >= bits_to_set) {
> +		kaddr[offset] |= mask_to_set;
> +		len -= bits_to_set;
> +		bits_to_set = BITS_PER_BYTE;
> +		mask_to_set = ~0U;
> +		if (++offset >= PAGE_CACHE_SIZE && len > 0) {
> +			offset = 0;
> +			page = eb->pages[++i];
> +			WARN_ON(!PageUptodate(page));
> +			kaddr = page_address(page);
> +		}
> +	}
> +	if (len) {
> +		mask_to_set &= BITMAP_LAST_BYTE_MASK(size);
> +		kaddr[offset] |= mask_to_set;
> +	}
> +}
> +
> +void extent_buffer_bitmap_clear(struct extent_buffer *eb, unsigned long start,
> +				unsigned long pos, unsigned long len)
> +{
> +	size_t offset;
> +	char *kaddr;
> +	struct page *page;
> +	size_t byte_offset = BIT_BYTE(pos);
> +	size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1);
> +	unsigned long i = (start_offset + start + byte_offset) >> PAGE_CACHE_SHIFT;
> +	const unsigned int size = pos + len;
> +	int bits_to_clear = BITS_PER_BYTE - (pos % BITS_PER_BYTE);
> +	unsigned int mask_to_clear = BITMAP_FIRST_BYTE_MASK(pos);
> +
> +	offset = (start_offset + start + byte_offset) & (PAGE_CACHE_SIZE - 1);
> +	page = eb->pages[i];
> +	WARN_ON(!PageUptodate(page));
> +	kaddr = page_address(page);
> +

Abstract this offset finding logic to a helper function and then comment 
the hell out of it, I now have a migraine trying to figure out what is 
going on.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/6] Btrfs: add helpers for read-only compat bits
  2015-09-01 19:01 ` [PATCH 2/6] Btrfs: add helpers for read-only compat bits Omar Sandoval
@ 2015-09-01 19:26   ` Josef Bacik
  0 siblings, 0 replies; 43+ messages in thread
From: Josef Bacik @ 2015-09-01 19:26 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: Omar Sandoval

On 09/01/2015 03:01 PM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
>
> We're finally going to add one of these for the free space tree, so
> let's add the same nice helpers that we have for the incompat bits.
>
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Reviewed-by: Josef Bacik <jbacik@fb.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/6] Btrfs: introduce the free space B-tree on-disk format
  2015-09-01 19:01 ` [PATCH 3/6] Btrfs: introduce the free space B-tree on-disk format Omar Sandoval
@ 2015-09-01 19:28   ` Josef Bacik
  0 siblings, 0 replies; 43+ messages in thread
From: Josef Bacik @ 2015-09-01 19:28 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: Omar Sandoval

On 09/01/2015 03:02 PM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
>
> The on-disk format for the free space tree is straightforward. Each
> block group is represented in the free space tree by a free space info
> item that stores accounting information: whether the free space for this
> block group is stored as bitmaps or extents and how many extents of free
> space exist for this block group (regardless of which format is being
> used in the tree). Extents are (start, FREE_SPACE_EXTENT, length) keys
> with no corresponding item, and bitmaps instead have the
> FREE_SPACE_BITMAP type and have a bitmap item attached, which is just an
> array of bytes.
>
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Reviewed-by: Josef Bacik <jbacik@fb.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/6] Btrfs: add extent buffer bitmap operations
  2015-09-01 19:25   ` Josef Bacik
@ 2015-09-01 19:37     ` Omar Sandoval
  0 siblings, 0 replies; 43+ messages in thread
From: Omar Sandoval @ 2015-09-01 19:37 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, Omar Sandoval

On Tue, Sep 01, 2015 at 03:25:54PM -0400, Josef Bacik wrote:
> On 09/01/2015 03:01 PM, Omar Sandoval wrote:
> >From: Omar Sandoval <osandov@fb.com>
> >
> >These are going to be used for the free space tree bitmap items.
> >
> >Signed-off-by: Omar Sandoval <osandov@fb.com>
> 
> Can we get sanity tests for these operations so we know they are properly
> unit tested?
> 

No problem, I'll do that.

> >---
> >  fs/btrfs/extent_io.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/btrfs/extent_io.h |   6 +++
> >  2 files changed, 107 insertions(+)
> >
> >diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> >index 02d05817cbdf..649e3b4eeb1b 100644
> >--- a/fs/btrfs/extent_io.c
> >+++ b/fs/btrfs/extent_io.c
> >@@ -5475,6 +5475,107 @@ void copy_extent_buffer(struct extent_buffer *dst, struct extent_buffer *src,
> >  	}
> >  }
> >
> >+/*
> >+ * The extent buffer bitmap operations are done with byte granularity because
> >+ * bitmap items are not guaranteed to be aligned to a word and therefore a
> >+ * single word in a bitmap may straddle two pages in the extent buffer.
> >+ */
> >+#define BIT_BYTE(nr) ((nr) / BITS_PER_BYTE)
> >+#define BYTE_MASK ((1 << BITS_PER_BYTE) - 1)
> >+#define BITMAP_FIRST_BYTE_MASK(start) \
> >+	((BYTE_MASK << ((start) & (BITS_PER_BYTE - 1))) & BYTE_MASK)
> >+#define BITMAP_LAST_BYTE_MASK(nbits) \
> >+	(BYTE_MASK >> (-(nbits) & (BITS_PER_BYTE - 1)))
> >+
> >+int extent_buffer_test_bit(struct extent_buffer *eb, unsigned long start,
> >+			   unsigned long nr)
> >+{
> >+	size_t offset;
> >+	char *kaddr;
> >+	struct page *page;
> >+	size_t byte_offset = BIT_BYTE(nr);
> >+	size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1);
> >+	unsigned long i = (start_offset + start + byte_offset) >> PAGE_CACHE_SHIFT;
> >+
> >+	offset = (start_offset + start + byte_offset) & (PAGE_CACHE_SIZE - 1);
> >+	page = eb->pages[i];
> >+	WARN_ON(!PageUptodate(page));
> >+	kaddr = page_address(page);
> >+	return 1U & (kaddr[offset] >> (nr & (BITS_PER_BYTE - 1)));
> >+}
> >+
> >+void extent_buffer_bitmap_set(struct extent_buffer *eb, unsigned long start,
> >+			      unsigned long pos, unsigned long len)
> >+{
> >+	size_t offset;
> >+	char *kaddr;
> >+	struct page *page;
> >+	size_t byte_offset = BIT_BYTE(pos);
> >+	size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1);
> >+	unsigned long i = (start_offset + start + byte_offset) >> PAGE_CACHE_SHIFT;
> >+	const unsigned int size = pos + len;
> >+	int bits_to_set = BITS_PER_BYTE - (pos % BITS_PER_BYTE);
> >+	unsigned int mask_to_set = BITMAP_FIRST_BYTE_MASK(pos);
> >+
> >+	offset = (start_offset + start + byte_offset) & (PAGE_CACHE_SIZE - 1);
> >+	page = eb->pages[i];
> >+	WARN_ON(!PageUptodate(page));
> >+	kaddr = page_address(page);
> >+
> >+	while (len >= bits_to_set) {
> >+		kaddr[offset] |= mask_to_set;
> >+		len -= bits_to_set;
> >+		bits_to_set = BITS_PER_BYTE;
> >+		mask_to_set = ~0U;
> >+		if (++offset >= PAGE_CACHE_SIZE && len > 0) {
> >+			offset = 0;
> >+			page = eb->pages[++i];
> >+			WARN_ON(!PageUptodate(page));
> >+			kaddr = page_address(page);
> >+		}
> >+	}
> >+	if (len) {
> >+		mask_to_set &= BITMAP_LAST_BYTE_MASK(size);
> >+		kaddr[offset] |= mask_to_set;
> >+	}
> >+}
> >+
> >+void extent_buffer_bitmap_clear(struct extent_buffer *eb, unsigned long start,
> >+				unsigned long pos, unsigned long len)
> >+{
> >+	size_t offset;
> >+	char *kaddr;
> >+	struct page *page;
> >+	size_t byte_offset = BIT_BYTE(pos);
> >+	size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1);
> >+	unsigned long i = (start_offset + start + byte_offset) >> PAGE_CACHE_SHIFT;
> >+	const unsigned int size = pos + len;
> >+	int bits_to_clear = BITS_PER_BYTE - (pos % BITS_PER_BYTE);
> >+	unsigned int mask_to_clear = BITMAP_FIRST_BYTE_MASK(pos);
> >+
> >+	offset = (start_offset + start + byte_offset) & (PAGE_CACHE_SIZE - 1);
> >+	page = eb->pages[i];
> >+	WARN_ON(!PageUptodate(page));
> >+	kaddr = page_address(page);
> >+
> 
> Abstract this offset finding logic to a helper function and then comment the
> hell out of it, I now have a migraine trying to figure out what is going on.
> Thanks,
> 
> Josef

Will do, thanks.

-- 
Omar

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] Btrfs: implement the free space B-tree
  2015-09-01 19:13 ` [PATCH 4/6] Btrfs: implement the free space B-tree Omar Sandoval
@ 2015-09-01 19:44   ` Josef Bacik
  2015-09-01 20:06     ` Omar Sandoval
  0 siblings, 1 reply; 43+ messages in thread
From: Josef Bacik @ 2015-09-01 19:44 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: Omar Sandoval

On 09/01/2015 03:13 PM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
>
> The free space cache has turned out to be a scalability bottleneck on
> large, busy filesystems. When the cache for a lot of block groups needs
> to be written out, we can get extremely long commit times; if this
> happens in the critical section, things are especially bad because we
> block new transactions from happening.
>
> The main problem with the free space cache is that it has to be written
> out in its entirety and is managed in an ad hoc fashion. Using a B-tree
> to store free space fixes this: updates can be done as needed and we get
> all of the benefits of using a B-tree: checksumming, RAID handling,
> well-understood behavior.
>
> With the free space tree, we get commit times that are about the same as
> the no cache case with load times slower than the free space cache case
> but still much faster than the no cache case. Free space is represented
> with extents until it becomes more space-efficient to use bitmaps,
> giving us similar space overhead to the free space cache.
>
> The operations on the free space tree are: adding and removing free
> space, handling the creation and deletion of block groups, and loading
> the free space for a block group. We can also create the free space tree
> by walking the extent tree.
>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>   fs/btrfs/Makefile          |    2 +-
>   fs/btrfs/ctree.h           |   25 +-
>   fs/btrfs/extent-tree.c     |   15 +-
>   fs/btrfs/free-space-tree.c | 1468 ++++++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/free-space-tree.h |   39 ++
>   5 files changed, 1541 insertions(+), 8 deletions(-)
>   create mode 100644 fs/btrfs/free-space-tree.c
>   create mode 100644 fs/btrfs/free-space-tree.h
>
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index 6d1d0b93b1aa..766169709146 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -9,7 +9,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
>   	   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
>   	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
>   	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
> -	   uuid-tree.o props.o hash.o
> +	   uuid-tree.o props.o hash.o free-space-tree.o
>
>   btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
>   btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 34a81a79f5b6..d49181d35f08 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1299,8 +1299,20 @@ struct btrfs_block_group_cache {
>   	u64 delalloc_bytes;
>   	u64 bytes_super;
>   	u64 flags;
> -	u64 sectorsize;
>   	u64 cache_generation;
> +	u32 sectorsize;
> +
> +	/*
> +	 * If the free space extent count exceeds this number, convert the block
> +	 * group to bitmaps.
> +	 */
> +	u32 bitmap_high_thresh;
> +
> +	/*
> +	 * If the free space extent count drops below this number, convert the
> +	 * block group back to extents.
> +	 */
> +	u32 bitmap_low_thresh;
>
>   	/*
>   	 * It is just used for the delayed data space allocation because
> @@ -1356,6 +1368,9 @@ struct btrfs_block_group_cache {
>   	struct list_head io_list;
>
>   	struct btrfs_io_ctl io_ctl;
> +
> +	/* Lock for free space tree operations. */
> +	struct mutex free_space_lock;
>   };
>
>   /* delayed seq elem */
> @@ -1407,6 +1422,7 @@ struct btrfs_fs_info {
>   	struct btrfs_root *csum_root;
>   	struct btrfs_root *quota_root;
>   	struct btrfs_root *uuid_root;
> +	struct btrfs_root *free_space_root;
>
>   	/* the log root tree is a directory of all the other log roots */
>   	struct btrfs_root *log_root_tree;
> @@ -3556,6 +3572,13 @@ void btrfs_end_write_no_snapshoting(struct btrfs_root *root);
>   void check_system_chunk(struct btrfs_trans_handle *trans,
>   			struct btrfs_root *root,
>   			const u64 type);
> +void free_excluded_extents(struct btrfs_root *root,
> +			   struct btrfs_block_group_cache *cache);
> +int exclude_super_stripes(struct btrfs_root *root,
> +			  struct btrfs_block_group_cache *cache);
> +u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
> +		       struct btrfs_fs_info *info, u64 start, u64 end);
> +
>   /* ctree.c */
>   int btrfs_bin_search(struct extent_buffer *eb, struct btrfs_key *key,
>   		     int level, int *slot);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 07204bf601ed..37179a569f40 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -237,8 +237,8 @@ static int add_excluded_extent(struct btrfs_root *root,
>   	return 0;
>   }
>
> -static void free_excluded_extents(struct btrfs_root *root,
> -				  struct btrfs_block_group_cache *cache)
> +void free_excluded_extents(struct btrfs_root *root,
> +			   struct btrfs_block_group_cache *cache)
>   {
>   	u64 start, end;
>
> @@ -251,14 +251,16 @@ static void free_excluded_extents(struct btrfs_root *root,
>   			  start, end, EXTENT_UPTODATE, GFP_NOFS);
>   }
>
> -static int exclude_super_stripes(struct btrfs_root *root,
> -				 struct btrfs_block_group_cache *cache)
> +int exclude_super_stripes(struct btrfs_root *root,
> +			  struct btrfs_block_group_cache *cache)
>   {
>   	u64 bytenr;
>   	u64 *logical;
>   	int stripe_len;
>   	int i, nr, ret;
>
> +	cache->bytes_super = 0;
> +
>   	if (cache->key.objectid < BTRFS_SUPER_INFO_OFFSET) {
>   		stripe_len = BTRFS_SUPER_INFO_OFFSET - cache->key.objectid;
>   		cache->bytes_super += stripe_len;
> @@ -337,8 +339,8 @@ static void put_caching_control(struct btrfs_caching_control *ctl)
>    * we need to check the pinned_extents for any extents that can't be used yet
>    * since their free space will be released as soon as the transaction commits.
>    */
> -static u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
> -			      struct btrfs_fs_info *info, u64 start, u64 end)
> +u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
> +		       struct btrfs_fs_info *info, u64 start, u64 end)
>   {
>   	u64 extent_start, extent_end, size, total_added = 0;
>   	int ret;
> @@ -9281,6 +9283,7 @@ btrfs_create_block_group_cache(struct btrfs_root *root, u64 start, u64 size)
>   	INIT_LIST_HEAD(&cache->io_list);
>   	btrfs_init_free_space_ctl(cache);
>   	atomic_set(&cache->trimming, 0);
> +	mutex_init(&cache->free_space_lock);
>
>   	return cache;
>   }
> diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
> new file mode 100644
> index 000000000000..bbb4f731f948
> --- /dev/null
> +++ b/fs/btrfs/free-space-tree.c
> @@ -0,0 +1,1468 @@
> +/*
> + * Copyright (C) 2015 Facebook.  All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public
> + * License v2 as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; if not, write to the
> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
> + * Boston, MA 021110-1307, USA.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/vmalloc.h>
> +#include "ctree.h"
> +#include "disk-io.h"
> +#include "locking.h"
> +#include "free-space-tree.h"
> +#include "transaction.h"
> +
> +/*
> + * The default size for new free space bitmap items. The last bitmap in a block
> + * group may be truncated, and none of the free space tree code assumes that
> + * existing bitmaps are this size.
> + */
> +#define BTRFS_FREE_SPACE_BITMAP_SIZE 256
> +#define BTRFS_FREE_SPACE_BITMAP_BITS (BTRFS_FREE_SPACE_BITMAP_SIZE * BITS_PER_BYTE)
> +
> +void set_free_space_tree_thresholds(struct btrfs_block_group_cache *cache)
> +{
> +	u32 bitmap_range;
> +	size_t bitmap_size;
> +	u64 num_bitmaps, total_bitmap_size;
> +
> +	/*
> +	 * We convert to bitmaps when the disk space required for using extents
> +	 * exceeds that required for using bitmaps.
> +	 */
> +	bitmap_range = cache->sectorsize * BTRFS_FREE_SPACE_BITMAP_BITS;
> +	num_bitmaps = div_u64(cache->key.offset + bitmap_range - 1,
> +			      bitmap_range);
> +	bitmap_size = sizeof(struct btrfs_item) + BTRFS_FREE_SPACE_BITMAP_SIZE;
> +	total_bitmap_size = num_bitmaps * bitmap_size;
> +	cache->bitmap_high_thresh = div_u64(total_bitmap_size,
> +					    sizeof(struct btrfs_item));
> +
> +	/*
> +	 * We allow for a small buffer between the high threshold and low
> +	 * threshold to avoid thrashing back and forth between the two formats.
> +	 */
> +	if (cache->bitmap_high_thresh > 100)
> +		cache->bitmap_low_thresh = cache->bitmap_high_thresh - 100;
> +	else
> +		cache->bitmap_low_thresh = 0;
> +}
> +
> +static int add_new_free_space_info(struct btrfs_trans_handle *trans,
> +				   struct btrfs_fs_info *fs_info,
> +				   struct btrfs_block_group_cache *block_group,
> +				   struct btrfs_path *path)
> +{
> +	struct btrfs_root *root = fs_info->free_space_root;
> +	struct btrfs_free_space_info *info;
> +	struct btrfs_key key;
> +	struct extent_buffer *leaf;
> +	int ret;
> +
> +	key.objectid = block_group->key.objectid;
> +	key.type = BTRFS_FREE_SPACE_INFO_KEY;
> +	key.offset = block_group->key.offset;
> +
> +	ret = btrfs_insert_empty_item(trans, root, path, &key, sizeof(*info));
> +	if (ret)
> +		goto out;
> +
> +	leaf = path->nodes[0];
> +	info = btrfs_item_ptr(leaf, path->slots[0],
> +			      struct btrfs_free_space_info);
> +	btrfs_set_free_space_extent_count(leaf, info, 0);
> +	btrfs_set_free_space_flags(leaf, info, 0);
> +	btrfs_mark_buffer_dirty(leaf);
> +
> +	ret = 0;
> +out:
> +	btrfs_release_path(path);
> +	return ret;
> +}
> +
> +static struct btrfs_free_space_info *
> +search_free_space_info(struct btrfs_trans_handle *trans,
> +		       struct btrfs_fs_info *fs_info,
> +		       struct btrfs_block_group_cache *block_group,
> +		       struct btrfs_path *path, int cow)
> +{
> +	struct btrfs_root *root = fs_info->free_space_root;
> +	struct btrfs_key key;
> +	int ret;
> +
> +	key.objectid = block_group->key.objectid;
> +	key.type = BTRFS_FREE_SPACE_INFO_KEY;
> +	key.offset = block_group->key.offset;
> +
> +	ret = btrfs_search_slot(trans, root, &key, path, 0, cow);
> +	if (ret < 0)
> +		return ERR_PTR(ret);
> +	if (ret != 0) {
> +		btrfs_warn(fs_info, "missing free space info for %llu\n",
> +			   block_group->key.objectid);
> +		ASSERT(0);
> +		return ERR_PTR(-ENOENT);
> +	}
> +
> +	return btrfs_item_ptr(path->nodes[0], path->slots[0],
> +			      struct btrfs_free_space_info);
> +}
> +
> +/*
> + * btrfs_search_slot() but we're looking for the greatest key less than the
> + * passed key.
> + */
> +static int btrfs_search_prev_slot(struct btrfs_trans_handle *trans,
> +				  struct btrfs_root *root,
> +				  struct btrfs_key *key, struct btrfs_path *p,
> +				  int ins_len, int cow)
> +{
> +	int ret;
> +
> +	ret = btrfs_search_slot(trans, root, key, p, ins_len, cow);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (ret == 0) {
> +		ASSERT(0);
> +		return -EIO;
> +	}
> +
> +	if (p->slots[0] == 0) {
> +		ASSERT(0);
> +		return -EIO;
> +	}
> +	p->slots[0]--;
> +
> +	return 0;
> +}
> +
> +static inline u32 free_space_bitmap_size(u64 size, u32 sectorsize)
> +{
> +	return DIV_ROUND_UP((u32)div_u64(size, sectorsize), BITS_PER_BYTE);
> +}
> +
> +static unsigned long *alloc_bitmap(u32 bitmap_size)
> +{
> +	return __vmalloc(bitmap_size, GFP_NOFS | __GFP_HIGHMEM | __GFP_ZERO,
> +			 PAGE_KERNEL);
> +}
> +
> +static int convert_free_space_to_bitmaps(struct btrfs_trans_handle *trans,
> +					 struct btrfs_fs_info *fs_info,
> +					 struct btrfs_block_group_cache *block_group,
> +					 struct btrfs_path *path)
> +{
> +	struct btrfs_root *root = fs_info->free_space_root;
> +	struct btrfs_free_space_info *info;
> +	struct btrfs_key key, found_key;
> +	struct extent_buffer *leaf;
> +	unsigned long *bitmap;
> +	char *bitmap_cursor;
> +	u64 start, end;
> +	u64 bitmap_range, i;
> +	u32 bitmap_size, flags, expected_extent_count;
> +	u32 extent_count = 0;
> +	int done = 0, nr;
> +	int ret;
> +
> +	bitmap_size = free_space_bitmap_size(block_group->key.offset,
> +					     block_group->sectorsize);
> +	bitmap = alloc_bitmap(bitmap_size);
> +	if (!bitmap)
> +		return -ENOMEM;
> +
> +	start = block_group->key.objectid;
> +	end = block_group->key.objectid + block_group->key.offset;
> +
> +	key.objectid = end - 1;
> +	key.type = (u8)-1;
> +	key.offset = (u64)-1;
> +
> +	while (!done) {
> +		ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
> +		if (ret)
> +			goto out;
> +
> +		leaf = path->nodes[0];
> +		nr = 0;
> +		path->slots[0]++;
> +		while (path->slots[0] > 0) {
> +			btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0] - 1);
> +
> +			if (found_key.type == BTRFS_FREE_SPACE_INFO_KEY) {
> +				ASSERT(found_key.objectid == block_group->key.objectid);
> +				ASSERT(found_key.offset == block_group->key.offset);
> +				done = 1;
> +				break;
> +			} else if (found_key.type == BTRFS_FREE_SPACE_EXTENT_KEY) {
> +				u64 first, last;
> +
> +				ASSERT(found_key.objectid >= start);
> +				ASSERT(found_key.objectid < end);
> +				ASSERT(found_key.objectid + found_key.offset <= end);
> +
> +				first = div_u64(found_key.objectid - start,
> +						block_group->sectorsize);
> +				last = div_u64(found_key.objectid + found_key.offset - start,
> +					       block_group->sectorsize);
> +				bitmap_set(bitmap, first, last - first);
> +
> +				extent_count++;
> +				nr++;
> +				path->slots[0]--;
> +			} else {
> +				ASSERT(0);
> +			}
> +		}
> +
> +		ret = btrfs_del_items(trans, root, path, path->slots[0], nr);
> +		if (ret)

We could have deleted stuff previously so we need to abort here as well.

> +			goto out;
> +		btrfs_release_path(path);
> +	}
> +
> +	info = search_free_space_info(trans, fs_info, block_group, path, 1);
> +	if (IS_ERR(info)) {
> +		ret = PTR_ERR(info);
> +		goto out;
> +	}
> +	leaf = path->nodes[0];
> +	flags = btrfs_free_space_flags(leaf, info);
> +	flags |= BTRFS_FREE_SPACE_USING_BITMAPS;
> +	btrfs_set_free_space_flags(leaf, info, flags);
> +	expected_extent_count = btrfs_free_space_extent_count(leaf, info);
> +	btrfs_mark_buffer_dirty(leaf);
> +	btrfs_release_path(path);
> +
> +	if (extent_count != expected_extent_count) {
> +		btrfs_err(fs_info, "incorrect extent count for %llu; counted %u, expected %u",
> +			  block_group->key.objectid, extent_count,
> +			  expected_extent_count);

We should also abort the transaction here since we will have already 
deleted the normal entries and thus have a corrupted fs if we are 
allowed to continue.

> +		ASSERT(0);
> +		ret = -EIO;
> +		goto out;
> +	}
> +
> +	bitmap_cursor = (char *)bitmap;
> +	bitmap_range = block_group->sectorsize * BTRFS_FREE_SPACE_BITMAP_BITS;
> +	i = start;
> +	while (i < end) {
> +		unsigned long ptr;
> +		u64 extent_size;
> +		u32 data_size;
> +
> +		extent_size = min(end - i, bitmap_range);
> +		data_size = free_space_bitmap_size(extent_size,
> +						   block_group->sectorsize);
> +
> +		key.objectid = i;
> +		key.type = BTRFS_FREE_SPACE_BITMAP_KEY;
> +		key.offset = extent_size;
> +
> +		ret = btrfs_insert_empty_item(trans, root, path, &key,
> +					      data_size);
> +		if (ret)

Need to abort here as well.

> +			goto out;
> +
> +		leaf = path->nodes[0];
> +		ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
> +		write_extent_buffer(leaf, bitmap_cursor, ptr,
> +				    data_size);
> +		btrfs_mark_buffer_dirty(leaf);
> +		btrfs_release_path(path);
> +
> +		i += extent_size;
> +		bitmap_cursor += data_size;
> +	}
> +
> +	ret = 0;
> +out:

Maybe have the if (ret) btrfs_abort_transaction() here.

> +	vfree(bitmap);
> +	return ret;
> +}
> +
> +static int convert_free_space_to_extents(struct btrfs_trans_handle *trans,
> +					 struct btrfs_fs_info *fs_info,
> +					 struct btrfs_block_group_cache *block_group,
> +					 struct btrfs_path *path)
> +{

You need to abort in the appropriate places here as well.

> +	struct btrfs_root *root = fs_info->free_space_root;
> +	struct btrfs_free_space_info *info;
> +	struct btrfs_key key, found_key;
> +	struct extent_buffer *leaf;
> +	unsigned long *bitmap;
> +	u64 start, end;
> +	/* Initialize to silence GCC. */
> +	u64 extent_start = 0;
> +	u64 offset;
> +	u32 bitmap_size, flags, expected_extent_count;
> +	int prev_bit = 0, bit, bitnr;
> +	u32 extent_count = 0;
> +	int done = 0, nr;
> +	int ret;
> +
> +	bitmap_size = free_space_bitmap_size(block_group->key.offset,
> +					     block_group->sectorsize);
> +	bitmap = alloc_bitmap(bitmap_size);
> +	if (!bitmap)
> +		return -ENOMEM;
> +
> +	start = block_group->key.objectid;
> +	end = block_group->key.objectid + block_group->key.offset;
> +
> +	key.objectid = end - 1;
> +	key.type = (u8)-1;
> +	key.offset = (u64)-1;
> +
> +	while (!done) {
> +		ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
> +		if (ret)
> +			goto out;
> +
> +		leaf = path->nodes[0];
> +		nr = 0;
> +		path->slots[0]++;
> +		while (path->slots[0] > 0) {
> +			btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0] - 1);
> +
> +			if (found_key.type == BTRFS_FREE_SPACE_INFO_KEY) {
> +				ASSERT(found_key.objectid == block_group->key.objectid);
> +				ASSERT(found_key.offset == block_group->key.offset);
> +				done = 1;
> +				break;
> +			} else if (found_key.type == BTRFS_FREE_SPACE_BITMAP_KEY) {
> +				unsigned long ptr;
> +				char *bitmap_cursor;
> +				u32 bitmap_pos, data_size;
> +
> +				ASSERT(found_key.objectid >= start);
> +				ASSERT(found_key.objectid < end);
> +				ASSERT(found_key.objectid + found_key.offset <= end);
> +
> +				bitmap_pos = div_u64(found_key.objectid - start,
> +						     block_group->sectorsize *
> +						     BITS_PER_BYTE);
> +				bitmap_cursor = ((char *)bitmap) + bitmap_pos;
> +				data_size = free_space_bitmap_size(found_key.offset,
> +								   block_group->sectorsize);
> +
> +				ptr = btrfs_item_ptr_offset(leaf, path->slots[0] - 1);
> +				read_extent_buffer(leaf, bitmap_cursor, ptr,
> +						   data_size);
> +
> +				nr++;
> +				path->slots[0]--;
> +			} else {
> +				ASSERT(0);
> +			}
> +		}
> +
> +		ret = btrfs_del_items(trans, root, path, path->slots[0], nr);
> +		if (ret)
> +			goto out;
> +		btrfs_release_path(path);
> +	}
> +
> +	info = search_free_space_info(trans, fs_info, block_group, path, 1);
> +	if (IS_ERR(info)) {
> +		ret = PTR_ERR(info);
> +		goto out;
> +	}
> +	leaf = path->nodes[0];
> +	flags = btrfs_free_space_flags(leaf, info);
> +	flags &= ~BTRFS_FREE_SPACE_USING_BITMAPS;
> +	btrfs_set_free_space_flags(leaf, info, flags);
> +	expected_extent_count = btrfs_free_space_extent_count(leaf, info);
> +	btrfs_mark_buffer_dirty(leaf);
> +	btrfs_release_path(path);
> +
> +	offset = start;
> +	bitnr = 0;
> +	while (offset < end) {
> +		bit = !!test_bit(bitnr, bitmap);
> +		if (prev_bit == 0 && bit == 1) {
> +			extent_start = offset;
> +		} else if (prev_bit == 1 && bit == 0) {
> +			key.objectid = extent_start;
> +			key.type = BTRFS_FREE_SPACE_EXTENT_KEY;
> +			key.offset = offset - extent_start;
> +
> +			ret = btrfs_insert_empty_item(trans, root, path, &key, 0);
> +			if (ret)
> +				goto out;
> +			btrfs_release_path(path);
> +
> +			extent_count++;
> +		}
> +		prev_bit = bit;
> +		offset += block_group->sectorsize;
> +		bitnr++;
> +	}
> +	if (prev_bit == 1) {
> +		key.objectid = extent_start;
> +		key.type = BTRFS_FREE_SPACE_EXTENT_KEY;
> +		key.offset = end - extent_start;
> +
> +		ret = btrfs_insert_empty_item(trans, root, path, &key, 0);
> +		if (ret)
> +			goto out;
> +		btrfs_release_path(path);
> +
> +		extent_count++;
> +	}
> +
> +	if (extent_count != expected_extent_count) {
> +		btrfs_err(fs_info, "incorrect extent count for %llu; counted %u, expected %u",
> +			  block_group->key.objectid, extent_count,
> +			  expected_extent_count);
> +		ASSERT(0);
> +		ret = -EIO;
> +		goto out;
> +	}
> +
> +	ret = 0;
> +out:
> +	vfree(bitmap);
> +	return ret;
> +}
> +
> +static int update_free_space_extent_count(struct btrfs_trans_handle *trans,
> +					  struct btrfs_fs_info *fs_info,
> +					  struct btrfs_block_group_cache *block_group,
> +					  struct btrfs_path *path,
> +					  int new_extents)
> +{
> +	struct btrfs_free_space_info *info;
> +	u32 flags;
> +	u32 extent_count;
> +	int ret = 0;
> +
> +	if (new_extents == 0)
> +		return 0;
> +
> +	info = search_free_space_info(trans, fs_info, block_group, path, 1);
> +	if (IS_ERR(info)) {
> +		ret = PTR_ERR(info);
> +		goto out;
> +	}
> +	flags = btrfs_free_space_flags(path->nodes[0], info);
> +	extent_count = btrfs_free_space_extent_count(path->nodes[0], info);
> +
> +	extent_count += new_extents;
> +	btrfs_set_free_space_extent_count(path->nodes[0], info, extent_count);
> +	btrfs_mark_buffer_dirty(path->nodes[0]);
> +	btrfs_release_path(path);
> +
> +	if (!(flags & BTRFS_FREE_SPACE_USING_BITMAPS) &&
> +	    extent_count > block_group->bitmap_high_thresh) {
> +		ret = convert_free_space_to_bitmaps(trans, fs_info, block_group,
> +						    path);
> +	} else if ((flags & BTRFS_FREE_SPACE_USING_BITMAPS) &&
> +		   extent_count < block_group->bitmap_low_thresh) {
> +		ret = convert_free_space_to_extents(trans, fs_info, block_group,
> +						    path);
> +	}
> +	if (ret)
> +		goto out;
> +
> +	ret = 0;
> +out:
> +	return ret;
> +}
> +
> +static int free_space_test_bit(struct btrfs_block_group_cache *block_group,
> +			       struct btrfs_path *path, u64 offset)
> +{
> +	struct extent_buffer *leaf;
> +	struct btrfs_key key;
> +	u64 found_start, found_end;
> +	unsigned long ptr, i;
> +
> +	leaf = path->nodes[0];
> +	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> +	ASSERT(key.type == BTRFS_FREE_SPACE_BITMAP_KEY);
> +
> +	found_start = key.objectid;
> +	found_end = key.objectid + key.offset;
> +	ASSERT(offset >= found_start && offset < found_end);
> +
> +	ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
> +	i = div_u64(offset - found_start, block_group->sectorsize);
> +	return !!extent_buffer_test_bit(leaf, ptr, i);
> +}
> +
> +static void free_space_set_bits(struct btrfs_block_group_cache *block_group,
> +				struct btrfs_path *path, u64 *start, u64 *size,
> +				int bit)
> +{
> +	struct extent_buffer *leaf;
> +	struct btrfs_key key;
> +	u64 end = *start + *size;
> +	u64 found_start, found_end;
> +	unsigned long ptr, first, last;
> +
> +	leaf = path->nodes[0];
> +	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> +	ASSERT(key.type == BTRFS_FREE_SPACE_BITMAP_KEY);
> +
> +	found_start = key.objectid;
> +	found_end = key.objectid + key.offset;
> +	ASSERT(*start >= found_start && *start < found_end);
> +	ASSERT(end > found_start);
> +
> +	if (end > found_end)
> +		end = found_end;
> +
> +	ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
> +	first = div_u64(*start - found_start, block_group->sectorsize);
> +	last = div_u64(end - found_start, block_group->sectorsize);
> +	if (bit)
> +		extent_buffer_bitmap_set(leaf, ptr, first, last - first);
> +	else
> +		extent_buffer_bitmap_clear(leaf, ptr, first, last - first);
> +	btrfs_mark_buffer_dirty(leaf);
> +
> +	*size -= end - *start;
> +	*start = end;
> +}
> +
> +/*
> + * We can't use btrfs_next_item() in modify_free_space_bitmap() because
> + * btrfs_next_leaf() doesn't get the path for writing. We can forgo the fancy
> + * tree walking in btrfs_next_leaf() anyways because we know exactly what we're
> + * looking for.
> + */
> +static int free_space_next_bitmap(struct btrfs_trans_handle *trans,
> +				  struct btrfs_root *root, struct btrfs_path *p)
> +{
> +	struct btrfs_key key;
> +
> +	if (p->slots[0] + 1 < btrfs_header_nritems(p->nodes[0])) {
> +		p->slots[0]++;
> +		return 0;
> +	}
> +
> +	btrfs_item_key_to_cpu(p->nodes[0], &key, p->slots[0]);
> +	btrfs_release_path(p);
> +
> +	key.objectid += key.offset;
> +	key.type = (u8)-1;
> +	key.offset = (u64)-1;
> +
> +	return btrfs_search_prev_slot(trans, root, &key, p, 0, 1);
> +}
> +
> +/*
> + * If remove is 1, then we are removing free space, thus clearing bits in the
> + * bitmap. If remove is 0, then we are adding free space, thus setting bits in
> + * the bitmap.
> + */
> +static int modify_free_space_bitmap(struct btrfs_trans_handle *trans,
> +				    struct btrfs_fs_info *fs_info,
> +				    struct btrfs_block_group_cache *block_group,
> +				    struct btrfs_path *path,
> +				    u64 start, u64 size, int remove)
> +{
> +	struct btrfs_root *root = fs_info->free_space_root;
> +	struct btrfs_key key;
> +	u64 end = start + size;
> +	u64 cur_start, cur_size;
> +	int prev_bit, next_bit;
> +	int new_extents;
> +	int ret;
> +
> +	/*
> +	 * Read the bit for the block immediately before the extent of space if
> +	 * that block is within the block group.
> +	 */
> +	if (start > block_group->key.objectid) {
> +		u64 prev_block = start - block_group->sectorsize;
> +
> +		key.objectid = prev_block;
> +		key.type = (u8)-1;
> +		key.offset = (u64)-1;
> +
> +		ret = btrfs_search_prev_slot(trans, root, &key, path, 0, 1);
> +		if (ret)
> +			goto out;
> +
> +		prev_bit = free_space_test_bit(block_group, path, prev_block);
> +
> +		/* The previous block may have been in the previous bitmap. */
> +		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> +		if (start >= key.objectid + key.offset) {
> +			ret = free_space_next_bitmap(trans, root, path);
> +			if (ret)
> +				goto out;
> +		}
> +	} else {
> +		key.objectid = start;
> +		key.type = (u8)-1;
> +		key.offset = (u64)-1;
> +
> +		ret = btrfs_search_prev_slot(trans, root, &key, path, 0, 1);
> +		if (ret)
> +			goto out;
> +
> +		prev_bit = -1;
> +	}
> +
> +	/*
> +	 * Iterate over all of the bitmaps overlapped by the extent of space,
> +	 * clearing/setting bits as required.
> +	 */
> +	cur_start = start;
> +	cur_size = size;
> +	while (1) {
> +		free_space_set_bits(block_group, path, &cur_start, &cur_size,
> +				    !remove);
> +		if (cur_size == 0)
> +			break;
> +		ret = free_space_next_bitmap(trans, root, path);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	/*
> +	 * Read the bit for the block immediately after the extent of space if
> +	 * that block is within the block group.
> +	 */
> +	if (end < block_group->key.objectid + block_group->key.offset) {
> +		/* The next block may be in the next bitmap. */
> +		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> +		if (end >= key.objectid + key.offset) {
> +			ret = free_space_next_bitmap(trans, root, path);
> +			if (ret)
> +				goto out;
> +		}
> +
> +		next_bit = free_space_test_bit(block_group, path, end);
> +	} else {
> +		next_bit = -1;
> +	}
> +
> +	if (remove) {
> +		new_extents = -1;
> +		if (prev_bit == 1) {
> +			/* Leftover on the left. */
> +			new_extents++;
> +		}
> +		if (next_bit == 1) {
> +			/* Leftover on the right. */
> +			new_extents++;
> +		}
> +	} else {
> +		new_extents = 1;
> +		if (prev_bit == 1) {
> +			/* Merging with neighbor on the left. */
> +			new_extents--;
> +		}
> +		if (next_bit == 1) {
> +			/* Merging with neighbor on the right. */
> +			new_extents--;
> +		}
> +	}
> +
> +	btrfs_release_path(path);
> +	ret = update_free_space_extent_count(trans, fs_info, block_group, path,
> +					     new_extents);
> +	if (ret)
> +		goto out;
> +
> +	ret = 0;
> +out:
> +	return ret;
> +}
> +
> +static int remove_free_space_extent(struct btrfs_trans_handle *trans,
> +				    struct btrfs_fs_info *fs_info,
> +				    struct btrfs_block_group_cache *block_group,
> +				    struct btrfs_path *path,
> +				    u64 start, u64 size)
> +{
> +	struct btrfs_root *root = fs_info->free_space_root;
> +	struct btrfs_key key;
> +	u64 found_start, found_end;
> +	u64 end = start + size;
> +	int new_extents = -1;
> +	int ret;
> +
> +	key.objectid = start;
> +	key.type = (u8)-1;
> +	key.offset = (u64)-1;
> +
> +	ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
> +	if (ret)
> +		goto out;
> +
> +	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> +
> +	ASSERT(key.type == BTRFS_FREE_SPACE_EXTENT_KEY);
> +
> +	found_start = key.objectid;
> +	found_end = key.objectid + key.offset;
> +	ASSERT(start >= found_start && end <= found_end);
> +
> +	/*
> +	 * Okay, now that we've found the free space extent which contains the
> +	 * free space that we are removing, there are four cases:
> +	 *
> +	 * 1. We're using the whole extent: delete the key we found and
> +	 * decrement the free space extent count.
> +	 * 2. We are using part of the extent starting at the beginning: delete
> +	 * the key we found and insert a new key representing the leftover at
> +	 * the end. There is no net change in the number of extents.
> +	 * 3. We are using part of the extent ending at the end: delete the key
> +	 * we found and insert a new key representing the leftover at the
> +	 * beginning. There is no net change in the number of extents.
> +	 * 4. We are using part of the extent in the middle: delete the key we
> +	 * found and insert two new keys representing the leftovers on each
> +	 * side. Where we used to have one extent, we now have two, so increment
> +	 * the extent count. We may need to convert the block group to bitmaps
> +	 * as a result.
> +	 */
> +
> +	/* Delete the existing key (cases 1-4). */
> +	ret = btrfs_del_item(trans, root, path);
> +	if (ret)
> +		goto out;
> +
> +	/* Add a key for leftovers at the beginning (cases 3 and 4). */
> +	if (start > found_start) {
> +		key.objectid = found_start;
> +		key.type = BTRFS_FREE_SPACE_EXTENT_KEY;
> +		key.offset = start - found_start;
> +
> +		btrfs_release_path(path);
> +		ret = btrfs_insert_empty_item(trans, root, path, &key, 0);
> +		if (ret)
> +			goto out;
> +		new_extents++;
> +	}
> +
> +	/* Add a key for leftovers at the end (cases 2 and 4). */
> +	if (end < found_end) {
> +		key.objectid = end;
> +		key.type = BTRFS_FREE_SPACE_EXTENT_KEY;
> +		key.offset = found_end - end;
> +
> +		btrfs_release_path(path);
> +		ret = btrfs_insert_empty_item(trans, root, path, &key, 0);
> +		if (ret)
> +			goto out;
> +		new_extents++;
> +	}
> +
> +	btrfs_release_path(path);
> +	ret = update_free_space_extent_count(trans, fs_info, block_group, path,
> +					     new_extents);
> +	if (ret)
> +		goto out;
> +
> +	ret = 0;
> +out:
> +	return ret;
> +}

A sanity test would be good for this.

> +
> +int remove_from_free_space_tree(struct btrfs_trans_handle *trans,
> +				struct btrfs_fs_info *fs_info,
> +				u64 start, u64 size)
> +{
> +	struct btrfs_block_group_cache *block_group;
> +	struct btrfs_free_space_info *info;
> +	struct btrfs_path *path;
> +	u32 flags;
> +	int ret;
> +
> +	if (!btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
> +		return 0;
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	block_group = btrfs_lookup_block_group(fs_info, start);
> +	if (!block_group) {
> +		ASSERT(0);
> +		ret = -ENOENT;
> +		goto out_nobg;
> +	}
> +
> +	mutex_lock(&block_group->free_space_lock);
> +
> +	info = search_free_space_info(NULL, fs_info, block_group, path, 0);
> +	if (IS_ERR(info)) {
> +		ret = PTR_ERR(info);
> +		goto out;
> +	}
> +	flags = btrfs_free_space_flags(path->nodes[0], info);
> +	btrfs_release_path(path);
> +
> +	if (flags & BTRFS_FREE_SPACE_USING_BITMAPS) {
> +		ret = modify_free_space_bitmap(trans, fs_info, block_group,
> +					       path, start, size, 1);
> +	} else {
> +		ret = remove_free_space_extent(trans, fs_info, block_group,
> +					       path, start, size);
> +	}
> +	if (ret)
> +		goto out;
> +
> +	ret = 0;
> +out:
> +	mutex_unlock(&block_group->free_space_lock);
> +	btrfs_put_block_group(block_group);
> +out_nobg:
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
> +static int add_free_space_extent(struct btrfs_trans_handle *trans,
> +				 struct btrfs_fs_info *fs_info,
> +				 struct btrfs_block_group_cache *block_group,
> +				 struct btrfs_path *path,
> +				 u64 start, u64 size)
> +{
> +	struct btrfs_root *root = fs_info->free_space_root;
> +	struct btrfs_key key, new_key;
> +	u64 found_start, found_end;
> +	u64 end = start + size;
> +	int new_extents = 1;
> +	int ret;
> +
> +	/*
> +	 * We are adding a new extent of free space, but we need to merge
> +	 * extents. There are four cases here:
> +	 *
> +	 * 1. The new extent does not have any immediate neighbors to merge
> +	 * with: add the new key and increment the free space extent count. We
> +	 * may need to convert the block group to bitmaps as a result.
> +	 * 2. The new extent has an immediate neighbor before it: remove the
> +	 * previous key and insert a new key combining both of them. There is no
> +	 * net change in the number of extents.
> +	 * 3. The new extent has an immediate neighbor after it: remove the next
> +	 * key and insert a new key combining both of them. There is no net
> +	 * change in the number of extents.
> +	 * 4. The new extent has immediate neighbors on both sides: remove both
> +	 * of the keys and insert a new key combining all of them. Where we used
> +	 * to have two extents, we now have one, so decrement the extent count.
> +	 */
> +
> +	new_key.objectid = start;
> +	new_key.type = BTRFS_FREE_SPACE_EXTENT_KEY;
> +	new_key.offset = size;
> +
> +	/* Search for a neighbor on the left. */
> +	if (start == block_group->key.objectid)
> +		goto right;
> +	key.objectid = start - 1;
> +	key.type = (u8)-1;
> +	key.offset = (u64)-1;
> +
> +	ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
> +	if (ret)
> +		goto out;
> +
> +	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> +
> +	if (key.type != BTRFS_FREE_SPACE_EXTENT_KEY) {
> +		ASSERT(key.type == BTRFS_FREE_SPACE_INFO_KEY);
> +		btrfs_release_path(path);
> +		goto right;
> +	}
> +
> +	found_start = key.objectid;
> +	found_end = key.objectid + key.offset;
> +	ASSERT(found_start >= block_group->key.objectid &&
> +	       found_end > block_group->key.objectid);
> +	ASSERT(found_start < start && found_end <= start);
> +
> +	/*
> +	 * Delete the neighbor on the left and absorb it into the new key (cases
> +	 * 2 and 4).
> +	 */
> +	if (found_end == start) {
> +		ret = btrfs_del_item(trans, root, path);
> +		if (ret)
> +			goto out;
> +		new_key.objectid = found_start;
> +		new_key.offset += key.offset;
> +		new_extents--;
> +	}
> +	btrfs_release_path(path);
> +
> +right:
> +	/* Search for a neighbor on the right. */
> +	if (end == block_group->key.objectid + block_group->key.offset)
> +		goto insert;
> +	key.objectid = end;
> +	key.type = (u8)-1;
> +	key.offset = (u64)-1;
> +
> +	ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
> +	if (ret)
> +		goto out;
> +
> +	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> +
> +	if (key.type != BTRFS_FREE_SPACE_EXTENT_KEY) {
> +		ASSERT(key.type == BTRFS_FREE_SPACE_INFO_KEY);
> +		btrfs_release_path(path);
> +		goto insert;
> +	}
> +
> +	found_start = key.objectid;
> +	found_end = key.objectid + key.offset;
> +	ASSERT(found_start >= block_group->key.objectid &&
> +	       found_end > block_group->key.objectid);
> +	ASSERT((found_start < start && found_end <= start) ||
> +	       (found_start >= end && found_end > end));
> +
> +	/*
> +	 * Delete the neighbor on the right and absorb it into the new key
> +	 * (cases 3 and 4).
> +	 */
> +	if (found_start == end) {
> +		ret = btrfs_del_item(trans, root, path);
> +		if (ret)
> +			goto out;
> +		new_key.offset += key.offset;
> +		new_extents--;
> +	}
> +	btrfs_release_path(path);
> +
> +insert:
> +	/* Insert the new key (cases 1-4). */
> +	ret = btrfs_insert_empty_item(trans, root, path, &new_key, 0);
> +	if (ret)
> +		goto out;
> +
> +	btrfs_release_path(path);
> +	ret = update_free_space_extent_count(trans, fs_info, block_group, path,
> +					     new_extents);
> +	if (ret)
> +		goto out;
> +
> +	ret = 0;
> +out:
> +	return ret;
> +}

It would be good to have a sanity test for this to make sure all of your 
cases are covered and are proven in a unit test.

> +
> +static int __add_to_free_space_tree(struct btrfs_trans_handle *trans,
> +				    struct btrfs_fs_info *fs_info,
> +				    struct btrfs_block_group_cache *block_group,
> +				    struct btrfs_path *path,
> +				    u64 start, u64 size)
> +{
> +	struct btrfs_free_space_info *info;
> +	u32 flags;
> +	int ret;
> +
> +	mutex_lock(&block_group->free_space_lock);
> +
> +	info = search_free_space_info(NULL, fs_info, block_group, path, 0);
> +	if (IS_ERR(info)) {
> +		return PTR_ERR(info);
> +		goto out;
> +	}
> +	flags = btrfs_free_space_flags(path->nodes[0], info);
> +	btrfs_release_path(path);
> +
> +	if (flags & BTRFS_FREE_SPACE_USING_BITMAPS) {
> +		ret = modify_free_space_bitmap(trans, fs_info, block_group,
> +					       path, start, size, 0);
> +	} else {
> +		ret = add_free_space_extent(trans, fs_info, block_group, path,
> +					    start, size);
> +	}
> +
> +out:
> +	mutex_unlock(&block_group->free_space_lock);
> +	return ret;
> +}
> +
> +int add_to_free_space_tree(struct btrfs_trans_handle *trans,
> +			   struct btrfs_fs_info *fs_info,
> +			   u64 start, u64 size)
> +{
> +	struct btrfs_block_group_cache *block_group;
> +	struct btrfs_path *path;
> +	int ret;
> +
> +	if (!btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
> +		return 0;
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	block_group = btrfs_lookup_block_group(fs_info, start);
> +	if (!block_group) {
> +		ASSERT(0);
> +		ret = -ENOENT;
> +		goto out_nobg;
> +	}
> +
> +	ret = __add_to_free_space_tree(trans, fs_info, block_group, path, start,
> +				       size);
> +	if (ret)
> +		goto out;
> +
> +	ret = 0;
> +out:
> +	btrfs_put_block_group(block_group);
> +out_nobg:
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
> +static int add_new_free_space_extent(struct btrfs_trans_handle *trans,
> +				     struct btrfs_fs_info *fs_info,
> +				     struct btrfs_block_group_cache *block_group,
> +				     struct btrfs_path *path,
> +				     u64 start, u64 end)
> +{
> +	u64 extent_start, extent_end;
> +	int ret;
> +
> +	while (start < end) {
> +		ret = find_first_extent_bit(fs_info->pinned_extents, start,
> +					    &extent_start, &extent_end,
> +					    EXTENT_DIRTY | EXTENT_UPTODATE,
> +					    NULL);
> +		if (ret)
> +			break;
> +
> +		if (extent_start <= start) {
> +			start = extent_end + 1;
> +		} else if (extent_start > start && extent_start < end) {
> +			ret = __add_to_free_space_tree(trans, fs_info,
> +						       block_group, path, start,
> +						       extent_start - start);
> +			btrfs_release_path(path);
> +			if (ret)
> +				return ret;
> +			start = extent_end + 1;
> +		} else {
> +			break;
> +		}
> +	}
> +	if (start < end) {
> +		ret = __add_to_free_space_tree(trans, fs_info, block_group,
> +					       path, start, end - start);
> +		btrfs_release_path(path);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Populate the free space tree by walking the extent tree, avoiding the super
> + * block mirrors. Operations on the extent tree that happen as a result of
> + * writes to the free space tree will go through the normal add/remove hooks.
> + */
> +static int populate_free_space_tree(struct btrfs_trans_handle *trans,
> +				    struct btrfs_fs_info *fs_info,
> +				    struct btrfs_block_group_cache *block_group)
> +{
> +	struct btrfs_root *extent_root = fs_info->extent_root;
> +	struct btrfs_path *path, *path2;
> +	struct btrfs_key key;
> +	u64 start, end;
> +	int ret;
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +	path->reada = 1;
> +
> +	path2 = btrfs_alloc_path();
> +	if (!path2) {
> +		btrfs_free_path(path);
> +		return -ENOMEM;
> +	}
> +
> +	ret = add_new_free_space_info(trans, fs_info, block_group, path2);
> +	if (ret)
> +		goto out;
> +
> +	ret = exclude_super_stripes(extent_root, block_group);
> +	if (ret)
> +		goto out;
> +
> +	/*
> +	 * Iterate through all of the extent and metadata items in this block
> +	 * group, adding the free space between them and the free space at the
> +	 * end. Note that EXTENT_ITEM and METADATA_ITEM are less than
> +	 * BLOCK_GROUP_ITEM, so an extent may precede the block group that it's
> +	 * contained in.
> +	 */
> +	key.objectid = block_group->key.objectid;
> +	key.type = BTRFS_EXTENT_ITEM_KEY;
> +	key.offset = 0;
> +
> +	ret = btrfs_search_slot_for_read(extent_root, &key, path, 1, 0);
> +	if (ret < 0)
> +		goto out;
> +	ASSERT(ret == 0);
> +
> +	start = block_group->key.objectid;
> +	end = block_group->key.objectid + block_group->key.offset;
> +	while (1) {
> +		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> +
> +		if (key.type == BTRFS_EXTENT_ITEM_KEY ||
> +		    key.type == BTRFS_METADATA_ITEM_KEY) {
> +			if (key.objectid >= end)
> +				break;
> +
> +			ret = add_new_free_space_extent(trans, fs_info,
> +							block_group, path2,
> +							start, key.objectid);
> +			start = key.objectid;
> +			if (key.type == BTRFS_METADATA_ITEM_KEY)
> +				start += fs_info->tree_root->nodesize;
> +			else
> +				start += key.offset;
> +		} else if (key.type == BTRFS_BLOCK_GROUP_ITEM_KEY) {
> +			if (key.objectid != block_group->key.objectid)
> +				break;
> +		}
> +
> +		ret = btrfs_next_item(extent_root, path);
> +		if (ret < 0)
> +			goto out;
> +		if (ret)
> +			break;
> +	}
> +	ret = add_new_free_space_extent(trans, fs_info, block_group, path2,
> +					start, end);
> +	if (ret)
> +		goto out;
> +
> +out:
> +	free_excluded_extents(extent_root, block_group);
> +	btrfs_free_path(path2);
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
> +int btrfs_create_free_space_tree(struct btrfs_fs_info *fs_info)
> +{
> +	struct btrfs_trans_handle *trans;
> +	struct btrfs_root *tree_root = fs_info->tree_root;
> +	struct btrfs_root *free_space_root;
> +	struct btrfs_block_group_cache *block_group;
> +	struct rb_node *node;
> +	int ret;
> +
> +	trans = btrfs_start_transaction(tree_root, 0);
> +	if (IS_ERR(trans))
> +		return PTR_ERR(trans);
> +
> +	free_space_root = btrfs_create_tree(trans, fs_info,
> +					    BTRFS_FREE_SPACE_TREE_OBJECTID);
> +	if (IS_ERR(free_space_root)) {
> +		ret = PTR_ERR(free_space_root);
> +		btrfs_abort_transaction(trans, tree_root, ret);
> +		return ret;
> +	}
> +	fs_info->free_space_root = free_space_root;
> +
> +	node = rb_first(&fs_info->block_group_cache_tree);
> +	while (node) {
> +		block_group = rb_entry(node, struct btrfs_block_group_cache,
> +				       cache_node);
> +		ret = populate_free_space_tree(trans, fs_info, block_group);
> +		if (ret) {
> +			btrfs_abort_transaction(trans, tree_root, ret);
> +			return ret;
> +		}
> +		node = rb_next(node);
> +	}
> +
> +	btrfs_set_fs_compat_ro(fs_info, FREE_SPACE_TREE);
> +
> +	ret = btrfs_commit_transaction(trans, tree_root);
> +	if (ret)
> +		return ret;
> +
> +	return 0;
> +}
> +
> +int add_block_group_free_space(struct btrfs_trans_handle *trans,
> +			       struct btrfs_fs_info *fs_info,
> +			       struct btrfs_block_group_cache *block_group)
> +{
> +	struct btrfs_path *path;
> +	int ret;
> +
> +	if (!btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
> +		return 0;
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	ret = add_new_free_space_info(trans, fs_info, block_group, path);
> +	if (ret)
> +		goto out;
> +
> +	ret = add_new_free_space_extent(trans, fs_info, block_group, path,
> +					block_group->key.objectid,
> +					block_group->key.objectid +
> +					block_group->key.offset);
> +	if (ret)
> +		goto out;
> +
> +	ret = 0;
> +out:
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
> +int remove_block_group_free_space(struct btrfs_trans_handle *trans,
> +				  struct btrfs_fs_info *fs_info,
> +				  struct btrfs_block_group_cache *block_group)
> +{
> +	struct btrfs_root *root = fs_info->free_space_root;
> +	struct btrfs_path *path;
> +	struct btrfs_key key, found_key;
> +	struct extent_buffer *leaf;
> +	u64 start, end;
> +	int done = 0, nr;
> +	int ret;
> +
> +	if (!btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
> +		return 0;
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	start = block_group->key.objectid;
> +	end = block_group->key.objectid + block_group->key.offset;
> +
> +	key.objectid = end - 1;
> +	key.type = (u8)-1;
> +	key.offset = (u64)-1;
> +
> +	while (!done) {
> +		ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
> +		if (ret)
> +			goto out;
> +
> +		leaf = path->nodes[0];
> +		nr = 0;
> +		path->slots[0]++;
> +		while (path->slots[0] > 0) {
> +			btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0] - 1);
> +
> +			if (found_key.type == BTRFS_FREE_SPACE_INFO_KEY) {
> +				ASSERT(found_key.objectid == block_group->key.objectid);
> +				ASSERT(found_key.offset == block_group->key.offset);
> +				done = 1;
> +				nr++;
> +				path->slots[0]--;
> +				break;
> +			} else if (found_key.type == BTRFS_FREE_SPACE_EXTENT_KEY ||
> +				   found_key.type == BTRFS_FREE_SPACE_BITMAP_KEY) {
> +				ASSERT(found_key.objectid >= start);
> +				ASSERT(found_key.objectid < end);
> +				ASSERT(found_key.objectid + found_key.offset <= end);
> +				nr++;
> +				path->slots[0]--;
> +			} else {
> +				ASSERT(0);
> +			}
> +		}
> +
> +		ret = btrfs_del_items(trans, root, path, path->slots[0], nr);
> +		if (ret)
> +			goto out;
> +		btrfs_release_path(path);
> +	}
> +
> +	ret = 0;
> +out:
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
> +static int load_free_space_bitmaps(struct btrfs_fs_info *fs_info,
> +				   struct btrfs_block_group_cache *block_group,
> +				   struct btrfs_path *path,
> +				   u32 expected_extent_count)
> +{
> +	struct btrfs_root *root = fs_info->free_space_root;
> +	struct btrfs_key key;
> +	int prev_bit = 0, bit;
> +	/* Initialize to silence GCC. */
> +	u64 extent_start = 0;
> +	u64 end, offset;
> +	u32 extent_count = 0;
> +	int ret;
> +
> +	end = block_group->key.objectid + block_group->key.offset;
> +
> +	while (1) {
> +		ret = btrfs_next_item(root, path);
> +		if (ret < 0)
> +			goto out;
> +		if (ret)
> +			break;
> +
> +		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> +
> +		if (key.type == BTRFS_FREE_SPACE_INFO_KEY)
> +			break;
> +
> +		ASSERT(key.type == BTRFS_FREE_SPACE_BITMAP_KEY);
> +		ASSERT(key.objectid < end && key.objectid + key.offset <= end);
> +
> +		offset = key.objectid;
> +		while (offset < key.objectid + key.offset) {
> +			bit = free_space_test_bit(block_group, path, offset);
> +			if (prev_bit == 0 && bit == 1) {
> +				extent_start = offset;
> +			} else if (prev_bit == 1 && bit == 0) {
> +				add_new_free_space(block_group, fs_info,
> +						   extent_start, offset);
> +				extent_count++;
> +			}
> +			prev_bit = bit;
> +			offset += block_group->sectorsize;
> +		}
> +	}
> +	if (prev_bit == 1) {
> +		add_new_free_space(block_group, fs_info, extent_start, end);
> +		extent_count++;
> +	}
> +
> +	if (extent_count != expected_extent_count) {
> +		btrfs_err(fs_info, "incorrect extent count for %llu; counted %u, expected %u",
> +			  block_group->key.objectid, extent_count,
> +			  expected_extent_count);
> +		ASSERT(0);
> +		ret = -EIO;
> +		goto out;
> +	}
> +
> +	ret = 0;
> +out:
> +	return ret;
> +}
> +
> +static int load_free_space_extents(struct btrfs_fs_info *fs_info,
> +				   struct btrfs_block_group_cache *block_group,
> +				   struct btrfs_path *path,
> +				   u32 expected_extent_count)
> +{
> +	struct btrfs_root *root = fs_info->free_space_root;
> +	struct btrfs_key key;
> +	u64 end;
> +	u32 extent_count = 0;
> +	int ret;
> +
> +	end = block_group->key.objectid + block_group->key.offset;
> +
> +	while (1) {
> +		ret = btrfs_next_item(root, path);
> +		if (ret < 0)
> +			goto out;
> +		if (ret)
> +			break;
> +
> +		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> +
> +		if (key.type == BTRFS_FREE_SPACE_INFO_KEY)
> +			break;
> +
> +		ASSERT(key.type == BTRFS_FREE_SPACE_EXTENT_KEY);
> +		ASSERT(key.objectid < end && key.objectid + key.offset <= end);
> +
> +		add_new_free_space(block_group, fs_info, key.objectid,
> +				   key.objectid + key.offset);
> +		extent_count++;
> +	}
> +
> +	if (extent_count != expected_extent_count) {
> +		btrfs_err(fs_info, "incorrect extent count for %llu; counted %u, expected %u",
> +			  block_group->key.objectid, extent_count,
> +			  expected_extent_count);
> +		ASSERT(0);
> +		ret = -EIO;
> +		goto out;
> +	}
> +
> +	ret = 0;
> +out:
> +	return ret;
> +}
> +
> +int load_free_space_tree(struct btrfs_fs_info *fs_info,
> +			 struct btrfs_block_group_cache *block_group)
> +{
> +	struct btrfs_free_space_info *info;
> +	struct btrfs_path *path;
> +	u32 extent_count, flags;
> +	int ret;
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	/*
> +	 * Just like caching_thread() doesn't want to deadlock on the extent
> +	 * tree, we don't want to deadlock on the free space tree.
> +	 */
> +	path->skip_locking = 1;
> +	path->search_commit_root = 1;
> +	path->reada = 1;
> +
> +	down_read(&fs_info->commit_root_sem);
> +
> +	info = search_free_space_info(NULL, fs_info, block_group, path, 0);
> +	if (IS_ERR(info)) {
> +		ret = PTR_ERR(info);
> +		goto out;
> +	}
> +	extent_count = btrfs_free_space_extent_count(path->nodes[0], info);
> +	flags = btrfs_free_space_flags(path->nodes[0], info);
> +
> +	/*
> +	 * We left path pointing to the free space info item, so now
> +	 * load_free_space_foo can just iterate through the free space tree from
> +	 * there.
> +	 */
> +	if (flags & BTRFS_FREE_SPACE_USING_BITMAPS) {
> +		ret = load_free_space_bitmaps(fs_info, block_group, path,
> +					      extent_count);
> +	} else {
> +		ret = load_free_space_extents(fs_info, block_group, path,
> +					      extent_count);
> +	}
> +	if (ret)
> +		goto out;
> +
> +	ret = 0;

This bit isn't needed, just fall through.

> +out:
> +	up_read(&fs_info->commit_root_sem);
> +	btrfs_free_path(path);
> +	return ret;
> +}

So actually there are a lot of places in here that you need to abort the 
transaction if there is a failure.  If we can't update the free space 
tree for whatever reason and we aren't a developer so don't immediately 
panic the box we need to make sure to abort so the fs stays consistent. 
  The only place you don't have to do this is when loading the free 
space tree.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/6] Btrfs: wire up the free space tree to the extent tree
  2015-09-01 19:05 ` [PATCH 5/6] Btrfs: wire up the free space tree to the extent tree Omar Sandoval
@ 2015-09-01 19:48   ` Josef Bacik
  2015-09-02  4:42     ` Omar Sandoval
  0 siblings, 1 reply; 43+ messages in thread
From: Josef Bacik @ 2015-09-01 19:48 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: Omar Sandoval

On 09/01/2015 03:05 PM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
>
> The free space tree is updated in tandem with the extent tree. There are
> only a handful of places where we need to hook in:
>
> 1. Block group creation
> 2. Block group deletion
> 3. Delayed refs (extent creation and deletion)
> 4. Block group caching
>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>   fs/btrfs/extent-tree.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++---
>   1 file changed, 70 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 37179a569f40..3f10df3932f0 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -33,6 +33,7 @@
>   #include "raid56.h"
>   #include "locking.h"
>   #include "free-space-cache.h"
> +#include "free-space-tree.h"
>   #include "math.h"
>   #include "sysfs.h"
>   #include "qgroup.h"
> @@ -589,7 +590,41 @@ static int cache_block_group(struct btrfs_block_group_cache *cache,
>   	cache->cached = BTRFS_CACHE_FAST;
>   	spin_unlock(&cache->lock);
>
> -	if (fs_info->mount_opt & BTRFS_MOUNT_SPACE_CACHE) {
> +	if (btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE)) {
> +		if (load_cache_only) {
> +			spin_lock(&cache->lock);
> +			cache->caching_ctl = NULL;
> +			cache->cached = BTRFS_CACHE_NO;
> +			spin_unlock(&cache->lock);
> +			wake_up(&caching_ctl->wait);
> +		} else {
> +			mutex_lock(&caching_ctl->mutex);
> +			ret = load_free_space_tree(fs_info, cache);
> +			if (ret) {
> +				btrfs_warn(fs_info, "failed to load free space tree for %llu: %d",
> +					   cache->key.objectid, ret);
> +				spin_lock(&cache->lock);
> +				cache->caching_ctl = NULL;
> +				cache->cached = BTRFS_CACHE_ERROR;
> +				spin_unlock(&cache->lock);
> +				goto tree_out;
> +			}
> +
> +			spin_lock(&cache->lock);
> +			cache->caching_ctl = NULL;
> +			cache->cached = BTRFS_CACHE_FINISHED;
> +			cache->last_byte_to_unpin = (u64)-1;
> +			caching_ctl->progress = (u64)-1;
> +			spin_unlock(&cache->lock);
> +			mutex_unlock(&caching_ctl->mutex);
> +
> +tree_out:
> +			wake_up(&caching_ctl->wait);
> +			put_caching_control(caching_ctl);
> +			free_excluded_extents(fs_info->extent_root, cache);
> +			return 0;
> +		}
> +	} else if (fs_info->mount_opt & BTRFS_MOUNT_SPACE_CACHE) {
>   		mutex_lock(&caching_ctl->mutex);

So the reason we have this load_cache_only thing is because the free 
space cache could be loaded almost instantaneously since it was 
contiguously allocated.  This isn't the case with the free space tree, 
and although it is better than the no space cache way of doing things, 
we are still going to incur a bit of latency when seeking through a 
large free space tree.  So break this out and make the caching kthread 
either do the old style load or load the free space tree.  Then you can 
use the add free space helpers that will wake anybody up waiting on 
allocations and you incur less direct latency.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/6] Btrfs: add free space tree mount option
  2015-09-01 19:05 ` [PATCH 6/6] Btrfs: add free space tree mount option Omar Sandoval
@ 2015-09-01 19:49   ` Josef Bacik
  0 siblings, 0 replies; 43+ messages in thread
From: Josef Bacik @ 2015-09-01 19:49 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs; +Cc: Omar Sandoval

On 09/01/2015 03:05 PM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
>
> Now we can finally hook up everything so we can actually use free space
> tree. On the first mount with the free_space_tree mount option, the free
> space tree will be created and the FREE_SPACE_TREE read-only compat bit
> will be set. Any time the filesystem is mounted from then on, we will
> use the free space tree.
>
> Having both the free space cache and free space trees enabled is
> nonsense, so we don't allow that to happen. Since mkfs sets the
> superblock cache generation to -1, this means that the filesystem will
> have to be mounted with nospace_cache,free_space_tree to create the free
> space trees on first mount. Once the FREE_SPACE_TREE bit is set, the
> cache generation is ignored when mounting. This is all a little more
> complicated than would be ideal, but at some point we can presumably
> make the free space tree the default and stop setting the cache
> generation in mkfs.
>
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Reviewed-by: Josef Bacik <jbacik@fb.com>

Thanks,

Josef


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] Btrfs: implement the free space B-tree
  2015-09-01 19:44   ` Josef Bacik
@ 2015-09-01 20:06     ` Omar Sandoval
  2015-09-01 20:08       ` Josef Bacik
  0 siblings, 1 reply; 43+ messages in thread
From: Omar Sandoval @ 2015-09-01 20:06 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, Omar Sandoval

On Tue, Sep 01, 2015 at 03:44:27PM -0400, Josef Bacik wrote:
> On 09/01/2015 03:13 PM, Omar Sandoval wrote:
> >From: Omar Sandoval <osandov@fb.com>
> >
> >The free space cache has turned out to be a scalability bottleneck on
> >large, busy filesystems. When the cache for a lot of block groups needs
> >to be written out, we can get extremely long commit times; if this
> >happens in the critical section, things are especially bad because we
> >block new transactions from happening.
> >
> >The main problem with the free space cache is that it has to be written
> >out in its entirety and is managed in an ad hoc fashion. Using a B-tree
> >to store free space fixes this: updates can be done as needed and we get
> >all of the benefits of using a B-tree: checksumming, RAID handling,
> >well-understood behavior.
> >
> >With the free space tree, we get commit times that are about the same as
> >the no cache case with load times slower than the free space cache case
> >but still much faster than the no cache case. Free space is represented
> >with extents until it becomes more space-efficient to use bitmaps,
> >giving us similar space overhead to the free space cache.
> >
> >The operations on the free space tree are: adding and removing free
> >space, handling the creation and deletion of block groups, and loading
> >the free space for a block group. We can also create the free space tree
> >by walking the extent tree.
> >
> >Signed-off-by: Omar Sandoval <osandov@fb.com>
> >---
> >  fs/btrfs/Makefile          |    2 +-
> >  fs/btrfs/ctree.h           |   25 +-
> >  fs/btrfs/extent-tree.c     |   15 +-
> >  fs/btrfs/free-space-tree.c | 1468 ++++++++++++++++++++++++++++++++++++++++++++
> >  fs/btrfs/free-space-tree.h |   39 ++
> >  5 files changed, 1541 insertions(+), 8 deletions(-)
> >  create mode 100644 fs/btrfs/free-space-tree.c
> >  create mode 100644 fs/btrfs/free-space-tree.h
> >
> >diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> >index 6d1d0b93b1aa..766169709146 100644
> >--- a/fs/btrfs/Makefile
> >+++ b/fs/btrfs/Makefile
> >@@ -9,7 +9,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
> >  	   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
> >  	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
> >  	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
> >-	   uuid-tree.o props.o hash.o
> >+	   uuid-tree.o props.o hash.o free-space-tree.o
> >
> >  btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
> >  btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
> >diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> >index 34a81a79f5b6..d49181d35f08 100644
> >--- a/fs/btrfs/ctree.h
> >+++ b/fs/btrfs/ctree.h
> >@@ -1299,8 +1299,20 @@ struct btrfs_block_group_cache {
> >  	u64 delalloc_bytes;
> >  	u64 bytes_super;
> >  	u64 flags;
> >-	u64 sectorsize;
> >  	u64 cache_generation;
> >+	u32 sectorsize;
> >+
> >+	/*
> >+	 * If the free space extent count exceeds this number, convert the block
> >+	 * group to bitmaps.
> >+	 */
> >+	u32 bitmap_high_thresh;
> >+
> >+	/*
> >+	 * If the free space extent count drops below this number, convert the
> >+	 * block group back to extents.
> >+	 */
> >+	u32 bitmap_low_thresh;
> >
> >  	/*
> >  	 * It is just used for the delayed data space allocation because
> >@@ -1356,6 +1368,9 @@ struct btrfs_block_group_cache {
> >  	struct list_head io_list;
> >
> >  	struct btrfs_io_ctl io_ctl;
> >+
> >+	/* Lock for free space tree operations. */
> >+	struct mutex free_space_lock;
> >  };
> >
> >  /* delayed seq elem */
> >@@ -1407,6 +1422,7 @@ struct btrfs_fs_info {
> >  	struct btrfs_root *csum_root;
> >  	struct btrfs_root *quota_root;
> >  	struct btrfs_root *uuid_root;
> >+	struct btrfs_root *free_space_root;
> >
> >  	/* the log root tree is a directory of all the other log roots */
> >  	struct btrfs_root *log_root_tree;
> >@@ -3556,6 +3572,13 @@ void btrfs_end_write_no_snapshoting(struct btrfs_root *root);
> >  void check_system_chunk(struct btrfs_trans_handle *trans,
> >  			struct btrfs_root *root,
> >  			const u64 type);
> >+void free_excluded_extents(struct btrfs_root *root,
> >+			   struct btrfs_block_group_cache *cache);
> >+int exclude_super_stripes(struct btrfs_root *root,
> >+			  struct btrfs_block_group_cache *cache);
> >+u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
> >+		       struct btrfs_fs_info *info, u64 start, u64 end);
> >+
> >  /* ctree.c */
> >  int btrfs_bin_search(struct extent_buffer *eb, struct btrfs_key *key,
> >  		     int level, int *slot);
> >diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> >index 07204bf601ed..37179a569f40 100644
> >--- a/fs/btrfs/extent-tree.c
> >+++ b/fs/btrfs/extent-tree.c
> >@@ -237,8 +237,8 @@ static int add_excluded_extent(struct btrfs_root *root,
> >  	return 0;
> >  }
> >
> >-static void free_excluded_extents(struct btrfs_root *root,
> >-				  struct btrfs_block_group_cache *cache)
> >+void free_excluded_extents(struct btrfs_root *root,
> >+			   struct btrfs_block_group_cache *cache)
> >  {
> >  	u64 start, end;
> >
> >@@ -251,14 +251,16 @@ static void free_excluded_extents(struct btrfs_root *root,
> >  			  start, end, EXTENT_UPTODATE, GFP_NOFS);
> >  }
> >
> >-static int exclude_super_stripes(struct btrfs_root *root,
> >-				 struct btrfs_block_group_cache *cache)
> >+int exclude_super_stripes(struct btrfs_root *root,
> >+			  struct btrfs_block_group_cache *cache)
> >  {
> >  	u64 bytenr;
> >  	u64 *logical;
> >  	int stripe_len;
> >  	int i, nr, ret;
> >
> >+	cache->bytes_super = 0;
> >+
> >  	if (cache->key.objectid < BTRFS_SUPER_INFO_OFFSET) {
> >  		stripe_len = BTRFS_SUPER_INFO_OFFSET - cache->key.objectid;
> >  		cache->bytes_super += stripe_len;
> >@@ -337,8 +339,8 @@ static void put_caching_control(struct btrfs_caching_control *ctl)
> >   * we need to check the pinned_extents for any extents that can't be used yet
> >   * since their free space will be released as soon as the transaction commits.
> >   */
> >-static u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
> >-			      struct btrfs_fs_info *info, u64 start, u64 end)
> >+u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
> >+		       struct btrfs_fs_info *info, u64 start, u64 end)
> >  {
> >  	u64 extent_start, extent_end, size, total_added = 0;
> >  	int ret;
> >@@ -9281,6 +9283,7 @@ btrfs_create_block_group_cache(struct btrfs_root *root, u64 start, u64 size)
> >  	INIT_LIST_HEAD(&cache->io_list);
> >  	btrfs_init_free_space_ctl(cache);
> >  	atomic_set(&cache->trimming, 0);
> >+	mutex_init(&cache->free_space_lock);
> >
> >  	return cache;
> >  }
> >diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
> >new file mode 100644
> >index 000000000000..bbb4f731f948
> >--- /dev/null
> >+++ b/fs/btrfs/free-space-tree.c
> >@@ -0,0 +1,1468 @@
> >+/*
> >+ * Copyright (C) 2015 Facebook.  All rights reserved.
> >+ *
> >+ * This program is free software; you can redistribute it and/or
> >+ * modify it under the terms of the GNU General Public
> >+ * License v2 as published by the Free Software Foundation.
> >+ *
> >+ * This program is distributed in the hope that it will be useful,
> >+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> >+ * General Public License for more details.
> >+ *
> >+ * You should have received a copy of the GNU General Public
> >+ * License along with this program; if not, write to the
> >+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
> >+ * Boston, MA 021110-1307, USA.
> >+ */
> >+
> >+#include <linux/kernel.h>
> >+#include <linux/vmalloc.h>
> >+#include "ctree.h"
> >+#include "disk-io.h"
> >+#include "locking.h"
> >+#include "free-space-tree.h"
> >+#include "transaction.h"
> >+
> >+/*
> >+ * The default size for new free space bitmap items. The last bitmap in a block
> >+ * group may be truncated, and none of the free space tree code assumes that
> >+ * existing bitmaps are this size.
> >+ */
> >+#define BTRFS_FREE_SPACE_BITMAP_SIZE 256
> >+#define BTRFS_FREE_SPACE_BITMAP_BITS (BTRFS_FREE_SPACE_BITMAP_SIZE * BITS_PER_BYTE)
> >+
> >+void set_free_space_tree_thresholds(struct btrfs_block_group_cache *cache)
> >+{
> >+	u32 bitmap_range;
> >+	size_t bitmap_size;
> >+	u64 num_bitmaps, total_bitmap_size;
> >+
> >+	/*
> >+	 * We convert to bitmaps when the disk space required for using extents
> >+	 * exceeds that required for using bitmaps.
> >+	 */
> >+	bitmap_range = cache->sectorsize * BTRFS_FREE_SPACE_BITMAP_BITS;
> >+	num_bitmaps = div_u64(cache->key.offset + bitmap_range - 1,
> >+			      bitmap_range);
> >+	bitmap_size = sizeof(struct btrfs_item) + BTRFS_FREE_SPACE_BITMAP_SIZE;
> >+	total_bitmap_size = num_bitmaps * bitmap_size;
> >+	cache->bitmap_high_thresh = div_u64(total_bitmap_size,
> >+					    sizeof(struct btrfs_item));
> >+
> >+	/*
> >+	 * We allow for a small buffer between the high threshold and low
> >+	 * threshold to avoid thrashing back and forth between the two formats.
> >+	 */
> >+	if (cache->bitmap_high_thresh > 100)
> >+		cache->bitmap_low_thresh = cache->bitmap_high_thresh - 100;
> >+	else
> >+		cache->bitmap_low_thresh = 0;
> >+}
> >+
> >+static int add_new_free_space_info(struct btrfs_trans_handle *trans,
> >+				   struct btrfs_fs_info *fs_info,
> >+				   struct btrfs_block_group_cache *block_group,
> >+				   struct btrfs_path *path)
> >+{
> >+	struct btrfs_root *root = fs_info->free_space_root;
> >+	struct btrfs_free_space_info *info;
> >+	struct btrfs_key key;
> >+	struct extent_buffer *leaf;
> >+	int ret;
> >+
> >+	key.objectid = block_group->key.objectid;
> >+	key.type = BTRFS_FREE_SPACE_INFO_KEY;
> >+	key.offset = block_group->key.offset;
> >+
> >+	ret = btrfs_insert_empty_item(trans, root, path, &key, sizeof(*info));
> >+	if (ret)
> >+		goto out;
> >+
> >+	leaf = path->nodes[0];
> >+	info = btrfs_item_ptr(leaf, path->slots[0],
> >+			      struct btrfs_free_space_info);
> >+	btrfs_set_free_space_extent_count(leaf, info, 0);
> >+	btrfs_set_free_space_flags(leaf, info, 0);
> >+	btrfs_mark_buffer_dirty(leaf);
> >+
> >+	ret = 0;
> >+out:
> >+	btrfs_release_path(path);
> >+	return ret;
> >+}
> >+
> >+static struct btrfs_free_space_info *
> >+search_free_space_info(struct btrfs_trans_handle *trans,
> >+		       struct btrfs_fs_info *fs_info,
> >+		       struct btrfs_block_group_cache *block_group,
> >+		       struct btrfs_path *path, int cow)
> >+{
> >+	struct btrfs_root *root = fs_info->free_space_root;
> >+	struct btrfs_key key;
> >+	int ret;
> >+
> >+	key.objectid = block_group->key.objectid;
> >+	key.type = BTRFS_FREE_SPACE_INFO_KEY;
> >+	key.offset = block_group->key.offset;
> >+
> >+	ret = btrfs_search_slot(trans, root, &key, path, 0, cow);
> >+	if (ret < 0)
> >+		return ERR_PTR(ret);
> >+	if (ret != 0) {
> >+		btrfs_warn(fs_info, "missing free space info for %llu\n",
> >+			   block_group->key.objectid);
> >+		ASSERT(0);
> >+		return ERR_PTR(-ENOENT);
> >+	}
> >+
> >+	return btrfs_item_ptr(path->nodes[0], path->slots[0],
> >+			      struct btrfs_free_space_info);
> >+}
> >+
> >+/*
> >+ * btrfs_search_slot() but we're looking for the greatest key less than the
> >+ * passed key.
> >+ */
> >+static int btrfs_search_prev_slot(struct btrfs_trans_handle *trans,
> >+				  struct btrfs_root *root,
> >+				  struct btrfs_key *key, struct btrfs_path *p,
> >+				  int ins_len, int cow)
> >+{
> >+	int ret;
> >+
> >+	ret = btrfs_search_slot(trans, root, key, p, ins_len, cow);
> >+	if (ret < 0)
> >+		return ret;
> >+
> >+	if (ret == 0) {
> >+		ASSERT(0);
> >+		return -EIO;
> >+	}
> >+
> >+	if (p->slots[0] == 0) {
> >+		ASSERT(0);
> >+		return -EIO;
> >+	}
> >+	p->slots[0]--;
> >+
> >+	return 0;
> >+}
> >+
> >+static inline u32 free_space_bitmap_size(u64 size, u32 sectorsize)
> >+{
> >+	return DIV_ROUND_UP((u32)div_u64(size, sectorsize), BITS_PER_BYTE);
> >+}
> >+
> >+static unsigned long *alloc_bitmap(u32 bitmap_size)
> >+{
> >+	return __vmalloc(bitmap_size, GFP_NOFS | __GFP_HIGHMEM | __GFP_ZERO,
> >+			 PAGE_KERNEL);
> >+}
> >+
> >+static int convert_free_space_to_bitmaps(struct btrfs_trans_handle *trans,
> >+					 struct btrfs_fs_info *fs_info,
> >+					 struct btrfs_block_group_cache *block_group,
> >+					 struct btrfs_path *path)
> >+{
> >+	struct btrfs_root *root = fs_info->free_space_root;
> >+	struct btrfs_free_space_info *info;
> >+	struct btrfs_key key, found_key;
> >+	struct extent_buffer *leaf;
> >+	unsigned long *bitmap;
> >+	char *bitmap_cursor;
> >+	u64 start, end;
> >+	u64 bitmap_range, i;
> >+	u32 bitmap_size, flags, expected_extent_count;
> >+	u32 extent_count = 0;
> >+	int done = 0, nr;
> >+	int ret;
> >+
> >+	bitmap_size = free_space_bitmap_size(block_group->key.offset,
> >+					     block_group->sectorsize);
> >+	bitmap = alloc_bitmap(bitmap_size);
> >+	if (!bitmap)
> >+		return -ENOMEM;
> >+
> >+	start = block_group->key.objectid;
> >+	end = block_group->key.objectid + block_group->key.offset;
> >+
> >+	key.objectid = end - 1;
> >+	key.type = (u8)-1;
> >+	key.offset = (u64)-1;
> >+
> >+	while (!done) {
> >+		ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
> >+		if (ret)
> >+			goto out;
> >+
> >+		leaf = path->nodes[0];
> >+		nr = 0;
> >+		path->slots[0]++;
> >+		while (path->slots[0] > 0) {
> >+			btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0] - 1);
> >+
> >+			if (found_key.type == BTRFS_FREE_SPACE_INFO_KEY) {
> >+				ASSERT(found_key.objectid == block_group->key.objectid);
> >+				ASSERT(found_key.offset == block_group->key.offset);
> >+				done = 1;
> >+				break;
> >+			} else if (found_key.type == BTRFS_FREE_SPACE_EXTENT_KEY) {
> >+				u64 first, last;
> >+
> >+				ASSERT(found_key.objectid >= start);
> >+				ASSERT(found_key.objectid < end);
> >+				ASSERT(found_key.objectid + found_key.offset <= end);
> >+
> >+				first = div_u64(found_key.objectid - start,
> >+						block_group->sectorsize);
> >+				last = div_u64(found_key.objectid + found_key.offset - start,
> >+					       block_group->sectorsize);
> >+				bitmap_set(bitmap, first, last - first);
> >+
> >+				extent_count++;
> >+				nr++;
> >+				path->slots[0]--;
> >+			} else {
> >+				ASSERT(0);
> >+			}
> >+		}
> >+
> >+		ret = btrfs_del_items(trans, root, path, path->slots[0], nr);
> >+		if (ret)
> 
> We could have deleted stuff previously so we need to abort here as well.
> 
> >+			goto out;
> >+		btrfs_release_path(path);
> >+	}
> >+
> >+	info = search_free_space_info(trans, fs_info, block_group, path, 1);
> >+	if (IS_ERR(info)) {
> >+		ret = PTR_ERR(info);
> >+		goto out;
> >+	}
> >+	leaf = path->nodes[0];
> >+	flags = btrfs_free_space_flags(leaf, info);
> >+	flags |= BTRFS_FREE_SPACE_USING_BITMAPS;
> >+	btrfs_set_free_space_flags(leaf, info, flags);
> >+	expected_extent_count = btrfs_free_space_extent_count(leaf, info);
> >+	btrfs_mark_buffer_dirty(leaf);
> >+	btrfs_release_path(path);
> >+
> >+	if (extent_count != expected_extent_count) {
> >+		btrfs_err(fs_info, "incorrect extent count for %llu; counted %u, expected %u",
> >+			  block_group->key.objectid, extent_count,
> >+			  expected_extent_count);
> 
> We should also abort the transaction here since we will have already deleted
> the normal entries and thus have a corrupted fs if we are allowed to
> continue.
> 
> >+		ASSERT(0);
> >+		ret = -EIO;
> >+		goto out;
> >+	}
> >+
> >+	bitmap_cursor = (char *)bitmap;
> >+	bitmap_range = block_group->sectorsize * BTRFS_FREE_SPACE_BITMAP_BITS;
> >+	i = start;
> >+	while (i < end) {
> >+		unsigned long ptr;
> >+		u64 extent_size;
> >+		u32 data_size;
> >+
> >+		extent_size = min(end - i, bitmap_range);
> >+		data_size = free_space_bitmap_size(extent_size,
> >+						   block_group->sectorsize);
> >+
> >+		key.objectid = i;
> >+		key.type = BTRFS_FREE_SPACE_BITMAP_KEY;
> >+		key.offset = extent_size;
> >+
> >+		ret = btrfs_insert_empty_item(trans, root, path, &key,
> >+					      data_size);
> >+		if (ret)
> 
> Need to abort here as well.
> 
> >+			goto out;
> >+
> >+		leaf = path->nodes[0];
> >+		ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
> >+		write_extent_buffer(leaf, bitmap_cursor, ptr,
> >+				    data_size);
> >+		btrfs_mark_buffer_dirty(leaf);
> >+		btrfs_release_path(path);
> >+
> >+		i += extent_size;
> >+		bitmap_cursor += data_size;
> >+	}
> >+
> >+	ret = 0;
> >+out:
> 
> Maybe have the if (ret) btrfs_abort_transaction() here.
> 
> >+	vfree(bitmap);
> >+	return ret;
> >+}
> >+
> >+static int convert_free_space_to_extents(struct btrfs_trans_handle *trans,
> >+					 struct btrfs_fs_info *fs_info,
> >+					 struct btrfs_block_group_cache *block_group,
> >+					 struct btrfs_path *path)
> >+{
> 
> You need to abort in the appropriate places here as well.
> 
> >+	struct btrfs_root *root = fs_info->free_space_root;
> >+	struct btrfs_free_space_info *info;
> >+	struct btrfs_key key, found_key;
> >+	struct extent_buffer *leaf;
> >+	unsigned long *bitmap;
> >+	u64 start, end;
> >+	/* Initialize to silence GCC. */
> >+	u64 extent_start = 0;
> >+	u64 offset;
> >+	u32 bitmap_size, flags, expected_extent_count;
> >+	int prev_bit = 0, bit, bitnr;
> >+	u32 extent_count = 0;
> >+	int done = 0, nr;
> >+	int ret;
> >+
> >+	bitmap_size = free_space_bitmap_size(block_group->key.offset,
> >+					     block_group->sectorsize);
> >+	bitmap = alloc_bitmap(bitmap_size);
> >+	if (!bitmap)
> >+		return -ENOMEM;
> >+
> >+	start = block_group->key.objectid;
> >+	end = block_group->key.objectid + block_group->key.offset;
> >+
> >+	key.objectid = end - 1;
> >+	key.type = (u8)-1;
> >+	key.offset = (u64)-1;
> >+
> >+	while (!done) {
> >+		ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
> >+		if (ret)
> >+			goto out;
> >+
> >+		leaf = path->nodes[0];
> >+		nr = 0;
> >+		path->slots[0]++;
> >+		while (path->slots[0] > 0) {
> >+			btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0] - 1);
> >+
> >+			if (found_key.type == BTRFS_FREE_SPACE_INFO_KEY) {
> >+				ASSERT(found_key.objectid == block_group->key.objectid);
> >+				ASSERT(found_key.offset == block_group->key.offset);
> >+				done = 1;
> >+				break;
> >+			} else if (found_key.type == BTRFS_FREE_SPACE_BITMAP_KEY) {
> >+				unsigned long ptr;
> >+				char *bitmap_cursor;
> >+				u32 bitmap_pos, data_size;
> >+
> >+				ASSERT(found_key.objectid >= start);
> >+				ASSERT(found_key.objectid < end);
> >+				ASSERT(found_key.objectid + found_key.offset <= end);
> >+
> >+				bitmap_pos = div_u64(found_key.objectid - start,
> >+						     block_group->sectorsize *
> >+						     BITS_PER_BYTE);
> >+				bitmap_cursor = ((char *)bitmap) + bitmap_pos;
> >+				data_size = free_space_bitmap_size(found_key.offset,
> >+								   block_group->sectorsize);
> >+
> >+				ptr = btrfs_item_ptr_offset(leaf, path->slots[0] - 1);
> >+				read_extent_buffer(leaf, bitmap_cursor, ptr,
> >+						   data_size);
> >+
> >+				nr++;
> >+				path->slots[0]--;
> >+			} else {
> >+				ASSERT(0);
> >+			}
> >+		}
> >+
> >+		ret = btrfs_del_items(trans, root, path, path->slots[0], nr);
> >+		if (ret)
> >+			goto out;
> >+		btrfs_release_path(path);
> >+	}
> >+
> >+	info = search_free_space_info(trans, fs_info, block_group, path, 1);
> >+	if (IS_ERR(info)) {
> >+		ret = PTR_ERR(info);
> >+		goto out;
> >+	}
> >+	leaf = path->nodes[0];
> >+	flags = btrfs_free_space_flags(leaf, info);
> >+	flags &= ~BTRFS_FREE_SPACE_USING_BITMAPS;
> >+	btrfs_set_free_space_flags(leaf, info, flags);
> >+	expected_extent_count = btrfs_free_space_extent_count(leaf, info);
> >+	btrfs_mark_buffer_dirty(leaf);
> >+	btrfs_release_path(path);
> >+
> >+	offset = start;
> >+	bitnr = 0;
> >+	while (offset < end) {
> >+		bit = !!test_bit(bitnr, bitmap);
> >+		if (prev_bit == 0 && bit == 1) {
> >+			extent_start = offset;
> >+		} else if (prev_bit == 1 && bit == 0) {
> >+			key.objectid = extent_start;
> >+			key.type = BTRFS_FREE_SPACE_EXTENT_KEY;
> >+			key.offset = offset - extent_start;
> >+
> >+			ret = btrfs_insert_empty_item(trans, root, path, &key, 0);
> >+			if (ret)
> >+				goto out;
> >+			btrfs_release_path(path);
> >+
> >+			extent_count++;
> >+		}
> >+		prev_bit = bit;
> >+		offset += block_group->sectorsize;
> >+		bitnr++;
> >+	}
> >+	if (prev_bit == 1) {
> >+		key.objectid = extent_start;
> >+		key.type = BTRFS_FREE_SPACE_EXTENT_KEY;
> >+		key.offset = end - extent_start;
> >+
> >+		ret = btrfs_insert_empty_item(trans, root, path, &key, 0);
> >+		if (ret)
> >+			goto out;
> >+		btrfs_release_path(path);
> >+
> >+		extent_count++;
> >+	}
> >+
> >+	if (extent_count != expected_extent_count) {
> >+		btrfs_err(fs_info, "incorrect extent count for %llu; counted %u, expected %u",
> >+			  block_group->key.objectid, extent_count,
> >+			  expected_extent_count);
> >+		ASSERT(0);
> >+		ret = -EIO;
> >+		goto out;
> >+	}
> >+
> >+	ret = 0;
> >+out:
> >+	vfree(bitmap);
> >+	return ret;
> >+}
> >+
> >+static int update_free_space_extent_count(struct btrfs_trans_handle *trans,
> >+					  struct btrfs_fs_info *fs_info,
> >+					  struct btrfs_block_group_cache *block_group,
> >+					  struct btrfs_path *path,
> >+					  int new_extents)
> >+{
> >+	struct btrfs_free_space_info *info;
> >+	u32 flags;
> >+	u32 extent_count;
> >+	int ret = 0;
> >+
> >+	if (new_extents == 0)
> >+		return 0;
> >+
> >+	info = search_free_space_info(trans, fs_info, block_group, path, 1);
> >+	if (IS_ERR(info)) {
> >+		ret = PTR_ERR(info);
> >+		goto out;
> >+	}
> >+	flags = btrfs_free_space_flags(path->nodes[0], info);
> >+	extent_count = btrfs_free_space_extent_count(path->nodes[0], info);
> >+
> >+	extent_count += new_extents;
> >+	btrfs_set_free_space_extent_count(path->nodes[0], info, extent_count);
> >+	btrfs_mark_buffer_dirty(path->nodes[0]);
> >+	btrfs_release_path(path);
> >+
> >+	if (!(flags & BTRFS_FREE_SPACE_USING_BITMAPS) &&
> >+	    extent_count > block_group->bitmap_high_thresh) {
> >+		ret = convert_free_space_to_bitmaps(trans, fs_info, block_group,
> >+						    path);
> >+	} else if ((flags & BTRFS_FREE_SPACE_USING_BITMAPS) &&
> >+		   extent_count < block_group->bitmap_low_thresh) {
> >+		ret = convert_free_space_to_extents(trans, fs_info, block_group,
> >+						    path);
> >+	}
> >+	if (ret)
> >+		goto out;
> >+
> >+	ret = 0;
> >+out:
> >+	return ret;
> >+}
> >+
> >+static int free_space_test_bit(struct btrfs_block_group_cache *block_group,
> >+			       struct btrfs_path *path, u64 offset)
> >+{
> >+	struct extent_buffer *leaf;
> >+	struct btrfs_key key;
> >+	u64 found_start, found_end;
> >+	unsigned long ptr, i;
> >+
> >+	leaf = path->nodes[0];
> >+	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> >+	ASSERT(key.type == BTRFS_FREE_SPACE_BITMAP_KEY);
> >+
> >+	found_start = key.objectid;
> >+	found_end = key.objectid + key.offset;
> >+	ASSERT(offset >= found_start && offset < found_end);
> >+
> >+	ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
> >+	i = div_u64(offset - found_start, block_group->sectorsize);
> >+	return !!extent_buffer_test_bit(leaf, ptr, i);
> >+}
> >+
> >+static void free_space_set_bits(struct btrfs_block_group_cache *block_group,
> >+				struct btrfs_path *path, u64 *start, u64 *size,
> >+				int bit)
> >+{
> >+	struct extent_buffer *leaf;
> >+	struct btrfs_key key;
> >+	u64 end = *start + *size;
> >+	u64 found_start, found_end;
> >+	unsigned long ptr, first, last;
> >+
> >+	leaf = path->nodes[0];
> >+	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
> >+	ASSERT(key.type == BTRFS_FREE_SPACE_BITMAP_KEY);
> >+
> >+	found_start = key.objectid;
> >+	found_end = key.objectid + key.offset;
> >+	ASSERT(*start >= found_start && *start < found_end);
> >+	ASSERT(end > found_start);
> >+
> >+	if (end > found_end)
> >+		end = found_end;
> >+
> >+	ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
> >+	first = div_u64(*start - found_start, block_group->sectorsize);
> >+	last = div_u64(end - found_start, block_group->sectorsize);
> >+	if (bit)
> >+		extent_buffer_bitmap_set(leaf, ptr, first, last - first);
> >+	else
> >+		extent_buffer_bitmap_clear(leaf, ptr, first, last - first);
> >+	btrfs_mark_buffer_dirty(leaf);
> >+
> >+	*size -= end - *start;
> >+	*start = end;
> >+}
> >+
> >+/*
> >+ * We can't use btrfs_next_item() in modify_free_space_bitmap() because
> >+ * btrfs_next_leaf() doesn't get the path for writing. We can forgo the fancy
> >+ * tree walking in btrfs_next_leaf() anyways because we know exactly what we're
> >+ * looking for.
> >+ */
> >+static int free_space_next_bitmap(struct btrfs_trans_handle *trans,
> >+				  struct btrfs_root *root, struct btrfs_path *p)
> >+{
> >+	struct btrfs_key key;
> >+
> >+	if (p->slots[0] + 1 < btrfs_header_nritems(p->nodes[0])) {
> >+		p->slots[0]++;
> >+		return 0;
> >+	}
> >+
> >+	btrfs_item_key_to_cpu(p->nodes[0], &key, p->slots[0]);
> >+	btrfs_release_path(p);
> >+
> >+	key.objectid += key.offset;
> >+	key.type = (u8)-1;
> >+	key.offset = (u64)-1;
> >+
> >+	return btrfs_search_prev_slot(trans, root, &key, p, 0, 1);
> >+}
> >+
> >+/*
> >+ * If remove is 1, then we are removing free space, thus clearing bits in the
> >+ * bitmap. If remove is 0, then we are adding free space, thus setting bits in
> >+ * the bitmap.
> >+ */
> >+static int modify_free_space_bitmap(struct btrfs_trans_handle *trans,
> >+				    struct btrfs_fs_info *fs_info,
> >+				    struct btrfs_block_group_cache *block_group,
> >+				    struct btrfs_path *path,
> >+				    u64 start, u64 size, int remove)
> >+{
> >+	struct btrfs_root *root = fs_info->free_space_root;
> >+	struct btrfs_key key;
> >+	u64 end = start + size;
> >+	u64 cur_start, cur_size;
> >+	int prev_bit, next_bit;
> >+	int new_extents;
> >+	int ret;
> >+
> >+	/*
> >+	 * Read the bit for the block immediately before the extent of space if
> >+	 * that block is within the block group.
> >+	 */
> >+	if (start > block_group->key.objectid) {
> >+		u64 prev_block = start - block_group->sectorsize;
> >+
> >+		key.objectid = prev_block;
> >+		key.type = (u8)-1;
> >+		key.offset = (u64)-1;
> >+
> >+		ret = btrfs_search_prev_slot(trans, root, &key, path, 0, 1);
> >+		if (ret)
> >+			goto out;
> >+
> >+		prev_bit = free_space_test_bit(block_group, path, prev_block);
> >+
> >+		/* The previous block may have been in the previous bitmap. */
> >+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> >+		if (start >= key.objectid + key.offset) {
> >+			ret = free_space_next_bitmap(trans, root, path);
> >+			if (ret)
> >+				goto out;
> >+		}
> >+	} else {
> >+		key.objectid = start;
> >+		key.type = (u8)-1;
> >+		key.offset = (u64)-1;
> >+
> >+		ret = btrfs_search_prev_slot(trans, root, &key, path, 0, 1);
> >+		if (ret)
> >+			goto out;
> >+
> >+		prev_bit = -1;
> >+	}
> >+
> >+	/*
> >+	 * Iterate over all of the bitmaps overlapped by the extent of space,
> >+	 * clearing/setting bits as required.
> >+	 */
> >+	cur_start = start;
> >+	cur_size = size;
> >+	while (1) {
> >+		free_space_set_bits(block_group, path, &cur_start, &cur_size,
> >+				    !remove);
> >+		if (cur_size == 0)
> >+			break;
> >+		ret = free_space_next_bitmap(trans, root, path);
> >+		if (ret)
> >+			goto out;
> >+	}
> >+
> >+	/*
> >+	 * Read the bit for the block immediately after the extent of space if
> >+	 * that block is within the block group.
> >+	 */
> >+	if (end < block_group->key.objectid + block_group->key.offset) {
> >+		/* The next block may be in the next bitmap. */
> >+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> >+		if (end >= key.objectid + key.offset) {
> >+			ret = free_space_next_bitmap(trans, root, path);
> >+			if (ret)
> >+				goto out;
> >+		}
> >+
> >+		next_bit = free_space_test_bit(block_group, path, end);
> >+	} else {
> >+		next_bit = -1;
> >+	}
> >+
> >+	if (remove) {
> >+		new_extents = -1;
> >+		if (prev_bit == 1) {
> >+			/* Leftover on the left. */
> >+			new_extents++;
> >+		}
> >+		if (next_bit == 1) {
> >+			/* Leftover on the right. */
> >+			new_extents++;
> >+		}
> >+	} else {
> >+		new_extents = 1;
> >+		if (prev_bit == 1) {
> >+			/* Merging with neighbor on the left. */
> >+			new_extents--;
> >+		}
> >+		if (next_bit == 1) {
> >+			/* Merging with neighbor on the right. */
> >+			new_extents--;
> >+		}
> >+	}
> >+
> >+	btrfs_release_path(path);
> >+	ret = update_free_space_extent_count(trans, fs_info, block_group, path,
> >+					     new_extents);
> >+	if (ret)
> >+		goto out;
> >+
> >+	ret = 0;
> >+out:
> >+	return ret;
> >+}
> >+
> >+static int remove_free_space_extent(struct btrfs_trans_handle *trans,
> >+				    struct btrfs_fs_info *fs_info,
> >+				    struct btrfs_block_group_cache *block_group,
> >+				    struct btrfs_path *path,
> >+				    u64 start, u64 size)
> >+{
> >+	struct btrfs_root *root = fs_info->free_space_root;
> >+	struct btrfs_key key;
> >+	u64 found_start, found_end;
> >+	u64 end = start + size;
> >+	int new_extents = -1;
> >+	int ret;
> >+
> >+	key.objectid = start;
> >+	key.type = (u8)-1;
> >+	key.offset = (u64)-1;
> >+
> >+	ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
> >+	if (ret)
> >+		goto out;
> >+
> >+	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> >+
> >+	ASSERT(key.type == BTRFS_FREE_SPACE_EXTENT_KEY);
> >+
> >+	found_start = key.objectid;
> >+	found_end = key.objectid + key.offset;
> >+	ASSERT(start >= found_start && end <= found_end);
> >+
> >+	/*
> >+	 * Okay, now that we've found the free space extent which contains the
> >+	 * free space that we are removing, there are four cases:
> >+	 *
> >+	 * 1. We're using the whole extent: delete the key we found and
> >+	 * decrement the free space extent count.
> >+	 * 2. We are using part of the extent starting at the beginning: delete
> >+	 * the key we found and insert a new key representing the leftover at
> >+	 * the end. There is no net change in the number of extents.
> >+	 * 3. We are using part of the extent ending at the end: delete the key
> >+	 * we found and insert a new key representing the leftover at the
> >+	 * beginning. There is no net change in the number of extents.
> >+	 * 4. We are using part of the extent in the middle: delete the key we
> >+	 * found and insert two new keys representing the leftovers on each
> >+	 * side. Where we used to have one extent, we now have two, so increment
> >+	 * the extent count. We may need to convert the block group to bitmaps
> >+	 * as a result.
> >+	 */
> >+
> >+	/* Delete the existing key (cases 1-4). */
> >+	ret = btrfs_del_item(trans, root, path);
> >+	if (ret)
> >+		goto out;
> >+
> >+	/* Add a key for leftovers at the beginning (cases 3 and 4). */
> >+	if (start > found_start) {
> >+		key.objectid = found_start;
> >+		key.type = BTRFS_FREE_SPACE_EXTENT_KEY;
> >+		key.offset = start - found_start;
> >+
> >+		btrfs_release_path(path);
> >+		ret = btrfs_insert_empty_item(trans, root, path, &key, 0);
> >+		if (ret)
> >+			goto out;
> >+		new_extents++;
> >+	}
> >+
> >+	/* Add a key for leftovers at the end (cases 2 and 4). */
> >+	if (end < found_end) {
> >+		key.objectid = end;
> >+		key.type = BTRFS_FREE_SPACE_EXTENT_KEY;
> >+		key.offset = found_end - end;
> >+
> >+		btrfs_release_path(path);
> >+		ret = btrfs_insert_empty_item(trans, root, path, &key, 0);
> >+		if (ret)
> >+			goto out;
> >+		new_extents++;
> >+	}
> >+
> >+	btrfs_release_path(path);
> >+	ret = update_free_space_extent_count(trans, fs_info, block_group, path,
> >+					     new_extents);
> >+	if (ret)
> >+		goto out;
> >+
> >+	ret = 0;
> >+out:
> >+	return ret;
> >+}
> 
> A sanity test would be good for this.
> 
> >+
> >+int remove_from_free_space_tree(struct btrfs_trans_handle *trans,
> >+				struct btrfs_fs_info *fs_info,
> >+				u64 start, u64 size)
> >+{
> >+	struct btrfs_block_group_cache *block_group;
> >+	struct btrfs_free_space_info *info;
> >+	struct btrfs_path *path;
> >+	u32 flags;
> >+	int ret;
> >+
> >+	if (!btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
> >+		return 0;
> >+
> >+	path = btrfs_alloc_path();
> >+	if (!path)
> >+		return -ENOMEM;
> >+
> >+	block_group = btrfs_lookup_block_group(fs_info, start);
> >+	if (!block_group) {
> >+		ASSERT(0);
> >+		ret = -ENOENT;
> >+		goto out_nobg;
> >+	}
> >+
> >+	mutex_lock(&block_group->free_space_lock);
> >+
> >+	info = search_free_space_info(NULL, fs_info, block_group, path, 0);
> >+	if (IS_ERR(info)) {
> >+		ret = PTR_ERR(info);
> >+		goto out;
> >+	}
> >+	flags = btrfs_free_space_flags(path->nodes[0], info);
> >+	btrfs_release_path(path);
> >+
> >+	if (flags & BTRFS_FREE_SPACE_USING_BITMAPS) {
> >+		ret = modify_free_space_bitmap(trans, fs_info, block_group,
> >+					       path, start, size, 1);
> >+	} else {
> >+		ret = remove_free_space_extent(trans, fs_info, block_group,
> >+					       path, start, size);
> >+	}
> >+	if (ret)
> >+		goto out;
> >+
> >+	ret = 0;
> >+out:
> >+	mutex_unlock(&block_group->free_space_lock);
> >+	btrfs_put_block_group(block_group);
> >+out_nobg:
> >+	btrfs_free_path(path);
> >+	return ret;
> >+}
> >+
> >+static int add_free_space_extent(struct btrfs_trans_handle *trans,
> >+				 struct btrfs_fs_info *fs_info,
> >+				 struct btrfs_block_group_cache *block_group,
> >+				 struct btrfs_path *path,
> >+				 u64 start, u64 size)
> >+{
> >+	struct btrfs_root *root = fs_info->free_space_root;
> >+	struct btrfs_key key, new_key;
> >+	u64 found_start, found_end;
> >+	u64 end = start + size;
> >+	int new_extents = 1;
> >+	int ret;
> >+
> >+	/*
> >+	 * We are adding a new extent of free space, but we need to merge
> >+	 * extents. There are four cases here:
> >+	 *
> >+	 * 1. The new extent does not have any immediate neighbors to merge
> >+	 * with: add the new key and increment the free space extent count. We
> >+	 * may need to convert the block group to bitmaps as a result.
> >+	 * 2. The new extent has an immediate neighbor before it: remove the
> >+	 * previous key and insert a new key combining both of them. There is no
> >+	 * net change in the number of extents.
> >+	 * 3. The new extent has an immediate neighbor after it: remove the next
> >+	 * key and insert a new key combining both of them. There is no net
> >+	 * change in the number of extents.
> >+	 * 4. The new extent has immediate neighbors on both sides: remove both
> >+	 * of the keys and insert a new key combining all of them. Where we used
> >+	 * to have two extents, we now have one, so decrement the extent count.
> >+	 */
> >+
> >+	new_key.objectid = start;
> >+	new_key.type = BTRFS_FREE_SPACE_EXTENT_KEY;
> >+	new_key.offset = size;
> >+
> >+	/* Search for a neighbor on the left. */
> >+	if (start == block_group->key.objectid)
> >+		goto right;
> >+	key.objectid = start - 1;
> >+	key.type = (u8)-1;
> >+	key.offset = (u64)-1;
> >+
> >+	ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
> >+	if (ret)
> >+		goto out;
> >+
> >+	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> >+
> >+	if (key.type != BTRFS_FREE_SPACE_EXTENT_KEY) {
> >+		ASSERT(key.type == BTRFS_FREE_SPACE_INFO_KEY);
> >+		btrfs_release_path(path);
> >+		goto right;
> >+	}
> >+
> >+	found_start = key.objectid;
> >+	found_end = key.objectid + key.offset;
> >+	ASSERT(found_start >= block_group->key.objectid &&
> >+	       found_end > block_group->key.objectid);
> >+	ASSERT(found_start < start && found_end <= start);
> >+
> >+	/*
> >+	 * Delete the neighbor on the left and absorb it into the new key (cases
> >+	 * 2 and 4).
> >+	 */
> >+	if (found_end == start) {
> >+		ret = btrfs_del_item(trans, root, path);
> >+		if (ret)
> >+			goto out;
> >+		new_key.objectid = found_start;
> >+		new_key.offset += key.offset;
> >+		new_extents--;
> >+	}
> >+	btrfs_release_path(path);
> >+
> >+right:
> >+	/* Search for a neighbor on the right. */
> >+	if (end == block_group->key.objectid + block_group->key.offset)
> >+		goto insert;
> >+	key.objectid = end;
> >+	key.type = (u8)-1;
> >+	key.offset = (u64)-1;
> >+
> >+	ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
> >+	if (ret)
> >+		goto out;
> >+
> >+	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> >+
> >+	if (key.type != BTRFS_FREE_SPACE_EXTENT_KEY) {
> >+		ASSERT(key.type == BTRFS_FREE_SPACE_INFO_KEY);
> >+		btrfs_release_path(path);
> >+		goto insert;
> >+	}
> >+
> >+	found_start = key.objectid;
> >+	found_end = key.objectid + key.offset;
> >+	ASSERT(found_start >= block_group->key.objectid &&
> >+	       found_end > block_group->key.objectid);
> >+	ASSERT((found_start < start && found_end <= start) ||
> >+	       (found_start >= end && found_end > end));
> >+
> >+	/*
> >+	 * Delete the neighbor on the right and absorb it into the new key
> >+	 * (cases 3 and 4).
> >+	 */
> >+	if (found_start == end) {
> >+		ret = btrfs_del_item(trans, root, path);
> >+		if (ret)
> >+			goto out;
> >+		new_key.offset += key.offset;
> >+		new_extents--;
> >+	}
> >+	btrfs_release_path(path);
> >+
> >+insert:
> >+	/* Insert the new key (cases 1-4). */
> >+	ret = btrfs_insert_empty_item(trans, root, path, &new_key, 0);
> >+	if (ret)
> >+		goto out;
> >+
> >+	btrfs_release_path(path);
> >+	ret = update_free_space_extent_count(trans, fs_info, block_group, path,
> >+					     new_extents);
> >+	if (ret)
> >+		goto out;
> >+
> >+	ret = 0;
> >+out:
> >+	return ret;
> >+}
> 
> It would be good to have a sanity test for this to make sure all of your
> cases are covered and are proven in a unit test.
> 
> >+
> >+static int __add_to_free_space_tree(struct btrfs_trans_handle *trans,
> >+				    struct btrfs_fs_info *fs_info,
> >+				    struct btrfs_block_group_cache *block_group,
> >+				    struct btrfs_path *path,
> >+				    u64 start, u64 size)
> >+{
> >+	struct btrfs_free_space_info *info;
> >+	u32 flags;
> >+	int ret;
> >+
> >+	mutex_lock(&block_group->free_space_lock);
> >+
> >+	info = search_free_space_info(NULL, fs_info, block_group, path, 0);
> >+	if (IS_ERR(info)) {
> >+		return PTR_ERR(info);
> >+		goto out;
> >+	}
> >+	flags = btrfs_free_space_flags(path->nodes[0], info);
> >+	btrfs_release_path(path);
> >+
> >+	if (flags & BTRFS_FREE_SPACE_USING_BITMAPS) {
> >+		ret = modify_free_space_bitmap(trans, fs_info, block_group,
> >+					       path, start, size, 0);
> >+	} else {
> >+		ret = add_free_space_extent(trans, fs_info, block_group, path,
> >+					    start, size);
> >+	}
> >+
> >+out:
> >+	mutex_unlock(&block_group->free_space_lock);
> >+	return ret;
> >+}
> >+
> >+int add_to_free_space_tree(struct btrfs_trans_handle *trans,
> >+			   struct btrfs_fs_info *fs_info,
> >+			   u64 start, u64 size)
> >+{
> >+	struct btrfs_block_group_cache *block_group;
> >+	struct btrfs_path *path;
> >+	int ret;
> >+
> >+	if (!btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
> >+		return 0;
> >+
> >+	path = btrfs_alloc_path();
> >+	if (!path)
> >+		return -ENOMEM;
> >+
> >+	block_group = btrfs_lookup_block_group(fs_info, start);
> >+	if (!block_group) {
> >+		ASSERT(0);
> >+		ret = -ENOENT;
> >+		goto out_nobg;
> >+	}
> >+
> >+	ret = __add_to_free_space_tree(trans, fs_info, block_group, path, start,
> >+				       size);
> >+	if (ret)
> >+		goto out;
> >+
> >+	ret = 0;
> >+out:
> >+	btrfs_put_block_group(block_group);
> >+out_nobg:
> >+	btrfs_free_path(path);
> >+	return ret;
> >+}
> >+
> >+static int add_new_free_space_extent(struct btrfs_trans_handle *trans,
> >+				     struct btrfs_fs_info *fs_info,
> >+				     struct btrfs_block_group_cache *block_group,
> >+				     struct btrfs_path *path,
> >+				     u64 start, u64 end)
> >+{
> >+	u64 extent_start, extent_end;
> >+	int ret;
> >+
> >+	while (start < end) {
> >+		ret = find_first_extent_bit(fs_info->pinned_extents, start,
> >+					    &extent_start, &extent_end,
> >+					    EXTENT_DIRTY | EXTENT_UPTODATE,
> >+					    NULL);
> >+		if (ret)
> >+			break;
> >+
> >+		if (extent_start <= start) {
> >+			start = extent_end + 1;
> >+		} else if (extent_start > start && extent_start < end) {
> >+			ret = __add_to_free_space_tree(trans, fs_info,
> >+						       block_group, path, start,
> >+						       extent_start - start);
> >+			btrfs_release_path(path);
> >+			if (ret)
> >+				return ret;
> >+			start = extent_end + 1;
> >+		} else {
> >+			break;
> >+		}
> >+	}
> >+	if (start < end) {
> >+		ret = __add_to_free_space_tree(trans, fs_info, block_group,
> >+					       path, start, end - start);
> >+		btrfs_release_path(path);
> >+		if (ret)
> >+			return ret;
> >+	}
> >+
> >+	return 0;
> >+}
> >+
> >+/*
> >+ * Populate the free space tree by walking the extent tree, avoiding the super
> >+ * block mirrors. Operations on the extent tree that happen as a result of
> >+ * writes to the free space tree will go through the normal add/remove hooks.
> >+ */
> >+static int populate_free_space_tree(struct btrfs_trans_handle *trans,
> >+				    struct btrfs_fs_info *fs_info,
> >+				    struct btrfs_block_group_cache *block_group)
> >+{
> >+	struct btrfs_root *extent_root = fs_info->extent_root;
> >+	struct btrfs_path *path, *path2;
> >+	struct btrfs_key key;
> >+	u64 start, end;
> >+	int ret;
> >+
> >+	path = btrfs_alloc_path();
> >+	if (!path)
> >+		return -ENOMEM;
> >+	path->reada = 1;
> >+
> >+	path2 = btrfs_alloc_path();
> >+	if (!path2) {
> >+		btrfs_free_path(path);
> >+		return -ENOMEM;
> >+	}
> >+
> >+	ret = add_new_free_space_info(trans, fs_info, block_group, path2);
> >+	if (ret)
> >+		goto out;
> >+
> >+	ret = exclude_super_stripes(extent_root, block_group);
> >+	if (ret)
> >+		goto out;
> >+
> >+	/*
> >+	 * Iterate through all of the extent and metadata items in this block
> >+	 * group, adding the free space between them and the free space at the
> >+	 * end. Note that EXTENT_ITEM and METADATA_ITEM are less than
> >+	 * BLOCK_GROUP_ITEM, so an extent may precede the block group that it's
> >+	 * contained in.
> >+	 */
> >+	key.objectid = block_group->key.objectid;
> >+	key.type = BTRFS_EXTENT_ITEM_KEY;
> >+	key.offset = 0;
> >+
> >+	ret = btrfs_search_slot_for_read(extent_root, &key, path, 1, 0);
> >+	if (ret < 0)
> >+		goto out;
> >+	ASSERT(ret == 0);
> >+
> >+	start = block_group->key.objectid;
> >+	end = block_group->key.objectid + block_group->key.offset;
> >+	while (1) {
> >+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> >+
> >+		if (key.type == BTRFS_EXTENT_ITEM_KEY ||
> >+		    key.type == BTRFS_METADATA_ITEM_KEY) {
> >+			if (key.objectid >= end)
> >+				break;
> >+
> >+			ret = add_new_free_space_extent(trans, fs_info,
> >+							block_group, path2,
> >+							start, key.objectid);
> >+			start = key.objectid;
> >+			if (key.type == BTRFS_METADATA_ITEM_KEY)
> >+				start += fs_info->tree_root->nodesize;
> >+			else
> >+				start += key.offset;
> >+		} else if (key.type == BTRFS_BLOCK_GROUP_ITEM_KEY) {
> >+			if (key.objectid != block_group->key.objectid)
> >+				break;
> >+		}
> >+
> >+		ret = btrfs_next_item(extent_root, path);
> >+		if (ret < 0)
> >+			goto out;
> >+		if (ret)
> >+			break;
> >+	}
> >+	ret = add_new_free_space_extent(trans, fs_info, block_group, path2,
> >+					start, end);
> >+	if (ret)
> >+		goto out;
> >+
> >+out:
> >+	free_excluded_extents(extent_root, block_group);
> >+	btrfs_free_path(path2);
> >+	btrfs_free_path(path);
> >+	return ret;
> >+}
> >+
> >+int btrfs_create_free_space_tree(struct btrfs_fs_info *fs_info)
> >+{
> >+	struct btrfs_trans_handle *trans;
> >+	struct btrfs_root *tree_root = fs_info->tree_root;
> >+	struct btrfs_root *free_space_root;
> >+	struct btrfs_block_group_cache *block_group;
> >+	struct rb_node *node;
> >+	int ret;
> >+
> >+	trans = btrfs_start_transaction(tree_root, 0);
> >+	if (IS_ERR(trans))
> >+		return PTR_ERR(trans);
> >+
> >+	free_space_root = btrfs_create_tree(trans, fs_info,
> >+					    BTRFS_FREE_SPACE_TREE_OBJECTID);
> >+	if (IS_ERR(free_space_root)) {
> >+		ret = PTR_ERR(free_space_root);
> >+		btrfs_abort_transaction(trans, tree_root, ret);
> >+		return ret;
> >+	}
> >+	fs_info->free_space_root = free_space_root;
> >+
> >+	node = rb_first(&fs_info->block_group_cache_tree);
> >+	while (node) {
> >+		block_group = rb_entry(node, struct btrfs_block_group_cache,
> >+				       cache_node);
> >+		ret = populate_free_space_tree(trans, fs_info, block_group);
> >+		if (ret) {
> >+			btrfs_abort_transaction(trans, tree_root, ret);
> >+			return ret;
> >+		}
> >+		node = rb_next(node);
> >+	}
> >+
> >+	btrfs_set_fs_compat_ro(fs_info, FREE_SPACE_TREE);
> >+
> >+	ret = btrfs_commit_transaction(trans, tree_root);
> >+	if (ret)
> >+		return ret;
> >+
> >+	return 0;
> >+}
> >+
> >+int add_block_group_free_space(struct btrfs_trans_handle *trans,
> >+			       struct btrfs_fs_info *fs_info,
> >+			       struct btrfs_block_group_cache *block_group)
> >+{
> >+	struct btrfs_path *path;
> >+	int ret;
> >+
> >+	if (!btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
> >+		return 0;
> >+
> >+	path = btrfs_alloc_path();
> >+	if (!path)
> >+		return -ENOMEM;
> >+
> >+	ret = add_new_free_space_info(trans, fs_info, block_group, path);
> >+	if (ret)
> >+		goto out;
> >+
> >+	ret = add_new_free_space_extent(trans, fs_info, block_group, path,
> >+					block_group->key.objectid,
> >+					block_group->key.objectid +
> >+					block_group->key.offset);
> >+	if (ret)
> >+		goto out;
> >+
> >+	ret = 0;
> >+out:
> >+	btrfs_free_path(path);
> >+	return ret;
> >+}
> >+
> >+int remove_block_group_free_space(struct btrfs_trans_handle *trans,
> >+				  struct btrfs_fs_info *fs_info,
> >+				  struct btrfs_block_group_cache *block_group)
> >+{
> >+	struct btrfs_root *root = fs_info->free_space_root;
> >+	struct btrfs_path *path;
> >+	struct btrfs_key key, found_key;
> >+	struct extent_buffer *leaf;
> >+	u64 start, end;
> >+	int done = 0, nr;
> >+	int ret;
> >+
> >+	if (!btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
> >+		return 0;
> >+
> >+	path = btrfs_alloc_path();
> >+	if (!path)
> >+		return -ENOMEM;
> >+
> >+	start = block_group->key.objectid;
> >+	end = block_group->key.objectid + block_group->key.offset;
> >+
> >+	key.objectid = end - 1;
> >+	key.type = (u8)-1;
> >+	key.offset = (u64)-1;
> >+
> >+	while (!done) {
> >+		ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
> >+		if (ret)
> >+			goto out;
> >+
> >+		leaf = path->nodes[0];
> >+		nr = 0;
> >+		path->slots[0]++;
> >+		while (path->slots[0] > 0) {
> >+			btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0] - 1);
> >+
> >+			if (found_key.type == BTRFS_FREE_SPACE_INFO_KEY) {
> >+				ASSERT(found_key.objectid == block_group->key.objectid);
> >+				ASSERT(found_key.offset == block_group->key.offset);
> >+				done = 1;
> >+				nr++;
> >+				path->slots[0]--;
> >+				break;
> >+			} else if (found_key.type == BTRFS_FREE_SPACE_EXTENT_KEY ||
> >+				   found_key.type == BTRFS_FREE_SPACE_BITMAP_KEY) {
> >+				ASSERT(found_key.objectid >= start);
> >+				ASSERT(found_key.objectid < end);
> >+				ASSERT(found_key.objectid + found_key.offset <= end);
> >+				nr++;
> >+				path->slots[0]--;
> >+			} else {
> >+				ASSERT(0);
> >+			}
> >+		}
> >+
> >+		ret = btrfs_del_items(trans, root, path, path->slots[0], nr);
> >+		if (ret)
> >+			goto out;
> >+		btrfs_release_path(path);
> >+	}
> >+
> >+	ret = 0;
> >+out:
> >+	btrfs_free_path(path);
> >+	return ret;
> >+}
> >+
> >+static int load_free_space_bitmaps(struct btrfs_fs_info *fs_info,
> >+				   struct btrfs_block_group_cache *block_group,
> >+				   struct btrfs_path *path,
> >+				   u32 expected_extent_count)
> >+{
> >+	struct btrfs_root *root = fs_info->free_space_root;
> >+	struct btrfs_key key;
> >+	int prev_bit = 0, bit;
> >+	/* Initialize to silence GCC. */
> >+	u64 extent_start = 0;
> >+	u64 end, offset;
> >+	u32 extent_count = 0;
> >+	int ret;
> >+
> >+	end = block_group->key.objectid + block_group->key.offset;
> >+
> >+	while (1) {
> >+		ret = btrfs_next_item(root, path);
> >+		if (ret < 0)
> >+			goto out;
> >+		if (ret)
> >+			break;
> >+
> >+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> >+
> >+		if (key.type == BTRFS_FREE_SPACE_INFO_KEY)
> >+			break;
> >+
> >+		ASSERT(key.type == BTRFS_FREE_SPACE_BITMAP_KEY);
> >+		ASSERT(key.objectid < end && key.objectid + key.offset <= end);
> >+
> >+		offset = key.objectid;
> >+		while (offset < key.objectid + key.offset) {
> >+			bit = free_space_test_bit(block_group, path, offset);
> >+			if (prev_bit == 0 && bit == 1) {
> >+				extent_start = offset;
> >+			} else if (prev_bit == 1 && bit == 0) {
> >+				add_new_free_space(block_group, fs_info,
> >+						   extent_start, offset);
> >+				extent_count++;
> >+			}
> >+			prev_bit = bit;
> >+			offset += block_group->sectorsize;
> >+		}
> >+	}
> >+	if (prev_bit == 1) {
> >+		add_new_free_space(block_group, fs_info, extent_start, end);
> >+		extent_count++;
> >+	}
> >+
> >+	if (extent_count != expected_extent_count) {
> >+		btrfs_err(fs_info, "incorrect extent count for %llu; counted %u, expected %u",
> >+			  block_group->key.objectid, extent_count,
> >+			  expected_extent_count);
> >+		ASSERT(0);
> >+		ret = -EIO;
> >+		goto out;
> >+	}
> >+
> >+	ret = 0;
> >+out:
> >+	return ret;
> >+}
> >+
> >+static int load_free_space_extents(struct btrfs_fs_info *fs_info,
> >+				   struct btrfs_block_group_cache *block_group,
> >+				   struct btrfs_path *path,
> >+				   u32 expected_extent_count)
> >+{
> >+	struct btrfs_root *root = fs_info->free_space_root;
> >+	struct btrfs_key key;
> >+	u64 end;
> >+	u32 extent_count = 0;
> >+	int ret;
> >+
> >+	end = block_group->key.objectid + block_group->key.offset;
> >+
> >+	while (1) {
> >+		ret = btrfs_next_item(root, path);
> >+		if (ret < 0)
> >+			goto out;
> >+		if (ret)
> >+			break;
> >+
> >+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
> >+
> >+		if (key.type == BTRFS_FREE_SPACE_INFO_KEY)
> >+			break;
> >+
> >+		ASSERT(key.type == BTRFS_FREE_SPACE_EXTENT_KEY);
> >+		ASSERT(key.objectid < end && key.objectid + key.offset <= end);
> >+
> >+		add_new_free_space(block_group, fs_info, key.objectid,
> >+				   key.objectid + key.offset);
> >+		extent_count++;
> >+	}
> >+
> >+	if (extent_count != expected_extent_count) {
> >+		btrfs_err(fs_info, "incorrect extent count for %llu; counted %u, expected %u",
> >+			  block_group->key.objectid, extent_count,
> >+			  expected_extent_count);
> >+		ASSERT(0);
> >+		ret = -EIO;
> >+		goto out;
> >+	}
> >+
> >+	ret = 0;
> >+out:
> >+	return ret;
> >+}
> >+
> >+int load_free_space_tree(struct btrfs_fs_info *fs_info,
> >+			 struct btrfs_block_group_cache *block_group)
> >+{
> >+	struct btrfs_free_space_info *info;
> >+	struct btrfs_path *path;
> >+	u32 extent_count, flags;
> >+	int ret;
> >+
> >+	path = btrfs_alloc_path();
> >+	if (!path)
> >+		return -ENOMEM;
> >+
> >+	/*
> >+	 * Just like caching_thread() doesn't want to deadlock on the extent
> >+	 * tree, we don't want to deadlock on the free space tree.
> >+	 */
> >+	path->skip_locking = 1;
> >+	path->search_commit_root = 1;
> >+	path->reada = 1;
> >+
> >+	down_read(&fs_info->commit_root_sem);
> >+
> >+	info = search_free_space_info(NULL, fs_info, block_group, path, 0);
> >+	if (IS_ERR(info)) {
> >+		ret = PTR_ERR(info);
> >+		goto out;
> >+	}
> >+	extent_count = btrfs_free_space_extent_count(path->nodes[0], info);
> >+	flags = btrfs_free_space_flags(path->nodes[0], info);
> >+
> >+	/*
> >+	 * We left path pointing to the free space info item, so now
> >+	 * load_free_space_foo can just iterate through the free space tree from
> >+	 * there.
> >+	 */
> >+	if (flags & BTRFS_FREE_SPACE_USING_BITMAPS) {
> >+		ret = load_free_space_bitmaps(fs_info, block_group, path,
> >+					      extent_count);
> >+	} else {
> >+		ret = load_free_space_extents(fs_info, block_group, path,
> >+					      extent_count);
> >+	}
> >+	if (ret)
> >+		goto out;
> >+
> >+	ret = 0;
> 
> This bit isn't needed, just fall through.
> 
> >+out:
> >+	up_read(&fs_info->commit_root_sem);
> >+	btrfs_free_path(path);
> >+	return ret;
> >+}
> 
> So actually there are a lot of places in here that you need to abort the
> transaction if there is a failure.  If we can't update the free space tree
> for whatever reason and we aren't a developer so don't immediately panic the
> box we need to make sure to abort so the fs stays consistent.  The only
> place you don't have to do this is when loading the free space tree.
> Thanks,
> 
> Josef
> 

So an error returned from either add_to_free_space_tree() or
remove_from_free_space_tree() will eventually bubble up to
btrfs_run_delayed_refs() which will abort the transaction. Likewise, an
error from remove_block_group_free_space() will abort in
btrfs_remove_chunk(). It looks like there's at least one call chain
where an error from add_block_group_free_space() won't abort. For the
sake of not having to audit all of these call chains, I'll go ahead and
add the aborts closer to where they occur and add some sanity tests,
thanks.

-- 
Omar

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] Btrfs: implement the free space B-tree
  2015-09-01 20:06     ` Omar Sandoval
@ 2015-09-01 20:08       ` Josef Bacik
  0 siblings, 0 replies; 43+ messages in thread
From: Josef Bacik @ 2015-09-01 20:08 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, Omar Sandoval

On 09/01/2015 04:06 PM, Omar Sandoval wrote:
> On Tue, Sep 01, 2015 at 03:44:27PM -0400, Josef Bacik wrote:
>> On 09/01/2015 03:13 PM, Omar Sandoval wrote:
>>> From: Omar Sandoval <osandov@fb.com>
>>>
>>> The free space cache has turned out to be a scalability bottleneck on
>>> large, busy filesystems. When the cache for a lot of block groups needs
>>> to be written out, we can get extremely long commit times; if this
>>> happens in the critical section, things are especially bad because we
>>> block new transactions from happening.
>>>
>>> The main problem with the free space cache is that it has to be written
>>> out in its entirety and is managed in an ad hoc fashion. Using a B-tree
>>> to store free space fixes this: updates can be done as needed and we get
>>> all of the benefits of using a B-tree: checksumming, RAID handling,
>>> well-understood behavior.
>>>
>>> With the free space tree, we get commit times that are about the same as
>>> the no cache case with load times slower than the free space cache case
>>> but still much faster than the no cache case. Free space is represented
>>> with extents until it becomes more space-efficient to use bitmaps,
>>> giving us similar space overhead to the free space cache.
>>>
>>> The operations on the free space tree are: adding and removing free
>>> space, handling the creation and deletion of block groups, and loading
>>> the free space for a block group. We can also create the free space tree
>>> by walking the extent tree.
>>>
>>> Signed-off-by: Omar Sandoval <osandov@fb.com>
>>> ---
>>>   fs/btrfs/Makefile          |    2 +-
>>>   fs/btrfs/ctree.h           |   25 +-
>>>   fs/btrfs/extent-tree.c     |   15 +-
>>>   fs/btrfs/free-space-tree.c | 1468 ++++++++++++++++++++++++++++++++++++++++++++
>>>   fs/btrfs/free-space-tree.h |   39 ++
>>>   5 files changed, 1541 insertions(+), 8 deletions(-)
>>>   create mode 100644 fs/btrfs/free-space-tree.c
>>>   create mode 100644 fs/btrfs/free-space-tree.h
>>>
>>> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
>>> index 6d1d0b93b1aa..766169709146 100644
>>> --- a/fs/btrfs/Makefile
>>> +++ b/fs/btrfs/Makefile
>>> @@ -9,7 +9,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
>>>   	   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
>>>   	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
>>>   	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
>>> -	   uuid-tree.o props.o hash.o
>>> +	   uuid-tree.o props.o hash.o free-space-tree.o
>>>
>>>   btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
>>>   btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>> index 34a81a79f5b6..d49181d35f08 100644
>>> --- a/fs/btrfs/ctree.h
>>> +++ b/fs/btrfs/ctree.h
>>> @@ -1299,8 +1299,20 @@ struct btrfs_block_group_cache {
>>>   	u64 delalloc_bytes;
>>>   	u64 bytes_super;
>>>   	u64 flags;
>>> -	u64 sectorsize;
>>>   	u64 cache_generation;
>>> +	u32 sectorsize;
>>> +
>>> +	/*
>>> +	 * If the free space extent count exceeds this number, convert the block
>>> +	 * group to bitmaps.
>>> +	 */
>>> +	u32 bitmap_high_thresh;
>>> +
>>> +	/*
>>> +	 * If the free space extent count drops below this number, convert the
>>> +	 * block group back to extents.
>>> +	 */
>>> +	u32 bitmap_low_thresh;
>>>
>>>   	/*
>>>   	 * It is just used for the delayed data space allocation because
>>> @@ -1356,6 +1368,9 @@ struct btrfs_block_group_cache {
>>>   	struct list_head io_list;
>>>
>>>   	struct btrfs_io_ctl io_ctl;
>>> +
>>> +	/* Lock for free space tree operations. */
>>> +	struct mutex free_space_lock;
>>>   };
>>>
>>>   /* delayed seq elem */
>>> @@ -1407,6 +1422,7 @@ struct btrfs_fs_info {
>>>   	struct btrfs_root *csum_root;
>>>   	struct btrfs_root *quota_root;
>>>   	struct btrfs_root *uuid_root;
>>> +	struct btrfs_root *free_space_root;
>>>
>>>   	/* the log root tree is a directory of all the other log roots */
>>>   	struct btrfs_root *log_root_tree;
>>> @@ -3556,6 +3572,13 @@ void btrfs_end_write_no_snapshoting(struct btrfs_root *root);
>>>   void check_system_chunk(struct btrfs_trans_handle *trans,
>>>   			struct btrfs_root *root,
>>>   			const u64 type);
>>> +void free_excluded_extents(struct btrfs_root *root,
>>> +			   struct btrfs_block_group_cache *cache);
>>> +int exclude_super_stripes(struct btrfs_root *root,
>>> +			  struct btrfs_block_group_cache *cache);
>>> +u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
>>> +		       struct btrfs_fs_info *info, u64 start, u64 end);
>>> +
>>>   /* ctree.c */
>>>   int btrfs_bin_search(struct extent_buffer *eb, struct btrfs_key *key,
>>>   		     int level, int *slot);
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 07204bf601ed..37179a569f40 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -237,8 +237,8 @@ static int add_excluded_extent(struct btrfs_root *root,
>>>   	return 0;
>>>   }
>>>
>>> -static void free_excluded_extents(struct btrfs_root *root,
>>> -				  struct btrfs_block_group_cache *cache)
>>> +void free_excluded_extents(struct btrfs_root *root,
>>> +			   struct btrfs_block_group_cache *cache)
>>>   {
>>>   	u64 start, end;
>>>
>>> @@ -251,14 +251,16 @@ static void free_excluded_extents(struct btrfs_root *root,
>>>   			  start, end, EXTENT_UPTODATE, GFP_NOFS);
>>>   }
>>>
>>> -static int exclude_super_stripes(struct btrfs_root *root,
>>> -				 struct btrfs_block_group_cache *cache)
>>> +int exclude_super_stripes(struct btrfs_root *root,
>>> +			  struct btrfs_block_group_cache *cache)
>>>   {
>>>   	u64 bytenr;
>>>   	u64 *logical;
>>>   	int stripe_len;
>>>   	int i, nr, ret;
>>>
>>> +	cache->bytes_super = 0;
>>> +
>>>   	if (cache->key.objectid < BTRFS_SUPER_INFO_OFFSET) {
>>>   		stripe_len = BTRFS_SUPER_INFO_OFFSET - cache->key.objectid;
>>>   		cache->bytes_super += stripe_len;
>>> @@ -337,8 +339,8 @@ static void put_caching_control(struct btrfs_caching_control *ctl)
>>>    * we need to check the pinned_extents for any extents that can't be used yet
>>>    * since their free space will be released as soon as the transaction commits.
>>>    */
>>> -static u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
>>> -			      struct btrfs_fs_info *info, u64 start, u64 end)
>>> +u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
>>> +		       struct btrfs_fs_info *info, u64 start, u64 end)
>>>   {
>>>   	u64 extent_start, extent_end, size, total_added = 0;
>>>   	int ret;
>>> @@ -9281,6 +9283,7 @@ btrfs_create_block_group_cache(struct btrfs_root *root, u64 start, u64 size)
>>>   	INIT_LIST_HEAD(&cache->io_list);
>>>   	btrfs_init_free_space_ctl(cache);
>>>   	atomic_set(&cache->trimming, 0);
>>> +	mutex_init(&cache->free_space_lock);
>>>
>>>   	return cache;
>>>   }
>>> diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
>>> new file mode 100644
>>> index 000000000000..bbb4f731f948
>>> --- /dev/null
>>> +++ b/fs/btrfs/free-space-tree.c
>>> @@ -0,0 +1,1468 @@
>>> +/*
>>> + * Copyright (C) 2015 Facebook.  All rights reserved.
>>> + *
>>> + * This program is free software; you can redistribute it and/or
>>> + * modify it under the terms of the GNU General Public
>>> + * License v2 as published by the Free Software Foundation.
>>> + *
>>> + * This program is distributed in the hope that it will be useful,
>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>>> + * General Public License for more details.
>>> + *
>>> + * You should have received a copy of the GNU General Public
>>> + * License along with this program; if not, write to the
>>> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
>>> + * Boston, MA 021110-1307, USA.
>>> + */
>>> +
>>> +#include <linux/kernel.h>
>>> +#include <linux/vmalloc.h>
>>> +#include "ctree.h"
>>> +#include "disk-io.h"
>>> +#include "locking.h"
>>> +#include "free-space-tree.h"
>>> +#include "transaction.h"
>>> +
>>> +/*
>>> + * The default size for new free space bitmap items. The last bitmap in a block
>>> + * group may be truncated, and none of the free space tree code assumes that
>>> + * existing bitmaps are this size.
>>> + */
>>> +#define BTRFS_FREE_SPACE_BITMAP_SIZE 256
>>> +#define BTRFS_FREE_SPACE_BITMAP_BITS (BTRFS_FREE_SPACE_BITMAP_SIZE * BITS_PER_BYTE)
>>> +
>>> +void set_free_space_tree_thresholds(struct btrfs_block_group_cache *cache)
>>> +{
>>> +	u32 bitmap_range;
>>> +	size_t bitmap_size;
>>> +	u64 num_bitmaps, total_bitmap_size;
>>> +
>>> +	/*
>>> +	 * We convert to bitmaps when the disk space required for using extents
>>> +	 * exceeds that required for using bitmaps.
>>> +	 */
>>> +	bitmap_range = cache->sectorsize * BTRFS_FREE_SPACE_BITMAP_BITS;
>>> +	num_bitmaps = div_u64(cache->key.offset + bitmap_range - 1,
>>> +			      bitmap_range);
>>> +	bitmap_size = sizeof(struct btrfs_item) + BTRFS_FREE_SPACE_BITMAP_SIZE;
>>> +	total_bitmap_size = num_bitmaps * bitmap_size;
>>> +	cache->bitmap_high_thresh = div_u64(total_bitmap_size,
>>> +					    sizeof(struct btrfs_item));
>>> +
>>> +	/*
>>> +	 * We allow for a small buffer between the high threshold and low
>>> +	 * threshold to avoid thrashing back and forth between the two formats.
>>> +	 */
>>> +	if (cache->bitmap_high_thresh > 100)
>>> +		cache->bitmap_low_thresh = cache->bitmap_high_thresh - 100;
>>> +	else
>>> +		cache->bitmap_low_thresh = 0;
>>> +}
>>> +
>>> +static int add_new_free_space_info(struct btrfs_trans_handle *trans,
>>> +				   struct btrfs_fs_info *fs_info,
>>> +				   struct btrfs_block_group_cache *block_group,
>>> +				   struct btrfs_path *path)
>>> +{
>>> +	struct btrfs_root *root = fs_info->free_space_root;
>>> +	struct btrfs_free_space_info *info;
>>> +	struct btrfs_key key;
>>> +	struct extent_buffer *leaf;
>>> +	int ret;
>>> +
>>> +	key.objectid = block_group->key.objectid;
>>> +	key.type = BTRFS_FREE_SPACE_INFO_KEY;
>>> +	key.offset = block_group->key.offset;
>>> +
>>> +	ret = btrfs_insert_empty_item(trans, root, path, &key, sizeof(*info));
>>> +	if (ret)
>>> +		goto out;
>>> +
>>> +	leaf = path->nodes[0];
>>> +	info = btrfs_item_ptr(leaf, path->slots[0],
>>> +			      struct btrfs_free_space_info);
>>> +	btrfs_set_free_space_extent_count(leaf, info, 0);
>>> +	btrfs_set_free_space_flags(leaf, info, 0);
>>> +	btrfs_mark_buffer_dirty(leaf);
>>> +
>>> +	ret = 0;
>>> +out:
>>> +	btrfs_release_path(path);
>>> +	return ret;
>>> +}
>>> +
>>> +static struct btrfs_free_space_info *
>>> +search_free_space_info(struct btrfs_trans_handle *trans,
>>> +		       struct btrfs_fs_info *fs_info,
>>> +		       struct btrfs_block_group_cache *block_group,
>>> +		       struct btrfs_path *path, int cow)
>>> +{
>>> +	struct btrfs_root *root = fs_info->free_space_root;
>>> +	struct btrfs_key key;
>>> +	int ret;
>>> +
>>> +	key.objectid = block_group->key.objectid;
>>> +	key.type = BTRFS_FREE_SPACE_INFO_KEY;
>>> +	key.offset = block_group->key.offset;
>>> +
>>> +	ret = btrfs_search_slot(trans, root, &key, path, 0, cow);
>>> +	if (ret < 0)
>>> +		return ERR_PTR(ret);
>>> +	if (ret != 0) {
>>> +		btrfs_warn(fs_info, "missing free space info for %llu\n",
>>> +			   block_group->key.objectid);
>>> +		ASSERT(0);
>>> +		return ERR_PTR(-ENOENT);
>>> +	}
>>> +
>>> +	return btrfs_item_ptr(path->nodes[0], path->slots[0],
>>> +			      struct btrfs_free_space_info);
>>> +}
>>> +
>>> +/*
>>> + * btrfs_search_slot() but we're looking for the greatest key less than the
>>> + * passed key.
>>> + */
>>> +static int btrfs_search_prev_slot(struct btrfs_trans_handle *trans,
>>> +				  struct btrfs_root *root,
>>> +				  struct btrfs_key *key, struct btrfs_path *p,
>>> +				  int ins_len, int cow)
>>> +{
>>> +	int ret;
>>> +
>>> +	ret = btrfs_search_slot(trans, root, key, p, ins_len, cow);
>>> +	if (ret < 0)
>>> +		return ret;
>>> +
>>> +	if (ret == 0) {
>>> +		ASSERT(0);
>>> +		return -EIO;
>>> +	}
>>> +
>>> +	if (p->slots[0] == 0) {
>>> +		ASSERT(0);
>>> +		return -EIO;
>>> +	}
>>> +	p->slots[0]--;
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static inline u32 free_space_bitmap_size(u64 size, u32 sectorsize)
>>> +{
>>> +	return DIV_ROUND_UP((u32)div_u64(size, sectorsize), BITS_PER_BYTE);
>>> +}
>>> +
>>> +static unsigned long *alloc_bitmap(u32 bitmap_size)
>>> +{
>>> +	return __vmalloc(bitmap_size, GFP_NOFS | __GFP_HIGHMEM | __GFP_ZERO,
>>> +			 PAGE_KERNEL);
>>> +}
>>> +
>>> +static int convert_free_space_to_bitmaps(struct btrfs_trans_handle *trans,
>>> +					 struct btrfs_fs_info *fs_info,
>>> +					 struct btrfs_block_group_cache *block_group,
>>> +					 struct btrfs_path *path)
>>> +{
>>> +	struct btrfs_root *root = fs_info->free_space_root;
>>> +	struct btrfs_free_space_info *info;
>>> +	struct btrfs_key key, found_key;
>>> +	struct extent_buffer *leaf;
>>> +	unsigned long *bitmap;
>>> +	char *bitmap_cursor;
>>> +	u64 start, end;
>>> +	u64 bitmap_range, i;
>>> +	u32 bitmap_size, flags, expected_extent_count;
>>> +	u32 extent_count = 0;
>>> +	int done = 0, nr;
>>> +	int ret;
>>> +
>>> +	bitmap_size = free_space_bitmap_size(block_group->key.offset,
>>> +					     block_group->sectorsize);
>>> +	bitmap = alloc_bitmap(bitmap_size);
>>> +	if (!bitmap)
>>> +		return -ENOMEM;
>>> +
>>> +	start = block_group->key.objectid;
>>> +	end = block_group->key.objectid + block_group->key.offset;
>>> +
>>> +	key.objectid = end - 1;
>>> +	key.type = (u8)-1;
>>> +	key.offset = (u64)-1;
>>> +
>>> +	while (!done) {
>>> +		ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
>>> +		if (ret)
>>> +			goto out;
>>> +
>>> +		leaf = path->nodes[0];
>>> +		nr = 0;
>>> +		path->slots[0]++;
>>> +		while (path->slots[0] > 0) {
>>> +			btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0] - 1);
>>> +
>>> +			if (found_key.type == BTRFS_FREE_SPACE_INFO_KEY) {
>>> +				ASSERT(found_key.objectid == block_group->key.objectid);
>>> +				ASSERT(found_key.offset == block_group->key.offset);
>>> +				done = 1;
>>> +				break;
>>> +			} else if (found_key.type == BTRFS_FREE_SPACE_EXTENT_KEY) {
>>> +				u64 first, last;
>>> +
>>> +				ASSERT(found_key.objectid >= start);
>>> +				ASSERT(found_key.objectid < end);
>>> +				ASSERT(found_key.objectid + found_key.offset <= end);
>>> +
>>> +				first = div_u64(found_key.objectid - start,
>>> +						block_group->sectorsize);
>>> +				last = div_u64(found_key.objectid + found_key.offset - start,
>>> +					       block_group->sectorsize);
>>> +				bitmap_set(bitmap, first, last - first);
>>> +
>>> +				extent_count++;
>>> +				nr++;
>>> +				path->slots[0]--;
>>> +			} else {
>>> +				ASSERT(0);
>>> +			}
>>> +		}
>>> +
>>> +		ret = btrfs_del_items(trans, root, path, path->slots[0], nr);
>>> +		if (ret)
>>
>> We could have deleted stuff previously so we need to abort here as well.
>>
>>> +			goto out;
>>> +		btrfs_release_path(path);
>>> +	}
>>> +
>>> +	info = search_free_space_info(trans, fs_info, block_group, path, 1);
>>> +	if (IS_ERR(info)) {
>>> +		ret = PTR_ERR(info);
>>> +		goto out;
>>> +	}
>>> +	leaf = path->nodes[0];
>>> +	flags = btrfs_free_space_flags(leaf, info);
>>> +	flags |= BTRFS_FREE_SPACE_USING_BITMAPS;
>>> +	btrfs_set_free_space_flags(leaf, info, flags);
>>> +	expected_extent_count = btrfs_free_space_extent_count(leaf, info);
>>> +	btrfs_mark_buffer_dirty(leaf);
>>> +	btrfs_release_path(path);
>>> +
>>> +	if (extent_count != expected_extent_count) {
>>> +		btrfs_err(fs_info, "incorrect extent count for %llu; counted %u, expected %u",
>>> +			  block_group->key.objectid, extent_count,
>>> +			  expected_extent_count);
>>
>> We should also abort the transaction here since we will have already deleted
>> the normal entries and thus have a corrupted fs if we are allowed to
>> continue.
>>
>>> +		ASSERT(0);
>>> +		ret = -EIO;
>>> +		goto out;
>>> +	}
>>> +
>>> +	bitmap_cursor = (char *)bitmap;
>>> +	bitmap_range = block_group->sectorsize * BTRFS_FREE_SPACE_BITMAP_BITS;
>>> +	i = start;
>>> +	while (i < end) {
>>> +		unsigned long ptr;
>>> +		u64 extent_size;
>>> +		u32 data_size;
>>> +
>>> +		extent_size = min(end - i, bitmap_range);
>>> +		data_size = free_space_bitmap_size(extent_size,
>>> +						   block_group->sectorsize);
>>> +
>>> +		key.objectid = i;
>>> +		key.type = BTRFS_FREE_SPACE_BITMAP_KEY;
>>> +		key.offset = extent_size;
>>> +
>>> +		ret = btrfs_insert_empty_item(trans, root, path, &key,
>>> +					      data_size);
>>> +		if (ret)
>>
>> Need to abort here as well.
>>
>>> +			goto out;
>>> +
>>> +		leaf = path->nodes[0];
>>> +		ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
>>> +		write_extent_buffer(leaf, bitmap_cursor, ptr,
>>> +				    data_size);
>>> +		btrfs_mark_buffer_dirty(leaf);
>>> +		btrfs_release_path(path);
>>> +
>>> +		i += extent_size;
>>> +		bitmap_cursor += data_size;
>>> +	}
>>> +
>>> +	ret = 0;
>>> +out:
>>
>> Maybe have the if (ret) btrfs_abort_transaction() here.
>>
>>> +	vfree(bitmap);
>>> +	return ret;
>>> +}
>>> +
>>> +static int convert_free_space_to_extents(struct btrfs_trans_handle *trans,
>>> +					 struct btrfs_fs_info *fs_info,
>>> +					 struct btrfs_block_group_cache *block_group,
>>> +					 struct btrfs_path *path)
>>> +{
>>
>> You need to abort in the appropriate places here as well.
>>
>>> +	struct btrfs_root *root = fs_info->free_space_root;
>>> +	struct btrfs_free_space_info *info;
>>> +	struct btrfs_key key, found_key;
>>> +	struct extent_buffer *leaf;
>>> +	unsigned long *bitmap;
>>> +	u64 start, end;
>>> +	/* Initialize to silence GCC. */
>>> +	u64 extent_start = 0;
>>> +	u64 offset;
>>> +	u32 bitmap_size, flags, expected_extent_count;
>>> +	int prev_bit = 0, bit, bitnr;
>>> +	u32 extent_count = 0;
>>> +	int done = 0, nr;
>>> +	int ret;
>>> +
>>> +	bitmap_size = free_space_bitmap_size(block_group->key.offset,
>>> +					     block_group->sectorsize);
>>> +	bitmap = alloc_bitmap(bitmap_size);
>>> +	if (!bitmap)
>>> +		return -ENOMEM;
>>> +
>>> +	start = block_group->key.objectid;
>>> +	end = block_group->key.objectid + block_group->key.offset;
>>> +
>>> +	key.objectid = end - 1;
>>> +	key.type = (u8)-1;
>>> +	key.offset = (u64)-1;
>>> +
>>> +	while (!done) {
>>> +		ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
>>> +		if (ret)
>>> +			goto out;
>>> +
>>> +		leaf = path->nodes[0];
>>> +		nr = 0;
>>> +		path->slots[0]++;
>>> +		while (path->slots[0] > 0) {
>>> +			btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0] - 1);
>>> +
>>> +			if (found_key.type == BTRFS_FREE_SPACE_INFO_KEY) {
>>> +				ASSERT(found_key.objectid == block_group->key.objectid);
>>> +				ASSERT(found_key.offset == block_group->key.offset);
>>> +				done = 1;
>>> +				break;
>>> +			} else if (found_key.type == BTRFS_FREE_SPACE_BITMAP_KEY) {
>>> +				unsigned long ptr;
>>> +				char *bitmap_cursor;
>>> +				u32 bitmap_pos, data_size;
>>> +
>>> +				ASSERT(found_key.objectid >= start);
>>> +				ASSERT(found_key.objectid < end);
>>> +				ASSERT(found_key.objectid + found_key.offset <= end);
>>> +
>>> +				bitmap_pos = div_u64(found_key.objectid - start,
>>> +						     block_group->sectorsize *
>>> +						     BITS_PER_BYTE);
>>> +				bitmap_cursor = ((char *)bitmap) + bitmap_pos;
>>> +				data_size = free_space_bitmap_size(found_key.offset,
>>> +								   block_group->sectorsize);
>>> +
>>> +				ptr = btrfs_item_ptr_offset(leaf, path->slots[0] - 1);
>>> +				read_extent_buffer(leaf, bitmap_cursor, ptr,
>>> +						   data_size);
>>> +
>>> +				nr++;
>>> +				path->slots[0]--;
>>> +			} else {
>>> +				ASSERT(0);
>>> +			}
>>> +		}
>>> +
>>> +		ret = btrfs_del_items(trans, root, path, path->slots[0], nr);
>>> +		if (ret)
>>> +			goto out;
>>> +		btrfs_release_path(path);
>>> +	}
>>> +
>>> +	info = search_free_space_info(trans, fs_info, block_group, path, 1);
>>> +	if (IS_ERR(info)) {
>>> +		ret = PTR_ERR(info);
>>> +		goto out;
>>> +	}
>>> +	leaf = path->nodes[0];
>>> +	flags = btrfs_free_space_flags(leaf, info);
>>> +	flags &= ~BTRFS_FREE_SPACE_USING_BITMAPS;
>>> +	btrfs_set_free_space_flags(leaf, info, flags);
>>> +	expected_extent_count = btrfs_free_space_extent_count(leaf, info);
>>> +	btrfs_mark_buffer_dirty(leaf);
>>> +	btrfs_release_path(path);
>>> +
>>> +	offset = start;
>>> +	bitnr = 0;
>>> +	while (offset < end) {
>>> +		bit = !!test_bit(bitnr, bitmap);
>>> +		if (prev_bit == 0 && bit == 1) {
>>> +			extent_start = offset;
>>> +		} else if (prev_bit == 1 && bit == 0) {
>>> +			key.objectid = extent_start;
>>> +			key.type = BTRFS_FREE_SPACE_EXTENT_KEY;
>>> +			key.offset = offset - extent_start;
>>> +
>>> +			ret = btrfs_insert_empty_item(trans, root, path, &key, 0);
>>> +			if (ret)
>>> +				goto out;
>>> +			btrfs_release_path(path);
>>> +
>>> +			extent_count++;
>>> +		}
>>> +		prev_bit = bit;
>>> +		offset += block_group->sectorsize;
>>> +		bitnr++;
>>> +	}
>>> +	if (prev_bit == 1) {
>>> +		key.objectid = extent_start;
>>> +		key.type = BTRFS_FREE_SPACE_EXTENT_KEY;
>>> +		key.offset = end - extent_start;
>>> +
>>> +		ret = btrfs_insert_empty_item(trans, root, path, &key, 0);
>>> +		if (ret)
>>> +			goto out;
>>> +		btrfs_release_path(path);
>>> +
>>> +		extent_count++;
>>> +	}
>>> +
>>> +	if (extent_count != expected_extent_count) {
>>> +		btrfs_err(fs_info, "incorrect extent count for %llu; counted %u, expected %u",
>>> +			  block_group->key.objectid, extent_count,
>>> +			  expected_extent_count);
>>> +		ASSERT(0);
>>> +		ret = -EIO;
>>> +		goto out;
>>> +	}
>>> +
>>> +	ret = 0;
>>> +out:
>>> +	vfree(bitmap);
>>> +	return ret;
>>> +}
>>> +
>>> +static int update_free_space_extent_count(struct btrfs_trans_handle *trans,
>>> +					  struct btrfs_fs_info *fs_info,
>>> +					  struct btrfs_block_group_cache *block_group,
>>> +					  struct btrfs_path *path,
>>> +					  int new_extents)
>>> +{
>>> +	struct btrfs_free_space_info *info;
>>> +	u32 flags;
>>> +	u32 extent_count;
>>> +	int ret = 0;
>>> +
>>> +	if (new_extents == 0)
>>> +		return 0;
>>> +
>>> +	info = search_free_space_info(trans, fs_info, block_group, path, 1);
>>> +	if (IS_ERR(info)) {
>>> +		ret = PTR_ERR(info);
>>> +		goto out;
>>> +	}
>>> +	flags = btrfs_free_space_flags(path->nodes[0], info);
>>> +	extent_count = btrfs_free_space_extent_count(path->nodes[0], info);
>>> +
>>> +	extent_count += new_extents;
>>> +	btrfs_set_free_space_extent_count(path->nodes[0], info, extent_count);
>>> +	btrfs_mark_buffer_dirty(path->nodes[0]);
>>> +	btrfs_release_path(path);
>>> +
>>> +	if (!(flags & BTRFS_FREE_SPACE_USING_BITMAPS) &&
>>> +	    extent_count > block_group->bitmap_high_thresh) {
>>> +		ret = convert_free_space_to_bitmaps(trans, fs_info, block_group,
>>> +						    path);
>>> +	} else if ((flags & BTRFS_FREE_SPACE_USING_BITMAPS) &&
>>> +		   extent_count < block_group->bitmap_low_thresh) {
>>> +		ret = convert_free_space_to_extents(trans, fs_info, block_group,
>>> +						    path);
>>> +	}
>>> +	if (ret)
>>> +		goto out;
>>> +
>>> +	ret = 0;
>>> +out:
>>> +	return ret;
>>> +}
>>> +
>>> +static int free_space_test_bit(struct btrfs_block_group_cache *block_group,
>>> +			       struct btrfs_path *path, u64 offset)
>>> +{
>>> +	struct extent_buffer *leaf;
>>> +	struct btrfs_key key;
>>> +	u64 found_start, found_end;
>>> +	unsigned long ptr, i;
>>> +
>>> +	leaf = path->nodes[0];
>>> +	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
>>> +	ASSERT(key.type == BTRFS_FREE_SPACE_BITMAP_KEY);
>>> +
>>> +	found_start = key.objectid;
>>> +	found_end = key.objectid + key.offset;
>>> +	ASSERT(offset >= found_start && offset < found_end);
>>> +
>>> +	ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
>>> +	i = div_u64(offset - found_start, block_group->sectorsize);
>>> +	return !!extent_buffer_test_bit(leaf, ptr, i);
>>> +}
>>> +
>>> +static void free_space_set_bits(struct btrfs_block_group_cache *block_group,
>>> +				struct btrfs_path *path, u64 *start, u64 *size,
>>> +				int bit)
>>> +{
>>> +	struct extent_buffer *leaf;
>>> +	struct btrfs_key key;
>>> +	u64 end = *start + *size;
>>> +	u64 found_start, found_end;
>>> +	unsigned long ptr, first, last;
>>> +
>>> +	leaf = path->nodes[0];
>>> +	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
>>> +	ASSERT(key.type == BTRFS_FREE_SPACE_BITMAP_KEY);
>>> +
>>> +	found_start = key.objectid;
>>> +	found_end = key.objectid + key.offset;
>>> +	ASSERT(*start >= found_start && *start < found_end);
>>> +	ASSERT(end > found_start);
>>> +
>>> +	if (end > found_end)
>>> +		end = found_end;
>>> +
>>> +	ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
>>> +	first = div_u64(*start - found_start, block_group->sectorsize);
>>> +	last = div_u64(end - found_start, block_group->sectorsize);
>>> +	if (bit)
>>> +		extent_buffer_bitmap_set(leaf, ptr, first, last - first);
>>> +	else
>>> +		extent_buffer_bitmap_clear(leaf, ptr, first, last - first);
>>> +	btrfs_mark_buffer_dirty(leaf);
>>> +
>>> +	*size -= end - *start;
>>> +	*start = end;
>>> +}
>>> +
>>> +/*
>>> + * We can't use btrfs_next_item() in modify_free_space_bitmap() because
>>> + * btrfs_next_leaf() doesn't get the path for writing. We can forgo the fancy
>>> + * tree walking in btrfs_next_leaf() anyways because we know exactly what we're
>>> + * looking for.
>>> + */
>>> +static int free_space_next_bitmap(struct btrfs_trans_handle *trans,
>>> +				  struct btrfs_root *root, struct btrfs_path *p)
>>> +{
>>> +	struct btrfs_key key;
>>> +
>>> +	if (p->slots[0] + 1 < btrfs_header_nritems(p->nodes[0])) {
>>> +		p->slots[0]++;
>>> +		return 0;
>>> +	}
>>> +
>>> +	btrfs_item_key_to_cpu(p->nodes[0], &key, p->slots[0]);
>>> +	btrfs_release_path(p);
>>> +
>>> +	key.objectid += key.offset;
>>> +	key.type = (u8)-1;
>>> +	key.offset = (u64)-1;
>>> +
>>> +	return btrfs_search_prev_slot(trans, root, &key, p, 0, 1);
>>> +}
>>> +
>>> +/*
>>> + * If remove is 1, then we are removing free space, thus clearing bits in the
>>> + * bitmap. If remove is 0, then we are adding free space, thus setting bits in
>>> + * the bitmap.
>>> + */
>>> +static int modify_free_space_bitmap(struct btrfs_trans_handle *trans,
>>> +				    struct btrfs_fs_info *fs_info,
>>> +				    struct btrfs_block_group_cache *block_group,
>>> +				    struct btrfs_path *path,
>>> +				    u64 start, u64 size, int remove)
>>> +{
>>> +	struct btrfs_root *root = fs_info->free_space_root;
>>> +	struct btrfs_key key;
>>> +	u64 end = start + size;
>>> +	u64 cur_start, cur_size;
>>> +	int prev_bit, next_bit;
>>> +	int new_extents;
>>> +	int ret;
>>> +
>>> +	/*
>>> +	 * Read the bit for the block immediately before the extent of space if
>>> +	 * that block is within the block group.
>>> +	 */
>>> +	if (start > block_group->key.objectid) {
>>> +		u64 prev_block = start - block_group->sectorsize;
>>> +
>>> +		key.objectid = prev_block;
>>> +		key.type = (u8)-1;
>>> +		key.offset = (u64)-1;
>>> +
>>> +		ret = btrfs_search_prev_slot(trans, root, &key, path, 0, 1);
>>> +		if (ret)
>>> +			goto out;
>>> +
>>> +		prev_bit = free_space_test_bit(block_group, path, prev_block);
>>> +
>>> +		/* The previous block may have been in the previous bitmap. */
>>> +		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
>>> +		if (start >= key.objectid + key.offset) {
>>> +			ret = free_space_next_bitmap(trans, root, path);
>>> +			if (ret)
>>> +				goto out;
>>> +		}
>>> +	} else {
>>> +		key.objectid = start;
>>> +		key.type = (u8)-1;
>>> +		key.offset = (u64)-1;
>>> +
>>> +		ret = btrfs_search_prev_slot(trans, root, &key, path, 0, 1);
>>> +		if (ret)
>>> +			goto out;
>>> +
>>> +		prev_bit = -1;
>>> +	}
>>> +
>>> +	/*
>>> +	 * Iterate over all of the bitmaps overlapped by the extent of space,
>>> +	 * clearing/setting bits as required.
>>> +	 */
>>> +	cur_start = start;
>>> +	cur_size = size;
>>> +	while (1) {
>>> +		free_space_set_bits(block_group, path, &cur_start, &cur_size,
>>> +				    !remove);
>>> +		if (cur_size == 0)
>>> +			break;
>>> +		ret = free_space_next_bitmap(trans, root, path);
>>> +		if (ret)
>>> +			goto out;
>>> +	}
>>> +
>>> +	/*
>>> +	 * Read the bit for the block immediately after the extent of space if
>>> +	 * that block is within the block group.
>>> +	 */
>>> +	if (end < block_group->key.objectid + block_group->key.offset) {
>>> +		/* The next block may be in the next bitmap. */
>>> +		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
>>> +		if (end >= key.objectid + key.offset) {
>>> +			ret = free_space_next_bitmap(trans, root, path);
>>> +			if (ret)
>>> +				goto out;
>>> +		}
>>> +
>>> +		next_bit = free_space_test_bit(block_group, path, end);
>>> +	} else {
>>> +		next_bit = -1;
>>> +	}
>>> +
>>> +	if (remove) {
>>> +		new_extents = -1;
>>> +		if (prev_bit == 1) {
>>> +			/* Leftover on the left. */
>>> +			new_extents++;
>>> +		}
>>> +		if (next_bit == 1) {
>>> +			/* Leftover on the right. */
>>> +			new_extents++;
>>> +		}
>>> +	} else {
>>> +		new_extents = 1;
>>> +		if (prev_bit == 1) {
>>> +			/* Merging with neighbor on the left. */
>>> +			new_extents--;
>>> +		}
>>> +		if (next_bit == 1) {
>>> +			/* Merging with neighbor on the right. */
>>> +			new_extents--;
>>> +		}
>>> +	}
>>> +
>>> +	btrfs_release_path(path);
>>> +	ret = update_free_space_extent_count(trans, fs_info, block_group, path,
>>> +					     new_extents);
>>> +	if (ret)
>>> +		goto out;
>>> +
>>> +	ret = 0;
>>> +out:
>>> +	return ret;
>>> +}
>>> +
>>> +static int remove_free_space_extent(struct btrfs_trans_handle *trans,
>>> +				    struct btrfs_fs_info *fs_info,
>>> +				    struct btrfs_block_group_cache *block_group,
>>> +				    struct btrfs_path *path,
>>> +				    u64 start, u64 size)
>>> +{
>>> +	struct btrfs_root *root = fs_info->free_space_root;
>>> +	struct btrfs_key key;
>>> +	u64 found_start, found_end;
>>> +	u64 end = start + size;
>>> +	int new_extents = -1;
>>> +	int ret;
>>> +
>>> +	key.objectid = start;
>>> +	key.type = (u8)-1;
>>> +	key.offset = (u64)-1;
>>> +
>>> +	ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
>>> +	if (ret)
>>> +		goto out;
>>> +
>>> +	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
>>> +
>>> +	ASSERT(key.type == BTRFS_FREE_SPACE_EXTENT_KEY);
>>> +
>>> +	found_start = key.objectid;
>>> +	found_end = key.objectid + key.offset;
>>> +	ASSERT(start >= found_start && end <= found_end);
>>> +
>>> +	/*
>>> +	 * Okay, now that we've found the free space extent which contains the
>>> +	 * free space that we are removing, there are four cases:
>>> +	 *
>>> +	 * 1. We're using the whole extent: delete the key we found and
>>> +	 * decrement the free space extent count.
>>> +	 * 2. We are using part of the extent starting at the beginning: delete
>>> +	 * the key we found and insert a new key representing the leftover at
>>> +	 * the end. There is no net change in the number of extents.
>>> +	 * 3. We are using part of the extent ending at the end: delete the key
>>> +	 * we found and insert a new key representing the leftover at the
>>> +	 * beginning. There is no net change in the number of extents.
>>> +	 * 4. We are using part of the extent in the middle: delete the key we
>>> +	 * found and insert two new keys representing the leftovers on each
>>> +	 * side. Where we used to have one extent, we now have two, so increment
>>> +	 * the extent count. We may need to convert the block group to bitmaps
>>> +	 * as a result.
>>> +	 */
>>> +
>>> +	/* Delete the existing key (cases 1-4). */
>>> +	ret = btrfs_del_item(trans, root, path);
>>> +	if (ret)
>>> +		goto out;
>>> +
>>> +	/* Add a key for leftovers at the beginning (cases 3 and 4). */
>>> +	if (start > found_start) {
>>> +		key.objectid = found_start;
>>> +		key.type = BTRFS_FREE_SPACE_EXTENT_KEY;
>>> +		key.offset = start - found_start;
>>> +
>>> +		btrfs_release_path(path);
>>> +		ret = btrfs_insert_empty_item(trans, root, path, &key, 0);
>>> +		if (ret)
>>> +			goto out;
>>> +		new_extents++;
>>> +	}
>>> +
>>> +	/* Add a key for leftovers at the end (cases 2 and 4). */
>>> +	if (end < found_end) {
>>> +		key.objectid = end;
>>> +		key.type = BTRFS_FREE_SPACE_EXTENT_KEY;
>>> +		key.offset = found_end - end;
>>> +
>>> +		btrfs_release_path(path);
>>> +		ret = btrfs_insert_empty_item(trans, root, path, &key, 0);
>>> +		if (ret)
>>> +			goto out;
>>> +		new_extents++;
>>> +	}
>>> +
>>> +	btrfs_release_path(path);
>>> +	ret = update_free_space_extent_count(trans, fs_info, block_group, path,
>>> +					     new_extents);
>>> +	if (ret)
>>> +		goto out;
>>> +
>>> +	ret = 0;
>>> +out:
>>> +	return ret;
>>> +}
>>
>> A sanity test would be good for this.
>>
>>> +
>>> +int remove_from_free_space_tree(struct btrfs_trans_handle *trans,
>>> +				struct btrfs_fs_info *fs_info,
>>> +				u64 start, u64 size)
>>> +{
>>> +	struct btrfs_block_group_cache *block_group;
>>> +	struct btrfs_free_space_info *info;
>>> +	struct btrfs_path *path;
>>> +	u32 flags;
>>> +	int ret;
>>> +
>>> +	if (!btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
>>> +		return 0;
>>> +
>>> +	path = btrfs_alloc_path();
>>> +	if (!path)
>>> +		return -ENOMEM;
>>> +
>>> +	block_group = btrfs_lookup_block_group(fs_info, start);
>>> +	if (!block_group) {
>>> +		ASSERT(0);
>>> +		ret = -ENOENT;
>>> +		goto out_nobg;
>>> +	}
>>> +
>>> +	mutex_lock(&block_group->free_space_lock);
>>> +
>>> +	info = search_free_space_info(NULL, fs_info, block_group, path, 0);
>>> +	if (IS_ERR(info)) {
>>> +		ret = PTR_ERR(info);
>>> +		goto out;
>>> +	}
>>> +	flags = btrfs_free_space_flags(path->nodes[0], info);
>>> +	btrfs_release_path(path);
>>> +
>>> +	if (flags & BTRFS_FREE_SPACE_USING_BITMAPS) {
>>> +		ret = modify_free_space_bitmap(trans, fs_info, block_group,
>>> +					       path, start, size, 1);
>>> +	} else {
>>> +		ret = remove_free_space_extent(trans, fs_info, block_group,
>>> +					       path, start, size);
>>> +	}
>>> +	if (ret)
>>> +		goto out;
>>> +
>>> +	ret = 0;
>>> +out:
>>> +	mutex_unlock(&block_group->free_space_lock);
>>> +	btrfs_put_block_group(block_group);
>>> +out_nobg:
>>> +	btrfs_free_path(path);
>>> +	return ret;
>>> +}
>>> +
>>> +static int add_free_space_extent(struct btrfs_trans_handle *trans,
>>> +				 struct btrfs_fs_info *fs_info,
>>> +				 struct btrfs_block_group_cache *block_group,
>>> +				 struct btrfs_path *path,
>>> +				 u64 start, u64 size)
>>> +{
>>> +	struct btrfs_root *root = fs_info->free_space_root;
>>> +	struct btrfs_key key, new_key;
>>> +	u64 found_start, found_end;
>>> +	u64 end = start + size;
>>> +	int new_extents = 1;
>>> +	int ret;
>>> +
>>> +	/*
>>> +	 * We are adding a new extent of free space, but we need to merge
>>> +	 * extents. There are four cases here:
>>> +	 *
>>> +	 * 1. The new extent does not have any immediate neighbors to merge
>>> +	 * with: add the new key and increment the free space extent count. We
>>> +	 * may need to convert the block group to bitmaps as a result.
>>> +	 * 2. The new extent has an immediate neighbor before it: remove the
>>> +	 * previous key and insert a new key combining both of them. There is no
>>> +	 * net change in the number of extents.
>>> +	 * 3. The new extent has an immediate neighbor after it: remove the next
>>> +	 * key and insert a new key combining both of them. There is no net
>>> +	 * change in the number of extents.
>>> +	 * 4. The new extent has immediate neighbors on both sides: remove both
>>> +	 * of the keys and insert a new key combining all of them. Where we used
>>> +	 * to have two extents, we now have one, so decrement the extent count.
>>> +	 */
>>> +
>>> +	new_key.objectid = start;
>>> +	new_key.type = BTRFS_FREE_SPACE_EXTENT_KEY;
>>> +	new_key.offset = size;
>>> +
>>> +	/* Search for a neighbor on the left. */
>>> +	if (start == block_group->key.objectid)
>>> +		goto right;
>>> +	key.objectid = start - 1;
>>> +	key.type = (u8)-1;
>>> +	key.offset = (u64)-1;
>>> +
>>> +	ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
>>> +	if (ret)
>>> +		goto out;
>>> +
>>> +	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
>>> +
>>> +	if (key.type != BTRFS_FREE_SPACE_EXTENT_KEY) {
>>> +		ASSERT(key.type == BTRFS_FREE_SPACE_INFO_KEY);
>>> +		btrfs_release_path(path);
>>> +		goto right;
>>> +	}
>>> +
>>> +	found_start = key.objectid;
>>> +	found_end = key.objectid + key.offset;
>>> +	ASSERT(found_start >= block_group->key.objectid &&
>>> +	       found_end > block_group->key.objectid);
>>> +	ASSERT(found_start < start && found_end <= start);
>>> +
>>> +	/*
>>> +	 * Delete the neighbor on the left and absorb it into the new key (cases
>>> +	 * 2 and 4).
>>> +	 */
>>> +	if (found_end == start) {
>>> +		ret = btrfs_del_item(trans, root, path);
>>> +		if (ret)
>>> +			goto out;
>>> +		new_key.objectid = found_start;
>>> +		new_key.offset += key.offset;
>>> +		new_extents--;
>>> +	}
>>> +	btrfs_release_path(path);
>>> +
>>> +right:
>>> +	/* Search for a neighbor on the right. */
>>> +	if (end == block_group->key.objectid + block_group->key.offset)
>>> +		goto insert;
>>> +	key.objectid = end;
>>> +	key.type = (u8)-1;
>>> +	key.offset = (u64)-1;
>>> +
>>> +	ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
>>> +	if (ret)
>>> +		goto out;
>>> +
>>> +	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
>>> +
>>> +	if (key.type != BTRFS_FREE_SPACE_EXTENT_KEY) {
>>> +		ASSERT(key.type == BTRFS_FREE_SPACE_INFO_KEY);
>>> +		btrfs_release_path(path);
>>> +		goto insert;
>>> +	}
>>> +
>>> +	found_start = key.objectid;
>>> +	found_end = key.objectid + key.offset;
>>> +	ASSERT(found_start >= block_group->key.objectid &&
>>> +	       found_end > block_group->key.objectid);
>>> +	ASSERT((found_start < start && found_end <= start) ||
>>> +	       (found_start >= end && found_end > end));
>>> +
>>> +	/*
>>> +	 * Delete the neighbor on the right and absorb it into the new key
>>> +	 * (cases 3 and 4).
>>> +	 */
>>> +	if (found_start == end) {
>>> +		ret = btrfs_del_item(trans, root, path);
>>> +		if (ret)
>>> +			goto out;
>>> +		new_key.offset += key.offset;
>>> +		new_extents--;
>>> +	}
>>> +	btrfs_release_path(path);
>>> +
>>> +insert:
>>> +	/* Insert the new key (cases 1-4). */
>>> +	ret = btrfs_insert_empty_item(trans, root, path, &new_key, 0);
>>> +	if (ret)
>>> +		goto out;
>>> +
>>> +	btrfs_release_path(path);
>>> +	ret = update_free_space_extent_count(trans, fs_info, block_group, path,
>>> +					     new_extents);
>>> +	if (ret)
>>> +		goto out;
>>> +
>>> +	ret = 0;
>>> +out:
>>> +	return ret;
>>> +}
>>
>> It would be good to have a sanity test for this to make sure all of your
>> cases are covered and are proven in a unit test.
>>
>>> +
>>> +static int __add_to_free_space_tree(struct btrfs_trans_handle *trans,
>>> +				    struct btrfs_fs_info *fs_info,
>>> +				    struct btrfs_block_group_cache *block_group,
>>> +				    struct btrfs_path *path,
>>> +				    u64 start, u64 size)
>>> +{
>>> +	struct btrfs_free_space_info *info;
>>> +	u32 flags;
>>> +	int ret;
>>> +
>>> +	mutex_lock(&block_group->free_space_lock);
>>> +
>>> +	info = search_free_space_info(NULL, fs_info, block_group, path, 0);
>>> +	if (IS_ERR(info)) {
>>> +		return PTR_ERR(info);
>>> +		goto out;
>>> +	}
>>> +	flags = btrfs_free_space_flags(path->nodes[0], info);
>>> +	btrfs_release_path(path);
>>> +
>>> +	if (flags & BTRFS_FREE_SPACE_USING_BITMAPS) {
>>> +		ret = modify_free_space_bitmap(trans, fs_info, block_group,
>>> +					       path, start, size, 0);
>>> +	} else {
>>> +		ret = add_free_space_extent(trans, fs_info, block_group, path,
>>> +					    start, size);
>>> +	}
>>> +
>>> +out:
>>> +	mutex_unlock(&block_group->free_space_lock);
>>> +	return ret;
>>> +}
>>> +
>>> +int add_to_free_space_tree(struct btrfs_trans_handle *trans,
>>> +			   struct btrfs_fs_info *fs_info,
>>> +			   u64 start, u64 size)
>>> +{
>>> +	struct btrfs_block_group_cache *block_group;
>>> +	struct btrfs_path *path;
>>> +	int ret;
>>> +
>>> +	if (!btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
>>> +		return 0;
>>> +
>>> +	path = btrfs_alloc_path();
>>> +	if (!path)
>>> +		return -ENOMEM;
>>> +
>>> +	block_group = btrfs_lookup_block_group(fs_info, start);
>>> +	if (!block_group) {
>>> +		ASSERT(0);
>>> +		ret = -ENOENT;
>>> +		goto out_nobg;
>>> +	}
>>> +
>>> +	ret = __add_to_free_space_tree(trans, fs_info, block_group, path, start,
>>> +				       size);
>>> +	if (ret)
>>> +		goto out;
>>> +
>>> +	ret = 0;
>>> +out:
>>> +	btrfs_put_block_group(block_group);
>>> +out_nobg:
>>> +	btrfs_free_path(path);
>>> +	return ret;
>>> +}
>>> +
>>> +static int add_new_free_space_extent(struct btrfs_trans_handle *trans,
>>> +				     struct btrfs_fs_info *fs_info,
>>> +				     struct btrfs_block_group_cache *block_group,
>>> +				     struct btrfs_path *path,
>>> +				     u64 start, u64 end)
>>> +{
>>> +	u64 extent_start, extent_end;
>>> +	int ret;
>>> +
>>> +	while (start < end) {
>>> +		ret = find_first_extent_bit(fs_info->pinned_extents, start,
>>> +					    &extent_start, &extent_end,
>>> +					    EXTENT_DIRTY | EXTENT_UPTODATE,
>>> +					    NULL);
>>> +		if (ret)
>>> +			break;
>>> +
>>> +		if (extent_start <= start) {
>>> +			start = extent_end + 1;
>>> +		} else if (extent_start > start && extent_start < end) {
>>> +			ret = __add_to_free_space_tree(trans, fs_info,
>>> +						       block_group, path, start,
>>> +						       extent_start - start);
>>> +			btrfs_release_path(path);
>>> +			if (ret)
>>> +				return ret;
>>> +			start = extent_end + 1;
>>> +		} else {
>>> +			break;
>>> +		}
>>> +	}
>>> +	if (start < end) {
>>> +		ret = __add_to_free_space_tree(trans, fs_info, block_group,
>>> +					       path, start, end - start);
>>> +		btrfs_release_path(path);
>>> +		if (ret)
>>> +			return ret;
>>> +	}
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +/*
>>> + * Populate the free space tree by walking the extent tree, avoiding the super
>>> + * block mirrors. Operations on the extent tree that happen as a result of
>>> + * writes to the free space tree will go through the normal add/remove hooks.
>>> + */
>>> +static int populate_free_space_tree(struct btrfs_trans_handle *trans,
>>> +				    struct btrfs_fs_info *fs_info,
>>> +				    struct btrfs_block_group_cache *block_group)
>>> +{
>>> +	struct btrfs_root *extent_root = fs_info->extent_root;
>>> +	struct btrfs_path *path, *path2;
>>> +	struct btrfs_key key;
>>> +	u64 start, end;
>>> +	int ret;
>>> +
>>> +	path = btrfs_alloc_path();
>>> +	if (!path)
>>> +		return -ENOMEM;
>>> +	path->reada = 1;
>>> +
>>> +	path2 = btrfs_alloc_path();
>>> +	if (!path2) {
>>> +		btrfs_free_path(path);
>>> +		return -ENOMEM;
>>> +	}
>>> +
>>> +	ret = add_new_free_space_info(trans, fs_info, block_group, path2);
>>> +	if (ret)
>>> +		goto out;
>>> +
>>> +	ret = exclude_super_stripes(extent_root, block_group);
>>> +	if (ret)
>>> +		goto out;
>>> +
>>> +	/*
>>> +	 * Iterate through all of the extent and metadata items in this block
>>> +	 * group, adding the free space between them and the free space at the
>>> +	 * end. Note that EXTENT_ITEM and METADATA_ITEM are less than
>>> +	 * BLOCK_GROUP_ITEM, so an extent may precede the block group that it's
>>> +	 * contained in.
>>> +	 */
>>> +	key.objectid = block_group->key.objectid;
>>> +	key.type = BTRFS_EXTENT_ITEM_KEY;
>>> +	key.offset = 0;
>>> +
>>> +	ret = btrfs_search_slot_for_read(extent_root, &key, path, 1, 0);
>>> +	if (ret < 0)
>>> +		goto out;
>>> +	ASSERT(ret == 0);
>>> +
>>> +	start = block_group->key.objectid;
>>> +	end = block_group->key.objectid + block_group->key.offset;
>>> +	while (1) {
>>> +		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
>>> +
>>> +		if (key.type == BTRFS_EXTENT_ITEM_KEY ||
>>> +		    key.type == BTRFS_METADATA_ITEM_KEY) {
>>> +			if (key.objectid >= end)
>>> +				break;
>>> +
>>> +			ret = add_new_free_space_extent(trans, fs_info,
>>> +							block_group, path2,
>>> +							start, key.objectid);
>>> +			start = key.objectid;
>>> +			if (key.type == BTRFS_METADATA_ITEM_KEY)
>>> +				start += fs_info->tree_root->nodesize;
>>> +			else
>>> +				start += key.offset;
>>> +		} else if (key.type == BTRFS_BLOCK_GROUP_ITEM_KEY) {
>>> +			if (key.objectid != block_group->key.objectid)
>>> +				break;
>>> +		}
>>> +
>>> +		ret = btrfs_next_item(extent_root, path);
>>> +		if (ret < 0)
>>> +			goto out;
>>> +		if (ret)
>>> +			break;
>>> +	}
>>> +	ret = add_new_free_space_extent(trans, fs_info, block_group, path2,
>>> +					start, end);
>>> +	if (ret)
>>> +		goto out;
>>> +
>>> +out:
>>> +	free_excluded_extents(extent_root, block_group);
>>> +	btrfs_free_path(path2);
>>> +	btrfs_free_path(path);
>>> +	return ret;
>>> +}
>>> +
>>> +int btrfs_create_free_space_tree(struct btrfs_fs_info *fs_info)
>>> +{
>>> +	struct btrfs_trans_handle *trans;
>>> +	struct btrfs_root *tree_root = fs_info->tree_root;
>>> +	struct btrfs_root *free_space_root;
>>> +	struct btrfs_block_group_cache *block_group;
>>> +	struct rb_node *node;
>>> +	int ret;
>>> +
>>> +	trans = btrfs_start_transaction(tree_root, 0);
>>> +	if (IS_ERR(trans))
>>> +		return PTR_ERR(trans);
>>> +
>>> +	free_space_root = btrfs_create_tree(trans, fs_info,
>>> +					    BTRFS_FREE_SPACE_TREE_OBJECTID);
>>> +	if (IS_ERR(free_space_root)) {
>>> +		ret = PTR_ERR(free_space_root);
>>> +		btrfs_abort_transaction(trans, tree_root, ret);
>>> +		return ret;
>>> +	}
>>> +	fs_info->free_space_root = free_space_root;
>>> +
>>> +	node = rb_first(&fs_info->block_group_cache_tree);
>>> +	while (node) {
>>> +		block_group = rb_entry(node, struct btrfs_block_group_cache,
>>> +				       cache_node);
>>> +		ret = populate_free_space_tree(trans, fs_info, block_group);
>>> +		if (ret) {
>>> +			btrfs_abort_transaction(trans, tree_root, ret);
>>> +			return ret;
>>> +		}
>>> +		node = rb_next(node);
>>> +	}
>>> +
>>> +	btrfs_set_fs_compat_ro(fs_info, FREE_SPACE_TREE);
>>> +
>>> +	ret = btrfs_commit_transaction(trans, tree_root);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +int add_block_group_free_space(struct btrfs_trans_handle *trans,
>>> +			       struct btrfs_fs_info *fs_info,
>>> +			       struct btrfs_block_group_cache *block_group)
>>> +{
>>> +	struct btrfs_path *path;
>>> +	int ret;
>>> +
>>> +	if (!btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
>>> +		return 0;
>>> +
>>> +	path = btrfs_alloc_path();
>>> +	if (!path)
>>> +		return -ENOMEM;
>>> +
>>> +	ret = add_new_free_space_info(trans, fs_info, block_group, path);
>>> +	if (ret)
>>> +		goto out;
>>> +
>>> +	ret = add_new_free_space_extent(trans, fs_info, block_group, path,
>>> +					block_group->key.objectid,
>>> +					block_group->key.objectid +
>>> +					block_group->key.offset);
>>> +	if (ret)
>>> +		goto out;
>>> +
>>> +	ret = 0;
>>> +out:
>>> +	btrfs_free_path(path);
>>> +	return ret;
>>> +}
>>> +
>>> +int remove_block_group_free_space(struct btrfs_trans_handle *trans,
>>> +				  struct btrfs_fs_info *fs_info,
>>> +				  struct btrfs_block_group_cache *block_group)
>>> +{
>>> +	struct btrfs_root *root = fs_info->free_space_root;
>>> +	struct btrfs_path *path;
>>> +	struct btrfs_key key, found_key;
>>> +	struct extent_buffer *leaf;
>>> +	u64 start, end;
>>> +	int done = 0, nr;
>>> +	int ret;
>>> +
>>> +	if (!btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
>>> +		return 0;
>>> +
>>> +	path = btrfs_alloc_path();
>>> +	if (!path)
>>> +		return -ENOMEM;
>>> +
>>> +	start = block_group->key.objectid;
>>> +	end = block_group->key.objectid + block_group->key.offset;
>>> +
>>> +	key.objectid = end - 1;
>>> +	key.type = (u8)-1;
>>> +	key.offset = (u64)-1;
>>> +
>>> +	while (!done) {
>>> +		ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
>>> +		if (ret)
>>> +			goto out;
>>> +
>>> +		leaf = path->nodes[0];
>>> +		nr = 0;
>>> +		path->slots[0]++;
>>> +		while (path->slots[0] > 0) {
>>> +			btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0] - 1);
>>> +
>>> +			if (found_key.type == BTRFS_FREE_SPACE_INFO_KEY) {
>>> +				ASSERT(found_key.objectid == block_group->key.objectid);
>>> +				ASSERT(found_key.offset == block_group->key.offset);
>>> +				done = 1;
>>> +				nr++;
>>> +				path->slots[0]--;
>>> +				break;
>>> +			} else if (found_key.type == BTRFS_FREE_SPACE_EXTENT_KEY ||
>>> +				   found_key.type == BTRFS_FREE_SPACE_BITMAP_KEY) {
>>> +				ASSERT(found_key.objectid >= start);
>>> +				ASSERT(found_key.objectid < end);
>>> +				ASSERT(found_key.objectid + found_key.offset <= end);
>>> +				nr++;
>>> +				path->slots[0]--;
>>> +			} else {
>>> +				ASSERT(0);
>>> +			}
>>> +		}
>>> +
>>> +		ret = btrfs_del_items(trans, root, path, path->slots[0], nr);
>>> +		if (ret)
>>> +			goto out;
>>> +		btrfs_release_path(path);
>>> +	}
>>> +
>>> +	ret = 0;
>>> +out:
>>> +	btrfs_free_path(path);
>>> +	return ret;
>>> +}
>>> +
>>> +static int load_free_space_bitmaps(struct btrfs_fs_info *fs_info,
>>> +				   struct btrfs_block_group_cache *block_group,
>>> +				   struct btrfs_path *path,
>>> +				   u32 expected_extent_count)
>>> +{
>>> +	struct btrfs_root *root = fs_info->free_space_root;
>>> +	struct btrfs_key key;
>>> +	int prev_bit = 0, bit;
>>> +	/* Initialize to silence GCC. */
>>> +	u64 extent_start = 0;
>>> +	u64 end, offset;
>>> +	u32 extent_count = 0;
>>> +	int ret;
>>> +
>>> +	end = block_group->key.objectid + block_group->key.offset;
>>> +
>>> +	while (1) {
>>> +		ret = btrfs_next_item(root, path);
>>> +		if (ret < 0)
>>> +			goto out;
>>> +		if (ret)
>>> +			break;
>>> +
>>> +		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
>>> +
>>> +		if (key.type == BTRFS_FREE_SPACE_INFO_KEY)
>>> +			break;
>>> +
>>> +		ASSERT(key.type == BTRFS_FREE_SPACE_BITMAP_KEY);
>>> +		ASSERT(key.objectid < end && key.objectid + key.offset <= end);
>>> +
>>> +		offset = key.objectid;
>>> +		while (offset < key.objectid + key.offset) {
>>> +			bit = free_space_test_bit(block_group, path, offset);
>>> +			if (prev_bit == 0 && bit == 1) {
>>> +				extent_start = offset;
>>> +			} else if (prev_bit == 1 && bit == 0) {
>>> +				add_new_free_space(block_group, fs_info,
>>> +						   extent_start, offset);
>>> +				extent_count++;
>>> +			}
>>> +			prev_bit = bit;
>>> +			offset += block_group->sectorsize;
>>> +		}
>>> +	}
>>> +	if (prev_bit == 1) {
>>> +		add_new_free_space(block_group, fs_info, extent_start, end);
>>> +		extent_count++;
>>> +	}
>>> +
>>> +	if (extent_count != expected_extent_count) {
>>> +		btrfs_err(fs_info, "incorrect extent count for %llu; counted %u, expected %u",
>>> +			  block_group->key.objectid, extent_count,
>>> +			  expected_extent_count);
>>> +		ASSERT(0);
>>> +		ret = -EIO;
>>> +		goto out;
>>> +	}
>>> +
>>> +	ret = 0;
>>> +out:
>>> +	return ret;
>>> +}
>>> +
>>> +static int load_free_space_extents(struct btrfs_fs_info *fs_info,
>>> +				   struct btrfs_block_group_cache *block_group,
>>> +				   struct btrfs_path *path,
>>> +				   u32 expected_extent_count)
>>> +{
>>> +	struct btrfs_root *root = fs_info->free_space_root;
>>> +	struct btrfs_key key;
>>> +	u64 end;
>>> +	u32 extent_count = 0;
>>> +	int ret;
>>> +
>>> +	end = block_group->key.objectid + block_group->key.offset;
>>> +
>>> +	while (1) {
>>> +		ret = btrfs_next_item(root, path);
>>> +		if (ret < 0)
>>> +			goto out;
>>> +		if (ret)
>>> +			break;
>>> +
>>> +		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
>>> +
>>> +		if (key.type == BTRFS_FREE_SPACE_INFO_KEY)
>>> +			break;
>>> +
>>> +		ASSERT(key.type == BTRFS_FREE_SPACE_EXTENT_KEY);
>>> +		ASSERT(key.objectid < end && key.objectid + key.offset <= end);
>>> +
>>> +		add_new_free_space(block_group, fs_info, key.objectid,
>>> +				   key.objectid + key.offset);
>>> +		extent_count++;
>>> +	}
>>> +
>>> +	if (extent_count != expected_extent_count) {
>>> +		btrfs_err(fs_info, "incorrect extent count for %llu; counted %u, expected %u",
>>> +			  block_group->key.objectid, extent_count,
>>> +			  expected_extent_count);
>>> +		ASSERT(0);
>>> +		ret = -EIO;
>>> +		goto out;
>>> +	}
>>> +
>>> +	ret = 0;
>>> +out:
>>> +	return ret;
>>> +}
>>> +
>>> +int load_free_space_tree(struct btrfs_fs_info *fs_info,
>>> +			 struct btrfs_block_group_cache *block_group)
>>> +{
>>> +	struct btrfs_free_space_info *info;
>>> +	struct btrfs_path *path;
>>> +	u32 extent_count, flags;
>>> +	int ret;
>>> +
>>> +	path = btrfs_alloc_path();
>>> +	if (!path)
>>> +		return -ENOMEM;
>>> +
>>> +	/*
>>> +	 * Just like caching_thread() doesn't want to deadlock on the extent
>>> +	 * tree, we don't want to deadlock on the free space tree.
>>> +	 */
>>> +	path->skip_locking = 1;
>>> +	path->search_commit_root = 1;
>>> +	path->reada = 1;
>>> +
>>> +	down_read(&fs_info->commit_root_sem);
>>> +
>>> +	info = search_free_space_info(NULL, fs_info, block_group, path, 0);
>>> +	if (IS_ERR(info)) {
>>> +		ret = PTR_ERR(info);
>>> +		goto out;
>>> +	}
>>> +	extent_count = btrfs_free_space_extent_count(path->nodes[0], info);
>>> +	flags = btrfs_free_space_flags(path->nodes[0], info);
>>> +
>>> +	/*
>>> +	 * We left path pointing to the free space info item, so now
>>> +	 * load_free_space_foo can just iterate through the free space tree from
>>> +	 * there.
>>> +	 */
>>> +	if (flags & BTRFS_FREE_SPACE_USING_BITMAPS) {
>>> +		ret = load_free_space_bitmaps(fs_info, block_group, path,
>>> +					      extent_count);
>>> +	} else {
>>> +		ret = load_free_space_extents(fs_info, block_group, path,
>>> +					      extent_count);
>>> +	}
>>> +	if (ret)
>>> +		goto out;
>>> +
>>> +	ret = 0;
>>
>> This bit isn't needed, just fall through.
>>
>>> +out:
>>> +	up_read(&fs_info->commit_root_sem);
>>> +	btrfs_free_path(path);
>>> +	return ret;
>>> +}
>>
>> So actually there are a lot of places in here that you need to abort the
>> transaction if there is a failure.  If we can't update the free space tree
>> for whatever reason and we aren't a developer so don't immediately panic the
>> box we need to make sure to abort so the fs stays consistent.  The only
>> place you don't have to do this is when loading the free space tree.
>> Thanks,
>>
>> Josef
>>
>
> So an error returned from either add_to_free_space_tree() or
> remove_from_free_space_tree() will eventually bubble up to
> btrfs_run_delayed_refs() which will abort the transaction. Likewise, an
> error from remove_block_group_free_space() will abort in
> btrfs_remove_chunk(). It looks like there's at least one call chain
> where an error from add_block_group_free_space() won't abort. For the
> sake of not having to audit all of these call chains, I'll go ahead and
> add the aborts closer to where they occur and add some sanity tests,

Yeah we want to have the aborts close to where they happen so we know 
exactly what went wrong, otherwise we have to go and dig down to where 
the actual failure was.  If we are relying on an upper layer to abort 
properly we could miss something or be less informed of the real 
problem.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/6] Btrfs: wire up the free space tree to the extent tree
  2015-09-01 19:48   ` Josef Bacik
@ 2015-09-02  4:42     ` Omar Sandoval
  2015-09-02 15:29       ` Josef Bacik
  0 siblings, 1 reply; 43+ messages in thread
From: Omar Sandoval @ 2015-09-02  4:42 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, Omar Sandoval

On Tue, Sep 01, 2015 at 03:48:57PM -0400, Josef Bacik wrote:
> On 09/01/2015 03:05 PM, Omar Sandoval wrote:
> >From: Omar Sandoval <osandov@fb.com>
> >
> >The free space tree is updated in tandem with the extent tree. There are
> >only a handful of places where we need to hook in:
> >
> >1. Block group creation
> >2. Block group deletion
> >3. Delayed refs (extent creation and deletion)
> >4. Block group caching
> >
> >Signed-off-by: Omar Sandoval <osandov@fb.com>
> >---
> >  fs/btrfs/extent-tree.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 70 insertions(+), 3 deletions(-)
> >
> >diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> >index 37179a569f40..3f10df3932f0 100644
> >--- a/fs/btrfs/extent-tree.c
> >+++ b/fs/btrfs/extent-tree.c
> >@@ -33,6 +33,7 @@
> >  #include "raid56.h"
> >  #include "locking.h"
> >  #include "free-space-cache.h"
> >+#include "free-space-tree.h"
> >  #include "math.h"
> >  #include "sysfs.h"
> >  #include "qgroup.h"
> >@@ -589,7 +590,41 @@ static int cache_block_group(struct btrfs_block_group_cache *cache,
> >  	cache->cached = BTRFS_CACHE_FAST;
> >  	spin_unlock(&cache->lock);
> >
> >-	if (fs_info->mount_opt & BTRFS_MOUNT_SPACE_CACHE) {
> >+	if (btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE)) {
> >+		if (load_cache_only) {
> >+			spin_lock(&cache->lock);
> >+			cache->caching_ctl = NULL;
> >+			cache->cached = BTRFS_CACHE_NO;
> >+			spin_unlock(&cache->lock);
> >+			wake_up(&caching_ctl->wait);
> >+		} else {
> >+			mutex_lock(&caching_ctl->mutex);
> >+			ret = load_free_space_tree(fs_info, cache);
> >+			if (ret) {
> >+				btrfs_warn(fs_info, "failed to load free space tree for %llu: %d",
> >+					   cache->key.objectid, ret);
> >+				spin_lock(&cache->lock);
> >+				cache->caching_ctl = NULL;
> >+				cache->cached = BTRFS_CACHE_ERROR;
> >+				spin_unlock(&cache->lock);
> >+				goto tree_out;
> >+			}
> >+
> >+			spin_lock(&cache->lock);
> >+			cache->caching_ctl = NULL;
> >+			cache->cached = BTRFS_CACHE_FINISHED;
> >+			cache->last_byte_to_unpin = (u64)-1;
> >+			caching_ctl->progress = (u64)-1;
> >+			spin_unlock(&cache->lock);
> >+			mutex_unlock(&caching_ctl->mutex);
> >+
> >+tree_out:
> >+			wake_up(&caching_ctl->wait);
> >+			put_caching_control(caching_ctl);
> >+			free_excluded_extents(fs_info->extent_root, cache);
> >+			return 0;
> >+		}
> >+	} else if (fs_info->mount_opt & BTRFS_MOUNT_SPACE_CACHE) {
> >  		mutex_lock(&caching_ctl->mutex);
> 
> So the reason we have this load_cache_only thing is because the free space
> cache could be loaded almost instantaneously since it was contiguously
> allocated.  This isn't the case with the free space tree, and although it is
> better than the no space cache way of doing things, we are still going to
> incur a bit of latency when seeking through a large free space tree.  So
> break this out and make the caching kthread either do the old style load or
> load the free space tree.  Then you can use the add free space helpers that
> will wake anybody up waiting on allocations and you incur less direct
> latency.  Thanks,
> 
> Josef

Okay, I'll do the load from caching_thread(). Do you think we're going
to need the need_resched() || rwsem_is_contended(commit_root) check and
retry for the free space tree like we do with the extent tree? It seems
like it could get complicated since we would need to worry about the
format changing underneath us.

-- 
Omar

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/3] btrfs-progs: use calloc instead of malloc+memset for tree roots
  2015-09-01 19:22 ` [PATCH 1/3] btrfs-progs: use calloc instead of malloc+memset for tree roots Omar Sandoval
  2015-09-01 19:22   ` [PATCH 2/3] btrfs-progs: add basic awareness of the free space tree Omar Sandoval
  2015-09-01 19:22   ` [PATCH 3/3] btrfs-progs: check the free space tree in btrfsck Omar Sandoval
@ 2015-09-02 15:02   ` David Sterba
  2 siblings, 0 replies; 43+ messages in thread
From: David Sterba @ 2015-09-02 15:02 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, Omar Sandoval

On Tue, Sep 01, 2015 at 12:22:44PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>

I'll apply that one now as it's independent on the feature. Thanks.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/6] Btrfs: wire up the free space tree to the extent tree
  2015-09-02  4:42     ` Omar Sandoval
@ 2015-09-02 15:29       ` Josef Bacik
  0 siblings, 0 replies; 43+ messages in thread
From: Josef Bacik @ 2015-09-02 15:29 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, Omar Sandoval

On 09/02/2015 12:42 AM, Omar Sandoval wrote:
> On Tue, Sep 01, 2015 at 03:48:57PM -0400, Josef Bacik wrote:
>> On 09/01/2015 03:05 PM, Omar Sandoval wrote:
>>> From: Omar Sandoval <osandov@fb.com>
>>>
>>> The free space tree is updated in tandem with the extent tree. There are
>>> only a handful of places where we need to hook in:
>>>
>>> 1. Block group creation
>>> 2. Block group deletion
>>> 3. Delayed refs (extent creation and deletion)
>>> 4. Block group caching
>>>
>>> Signed-off-by: Omar Sandoval <osandov@fb.com>
>>> ---
>>>   fs/btrfs/extent-tree.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++---
>>>   1 file changed, 70 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 37179a569f40..3f10df3932f0 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -33,6 +33,7 @@
>>>   #include "raid56.h"
>>>   #include "locking.h"
>>>   #include "free-space-cache.h"
>>> +#include "free-space-tree.h"
>>>   #include "math.h"
>>>   #include "sysfs.h"
>>>   #include "qgroup.h"
>>> @@ -589,7 +590,41 @@ static int cache_block_group(struct btrfs_block_group_cache *cache,
>>>   	cache->cached = BTRFS_CACHE_FAST;
>>>   	spin_unlock(&cache->lock);
>>>
>>> -	if (fs_info->mount_opt & BTRFS_MOUNT_SPACE_CACHE) {
>>> +	if (btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE)) {
>>> +		if (load_cache_only) {
>>> +			spin_lock(&cache->lock);
>>> +			cache->caching_ctl = NULL;
>>> +			cache->cached = BTRFS_CACHE_NO;
>>> +			spin_unlock(&cache->lock);
>>> +			wake_up(&caching_ctl->wait);
>>> +		} else {
>>> +			mutex_lock(&caching_ctl->mutex);
>>> +			ret = load_free_space_tree(fs_info, cache);
>>> +			if (ret) {
>>> +				btrfs_warn(fs_info, "failed to load free space tree for %llu: %d",
>>> +					   cache->key.objectid, ret);
>>> +				spin_lock(&cache->lock);
>>> +				cache->caching_ctl = NULL;
>>> +				cache->cached = BTRFS_CACHE_ERROR;
>>> +				spin_unlock(&cache->lock);
>>> +				goto tree_out;
>>> +			}
>>> +
>>> +			spin_lock(&cache->lock);
>>> +			cache->caching_ctl = NULL;
>>> +			cache->cached = BTRFS_CACHE_FINISHED;
>>> +			cache->last_byte_to_unpin = (u64)-1;
>>> +			caching_ctl->progress = (u64)-1;
>>> +			spin_unlock(&cache->lock);
>>> +			mutex_unlock(&caching_ctl->mutex);
>>> +
>>> +tree_out:
>>> +			wake_up(&caching_ctl->wait);
>>> +			put_caching_control(caching_ctl);
>>> +			free_excluded_extents(fs_info->extent_root, cache);
>>> +			return 0;
>>> +		}
>>> +	} else if (fs_info->mount_opt & BTRFS_MOUNT_SPACE_CACHE) {
>>>   		mutex_lock(&caching_ctl->mutex);
>>
>> So the reason we have this load_cache_only thing is because the free space
>> cache could be loaded almost instantaneously since it was contiguously
>> allocated.  This isn't the case with the free space tree, and although it is
>> better than the no space cache way of doing things, we are still going to
>> incur a bit of latency when seeking through a large free space tree.  So
>> break this out and make the caching kthread either do the old style load or
>> load the free space tree.  Then you can use the add free space helpers that
>> will wake anybody up waiting on allocations and you incur less direct
>> latency.  Thanks,
>>
>> Josef
>
> Okay, I'll do the load from caching_thread(). Do you think we're going
> to need the need_resched() || rwsem_is_contended(commit_root) check and
> retry for the free space tree like we do with the extent tree? It seems
> like it could get complicated since we would need to worry about the
> format changing underneath us.
>

So we make it so we can't change the format of a block group we're 
caching, problem solved.  But I get your point, we could probably drop 
that since it shouldn't take that long to cache a whole block group, but 
that sort of thinking got us into this situation in the first place. 
We're reading from the commit root anyway, we have to re-search when we 
do this, if the block group has been converted to a bitmap in the 
meantime we'll still end up at the right offset correct?  I think it'll 
be fine.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v2 0/9] free space B-tree
@ 2015-09-03 19:44 ` Omar Sandoval
  2015-09-03 19:44   ` [PATCH v2 1/9] Btrfs: add extent buffer bitmap operations Omar Sandoval
                     ` (10 more replies)
  0 siblings, 11 replies; 43+ messages in thread
From: Omar Sandoval @ 2015-09-03 19:44 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Omar Sandoval

Here's version 2 of the the free space B-tree patches, addressing
Josef's review from the last round, which you can find here:
http://www.spinics.net/lists/linux-btrfs/msg46713.html

Changes from v1->v2:

- Cleaned up a bunch of unnecessary instances of "if (ret) goto out; ret = 0"
- Added aborts in the free space tree code closer to the site the error
  is encountered: where we add or remove block groups, add or remove
  free space, and also when we convert formats
- Moved loading of the free space tree into caching_thread() and added a
  new patch 4 in preparation for it
- Commented a bunch of stuff in the extent buffer bitmap operations and
  refactored some of the complicated logic
- Added sanity tests for the extent buffer bitmap operations and free
  space tree (patches 2 and 6)
- Added Josef's Reviewed-by tags

Omar Sandoval (9):
  Btrfs: add extent buffer bitmap operations
  Btrfs: add extent buffer bitmap sanity tests
  Btrfs: add helpers for read-only compat bits
  Btrfs: refactor caching_thread()
  Btrfs: introduce the free space B-tree on-disk format
  Btrfs: implement the free space B-tree
  Btrfs: add free space tree sanity tests
  Btrfs: wire up the free space tree to the extent tree
  Btrfs: add free space tree mount option

 fs/btrfs/Makefile                      |    5 +-
 fs/btrfs/ctree.h                       |  107 ++-
 fs/btrfs/disk-io.c                     |   26 +
 fs/btrfs/extent-tree.c                 |  112 ++-
 fs/btrfs/extent_io.c                   |  183 +++-
 fs/btrfs/extent_io.h                   |   10 +-
 fs/btrfs/free-space-tree.c             | 1501 ++++++++++++++++++++++++++++++++
 fs/btrfs/free-space-tree.h             |   71 ++
 fs/btrfs/super.c                       |   24 +-
 fs/btrfs/tests/btrfs-tests.c           |   52 ++
 fs/btrfs/tests/btrfs-tests.h           |   10 +
 fs/btrfs/tests/extent-io-tests.c       |  138 ++-
 fs/btrfs/tests/free-space-tests.c      |   35 +-
 fs/btrfs/tests/free-space-tree-tests.c |  570 ++++++++++++
 fs/btrfs/tests/qgroup-tests.c          |   20 +-
 include/trace/events/btrfs.h           |    3 +-
 16 files changed, 2763 insertions(+), 104 deletions(-)
 create mode 100644 fs/btrfs/free-space-tree.c
 create mode 100644 fs/btrfs/free-space-tree.h
 create mode 100644 fs/btrfs/tests/free-space-tree-tests.c

-- 
2.5.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v2 1/9] Btrfs: add extent buffer bitmap operations
  2015-09-03 19:44 ` [PATCH v2 0/9] free space B-tree Omar Sandoval
@ 2015-09-03 19:44   ` Omar Sandoval
  2015-09-03 19:44   ` [PATCH v2 2/9] Btrfs: add extent buffer bitmap sanity tests Omar Sandoval
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 43+ messages in thread
From: Omar Sandoval @ 2015-09-03 19:44 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Omar Sandoval

These are going to be used for the free space tree bitmap items.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/extent_io.c | 149 +++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/extent_io.h |   6 +++
 2 files changed, 155 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 02d05817cbdf..eae9175ff62b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5475,6 +5475,155 @@ void copy_extent_buffer(struct extent_buffer *dst, struct extent_buffer *src,
 	}
 }
 
+/*
+ * The extent buffer bitmap operations are done with byte granularity because
+ * bitmap items are not guaranteed to be aligned to a word and therefore a
+ * single word in a bitmap may straddle two pages in the extent buffer.
+ */
+#define BIT_BYTE(nr) ((nr) / BITS_PER_BYTE)
+#define BYTE_MASK ((1 << BITS_PER_BYTE) - 1)
+#define BITMAP_FIRST_BYTE_MASK(start) \
+	((BYTE_MASK << ((start) & (BITS_PER_BYTE - 1))) & BYTE_MASK)
+#define BITMAP_LAST_BYTE_MASK(nbits) \
+	(BYTE_MASK >> (-(nbits) & (BITS_PER_BYTE - 1)))
+
+/*
+ * eb_bitmap_offset() - calculate the page and offset of the byte containing the
+ * given bit number
+ * @eb: the extent buffer
+ * @start: offset of the bitmap item in the extent buffer
+ * @nr: bit number
+ * @page_index: return index of the page in the extent buffer that contains the
+ * given bit number
+ * @page_offset: return offset into the page given by page_index
+ *
+ * This helper hides the ugliness of finding the byte in an extent buffer which
+ * contains a given bit.
+ */
+static inline void eb_bitmap_offset(struct extent_buffer *eb,
+				    unsigned long start, unsigned long nr,
+				    unsigned long *page_index,
+				    size_t *page_offset)
+{
+	size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1);
+	size_t byte_offset = BIT_BYTE(nr);
+	size_t offset;
+
+	/*
+	 * The byte we want is the offset of the extent buffer + the offset of
+	 * the bitmap item in the extent buffer + the offset of the byte in the
+	 * bitmap item.
+	 */
+	offset = start_offset + start + byte_offset;
+
+	*page_index = offset >> PAGE_CACHE_SHIFT;
+	*page_offset = offset & (PAGE_CACHE_SIZE - 1);
+}
+
+/**
+ * extent_buffer_test_bit - determine whether a bit in a bitmap item is set
+ * @eb: the extent buffer
+ * @start: offset of the bitmap item in the extent buffer
+ * @nr: bit number to test
+ */
+int extent_buffer_test_bit(struct extent_buffer *eb, unsigned long start,
+			   unsigned long nr)
+{
+	char *kaddr;
+	struct page *page;
+	unsigned long i;
+	size_t offset;
+
+	eb_bitmap_offset(eb, start, nr, &i, &offset);
+	page = eb->pages[i];
+	WARN_ON(!PageUptodate(page));
+	kaddr = page_address(page);
+	return 1U & (kaddr[offset] >> (nr & (BITS_PER_BYTE - 1)));
+}
+
+/**
+ * extent_buffer_bitmap_set - set an area of a bitmap
+ * @eb: the extent buffer
+ * @start: offset of the bitmap item in the extent buffer
+ * @pos: bit number of the first bit
+ * @len: number of bits to set
+ */
+void extent_buffer_bitmap_set(struct extent_buffer *eb, unsigned long start,
+			      unsigned long pos, unsigned long len)
+{
+	char *kaddr;
+	struct page *page;
+	unsigned long i;
+	size_t offset;
+	const unsigned int size = pos + len;
+	int bits_to_set = BITS_PER_BYTE - (pos % BITS_PER_BYTE);
+	unsigned int mask_to_set = BITMAP_FIRST_BYTE_MASK(pos);
+
+	eb_bitmap_offset(eb, start, pos, &i, &offset);
+	page = eb->pages[i];
+	WARN_ON(!PageUptodate(page));
+	kaddr = page_address(page);
+
+	while (len >= bits_to_set) {
+		kaddr[offset] |= mask_to_set;
+		len -= bits_to_set;
+		bits_to_set = BITS_PER_BYTE;
+		mask_to_set = ~0U;
+		if (++offset >= PAGE_CACHE_SIZE && len > 0) {
+			offset = 0;
+			page = eb->pages[++i];
+			WARN_ON(!PageUptodate(page));
+			kaddr = page_address(page);
+		}
+	}
+	if (len) {
+		mask_to_set &= BITMAP_LAST_BYTE_MASK(size);
+		kaddr[offset] |= mask_to_set;
+	}
+}
+
+
+/**
+ * extent_buffer_bitmap_clear - clear an area of a bitmap
+ * @eb: the extent buffer
+ * @start: offset of the bitmap item in the extent buffer
+ * @pos: bit number of the first bit
+ * @len: number of bits to clear
+ */
+void extent_buffer_bitmap_clear(struct extent_buffer *eb, unsigned long start,
+				unsigned long pos, unsigned long len)
+{
+	char *kaddr;
+	struct page *page;
+	unsigned long i;
+	size_t offset;
+	const unsigned int size = pos + len;
+	int bits_to_clear = BITS_PER_BYTE - (pos % BITS_PER_BYTE);
+	unsigned int mask_to_clear = BITMAP_FIRST_BYTE_MASK(pos);
+
+	eb_bitmap_offset(eb, start, pos, &i, &offset);
+	page = eb->pages[i];
+	WARN_ON(!PageUptodate(page));
+	kaddr = page_address(page);
+
+	while (len >= bits_to_clear) {
+		kaddr[offset] &= ~mask_to_clear;
+		len -= bits_to_clear;
+		bits_to_clear = BITS_PER_BYTE;
+		mask_to_clear = ~0U;
+		if (++offset >= PAGE_CACHE_SIZE && len > 0) {
+			offset = 0;
+			page = eb->pages[++i];
+			WARN_ON(!PageUptodate(page));
+			kaddr = page_address(page);
+		}
+	}
+	if (len) {
+		mask_to_clear &= BITMAP_LAST_BYTE_MASK(size);
+		kaddr[offset] &= ~mask_to_clear;
+	}
+}
+
 static inline bool areas_overlap(unsigned long src, unsigned long dst, unsigned long len)
 {
 	unsigned long distance = (src > dst) ? src - dst : dst - src;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index c668f36898d3..9185a20081d7 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -309,6 +309,12 @@ void memmove_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
 			   unsigned long src_offset, unsigned long len);
 void memset_extent_buffer(struct extent_buffer *eb, char c,
 			  unsigned long start, unsigned long len);
+int extent_buffer_test_bit(struct extent_buffer *eb, unsigned long start,
+			   unsigned long pos);
+void extent_buffer_bitmap_set(struct extent_buffer *eb, unsigned long start,
+			      unsigned long pos, unsigned long len);
+void extent_buffer_bitmap_clear(struct extent_buffer *eb, unsigned long start,
+				unsigned long pos, unsigned long len);
 void clear_extent_buffer_dirty(struct extent_buffer *eb);
 int set_extent_buffer_dirty(struct extent_buffer *eb);
 int set_extent_buffer_uptodate(struct extent_buffer *eb);
-- 
2.5.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 2/9] Btrfs: add extent buffer bitmap sanity tests
  2015-09-03 19:44 ` [PATCH v2 0/9] free space B-tree Omar Sandoval
  2015-09-03 19:44   ` [PATCH v2 1/9] Btrfs: add extent buffer bitmap operations Omar Sandoval
@ 2015-09-03 19:44   ` Omar Sandoval
  2015-09-03 19:44   ` [PATCH v2 3/9] Btrfs: add helpers for read-only compat bits Omar Sandoval
                     ` (8 subsequent siblings)
  10 siblings, 0 replies; 43+ messages in thread
From: Omar Sandoval @ 2015-09-03 19:44 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Omar Sandoval

Sanity test the extent buffer bitmap operations (test, set, and clear)
against the equivalent standard kernel operations.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/extent_io.c             |  34 ++++++----
 fs/btrfs/extent_io.h             |   4 +-
 fs/btrfs/tests/extent-io-tests.c | 138 ++++++++++++++++++++++++++++++++++++++-
 3 files changed, 160 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index eae9175ff62b..f875e29e10e1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4678,24 +4678,14 @@ struct extent_buffer *btrfs_clone_extent_buffer(struct extent_buffer *src)
 	return new;
 }
 
-struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
-						u64 start)
+struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
+						  u64 start, unsigned long len)
 {
 	struct extent_buffer *eb;
-	unsigned long len;
 	unsigned long num_pages;
 	unsigned long i;
 
-	if (!fs_info) {
-		/*
-		 * Called only from tests that don't always have a fs_info
-		 * available, but we know that nodesize is 4096
-		 */
-		len = 4096;
-	} else {
-		len = fs_info->tree_root->nodesize;
-	}
-	num_pages = num_extent_pages(0, len);
+	num_pages = num_extent_pages(start, len);
 
 	eb = __alloc_extent_buffer(fs_info, start, len);
 	if (!eb)
@@ -4718,6 +4708,24 @@ err:
 	return NULL;
 }
 
+struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
+						u64 start)
+{
+	unsigned long len;
+
+	if (!fs_info) {
+		/*
+		 * Called only from tests that don't always have a fs_info
+		 * available, but we know that nodesize is 4096
+		 */
+		len = 4096;
+	} else {
+		len = fs_info->tree_root->nodesize;
+	}
+
+	return __alloc_dummy_extent_buffer(fs_info, start, len);
+}
+
 static void check_buffer_tree_ref(struct extent_buffer *eb)
 {
 	int refs;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 9185a20081d7..9f8d7d1a7015 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -263,8 +263,10 @@ void set_page_extent_mapped(struct page *page);
 
 struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 					  u64 start);
+struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
+						  u64 start, unsigned long len);
 struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
-		u64 start);
+						u64 start);
 struct extent_buffer *btrfs_clone_extent_buffer(struct extent_buffer *src);
 struct extent_buffer *find_extent_buffer(struct btrfs_fs_info *fs_info,
 					 u64 start);
diff --git a/fs/btrfs/tests/extent-io-tests.c b/fs/btrfs/tests/extent-io-tests.c
index 9e9f2368177d..71ab575e7633 100644
--- a/fs/btrfs/tests/extent-io-tests.c
+++ b/fs/btrfs/tests/extent-io-tests.c
@@ -18,6 +18,7 @@
 
 #include <linux/pagemap.h>
 #include <linux/sched.h>
+#include <linux/slab.h>
 #include "btrfs-tests.h"
 #include "../extent_io.h"
 
@@ -76,6 +77,8 @@ static int test_find_delalloc(void)
 	u64 found;
 	int ret = -EINVAL;
 
+	test_msg("Running find delalloc tests\n");
+
 	inode = btrfs_new_test_inode();
 	if (!inode) {
 		test_msg("Failed to allocate test inode\n");
@@ -268,8 +271,139 @@ out:
 	return ret;
 }
 
+static int __test_eb_bitmaps(unsigned long *bitmap, struct extent_buffer *eb,
+			     unsigned long len)
+{
+	unsigned long i, x;
+
+	memset(bitmap, 0, len);
+	memset_extent_buffer(eb, 0, 0, len);
+	if (memcmp_extent_buffer(eb, bitmap, 0, len) != 0) {
+		test_msg("Bitmap was not zeroed\n");
+		return -EINVAL;
+	}
+
+	bitmap_set(bitmap, 0, len * BITS_PER_BYTE);
+	extent_buffer_bitmap_set(eb, 0, 0, len * BITS_PER_BYTE);
+	if (memcmp_extent_buffer(eb, bitmap, 0, len) != 0) {
+		test_msg("Setting all bits failed\n");
+		return -EINVAL;
+	}
+
+	bitmap_clear(bitmap, 0, len * BITS_PER_BYTE);
+	extent_buffer_bitmap_clear(eb, 0, 0, len * BITS_PER_BYTE);
+	if (memcmp_extent_buffer(eb, bitmap, 0, len) != 0) {
+		test_msg("Clearing all bits failed\n");
+		return -EINVAL;
+	}
+
+	bitmap_set(bitmap, (PAGE_CACHE_SIZE - sizeof(long) / 2) * BITS_PER_BYTE,
+		   sizeof(long) * BITS_PER_BYTE);
+	extent_buffer_bitmap_set(eb, PAGE_CACHE_SIZE - sizeof(long) / 2, 0,
+				 sizeof(long) * BITS_PER_BYTE);
+	if (memcmp_extent_buffer(eb, bitmap, 0, len) != 0) {
+		test_msg("Setting straddling pages failed\n");
+		return -EINVAL;
+	}
+
+	bitmap_set(bitmap, 0, len * BITS_PER_BYTE);
+	bitmap_clear(bitmap,
+		     (PAGE_CACHE_SIZE - sizeof(long) / 2) * BITS_PER_BYTE,
+		     sizeof(long) * BITS_PER_BYTE);
+	extent_buffer_bitmap_set(eb, 0, 0, len * BITS_PER_BYTE);
+	extent_buffer_bitmap_clear(eb, PAGE_CACHE_SIZE - sizeof(long) / 2, 0,
+				   sizeof(long) * BITS_PER_BYTE);
+	if (memcmp_extent_buffer(eb, bitmap, 0, len) != 0) {
+		test_msg("Clearing straddling pages failed\n");
+		return -EINVAL;
+	}
+
+	/*
+	 * Generate a wonky pseudo-random bit pattern for the sake of not using
+	 * something repetitive that could miss some hypothetical off-by-n bug.
+	 */
+	x = 0;
+	for (i = 0; i < len / sizeof(long); i++) {
+		x = (0x19660dULL * (u64)x + 0x3c6ef35fULL) & 0xffffffffUL;
+		bitmap[i] = x;
+	}
+	write_extent_buffer(eb, bitmap, 0, len);
+
+	for (i = 0; i < len * BITS_PER_BYTE; i++) {
+		int bit, bit1;
+
+		bit = !!test_bit(i, bitmap);
+		bit1 = !!extent_buffer_test_bit(eb, 0, i);
+		if (bit1 != bit) {
+			test_msg("Testing bit pattern failed\n");
+			return -EINVAL;
+		}
+
+		bit1 = !!extent_buffer_test_bit(eb, i / BITS_PER_BYTE,
+						i % BITS_PER_BYTE);
+		if (bit1 != bit) {
+			test_msg("Testing bit pattern with offset failed\n");
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+static int test_eb_bitmaps(void)
+{
+	unsigned long len = PAGE_CACHE_SIZE * 4;
+	unsigned long *bitmap;
+	struct extent_buffer *eb;
+	int ret;
+
+	test_msg("Running extent buffer bitmap tests\n");
+
+	bitmap = kmalloc(len, GFP_NOFS);
+	if (!bitmap) {
+		test_msg("Couldn't allocate test bitmap\n");
+		return -ENOMEM;
+	}
+
+	eb = __alloc_dummy_extent_buffer(NULL, 0, len);
+	if (!eb) {
+		test_msg("Couldn't allocate test extent buffer\n");
+		kfree(bitmap);
+		return -ENOMEM;
+	}
+
+	ret = __test_eb_bitmaps(bitmap, eb, len);
+	if (ret)
+		goto out;
+
+	/* Do it over again with an extent buffer which isn't page-aligned. */
+	free_extent_buffer(eb);
+	eb = __alloc_dummy_extent_buffer(NULL, PAGE_CACHE_SIZE / 2, len);
+	if (!eb) {
+		test_msg("Couldn't allocate test extent buffer\n");
+		kfree(bitmap);
+		return -ENOMEM;
+	}
+
+	ret = __test_eb_bitmaps(bitmap, eb, len);
+out:
+	free_extent_buffer(eb);
+	kfree(bitmap);
+	return ret;
+}
+
 int btrfs_test_extent_io(void)
 {
-	test_msg("Running find delalloc tests\n");
-	return test_find_delalloc();
+	int ret;
+
+	test_msg("Running extent I/O tests\n");
+
+	ret = test_find_delalloc();
+	if (ret)
+		goto out;
+
+	ret = test_eb_bitmaps();
+out:
+	test_msg("Extent I/O tests finished\n");
+	return ret;
 }
-- 
2.5.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 3/9] Btrfs: add helpers for read-only compat bits
  2015-09-03 19:44 ` [PATCH v2 0/9] free space B-tree Omar Sandoval
  2015-09-03 19:44   ` [PATCH v2 1/9] Btrfs: add extent buffer bitmap operations Omar Sandoval
  2015-09-03 19:44   ` [PATCH v2 2/9] Btrfs: add extent buffer bitmap sanity tests Omar Sandoval
@ 2015-09-03 19:44   ` Omar Sandoval
  2015-09-03 19:44   ` [PATCH v2 4/9] Btrfs: refactor caching_thread() Omar Sandoval
                     ` (7 subsequent siblings)
  10 siblings, 0 replies; 43+ messages in thread
From: Omar Sandoval @ 2015-09-03 19:44 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Omar Sandoval

We're finally going to add one of these for the free space tree, so
let's add the same nice helpers that we have for the incompat bits.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/ctree.h | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index aac314e14188..10388ac041b6 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -4113,6 +4113,40 @@ static inline int __btrfs_fs_incompat(struct btrfs_fs_info *fs_info, u64 flag)
 	return !!(btrfs_super_incompat_flags(disk_super) & flag);
 }
 
+#define btrfs_set_fs_compat_ro(__fs_info, opt) \
+	__btrfs_set_fs_compat_ro((__fs_info), BTRFS_FEATURE_COMPAT_RO_##opt)
+
+static inline void __btrfs_set_fs_compat_ro(struct btrfs_fs_info *fs_info,
+					    u64 flag)
+{
+	struct btrfs_super_block *disk_super;
+	u64 features;
+
+	disk_super = fs_info->super_copy;
+	features = btrfs_super_compat_ro_flags(disk_super);
+	if (!(features & flag)) {
+		spin_lock(&fs_info->super_lock);
+		features = btrfs_super_compat_ro_flags(disk_super);
+		if (!(features & flag)) {
+			features |= flag;
+			btrfs_set_super_compat_ro_flags(disk_super, features);
+			btrfs_info(fs_info, "setting %llu ro feature flag",
+				   flag);
+		}
+		spin_unlock(&fs_info->super_lock);
+	}
+}
+
+#define btrfs_fs_compat_ro(fs_info, opt) \
+	__btrfs_fs_compat_ro((fs_info), BTRFS_FEATURE_COMPAT_RO_##opt)
+
+static inline int __btrfs_fs_compat_ro(struct btrfs_fs_info *fs_info, u64 flag)
+{
+	struct btrfs_super_block *disk_super;
+	disk_super = fs_info->super_copy;
+	return !!(btrfs_super_compat_ro_flags(disk_super) & flag);
+}
+
 /*
  * Call btrfs_abort_transaction as early as possible when an error condition is
  * detected, that way the exact line number is reported.
-- 
2.5.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 4/9] Btrfs: refactor caching_thread()
  2015-09-03 19:44 ` [PATCH v2 0/9] free space B-tree Omar Sandoval
                     ` (2 preceding siblings ...)
  2015-09-03 19:44   ` [PATCH v2 3/9] Btrfs: add helpers for read-only compat bits Omar Sandoval
@ 2015-09-03 19:44   ` Omar Sandoval
  2015-09-03 19:44   ` [PATCH v2 5/9] Btrfs: introduce the free space B-tree on-disk format Omar Sandoval
                     ` (6 subsequent siblings)
  10 siblings, 0 replies; 43+ messages in thread
From: Omar Sandoval @ 2015-09-03 19:44 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Omar Sandoval

We're also going to load the free space tree from caching_thread(), so
we should refactor some of the common code.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/ctree.h       |  3 +++
 fs/btrfs/extent-tree.c | 59 ++++++++++++++++++++++++++++----------------------
 2 files changed, 36 insertions(+), 26 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 10388ac041b6..147a4df46960 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1262,6 +1262,9 @@ struct btrfs_caching_control {
 	atomic_t count;
 };
 
+/* Once caching_thread() finds this much free space, it will wake up waiters. */
+#define CACHING_CTL_WAKE_UP (1024 * 1024 * 2)
+
 struct btrfs_io_ctl {
 	void *cur, *orig;
 	struct page *page;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 07204bf601ed..20b04edf1079 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -375,11 +375,10 @@ static u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
 	return total_added;
 }
 
-static noinline void caching_thread(struct btrfs_work *work)
+static int load_extent_tree_free(struct btrfs_caching_control *caching_ctl)
 {
 	struct btrfs_block_group_cache *block_group;
 	struct btrfs_fs_info *fs_info;
-	struct btrfs_caching_control *caching_ctl;
 	struct btrfs_root *extent_root;
 	struct btrfs_path *path;
 	struct extent_buffer *leaf;
@@ -387,16 +386,15 @@ static noinline void caching_thread(struct btrfs_work *work)
 	u64 total_found = 0;
 	u64 last = 0;
 	u32 nritems;
-	int ret = -ENOMEM;
+	int ret;
 
-	caching_ctl = container_of(work, struct btrfs_caching_control, work);
 	block_group = caching_ctl->block_group;
 	fs_info = block_group->fs_info;
 	extent_root = fs_info->extent_root;
 
 	path = btrfs_alloc_path();
 	if (!path)
-		goto out;
+		return -ENOMEM;
 
 	last = max_t(u64, block_group->key.objectid, BTRFS_SUPER_INFO_OFFSET);
 
@@ -413,15 +411,11 @@ static noinline void caching_thread(struct btrfs_work *work)
 	key.objectid = last;
 	key.offset = 0;
 	key.type = BTRFS_EXTENT_ITEM_KEY;
-again:
-	mutex_lock(&caching_ctl->mutex);
-	/* need to make sure the commit_root doesn't disappear */
-	down_read(&fs_info->commit_root_sem);
 
 next:
 	ret = btrfs_search_slot(NULL, extent_root, &key, path, 0, 0);
 	if (ret < 0)
-		goto err;
+		goto out;
 
 	leaf = path->nodes[0];
 	nritems = btrfs_header_nritems(leaf);
@@ -446,12 +440,14 @@ next:
 				up_read(&fs_info->commit_root_sem);
 				mutex_unlock(&caching_ctl->mutex);
 				cond_resched();
-				goto again;
+				mutex_lock(&caching_ctl->mutex);
+				down_read(&fs_info->commit_root_sem);
+				goto next;
 			}
 
 			ret = btrfs_next_leaf(extent_root, path);
 			if (ret < 0)
-				goto err;
+				goto out;
 			if (ret)
 				break;
 			leaf = path->nodes[0];
@@ -489,7 +485,7 @@ next:
 			else
 				last = key.objectid + key.offset;
 
-			if (total_found > (1024 * 1024 * 2)) {
+			if (total_found > CACHING_CTL_WAKE_UP) {
 				total_found = 0;
 				wake_up(&caching_ctl->wait);
 			}
@@ -503,25 +499,36 @@ next:
 					  block_group->key.offset);
 	caching_ctl->progress = (u64)-1;
 
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
+static noinline void caching_thread(struct btrfs_work *work)
+{
+	struct btrfs_block_group_cache *block_group;
+	struct btrfs_fs_info *fs_info;
+	struct btrfs_caching_control *caching_ctl;
+	int ret;
+
+	caching_ctl = container_of(work, struct btrfs_caching_control, work);
+	block_group = caching_ctl->block_group;
+	fs_info = block_group->fs_info;
+
+	mutex_lock(&caching_ctl->mutex);
+	down_read(&fs_info->commit_root_sem);
+
+	ret = load_extent_tree_free(caching_ctl);
+
 	spin_lock(&block_group->lock);
 	block_group->caching_ctl = NULL;
-	block_group->cached = BTRFS_CACHE_FINISHED;
+	block_group->cached = ret ? BTRFS_CACHE_ERROR : BTRFS_CACHE_FINISHED;
 	spin_unlock(&block_group->lock);
 
-err:
-	btrfs_free_path(path);
 	up_read(&fs_info->commit_root_sem);
-
-	free_excluded_extents(extent_root, block_group);
-
+	free_excluded_extents(fs_info->extent_root, block_group);
 	mutex_unlock(&caching_ctl->mutex);
-out:
-	if (ret) {
-		spin_lock(&block_group->lock);
-		block_group->caching_ctl = NULL;
-		block_group->cached = BTRFS_CACHE_ERROR;
-		spin_unlock(&block_group->lock);
-	}
+
 	wake_up(&caching_ctl->wait);
 
 	put_caching_control(caching_ctl);
-- 
2.5.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 5/9] Btrfs: introduce the free space B-tree on-disk format
  2015-09-03 19:44 ` [PATCH v2 0/9] free space B-tree Omar Sandoval
                     ` (3 preceding siblings ...)
  2015-09-03 19:44   ` [PATCH v2 4/9] Btrfs: refactor caching_thread() Omar Sandoval
@ 2015-09-03 19:44   ` Omar Sandoval
  2015-09-03 19:44   ` [PATCH v2 6/9] Btrfs: implement the free space B-tree Omar Sandoval
                     ` (5 subsequent siblings)
  10 siblings, 0 replies; 43+ messages in thread
From: Omar Sandoval @ 2015-09-03 19:44 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Omar Sandoval

The on-disk format for the free space tree is straightforward. Each
block group is represented in the free space tree by a free space info
item that stores accounting information: whether the free space for this
block group is stored as bitmaps or extents and how many extents of free
space exist for this block group (regardless of which format is being
used in the tree). Extents are (start, FREE_SPACE_EXTENT, length) keys
with no corresponding item, and bitmaps instead have the
FREE_SPACE_BITMAP type and have a bitmap item attached, which is just an
array of bytes.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/ctree.h             | 38 ++++++++++++++++++++++++++++++++++++++
 include/trace/events/btrfs.h |  3 ++-
 2 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 147a4df46960..e97a923e9d44 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -96,6 +96,9 @@ struct btrfs_ordered_sum;
 /* for storing items that use the BTRFS_UUID_KEY* types */
 #define BTRFS_UUID_TREE_OBJECTID 9ULL
 
+/* tracks free space in block groups. */
+#define BTRFS_FREE_SPACE_TREE_OBJECTID 10ULL
+
 /* for storing balance parameters in the root tree */
 #define BTRFS_BALANCE_OBJECTID -4ULL
 
@@ -500,6 +503,8 @@ struct btrfs_super_block {
  * Compat flags that we support.  If any incompat flags are set other than the
  * ones specified below then we will fail to mount
  */
+#define BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE	(1ULL << 0)
+
 #define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF	(1ULL << 0)
 #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL	(1ULL << 1)
 #define BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS	(1ULL << 2)
@@ -1061,6 +1066,13 @@ struct btrfs_block_group_item {
 	__le64 flags;
 } __attribute__ ((__packed__));
 
+struct btrfs_free_space_info {
+	__le32 extent_count;
+	__le32 flags;
+} __attribute__ ((__packed__));
+
+#define BTRFS_FREE_SPACE_USING_BITMAPS (1ULL << 0)
+
 #define BTRFS_QGROUP_LEVEL_SHIFT		48
 static inline u64 btrfs_qgroup_level(u64 qgroupid)
 {
@@ -2064,6 +2076,27 @@ struct btrfs_ioctl_defrag_range_args {
  */
 #define BTRFS_BLOCK_GROUP_ITEM_KEY 192
 
+/*
+ * Every block group is represented in the free space tree by a free space info
+ * item, which stores some accounting information. It is keyed on
+ * (block_group_start, FREE_SPACE_INFO, block_group_length).
+ */
+#define BTRFS_FREE_SPACE_INFO_KEY 198
+
+/*
+ * A free space extent tracks an extent of space that is free in a block group.
+ * It is keyed on (start, FREE_SPACE_EXTENT, length).
+ */
+#define BTRFS_FREE_SPACE_EXTENT_KEY 199
+
+/*
+ * When a block group becomes very fragmented, we convert it to use bitmaps
+ * instead of extents. A free space bitmap is keyed on
+ * (start, FREE_SPACE_BITMAP, length); the corresponding item is a bitmap with
+ * (length / sectorsize) bits.
+ */
+#define BTRFS_FREE_SPACE_BITMAP_KEY 200
+
 #define BTRFS_DEV_EXTENT_KEY	204
 #define BTRFS_DEV_ITEM_KEY	216
 #define BTRFS_CHUNK_ITEM_KEY	228
@@ -2464,6 +2497,11 @@ BTRFS_SETGET_FUNCS(disk_block_group_flags,
 BTRFS_SETGET_STACK_FUNCS(block_group_flags,
 			struct btrfs_block_group_item, flags, 64);
 
+/* struct btrfs_free_space_info */
+BTRFS_SETGET_FUNCS(free_space_extent_count, struct btrfs_free_space_info,
+		   extent_count, 32);
+BTRFS_SETGET_FUNCS(free_space_flags, struct btrfs_free_space_info, flags, 32);
+
 /* struct btrfs_inode_ref */
 BTRFS_SETGET_FUNCS(inode_ref_name_len, struct btrfs_inode_ref, name_len, 16);
 BTRFS_SETGET_FUNCS(inode_ref_index, struct btrfs_inode_ref, index, 64);
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 0b73af9be12f..e6289e62a2a8 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -45,7 +45,8 @@ struct btrfs_qgroup_operation;
 		{ BTRFS_TREE_LOG_OBJECTID,	"TREE_LOG"	},	\
 		{ BTRFS_QUOTA_TREE_OBJECTID,	"QUOTA_TREE"	},	\
 		{ BTRFS_TREE_RELOC_OBJECTID,	"TREE_RELOC"	},	\
-		{ BTRFS_UUID_TREE_OBJECTID,	"UUID_RELOC"	},	\
+		{ BTRFS_UUID_TREE_OBJECTID,	"UUID_TREE"	},	\
+		{ BTRFS_FREE_SPACE_TREE_OBJECTID, "FREE_SPACE_TREE" },	\
 		{ BTRFS_DATA_RELOC_TREE_OBJECTID, "DATA_RELOC_TREE" })
 
 #define show_root_type(obj)						\
-- 
2.5.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 6/9] Btrfs: implement the free space B-tree
  2015-09-03 19:44 ` [PATCH v2 0/9] free space B-tree Omar Sandoval
                     ` (4 preceding siblings ...)
  2015-09-03 19:44   ` [PATCH v2 5/9] Btrfs: introduce the free space B-tree on-disk format Omar Sandoval
@ 2015-09-03 19:44   ` Omar Sandoval
  2015-09-03 19:44   ` [PATCH v2 7/9] Btrfs: add free space tree sanity tests Omar Sandoval
                     ` (4 subsequent siblings)
  10 siblings, 0 replies; 43+ messages in thread
From: Omar Sandoval @ 2015-09-03 19:44 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Omar Sandoval

The free space cache has turned out to be a scalability bottleneck on
large, busy filesystems. When the cache for a lot of block groups needs
to be written out, we can get extremely long commit times; if this
happens in the critical section, things are especially bad because we
block new transactions from happening.

The main problem with the free space cache is that it has to be written
out in its entirety and is managed in an ad hoc fashion. Using a B-tree
to store free space fixes this: updates can be done as needed and we get
all of the benefits of using a B-tree: checksumming, RAID handling,
well-understood behavior.

With the free space tree, we get commit times that are about the same as
the no cache case with load times slower than the free space cache case
but still much faster than the no cache case. Free space is represented
with extents until it becomes more space-efficient to use bitmaps,
giving us similar space overhead to the free space cache.

The operations on the free space tree are: adding and removing free
space, handling the creation and deletion of block groups, and loading
the free space for a block group. We can also create the free space tree
by walking the extent tree.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/Makefile          |    2 +-
 fs/btrfs/ctree.h           |   25 +-
 fs/btrfs/extent-tree.c     |   15 +-
 fs/btrfs/free-space-tree.c | 1501 ++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/free-space-tree.h |   71 +++
 5 files changed, 1606 insertions(+), 8 deletions(-)
 create mode 100644 fs/btrfs/free-space-tree.c
 create mode 100644 fs/btrfs/free-space-tree.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 6d1d0b93b1aa..766169709146 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -9,7 +9,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
 	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
 	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
-	   uuid-tree.o props.o hash.o
+	   uuid-tree.o props.o hash.o free-space-tree.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e97a923e9d44..05420991e101 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1302,8 +1302,20 @@ struct btrfs_block_group_cache {
 	u64 delalloc_bytes;
 	u64 bytes_super;
 	u64 flags;
-	u64 sectorsize;
 	u64 cache_generation;
+	u32 sectorsize;
+
+	/*
+	 * If the free space extent count exceeds this number, convert the block
+	 * group to bitmaps.
+	 */
+	u32 bitmap_high_thresh;
+
+	/*
+	 * If the free space extent count drops below this number, convert the
+	 * block group back to extents.
+	 */
+	u32 bitmap_low_thresh;
 
 	/*
 	 * It is just used for the delayed data space allocation because
@@ -1359,6 +1371,9 @@ struct btrfs_block_group_cache {
 	struct list_head io_list;
 
 	struct btrfs_io_ctl io_ctl;
+
+	/* Lock for free space tree operations. */
+	struct mutex free_space_lock;
 };
 
 /* delayed seq elem */
@@ -1410,6 +1425,7 @@ struct btrfs_fs_info {
 	struct btrfs_root *csum_root;
 	struct btrfs_root *quota_root;
 	struct btrfs_root *uuid_root;
+	struct btrfs_root *free_space_root;
 
 	/* the log root tree is a directory of all the other log roots */
 	struct btrfs_root *log_root_tree;
@@ -3559,6 +3575,13 @@ void btrfs_end_write_no_snapshoting(struct btrfs_root *root);
 void check_system_chunk(struct btrfs_trans_handle *trans,
 			struct btrfs_root *root,
 			const u64 type);
+void free_excluded_extents(struct btrfs_root *root,
+			   struct btrfs_block_group_cache *cache);
+int exclude_super_stripes(struct btrfs_root *root,
+			  struct btrfs_block_group_cache *cache);
+u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
+		       struct btrfs_fs_info *info, u64 start, u64 end);
+
 /* ctree.c */
 int btrfs_bin_search(struct extent_buffer *eb, struct btrfs_key *key,
 		     int level, int *slot);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 20b04edf1079..418c0eca9bb4 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -237,8 +237,8 @@ static int add_excluded_extent(struct btrfs_root *root,
 	return 0;
 }
 
-static void free_excluded_extents(struct btrfs_root *root,
-				  struct btrfs_block_group_cache *cache)
+void free_excluded_extents(struct btrfs_root *root,
+			   struct btrfs_block_group_cache *cache)
 {
 	u64 start, end;
 
@@ -251,14 +251,16 @@ static void free_excluded_extents(struct btrfs_root *root,
 			  start, end, EXTENT_UPTODATE, GFP_NOFS);
 }
 
-static int exclude_super_stripes(struct btrfs_root *root,
-				 struct btrfs_block_group_cache *cache)
+int exclude_super_stripes(struct btrfs_root *root,
+			  struct btrfs_block_group_cache *cache)
 {
 	u64 bytenr;
 	u64 *logical;
 	int stripe_len;
 	int i, nr, ret;
 
+	cache->bytes_super = 0;
+
 	if (cache->key.objectid < BTRFS_SUPER_INFO_OFFSET) {
 		stripe_len = BTRFS_SUPER_INFO_OFFSET - cache->key.objectid;
 		cache->bytes_super += stripe_len;
@@ -337,8 +339,8 @@ static void put_caching_control(struct btrfs_caching_control *ctl)
  * we need to check the pinned_extents for any extents that can't be used yet
  * since their free space will be released as soon as the transaction commits.
  */
-static u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
-			      struct btrfs_fs_info *info, u64 start, u64 end)
+u64 add_new_free_space(struct btrfs_block_group_cache *block_group,
+		       struct btrfs_fs_info *info, u64 start, u64 end)
 {
 	u64 extent_start, extent_end, size, total_added = 0;
 	int ret;
@@ -9288,6 +9290,7 @@ btrfs_create_block_group_cache(struct btrfs_root *root, u64 start, u64 size)
 	INIT_LIST_HEAD(&cache->io_list);
 	btrfs_init_free_space_ctl(cache);
 	atomic_set(&cache->trimming, 0);
+	mutex_init(&cache->free_space_lock);
 
 	return cache;
 }
diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
new file mode 100644
index 000000000000..2edfdcce3980
--- /dev/null
+++ b/fs/btrfs/free-space-tree.c
@@ -0,0 +1,1501 @@
+/*
+ * Copyright (C) 2015 Facebook.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/kernel.h>
+#include <linux/vmalloc.h>
+#include "ctree.h"
+#include "disk-io.h"
+#include "locking.h"
+#include "free-space-tree.h"
+#include "transaction.h"
+
+void set_free_space_tree_thresholds(struct btrfs_block_group_cache *cache)
+{
+	u32 bitmap_range;
+	size_t bitmap_size;
+	u64 num_bitmaps, total_bitmap_size;
+
+	/*
+	 * We convert to bitmaps when the disk space required for using extents
+	 * exceeds that required for using bitmaps.
+	 */
+	bitmap_range = cache->sectorsize * BTRFS_FREE_SPACE_BITMAP_BITS;
+	num_bitmaps = div_u64(cache->key.offset + bitmap_range - 1,
+			      bitmap_range);
+	bitmap_size = sizeof(struct btrfs_item) + BTRFS_FREE_SPACE_BITMAP_SIZE;
+	total_bitmap_size = num_bitmaps * bitmap_size;
+	cache->bitmap_high_thresh = div_u64(total_bitmap_size,
+					    sizeof(struct btrfs_item));
+
+	/*
+	 * We allow for a small buffer between the high threshold and low
+	 * threshold to avoid thrashing back and forth between the two formats.
+	 */
+	if (cache->bitmap_high_thresh > 100)
+		cache->bitmap_low_thresh = cache->bitmap_high_thresh - 100;
+	else
+		cache->bitmap_low_thresh = 0;
+}
+
+static int add_new_free_space_info(struct btrfs_trans_handle *trans,
+				   struct btrfs_fs_info *fs_info,
+				   struct btrfs_block_group_cache *block_group,
+				   struct btrfs_path *path)
+{
+	struct btrfs_root *root = fs_info->free_space_root;
+	struct btrfs_free_space_info *info;
+	struct btrfs_key key;
+	struct extent_buffer *leaf;
+	int ret;
+
+	key.objectid = block_group->key.objectid;
+	key.type = BTRFS_FREE_SPACE_INFO_KEY;
+	key.offset = block_group->key.offset;
+
+	ret = btrfs_insert_empty_item(trans, root, path, &key, sizeof(*info));
+	if (ret)
+		goto out;
+
+	leaf = path->nodes[0];
+	info = btrfs_item_ptr(leaf, path->slots[0],
+			      struct btrfs_free_space_info);
+	btrfs_set_free_space_extent_count(leaf, info, 0);
+	btrfs_set_free_space_flags(leaf, info, 0);
+	btrfs_mark_buffer_dirty(leaf);
+
+	ret = 0;
+out:
+	btrfs_release_path(path);
+	return ret;
+}
+
+struct btrfs_free_space_info *
+search_free_space_info(struct btrfs_trans_handle *trans,
+		       struct btrfs_fs_info *fs_info,
+		       struct btrfs_block_group_cache *block_group,
+		       struct btrfs_path *path, int cow)
+{
+	struct btrfs_root *root = fs_info->free_space_root;
+	struct btrfs_key key;
+	int ret;
+
+	key.objectid = block_group->key.objectid;
+	key.type = BTRFS_FREE_SPACE_INFO_KEY;
+	key.offset = block_group->key.offset;
+
+	ret = btrfs_search_slot(trans, root, &key, path, 0, cow);
+	if (ret < 0)
+		return ERR_PTR(ret);
+	if (ret != 0) {
+		btrfs_warn(fs_info, "missing free space info for %llu\n",
+			   block_group->key.objectid);
+		ASSERT(0);
+		return ERR_PTR(-ENOENT);
+	}
+
+	return btrfs_item_ptr(path->nodes[0], path->slots[0],
+			      struct btrfs_free_space_info);
+}
+
+/*
+ * btrfs_search_slot() but we're looking for the greatest key less than the
+ * passed key.
+ */
+static int btrfs_search_prev_slot(struct btrfs_trans_handle *trans,
+				  struct btrfs_root *root,
+				  struct btrfs_key *key, struct btrfs_path *p,
+				  int ins_len, int cow)
+{
+	int ret;
+
+	ret = btrfs_search_slot(trans, root, key, p, ins_len, cow);
+	if (ret < 0)
+		return ret;
+
+	if (ret == 0) {
+		ASSERT(0);
+		return -EIO;
+	}
+
+	if (p->slots[0] == 0) {
+		ASSERT(0);
+		return -EIO;
+	}
+	p->slots[0]--;
+
+	return 0;
+}
+
+static inline u32 free_space_bitmap_size(u64 size, u32 sectorsize)
+{
+	return DIV_ROUND_UP((u32)div_u64(size, sectorsize), BITS_PER_BYTE);
+}
+
+static unsigned long *alloc_bitmap(u32 bitmap_size)
+{
+	return __vmalloc(bitmap_size, GFP_NOFS | __GFP_HIGHMEM | __GFP_ZERO,
+			 PAGE_KERNEL);
+}
+
+int convert_free_space_to_bitmaps(struct btrfs_trans_handle *trans,
+				  struct btrfs_fs_info *fs_info,
+				  struct btrfs_block_group_cache *block_group,
+				  struct btrfs_path *path)
+{
+	struct btrfs_root *root = fs_info->free_space_root;
+	struct btrfs_free_space_info *info;
+	struct btrfs_key key, found_key;
+	struct extent_buffer *leaf;
+	unsigned long *bitmap;
+	char *bitmap_cursor;
+	u64 start, end;
+	u64 bitmap_range, i;
+	u32 bitmap_size, flags, expected_extent_count;
+	u32 extent_count = 0;
+	int done = 0, nr;
+	int ret;
+
+	bitmap_size = free_space_bitmap_size(block_group->key.offset,
+					     block_group->sectorsize);
+	bitmap = alloc_bitmap(bitmap_size);
+	if (!bitmap) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	start = block_group->key.objectid;
+	end = block_group->key.objectid + block_group->key.offset;
+
+	key.objectid = end - 1;
+	key.type = (u8)-1;
+	key.offset = (u64)-1;
+
+	while (!done) {
+		ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
+		if (ret)
+			goto out;
+
+		leaf = path->nodes[0];
+		nr = 0;
+		path->slots[0]++;
+		while (path->slots[0] > 0) {
+			btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0] - 1);
+
+			if (found_key.type == BTRFS_FREE_SPACE_INFO_KEY) {
+				ASSERT(found_key.objectid == block_group->key.objectid);
+				ASSERT(found_key.offset == block_group->key.offset);
+				done = 1;
+				break;
+			} else if (found_key.type == BTRFS_FREE_SPACE_EXTENT_KEY) {
+				u64 first, last;
+
+				ASSERT(found_key.objectid >= start);
+				ASSERT(found_key.objectid < end);
+				ASSERT(found_key.objectid + found_key.offset <= end);
+
+				first = div_u64(found_key.objectid - start,
+						block_group->sectorsize);
+				last = div_u64(found_key.objectid + found_key.offset - start,
+					       block_group->sectorsize);
+				bitmap_set(bitmap, first, last - first);
+
+				extent_count++;
+				nr++;
+				path->slots[0]--;
+			} else {
+				ASSERT(0);
+			}
+		}
+
+		ret = btrfs_del_items(trans, root, path, path->slots[0], nr);
+		if (ret)
+			goto out;
+		btrfs_release_path(path);
+	}
+
+	info = search_free_space_info(trans, fs_info, block_group, path, 1);
+	if (IS_ERR(info)) {
+		ret = PTR_ERR(info);
+		goto out;
+	}
+	leaf = path->nodes[0];
+	flags = btrfs_free_space_flags(leaf, info);
+	flags |= BTRFS_FREE_SPACE_USING_BITMAPS;
+	btrfs_set_free_space_flags(leaf, info, flags);
+	expected_extent_count = btrfs_free_space_extent_count(leaf, info);
+	btrfs_mark_buffer_dirty(leaf);
+	btrfs_release_path(path);
+
+	if (extent_count != expected_extent_count) {
+		btrfs_err(fs_info, "incorrect extent count for %llu; counted %u, expected %u",
+			  block_group->key.objectid, extent_count,
+			  expected_extent_count);
+		ASSERT(0);
+		ret = -EIO;
+		goto out;
+	}
+
+	bitmap_cursor = (char *)bitmap;
+	bitmap_range = block_group->sectorsize * BTRFS_FREE_SPACE_BITMAP_BITS;
+	i = start;
+	while (i < end) {
+		unsigned long ptr;
+		u64 extent_size;
+		u32 data_size;
+
+		extent_size = min(end - i, bitmap_range);
+		data_size = free_space_bitmap_size(extent_size,
+						   block_group->sectorsize);
+
+		key.objectid = i;
+		key.type = BTRFS_FREE_SPACE_BITMAP_KEY;
+		key.offset = extent_size;
+
+		ret = btrfs_insert_empty_item(trans, root, path, &key,
+					      data_size);
+		if (ret)
+			goto out;
+
+		leaf = path->nodes[0];
+		ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
+		write_extent_buffer(leaf, bitmap_cursor, ptr,
+				    data_size);
+		btrfs_mark_buffer_dirty(leaf);
+		btrfs_release_path(path);
+
+		i += extent_size;
+		bitmap_cursor += data_size;
+	}
+
+	ret = 0;
+out:
+	vfree(bitmap);
+	if (ret)
+		btrfs_abort_transaction(trans, root, ret);
+	return ret;
+}
+
+int convert_free_space_to_extents(struct btrfs_trans_handle *trans,
+				  struct btrfs_fs_info *fs_info,
+				  struct btrfs_block_group_cache *block_group,
+				  struct btrfs_path *path)
+{
+	struct btrfs_root *root = fs_info->free_space_root;
+	struct btrfs_free_space_info *info;
+	struct btrfs_key key, found_key;
+	struct extent_buffer *leaf;
+	unsigned long *bitmap;
+	u64 start, end;
+	/* Initialize to silence GCC. */
+	u64 extent_start = 0;
+	u64 offset;
+	u32 bitmap_size, flags, expected_extent_count;
+	int prev_bit = 0, bit, bitnr;
+	u32 extent_count = 0;
+	int done = 0, nr;
+	int ret;
+
+	bitmap_size = free_space_bitmap_size(block_group->key.offset,
+					     block_group->sectorsize);
+	bitmap = alloc_bitmap(bitmap_size);
+	if (!bitmap) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	start = block_group->key.objectid;
+	end = block_group->key.objectid + block_group->key.offset;
+
+	key.objectid = end - 1;
+	key.type = (u8)-1;
+	key.offset = (u64)-1;
+
+	while (!done) {
+		ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
+		if (ret)
+			goto out;
+
+		leaf = path->nodes[0];
+		nr = 0;
+		path->slots[0]++;
+		while (path->slots[0] > 0) {
+			btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0] - 1);
+
+			if (found_key.type == BTRFS_FREE_SPACE_INFO_KEY) {
+				ASSERT(found_key.objectid == block_group->key.objectid);
+				ASSERT(found_key.offset == block_group->key.offset);
+				done = 1;
+				break;
+			} else if (found_key.type == BTRFS_FREE_SPACE_BITMAP_KEY) {
+				unsigned long ptr;
+				char *bitmap_cursor;
+				u32 bitmap_pos, data_size;
+
+				ASSERT(found_key.objectid >= start);
+				ASSERT(found_key.objectid < end);
+				ASSERT(found_key.objectid + found_key.offset <= end);
+
+				bitmap_pos = div_u64(found_key.objectid - start,
+						     block_group->sectorsize *
+						     BITS_PER_BYTE);
+				bitmap_cursor = ((char *)bitmap) + bitmap_pos;
+				data_size = free_space_bitmap_size(found_key.offset,
+								   block_group->sectorsize);
+
+				ptr = btrfs_item_ptr_offset(leaf, path->slots[0] - 1);
+				read_extent_buffer(leaf, bitmap_cursor, ptr,
+						   data_size);
+
+				nr++;
+				path->slots[0]--;
+			} else {
+				ASSERT(0);
+			}
+		}
+
+		ret = btrfs_del_items(trans, root, path, path->slots[0], nr);
+		if (ret)
+			goto out;
+		btrfs_release_path(path);
+	}
+
+	info = search_free_space_info(trans, fs_info, block_group, path, 1);
+	if (IS_ERR(info)) {
+		ret = PTR_ERR(info);
+		goto out;
+	}
+	leaf = path->nodes[0];
+	flags = btrfs_free_space_flags(leaf, info);
+	flags &= ~BTRFS_FREE_SPACE_USING_BITMAPS;
+	btrfs_set_free_space_flags(leaf, info, flags);
+	expected_extent_count = btrfs_free_space_extent_count(leaf, info);
+	btrfs_mark_buffer_dirty(leaf);
+	btrfs_release_path(path);
+
+	offset = start;
+	bitnr = 0;
+	while (offset < end) {
+		bit = !!test_bit(bitnr, bitmap);
+		if (prev_bit == 0 && bit == 1) {
+			extent_start = offset;
+		} else if (prev_bit == 1 && bit == 0) {
+			key.objectid = extent_start;
+			key.type = BTRFS_FREE_SPACE_EXTENT_KEY;
+			key.offset = offset - extent_start;
+
+			ret = btrfs_insert_empty_item(trans, root, path, &key, 0);
+			if (ret)
+				goto out;
+			btrfs_release_path(path);
+
+			extent_count++;
+		}
+		prev_bit = bit;
+		offset += block_group->sectorsize;
+		bitnr++;
+	}
+	if (prev_bit == 1) {
+		key.objectid = extent_start;
+		key.type = BTRFS_FREE_SPACE_EXTENT_KEY;
+		key.offset = end - extent_start;
+
+		ret = btrfs_insert_empty_item(trans, root, path, &key, 0);
+		if (ret)
+			goto out;
+		btrfs_release_path(path);
+
+		extent_count++;
+	}
+
+	if (extent_count != expected_extent_count) {
+		btrfs_err(fs_info, "incorrect extent count for %llu; counted %u, expected %u",
+			  block_group->key.objectid, extent_count,
+			  expected_extent_count);
+		ASSERT(0);
+		ret = -EIO;
+		goto out;
+	}
+
+	ret = 0;
+out:
+	vfree(bitmap);
+	if (ret)
+		btrfs_abort_transaction(trans, root, ret);
+	return ret;
+}
+
+static int update_free_space_extent_count(struct btrfs_trans_handle *trans,
+					  struct btrfs_fs_info *fs_info,
+					  struct btrfs_block_group_cache *block_group,
+					  struct btrfs_path *path,
+					  int new_extents)
+{
+	struct btrfs_free_space_info *info;
+	u32 flags;
+	u32 extent_count;
+	int ret = 0;
+
+	if (new_extents == 0)
+		return 0;
+
+	info = search_free_space_info(trans, fs_info, block_group, path, 1);
+	if (IS_ERR(info)) {
+		ret = PTR_ERR(info);
+		goto out;
+	}
+	flags = btrfs_free_space_flags(path->nodes[0], info);
+	extent_count = btrfs_free_space_extent_count(path->nodes[0], info);
+
+	extent_count += new_extents;
+	btrfs_set_free_space_extent_count(path->nodes[0], info, extent_count);
+	btrfs_mark_buffer_dirty(path->nodes[0]);
+	btrfs_release_path(path);
+
+	if (!(flags & BTRFS_FREE_SPACE_USING_BITMAPS) &&
+	    extent_count > block_group->bitmap_high_thresh) {
+		ret = convert_free_space_to_bitmaps(trans, fs_info, block_group,
+						    path);
+	} else if ((flags & BTRFS_FREE_SPACE_USING_BITMAPS) &&
+		   extent_count < block_group->bitmap_low_thresh) {
+		ret = convert_free_space_to_extents(trans, fs_info, block_group,
+						    path);
+	}
+
+out:
+	return ret;
+}
+
+int free_space_test_bit(struct btrfs_block_group_cache *block_group,
+			struct btrfs_path *path, u64 offset)
+{
+	struct extent_buffer *leaf;
+	struct btrfs_key key;
+	u64 found_start, found_end;
+	unsigned long ptr, i;
+
+	leaf = path->nodes[0];
+	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+	ASSERT(key.type == BTRFS_FREE_SPACE_BITMAP_KEY);
+
+	found_start = key.objectid;
+	found_end = key.objectid + key.offset;
+	ASSERT(offset >= found_start && offset < found_end);
+
+	ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
+	i = div_u64(offset - found_start, block_group->sectorsize);
+	return !!extent_buffer_test_bit(leaf, ptr, i);
+}
+
+static void free_space_set_bits(struct btrfs_block_group_cache *block_group,
+				struct btrfs_path *path, u64 *start, u64 *size,
+				int bit)
+{
+	struct extent_buffer *leaf;
+	struct btrfs_key key;
+	u64 end = *start + *size;
+	u64 found_start, found_end;
+	unsigned long ptr, first, last;
+
+	leaf = path->nodes[0];
+	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+	ASSERT(key.type == BTRFS_FREE_SPACE_BITMAP_KEY);
+
+	found_start = key.objectid;
+	found_end = key.objectid + key.offset;
+	ASSERT(*start >= found_start && *start < found_end);
+	ASSERT(end > found_start);
+
+	if (end > found_end)
+		end = found_end;
+
+	ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
+	first = div_u64(*start - found_start, block_group->sectorsize);
+	last = div_u64(end - found_start, block_group->sectorsize);
+	if (bit)
+		extent_buffer_bitmap_set(leaf, ptr, first, last - first);
+	else
+		extent_buffer_bitmap_clear(leaf, ptr, first, last - first);
+	btrfs_mark_buffer_dirty(leaf);
+
+	*size -= end - *start;
+	*start = end;
+}
+
+/*
+ * We can't use btrfs_next_item() in modify_free_space_bitmap() because
+ * btrfs_next_leaf() doesn't get the path for writing. We can forgo the fancy
+ * tree walking in btrfs_next_leaf() anyways because we know exactly what we're
+ * looking for.
+ */
+static int free_space_next_bitmap(struct btrfs_trans_handle *trans,
+				  struct btrfs_root *root, struct btrfs_path *p)
+{
+	struct btrfs_key key;
+
+	if (p->slots[0] + 1 < btrfs_header_nritems(p->nodes[0])) {
+		p->slots[0]++;
+		return 0;
+	}
+
+	btrfs_item_key_to_cpu(p->nodes[0], &key, p->slots[0]);
+	btrfs_release_path(p);
+
+	key.objectid += key.offset;
+	key.type = (u8)-1;
+	key.offset = (u64)-1;
+
+	return btrfs_search_prev_slot(trans, root, &key, p, 0, 1);
+}
+
+/*
+ * If remove is 1, then we are removing free space, thus clearing bits in the
+ * bitmap. If remove is 0, then we are adding free space, thus setting bits in
+ * the bitmap.
+ */
+static int modify_free_space_bitmap(struct btrfs_trans_handle *trans,
+				    struct btrfs_fs_info *fs_info,
+				    struct btrfs_block_group_cache *block_group,
+				    struct btrfs_path *path,
+				    u64 start, u64 size, int remove)
+{
+	struct btrfs_root *root = fs_info->free_space_root;
+	struct btrfs_key key;
+	u64 end = start + size;
+	u64 cur_start, cur_size;
+	int prev_bit, next_bit;
+	int new_extents;
+	int ret;
+
+	/*
+	 * Read the bit for the block immediately before the extent of space if
+	 * that block is within the block group.
+	 */
+	if (start > block_group->key.objectid) {
+		u64 prev_block = start - block_group->sectorsize;
+
+		key.objectid = prev_block;
+		key.type = (u8)-1;
+		key.offset = (u64)-1;
+
+		ret = btrfs_search_prev_slot(trans, root, &key, path, 0, 1);
+		if (ret)
+			goto out;
+
+		prev_bit = free_space_test_bit(block_group, path, prev_block);
+
+		/* The previous block may have been in the previous bitmap. */
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+		if (start >= key.objectid + key.offset) {
+			ret = free_space_next_bitmap(trans, root, path);
+			if (ret)
+				goto out;
+		}
+	} else {
+		key.objectid = start;
+		key.type = (u8)-1;
+		key.offset = (u64)-1;
+
+		ret = btrfs_search_prev_slot(trans, root, &key, path, 0, 1);
+		if (ret)
+			goto out;
+
+		prev_bit = -1;
+	}
+
+	/*
+	 * Iterate over all of the bitmaps overlapped by the extent of space,
+	 * clearing/setting bits as required.
+	 */
+	cur_start = start;
+	cur_size = size;
+	while (1) {
+		free_space_set_bits(block_group, path, &cur_start, &cur_size,
+				    !remove);
+		if (cur_size == 0)
+			break;
+		ret = free_space_next_bitmap(trans, root, path);
+		if (ret)
+			goto out;
+	}
+
+	/*
+	 * Read the bit for the block immediately after the extent of space if
+	 * that block is within the block group.
+	 */
+	if (end < block_group->key.objectid + block_group->key.offset) {
+		/* The next block may be in the next bitmap. */
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+		if (end >= key.objectid + key.offset) {
+			ret = free_space_next_bitmap(trans, root, path);
+			if (ret)
+				goto out;
+		}
+
+		next_bit = free_space_test_bit(block_group, path, end);
+	} else {
+		next_bit = -1;
+	}
+
+	if (remove) {
+		new_extents = -1;
+		if (prev_bit == 1) {
+			/* Leftover on the left. */
+			new_extents++;
+		}
+		if (next_bit == 1) {
+			/* Leftover on the right. */
+			new_extents++;
+		}
+	} else {
+		new_extents = 1;
+		if (prev_bit == 1) {
+			/* Merging with neighbor on the left. */
+			new_extents--;
+		}
+		if (next_bit == 1) {
+			/* Merging with neighbor on the right. */
+			new_extents--;
+		}
+	}
+
+	btrfs_release_path(path);
+	ret = update_free_space_extent_count(trans, fs_info, block_group, path,
+					     new_extents);
+
+out:
+	return ret;
+}
+
+static int remove_free_space_extent(struct btrfs_trans_handle *trans,
+				    struct btrfs_fs_info *fs_info,
+				    struct btrfs_block_group_cache *block_group,
+				    struct btrfs_path *path,
+				    u64 start, u64 size)
+{
+	struct btrfs_root *root = fs_info->free_space_root;
+	struct btrfs_key key;
+	u64 found_start, found_end;
+	u64 end = start + size;
+	int new_extents = -1;
+	int ret;
+
+	key.objectid = start;
+	key.type = (u8)-1;
+	key.offset = (u64)-1;
+
+	ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
+	if (ret)
+		goto out;
+
+	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+
+	ASSERT(key.type == BTRFS_FREE_SPACE_EXTENT_KEY);
+
+	found_start = key.objectid;
+	found_end = key.objectid + key.offset;
+	ASSERT(start >= found_start && end <= found_end);
+
+	/*
+	 * Okay, now that we've found the free space extent which contains the
+	 * free space that we are removing, there are four cases:
+	 *
+	 * 1. We're using the whole extent: delete the key we found and
+	 * decrement the free space extent count.
+	 * 2. We are using part of the extent starting at the beginning: delete
+	 * the key we found and insert a new key representing the leftover at
+	 * the end. There is no net change in the number of extents.
+	 * 3. We are using part of the extent ending at the end: delete the key
+	 * we found and insert a new key representing the leftover at the
+	 * beginning. There is no net change in the number of extents.
+	 * 4. We are using part of the extent in the middle: delete the key we
+	 * found and insert two new keys representing the leftovers on each
+	 * side. Where we used to have one extent, we now have two, so increment
+	 * the extent count. We may need to convert the block group to bitmaps
+	 * as a result.
+	 */
+
+	/* Delete the existing key (cases 1-4). */
+	ret = btrfs_del_item(trans, root, path);
+	if (ret)
+		goto out;
+
+	/* Add a key for leftovers at the beginning (cases 3 and 4). */
+	if (start > found_start) {
+		key.objectid = found_start;
+		key.type = BTRFS_FREE_SPACE_EXTENT_KEY;
+		key.offset = start - found_start;
+
+		btrfs_release_path(path);
+		ret = btrfs_insert_empty_item(trans, root, path, &key, 0);
+		if (ret)
+			goto out;
+		new_extents++;
+	}
+
+	/* Add a key for leftovers at the end (cases 2 and 4). */
+	if (end < found_end) {
+		key.objectid = end;
+		key.type = BTRFS_FREE_SPACE_EXTENT_KEY;
+		key.offset = found_end - end;
+
+		btrfs_release_path(path);
+		ret = btrfs_insert_empty_item(trans, root, path, &key, 0);
+		if (ret)
+			goto out;
+		new_extents++;
+	}
+
+	btrfs_release_path(path);
+	ret = update_free_space_extent_count(trans, fs_info, block_group, path,
+					     new_extents);
+
+out:
+	return ret;
+}
+
+int __remove_from_free_space_tree(struct btrfs_trans_handle *trans,
+				  struct btrfs_fs_info *fs_info,
+				  struct btrfs_block_group_cache *block_group,
+				  struct btrfs_path *path, u64 start, u64 size)
+{
+	struct btrfs_free_space_info *info;
+	u32 flags;
+	int ret;
+
+	mutex_lock(&block_group->free_space_lock);
+
+	info = search_free_space_info(NULL, fs_info, block_group, path, 0);
+	if (IS_ERR(info)) {
+		ret = PTR_ERR(info);
+		goto out;
+	}
+	flags = btrfs_free_space_flags(path->nodes[0], info);
+	btrfs_release_path(path);
+
+	if (flags & BTRFS_FREE_SPACE_USING_BITMAPS) {
+		ret = modify_free_space_bitmap(trans, fs_info, block_group,
+					       path, start, size, 1);
+	} else {
+		ret = remove_free_space_extent(trans, fs_info, block_group,
+					       path, start, size);
+	}
+
+out:
+	mutex_unlock(&block_group->free_space_lock);
+	return ret;
+}
+
+int remove_from_free_space_tree(struct btrfs_trans_handle *trans,
+				struct btrfs_fs_info *fs_info,
+				u64 start, u64 size)
+{
+	struct btrfs_block_group_cache *block_group;
+	struct btrfs_path *path;
+	int ret;
+
+	if (!btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
+		return 0;
+
+	path = btrfs_alloc_path();
+	if (!path) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	block_group = btrfs_lookup_block_group(fs_info, start);
+	if (!block_group) {
+		ASSERT(0);
+		ret = -ENOENT;
+		goto out;
+	}
+
+	ret = __remove_from_free_space_tree(trans, fs_info, block_group, path,
+					    start, size);
+
+	btrfs_put_block_group(block_group);
+out:
+	btrfs_free_path(path);
+	if (ret)
+		btrfs_abort_transaction(trans, fs_info->free_space_root, ret);
+	return ret;
+}
+
+static int add_free_space_extent(struct btrfs_trans_handle *trans,
+				 struct btrfs_fs_info *fs_info,
+				 struct btrfs_block_group_cache *block_group,
+				 struct btrfs_path *path,
+				 u64 start, u64 size)
+{
+	struct btrfs_root *root = fs_info->free_space_root;
+	struct btrfs_key key, new_key;
+	u64 found_start, found_end;
+	u64 end = start + size;
+	int new_extents = 1;
+	int ret;
+
+	/*
+	 * We are adding a new extent of free space, but we need to merge
+	 * extents. There are four cases here:
+	 *
+	 * 1. The new extent does not have any immediate neighbors to merge
+	 * with: add the new key and increment the free space extent count. We
+	 * may need to convert the block group to bitmaps as a result.
+	 * 2. The new extent has an immediate neighbor before it: remove the
+	 * previous key and insert a new key combining both of them. There is no
+	 * net change in the number of extents.
+	 * 3. The new extent has an immediate neighbor after it: remove the next
+	 * key and insert a new key combining both of them. There is no net
+	 * change in the number of extents.
+	 * 4. The new extent has immediate neighbors on both sides: remove both
+	 * of the keys and insert a new key combining all of them. Where we used
+	 * to have two extents, we now have one, so decrement the extent count.
+	 */
+
+	new_key.objectid = start;
+	new_key.type = BTRFS_FREE_SPACE_EXTENT_KEY;
+	new_key.offset = size;
+
+	/* Search for a neighbor on the left. */
+	if (start == block_group->key.objectid)
+		goto right;
+	key.objectid = start - 1;
+	key.type = (u8)-1;
+	key.offset = (u64)-1;
+
+	ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
+	if (ret)
+		goto out;
+
+	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+
+	if (key.type != BTRFS_FREE_SPACE_EXTENT_KEY) {
+		ASSERT(key.type == BTRFS_FREE_SPACE_INFO_KEY);
+		btrfs_release_path(path);
+		goto right;
+	}
+
+	found_start = key.objectid;
+	found_end = key.objectid + key.offset;
+	ASSERT(found_start >= block_group->key.objectid &&
+	       found_end > block_group->key.objectid);
+	ASSERT(found_start < start && found_end <= start);
+
+	/*
+	 * Delete the neighbor on the left and absorb it into the new key (cases
+	 * 2 and 4).
+	 */
+	if (found_end == start) {
+		ret = btrfs_del_item(trans, root, path);
+		if (ret)
+			goto out;
+		new_key.objectid = found_start;
+		new_key.offset += key.offset;
+		new_extents--;
+	}
+	btrfs_release_path(path);
+
+right:
+	/* Search for a neighbor on the right. */
+	if (end == block_group->key.objectid + block_group->key.offset)
+		goto insert;
+	key.objectid = end;
+	key.type = (u8)-1;
+	key.offset = (u64)-1;
+
+	ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
+	if (ret)
+		goto out;
+
+	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+
+	if (key.type != BTRFS_FREE_SPACE_EXTENT_KEY) {
+		ASSERT(key.type == BTRFS_FREE_SPACE_INFO_KEY);
+		btrfs_release_path(path);
+		goto insert;
+	}
+
+	found_start = key.objectid;
+	found_end = key.objectid + key.offset;
+	ASSERT(found_start >= block_group->key.objectid &&
+	       found_end > block_group->key.objectid);
+	ASSERT((found_start < start && found_end <= start) ||
+	       (found_start >= end && found_end > end));
+
+	/*
+	 * Delete the neighbor on the right and absorb it into the new key
+	 * (cases 3 and 4).
+	 */
+	if (found_start == end) {
+		ret = btrfs_del_item(trans, root, path);
+		if (ret)
+			goto out;
+		new_key.offset += key.offset;
+		new_extents--;
+	}
+	btrfs_release_path(path);
+
+insert:
+	/* Insert the new key (cases 1-4). */
+	ret = btrfs_insert_empty_item(trans, root, path, &new_key, 0);
+	if (ret)
+		goto out;
+
+	btrfs_release_path(path);
+	ret = update_free_space_extent_count(trans, fs_info, block_group, path,
+					     new_extents);
+
+out:
+	return ret;
+}
+
+int __add_to_free_space_tree(struct btrfs_trans_handle *trans,
+			     struct btrfs_fs_info *fs_info,
+			     struct btrfs_block_group_cache *block_group,
+			     struct btrfs_path *path, u64 start, u64 size)
+{
+	struct btrfs_free_space_info *info;
+	u32 flags;
+	int ret;
+
+	mutex_lock(&block_group->free_space_lock);
+
+	info = search_free_space_info(NULL, fs_info, block_group, path, 0);
+	if (IS_ERR(info)) {
+		return PTR_ERR(info);
+		goto out;
+	}
+	flags = btrfs_free_space_flags(path->nodes[0], info);
+	btrfs_release_path(path);
+
+	if (flags & BTRFS_FREE_SPACE_USING_BITMAPS) {
+		ret = modify_free_space_bitmap(trans, fs_info, block_group,
+					       path, start, size, 0);
+	} else {
+		ret = add_free_space_extent(trans, fs_info, block_group, path,
+					    start, size);
+	}
+
+out:
+	mutex_unlock(&block_group->free_space_lock);
+	return ret;
+}
+
+int add_to_free_space_tree(struct btrfs_trans_handle *trans,
+			   struct btrfs_fs_info *fs_info,
+			   u64 start, u64 size)
+{
+	struct btrfs_block_group_cache *block_group;
+	struct btrfs_path *path;
+	int ret;
+
+	if (!btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
+		return 0;
+
+	path = btrfs_alloc_path();
+	if (!path) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	block_group = btrfs_lookup_block_group(fs_info, start);
+	if (!block_group) {
+		ASSERT(0);
+		ret = -ENOENT;
+		goto out;
+	}
+
+	ret = __add_to_free_space_tree(trans, fs_info, block_group, path, start,
+				       size);
+
+	btrfs_put_block_group(block_group);
+out:
+	btrfs_free_path(path);
+	if (ret)
+		btrfs_abort_transaction(trans, fs_info->free_space_root, ret);
+	return ret;
+}
+
+static int add_new_free_space_extent(struct btrfs_trans_handle *trans,
+				     struct btrfs_fs_info *fs_info,
+				     struct btrfs_block_group_cache *block_group,
+				     struct btrfs_path *path,
+				     u64 start, u64 end)
+{
+	u64 extent_start, extent_end;
+	int ret;
+
+	while (start < end) {
+		ret = find_first_extent_bit(fs_info->pinned_extents, start,
+					    &extent_start, &extent_end,
+					    EXTENT_DIRTY | EXTENT_UPTODATE,
+					    NULL);
+		if (ret)
+			break;
+
+		if (extent_start <= start) {
+			start = extent_end + 1;
+		} else if (extent_start > start && extent_start < end) {
+			ret = __add_to_free_space_tree(trans, fs_info,
+						       block_group, path, start,
+						       extent_start - start);
+			btrfs_release_path(path);
+			if (ret)
+				return ret;
+			start = extent_end + 1;
+		} else {
+			break;
+		}
+	}
+	if (start < end) {
+		ret = __add_to_free_space_tree(trans, fs_info, block_group,
+					       path, start, end - start);
+		btrfs_release_path(path);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+/*
+ * Populate the free space tree by walking the extent tree, avoiding the super
+ * block mirrors. Operations on the extent tree that happen as a result of
+ * writes to the free space tree will go through the normal add/remove hooks.
+ */
+static int populate_free_space_tree(struct btrfs_trans_handle *trans,
+				    struct btrfs_fs_info *fs_info,
+				    struct btrfs_block_group_cache *block_group)
+{
+	struct btrfs_root *extent_root = fs_info->extent_root;
+	struct btrfs_path *path, *path2;
+	struct btrfs_key key;
+	u64 start, end;
+	int ret;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+	path->reada = 1;
+
+	path2 = btrfs_alloc_path();
+	if (!path2) {
+		btrfs_free_path(path);
+		return -ENOMEM;
+	}
+
+	ret = add_new_free_space_info(trans, fs_info, block_group, path2);
+	if (ret)
+		goto out;
+
+	ret = exclude_super_stripes(extent_root, block_group);
+	if (ret)
+		goto out;
+
+	/*
+	 * Iterate through all of the extent and metadata items in this block
+	 * group, adding the free space between them and the free space at the
+	 * end. Note that EXTENT_ITEM and METADATA_ITEM are less than
+	 * BLOCK_GROUP_ITEM, so an extent may precede the block group that it's
+	 * contained in.
+	 */
+	key.objectid = block_group->key.objectid;
+	key.type = BTRFS_EXTENT_ITEM_KEY;
+	key.offset = 0;
+
+	ret = btrfs_search_slot_for_read(extent_root, &key, path, 1, 0);
+	if (ret < 0)
+		goto out;
+	ASSERT(ret == 0);
+
+	start = block_group->key.objectid;
+	end = block_group->key.objectid + block_group->key.offset;
+	while (1) {
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+
+		if (key.type == BTRFS_EXTENT_ITEM_KEY ||
+		    key.type == BTRFS_METADATA_ITEM_KEY) {
+			if (key.objectid >= end)
+				break;
+
+			ret = add_new_free_space_extent(trans, fs_info,
+							block_group, path2,
+							start, key.objectid);
+			start = key.objectid;
+			if (key.type == BTRFS_METADATA_ITEM_KEY)
+				start += fs_info->tree_root->nodesize;
+			else
+				start += key.offset;
+		} else if (key.type == BTRFS_BLOCK_GROUP_ITEM_KEY) {
+			if (key.objectid != block_group->key.objectid)
+				break;
+		}
+
+		ret = btrfs_next_item(extent_root, path);
+		if (ret < 0)
+			goto out;
+		if (ret)
+			break;
+	}
+	ret = add_new_free_space_extent(trans, fs_info, block_group, path2,
+					start, end);
+	if (ret)
+		goto out;
+
+out:
+	free_excluded_extents(extent_root, block_group);
+	btrfs_free_path(path2);
+	btrfs_free_path(path);
+	return ret;
+}
+
+int btrfs_create_free_space_tree(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_trans_handle *trans;
+	struct btrfs_root *tree_root = fs_info->tree_root;
+	struct btrfs_root *free_space_root;
+	struct btrfs_block_group_cache *block_group;
+	struct rb_node *node;
+	int ret;
+
+	trans = btrfs_start_transaction(tree_root, 0);
+	if (IS_ERR(trans))
+		return PTR_ERR(trans);
+
+	free_space_root = btrfs_create_tree(trans, fs_info,
+					    BTRFS_FREE_SPACE_TREE_OBJECTID);
+	if (IS_ERR(free_space_root)) {
+		ret = PTR_ERR(free_space_root);
+		btrfs_abort_transaction(trans, tree_root, ret);
+		return ret;
+	}
+	fs_info->free_space_root = free_space_root;
+
+	node = rb_first(&fs_info->block_group_cache_tree);
+	while (node) {
+		block_group = rb_entry(node, struct btrfs_block_group_cache,
+				       cache_node);
+		ret = populate_free_space_tree(trans, fs_info, block_group);
+		if (ret) {
+			btrfs_abort_transaction(trans, tree_root, ret);
+			return ret;
+		}
+		node = rb_next(node);
+	}
+
+	btrfs_set_fs_compat_ro(fs_info, FREE_SPACE_TREE);
+
+	ret = btrfs_commit_transaction(trans, tree_root);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+int add_block_group_free_space(struct btrfs_trans_handle *trans,
+			       struct btrfs_fs_info *fs_info,
+			       struct btrfs_block_group_cache *block_group)
+{
+	struct btrfs_path *path;
+	int ret;
+
+	if (!btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
+		return 0;
+
+	path = btrfs_alloc_path();
+	if (!path) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = add_new_free_space_info(trans, fs_info, block_group, path);
+	if (ret)
+		goto out;
+
+	ret = add_new_free_space_extent(trans, fs_info, block_group, path,
+					block_group->key.objectid,
+					block_group->key.objectid +
+					block_group->key.offset);
+
+out:
+	btrfs_free_path(path);
+	if (ret)
+		btrfs_abort_transaction(trans, fs_info->free_space_root, ret);
+	return ret;
+}
+
+int remove_block_group_free_space(struct btrfs_trans_handle *trans,
+				  struct btrfs_fs_info *fs_info,
+				  struct btrfs_block_group_cache *block_group)
+{
+	struct btrfs_root *root = fs_info->free_space_root;
+	struct btrfs_path *path;
+	struct btrfs_key key, found_key;
+	struct extent_buffer *leaf;
+	u64 start, end;
+	int done = 0, nr;
+	int ret;
+
+	if (!btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
+		return 0;
+
+	path = btrfs_alloc_path();
+	if (!path) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	start = block_group->key.objectid;
+	end = block_group->key.objectid + block_group->key.offset;
+
+	key.objectid = end - 1;
+	key.type = (u8)-1;
+	key.offset = (u64)-1;
+
+	while (!done) {
+		ret = btrfs_search_prev_slot(trans, root, &key, path, -1, 1);
+		if (ret)
+			goto out;
+
+		leaf = path->nodes[0];
+		nr = 0;
+		path->slots[0]++;
+		while (path->slots[0] > 0) {
+			btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0] - 1);
+
+			if (found_key.type == BTRFS_FREE_SPACE_INFO_KEY) {
+				ASSERT(found_key.objectid == block_group->key.objectid);
+				ASSERT(found_key.offset == block_group->key.offset);
+				done = 1;
+				nr++;
+				path->slots[0]--;
+				break;
+			} else if (found_key.type == BTRFS_FREE_SPACE_EXTENT_KEY ||
+				   found_key.type == BTRFS_FREE_SPACE_BITMAP_KEY) {
+				ASSERT(found_key.objectid >= start);
+				ASSERT(found_key.objectid < end);
+				ASSERT(found_key.objectid + found_key.offset <= end);
+				nr++;
+				path->slots[0]--;
+			} else {
+				ASSERT(0);
+			}
+		}
+
+		ret = btrfs_del_items(trans, root, path, path->slots[0], nr);
+		if (ret)
+			goto out;
+		btrfs_release_path(path);
+	}
+
+	ret = 0;
+out:
+	btrfs_free_path(path);
+	if (ret)
+		btrfs_abort_transaction(trans, root, ret);
+	return ret;
+}
+
+static int load_free_space_bitmaps(struct btrfs_caching_control *caching_ctl,
+				   struct btrfs_path *path,
+				   u32 expected_extent_count)
+{
+	struct btrfs_block_group_cache *block_group;
+	struct btrfs_fs_info *fs_info;
+	struct btrfs_root *root;
+	struct btrfs_key key;
+	int prev_bit = 0, bit;
+	/* Initialize to silence GCC. */
+	u64 extent_start = 0;
+	u64 end, offset;
+	u64 total_found = 0;
+	u32 extent_count = 0;
+	int ret;
+
+	block_group = caching_ctl->block_group;
+	fs_info = block_group->fs_info;
+	root = fs_info->free_space_root;
+
+	end = block_group->key.objectid + block_group->key.offset;
+
+	while (1) {
+		ret = btrfs_next_item(root, path);
+		if (ret < 0)
+			goto out;
+		if (ret)
+			break;
+
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+
+		if (key.type == BTRFS_FREE_SPACE_INFO_KEY)
+			break;
+
+		ASSERT(key.type == BTRFS_FREE_SPACE_BITMAP_KEY);
+		ASSERT(key.objectid < end && key.objectid + key.offset <= end);
+
+		caching_ctl->progress = key.objectid;
+
+		offset = key.objectid;
+		while (offset < key.objectid + key.offset) {
+			bit = free_space_test_bit(block_group, path, offset);
+			if (prev_bit == 0 && bit == 1) {
+				extent_start = offset;
+			} else if (prev_bit == 1 && bit == 0) {
+				total_found += add_new_free_space(block_group,
+								  fs_info,
+								  extent_start,
+								  offset);
+				if (total_found > CACHING_CTL_WAKE_UP) {
+					total_found = 0;
+					wake_up(&caching_ctl->wait);
+				}
+				extent_count++;
+			}
+			prev_bit = bit;
+			offset += block_group->sectorsize;
+		}
+	}
+	if (prev_bit == 1) {
+		total_found += add_new_free_space(block_group, fs_info,
+						  extent_start, end);
+		extent_count++;
+	}
+
+	if (extent_count != expected_extent_count) {
+		btrfs_err(fs_info, "incorrect extent count for %llu; counted %u, expected %u",
+			  block_group->key.objectid, extent_count,
+			  expected_extent_count);
+		ASSERT(0);
+		ret = -EIO;
+		goto out;
+	}
+
+	caching_ctl->progress = (u64)-1;
+
+	ret = 0;
+out:
+	return ret;
+}
+
+static int load_free_space_extents(struct btrfs_caching_control *caching_ctl,
+				   struct btrfs_path *path,
+				   u32 expected_extent_count)
+{
+	struct btrfs_block_group_cache *block_group;
+	struct btrfs_fs_info *fs_info;
+	struct btrfs_root *root;
+	struct btrfs_key key;
+	u64 end;
+	u64 total_found = 0;
+	u32 extent_count = 0;
+	int ret;
+
+	block_group = caching_ctl->block_group;
+	fs_info = block_group->fs_info;
+	root = fs_info->free_space_root;
+
+	end = block_group->key.objectid + block_group->key.offset;
+
+	while (1) {
+		ret = btrfs_next_item(root, path);
+		if (ret < 0)
+			goto out;
+		if (ret)
+			break;
+
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+
+		if (key.type == BTRFS_FREE_SPACE_INFO_KEY)
+			break;
+
+		ASSERT(key.type == BTRFS_FREE_SPACE_EXTENT_KEY);
+		ASSERT(key.objectid < end && key.objectid + key.offset <= end);
+
+		caching_ctl->progress = key.objectid;
+
+		total_found += add_new_free_space(block_group, fs_info,
+						  key.objectid,
+						  key.objectid + key.offset);
+		if (total_found > CACHING_CTL_WAKE_UP) {
+			total_found = 0;
+			wake_up(&caching_ctl->wait);
+		}
+		extent_count++;
+	}
+
+	if (extent_count != expected_extent_count) {
+		btrfs_err(fs_info, "incorrect extent count for %llu; counted %u, expected %u",
+			  block_group->key.objectid, extent_count,
+			  expected_extent_count);
+		ASSERT(0);
+		ret = -EIO;
+		goto out;
+	}
+
+	caching_ctl->progress = (u64)-1;
+
+	ret = 0;
+out:
+	return ret;
+}
+
+int load_free_space_tree(struct btrfs_caching_control *caching_ctl)
+{
+	struct btrfs_block_group_cache *block_group;
+	struct btrfs_fs_info *fs_info;
+	struct btrfs_free_space_info *info;
+	struct btrfs_path *path;
+	u32 extent_count, flags;
+	int ret;
+
+	block_group = caching_ctl->block_group;
+	fs_info = block_group->fs_info;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	/*
+	 * Just like caching_thread() doesn't want to deadlock on the extent
+	 * tree, we don't want to deadlock on the free space tree.
+	 */
+	path->skip_locking = 1;
+	path->search_commit_root = 1;
+	path->reada = 1;
+
+	info = search_free_space_info(NULL, fs_info, block_group, path, 0);
+	if (IS_ERR(info)) {
+		ret = PTR_ERR(info);
+		goto out;
+	}
+	extent_count = btrfs_free_space_extent_count(path->nodes[0], info);
+	flags = btrfs_free_space_flags(path->nodes[0], info);
+
+	/*
+	 * We left path pointing to the free space info item, so now
+	 * load_free_space_foo can just iterate through the free space tree from
+	 * there.
+	 */
+	if (flags & BTRFS_FREE_SPACE_USING_BITMAPS)
+		ret = load_free_space_bitmaps(caching_ctl, path, extent_count);
+	else
+		ret = load_free_space_extents(caching_ctl, path, extent_count);
+
+out:
+	btrfs_free_path(path);
+	return ret;
+}
diff --git a/fs/btrfs/free-space-tree.h b/fs/btrfs/free-space-tree.h
new file mode 100644
index 000000000000..4d354b6f463b
--- /dev/null
+++ b/fs/btrfs/free-space-tree.h
@@ -0,0 +1,71 @@
+/*
+ * Copyright (C) 2015 Facebook.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __BTRFS_FREE_SPACE_TREE
+#define __BTRFS_FREE_SPACE_TREE
+
+/*
+ * The default size for new free space bitmap items. The last bitmap in a block
+ * group may be truncated, and none of the free space tree code assumes that
+ * existing bitmaps are this size.
+ */
+#define BTRFS_FREE_SPACE_BITMAP_SIZE 256
+#define BTRFS_FREE_SPACE_BITMAP_BITS (BTRFS_FREE_SPACE_BITMAP_SIZE * BITS_PER_BYTE)
+
+void set_free_space_tree_thresholds(struct btrfs_block_group_cache *block_group);
+int btrfs_create_free_space_tree(struct btrfs_fs_info *fs_info);
+int load_free_space_tree(struct btrfs_caching_control *caching_ctl);
+int add_block_group_free_space(struct btrfs_trans_handle *trans,
+			       struct btrfs_fs_info *fs_info,
+			       struct btrfs_block_group_cache *block_group);
+int remove_block_group_free_space(struct btrfs_trans_handle *trans,
+				  struct btrfs_fs_info *fs_info,
+				  struct btrfs_block_group_cache *block_group);
+int add_to_free_space_tree(struct btrfs_trans_handle *trans,
+			   struct btrfs_fs_info *fs_info,
+			   u64 start, u64 size);
+int remove_from_free_space_tree(struct btrfs_trans_handle *trans,
+				struct btrfs_fs_info *fs_info,
+				u64 start, u64 size);
+
+/* Exposed for testing. */
+struct btrfs_free_space_info *
+search_free_space_info(struct btrfs_trans_handle *trans,
+		       struct btrfs_fs_info *fs_info,
+		       struct btrfs_block_group_cache *block_group,
+		       struct btrfs_path *path, int cow);
+int __add_to_free_space_tree(struct btrfs_trans_handle *trans,
+			     struct btrfs_fs_info *fs_info,
+			     struct btrfs_block_group_cache *block_group,
+			     struct btrfs_path *path, u64 start, u64 size);
+int __remove_from_free_space_tree(struct btrfs_trans_handle *trans,
+				  struct btrfs_fs_info *fs_info,
+				  struct btrfs_block_group_cache *block_group,
+				  struct btrfs_path *path, u64 start, u64 size);
+int convert_free_space_to_bitmaps(struct btrfs_trans_handle *trans,
+				  struct btrfs_fs_info *fs_info,
+				  struct btrfs_block_group_cache *block_group,
+				  struct btrfs_path *path);
+int convert_free_space_to_extents(struct btrfs_trans_handle *trans,
+				  struct btrfs_fs_info *fs_info,
+				  struct btrfs_block_group_cache *block_group,
+				  struct btrfs_path *path);
+int free_space_test_bit(struct btrfs_block_group_cache *block_group,
+			struct btrfs_path *path, u64 offset);
+
+#endif
-- 
2.5.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 7/9] Btrfs: add free space tree sanity tests
  2015-09-03 19:44 ` [PATCH v2 0/9] free space B-tree Omar Sandoval
                     ` (5 preceding siblings ...)
  2015-09-03 19:44   ` [PATCH v2 6/9] Btrfs: implement the free space B-tree Omar Sandoval
@ 2015-09-03 19:44   ` Omar Sandoval
  2015-09-03 19:44   ` [PATCH v2 8/9] Btrfs: wire up the free space tree to the extent tree Omar Sandoval
                     ` (3 subsequent siblings)
  10 siblings, 0 replies; 43+ messages in thread
From: Omar Sandoval @ 2015-09-03 19:44 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Omar Sandoval

This tests the operations on the free space tree trying to excercise all
of the main cases for both formats. Between this and xfstests, the free
space tree should have pretty good coverage.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/Makefile                      |   3 +-
 fs/btrfs/super.c                       |   3 +
 fs/btrfs/tests/btrfs-tests.c           |  52 +++
 fs/btrfs/tests/btrfs-tests.h           |  10 +
 fs/btrfs/tests/free-space-tests.c      |  35 +-
 fs/btrfs/tests/free-space-tree-tests.c | 570 +++++++++++++++++++++++++++++++++
 fs/btrfs/tests/qgroup-tests.c          |  20 +-
 7 files changed, 645 insertions(+), 48 deletions(-)
 create mode 100644 fs/btrfs/tests/free-space-tree-tests.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 766169709146..128ce17a80b0 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -16,4 +16,5 @@ btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
 
 btrfs-$(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) += tests/free-space-tests.o \
 	tests/extent-buffer-tests.o tests/btrfs-tests.o \
-	tests/extent-io-tests.o tests/inode-tests.o tests/qgroup-tests.o
+	tests/extent-io-tests.o tests/inode-tests.o tests/qgroup-tests.o \
+	tests/free-space-tree-tests.o
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index cd7ef34d2dce..b93f127c4bc8 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2203,6 +2203,9 @@ static int btrfs_run_sanity_tests(void)
 	if (ret)
 		goto out;
 	ret = btrfs_test_qgroups();
+	if (ret)
+		goto out;
+	ret = btrfs_test_free_space_tree();
 out:
 	btrfs_destroy_test_fs();
 	return ret;
diff --git a/fs/btrfs/tests/btrfs-tests.c b/fs/btrfs/tests/btrfs-tests.c
index 9626252ee6b4..ba28cefdf9e7 100644
--- a/fs/btrfs/tests/btrfs-tests.c
+++ b/fs/btrfs/tests/btrfs-tests.c
@@ -21,6 +21,9 @@
 #include <linux/magic.h>
 #include "btrfs-tests.h"
 #include "../ctree.h"
+#include "../free-space-cache.h"
+#include "../free-space-tree.h"
+#include "../transaction.h"
 #include "../volumes.h"
 #include "../disk-io.h"
 #include "../qgroup.h"
@@ -122,6 +125,9 @@ struct btrfs_fs_info *btrfs_alloc_dummy_fs_info(void)
 	INIT_LIST_HEAD(&fs_info->tree_mod_seq_list);
 	INIT_RADIX_TREE(&fs_info->buffer_radix, GFP_ATOMIC);
 	INIT_RADIX_TREE(&fs_info->fs_roots_radix, GFP_ATOMIC);
+	extent_io_tree_init(&fs_info->freed_extents[0], NULL);
+	extent_io_tree_init(&fs_info->freed_extents[1], NULL);
+	fs_info->pinned_extents = &fs_info->freed_extents[0];
 	return fs_info;
 }
 
@@ -169,3 +175,49 @@ void btrfs_free_dummy_root(struct btrfs_root *root)
 	kfree(root);
 }
 
+struct btrfs_block_group_cache *
+btrfs_alloc_dummy_block_group(unsigned long length)
+{
+	struct btrfs_block_group_cache *cache;
+
+	cache = kzalloc(sizeof(*cache), GFP_NOFS);
+	if (!cache)
+		return NULL;
+	cache->free_space_ctl = kzalloc(sizeof(*cache->free_space_ctl),
+					GFP_NOFS);
+	if (!cache->free_space_ctl) {
+		kfree(cache);
+		return NULL;
+	}
+
+	cache->key.objectid = 0;
+	cache->key.offset = length;
+	cache->key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
+	cache->sectorsize = 4096;
+	cache->full_stripe_len = 4096;
+
+	INIT_LIST_HEAD(&cache->list);
+	INIT_LIST_HEAD(&cache->cluster_list);
+	INIT_LIST_HEAD(&cache->bg_list);
+	btrfs_init_free_space_ctl(cache);
+	mutex_init(&cache->free_space_lock);
+
+	return cache;
+}
+
+void btrfs_free_dummy_block_group(struct btrfs_block_group_cache *cache)
+{
+	if (!cache)
+		return;
+	__btrfs_remove_free_space_cache(cache->free_space_ctl);
+	kfree(cache->free_space_ctl);
+	kfree(cache);
+}
+
+void btrfs_init_dummy_trans(struct btrfs_trans_handle *trans)
+{
+	memset(trans, 0, sizeof(*trans));
+	trans->transid = 1;
+	INIT_LIST_HEAD(&trans->qgroup_ref_list);
+	trans->type = __TRANS_DUMMY;
+}
diff --git a/fs/btrfs/tests/btrfs-tests.h b/fs/btrfs/tests/btrfs-tests.h
index fd3954224480..054b8c73c951 100644
--- a/fs/btrfs/tests/btrfs-tests.h
+++ b/fs/btrfs/tests/btrfs-tests.h
@@ -24,17 +24,23 @@
 #define test_msg(fmt, ...) pr_info("BTRFS: selftest: " fmt, ##__VA_ARGS__)
 
 struct btrfs_root;
+struct btrfs_trans_handle;
 
 int btrfs_test_free_space_cache(void);
 int btrfs_test_extent_buffer_operations(void);
 int btrfs_test_extent_io(void);
 int btrfs_test_inodes(void);
 int btrfs_test_qgroups(void);
+int btrfs_test_free_space_tree(void);
 int btrfs_init_test_fs(void);
 void btrfs_destroy_test_fs(void);
 struct inode *btrfs_new_test_inode(void);
 struct btrfs_fs_info *btrfs_alloc_dummy_fs_info(void);
 void btrfs_free_dummy_root(struct btrfs_root *root);
+struct btrfs_block_group_cache *
+btrfs_alloc_dummy_block_group(unsigned long length);
+void btrfs_free_dummy_block_group(struct btrfs_block_group_cache *cache);
+void btrfs_init_dummy_trans(struct btrfs_trans_handle *trans);
 #else
 static inline int btrfs_test_free_space_cache(void)
 {
@@ -63,6 +69,10 @@ static inline int btrfs_test_qgroups(void)
 {
 	return 0;
 }
+static inline int btrfs_test_free_space_tree(void)
+{
+	return 0;
+}
 #endif
 
 #endif
diff --git a/fs/btrfs/tests/free-space-tests.c b/fs/btrfs/tests/free-space-tests.c
index 2299bfde39ee..bae6c599f604 100644
--- a/fs/btrfs/tests/free-space-tests.c
+++ b/fs/btrfs/tests/free-space-tests.c
@@ -22,35 +22,6 @@
 #include "../free-space-cache.h"
 
 #define BITS_PER_BITMAP		(PAGE_CACHE_SIZE * 8)
-static struct btrfs_block_group_cache *init_test_block_group(void)
-{
-	struct btrfs_block_group_cache *cache;
-
-	cache = kzalloc(sizeof(*cache), GFP_NOFS);
-	if (!cache)
-		return NULL;
-	cache->free_space_ctl = kzalloc(sizeof(*cache->free_space_ctl),
-					GFP_NOFS);
-	if (!cache->free_space_ctl) {
-		kfree(cache);
-		return NULL;
-	}
-
-	cache->key.objectid = 0;
-	cache->key.offset = 1024 * 1024 * 1024;
-	cache->key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
-	cache->sectorsize = 4096;
-	cache->full_stripe_len = 4096;
-
-	spin_lock_init(&cache->lock);
-	INIT_LIST_HEAD(&cache->list);
-	INIT_LIST_HEAD(&cache->cluster_list);
-	INIT_LIST_HEAD(&cache->bg_list);
-
-	btrfs_init_free_space_ctl(cache);
-
-	return cache;
-}
 
 /*
  * This test just does basic sanity checking, making sure we can add an exten
@@ -883,7 +854,7 @@ int btrfs_test_free_space_cache(void)
 
 	test_msg("Running btrfs free space cache tests\n");
 
-	cache = init_test_block_group();
+	cache = btrfs_alloc_dummy_block_group(1024 * 1024 * 1024);
 	if (!cache) {
 		test_msg("Couldn't run the tests\n");
 		return 0;
@@ -901,9 +872,7 @@ int btrfs_test_free_space_cache(void)
 
 	ret = test_steal_space_from_bitmap_to_extent(cache);
 out:
-	__btrfs_remove_free_space_cache(cache->free_space_ctl);
-	kfree(cache->free_space_ctl);
-	kfree(cache);
+	btrfs_free_dummy_block_group(cache);
 	test_msg("Free space cache tests finished\n");
 	return ret;
 }
diff --git a/fs/btrfs/tests/free-space-tree-tests.c b/fs/btrfs/tests/free-space-tree-tests.c
new file mode 100644
index 000000000000..a3fce6f67367
--- /dev/null
+++ b/fs/btrfs/tests/free-space-tree-tests.c
@@ -0,0 +1,570 @@
+/*
+ * Copyright (C) 2015 Facebook.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include "btrfs-tests.h"
+#include "../ctree.h"
+#include "../disk-io.h"
+#include "../free-space-tree.h"
+#include "../transaction.h"
+
+struct free_space_extent {
+	u64 start, length;
+};
+
+/*
+ * The test cases align their operations to this in order to hit some of the
+ * edge cases in the bitmap code.
+ */
+#define BITMAP_RANGE (BTRFS_FREE_SPACE_BITMAP_BITS * 4096)
+
+static int __check_free_space_extents(struct btrfs_trans_handle *trans,
+				      struct btrfs_fs_info *fs_info,
+				      struct btrfs_block_group_cache *cache,
+				      struct btrfs_path *path,
+				      struct free_space_extent *extents,
+				      unsigned int num_extents)
+{
+	struct btrfs_free_space_info *info;
+	struct btrfs_key key;
+	int prev_bit = 0, bit;
+	u64 extent_start = 0, offset, end;
+	u32 flags, extent_count;
+	unsigned int i;
+	int ret;
+
+	info = search_free_space_info(trans, fs_info, cache, path, 0);
+	if (IS_ERR(info)) {
+		test_msg("Could not find free space info\n");
+		ret = PTR_ERR(info);
+		goto out;
+	}
+	flags = btrfs_free_space_flags(path->nodes[0], info);
+	extent_count = btrfs_free_space_extent_count(path->nodes[0], info);
+
+	if (extent_count != num_extents) {
+		test_msg("Extent count is wrong\n");
+		ret = -EINVAL;
+		goto out;
+	}
+	if (flags & BTRFS_FREE_SPACE_USING_BITMAPS) {
+		if (path->slots[0] != 0)
+			goto invalid;
+		end = cache->key.objectid + cache->key.offset;
+		i = 0;
+		while (++path->slots[0] < btrfs_header_nritems(path->nodes[0])) {
+			btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+			if (key.type != BTRFS_FREE_SPACE_BITMAP_KEY)
+				goto invalid;
+			offset = key.objectid;
+			while (offset < key.objectid + key.offset) {
+				bit = free_space_test_bit(cache, path, offset);
+				if (prev_bit == 0 && bit == 1) {
+					extent_start = offset;
+				} else if (prev_bit == 1 && bit == 0) {
+					if (i >= num_extents)
+						goto invalid;
+					if (i >= num_extents ||
+					    extent_start != extents[i].start ||
+					    offset - extent_start != extents[i].length)
+						goto invalid;
+					i++;
+				}
+				prev_bit = bit;
+				offset += cache->sectorsize;
+			}
+		}
+		if (prev_bit == 1) {
+			if (i >= num_extents ||
+			    extent_start != extents[i].start ||
+			    offset - extent_start != extents[i].length)
+				goto invalid;
+			i++;
+		}
+		if (i != num_extents)
+			goto invalid;
+	} else {
+		if (btrfs_header_nritems(path->nodes[0]) != num_extents + 1 ||
+		    path->slots[0] != 0)
+			goto invalid;
+		for (i = 0; i < num_extents; i++) {
+			path->slots[0]++;
+			btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+			if (key.type != BTRFS_FREE_SPACE_EXTENT_KEY ||
+			    key.objectid != extents[i].start ||
+			    key.offset != extents[i].length)
+				goto invalid;
+		}
+	}
+
+	ret = 0;
+out:
+	btrfs_release_path(path);
+	return ret;
+invalid:
+	test_msg("Free space tree is invalid\n");
+	ret = -EINVAL;
+	goto out;
+}
+
+static int check_free_space_extents(struct btrfs_trans_handle *trans,
+				    struct btrfs_fs_info *fs_info,
+				    struct btrfs_block_group_cache *cache,
+				    struct btrfs_path *path,
+				    struct free_space_extent *extents,
+				    unsigned int num_extents)
+{
+	struct btrfs_free_space_info *info;
+	u32 flags;
+	int ret;
+
+	info = search_free_space_info(trans, fs_info, cache, path, 0);
+	if (IS_ERR(info)) {
+		test_msg("Could not find free space info\n");
+		btrfs_release_path(path);
+		return PTR_ERR(info);
+	}
+	flags = btrfs_free_space_flags(path->nodes[0], info);
+	btrfs_release_path(path);
+
+	ret = __check_free_space_extents(trans, fs_info, cache, path, extents,
+					 num_extents);
+	if (ret)
+		return ret;
+
+	/* Flip it to the other format and check that for good measure. */
+	if (flags & BTRFS_FREE_SPACE_USING_BITMAPS) {
+		ret = convert_free_space_to_extents(trans, fs_info, cache, path);
+		if (ret) {
+			test_msg("Could not convert to extents\n");
+			return ret;
+		}
+	} else {
+		ret = convert_free_space_to_bitmaps(trans, fs_info, cache, path);
+		if (ret) {
+			test_msg("Could not convert to bitmaps\n");
+			return ret;
+		}
+	}
+	return __check_free_space_extents(trans, fs_info, cache, path, extents,
+					  num_extents);
+}
+
+static int test_empty_block_group(struct btrfs_trans_handle *trans,
+				  struct btrfs_fs_info *fs_info,
+				  struct btrfs_block_group_cache *cache,
+				  struct btrfs_path *path)
+{
+	struct free_space_extent extents[] = {
+		{cache->key.objectid, cache->key.offset},
+	};
+
+	return check_free_space_extents(trans, fs_info, cache, path,
+					extents, ARRAY_SIZE(extents));
+}
+
+static int test_remove_all(struct btrfs_trans_handle *trans,
+			   struct btrfs_fs_info *fs_info,
+			   struct btrfs_block_group_cache *cache,
+			   struct btrfs_path *path)
+{
+	struct free_space_extent extents[] = {};
+	int ret;
+
+	ret = __remove_from_free_space_tree(trans, fs_info, cache, path,
+					    cache->key.objectid,
+					    cache->key.offset);
+	if (ret) {
+		test_msg("Could not remove free space\n");
+		return ret;
+	}
+
+	return check_free_space_extents(trans, fs_info, cache, path,
+					extents, ARRAY_SIZE(extents));
+}
+
+static int test_remove_beginning(struct btrfs_trans_handle *trans,
+				 struct btrfs_fs_info *fs_info,
+				 struct btrfs_block_group_cache *cache,
+				 struct btrfs_path *path)
+{
+	struct free_space_extent extents[] = {
+		{cache->key.objectid + BITMAP_RANGE,
+			cache->key.offset - BITMAP_RANGE},
+	};
+	int ret;
+
+	ret = __remove_from_free_space_tree(trans, fs_info, cache, path,
+					    cache->key.objectid, BITMAP_RANGE);
+	if (ret) {
+		test_msg("Could not remove free space\n");
+		return ret;
+	}
+
+	return check_free_space_extents(trans, fs_info, cache, path,
+					extents, ARRAY_SIZE(extents));
+
+}
+
+static int test_remove_end(struct btrfs_trans_handle *trans,
+			   struct btrfs_fs_info *fs_info,
+			   struct btrfs_block_group_cache *cache,
+			   struct btrfs_path *path)
+{
+	struct free_space_extent extents[] = {
+		{cache->key.objectid, cache->key.offset - BITMAP_RANGE},
+	};
+	int ret;
+
+	ret = __remove_from_free_space_tree(trans, fs_info, cache, path,
+					    cache->key.objectid +
+					    cache->key.offset - BITMAP_RANGE,
+					    BITMAP_RANGE);
+	if (ret) {
+		test_msg("Could not remove free space\n");
+		return ret;
+	}
+
+	return check_free_space_extents(trans, fs_info, cache, path,
+					extents, ARRAY_SIZE(extents));
+}
+
+static int test_remove_middle(struct btrfs_trans_handle *trans,
+			      struct btrfs_fs_info *fs_info,
+			      struct btrfs_block_group_cache *cache,
+			      struct btrfs_path *path)
+{
+	struct free_space_extent extents[] = {
+		{cache->key.objectid, BITMAP_RANGE},
+		{cache->key.objectid + 2 * BITMAP_RANGE,
+			cache->key.offset - 2 * BITMAP_RANGE},
+	};
+	int ret;
+
+	ret = __remove_from_free_space_tree(trans, fs_info, cache, path,
+					    cache->key.objectid + BITMAP_RANGE,
+					    BITMAP_RANGE);
+	if (ret) {
+		test_msg("Could not remove free space\n");
+		return ret;
+	}
+
+	return check_free_space_extents(trans, fs_info, cache, path,
+					extents, ARRAY_SIZE(extents));
+}
+
+static int test_merge_left(struct btrfs_trans_handle *trans,
+			   struct btrfs_fs_info *fs_info,
+			   struct btrfs_block_group_cache *cache,
+			   struct btrfs_path *path)
+{
+	struct free_space_extent extents[] = {
+		{cache->key.objectid, 2 * BITMAP_RANGE},
+	};
+	int ret;
+
+	ret = __remove_from_free_space_tree(trans, fs_info, cache, path,
+					    cache->key.objectid,
+					    cache->key.offset);
+	if (ret) {
+		test_msg("Could not remove free space\n");
+		return ret;
+	}
+
+	ret = __add_to_free_space_tree(trans, fs_info, cache, path,
+				       cache->key.objectid, BITMAP_RANGE);
+	if (ret) {
+		test_msg("Could not add free space\n");
+		return ret;
+	}
+
+	ret = __add_to_free_space_tree(trans, fs_info, cache, path,
+				       cache->key.objectid + BITMAP_RANGE,
+				       BITMAP_RANGE);
+	if (ret) {
+		test_msg("Could not add free space\n");
+		return ret;
+	}
+
+	return check_free_space_extents(trans, fs_info, cache, path,
+					extents, ARRAY_SIZE(extents));
+}
+
+static int test_merge_right(struct btrfs_trans_handle *trans,
+			   struct btrfs_fs_info *fs_info,
+			   struct btrfs_block_group_cache *cache,
+			   struct btrfs_path *path)
+{
+	struct free_space_extent extents[] = {
+		{cache->key.objectid + BITMAP_RANGE, 2 * BITMAP_RANGE},
+	};
+	int ret;
+
+	ret = __remove_from_free_space_tree(trans, fs_info, cache, path,
+					    cache->key.objectid,
+					    cache->key.offset);
+	if (ret) {
+		test_msg("Could not remove free space\n");
+		return ret;
+	}
+
+	ret = __add_to_free_space_tree(trans, fs_info, cache, path,
+				       cache->key.objectid + 2 * BITMAP_RANGE,
+				       BITMAP_RANGE);
+	if (ret) {
+		test_msg("Could not add free space\n");
+		return ret;
+	}
+
+	ret = __add_to_free_space_tree(trans, fs_info, cache, path,
+				       cache->key.objectid + BITMAP_RANGE,
+				       BITMAP_RANGE);
+	if (ret) {
+		test_msg("Could not add free space\n");
+		return ret;
+	}
+
+	return check_free_space_extents(trans, fs_info, cache, path,
+					extents, ARRAY_SIZE(extents));
+}
+
+static int test_merge_both(struct btrfs_trans_handle *trans,
+			   struct btrfs_fs_info *fs_info,
+			   struct btrfs_block_group_cache *cache,
+			   struct btrfs_path *path)
+{
+	struct free_space_extent extents[] = {
+		{cache->key.objectid, 3 * BITMAP_RANGE},
+	};
+	int ret;
+
+	ret = __remove_from_free_space_tree(trans, fs_info, cache, path,
+					    cache->key.objectid,
+					    cache->key.offset);
+	if (ret) {
+		test_msg("Could not remove free space\n");
+		return ret;
+	}
+
+	ret = __add_to_free_space_tree(trans, fs_info, cache, path,
+				       cache->key.objectid, BITMAP_RANGE);
+	if (ret) {
+		test_msg("Could not add free space\n");
+		return ret;
+	}
+
+	ret = __add_to_free_space_tree(trans, fs_info, cache, path,
+				       cache->key.objectid + 2 * BITMAP_RANGE,
+				       BITMAP_RANGE);
+	if (ret) {
+		test_msg("Could not add free space\n");
+		return ret;
+	}
+
+	ret = __add_to_free_space_tree(trans, fs_info, cache, path,
+				       cache->key.objectid + BITMAP_RANGE,
+				       BITMAP_RANGE);
+	if (ret) {
+		test_msg("Could not add free space\n");
+		return ret;
+	}
+
+	return check_free_space_extents(trans, fs_info, cache, path,
+					extents, ARRAY_SIZE(extents));
+}
+
+static int test_merge_none(struct btrfs_trans_handle *trans,
+			   struct btrfs_fs_info *fs_info,
+			   struct btrfs_block_group_cache *cache,
+			   struct btrfs_path *path)
+{
+	struct free_space_extent extents[] = {
+		{cache->key.objectid, BITMAP_RANGE},
+		{cache->key.objectid + 2 * BITMAP_RANGE, BITMAP_RANGE},
+		{cache->key.objectid + 4 * BITMAP_RANGE, BITMAP_RANGE},
+	};
+	int ret;
+
+	ret = __remove_from_free_space_tree(trans, fs_info, cache, path,
+					    cache->key.objectid,
+					    cache->key.offset);
+	if (ret) {
+		test_msg("Could not remove free space\n");
+		return ret;
+	}
+
+	ret = __add_to_free_space_tree(trans, fs_info, cache, path,
+				       cache->key.objectid, BITMAP_RANGE);
+	if (ret) {
+		test_msg("Could not add free space\n");
+		return ret;
+	}
+
+	ret = __add_to_free_space_tree(trans, fs_info, cache, path,
+				       cache->key.objectid + 4 * BITMAP_RANGE,
+				       BITMAP_RANGE);
+	if (ret) {
+		test_msg("Could not add free space\n");
+		return ret;
+	}
+
+	ret = __add_to_free_space_tree(trans, fs_info, cache, path,
+				       cache->key.objectid + 2 * BITMAP_RANGE,
+				       BITMAP_RANGE);
+	if (ret) {
+		test_msg("Could not add free space\n");
+		return ret;
+	}
+
+	return check_free_space_extents(trans, fs_info, cache, path,
+					extents, ARRAY_SIZE(extents));
+}
+
+typedef int (*test_func_t)(struct btrfs_trans_handle *,
+			   struct btrfs_fs_info *,
+			   struct btrfs_block_group_cache *,
+			   struct btrfs_path *);
+
+static int run_test(test_func_t test_func, int bitmaps)
+{
+	struct btrfs_root *root = NULL;
+	struct btrfs_block_group_cache *cache = NULL;
+	struct btrfs_trans_handle trans;
+	struct btrfs_path *path = NULL;
+	int ret;
+
+	root = btrfs_alloc_dummy_root();
+	if (IS_ERR(root)) {
+		test_msg("Couldn't allocate dummy root\n");
+		ret = PTR_ERR(root);
+		goto out;
+	}
+
+	root->fs_info = btrfs_alloc_dummy_fs_info();
+	if (!root->fs_info) {
+		test_msg("Couldn't allocate dummy fs info\n");
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	btrfs_set_super_compat_ro_flags(root->fs_info->super_copy,
+					BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE);
+	root->fs_info->free_space_root = root;
+	root->fs_info->tree_root = root;
+
+	root->node = alloc_test_extent_buffer(root->fs_info, 4096);
+	if (!root->node) {
+		test_msg("Couldn't allocate dummy buffer\n");
+		ret = -ENOMEM;
+		goto out;
+	}
+	btrfs_set_header_level(root->node, 0);
+	btrfs_set_header_nritems(root->node, 0);
+	root->alloc_bytenr += 8192;
+
+	cache = btrfs_alloc_dummy_block_group(8 * BITMAP_RANGE);
+	if (!cache) {
+		test_msg("Couldn't allocate dummy block group cache\n");
+		ret = -ENOMEM;
+		goto out;
+	}
+	cache->bitmap_low_thresh = 0;
+	cache->bitmap_high_thresh = (u32)-1;
+
+	btrfs_init_dummy_trans(&trans);
+
+	path = btrfs_alloc_path();
+	if (!path) {
+		test_msg("Couldn't allocate path\n");
+		return -ENOMEM;
+	}
+
+	ret = add_block_group_free_space(&trans, root->fs_info, cache);
+	if (ret) {
+		test_msg("Could not add block group free space\n");
+		goto out;
+	}
+
+	if (bitmaps) {
+		ret = convert_free_space_to_bitmaps(&trans, root->fs_info,
+						    cache, path);
+		if (ret) {
+			test_msg("Could not convert block group to bitmaps\n");
+			goto out;
+		}
+	}
+
+	ret = test_func(&trans, root->fs_info, cache, path);
+	if (ret)
+		goto out;
+
+	ret = remove_block_group_free_space(&trans, root->fs_info, cache);
+	if (ret) {
+		test_msg("Could not remove block group free space\n");
+		goto out;
+	}
+
+	if (btrfs_header_nritems(root->node) != 0) {
+		test_msg("Free space tree has leftover items\n");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = 0;
+out:
+	btrfs_free_path(path);
+	btrfs_free_dummy_block_group(cache);
+	btrfs_free_dummy_root(root);
+	return ret;
+}
+
+static int run_test_both_formats(test_func_t test_func)
+{
+	int ret;
+
+	ret = run_test(test_func, 0);
+	if (ret)
+		return ret;
+	return run_test(test_func, 1);
+}
+
+int btrfs_test_free_space_tree(void)
+{
+	test_func_t tests[] = {
+		test_empty_block_group,
+		test_remove_all,
+		test_remove_beginning,
+		test_remove_end,
+		test_remove_middle,
+		test_merge_left,
+		test_merge_right,
+		test_merge_both,
+		test_merge_none,
+	};
+	int i;
+
+	test_msg("Running free space tree tests\n");
+	for (i = 0; i < ARRAY_SIZE(tests); i++) {
+		int ret = run_test_both_formats(tests[i]);
+		if (ret) {
+			test_msg("%pf failed\n", tests[i]);
+			return ret;
+		}
+	}
+
+	return 0;
+}
diff --git a/fs/btrfs/tests/qgroup-tests.c b/fs/btrfs/tests/qgroup-tests.c
index 846d277b1901..8ea5d34bc5a2 100644
--- a/fs/btrfs/tests/qgroup-tests.c
+++ b/fs/btrfs/tests/qgroup-tests.c
@@ -23,14 +23,6 @@
 #include "../qgroup.h"
 #include "../backref.h"
 
-static void init_dummy_trans(struct btrfs_trans_handle *trans)
-{
-	memset(trans, 0, sizeof(*trans));
-	trans->transid = 1;
-	INIT_LIST_HEAD(&trans->qgroup_ref_list);
-	trans->type = __TRANS_DUMMY;
-}
-
 static int insert_normal_tree_ref(struct btrfs_root *root, u64 bytenr,
 				  u64 num_bytes, u64 parent, u64 root_objectid)
 {
@@ -44,7 +36,7 @@ static int insert_normal_tree_ref(struct btrfs_root *root, u64 bytenr,
 	u32 size = sizeof(*item) + sizeof(*iref) + sizeof(*block_info);
 	int ret;
 
-	init_dummy_trans(&trans);
+	btrfs_init_dummy_trans(&trans);
 
 	ins.objectid = bytenr;
 	ins.type = BTRFS_EXTENT_ITEM_KEY;
@@ -94,7 +86,7 @@ static int add_tree_ref(struct btrfs_root *root, u64 bytenr, u64 num_bytes,
 	u64 refs;
 	int ret;
 
-	init_dummy_trans(&trans);
+	btrfs_init_dummy_trans(&trans);
 
 	key.objectid = bytenr;
 	key.type = BTRFS_EXTENT_ITEM_KEY;
@@ -144,7 +136,7 @@ static int remove_extent_item(struct btrfs_root *root, u64 bytenr,
 	struct btrfs_path *path;
 	int ret;
 
-	init_dummy_trans(&trans);
+	btrfs_init_dummy_trans(&trans);
 
 	key.objectid = bytenr;
 	key.type = BTRFS_EXTENT_ITEM_KEY;
@@ -178,7 +170,7 @@ static int remove_extent_ref(struct btrfs_root *root, u64 bytenr,
 	u64 refs;
 	int ret;
 
-	init_dummy_trans(&trans);
+	btrfs_init_dummy_trans(&trans);
 
 	key.objectid = bytenr;
 	key.type = BTRFS_EXTENT_ITEM_KEY;
@@ -232,7 +224,7 @@ static int test_no_shared_qgroup(struct btrfs_root *root)
 	struct ulist *new_roots = NULL;
 	int ret;
 
-	init_dummy_trans(&trans);
+	btrfs_init_dummy_trans(&trans);
 
 	test_msg("Qgroup basic add\n");
 	ret = btrfs_create_qgroup(NULL, fs_info, 5);
@@ -326,7 +318,7 @@ static int test_multiple_refs(struct btrfs_root *root)
 	struct ulist *new_roots = NULL;
 	int ret;
 
-	init_dummy_trans(&trans);
+	btrfs_init_dummy_trans(&trans);
 
 	test_msg("Qgroup multiple refs test\n");
 
-- 
2.5.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 8/9] Btrfs: wire up the free space tree to the extent tree
  2015-09-03 19:44 ` [PATCH v2 0/9] free space B-tree Omar Sandoval
                     ` (6 preceding siblings ...)
  2015-09-03 19:44   ` [PATCH v2 7/9] Btrfs: add free space tree sanity tests Omar Sandoval
@ 2015-09-03 19:44   ` Omar Sandoval
  2015-09-04  5:56     ` Omar Sandoval
  2015-09-03 19:44   ` [PATCH v2 9/9] Btrfs: add free space tree mount option Omar Sandoval
                     ` (2 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Omar Sandoval @ 2015-09-03 19:44 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Omar Sandoval

The free space tree is updated in tandem with the extent tree. There are
only a handful of places where we need to hook in:

1. Block group creation
2. Block group deletion
3. Delayed refs (extent creation and deletion)
4. Block group caching

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/extent-tree.c | 40 +++++++++++++++++++++++++++++++++++++---
 1 file changed, 37 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 418c0eca9bb4..1c007e858787 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -33,6 +33,7 @@
 #include "raid56.h"
 #include "locking.h"
 #include "free-space-cache.h"
+#include "free-space-tree.h"
 #include "math.h"
 #include "sysfs.h"
 #include "qgroup.h"
@@ -520,7 +521,10 @@ static noinline void caching_thread(struct btrfs_work *work)
 	mutex_lock(&caching_ctl->mutex);
 	down_read(&fs_info->commit_root_sem);
 
-	ret = load_extent_tree_free(caching_ctl);
+	if (btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
+		ret = load_free_space_tree(caching_ctl);
+	else
+		ret = load_extent_tree_free(caching_ctl);
 
 	spin_lock(&block_group->lock);
 	block_group->caching_ctl = NULL;
@@ -626,8 +630,8 @@ static int cache_block_group(struct btrfs_block_group_cache *cache,
 		}
 	} else {
 		/*
-		 * We are not going to do the fast caching, set cached to the
-		 * appropriate value and wakeup any waiters.
+		 * We're either using the free space tree or no caching at all.
+		 * Set cached to the appropriate value and wakeup any waiters.
 		 */
 		spin_lock(&cache->lock);
 		if (load_cache_only) {
@@ -6385,6 +6389,13 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 			}
 		}
 
+		ret = add_to_free_space_tree(trans, root->fs_info, bytenr,
+					     num_bytes);
+		if (ret) {
+			btrfs_abort_transaction(trans, extent_root, ret);
+			goto out;
+		}
+
 		ret = update_block_group(trans, root, bytenr, num_bytes, 0);
 		if (ret) {
 			btrfs_abort_transaction(trans, extent_root, ret);
@@ -7328,6 +7339,11 @@ static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
 	btrfs_mark_buffer_dirty(path->nodes[0]);
 	btrfs_free_path(path);
 
+	ret = remove_from_free_space_tree(trans, fs_info, ins->objectid,
+					  ins->offset);
+	if (ret)
+		return ret;
+
 	ret = update_block_group(trans, root, ins->objectid, ins->offset, 1);
 	if (ret) { /* -ENOENT, logic error */
 		btrfs_err(fs_info, "update block group failed for %llu %llu",
@@ -7409,6 +7425,11 @@ static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
 	btrfs_mark_buffer_dirty(leaf);
 	btrfs_free_path(path);
 
+	ret = remove_from_free_space_tree(trans, fs_info, ins->objectid,
+					  num_bytes);
+	if (ret)
+		return ret;
+
 	ret = update_block_group(trans, root, ins->objectid, root->nodesize,
 				 1);
 	if (ret) { /* -ENOENT, logic error */
@@ -9279,6 +9300,8 @@ btrfs_create_block_group_cache(struct btrfs_root *root, u64 start, u64 size)
 	cache->full_stripe_len = btrfs_full_stripe_len(root,
 					       &root->fs_info->mapping_tree,
 					       start);
+	set_free_space_tree_thresholds(cache);
+
 	atomic_set(&cache->count, 1);
 	spin_lock_init(&cache->lock);
 	init_rwsem(&cache->data_rwsem);
@@ -9542,6 +9565,13 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans,
 	add_new_free_space(cache, root->fs_info, chunk_offset,
 			   chunk_offset + size);
 
+	ret = add_block_group_free_space(trans, root->fs_info, cache);
+	if (ret) {
+		btrfs_remove_free_space_cache(cache);
+		btrfs_put_block_group(cache);
+		return ret;
+	}
+
 	free_excluded_extents(root, cache);
 
 	/*
@@ -9885,6 +9915,10 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 
 	unlock_chunks(root);
 
+	ret = remove_block_group_free_space(trans, root->fs_info, block_group);
+	if (ret)
+		goto out;
+
 	btrfs_put_block_group(block_group);
 	btrfs_put_block_group(block_group);
 
-- 
2.5.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 9/9] Btrfs: add free space tree mount option
  2015-09-03 19:44 ` [PATCH v2 0/9] free space B-tree Omar Sandoval
                     ` (7 preceding siblings ...)
  2015-09-03 19:44   ` [PATCH v2 8/9] Btrfs: wire up the free space tree to the extent tree Omar Sandoval
@ 2015-09-03 19:44   ` Omar Sandoval
  2015-09-09 12:00     ` David Sterba
  2015-09-04  1:29   ` [PATCH v2 0/9] free space B-tree Zhao Lei
  2015-09-11  1:21   ` Qu Wenruo
  10 siblings, 1 reply; 43+ messages in thread
From: Omar Sandoval @ 2015-09-03 19:44 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Omar Sandoval

Now we can finally hook up everything so we can actually use free space
tree. On the first mount with the free_space_tree mount option, the free
space tree will be created and the FREE_SPACE_TREE read-only compat bit
will be set. Any time the filesystem is mounted from then on, we will
use the free space tree.

Having both the free space cache and free space trees enabled is
nonsense, so we don't allow that to happen. Since mkfs sets the
superblock cache generation to -1, this means that the filesystem will
have to be mounted with nospace_cache,free_space_tree to create the free
space trees on first mount. Once the FREE_SPACE_TREE bit is set, the
cache generation is ignored when mounting. This is all a little more
complicated than would be ideal, but at some point we can presumably
make the free space tree the default and stop setting the cache
generation in mkfs.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/ctree.h   |  7 ++++++-
 fs/btrfs/disk-io.c | 26 ++++++++++++++++++++++++++
 fs/btrfs/super.c   | 21 +++++++++++++++++++--
 3 files changed, 51 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 05420991e101..3524fe065b72 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -531,7 +531,10 @@ struct btrfs_super_block {
 #define BTRFS_FEATURE_COMPAT_SUPP		0ULL
 #define BTRFS_FEATURE_COMPAT_SAFE_SET		0ULL
 #define BTRFS_FEATURE_COMPAT_SAFE_CLEAR		0ULL
-#define BTRFS_FEATURE_COMPAT_RO_SUPP		0ULL
+
+#define BTRFS_FEATURE_COMPAT_RO_SUPP			\
+	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE)
+
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_SET	0ULL
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR	0ULL
 
@@ -2203,6 +2206,7 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21)
 #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR	(1 << 22)
 #define BTRFS_MOUNT_RESCAN_UUID_TREE	(1 << 23)
+#define BTRFS_MOUNT_FREE_SPACE_TREE	(1 << 24)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
 #define BTRFS_DEFAULT_MAX_INLINE	(8192)
@@ -3746,6 +3750,7 @@ static inline void free_fs_info(struct btrfs_fs_info *fs_info)
 	kfree(fs_info->csum_root);
 	kfree(fs_info->quota_root);
 	kfree(fs_info->uuid_root);
+	kfree(fs_info->free_space_root);
 	kfree(fs_info->super_copy);
 	kfree(fs_info->super_for_commit);
 	security_free_mnt_opts(&fs_info->security_opts);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f556c3732c2c..e88674c594da 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -42,6 +42,7 @@
 #include "locking.h"
 #include "tree-log.h"
 #include "free-space-cache.h"
+#include "free-space-tree.h"
 #include "inode-map.h"
 #include "check-integrity.h"
 #include "rcu-string.h"
@@ -1641,6 +1642,9 @@ struct btrfs_root *btrfs_get_fs_root(struct btrfs_fs_info *fs_info,
 	if (location->objectid == BTRFS_UUID_TREE_OBJECTID)
 		return fs_info->uuid_root ? fs_info->uuid_root :
 					    ERR_PTR(-ENOENT);
+	if (location->objectid == BTRFS_FREE_SPACE_TREE_OBJECTID)
+		return fs_info->free_space_root ? fs_info->free_space_root :
+						  ERR_PTR(-ENOENT);
 again:
 	root = btrfs_lookup_fs_root(fs_info, location->objectid);
 	if (root) {
@@ -2138,6 +2142,7 @@ static void free_root_pointers(struct btrfs_fs_info *info, int chunk_root)
 	free_root_extent_buffers(info->uuid_root);
 	if (chunk_root)
 		free_root_extent_buffers(info->chunk_root);
+	free_root_extent_buffers(info->free_space_root);
 }
 
 void btrfs_free_fs_roots(struct btrfs_fs_info *fs_info)
@@ -2439,6 +2444,15 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info,
 		fs_info->uuid_root = root;
 	}
 
+	if (btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE)) {
+		location.objectid = BTRFS_FREE_SPACE_TREE_OBJECTID;
+		root = btrfs_read_tree_root(tree_root, &location);
+		if (IS_ERR(root))
+			return PTR_ERR(root);
+		set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
+		fs_info->free_space_root = root;
+	}
+
 	return 0;
 }
 
@@ -3063,6 +3077,18 @@ retry_root_backup:
 
 	btrfs_qgroup_rescan_resume(fs_info);
 
+	if (btrfs_test_opt(tree_root, FREE_SPACE_TREE) &&
+	    !btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE)) {
+		pr_info("BTRFS: creating free space tree\n");
+		ret = btrfs_create_free_space_tree(fs_info);
+		if (ret) {
+			pr_warn("BTRFS: failed to create free space tree %d\n",
+				ret);
+			close_ctree(tree_root);
+			return ret;
+		}
+	}
+
 	if (!fs_info->uuid_root) {
 		pr_info("BTRFS: creating UUID tree\n");
 		ret = btrfs_create_uuid_tree(fs_info);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index b93f127c4bc8..d7705e4ed119 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -319,7 +319,7 @@ enum {
 	Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_rescan_uuid_tree,
 	Opt_commit_interval, Opt_barrier, Opt_nodefrag, Opt_nodiscard,
 	Opt_noenospc_debug, Opt_noflushoncommit, Opt_acl, Opt_datacow,
-	Opt_datasum, Opt_treelog, Opt_noinode_cache,
+	Opt_datasum, Opt_treelog, Opt_noinode_cache, Opt_free_space_tree,
 	Opt_err,
 };
 
@@ -372,6 +372,7 @@ static match_table_t tokens = {
 	{Opt_rescan_uuid_tree, "rescan_uuid_tree"},
 	{Opt_fatal_errors, "fatal_errors=%s"},
 	{Opt_commit_interval, "commit=%d"},
+	{Opt_free_space_tree, "free_space_tree"},
 	{Opt_err, NULL},
 };
 
@@ -392,7 +393,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 	bool compress_force = false;
 
 	cache_gen = btrfs_super_cache_generation(root->fs_info->super_copy);
-	if (cache_gen)
+	if (btrfs_fs_compat_ro(root->fs_info, FREE_SPACE_TREE))
+		btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE);
+	else if (cache_gen)
 		btrfs_set_opt(info->mount_opt, SPACE_CACHE);
 
 	if (!options)
@@ -738,6 +741,10 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 				info->commit_interval = BTRFS_DEFAULT_COMMIT_INTERVAL;
 			}
 			break;
+		case Opt_free_space_tree:
+			btrfs_set_and_info(root, FREE_SPACE_TREE,
+					   "enabling free space tree");
+			break;
 		case Opt_err:
 			btrfs_info(root->fs_info, "unrecognized mount option '%s'", p);
 			ret = -EINVAL;
@@ -747,8 +754,16 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 		}
 	}
 out:
+	if (btrfs_test_opt(root, SPACE_CACHE) &&
+	    btrfs_test_opt(root, FREE_SPACE_TREE)) {
+		btrfs_err(root->fs_info,
+			  "cannot use both free space cache and free space tree");
+		ret = -EINVAL;
+	}
 	if (!ret && btrfs_test_opt(root, SPACE_CACHE))
 		btrfs_info(root->fs_info, "disk space caching is enabled");
+	if (!ret && btrfs_test_opt(root, FREE_SPACE_TREE))
+		btrfs_info(root->fs_info, "using free space tree");
 	kfree(orig);
 	return ret;
 }
@@ -1152,6 +1167,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 		seq_puts(seq, ",discard");
 	if (!(root->fs_info->sb->s_flags & MS_POSIXACL))
 		seq_puts(seq, ",noacl");
+	if (btrfs_test_opt(root, FREE_SPACE_TREE))
+		seq_puts(seq, ",free_space_tree");
 	if (btrfs_test_opt(root, SPACE_CACHE))
 		seq_puts(seq, ",space_cache");
 	else
-- 
2.5.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* RE: [PATCH v2 0/9] free space B-tree
  2015-09-03 19:44 ` [PATCH v2 0/9] free space B-tree Omar Sandoval
                     ` (8 preceding siblings ...)
  2015-09-03 19:44   ` [PATCH v2 9/9] Btrfs: add free space tree mount option Omar Sandoval
@ 2015-09-04  1:29   ` Zhao Lei
  2015-09-04  5:43     ` Omar Sandoval
  2015-09-11  1:21   ` Qu Wenruo
  10 siblings, 1 reply; 43+ messages in thread
From: Zhao Lei @ 2015-09-04  1:29 UTC (permalink / raw)
  To: 'Omar Sandoval', linux-btrfs

Hi, Omar Sandoval

[PATCH 7/9] have following compiler warning:
 fs/btrfs/tests/free-space-tree-tests.c: In function '__check_free_space_extents':   
 fs/btrfs/tests/free-space-tree-tests.c:45: warning: 'offset' may be used uninitialized in this function   

It is just a compiler warning, and will not happened in code logic,
but could you fix it to make output pretty?

Thanks
Zhaolei

* From: linux-btrfs-owner@vger.kernel.org
> [mailto:linux-btrfs-owner@vger.kernel.org] On Behalf Of Omar Sandoval
> Sent: Friday, September 04, 2015 3:44 AM
> To: linux-btrfs@vger.kernel.org
> Cc: Omar Sandoval <osandov@osandov.com>
> Subject: [PATCH v2 0/9] free space B-tree
> 
> Here's version 2 of the the free space B-tree patches, addressing Josef's review
> from the last round, which you can find here:
> http://www.spinics.net/lists/linux-btrfs/msg46713.html
> 
> Changes from v1->v2:
> 
> - Cleaned up a bunch of unnecessary instances of "if (ret) goto out; ret = 0"
> - Added aborts in the free space tree code closer to the site the error
>   is encountered: where we add or remove block groups, add or remove
>   free space, and also when we convert formats
> - Moved loading of the free space tree into caching_thread() and added a
>   new patch 4 in preparation for it
> - Commented a bunch of stuff in the extent buffer bitmap operations and
>   refactored some of the complicated logic
> - Added sanity tests for the extent buffer bitmap operations and free
>   space tree (patches 2 and 6)
> - Added Josef's Reviewed-by tags
> 
> Omar Sandoval (9):
>   Btrfs: add extent buffer bitmap operations
>   Btrfs: add extent buffer bitmap sanity tests
>   Btrfs: add helpers for read-only compat bits
>   Btrfs: refactor caching_thread()
>   Btrfs: introduce the free space B-tree on-disk format
>   Btrfs: implement the free space B-tree
>   Btrfs: add free space tree sanity tests
>   Btrfs: wire up the free space tree to the extent tree
>   Btrfs: add free space tree mount option
> 
>  fs/btrfs/Makefile                      |    5 +-
>  fs/btrfs/ctree.h                       |  107 ++-
>  fs/btrfs/disk-io.c                     |   26 +
>  fs/btrfs/extent-tree.c                 |  112 ++-
>  fs/btrfs/extent_io.c                   |  183 +++-
>  fs/btrfs/extent_io.h                   |   10 +-
>  fs/btrfs/free-space-tree.c             | 1501
> ++++++++++++++++++++++++++++++++
>  fs/btrfs/free-space-tree.h             |   71 ++
>  fs/btrfs/super.c                       |   24 +-
>  fs/btrfs/tests/btrfs-tests.c           |   52 ++
>  fs/btrfs/tests/btrfs-tests.h           |   10 +
>  fs/btrfs/tests/extent-io-tests.c       |  138 ++-
>  fs/btrfs/tests/free-space-tests.c      |   35 +-
>  fs/btrfs/tests/free-space-tree-tests.c |  570 ++++++++++++
>  fs/btrfs/tests/qgroup-tests.c          |   20 +-
>  include/trace/events/btrfs.h           |    3 +-
>  16 files changed, 2763 insertions(+), 104 deletions(-)  create mode 100644
> fs/btrfs/free-space-tree.c  create mode 100644 fs/btrfs/free-space-tree.h
> create mode 100644 fs/btrfs/tests/free-space-tree-tests.c
> 
> --
> 2.5.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body
> of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 0/9] free space B-tree
  2015-09-04  1:29   ` [PATCH v2 0/9] free space B-tree Zhao Lei
@ 2015-09-04  5:43     ` Omar Sandoval
  0 siblings, 0 replies; 43+ messages in thread
From: Omar Sandoval @ 2015-09-04  5:43 UTC (permalink / raw)
  To: Zhao Lei; +Cc: linux-btrfs

On Fri, Sep 04, 2015 at 09:29:45AM +0800, Zhao Lei wrote:
> Hi, Omar Sandoval
> 
> [PATCH 7/9] have following compiler warning:
>  fs/btrfs/tests/free-space-tree-tests.c: In function '__check_free_space_extents':   
>  fs/btrfs/tests/free-space-tree-tests.c:45: warning: 'offset' may be used uninitialized in this function   
> 
> It is just a compiler warning, and will not happened in code logic,
> but could you fix it to make output pretty?
> 
> Thanks
> Zhaolei

Thanks, Zhaolei, I actually meant to use "end" where I used "offset" on
line 94, that should get rid of the warning and be more correct. I'll
fix it and send it out in v3.

Omar

> 
> * From: linux-btrfs-owner@vger.kernel.org
> > [mailto:linux-btrfs-owner@vger.kernel.org] On Behalf Of Omar Sandoval
> > Sent: Friday, September 04, 2015 3:44 AM
> > To: linux-btrfs@vger.kernel.org
> > Cc: Omar Sandoval <osandov@osandov.com>
> > Subject: [PATCH v2 0/9] free space B-tree
> > 
> > Here's version 2 of the the free space B-tree patches, addressing Josef's review
> > from the last round, which you can find here:
> > http://www.spinics.net/lists/linux-btrfs/msg46713.html
> > 
> > Changes from v1->v2:
> > 
> > - Cleaned up a bunch of unnecessary instances of "if (ret) goto out; ret = 0"
> > - Added aborts in the free space tree code closer to the site the error
> >   is encountered: where we add or remove block groups, add or remove
> >   free space, and also when we convert formats
> > - Moved loading of the free space tree into caching_thread() and added a
> >   new patch 4 in preparation for it
> > - Commented a bunch of stuff in the extent buffer bitmap operations and
> >   refactored some of the complicated logic
> > - Added sanity tests for the extent buffer bitmap operations and free
> >   space tree (patches 2 and 6)
> > - Added Josef's Reviewed-by tags
> > 
> > Omar Sandoval (9):
> >   Btrfs: add extent buffer bitmap operations
> >   Btrfs: add extent buffer bitmap sanity tests
> >   Btrfs: add helpers for read-only compat bits
> >   Btrfs: refactor caching_thread()
> >   Btrfs: introduce the free space B-tree on-disk format
> >   Btrfs: implement the free space B-tree
> >   Btrfs: add free space tree sanity tests
> >   Btrfs: wire up the free space tree to the extent tree
> >   Btrfs: add free space tree mount option
> > 
> >  fs/btrfs/Makefile                      |    5 +-
> >  fs/btrfs/ctree.h                       |  107 ++-
> >  fs/btrfs/disk-io.c                     |   26 +
> >  fs/btrfs/extent-tree.c                 |  112 ++-
> >  fs/btrfs/extent_io.c                   |  183 +++-
> >  fs/btrfs/extent_io.h                   |   10 +-
> >  fs/btrfs/free-space-tree.c             | 1501
> > ++++++++++++++++++++++++++++++++
> >  fs/btrfs/free-space-tree.h             |   71 ++
> >  fs/btrfs/super.c                       |   24 +-
> >  fs/btrfs/tests/btrfs-tests.c           |   52 ++
> >  fs/btrfs/tests/btrfs-tests.h           |   10 +
> >  fs/btrfs/tests/extent-io-tests.c       |  138 ++-
> >  fs/btrfs/tests/free-space-tests.c      |   35 +-
> >  fs/btrfs/tests/free-space-tree-tests.c |  570 ++++++++++++
> >  fs/btrfs/tests/qgroup-tests.c          |   20 +-
> >  include/trace/events/btrfs.h           |    3 +-
> >  16 files changed, 2763 insertions(+), 104 deletions(-)  create mode 100644
> > fs/btrfs/free-space-tree.c  create mode 100644 fs/btrfs/free-space-tree.h
> > create mode 100644 fs/btrfs/tests/free-space-tree-tests.c
> > 
> > --
> > 2.5.1
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body
> > of a message to majordomo@vger.kernel.org More majordomo info at
> > http://vger.kernel.org/majordomo-info.html
> 

-- 
Omar

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 8/9] Btrfs: wire up the free space tree to the extent tree
  2015-09-03 19:44   ` [PATCH v2 8/9] Btrfs: wire up the free space tree to the extent tree Omar Sandoval
@ 2015-09-04  5:56     ` Omar Sandoval
  0 siblings, 0 replies; 43+ messages in thread
From: Omar Sandoval @ 2015-09-04  5:56 UTC (permalink / raw)
  To: linux-btrfs

On Thu, Sep 03, 2015 at 12:44:26PM -0700, Omar Sandoval wrote:
> The free space tree is updated in tandem with the extent tree. There are
> only a handful of places where we need to hook in:
> 
> 1. Block group creation
> 2. Block group deletion
> 3. Delayed refs (extent creation and deletion)
> 4. Block group caching
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  fs/btrfs/extent-tree.c | 40 +++++++++++++++++++++++++++++++++++++---
>  1 file changed, 37 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 418c0eca9bb4..1c007e858787 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -33,6 +33,7 @@
>  #include "raid56.h"
>  #include "locking.h"
>  #include "free-space-cache.h"
> +#include "free-space-tree.h"
>  #include "math.h"
>  #include "sysfs.h"
>  #include "qgroup.h"
> @@ -520,7 +521,10 @@ static noinline void caching_thread(struct btrfs_work *work)
>  	mutex_lock(&caching_ctl->mutex);
>  	down_read(&fs_info->commit_root_sem);
>  
> -	ret = load_extent_tree_free(caching_ctl);
> +	if (btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
> +		ret = load_free_space_tree(caching_ctl);
> +	else
> +		ret = load_extent_tree_free(caching_ctl);
>  
>  	spin_lock(&block_group->lock);
>  	block_group->caching_ctl = NULL;
> @@ -626,8 +630,8 @@ static int cache_block_group(struct btrfs_block_group_cache *cache,
>  		}
>  	} else {
>  		/*
> -		 * We are not going to do the fast caching, set cached to the
> -		 * appropriate value and wakeup any waiters.
> +		 * We're either using the free space tree or no caching at all.
> +		 * Set cached to the appropriate value and wakeup any waiters.
>  		 */
>  		spin_lock(&cache->lock);
>  		if (load_cache_only) {
> @@ -6385,6 +6389,13 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
>  			}
>  		}
>  
> +		ret = add_to_free_space_tree(trans, root->fs_info, bytenr,
> +					     num_bytes);
> +		if (ret) {
> +			btrfs_abort_transaction(trans, extent_root, ret);
> +			goto out;
> +		}
> +
>  		ret = update_block_group(trans, root, bytenr, num_bytes, 0);
>  		if (ret) {
>  			btrfs_abort_transaction(trans, extent_root, ret);
> @@ -7328,6 +7339,11 @@ static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
>  	btrfs_mark_buffer_dirty(path->nodes[0]);
>  	btrfs_free_path(path);
>  
> +	ret = remove_from_free_space_tree(trans, fs_info, ins->objectid,
> +					  ins->offset);
> +	if (ret)
> +		return ret;
> +
>  	ret = update_block_group(trans, root, ins->objectid, ins->offset, 1);
>  	if (ret) { /* -ENOENT, logic error */
>  		btrfs_err(fs_info, "update block group failed for %llu %llu",
> @@ -7409,6 +7425,11 @@ static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
>  	btrfs_mark_buffer_dirty(leaf);
>  	btrfs_free_path(path);
>  
> +	ret = remove_from_free_space_tree(trans, fs_info, ins->objectid,
> +					  num_bytes);
> +	if (ret)
> +		return ret;
> +
>  	ret = update_block_group(trans, root, ins->objectid, root->nodesize,
>  				 1);
>  	if (ret) { /* -ENOENT, logic error */
> @@ -9279,6 +9300,8 @@ btrfs_create_block_group_cache(struct btrfs_root *root, u64 start, u64 size)
>  	cache->full_stripe_len = btrfs_full_stripe_len(root,
>  					       &root->fs_info->mapping_tree,
>  					       start);
> +	set_free_space_tree_thresholds(cache);
> +
>  	atomic_set(&cache->count, 1);
>  	spin_lock_init(&cache->lock);
>  	init_rwsem(&cache->data_rwsem);
> @@ -9542,6 +9565,13 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans,
>  	add_new_free_space(cache, root->fs_info, chunk_offset,
>  			   chunk_offset + size);
>  
> +	ret = add_block_group_free_space(trans, root->fs_info, cache);
> +	if (ret) {
> +		btrfs_remove_free_space_cache(cache);
> +		btrfs_put_block_group(cache);
> +		return ret;
> +	}
> +

Crap, so this definitely isn't the right place to do this. If we end up
allocating a new block group while modifying the free space tree, we'll
call through here and deadlock on the free space tree. Instead, I think
I'll have to delay this until either the first time we attempt to modify
the free space tree for a block group or in
btrfs_create_pending_block_groups(), whichever happens first.

>  	free_excluded_extents(root, cache);
>  
>  	/*
> @@ -9885,6 +9915,10 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
>  
>  	unlock_chunks(root);
>  
> +	ret = remove_block_group_free_space(trans, root->fs_info, block_group);
> +	if (ret)
> +		goto out;
> +
>  	btrfs_put_block_group(block_group);
>  	btrfs_put_block_group(block_group);
>  
> -- 
> 2.5.1
> 

-- 
Omar

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 9/9] Btrfs: add free space tree mount option
  2015-09-03 19:44   ` [PATCH v2 9/9] Btrfs: add free space tree mount option Omar Sandoval
@ 2015-09-09 12:00     ` David Sterba
  2015-09-11  0:52       ` Omar Sandoval
  0 siblings, 1 reply; 43+ messages in thread
From: David Sterba @ 2015-09-09 12:00 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs

On Thu, Sep 03, 2015 at 12:44:27PM -0700, Omar Sandoval wrote:
> Now we can finally hook up everything so we can actually use free space
> tree. On the first mount with the free_space_tree mount option, the free
> space tree will be created and the FREE_SPACE_TREE read-only compat bit
> will be set. Any time the filesystem is mounted from then on, we will
> use the free space tree.
> 
> Having both the free space cache and free space trees enabled is
> nonsense, so we don't allow that to happen. Since mkfs sets the
> superblock cache generation to -1, this means that the filesystem will
> have to be mounted with nospace_cache,free_space_tree to create the free
> space trees on first mount. Once the FREE_SPACE_TREE bit is set, the
> cache generation is ignored when mounting. This is all a little more
> complicated than would be ideal, but at some point we can presumably
> make the free space tree the default and stop setting the cache
> generation in mkfs.

I have objections against introducing another options to do something
with space cache. As you write, it does not make sens to have
'space_cache' and 'free_space_tree' enabled, and I agree. The b-tree
approach is an "implementation detail", an improved version of space
caching.

Because of that I propose to do the following:

* use space_cache mount option, and add a value denoting the used
  implementation, eg. space_cache=btree or space_cache=v2 etc

* keep space_cache for backward compatibility for the current
  implementaion

* clear_cache should reset state for both

* nospace_cache prevents using any of the two versions of space cache

On the mkfs side, we can add new incompat feature to the -O option that
will set the incompat bit to the superblock. Mounting such filesystem
would use the v2 cache automatically.

I'd like to see the b-tree space cache default in the future, until then
it'll be mkfs-time option or mount-time option.

For backward compatibility, mounting a free space v2 filesystem on older
kernel can be done with support of userspace tools: reset the cache
generation (as if clear_cache was used), drop all the free-space-tree
structures and unset the incompat bit. I think this kind of fallback is
desirable.


Other than that, I like the series and the improvements it's supposed to
bring.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 9/9] Btrfs: add free space tree mount option
  2015-09-09 12:00     ` David Sterba
@ 2015-09-11  0:52       ` Omar Sandoval
  0 siblings, 0 replies; 43+ messages in thread
From: Omar Sandoval @ 2015-09-11  0:52 UTC (permalink / raw)
  To: dsterba, linux-btrfs

On Wed, Sep 09, 2015 at 02:00:23PM +0200, David Sterba wrote:
> On Thu, Sep 03, 2015 at 12:44:27PM -0700, Omar Sandoval wrote:
> > Now we can finally hook up everything so we can actually use free space
> > tree. On the first mount with the free_space_tree mount option, the free
> > space tree will be created and the FREE_SPACE_TREE read-only compat bit
> > will be set. Any time the filesystem is mounted from then on, we will
> > use the free space tree.
> > 
> > Having both the free space cache and free space trees enabled is
> > nonsense, so we don't allow that to happen. Since mkfs sets the
> > superblock cache generation to -1, this means that the filesystem will
> > have to be mounted with nospace_cache,free_space_tree to create the free
> > space trees on first mount. Once the FREE_SPACE_TREE bit is set, the
> > cache generation is ignored when mounting. This is all a little more
> > complicated than would be ideal, but at some point we can presumably
> > make the free space tree the default and stop setting the cache
> > generation in mkfs.
> 
> I have objections against introducing another options to do something
> with space cache. As you write, it does not make sens to have
> 'space_cache' and 'free_space_tree' enabled, and I agree. The b-tree
> approach is an "implementation detail", an improved version of space
> caching.
> 
> Because of that I propose to do the following:
> 
> * use space_cache mount option, and add a value denoting the used
>   implementation, eg. space_cache=btree or space_cache=v2 etc
> 
> * keep space_cache for backward compatibility for the current
>   implementaion
> 
> * clear_cache should reset state for both
> 
> * nospace_cache prevents using any of the two versions of space cache

Okay, I like the idea of calling this space_cache=v2 and allowing
clear_cache to clear the free space tree just in case. However, the free
space tree doesn't use a cache generation like the old free space cache,
so once it's created, we can't ever ignore it, so for nospace_cache, the
best we could do would be to fail the mount (unless clear_cache is also
set). The other option would be to add something like the cache
generation for the free space tree, but I'd rather not do that since
fixing an out-of-date free space tree is a little more involved than
with the old cache (at that point, we might as well clear the tree and
redo it all over again). What do you think of that? Is failing on
nospace_cache okay with you?

Thanks.

> On the mkfs side, we can add new incompat feature to the -O option that
> will set the incompat bit to the superblock. Mounting such filesystem
> would use the v2 cache automatically.
> 
> I'd like to see the b-tree space cache default in the future, until then
> it'll be mkfs-time option or mount-time option.
> 
> For backward compatibility, mounting a free space v2 filesystem on older
> kernel can be done with support of userspace tools: reset the cache
> generation (as if clear_cache was used), drop all the free-space-tree
> structures and unset the incompat bit. I think this kind of fallback is
> desirable.
> 
> 
> Other than that, I like the series and the improvements it's supposed to
> bring.

-- 
Omar

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 0/9] free space B-tree
  2015-09-03 19:44 ` [PATCH v2 0/9] free space B-tree Omar Sandoval
                     ` (9 preceding siblings ...)
  2015-09-04  1:29   ` [PATCH v2 0/9] free space B-tree Zhao Lei
@ 2015-09-11  1:21   ` Qu Wenruo
  2015-09-11  3:48     ` Omar Sandoval
  2015-09-22 14:41     ` David Sterba
  10 siblings, 2 replies; 43+ messages in thread
From: Qu Wenruo @ 2015-09-11  1:21 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs

Hi Omar,

Thanks for your patchset.
Quite a nice one, and debug-tree can give better output on space cache.
With current implement, space cache is near a black box in debug-tree 
output.

And current on disk format is not quite easy to understand.(In fact, 
space cache is restored in tree root, as a NODATACOW inode, quite wired)

Also, it should provide a quite good base for rework inode cache for 
future development.


But I'm still a little concerned about the performance.

One of the problem using b-tree is, now we need to use 
btrfs_search_slot() to do modification, that means we will do 
level-based tree lock and COW.
Personally speaking, I'd like to blame that for the slow metadata 
performance of btrfs.
(Yeah personal experience, may be wrong again)

So with the new implement every space cache operation will causing tree 
lock and cow.
Unlike the old wired structure, which is done in a NODATACOW fashion.

Hopes I'm wrong about it (and it seems I'm always wrong about all these 
assumption based performance thing).

Thanks,
Qu


Omar Sandoval wrote on 2015/09/03 12:44 -0700:
> Here's version 2 of the the free space B-tree patches, addressing
> Josef's review from the last round, which you can find here:
> http://www.spinics.net/lists/linux-btrfs/msg46713.html
>
> Changes from v1->v2:
>
> - Cleaned up a bunch of unnecessary instances of "if (ret) goto out; ret = 0"
> - Added aborts in the free space tree code closer to the site the error
>    is encountered: where we add or remove block groups, add or remove
>    free space, and also when we convert formats
> - Moved loading of the free space tree into caching_thread() and added a
>    new patch 4 in preparation for it
> - Commented a bunch of stuff in the extent buffer bitmap operations and
>    refactored some of the complicated logic
> - Added sanity tests for the extent buffer bitmap operations and free
>    space tree (patches 2 and 6)
> - Added Josef's Reviewed-by tags
>
> Omar Sandoval (9):
>    Btrfs: add extent buffer bitmap operations
>    Btrfs: add extent buffer bitmap sanity tests
>    Btrfs: add helpers for read-only compat bits
>    Btrfs: refactor caching_thread()
>    Btrfs: introduce the free space B-tree on-disk format
>    Btrfs: implement the free space B-tree
>    Btrfs: add free space tree sanity tests
>    Btrfs: wire up the free space tree to the extent tree
>    Btrfs: add free space tree mount option
>
>   fs/btrfs/Makefile                      |    5 +-
>   fs/btrfs/ctree.h                       |  107 ++-
>   fs/btrfs/disk-io.c                     |   26 +
>   fs/btrfs/extent-tree.c                 |  112 ++-
>   fs/btrfs/extent_io.c                   |  183 +++-
>   fs/btrfs/extent_io.h                   |   10 +-
>   fs/btrfs/free-space-tree.c             | 1501 ++++++++++++++++++++++++++++++++
>   fs/btrfs/free-space-tree.h             |   71 ++
>   fs/btrfs/super.c                       |   24 +-
>   fs/btrfs/tests/btrfs-tests.c           |   52 ++
>   fs/btrfs/tests/btrfs-tests.h           |   10 +
>   fs/btrfs/tests/extent-io-tests.c       |  138 ++-
>   fs/btrfs/tests/free-space-tests.c      |   35 +-
>   fs/btrfs/tests/free-space-tree-tests.c |  570 ++++++++++++
>   fs/btrfs/tests/qgroup-tests.c          |   20 +-
>   include/trace/events/btrfs.h           |    3 +-
>   16 files changed, 2763 insertions(+), 104 deletions(-)
>   create mode 100644 fs/btrfs/free-space-tree.c
>   create mode 100644 fs/btrfs/free-space-tree.h
>   create mode 100644 fs/btrfs/tests/free-space-tree-tests.c
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 0/9] free space B-tree
  2015-09-11  1:21   ` Qu Wenruo
@ 2015-09-11  3:48     ` Omar Sandoval
  2015-09-11  3:58       ` Qu Wenruo
  2015-09-22 14:41     ` David Sterba
  1 sibling, 1 reply; 43+ messages in thread
From: Omar Sandoval @ 2015-09-11  3:48 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Sep 11, 2015 at 09:21:13AM +0800, Qu Wenruo wrote:
> Hi Omar,
> 
> Thanks for your patchset.
> Quite a nice one, and debug-tree can give better output on space cache.
> With current implement, space cache is near a black box in debug-tree
> output.
> 
> And current on disk format is not quite easy to understand.(In fact, space
> cache is restored in tree root, as a NODATACOW inode, quite wired)
> 
> Also, it should provide a quite good base for rework inode cache for future
> development.
> 
> 
> But I'm still a little concerned about the performance.
> 
> One of the problem using b-tree is, now we need to use btrfs_search_slot()
> to do modification, that means we will do level-based tree lock and COW.
> Personally speaking, I'd like to blame that for the slow metadata
> performance of btrfs.
> (Yeah personal experience, may be wrong again)
> 
> So with the new implement every space cache operation will causing tree lock
> and cow.
> Unlike the old wired structure, which is done in a NODATACOW fashion.
> 
> Hopes I'm wrong about it (and it seems I'm always wrong about all these
> assumption based performance thing).
> 
> Thanks,
> Qu

Hey, Qu,

So the thing about the free space tree is that the B-tree is only
modified while running delayed refs, so we only incur any overhead
during a transaction commit. The numbers I got showed that the overhead
was better than the old free space cache and not too much more than not
using the cache. Now that I think about it, I only profiled it under
heavy load, though, it'd probably be a good idea to get some numbers for
more typical workloads, but I don't currently have access to any
reasonable hardware.

Thanks,
Omar

> Omar Sandoval wrote on 2015/09/03 12:44 -0700:
> >Here's version 2 of the the free space B-tree patches, addressing
> >Josef's review from the last round, which you can find here:
> >http://www.spinics.net/lists/linux-btrfs/msg46713.html
> >
> >Changes from v1->v2:
> >
> >- Cleaned up a bunch of unnecessary instances of "if (ret) goto out; ret = 0"
> >- Added aborts in the free space tree code closer to the site the error
> >   is encountered: where we add or remove block groups, add or remove
> >   free space, and also when we convert formats
> >- Moved loading of the free space tree into caching_thread() and added a
> >   new patch 4 in preparation for it
> >- Commented a bunch of stuff in the extent buffer bitmap operations and
> >   refactored some of the complicated logic
> >- Added sanity tests for the extent buffer bitmap operations and free
> >   space tree (patches 2 and 6)
> >- Added Josef's Reviewed-by tags
> >
> >Omar Sandoval (9):
> >   Btrfs: add extent buffer bitmap operations
> >   Btrfs: add extent buffer bitmap sanity tests
> >   Btrfs: add helpers for read-only compat bits
> >   Btrfs: refactor caching_thread()
> >   Btrfs: introduce the free space B-tree on-disk format
> >   Btrfs: implement the free space B-tree
> >   Btrfs: add free space tree sanity tests
> >   Btrfs: wire up the free space tree to the extent tree
> >   Btrfs: add free space tree mount option
> >
> >  fs/btrfs/Makefile                      |    5 +-
> >  fs/btrfs/ctree.h                       |  107 ++-
> >  fs/btrfs/disk-io.c                     |   26 +
> >  fs/btrfs/extent-tree.c                 |  112 ++-
> >  fs/btrfs/extent_io.c                   |  183 +++-
> >  fs/btrfs/extent_io.h                   |   10 +-
> >  fs/btrfs/free-space-tree.c             | 1501 ++++++++++++++++++++++++++++++++
> >  fs/btrfs/free-space-tree.h             |   71 ++
> >  fs/btrfs/super.c                       |   24 +-
> >  fs/btrfs/tests/btrfs-tests.c           |   52 ++
> >  fs/btrfs/tests/btrfs-tests.h           |   10 +
> >  fs/btrfs/tests/extent-io-tests.c       |  138 ++-
> >  fs/btrfs/tests/free-space-tests.c      |   35 +-
> >  fs/btrfs/tests/free-space-tree-tests.c |  570 ++++++++++++
> >  fs/btrfs/tests/qgroup-tests.c          |   20 +-
> >  include/trace/events/btrfs.h           |    3 +-
> >  16 files changed, 2763 insertions(+), 104 deletions(-)
> >  create mode 100644 fs/btrfs/free-space-tree.c
> >  create mode 100644 fs/btrfs/free-space-tree.h
> >  create mode 100644 fs/btrfs/tests/free-space-tree-tests.c
> >

-- 
Omar

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 0/9] free space B-tree
  2015-09-11  3:48     ` Omar Sandoval
@ 2015-09-11  3:58       ` Qu Wenruo
  2015-09-11  4:15         ` Omar Sandoval
  0 siblings, 1 reply; 43+ messages in thread
From: Qu Wenruo @ 2015-09-11  3:58 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs



Omar Sandoval wrote on 2015/09/10 20:48 -0700:
> On Fri, Sep 11, 2015 at 09:21:13AM +0800, Qu Wenruo wrote:
>> Hi Omar,
>>
>> Thanks for your patchset.
>> Quite a nice one, and debug-tree can give better output on space cache.
>> With current implement, space cache is near a black box in debug-tree
>> output.
>>
>> And current on disk format is not quite easy to understand.(In fact, space
>> cache is restored in tree root, as a NODATACOW inode, quite wired)
>>
>> Also, it should provide a quite good base for rework inode cache for future
>> development.
>>
>>
>> But I'm still a little concerned about the performance.
>>
>> One of the problem using b-tree is, now we need to use btrfs_search_slot()
>> to do modification, that means we will do level-based tree lock and COW.
>> Personally speaking, I'd like to blame that for the slow metadata
>> performance of btrfs.
>> (Yeah personal experience, may be wrong again)
>>
>> So with the new implement every space cache operation will causing tree lock
>> and cow.
>> Unlike the old wired structure, which is done in a NODATACOW fashion.
>>
>> Hopes I'm wrong about it (and it seems I'm always wrong about all these
>> assumption based performance thing).
>>
>> Thanks,
>> Qu
>
> Hey, Qu,
>
> So the thing about the free space tree is that the B-tree is only
> modified while running delayed refs, so we only incur any overhead
> during a transaction commit. The numbers I got showed that the overhead
> was better than the old free space cache and not too much more than not
> using the cache. Now that I think about it, I only profiled it under
> heavy load, though, it'd probably be a good idea to get some numbers for
> more typical workloads, but I don't currently have access to any
> reasonable hardware.
>
> Thanks,
> Omar

Great, if its performance is better than old one under heavy load, then 
I'm completely OK with it.

Nice job!

BTW, don't forget to add btrfs-debug-tree and fsck support for the new 
implement. I can't even wait to see these one merged now.

Thanks,
Qu
>
>> Omar Sandoval wrote on 2015/09/03 12:44 -0700:
>>> Here's version 2 of the the free space B-tree patches, addressing
>>> Josef's review from the last round, which you can find here:
>>> http://www.spinics.net/lists/linux-btrfs/msg46713.html
>>>
>>> Changes from v1->v2:
>>>
>>> - Cleaned up a bunch of unnecessary instances of "if (ret) goto out; ret = 0"
>>> - Added aborts in the free space tree code closer to the site the error
>>>    is encountered: where we add or remove block groups, add or remove
>>>    free space, and also when we convert formats
>>> - Moved loading of the free space tree into caching_thread() and added a
>>>    new patch 4 in preparation for it
>>> - Commented a bunch of stuff in the extent buffer bitmap operations and
>>>    refactored some of the complicated logic
>>> - Added sanity tests for the extent buffer bitmap operations and free
>>>    space tree (patches 2 and 6)
>>> - Added Josef's Reviewed-by tags
>>>
>>> Omar Sandoval (9):
>>>    Btrfs: add extent buffer bitmap operations
>>>    Btrfs: add extent buffer bitmap sanity tests
>>>    Btrfs: add helpers for read-only compat bits
>>>    Btrfs: refactor caching_thread()
>>>    Btrfs: introduce the free space B-tree on-disk format
>>>    Btrfs: implement the free space B-tree
>>>    Btrfs: add free space tree sanity tests
>>>    Btrfs: wire up the free space tree to the extent tree
>>>    Btrfs: add free space tree mount option
>>>
>>>   fs/btrfs/Makefile                      |    5 +-
>>>   fs/btrfs/ctree.h                       |  107 ++-
>>>   fs/btrfs/disk-io.c                     |   26 +
>>>   fs/btrfs/extent-tree.c                 |  112 ++-
>>>   fs/btrfs/extent_io.c                   |  183 +++-
>>>   fs/btrfs/extent_io.h                   |   10 +-
>>>   fs/btrfs/free-space-tree.c             | 1501 ++++++++++++++++++++++++++++++++
>>>   fs/btrfs/free-space-tree.h             |   71 ++
>>>   fs/btrfs/super.c                       |   24 +-
>>>   fs/btrfs/tests/btrfs-tests.c           |   52 ++
>>>   fs/btrfs/tests/btrfs-tests.h           |   10 +
>>>   fs/btrfs/tests/extent-io-tests.c       |  138 ++-
>>>   fs/btrfs/tests/free-space-tests.c      |   35 +-
>>>   fs/btrfs/tests/free-space-tree-tests.c |  570 ++++++++++++
>>>   fs/btrfs/tests/qgroup-tests.c          |   20 +-
>>>   include/trace/events/btrfs.h           |    3 +-
>>>   16 files changed, 2763 insertions(+), 104 deletions(-)
>>>   create mode 100644 fs/btrfs/free-space-tree.c
>>>   create mode 100644 fs/btrfs/free-space-tree.h
>>>   create mode 100644 fs/btrfs/tests/free-space-tree-tests.c
>>>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 0/9] free space B-tree
  2015-09-11  3:58       ` Qu Wenruo
@ 2015-09-11  4:15         ` Omar Sandoval
  0 siblings, 0 replies; 43+ messages in thread
From: Omar Sandoval @ 2015-09-11  4:15 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Sep 11, 2015 at 11:58:13AM +0800, Qu Wenruo wrote:
> 
> 
> Omar Sandoval wrote on 2015/09/10 20:48 -0700:
> >On Fri, Sep 11, 2015 at 09:21:13AM +0800, Qu Wenruo wrote:
> >>Hi Omar,
> >>
> >>Thanks for your patchset.
> >>Quite a nice one, and debug-tree can give better output on space cache.
> >>With current implement, space cache is near a black box in debug-tree
> >>output.
> >>
> >>And current on disk format is not quite easy to understand.(In fact, space
> >>cache is restored in tree root, as a NODATACOW inode, quite wired)
> >>
> >>Also, it should provide a quite good base for rework inode cache for future
> >>development.
> >>
> >>
> >>But I'm still a little concerned about the performance.
> >>
> >>One of the problem using b-tree is, now we need to use btrfs_search_slot()
> >>to do modification, that means we will do level-based tree lock and COW.
> >>Personally speaking, I'd like to blame that for the slow metadata
> >>performance of btrfs.
> >>(Yeah personal experience, may be wrong again)
> >>
> >>So with the new implement every space cache operation will causing tree lock
> >>and cow.
> >>Unlike the old wired structure, which is done in a NODATACOW fashion.
> >>
> >>Hopes I'm wrong about it (and it seems I'm always wrong about all these
> >>assumption based performance thing).
> >>
> >>Thanks,
> >>Qu
> >
> >Hey, Qu,
> >
> >So the thing about the free space tree is that the B-tree is only
> >modified while running delayed refs, so we only incur any overhead
> >during a transaction commit. The numbers I got showed that the overhead
> >was better than the old free space cache and not too much more than not
> >using the cache. Now that I think about it, I only profiled it under
> >heavy load, though, it'd probably be a good idea to get some numbers for
> >more typical workloads, but I don't currently have access to any
> >reasonable hardware.
> >
> >Thanks,
> >Omar
> 
> Great, if its performance is better than old one under heavy load, then I'm
> completely OK with it.
> 
> Nice job!

Thanks! The v1 post has specific numbers if you want to take a look:
http://www.spinics.net/lists/linux-btrfs/msg46713.html.

> BTW, don't forget to add btrfs-debug-tree and fsck support for the new
> implement. I can't even wait to see these one merged now.

Yup, the btrfs-progs patches include both :) The only caveat is that
there's no visibility into the bitmap items from btrfs-debug-tree, but
that wouldn't be too hard to add.

> Thanks,
> Qu
> >
> >>Omar Sandoval wrote on 2015/09/03 12:44 -0700:
> >>>Here's version 2 of the the free space B-tree patches, addressing
> >>>Josef's review from the last round, which you can find here:
> >>>http://www.spinics.net/lists/linux-btrfs/msg46713.html
> >>>
> >>>Changes from v1->v2:
> >>>
> >>>- Cleaned up a bunch of unnecessary instances of "if (ret) goto out; ret = 0"
> >>>- Added aborts in the free space tree code closer to the site the error
> >>>   is encountered: where we add or remove block groups, add or remove
> >>>   free space, and also when we convert formats
> >>>- Moved loading of the free space tree into caching_thread() and added a
> >>>   new patch 4 in preparation for it
> >>>- Commented a bunch of stuff in the extent buffer bitmap operations and
> >>>   refactored some of the complicated logic
> >>>- Added sanity tests for the extent buffer bitmap operations and free
> >>>   space tree (patches 2 and 6)
> >>>- Added Josef's Reviewed-by tags
> >>>
> >>>Omar Sandoval (9):
> >>>   Btrfs: add extent buffer bitmap operations
> >>>   Btrfs: add extent buffer bitmap sanity tests
> >>>   Btrfs: add helpers for read-only compat bits
> >>>   Btrfs: refactor caching_thread()
> >>>   Btrfs: introduce the free space B-tree on-disk format
> >>>   Btrfs: implement the free space B-tree
> >>>   Btrfs: add free space tree sanity tests
> >>>   Btrfs: wire up the free space tree to the extent tree
> >>>   Btrfs: add free space tree mount option
> >>>
> >>>  fs/btrfs/Makefile                      |    5 +-
> >>>  fs/btrfs/ctree.h                       |  107 ++-
> >>>  fs/btrfs/disk-io.c                     |   26 +
> >>>  fs/btrfs/extent-tree.c                 |  112 ++-
> >>>  fs/btrfs/extent_io.c                   |  183 +++-
> >>>  fs/btrfs/extent_io.h                   |   10 +-
> >>>  fs/btrfs/free-space-tree.c             | 1501 ++++++++++++++++++++++++++++++++
> >>>  fs/btrfs/free-space-tree.h             |   71 ++
> >>>  fs/btrfs/super.c                       |   24 +-
> >>>  fs/btrfs/tests/btrfs-tests.c           |   52 ++
> >>>  fs/btrfs/tests/btrfs-tests.h           |   10 +
> >>>  fs/btrfs/tests/extent-io-tests.c       |  138 ++-
> >>>  fs/btrfs/tests/free-space-tests.c      |   35 +-
> >>>  fs/btrfs/tests/free-space-tree-tests.c |  570 ++++++++++++
> >>>  fs/btrfs/tests/qgroup-tests.c          |   20 +-
> >>>  include/trace/events/btrfs.h           |    3 +-
> >>>  16 files changed, 2763 insertions(+), 104 deletions(-)
> >>>  create mode 100644 fs/btrfs/free-space-tree.c
> >>>  create mode 100644 fs/btrfs/free-space-tree.h
> >>>  create mode 100644 fs/btrfs/tests/free-space-tree-tests.c
> >>>
> >

-- 
Omar

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 0/9] free space B-tree
  2015-09-11  1:21   ` Qu Wenruo
  2015-09-11  3:48     ` Omar Sandoval
@ 2015-09-22 14:41     ` David Sterba
  1 sibling, 0 replies; 43+ messages in thread
From: David Sterba @ 2015-09-22 14:41 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Omar Sandoval, linux-btrfs

On Fri, Sep 11, 2015 at 09:21:13AM +0800, Qu Wenruo wrote:
> Also, it should provide a quite good base for rework inode cache for 
> future development.

You mean what's now under 'inode_cache' mount option? It builds on the
free space cache infrastructure so it would be natural to use it as
well. But still, the usecase for inode_cache is still uncommon.

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2015-09-22 14:42 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-01 19:01 [PATCH 0/6] free space B-tree Omar Sandoval
2015-09-01 19:01 ` [PATCH 1/6] Btrfs: add extent buffer bitmap operations Omar Sandoval
2015-09-01 19:25   ` Josef Bacik
2015-09-01 19:37     ` Omar Sandoval
2015-09-01 19:01 ` [PATCH 2/6] Btrfs: add helpers for read-only compat bits Omar Sandoval
2015-09-01 19:26   ` Josef Bacik
2015-09-01 19:01 ` [PATCH 3/6] Btrfs: introduce the free space B-tree on-disk format Omar Sandoval
2015-09-01 19:28   ` Josef Bacik
2015-09-01 19:05 ` [PATCH 5/6] Btrfs: wire up the free space tree to the extent tree Omar Sandoval
2015-09-01 19:48   ` Josef Bacik
2015-09-02  4:42     ` Omar Sandoval
2015-09-02 15:29       ` Josef Bacik
2015-09-01 19:05 ` [PATCH 6/6] Btrfs: add free space tree mount option Omar Sandoval
2015-09-01 19:49   ` Josef Bacik
2015-09-01 19:13 ` [PATCH 4/6] Btrfs: implement the free space B-tree Omar Sandoval
2015-09-01 19:44   ` Josef Bacik
2015-09-01 20:06     ` Omar Sandoval
2015-09-01 20:08       ` Josef Bacik
2015-09-01 19:17 ` [PATCH 0/6] " Omar Sandoval
2015-09-01 19:22 ` [PATCH 1/3] btrfs-progs: use calloc instead of malloc+memset for tree roots Omar Sandoval
2015-09-01 19:22   ` [PATCH 2/3] btrfs-progs: add basic awareness of the free space tree Omar Sandoval
2015-09-01 19:22   ` [PATCH 3/3] btrfs-progs: check the free space tree in btrfsck Omar Sandoval
2015-09-02 15:02   ` [PATCH 1/3] btrfs-progs: use calloc instead of malloc+memset for tree roots David Sterba
2015-09-03 19:44 ` [PATCH v2 0/9] free space B-tree Omar Sandoval
2015-09-03 19:44   ` [PATCH v2 1/9] Btrfs: add extent buffer bitmap operations Omar Sandoval
2015-09-03 19:44   ` [PATCH v2 2/9] Btrfs: add extent buffer bitmap sanity tests Omar Sandoval
2015-09-03 19:44   ` [PATCH v2 3/9] Btrfs: add helpers for read-only compat bits Omar Sandoval
2015-09-03 19:44   ` [PATCH v2 4/9] Btrfs: refactor caching_thread() Omar Sandoval
2015-09-03 19:44   ` [PATCH v2 5/9] Btrfs: introduce the free space B-tree on-disk format Omar Sandoval
2015-09-03 19:44   ` [PATCH v2 6/9] Btrfs: implement the free space B-tree Omar Sandoval
2015-09-03 19:44   ` [PATCH v2 7/9] Btrfs: add free space tree sanity tests Omar Sandoval
2015-09-03 19:44   ` [PATCH v2 8/9] Btrfs: wire up the free space tree to the extent tree Omar Sandoval
2015-09-04  5:56     ` Omar Sandoval
2015-09-03 19:44   ` [PATCH v2 9/9] Btrfs: add free space tree mount option Omar Sandoval
2015-09-09 12:00     ` David Sterba
2015-09-11  0:52       ` Omar Sandoval
2015-09-04  1:29   ` [PATCH v2 0/9] free space B-tree Zhao Lei
2015-09-04  5:43     ` Omar Sandoval
2015-09-11  1:21   ` Qu Wenruo
2015-09-11  3:48     ` Omar Sandoval
2015-09-11  3:58       ` Qu Wenruo
2015-09-11  4:15         ` Omar Sandoval
2015-09-22 14:41     ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.