linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/17] btrfs: implementation of priority aware allocator
@ 2018-11-28  3:11 Su Yue
  2018-11-28  3:11 ` [RFC PATCH 01/17] btrfs: priority alloc: prepare " Su Yue
                   ` (17 more replies)
  0 siblings, 18 replies; 23+ messages in thread
From: Su Yue @ 2018-11-28  3:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: suy.fnst

This patchset can be fetched from repo:
https://github.com/Damenly/btrfs-devel/commits/priority_aware_allocator.
Since patchset 'btrfs: Refactor find_free_extent()' does a nice work
to simplify find_free_extent(). This patchset dependents on the refactor.
The base is the commit in kdave/misc-next:

commit fcaaa1dfa81f2f87ad88cbe0ab86a07f9f76073c (kdave/misc-next)
Author: Nikolay Borisov <nborisov@suse.com>
Date:   Tue Nov 6 16:40:20 2018 +0200

    btrfs: Always try all copies when reading extent buffers


This patchset introduces a new mount option named 'priority_alloc=%s',
%s is supported to be "usage" and "off" now. The mount option changes
the way find_free_extent() how to search block groups.

Previously, block groups are stored in list of btrfs_space_info
by start position. When call find_free_extent() if no hint,
block_groups are searched one by one.

Design of priority aware allocator:
Block group has its own priority. We split priorities to many levels,
block groups are split to different trees according priorities.
And those trees are sorted by their levels and stored in space_info.
Once find_free_extent() is called, try to search block groups in higher
priority level then lower level. Then a block group with higher
priority is more likely to be used.

Pros:
1) Reduce the frequency of balance.
   The block group with a higher usage rate will be used preferentially
   for allocating extents. Free the empty block groups with pinned bytes
   as non-zero.[1]

2) The priority of empty block group with pinned bytes as non-zero
   will be set as the lowest.
   
3) Support zoned block device.[2]
   For metadata allocation, the block group in conventional zones
   will be used as much as possible regardless of usage rate.
   Will do it in future.
   
Cons:
1) Expectable performance regression.
   The degree of the decline is temporarily unknown.
   The user can disable block group priority to get the full performance.

TESTS:

If use usage as priority(the only available option), empty block group
is much harder to be reused.

About block group usage:
  Disk: 4 x 1T HDD gathered in LVM.

  Run script to create files and delete files randomly in loop.
  The num of files to create are double than to delete.

  Default mount option result:
  https://i.loli.net/2018/11/28/5bfdfdf08c760.png

  Priority aware allocator(usage) result:
  https://i.loli.net/2018/11/28/5bfdfdf0c1b11.png

  X coordinate means total disk usage, Y coordinate means avg block
  group usage.

  Due to fragmentation of extents, the different are not obvious,
  only about 1% improvement....
  						   
Performance regression:
   I have ran sysbench on our machine with SSD in multi combinations,
   no obvious regression found.
   However in theory, the new allocator may cost more time in some
   cases.

[1] https://www.spinics.net/lists/linux-btrfs/msg79508.html
[2] https://lkml.org/lkml/2018/8/16/174

---
Due to some reasons includes time and hardware, the use-case is not
outstanding enough. And some codes are dirty but I can't found another
way. So I named it as RFC.
     Any comments and suggestions are welcome.
     
Su Yue (17):
  btrfs: priority alloc: prepare of priority aware allocator
  btrfs: add mount definition BTRFS_MOUNT_PRIORITY_USAGE
  btrfs: priority alloc: introduce compute_block_group_priority/usage
  btrfs: priority alloc: add functions to create/remove priority trees
  btrfs: priority alloc: introduce functions to add block group to
    priority tree
  btrfs: priority alloc: introduce three macros to mark block group
    status
  btrfs: priority alloc: add functions to remove block group from
    priority tree
  btrfs: priority alloc: add btrfs_update_block_group_priority()
  btrfs: priority alloc: call create/remove_priority_trees in space_info
  btrfs: priority alloc: call add_block_group_priority while reading or
    making block group
  btrfs: priority alloc: remove block group from priority tree while
    removing block group
  btrfs: priority alloc: introduce find_free_extent_search()
  btrfs: priority alloc: modify find_free_extent() to fit priority
    allocator
  btrfs: priority alloc: introduce btrfs_set_bg_updating and call
    btrfs_update_block_group_prioriy
  btrfs: priority alloc: write bg->priority_groups_sem while waiting
    reservation
  btrfs: priority alloc: write bg->priority_tree->groups_sem to avoid
    race in btrfs_delete_unused_bgs()
  btrfs: add mount option "priority_alloc=%s"

 fs/btrfs/ctree.h            |  28 ++
 fs/btrfs/extent-tree.c      | 672 +++++++++++++++++++++++++++++++++---
 fs/btrfs/free-space-cache.c |   3 +
 fs/btrfs/super.c            |  18 +
 fs/btrfs/transaction.c      |   1 +
 5 files changed, 681 insertions(+), 41 deletions(-)

-- 
2.19.1




^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH 01/17] btrfs: priority alloc: prepare of priority aware allocator
  2018-11-28  3:11 [RFC PATCH 00/17] btrfs: implementation of priority aware allocator Su Yue
@ 2018-11-28  3:11 ` Su Yue
  2018-11-28  8:24   ` Nikolay Borisov
  2018-11-28  3:11 ` [RFC PATCH 02/17] btrfs: add mount definition BTRFS_MOUNT_PRIORITY_USAGE Su Yue
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 23+ messages in thread
From: Su Yue @ 2018-11-28  3:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: suy.fnst

To implement priority aware allocator, this patch:
Introduces struct btrfs_priority_tree which contains block groups
in same level.
Adds member priority to struct btrfs_block_group_cache and pointer
points to the priority tree it's located.

Adds member priority_trees to struct btrfs_space_info to represents
priority trees in different raid types.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
---
 fs/btrfs/ctree.h | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e62824cae00a..5c4651d8a524 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -437,6 +437,8 @@ struct btrfs_space_info {
 	struct rw_semaphore groups_sem;
 	/* for block groups in our same type */
 	struct list_head block_groups[BTRFS_NR_RAID_TYPES];
+	/* for priority trees in our same type */
+	struct rb_root priority_trees[BTRFS_NR_RAID_TYPES];
 	wait_queue_head_t wait;
 
 	struct kobject kobj;
@@ -558,6 +560,21 @@ struct btrfs_full_stripe_locks_tree {
 	struct mutex lock;
 };
 
+/*
+ * Tree to record all block_groups in same priority level.
+ * Only used in priority aware allocator.
+ */
+struct btrfs_priority_tree {
+	/* protected by groups_sem */
+	struct rb_root block_groups;
+	struct rw_semaphore groups_sem;
+
+	/* for different level priority trees in same index*/
+	struct rb_node node;
+
+	int level;
+};
+
 struct btrfs_block_group_cache {
 	struct btrfs_key key;
 	struct btrfs_block_group_item item;
@@ -571,6 +588,8 @@ struct btrfs_block_group_cache {
 	u64 flags;
 	u64 cache_generation;
 
+	/* It's used only when priority aware allocator is enabled. */
+	long priority;
 	/*
 	 * If the free space extent count exceeds this number, convert the block
 	 * group to bitmaps.
@@ -616,6 +635,9 @@ struct btrfs_block_group_cache {
 	/* for block groups in the same raid type */
 	struct list_head list;
 
+	/* for block groups in the same priority level */
+	struct rb_node node;
+
 	/* usage count */
 	atomic_t count;
 
@@ -670,6 +692,8 @@ struct btrfs_block_group_cache {
 
 	/* Record locked full stripes for RAID5/6 block group */
 	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
+
+	struct btrfs_priority_tree *priority_tree;
 };
 
 /* delayed seq elem */
-- 
2.19.1




^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 02/17] btrfs: add mount definition BTRFS_MOUNT_PRIORITY_USAGE
  2018-11-28  3:11 [RFC PATCH 00/17] btrfs: implementation of priority aware allocator Su Yue
  2018-11-28  3:11 ` [RFC PATCH 01/17] btrfs: priority alloc: prepare " Su Yue
@ 2018-11-28  3:11 ` Su Yue
  2018-11-28  3:11 ` [RFC PATCH 03/17] btrfs: priority alloc: introduce compute_block_group_priority/usage Su Yue
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 23+ messages in thread
From: Su Yue @ 2018-11-28  3:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: suy.fnst

Now implementation of priority allocator only support usage option.
Add BTRFS_MOUNT_PRIORITY_USAGE for further commits.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
---
 fs/btrfs/ctree.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5c4651d8a524..4c56baf9f7cf 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1406,6 +1406,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
 #define BTRFS_MOUNT_FREE_SPACE_TREE	(1 << 26)
 #define BTRFS_MOUNT_NOLOGREPLAY		(1 << 27)
 #define BTRFS_MOUNT_REF_VERIFY		(1 << 28)
+#define BTRFS_MOUNT_PRIORITY_USAGE      (1 << 29)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL	(30)
 #define BTRFS_DEFAULT_MAX_INLINE	(2048)
-- 
2.19.1




^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 03/17] btrfs: priority alloc: introduce compute_block_group_priority/usage
  2018-11-28  3:11 [RFC PATCH 00/17] btrfs: implementation of priority aware allocator Su Yue
  2018-11-28  3:11 ` [RFC PATCH 01/17] btrfs: priority alloc: prepare " Su Yue
  2018-11-28  3:11 ` [RFC PATCH 02/17] btrfs: add mount definition BTRFS_MOUNT_PRIORITY_USAGE Su Yue
@ 2018-11-28  3:11 ` Su Yue
  2018-11-28  8:56   ` Nikolay Borisov
  2018-11-28  3:11 ` [RFC PATCH 04/17] btrfs: priority alloc: add functions to create/remove priority trees Su Yue
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 23+ messages in thread
From: Su Yue @ 2018-11-28  3:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: suy.fnst

Introduce compute_block_group_usage() and compute_block_group_usage().
And call the latter in btrfs_make_block_group() and
btrfs_read_block_groups().

compute_priority_level use ilog2(free) to compute priority level.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c | 60 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d242a1174e50..0f4c5b1e0bcc 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10091,6 +10091,7 @@ static int check_chunk_block_group_mappings(struct btrfs_fs_info *fs_info)
 	return ret;
 }
 
+static long compute_block_group_priority(struct btrfs_block_group_cache *bg);
 int btrfs_read_block_groups(struct btrfs_fs_info *info)
 {
 	struct btrfs_path *path;
@@ -10224,6 +10225,8 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 
 		link_block_group(cache);
 
+		cache->priority = compute_block_group_priority(cache);
+
 		set_avail_alloc_bits(info, cache->flags);
 		if (btrfs_chunk_readonly(info, cache->key.objectid)) {
 			inc_block_group_ro(cache, 1);
@@ -10373,6 +10376,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,
 
 	link_block_group(cache);
 
+	cache->priority = compute_block_group_priority(cache);
+
 	list_add_tail(&cache->bg_list, &trans->new_bgs);
 
 	set_avail_alloc_bits(fs_info, type);
@@ -11190,3 +11195,58 @@ void btrfs_mark_bg_unused(struct btrfs_block_group_cache *bg)
 	}
 	spin_unlock(&fs_info->unused_bgs_lock);
 }
+
+enum btrfs_priority_shift {
+	PRIORITY_USAGE_SHIFT = 0
+};
+
+static inline u8
+compute_block_group_usage(struct btrfs_block_group_cache *cache)
+{
+	u64 num_bytes;
+	u8 usage;
+
+	num_bytes = cache->reserved + cache->bytes_super +
+		btrfs_block_group_used(&cache->item);
+
+	usage = div_u64(num_bytes, div_factor_fine(cache->key.offset, 1));
+
+	return usage;
+}
+
+static long compute_block_group_priority(struct btrfs_block_group_cache *bg)
+{
+	u8 usage;
+	long priority = 0;
+
+	if (btrfs_test_opt(bg->fs_info, PRIORITY_USAGE)) {
+		usage = compute_block_group_usage(bg);
+		priority |= usage << PRIORITY_USAGE_SHIFT;
+	}
+
+	return priority;
+}
+
+static int compute_priority_level(struct btrfs_fs_info *fs_info,
+				  long priority)
+{
+	int level = 0;
+
+	if (btrfs_test_opt(fs_info, PRIORITY_USAGE)) {
+		u8 free;
+
+		WARN_ON(priority < 0);
+		free = 100 - (priority >> PRIORITY_USAGE_SHIFT);
+
+		if (free == 0)
+			level = 0;
+		else if (free == 100)
+			level = ilog2(free) + 1;
+		else
+			level = ilog2(free);
+		/* log2(1) == 0, leave level 0 for unused block_groups */
+		level = ilog2(100) + 1 - level;
+	}
+
+	return level;
+}
-- 
2.19.1




^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 04/17] btrfs: priority alloc: add functions to create/remove priority trees
  2018-11-28  3:11 [RFC PATCH 00/17] btrfs: implementation of priority aware allocator Su Yue
                   ` (2 preceding siblings ...)
  2018-11-28  3:11 ` [RFC PATCH 03/17] btrfs: priority alloc: introduce compute_block_group_priority/usage Su Yue
@ 2018-11-28  3:11 ` Su Yue
  2018-11-28  3:11 ` [RFC PATCH 05/17] btrfs: priority alloc: introduce functions to add block group to priority tree Su Yue
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 23+ messages in thread
From: Su Yue @ 2018-11-28  3:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: suy.fnst

Introduce create_priority_trees() to create priority trees in
space_info.
Introduce remove_priority_trees() to remove priority trees in
space_info.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c | 94 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 94 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0f4c5b1e0bcc..787a68b5bdcb 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -11250,3 +11250,97 @@ static int compute_priority_level(struct btrfs_fs_info *fs_info,
 
 	return level;
 }
+
+static inline bool
+is_priority_alloc_enabled(struct btrfs_fs_info *fs_info)
+{
+	if (btrfs_test_opt(fs_info, PRIORITY_USAGE))
+		return true;
+	return false;
+}
+
+static void init_priority_tree(struct btrfs_priority_tree *pt, int level)
+{
+	pt->block_groups = RB_ROOT;
+	init_rwsem(&pt->groups_sem);
+	pt->level = level;
+}
+
+static void remove_priority_trees(struct btrfs_fs_info *fs_info,
+				  struct btrfs_space_info *space_info)
+{
+	struct rb_root *root;
+	struct btrfs_priority_tree *pt, *next_pt;
+	int i;
+
+	if (!is_priority_alloc_enabled(fs_info))
+		return;
+
+	for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
+		root = &space_info->priority_trees[i];
+		rbtree_postorder_for_each_entry_safe(pt, next_pt, root, node) {
+			kfree(pt);
+		}
+		space_info->priority_trees[i] = RB_ROOT;
+	}
+}
+
+static int insert_priority_tree(struct rb_root *root,
+				struct btrfs_priority_tree *pt)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct btrfs_priority_tree *tmp;
+
+	while (*p) {
+		parent = *p;
+		tmp = rb_entry(parent, struct btrfs_priority_tree, node);
+		if (pt->level > tmp->level)
+			p = &(*p)->rb_left;
+		else if (pt->level < tmp->level)
+			p = &(*p)->rb_right;
+		else
+			return -EEXIST;
+	}
+
+	rb_link_node(&pt->node, parent, p);
+	rb_insert_color(&pt->node, root);
+	return 0;
+}
+
+static int create_priority_trees(struct btrfs_fs_info *fs_info,
+				 struct btrfs_space_info *space_info)
+{
+	struct rb_root *root;
+	struct btrfs_priority_tree *pt;
+	int ret = 0;
+	int i, level, max_level;
+	u64 priority;
+
+	if (!is_priority_alloc_enabled(fs_info))
+		return 0;
+
+	if (btrfs_test_opt(fs_info, PRIORITY_USAGE)) {
+		priority = (u8)100 << PRIORITY_USAGE_SHIFT;
+		max_level = compute_priority_level(fs_info, priority);
+	}
+	for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
+		root = &space_info->priority_trees[i];
+
+		for (level = 0; level <= max_level; level++) {
+			pt = kzalloc(sizeof(*pt), GFP_NOFS);
+			if (!pt) {
+				ret = -ENOMEM;
+				break;
+			}
+			init_priority_tree(pt, level);
+			ret = insert_priority_tree(root, pt);
+			if (ret)
+				break;
+		}
+	}
+
+	if (ret)
+		remove_priority_trees(fs_info, space_info);
+	return ret;
+}
-- 
2.19.1




^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 05/17] btrfs: priority alloc: introduce functions to add block group to priority tree
  2018-11-28  3:11 [RFC PATCH 00/17] btrfs: implementation of priority aware allocator Su Yue
                   ` (3 preceding siblings ...)
  2018-11-28  3:11 ` [RFC PATCH 04/17] btrfs: priority alloc: add functions to create/remove priority trees Su Yue
@ 2018-11-28  3:11 ` Su Yue
  2018-11-28  3:11 ` [RFC PATCH 06/17] btrfs: priority alloc: introduce three macros to mark block group status Su Yue
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 23+ messages in thread
From: Su Yue @ 2018-11-28  3:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: suy.fnst

Introduce compute_priority_level() to compute priority level according
priority, now just divides PRIORITY_USAGE_FACOTR.

Introduce add_block_group_priority() to add block groups to
priority tree.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c | 76 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 76 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 787a68b5bdcb..d63078930a1e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -11344,3 +11344,79 @@ static int create_priority_trees(struct btrfs_fs_info *fs_info,
 		remove_priority_trees(fs_info, space_info);
 	return ret;
 }
+
+static int link_block_group_priority(struct btrfs_priority_tree *tree,
+				     struct btrfs_block_group_cache *cache)
+{
+	struct rb_node **p = &tree->block_groups.rb_node;
+	struct rb_node *parent = NULL;
+	struct btrfs_block_group_cache *tmp;
+
+	while (*p) {
+		parent = *p;
+		tmp = rb_entry(parent, struct btrfs_block_group_cache, node);
+		if (cache->key.objectid < tmp->key.objectid)
+			p = &(*p)->rb_left;
+		else if (cache->key.objectid > tmp->key.objectid)
+			p = &(*p)->rb_right;
+		else
+			return -EEXIST;
+	}
+
+	rb_link_node(&cache->node, parent, p);
+	rb_insert_color(&cache->node, &tree->block_groups);
+	return 0;
+}
+
+static struct btrfs_priority_tree *
+search_priority_tree(struct rb_root *root, int level)
+{
+	struct rb_node *node = root->rb_node;
+	struct btrfs_priority_tree *pt;
+
+	while (node) {
+		pt = rb_entry(node, struct btrfs_priority_tree, node);
+
+		if (level > pt->level)
+			node = node->rb_left;
+		else if (level < pt->level)
+			node = node->rb_right;
+		else
+			return pt;
+	}
+
+	return NULL;
+}
+
+static void add_block_group_priority(struct btrfs_block_group_cache *cache)
+{
+	struct btrfs_priority_tree *pt;
+	int index = btrfs_bg_flags_to_raid_index(cache->flags);
+	int level, max_level, min_level;
+	int ret;
+
+	if (!is_priority_alloc_enabled(cache->fs_info))
+		return;
+
+	if (btrfs_test_opt(cache->fs_info, PRIORITY_USAGE)) {
+		max_level = compute_priority_level(cache->fs_info,
+					   (u8)100 << PRIORITY_USAGE_SHIFT);
+		min_level = 0;
+	}
+
+	level = compute_priority_level(cache->fs_info, cache->priority);
+	if (level > max_level || level < min_level) {
+		WARN_ON(level);
+		return;
+	}
+
+	pt = search_priority_tree(&cache->space_info->priority_trees[index],
+				  level);
+	BUG_ON(pt == NULL);
+
+	down_write(&pt->groups_sem);
+	cache->priority_tree = pt;
+	ret = link_block_group_priority(pt, cache);
+	up_write(&pt->groups_sem);
+	BUG_ON(ret);
+}
-- 
2.19.1




^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 06/17] btrfs: priority alloc: introduce three macros to mark block group status
  2018-11-28  3:11 [RFC PATCH 00/17] btrfs: implementation of priority aware allocator Su Yue
                   ` (4 preceding siblings ...)
  2018-11-28  3:11 ` [RFC PATCH 05/17] btrfs: priority alloc: introduce functions to add block group to priority tree Su Yue
@ 2018-11-28  3:11 ` Su Yue
  2018-11-28  3:11 ` [RFC PATCH 07/17] btrfs: priority alloc: add functions to remove block group from priority tree Su Yue
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 23+ messages in thread
From: Su Yue @ 2018-11-28  3:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: suy.fnst

In origin design, the order of space_info::block_groups is changed only
when add/remove block group(s).

In design of priority aware allocator, block groups may be moved from
one priority tree to another one. What the operation is usually that
1)  lock block_group
    down_write first_tree
    down_write second_tree
    do unlink
    ....
    do link
    up_write second_tree
    up_write first_tree
    unlock blcok_group
However, the expected order in find_free_extent() is
2)  down_read tree
    lock block_group
    ...
    unlock block_group
    up_read tree

Obviously, orders of operation 1 and operation 2 are on the contrary
and will cause dead lock.

The ugly method is to introduce special status to mark block group
status.
Then:

1) After priority changed:
    lock block_group
    mark block_group to be updated
    unlock blcok_group

    down_write first_tree
    down_write second_tree

    lock block_group
    check block group should be moved
    do unlink
    ....
    do link
    unlock blcok_group
    up_write second_tree
    up_write first_tree

2) find_free_extent():
    down_read tree
    lock block_group
    check the block group is not in special status
    ...
    unlock block_group
    up_read tree

This patch introduce three macros to represents block group is removing
/need to updated / busy.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d63078930a1e..5bae757786dc 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -31,6 +31,15 @@
 
 #undef SCRAMBLE_DELAYED_REFS
 
+/* The block group is used by find_free_extent() */
+#define PRIORITY_BG_BUSY -1
+
+/* The block group is in remove or removed */
+#define PRIORITY_BG_DELETED -2
+
+/* The block group' priority needs to be updated */
+#define PRIORITY_BG_UPDATING -3
+
 /*
  * control flags for do_chunk_alloc's force field
  * CHUNK_ALLOC_NO_FORCE means to only allocate a chunk
-- 
2.19.1




^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 07/17] btrfs: priority alloc: add functions to remove block group from priority tree
  2018-11-28  3:11 [RFC PATCH 00/17] btrfs: implementation of priority aware allocator Su Yue
                   ` (5 preceding siblings ...)
  2018-11-28  3:11 ` [RFC PATCH 06/17] btrfs: priority alloc: introduce three macros to mark block group status Su Yue
@ 2018-11-28  3:11 ` Su Yue
  2018-11-28  3:11 ` [RFC PATCH 08/17] btrfs: priority alloc: add btrfs_update_block_group_priority() Su Yue
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 23+ messages in thread
From: Su Yue @ 2018-11-28  3:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: suy.fnst

Introduce btrfs_remove_block_group_priority() to remove block group
from priority tree.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 5bae757786dc..b559c9a9afc6 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -11429,3 +11429,40 @@ static void add_block_group_priority(struct btrfs_block_group_cache *cache)
 	up_write(&pt->groups_sem);
 	BUG_ON(ret);
 }
+
+static void unlink_block_group_priority(struct btrfs_priority_tree *pt,
+					struct btrfs_block_group_cache *cache)
+{
+	rb_erase(&cache->node, &pt->block_groups);
+	RB_CLEAR_NODE(&cache->node);
+}
+
+void btrfs_remove_block_group_priority(struct btrfs_block_group_cache *cache)
+{
+	struct btrfs_priority_tree *pt;
+
+	if (!is_priority_alloc_enabled(cache->fs_info))
+		return;
+
+	spin_lock(&cache->lock);
+	if (cache->priority_tree == NULL) {
+		spin_unlock(&cache->lock);
+		return;
+	}
+
+	pt = cache->priority_tree;
+	cache->priority = PRIORITY_BG_DELETED;
+	spin_unlock(&cache->lock);
+
+	down_write(&pt->groups_sem);
+	spin_lock(&cache->lock);
+
+	if (cache->priority_tree == NULL)
+		goto out;
+
+	unlink_block_group_priority(pt, cache);
+	cache->priority_tree = NULL;
+out:
+	spin_unlock(&cache->lock);
+	up_write(&pt->groups_sem);
+}
-- 
2.19.1




^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 08/17] btrfs: priority alloc: add btrfs_update_block_group_priority()
  2018-11-28  3:11 [RFC PATCH 00/17] btrfs: implementation of priority aware allocator Su Yue
                   ` (6 preceding siblings ...)
  2018-11-28  3:11 ` [RFC PATCH 07/17] btrfs: priority alloc: add functions to remove block group from priority tree Su Yue
@ 2018-11-28  3:11 ` Su Yue
  2018-11-28  3:11 ` [RFC PATCH 09/17] btrfs: priority alloc: call create/remove_priority_trees in space_info Su Yue
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 23+ messages in thread
From: Su Yue @ 2018-11-28  3:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: suy.fnst

Introduce btrfs_update_block_group_priority() to update
block_groups::priority. It will move block group from old tree
to new tree if need.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c | 76 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 76 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index b559c9a9afc6..5ea1f2e40701 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -11466,3 +11466,79 @@ void btrfs_remove_block_group_priority(struct btrfs_block_group_cache *cache)
 	spin_unlock(&cache->lock);
 	up_write(&pt->groups_sem);
 }
+
+static inline struct btrfs_priority_tree *
+find_priority_tree(struct btrfs_space_info *sinfo, u64 flags, int level)
+{
+	int index = btrfs_bg_flags_to_raid_index(flags);
+
+	return search_priority_tree(&sinfo->priority_trees[index], level);
+}
+
+void btrfs_update_block_group_priority(struct btrfs_block_group_cache *cache)
+{
+	struct btrfs_priority_tree *old_tree, *new_tree;
+	struct rw_semaphore *front_sem, *back_sem;
+	long priority;
+	int old_level, new_level;
+
+	if (!is_priority_alloc_enabled(cache->fs_info))
+		return;
+
+	spin_lock(&cache->lock);
+	if (cache->priority != PRIORITY_BG_UPDATING) {
+		spin_unlock(&cache->lock);
+		return;
+	}
+
+	old_level = cache->priority_tree->level;
+	priority = compute_block_group_priority(cache);
+	new_level = compute_priority_level(cache->fs_info, priority);
+
+	/* no need to move the block group */
+	if (old_level == new_level) {
+		cache->priority = priority;
+		spin_unlock(&cache->lock);
+		return;
+	}
+	spin_unlock(&cache->lock);
+
+	old_tree = cache->priority_tree;
+	new_tree = find_priority_tree(cache->space_info, cache->flags,
+				      new_level);
+
+	if (!old_tree || !new_tree) {
+		btrfs_err(cache->fs_info,
+"can't found priority tree %d for block_group %llu old level %d new priority %ld",
+			  old_tree ? new_level : old_level,
+			  cache->key.objectid, old_level, priority);
+		BUG();
+	}
+
+	if (old_level > new_level) {
+		front_sem = &old_tree->groups_sem;
+		back_sem = &new_tree->groups_sem;
+	} else {
+		front_sem = &new_tree->groups_sem;
+		back_sem = &old_tree->groups_sem;
+	}
+
+	down_write(front_sem);
+	down_write(back_sem);
+	spin_lock(&cache->lock);
+
+	/* the block group was removed/is removing */
+	if (cache->priority != PRIORITY_BG_UPDATING ||
+	    priority != compute_block_group_priority(cache) ||
+	    old_tree != cache->priority_tree)
+		goto out;
+
+	unlink_block_group_priority(old_tree, cache);
+	cache->priority = priority;
+	link_block_group_priority(new_tree, cache);
+	cache->priority_tree = new_tree;
+out:
+	spin_unlock(&cache->lock);
+	up_write(front_sem);
+	up_write(back_sem);
+}
-- 
2.19.1




^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 09/17] btrfs: priority alloc: call create/remove_priority_trees in space_info
  2018-11-28  3:11 [RFC PATCH 00/17] btrfs: implementation of priority aware allocator Su Yue
                   ` (7 preceding siblings ...)
  2018-11-28  3:11 ` [RFC PATCH 08/17] btrfs: priority alloc: add btrfs_update_block_group_priority() Su Yue
@ 2018-11-28  3:11 ` Su Yue
  2018-11-28  3:11 ` [RFC PATCH 10/17] btrfs: priority alloc: call add_block_group_priority while reading or making block group Su Yue
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 23+ messages in thread
From: Su Yue @ 2018-11-28  3:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: suy.fnst

Call create_priority_trees() in create_space_info().
Call remove_priority_trees() before free of space_info.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 5ea1f2e40701..2dec02782df1 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3940,6 +3940,10 @@ static const char *alloc_name(u64 flags)
 	};
 }
 
+static int create_priority_trees(struct btrfs_fs_info *fs_info,
+				 struct btrfs_space_info *space_info);
+static void remove_priority_trees(struct btrfs_fs_info *fs_info,
+				  struct btrfs_space_info *space_info);
 static int create_space_info(struct btrfs_fs_info *info, u64 flags)
 {
 
@@ -3958,8 +3962,17 @@ static int create_space_info(struct btrfs_fs_info *info, u64 flags)
 		return ret;
 	}
 
-	for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
+	for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
 		INIT_LIST_HEAD(&space_info->block_groups[i]);
+		space_info->priority_trees[i] = RB_ROOT;
+	}
+
+	ret = create_priority_trees(info, space_info);
+	if (ret) {
+		kfree(space_info);
+		return ret;
+	}
+
 	init_rwsem(&space_info->groups_sem);
 	spin_lock_init(&space_info->lock);
 	space_info->flags = flags & BTRFS_BLOCK_GROUP_TYPE_MASK;
@@ -3974,6 +3987,7 @@ static int create_space_info(struct btrfs_fs_info *info, u64 flags)
 				    alloc_name(space_info->flags));
 	if (ret) {
 		percpu_counter_destroy(&space_info->total_bytes_pinned);
+		remove_priority_trees(info, space_info);
 		kfree(space_info);
 		return ret;
 	}
@@ -9928,6 +9942,8 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
 			    space_info->bytes_may_use > 0))
 			dump_space_info(info, space_info, 0, 0);
 		list_del(&space_info->list);
+		remove_priority_trees(info, space_info);
+
 		for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
 			struct kobject *kobj;
 			kobj = space_info->block_group_kobjs[i];
@@ -11282,9 +11298,6 @@ static void remove_priority_trees(struct btrfs_fs_info *fs_info,
 	struct btrfs_priority_tree *pt, *next_pt;
 	int i;
 
-	if (!is_priority_alloc_enabled(fs_info))
-		return;
-
 	for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
 		root = &space_info->priority_trees[i];
 		rbtree_postorder_for_each_entry_safe(pt, next_pt, root, node) {
-- 
2.19.1




^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 10/17] btrfs: priority alloc: call add_block_group_priority while reading or making block group
  2018-11-28  3:11 [RFC PATCH 00/17] btrfs: implementation of priority aware allocator Su Yue
                   ` (8 preceding siblings ...)
  2018-11-28  3:11 ` [RFC PATCH 09/17] btrfs: priority alloc: call create/remove_priority_trees in space_info Su Yue
@ 2018-11-28  3:11 ` Su Yue
  2018-11-28  3:11 ` [RFC PATCH 11/17] btrfs: priority alloc: remove block group from priority tree while removing " Su Yue
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 23+ messages in thread
From: Su Yue @ 2018-11-28  3:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: suy.fnst

Add  block group to priority tree in btrfs_read_block_groups()
and btrfs_make_block_groups().

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 2dec02782df1..fc40901b4772 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10117,6 +10117,7 @@ static int check_chunk_block_group_mappings(struct btrfs_fs_info *fs_info)
 }
 
 static long compute_block_group_priority(struct btrfs_block_group_cache *bg);
+static void add_block_group_priority(struct btrfs_block_group_cache *cache);
 int btrfs_read_block_groups(struct btrfs_fs_info *info)
 {
 	struct btrfs_path *path;
@@ -10251,6 +10252,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 		link_block_group(cache);
 
 		cache->priority = compute_block_group_priority(cache);
+		add_block_group_priority(cache);
 
 		set_avail_alloc_bits(info, cache->flags);
 		if (btrfs_chunk_readonly(info, cache->key.objectid)) {
@@ -10402,6 +10404,7 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,
 	link_block_group(cache);
 
 	cache->priority = compute_block_group_priority(cache);
+	add_block_group_priority(cache);
 
 	list_add_tail(&cache->bg_list, &trans->new_bgs);
 
-- 
2.19.1




^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 11/17] btrfs: priority alloc: remove block group from priority tree while removing block group
  2018-11-28  3:11 [RFC PATCH 00/17] btrfs: implementation of priority aware allocator Su Yue
                   ` (9 preceding siblings ...)
  2018-11-28  3:11 ` [RFC PATCH 10/17] btrfs: priority alloc: call add_block_group_priority while reading or making block group Su Yue
@ 2018-11-28  3:11 ` Su Yue
  2018-11-28  3:11 ` [RFC PATCH 12/17] btrfs: priority alloc: introduce find_free_extent_search() Su Yue
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 23+ messages in thread
From: Su Yue @ 2018-11-28  3:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: suy.fnst

Export btrfs_remove_block_group_priority() to header ctree.h.
Call btrfs_remove_block_group_priority() while deleting
transaction->deleted_bgs, btrfs_free_block_groups() and
btrfs_remove_block_group().

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
---
 fs/btrfs/ctree.h       | 1 +
 fs/btrfs/extent-tree.c | 3 +++
 fs/btrfs/transaction.c | 1 +
 3 files changed, 5 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4c56baf9f7cf..091b878e326c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2752,6 +2752,7 @@ u64 btrfs_data_alloc_profile(struct btrfs_fs_info *fs_info);
 u64 btrfs_metadata_alloc_profile(struct btrfs_fs_info *fs_info);
 u64 btrfs_system_alloc_profile(struct btrfs_fs_info *fs_info);
 void btrfs_clear_space_info_full(struct btrfs_fs_info *info);
+void btrfs_remove_block_group_priority(struct btrfs_block_group_cache *cache);
 
 enum btrfs_reserve_flush_enum {
 	/* If we are in the transaction, we can't flush anything.*/
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index fc40901b4772..74955f79fcce 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9896,6 +9896,7 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
 		list_del(&block_group->list);
 		up_write(&block_group->space_info->groups_sem);
 
+		btrfs_remove_block_group_priority(block_group);
 		/*
 		 * We haven't cached this block group, which means we could
 		 * possibly have excluded extents on this block group.
@@ -10571,6 +10572,8 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 		clear_avail_alloc_bits(fs_info, block_group->flags);
 	}
 	up_write(&block_group->space_info->groups_sem);
+	btrfs_remove_block_group_priority(block_group);
+
 	if (kobj) {
 		kobject_del(kobj);
 		kobject_put(kobj);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index f92c0a88c4ad..74234de9304a 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -72,6 +72,7 @@ void btrfs_put_transaction(struct btrfs_transaction *transaction)
 						 struct btrfs_block_group_cache,
 						 bg_list);
 			list_del_init(&cache->bg_list);
+			btrfs_remove_block_group_priority(cache);
 			btrfs_put_block_group_trimming(cache);
 			btrfs_put_block_group(cache);
 		}
-- 
2.19.1




^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 12/17] btrfs: priority alloc: introduce find_free_extent_search()
  2018-11-28  3:11 [RFC PATCH 00/17] btrfs: implementation of priority aware allocator Su Yue
                   ` (10 preceding siblings ...)
  2018-11-28  3:11 ` [RFC PATCH 11/17] btrfs: priority alloc: remove block group from priority tree while removing " Su Yue
@ 2018-11-28  3:11 ` Su Yue
  2018-11-28  3:11 ` [RFC PATCH 13/17] btrfs: priority alloc: modify find_free_extent() to fit priority allocator Su Yue
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 23+ messages in thread
From: Su Yue @ 2018-11-28  3:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: suy.fnst

In origin, find_free_extent() just searches block groups in space_info
one by one.

In priority aware allocator, we first search block groups in
higher priority tree than in lower priority tree.

This helper unify above two ways for further use.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c | 77 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 77 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 74955f79fcce..5484256169dd 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7331,6 +7331,83 @@ struct find_free_extent_ctl {
 };
 
 
+/*
+ * Helper function for find_free_extent().
+ *
+ * Return next block_group_cache to use.
+ * Return NULL mean no more block group to use.
+ *
+ * If use_priority is false: return the next of @block_group in
+ * space_info->block_groups[index] or if @blockgroup is NULL, return first
+ * block group in the list.
+ *
+ * If use_priority is true: if *pt_ret is NULL, return first block group in
+ * highest priority level. If *pt_ret is NOT NULL, return the next of
+ * @block_group or if @block_group is NULL, return first available block group
+ * which located in priority level <= *pt_ret->level.
+ * The tree will be down_read if return value is not NULL.
+ *
+ */
+static struct btrfs_block_group_cache *
+find_free_extent_search(struct btrfs_space_info *space_info, int index,
+		struct btrfs_block_group_cache *block_group, bool use_priority,
+		struct btrfs_priority_tree **pt_ret)
+{
+
+	struct list_head *lentry, *head;
+	struct rb_root *root;
+	struct rb_node *node, *bg_node;
+	struct btrfs_priority_tree *pt;
+	bool read;
+
+	if (!use_priority) {
+		head = &space_info->block_groups[index];
+		if (!block_group)
+			lentry = head->next;
+		else
+			lentry = block_group->list.next;
+		if (lentry == head)
+			return NULL;
+		return list_entry(lentry, struct btrfs_block_group_cache,
+				  list);
+	}
+
+	root = &space_info->priority_trees[index];
+	pt = pt_ret ? *pt_ret : NULL;
+
+	if (!pt) {
+		node = rb_first(root);
+		read = true;
+	} else {
+		node = &pt->node;
+		read = false;
+
+		if (block_group)
+			bg_node = rb_next(&block_group->node);
+		else
+			bg_node = rb_first(&pt->block_groups);
+	}
+
+	for (; node; node = rb_next(node)) {
+		pt = rb_entry(node, struct btrfs_priority_tree, node);
+		if (read) {
+			down_read(&pt->groups_sem);
+			bg_node = rb_first(&pt->block_groups);
+		}
+		for (; bg_node; bg_node = rb_next(bg_node)) {
+			if (bg_node) {
+				*pt_ret = pt;
+				return rb_entry(bg_node,
+					struct btrfs_block_group_cache, node);
+			}
+		}
+
+		up_read(&pt->groups_sem);
+		read = true;
+	}
+
+	return NULL;
+}
 /*
  * Helper function for find_free_extent().
  *
-- 
2.19.1




^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 13/17] btrfs: priority alloc: modify find_free_extent() to fit priority allocator
  2018-11-28  3:11 [RFC PATCH 00/17] btrfs: implementation of priority aware allocator Su Yue
                   ` (11 preceding siblings ...)
  2018-11-28  3:11 ` [RFC PATCH 12/17] btrfs: priority alloc: introduce find_free_extent_search() Su Yue
@ 2018-11-28  3:11 ` Su Yue
  2018-11-28  3:11 ` [RFC PATCH 14/17] btrfs: priority alloc: introduce btrfs_set_bg_updating and call btrfs_update_block_group_prioriy Su Yue
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 23+ messages in thread
From: Su Yue @ 2018-11-28  3:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: suy.fnst

Add member priority_tree to find_free_extent_ctl to represents the
tree using.

Modify find_free_extent to use find_free_extent_search, so it can
work in default mount option and priorit aware allocator.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c | 114 ++++++++++++++++++++++++++++++++---------
 1 file changed, 91 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 5484256169dd..4c76677a54a9 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7328,6 +7328,8 @@ struct find_free_extent_ctl {
 
 	/* Found result */
 	u64 found_offset;
+
+	struct btrfs_priority_tree *priority_tree;
 };
 
 
@@ -7692,6 +7694,10 @@ static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info,
 	return -ENOSPC;
 }
 
+static inline bool
+is_priority_alloc_enabled(struct btrfs_fs_info *fs_info);
+
+static long compute_block_group_priority(struct btrfs_block_group_cache *bg);
 /*
  * walks the btree of allocated extents and find a hole of a given size.
  * The key ins is changed to record the hole:
@@ -7729,6 +7735,7 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
 	struct btrfs_space_info *space_info;
 	bool use_cluster = true;
 	bool full_search = false;
+	bool use_priority = is_priority_alloc_enabled(fs_info);
 
 	WARN_ON(num_bytes < fs_info->sectorsize);
 
@@ -7744,6 +7751,7 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
 	ffe_ctl.have_caching_bg = false;
 	ffe_ctl.orig_have_caching_bg = false;
 	ffe_ctl.found_offset = 0;
+	ffe_ctl.priority_tree = NULL;
 
 	ins->type = BTRFS_EXTENT_ITEM_KEY;
 	ins->objectid = 0;
@@ -7813,40 +7821,82 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
 		 */
 		if (block_group && block_group_bits(block_group, flags) &&
 		    block_group->cached != BTRFS_CACHE_NO) {
-			down_read(&space_info->groups_sem);
-			if (list_empty(&block_group->list) ||
-			    block_group->ro) {
-				/*
-				 * someone is removing this block group,
-				 * we can't jump into the have_block_group
-				 * target because our list pointers are not
-				 * valid
-				 */
-				btrfs_put_block_group(block_group);
-				up_read(&space_info->groups_sem);
+			if (use_priority) {
+				spin_lock(&block_group->lock);
+				if (block_group->priority ==
+						    PRIORITY_BG_DELETED ||
+				    block_group->priority ==
+						    PRIORITY_BG_BUSY ||
+				    block_group->ro) {
+					spin_unlock(&block_group->lock);
+					btrfs_put_block_group(block_group);
+					goto search;
+				}
+
+				block_group->priority = PRIORITY_BG_BUSY;
+				spin_unlock(&block_group->lock);
+				ffe_ctl.priority_tree =
+					block_group->priority_tree;
+				down_read(&ffe_ctl.priority_tree->groups_sem);
 			} else {
-				ffe_ctl.index = btrfs_bg_flags_to_raid_index(
-						block_group->flags);
-				btrfs_lock_block_group(block_group, delalloc);
-				goto have_block_group;
+				down_read(&space_info->groups_sem);
+				if (list_empty(&block_group->list) ||
+				    block_group->ro) {
+					/*
+					 * someone is removing this block
+					 * group, we can't jump into the
+					 * have_block_group target because our
+					 * list pointers are not valid
+					 */
+					btrfs_put_block_group(block_group);
+					up_read(&space_info->groups_sem);
+					goto search;
+				}
 			}
+			ffe_ctl.index = btrfs_bg_flags_to_raid_index(
+				block_group->flags);
+			btrfs_lock_block_group(block_group, delalloc);
+			goto have_block_group;
 		} else if (block_group) {
 			btrfs_put_block_group(block_group);
 		}
 	}
 search:
+	ffe_ctl.priority_tree = NULL;
+	block_group = NULL;
 	ffe_ctl.have_caching_bg = false;
 	if (ffe_ctl.index == btrfs_bg_flags_to_raid_index(flags) ||
 	    ffe_ctl.index == 0)
 		full_search = true;
-	down_read(&space_info->groups_sem);
-	list_for_each_entry(block_group,
-			    &space_info->block_groups[ffe_ctl.index], list) {
+	if (!use_priority)
+		down_read(&space_info->groups_sem);
+	while (1) {
+		block_group = find_free_extent_search(space_info,
+				      ffe_ctl.index, block_group, use_priority,
+				      &ffe_ctl.priority_tree);
+
+		if (block_group == NULL)
+			break;
 		/* If the block group is read-only, we can skip it entirely. */
 		if (unlikely(block_group->ro))
 			continue;
 
 		btrfs_grab_block_group(block_group, delalloc);
+
+		if (use_priority) {
+			spin_lock(&block_group->lock);
+			/* someone the block group is unavaiable now */
+			if (block_group->priority == PRIORITY_BG_DELETED ||
+			    block_group->priority == PRIORITY_BG_UPDATING ||
+			    block_group->priority == PRIORITY_BG_BUSY) {
+				spin_unlock(&block_group->lock);
+				goto loop;
+			}
+
+			block_group->priority = PRIORITY_BG_BUSY;
+			spin_unlock(&block_group->lock);
+		}
+
 		ffe_ctl.search_start = block_group->key.objectid;
 
 		/*
@@ -7945,9 +7995,20 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
 
 		trace_btrfs_reserve_extent(block_group, ffe_ctl.search_start,
 					   num_bytes);
-		btrfs_release_block_group(block_group, delalloc);
-		break;
+		goto out;
 loop:
+		if (use_priority) {
+			long priority;
+
+			spin_lock(&block_group->lock);
+			if (block_group->priority == PRIORITY_BG_BUSY) {
+				priority = compute_block_group_priority(
+					block_group);
+				block_group->priority = priority;
+			}
+			spin_unlock(&block_group->lock);
+		}
+
 		ffe_ctl.retry_clustered = false;
 		ffe_ctl.retry_unclustered = false;
 		BUG_ON(btrfs_bg_flags_to_raid_index(block_group->flags) !=
@@ -7955,8 +8016,16 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
 		btrfs_release_block_group(block_group, delalloc);
 		cond_resched();
 	}
-	up_read(&space_info->groups_sem);
-
+	if (!use_priority)
+		up_read(&space_info->groups_sem);
+out:
+	if (ins->objectid) {
+		if (use_priority)
+			up_read(&ffe_ctl.priority_tree->groups_sem);
+		else
+			up_read(&space_info->groups_sem);
+		btrfs_release_block_group(block_group, delalloc);
+	}
 	ret = find_free_extent_update_loop(fs_info, last_ptr, ins, &ffe_ctl,
 					   full_search, use_cluster);
 	if (ret > 0)
@@ -10194,7 +10263,6 @@ static int check_chunk_block_group_mappings(struct btrfs_fs_info *fs_info)
 	return ret;
 }
 
-static long compute_block_group_priority(struct btrfs_block_group_cache *bg);
 static void add_block_group_priority(struct btrfs_block_group_cache *cache);
 int btrfs_read_block_groups(struct btrfs_fs_info *info)
 {
-- 
2.19.1




^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 14/17] btrfs: priority alloc: introduce btrfs_set_bg_updating and call btrfs_update_block_group_prioriy
  2018-11-28  3:11 [RFC PATCH 00/17] btrfs: implementation of priority aware allocator Su Yue
                   ` (12 preceding siblings ...)
  2018-11-28  3:11 ` [RFC PATCH 13/17] btrfs: priority alloc: modify find_free_extent() to fit priority allocator Su Yue
@ 2018-11-28  3:11 ` Su Yue
  2018-11-28  3:11 ` [RFC PATCH 15/17] btrfs: priority alloc: write bg->priority_groups_sem while waiting reservation Su Yue
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 23+ messages in thread
From: Su Yue @ 2018-11-28  3:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: suy.fnst

For usage as priority, the varaiables in block groups we concered are
reserved, bytes_super and btrfs_block_group_used(&cache->item).

This patch calls btrfs_set_bg_updating() in locations where above three
varaiables changed to mark block groups needs to be updated, then
calls btrfs_update_block_group() to update priority tree if needed.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
---
 fs/btrfs/ctree.h            |  2 ++
 fs/btrfs/extent-tree.c      | 40 +++++++++++++++++++++++++++++++++++++
 fs/btrfs/free-space-cache.c |  3 +++
 3 files changed, 45 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 091b878e326c..f1ab0310da08 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2753,6 +2753,8 @@ u64 btrfs_metadata_alloc_profile(struct btrfs_fs_info *fs_info);
 u64 btrfs_system_alloc_profile(struct btrfs_fs_info *fs_info);
 void btrfs_clear_space_info_full(struct btrfs_fs_info *info);
 void btrfs_remove_block_group_priority(struct btrfs_block_group_cache *cache);
+void btrfs_set_bg_priority_updating(struct btrfs_block_group_cache *cache);
+void btrfs_update_block_group_priority(struct btrfs_block_group_cache *cache);
 
 enum btrfs_reserve_flush_enum {
 	/* If we are in the transaction, we can't flush anything.*/
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4c76677a54a9..f530a4344368 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6183,6 +6183,7 @@ static int update_block_group(struct btrfs_trans_handle *trans,
 			cache->space_info->bytes_reserved -= num_bytes;
 			cache->space_info->bytes_used += num_bytes;
 			cache->space_info->disk_used += num_bytes * factor;
+			btrfs_set_bg_priority_updating(cache);
 			spin_unlock(&cache->lock);
 			spin_unlock(&cache->space_info->lock);
 		} else {
@@ -6192,6 +6193,7 @@ static int update_block_group(struct btrfs_trans_handle *trans,
 			update_bytes_pinned(cache->space_info, num_bytes);
 			cache->space_info->bytes_used -= num_bytes;
 			cache->space_info->disk_used -= num_bytes * factor;
+			btrfs_set_bg_priority_updating(cache);
 			spin_unlock(&cache->lock);
 			spin_unlock(&cache->space_info->lock);
 
@@ -6205,6 +6207,7 @@ static int update_block_group(struct btrfs_trans_handle *trans,
 					 bytenr, bytenr + num_bytes - 1,
 					 GFP_NOFS | __GFP_NOFAIL);
 		}
+		btrfs_update_block_group_priority(cache);
 
 		spin_lock(&trans->transaction->dirty_bgs_lock);
 		if (list_empty(&cache->dirty_list)) {
@@ -6264,6 +6267,7 @@ static int pin_down_extent(struct btrfs_fs_info *fs_info,
 	if (reserved) {
 		cache->reserved -= num_bytes;
 		cache->space_info->bytes_reserved -= num_bytes;
+		btrfs_set_bg_priority_updating(cache);
 	}
 	spin_unlock(&cache->lock);
 	spin_unlock(&cache->space_info->lock);
@@ -6274,6 +6278,8 @@ static int pin_down_extent(struct btrfs_fs_info *fs_info,
 		    num_bytes, BTRFS_TOTAL_BYTES_PINNED_BATCH);
 	set_extent_dirty(fs_info->pinned_extents, bytenr,
 			 bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL);
+
+	btrfs_update_block_group_priority(cache);
 	return 0;
 }
 
@@ -6472,6 +6478,12 @@ static int btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache,
 		update_bytes_may_use(space_info, -ram_bytes);
 		if (delalloc)
 			cache->delalloc_bytes += num_bytes;
+		/*
+		 * Since it's called in find_free_extent(),
+		 * call btrfs_update_block_group_priority() in outter to
+		 * avoid dead lock.
+		 */
+		btrfs_set_bg_priority_updating(cache);
 	}
 	spin_unlock(&cache->lock);
 	spin_unlock(&space_info->lock);
@@ -6502,11 +6514,14 @@ static void btrfs_free_reserved_bytes(struct btrfs_block_group_cache *cache,
 	cache->reserved -= num_bytes;
 	space_info->bytes_reserved -= num_bytes;
 	space_info->max_extent_size = 0;
+	btrfs_set_bg_priority_updating(cache);
 
 	if (delalloc)
 		cache->delalloc_bytes -= num_bytes;
 	spin_unlock(&cache->lock);
 	spin_unlock(&space_info->lock);
+
+	btrfs_update_block_group_priority(cache);
 }
 void btrfs_prepare_extent_commit(struct btrfs_fs_info *fs_info)
 {
@@ -8025,6 +8040,7 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
 		else
 			up_read(&space_info->groups_sem);
 		btrfs_release_block_group(block_group, delalloc);
+		btrfs_update_block_group_priority(block_group);
 	}
 	ret = find_free_extent_update_loop(fs_info, last_ptr, ins, &ffe_ctl,
 					   full_search, use_cluster);
@@ -8434,9 +8450,12 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
 	spin_lock(&block_group->lock);
 	space_info->bytes_reserved += ins->offset;
 	block_group->reserved += ins->offset;
+	btrfs_set_bg_priority_updating(block_group);
 	spin_unlock(&block_group->lock);
 	spin_unlock(&space_info->lock);
 
+	btrfs_update_block_group_priority(block_group);
+
 	ret = alloc_reserved_file_extent(trans, 0, root_objectid, 0, owner,
 					 offset, ins, 1);
 	btrfs_put_block_group(block_group);
@@ -11706,3 +11725,24 @@ void btrfs_update_block_group_priority(struct btrfs_block_group_cache *cache)
 	up_write(front_sem);
 	up_write(back_sem);
 }
+
+/* Caller must hold cache->lock */
+void
+btrfs_set_bg_priority_updating(struct btrfs_block_group_cache *cache)
+{
+	long priority;
+	int new_level;
+
+	if (!is_priority_alloc_enabled(cache->fs_info))
+		return;
+	if (cache->priority == PRIORITY_BG_DELETED)
+		return;
+
+	priority = compute_block_group_priority(cache);
+	new_level = compute_priority_level(cache->fs_info, priority);
+
+	if (cache->priority_tree->level != new_level)
+		priority = PRIORITY_BG_UPDATING;
+
+	cache->priority = priority;
+}
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 74aa552f4793..ff28a26c9104 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -3149,6 +3149,7 @@ static int do_trimming(struct btrfs_block_group_cache *block_group,
 		block_group->reserved += reserved_bytes;
 		space_info->bytes_reserved += reserved_bytes;
 		update = 1;
+		btrfs_set_bg_priority_updating(block_group);
 	}
 	spin_unlock(&block_group->lock);
 	spin_unlock(&space_info->lock);
@@ -3169,10 +3170,12 @@ static int do_trimming(struct btrfs_block_group_cache *block_group,
 			space_info->bytes_readonly += reserved_bytes;
 		block_group->reserved -= reserved_bytes;
 		space_info->bytes_reserved -= reserved_bytes;
+		btrfs_set_bg_priority_updating(block_group);
 		spin_unlock(&space_info->lock);
 		spin_unlock(&block_group->lock);
 	}
 
+	btrfs_update_block_group_priority(block_group);
 	return ret;
 }
 
-- 
2.19.1




^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 15/17] btrfs: priority alloc: write bg->priority_groups_sem while waiting reservation
  2018-11-28  3:11 [RFC PATCH 00/17] btrfs: implementation of priority aware allocator Su Yue
                   ` (13 preceding siblings ...)
  2018-11-28  3:11 ` [RFC PATCH 14/17] btrfs: priority alloc: introduce btrfs_set_bg_updating and call btrfs_update_block_group_prioriy Su Yue
@ 2018-11-28  3:11 ` Su Yue
  2018-11-28  3:11 ` [RFC PATCH 16/17] btrfs: priority alloc: write bg->priority_tree->groups_sem to avoid race in btrfs_delete_unused_bgs() Su Yue
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 23+ messages in thread
From: Su Yue @ 2018-11-28  3:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: suy.fnst

Since if use priority alloc, we should down/up_write()
bg->priority_groups_sem.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f530a4344368..6627bbe56ad5 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6425,6 +6425,8 @@ void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info,
 	btrfs_put_block_group(bg);
 }
 
+static inline bool
+is_priority_alloc_enabled(struct btrfs_fs_info *fs_info);
 void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg)
 {
 	struct btrfs_space_info *space_info = bg->space_info;
@@ -6433,7 +6435,11 @@ void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg)
 
 	if (!(bg->flags & BTRFS_BLOCK_GROUP_DATA))
 		return;
-
+	if (is_priority_alloc_enabled(bg->fs_info)) {
+		down_write(&bg->priority_tree->groups_sem);
+		up_write(&bg->priority_tree->groups_sem);
+		goto wait;
+	}
 	/*
 	 * Our block group is read only but before we set it to read only,
 	 * some task might have had allocated an extent from it already, but it
@@ -6446,7 +6452,7 @@ void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg)
 	 */
 	down_write(&space_info->groups_sem);
 	up_write(&space_info->groups_sem);
-
+wait:
 	wait_var_event(&bg->reservations, !atomic_read(&bg->reservations));
 }
 
-- 
2.19.1




^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 16/17] btrfs: priority alloc: write bg->priority_tree->groups_sem to avoid race in btrfs_delete_unused_bgs()
  2018-11-28  3:11 [RFC PATCH 00/17] btrfs: implementation of priority aware allocator Su Yue
                   ` (14 preceding siblings ...)
  2018-11-28  3:11 ` [RFC PATCH 15/17] btrfs: priority alloc: write bg->priority_groups_sem while waiting reservation Su Yue
@ 2018-11-28  3:11 ` Su Yue
  2018-11-28  3:11 ` [RFC PATCH 17/17] btrfs: add mount option "priority_alloc=%s" Su Yue
  2018-11-28  4:04 ` [RFC PATCH 00/17] btrfs: implementation of priority aware allocator Qu Wenruo
  17 siblings, 0 replies; 23+ messages in thread
From: Su Yue @ 2018-11-28  3:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: suy.fnst

If use priority aware allocator, bg->priority_tree->groups_sem should
be written instead of space_info->groups_sem.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c | 60 +++++++++++++++++++++++++++++++-----------
 1 file changed, 44 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 6627bbe56ad5..5c9536609621 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10944,6 +10944,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 	struct btrfs_block_group_cache *block_group;
 	struct btrfs_space_info *space_info;
 	struct btrfs_trans_handle *trans;
+	bool use_priority = is_priority_alloc_enabled(fs_info);
 	int ret = 0;
 
 	if (!test_bit(BTRFS_FS_OPEN, &fs_info->flags))
@@ -10953,6 +10954,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 	while (!list_empty(&fs_info->unused_bgs)) {
 		u64 start, end;
 		int trimming;
+		struct btrfs_priority_tree *pt;
 
 		block_group = list_first_entry(&fs_info->unused_bgs,
 					       struct btrfs_block_group_cache,
@@ -10969,29 +10971,55 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 
 		mutex_lock(&fs_info->delete_unused_bgs_mutex);
 
-		/* Don't want to race with allocators so take the groups_sem */
-		down_write(&space_info->groups_sem);
-		spin_lock(&block_group->lock);
-		if (block_group->reserved || block_group->pinned ||
-		    btrfs_block_group_used(&block_group->item) ||
-		    block_group->ro ||
-		    list_is_singular(&block_group->list)) {
+		if (use_priority) {
+			spin_lock(&block_group->lock);
+			if (block_group->reserved || block_group->pinned ||
+			    btrfs_block_group_used(&block_group->item) ||
+			    block_group->ro ||
+			    block_group->priority != 0) {
+				trace_btrfs_skip_unused_block_group(
+					block_group);
+				spin_unlock(&block_group->lock);
+				goto next;
+			}
+			block_group->priority = PRIORITY_BG_BUSY;
+			spin_unlock(&block_group->lock);
+			pt = block_group->priority_tree;
+			down_write(&pt->groups_sem);
+		} else {
 			/*
-			 * We want to bail if we made new allocations or have
-			 * outstanding allocations in this block group.  We do
-			 * the ro check in case balance is currently acting on
-			 * this block group.
+			 * Don't want to race with allocators so take the
+			 * groups_sem
 			 */
-			trace_btrfs_skip_unused_block_group(block_group);
+			down_write(&space_info->groups_sem);
+			spin_lock(&block_group->lock);
+			if (block_group->reserved || block_group->pinned ||
+			    btrfs_block_group_used(&block_group->item) ||
+			    block_group->ro ||
+			    list_is_singular(&block_group->list)) {
+				/*
+				 * We want to bail if we made new allocations
+				 * or have outstanding allocations in this
+				 * block group.  We do the ro check in case
+				 * balance is currently acting on this block
+				 * group.
+				 */
+				trace_btrfs_skip_unused_block_group(
+					block_group);
+				spin_unlock(&block_group->lock);
+				up_write(&space_info->groups_sem);
+				goto next;
+			}
 			spin_unlock(&block_group->lock);
-			up_write(&space_info->groups_sem);
-			goto next;
 		}
-		spin_unlock(&block_group->lock);
 
 		/* We don't want to force the issue, only flip if it's ok. */
 		ret = inc_block_group_ro(block_group, 0);
-		up_write(&space_info->groups_sem);
+		if (use_priority)
+			up_write(&pt->groups_sem);
+		else
+			up_write(&space_info->groups_sem);
+
 		if (ret < 0) {
 			ret = 0;
 			goto next;
-- 
2.19.1




^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 17/17] btrfs: add mount option "priority_alloc=%s"
  2018-11-28  3:11 [RFC PATCH 00/17] btrfs: implementation of priority aware allocator Su Yue
                   ` (15 preceding siblings ...)
  2018-11-28  3:11 ` [RFC PATCH 16/17] btrfs: priority alloc: write bg->priority_tree->groups_sem to avoid race in btrfs_delete_unused_bgs() Su Yue
@ 2018-11-28  3:11 ` Su Yue
  2018-11-28  4:04 ` [RFC PATCH 00/17] btrfs: implementation of priority aware allocator Qu Wenruo
  17 siblings, 0 replies; 23+ messages in thread
From: Su Yue @ 2018-11-28  3:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: suy.fnst

Add mount option "priority_alloc=%s", now %s only supports "usage" and
"off". The latter is used for remount.
"priority_alloc=usage" will active priority aware allocator.
This mount option changes the way of find_free_extent to search
block groups and may cost more time.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
---
 fs/btrfs/super.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index cbc9d0d2c12d..4a6ccd4c29fd 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -326,6 +326,7 @@ enum {
 	Opt_treelog, Opt_notreelog,
 	Opt_usebackuproot,
 	Opt_user_subvol_rm_allowed,
+	Opt_priority_allocator,
 
 	/* Deprecated options */
 	Opt_alloc_start,
@@ -393,6 +394,7 @@ static const match_table_t tokens = {
 	{Opt_notreelog, "notreelog"},
 	{Opt_usebackuproot, "usebackuproot"},
 	{Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"},
+	{Opt_priority_allocator, "priority_alloc=%s"},
 
 	/* Deprecated options */
 	{Opt_alloc_start, "alloc_start=%s"},
@@ -765,6 +767,18 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 		case Opt_skip_balance:
 			btrfs_set_opt(info->mount_opt, SKIP_BALANCE);
 			break;
+		case Opt_priority_allocator:
+			if (strcmp(args[0].from, "usage") == 0) {
+				btrfs_set_and_info(info, PRIORITY_USAGE,
+				   "using priority usage-aware allocator");
+			} else if (strcmp(args[0].from, "off") == 0) {
+				btrfs_clear_and_info(info, PRIORITY_USAGE,
+				   "priority awareallocator disabled");
+			} else {
+				ret = -EINVAL;
+				goto out;
+			}
+			break;
 #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
 		case Opt_check_integrity_including_extent_data:
 			btrfs_info(info,
@@ -1337,6 +1351,10 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 		seq_puts(seq, ",inode_cache");
 	if (btrfs_test_opt(info, SKIP_BALANCE))
 		seq_puts(seq, ",skip_balance");
+	if (btrfs_test_opt(info, PRIORITY_USAGE))
+		seq_puts(seq, ",priority_alloc=usage");
+	else
+		seq_puts(seq, ",priority_alloc=off");
 #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
 	if (btrfs_test_opt(info, CHECK_INTEGRITY_INCLUDING_EXTENT_DATA))
 		seq_puts(seq, ",check_int_data");
-- 
2.19.1




^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 00/17] btrfs: implementation of priority aware allocator
  2018-11-28  3:11 [RFC PATCH 00/17] btrfs: implementation of priority aware allocator Su Yue
                   ` (16 preceding siblings ...)
  2018-11-28  3:11 ` [RFC PATCH 17/17] btrfs: add mount option "priority_alloc=%s" Su Yue
@ 2018-11-28  4:04 ` Qu Wenruo
  2018-12-02  5:28   ` Su Yue
  17 siblings, 1 reply; 23+ messages in thread
From: Qu Wenruo @ 2018-11-28  4:04 UTC (permalink / raw)
  To: Su Yue, linux-btrfs



On 2018/11/28 上午11:11, Su Yue wrote:
> This patchset can be fetched from repo:
> https://github.com/Damenly/btrfs-devel/commits/priority_aware_allocator.
> Since patchset 'btrfs: Refactor find_free_extent()' does a nice work
> to simplify find_free_extent(). This patchset dependents on the refactor.
> The base is the commit in kdave/misc-next:
> 
> commit fcaaa1dfa81f2f87ad88cbe0ab86a07f9f76073c (kdave/misc-next)
> Author: Nikolay Borisov <nborisov@suse.com>
> Date:   Tue Nov 6 16:40:20 2018 +0200
> 
>     btrfs: Always try all copies when reading extent buffers
> 
> 
> This patchset introduces a new mount option named 'priority_alloc=%s',
> %s is supported to be "usage" and "off" now. The mount option changes
> the way find_free_extent() how to search block groups.
> 
> Previously, block groups are stored in list of btrfs_space_info
> by start position. When call find_free_extent() if no hint,
> block_groups are searched one by one.
> 
> Design of priority aware allocator:
> Block group has its own priority. We split priorities to many levels,
> block groups are split to different trees according priorities.
> And those trees are sorted by their levels and stored in space_info.
> Once find_free_extent() is called, try to search block groups in higher
> priority level then lower level. Then a block group with higher
> priority is more likely to be used.
> 
> Pros:
> 1) Reduce the frequency of balance.
>    The block group with a higher usage rate will be used preferentially
>    for allocating extents. Free the empty block groups with pinned bytes
>    as non-zero.[1]
> 
> 2) The priority of empty block group with pinned bytes as non-zero
>    will be set as the lowest.
>    
> 3) Support zoned block device.[2]
>    For metadata allocation, the block group in conventional zones
>    will be used as much as possible regardless of usage rate.
>    Will do it in future.

Personally I'm a big fan of the priority aware extent allocator.

So nice job!

>    
> Cons:
> 1) Expectable performance regression.
>    The degree of the decline is temporarily unknown.
>    The user can disable block group priority to get the full performance.
> 
> TESTS:
> 
> If use usage as priority(the only available option), empty block group
> is much harder to be reused.
> 
> About block group usage:
>   Disk: 4 x 1T HDD gathered in LVM.
> 
>   Run script to create files and delete files randomly in loop.
>   The num of files to create are double than to delete.
> 
>   Default mount option result:
>   https://i.loli.net/2018/11/28/5bfdfdf08c760.png
> 
>   Priority aware allocator(usage) result:
>   https://i.loli.net/2018/11/28/5bfdfdf0c1b11.png
> 
>   X coordinate means total disk usage, Y coordinate means avg block
>   group usage.
> 
>   Due to fragmentation of extents, the different are not obvious,
>   only about 1% improvement....

I think you're using the wrong indicator to show the difference.

The real indicator should not be overall block group usage, but:
1) Number of block groups
2) Usage distribution of the block groups

If the number of block groups isn't much different, then we should go
check the distribution.
E.g. all bgs with 97% usage is not as good mostly 100% bgs and several
near 10% bgs.

And we should check the usage distribution between metadata and data bgs.
For data bg, we could hit some fragmentation problem, while for meta bgs
all extents are in the same size, thus may have a better performance for
metadata.

Thus we could do better for the test result.

>   						   
> Performance regression:
>    I have ran sysbench on our machine with SSD in multi combinations,
>    no obvious regression found.
>    However in theory, the new allocator may cost more time in some
>    cases.

Isn't that a good news? :)

> 
> [1] https://www.spinics.net/lists/linux-btrfs/msg79508.html
> [2] https://lkml.org/lkml/2018/8/16/174
> 
> ---
> Due to some reasons includes time and hardware, the use-case is not
> outstanding enough.

As discussed offline, another cause would be data extent fragmentations.
E.g we have a lot of small 4K holes but the request is a big 128M.
In that case btrfs_reserve_extent() could still trigger a new data chunk
other than return the 4K holes found.

Thanks,
Qu

> And some codes are dirty but I can't found another
> way. So I named it as RFC.
>      Any comments and suggestions are welcome.
>      
> Su Yue (17):
>   btrfs: priority alloc: prepare of priority aware allocator
>   btrfs: add mount definition BTRFS_MOUNT_PRIORITY_USAGE
>   btrfs: priority alloc: introduce compute_block_group_priority/usage
>   btrfs: priority alloc: add functions to create/remove priority trees
>   btrfs: priority alloc: introduce functions to add block group to
>     priority tree
>   btrfs: priority alloc: introduce three macros to mark block group
>     status
>   btrfs: priority alloc: add functions to remove block group from
>     priority tree
>   btrfs: priority alloc: add btrfs_update_block_group_priority()
>   btrfs: priority alloc: call create/remove_priority_trees in space_info
>   btrfs: priority alloc: call add_block_group_priority while reading or
>     making block group
>   btrfs: priority alloc: remove block group from priority tree while
>     removing block group
>   btrfs: priority alloc: introduce find_free_extent_search()
>   btrfs: priority alloc: modify find_free_extent() to fit priority
>     allocator
>   btrfs: priority alloc: introduce btrfs_set_bg_updating and call
>     btrfs_update_block_group_prioriy
>   btrfs: priority alloc: write bg->priority_groups_sem while waiting
>     reservation
>   btrfs: priority alloc: write bg->priority_tree->groups_sem to avoid
>     race in btrfs_delete_unused_bgs()
>   btrfs: add mount option "priority_alloc=%s"
> 
>  fs/btrfs/ctree.h            |  28 ++
>  fs/btrfs/extent-tree.c      | 672 +++++++++++++++++++++++++++++++++---
>  fs/btrfs/free-space-cache.c |   3 +
>  fs/btrfs/super.c            |  18 +
>  fs/btrfs/transaction.c      |   1 +
>  5 files changed, 681 insertions(+), 41 deletions(-)
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 01/17] btrfs: priority alloc: prepare of priority aware allocator
  2018-11-28  3:11 ` [RFC PATCH 01/17] btrfs: priority alloc: prepare " Su Yue
@ 2018-11-28  8:24   ` Nikolay Borisov
  2018-11-28  9:24     ` Su Yue
  0 siblings, 1 reply; 23+ messages in thread
From: Nikolay Borisov @ 2018-11-28  8:24 UTC (permalink / raw)
  To: Su Yue, linux-btrfs



On 28.11.18 г. 5:11 ч., Su Yue wrote:
> To implement priority aware allocator, this patch:
> Introduces struct btrfs_priority_tree which contains block groups
> in same level.
> Adds member priority to struct btrfs_block_group_cache and pointer
> points to the priority tree it's located.
> 
> Adds member priority_trees to struct btrfs_space_info to represents
> priority trees in different raid types.
> 
> Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
> ---
>  fs/btrfs/ctree.h | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index e62824cae00a..5c4651d8a524 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -437,6 +437,8 @@ struct btrfs_space_info {
>  	struct rw_semaphore groups_sem;
>  	/* for block groups in our same type */
>  	struct list_head block_groups[BTRFS_NR_RAID_TYPES];
> +	/* for priority trees in our same type */
> +	struct rb_root priority_trees[BTRFS_NR_RAID_TYPES];
>  	wait_queue_head_t wait;
>  
>  	struct kobject kobj;
> @@ -558,6 +560,21 @@ struct btrfs_full_stripe_locks_tree {
>  	struct mutex lock;
>  };
>  
> +/*
> + * Tree to record all block_groups in same priority level.
> + * Only used in priority aware allocator.
> + */
> +struct btrfs_priority_tree {
> +	/* protected by groups_sem */
> +	struct rb_root block_groups;
> +	struct rw_semaphore groups_sem;
> +
> +	/* for different level priority trees in same index*/
> +	struct rb_node node;
> +
> +	int level;

Do you ever expect the level to be a negative number? If not then use
u8/u32 depending on the range of levels you expect.

> +};
> +
>  struct btrfs_block_group_cache {
>  	struct btrfs_key key;
>  	struct btrfs_block_group_item item;
> @@ -571,6 +588,8 @@ struct btrfs_block_group_cache {
>  	u64 flags;
>  	u64 cache_generation;
>  
> +	/* It's used only when priority aware allocator is enabled. */
> +	long priority;

What's the range of priorities you are expecting, wouldn't an u8 be
sufficient, that gives us 256 priorities?

>  	/*
>  	 * If the free space extent count exceeds this number, convert the block
>  	 * group to bitmaps.
> @@ -616,6 +635,9 @@ struct btrfs_block_group_cache {
>  	/* for block groups in the same raid type */
>  	struct list_head list;
>  
> +	/* for block groups in the same priority level */
> +	struct rb_node node;
> +
>  	/* usage count */
>  	atomic_t count;
>  
> @@ -670,6 +692,8 @@ struct btrfs_block_group_cache {
>  
>  	/* Record locked full stripes for RAID5/6 block group */
>  	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
> +
> +	struct btrfs_priority_tree *priority_tree;
>  };
>  
>  /* delayed seq elem */
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 03/17] btrfs: priority alloc: introduce compute_block_group_priority/usage
  2018-11-28  3:11 ` [RFC PATCH 03/17] btrfs: priority alloc: introduce compute_block_group_priority/usage Su Yue
@ 2018-11-28  8:56   ` Nikolay Borisov
  0 siblings, 0 replies; 23+ messages in thread
From: Nikolay Borisov @ 2018-11-28  8:56 UTC (permalink / raw)
  To: Su Yue, linux-btrfs



On 28.11.18 г. 5:11 ч., Su Yue wrote:
> Introduce compute_block_group_usage() and compute_block_group_usage().
> And call the latter in btrfs_make_block_group() and
> btrfs_read_block_groups().
> 
> compute_priority_level use ilog2(free) to compute priority level.
> 
> Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
> ---
>  fs/btrfs/extent-tree.c | 60 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 60 insertions(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index d242a1174e50..0f4c5b1e0bcc 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -10091,6 +10091,7 @@ static int check_chunk_block_group_mappings(struct btrfs_fs_info *fs_info)
>  	return ret;
>  }
>  
> +static long compute_block_group_priority(struct btrfs_block_group_cache *bg);

That is ugly, just put the function above the first place where they are
going to be used and don't introduce forward declarations for static
functions.

>  int btrfs_read_block_groups(struct btrfs_fs_info *info)
>  {
>  	struct btrfs_path *path;
> @@ -10224,6 +10225,8 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
>  
>  		link_block_group(cache);
>  
> +		cache->priority = compute_block_group_priority(cache);
> +
>  		set_avail_alloc_bits(info, cache->flags);
>  		if (btrfs_chunk_readonly(info, cache->key.objectid)) {
>  			inc_block_group_ro(cache, 1);
> @@ -10373,6 +10376,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,
>  
>  	link_block_group(cache);
>  
> +	cache->priority = compute_block_group_priority(cache);
> +
>  	list_add_tail(&cache->bg_list, &trans->new_bgs);
>  
>  	set_avail_alloc_bits(fs_info, type);
> @@ -11190,3 +11195,58 @@ void btrfs_mark_bg_unused(struct btrfs_block_group_cache *bg)
>  	}
>  	spin_unlock(&fs_info->unused_bgs_lock);
>  }
> +
> +enum btrfs_priority_shift {
> +	PRIORITY_USAGE_SHIFT = 0
> +};
> +
> +static inline u8
> +compute_block_group_usage(struct btrfs_block_group_cache *cache)
> +{
> +	u64 num_bytes;
> +	u8 usage;
> +
> +	num_bytes = cache->reserved + cache->bytes_super +
> +		btrfs_block_group_used(&cache->item);
> +
> +	usage = div_u64(num_bytes, div_factor_fine(cache->key.offset, 1));

Mention somewhere (either as a function description or in the patch
description) that you use the % used.

> +
> +	return usage;
> +}
> +
> +static long compute_block_group_priority(struct btrfs_block_group_cache *bg)
> +{
> +	u8 usage;
> +	long priority = 0;
> +
> +	if (btrfs_test_opt(bg->fs_info, PRIORITY_USAGE)) {
> +		usage = compute_block_group_usage(bg);
> +		priority |= usage << PRIORITY_USAGE_SHIFT;
> +	}

Why is priority a signed type and not unsigned, I assume priority can
never be negative? I briefly looked at the other patches and most of the
time the argument passed is indeed na unsigned value.

> +
> +	return priority;
> +}
> +
> +static int compute_priority_level(struct btrfs_fs_info *fs_info,
> +				  long priority)
> +{
> +	int level = 0;
> +
> +	if (btrfs_test_opt(fs_info, PRIORITY_USAGE)) {
> +		u8 free;
> +
> +		WARN_ON(priority < 0);

I think this WARN_ON is redundant provided that the high-level
interfaces are sane and don't allow negative value to trickle down.

> +		free = 100 - (priority >> PRIORITY_USAGE_SHIFT);
> +
> +		if (free == 0)
> +			level = 0;
> +		else if (free == 100)
> +			level = ilog2(free) + 1;
> +		else
> +			level = ilog2(free);
> +		/* log2(1) == 0, leave level 0 for unused block_groups */
> +		level = ilog2(100) + 1 - level;
> +	}
> +
> +	return level;
> +}
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 01/17] btrfs: priority alloc: prepare of priority aware allocator
  2018-11-28  8:24   ` Nikolay Borisov
@ 2018-11-28  9:24     ` Su Yue
  0 siblings, 0 replies; 23+ messages in thread
From: Su Yue @ 2018-11-28  9:24 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs



On 11/28/18 4:24 PM, Nikolay Borisov wrote:
> 
> 
> On 28.11.18 г. 5:11 ч., Su Yue wrote:
>> To implement priority aware allocator, this patch:
>> Introduces struct btrfs_priority_tree which contains block groups
>> in same level.
>> Adds member priority to struct btrfs_block_group_cache and pointer
>> points to the priority tree it's located.
>>
>> Adds member priority_trees to struct btrfs_space_info to represents
>> priority trees in different raid types.
>>
>> Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
>> ---
>>   fs/btrfs/ctree.h | 24 ++++++++++++++++++++++++
>>   1 file changed, 24 insertions(+)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index e62824cae00a..5c4651d8a524 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -437,6 +437,8 @@ struct btrfs_space_info {
>>   	struct rw_semaphore groups_sem;
>>   	/* for block groups in our same type */
>>   	struct list_head block_groups[BTRFS_NR_RAID_TYPES];
>> +	/* for priority trees in our same type */
>> +	struct rb_root priority_trees[BTRFS_NR_RAID_TYPES];
>>   	wait_queue_head_t wait;
>>   
>>   	struct kobject kobj;
>> @@ -558,6 +560,21 @@ struct btrfs_full_stripe_locks_tree {
>>   	struct mutex lock;
>>   };
>>   
>> +/*
>> + * Tree to record all block_groups in same priority level.
>> + * Only used in priority aware allocator.
>> + */
>> +struct btrfs_priority_tree {
>> +	/* protected by groups_sem */
>> +	struct rb_root block_groups;
>> +	struct rw_semaphore groups_sem;
>> +
>> +	/* for different level priority trees in same index*/
>> +	struct rb_node node;
>> +
>> +	int level;
> 
> Do you ever expect the level to be a negative number? If not then use
> u8/u32 depending on the range of levels you expect.
> 

Indeed, level is not expected to be negative. u8 is more proper.

>> +};
>> +
>>   struct btrfs_block_group_cache {
>>   	struct btrfs_key key;
>>   	struct btrfs_block_group_item item;
>> @@ -571,6 +588,8 @@ struct btrfs_block_group_cache {
>>   	u64 flags;
>>   	u64 cache_generation;
>>   
>> +	/* It's used only when priority aware allocator is enabled. */
>> +	long priority;
> 
> What's the range of priorities you are expecting, wouldn't an u8 be
> sufficient, that gives us 256 priorities?
> 

The 6th patch introduces three special priorities. That's what I called
dirty codes.

Thanks,
Su

>>   	/*
>>   	 * If the free space extent count exceeds this number, convert the block
>>   	 * group to bitmaps.
>> @@ -616,6 +635,9 @@ struct btrfs_block_group_cache {
>>   	/* for block groups in the same raid type */
>>   	struct list_head list;
>>   
>> +	/* for block groups in the same priority level */
>> +	struct rb_node node;
>> +
>>   	/* usage count */
>>   	atomic_t count;
>>   
>> @@ -670,6 +692,8 @@ struct btrfs_block_group_cache {
>>   
>>   	/* Record locked full stripes for RAID5/6 block group */
>>   	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
>> +
>> +	struct btrfs_priority_tree *priority_tree;
>>   };
>>   
>>   /* delayed seq elem */
>>
> 
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 00/17] btrfs: implementation of priority aware allocator
  2018-11-28  4:04 ` [RFC PATCH 00/17] btrfs: implementation of priority aware allocator Qu Wenruo
@ 2018-12-02  5:28   ` Su Yue
  0 siblings, 0 replies; 23+ messages in thread
From: Su Yue @ 2018-12-02  5:28 UTC (permalink / raw)
  To: Qu Wenruo, Su Yue, linux-btrfs



On 2018/11/28 12:04 PM, Qu Wenruo wrote:
> 
> 
> On 2018/11/28 上午11:11, Su Yue wrote:
>> This patchset can be fetched from repo:
>> https://github.com/Damenly/btrfs-devel/commits/priority_aware_allocator.
>> Since patchset 'btrfs: Refactor find_free_extent()' does a nice work
>> to simplify find_free_extent(). This patchset dependents on the refactor.
>> The base is the commit in kdave/misc-next:
>>
>> commit fcaaa1dfa81f2f87ad88cbe0ab86a07f9f76073c (kdave/misc-next)
>> Author: Nikolay Borisov <nborisov@suse.com>
>> Date:   Tue Nov 6 16:40:20 2018 +0200
>>
>>      btrfs: Always try all copies when reading extent buffers
>>
>>
>> This patchset introduces a new mount option named 'priority_alloc=%s',
>> %s is supported to be "usage" and "off" now. The mount option changes
>> the way find_free_extent() how to search block groups.
>>
>> Previously, block groups are stored in list of btrfs_space_info
>> by start position. When call find_free_extent() if no hint,
>> block_groups are searched one by one.
>>
>> Design of priority aware allocator:
>> Block group has its own priority. We split priorities to many levels,
>> block groups are split to different trees according priorities.
>> And those trees are sorted by their levels and stored in space_info.
>> Once find_free_extent() is called, try to search block groups in higher
>> priority level then lower level. Then a block group with higher
>> priority is more likely to be used.
>>
>> Pros:
>> 1) Reduce the frequency of balance.
>>     The block group with a higher usage rate will be used preferentially
>>     for allocating extents. Free the empty block groups with pinned bytes
>>     as non-zero.[1]
>>
>> 2) The priority of empty block group with pinned bytes as non-zero
>>     will be set as the lowest.
>>     
>> 3) Support zoned block device.[2]
>>     For metadata allocation, the block group in conventional zones
>>     will be used as much as possible regardless of usage rate.
>>     Will do it in future.
> 
> Personally I'm a big fan of the priority aware extent allocator.
> 
> So nice job!
>

Thanks for the offline help.

>>     
>> Cons:
>> 1) Expectable performance regression.
>>     The degree of the decline is temporarily unknown.
>>     The user can disable block group priority to get the full performance.
>>
>> TESTS:
>>
>> If use usage as priority(the only available option), empty block group
>> is much harder to be reused.
>>
>> About block group usage:
>>    Disk: 4 x 1T HDD gathered in LVM.
>>
>>    Run script to create files and delete files randomly in loop.
>>    The num of files to create are double than to delete.
>>
>>    Default mount option result:
>>    https://i.loli.net/2018/11/28/5bfdfdf08c760.png
>>
>>    Priority aware allocator(usage) result:
>>    https://i.loli.net/2018/11/28/5bfdfdf0c1b11.png
>>
>>    X coordinate means total disk usage, Y coordinate means avg block
>>    group usage.
>>
>>    Due to fragmentation of extents, the different are not obvious,
>>    only about 1% improvement....
> 
> I think you're using the wrong indicator to show the difference.
> 
> The real indicator should not be overall block group usage, but:
> 1) Number of block groups
> 2) Usage distribution of the block groups
> 
> If the number of block groups isn't much different, then we should go
> check the distribution.
> E.g. all bgs with 97% usage is not as good mostly 100% bgs and several
> near 10% bgs.
>

Took some time to write scripts for summary:
Avg of percentage of block groups during disk usage from 0% to 100%

For block groups whose usage >= 98%, default: 31.09%, priorty_alloc: 46.73%


For block groups whose usage >= 95%, default: 57.69%, priorty_alloc: 64.24%


For block groups whose usage >= 90%, default: 79.87%, priorty_alloc: 80.2%

So this patchset does work in improvement of block groups usages.


> And we should check the usage distribution between metadata and data bgs.
> For data bg, we could hit some fragmentation problem, while for meta bgs
> all extents are in the same size, thus may have a better performance for
> metadata.
> 
> Thus we could do better for the test result.
> 
>>    						
>> Performance regression:
>>     I have ran sysbench on our machine with SSD in multi combinations,
>>     no obvious regression found.
>>     However in theory, the new allocator may cost more time in some
>>     cases.
> 
> Isn't that a good news? :)
> 

Yeah.
>>
>> [1] https://www.spinics.net/lists/linux-btrfs/msg79508.html
>> [2] https://lkml.org/lkml/2018/8/16/174
>>
>> ---
>> Due to some reasons includes time and hardware, the use-case is not
>> outstanding enough.
> 
> As discussed offline, another cause would be data extent fragmentations.
> E.g we have a lot of small 4K holes but the request is a big 128M.
> In that case btrfs_reserve_extent() could still trigger a new data chunk
> other than return the 4K holes found.
> 

IMO, this is another business. Doing it in another patchset is prefered.

Thanks,
Su

> Thanks,
> Qu
> 
>> And some codes are dirty but I can't found another
>> way. So I named it as RFC.
>>       Any comments and suggestions are welcome.
>>       
>> Su Yue (17):
>>    btrfs: priority alloc: prepare of priority aware allocator
>>    btrfs: add mount definition BTRFS_MOUNT_PRIORITY_USAGE
>>    btrfs: priority alloc: introduce compute_block_group_priority/usage
>>    btrfs: priority alloc: add functions to create/remove priority trees
>>    btrfs: priority alloc: introduce functions to add block group to
>>      priority tree
>>    btrfs: priority alloc: introduce three macros to mark block group
>>      status
>>    btrfs: priority alloc: add functions to remove block group from
>>      priority tree
>>    btrfs: priority alloc: add btrfs_update_block_group_priority()
>>    btrfs: priority alloc: call create/remove_priority_trees in space_info
>>    btrfs: priority alloc: call add_block_group_priority while reading or
>>      making block group
>>    btrfs: priority alloc: remove block group from priority tree while
>>      removing block group
>>    btrfs: priority alloc: introduce find_free_extent_search()
>>    btrfs: priority alloc: modify find_free_extent() to fit priority
>>      allocator
>>    btrfs: priority alloc: introduce btrfs_set_bg_updating and call
>>      btrfs_update_block_group_prioriy
>>    btrfs: priority alloc: write bg->priority_groups_sem while waiting
>>      reservation
>>    btrfs: priority alloc: write bg->priority_tree->groups_sem to avoid
>>      race in btrfs_delete_unused_bgs()
>>    btrfs: add mount option "priority_alloc=%s"
>>
>>   fs/btrfs/ctree.h            |  28 ++
>>   fs/btrfs/extent-tree.c      | 672 +++++++++++++++++++++++++++++++++---
>>   fs/btrfs/free-space-cache.c |   3 +
>>   fs/btrfs/super.c            |  18 +
>>   fs/btrfs/transaction.c      |   1 +
>>   5 files changed, 681 insertions(+), 41 deletions(-)
>>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2018-12-02  5:28 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-28  3:11 [RFC PATCH 00/17] btrfs: implementation of priority aware allocator Su Yue
2018-11-28  3:11 ` [RFC PATCH 01/17] btrfs: priority alloc: prepare " Su Yue
2018-11-28  8:24   ` Nikolay Borisov
2018-11-28  9:24     ` Su Yue
2018-11-28  3:11 ` [RFC PATCH 02/17] btrfs: add mount definition BTRFS_MOUNT_PRIORITY_USAGE Su Yue
2018-11-28  3:11 ` [RFC PATCH 03/17] btrfs: priority alloc: introduce compute_block_group_priority/usage Su Yue
2018-11-28  8:56   ` Nikolay Borisov
2018-11-28  3:11 ` [RFC PATCH 04/17] btrfs: priority alloc: add functions to create/remove priority trees Su Yue
2018-11-28  3:11 ` [RFC PATCH 05/17] btrfs: priority alloc: introduce functions to add block group to priority tree Su Yue
2018-11-28  3:11 ` [RFC PATCH 06/17] btrfs: priority alloc: introduce three macros to mark block group status Su Yue
2018-11-28  3:11 ` [RFC PATCH 07/17] btrfs: priority alloc: add functions to remove block group from priority tree Su Yue
2018-11-28  3:11 ` [RFC PATCH 08/17] btrfs: priority alloc: add btrfs_update_block_group_priority() Su Yue
2018-11-28  3:11 ` [RFC PATCH 09/17] btrfs: priority alloc: call create/remove_priority_trees in space_info Su Yue
2018-11-28  3:11 ` [RFC PATCH 10/17] btrfs: priority alloc: call add_block_group_priority while reading or making block group Su Yue
2018-11-28  3:11 ` [RFC PATCH 11/17] btrfs: priority alloc: remove block group from priority tree while removing " Su Yue
2018-11-28  3:11 ` [RFC PATCH 12/17] btrfs: priority alloc: introduce find_free_extent_search() Su Yue
2018-11-28  3:11 ` [RFC PATCH 13/17] btrfs: priority alloc: modify find_free_extent() to fit priority allocator Su Yue
2018-11-28  3:11 ` [RFC PATCH 14/17] btrfs: priority alloc: introduce btrfs_set_bg_updating and call btrfs_update_block_group_prioriy Su Yue
2018-11-28  3:11 ` [RFC PATCH 15/17] btrfs: priority alloc: write bg->priority_groups_sem while waiting reservation Su Yue
2018-11-28  3:11 ` [RFC PATCH 16/17] btrfs: priority alloc: write bg->priority_tree->groups_sem to avoid race in btrfs_delete_unused_bgs() Su Yue
2018-11-28  3:11 ` [RFC PATCH 17/17] btrfs: add mount option "priority_alloc=%s" Su Yue
2018-11-28  4:04 ` [RFC PATCH 00/17] btrfs: implementation of priority aware allocator Qu Wenruo
2018-12-02  5:28   ` Su Yue

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).