All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] btrfs: introduce a separate mutex for caching_block_groups list
@ 2017-03-19 17:18 Alex Lyakas
  2017-03-19 17:40 ` Josef Bacik
  2017-03-21 22:40 ` Liu Bo
  0 siblings, 2 replies; 4+ messages in thread
From: Alex Lyakas @ 2017-03-19 17:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: jbacik, dsterba

We have a commit_root_sem, which is a read-write semaphore that protects the 
commit roots.
But it is also used to protect the list of caching block groups.

As a result, while doing "slow" caching, the following issue is seen:

Some of the caching threads are scanning the extent tree with 
commit_root_sem
acquired in shared mode, with stack like:
[<ffffffffc0ad27b2>] read_extent_buffer_pages+0x2d2/0x300 [btrfs]
[<ffffffffc0a9fbe7>] btree_read_extent_buffer_pages.constprop.50+0xb7/0x1e0 
[btrfs]
[<ffffffffc0aa1550>] read_tree_block+0x40/0x70 [btrfs]
[<ffffffffc0a7aa7c>] read_block_for_search.isra.33+0x12c/0x370 [btrfs]
[<ffffffffc0a7ce86>] btrfs_search_slot+0x3c6/0xb10 [btrfs]
[<ffffffffc0a975b9>] caching_thread+0x1b9/0x820 [btrfs]
[<ffffffffc0adfa36>] normal_work_helper+0xc6/0x340 [btrfs]
[<ffffffffc0adfd22>] btrfs_cache_helper+0x12/0x20 [btrfs]

IO requests that want to allocate space are waiting in cache_block_group()
to acquire the commit_root_sem in exclusive mode. But they only want to add
the caching control structure to the list of caching block-groups:
[<ffffffff817136c9>] schedule+0x29/0x70
[<ffffffff81716085>] rwsem_down_write_failed+0x145/0x320
[<ffffffff813a1ae3>] call_rwsem_down_write_failed+0x13/0x20
[<ffffffffc0a86d0b>] cache_block_group+0x25b/0x450 [btrfs]
[<ffffffffc0a94d36>] find_free_extent+0xd16/0xdb0 [btrfs]
[<ffffffffc0a94e7f>] btrfs_reserve_extent+0xaf/0x160 [btrfs]

Other caching threads want to continue their scanning, and for that they
are waiting to acquire commit_root_sem in shared mode. But since there are
IO threads that want the exclusive lock, the caching threads are unable
to continue the scanning, because (I presume) rw_semaphore guarantees some 
fairness:
[<ffffffff817136c9>] schedule+0x29/0x70
[<ffffffff81715ee5>] rwsem_down_read_failed+0xc5/0x120
[<ffffffff813a1ab4>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffffc0a975a1>] caching_thread+0x1a1/0x820 [btrfs]
[<ffffffffc0adfa36>] normal_work_helper+0xc6/0x340 [btrfs]
[<ffffffffc0adfd22>] btrfs_cache_helper+0x12/0x20 [btrfs]
[<ffffffff8108bd56>] process_one_work+0x146/0x410

This causes slowness of the IO, especially when there are many block groups
that need to be scanned for free space. In some cases it takes minutes
until a single IO thread is able to allocate free space.

I don't see a deadlock here, because the caching threads that were able to 
acquire
the commit_root_sem will call rwsem_is_contended() and should give up the 
semaphore,
so that IO threads are able to acquire it in exclusive mode.

However, introducing a separate mutex that protects only the list of caching
block groups makes things move forward much faster.

This patch is based on kernel 3.18.
Unfortunately, I am not able to submit a patch based on one of the latest 
kernels, because
here btrfs is part of the larger system, and upgrading the kernel is a 
significant effort.
Hence marking the patch as RFC.
Hopefully, this patch still has some value to the community.

Signed-off-by: Alex Lyakas <alex@zadarastorage.com>

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 42d11e7..74feacb 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1490,6 +1490,8 @@ struct btrfs_fs_info {
     struct list_head trans_list;
     struct list_head dead_roots;
     struct list_head caching_block_groups;
+    /* protects the above list */
+    struct mutex caching_block_groups_mutex;

     spinlock_t delayed_iput_lock;
     struct list_head delayed_iputs;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5177954..130ec58 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2229,6 +2229,7 @@ int open_ctree(struct super_block *sb,
     INIT_LIST_HEAD(&fs_info->delayed_iputs);
     INIT_LIST_HEAD(&fs_info->delalloc_roots);
     INIT_LIST_HEAD(&fs_info->caching_block_groups);
+    mutex_init(&fs_info->caching_block_groups_mutex);
     spin_lock_init(&fs_info->delalloc_root_lock);
     spin_lock_init(&fs_info->trans_lock);
     spin_lock_init(&fs_info->fs_roots_radix_lock);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a067065..906fb08 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -637,10 +637,10 @@ static int cache_block_group(struct 
btrfs_block_group_cache *cache,
         return 0;
     }

-    down_write(&fs_info->commit_root_sem);
+    mutex_lock(&fs_info->caching_block_groups_mutex);
     atomic_inc(&caching_ctl->count);
     list_add_tail(&caching_ctl->list, &fs_info->caching_block_groups);
-    up_write(&fs_info->commit_root_sem);
+    mutex_unlock(&fs_info->caching_block_groups_mutex);

     btrfs_get_block_group(cache);

@@ -5693,6 +5693,7 @@ void btrfs_prepare_extent_commit(struct 
btrfs_trans_handle *trans,

     down_write(&fs_info->commit_root_sem);

+    mutex_lock(&fs_info->caching_block_groups_mutex);
     list_for_each_entry_safe(caching_ctl, next,
                  &fs_info->caching_block_groups, list) {
         cache = caching_ctl->block_group;
@@ -5704,6 +5705,7 @@ void btrfs_prepare_extent_commit(struct 
btrfs_trans_handle *trans,
             cache->last_byte_to_unpin = caching_ctl->progress;
         }
     }
+    mutex_unlock(&fs_info->caching_block_groups_mutex);

     if (fs_info->pinned_extents == &fs_info->freed_extents[0])
         fs_info->pinned_extents = &fs_info->freed_extents[1];
@@ -8849,14 +8851,14 @@ int btrfs_free_block_groups(struct btrfs_fs_info 
*info)
     struct btrfs_caching_control *caching_ctl;
     struct rb_node *n;

-    down_write(&info->commit_root_sem);
+    mutex_lock(&info->caching_block_groups_mutex);
     while (!list_empty(&info->caching_block_groups)) {
         caching_ctl = list_entry(info->caching_block_groups.next,
                      struct btrfs_caching_control, list);
         list_del(&caching_ctl->list);
         put_caching_control(caching_ctl);
     }
-    up_write(&info->commit_root_sem);
+    mutex_unlock(&info->caching_block_groups_mutex);

     spin_lock(&info->unused_bgs_lock);
     while (!list_empty(&info->unused_bgs)) {




^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC] btrfs: introduce a separate mutex for caching_block_groups list
  2017-03-19 17:18 [PATCH RFC] btrfs: introduce a separate mutex for caching_block_groups list Alex Lyakas
@ 2017-03-19 17:40 ` Josef Bacik
  2017-03-21 22:40 ` Liu Bo
  1 sibling, 0 replies; 4+ messages in thread
From: Josef Bacik @ 2017-03-19 17:40 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-btrfs, dsterba

This sounds reasonable to me, I'll look at it more when I'm on the ground and can look at the code and see for sure.  Thanks,

Josef

Sent from my iPhone

> On Mar 19, 2017, at 1:19 PM, Alex Lyakas <alex@zadarastorage.com> wrote:
> 
> We have a commit_root_sem, which is a read-write semaphore that protects the commit roots.
> But it is also used to protect the list of caching block groups.
> 
> As a result, while doing "slow" caching, the following issue is seen:
> 
> Some of the caching threads are scanning the extent tree with commit_root_sem
> acquired in shared mode, with stack like:
> [<ffffffffc0ad27b2>] read_extent_buffer_pages+0x2d2/0x300 [btrfs]
> [<ffffffffc0a9fbe7>] btree_read_extent_buffer_pages.constprop.50+0xb7/0x1e0 [btrfs]
> [<ffffffffc0aa1550>] read_tree_block+0x40/0x70 [btrfs]
> [<ffffffffc0a7aa7c>] read_block_for_search.isra.33+0x12c/0x370 [btrfs]
> [<ffffffffc0a7ce86>] btrfs_search_slot+0x3c6/0xb10 [btrfs]
> [<ffffffffc0a975b9>] caching_thread+0x1b9/0x820 [btrfs]
> [<ffffffffc0adfa36>] normal_work_helper+0xc6/0x340 [btrfs]
> [<ffffffffc0adfd22>] btrfs_cache_helper+0x12/0x20 [btrfs]
> 
> IO requests that want to allocate space are waiting in cache_block_group()
> to acquire the commit_root_sem in exclusive mode. But they only want to add
> the caching control structure to the list of caching block-groups:
> [<ffffffff817136c9>] schedule+0x29/0x70
> [<ffffffff81716085>] rwsem_down_write_failed+0x145/0x320
> [<ffffffff813a1ae3>] call_rwsem_down_write_failed+0x13/0x20
> [<ffffffffc0a86d0b>] cache_block_group+0x25b/0x450 [btrfs]
> [<ffffffffc0a94d36>] find_free_extent+0xd16/0xdb0 [btrfs]
> [<ffffffffc0a94e7f>] btrfs_reserve_extent+0xaf/0x160 [btrfs]
> 
> Other caching threads want to continue their scanning, and for that they
> are waiting to acquire commit_root_sem in shared mode. But since there are
> IO threads that want the exclusive lock, the caching threads are unable
> to continue the scanning, because (I presume) rw_semaphore guarantees some fairness:
> [<ffffffff817136c9>] schedule+0x29/0x70
> [<ffffffff81715ee5>] rwsem_down_read_failed+0xc5/0x120
> [<ffffffff813a1ab4>] call_rwsem_down_read_failed+0x14/0x30
> [<ffffffffc0a975a1>] caching_thread+0x1a1/0x820 [btrfs]
> [<ffffffffc0adfa36>] normal_work_helper+0xc6/0x340 [btrfs]
> [<ffffffffc0adfd22>] btrfs_cache_helper+0x12/0x20 [btrfs]
> [<ffffffff8108bd56>] process_one_work+0x146/0x410
> 
> This causes slowness of the IO, especially when there are many block groups
> that need to be scanned for free space. In some cases it takes minutes
> until a single IO thread is able to allocate free space.
> 
> I don't see a deadlock here, because the caching threads that were able to acquire
> the commit_root_sem will call rwsem_is_contended() and should give up the semaphore,
> so that IO threads are able to acquire it in exclusive mode.
> 
> However, introducing a separate mutex that protects only the list of caching
> block groups makes things move forward much faster.
> 
> This patch is based on kernel 3.18.
> Unfortunately, I am not able to submit a patch based on one of the latest kernels, because
> here btrfs is part of the larger system, and upgrading the kernel is a significant effort.
> Hence marking the patch as RFC.
> Hopefully, this patch still has some value to the community.
> 
> Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 42d11e7..74feacb 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1490,6 +1490,8 @@ struct btrfs_fs_info {
>    struct list_head trans_list;
>    struct list_head dead_roots;
>    struct list_head caching_block_groups;
> +    /* protects the above list */
> +    struct mutex caching_block_groups_mutex;
> 
>    spinlock_t delayed_iput_lock;
>    struct list_head delayed_iputs;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 5177954..130ec58 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2229,6 +2229,7 @@ int open_ctree(struct super_block *sb,
>    INIT_LIST_HEAD(&fs_info->delayed_iputs);
>    INIT_LIST_HEAD(&fs_info->delalloc_roots);
>    INIT_LIST_HEAD(&fs_info->caching_block_groups);
> +    mutex_init(&fs_info->caching_block_groups_mutex);
>    spin_lock_init(&fs_info->delalloc_root_lock);
>    spin_lock_init(&fs_info->trans_lock);
>    spin_lock_init(&fs_info->fs_roots_radix_lock);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index a067065..906fb08 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -637,10 +637,10 @@ static int cache_block_group(struct btrfs_block_group_cache *cache,
>        return 0;
>    }
> 
> -    down_write(&fs_info->commit_root_sem);
> +    mutex_lock(&fs_info->caching_block_groups_mutex);
>    atomic_inc(&caching_ctl->count);
>    list_add_tail(&caching_ctl->list, &fs_info->caching_block_groups);
> -    up_write(&fs_info->commit_root_sem);
> +    mutex_unlock(&fs_info->caching_block_groups_mutex);
> 
>    btrfs_get_block_group(cache);
> 
> @@ -5693,6 +5693,7 @@ void btrfs_prepare_extent_commit(struct btrfs_trans_handle *trans,
> 
>    down_write(&fs_info->commit_root_sem);
> 
> +    mutex_lock(&fs_info->caching_block_groups_mutex);
>    list_for_each_entry_safe(caching_ctl, next,
>                 &fs_info->caching_block_groups, list) {
>        cache = caching_ctl->block_group;
> @@ -5704,6 +5705,7 @@ void btrfs_prepare_extent_commit(struct btrfs_trans_handle *trans,
>            cache->last_byte_to_unpin = caching_ctl->progress;
>        }
>    }
> +    mutex_unlock(&fs_info->caching_block_groups_mutex);
> 
>    if (fs_info->pinned_extents == &fs_info->freed_extents[0])
>        fs_info->pinned_extents = &fs_info->freed_extents[1];
> @@ -8849,14 +8851,14 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
>    struct btrfs_caching_control *caching_ctl;
>    struct rb_node *n;
> 
> -    down_write(&info->commit_root_sem);
> +    mutex_lock(&info->caching_block_groups_mutex);
>    while (!list_empty(&info->caching_block_groups)) {
>        caching_ctl = list_entry(info->caching_block_groups.next,
>                     struct btrfs_caching_control, list);
>        list_del(&caching_ctl->list);
>        put_caching_control(caching_ctl);
>    }
> -    up_write(&info->commit_root_sem);
> +    mutex_unlock(&info->caching_block_groups_mutex);
> 
>    spin_lock(&info->unused_bgs_lock);
>    while (!list_empty(&info->unused_bgs)) {
> 
> 
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC] btrfs: introduce a separate mutex for caching_block_groups list
  2017-03-19 17:18 [PATCH RFC] btrfs: introduce a separate mutex for caching_block_groups list Alex Lyakas
  2017-03-19 17:40 ` Josef Bacik
@ 2017-03-21 22:40 ` Liu Bo
  2017-05-13 18:45   ` Alex Lyakas
  1 sibling, 1 reply; 4+ messages in thread
From: Liu Bo @ 2017-03-21 22:40 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-btrfs, jbacik, dsterba

On Sun, Mar 19, 2017 at 07:18:59PM +0200, Alex Lyakas wrote:
> We have a commit_root_sem, which is a read-write semaphore that protects the
> commit roots.
> But it is also used to protect the list of caching block groups.
> 
> As a result, while doing "slow" caching, the following issue is seen:
> 
> Some of the caching threads are scanning the extent tree with
> commit_root_sem
> acquired in shared mode, with stack like:
> [<ffffffffc0ad27b2>] read_extent_buffer_pages+0x2d2/0x300 [btrfs]
> [<ffffffffc0a9fbe7>] btree_read_extent_buffer_pages.constprop.50+0xb7/0x1e0
> [btrfs]
> [<ffffffffc0aa1550>] read_tree_block+0x40/0x70 [btrfs]
> [<ffffffffc0a7aa7c>] read_block_for_search.isra.33+0x12c/0x370 [btrfs]
> [<ffffffffc0a7ce86>] btrfs_search_slot+0x3c6/0xb10 [btrfs]
> [<ffffffffc0a975b9>] caching_thread+0x1b9/0x820 [btrfs]
> [<ffffffffc0adfa36>] normal_work_helper+0xc6/0x340 [btrfs]
> [<ffffffffc0adfd22>] btrfs_cache_helper+0x12/0x20 [btrfs]
> 
> IO requests that want to allocate space are waiting in cache_block_group()
> to acquire the commit_root_sem in exclusive mode. But they only want to add
> the caching control structure to the list of caching block-groups:
> [<ffffffff817136c9>] schedule+0x29/0x70
> [<ffffffff81716085>] rwsem_down_write_failed+0x145/0x320
> [<ffffffff813a1ae3>] call_rwsem_down_write_failed+0x13/0x20
> [<ffffffffc0a86d0b>] cache_block_group+0x25b/0x450 [btrfs]
> [<ffffffffc0a94d36>] find_free_extent+0xd16/0xdb0 [btrfs]
> [<ffffffffc0a94e7f>] btrfs_reserve_extent+0xaf/0x160 [btrfs]
> 
> Other caching threads want to continue their scanning, and for that they
> are waiting to acquire commit_root_sem in shared mode. But since there are
> IO threads that want the exclusive lock, the caching threads are unable
> to continue the scanning, because (I presume) rw_semaphore guarantees some
> fairness:
> [<ffffffff817136c9>] schedule+0x29/0x70
> [<ffffffff81715ee5>] rwsem_down_read_failed+0xc5/0x120
> [<ffffffff813a1ab4>] call_rwsem_down_read_failed+0x14/0x30
> [<ffffffffc0a975a1>] caching_thread+0x1a1/0x820 [btrfs]
> [<ffffffffc0adfa36>] normal_work_helper+0xc6/0x340 [btrfs]
> [<ffffffffc0adfd22>] btrfs_cache_helper+0x12/0x20 [btrfs]
> [<ffffffff8108bd56>] process_one_work+0x146/0x410
> 
> This causes slowness of the IO, especially when there are many block groups
> that need to be scanned for free space. In some cases it takes minutes
> until a single IO thread is able to allocate free space.
> 
> I don't see a deadlock here, because the caching threads that were able to
> acquire
> the commit_root_sem will call rwsem_is_contended() and should give up the
> semaphore,
> so that IO threads are able to acquire it in exclusive mode.
> 
> However, introducing a separate mutex that protects only the list of caching
> block groups makes things move forward much faster.
>

The problem did exist and the patch looks good to me.

> This patch is based on kernel 3.18.
> Unfortunately, I am not able to submit a patch based on one of the latest
> kernels, because
> here btrfs is part of the larger system, and upgrading the kernel is a
> significant effort.
> Hence marking the patch as RFC.
> Hopefully, this patch still has some value to the community.
> 
> Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 42d11e7..74feacb 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1490,6 +1490,8 @@ struct btrfs_fs_info {
>     struct list_head trans_list;
>     struct list_head dead_roots;
>     struct list_head caching_block_groups;
> +    /* protects the above list */
> +    struct mutex caching_block_groups_mutex;
> 
>     spinlock_t delayed_iput_lock;
>     struct list_head delayed_iputs;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 5177954..130ec58 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2229,6 +2229,7 @@ int open_ctree(struct super_block *sb,
>     INIT_LIST_HEAD(&fs_info->delayed_iputs);
>     INIT_LIST_HEAD(&fs_info->delalloc_roots);
>     INIT_LIST_HEAD(&fs_info->caching_block_groups);
> +    mutex_init(&fs_info->caching_block_groups_mutex);
>     spin_lock_init(&fs_info->delalloc_root_lock);
>     spin_lock_init(&fs_info->trans_lock);
>     spin_lock_init(&fs_info->fs_roots_radix_lock);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index a067065..906fb08 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -637,10 +637,10 @@ static int cache_block_group(struct
> btrfs_block_group_cache *cache,
>         return 0;
>     }
> 
> -    down_write(&fs_info->commit_root_sem);
> +    mutex_lock(&fs_info->caching_block_groups_mutex);
>     atomic_inc(&caching_ctl->count);
>     list_add_tail(&caching_ctl->list, &fs_info->caching_block_groups);
> -    up_write(&fs_info->commit_root_sem);
> +    mutex_unlock(&fs_info->caching_block_groups_mutex);
> 
>     btrfs_get_block_group(cache);
> 
> @@ -5693,6 +5693,7 @@ void btrfs_prepare_extent_commit(struct
> btrfs_trans_handle *trans,
> 
>     down_write(&fs_info->commit_root_sem);
>

Witht the new mutex, it's not necessary to take commit_root_sem here
because a) pinned_extents could not be modified outside of a
transaction, and b) while at btrfs_prepare_extent_commit(), it's
already at the very end of commiting a transaction.

caching_ctl should have at least one reference and
caching_ctl->progress is supposed to be protected by
caching_ctl->mutex.

Thanks,

-liubo

> +    mutex_lock(&fs_info->caching_block_groups_mutex);
>     list_for_each_entry_safe(caching_ctl, next,
>                  &fs_info->caching_block_groups, list) {
>         cache = caching_ctl->block_group;
> @@ -5704,6 +5705,7 @@ void btrfs_prepare_extent_commit(struct
> btrfs_trans_handle *trans,
>             cache->last_byte_to_unpin = caching_ctl->progress;
>         }
>     }
> +    mutex_unlock(&fs_info->caching_block_groups_mutex);
> 
>     if (fs_info->pinned_extents == &fs_info->freed_extents[0])
>         fs_info->pinned_extents = &fs_info->freed_extents[1];
> @@ -8849,14 +8851,14 @@ int btrfs_free_block_groups(struct btrfs_fs_info
> *info)
>     struct btrfs_caching_control *caching_ctl;
>     struct rb_node *n;
> 
> -    down_write(&info->commit_root_sem);
> +    mutex_lock(&info->caching_block_groups_mutex);
>     while (!list_empty(&info->caching_block_groups)) {
>         caching_ctl = list_entry(info->caching_block_groups.next,
>                      struct btrfs_caching_control, list);
>         list_del(&caching_ctl->list);
>         put_caching_control(caching_ctl);
>     }
> -    up_write(&info->commit_root_sem);
> +    mutex_unlock(&info->caching_block_groups_mutex);
> 
>     spin_lock(&info->unused_bgs_lock);
>     while (!list_empty(&info->unused_bgs)) {
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC] btrfs: introduce a separate mutex for caching_block_groups list
  2017-03-21 22:40 ` Liu Bo
@ 2017-05-13 18:45   ` Alex Lyakas
  0 siblings, 0 replies; 4+ messages in thread
From: Alex Lyakas @ 2017-05-13 18:45 UTC (permalink / raw)
  To: Liu Bo; +Cc: linux-btrfs, Josef Bacik, dsterba

Hi Liu,

On Wed, Mar 22, 2017 at 1:40 AM, Liu Bo <bo.li.liu@oracle.com> wrote:
> On Sun, Mar 19, 2017 at 07:18:59PM +0200, Alex Lyakas wrote:
>> We have a commit_root_sem, which is a read-write semaphore that protects the
>> commit roots.
>> But it is also used to protect the list of caching block groups.
>>
>> As a result, while doing "slow" caching, the following issue is seen:
>>
>> Some of the caching threads are scanning the extent tree with
>> commit_root_sem
>> acquired in shared mode, with stack like:
>> [<ffffffffc0ad27b2>] read_extent_buffer_pages+0x2d2/0x300 [btrfs]
>> [<ffffffffc0a9fbe7>] btree_read_extent_buffer_pages.constprop.50+0xb7/0x1e0
>> [btrfs]
>> [<ffffffffc0aa1550>] read_tree_block+0x40/0x70 [btrfs]
>> [<ffffffffc0a7aa7c>] read_block_for_search.isra.33+0x12c/0x370 [btrfs]
>> [<ffffffffc0a7ce86>] btrfs_search_slot+0x3c6/0xb10 [btrfs]
>> [<ffffffffc0a975b9>] caching_thread+0x1b9/0x820 [btrfs]
>> [<ffffffffc0adfa36>] normal_work_helper+0xc6/0x340 [btrfs]
>> [<ffffffffc0adfd22>] btrfs_cache_helper+0x12/0x20 [btrfs]
>>
>> IO requests that want to allocate space are waiting in cache_block_group()
>> to acquire the commit_root_sem in exclusive mode. But they only want to add
>> the caching control structure to the list of caching block-groups:
>> [<ffffffff817136c9>] schedule+0x29/0x70
>> [<ffffffff81716085>] rwsem_down_write_failed+0x145/0x320
>> [<ffffffff813a1ae3>] call_rwsem_down_write_failed+0x13/0x20
>> [<ffffffffc0a86d0b>] cache_block_group+0x25b/0x450 [btrfs]
>> [<ffffffffc0a94d36>] find_free_extent+0xd16/0xdb0 [btrfs]
>> [<ffffffffc0a94e7f>] btrfs_reserve_extent+0xaf/0x160 [btrfs]
>>
>> Other caching threads want to continue their scanning, and for that they
>> are waiting to acquire commit_root_sem in shared mode. But since there are
>> IO threads that want the exclusive lock, the caching threads are unable
>> to continue the scanning, because (I presume) rw_semaphore guarantees some
>> fairness:
>> [<ffffffff817136c9>] schedule+0x29/0x70
>> [<ffffffff81715ee5>] rwsem_down_read_failed+0xc5/0x120
>> [<ffffffff813a1ab4>] call_rwsem_down_read_failed+0x14/0x30
>> [<ffffffffc0a975a1>] caching_thread+0x1a1/0x820 [btrfs]
>> [<ffffffffc0adfa36>] normal_work_helper+0xc6/0x340 [btrfs]
>> [<ffffffffc0adfd22>] btrfs_cache_helper+0x12/0x20 [btrfs]
>> [<ffffffff8108bd56>] process_one_work+0x146/0x410
>>
>> This causes slowness of the IO, especially when there are many block groups
>> that need to be scanned for free space. In some cases it takes minutes
>> until a single IO thread is able to allocate free space.
>>
>> I don't see a deadlock here, because the caching threads that were able to
>> acquire
>> the commit_root_sem will call rwsem_is_contended() and should give up the
>> semaphore,
>> so that IO threads are able to acquire it in exclusive mode.
>>
>> However, introducing a separate mutex that protects only the list of caching
>> block groups makes things move forward much faster.
>>
>
> The problem did exist and the patch looks good to me.
>
>> This patch is based on kernel 3.18.
>> Unfortunately, I am not able to submit a patch based on one of the latest
>> kernels, because
>> here btrfs is part of the larger system, and upgrading the kernel is a
>> significant effort.
>> Hence marking the patch as RFC.
>> Hopefully, this patch still has some value to the community.
>>
>> Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 42d11e7..74feacb 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -1490,6 +1490,8 @@ struct btrfs_fs_info {
>>     struct list_head trans_list;
>>     struct list_head dead_roots;
>>     struct list_head caching_block_groups;
>> +    /* protects the above list */
>> +    struct mutex caching_block_groups_mutex;
>>
>>     spinlock_t delayed_iput_lock;
>>     struct list_head delayed_iputs;
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 5177954..130ec58 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -2229,6 +2229,7 @@ int open_ctree(struct super_block *sb,
>>     INIT_LIST_HEAD(&fs_info->delayed_iputs);
>>     INIT_LIST_HEAD(&fs_info->delalloc_roots);
>>     INIT_LIST_HEAD(&fs_info->caching_block_groups);
>> +    mutex_init(&fs_info->caching_block_groups_mutex);
>>     spin_lock_init(&fs_info->delalloc_root_lock);
>>     spin_lock_init(&fs_info->trans_lock);
>>     spin_lock_init(&fs_info->fs_roots_radix_lock);
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index a067065..906fb08 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -637,10 +637,10 @@ static int cache_block_group(struct
>> btrfs_block_group_cache *cache,
>>         return 0;
>>     }
>>
>> -    down_write(&fs_info->commit_root_sem);
>> +    mutex_lock(&fs_info->caching_block_groups_mutex);
>>     atomic_inc(&caching_ctl->count);
>>     list_add_tail(&caching_ctl->list, &fs_info->caching_block_groups);
>> -    up_write(&fs_info->commit_root_sem);
>> +    mutex_unlock(&fs_info->caching_block_groups_mutex);
>>
>>     btrfs_get_block_group(cache);
>>
>> @@ -5693,6 +5693,7 @@ void btrfs_prepare_extent_commit(struct
>> btrfs_trans_handle *trans,
>>
>>     down_write(&fs_info->commit_root_sem);
>>
>
> Witht the new mutex, it's not necessary to take commit_root_sem here
> because a) pinned_extents could not be modified outside of a
> transaction, and b) while at btrfs_prepare_extent_commit(), it's
> already at the very end of commiting a transaction.
>
> caching_ctl should have at least one reference and
> caching_ctl->progress is supposed to be protected by
> caching_ctl->mutex.
>
> Thanks,
>
> -liubo
Thank you for the review.

Alex.


>
>> +    mutex_lock(&fs_info->caching_block_groups_mutex);
>>     list_for_each_entry_safe(caching_ctl, next,
>>                  &fs_info->caching_block_groups, list) {
>>         cache = caching_ctl->block_group;
>> @@ -5704,6 +5705,7 @@ void btrfs_prepare_extent_commit(struct
>> btrfs_trans_handle *trans,
>>             cache->last_byte_to_unpin = caching_ctl->progress;
>>         }
>>     }
>> +    mutex_unlock(&fs_info->caching_block_groups_mutex);
>>
>>     if (fs_info->pinned_extents == &fs_info->freed_extents[0])
>>         fs_info->pinned_extents = &fs_info->freed_extents[1];
>> @@ -8849,14 +8851,14 @@ int btrfs_free_block_groups(struct btrfs_fs_info
>> *info)
>>     struct btrfs_caching_control *caching_ctl;
>>     struct rb_node *n;
>>
>> -    down_write(&info->commit_root_sem);
>> +    mutex_lock(&info->caching_block_groups_mutex);
>>     while (!list_empty(&info->caching_block_groups)) {
>>         caching_ctl = list_entry(info->caching_block_groups.next,
>>                      struct btrfs_caching_control, list);
>>         list_del(&caching_ctl->list);
>>         put_caching_control(caching_ctl);
>>     }
>> -    up_write(&info->commit_root_sem);
>> +    mutex_unlock(&info->caching_block_groups_mutex);
>>
>>     spin_lock(&info->unused_bgs_lock);
>>     while (!list_empty(&info->unused_bgs)) {
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-05-13 18:45 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-19 17:18 [PATCH RFC] btrfs: introduce a separate mutex for caching_block_groups list Alex Lyakas
2017-03-19 17:40 ` Josef Bacik
2017-03-21 22:40 ` Liu Bo
2017-05-13 18:45   ` Alex Lyakas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.