All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] btrfs: Don't remove block group still has pinned down bytes
@ 2018-06-15  1:35 Qu Wenruo
  2018-06-20  5:13 ` Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Qu Wenruo @ 2018-06-15  1:35 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
Under certain KVM load and LTP tests, we are possible to hit the
following calltrace if quota is enabled:
------
BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
------------[ cut here ]------------
WARNING: CPU: 0 PID: 49 at ../block/blk-core.c:172 blk_status_to_errno+0x1a/0x30
CPU: 0 PID: 49 Comm: kworker/u2:1 Not tainted 4.12.14-15-default #1 SLE15 (unreleased)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
task: ffff9f827b340bc0 task.stack: ffffb4f8c0304000
RIP: 0010:blk_status_to_errno+0x1a/0x30
Call Trace:
 submit_extent_page+0x191/0x270 [btrfs]
 ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
 __do_readpage+0x2d2/0x810 [btrfs]
 ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
 ? run_one_async_done+0xc0/0xc0 [btrfs]
 __extent_read_full_page+0xe7/0x100 [btrfs]
 ? run_one_async_done+0xc0/0xc0 [btrfs]
 read_extent_buffer_pages+0x1ab/0x2d0 [btrfs]
 ? run_one_async_done+0xc0/0xc0 [btrfs]
 btree_read_extent_buffer_pages+0x94/0xf0 [btrfs]
 read_tree_block+0x31/0x60 [btrfs]
 read_block_for_search.isra.35+0xf0/0x2e0 [btrfs]
 btrfs_search_slot+0x46b/0xa00 [btrfs]
 ? kmem_cache_alloc+0x1a8/0x510
 ? btrfs_get_token_32+0x5b/0x120 [btrfs]
 find_parent_nodes+0x11d/0xeb0 [btrfs]
 ? leaf_space_used+0xb8/0xd0 [btrfs]
 ? btrfs_leaf_free_space+0x49/0x90 [btrfs]
 ? btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
 btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
 btrfs_find_all_roots+0x45/0x60 [btrfs]
 btrfs_qgroup_trace_extent_post+0x20/0x40 [btrfs]
 btrfs_add_delayed_data_ref+0x1a3/0x1d0 [btrfs]
 btrfs_alloc_reserved_file_extent+0x38/0x40 [btrfs]
 insert_reserved_file_extent.constprop.71+0x289/0x2e0 [btrfs]
 btrfs_finish_ordered_io+0x2f4/0x7f0 [btrfs]
 ? pick_next_task_fair+0x2cd/0x530
 ? __switch_to+0x92/0x4b0
 btrfs_worker_helper+0x81/0x300 [btrfs]
 process_one_work+0x1da/0x3f0
 worker_thread+0x2b/0x3f0
 ? process_one_work+0x3f0/0x3f0
 kthread+0x11a/0x130
 ? kthread_create_on_node+0x40/0x40
 ret_from_fork+0x35/0x40
Code: 00 00 5b c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 40 80 ff 0c 40 0f b6 c7 77 0b 48 c1 e0 04 8b 80 00 bf c8 bd c3 <0f> 0b b8 fb ff ff ff c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00
---[ end trace f079fb809e7a862b ]---
BTRFS critical (device vda2): unable to find logical 8820195328 length 16384
BTRFS: error (device vda2) in btrfs_finish_ordered_io:3023: errno=-5 IO failure
BTRFS info (device vda2): forced readonly
BTRFS error (device vda2): pending csums is 2887680
------

[CAUSE]
It's caused by race with block group auto removal like the following
case:
- There is a meta block group X, which has only one tree block
  The tree block belongs to fs tree 257.
- In current transaction, some operation modified fs tree 257
  The tree block get CoWed, so the block group X is empty, and marked as
  unused, queued to be deleted.
- Some workload (like fsync) wakes up cleaner_kthread()
  Which will call btrfs_deleted_unused_bgs() to remove unused block
  groups.
  So block group X along its chunk map get removed.
- Some delalloc work finished for fs tree 257
  Quota needs to get the original reference of the extent, which will
  reads tree blocks of commit root of 257.
  Then since the chunk map get removed, above warning get triggered.

[FIX]
Just teach btrfs_delete_unused_bgs() to skip block group who still has
pinned bytes.

However there is a minor side effect, since currently we only queue
empty blocks at update_block_group(), and such empty block group with
pinned bytes won't go through update_block_group() again, such block
group won't be removed, until it get new extent allocated and removed.

But please note that, there are more problems related to extent
allocator with block group auto removal.

Even a block group is marked unused, extent allocator can still allocate
new extents from unused block group.
Thus delaying block group to next transaction won't work.
(Extents get allocated in current transaction, and removed again in next
transaction).

So the root fix need to co-operate with extent allocator.
But current fix should be enough for this particular bug.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
changelog:
v2:
  Commit message update, to better indicate how pinned byte is used in
  btrfs and why it's related to quota.
v3:
  Commit message update, further explaining the bug with an example.
  And added the side effect of the fix, and possible further fix.
---
 fs/btrfs/extent-tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f190023386a9..7d14c4ca8232 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10675,7 +10675,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		/* Don't want to race with allocators so take the groups_sem */
 		down_write(&space_info->groups_sem);
 		spin_lock(&block_group->lock);
-		if (block_group->reserved ||
+		if (block_group->reserved || block_group->pinned ||
 		    btrfs_block_group_used(&block_group->item) ||
 		    block_group->ro ||
 		    list_is_singular(&block_group->list)) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3] btrfs: Don't remove block group still has pinned down bytes
  2018-06-15  1:35 [PATCH v3] btrfs: Don't remove block group still has pinned down bytes Qu Wenruo
@ 2018-06-20  5:13 ` Qu Wenruo
  2018-06-20  5:25 ` Nikolay Borisov
  2018-06-20  9:13 ` Filipe Manana
  2 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2018-06-20  5:13 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 5772 bytes --]

Gentle ping.

Since it's causing LTP tests failure, I'd like to backport it asap.

Thanks,
Qu

On 2018年06月15日 09:35, Qu Wenruo wrote:
> [BUG]
> Under certain KVM load and LTP tests, we are possible to hit the
> following calltrace if quota is enabled:
> ------
> BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
> BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 49 at ../block/blk-core.c:172 blk_status_to_errno+0x1a/0x30
> CPU: 0 PID: 49 Comm: kworker/u2:1 Not tainted 4.12.14-15-default #1 SLE15 (unreleased)
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
> Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
> task: ffff9f827b340bc0 task.stack: ffffb4f8c0304000
> RIP: 0010:blk_status_to_errno+0x1a/0x30
> Call Trace:
>  submit_extent_page+0x191/0x270 [btrfs]
>  ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
>  __do_readpage+0x2d2/0x810 [btrfs]
>  ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>  __extent_read_full_page+0xe7/0x100 [btrfs]
>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>  read_extent_buffer_pages+0x1ab/0x2d0 [btrfs]
>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>  btree_read_extent_buffer_pages+0x94/0xf0 [btrfs]
>  read_tree_block+0x31/0x60 [btrfs]
>  read_block_for_search.isra.35+0xf0/0x2e0 [btrfs]
>  btrfs_search_slot+0x46b/0xa00 [btrfs]
>  ? kmem_cache_alloc+0x1a8/0x510
>  ? btrfs_get_token_32+0x5b/0x120 [btrfs]
>  find_parent_nodes+0x11d/0xeb0 [btrfs]
>  ? leaf_space_used+0xb8/0xd0 [btrfs]
>  ? btrfs_leaf_free_space+0x49/0x90 [btrfs]
>  ? btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
>  btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
>  btrfs_find_all_roots+0x45/0x60 [btrfs]
>  btrfs_qgroup_trace_extent_post+0x20/0x40 [btrfs]
>  btrfs_add_delayed_data_ref+0x1a3/0x1d0 [btrfs]
>  btrfs_alloc_reserved_file_extent+0x38/0x40 [btrfs]
>  insert_reserved_file_extent.constprop.71+0x289/0x2e0 [btrfs]
>  btrfs_finish_ordered_io+0x2f4/0x7f0 [btrfs]
>  ? pick_next_task_fair+0x2cd/0x530
>  ? __switch_to+0x92/0x4b0
>  btrfs_worker_helper+0x81/0x300 [btrfs]
>  process_one_work+0x1da/0x3f0
>  worker_thread+0x2b/0x3f0
>  ? process_one_work+0x3f0/0x3f0
>  kthread+0x11a/0x130
>  ? kthread_create_on_node+0x40/0x40
>  ret_from_fork+0x35/0x40
> Code: 00 00 5b c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 40 80 ff 0c 40 0f b6 c7 77 0b 48 c1 e0 04 8b 80 00 bf c8 bd c3 <0f> 0b b8 fb ff ff ff c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00
> ---[ end trace f079fb809e7a862b ]---
> BTRFS critical (device vda2): unable to find logical 8820195328 length 16384
> BTRFS: error (device vda2) in btrfs_finish_ordered_io:3023: errno=-5 IO failure
> BTRFS info (device vda2): forced readonly
> BTRFS error (device vda2): pending csums is 2887680
> ------
> 
> [CAUSE]
> It's caused by race with block group auto removal like the following
> case:
> - There is a meta block group X, which has only one tree block
>   The tree block belongs to fs tree 257.
> - In current transaction, some operation modified fs tree 257
>   The tree block get CoWed, so the block group X is empty, and marked as
>   unused, queued to be deleted.
> - Some workload (like fsync) wakes up cleaner_kthread()
>   Which will call btrfs_deleted_unused_bgs() to remove unused block
>   groups.
>   So block group X along its chunk map get removed.
> - Some delalloc work finished for fs tree 257
>   Quota needs to get the original reference of the extent, which will
>   reads tree blocks of commit root of 257.
>   Then since the chunk map get removed, above warning get triggered.
> 
> [FIX]
> Just teach btrfs_delete_unused_bgs() to skip block group who still has
> pinned bytes.
> 
> However there is a minor side effect, since currently we only queue
> empty blocks at update_block_group(), and such empty block group with
> pinned bytes won't go through update_block_group() again, such block
> group won't be removed, until it get new extent allocated and removed.
> 
> But please note that, there are more problems related to extent
> allocator with block group auto removal.
> 
> Even a block group is marked unused, extent allocator can still allocate
> new extents from unused block group.
> Thus delaying block group to next transaction won't work.
> (Extents get allocated in current transaction, and removed again in next
> transaction).
> 
> So the root fix need to co-operate with extent allocator.
> But current fix should be enough for this particular bug.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> changelog:
> v2:
>   Commit message update, to better indicate how pinned byte is used in
>   btrfs and why it's related to quota.
> v3:
>   Commit message update, further explaining the bug with an example.
>   And added the side effect of the fix, and possible further fix.
> ---
>  fs/btrfs/extent-tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index f190023386a9..7d14c4ca8232 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -10675,7 +10675,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>  		/* Don't want to race with allocators so take the groups_sem */
>  		down_write(&space_info->groups_sem);
>  		spin_lock(&block_group->lock);
> -		if (block_group->reserved ||
> +		if (block_group->reserved || block_group->pinned ||
>  		    btrfs_block_group_used(&block_group->item) ||
>  		    block_group->ro ||
>  		    list_is_singular(&block_group->list)) {
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3] btrfs: Don't remove block group still has pinned down bytes
  2018-06-15  1:35 [PATCH v3] btrfs: Don't remove block group still has pinned down bytes Qu Wenruo
  2018-06-20  5:13 ` Qu Wenruo
@ 2018-06-20  5:25 ` Nikolay Borisov
  2018-06-20  5:34   ` Qu Wenruo
  2018-06-20  9:13 ` Filipe Manana
  2 siblings, 1 reply; 11+ messages in thread
From: Nikolay Borisov @ 2018-06-20  5:25 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 15.06.2018 04:35, Qu Wenruo wrote:
> [BUG]
> Under certain KVM load and LTP tests, we are possible to hit the
> following calltrace if quota is enabled:
> ------
> BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
> BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 49 at ../block/blk-core.c:172 blk_status_to_errno+0x1a/0x30
> CPU: 0 PID: 49 Comm: kworker/u2:1 Not tainted 4.12.14-15-default #1 SLE15 (unreleased)
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
> Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
> task: ffff9f827b340bc0 task.stack: ffffb4f8c0304000
> RIP: 0010:blk_status_to_errno+0x1a/0x30
> Call Trace:
>  submit_extent_page+0x191/0x270 [btrfs]
>  ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
>  __do_readpage+0x2d2/0x810 [btrfs]
>  ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>  __extent_read_full_page+0xe7/0x100 [btrfs]
>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>  read_extent_buffer_pages+0x1ab/0x2d0 [btrfs]
>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>  btree_read_extent_buffer_pages+0x94/0xf0 [btrfs]
>  read_tree_block+0x31/0x60 [btrfs]
>  read_block_for_search.isra.35+0xf0/0x2e0 [btrfs]
>  btrfs_search_slot+0x46b/0xa00 [btrfs]
>  ? kmem_cache_alloc+0x1a8/0x510
>  ? btrfs_get_token_32+0x5b/0x120 [btrfs]
>  find_parent_nodes+0x11d/0xeb0 [btrfs]
>  ? leaf_space_used+0xb8/0xd0 [btrfs]
>  ? btrfs_leaf_free_space+0x49/0x90 [btrfs]
>  ? btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
>  btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
>  btrfs_find_all_roots+0x45/0x60 [btrfs]
>  btrfs_qgroup_trace_extent_post+0x20/0x40 [btrfs]
>  btrfs_add_delayed_data_ref+0x1a3/0x1d0 [btrfs]
>  btrfs_alloc_reserved_file_extent+0x38/0x40 [btrfs]
>  insert_reserved_file_extent.constprop.71+0x289/0x2e0 [btrfs]
>  btrfs_finish_ordered_io+0x2f4/0x7f0 [btrfs]
>  ? pick_next_task_fair+0x2cd/0x530
>  ? __switch_to+0x92/0x4b0
>  btrfs_worker_helper+0x81/0x300 [btrfs]
>  process_one_work+0x1da/0x3f0
>  worker_thread+0x2b/0x3f0
>  ? process_one_work+0x3f0/0x3f0
>  kthread+0x11a/0x130
>  ? kthread_create_on_node+0x40/0x40
>  ret_from_fork+0x35/0x40
> Code: 00 00 5b c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 40 80 ff 0c 40 0f b6 c7 77 0b 48 c1 e0 04 8b 80 00 bf c8 bd c3 <0f> 0b b8 fb ff ff ff c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00
> ---[ end trace f079fb809e7a862b ]---
> BTRFS critical (device vda2): unable to find logical 8820195328 length 16384
> BTRFS: error (device vda2) in btrfs_finish_ordered_io:3023: errno=-5 IO failure
> BTRFS info (device vda2): forced readonly
> BTRFS error (device vda2): pending csums is 2887680
> ------
> 
> [CAUSE]
> It's caused by race with block group auto removal like the following
> case:
> - There is a meta block group X, which has only one tree block
>   The tree block belongs to fs tree 257.
> - In current transaction, some operation modified fs tree 257
>   The tree block get CoWed, so the block group X is empty, and marked as
>   unused, queued to be deleted.
> - Some workload (like fsync) wakes up cleaner_kthread()
>   Which will call btrfs_deleted_unused_bgs() to remove unused block
>   groups.
>   So block group X along its chunk map get removed.
> - Some delalloc work finished for fs tree 257
>   Quota needs to get the original reference of the extent, which will
>   reads tree blocks of commit root of 257.
>   Then since the chunk map get removed, above warning get triggered.
> 
> [FIX]
> Just teach btrfs_delete_unused_bgs() to skip block group who still has
> pinned bytes.
> 
> However there is a minor side effect, since currently we only queue
> empty blocks at update_block_group(), and such empty block group with
> pinned bytes won't go through update_block_group() again, such block
> group won't be removed, until it get new extent allocated and removed.
> 
> But please note that, there are more problems related to extent
> allocator with block group auto removal.
> 
> Even a block group is marked unused, extent allocator can still allocate
> new extents from unused block group.
> Thus delaying block group to next transaction won't work.
> (Extents get allocated in current transaction, and removed again in next
> transaction).

Isn't this easily solvable by making the allocator check whether the
block group it has chosen is part of the unused_bgs_list ?

> 
> So the root fix need to co-operate with extent allocator.
> But current fix should be enough for this particular bug.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

LGTM,

Reviewed-by: Nikolay Borisov <nborisov@suse.com>

> ---
> changelog:
> v2:
>   Commit message update, to better indicate how pinned byte is used in
>   btrfs and why it's related to quota.
> v3:
>   Commit message update, further explaining the bug with an example.
>   And added the side effect of the fix, and possible further fix.
> ---
>  fs/btrfs/extent-tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index f190023386a9..7d14c4ca8232 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -10675,7 +10675,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>  		/* Don't want to race with allocators so take the groups_sem */
>  		down_write(&space_info->groups_sem);
>  		spin_lock(&block_group->lock);
> -		if (block_group->reserved ||
> +		if (block_group->reserved || block_group->pinned ||
>  		    btrfs_block_group_used(&block_group->item) ||
>  		    block_group->ro ||
>  		    list_is_singular(&block_group->list)) {
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3] btrfs: Don't remove block group still has pinned down bytes
  2018-06-20  5:25 ` Nikolay Borisov
@ 2018-06-20  5:34   ` Qu Wenruo
  2018-06-20  6:38     ` Nikolay Borisov
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2018-06-20  5:34 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs



On 2018年06月20日 13:25, Nikolay Borisov wrote:
> 
> 
> On 15.06.2018 04:35, Qu Wenruo wrote:
>> [BUG]
>> Under certain KVM load and LTP tests, we are possible to hit the
>> following calltrace if quota is enabled:
>> ------
>> BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
>> BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
>> ------------[ cut here ]------------
>> WARNING: CPU: 0 PID: 49 at ../block/blk-core.c:172 blk_status_to_errno+0x1a/0x30
>> CPU: 0 PID: 49 Comm: kworker/u2:1 Not tainted 4.12.14-15-default #1 SLE15 (unreleased)
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
>> Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
>> task: ffff9f827b340bc0 task.stack: ffffb4f8c0304000
>> RIP: 0010:blk_status_to_errno+0x1a/0x30
>> Call Trace:
>>  submit_extent_page+0x191/0x270 [btrfs]
>>  ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
>>  __do_readpage+0x2d2/0x810 [btrfs]
>>  ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
>>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>>  __extent_read_full_page+0xe7/0x100 [btrfs]
>>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>>  read_extent_buffer_pages+0x1ab/0x2d0 [btrfs]
>>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>>  btree_read_extent_buffer_pages+0x94/0xf0 [btrfs]
>>  read_tree_block+0x31/0x60 [btrfs]
>>  read_block_for_search.isra.35+0xf0/0x2e0 [btrfs]
>>  btrfs_search_slot+0x46b/0xa00 [btrfs]
>>  ? kmem_cache_alloc+0x1a8/0x510
>>  ? btrfs_get_token_32+0x5b/0x120 [btrfs]
>>  find_parent_nodes+0x11d/0xeb0 [btrfs]
>>  ? leaf_space_used+0xb8/0xd0 [btrfs]
>>  ? btrfs_leaf_free_space+0x49/0x90 [btrfs]
>>  ? btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
>>  btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
>>  btrfs_find_all_roots+0x45/0x60 [btrfs]
>>  btrfs_qgroup_trace_extent_post+0x20/0x40 [btrfs]
>>  btrfs_add_delayed_data_ref+0x1a3/0x1d0 [btrfs]
>>  btrfs_alloc_reserved_file_extent+0x38/0x40 [btrfs]
>>  insert_reserved_file_extent.constprop.71+0x289/0x2e0 [btrfs]
>>  btrfs_finish_ordered_io+0x2f4/0x7f0 [btrfs]
>>  ? pick_next_task_fair+0x2cd/0x530
>>  ? __switch_to+0x92/0x4b0
>>  btrfs_worker_helper+0x81/0x300 [btrfs]
>>  process_one_work+0x1da/0x3f0
>>  worker_thread+0x2b/0x3f0
>>  ? process_one_work+0x3f0/0x3f0
>>  kthread+0x11a/0x130
>>  ? kthread_create_on_node+0x40/0x40
>>  ret_from_fork+0x35/0x40
>> Code: 00 00 5b c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 40 80 ff 0c 40 0f b6 c7 77 0b 48 c1 e0 04 8b 80 00 bf c8 bd c3 <0f> 0b b8 fb ff ff ff c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00
>> ---[ end trace f079fb809e7a862b ]---
>> BTRFS critical (device vda2): unable to find logical 8820195328 length 16384
>> BTRFS: error (device vda2) in btrfs_finish_ordered_io:3023: errno=-5 IO failure
>> BTRFS info (device vda2): forced readonly
>> BTRFS error (device vda2): pending csums is 2887680
>> ------
>>
>> [CAUSE]
>> It's caused by race with block group auto removal like the following
>> case:
>> - There is a meta block group X, which has only one tree block
>>   The tree block belongs to fs tree 257.
>> - In current transaction, some operation modified fs tree 257
>>   The tree block get CoWed, so the block group X is empty, and marked as
>>   unused, queued to be deleted.
>> - Some workload (like fsync) wakes up cleaner_kthread()
>>   Which will call btrfs_deleted_unused_bgs() to remove unused block
>>   groups.
>>   So block group X along its chunk map get removed.
>> - Some delalloc work finished for fs tree 257
>>   Quota needs to get the original reference of the extent, which will
>>   reads tree blocks of commit root of 257.
>>   Then since the chunk map get removed, above warning get triggered.
>>
>> [FIX]
>> Just teach btrfs_delete_unused_bgs() to skip block group who still has
>> pinned bytes.
>>
>> However there is a minor side effect, since currently we only queue
>> empty blocks at update_block_group(), and such empty block group with
>> pinned bytes won't go through update_block_group() again, such block
>> group won't be removed, until it get new extent allocated and removed.
>>
>> But please note that, there are more problems related to extent
>> allocator with block group auto removal.
>>
>> Even a block group is marked unused, extent allocator can still allocate
>> new extents from unused block group.
>> Thus delaying block group to next transaction won't work.
>> (Extents get allocated in current transaction, and removed again in next
>> transaction).
> 
> Isn't this easily solvable by making the allocator check whether the
> block group it has chosen is part of the unused_bgs_list ?

That's also my initial idea, however bg_cache->bg_list can also be used
in trans->new_bgs.

Another factor is, even we check the block group of an extent in
find_free_extent() and skip the allocation, we can lead to more frequent
false ENOSPC.

The correct way to do it is to allow find_free_extent() to skip block
group which is in unused_bgs_list (maybe needs extra bit to indicate
this), if we have other block group to use.
Or we still need to use that unused block group, and remove it from
unused_bgs_list to avoid ENOSPC and unneeded new block group creation.

(BTW, there is a bug that find_free_extent() checked block_group->list
wrongly)

That's the main reason sending out such small fix, as the perfect fix
may be much complex than my expectation.

Thanks,
Qu

> 
>>
>> So the root fix need to co-operate with extent allocator.
>> But current fix should be enough for this particular bug.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
> 
> LGTM,
> 
> Reviewed-by: Nikolay Borisov <nborisov@suse.com>
> 
>> ---
>> changelog:
>> v2:
>>   Commit message update, to better indicate how pinned byte is used in
>>   btrfs and why it's related to quota.
>> v3:
>>   Commit message update, further explaining the bug with an example.
>>   And added the side effect of the fix, and possible further fix.
>> ---
>>  fs/btrfs/extent-tree.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index f190023386a9..7d14c4ca8232 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -10675,7 +10675,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>>  		/* Don't want to race with allocators so take the groups_sem */
>>  		down_write(&space_info->groups_sem);
>>  		spin_lock(&block_group->lock);
>> -		if (block_group->reserved ||
>> +		if (block_group->reserved || block_group->pinned ||
>>  		    btrfs_block_group_used(&block_group->item) ||
>>  		    block_group->ro ||
>>  		    list_is_singular(&block_group->list)) {
>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3] btrfs: Don't remove block group still has pinned down bytes
  2018-06-20  5:34   ` Qu Wenruo
@ 2018-06-20  6:38     ` Nikolay Borisov
  0 siblings, 0 replies; 11+ messages in thread
From: Nikolay Borisov @ 2018-06-20  6:38 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 20.06.2018 08:34, Qu Wenruo wrote:
> 
> 
> On 2018年06月20日 13:25, Nikolay Borisov wrote:
>>
>>
>> On 15.06.2018 04:35, Qu Wenruo wrote:
>>> [BUG]
>>> Under certain KVM load and LTP tests, we are possible to hit the
>>> following calltrace if quota is enabled:
>>> ------
>>> BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
>>> BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
>>> ------------[ cut here ]------------
>>> WARNING: CPU: 0 PID: 49 at ../block/blk-core.c:172 blk_status_to_errno+0x1a/0x30
>>> CPU: 0 PID: 49 Comm: kworker/u2:1 Not tainted 4.12.14-15-default #1 SLE15 (unreleased)
>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
>>> Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
>>> task: ffff9f827b340bc0 task.stack: ffffb4f8c0304000
>>> RIP: 0010:blk_status_to_errno+0x1a/0x30
>>> Call Trace:
>>>  submit_extent_page+0x191/0x270 [btrfs]
>>>  ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
>>>  __do_readpage+0x2d2/0x810 [btrfs]
>>>  ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
>>>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>>>  __extent_read_full_page+0xe7/0x100 [btrfs]
>>>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>>>  read_extent_buffer_pages+0x1ab/0x2d0 [btrfs]
>>>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>>>  btree_read_extent_buffer_pages+0x94/0xf0 [btrfs]
>>>  read_tree_block+0x31/0x60 [btrfs]
>>>  read_block_for_search.isra.35+0xf0/0x2e0 [btrfs]
>>>  btrfs_search_slot+0x46b/0xa00 [btrfs]
>>>  ? kmem_cache_alloc+0x1a8/0x510
>>>  ? btrfs_get_token_32+0x5b/0x120 [btrfs]
>>>  find_parent_nodes+0x11d/0xeb0 [btrfs]
>>>  ? leaf_space_used+0xb8/0xd0 [btrfs]
>>>  ? btrfs_leaf_free_space+0x49/0x90 [btrfs]
>>>  ? btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
>>>  btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
>>>  btrfs_find_all_roots+0x45/0x60 [btrfs]
>>>  btrfs_qgroup_trace_extent_post+0x20/0x40 [btrfs]
>>>  btrfs_add_delayed_data_ref+0x1a3/0x1d0 [btrfs]
>>>  btrfs_alloc_reserved_file_extent+0x38/0x40 [btrfs]
>>>  insert_reserved_file_extent.constprop.71+0x289/0x2e0 [btrfs]
>>>  btrfs_finish_ordered_io+0x2f4/0x7f0 [btrfs]
>>>  ? pick_next_task_fair+0x2cd/0x530
>>>  ? __switch_to+0x92/0x4b0
>>>  btrfs_worker_helper+0x81/0x300 [btrfs]
>>>  process_one_work+0x1da/0x3f0
>>>  worker_thread+0x2b/0x3f0
>>>  ? process_one_work+0x3f0/0x3f0
>>>  kthread+0x11a/0x130
>>>  ? kthread_create_on_node+0x40/0x40
>>>  ret_from_fork+0x35/0x40
>>> Code: 00 00 5b c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 40 80 ff 0c 40 0f b6 c7 77 0b 48 c1 e0 04 8b 80 00 bf c8 bd c3 <0f> 0b b8 fb ff ff ff c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00
>>> ---[ end trace f079fb809e7a862b ]---
>>> BTRFS critical (device vda2): unable to find logical 8820195328 length 16384
>>> BTRFS: error (device vda2) in btrfs_finish_ordered_io:3023: errno=-5 IO failure
>>> BTRFS info (device vda2): forced readonly
>>> BTRFS error (device vda2): pending csums is 2887680
>>> ------
>>>
>>> [CAUSE]
>>> It's caused by race with block group auto removal like the following
>>> case:
>>> - There is a meta block group X, which has only one tree block
>>>   The tree block belongs to fs tree 257.
>>> - In current transaction, some operation modified fs tree 257
>>>   The tree block get CoWed, so the block group X is empty, and marked as
>>>   unused, queued to be deleted.
>>> - Some workload (like fsync) wakes up cleaner_kthread()
>>>   Which will call btrfs_deleted_unused_bgs() to remove unused block
>>>   groups.
>>>   So block group X along its chunk map get removed.
>>> - Some delalloc work finished for fs tree 257
>>>   Quota needs to get the original reference of the extent, which will
>>>   reads tree blocks of commit root of 257.
>>>   Then since the chunk map get removed, above warning get triggered.
>>>
>>> [FIX]
>>> Just teach btrfs_delete_unused_bgs() to skip block group who still has
>>> pinned bytes.
>>>
>>> However there is a minor side effect, since currently we only queue
>>> empty blocks at update_block_group(), and such empty block group with
>>> pinned bytes won't go through update_block_group() again, such block
>>> group won't be removed, until it get new extent allocated and removed.
>>>
>>> But please note that, there are more problems related to extent
>>> allocator with block group auto removal.
>>>
>>> Even a block group is marked unused, extent allocator can still allocate
>>> new extents from unused block group.
>>> Thus delaying block group to next transaction won't work.
>>> (Extents get allocated in current transaction, and removed again in next
>>> transaction).
>>
>> Isn't this easily solvable by making the allocator check whether the
>> block group it has chosen is part of the unused_bgs_list ?
> 
> That's also my initial idea, however bg_cache->bg_list can also be used
> in trans->new_bgs.
> 
> Another factor is, even we check the block group of an extent in
> find_free_extent() and skip the allocation, we can lead to more frequent
> false ENOSPC.
> 
> The correct way to do it is to allow find_free_extent() to skip block
> group which is in unused_bgs_list (maybe needs extra bit to indicate
> this), if we have other block group to use.
> Or we still need to use that unused block group, and remove it from
> unused_bgs_list to avoid ENOSPC and unneeded new block group creation.
> 
> (BTW, there is a bug that find_free_extent() checked block_group->list
> wrongly)
> 
> That's the main reason sending out such small fix, as the perfect fix
> may be much complex than my expectation.

Well, if we have somewhat major deficiencies in the allocator then I'd
say in the long run those need to be fixed - properly. AFAIR we've
discussed the fact that the bg_list  is somewhat abused since it could
be linked to one of 2 lists, AFAIR it was in the context of some locking
you did around block groups. I guess the take away is that the block
group management + allocator are in need of some overhauling rather than
piling hacks on top of hacks just to make it work another day.


> 
> Thanks,
> Qu
> 
>>
>>>
>>> So the root fix need to co-operate with extent allocator.
>>> But current fix should be enough for this particular bug.
>>>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>
>> LGTM,
>>
>> Reviewed-by: Nikolay Borisov <nborisov@suse.com>
>>
>>> ---
>>> changelog:
>>> v2:
>>>   Commit message update, to better indicate how pinned byte is used in
>>>   btrfs and why it's related to quota.
>>> v3:
>>>   Commit message update, further explaining the bug with an example.
>>>   And added the side effect of the fix, and possible further fix.
>>> ---
>>>  fs/btrfs/extent-tree.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index f190023386a9..7d14c4ca8232 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -10675,7 +10675,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>>>  		/* Don't want to race with allocators so take the groups_sem */
>>>  		down_write(&space_info->groups_sem);
>>>  		spin_lock(&block_group->lock);
>>> -		if (block_group->reserved ||
>>> +		if (block_group->reserved || block_group->pinned ||
>>>  		    btrfs_block_group_used(&block_group->item) ||
>>>  		    block_group->ro ||
>>>  		    list_is_singular(&block_group->list)) {
>>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3] btrfs: Don't remove block group still has pinned down bytes
  2018-06-15  1:35 [PATCH v3] btrfs: Don't remove block group still has pinned down bytes Qu Wenruo
  2018-06-20  5:13 ` Qu Wenruo
  2018-06-20  5:25 ` Nikolay Borisov
@ 2018-06-20  9:13 ` Filipe Manana
  2018-06-20  9:22   ` Qu Wenruo
  2 siblings, 1 reply; 11+ messages in thread
From: Filipe Manana @ 2018-06-20  9:13 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Jun 15, 2018 at 2:35 AM, Qu Wenruo <wqu@suse.com> wrote:
> [BUG]
> Under certain KVM load and LTP tests, we are possible to hit the
> following calltrace if quota is enabled:
> ------
> BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
> BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 49 at ../block/blk-core.c:172 blk_status_to_errno+0x1a/0x30
> CPU: 0 PID: 49 Comm: kworker/u2:1 Not tainted 4.12.14-15-default #1 SLE15 (unreleased)
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
> Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
> task: ffff9f827b340bc0 task.stack: ffffb4f8c0304000
> RIP: 0010:blk_status_to_errno+0x1a/0x30
> Call Trace:
>  submit_extent_page+0x191/0x270 [btrfs]
>  ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
>  __do_readpage+0x2d2/0x810 [btrfs]
>  ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>  __extent_read_full_page+0xe7/0x100 [btrfs]
>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>  read_extent_buffer_pages+0x1ab/0x2d0 [btrfs]
>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>  btree_read_extent_buffer_pages+0x94/0xf0 [btrfs]
>  read_tree_block+0x31/0x60 [btrfs]
>  read_block_for_search.isra.35+0xf0/0x2e0 [btrfs]
>  btrfs_search_slot+0x46b/0xa00 [btrfs]
>  ? kmem_cache_alloc+0x1a8/0x510
>  ? btrfs_get_token_32+0x5b/0x120 [btrfs]
>  find_parent_nodes+0x11d/0xeb0 [btrfs]
>  ? leaf_space_used+0xb8/0xd0 [btrfs]
>  ? btrfs_leaf_free_space+0x49/0x90 [btrfs]
>  ? btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
>  btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
>  btrfs_find_all_roots+0x45/0x60 [btrfs]
>  btrfs_qgroup_trace_extent_post+0x20/0x40 [btrfs]
>  btrfs_add_delayed_data_ref+0x1a3/0x1d0 [btrfs]
>  btrfs_alloc_reserved_file_extent+0x38/0x40 [btrfs]
>  insert_reserved_file_extent.constprop.71+0x289/0x2e0 [btrfs]
>  btrfs_finish_ordered_io+0x2f4/0x7f0 [btrfs]
>  ? pick_next_task_fair+0x2cd/0x530
>  ? __switch_to+0x92/0x4b0
>  btrfs_worker_helper+0x81/0x300 [btrfs]
>  process_one_work+0x1da/0x3f0
>  worker_thread+0x2b/0x3f0
>  ? process_one_work+0x3f0/0x3f0
>  kthread+0x11a/0x130
>  ? kthread_create_on_node+0x40/0x40
>  ret_from_fork+0x35/0x40
> Code: 00 00 5b c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 40 80 ff 0c 40 0f b6 c7 77 0b 48 c1 e0 04 8b 80 00 bf c8 bd c3 <0f> 0b b8 fb ff ff ff c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00
> ---[ end trace f079fb809e7a862b ]---
> BTRFS critical (device vda2): unable to find logical 8820195328 length 16384
> BTRFS: error (device vda2) in btrfs_finish_ordered_io:3023: errno=-5 IO failure
> BTRFS info (device vda2): forced readonly
> BTRFS error (device vda2): pending csums is 2887680
> ------
>
> [CAUSE]
> It's caused by race with block group auto removal like the following
> case:
> - There is a meta block group X, which has only one tree block
>   The tree block belongs to fs tree 257.
> - In current transaction, some operation modified fs tree 257
>   The tree block get CoWed, so the block group X is empty, and marked as
>   unused, queued to be deleted.
> - Some workload (like fsync) wakes up cleaner_kthread()
>   Which will call btrfs_deleted_unused_bgs() to remove unused block
>   groups.
>   So block group X along its chunk map get removed.
> - Some delalloc work finished for fs tree 257
>   Quota needs to get the original reference of the extent, which will
>   reads tree blocks of commit root of 257.
>   Then since the chunk map get removed, above warning get triggered.
>
> [FIX]
> Just teach btrfs_delete_unused_bgs() to skip block group who still has
> pinned bytes.
>
> However there is a minor side effect, since currently we only queue
> empty blocks at update_block_group(), and such empty block group with
> pinned bytes won't go through update_block_group() again, such block
> group won't be removed, until it get new extent allocated and removed.

So that can be fixed in a separate patch, to add it back to the list
of block groups to be deleted once everything is unpinned and passes
all other necessary criteria.

>
> But please note that, there are more problems related to extent
> allocator with block group auto removal.

The above isn't a problem of the allocator itself but rather in the
way we manage COW, commit roots and unpinning.

>
> Even a block group is marked unused, extent allocator can still allocate
> new extents from unused block group.

Why is that a problem?
It's ok (with some good benefits), as long as the cleaner thread (or
any thing that attempts to delete block groups in the unused list),
doesn't delete it.

> Thus delaying block group to next transaction won't work.
> (Extents get allocated in current transaction, and removed again in next
> transaction).
>
> So the root fix need to co-operate with extent allocator.

What do you mean by co-operation with the extent allocator? I don't
think the problem is there.


> But current fix should be enough for this particular bug.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> changelog:
> v2:
>   Commit message update, to better indicate how pinned byte is used in
>   btrfs and why it's related to quota.
> v3:
>   Commit message update, further explaining the bug with an example.
>   And added the side effect of the fix, and possible further fix.
> ---
>  fs/btrfs/extent-tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index f190023386a9..7d14c4ca8232 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -10675,7 +10675,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>                 /* Don't want to race with allocators so take the groups_sem */
>                 down_write(&space_info->groups_sem);
>                 spin_lock(&block_group->lock);
> -               if (block_group->reserved ||
> +               if (block_group->reserved || block_group->pinned ||
>                     btrfs_block_group_used(&block_group->item) ||
>                     block_group->ro ||
>                     list_is_singular(&block_group->list)) {
> --
> 2.17.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3] btrfs: Don't remove block group still has pinned down bytes
  2018-06-20  9:13 ` Filipe Manana
@ 2018-06-20  9:22   ` Qu Wenruo
  2018-06-20  9:33     ` Filipe Manana
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2018-06-20  9:22 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 7830 bytes --]



On 2018年06月20日 17:13, Filipe Manana wrote:
> On Fri, Jun 15, 2018 at 2:35 AM, Qu Wenruo <wqu@suse.com> wrote:
>> [BUG]
>> Under certain KVM load and LTP tests, we are possible to hit the
>> following calltrace if quota is enabled:
>> ------
>> BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
>> BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
>> ------------[ cut here ]------------
>> WARNING: CPU: 0 PID: 49 at ../block/blk-core.c:172 blk_status_to_errno+0x1a/0x30
>> CPU: 0 PID: 49 Comm: kworker/u2:1 Not tainted 4.12.14-15-default #1 SLE15 (unreleased)
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
>> Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
>> task: ffff9f827b340bc0 task.stack: ffffb4f8c0304000
>> RIP: 0010:blk_status_to_errno+0x1a/0x30
>> Call Trace:
>>  submit_extent_page+0x191/0x270 [btrfs]
>>  ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
>>  __do_readpage+0x2d2/0x810 [btrfs]
>>  ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
>>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>>  __extent_read_full_page+0xe7/0x100 [btrfs]
>>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>>  read_extent_buffer_pages+0x1ab/0x2d0 [btrfs]
>>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>>  btree_read_extent_buffer_pages+0x94/0xf0 [btrfs]
>>  read_tree_block+0x31/0x60 [btrfs]
>>  read_block_for_search.isra.35+0xf0/0x2e0 [btrfs]
>>  btrfs_search_slot+0x46b/0xa00 [btrfs]
>>  ? kmem_cache_alloc+0x1a8/0x510
>>  ? btrfs_get_token_32+0x5b/0x120 [btrfs]
>>  find_parent_nodes+0x11d/0xeb0 [btrfs]
>>  ? leaf_space_used+0xb8/0xd0 [btrfs]
>>  ? btrfs_leaf_free_space+0x49/0x90 [btrfs]
>>  ? btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
>>  btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
>>  btrfs_find_all_roots+0x45/0x60 [btrfs]
>>  btrfs_qgroup_trace_extent_post+0x20/0x40 [btrfs]
>>  btrfs_add_delayed_data_ref+0x1a3/0x1d0 [btrfs]
>>  btrfs_alloc_reserved_file_extent+0x38/0x40 [btrfs]
>>  insert_reserved_file_extent.constprop.71+0x289/0x2e0 [btrfs]
>>  btrfs_finish_ordered_io+0x2f4/0x7f0 [btrfs]
>>  ? pick_next_task_fair+0x2cd/0x530
>>  ? __switch_to+0x92/0x4b0
>>  btrfs_worker_helper+0x81/0x300 [btrfs]
>>  process_one_work+0x1da/0x3f0
>>  worker_thread+0x2b/0x3f0
>>  ? process_one_work+0x3f0/0x3f0
>>  kthread+0x11a/0x130
>>  ? kthread_create_on_node+0x40/0x40
>>  ret_from_fork+0x35/0x40
>> Code: 00 00 5b c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 40 80 ff 0c 40 0f b6 c7 77 0b 48 c1 e0 04 8b 80 00 bf c8 bd c3 <0f> 0b b8 fb ff ff ff c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00
>> ---[ end trace f079fb809e7a862b ]---
>> BTRFS critical (device vda2): unable to find logical 8820195328 length 16384
>> BTRFS: error (device vda2) in btrfs_finish_ordered_io:3023: errno=-5 IO failure
>> BTRFS info (device vda2): forced readonly
>> BTRFS error (device vda2): pending csums is 2887680
>> ------
>>
>> [CAUSE]
>> It's caused by race with block group auto removal like the following
>> case:
>> - There is a meta block group X, which has only one tree block
>>   The tree block belongs to fs tree 257.
>> - In current transaction, some operation modified fs tree 257
>>   The tree block get CoWed, so the block group X is empty, and marked as
>>   unused, queued to be deleted.
>> - Some workload (like fsync) wakes up cleaner_kthread()
>>   Which will call btrfs_deleted_unused_bgs() to remove unused block
>>   groups.
>>   So block group X along its chunk map get removed.
>> - Some delalloc work finished for fs tree 257
>>   Quota needs to get the original reference of the extent, which will
>>   reads tree blocks of commit root of 257.
>>   Then since the chunk map get removed, above warning get triggered.
>>
>> [FIX]
>> Just teach btrfs_delete_unused_bgs() to skip block group who still has
>> pinned bytes.
>>
>> However there is a minor side effect, since currently we only queue
>> empty blocks at update_block_group(), and such empty block group with
>> pinned bytes won't go through update_block_group() again, such block
>> group won't be removed, until it get new extent allocated and removed.
> 
> So that can be fixed in a separate patch, to add it back to the list
> of block groups to be deleted once everything is unpinned and passes
> all other necessary criteria.

That's the plan.
Although still something more need to be considered.

> 
>>
>> But please note that, there are more problems related to extent
>> allocator with block group auto removal.
> 
> The above isn't a problem of the allocator itself but rather in the
> way we manage COW, commit roots and unpinning.
> 
>>
>> Even a block group is marked unused, extent allocator can still allocate
>> new extents from unused block group.
> 
> Why is that a problem?
> It's ok (with some good benefits), as long as the cleaner thread (or
> any thing that attempts to delete block groups in the unused list),
> doesn't delete it.

It in fact could cause problem under certain case, e.g. btrfs needs to
allocate new data chunks, while all existing metadata are pretty sparse
with a lot of holes.
In that case, an empty block group which can be deleted will help.

Although balance should be the ultimate fix, if btrfs can be more
aggressive to avoid using empty block groups, it would definitely help.

> 
>> Thus delaying block group to next transaction won't work.
>> (Extents get allocated in current transaction, and removed again in next
>> transaction).
>>
>> So the root fix need to co-operate with extent allocator.
> 
> What do you mean by co-operation with the extent allocator? I don't
> think the problem is there.

The bug itself is not related to extent allocator.

It's my later planned fix related to extent allocator.
It's about how aggressive we should try to claim empty block group.
Current behavior is pretty passive, mostly by luck to delete unused
block group.
My purposed fix is to claim empty block groups more aggressive.
If we find an empty block group (even with pinned bytes), we'll try our
best not to allocate from it, so in next transaction (with its pinned
bytes removed) we will definitely claim the space.

Thanks,
Qu

> 
> 
>> But current fix should be enough for this particular bug.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> changelog:
>> v2:
>>   Commit message update, to better indicate how pinned byte is used in
>>   btrfs and why it's related to quota.
>> v3:
>>   Commit message update, further explaining the bug with an example.
>>   And added the side effect of the fix, and possible further fix.
>> ---
>>  fs/btrfs/extent-tree.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index f190023386a9..7d14c4ca8232 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -10675,7 +10675,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>>                 /* Don't want to race with allocators so take the groups_sem */
>>                 down_write(&space_info->groups_sem);
>>                 spin_lock(&block_group->lock);
>> -               if (block_group->reserved ||
>> +               if (block_group->reserved || block_group->pinned ||
>>                     btrfs_block_group_used(&block_group->item) ||
>>                     block_group->ro ||
>>                     list_is_singular(&block_group->list)) {
>> --
>> 2.17.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3] btrfs: Don't remove block group still has pinned down bytes
  2018-06-20  9:22   ` Qu Wenruo
@ 2018-06-20  9:33     ` Filipe Manana
  2018-06-20 11:03       ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Filipe Manana @ 2018-06-20  9:33 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Jun 20, 2018 at 10:22 AM, Qu Wenruo <wqu@suse.de> wrote:
>
>
> On 2018年06月20日 17:13, Filipe Manana wrote:
>> On Fri, Jun 15, 2018 at 2:35 AM, Qu Wenruo <wqu@suse.com> wrote:
>>> [BUG]
>>> Under certain KVM load and LTP tests, we are possible to hit the
>>> following calltrace if quota is enabled:
>>> ------
>>> BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
>>> BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
>>> ------------[ cut here ]------------
>>> WARNING: CPU: 0 PID: 49 at ../block/blk-core.c:172 blk_status_to_errno+0x1a/0x30
>>> CPU: 0 PID: 49 Comm: kworker/u2:1 Not tainted 4.12.14-15-default #1 SLE15 (unreleased)
>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
>>> Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
>>> task: ffff9f827b340bc0 task.stack: ffffb4f8c0304000
>>> RIP: 0010:blk_status_to_errno+0x1a/0x30
>>> Call Trace:
>>>  submit_extent_page+0x191/0x270 [btrfs]
>>>  ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
>>>  __do_readpage+0x2d2/0x810 [btrfs]
>>>  ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
>>>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>>>  __extent_read_full_page+0xe7/0x100 [btrfs]
>>>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>>>  read_extent_buffer_pages+0x1ab/0x2d0 [btrfs]
>>>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>>>  btree_read_extent_buffer_pages+0x94/0xf0 [btrfs]
>>>  read_tree_block+0x31/0x60 [btrfs]
>>>  read_block_for_search.isra.35+0xf0/0x2e0 [btrfs]
>>>  btrfs_search_slot+0x46b/0xa00 [btrfs]
>>>  ? kmem_cache_alloc+0x1a8/0x510
>>>  ? btrfs_get_token_32+0x5b/0x120 [btrfs]
>>>  find_parent_nodes+0x11d/0xeb0 [btrfs]
>>>  ? leaf_space_used+0xb8/0xd0 [btrfs]
>>>  ? btrfs_leaf_free_space+0x49/0x90 [btrfs]
>>>  ? btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
>>>  btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
>>>  btrfs_find_all_roots+0x45/0x60 [btrfs]
>>>  btrfs_qgroup_trace_extent_post+0x20/0x40 [btrfs]
>>>  btrfs_add_delayed_data_ref+0x1a3/0x1d0 [btrfs]
>>>  btrfs_alloc_reserved_file_extent+0x38/0x40 [btrfs]
>>>  insert_reserved_file_extent.constprop.71+0x289/0x2e0 [btrfs]
>>>  btrfs_finish_ordered_io+0x2f4/0x7f0 [btrfs]
>>>  ? pick_next_task_fair+0x2cd/0x530
>>>  ? __switch_to+0x92/0x4b0
>>>  btrfs_worker_helper+0x81/0x300 [btrfs]
>>>  process_one_work+0x1da/0x3f0
>>>  worker_thread+0x2b/0x3f0
>>>  ? process_one_work+0x3f0/0x3f0
>>>  kthread+0x11a/0x130
>>>  ? kthread_create_on_node+0x40/0x40
>>>  ret_from_fork+0x35/0x40
>>> Code: 00 00 5b c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 40 80 ff 0c 40 0f b6 c7 77 0b 48 c1 e0 04 8b 80 00 bf c8 bd c3 <0f> 0b b8 fb ff ff ff c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00
>>> ---[ end trace f079fb809e7a862b ]---
>>> BTRFS critical (device vda2): unable to find logical 8820195328 length 16384
>>> BTRFS: error (device vda2) in btrfs_finish_ordered_io:3023: errno=-5 IO failure
>>> BTRFS info (device vda2): forced readonly
>>> BTRFS error (device vda2): pending csums is 2887680
>>> ------
>>>
>>> [CAUSE]
>>> It's caused by race with block group auto removal like the following
>>> case:
>>> - There is a meta block group X, which has only one tree block
>>>   The tree block belongs to fs tree 257.
>>> - In current transaction, some operation modified fs tree 257
>>>   The tree block get CoWed, so the block group X is empty, and marked as
>>>   unused, queued to be deleted.
>>> - Some workload (like fsync) wakes up cleaner_kthread()
>>>   Which will call btrfs_deleted_unused_bgs() to remove unused block
>>>   groups.
>>>   So block group X along its chunk map get removed.
>>> - Some delalloc work finished for fs tree 257
>>>   Quota needs to get the original reference of the extent, which will
>>>   reads tree blocks of commit root of 257.
>>>   Then since the chunk map get removed, above warning get triggered.
>>>
>>> [FIX]
>>> Just teach btrfs_delete_unused_bgs() to skip block group who still has
>>> pinned bytes.
>>>
>>> However there is a minor side effect, since currently we only queue
>>> empty blocks at update_block_group(), and such empty block group with
>>> pinned bytes won't go through update_block_group() again, such block
>>> group won't be removed, until it get new extent allocated and removed.
>>
>> So that can be fixed in a separate patch, to add it back to the list
>> of block groups to be deleted once everything is unpinned and passes
>> all other necessary criteria.
>
> That's the plan.
> Although still something more need to be considered.
>
>>
>>>
>>> But please note that, there are more problems related to extent
>>> allocator with block group auto removal.
>>
>> The above isn't a problem of the allocator itself but rather in the
>> way we manage COW, commit roots and unpinning.
>>
>>>
>>> Even a block group is marked unused, extent allocator can still allocate
>>> new extents from unused block group.
>>
>> Why is that a problem?
>> It's ok (with some good benefits), as long as the cleaner thread (or
>> any thing that attempts to delete block groups in the unused list),
>> doesn't delete it.
>
> It in fact could cause problem under certain case, e.g. btrfs needs to
> allocate new data chunks, while all existing metadata are pretty sparse
> with a lot of holes.
> In that case, an empty block group which can be deleted will help.

That doesn't answer my question.
Let me rephrase it - If we need to allocate an extent and we allocate
one from an empty block group, why is that a problem?

>
> Although balance should be the ultimate fix, if btrfs can be more
> aggressive to avoid using empty block groups, it would definitely help.
>
>>
>>> Thus delaying block group to next transaction won't work.
>>> (Extents get allocated in current transaction, and removed again in next
>>> transaction).
>>>
>>> So the root fix need to co-operate with extent allocator.
>>
>> What do you mean by co-operation with the extent allocator? I don't
>> think the problem is there.
>
> The bug itself is not related to extent allocator.
>
> It's my later planned fix related to extent allocator.
> It's about how aggressive we should try to claim empty block group.
> Current behavior is pretty passive, mostly by luck to delete unused
> block group.
> My purposed fix is to claim empty block groups more aggressive.
> If we find an empty block group (even with pinned bytes), we'll try our
> best not to allocate from it, so in next transaction (with its pinned
> bytes removed) we will definitely claim the space.

Still doesn't answer my question. Why avoid allocation from an empty
block group?
By skipping it, you may end forcing allocation of a new block group
which mail fail due to lack of unused space.
What is your goal?

It would make sense to me to make it prefer to use non-empty block
groups first and then only if none is available, to use an empty one.
But this is different from what you are saying, since you seem
explicit about never using an empty one.

>
> Thanks,
> Qu
>
>>
>>
>>> But current fix should be enough for this particular bug.
>>>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> ---
>>> changelog:
>>> v2:
>>>   Commit message update, to better indicate how pinned byte is used in
>>>   btrfs and why it's related to quota.
>>> v3:
>>>   Commit message update, further explaining the bug with an example.
>>>   And added the side effect of the fix, and possible further fix.
>>> ---
>>>  fs/btrfs/extent-tree.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index f190023386a9..7d14c4ca8232 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -10675,7 +10675,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>>>                 /* Don't want to race with allocators so take the groups_sem */
>>>                 down_write(&space_info->groups_sem);
>>>                 spin_lock(&block_group->lock);
>>> -               if (block_group->reserved ||
>>> +               if (block_group->reserved || block_group->pinned ||
>>>                     btrfs_block_group_used(&block_group->item) ||
>>>                     block_group->ro ||
>>>                     list_is_singular(&block_group->list)) {
>>> --
>>> 2.17.1
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3] btrfs: Don't remove block group still has pinned down bytes
  2018-06-20  9:33     ` Filipe Manana
@ 2018-06-20 11:03       ` Qu Wenruo
  2018-06-21 16:34         ` Filipe Manana
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2018-06-20 11:03 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 9752 bytes --]



On 2018年06月20日 17:33, Filipe Manana wrote:
> On Wed, Jun 20, 2018 at 10:22 AM, Qu Wenruo <wqu@suse.de> wrote:
>>
>>
>> On 2018年06月20日 17:13, Filipe Manana wrote:
>>> On Fri, Jun 15, 2018 at 2:35 AM, Qu Wenruo <wqu@suse.com> wrote:
>>>> [BUG]
>>>> Under certain KVM load and LTP tests, we are possible to hit the
>>>> following calltrace if quota is enabled:
>>>> ------
>>>> BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
>>>> BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
>>>> ------------[ cut here ]------------
>>>> WARNING: CPU: 0 PID: 49 at ../block/blk-core.c:172 blk_status_to_errno+0x1a/0x30
>>>> CPU: 0 PID: 49 Comm: kworker/u2:1 Not tainted 4.12.14-15-default #1 SLE15 (unreleased)
>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
>>>> Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
>>>> task: ffff9f827b340bc0 task.stack: ffffb4f8c0304000
>>>> RIP: 0010:blk_status_to_errno+0x1a/0x30
>>>> Call Trace:
>>>>  submit_extent_page+0x191/0x270 [btrfs]
>>>>  ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
>>>>  __do_readpage+0x2d2/0x810 [btrfs]
>>>>  ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
>>>>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>>>>  __extent_read_full_page+0xe7/0x100 [btrfs]
>>>>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>>>>  read_extent_buffer_pages+0x1ab/0x2d0 [btrfs]
>>>>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>>>>  btree_read_extent_buffer_pages+0x94/0xf0 [btrfs]
>>>>  read_tree_block+0x31/0x60 [btrfs]
>>>>  read_block_for_search.isra.35+0xf0/0x2e0 [btrfs]
>>>>  btrfs_search_slot+0x46b/0xa00 [btrfs]
>>>>  ? kmem_cache_alloc+0x1a8/0x510
>>>>  ? btrfs_get_token_32+0x5b/0x120 [btrfs]
>>>>  find_parent_nodes+0x11d/0xeb0 [btrfs]
>>>>  ? leaf_space_used+0xb8/0xd0 [btrfs]
>>>>  ? btrfs_leaf_free_space+0x49/0x90 [btrfs]
>>>>  ? btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
>>>>  btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
>>>>  btrfs_find_all_roots+0x45/0x60 [btrfs]
>>>>  btrfs_qgroup_trace_extent_post+0x20/0x40 [btrfs]
>>>>  btrfs_add_delayed_data_ref+0x1a3/0x1d0 [btrfs]
>>>>  btrfs_alloc_reserved_file_extent+0x38/0x40 [btrfs]
>>>>  insert_reserved_file_extent.constprop.71+0x289/0x2e0 [btrfs]
>>>>  btrfs_finish_ordered_io+0x2f4/0x7f0 [btrfs]
>>>>  ? pick_next_task_fair+0x2cd/0x530
>>>>  ? __switch_to+0x92/0x4b0
>>>>  btrfs_worker_helper+0x81/0x300 [btrfs]
>>>>  process_one_work+0x1da/0x3f0
>>>>  worker_thread+0x2b/0x3f0
>>>>  ? process_one_work+0x3f0/0x3f0
>>>>  kthread+0x11a/0x130
>>>>  ? kthread_create_on_node+0x40/0x40
>>>>  ret_from_fork+0x35/0x40
>>>> Code: 00 00 5b c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 40 80 ff 0c 40 0f b6 c7 77 0b 48 c1 e0 04 8b 80 00 bf c8 bd c3 <0f> 0b b8 fb ff ff ff c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00
>>>> ---[ end trace f079fb809e7a862b ]---
>>>> BTRFS critical (device vda2): unable to find logical 8820195328 length 16384
>>>> BTRFS: error (device vda2) in btrfs_finish_ordered_io:3023: errno=-5 IO failure
>>>> BTRFS info (device vda2): forced readonly
>>>> BTRFS error (device vda2): pending csums is 2887680
>>>> ------
>>>>
>>>> [CAUSE]
>>>> It's caused by race with block group auto removal like the following
>>>> case:
>>>> - There is a meta block group X, which has only one tree block
>>>>   The tree block belongs to fs tree 257.
>>>> - In current transaction, some operation modified fs tree 257
>>>>   The tree block get CoWed, so the block group X is empty, and marked as
>>>>   unused, queued to be deleted.
>>>> - Some workload (like fsync) wakes up cleaner_kthread()
>>>>   Which will call btrfs_deleted_unused_bgs() to remove unused block
>>>>   groups.
>>>>   So block group X along its chunk map get removed.
>>>> - Some delalloc work finished for fs tree 257
>>>>   Quota needs to get the original reference of the extent, which will
>>>>   reads tree blocks of commit root of 257.
>>>>   Then since the chunk map get removed, above warning get triggered.
>>>>
>>>> [FIX]
>>>> Just teach btrfs_delete_unused_bgs() to skip block group who still has
>>>> pinned bytes.
>>>>
>>>> However there is a minor side effect, since currently we only queue
>>>> empty blocks at update_block_group(), and such empty block group with
>>>> pinned bytes won't go through update_block_group() again, such block
>>>> group won't be removed, until it get new extent allocated and removed.
>>>
>>> So that can be fixed in a separate patch, to add it back to the list
>>> of block groups to be deleted once everything is unpinned and passes
>>> all other necessary criteria.
>>
>> That's the plan.
>> Although still something more need to be considered.
>>
>>>
>>>>
>>>> But please note that, there are more problems related to extent
>>>> allocator with block group auto removal.
>>>
>>> The above isn't a problem of the allocator itself but rather in the
>>> way we manage COW, commit roots and unpinning.
>>>
>>>>
>>>> Even a block group is marked unused, extent allocator can still allocate
>>>> new extents from unused block group.
>>>
>>> Why is that a problem?
>>> It's ok (with some good benefits), as long as the cleaner thread (or
>>> any thing that attempts to delete block groups in the unused list),
>>> doesn't delete it.
>>
>> It in fact could cause problem under certain case, e.g. btrfs needs to
>> allocate new data chunks, while all existing metadata are pretty sparse
>> with a lot of holes.
>> In that case, an empty block group which can be deleted will help.
> 
> That doesn't answer my question.
> Let me rephrase it - If we need to allocate an extent and we allocate
> one from an empty block group, why is that a problem?
> 
>>
>> Although balance should be the ultimate fix, if btrfs can be more
>> aggressive to avoid using empty block groups, it would definitely help.
>>
>>>
>>>> Thus delaying block group to next transaction won't work.
>>>> (Extents get allocated in current transaction, and removed again in next
>>>> transaction).
>>>>
>>>> So the root fix need to co-operate with extent allocator.
>>>
>>> What do you mean by co-operation with the extent allocator? I don't
>>> think the problem is there.
>>
>> The bug itself is not related to extent allocator.
>>
>> It's my later planned fix related to extent allocator.
>> It's about how aggressive we should try to claim empty block group.
>> Current behavior is pretty passive, mostly by luck to delete unused
>> block group.
>> My purposed fix is to claim empty block groups more aggressive.
>> If we find an empty block group (even with pinned bytes), we'll try our
>> best not to allocate from it, so in next transaction (with its pinned
>> bytes removed) we will definitely claim the space.
> 
> Still doesn't answer my question. Why avoid allocation from an empty
> block group?

I have already answered your question: to ensure (or try to) this empty
block group will be deleted.

> By skipping it, you may end forcing allocation of a new block group
> which mail fail due to lack of unused space.
> What is your goal?

To make block group removal more aggressive.
For forced block group allocation, we can detect such case and fallback
to the empty block group.

> 
> It would make sense to me to make it prefer to use non-empty block
> groups first and then only if none is available, to use an empty one.

Exactly.

This in fact introduced priority for block groups.
Empty block groups have the lowest priority.
And we can enhance it furthermore, by prefer near-full block groups
more, so we can reduce the need to do routine balance.

> But this is different from what you are saying, since you seem
> explicit about never using an empty one.

I mean, if we have other block group with enough space, we could ban
allocating from empty block group.
If we're forced to allocate new block group, then fallback to reuse
empty one, and remove it from unused_bgs_list.

Thanks,
Qu



> 
>>
>> Thanks,
>> Qu
>>
>>>
>>>
>>>> But current fix should be enough for this particular bug.
>>>>
>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>> ---
>>>> changelog:
>>>> v2:
>>>>   Commit message update, to better indicate how pinned byte is used in
>>>>   btrfs and why it's related to quota.
>>>> v3:
>>>>   Commit message update, further explaining the bug with an example.
>>>>   And added the side effect of the fix, and possible further fix.
>>>> ---
>>>>  fs/btrfs/extent-tree.c | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>>> index f190023386a9..7d14c4ca8232 100644
>>>> --- a/fs/btrfs/extent-tree.c
>>>> +++ b/fs/btrfs/extent-tree.c
>>>> @@ -10675,7 +10675,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>>>>                 /* Don't want to race with allocators so take the groups_sem */
>>>>                 down_write(&space_info->groups_sem);
>>>>                 spin_lock(&block_group->lock);
>>>> -               if (block_group->reserved ||
>>>> +               if (block_group->reserved || block_group->pinned ||
>>>>                     btrfs_block_group_used(&block_group->item) ||
>>>>                     block_group->ro ||
>>>>                     list_is_singular(&block_group->list)) {
>>>> --
>>>> 2.17.1
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>
> 
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3] btrfs: Don't remove block group still has pinned down bytes
  2018-06-20 11:03       ` Qu Wenruo
@ 2018-06-21 16:34         ` Filipe Manana
  2018-06-22  0:54           ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Filipe Manana @ 2018-06-21 16:34 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Jun 20, 2018 at 12:03 PM, Qu Wenruo <wqu@suse.de> wrote:
>
>
> On 2018年06月20日 17:33, Filipe Manana wrote:
>> On Wed, Jun 20, 2018 at 10:22 AM, Qu Wenruo <wqu@suse.de> wrote:
>>>
>>>
>>> On 2018年06月20日 17:13, Filipe Manana wrote:
>>>> On Fri, Jun 15, 2018 at 2:35 AM, Qu Wenruo <wqu@suse.com> wrote:
>>>>> [BUG]
>>>>> Under certain KVM load and LTP tests, we are possible to hit the
>>>>> following calltrace if quota is enabled:
>>>>> ------
>>>>> BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
>>>>> BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
>>>>> ------------[ cut here ]------------
>>>>> WARNING: CPU: 0 PID: 49 at ../block/blk-core.c:172 blk_status_to_errno+0x1a/0x30
>>>>> CPU: 0 PID: 49 Comm: kworker/u2:1 Not tainted 4.12.14-15-default #1 SLE15 (unreleased)
>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
>>>>> Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
>>>>> task: ffff9f827b340bc0 task.stack: ffffb4f8c0304000
>>>>> RIP: 0010:blk_status_to_errno+0x1a/0x30
>>>>> Call Trace:
>>>>>  submit_extent_page+0x191/0x270 [btrfs]
>>>>>  ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
>>>>>  __do_readpage+0x2d2/0x810 [btrfs]
>>>>>  ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
>>>>>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>>>>>  __extent_read_full_page+0xe7/0x100 [btrfs]
>>>>>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>>>>>  read_extent_buffer_pages+0x1ab/0x2d0 [btrfs]
>>>>>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>>>>>  btree_read_extent_buffer_pages+0x94/0xf0 [btrfs]
>>>>>  read_tree_block+0x31/0x60 [btrfs]
>>>>>  read_block_for_search.isra.35+0xf0/0x2e0 [btrfs]
>>>>>  btrfs_search_slot+0x46b/0xa00 [btrfs]
>>>>>  ? kmem_cache_alloc+0x1a8/0x510
>>>>>  ? btrfs_get_token_32+0x5b/0x120 [btrfs]
>>>>>  find_parent_nodes+0x11d/0xeb0 [btrfs]
>>>>>  ? leaf_space_used+0xb8/0xd0 [btrfs]
>>>>>  ? btrfs_leaf_free_space+0x49/0x90 [btrfs]
>>>>>  ? btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
>>>>>  btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
>>>>>  btrfs_find_all_roots+0x45/0x60 [btrfs]
>>>>>  btrfs_qgroup_trace_extent_post+0x20/0x40 [btrfs]
>>>>>  btrfs_add_delayed_data_ref+0x1a3/0x1d0 [btrfs]
>>>>>  btrfs_alloc_reserved_file_extent+0x38/0x40 [btrfs]
>>>>>  insert_reserved_file_extent.constprop.71+0x289/0x2e0 [btrfs]
>>>>>  btrfs_finish_ordered_io+0x2f4/0x7f0 [btrfs]
>>>>>  ? pick_next_task_fair+0x2cd/0x530
>>>>>  ? __switch_to+0x92/0x4b0
>>>>>  btrfs_worker_helper+0x81/0x300 [btrfs]
>>>>>  process_one_work+0x1da/0x3f0
>>>>>  worker_thread+0x2b/0x3f0
>>>>>  ? process_one_work+0x3f0/0x3f0
>>>>>  kthread+0x11a/0x130
>>>>>  ? kthread_create_on_node+0x40/0x40
>>>>>  ret_from_fork+0x35/0x40
>>>>> Code: 00 00 5b c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 40 80 ff 0c 40 0f b6 c7 77 0b 48 c1 e0 04 8b 80 00 bf c8 bd c3 <0f> 0b b8 fb ff ff ff c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00
>>>>> ---[ end trace f079fb809e7a862b ]---
>>>>> BTRFS critical (device vda2): unable to find logical 8820195328 length 16384
>>>>> BTRFS: error (device vda2) in btrfs_finish_ordered_io:3023: errno=-5 IO failure
>>>>> BTRFS info (device vda2): forced readonly
>>>>> BTRFS error (device vda2): pending csums is 2887680
>>>>> ------
>>>>>
>>>>> [CAUSE]
>>>>> It's caused by race with block group auto removal like the following
>>>>> case:
>>>>> - There is a meta block group X, which has only one tree block
>>>>>   The tree block belongs to fs tree 257.
>>>>> - In current transaction, some operation modified fs tree 257
>>>>>   The tree block get CoWed, so the block group X is empty, and marked as
>>>>>   unused, queued to be deleted.
>>>>> - Some workload (like fsync) wakes up cleaner_kthread()
>>>>>   Which will call btrfs_deleted_unused_bgs() to remove unused block
>>>>>   groups.
>>>>>   So block group X along its chunk map get removed.
>>>>> - Some delalloc work finished for fs tree 257
>>>>>   Quota needs to get the original reference of the extent, which will
>>>>>   reads tree blocks of commit root of 257.
>>>>>   Then since the chunk map get removed, above warning get triggered.
>>>>>
>>>>> [FIX]
>>>>> Just teach btrfs_delete_unused_bgs() to skip block group who still has
>>>>> pinned bytes.
>>>>>
>>>>> However there is a minor side effect, since currently we only queue
>>>>> empty blocks at update_block_group(), and such empty block group with
>>>>> pinned bytes won't go through update_block_group() again, such block
>>>>> group won't be removed, until it get new extent allocated and removed.
>>>>
>>>> So that can be fixed in a separate patch, to add it back to the list
>>>> of block groups to be deleted once everything is unpinned and passes
>>>> all other necessary criteria.
>>>
>>> That's the plan.
>>> Although still something more need to be considered.
>>>
>>>>
>>>>>
>>>>> But please note that, there are more problems related to extent
>>>>> allocator with block group auto removal.
>>>>
>>>> The above isn't a problem of the allocator itself but rather in the
>>>> way we manage COW, commit roots and unpinning.
>>>>
>>>>>
>>>>> Even a block group is marked unused, extent allocator can still allocate
>>>>> new extents from unused block group.
>>>>
>>>> Why is that a problem?
>>>> It's ok (with some good benefits), as long as the cleaner thread (or
>>>> any thing that attempts to delete block groups in the unused list),
>>>> doesn't delete it.
>>>
>>> It in fact could cause problem under certain case, e.g. btrfs needs to
>>> allocate new data chunks, while all existing metadata are pretty sparse
>>> with a lot of holes.
>>> In that case, an empty block group which can be deleted will help.
>>
>> That doesn't answer my question.
>> Let me rephrase it - If we need to allocate an extent and we allocate
>> one from an empty block group, why is that a problem?
>>
>>>
>>> Although balance should be the ultimate fix, if btrfs can be more
>>> aggressive to avoid using empty block groups, it would definitely help.
>>>
>>>>
>>>>> Thus delaying block group to next transaction won't work.
>>>>> (Extents get allocated in current transaction, and removed again in next
>>>>> transaction).
>>>>>
>>>>> So the root fix need to co-operate with extent allocator.
>>>>
>>>> What do you mean by co-operation with the extent allocator? I don't
>>>> think the problem is there.
>>>
>>> The bug itself is not related to extent allocator.
>>>
>>> It's my later planned fix related to extent allocator.
>>> It's about how aggressive we should try to claim empty block group.
>>> Current behavior is pretty passive, mostly by luck to delete unused
>>> block group.
>>> My purposed fix is to claim empty block groups more aggressive.
>>> If we find an empty block group (even with pinned bytes), we'll try our
>>> best not to allocate from it, so in next transaction (with its pinned
>>> bytes removed) we will definitely claim the space.
>>
>> Still doesn't answer my question. Why avoid allocation from an empty
>> block group?
>
> I have already answered your question: to ensure (or try to) this empty
> block group will be deleted.
>
>> By skipping it, you may end forcing allocation of a new block group
>> which mail fail due to lack of unused space.
>> What is your goal?
>
> To make block group removal more aggressive.
> For forced block group allocation, we can detect such case and fallback
> to the empty block group.
>
>>
>> It would make sense to me to make it prefer to use non-empty block
>> groups first and then only if none is available, to use an empty one.
>
> Exactly.
>
> This in fact introduced priority for block groups.
> Empty block groups have the lowest priority.
> And we can enhance it furthermore, by prefer near-full block groups
> more, so we can reduce the need to do routine balance.
>
>> But this is different from what you are saying, since you seem
>> explicit about never using an empty one.
>
> I mean, if we have other block group with enough space, we could ban
> allocating from empty block group.
> If we're forced to allocate new block group, then fallback to reuse
> empty one, and remove it from unused_bgs_list.

So I would just remove the following paragraphs from the change log:

"But please note that, there are more problems related to extent
allocator with block group auto removal.

Even a block group is marked unused, extent allocator can still allocate
new extents from unused block group.
Thus delaying block group to next transaction won't work.
(Extents get allocated in current transaction, and removed again in next
transaction).

So the root fix need to co-operate with extent allocator.
But current fix should be enough for this particular bug."

Because they are confusing and unrelated to the problem of block
groups being removed while they are still needed (some commit root
references it).
Regardless of the allocator picking or not empty block groups, the
problem is solely elsewhere on the decision of deleting a block group.
So those 2 paragraphs, related to optimizations you want to do
regarding allocation and space management, are totally unrelated to
bug and solution for it.

thanks

>
> Thanks,
> Qu
>
>
>
>>
>>>
>>> Thanks,
>>> Qu
>>>
>>>>
>>>>
>>>>> But current fix should be enough for this particular bug.
>>>>>
>>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>>> ---
>>>>> changelog:
>>>>> v2:
>>>>>   Commit message update, to better indicate how pinned byte is used in
>>>>>   btrfs and why it's related to quota.
>>>>> v3:
>>>>>   Commit message update, further explaining the bug with an example.
>>>>>   And added the side effect of the fix, and possible further fix.
>>>>> ---
>>>>>  fs/btrfs/extent-tree.c | 2 +-
>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>>>> index f190023386a9..7d14c4ca8232 100644
>>>>> --- a/fs/btrfs/extent-tree.c
>>>>> +++ b/fs/btrfs/extent-tree.c
>>>>> @@ -10675,7 +10675,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>>>>>                 /* Don't want to race with allocators so take the groups_sem */
>>>>>                 down_write(&space_info->groups_sem);
>>>>>                 spin_lock(&block_group->lock);
>>>>> -               if (block_group->reserved ||
>>>>> +               if (block_group->reserved || block_group->pinned ||
>>>>>                     btrfs_block_group_used(&block_group->item) ||
>>>>>                     block_group->ro ||
>>>>>                     list_is_singular(&block_group->list)) {
>>>>> --
>>>>> 2.17.1
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3] btrfs: Don't remove block group still has pinned down bytes
  2018-06-21 16:34         ` Filipe Manana
@ 2018-06-22  0:54           ` Qu Wenruo
  0 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2018-06-22  0:54 UTC (permalink / raw)
  To: fdmanana, Qu Wenruo; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 11516 bytes --]



On 2018年06月22日 00:34, Filipe Manana wrote:
> On Wed, Jun 20, 2018 at 12:03 PM, Qu Wenruo <wqu@suse.de> wrote:
>>
>>
>> On 2018年06月20日 17:33, Filipe Manana wrote:
>>> On Wed, Jun 20, 2018 at 10:22 AM, Qu Wenruo <wqu@suse.de> wrote:
>>>>
>>>>
>>>> On 2018年06月20日 17:13, Filipe Manana wrote:
>>>>> On Fri, Jun 15, 2018 at 2:35 AM, Qu Wenruo <wqu@suse.com> wrote:
>>>>>> [BUG]
>>>>>> Under certain KVM load and LTP tests, we are possible to hit the
>>>>>> following calltrace if quota is enabled:
>>>>>> ------
>>>>>> BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
>>>>>> BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
>>>>>> ------------[ cut here ]------------
>>>>>> WARNING: CPU: 0 PID: 49 at ../block/blk-core.c:172 blk_status_to_errno+0x1a/0x30
>>>>>> CPU: 0 PID: 49 Comm: kworker/u2:1 Not tainted 4.12.14-15-default #1 SLE15 (unreleased)
>>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
>>>>>> Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
>>>>>> task: ffff9f827b340bc0 task.stack: ffffb4f8c0304000
>>>>>> RIP: 0010:blk_status_to_errno+0x1a/0x30
>>>>>> Call Trace:
>>>>>>  submit_extent_page+0x191/0x270 [btrfs]
>>>>>>  ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
>>>>>>  __do_readpage+0x2d2/0x810 [btrfs]
>>>>>>  ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
>>>>>>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>>>>>>  __extent_read_full_page+0xe7/0x100 [btrfs]
>>>>>>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>>>>>>  read_extent_buffer_pages+0x1ab/0x2d0 [btrfs]
>>>>>>  ? run_one_async_done+0xc0/0xc0 [btrfs]
>>>>>>  btree_read_extent_buffer_pages+0x94/0xf0 [btrfs]
>>>>>>  read_tree_block+0x31/0x60 [btrfs]
>>>>>>  read_block_for_search.isra.35+0xf0/0x2e0 [btrfs]
>>>>>>  btrfs_search_slot+0x46b/0xa00 [btrfs]
>>>>>>  ? kmem_cache_alloc+0x1a8/0x510
>>>>>>  ? btrfs_get_token_32+0x5b/0x120 [btrfs]
>>>>>>  find_parent_nodes+0x11d/0xeb0 [btrfs]
>>>>>>  ? leaf_space_used+0xb8/0xd0 [btrfs]
>>>>>>  ? btrfs_leaf_free_space+0x49/0x90 [btrfs]
>>>>>>  ? btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
>>>>>>  btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
>>>>>>  btrfs_find_all_roots+0x45/0x60 [btrfs]
>>>>>>  btrfs_qgroup_trace_extent_post+0x20/0x40 [btrfs]
>>>>>>  btrfs_add_delayed_data_ref+0x1a3/0x1d0 [btrfs]
>>>>>>  btrfs_alloc_reserved_file_extent+0x38/0x40 [btrfs]
>>>>>>  insert_reserved_file_extent.constprop.71+0x289/0x2e0 [btrfs]
>>>>>>  btrfs_finish_ordered_io+0x2f4/0x7f0 [btrfs]
>>>>>>  ? pick_next_task_fair+0x2cd/0x530
>>>>>>  ? __switch_to+0x92/0x4b0
>>>>>>  btrfs_worker_helper+0x81/0x300 [btrfs]
>>>>>>  process_one_work+0x1da/0x3f0
>>>>>>  worker_thread+0x2b/0x3f0
>>>>>>  ? process_one_work+0x3f0/0x3f0
>>>>>>  kthread+0x11a/0x130
>>>>>>  ? kthread_create_on_node+0x40/0x40
>>>>>>  ret_from_fork+0x35/0x40
>>>>>> Code: 00 00 5b c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 40 80 ff 0c 40 0f b6 c7 77 0b 48 c1 e0 04 8b 80 00 bf c8 bd c3 <0f> 0b b8 fb ff ff ff c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00
>>>>>> ---[ end trace f079fb809e7a862b ]---
>>>>>> BTRFS critical (device vda2): unable to find logical 8820195328 length 16384
>>>>>> BTRFS: error (device vda2) in btrfs_finish_ordered_io:3023: errno=-5 IO failure
>>>>>> BTRFS info (device vda2): forced readonly
>>>>>> BTRFS error (device vda2): pending csums is 2887680
>>>>>> ------
>>>>>>
>>>>>> [CAUSE]
>>>>>> It's caused by race with block group auto removal like the following
>>>>>> case:
>>>>>> - There is a meta block group X, which has only one tree block
>>>>>>   The tree block belongs to fs tree 257.
>>>>>> - In current transaction, some operation modified fs tree 257
>>>>>>   The tree block get CoWed, so the block group X is empty, and marked as
>>>>>>   unused, queued to be deleted.
>>>>>> - Some workload (like fsync) wakes up cleaner_kthread()
>>>>>>   Which will call btrfs_deleted_unused_bgs() to remove unused block
>>>>>>   groups.
>>>>>>   So block group X along its chunk map get removed.
>>>>>> - Some delalloc work finished for fs tree 257
>>>>>>   Quota needs to get the original reference of the extent, which will
>>>>>>   reads tree blocks of commit root of 257.
>>>>>>   Then since the chunk map get removed, above warning get triggered.
>>>>>>
>>>>>> [FIX]
>>>>>> Just teach btrfs_delete_unused_bgs() to skip block group who still has
>>>>>> pinned bytes.
>>>>>>
>>>>>> However there is a minor side effect, since currently we only queue
>>>>>> empty blocks at update_block_group(), and such empty block group with
>>>>>> pinned bytes won't go through update_block_group() again, such block
>>>>>> group won't be removed, until it get new extent allocated and removed.
>>>>>
>>>>> So that can be fixed in a separate patch, to add it back to the list
>>>>> of block groups to be deleted once everything is unpinned and passes
>>>>> all other necessary criteria.
>>>>
>>>> That's the plan.
>>>> Although still something more need to be considered.
>>>>
>>>>>
>>>>>>
>>>>>> But please note that, there are more problems related to extent
>>>>>> allocator with block group auto removal.
>>>>>
>>>>> The above isn't a problem of the allocator itself but rather in the
>>>>> way we manage COW, commit roots and unpinning.
>>>>>
>>>>>>
>>>>>> Even a block group is marked unused, extent allocator can still allocate
>>>>>> new extents from unused block group.
>>>>>
>>>>> Why is that a problem?
>>>>> It's ok (with some good benefits), as long as the cleaner thread (or
>>>>> any thing that attempts to delete block groups in the unused list),
>>>>> doesn't delete it.
>>>>
>>>> It in fact could cause problem under certain case, e.g. btrfs needs to
>>>> allocate new data chunks, while all existing metadata are pretty sparse
>>>> with a lot of holes.
>>>> In that case, an empty block group which can be deleted will help.
>>>
>>> That doesn't answer my question.
>>> Let me rephrase it - If we need to allocate an extent and we allocate
>>> one from an empty block group, why is that a problem?
>>>
>>>>
>>>> Although balance should be the ultimate fix, if btrfs can be more
>>>> aggressive to avoid using empty block groups, it would definitely help.
>>>>
>>>>>
>>>>>> Thus delaying block group to next transaction won't work.
>>>>>> (Extents get allocated in current transaction, and removed again in next
>>>>>> transaction).
>>>>>>
>>>>>> So the root fix need to co-operate with extent allocator.
>>>>>
>>>>> What do you mean by co-operation with the extent allocator? I don't
>>>>> think the problem is there.
>>>>
>>>> The bug itself is not related to extent allocator.
>>>>
>>>> It's my later planned fix related to extent allocator.
>>>> It's about how aggressive we should try to claim empty block group.
>>>> Current behavior is pretty passive, mostly by luck to delete unused
>>>> block group.
>>>> My purposed fix is to claim empty block groups more aggressive.
>>>> If we find an empty block group (even with pinned bytes), we'll try our
>>>> best not to allocate from it, so in next transaction (with its pinned
>>>> bytes removed) we will definitely claim the space.
>>>
>>> Still doesn't answer my question. Why avoid allocation from an empty
>>> block group?
>>
>> I have already answered your question: to ensure (or try to) this empty
>> block group will be deleted.
>>
>>> By skipping it, you may end forcing allocation of a new block group
>>> which mail fail due to lack of unused space.
>>> What is your goal?
>>
>> To make block group removal more aggressive.
>> For forced block group allocation, we can detect such case and fallback
>> to the empty block group.
>>
>>>
>>> It would make sense to me to make it prefer to use non-empty block
>>> groups first and then only if none is available, to use an empty one.
>>
>> Exactly.
>>
>> This in fact introduced priority for block groups.
>> Empty block groups have the lowest priority.
>> And we can enhance it furthermore, by prefer near-full block groups
>> more, so we can reduce the need to do routine balance.
>>
>>> But this is different from what you are saying, since you seem
>>> explicit about never using an empty one.
>>
>> I mean, if we have other block group with enough space, we could ban
>> allocating from empty block group.
>> If we're forced to allocate new block group, then fallback to reuse
>> empty one, and remove it from unused_bgs_list.
> 
> So I would just remove the following paragraphs from the change log:
> 
> "But please note that, there are more problems related to extent
> allocator with block group auto removal.
> 
> Even a block group is marked unused, extent allocator can still allocate
> new extents from unused block group.
> Thus delaying block group to next transaction won't work.
> (Extents get allocated in current transaction, and removed again in next
> transaction).
> 
> So the root fix need to co-operate with extent allocator.
> But current fix should be enough for this particular bug."
> 
> Because they are confusing and unrelated to the problem of block
> groups being removed while they are still needed (some commit root
> references it).
> Regardless of the allocator picking or not empty block groups, the
> problem is solely elsewhere on the decision of deleting a block group.
> So those 2 paragraphs, related to optimizations you want to do
> regarding allocation and space management, are totally unrelated to
> bug and solution for it.

Makes sense, I'll remove unrelated comment.

Thanks,
Qu

> 
> thanks
> 
>>
>> Thanks,
>> Qu
>>
>>
>>
>>>
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>>
>>>>>> But current fix should be enough for this particular bug.
>>>>>>
>>>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>>>> ---
>>>>>> changelog:
>>>>>> v2:
>>>>>>   Commit message update, to better indicate how pinned byte is used in
>>>>>>   btrfs and why it's related to quota.
>>>>>> v3:
>>>>>>   Commit message update, further explaining the bug with an example.
>>>>>>   And added the side effect of the fix, and possible further fix.
>>>>>> ---
>>>>>>  fs/btrfs/extent-tree.c | 2 +-
>>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>>>>> index f190023386a9..7d14c4ca8232 100644
>>>>>> --- a/fs/btrfs/extent-tree.c
>>>>>> +++ b/fs/btrfs/extent-tree.c
>>>>>> @@ -10675,7 +10675,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>>>>>>                 /* Don't want to race with allocators so take the groups_sem */
>>>>>>                 down_write(&space_info->groups_sem);
>>>>>>                 spin_lock(&block_group->lock);
>>>>>> -               if (block_group->reserved ||
>>>>>> +               if (block_group->reserved || block_group->pinned ||
>>>>>>                     btrfs_block_group_used(&block_group->item) ||
>>>>>>                     block_group->ro ||
>>>>>>                     list_is_singular(&block_group->list)) {
>>>>>> --
>>>>>> 2.17.1
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>
> 
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-06-22  0:54 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-15  1:35 [PATCH v3] btrfs: Don't remove block group still has pinned down bytes Qu Wenruo
2018-06-20  5:13 ` Qu Wenruo
2018-06-20  5:25 ` Nikolay Borisov
2018-06-20  5:34   ` Qu Wenruo
2018-06-20  6:38     ` Nikolay Borisov
2018-06-20  9:13 ` Filipe Manana
2018-06-20  9:22   ` Qu Wenruo
2018-06-20  9:33     ` Filipe Manana
2018-06-20 11:03       ` Qu Wenruo
2018-06-21 16:34         ` Filipe Manana
2018-06-22  0:54           ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.