linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] btrfs: fix rw device counting in __btrfs_free_extra_devids
@ 2021-07-15 10:34 Desmond Cheong Zhi Xi
  2021-07-15 11:23 ` Anand Jain
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Desmond Cheong Zhi Xi @ 2021-07-15 10:34 UTC (permalink / raw)
  To: clm, josef, dsterba
  Cc: Desmond Cheong Zhi Xi, anand.jain, linux-btrfs, linux-kernel,
	skhan, gregkh, linux-kernel-mentees, syzbot+a70e2ad0879f160b9217

Syzbot reports a warning in close_fs_devices that happens because
fs_devices->rw_devices is not 0 after calling btrfs_close_one_device
on each device.

This happens when a writeable device is removed in
__btrfs_free_extra_devids, but the rw device count is not decremented
accordingly. So when close_fs_devices is called, the removed device is
still counted and we get an off by 1 error.

Here is one call trace that was observed:
  btrfs_mount_root():
    btrfs_scan_one_device():
      device_list_add();   <---------------- device added
    btrfs_open_devices():
      open_fs_devices():
        btrfs_open_one_device();   <-------- rw device count ++
    btrfs_fill_super():
      open_ctree():
        btrfs_free_extra_devids():
	  __btrfs_free_extra_devids();  <--- device removed
	  fail_tree_roots:
	    btrfs_close_devices():
	      close_fs_devices();   <------- rw device count off by 1

Fixes: cf89af146b7e ("btrfs: dev-replace: fail mount if we don't have replace item with target device")
Reported-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
Tested-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
---
 fs/btrfs/volumes.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 807502cd6510..916c25371658 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1078,6 +1078,7 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
 		if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
 			list_del_init(&device->dev_alloc_list);
 			clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
+			fs_devices->rw_devices--;
 		}
 		list_del_init(&device->dev_list);
 		fs_devices->num_devices--;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] btrfs: fix rw device counting in __btrfs_free_extra_devids
  2021-07-15 10:34 [PATCH] btrfs: fix rw device counting in __btrfs_free_extra_devids Desmond Cheong Zhi Xi
@ 2021-07-15 11:23 ` Anand Jain
  2021-07-15 11:55 ` Nikolay Borisov
  2021-07-21 17:59 ` David Sterba
  2 siblings, 0 replies; 10+ messages in thread
From: Anand Jain @ 2021-07-15 11:23 UTC (permalink / raw)
  To: Desmond Cheong Zhi Xi, clm, josef, dsterba
  Cc: linux-btrfs, linux-kernel, skhan, gregkh, linux-kernel-mentees,
	syzbot+a70e2ad0879f160b9217



On 15/07/2021 18:34, Desmond Cheong Zhi Xi wrote:
> Syzbot reports a warning in close_fs_devices that happens because
> fs_devices->rw_devices is not 0 after calling btrfs_close_one_device
> on each device.
> 
> This happens when a writeable device is removed in
> __btrfs_free_extra_devids, but the rw device count is not decremented
> accordingly. So when close_fs_devices is called, the removed device is
> still counted and we get an off by 1 error.
> 
> Here is one call trace that was observed:
>    btrfs_mount_root():
>      btrfs_scan_one_device():
>        device_list_add();   <---------------- device added
>      btrfs_open_devices():
>        open_fs_devices():
>          btrfs_open_one_device();   <-------- rw device count ++
>      btrfs_fill_super():
>        open_ctree():
>          btrfs_free_extra_devids():
> 	  __btrfs_free_extra_devids();  <--- device removed
> 	  fail_tree_roots:
> 	    btrfs_close_devices():
> 	      close_fs_devices();   <------- rw device count off by 1
> 
> Fixes: cf89af146b7e ("btrfs: dev-replace: fail mount if we don't have replace item with target device")
> Reported-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
> Tested-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
> Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
> ---
>   fs/btrfs/volumes.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 807502cd6510..916c25371658 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1078,6 +1078,7 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
>   		if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
>   			list_del_init(&device->dev_alloc_list);
>   			clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
> +			fs_devices->rw_devices--;
>   		}
>   		list_del_init(&device->dev_list);
>   		fs_devices->num_devices--;
> 

Thanks for the fix.

Looks good.

Reviewed-by: Anand Jain <anand.jain@oracle.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] btrfs: fix rw device counting in __btrfs_free_extra_devids
  2021-07-15 10:34 [PATCH] btrfs: fix rw device counting in __btrfs_free_extra_devids Desmond Cheong Zhi Xi
  2021-07-15 11:23 ` Anand Jain
@ 2021-07-15 11:55 ` Nikolay Borisov
  2021-07-15 13:11   ` Desmond Cheong Zhi Xi
  2021-07-21 17:59 ` David Sterba
  2 siblings, 1 reply; 10+ messages in thread
From: Nikolay Borisov @ 2021-07-15 11:55 UTC (permalink / raw)
  To: Desmond Cheong Zhi Xi, clm, josef, dsterba
  Cc: anand.jain, linux-btrfs, linux-kernel, skhan, gregkh,
	linux-kernel-mentees, syzbot+a70e2ad0879f160b9217



On 15.07.21 г. 13:34, Desmond Cheong Zhi Xi wrote:
> Syzbot reports a warning in close_fs_devices that happens because
> fs_devices->rw_devices is not 0 after calling btrfs_close_one_device
> on each device.
> 
> This happens when a writeable device is removed in
> __btrfs_free_extra_devids, but the rw device count is not decremented
> accordingly. So when close_fs_devices is called, the removed device is
> still counted and we get an off by 1 error.
> 
> Here is one call trace that was observed:
>   btrfs_mount_root():
>     btrfs_scan_one_device():
>       device_list_add();   <---------------- device added
>     btrfs_open_devices():
>       open_fs_devices():
>         btrfs_open_one_device();   <-------- rw device count ++
>     btrfs_fill_super():
>       open_ctree():
>         btrfs_free_extra_devids():
> 	  __btrfs_free_extra_devids();  <--- device removed
> 	  fail_tree_roots:
> 	    btrfs_close_devices():
> 	      close_fs_devices();   <------- rw device count off by 1
> 
> Fixes: cf89af146b7e ("btrfs: dev-replace: fail mount if we don't have replace item with target device")
> Reported-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
> Tested-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
> Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>

Is there a reliable reproducer from syzbot? Can this be turned into an
xfstest?

> ---
>  fs/btrfs/volumes.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 807502cd6510..916c25371658 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1078,6 +1078,7 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
>  		if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
>  			list_del_init(&device->dev_alloc_list);
>  			clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
> +			fs_devices->rw_devices--;
>  		}
>  		list_del_init(&device->dev_list);
>  		fs_devices->num_devices--;
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] btrfs: fix rw device counting in __btrfs_free_extra_devids
  2021-07-15 11:55 ` Nikolay Borisov
@ 2021-07-15 13:11   ` Desmond Cheong Zhi Xi
  2021-07-21 13:34     ` David Sterba
  0 siblings, 1 reply; 10+ messages in thread
From: Desmond Cheong Zhi Xi @ 2021-07-15 13:11 UTC (permalink / raw)
  To: Nikolay Borisov, clm, josef, dsterba
  Cc: anand.jain, linux-btrfs, linux-kernel, skhan, gregkh,
	linux-kernel-mentees, syzbot+a70e2ad0879f160b9217

On 15/7/21 7:55 pm, Nikolay Borisov wrote:
> 
> 
> On 15.07.21 г. 13:34, Desmond Cheong Zhi Xi wrote:
>> Syzbot reports a warning in close_fs_devices that happens because
>> fs_devices->rw_devices is not 0 after calling btrfs_close_one_device
>> on each device.
>>
>> This happens when a writeable device is removed in
>> __btrfs_free_extra_devids, but the rw device count is not decremented
>> accordingly. So when close_fs_devices is called, the removed device is
>> still counted and we get an off by 1 error.
>>
>> Here is one call trace that was observed:
>>    btrfs_mount_root():
>>      btrfs_scan_one_device():
>>        device_list_add();   <---------------- device added
>>      btrfs_open_devices():
>>        open_fs_devices():
>>          btrfs_open_one_device();   <-------- rw device count ++
>>      btrfs_fill_super():
>>        open_ctree():
>>          btrfs_free_extra_devids():
>> 	  __btrfs_free_extra_devids();  <--- device removed
>> 	  fail_tree_roots:
>> 	    btrfs_close_devices():
>> 	      close_fs_devices();   <------- rw device count off by 1
>>
>> Fixes: cf89af146b7e ("btrfs: dev-replace: fail mount if we don't have replace item with target device")
>> Reported-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
>> Tested-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
>> Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
> 
> Is there a reliable reproducer from syzbot? Can this be turned into an
> xfstest?
> 

Syzbot has some reliable reproducers here:
https://syzkaller.appspot.com/bug?id=113d9a01cbe0af3e291633ba7a7a3e983b86c3c0

Seems like it constructs two images in-memory then mounts them. I'm not 
sure if that's amenable to be converted into an xfstest?

>> ---
>>   fs/btrfs/volumes.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 807502cd6510..916c25371658 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -1078,6 +1078,7 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
>>   		if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
>>   			list_del_init(&device->dev_alloc_list);
>>   			clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
>> +			fs_devices->rw_devices--;
>>   		}
>>   		list_del_init(&device->dev_list);
>>   		fs_devices->num_devices--;
>>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] btrfs: fix rw device counting in __btrfs_free_extra_devids
  2021-07-15 13:11   ` Desmond Cheong Zhi Xi
@ 2021-07-21 13:34     ` David Sterba
  0 siblings, 0 replies; 10+ messages in thread
From: David Sterba @ 2021-07-21 13:34 UTC (permalink / raw)
  To: Desmond Cheong Zhi Xi
  Cc: Nikolay Borisov, clm, josef, dsterba, anand.jain, linux-btrfs,
	linux-kernel, skhan, gregkh, linux-kernel-mentees,
	syzbot+a70e2ad0879f160b9217

On Thu, Jul 15, 2021 at 09:11:43PM +0800, Desmond Cheong Zhi Xi wrote:
> On 15/7/21 7:55 pm, Nikolay Borisov wrote:
> > 
> > 
> > On 15.07.21 г. 13:34, Desmond Cheong Zhi Xi wrote:
> >> Syzbot reports a warning in close_fs_devices that happens because
> >> fs_devices->rw_devices is not 0 after calling btrfs_close_one_device
> >> on each device.
> >>
> >> This happens when a writeable device is removed in
> >> __btrfs_free_extra_devids, but the rw device count is not decremented
> >> accordingly. So when close_fs_devices is called, the removed device is
> >> still counted and we get an off by 1 error.
> >>
> >> Here is one call trace that was observed:
> >>    btrfs_mount_root():
> >>      btrfs_scan_one_device():
> >>        device_list_add();   <---------------- device added
> >>      btrfs_open_devices():
> >>        open_fs_devices():
> >>          btrfs_open_one_device();   <-------- rw device count ++
> >>      btrfs_fill_super():
> >>        open_ctree():
> >>          btrfs_free_extra_devids():
> >> 	  __btrfs_free_extra_devids();  <--- device removed
> >> 	  fail_tree_roots:
> >> 	    btrfs_close_devices():
> >> 	      close_fs_devices();   <------- rw device count off by 1
> >>
> >> Fixes: cf89af146b7e ("btrfs: dev-replace: fail mount if we don't have replace item with target device")
> >> Reported-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
> >> Tested-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
> >> Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
> > 
> > Is there a reliable reproducer from syzbot? Can this be turned into an
> > xfstest?
> > 
> 
> Syzbot has some reliable reproducers here:
> https://syzkaller.appspot.com/bug?id=113d9a01cbe0af3e291633ba7a7a3e983b86c3c0
> 
> Seems like it constructs two images in-memory then mounts them. I'm not 
> sure if that's amenable to be converted into an xfstest?

It would need to be an image from the time the warning is reproduced,
I'm not sure how much timing is also important. But iirc adding raw test
images to fstests was not welcome, so it would have to be a reproducer
and given that the syzkaller source is not human readable I'm not sure
it would be welcome either.

Maybe there's some middle ground when the image is created by mkfs and
filled with the data and then the mount loop is started from shell. But
that means to untangle the C reproducer.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] btrfs: fix rw device counting in __btrfs_free_extra_devids
  2021-07-15 10:34 [PATCH] btrfs: fix rw device counting in __btrfs_free_extra_devids Desmond Cheong Zhi Xi
  2021-07-15 11:23 ` Anand Jain
  2021-07-15 11:55 ` Nikolay Borisov
@ 2021-07-21 17:59 ` David Sterba
  2021-07-25  6:19   ` Desmond Cheong Zhi Xi
  2021-07-25 13:49   ` Anand Jain
  2 siblings, 2 replies; 10+ messages in thread
From: David Sterba @ 2021-07-21 17:59 UTC (permalink / raw)
  To: Desmond Cheong Zhi Xi
  Cc: clm, josef, dsterba, anand.jain, linux-btrfs, linux-kernel,
	skhan, gregkh, linux-kernel-mentees, syzbot+a70e2ad0879f160b9217

On Thu, Jul 15, 2021 at 06:34:03PM +0800, Desmond Cheong Zhi Xi wrote:
> Syzbot reports a warning in close_fs_devices that happens because
> fs_devices->rw_devices is not 0 after calling btrfs_close_one_device
> on each device.
> 
> This happens when a writeable device is removed in
> __btrfs_free_extra_devids, but the rw device count is not decremented
> accordingly. So when close_fs_devices is called, the removed device is
> still counted and we get an off by 1 error.
> 
> Here is one call trace that was observed:
>   btrfs_mount_root():
>     btrfs_scan_one_device():
>       device_list_add();   <---------------- device added
>     btrfs_open_devices():
>       open_fs_devices():
>         btrfs_open_one_device();   <-------- rw device count ++
>     btrfs_fill_super():
>       open_ctree():
>         btrfs_free_extra_devids():
> 	  __btrfs_free_extra_devids();  <--- device removed
> 	  fail_tree_roots:
> 	    btrfs_close_devices():
> 	      close_fs_devices();   <------- rw device count off by 1
> 
> Fixes: cf89af146b7e ("btrfs: dev-replace: fail mount if we don't have replace item with target device")

What this patch did in the last hunk was the rw_devices decrement, but
conditional:

@@ -1080,9 +1071,6 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
                if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
                        list_del_init(&device->dev_alloc_list);
                        clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
-                       if (!test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
-                                     &device->dev_state))
-                               fs_devices->rw_devices--;
                }
                list_del_init(&device->dev_list);
                fs_devices->num_devices--;
---


> @@ -1078,6 +1078,7 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
>  		if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
>  			list_del_init(&device->dev_alloc_list);
>  			clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
> +			fs_devices->rw_devices--;
>  		}
>  		list_del_init(&device->dev_list);
>  		fs_devices->num_devices--;

So should it be reinstated in the original form? The rest of
cf89af146b7e handles unexpected device replace item during mount.

Adding the decrement is correct, but right now I'm not sure about the
corner case when teh devcie has the BTRFS_DEV_STATE_REPLACE_TGT bit set.
The state machine of the device bits and counters is not trivial so
fixing it one way or the other could lead to further syzbot reports if
we don't understand the issue.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] btrfs: fix rw device counting in __btrfs_free_extra_devids
  2021-07-21 17:59 ` David Sterba
@ 2021-07-25  6:19   ` Desmond Cheong Zhi Xi
  2021-07-26 17:52     ` David Sterba
  2021-07-25 13:49   ` Anand Jain
  1 sibling, 1 reply; 10+ messages in thread
From: Desmond Cheong Zhi Xi @ 2021-07-25  6:19 UTC (permalink / raw)
  To: dsterba, clm, josef, dsterba, anand.jain, linux-btrfs,
	linux-kernel, skhan, gregkh, linux-kernel-mentees,
	syzbot+a70e2ad0879f160b9217

On 22/7/21 1:59 am, David Sterba wrote:
> On Thu, Jul 15, 2021 at 06:34:03PM +0800, Desmond Cheong Zhi Xi wrote:
>> Syzbot reports a warning in close_fs_devices that happens because
>> fs_devices->rw_devices is not 0 after calling btrfs_close_one_device
>> on each device.
>>
>> This happens when a writeable device is removed in
>> __btrfs_free_extra_devids, but the rw device count is not decremented
>> accordingly. So when close_fs_devices is called, the removed device is
>> still counted and we get an off by 1 error.
>>
>> Here is one call trace that was observed:
>>    btrfs_mount_root():
>>      btrfs_scan_one_device():
>>        device_list_add();   <---------------- device added
>>      btrfs_open_devices():
>>        open_fs_devices():
>>          btrfs_open_one_device();   <-------- rw device count ++
>>      btrfs_fill_super():
>>        open_ctree():
>>          btrfs_free_extra_devids():
>> 	  __btrfs_free_extra_devids();  <--- device removed
>> 	  fail_tree_roots:
>> 	    btrfs_close_devices():
>> 	      close_fs_devices();   <------- rw device count off by 1
>>
>> Fixes: cf89af146b7e ("btrfs: dev-replace: fail mount if we don't have replace item with target device")
> 
> What this patch did in the last hunk was the rw_devices decrement, but
> conditional:
> 
> @@ -1080,9 +1071,6 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
>                  if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
>                          list_del_init(&device->dev_alloc_list);
>                          clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
> -                       if (!test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
> -                                     &device->dev_state))
> -                               fs_devices->rw_devices--;
>                  }
>                  list_del_init(&device->dev_list);
>                  fs_devices->num_devices--;
> ---
> 
> 
>> @@ -1078,6 +1078,7 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
>>   		if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
>>   			list_del_init(&device->dev_alloc_list);
>>   			clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
>> +			fs_devices->rw_devices--;
>>   		}
>>   		list_del_init(&device->dev_list);
>>   		fs_devices->num_devices--;
> 
> So should it be reinstated in the original form? The rest of
> cf89af146b7e handles unexpected device replace item during mount.
> 
> Adding the decrement is correct, but right now I'm not sure about the
> corner case when teh devcie has the BTRFS_DEV_STATE_REPLACE_TGT bit set.
> The state machine of the device bits and counters is not trivial so
> fixing it one way or the other could lead to further syzbot reports if
> we don't understand the issue.
> 

Hi David,

Thanks for raising this issue. I took a closer look and I think we don't 
have to reinstate the original form because it's a historical artifact.

The short version of the story is that going by the intention of 
__btrfs_free_extra_devids, we skip removing the replace target device. 
Hence, by the time we've reached the decrement in question, the device 
is not the replace target device and the BTRFS_DEV_STATE_REPLACE_TGT bit 
should not be set.

But we should also try to understand the original intention of the code. 
The check in question was first introduced in commit 8dabb7420f01 
("Btrfs: change core code of btrfs to support the device replace 
operations"):
> @@ -536,7 +553,8 @@ void btrfs_close_extra_devices(struct btrfs_fs_devices *fs_devices)
>                 if (device->writeable) {
>                         list_del_init(&device->dev_alloc_list);
>                         device->writeable = 0;
> -                       fs_devices->rw_devices--;
> +                       if (!device->is_tgtdev_for_dev_replace)
> +                               fs_devices->rw_devices--;
>                 }
>                 list_del_init(&device->dev_list);
>                 fs_devices->num_devices--;

If we take a trip back in time to this commit we see that 
btrfs_dev_replace_finishing added the target device to the alloc list 
without incrementing the rw_devices count. So this check was likely 
originally meant to prevent under-counting of rw_devices.

However, the situation has changed, following various fixes to 
rw_devices counting. Commit 63dd86fa79db ("btrfs: fix rw_devices miss 
match after seed replace") added an increment to rw_devices when 
replacing a seed device with a writable one in btrfs_dev_replace_finishing:
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index eea26e1b2fda..fb0a7fa2f70c 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -562,6 +562,8 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
>         if (fs_info->fs_devices->latest_bdev == src_device->bdev)
>                 fs_info->fs_devices->latest_bdev = tgt_device->bdev;
>         list_add(&tgt_device->dev_alloc_list, &fs_info->fs_devices->alloc_list);
> +       if (src_device->fs_devices->seeding)
> +               fs_info->fs_devices->rw_devices++;
>  
>         /* replace the sysfs entry */
>         btrfs_kobj_rm_device(fs_info, src_device);

This was later simplified in commit 82372bc816d7 ("Btrfs: make the logic 
of source device removing more clear") that simply decremented 
rw_devices in btrfs_rm_dev_replace_srcdev if the replaced device was 
writable. This meant that the rw_devices count could be incremented in 
btrfs_dev_replace_finishing without any checks:
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index e9cbbdb72978..6f662b34ba0e 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -569,8 +569,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
>         if (fs_info->fs_devices->latest_bdev == src_device->bdev)
>                 fs_info->fs_devices->latest_bdev = tgt_device->bdev;
>         list_add(&tgt_device->dev_alloc_list, &fs_info->fs_devices->alloc_list);
> -       if (src_device->fs_devices->seeding)
> -               fs_info->fs_devices->rw_devices++;
> +       fs_info->fs_devices->rw_devices++;
>  
>         /* replace the sysfs entry */
>         btrfs_kobj_rm_device(fs_info, src_device);

Thus, given the current state of the code base, the original check is 
now incorrect, because we want to decrement rw_devices as long as the 
device is being removed from the alloc list.

To further convince ourselves of this, we can take a closer look at the 
relation between the device with devid BTRFS_DEV_REPLACE_DEVID and the 
BTRFS_DEV_STATE_REPLACE_TGT bit for devices.

BTRFS_DEV_STATE_REPLACE_TGT is set in two places:
- btrfs_init_dev_replace_tgtdev
- btrfs_init_dev_replace

In btrfs_init_dev_replace_tgtdev, the BTRFS_DEV_STATE_REPLACE_TGT bit is 
set for a device allocated with devid BTRFS_DEV_REPLACE_DEVID.

In btrfs_init_dev_replace, the BTRFS_DEV_STATE_REPLACE_TGT bit is set 
for the target device found with devid BTRFS_DEV_REPLACE_DEVID.

 From both cases, we see that the BTRFS_DEV_STATE_REPLACE_TGT bit is set 
only for the device with devid BTRFS_DEV_REPLACE_DEVID.

It follows that if a device does not have devid BTRFS_DEV_REPLACE_DEVID, 
then the BTRFS_DEV_STATE_REPLACE_TGT bit will not be set.

With commit cf89af146b7e ("btrfs: dev-replace: fail mount if we don't 
have replace item with target device"), we skip removing the device in 
__btrfs_free_extra_devids as long as the devid is BTRFS_DEV_REPLACE_DEVID:
> -               if (device->devid == BTRFS_DEV_REPLACE_DEVID) {
> -                       /*
> -                        * In the first step, keep the device which has
> -                        * the correct fsid and the devid that is used
> -                        * for the dev_replace procedure.
> -                        * In the second step, the dev_replace state is
> -                        * read from the device tree and it is known
> -                        * whether the procedure is really active or
> -                        * not, which means whether this device is
> -                        * used or whether it should be removed.
> -                        */
> -                       if (step == 0 || test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
> -                                                 &device->dev_state)) {
> -                               continue;
> -                       }
> -               }
> +               /*
> +                * We have already validated the presence of BTRFS_DEV_REPLACE_DEVID,
> +                * in btrfs_init_dev_replace() so just continue.
> +                */
> +               if (device->devid == BTRFS_DEV_REPLACE_DEVID)
> +                       continue;

Given the discussion above, after we fail the check for device->devid == 
BTRFS_DEV_REPLACE_DEVID, all devices from that point are not the replace 
target device, and do not have the BTRFS_DEV_STATE_REPLACE_TGT bit set.

So the original check for the BTRFS_DEV_STATE_REPLACE_TGT bit before 
incrementing rw_devices is not just incorrect at this point, it's also 
redundant.

Of course, I would hate to introduce a hard-to-find bug with a bad 
analysis, so any thoughts on this would be appreciated.

Best wishes,
Desmond

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] btrfs: fix rw device counting in __btrfs_free_extra_devids
  2021-07-21 17:59 ` David Sterba
  2021-07-25  6:19   ` Desmond Cheong Zhi Xi
@ 2021-07-25 13:49   ` Anand Jain
  1 sibling, 0 replies; 10+ messages in thread
From: Anand Jain @ 2021-07-25 13:49 UTC (permalink / raw)
  To: dsterba, dsterba
  Cc: Desmond Cheong Zhi Xi, clm, josef, linux-btrfs, linux-kernel,
	skhan, gregkh, linux-kernel-mentees, syzbot+a70e2ad0879f160b9217



On 22/07/2021 01:59, David Sterba wrote:
> On Thu, Jul 15, 2021 at 06:34:03PM +0800, Desmond Cheong Zhi Xi wrote:
>> Syzbot reports a warning in close_fs_devices that happens because
>> fs_devices->rw_devices is not 0 after calling btrfs_close_one_device
>> on each device.
>>
>> This happens when a writeable device is removed in
>> __btrfs_free_extra_devids, but the rw device count is not decremented
>> accordingly. So when close_fs_devices is called, the removed device is
>> still counted and we get an off by 1 error.
>>
>> Here is one call trace that was observed:
>>    btrfs_mount_root():
>>      btrfs_scan_one_device():
>>        device_list_add();   <---------------- device added
>>      btrfs_open_devices():
>>        open_fs_devices():
>>          btrfs_open_one_device();   <-------- rw device count ++
>>      btrfs_fill_super():
>>        open_ctree():
>>          btrfs_free_extra_devids():
>> 	  __btrfs_free_extra_devids();  <--- device removed
>> 	  fail_tree_roots:
>> 	    btrfs_close_devices():
>> 	      close_fs_devices();   <------- rw device count off by 1
>>
>> Fixes: cf89af146b7e ("btrfs: dev-replace: fail mount if we don't have replace item with target device")
> 
> What this patch did in the last hunk was the rw_devices decrement, but
> conditional:
> 
> @@ -1080,9 +1071,6 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
>                  if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
>                          list_del_init(&device->dev_alloc_list);
>                          clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);



> -                       if (!test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
> -                                     &device->dev_state))

This condition was wrong.
The 1st roll of this patch which is here [1], has the details of why. As 
shown below -

[1]
https://patchwork.kernel.org/project/linux-btrfs/patch/b3a0a629df98bd044a1fd5c4964f381ff6e7aa05.1600777827.git.anand.jain@oracle.com/#23640775

----
rw_devices is incremented in btrfs_open_one_device() for all write-able
devices except for devid == BTRFS_DEV_REPLACE_DEVID.
But while we clean up the extra devices in __btrfs_free_extra_devids()
we used the BTRFS_DEV_STATE_REPLACE_TGT flag isn't set because
there isn't the replace-item. So rw_devices went below zero.
----


> -                               fs_devices->rw_devices--;
>                  }
>                  list_del_init(&device->dev_list);
>                  fs_devices->num_devices--;
> ---
> 
> 
>> @@ -1078,6 +1078,7 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
>>   		if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
>>   			list_del_init(&device->dev_alloc_list);
>>   			clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
>> +			fs_devices->rw_devices--;
>>   		}
>>   		list_del_init(&device->dev_list);
>>   		fs_devices->num_devices--;
> 
> So should it be reinstated in the original form?

No. The reason is the same as above.
Only the rw_devices decrement line has to be restored.

> The rest of
> cf89af146b7e handles unexpected device replace item during mount.

> Adding the decrement is correct, but right now I'm not sure about the
> corner case when teh devcie has the BTRFS_DEV_STATE_REPLACE_TGT bit set.

BTRFS_DEV_STATE_REPLACE_TGT is set (on BTRFS_DEV_REPLACE_DEVID) for two 
reasons when we call replace through ioctl or during mount upon finding 
a replace-device item.

> The state machine of the device bits and counters is not trivial so
> fixing it one way or the other could lead to further syzbot reports if
> we don't understand the issue.

I agree. Also, a good idea to convert this sysbot test into an xfstests 
case.

Thanks, Anand

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] btrfs: fix rw device counting in __btrfs_free_extra_devids
  2021-07-25  6:19   ` Desmond Cheong Zhi Xi
@ 2021-07-26 17:52     ` David Sterba
  2021-07-26 23:07       ` Desmond Cheong Zhi Xi
  0 siblings, 1 reply; 10+ messages in thread
From: David Sterba @ 2021-07-26 17:52 UTC (permalink / raw)
  To: Desmond Cheong Zhi Xi
  Cc: dsterba, clm, josef, dsterba, anand.jain, linux-btrfs,
	linux-kernel, skhan, gregkh, linux-kernel-mentees,
	syzbot+a70e2ad0879f160b9217

On Sun, Jul 25, 2021 at 02:19:52PM +0800, Desmond Cheong Zhi Xi wrote:
> On 22/7/21 1:59 am, David Sterba wrote:
> > On Thu, Jul 15, 2021 at 06:34:03PM +0800, Desmond Cheong Zhi Xi wrote:
> >> Syzbot reports a warning in close_fs_devices that happens because
> >> fs_devices->rw_devices is not 0 after calling btrfs_close_one_device
> >> on each device.
> >>
> >> This happens when a writeable device is removed in
> >> __btrfs_free_extra_devids, but the rw device count is not decremented
> >> accordingly. So when close_fs_devices is called, the removed device is
> >> still counted and we get an off by 1 error.
> >>
> >> Here is one call trace that was observed:
> >>    btrfs_mount_root():
> >>      btrfs_scan_one_device():
> >>        device_list_add();   <---------------- device added
> >>      btrfs_open_devices():
> >>        open_fs_devices():
> >>          btrfs_open_one_device();   <-------- rw device count ++
> >>      btrfs_fill_super():
> >>        open_ctree():
> >>          btrfs_free_extra_devids():
> >> 	  __btrfs_free_extra_devids();  <--- device removed
> >> 	  fail_tree_roots:
> >> 	    btrfs_close_devices():
> >> 	      close_fs_devices();   <------- rw device count off by 1
> >>
> >> Fixes: cf89af146b7e ("btrfs: dev-replace: fail mount if we don't have replace item with target device")
> > 
> > What this patch did in the last hunk was the rw_devices decrement, but
> > conditional:
> > 
> > @@ -1080,9 +1071,6 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
> >                  if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
> >                          list_del_init(&device->dev_alloc_list);
> >                          clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
> > -                       if (!test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
> > -                                     &device->dev_state))
> > -                               fs_devices->rw_devices--;
> >                  }
> >                  list_del_init(&device->dev_list);
> >                  fs_devices->num_devices--;
> > ---
> > 
> > 
> >> @@ -1078,6 +1078,7 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
> >>   		if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
> >>   			list_del_init(&device->dev_alloc_list);
> >>   			clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
> >> +			fs_devices->rw_devices--;
> >>   		}
> >>   		list_del_init(&device->dev_list);
> >>   		fs_devices->num_devices--;
> > 
> > So should it be reinstated in the original form? The rest of
> > cf89af146b7e handles unexpected device replace item during mount.
> > 
> > Adding the decrement is correct, but right now I'm not sure about the
> > corner case when teh devcie has the BTRFS_DEV_STATE_REPLACE_TGT bit set.
> > The state machine of the device bits and counters is not trivial so
> > fixing it one way or the other could lead to further syzbot reports if
> > we don't understand the issue.
> > 
> 
> Hi David,
> 
> Thanks for raising this issue. I took a closer look and I think we don't 
> have to reinstate the original form because it's a historical artifact.
> 
> The short version of the story is that going by the intention of 
> __btrfs_free_extra_devids, we skip removing the replace target device. 
> Hence, by the time we've reached the decrement in question, the device 
> is not the replace target device and the BTRFS_DEV_STATE_REPLACE_TGT bit 
> should not be set.
> 
> But we should also try to understand the original intention of the code. 
> The check in question was first introduced in commit 8dabb7420f01 
> ("Btrfs: change core code of btrfs to support the device replace 
> operations"):
> > @@ -536,7 +553,8 @@ void btrfs_close_extra_devices(struct btrfs_fs_devices *fs_devices)
> >                 if (device->writeable) {
> >                         list_del_init(&device->dev_alloc_list);
> >                         device->writeable = 0;
> > -                       fs_devices->rw_devices--;
> > +                       if (!device->is_tgtdev_for_dev_replace)
> > +                               fs_devices->rw_devices--;
> >                 }
> >                 list_del_init(&device->dev_list);
> >                 fs_devices->num_devices--;
> 
> If we take a trip back in time to this commit we see that 
> btrfs_dev_replace_finishing added the target device to the alloc list 
> without incrementing the rw_devices count. So this check was likely 
> originally meant to prevent under-counting of rw_devices.
> 
> However, the situation has changed, following various fixes to 
> rw_devices counting. Commit 63dd86fa79db ("btrfs: fix rw_devices miss 
> match after seed replace") added an increment to rw_devices when 
> replacing a seed device with a writable one in btrfs_dev_replace_finishing:
> > diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> > index eea26e1b2fda..fb0a7fa2f70c 100644
> > --- a/fs/btrfs/dev-replace.c
> > +++ b/fs/btrfs/dev-replace.c
> > @@ -562,6 +562,8 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
> >         if (fs_info->fs_devices->latest_bdev == src_device->bdev)
> >                 fs_info->fs_devices->latest_bdev = tgt_device->bdev;
> >         list_add(&tgt_device->dev_alloc_list, &fs_info->fs_devices->alloc_list);
> > +       if (src_device->fs_devices->seeding)
> > +               fs_info->fs_devices->rw_devices++;
> >  
> >         /* replace the sysfs entry */
> >         btrfs_kobj_rm_device(fs_info, src_device);
> 
> This was later simplified in commit 82372bc816d7 ("Btrfs: make the logic 
> of source device removing more clear") that simply decremented 
> rw_devices in btrfs_rm_dev_replace_srcdev if the replaced device was 
> writable. This meant that the rw_devices count could be incremented in 
> btrfs_dev_replace_finishing without any checks:
> > diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> > index e9cbbdb72978..6f662b34ba0e 100644
> > --- a/fs/btrfs/dev-replace.c
> > +++ b/fs/btrfs/dev-replace.c
> > @@ -569,8 +569,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
> >         if (fs_info->fs_devices->latest_bdev == src_device->bdev)
> >                 fs_info->fs_devices->latest_bdev = tgt_device->bdev;
> >         list_add(&tgt_device->dev_alloc_list, &fs_info->fs_devices->alloc_list);
> > -       if (src_device->fs_devices->seeding)
> > -               fs_info->fs_devices->rw_devices++;
> > +       fs_info->fs_devices->rw_devices++;
> >  
> >         /* replace the sysfs entry */
> >         btrfs_kobj_rm_device(fs_info, src_device);
> 
> Thus, given the current state of the code base, the original check is 
> now incorrect, because we want to decrement rw_devices as long as the 
> device is being removed from the alloc list.
> 
> To further convince ourselves of this, we can take a closer look at the 
> relation between the device with devid BTRFS_DEV_REPLACE_DEVID and the 
> BTRFS_DEV_STATE_REPLACE_TGT bit for devices.
> 
> BTRFS_DEV_STATE_REPLACE_TGT is set in two places:
> - btrfs_init_dev_replace_tgtdev
> - btrfs_init_dev_replace
> 
> In btrfs_init_dev_replace_tgtdev, the BTRFS_DEV_STATE_REPLACE_TGT bit is 
> set for a device allocated with devid BTRFS_DEV_REPLACE_DEVID.
> 
> In btrfs_init_dev_replace, the BTRFS_DEV_STATE_REPLACE_TGT bit is set 
> for the target device found with devid BTRFS_DEV_REPLACE_DEVID.
> 
>  From both cases, we see that the BTRFS_DEV_STATE_REPLACE_TGT bit is set 
> only for the device with devid BTRFS_DEV_REPLACE_DEVID.
> 
> It follows that if a device does not have devid BTRFS_DEV_REPLACE_DEVID, 
> then the BTRFS_DEV_STATE_REPLACE_TGT bit will not be set.
> 
> With commit cf89af146b7e ("btrfs: dev-replace: fail mount if we don't 
> have replace item with target device"), we skip removing the device in 
> __btrfs_free_extra_devids as long as the devid is BTRFS_DEV_REPLACE_DEVID:
> > -               if (device->devid == BTRFS_DEV_REPLACE_DEVID) {
> > -                       /*
> > -                        * In the first step, keep the device which has
> > -                        * the correct fsid and the devid that is used
> > -                        * for the dev_replace procedure.
> > -                        * In the second step, the dev_replace state is
> > -                        * read from the device tree and it is known
> > -                        * whether the procedure is really active or
> > -                        * not, which means whether this device is
> > -                        * used or whether it should be removed.
> > -                        */
> > -                       if (step == 0 || test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
> > -                                                 &device->dev_state)) {
> > -                               continue;
> > -                       }
> > -               }
> > +               /*
> > +                * We have already validated the presence of BTRFS_DEV_REPLACE_DEVID,
> > +                * in btrfs_init_dev_replace() so just continue.
> > +                */
> > +               if (device->devid == BTRFS_DEV_REPLACE_DEVID)
> > +                       continue;
> 
> Given the discussion above, after we fail the check for device->devid == 
> BTRFS_DEV_REPLACE_DEVID, all devices from that point are not the replace 
> target device, and do not have the BTRFS_DEV_STATE_REPLACE_TGT bit set.
> 
> So the original check for the BTRFS_DEV_STATE_REPLACE_TGT bit before 
> incrementing rw_devices is not just incorrect at this point, it's also 
> redundant.

Could you please write some condensed version of the above and resend?
The original changelog says what happends and how, the analysis here
is the actual explanation and I'd like to have that recorded. Thanks.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] btrfs: fix rw device counting in __btrfs_free_extra_devids
  2021-07-26 17:52     ` David Sterba
@ 2021-07-26 23:07       ` Desmond Cheong Zhi Xi
  0 siblings, 0 replies; 10+ messages in thread
From: Desmond Cheong Zhi Xi @ 2021-07-26 23:07 UTC (permalink / raw)
  To: dsterba, clm, josef, dsterba, anand.jain, linux-btrfs,
	linux-kernel, skhan, gregkh, linux-kernel-mentees,
	syzbot+a70e2ad0879f160b9217

On 27/7/21 1:52 am, David Sterba wrote:
> On Sun, Jul 25, 2021 at 02:19:52PM +0800, Desmond Cheong Zhi Xi wrote:
>> On 22/7/21 1:59 am, David Sterba wrote:
>>> On Thu, Jul 15, 2021 at 06:34:03PM +0800, Desmond Cheong Zhi Xi wrote:
>>>> Syzbot reports a warning in close_fs_devices that happens because
>>>> fs_devices->rw_devices is not 0 after calling btrfs_close_one_device
>>>> on each device.
>>>>
>>>> This happens when a writeable device is removed in
>>>> __btrfs_free_extra_devids, but the rw device count is not decremented
>>>> accordingly. So when close_fs_devices is called, the removed device is
>>>> still counted and we get an off by 1 error.
>>>>
>>>> Here is one call trace that was observed:
>>>>     btrfs_mount_root():
>>>>       btrfs_scan_one_device():
>>>>         device_list_add();   <---------------- device added
>>>>       btrfs_open_devices():
>>>>         open_fs_devices():
>>>>           btrfs_open_one_device();   <-------- rw device count ++
>>>>       btrfs_fill_super():
>>>>         open_ctree():
>>>>           btrfs_free_extra_devids():
>>>> 	  __btrfs_free_extra_devids();  <--- device removed
>>>> 	  fail_tree_roots:
>>>> 	    btrfs_close_devices():
>>>> 	      close_fs_devices();   <------- rw device count off by 1
>>>>
>>>> Fixes: cf89af146b7e ("btrfs: dev-replace: fail mount if we don't have replace item with target device")
>>>
>>> What this patch did in the last hunk was the rw_devices decrement, but
>>> conditional:
>>>
>>> @@ -1080,9 +1071,6 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
>>>                   if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
>>>                           list_del_init(&device->dev_alloc_list);
>>>                           clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
>>> -                       if (!test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
>>> -                                     &device->dev_state))
>>> -                               fs_devices->rw_devices--;
>>>                   }
>>>                   list_del_init(&device->dev_list);
>>>                   fs_devices->num_devices--;
>>> ---
>>>
>>>
>>>> @@ -1078,6 +1078,7 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
>>>>    		if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
>>>>    			list_del_init(&device->dev_alloc_list);
>>>>    			clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
>>>> +			fs_devices->rw_devices--;
>>>>    		}
>>>>    		list_del_init(&device->dev_list);
>>>>    		fs_devices->num_devices--;
>>>
>>> So should it be reinstated in the original form? The rest of
>>> cf89af146b7e handles unexpected device replace item during mount.
>>>
>>> Adding the decrement is correct, but right now I'm not sure about the
>>> corner case when teh devcie has the BTRFS_DEV_STATE_REPLACE_TGT bit set.
>>> The state machine of the device bits and counters is not trivial so
>>> fixing it one way or the other could lead to further syzbot reports if
>>> we don't understand the issue.
>>>
>>
>> Hi David,
>>
>> Thanks for raising this issue. I took a closer look and I think we don't
>> have to reinstate the original form because it's a historical artifact.
>>
>> The short version of the story is that going by the intention of
>> __btrfs_free_extra_devids, we skip removing the replace target device.
>> Hence, by the time we've reached the decrement in question, the device
>> is not the replace target device and the BTRFS_DEV_STATE_REPLACE_TGT bit
>> should not be set.
>>
>> But we should also try to understand the original intention of the code.
>> The check in question was first introduced in commit 8dabb7420f01
>> ("Btrfs: change core code of btrfs to support the device replace
>> operations"):
>>> @@ -536,7 +553,8 @@ void btrfs_close_extra_devices(struct btrfs_fs_devices *fs_devices)
>>>                  if (device->writeable) {
>>>                          list_del_init(&device->dev_alloc_list);
>>>                          device->writeable = 0;
>>> -                       fs_devices->rw_devices--;
>>> +                       if (!device->is_tgtdev_for_dev_replace)
>>> +                               fs_devices->rw_devices--;
>>>                  }
>>>                  list_del_init(&device->dev_list);
>>>                  fs_devices->num_devices--;
>>
>> If we take a trip back in time to this commit we see that
>> btrfs_dev_replace_finishing added the target device to the alloc list
>> without incrementing the rw_devices count. So this check was likely
>> originally meant to prevent under-counting of rw_devices.
>>
>> However, the situation has changed, following various fixes to
>> rw_devices counting. Commit 63dd86fa79db ("btrfs: fix rw_devices miss
>> match after seed replace") added an increment to rw_devices when
>> replacing a seed device with a writable one in btrfs_dev_replace_finishing:
>>> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
>>> index eea26e1b2fda..fb0a7fa2f70c 100644
>>> --- a/fs/btrfs/dev-replace.c
>>> +++ b/fs/btrfs/dev-replace.c
>>> @@ -562,6 +562,8 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
>>>          if (fs_info->fs_devices->latest_bdev == src_device->bdev)
>>>                  fs_info->fs_devices->latest_bdev = tgt_device->bdev;
>>>          list_add(&tgt_device->dev_alloc_list, &fs_info->fs_devices->alloc_list);
>>> +       if (src_device->fs_devices->seeding)
>>> +               fs_info->fs_devices->rw_devices++;
>>>   
>>>          /* replace the sysfs entry */
>>>          btrfs_kobj_rm_device(fs_info, src_device);
>>
>> This was later simplified in commit 82372bc816d7 ("Btrfs: make the logic
>> of source device removing more clear") that simply decremented
>> rw_devices in btrfs_rm_dev_replace_srcdev if the replaced device was
>> writable. This meant that the rw_devices count could be incremented in
>> btrfs_dev_replace_finishing without any checks:
>>> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
>>> index e9cbbdb72978..6f662b34ba0e 100644
>>> --- a/fs/btrfs/dev-replace.c
>>> +++ b/fs/btrfs/dev-replace.c
>>> @@ -569,8 +569,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
>>>          if (fs_info->fs_devices->latest_bdev == src_device->bdev)
>>>                  fs_info->fs_devices->latest_bdev = tgt_device->bdev;
>>>          list_add(&tgt_device->dev_alloc_list, &fs_info->fs_devices->alloc_list);
>>> -       if (src_device->fs_devices->seeding)
>>> -               fs_info->fs_devices->rw_devices++;
>>> +       fs_info->fs_devices->rw_devices++;
>>>   
>>>          /* replace the sysfs entry */
>>>          btrfs_kobj_rm_device(fs_info, src_device);
>>
>> Thus, given the current state of the code base, the original check is
>> now incorrect, because we want to decrement rw_devices as long as the
>> device is being removed from the alloc list.
>>
>> To further convince ourselves of this, we can take a closer look at the
>> relation between the device with devid BTRFS_DEV_REPLACE_DEVID and the
>> BTRFS_DEV_STATE_REPLACE_TGT bit for devices.
>>
>> BTRFS_DEV_STATE_REPLACE_TGT is set in two places:
>> - btrfs_init_dev_replace_tgtdev
>> - btrfs_init_dev_replace
>>
>> In btrfs_init_dev_replace_tgtdev, the BTRFS_DEV_STATE_REPLACE_TGT bit is
>> set for a device allocated with devid BTRFS_DEV_REPLACE_DEVID.
>>
>> In btrfs_init_dev_replace, the BTRFS_DEV_STATE_REPLACE_TGT bit is set
>> for the target device found with devid BTRFS_DEV_REPLACE_DEVID.
>>
>>   From both cases, we see that the BTRFS_DEV_STATE_REPLACE_TGT bit is set
>> only for the device with devid BTRFS_DEV_REPLACE_DEVID.
>>
>> It follows that if a device does not have devid BTRFS_DEV_REPLACE_DEVID,
>> then the BTRFS_DEV_STATE_REPLACE_TGT bit will not be set.
>>
>> With commit cf89af146b7e ("btrfs: dev-replace: fail mount if we don't
>> have replace item with target device"), we skip removing the device in
>> __btrfs_free_extra_devids as long as the devid is BTRFS_DEV_REPLACE_DEVID:
>>> -               if (device->devid == BTRFS_DEV_REPLACE_DEVID) {
>>> -                       /*
>>> -                        * In the first step, keep the device which has
>>> -                        * the correct fsid and the devid that is used
>>> -                        * for the dev_replace procedure.
>>> -                        * In the second step, the dev_replace state is
>>> -                        * read from the device tree and it is known
>>> -                        * whether the procedure is really active or
>>> -                        * not, which means whether this device is
>>> -                        * used or whether it should be removed.
>>> -                        */
>>> -                       if (step == 0 || test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
>>> -                                                 &device->dev_state)) {
>>> -                               continue;
>>> -                       }
>>> -               }
>>> +               /*
>>> +                * We have already validated the presence of BTRFS_DEV_REPLACE_DEVID,
>>> +                * in btrfs_init_dev_replace() so just continue.
>>> +                */
>>> +               if (device->devid == BTRFS_DEV_REPLACE_DEVID)
>>> +                       continue;
>>
>> Given the discussion above, after we fail the check for device->devid ==
>> BTRFS_DEV_REPLACE_DEVID, all devices from that point are not the replace
>> target device, and do not have the BTRFS_DEV_STATE_REPLACE_TGT bit set.
>>
>> So the original check for the BTRFS_DEV_STATE_REPLACE_TGT bit before
>> incrementing rw_devices is not just incorrect at this point, it's also
>> redundant.
> 
> Could you please write some condensed version of the above and resend?
> The original changelog says what happends and how, the analysis here
> is the actual explanation and I'd like to have that recorded. Thanks.
> 

Sure thing, I'll prepare a v2 with an updated commit message. Thanks for 
the feedback, David.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-07-26 23:07 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-15 10:34 [PATCH] btrfs: fix rw device counting in __btrfs_free_extra_devids Desmond Cheong Zhi Xi
2021-07-15 11:23 ` Anand Jain
2021-07-15 11:55 ` Nikolay Borisov
2021-07-15 13:11   ` Desmond Cheong Zhi Xi
2021-07-21 13:34     ` David Sterba
2021-07-21 17:59 ` David Sterba
2021-07-25  6:19   ` Desmond Cheong Zhi Xi
2021-07-26 17:52     ` David Sterba
2021-07-26 23:07       ` Desmond Cheong Zhi Xi
2021-07-25 13:49   ` Anand Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).