All of lore.kernel.org
 help / color / mirror / Atom feed
* Device Delete Stuck
@ 2020-03-29 14:13 Jason Clara
  2020-03-29 16:40 ` Steven Fosdick
  2020-03-29 18:55 ` Zygo Blaxell
  0 siblings, 2 replies; 5+ messages in thread
From: Jason Clara @ 2020-03-29 14:13 UTC (permalink / raw)
  To: linux-btrfs

I had a previous post about when trying to do a device delete that it would cause my whole system to hang.  I seem to have got past that issue.  

For that, it seems like even though all the SCRUBs finished without any errors I still had a problem with some files.  By forcing a read of every single file I was able to detect the bad files in DMESG.  Not sure though why SCRUB didn’t detect this.
BTRFS warning (device sdd1): csum failed root 5 ino 14654354 off 163852288 csum 0


But now when I attempt to delete a device from the array it seems to get stuck.  Normally it will show in the log that it has found some extents and then another message saying they were relocated.

But for the last few days it has just been repeating the same found value and never relocating anything, and the usage of the device doesn’t change at all.

This line has now been repeating for more then 24 hours, and the previous attempt was similar.
[Sun Mar 29 09:59:50 2020] BTRFS info (device sdd1): found 133 extents

Prior to this run I had tried with an earlier kernel (5.5.10) and had the same results.  It starts with finding and then relocating, but then relocating.  So I upgraded my kernel to see if that would help, and it has not.

System Info
Ubuntu 18.04
btrfs-progs v5.4.1
Linux FileServer 5.5.13-050513-generic #202003251631 SMP Wed Mar 25 16:35:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

DEVICE USAGE
/dev/sdd1, ID: 1
   Device size:             2.73TiB
   Device slack:              0.00B
   Data,RAID6:            188.67GiB
   Data,RAID6:              1.68TiB
   Data,RAID6:            888.43GiB
   Unallocated:             1.00MiB

/dev/sdb1, ID: 2
   Device size:             2.73TiB
   Device slack:            2.73TiB
   Data,RAID6:            188.67GiB
   Data,RAID6:            508.82GiB
   Data,RAID6:              2.00GiB
   Unallocated:          -699.50GiB

/dev/sdc1, ID: 3
   Device size:             2.73TiB
   Device slack:              0.00B
   Data,RAID6:            188.67GiB
   Data,RAID6:              1.68TiB
   Data,RAID6:            888.43GiB
   Unallocated:             1.00MiB

/dev/sdi1, ID: 5
   Device size:             2.73TiB
   Device slack:            1.36TiB
   Data,RAID6:            188.67GiB
   Data,RAID6:              1.18TiB
   Unallocated:             1.00MiB

/dev/sdh1, ID: 6
   Device size:             4.55TiB
   Device slack:              0.00B
   Data,RAID6:            188.67GiB
   Data,RAID6:              1.68TiB
   Data,RAID6:              1.23TiB
   Data,RAID6:            888.43GiB
   Data,RAID6:              2.00GiB
   Metadata,RAID1:          2.00GiB
   Unallocated:           601.01GiB

/dev/sda1, ID: 7
   Device size:             7.28TiB
   Device slack:              0.00B
   Data,RAID6:            188.67GiB
   Data,RAID6:              1.68TiB
   Data,RAID6:              1.23TiB
   Data,RAID6:            888.43GiB
   Data,RAID6:              2.00GiB
   Metadata,RAID1:          2.00GiB
   System,RAID1:           32.00MiB
   Unallocated:             3.32TiB

/dev/sdf1, ID: 8
   Device size:             7.28TiB
   Device slack:              0.00B
   Data,RAID6:            188.67GiB
   Data,RAID6:              1.68TiB
   Data,RAID6:              1.23TiB
   Data,RAID6:            888.43GiB
   Data,RAID6:              2.00GiB
   Metadata,RAID1:          8.00GiB
   Unallocated:             3.31TiB

/dev/sdj1, ID: 9
   Device size:             7.28TiB
   Device slack:              0.00B
   Data,RAID6:            188.67GiB
   Data,RAID6:              1.68TiB
   Data,RAID6:              1.23TiB
   Data,RAID6:            888.43GiB
   Data,RAID6:              2.00GiB
   Metadata,RAID1:          8.00GiB
   System,RAID1:           32.00MiB
   Unallocated:             3.31TiB


FI USAGE
WARNING: RAID56 detected, not implemented
Overall:
    Device size:		  33.20TiB
    Device allocated:		  20.06GiB
    Device unallocated:		  33.18TiB
    Device missing:		     0.00B
    Used:			  19.38GiB
    Free (estimated):		     0.00B	(min: 8.00EiB)
    Data ratio:			      0.00
    Metadata ratio:		      2.00
    Global reserve:		 512.00MiB	(used: 0.00B)

Data,RAID6: Size:15.42TiB, Used:15.18TiB (98.44%)
   /dev/sdd1	   2.73TiB
   /dev/sdb1	 699.50GiB
   /dev/sdc1	   2.73TiB
   /dev/sdi1	   1.36TiB
   /dev/sdh1	   3.96TiB
   /dev/sda1	   3.96TiB
   /dev/sdf1	   3.96TiB
   /dev/sdj1	   3.96TiB

Metadata,RAID1: Size:10.00GiB, Used:9.69GiB (96.90%)
   /dev/sdh1	   2.00GiB
   /dev/sda1	   2.00GiB
   /dev/sdf1	   8.00GiB
   /dev/sdj1	   8.00GiB

System,RAID1: Size:32.00MiB, Used:1.19MiB (3.71%)
   /dev/sda1	  32.00MiB
   /dev/sdj1	  32.00MiB

Unallocated:
   /dev/sdd1	   1.00MiB
   /dev/sdb1	-699.50GiB
   /dev/sdc1	   1.00MiB
   /dev/sdi1	   1.00MiB
   /dev/sdh1	 601.01GiB
   /dev/sda1	   3.32TiB
   /dev/sdf1	   3.31TiB
   /dev/sdj1	   3.31TiB


FI SHOW
Label: 'Pool1'  uuid: 99935e27-4922-4efa-bf76-5787536dd71f
	Total devices 8 FS bytes used 15.19TiB
	devid    1 size 2.73TiB used 2.73TiB path /dev/sdd1
	devid    2 size 0.00B used 699.50GiB path /dev/sdb1
	devid    3 size 2.73TiB used 2.73TiB path /dev/sdc1
	devid    5 size 1.36TiB used 1.36TiB path /dev/sdi1
	devid    6 size 4.55TiB used 3.96TiB path /dev/sdh1
	devid    7 size 7.28TiB used 3.96TiB path /dev/sda1
	devid    8 size 7.28TiB used 3.97TiB path /dev/sdf1
	devid    9 size 7.28TiB used 3.97TiB path /dev/sdj1

FI DF
Data, RAID6: total=15.42TiB, used=15.18TiB
System, RAID1: total=32.00MiB, used=1.19MiB
Metadata, RAID1: total=10.00GiB, used=9.69GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Device Delete Stuck
  2020-03-29 14:13 Device Delete Stuck Jason Clara
@ 2020-03-29 16:40 ` Steven Fosdick
  2020-03-29 18:18   ` Jason Clara
  2020-03-29 18:55 ` Zygo Blaxell
  1 sibling, 1 reply; 5+ messages in thread
From: Steven Fosdick @ 2020-03-29 16:40 UTC (permalink / raw)
  To: Jason Clara; +Cc: Btrfs BTRFS

Jason,

I am not a btrfs developer but I had he same problem as you.  In my
case the problem went away when I used the mount option to clear the
free space cache.  From my own experience, whatever is going wrong
that causes the checksum error also corrupts this cache but that does
no long term harm as, once it is cleared on mount, it gets rebuilt.

Steve.

On Sun, 29 Mar 2020 at 15:15, Jason Clara <jason@clarafamily.com> wrote:
>
> I had a previous post about when trying to do a device delete that it would cause my whole system to hang.  I seem to have got past that issue.
>
> For that, it seems like even though all the SCRUBs finished without any errors I still had a problem with some files.  By forcing a read of every single file I was able to detect the bad files in DMESG.  Not sure though why SCRUB didn’t detect this.
> BTRFS warning (device sdd1): csum failed root 5 ino 14654354 off 163852288 csum 0
>
>
> But now when I attempt to delete a device from the array it seems to get stuck.  Normally it will show in the log that it has found some extents and then another message saying they were relocated.
>
> But for the last few days it has just been repeating the same found value and never relocating anything, and the usage of the device doesn’t change at all.
>
> This line has now been repeating for more then 24 hours, and the previous attempt was similar.
> [Sun Mar 29 09:59:50 2020] BTRFS info (device sdd1): found 133 extents
>
> Prior to this run I had tried with an earlier kernel (5.5.10) and had the same results.  It starts with finding and then relocating, but then relocating.  So I upgraded my kernel to see if that would help, and it has not.
>
> System Info
> Ubuntu 18.04
> btrfs-progs v5.4.1
> Linux FileServer 5.5.13-050513-generic #202003251631 SMP Wed Mar 25 16:35:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
>
> DEVICE USAGE
> /dev/sdd1, ID: 1
>    Device size:             2.73TiB
>    Device slack:              0.00B
>    Data,RAID6:            188.67GiB
>    Data,RAID6:              1.68TiB
>    Data,RAID6:            888.43GiB
>    Unallocated:             1.00MiB
>
> /dev/sdb1, ID: 2
>    Device size:             2.73TiB
>    Device slack:            2.73TiB
>    Data,RAID6:            188.67GiB
>    Data,RAID6:            508.82GiB
>    Data,RAID6:              2.00GiB
>    Unallocated:          -699.50GiB
>
> /dev/sdc1, ID: 3
>    Device size:             2.73TiB
>    Device slack:              0.00B
>    Data,RAID6:            188.67GiB
>    Data,RAID6:              1.68TiB
>    Data,RAID6:            888.43GiB
>    Unallocated:             1.00MiB
>
> /dev/sdi1, ID: 5
>    Device size:             2.73TiB
>    Device slack:            1.36TiB
>    Data,RAID6:            188.67GiB
>    Data,RAID6:              1.18TiB
>    Unallocated:             1.00MiB
>
> /dev/sdh1, ID: 6
>    Device size:             4.55TiB
>    Device slack:              0.00B
>    Data,RAID6:            188.67GiB
>    Data,RAID6:              1.68TiB
>    Data,RAID6:              1.23TiB
>    Data,RAID6:            888.43GiB
>    Data,RAID6:              2.00GiB
>    Metadata,RAID1:          2.00GiB
>    Unallocated:           601.01GiB
>
> /dev/sda1, ID: 7
>    Device size:             7.28TiB
>    Device slack:              0.00B
>    Data,RAID6:            188.67GiB
>    Data,RAID6:              1.68TiB
>    Data,RAID6:              1.23TiB
>    Data,RAID6:            888.43GiB
>    Data,RAID6:              2.00GiB
>    Metadata,RAID1:          2.00GiB
>    System,RAID1:           32.00MiB
>    Unallocated:             3.32TiB
>
> /dev/sdf1, ID: 8
>    Device size:             7.28TiB
>    Device slack:              0.00B
>    Data,RAID6:            188.67GiB
>    Data,RAID6:              1.68TiB
>    Data,RAID6:              1.23TiB
>    Data,RAID6:            888.43GiB
>    Data,RAID6:              2.00GiB
>    Metadata,RAID1:          8.00GiB
>    Unallocated:             3.31TiB
>
> /dev/sdj1, ID: 9
>    Device size:             7.28TiB
>    Device slack:              0.00B
>    Data,RAID6:            188.67GiB
>    Data,RAID6:              1.68TiB
>    Data,RAID6:              1.23TiB
>    Data,RAID6:            888.43GiB
>    Data,RAID6:              2.00GiB
>    Metadata,RAID1:          8.00GiB
>    System,RAID1:           32.00MiB
>    Unallocated:             3.31TiB
>
>
> FI USAGE
> WARNING: RAID56 detected, not implemented
> Overall:
>     Device size:                  33.20TiB
>     Device allocated:             20.06GiB
>     Device unallocated:           33.18TiB
>     Device missing:                  0.00B
>     Used:                         19.38GiB
>     Free (estimated):                0.00B      (min: 8.00EiB)
>     Data ratio:                       0.00
>     Metadata ratio:                   2.00
>     Global reserve:              512.00MiB      (used: 0.00B)
>
> Data,RAID6: Size:15.42TiB, Used:15.18TiB (98.44%)
>    /dev/sdd1       2.73TiB
>    /dev/sdb1     699.50GiB
>    /dev/sdc1       2.73TiB
>    /dev/sdi1       1.36TiB
>    /dev/sdh1       3.96TiB
>    /dev/sda1       3.96TiB
>    /dev/sdf1       3.96TiB
>    /dev/sdj1       3.96TiB
>
> Metadata,RAID1: Size:10.00GiB, Used:9.69GiB (96.90%)
>    /dev/sdh1       2.00GiB
>    /dev/sda1       2.00GiB
>    /dev/sdf1       8.00GiB
>    /dev/sdj1       8.00GiB
>
> System,RAID1: Size:32.00MiB, Used:1.19MiB (3.71%)
>    /dev/sda1      32.00MiB
>    /dev/sdj1      32.00MiB
>
> Unallocated:
>    /dev/sdd1       1.00MiB
>    /dev/sdb1    -699.50GiB
>    /dev/sdc1       1.00MiB
>    /dev/sdi1       1.00MiB
>    /dev/sdh1     601.01GiB
>    /dev/sda1       3.32TiB
>    /dev/sdf1       3.31TiB
>    /dev/sdj1       3.31TiB
>
>
> FI SHOW
> Label: 'Pool1'  uuid: 99935e27-4922-4efa-bf76-5787536dd71f
>         Total devices 8 FS bytes used 15.19TiB
>         devid    1 size 2.73TiB used 2.73TiB path /dev/sdd1
>         devid    2 size 0.00B used 699.50GiB path /dev/sdb1
>         devid    3 size 2.73TiB used 2.73TiB path /dev/sdc1
>         devid    5 size 1.36TiB used 1.36TiB path /dev/sdi1
>         devid    6 size 4.55TiB used 3.96TiB path /dev/sdh1
>         devid    7 size 7.28TiB used 3.96TiB path /dev/sda1
>         devid    8 size 7.28TiB used 3.97TiB path /dev/sdf1
>         devid    9 size 7.28TiB used 3.97TiB path /dev/sdj1
>
> FI DF
> Data, RAID6: total=15.42TiB, used=15.18TiB
> System, RAID1: total=32.00MiB, used=1.19MiB
> Metadata, RAID1: total=10.00GiB, used=9.69GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Device Delete Stuck
  2020-03-29 16:40 ` Steven Fosdick
@ 2020-03-29 18:18   ` Jason Clara
  0 siblings, 0 replies; 5+ messages in thread
From: Jason Clara @ 2020-03-29 18:18 UTC (permalink / raw)
  To: Steven Fosdick; +Cc: Btrfs BTRFS

Thanks for the suggestion.  I added clear_cache to my fstab, rebooted and waited about 20-30 minutes to make sure everything had settled down.
I did see this in my log, so it appears to have worked "BTRFS info (device sdd1): force clearing of disk cache"

I attempted the delete again, and it did remove more data but looks like it is stuck again.

Here is the DMESG from when I started the delete.  The last line "found 3 extents” has been repeating for the last 20 or so minutes

[Sun Mar 29 13:42:06 2020] BTRFS info (device sdd1): relocating block group 145441210499072 flags data|raid6
[Sun Mar 29 13:43:17 2020] BTRFS info (device sdd1): found 3010 extents
[Sun Mar 29 13:43:23 2020] BTRFS info (device sdd1): found 3010 extents
[Sun Mar 29 13:43:25 2020] BTRFS info (device sdd1): relocating block group 145437989273600 flags data|raid6
[Sun Mar 29 13:44:14 2020] BTRFS info (device sdd1): found 972 extents
[Sun Mar 29 13:44:21 2020] BTRFS info (device sdd1): found 950 extents
[Sun Mar 29 13:44:31 2020] BTRFS info (device sdd1): relocating block group 120428453429248 flags data|raid6
[Sun Mar 29 13:45:23 2020] BTRFS info (device sdd1): found 3884 extents
[Sun Mar 29 13:45:49 2020] BTRFS info (device sdd1): found 3883 extents
[Sun Mar 29 13:46:14 2020] BTRFS info (device sdd1): relocating block group 132181611511808 flags data|raid6
[Sun Mar 29 13:46:19 2020] BTRFS info (device sdd1): found 60 extents
[Sun Mar 29 13:46:21 2020] BTRFS info (device sdd1): found 60 extents
[Sun Mar 29 13:46:23 2020] BTRFS info (device sdd1): relocating block group 132153520160768 flags data|raid6
[Sun Mar 29 13:46:33 2020] BTRFS info (device sdd1): found 42 extents
[Sun Mar 29 13:46:35 2020] BTRFS info (device sdd1): found 42 extents
[Sun Mar 29 13:46:37 2020] BTRFS info (device sdd1): relocating block group 120433822138368 flags data|raid6
[Sun Mar 29 13:47:37 2020] BTRFS info (device sdd1): found 3831 extents
[Sun Mar 29 13:47:59 2020] BTRFS info (device sdd1): found 3831 extents
[Sun Mar 29 13:48:15 2020] BTRFS info (device sdd1): relocating block group 132175346270208 flags data|raid6
[Sun Mar 29 13:48:19 2020] BTRFS info (device sdd1): found 29 extents
[Sun Mar 29 13:48:21 2020] BTRFS info (device sdd1): found 29 extents
[Sun Mar 29 13:48:23 2020] BTRFS info (device sdd1): found 29 extents
[Sun Mar 29 13:48:25 2020] BTRFS info (device sdd1): relocating block group 120439190847488 flags data|raid6
[Sun Mar 29 13:49:12 2020] BTRFS info (device sdd1): relocating block group 132182843588608 flags data|raid6
[Sun Mar 29 13:49:16 2020] BTRFS info (device sdd1): found 3 extents
[Sun Mar 29 13:49:17 2020] BTRFS info (device sdd1): found 3 extents
[Sun Mar 29 13:49:18 2020] BTRFS info (device sdd1): found 3 extents
[Sun Mar 29 13:49:18 2020] BTRFS info (device sdd1): found 3 extents
[Sun Mar 29 13:49:19 2020] BTRFS info (device sdd1): found 3 extents
[Sun Mar 29 13:49:19 2020] BTRFS info (device sdd1): found 3 extents
[Sun Mar 29 13:49:20 2020] BTRFS info (device sdd1): found 3 extents
[Sun Mar 29 13:49:20 2020] BTRFS info (device sdd1): found 3 extents


Updated FI USAGE
WARNING: RAID56 detected, not implemented
Overall:
    Device size:		  33.20TiB
    Device allocated:		  20.06GiB
    Device unallocated:		  33.18TiB
    Device missing:		     0.00B
    Used:			  19.38GiB
    Free (estimated):		     0.00B	(min: 8.00EiB)
    Data ratio:			      0.00
    Metadata ratio:		      2.00
    Global reserve:		 512.00MiB	(used: 144.00KiB)

Data,RAID6: Size:15.42TiB, Used:15.18TiB (98.47%)
   /dev/sdd1	   2.73TiB
   /dev/sdb1	 695.21GiB
   /dev/sdc1	   2.73TiB
   /dev/sdi1	   1.36TiB
   /dev/sdh1	   3.96TiB
   /dev/sda1	   3.96TiB
   /dev/sdf1	   3.96TiB
   /dev/sdj1	   3.96TiB

Metadata,RAID1: Size:10.00GiB, Used:9.69GiB (96.89%)
   /dev/sdh1	   2.00GiB
   /dev/sda1	   2.00GiB
   /dev/sdf1	   8.00GiB
   /dev/sdj1	   8.00GiB

System,RAID1: Size:32.00MiB, Used:1.19MiB (3.71%)
   /dev/sda1	  32.00MiB
   /dev/sdj1	  32.00MiB

Unallocated:
   /dev/sdd1	   1.00MiB
   /dev/sdb1	-695.21GiB
   /dev/sdc1	   1.00MiB
   /dev/sdi1	   1.00MiB
   /dev/sdh1	 601.01GiB
   /dev/sda1	   3.32TiB
   /dev/sdf1	   3.31TiB
   /dev/sdj1	   3.31TiB


> On Mar 29, 2020, at 12:40 PM, Steven Fosdick <stevenfosdick@gmail.com> wrote:
> 
> Jason,
> 
> I am not a btrfs developer but I had he same problem as you.  In my
> case the problem went away when I used the mount option to clear the
> free space cache.  From my own experience, whatever is going wrong
> that causes the checksum error also corrupts this cache but that does
> no long term harm as, once it is cleared on mount, it gets rebuilt.
> 
> Steve.
> 
> On Sun, 29 Mar 2020 at 15:15, Jason Clara <jason@clarafamily.com> wrote:
>> 
>> I had a previous post about when trying to do a device delete that it would cause my whole system to hang.  I seem to have got past that issue.
>> 
>> For that, it seems like even though all the SCRUBs finished without any errors I still had a problem with some files.  By forcing a read of every single file I was able to detect the bad files in DMESG.  Not sure though why SCRUB didn’t detect this.
>> BTRFS warning (device sdd1): csum failed root 5 ino 14654354 off 163852288 csum 0
>> 
>> 
>> But now when I attempt to delete a device from the array it seems to get stuck.  Normally it will show in the log that it has found some extents and then another message saying they were relocated.
>> 
>> But for the last few days it has just been repeating the same found value and never relocating anything, and the usage of the device doesn’t change at all.
>> 
>> This line has now been repeating for more then 24 hours, and the previous attempt was similar.
>> [Sun Mar 29 09:59:50 2020] BTRFS info (device sdd1): found 133 extents
>> 
>> Prior to this run I had tried with an earlier kernel (5.5.10) and had the same results.  It starts with finding and then relocating, but then relocating.  So I upgraded my kernel to see if that would help, and it has not.
>> 
>> System Info
>> Ubuntu 18.04
>> btrfs-progs v5.4.1
>> Linux FileServer 5.5.13-050513-generic #202003251631 SMP Wed Mar 25 16:35:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
>> 
>> DEVICE USAGE
>> /dev/sdd1, ID: 1
>>   Device size:             2.73TiB
>>   Device slack:              0.00B
>>   Data,RAID6:            188.67GiB
>>   Data,RAID6:              1.68TiB
>>   Data,RAID6:            888.43GiB
>>   Unallocated:             1.00MiB
>> 
>> /dev/sdb1, ID: 2
>>   Device size:             2.73TiB
>>   Device slack:            2.73TiB
>>   Data,RAID6:            188.67GiB
>>   Data,RAID6:            508.82GiB
>>   Data,RAID6:              2.00GiB
>>   Unallocated:          -699.50GiB
>> 
>> /dev/sdc1, ID: 3
>>   Device size:             2.73TiB
>>   Device slack:              0.00B
>>   Data,RAID6:            188.67GiB
>>   Data,RAID6:              1.68TiB
>>   Data,RAID6:            888.43GiB
>>   Unallocated:             1.00MiB
>> 
>> /dev/sdi1, ID: 5
>>   Device size:             2.73TiB
>>   Device slack:            1.36TiB
>>   Data,RAID6:            188.67GiB
>>   Data,RAID6:              1.18TiB
>>   Unallocated:             1.00MiB
>> 
>> /dev/sdh1, ID: 6
>>   Device size:             4.55TiB
>>   Device slack:              0.00B
>>   Data,RAID6:            188.67GiB
>>   Data,RAID6:              1.68TiB
>>   Data,RAID6:              1.23TiB
>>   Data,RAID6:            888.43GiB
>>   Data,RAID6:              2.00GiB
>>   Metadata,RAID1:          2.00GiB
>>   Unallocated:           601.01GiB
>> 
>> /dev/sda1, ID: 7
>>   Device size:             7.28TiB
>>   Device slack:              0.00B
>>   Data,RAID6:            188.67GiB
>>   Data,RAID6:              1.68TiB
>>   Data,RAID6:              1.23TiB
>>   Data,RAID6:            888.43GiB
>>   Data,RAID6:              2.00GiB
>>   Metadata,RAID1:          2.00GiB
>>   System,RAID1:           32.00MiB
>>   Unallocated:             3.32TiB
>> 
>> /dev/sdf1, ID: 8
>>   Device size:             7.28TiB
>>   Device slack:              0.00B
>>   Data,RAID6:            188.67GiB
>>   Data,RAID6:              1.68TiB
>>   Data,RAID6:              1.23TiB
>>   Data,RAID6:            888.43GiB
>>   Data,RAID6:              2.00GiB
>>   Metadata,RAID1:          8.00GiB
>>   Unallocated:             3.31TiB
>> 
>> /dev/sdj1, ID: 9
>>   Device size:             7.28TiB
>>   Device slack:              0.00B
>>   Data,RAID6:            188.67GiB
>>   Data,RAID6:              1.68TiB
>>   Data,RAID6:              1.23TiB
>>   Data,RAID6:            888.43GiB
>>   Data,RAID6:              2.00GiB
>>   Metadata,RAID1:          8.00GiB
>>   System,RAID1:           32.00MiB
>>   Unallocated:             3.31TiB
>> 
>> 
>> FI USAGE
>> WARNING: RAID56 detected, not implemented
>> Overall:
>>    Device size:                  33.20TiB
>>    Device allocated:             20.06GiB
>>    Device unallocated:           33.18TiB
>>    Device missing:                  0.00B
>>    Used:                         19.38GiB
>>    Free (estimated):                0.00B      (min: 8.00EiB)
>>    Data ratio:                       0.00
>>    Metadata ratio:                   2.00
>>    Global reserve:              512.00MiB      (used: 0.00B)
>> 
>> Data,RAID6: Size:15.42TiB, Used:15.18TiB (98.44%)
>>   /dev/sdd1       2.73TiB
>>   /dev/sdb1     699.50GiB
>>   /dev/sdc1       2.73TiB
>>   /dev/sdi1       1.36TiB
>>   /dev/sdh1       3.96TiB
>>   /dev/sda1       3.96TiB
>>   /dev/sdf1       3.96TiB
>>   /dev/sdj1       3.96TiB
>> 
>> Metadata,RAID1: Size:10.00GiB, Used:9.69GiB (96.90%)
>>   /dev/sdh1       2.00GiB
>>   /dev/sda1       2.00GiB
>>   /dev/sdf1       8.00GiB
>>   /dev/sdj1       8.00GiB
>> 
>> System,RAID1: Size:32.00MiB, Used:1.19MiB (3.71%)
>>   /dev/sda1      32.00MiB
>>   /dev/sdj1      32.00MiB
>> 
>> Unallocated:
>>   /dev/sdd1       1.00MiB
>>   /dev/sdb1    -699.50GiB
>>   /dev/sdc1       1.00MiB
>>   /dev/sdi1       1.00MiB
>>   /dev/sdh1     601.01GiB
>>   /dev/sda1       3.32TiB
>>   /dev/sdf1       3.31TiB
>>   /dev/sdj1       3.31TiB
>> 
>> 
>> FI SHOW
>> Label: 'Pool1'  uuid: 99935e27-4922-4efa-bf76-5787536dd71f
>>        Total devices 8 FS bytes used 15.19TiB
>>        devid    1 size 2.73TiB used 2.73TiB path /dev/sdd1
>>        devid    2 size 0.00B used 699.50GiB path /dev/sdb1
>>        devid    3 size 2.73TiB used 2.73TiB path /dev/sdc1
>>        devid    5 size 1.36TiB used 1.36TiB path /dev/sdi1
>>        devid    6 size 4.55TiB used 3.96TiB path /dev/sdh1
>>        devid    7 size 7.28TiB used 3.96TiB path /dev/sda1
>>        devid    8 size 7.28TiB used 3.97TiB path /dev/sdf1
>>        devid    9 size 7.28TiB used 3.97TiB path /dev/sdj1
>> 
>> FI DF
>> Data, RAID6: total=15.42TiB, used=15.18TiB
>> System, RAID1: total=32.00MiB, used=1.19MiB
>> Metadata, RAID1: total=10.00GiB, used=9.69GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Device Delete Stuck
  2020-03-29 14:13 Device Delete Stuck Jason Clara
  2020-03-29 16:40 ` Steven Fosdick
@ 2020-03-29 18:55 ` Zygo Blaxell
  2020-03-29 19:24   ` Jason Clara
  1 sibling, 1 reply; 5+ messages in thread
From: Zygo Blaxell @ 2020-03-29 18:55 UTC (permalink / raw)
  To: Jason Clara; +Cc: linux-btrfs

On Sun, Mar 29, 2020 at 10:13:05AM -0400, Jason Clara wrote:
> I had a previous post about when trying to do a device delete that
> it would cause my whole system to hang.  I seem to have got past
> that issue.
>
> For that, it seems like even though all the SCRUBs finished without
> any errors I still had a problem with some files.  By forcing a read
> of every single file I was able to detect the bad files in DMESG.
> Not sure though why SCRUB didn’t detect this.  BTRFS warning (device
> sdd1): csum failed root 5 ino 14654354 off 163852288 csum 0

That sounds like it could be the raid5/6 bug I reported

	https://www.spinics.net/lists/linux-btrfs/msg94594.html

To trigger that bug you need pre-existing corruption on the disk.

You can work around by:

	1.  Read every file, e.g. 'find -type f -exec cat {} + >/dev/null'
	This avoids dmesg ratelimiting which will hide some errors.

	2.  If there are read errors in step 1, remove any that have
	failures.

	3.  Run full scrub to fix parity or inject new errors.

	4.  Repeat until there are no errors at step 1.

The bug will introduce new errors in a small fraction (<0.1%) of corrupted
raid stripes as you do this.  Each pass through the loop will remove
existing errors, but may add a few more new errors at the same time.
The rate of removal is much faster than the rate of addition, so the
loop will eventually terminate at zero errors.  You'll be able to use
the filesystem normally again after that.

This bug is not a regression--there has not been a kernel release with
working btrfs raid5/6 yet.  All releases from 4.15 to 5.5.3 fail my test
case, and versions before 4.15 have worse bugs.  At the moment, btrfs
raid5/6 should only be used by developers who intend to test, debug,
and fix btrfs raid5/6.

> But now when I attempt to delete a device from the array it seems to
> get stuck.  Normally it will show in the log that it has found some
> extents and then another message saying they were relocated.
>
> But for the last few days it has just been repeating the same found
> value and never relocating anything, and the usage of the device
> doesn’t change at all.
>
> This line has now been repeating for more then 24 hours, and the
> previous attempt was similar.  [Sun Mar 29 09:59:50 2020] BTRFS info
> (device sdd1): found 133 extents

Kernels starting with 5.1 have a known regression where block group
relocation gets stuck in loops.  Everything in the block group gets
relocated except for shared data backref items, then the relocation can't
seem to move those and no further progress is made.  This has not been
fixed yet.

> Prior to this run I had tried with an earlier kernel (5.5.10) and had
> the same results.  It starts with finding and then relocating, but
> then relocating.  So I upgraded my kernel to see if that would help,
> and it has not.

Use kernel 4.19 for device deletes or other big relocation operations.
(5.0 and 4.20 are OK too, but 4.19 is still maintained and has fixes
for non-btrfs issues).

> System Info
> Ubuntu 18.04
> btrfs-progs v5.4.1
> Linux FileServer 5.5.13-050513-generic #202003251631 SMP Wed Mar 25 16:35:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
> 
> DEVICE USAGE
> /dev/sdd1, ID: 1
>    Device size:             2.73TiB
>    Device slack:              0.00B
>    Data,RAID6:            188.67GiB
>    Data,RAID6:              1.68TiB
>    Data,RAID6:            888.43GiB
>    Unallocated:             1.00MiB
> 
> /dev/sdb1, ID: 2
>    Device size:             2.73TiB
>    Device slack:            2.73TiB
>    Data,RAID6:            188.67GiB
>    Data,RAID6:            508.82GiB
>    Data,RAID6:              2.00GiB
>    Unallocated:          -699.50GiB
> 
> /dev/sdc1, ID: 3
>    Device size:             2.73TiB
>    Device slack:              0.00B
>    Data,RAID6:            188.67GiB
>    Data,RAID6:              1.68TiB
>    Data,RAID6:            888.43GiB
>    Unallocated:             1.00MiB
> 
> /dev/sdi1, ID: 5
>    Device size:             2.73TiB
>    Device slack:            1.36TiB
>    Data,RAID6:            188.67GiB
>    Data,RAID6:              1.18TiB
>    Unallocated:             1.00MiB
> 
> /dev/sdh1, ID: 6
>    Device size:             4.55TiB
>    Device slack:              0.00B
>    Data,RAID6:            188.67GiB
>    Data,RAID6:              1.68TiB
>    Data,RAID6:              1.23TiB
>    Data,RAID6:            888.43GiB
>    Data,RAID6:              2.00GiB
>    Metadata,RAID1:          2.00GiB
>    Unallocated:           601.01GiB
> 
> /dev/sda1, ID: 7
>    Device size:             7.28TiB
>    Device slack:              0.00B
>    Data,RAID6:            188.67GiB
>    Data,RAID6:              1.68TiB
>    Data,RAID6:              1.23TiB
>    Data,RAID6:            888.43GiB
>    Data,RAID6:              2.00GiB
>    Metadata,RAID1:          2.00GiB
>    System,RAID1:           32.00MiB
>    Unallocated:             3.32TiB
> 
> /dev/sdf1, ID: 8
>    Device size:             7.28TiB
>    Device slack:              0.00B
>    Data,RAID6:            188.67GiB
>    Data,RAID6:              1.68TiB
>    Data,RAID6:              1.23TiB
>    Data,RAID6:            888.43GiB
>    Data,RAID6:              2.00GiB
>    Metadata,RAID1:          8.00GiB
>    Unallocated:             3.31TiB
> 
> /dev/sdj1, ID: 9
>    Device size:             7.28TiB
>    Device slack:              0.00B
>    Data,RAID6:            188.67GiB
>    Data,RAID6:              1.68TiB
>    Data,RAID6:              1.23TiB
>    Data,RAID6:            888.43GiB
>    Data,RAID6:              2.00GiB
>    Metadata,RAID1:          8.00GiB
>    System,RAID1:           32.00MiB
>    Unallocated:             3.31TiB
> 
> 
> FI USAGE
> WARNING: RAID56 detected, not implemented
> Overall:
>     Device size:		  33.20TiB
>     Device allocated:		  20.06GiB
>     Device unallocated:		  33.18TiB
>     Device missing:		     0.00B
>     Used:			  19.38GiB
>     Free (estimated):		     0.00B	(min: 8.00EiB)
>     Data ratio:			      0.00
>     Metadata ratio:		      2.00
>     Global reserve:		 512.00MiB	(used: 0.00B)
> 
> Data,RAID6: Size:15.42TiB, Used:15.18TiB (98.44%)
>    /dev/sdd1	   2.73TiB
>    /dev/sdb1	 699.50GiB
>    /dev/sdc1	   2.73TiB
>    /dev/sdi1	   1.36TiB
>    /dev/sdh1	   3.96TiB
>    /dev/sda1	   3.96TiB
>    /dev/sdf1	   3.96TiB
>    /dev/sdj1	   3.96TiB
> 
> Metadata,RAID1: Size:10.00GiB, Used:9.69GiB (96.90%)
>    /dev/sdh1	   2.00GiB
>    /dev/sda1	   2.00GiB
>    /dev/sdf1	   8.00GiB
>    /dev/sdj1	   8.00GiB
> 
> System,RAID1: Size:32.00MiB, Used:1.19MiB (3.71%)
>    /dev/sda1	  32.00MiB
>    /dev/sdj1	  32.00MiB
> 
> Unallocated:
>    /dev/sdd1	   1.00MiB
>    /dev/sdb1	-699.50GiB
>    /dev/sdc1	   1.00MiB
>    /dev/sdi1	   1.00MiB
>    /dev/sdh1	 601.01GiB
>    /dev/sda1	   3.32TiB
>    /dev/sdf1	   3.31TiB
>    /dev/sdj1	   3.31TiB
> 
> 
> FI SHOW
> Label: 'Pool1'  uuid: 99935e27-4922-4efa-bf76-5787536dd71f
> 	Total devices 8 FS bytes used 15.19TiB
> 	devid    1 size 2.73TiB used 2.73TiB path /dev/sdd1
> 	devid    2 size 0.00B used 699.50GiB path /dev/sdb1
> 	devid    3 size 2.73TiB used 2.73TiB path /dev/sdc1
> 	devid    5 size 1.36TiB used 1.36TiB path /dev/sdi1
> 	devid    6 size 4.55TiB used 3.96TiB path /dev/sdh1
> 	devid    7 size 7.28TiB used 3.96TiB path /dev/sda1
> 	devid    8 size 7.28TiB used 3.97TiB path /dev/sdf1
> 	devid    9 size 7.28TiB used 3.97TiB path /dev/sdj1
> 
> FI DF
> Data, RAID6: total=15.42TiB, used=15.18TiB
> System, RAID1: total=32.00MiB, used=1.19MiB
> Metadata, RAID1: total=10.00GiB, used=9.69GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Device Delete Stuck
  2020-03-29 18:55 ` Zygo Blaxell
@ 2020-03-29 19:24   ` Jason Clara
  0 siblings, 0 replies; 5+ messages in thread
From: Jason Clara @ 2020-03-29 19:24 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

Thanks, I will give it a try.   Your step 1 is actually what I used to detect the errors the first time when the delete would cause the system to hang completely.  I then deleted all bad files and restored from a backup.  I did do a scrub after that, but didn’t repeat step 1 again.

I will try your suggestion and repeat the steps till I see no errors.

Also, I understand the state of RAID 5/6.  This pool has all important data backed up to another RAID1 pool daily.  I am actually trying to reduce the size of this pool to add to the RAID1 pool.

It was previously a RAID1 pool I converted to RAID6 and since then I have not been able to remove that device.

> On Mar 29, 2020, at 2:55 PM, Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote:
> 
> On Sun, Mar 29, 2020 at 10:13:05AM -0400, Jason Clara wrote:
>> I had a previous post about when trying to do a device delete that
>> it would cause my whole system to hang.  I seem to have got past
>> that issue.
>> 
>> For that, it seems like even though all the SCRUBs finished without
>> any errors I still had a problem with some files.  By forcing a read
>> of every single file I was able to detect the bad files in DMESG.
>> Not sure though why SCRUB didn’t detect this.  BTRFS warning (device
>> sdd1): csum failed root 5 ino 14654354 off 163852288 csum 0
> 
> That sounds like it could be the raid5/6 bug I reported
> 
> 	https://www.spinics.net/lists/linux-btrfs/msg94594.html
> 
> To trigger that bug you need pre-existing corruption on the disk.
> 
> You can work around by:
> 
> 	1.  Read every file, e.g. 'find -type f -exec cat {} + >/dev/null'
> 	This avoids dmesg ratelimiting which will hide some errors.
> 
> 	2.  If there are read errors in step 1, remove any that have
> 	failures.
> 
> 	3.  Run full scrub to fix parity or inject new errors.
> 
> 	4.  Repeat until there are no errors at step 1.
> 
> The bug will introduce new errors in a small fraction (<0.1%) of corrupted
> raid stripes as you do this.  Each pass through the loop will remove
> existing errors, but may add a few more new errors at the same time.
> The rate of removal is much faster than the rate of addition, so the
> loop will eventually terminate at zero errors.  You'll be able to use
> the filesystem normally again after that.
> 
> This bug is not a regression--there has not been a kernel release with
> working btrfs raid5/6 yet.  All releases from 4.15 to 5.5.3 fail my test
> case, and versions before 4.15 have worse bugs.  At the moment, btrfs
> raid5/6 should only be used by developers who intend to test, debug,
> and fix btrfs raid5/6.
> 
>> But now when I attempt to delete a device from the array it seems to
>> get stuck.  Normally it will show in the log that it has found some
>> extents and then another message saying they were relocated.
>> 
>> But for the last few days it has just been repeating the same found
>> value and never relocating anything, and the usage of the device
>> doesn’t change at all.
>> 
>> This line has now been repeating for more then 24 hours, and the
>> previous attempt was similar.  [Sun Mar 29 09:59:50 2020] BTRFS info
>> (device sdd1): found 133 extents
> 
> Kernels starting with 5.1 have a known regression where block group
> relocation gets stuck in loops.  Everything in the block group gets
> relocated except for shared data backref items, then the relocation can't
> seem to move those and no further progress is made.  This has not been
> fixed yet.
> 
>> Prior to this run I had tried with an earlier kernel (5.5.10) and had
>> the same results.  It starts with finding and then relocating, but
>> then relocating.  So I upgraded my kernel to see if that would help,
>> and it has not.
> 
> Use kernel 4.19 for device deletes or other big relocation operations.
> (5.0 and 4.20 are OK too, but 4.19 is still maintained and has fixes
> for non-btrfs issues).
> 
>> System Info
>> Ubuntu 18.04
>> btrfs-progs v5.4.1
>> Linux FileServer 5.5.13-050513-generic #202003251631 SMP Wed Mar 25 16:35:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
>> 
>> DEVICE USAGE
>> /dev/sdd1, ID: 1
>>   Device size:             2.73TiB
>>   Device slack:              0.00B
>>   Data,RAID6:            188.67GiB
>>   Data,RAID6:              1.68TiB
>>   Data,RAID6:            888.43GiB
>>   Unallocated:             1.00MiB
>> 
>> /dev/sdb1, ID: 2
>>   Device size:             2.73TiB
>>   Device slack:            2.73TiB
>>   Data,RAID6:            188.67GiB
>>   Data,RAID6:            508.82GiB
>>   Data,RAID6:              2.00GiB
>>   Unallocated:          -699.50GiB
>> 
>> /dev/sdc1, ID: 3
>>   Device size:             2.73TiB
>>   Device slack:              0.00B
>>   Data,RAID6:            188.67GiB
>>   Data,RAID6:              1.68TiB
>>   Data,RAID6:            888.43GiB
>>   Unallocated:             1.00MiB
>> 
>> /dev/sdi1, ID: 5
>>   Device size:             2.73TiB
>>   Device slack:            1.36TiB
>>   Data,RAID6:            188.67GiB
>>   Data,RAID6:              1.18TiB
>>   Unallocated:             1.00MiB
>> 
>> /dev/sdh1, ID: 6
>>   Device size:             4.55TiB
>>   Device slack:              0.00B
>>   Data,RAID6:            188.67GiB
>>   Data,RAID6:              1.68TiB
>>   Data,RAID6:              1.23TiB
>>   Data,RAID6:            888.43GiB
>>   Data,RAID6:              2.00GiB
>>   Metadata,RAID1:          2.00GiB
>>   Unallocated:           601.01GiB
>> 
>> /dev/sda1, ID: 7
>>   Device size:             7.28TiB
>>   Device slack:              0.00B
>>   Data,RAID6:            188.67GiB
>>   Data,RAID6:              1.68TiB
>>   Data,RAID6:              1.23TiB
>>   Data,RAID6:            888.43GiB
>>   Data,RAID6:              2.00GiB
>>   Metadata,RAID1:          2.00GiB
>>   System,RAID1:           32.00MiB
>>   Unallocated:             3.32TiB
>> 
>> /dev/sdf1, ID: 8
>>   Device size:             7.28TiB
>>   Device slack:              0.00B
>>   Data,RAID6:            188.67GiB
>>   Data,RAID6:              1.68TiB
>>   Data,RAID6:              1.23TiB
>>   Data,RAID6:            888.43GiB
>>   Data,RAID6:              2.00GiB
>>   Metadata,RAID1:          8.00GiB
>>   Unallocated:             3.31TiB
>> 
>> /dev/sdj1, ID: 9
>>   Device size:             7.28TiB
>>   Device slack:              0.00B
>>   Data,RAID6:            188.67GiB
>>   Data,RAID6:              1.68TiB
>>   Data,RAID6:              1.23TiB
>>   Data,RAID6:            888.43GiB
>>   Data,RAID6:              2.00GiB
>>   Metadata,RAID1:          8.00GiB
>>   System,RAID1:           32.00MiB
>>   Unallocated:             3.31TiB
>> 
>> 
>> FI USAGE
>> WARNING: RAID56 detected, not implemented
>> Overall:
>>    Device size:		  33.20TiB
>>    Device allocated:		  20.06GiB
>>    Device unallocated:		  33.18TiB
>>    Device missing:		     0.00B
>>    Used:			  19.38GiB
>>    Free (estimated):		     0.00B	(min: 8.00EiB)
>>    Data ratio:			      0.00
>>    Metadata ratio:		      2.00
>>    Global reserve:		 512.00MiB	(used: 0.00B)
>> 
>> Data,RAID6: Size:15.42TiB, Used:15.18TiB (98.44%)
>>   /dev/sdd1	   2.73TiB
>>   /dev/sdb1	 699.50GiB
>>   /dev/sdc1	   2.73TiB
>>   /dev/sdi1	   1.36TiB
>>   /dev/sdh1	   3.96TiB
>>   /dev/sda1	   3.96TiB
>>   /dev/sdf1	   3.96TiB
>>   /dev/sdj1	   3.96TiB
>> 
>> Metadata,RAID1: Size:10.00GiB, Used:9.69GiB (96.90%)
>>   /dev/sdh1	   2.00GiB
>>   /dev/sda1	   2.00GiB
>>   /dev/sdf1	   8.00GiB
>>   /dev/sdj1	   8.00GiB
>> 
>> System,RAID1: Size:32.00MiB, Used:1.19MiB (3.71%)
>>   /dev/sda1	  32.00MiB
>>   /dev/sdj1	  32.00MiB
>> 
>> Unallocated:
>>   /dev/sdd1	   1.00MiB
>>   /dev/sdb1	-699.50GiB
>>   /dev/sdc1	   1.00MiB
>>   /dev/sdi1	   1.00MiB
>>   /dev/sdh1	 601.01GiB
>>   /dev/sda1	   3.32TiB
>>   /dev/sdf1	   3.31TiB
>>   /dev/sdj1	   3.31TiB
>> 
>> 
>> FI SHOW
>> Label: 'Pool1'  uuid: 99935e27-4922-4efa-bf76-5787536dd71f
>> 	Total devices 8 FS bytes used 15.19TiB
>> 	devid    1 size 2.73TiB used 2.73TiB path /dev/sdd1
>> 	devid    2 size 0.00B used 699.50GiB path /dev/sdb1
>> 	devid    3 size 2.73TiB used 2.73TiB path /dev/sdc1
>> 	devid    5 size 1.36TiB used 1.36TiB path /dev/sdi1
>> 	devid    6 size 4.55TiB used 3.96TiB path /dev/sdh1
>> 	devid    7 size 7.28TiB used 3.96TiB path /dev/sda1
>> 	devid    8 size 7.28TiB used 3.97TiB path /dev/sdf1
>> 	devid    9 size 7.28TiB used 3.97TiB path /dev/sdj1
>> 
>> FI DF
>> Data, RAID6: total=15.42TiB, used=15.18TiB
>> System, RAID1: total=32.00MiB, used=1.19MiB
>> Metadata, RAID1: total=10.00GiB, used=9.69GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-03-29 19:25 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-29 14:13 Device Delete Stuck Jason Clara
2020-03-29 16:40 ` Steven Fosdick
2020-03-29 18:18   ` Jason Clara
2020-03-29 18:55 ` Zygo Blaxell
2020-03-29 19:24   ` Jason Clara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.