All of lore.kernel.org
 help / color / mirror / Atom feed
* Errors after successful disk replace
@ 2021-10-19  3:54 Emil Heimpel
  2021-10-19  5:35 ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Emil Heimpel @ 2021-10-19  3:54 UTC (permalink / raw)
  To: linux-btrfs

Hi all,

One of my drives of a raid 5 btrfs array failed (was dead completely) so I installed an identical replacement drive. The dead drive was devid 1 and the new drive /dev/sde. I used the following to replace the missing drive:

sudo btrfs replace start -B 1 /dev/sde1 /mnt/btrfsrepair/

and it completed successfully without any reported errors (took around 2 weeks though...).

I then tried to see my array with filesystem show, but it hung (or took longer than I wanted to wait), so I did a reboot.

It showed up after a reboot as followed:

Label: 'BlueButter'  uuid: 7e3378e6-da46-4a60-b9b8-1bcc306986e3
        Total devices 6 FS bytes used 20.96TiB
        devid    0 size 7.28TiB used 5.46TiB path /dev/sde1
        devid    2 size 7.28TiB used 5.46TiB path /dev/sdb1
        devid    3 size 2.73TiB used 2.73TiB path /dev/sdg1
        devid    4 size 2.73TiB used 2.73TiB path /dev/sdd1
        devid    5 size 7.28TiB used 4.81TiB path /dev/sdf1
        devid    6 size 7.28TiB used 5.33TiB path /dev/sdc1

I then tried to mount it, but it failed, so I run a readonly check and it reported the following problem:

[...]
[2/7] checking extents
ERROR: super total bytes 38007432437760 smaller than real device(s) size 46008994590720
ERROR: mounting this fs may fail for newer kernels
ERROR: this can be fixed by 'btrfs rescue fix-device-size'
[3/7] checking free space tree
[...]

So I followed that advice but got the following error:

sudo btrfs rescue fix-device-size /dev/sde1
ERROR: devid 1 is missing or not writeable
ERROR: fixing device size needs all device(s) to be present and writeable

So it seems something went wrong or didn't complete fully.
What can I do to fix this problem?

uname -a
Linux BlueQ 5.14.12-arch1-1 #1 SMP PREEMPT Wed, 13 Oct 2021 16:58:16 +0000 x86_64 GNU/Linux

btrfs --version
btrfs-progs v5.14.2

Regards,
Emil

P.S.: Yes, I know, raid5 isn't stable but it works good enough for me ;)
Metadata is raid1 btw...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Errors after successful disk replace
  2021-10-19  3:54 Errors after successful disk replace Emil Heimpel
@ 2021-10-19  5:35 ` Qu Wenruo
  2021-10-19 10:49   ` Emil Heimpel
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2021-10-19  5:35 UTC (permalink / raw)
  To: Emil Heimpel, linux-btrfs



On 2021/10/19 11:54, Emil Heimpel wrote:
> Hi all,
>
> One of my drives of a raid 5 btrfs array failed (was dead completely) so I installed an identical replacement drive. The dead drive was devid 1 and the new drive /dev/sde. I used the following to replace the missing drive:
>
> sudo btrfs replace start -B 1 /dev/sde1 /mnt/btrfsrepair/
>
> and it completed successfully without any reported errors (took around 2 weeks though...).
>
> I then tried to see my array with filesystem show, but it hung (or took longer than I wanted to wait), so I did a reboot.

Any dmesg of that time?

>
> It showed up after a reboot as followed:
>
> Label: 'BlueButter'  uuid: 7e3378e6-da46-4a60-b9b8-1bcc306986e3
>          Total devices 6 FS bytes used 20.96TiB
>          devid    0 size 7.28TiB used 5.46TiB path /dev/sde1
>          devid    2 size 7.28TiB used 5.46TiB path /dev/sdb1
>          devid    3 size 2.73TiB used 2.73TiB path /dev/sdg1
>          devid    4 size 2.73TiB used 2.73TiB path /dev/sdd1
>          devid    5 size 7.28TiB used 4.81TiB path /dev/sdf1
>          devid    6 size 7.28TiB used 5.33TiB path /dev/sdc1
>
> I then tried to mount it, but it failed, so I run a readonly check and it reported the following problem:

And dmesg for the failed mount?

Thanks,
Qu
>
> [...]
> [2/7] checking extents
> ERROR: super total bytes 38007432437760 smaller than real device(s) size 46008994590720
> ERROR: mounting this fs may fail for newer kernels
> ERROR: this can be fixed by 'btrfs rescue fix-device-size'
> [3/7] checking free space tree
> [...]
>
> So I followed that advice but got the following error:
>
> sudo btrfs rescue fix-device-size /dev/sde1
> ERROR: devid 1 is missing or not writeable
> ERROR: fixing device size needs all device(s) to be present and writeable
>
> So it seems something went wrong or didn't complete fully.
> What can I do to fix this problem?
>
> uname -a
> Linux BlueQ 5.14.12-arch1-1 #1 SMP PREEMPT Wed, 13 Oct 2021 16:58:16 +0000 x86_64 GNU/Linux
>
> btrfs --version
> btrfs-progs v5.14.2
>
> Regards,
> Emil
>
> P.S.: Yes, I know, raid5 isn't stable but it works good enough for me ;)
> Metadata is raid1 btw...
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Errors after successful disk replace
  2021-10-19  5:35 ` Qu Wenruo
@ 2021-10-19 10:49   ` Emil Heimpel
  2021-10-19 11:37     ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Emil Heimpel @ 2021-10-19 10:49 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs


Oct 19, 2021 07:35:54 Qu Wenruo <quwenruo.btrfs@gmx.com>:

>
>
> On 2021/10/19 11:54, Emil Heimpel wrote:
>> Hi all,
>>
>> One of my drives of a raid 5 btrfs array failed (was dead completely) so I installed an identical replacement drive. The dead drive was devid 1 and the new drive /dev/sde. I used the following to replace the missing drive:
>>
>> sudo btrfs replace start -B 1 /dev/sde1 /mnt/btrfsrepair/
>>
>> and it completed successfully without any reported errors (took around 2 weeks though...).
>>
>> I then tried to see my array with filesystem show, but it hung (or took longer than I wanted to wait), so I did a reboot.
>
> Any dmesg of that time?
>

Nothing after the replace finished:

1634463961.245751 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663044222976 for dev (efault)
1634463961.255819 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663045795840 for dev (efault)
1634463961.275815 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663046582272 for dev (efault)
1634463961.275922 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663047368704 for dev (efault)
1634463961.339074 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663048155136 for dev (efault)
1634463961.339248 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663048941568 for dev (efault)
1634475910.611261 BlueQ kernel: sd 9:0:2:0: attempting task abort!scmd(0x0000000046fead3f), outstanding for 7120 ms & timeout 7000 ms
1634475910.615126 BlueQ kernel: sd 9:0:2:0: [sdd] tag#840 CDB: ATA command pass through(16) 85 08 2e 00 00 00 01 00 00 00 00 00 00 00 ec 00
1634475910.615429 BlueQ kernel: scsi target9:0:2: handle(0x000b), sas_address(0x4433221105000000), phy(5)
1634475910.615691 BlueQ kernel: scsi target9:0:2: enclosure logical id(0x590b11c022f3fb00), slot(6)
1634475910.787911 BlueQ kernel: sd 9:0:2:0: task abort: SUCCESS scmd(0x0000000046fead3f)
1634475910.807083 BlueQ kernel: sd 9:0:2:0: Power-on or device reset occurred
1634475949.877998 BlueQ kernel: sd 9:0:2:0: Power-on or device reset occurred
1634525944.213931 BlueQ kernel: perf: interrupt took too long (3138 > 3137), lowering kernel.perf_event_max_sample_rate to 63600
1634533791.168760 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 22996545634304 for dev (efault)
1634552685.203559 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 23816815706112 for dev (efault)
1634558977.979621 BlueQ kernel: BTRFS info (device sdb1): dev_replace from <missing disk> (devid 1) to /dev/sde1 finished
1634560793.132731 BlueQ kernel: zram0: detected capacity change from 32610864 to 0
1634560793.169379 BlueQ kernel: zram: Removed device: zram0
1634560883.549481 BlueQ kernel: watchdog: watchdog0: watchdog did not stop!
1634560883.556038 BlueQ systemd-shutdown[1]: Syncing filesystems and block devices.
1634560883.572840 BlueQ systemd-shutdown[1]: Sending SIGTERM to remaining processes...




>>
>> It showed up after a reboot as followed:
>>
>> Label: 'BlueButter'  uuid: 7e3378e6-da46-4a60-b9b8-1bcc306986e3
>>         Total devices 6 FS bytes used 20.96TiB
>>         devid    0 size 7.28TiB used 5.46TiB path /dev/sde1
>>         devid    2 size 7.28TiB used 5.46TiB path /dev/sdb1
>>         devid    3 size 2.73TiB used 2.73TiB path /dev/sdg1
>>         devid    4 size 2.73TiB used 2.73TiB path /dev/sdd1
>>         devid    5 size 7.28TiB used 4.81TiB path /dev/sdf1
>>         devid    6 size 7.28TiB used 5.33TiB path /dev/sdc1
>>
>> I then tried to mount it, but it failed, so I run a readonly check and it reported the following problem:
>
> And dmesg for the failed mount?
>

Oops, I must have missed that it failed because of missing devid 1 too...

1634562944.145383 BlueQ kernel: BTRFS info (device sde1): flagging fs with big metadata feature
1634562944.145529 BlueQ kernel: BTRFS info (device sde1): force zstd compression, level 2
1634562944.145650 BlueQ kernel: BTRFS info (device sde1): using free space tree
1634562944.145697 BlueQ kernel: BTRFS info (device sde1): has skinny extents
1634562944.148709 BlueQ kernel: BTRFS error (device sde1): devid 1 uuid 51645efd-bf95-458d-b5ae-b31623533abb is missing
1634562944.148764 BlueQ kernel: BTRFS error (device sde1): failed to read chunk tree: -2
1634562944.185369 BlueQ kernel: BTRFS error (device sde1): open_ctree failed

> Thanks,
> Qu
>>
>> [...]
>> [2/7] checking extents
>> ERROR: super total bytes 38007432437760 smaller than real device(s) size 46008994590720
>> ERROR: mounting this fs may fail for newer kernels
>> ERROR: this can be fixed by 'btrfs rescue fix-device-size'
>> [3/7] checking free space tree
>> [...]
>>
>> So I followed that advice but got the following error:
>>
>> sudo btrfs rescue fix-device-size /dev/sde1
>> ERROR: devid 1 is missing or not writeable
>> ERROR: fixing device size needs all device(s) to be present and writeable
>>
>> So it seems something went wrong or didn't complete fully.
>> What can I do to fix this problem?
>>
>> uname -a
>> Linux BlueQ 5.14.12-arch1-1 #1 SMP PREEMPT Wed, 13 Oct 2021 16:58:16 +0000 x86_64 GNU/Linux
>>
>> btrfs --version
>> btrfs-progs v5.14.2
>>
>> Regards,
>> Emil
>>
>> P.S.: Yes, I know, raid5 isn't stable but it works good enough for me ;)
>> Metadata is raid1 btw...
>>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Errors after successful disk replace
  2021-10-19 10:49   ` Emil Heimpel
@ 2021-10-19 11:37     ` Qu Wenruo
  2021-10-19 12:10       ` Emil Heimpel
  2021-10-19 12:16       ` Emil Heimpel
  0 siblings, 2 replies; 11+ messages in thread
From: Qu Wenruo @ 2021-10-19 11:37 UTC (permalink / raw)
  To: Emil Heimpel; +Cc: linux-btrfs



On 2021/10/19 18:49, Emil Heimpel wrote:
>
> Oct 19, 2021 07:35:54 Qu Wenruo <quwenruo.btrfs@gmx.com>:
>
>>
>>
>> On 2021/10/19 11:54, Emil Heimpel wrote:
>>> Hi all,
>>>
>>> One of my drives of a raid 5 btrfs array failed (was dead completely) so I installed an identical replacement drive. The dead drive was devid 1 and the new drive /dev/sde. I used the following to replace the missing drive:
>>>
>>> sudo btrfs replace start -B 1 /dev/sde1 /mnt/btrfsrepair/
>>>
>>> and it completed successfully without any reported errors (took around 2 weeks though...).
>>>
>>> I then tried to see my array with filesystem show, but it hung (or took longer than I wanted to wait), so I did a reboot.
>>
>> Any dmesg of that time?
>>
>
> Nothing after the replace finished:
>
> 1634463961.245751 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663044222976 for dev (efault)
> 1634463961.255819 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663045795840 for dev (efault)
> 1634463961.275815 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663046582272 for dev (efault)
> 1634463961.275922 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663047368704 for dev (efault)
> 1634463961.339074 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663048155136 for dev (efault)
> 1634463961.339248 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663048941568 for dev (efault)

*failed*...

> 1634475910.611261 BlueQ kernel: sd 9:0:2:0: attempting task abort!scmd(0x0000000046fead3f), outstanding for 7120 ms & timeout 7000 ms
> 1634475910.615126 BlueQ kernel: sd 9:0:2:0: [sdd] tag#840 CDB: ATA command pass through(16) 85 08 2e 00 00 00 01 00 00 00 00 00 00 00 ec 00
> 1634475910.615429 BlueQ kernel: scsi target9:0:2: handle(0x000b), sas_address(0x4433221105000000), phy(5)
> 1634475910.615691 BlueQ kernel: scsi target9:0:2: enclosure logical id(0x590b11c022f3fb00), slot(6)

And ATA commands failure.

I don't believe the replace finished without problem, and the involved
device is /dev/sdd.

> 1634475910.787911 BlueQ kernel: sd 9:0:2:0: task abort: SUCCESS scmd(0x0000000046fead3f)
> 1634475910.807083 BlueQ kernel: sd 9:0:2:0: Power-on or device reset occurred
> 1634475949.877998 BlueQ kernel: sd 9:0:2:0: Power-on or device reset occurred
> 1634525944.213931 BlueQ kernel: perf: interrupt took too long (3138 > 3137), lowering kernel.perf_event_max_sample_rate to 63600
> 1634533791.168760 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 22996545634304 for dev (efault)
> 1634552685.203559 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 23816815706112 for dev (efault)

You won't want to see this message at all.

This means, you're running RAID56, as btrfs has write-hole problem,
which will degrade the robust of RAID56 byte by byte for each unclean
shutdown.

I guess the write hole problem has already make the repair failed for
the replace.

Thus after a successful mount, scrub and manually file checking is
almost a must.

> 1634558977.979621 BlueQ kernel: BTRFS info (device sdb1): dev_replace from <missing disk> (devid 1) to /dev/sde1 finished
> 1634560793.132731 BlueQ kernel: zram0: detected capacity change from 32610864 to 0
> 1634560793.169379 BlueQ kernel: zram: Removed device: zram0
> 1634560883.549481 BlueQ kernel: watchdog: watchdog0: watchdog did not stop!
> 1634560883.556038 BlueQ systemd-shutdown[1]: Syncing filesystems and block devices.
> 1634560883.572840 BlueQ systemd-shutdown[1]: Sending SIGTERM to remaining processes...
>
>
>
>
>>>
>>> It showed up after a reboot as followed:
>>>
>>> Label: 'BlueButter'  uuid: 7e3378e6-da46-4a60-b9b8-1bcc306986e3
>>>          Total devices 6 FS bytes used 20.96TiB
>>>          devid    0 size 7.28TiB used 5.46TiB path /dev/sde1
>>>          devid    2 size 7.28TiB used 5.46TiB path /dev/sdb1
>>>          devid    3 size 2.73TiB used 2.73TiB path /dev/sdg1
>>>          devid    4 size 2.73TiB used 2.73TiB path /dev/sdd1
>>>          devid    5 size 7.28TiB used 4.81TiB path /dev/sdf1
>>>          devid    6 size 7.28TiB used 5.33TiB path /dev/sdc1
>>>
>>> I then tried to mount it, but it failed, so I run a readonly check and it reported the following problem:
>>
>> And dmesg for the failed mount?
>>
>
> Oops, I must have missed that it failed because of missing devid 1 too...
>
> 1634562944.145383 BlueQ kernel: BTRFS info (device sde1): flagging fs with big metadata feature
> 1634562944.145529 BlueQ kernel: BTRFS info (device sde1): force zstd compression, level 2
> 1634562944.145650 BlueQ kernel: BTRFS info (device sde1): using free space tree
> 1634562944.145697 BlueQ kernel: BTRFS info (device sde1): has skinny extents
> 1634562944.148709 BlueQ kernel: BTRFS error (device sde1): devid 1 uuid 51645efd-bf95-458d-b5ae-b31623533abb is missing
> 1634562944.148764 BlueQ kernel: BTRFS error (device sde1): failed to read chunk tree: -2
> 1634562944.185369 BlueQ kernel: BTRFS error (device sde1): open_ctree failed

This doesn't sound correct.

If a device is properly replaced, it should have the same devid number.

I guess you have tried to add a new device before, and then tried to
replace the missing device, right?


Anyway, have you tried to mount it degraded and then remove the missing
device?

Since you're using RAID56, I guess degrade mount should work.

Thanks,
Qu

>
>> Thanks,
>> Qu
>>>
>>> [...]
>>> [2/7] checking extents
>>> ERROR: super total bytes 38007432437760 smaller than real device(s) size 46008994590720
>>> ERROR: mounting this fs may fail for newer kernels
>>> ERROR: this can be fixed by 'btrfs rescue fix-device-size'
>>> [3/7] checking free space tree
>>> [...]
>>>
>>> So I followed that advice but got the following error:
>>>
>>> sudo btrfs rescue fix-device-size /dev/sde1
>>> ERROR: devid 1 is missing or not writeable
>>> ERROR: fixing device size needs all device(s) to be present and writeable
>>>
>>> So it seems something went wrong or didn't complete fully.
>>> What can I do to fix this problem?
>>>
>>> uname -a
>>> Linux BlueQ 5.14.12-arch1-1 #1 SMP PREEMPT Wed, 13 Oct 2021 16:58:16 +0000 x86_64 GNU/Linux
>>>
>>> btrfs --version
>>> btrfs-progs v5.14.2
>>>
>>> Regards,
>>> Emil
>>>
>>> P.S.: Yes, I know, raid5 isn't stable but it works good enough for me ;)
>>> Metadata is raid1 btw...
>>>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Errors after successful disk replace
  2021-10-19 11:37     ` Qu Wenruo
@ 2021-10-19 12:10       ` Emil Heimpel
  2021-10-19 12:16       ` Emil Heimpel
  1 sibling, 0 replies; 11+ messages in thread
From: Emil Heimpel @ 2021-10-19 12:10 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs


Oct 19, 2021 13:37:09 Qu Wenruo <quwenruo.btrfs@gmx.com>:

>
>
> On 2021/10/19 18:49, Emil Heimpel wrote:
>>
>> Oct 19, 2021 07:35:54 Qu Wenruo <quwenruo.btrfs@gmx.com>:
>>
>>>
>>>
>>> On 2021/10/19 11:54, Emil Heimpel wrote:
>>>> Hi all,
>>>>
>>>> One of my drives of a raid 5 btrfs array failed (was dead completely) so I installed an identical replacement drive. The dead drive was devid 1 and the new drive /dev/sde. I used the following to replace the missing drive:
>>>>
>>>> sudo btrfs replace start -B 1 /dev/sde1 /mnt/btrfsrepair/
>>>>
>>>> and it completed successfully without any reported errors (took around 2 weeks though...).
>>>>
>>>> I then tried to see my array with filesystem show, but it hung (or took longer than I wanted to wait), so I did a reboot.
>>>
>>> Any dmesg of that time?
>>>
>>
>> Nothing after the replace finished:
>>
>> 1634463961.245751 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663044222976 for dev (efault)
>> 1634463961.255819 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663045795840 for dev (efault)
>> 1634463961.275815 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663046582272 for dev (efault)
>> 1634463961.275922 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663047368704 for dev (efault)
>> 1634463961.339074 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663048155136 for dev (efault)
>> 1634463961.339248 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663048941568 for dev (efault)
>
> *failed*...
>
>> 1634475910.611261 BlueQ kernel: sd 9:0:2:0: attempting task abort!scmd(0x0000000046fead3f), outstanding for 7120 ms & timeout 7000 ms
>> 1634475910.615126 BlueQ kernel: sd 9:0:2:0: [sdd] tag#840 CDB: ATA command pass through(16) 85 08 2e 00 00 00 01 00 00 00 00 00 00 00 ec 00
>> 1634475910.615429 BlueQ kernel: scsi target9:0:2: handle(0x000b), sas_address(0x4433221105000000), phy(5)
>> 1634475910.615691 BlueQ kernel: scsi target9:0:2: enclosure logical id(0x590b11c022f3fb00), slot(6)
>
> And ATA commands failure.
>
> I don't believe the replace finished without problem, and the involved
> device is /dev/sdd.
>
>> 1634475910.787911 BlueQ kernel: sd 9:0:2:0: task abort: SUCCESS scmd(0x0000000046fead3f)
>> 1634475910.807083 BlueQ kernel: sd 9:0:2:0: Power-on or device reset occurred
>> 1634475949.877998 BlueQ kernel: sd 9:0:2:0: Power-on or device reset occurred
>> 1634525944.213931 BlueQ kernel: perf: interrupt took too long (3138 > 3137), lowering kernel.perf_event_max_sample_rate to 63600
>> 1634533791.168760 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 22996545634304 for dev (efault)
>> 1634552685.203559 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 23816815706112 for dev (efault)
>
> You won't want to see this message at all.
>
> This means, you're running RAID56, as btrfs has write-hole problem,
> which will degrade the robust of RAID56 byte by byte for each unclean
> shutdown.
>
> I guess the write hole problem has already make the repair failed for
> the replace.
>

Hm, I never checked dmesg for the replace, because btrfs replace always showed 0 errors... I may have to check my sata-controller as I don't think it is a problem with sdd. I found these same errors for other drives as well....


> Thus after a successful mount, scrub and manually file checking is
> almost a must.
>
>> 1634558977.979621 BlueQ kernel: BTRFS info (device sdb1): dev_replace from <missing disk> (devid 1) to /dev/sde1 finished
>> 1634560793.132731 BlueQ kernel: zram0: detected capacity change from 32610864 to 0
>> 1634560793.169379 BlueQ kernel: zram: Removed device: zram0
>> 1634560883.549481 BlueQ kernel: watchdog: watchdog0: watchdog did not stop!
>> 1634560883.556038 BlueQ systemd-shutdown[1]: Syncing filesystems and block devices.
>> 1634560883.572840 BlueQ systemd-shutdown[1]: Sending SIGTERM to remaining processes...
>>
>>
>>
>>
>>>>
>>>> It showed up after a reboot as followed:
>>>>
>>>> Label: 'BlueButter'  uuid: 7e3378e6-da46-4a60-b9b8-1bcc306986e3
>>>>         Total devices 6 FS bytes used 20.96TiB
>>>>         devid    0 size 7.28TiB used 5.46TiB path /dev/sde1
>>>>         devid    2 size 7.28TiB used 5.46TiB path /dev/sdb1
>>>>         devid    3 size 2.73TiB used 2.73TiB path /dev/sdg1
>>>>         devid    4 size 2.73TiB used 2.73TiB path /dev/sdd1
>>>>         devid    5 size 7.28TiB used 4.81TiB path /dev/sdf1
>>>>         devid    6 size 7.28TiB used 5.33TiB path /dev/sdc1
>>>>
>>>> I then tried to mount it, but it failed, so I run a readonly check and it reported the following problem:
>>>
>>> And dmesg for the failed mount?
>>>
>>
>> Oops, I must have missed that it failed because of missing devid 1 too...
>>
>> 1634562944.145383 BlueQ kernel: BTRFS info (device sde1): flagging fs with big metadata feature
>> 1634562944.145529 BlueQ kernel: BTRFS info (device sde1): force zstd compression, level 2
>> 1634562944.145650 BlueQ kernel: BTRFS info (device sde1): using free space tree
>> 1634562944.145697 BlueQ kernel: BTRFS info (device sde1): has skinny extents
>> 1634562944.148709 BlueQ kernel: BTRFS error (device sde1): devid 1 uuid 51645efd-bf95-458d-b5ae-b31623533abb is missing
>> 1634562944.148764 BlueQ kernel: BTRFS error (device sde1): failed to read chunk tree: -2
>> 1634562944.185369 BlueQ kernel: BTRFS error (device sde1): open_ctree failed
>
> This doesn't sound correct.
>
> If a device is properly replaced, it should have the same devid number.
>
> I guess you have tried to add a new device before, and then tried to
> replace the missing device, right?
>

No. Never added a drive to that array, never tried to remove one. Only replace! This is the 3rd time I was replacing a drive, the first two times the faulty drive was readable and still in the system. This was the first time I had to do a replace without the source drive.

I did replace the first two drives with bigger ones though and did a resize after the replace...
>
> Anyway, have you tried to mount it degraded and then remove the missing
> device?
>
> Since you're using RAID56, I guess degrade mount should work.
>

So, mount degraded and then remove devid 1? I'll try that and report back, thanks!

Emil

> Thanks,
> Qu
>
>>
>>> Thanks,
>>> Qu
>>>>
>>>> [...]
>>>> [2/7] checking extents
>>>> ERROR: super total bytes 38007432437760 smaller than real device(s) size 46008994590720
>>>> ERROR: mounting this fs may fail for newer kernels
>>>> ERROR: this can be fixed by 'btrfs rescue fix-device-size'
>>>> [3/7] checking free space tree
>>>> [...]
>>>>
>>>> So I followed that advice but got the following error:
>>>>
>>>> sudo btrfs rescue fix-device-size /dev/sde1
>>>> ERROR: devid 1 is missing or not writeable
>>>> ERROR: fixing device size needs all device(s) to be present and writeable
>>>>
>>>> So it seems something went wrong or didn't complete fully.
>>>> What can I do to fix this problem?
>>>>
>>>> uname -a
>>>> Linux BlueQ 5.14.12-arch1-1 #1 SMP PREEMPT Wed, 13 Oct 2021 16:58:16 +0000 x86_64 GNU/Linux
>>>>
>>>> btrfs --version
>>>> btrfs-progs v5.14.2
>>>>
>>>> Regards,
>>>> Emil
>>>>
>>>> P.S.: Yes, I know, raid5 isn't stable but it works good enough for me ;)
>>>> Metadata is raid1 btw...
>>>>
>>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Errors after successful disk replace
  2021-10-19 11:37     ` Qu Wenruo
  2021-10-19 12:10       ` Emil Heimpel
@ 2021-10-19 12:16       ` Emil Heimpel
  2021-10-19 12:20         ` Qu Wenruo
  1 sibling, 1 reply; 11+ messages in thread
From: Emil Heimpel @ 2021-10-19 12:16 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

Color me suprised:


[74713.072745] BTRFS info (device sde1): flagging fs with big metadata feature
[74713.072755] BTRFS info (device sde1): allowing degraded mounts
[74713.072758] BTRFS info (device sde1): using free space tree
[74713.072760] BTRFS info (device sde1): has skinny extents
[74713.104297] BTRFS warning (device sde1): devid 1 uuid 51645efd-bf95-458d-b5ae-b31623533abb is missing
[74714.675001] BTRFS info (device sde1): bdev (efault) errs: wr 52950, rd 8161, flush 0, corrupt 1221, gen 0
[74714.675015] BTRFS info (device sde1): bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 228, gen 0
[74714.675025] BTRFS info (device sde1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 140, gen 0
[74751.033383] BTRFS info (device sde1): continuing dev_replace from <missing disk> (devid 1) to target /dev/sde1 @74%
[bluemond@BlueQ ~]$ sudo btrfs replace status  -1 /mnt/btrfsrepair/
74.9% done, 0 write errs, 0 uncorr. read errs

I guess I just wait?

Oct 19, 2021 13:37:09 Qu Wenruo <quwenruo.btrfs@gmx.com>:

> 
> 
> On 2021/10/19 18:49, Emil Heimpel wrote:
>> 
>> Oct 19, 2021 07:35:54 Qu Wenruo <quwenruo.btrfs@gmx.com>:
>> 
>>> 
>>> 
>>> On 2021/10/19 11:54, Emil Heimpel wrote:
>>>> Hi all,
>>>> 
>>>> One of my drives of a raid 5 btrfs array failed (was dead completely) so I installed an identical replacement drive. The dead drive was devid 1 and the new drive /dev/sde. I used the following to replace the missing drive:
>>>> 
>>>> sudo btrfs replace start -B 1 /dev/sde1 /mnt/btrfsrepair/
>>>> 
>>>> and it completed successfully without any reported errors (took around 2 weeks though...).
>>>> 
>>>> I then tried to see my array with filesystem show, but it hung (or took longer than I wanted to wait), so I did a reboot.
>>> 
>>> Any dmesg of that time?
>>> 
>> 
>> Nothing after the replace finished:
>> 
>> 1634463961.245751 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663044222976 for dev (efault)
>> 1634463961.255819 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663045795840 for dev (efault)
>> 1634463961.275815 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663046582272 for dev (efault)
>> 1634463961.275922 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663047368704 for dev (efault)
>> 1634463961.339074 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663048155136 for dev (efault)
>> 1634463961.339248 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663048941568 for dev (efault)
> 
> *failed*...
> 
>> 1634475910.611261 BlueQ kernel: sd 9:0:2:0: attempting task abort!scmd(0x0000000046fead3f), outstanding for 7120 ms & timeout 7000 ms
>> 1634475910.615126 BlueQ kernel: sd 9:0:2:0: [sdd] tag#840 CDB: ATA command pass through(16) 85 08 2e 00 00 00 01 00 00 00 00 00 00 00 ec 00
>> 1634475910.615429 BlueQ kernel: scsi target9:0:2: handle(0x000b), sas_address(0x4433221105000000), phy(5)
>> 1634475910.615691 BlueQ kernel: scsi target9:0:2: enclosure logical id(0x590b11c022f3fb00), slot(6)
> 
> And ATA commands failure.
> 
> I don't believe the replace finished without problem, and the involved
> device is /dev/sdd.
> 
>> 1634475910.787911 BlueQ kernel: sd 9:0:2:0: task abort: SUCCESS scmd(0x0000000046fead3f)
>> 1634475910.807083 BlueQ kernel: sd 9:0:2:0: Power-on or device reset occurred
>> 1634475949.877998 BlueQ kernel: sd 9:0:2:0: Power-on or device reset occurred
>> 1634525944.213931 BlueQ kernel: perf: interrupt took too long (3138 > 3137), lowering kernel.perf_event_max_sample_rate to 63600
>> 1634533791.168760 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 22996545634304 for dev (efault)
>> 1634552685.203559 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 23816815706112 for dev (efault)
> 
> You won't want to see this message at all.
> 
> This means, you're running RAID56, as btrfs has write-hole problem,
> which will degrade the robust of RAID56 byte by byte for each unclean
> shutdown.
> 
> I guess the write hole problem has already make the repair failed for
> the replace.
> 
> Thus after a successful mount, scrub and manually file checking is
> almost a must.
> 
>> 1634558977.979621 BlueQ kernel: BTRFS info (device sdb1): dev_replace from <missing disk> (devid 1) to /dev/sde1 finished
>> 1634560793.132731 BlueQ kernel: zram0: detected capacity change from 32610864 to 0
>> 1634560793.169379 BlueQ kernel: zram: Removed device: zram0
>> 1634560883.549481 BlueQ kernel: watchdog: watchdog0: watchdog did not stop!
>> 1634560883.556038 BlueQ systemd-shutdown[1]: Syncing filesystems and block devices.
>> 1634560883.572840 BlueQ systemd-shutdown[1]: Sending SIGTERM to remaining processes...
>> 
>> 
>> 
>> 
>>>> 
>>>> It showed up after a reboot as followed:
>>>> 
>>>> Label: 'BlueButter'  uuid: 7e3378e6-da46-4a60-b9b8-1bcc306986e3
>>>>         Total devices 6 FS bytes used 20.96TiB
>>>>         devid    0 size 7.28TiB used 5.46TiB path /dev/sde1
>>>>         devid    2 size 7.28TiB used 5.46TiB path /dev/sdb1
>>>>         devid    3 size 2.73TiB used 2.73TiB path /dev/sdg1
>>>>         devid    4 size 2.73TiB used 2.73TiB path /dev/sdd1
>>>>         devid    5 size 7.28TiB used 4.81TiB path /dev/sdf1
>>>>         devid    6 size 7.28TiB used 5.33TiB path /dev/sdc1
>>>> 
>>>> I then tried to mount it, but it failed, so I run a readonly check and it reported the following problem:
>>> 
>>> And dmesg for the failed mount?
>>> 
>> 
>> Oops, I must have missed that it failed because of missing devid 1 too...
>> 
>> 1634562944.145383 BlueQ kernel: BTRFS info (device sde1): flagging fs with big metadata feature
>> 1634562944.145529 BlueQ kernel: BTRFS info (device sde1): force zstd compression, level 2
>> 1634562944.145650 BlueQ kernel: BTRFS info (device sde1): using free space tree
>> 1634562944.145697 BlueQ kernel: BTRFS info (device sde1): has skinny extents
>> 1634562944.148709 BlueQ kernel: BTRFS error (device sde1): devid 1 uuid 51645efd-bf95-458d-b5ae-b31623533abb is missing
>> 1634562944.148764 BlueQ kernel: BTRFS error (device sde1): failed to read chunk tree: -2
>> 1634562944.185369 BlueQ kernel: BTRFS error (device sde1): open_ctree failed
> 
> This doesn't sound correct.
> 
> If a device is properly replaced, it should have the same devid number.
> 
> I guess you have tried to add a new device before, and then tried to
> replace the missing device, right?
> 
> 
> Anyway, have you tried to mount it degraded and then remove the missing
> device?
> 
> Since you're using RAID56, I guess degrade mount should work.
> 
> Thanks,
> Qu
> 
>> 
>>> Thanks,
>>> Qu
>>>> 
>>>> [...]
>>>> [2/7] checking extents
>>>> ERROR: super total bytes 38007432437760 smaller than real device(s) size 46008994590720
>>>> ERROR: mounting this fs may fail for newer kernels
>>>> ERROR: this can be fixed by 'btrfs rescue fix-device-size'
>>>> [3/7] checking free space tree
>>>> [...]
>>>> 
>>>> So I followed that advice but got the following error:
>>>> 
>>>> sudo btrfs rescue fix-device-size /dev/sde1
>>>> ERROR: devid 1 is missing or not writeable
>>>> ERROR: fixing device size needs all device(s) to be present and writeable
>>>> 
>>>> So it seems something went wrong or didn't complete fully.
>>>> What can I do to fix this problem?
>>>> 
>>>> uname -a
>>>> Linux BlueQ 5.14.12-arch1-1 #1 SMP PREEMPT Wed, 13 Oct 2021 16:58:16 +0000 x86_64 GNU/Linux
>>>> 
>>>> btrfs --version
>>>> btrfs-progs v5.14.2
>>>> 
>>>> Regards,
>>>> Emil
>>>> 
>>>> P.S.: Yes, I know, raid5 isn't stable but it works good enough for me ;)
>>>> Metadata is raid1 btw...
>>>> 
>> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Errors after successful disk replace
  2021-10-19 12:16       ` Emil Heimpel
@ 2021-10-19 12:20         ` Qu Wenruo
  2021-10-19 12:38           ` Emil Heimpel
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2021-10-19 12:20 UTC (permalink / raw)
  To: Emil Heimpel; +Cc: linux-btrfs



On 2021/10/19 20:16, Emil Heimpel wrote:
> Color me suprised:
>
>
> [74713.072745] BTRFS info (device sde1): flagging fs with big metadata feature
> [74713.072755] BTRFS info (device sde1): allowing degraded mounts
> [74713.072758] BTRFS info (device sde1): using free space tree
> [74713.072760] BTRFS info (device sde1): has skinny extents
> [74713.104297] BTRFS warning (device sde1): devid 1 uuid 51645efd-bf95-458d-b5ae-b31623533abb is missing
> [74714.675001] BTRFS info (device sde1): bdev (efault) errs: wr 52950, rd 8161, flush 0, corrupt 1221, gen 0
> [74714.675015] BTRFS info (device sde1): bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 228, gen 0
> [74714.675025] BTRFS info (device sde1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 140, gen 0
> [74751.033383] BTRFS info (device sde1): continuing dev_replace from <missing disk> (devid 1) to target /dev/sde1 @74%
> [bluemond@BlueQ ~]$ sudo btrfs replace status  -1 /mnt/btrfsrepair/
> 74.9% done, 0 write errs, 0 uncorr. read errs
>
> I guess I just wait?

Yep, wait and stay alert, better to also keep an eye on the dmesg.

But this also means, previous replace didn't really finish, which may
mean the replace ioctl is not reporting the proper status, and can be a
possible bug.

Thanks,
Qu

>
> Oct 19, 2021 13:37:09 Qu Wenruo <quwenruo.btrfs@gmx.com>:
>
>>
>>
>> On 2021/10/19 18:49, Emil Heimpel wrote:
>>>
>>> Oct 19, 2021 07:35:54 Qu Wenruo <quwenruo.btrfs@gmx.com>:
>>>
>>>>
>>>>
>>>> On 2021/10/19 11:54, Emil Heimpel wrote:
>>>>> Hi all,
>>>>>
>>>>> One of my drives of a raid 5 btrfs array failed (was dead completely) so I installed an identical replacement drive. The dead drive was devid 1 and the new drive /dev/sde. I used the following to replace the missing drive:
>>>>>
>>>>> sudo btrfs replace start -B 1 /dev/sde1 /mnt/btrfsrepair/
>>>>>
>>>>> and it completed successfully without any reported errors (took around 2 weeks though...).
>>>>>
>>>>> I then tried to see my array with filesystem show, but it hung (or took longer than I wanted to wait), so I did a reboot.
>>>>
>>>> Any dmesg of that time?
>>>>
>>>
>>> Nothing after the replace finished:
>>>
>>> 1634463961.245751 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663044222976 for dev (efault)
>>> 1634463961.255819 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663045795840 for dev (efault)
>>> 1634463961.275815 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663046582272 for dev (efault)
>>> 1634463961.275922 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663047368704 for dev (efault)
>>> 1634463961.339074 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663048155136 for dev (efault)
>>> 1634463961.339248 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663048941568 for dev (efault)
>>
>> *failed*...
>>
>>> 1634475910.611261 BlueQ kernel: sd 9:0:2:0: attempting task abort!scmd(0x0000000046fead3f), outstanding for 7120 ms & timeout 7000 ms
>>> 1634475910.615126 BlueQ kernel: sd 9:0:2:0: [sdd] tag#840 CDB: ATA command pass through(16) 85 08 2e 00 00 00 01 00 00 00 00 00 00 00 ec 00
>>> 1634475910.615429 BlueQ kernel: scsi target9:0:2: handle(0x000b), sas_address(0x4433221105000000), phy(5)
>>> 1634475910.615691 BlueQ kernel: scsi target9:0:2: enclosure logical id(0x590b11c022f3fb00), slot(6)
>>
>> And ATA commands failure.
>>
>> I don't believe the replace finished without problem, and the involved
>> device is /dev/sdd.
>>
>>> 1634475910.787911 BlueQ kernel: sd 9:0:2:0: task abort: SUCCESS scmd(0x0000000046fead3f)
>>> 1634475910.807083 BlueQ kernel: sd 9:0:2:0: Power-on or device reset occurred
>>> 1634475949.877998 BlueQ kernel: sd 9:0:2:0: Power-on or device reset occurred
>>> 1634525944.213931 BlueQ kernel: perf: interrupt took too long (3138 > 3137), lowering kernel.perf_event_max_sample_rate to 63600
>>> 1634533791.168760 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 22996545634304 for dev (efault)
>>> 1634552685.203559 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 23816815706112 for dev (efault)
>>
>> You won't want to see this message at all.
>>
>> This means, you're running RAID56, as btrfs has write-hole problem,
>> which will degrade the robust of RAID56 byte by byte for each unclean
>> shutdown.
>>
>> I guess the write hole problem has already make the repair failed for
>> the replace.
>>
>> Thus after a successful mount, scrub and manually file checking is
>> almost a must.
>>
>>> 1634558977.979621 BlueQ kernel: BTRFS info (device sdb1): dev_replace from <missing disk> (devid 1) to /dev/sde1 finished
>>> 1634560793.132731 BlueQ kernel: zram0: detected capacity change from 32610864 to 0
>>> 1634560793.169379 BlueQ kernel: zram: Removed device: zram0
>>> 1634560883.549481 BlueQ kernel: watchdog: watchdog0: watchdog did not stop!
>>> 1634560883.556038 BlueQ systemd-shutdown[1]: Syncing filesystems and block devices.
>>> 1634560883.572840 BlueQ systemd-shutdown[1]: Sending SIGTERM to remaining processes...
>>>
>>>
>>>
>>>
>>>>>
>>>>> It showed up after a reboot as followed:
>>>>>
>>>>> Label: 'BlueButter'  uuid: 7e3378e6-da46-4a60-b9b8-1bcc306986e3
>>>>>          Total devices 6 FS bytes used 20.96TiB
>>>>>          devid    0 size 7.28TiB used 5.46TiB path /dev/sde1
>>>>>          devid    2 size 7.28TiB used 5.46TiB path /dev/sdb1
>>>>>          devid    3 size 2.73TiB used 2.73TiB path /dev/sdg1
>>>>>          devid    4 size 2.73TiB used 2.73TiB path /dev/sdd1
>>>>>          devid    5 size 7.28TiB used 4.81TiB path /dev/sdf1
>>>>>          devid    6 size 7.28TiB used 5.33TiB path /dev/sdc1
>>>>>
>>>>> I then tried to mount it, but it failed, so I run a readonly check and it reported the following problem:
>>>>
>>>> And dmesg for the failed mount?
>>>>
>>>
>>> Oops, I must have missed that it failed because of missing devid 1 too...
>>>
>>> 1634562944.145383 BlueQ kernel: BTRFS info (device sde1): flagging fs with big metadata feature
>>> 1634562944.145529 BlueQ kernel: BTRFS info (device sde1): force zstd compression, level 2
>>> 1634562944.145650 BlueQ kernel: BTRFS info (device sde1): using free space tree
>>> 1634562944.145697 BlueQ kernel: BTRFS info (device sde1): has skinny extents
>>> 1634562944.148709 BlueQ kernel: BTRFS error (device sde1): devid 1 uuid 51645efd-bf95-458d-b5ae-b31623533abb is missing
>>> 1634562944.148764 BlueQ kernel: BTRFS error (device sde1): failed to read chunk tree: -2
>>> 1634562944.185369 BlueQ kernel: BTRFS error (device sde1): open_ctree failed
>>
>> This doesn't sound correct.
>>
>> If a device is properly replaced, it should have the same devid number.
>>
>> I guess you have tried to add a new device before, and then tried to
>> replace the missing device, right?
>>
>>
>> Anyway, have you tried to mount it degraded and then remove the missing
>> device?
>>
>> Since you're using RAID56, I guess degrade mount should work.
>>
>> Thanks,
>> Qu
>>
>>>
>>>> Thanks,
>>>> Qu
>>>>>
>>>>> [...]
>>>>> [2/7] checking extents
>>>>> ERROR: super total bytes 38007432437760 smaller than real device(s) size 46008994590720
>>>>> ERROR: mounting this fs may fail for newer kernels
>>>>> ERROR: this can be fixed by 'btrfs rescue fix-device-size'
>>>>> [3/7] checking free space tree
>>>>> [...]
>>>>>
>>>>> So I followed that advice but got the following error:
>>>>>
>>>>> sudo btrfs rescue fix-device-size /dev/sde1
>>>>> ERROR: devid 1 is missing or not writeable
>>>>> ERROR: fixing device size needs all device(s) to be present and writeable
>>>>>
>>>>> So it seems something went wrong or didn't complete fully.
>>>>> What can I do to fix this problem?
>>>>>
>>>>> uname -a
>>>>> Linux BlueQ 5.14.12-arch1-1 #1 SMP PREEMPT Wed, 13 Oct 2021 16:58:16 +0000 x86_64 GNU/Linux
>>>>>
>>>>> btrfs --version
>>>>> btrfs-progs v5.14.2
>>>>>
>>>>> Regards,
>>>>> Emil
>>>>>
>>>>> P.S.: Yes, I know, raid5 isn't stable but it works good enough for me ;)
>>>>> Metadata is raid1 btw...
>>>>>
>>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Errors after successful disk replace
  2021-10-19 12:20         ` Qu Wenruo
@ 2021-10-19 12:38           ` Emil Heimpel
  2021-10-19 12:46             ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Emil Heimpel @ 2021-10-19 12:38 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

So it finished after 2 minutes?

[Tue Oct 19 14:13:51 2021] BTRFS info (device sde1): continuing dev_replace from <missing disk> (devid 1) to target /dev/sde1 @74%
[Tue Oct 19 14:15:39 2021] BTRFS info (device sde1): dev_replace from <missing disk> (devid 1) to /dev/sde1 finished


Now I at least have an expected filesystem show:

Label: 'BlueButter'  uuid: 7e3378e6-da46-4a60-b9b8-1bcc306986e3
        Total devices 6 FS bytes used 20.96TiB
        devid    1 size 7.28TiB used 5.46TiB path /dev/sde1
        devid    2 size 7.28TiB used 5.46TiB path /dev/sdb1
        devid    3 size 2.73TiB used 2.73TiB path /dev/sdg1
        devid    4 size 2.73TiB used 2.73TiB path /dev/sdd1
        devid    5 size 7.28TiB used 4.81TiB path /dev/sdf1
        devid    6 size 7.28TiB used 5.33TiB path /dev/sdc1

And a nondegraded remount worked too.

Thanks,
Emil

Oct 19, 2021 14:20:21 Qu Wenruo <quwenruo.btrfs@gmx.com>:

> 
> 
> On 2021/10/19 20:16, Emil Heimpel wrote:
>> Color me suprised:
>> 
>> 
>> [74713.072745] BTRFS info (device sde1): flagging fs with big metadata feature
>> [74713.072755] BTRFS info (device sde1): allowing degraded mounts
>> [74713.072758] BTRFS info (device sde1): using free space tree
>> [74713.072760] BTRFS info (device sde1): has skinny extents
>> [74713.104297] BTRFS warning (device sde1): devid 1 uuid 51645efd-bf95-458d-b5ae-b31623533abb is missing
>> [74714.675001] BTRFS info (device sde1): bdev (efault) errs: wr 52950, rd 8161, flush 0, corrupt 1221, gen 0
>> [74714.675015] BTRFS info (device sde1): bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 228, gen 0
>> [74714.675025] BTRFS info (device sde1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 140, gen 0
>> [74751.033383] BTRFS info (device sde1): continuing dev_replace from <missing disk> (devid 1) to target /dev/sde1 @74%
>> [bluemond@BlueQ ~]$ sudo btrfs replace status  -1 /mnt/btrfsrepair/
>> 74.9% done, 0 write errs, 0 uncorr. read errs
>> 
>> I guess I just wait?
> 
> Yep, wait and stay alert, better to also keep an eye on the dmesg.
> 
> But this also means, previous replace didn't really finish, which may
> mean the replace ioctl is not reporting the proper status, and can be a
> possible bug.
> 
> Thanks,
> Qu
> 
>> 
>> Oct 19, 2021 13:37:09 Qu Wenruo <quwenruo.btrfs@gmx.com>:
>> 
>>> 
>>> 
>>> On 2021/10/19 18:49, Emil Heimpel wrote:
>>>> 
>>>> Oct 19, 2021 07:35:54 Qu Wenruo <quwenruo.btrfs@gmx.com>:
>>>> 
>>>>> 
>>>>> 
>>>>> On 2021/10/19 11:54, Emil Heimpel wrote:
>>>>>> …
>>>>> 
>>>>> Any dmesg of that time?
>>>>> 
>>>> 
>>>> Nothing after the replace finished:
>>>> 
>>>> 1634463961.245751 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663044222976 for dev (efault)
>>>> 1634463961.255819 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663045795840 for dev (efault)
>>>> 1634463961.275815 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663046582272 for dev (efault)
>>>> 1634463961.275922 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663047368704 for dev (efault)
>>>> 1634463961.339074 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663048155136 for dev (efault)
>>>> 1634463961.339248 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663048941568 for dev (efault)
>>> 
>>> *failed*...
>>> 
>>>> 1634475910.611261 BlueQ kernel: sd 9:0:2:0: attempting task abort!scmd(0x0000000046fead3f), outstanding for 7120 ms & timeout 7000 ms
>>>> 1634475910.615126 BlueQ kernel: sd 9:0:2:0: [sdd] tag#840 CDB: ATA command pass through(16) 85 08 2e 00 00 00 01 00 00 00 00 00 00 00 ec 00
>>>> 1634475910.615429 BlueQ kernel: scsi target9:0:2: handle(0x000b), sas_address(0x4433221105000000), phy(5)
>>>> 1634475910.615691 BlueQ kernel: scsi target9:0:2: enclosure logical id(0x590b11c022f3fb00), slot(6)
>>> 
>>> And ATA commands failure.
>>> 
>>> I don't believe the replace finished without problem, and the involved
>>> device is /dev/sdd.
>>> 
>>>> 1634475910.787911 BlueQ kernel: sd 9:0:2:0: task abort: SUCCESS scmd(0x0000000046fead3f)
>>>> 1634475910.807083 BlueQ kernel: sd 9:0:2:0: Power-on or device reset occurred
>>>> 1634475949.877998 BlueQ kernel: sd 9:0:2:0: Power-on or device reset occurred
>>>> 1634525944.213931 BlueQ kernel: perf: interrupt took too long (3138 > 3137), lowering kernel.perf_event_max_sample_rate to 63600
>>>> 1634533791.168760 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 22996545634304 for dev (efault)
>>>> 1634552685.203559 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 23816815706112 for dev (efault)
>>> 
>>> You won't want to see this message at all.
>>> 
>>> This means, you're running RAID56, as btrfs has write-hole problem,
>>> which will degrade the robust of RAID56 byte by byte for each unclean
>>> shutdown.
>>> 
>>> I guess the write hole problem has already make the repair failed for
>>> the replace.
>>> 
>>> Thus after a successful mount, scrub and manually file checking is
>>> almost a must.
>>> 
>>>> 1634558977.979621 BlueQ kernel: BTRFS info (device sdb1): dev_replace from <missing disk> (devid 1) to /dev/sde1 finished
>>>> 1634560793.132731 BlueQ kernel: zram0: detected capacity change from 32610864 to 0
>>>> 1634560793.169379 BlueQ kernel: zram: Removed device: zram0
>>>> 1634560883.549481 BlueQ kernel: watchdog: watchdog0: watchdog did not stop!
>>>> 1634560883.556038 BlueQ systemd-shutdown[1]: Syncing filesystems and block devices.
>>>> 1634560883.572840 BlueQ systemd-shutdown[1]: Sending SIGTERM to remaining processes...
>>>> 
>>>> 
>>>> 
>>>> 
>>>>>> …
>>>>> 
>>>>> And dmesg for the failed mount?
>>>>> 
>>>> 
>>>> Oops, I must have missed that it failed because of missing devid 1 too...
>>>> 
>>>> 1634562944.145383 BlueQ kernel: BTRFS info (device sde1): flagging fs with big metadata feature
>>>> 1634562944.145529 BlueQ kernel: BTRFS info (device sde1): force zstd compression, level 2
>>>> 1634562944.145650 BlueQ kernel: BTRFS info (device sde1): using free space tree
>>>> 1634562944.145697 BlueQ kernel: BTRFS info (device sde1): has skinny extents
>>>> 1634562944.148709 BlueQ kernel: BTRFS error (device sde1): devid 1 uuid 51645efd-bf95-458d-b5ae-b31623533abb is missing
>>>> 1634562944.148764 BlueQ kernel: BTRFS error (device sde1): failed to read chunk tree: -2
>>>> 1634562944.185369 BlueQ kernel: BTRFS error (device sde1): open_ctree failed
>>> 
>>> This doesn't sound correct.
>>> 
>>> If a device is properly replaced, it should have the same devid number.
>>> 
>>> I guess you have tried to add a new device before, and then tried to
>>> replace the missing device, right?
>>> 
>>> 
>>> Anyway, have you tried to mount it degraded and then remove the missing
>>> device?
>>> 
>>> Since you're using RAID56, I guess degrade mount should work.
>>> 
>>> Thanks,
>>> Qu
>>> 
>>>> 
>>>>> Thanks,
>>>>> Qu
>>>>>> …
>>>> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Errors after successful disk replace
  2021-10-19 12:38           ` Emil Heimpel
@ 2021-10-19 12:46             ` Qu Wenruo
  2021-10-26 12:16               ` Emil Heimpel
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2021-10-19 12:46 UTC (permalink / raw)
  To: Emil Heimpel; +Cc: linux-btrfs



On 2021/10/19 20:38, Emil Heimpel wrote:
> So it finished after 2 minutes?
>
> [Tue Oct 19 14:13:51 2021] BTRFS info (device sde1): continuing dev_replace from <missing disk> (devid 1) to target /dev/sde1 @74%
> [Tue Oct 19 14:15:39 2021] BTRFS info (device sde1): dev_replace from <missing disk> (devid 1) to /dev/sde1 finished

Then this means it previously only have the dev replace status not
cleaned up.

Thus it ends pretty quick.
>
>
> Now I at least have an expected filesystem show:
>
> Label: 'BlueButter'  uuid: 7e3378e6-da46-4a60-b9b8-1bcc306986e3
>          Total devices 6 FS bytes used 20.96TiB
>          devid    1 size 7.28TiB used 5.46TiB path /dev/sde1
>          devid    2 size 7.28TiB used 5.46TiB path /dev/sdb1
>          devid    3 size 2.73TiB used 2.73TiB path /dev/sdg1
>          devid    4 size 2.73TiB used 2.73TiB path /dev/sdd1
>          devid    5 size 7.28TiB used 4.81TiB path /dev/sdf1
>          devid    6 size 7.28TiB used 5.33TiB path /dev/sdc1
>
> And a nondegraded remount worked too.

And it's time for a full fs scrub to find out how consistent the fs is.

For btrfs RAID56, every time a unexpected shutdown is hit, a scrub is
strongly recommended.

And even routine scrub would be a plus for btrfs RAID56.

Thanks,
Qu

>
> Thanks,
> Emil
>
> Oct 19, 2021 14:20:21 Qu Wenruo <quwenruo.btrfs@gmx.com>:
>
>>
>>
>> On 2021/10/19 20:16, Emil Heimpel wrote:
>>> Color me suprised:
>>>
>>>
>>> [74713.072745] BTRFS info (device sde1): flagging fs with big metadata feature
>>> [74713.072755] BTRFS info (device sde1): allowing degraded mounts
>>> [74713.072758] BTRFS info (device sde1): using free space tree
>>> [74713.072760] BTRFS info (device sde1): has skinny extents
>>> [74713.104297] BTRFS warning (device sde1): devid 1 uuid 51645efd-bf95-458d-b5ae-b31623533abb is missing
>>> [74714.675001] BTRFS info (device sde1): bdev (efault) errs: wr 52950, rd 8161, flush 0, corrupt 1221, gen 0
>>> [74714.675015] BTRFS info (device sde1): bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 228, gen 0
>>> [74714.675025] BTRFS info (device sde1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 140, gen 0
>>> [74751.033383] BTRFS info (device sde1): continuing dev_replace from <missing disk> (devid 1) to target /dev/sde1 @74%
>>> [bluemond@BlueQ ~]$ sudo btrfs replace status  -1 /mnt/btrfsrepair/
>>> 74.9% done, 0 write errs, 0 uncorr. read errs
>>>
>>> I guess I just wait?
>>
>> Yep, wait and stay alert, better to also keep an eye on the dmesg.
>>
>> But this also means, previous replace didn't really finish, which may
>> mean the replace ioctl is not reporting the proper status, and can be a
>> possible bug.
>>
>> Thanks,
>> Qu
>>
>>>
>>> Oct 19, 2021 13:37:09 Qu Wenruo <quwenruo.btrfs@gmx.com>:
>>>
>>>>
>>>>
>>>> On 2021/10/19 18:49, Emil Heimpel wrote:
>>>>>
>>>>> Oct 19, 2021 07:35:54 Qu Wenruo <quwenruo.btrfs@gmx.com>:
>>>>>
>>>>>>
>>>>>>
>>>>>> On 2021/10/19 11:54, Emil Heimpel wrote:
>>>>>>> …
>>>>>>
>>>>>> Any dmesg of that time?
>>>>>>
>>>>>
>>>>> Nothing after the replace finished:
>>>>>
>>>>> 1634463961.245751 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663044222976 for dev (efault)
>>>>> 1634463961.255819 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663045795840 for dev (efault)
>>>>> 1634463961.275815 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663046582272 for dev (efault)
>>>>> 1634463961.275922 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663047368704 for dev (efault)
>>>>> 1634463961.339074 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663048155136 for dev (efault)
>>>>> 1634463961.339248 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 17663048941568 for dev (efault)
>>>>
>>>> *failed*...
>>>>
>>>>> 1634475910.611261 BlueQ kernel: sd 9:0:2:0: attempting task abort!scmd(0x0000000046fead3f), outstanding for 7120 ms & timeout 7000 ms
>>>>> 1634475910.615126 BlueQ kernel: sd 9:0:2:0: [sdd] tag#840 CDB: ATA command pass through(16) 85 08 2e 00 00 00 01 00 00 00 00 00 00 00 ec 00
>>>>> 1634475910.615429 BlueQ kernel: scsi target9:0:2: handle(0x000b), sas_address(0x4433221105000000), phy(5)
>>>>> 1634475910.615691 BlueQ kernel: scsi target9:0:2: enclosure logical id(0x590b11c022f3fb00), slot(6)
>>>>
>>>> And ATA commands failure.
>>>>
>>>> I don't believe the replace finished without problem, and the involved
>>>> device is /dev/sdd.
>>>>
>>>>> 1634475910.787911 BlueQ kernel: sd 9:0:2:0: task abort: SUCCESS scmd(0x0000000046fead3f)
>>>>> 1634475910.807083 BlueQ kernel: sd 9:0:2:0: Power-on or device reset occurred
>>>>> 1634475949.877998 BlueQ kernel: sd 9:0:2:0: Power-on or device reset occurred
>>>>> 1634525944.213931 BlueQ kernel: perf: interrupt took too long (3138 > 3137), lowering kernel.perf_event_max_sample_rate to 63600
>>>>> 1634533791.168760 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 22996545634304 for dev (efault)
>>>>> 1634552685.203559 BlueQ kernel: BTRFS error (device sdb1): failed to rebuild valid logical 23816815706112 for dev (efault)
>>>>
>>>> You won't want to see this message at all.
>>>>
>>>> This means, you're running RAID56, as btrfs has write-hole problem,
>>>> which will degrade the robust of RAID56 byte by byte for each unclean
>>>> shutdown.
>>>>
>>>> I guess the write hole problem has already make the repair failed for
>>>> the replace.
>>>>
>>>> Thus after a successful mount, scrub and manually file checking is
>>>> almost a must.
>>>>
>>>>> 1634558977.979621 BlueQ kernel: BTRFS info (device sdb1): dev_replace from <missing disk> (devid 1) to /dev/sde1 finished
>>>>> 1634560793.132731 BlueQ kernel: zram0: detected capacity change from 32610864 to 0
>>>>> 1634560793.169379 BlueQ kernel: zram: Removed device: zram0
>>>>> 1634560883.549481 BlueQ kernel: watchdog: watchdog0: watchdog did not stop!
>>>>> 1634560883.556038 BlueQ systemd-shutdown[1]: Syncing filesystems and block devices.
>>>>> 1634560883.572840 BlueQ systemd-shutdown[1]: Sending SIGTERM to remaining processes...
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>> …
>>>>>>
>>>>>> And dmesg for the failed mount?
>>>>>>
>>>>>
>>>>> Oops, I must have missed that it failed because of missing devid 1 too...
>>>>>
>>>>> 1634562944.145383 BlueQ kernel: BTRFS info (device sde1): flagging fs with big metadata feature
>>>>> 1634562944.145529 BlueQ kernel: BTRFS info (device sde1): force zstd compression, level 2
>>>>> 1634562944.145650 BlueQ kernel: BTRFS info (device sde1): using free space tree
>>>>> 1634562944.145697 BlueQ kernel: BTRFS info (device sde1): has skinny extents
>>>>> 1634562944.148709 BlueQ kernel: BTRFS error (device sde1): devid 1 uuid 51645efd-bf95-458d-b5ae-b31623533abb is missing
>>>>> 1634562944.148764 BlueQ kernel: BTRFS error (device sde1): failed to read chunk tree: -2
>>>>> 1634562944.185369 BlueQ kernel: BTRFS error (device sde1): open_ctree failed
>>>>
>>>> This doesn't sound correct.
>>>>
>>>> If a device is properly replaced, it should have the same devid number.
>>>>
>>>> I guess you have tried to add a new device before, and then tried to
>>>> replace the missing device, right?
>>>>
>>>>
>>>> Anyway, have you tried to mount it degraded and then remove the missing
>>>> device?
>>>>
>>>> Since you're using RAID56, I guess degrade mount should work.
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>>>> …
>>>>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Errors after successful disk replace
  2021-10-19 12:46             ` Qu Wenruo
@ 2021-10-26 12:16               ` Emil Heimpel
  2021-10-26 12:17                 ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Emil Heimpel @ 2021-10-26 12:16 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

After reapplying the thermal paste and adding a fan to my storage controller (LSI SAS2008) I haven't encountered any abnormal behavior of my drives. Scrub is running (disk by disk) and everything looks fine so far.

I only got one task blocked message in dmesg, is that anything I should worry about?


Sun Oct 24 18:22:11 2021] BTRFS info (device sdg1): scrub: started on devid 5
[Mon Oct 25 15:31:23 2021] BTRFS info (device sdg1): scrub: finished on devid 5 with status: 0
[Mon Oct 25 15:31:26 2021] BTRFS info (device sdg1): scrub: started on devid 3
[Mon Oct 25 21:12:34 2021] perf: interrupt took too long (2501 > 2500), lowering kernel.perf_event_max_sample_rate to 79800
[Mon Oct 25 23:01:43 2021] hrtimer: interrupt took 13811 ns
[Tue Oct 26 02:58:17 2021] BTRFS info (device sdg1): scrub: finished on devid 3 with status: 0
[Tue Oct 26 02:58:21 2021] BTRFS info (device sdg1): scrub: started on devid 6
[Tue Oct 26 12:23:36 2021] INFO: task btrfs:341674 blocked for more than 122 seconds.
[Tue Oct 26 12:23:36 2021]       Tainted: G           OE     5.14.14-arch1-1 #1
[Tue Oct 26 12:23:36 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Oct 26 12:23:36 2021] task:btrfs           state:D stack:    0 pid:341674 ppid:341658 flags:0x00004000
[Tue Oct 26 12:23:36 2021] Call Trace:
[Tue Oct 26 12:23:36 2021]  __schedule+0x333/0x1530
[Tue Oct 26 12:23:36 2021]  ? psi_task_switch+0xc2/0x1f0
[Tue Oct 26 12:23:36 2021]  ? autoremove_wake_function+0x2c/0x50
[Tue Oct 26 12:23:36 2021]  schedule+0x59/0xc0
[Tue Oct 26 12:23:36 2021]  __scrub_blocked_if_needed+0xa0/0xf0 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:23:36 2021]  ? do_wait_intr_irq+0xa0/0xa0
[Tue Oct 26 12:23:36 2021]  scrub_pause_off+0x21/0x50 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:23:36 2021]  scrub_stripe+0x452/0x1580 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:23:36 2021]  ? do_wait_intr_irq+0xa0/0xa0
[Tue Oct 26 12:23:36 2021]  ? __btrfs_end_transaction+0xf6/0x210 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:23:36 2021]  ? scrub_chunk+0xcd/0x130 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:23:36 2021]  scrub_chunk+0xcd/0x130 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:23:36 2021]  scrub_enumerate_chunks+0x354/0x790 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:23:36 2021]  ? do_wait_intr_irq+0xa0/0xa0
[Tue Oct 26 12:23:36 2021]  btrfs_scrub_dev+0x23d/0x570 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:23:36 2021]  btrfs_ioctl+0x1410/0x2df0 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:23:36 2021]  ? __x64_sys_ioctl+0x82/0xb0
[Tue Oct 26 12:23:36 2021]  __x64_sys_ioctl+0x82/0xb0
[Tue Oct 26 12:23:36 2021]  do_syscall_64+0x5c/0x80
[Tue Oct 26 12:23:36 2021]  ? create_task_io_context+0xc7/0x110
[Tue Oct 26 12:23:36 2021]  ? get_task_io_context+0x48/0x80
[Tue Oct 26 12:23:36 2021]  ? set_task_ioprio+0x97/0xa0
[Tue Oct 26 12:23:36 2021]  ? __do_sys_ioprio_set+0x5e/0x300
[Tue Oct 26 12:23:36 2021]  ? syscall_exit_to_user_mode+0x23/0x40
[Tue Oct 26 12:23:36 2021]  ? syscall_exit_to_user_mode+0x23/0x40
[Tue Oct 26 12:23:36 2021]  ? do_syscall_64+0x69/0x80
[Tue Oct 26 12:23:36 2021]  ? exit_to_user_mode_prepare+0x77/0x170
[Tue Oct 26 12:23:36 2021]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[Tue Oct 26 12:23:36 2021] RIP: 0033:0x7fba22e2559b
[Tue Oct 26 12:23:36 2021] RSP: 002b:00007fba22cf7c98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[Tue Oct 26 12:23:36 2021] RAX: ffffffffffffffda RBX: 000056511ecc6bc0 RCX: 00007fba22e2559b
[Tue Oct 26 12:23:36 2021] RDX: 000056511ecc6bc0 RSI: 00000000c400941b RDI: 0000000000000003
[Tue Oct 26 12:23:36 2021] RBP: 0000000000000000 R08: 00007fba22cf8640 R09: 0000000000000000
[Tue Oct 26 12:23:36 2021] R10: 00007fba22cf8640 R11: 0000000000000246 R12: 00007ffec33d061e
[Tue Oct 26 12:23:36 2021] R13: 00007ffec33d061f R14: 0000000000000000 R15: 00007fba22cf8640
[Tue Oct 26 12:27:42 2021] INFO: task btrfs:341674 blocked for more than 122 seconds.
[Tue Oct 26 12:27:42 2021]       Tainted: G           OE     5.14.14-arch1-1 #1
[Tue Oct 26 12:27:42 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Oct 26 12:27:42 2021] task:btrfs           state:D stack:    0 pid:341674 ppid:341658 flags:0x00004000
[Tue Oct 26 12:27:42 2021] Call Trace:
[Tue Oct 26 12:27:42 2021]  __schedule+0x333/0x1530
[Tue Oct 26 12:27:42 2021]  ? autoremove_wake_function+0x2c/0x50
[Tue Oct 26 12:27:42 2021]  schedule+0x59/0xc0
[Tue Oct 26 12:27:42 2021]  __scrub_blocked_if_needed+0xa0/0xf0 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:27:42 2021]  ? do_wait_intr_irq+0xa0/0xa0
[Tue Oct 26 12:27:42 2021]  scrub_pause_off+0x21/0x50 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:27:42 2021]  scrub_stripe+0x452/0x1580 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:27:42 2021]  ? __update_load_avg_cfs_rq+0x27d/0x2e0
[Tue Oct 26 12:27:42 2021]  ? __btrfs_end_transaction+0xf6/0x210 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:27:42 2021]  ? kmem_cache_free+0x107/0x410
[Tue Oct 26 12:27:42 2021]  ? scrub_chunk+0xcd/0x130 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:27:42 2021]  scrub_chunk+0xcd/0x130 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:27:42 2021]  scrub_enumerate_chunks+0x354/0x790 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:27:42 2021]  ? do_wait_intr_irq+0xa0/0xa0
[Tue Oct 26 12:27:42 2021]  btrfs_scrub_dev+0x23d/0x570 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:27:42 2021]  btrfs_ioctl+0x1410/0x2df0 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
[Tue Oct 26 12:27:42 2021]  ? __x64_sys_ioctl+0x82/0xb0
[Tue Oct 26 12:27:42 2021]  __x64_sys_ioctl+0x82/0xb0
[Tue Oct 26 12:27:42 2021]  do_syscall_64+0x5c/0x80
[Tue Oct 26 12:27:42 2021]  ? create_task_io_context+0xc7/0x110
[Tue Oct 26 12:27:42 2021]  ? get_task_io_context+0x48/0x80
[Tue Oct 26 12:27:42 2021]  ? set_task_ioprio+0x97/0xa0
[Tue Oct 26 12:27:42 2021]  ? __do_sys_ioprio_set+0x5e/0x300
[Tue Oct 26 12:27:42 2021]  ? syscall_exit_to_user_mode+0x23/0x40
[Tue Oct 26 12:27:42 2021]  ? syscall_exit_to_user_mode+0x23/0x40
[Tue Oct 26 12:27:42 2021]  ? do_syscall_64+0x69/0x80
[Tue Oct 26 12:27:42 2021]  ? exit_to_user_mode_prepare+0x77/0x170
[Tue Oct 26 12:27:42 2021]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[Tue Oct 26 12:27:42 2021] RIP: 0033:0x7fba22e2559b
[Tue Oct 26 12:27:42 2021] RSP: 002b:00007fba22cf7c98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[Tue Oct 26 12:27:42 2021] RAX: ffffffffffffffda RBX: 000056511ecc6bc0 RCX: 00007fba22e2559b
[Tue Oct 26 12:27:42 2021] RDX: 000056511ecc6bc0 RSI: 00000000c400941b RDI: 0000000000000003
[Tue Oct 26 12:27:42 2021] RBP: 0000000000000000 R08: 00007fba22cf8640 R09: 0000000000000000
[Tue Oct 26 12:27:42 2021] R10: 00007fba22cf8640 R11: 0000000000000246 R12: 00007ffec33d061e
[Tue Oct 26 12:27:42 2021] R13: 00007ffec33d061f R14: 0000000000000000 R15: 00007fba22cf8640

Thanks,
Emil

Oct 19, 2021 14:46:46 Qu Wenruo <quwenruo.btrfs@gmx.com>:

> 
> 
> On 2021/10/19 20:38, Emil Heimpel wrote:
>> So it finished after 2 minutes?
>> 
>> [Tue Oct 19 14:13:51 2021] BTRFS info (device sde1): continuing dev_replace from <missing disk> (devid 1) to target /dev/sde1 @74%
>> [Tue Oct 19 14:15:39 2021] BTRFS info (device sde1): dev_replace from <missing disk> (devid 1) to /dev/sde1 finished
> 
> Then this means it previously only have the dev replace status not
> cleaned up.
> 
> Thus it ends pretty quick.
>> 
>> 
>> Now I at least have an expected filesystem show:
>> 
>> Label: 'BlueButter'  uuid: 7e3378e6-da46-4a60-b9b8-1bcc306986e3
>>         Total devices 6 FS bytes used 20.96TiB
>>         devid    1 size 7.28TiB used 5.46TiB path /dev/sde1
>>         devid    2 size 7.28TiB used 5.46TiB path /dev/sdb1
>>         devid    3 size 2.73TiB used 2.73TiB path /dev/sdg1
>>         devid    4 size 2.73TiB used 2.73TiB path /dev/sdd1
>>         devid    5 size 7.28TiB used 4.81TiB path /dev/sdf1
>>         devid    6 size 7.28TiB used 5.33TiB path /dev/sdc1
>> 
>> And a nondegraded remount worked too.
> 
> And it's time for a full fs scrub to find out how consistent the fs is.
> 
> For btrfs RAID56, every time a unexpected shutdown is hit, a scrub is
> strongly recommended.
> 
> And even routine scrub would be a plus for btrfs RAID56.
> 
> Thanks,
> Qu
> 
>> 
>> Thanks,
>> Emil
>> 
>> Oct 19, 2021 14:20:21 Qu Wenruo <quwenruo.btrfs@gmx.com>:
>> 
>>> 
>>> 
>>> On 2021/10/19 20:16, Emil Heimpel wrote:
>>>> Color me suprised:
>>>> 
>>>> 
>>>> [74713.072745] BTRFS info (device sde1): flagging fs with big metadata feature
>>>> [74713.072755] BTRFS info (device sde1): allowing degraded mounts
>>>> [74713.072758] BTRFS info (device sde1): using free space tree
>>>> [74713.072760] BTRFS info (device sde1): has skinny extents
>>>> [74713.104297] BTRFS warning (device sde1): devid 1 uuid 51645efd-bf95-458d-b5ae-b31623533abb is missing
>>>> [74714.675001] BTRFS info (device sde1): bdev (efault) errs: wr 52950, rd 8161, flush 0, corrupt 1221, gen 0
>>>> [74714.675015] BTRFS info (device sde1): bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 228, gen 0
>>>> [74714.675025] BTRFS info (device sde1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 140, gen 0
>>>> [74751.033383] BTRFS info (device sde1): continuing dev_replace from <missing disk> (devid 1) to target /dev/sde1 @74%
>>>> [bluemond@BlueQ ~]$ sudo btrfs replace status  -1 /mnt/btrfsrepair/
>>>> 74.9% done, 0 write errs, 0 uncorr. read errs
>>>> 
>>>> I guess I just wait?
>>> 
>>> Yep, wait and stay alert, better to also keep an eye on the dmesg.
>>> 
>>> But this also means, previous replace didn't really finish, which may
>>> mean the replace ioctl is not reporting the proper status, and can be a
>>> possible bug.
>>> 
>>> Thanks,
>>> Qu
>>> 
>>>> 
>>>> Oct 19, 2021 13:37:09 Qu Wenruo <quwenruo.btrfs@gmx.com>:
>>>> 
>>>>> 
>>>>> 
>>>>> On 2021/10/19 18:49, Emil Heimpel wrote:
>>>>>> …
>>>>> 
>>>>> *failed*...
>>>>> 
>>>>>> …
>>>>> 
>>>>> And ATA commands failure.
>>>>> 
>>>>> I don't believe the replace finished without problem, and the involved
>>>>> device is /dev/sdd.
>>>>> 
>>>>>> …
>>>>> 
>>>>> You won't want to see this message at all.
>>>>> 
>>>>> This means, you're running RAID56, as btrfs has write-hole problem,
>>>>> which will degrade the robust of RAID56 byte by byte for each unclean
>>>>> shutdown.
>>>>> 
>>>>> I guess the write hole problem has already make the repair failed for
>>>>> the replace.
>>>>> 
>>>>> Thus after a successful mount, scrub and manually file checking is
>>>>> almost a must.
>>>>> 
>>>>>> …
>>>>> 
>>>>> This doesn't sound correct.
>>>>> 
>>>>> If a device is properly replaced, it should have the same devid number.
>>>>> 
>>>>> I guess you have tried to add a new device before, and then tried to
>>>>> replace the missing device, right?
>>>>> 
>>>>> 
>>>>> Anyway, have you tried to mount it degraded and then remove the missing
>>>>> device?
>>>>> 
>>>>> Since you're using RAID56, I guess degrade mount should work.
>>>>> 
>>>>> Thanks,
>>>>> Qu
>>>>> 
>>>>>> …

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Errors after successful disk replace
  2021-10-26 12:16               ` Emil Heimpel
@ 2021-10-26 12:17                 ` Qu Wenruo
  0 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2021-10-26 12:17 UTC (permalink / raw)
  To: Emil Heimpel; +Cc: linux-btrfs



On 2021/10/26 20:16, Emil Heimpel wrote:
> After reapplying the thermal paste and adding a fan to my storage controller (LSI SAS2008) I haven't encountered any abnormal behavior of my drives. Scrub is running (disk by disk) and everything looks fine so far.
>
> I only got one task blocked message in dmesg, is that anything I should worry about?

Sometimes we can also hit such blocked message for certain test cases.

As long as the fs is still submitting IO, and you find proper CPU usage,
it should be fine.

Thanks,
Qu
>
>
> Sun Oct 24 18:22:11 2021] BTRFS info (device sdg1): scrub: started on devid 5
> [Mon Oct 25 15:31:23 2021] BTRFS info (device sdg1): scrub: finished on devid 5 with status: 0
> [Mon Oct 25 15:31:26 2021] BTRFS info (device sdg1): scrub: started on devid 3
> [Mon Oct 25 21:12:34 2021] perf: interrupt took too long (2501 > 2500), lowering kernel.perf_event_max_sample_rate to 79800
> [Mon Oct 25 23:01:43 2021] hrtimer: interrupt took 13811 ns
> [Tue Oct 26 02:58:17 2021] BTRFS info (device sdg1): scrub: finished on devid 3 with status: 0
> [Tue Oct 26 02:58:21 2021] BTRFS info (device sdg1): scrub: started on devid 6
> [Tue Oct 26 12:23:36 2021] INFO: task btrfs:341674 blocked for more than 122 seconds.
> [Tue Oct 26 12:23:36 2021]       Tainted: G           OE     5.14.14-arch1-1 #1
> [Tue Oct 26 12:23:36 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [Tue Oct 26 12:23:36 2021] task:btrfs           state:D stack:    0 pid:341674 ppid:341658 flags:0x00004000
> [Tue Oct 26 12:23:36 2021] Call Trace:
> [Tue Oct 26 12:23:36 2021]  __schedule+0x333/0x1530
> [Tue Oct 26 12:23:36 2021]  ? psi_task_switch+0xc2/0x1f0
> [Tue Oct 26 12:23:36 2021]  ? autoremove_wake_function+0x2c/0x50
> [Tue Oct 26 12:23:36 2021]  schedule+0x59/0xc0
> [Tue Oct 26 12:23:36 2021]  __scrub_blocked_if_needed+0xa0/0xf0 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
> [Tue Oct 26 12:23:36 2021]  ? do_wait_intr_irq+0xa0/0xa0
> [Tue Oct 26 12:23:36 2021]  scrub_pause_off+0x21/0x50 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
> [Tue Oct 26 12:23:36 2021]  scrub_stripe+0x452/0x1580 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
> [Tue Oct 26 12:23:36 2021]  ? do_wait_intr_irq+0xa0/0xa0
> [Tue Oct 26 12:23:36 2021]  ? __btrfs_end_transaction+0xf6/0x210 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
> [Tue Oct 26 12:23:36 2021]  ? scrub_chunk+0xcd/0x130 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
> [Tue Oct 26 12:23:36 2021]  scrub_chunk+0xcd/0x130 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
> [Tue Oct 26 12:23:36 2021]  scrub_enumerate_chunks+0x354/0x790 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
> [Tue Oct 26 12:23:36 2021]  ? do_wait_intr_irq+0xa0/0xa0
> [Tue Oct 26 12:23:36 2021]  btrfs_scrub_dev+0x23d/0x570 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
> [Tue Oct 26 12:23:36 2021]  btrfs_ioctl+0x1410/0x2df0 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
> [Tue Oct 26 12:23:36 2021]  ? __x64_sys_ioctl+0x82/0xb0
> [Tue Oct 26 12:23:36 2021]  __x64_sys_ioctl+0x82/0xb0
> [Tue Oct 26 12:23:36 2021]  do_syscall_64+0x5c/0x80
> [Tue Oct 26 12:23:36 2021]  ? create_task_io_context+0xc7/0x110
> [Tue Oct 26 12:23:36 2021]  ? get_task_io_context+0x48/0x80
> [Tue Oct 26 12:23:36 2021]  ? set_task_ioprio+0x97/0xa0
> [Tue Oct 26 12:23:36 2021]  ? __do_sys_ioprio_set+0x5e/0x300
> [Tue Oct 26 12:23:36 2021]  ? syscall_exit_to_user_mode+0x23/0x40
> [Tue Oct 26 12:23:36 2021]  ? syscall_exit_to_user_mode+0x23/0x40
> [Tue Oct 26 12:23:36 2021]  ? do_syscall_64+0x69/0x80
> [Tue Oct 26 12:23:36 2021]  ? exit_to_user_mode_prepare+0x77/0x170
> [Tue Oct 26 12:23:36 2021]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [Tue Oct 26 12:23:36 2021] RIP: 0033:0x7fba22e2559b
> [Tue Oct 26 12:23:36 2021] RSP: 002b:00007fba22cf7c98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [Tue Oct 26 12:23:36 2021] RAX: ffffffffffffffda RBX: 000056511ecc6bc0 RCX: 00007fba22e2559b
> [Tue Oct 26 12:23:36 2021] RDX: 000056511ecc6bc0 RSI: 00000000c400941b RDI: 0000000000000003
> [Tue Oct 26 12:23:36 2021] RBP: 0000000000000000 R08: 00007fba22cf8640 R09: 0000000000000000
> [Tue Oct 26 12:23:36 2021] R10: 00007fba22cf8640 R11: 0000000000000246 R12: 00007ffec33d061e
> [Tue Oct 26 12:23:36 2021] R13: 00007ffec33d061f R14: 0000000000000000 R15: 00007fba22cf8640
> [Tue Oct 26 12:27:42 2021] INFO: task btrfs:341674 blocked for more than 122 seconds.
> [Tue Oct 26 12:27:42 2021]       Tainted: G           OE     5.14.14-arch1-1 #1
> [Tue Oct 26 12:27:42 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [Tue Oct 26 12:27:42 2021] task:btrfs           state:D stack:    0 pid:341674 ppid:341658 flags:0x00004000
> [Tue Oct 26 12:27:42 2021] Call Trace:
> [Tue Oct 26 12:27:42 2021]  __schedule+0x333/0x1530
> [Tue Oct 26 12:27:42 2021]  ? autoremove_wake_function+0x2c/0x50
> [Tue Oct 26 12:27:42 2021]  schedule+0x59/0xc0
> [Tue Oct 26 12:27:42 2021]  __scrub_blocked_if_needed+0xa0/0xf0 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
> [Tue Oct 26 12:27:42 2021]  ? do_wait_intr_irq+0xa0/0xa0
> [Tue Oct 26 12:27:42 2021]  scrub_pause_off+0x21/0x50 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
> [Tue Oct 26 12:27:42 2021]  scrub_stripe+0x452/0x1580 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
> [Tue Oct 26 12:27:42 2021]  ? __update_load_avg_cfs_rq+0x27d/0x2e0
> [Tue Oct 26 12:27:42 2021]  ? __btrfs_end_transaction+0xf6/0x210 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
> [Tue Oct 26 12:27:42 2021]  ? kmem_cache_free+0x107/0x410
> [Tue Oct 26 12:27:42 2021]  ? scrub_chunk+0xcd/0x130 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
> [Tue Oct 26 12:27:42 2021]  scrub_chunk+0xcd/0x130 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
> [Tue Oct 26 12:27:42 2021]  scrub_enumerate_chunks+0x354/0x790 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
> [Tue Oct 26 12:27:42 2021]  ? do_wait_intr_irq+0xa0/0xa0
> [Tue Oct 26 12:27:42 2021]  btrfs_scrub_dev+0x23d/0x570 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
> [Tue Oct 26 12:27:42 2021]  btrfs_ioctl+0x1410/0x2df0 [btrfs 796168a0fefcfb1a0bc1ff664fee7292a977f032]
> [Tue Oct 26 12:27:42 2021]  ? __x64_sys_ioctl+0x82/0xb0
> [Tue Oct 26 12:27:42 2021]  __x64_sys_ioctl+0x82/0xb0
> [Tue Oct 26 12:27:42 2021]  do_syscall_64+0x5c/0x80
> [Tue Oct 26 12:27:42 2021]  ? create_task_io_context+0xc7/0x110
> [Tue Oct 26 12:27:42 2021]  ? get_task_io_context+0x48/0x80
> [Tue Oct 26 12:27:42 2021]  ? set_task_ioprio+0x97/0xa0
> [Tue Oct 26 12:27:42 2021]  ? __do_sys_ioprio_set+0x5e/0x300
> [Tue Oct 26 12:27:42 2021]  ? syscall_exit_to_user_mode+0x23/0x40
> [Tue Oct 26 12:27:42 2021]  ? syscall_exit_to_user_mode+0x23/0x40
> [Tue Oct 26 12:27:42 2021]  ? do_syscall_64+0x69/0x80
> [Tue Oct 26 12:27:42 2021]  ? exit_to_user_mode_prepare+0x77/0x170
> [Tue Oct 26 12:27:42 2021]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [Tue Oct 26 12:27:42 2021] RIP: 0033:0x7fba22e2559b
> [Tue Oct 26 12:27:42 2021] RSP: 002b:00007fba22cf7c98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [Tue Oct 26 12:27:42 2021] RAX: ffffffffffffffda RBX: 000056511ecc6bc0 RCX: 00007fba22e2559b
> [Tue Oct 26 12:27:42 2021] RDX: 000056511ecc6bc0 RSI: 00000000c400941b RDI: 0000000000000003
> [Tue Oct 26 12:27:42 2021] RBP: 0000000000000000 R08: 00007fba22cf8640 R09: 0000000000000000
> [Tue Oct 26 12:27:42 2021] R10: 00007fba22cf8640 R11: 0000000000000246 R12: 00007ffec33d061e
> [Tue Oct 26 12:27:42 2021] R13: 00007ffec33d061f R14: 0000000000000000 R15: 00007fba22cf8640
>
> Thanks,
> Emil
>
> Oct 19, 2021 14:46:46 Qu Wenruo <quwenruo.btrfs@gmx.com>:
>
>>
>>
>> On 2021/10/19 20:38, Emil Heimpel wrote:
>>> So it finished after 2 minutes?
>>>
>>> [Tue Oct 19 14:13:51 2021] BTRFS info (device sde1): continuing dev_replace from <missing disk> (devid 1) to target /dev/sde1 @74%
>>> [Tue Oct 19 14:15:39 2021] BTRFS info (device sde1): dev_replace from <missing disk> (devid 1) to /dev/sde1 finished
>>
>> Then this means it previously only have the dev replace status not
>> cleaned up.
>>
>> Thus it ends pretty quick.
>>>
>>>
>>> Now I at least have an expected filesystem show:
>>>
>>> Label: 'BlueButter'  uuid: 7e3378e6-da46-4a60-b9b8-1bcc306986e3
>>>          Total devices 6 FS bytes used 20.96TiB
>>>          devid    1 size 7.28TiB used 5.46TiB path /dev/sde1
>>>          devid    2 size 7.28TiB used 5.46TiB path /dev/sdb1
>>>          devid    3 size 2.73TiB used 2.73TiB path /dev/sdg1
>>>          devid    4 size 2.73TiB used 2.73TiB path /dev/sdd1
>>>          devid    5 size 7.28TiB used 4.81TiB path /dev/sdf1
>>>          devid    6 size 7.28TiB used 5.33TiB path /dev/sdc1
>>>
>>> And a nondegraded remount worked too.
>>
>> And it's time for a full fs scrub to find out how consistent the fs is.
>>
>> For btrfs RAID56, every time a unexpected shutdown is hit, a scrub is
>> strongly recommended.
>>
>> And even routine scrub would be a plus for btrfs RAID56.
>>
>> Thanks,
>> Qu
>>
>>>
>>> Thanks,
>>> Emil
>>>
>>> Oct 19, 2021 14:20:21 Qu Wenruo <quwenruo.btrfs@gmx.com>:
>>>
>>>>
>>>>
>>>> On 2021/10/19 20:16, Emil Heimpel wrote:
>>>>> Color me suprised:
>>>>>
>>>>>
>>>>> [74713.072745] BTRFS info (device sde1): flagging fs with big metadata feature
>>>>> [74713.072755] BTRFS info (device sde1): allowing degraded mounts
>>>>> [74713.072758] BTRFS info (device sde1): using free space tree
>>>>> [74713.072760] BTRFS info (device sde1): has skinny extents
>>>>> [74713.104297] BTRFS warning (device sde1): devid 1 uuid 51645efd-bf95-458d-b5ae-b31623533abb is missing
>>>>> [74714.675001] BTRFS info (device sde1): bdev (efault) errs: wr 52950, rd 8161, flush 0, corrupt 1221, gen 0
>>>>> [74714.675015] BTRFS info (device sde1): bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 228, gen 0
>>>>> [74714.675025] BTRFS info (device sde1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 140, gen 0
>>>>> [74751.033383] BTRFS info (device sde1): continuing dev_replace from <missing disk> (devid 1) to target /dev/sde1 @74%
>>>>> [bluemond@BlueQ ~]$ sudo btrfs replace status  -1 /mnt/btrfsrepair/
>>>>> 74.9% done, 0 write errs, 0 uncorr. read errs
>>>>>
>>>>> I guess I just wait?
>>>>
>>>> Yep, wait and stay alert, better to also keep an eye on the dmesg.
>>>>
>>>> But this also means, previous replace didn't really finish, which may
>>>> mean the replace ioctl is not reporting the proper status, and can be a
>>>> possible bug.
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>> Oct 19, 2021 13:37:09 Qu Wenruo <quwenruo.btrfs@gmx.com>:
>>>>>
>>>>>>
>>>>>>
>>>>>> On 2021/10/19 18:49, Emil Heimpel wrote:
>>>>>>> …
>>>>>>
>>>>>> *failed*...
>>>>>>
>>>>>>> …
>>>>>>
>>>>>> And ATA commands failure.
>>>>>>
>>>>>> I don't believe the replace finished without problem, and the involved
>>>>>> device is /dev/sdd.
>>>>>>
>>>>>>> …
>>>>>>
>>>>>> You won't want to see this message at all.
>>>>>>
>>>>>> This means, you're running RAID56, as btrfs has write-hole problem,
>>>>>> which will degrade the robust of RAID56 byte by byte for each unclean
>>>>>> shutdown.
>>>>>>
>>>>>> I guess the write hole problem has already make the repair failed for
>>>>>> the replace.
>>>>>>
>>>>>> Thus after a successful mount, scrub and manually file checking is
>>>>>> almost a must.
>>>>>>
>>>>>>> …
>>>>>>
>>>>>> This doesn't sound correct.
>>>>>>
>>>>>> If a device is properly replaced, it should have the same devid number.
>>>>>>
>>>>>> I guess you have tried to add a new device before, and then tried to
>>>>>> replace the missing device, right?
>>>>>>
>>>>>>
>>>>>> Anyway, have you tried to mount it degraded and then remove the missing
>>>>>> device?
>>>>>>
>>>>>> Since you're using RAID56, I guess degrade mount should work.
>>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>>>
>>>>>>> …

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-10-26 12:17 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-19  3:54 Errors after successful disk replace Emil Heimpel
2021-10-19  5:35 ` Qu Wenruo
2021-10-19 10:49   ` Emil Heimpel
2021-10-19 11:37     ` Qu Wenruo
2021-10-19 12:10       ` Emil Heimpel
2021-10-19 12:16       ` Emil Heimpel
2021-10-19 12:20         ` Qu Wenruo
2021-10-19 12:38           ` Emil Heimpel
2021-10-19 12:46             ` Qu Wenruo
2021-10-26 12:16               ` Emil Heimpel
2021-10-26 12:17                 ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.