All of lore.kernel.org
 help / color / mirror / Atom feed
* RAID5 Unable to remove Failing HD
@ 2016-02-10  7:17 Rene Castberg
  2016-02-10  9:00 ` Anand Jain
  0 siblings, 1 reply; 10+ messages in thread
From: Rene Castberg @ 2016-02-10  7:17 UTC (permalink / raw)
  To: linux-btrfs

Hi,

This morning i woke up to a failing disk:

[230743.953079] BTRFS: bdev /dev/sdc errs: wr 1573, rd 45648, flush
503, corrupt 0, gen 0
[230743.953970] BTRFS: bdev /dev/sdc errs: wr 1573, rd 45649, flush
503, corrupt 0, gen 0
[230744.106443] BTRFS: lost page write due to I/O error on /dev/sdc
[230744.180412] BTRFS: lost page write due to I/O error on /dev/sdc
[230760.116173] btrfs_dev_stat_print_on_error: 5 callbacks suppressed
[230760.116176] BTRFS: bdev /dev/sdc errs: wr 1577, rd 45651, flush
503, corrupt 0, gen 0
[230760.726244] BTRFS: bdev /dev/sdc errs: wr 1577, rd 45652, flush
503, corrupt 0, gen 0
[230761.392939] btrfs_end_buffer_write_sync: 2 callbacks suppressed
[230761.392947] BTRFS: lost page write due to I/O error on /dev/sdc
[230761.392953] BTRFS: bdev /dev/sdc errs: wr 1578, rd 45652, flush
503, corrupt 0, gen 0
[230761.393813] BTRFS: lost page write due to I/O error on /dev/sdc
[230761.393818] BTRFS: bdev /dev/sdc errs: wr 1579, rd 45652, flush
503, corrupt 0, gen 0
[230761.394843] BTRFS: lost page write due to I/O error on /dev/sdc
[230761.394849] BTRFS: bdev /dev/sdc errs: wr 1580, rd 45652, flush
503, corrupt 0, gen 0
[230802.000425] nfsd: last server has exited, flushing export cache
[230898.791862] BTRFS: lost page write due to I/O error on /dev/sdc
[230898.791873] BTRFS: bdev /dev/sdc errs: wr 1581, rd 45652, flush
503, corrupt 0, gen 0
[230898.792746] BTRFS: lost page write due to I/O error on /dev/sdc
[230898.792752] BTRFS: bdev /dev/sdc errs: wr 1582, rd 45652, flush
503, corrupt 0, gen 0
[230898.793723] BTRFS: lost page write due to I/O error on /dev/sdc
[230898.793728] BTRFS: bdev /dev/sdc errs: wr 1583, rd 45652, flush
503, corrupt 0, gen 0
[230898.830893] BTRFS info (device sdd): allowing degraded mounts
[230898.830902] BTRFS info (device sdd): disk space caching is enabled

Eventually i remounted it as degraded, hopefully to prevent any loss of data.

It seems taht the btrfs filesystem still hasn't noticed that the disk
has failed:
$btrfs fi show
Label: 'RenesData'  uuid: ee80dae2-7c86-43ea-a253-c8f04589b496
        Total devices 5 FS bytes used 5.38TiB
        devid    1 size 2.73TiB used 1.84TiB path /dev/sdb
        devid    2 size 2.73TiB used 1.84TiB path /dev/sde
        devid    3 size 3.64TiB used 1.84TiB path /dev/sdf
        devid    4 size 2.73TiB used 1.84TiB path /dev/sdd
        devid    5 size 3.64TiB used 1.84TiB path /dev/sdc

I tried deleting the device:
# btrfs device delete /dev/sdc /mnt2/RenesData/
ERROR: error removing device '/dev/sdc': Invalid argument

I have been unlucky and already had a failure last friday, where a
RAID5 array failed after a disk failure.  I rebooted, and the data was
unrecoverable. Fortunately this was only temp data so the failure
wasn't a real issue.

Can somebody give me some advice how to delete the failing disk? I
plan on replacing the disk but unfortunately the system doesn't have
hotplug, so i will need to shutdown to replace the disk without
loosing any of the data stored on these devices.

Regards

Rene Castberg

# uname -a
Linux midgard 4.3.3-1.el7.elrepo.x86_64 #1 SMP Tue Dec 15 11:18:19 EST
2015 x86_64 x86_64 x86_64 GNU/Linux
[root@midgard ~]# btrfs --version
btrfs-progs v4.3.1
[root@midgard ~]# btrfs fi df  /mnt2/RenesData/
Data, RAID6: total=5.52TiB, used=5.37TiB
System, RAID6: total=96.00MiB, used=480.00KiB
Metadata, RAID6: total=17.53GiB, used=11.86GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


# btrfs device stats /mnt2/RenesData/
[/dev/sdb].write_io_errs   0
[/dev/sdb].read_io_errs    0
[/dev/sdb].flush_io_errs   0
[/dev/sdb].corruption_errs 0
[/dev/sdb].generation_errs 0
[/dev/sde].write_io_errs   0
[/dev/sde].read_io_errs    0
[/dev/sde].flush_io_errs   0
[/dev/sde].corruption_errs 0
[/dev/sde].generation_errs 0
[/dev/sdf].write_io_errs   0
[/dev/sdf].read_io_errs    0
[/dev/sdf].flush_io_errs   0
[/dev/sdf].corruption_errs 0
[/dev/sdf].generation_errs 0
[/dev/sdd].write_io_errs   0
[/dev/sdd].read_io_errs    0
[/dev/sdd].flush_io_errs   0
[/dev/sdd].corruption_errs 0
[/dev/sdd].generation_errs 0
[/dev/sdc].write_io_errs   1583
[/dev/sdc].read_io_errs    45652
[/dev/sdc].flush_io_errs   503
[/dev/sdc].corruption_errs 0
[/dev/sdc].generation_errs 0

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RAID5 Unable to remove Failing HD
  2016-02-10  7:17 RAID5 Unable to remove Failing HD Rene Castberg
@ 2016-02-10  9:00 ` Anand Jain
       [not found]   ` <CAKUFzr___Mc56XSu2nCuKbt11bAWdOdNo4y1LEZ47E5_TDxFGQ@mail.gmail.com>
  2016-04-18  8:59   ` Lionel Bouton
  0 siblings, 2 replies; 10+ messages in thread
From: Anand Jain @ 2016-02-10  9:00 UTC (permalink / raw)
  To: Rene Castberg, linux-btrfs



Rene,

Thanks for the report. Fixes are in the following patch sets

  concern1:
  Btrfs to fail/offline a device for write/flush error:
    [PATCH 00/15] btrfs: Hot spare and Auto replace

  concern2:
  User should be able to delete a device when device has failed:
    [PATCH 0/7] Introduce device delete by devid

  If you were able to tryout these patches, pls lets know.

Thanks, Anand


On 02/10/2016 03:17 PM, Rene Castberg wrote:
> Hi,
>
> This morning i woke up to a failing disk:
>
> [230743.953079] BTRFS: bdev /dev/sdc errs: wr 1573, rd 45648, flush
> 503, corrupt 0, gen 0
> [230743.953970] BTRFS: bdev /dev/sdc errs: wr 1573, rd 45649, flush
> 503, corrupt 0, gen 0
> [230744.106443] BTRFS: lost page write due to I/O error on /dev/sdc
> [230744.180412] BTRFS: lost page write due to I/O error on /dev/sdc
> [230760.116173] btrfs_dev_stat_print_on_error: 5 callbacks suppressed
> [230760.116176] BTRFS: bdev /dev/sdc errs: wr 1577, rd 45651, flush
> 503, corrupt 0, gen 0
> [230760.726244] BTRFS: bdev /dev/sdc errs: wr 1577, rd 45652, flush
> 503, corrupt 0, gen 0
> [230761.392939] btrfs_end_buffer_write_sync: 2 callbacks suppressed
> [230761.392947] BTRFS: lost page write due to I/O error on /dev/sdc
> [230761.392953] BTRFS: bdev /dev/sdc errs: wr 1578, rd 45652, flush
> 503, corrupt 0, gen 0
> [230761.393813] BTRFS: lost page write due to I/O error on /dev/sdc
> [230761.393818] BTRFS: bdev /dev/sdc errs: wr 1579, rd 45652, flush
> 503, corrupt 0, gen 0
> [230761.394843] BTRFS: lost page write due to I/O error on /dev/sdc
> [230761.394849] BTRFS: bdev /dev/sdc errs: wr 1580, rd 45652, flush
> 503, corrupt 0, gen 0
> [230802.000425] nfsd: last server has exited, flushing export cache
> [230898.791862] BTRFS: lost page write due to I/O error on /dev/sdc
> [230898.791873] BTRFS: bdev /dev/sdc errs: wr 1581, rd 45652, flush
> 503, corrupt 0, gen 0
> [230898.792746] BTRFS: lost page write due to I/O error on /dev/sdc
> [230898.792752] BTRFS: bdev /dev/sdc errs: wr 1582, rd 45652, flush
> 503, corrupt 0, gen 0
> [230898.793723] BTRFS: lost page write due to I/O error on /dev/sdc
> [230898.793728] BTRFS: bdev /dev/sdc errs: wr 1583, rd 45652, flush
> 503, corrupt 0, gen 0
> [230898.830893] BTRFS info (device sdd): allowing degraded mounts
> [230898.830902] BTRFS info (device sdd): disk space caching is enabled
>
> Eventually i remounted it as degraded, hopefully to prevent any loss of data.
>
> It seems taht the btrfs filesystem still hasn't noticed that the disk
> has failed:
> $btrfs fi show
> Label: 'RenesData'  uuid: ee80dae2-7c86-43ea-a253-c8f04589b496
>          Total devices 5 FS bytes used 5.38TiB
>          devid    1 size 2.73TiB used 1.84TiB path /dev/sdb
>          devid    2 size 2.73TiB used 1.84TiB path /dev/sde
>          devid    3 size 3.64TiB used 1.84TiB path /dev/sdf
>          devid    4 size 2.73TiB used 1.84TiB path /dev/sdd
>          devid    5 size 3.64TiB used 1.84TiB path /dev/sdc
>
> I tried deleting the device:
> # btrfs device delete /dev/sdc /mnt2/RenesData/
> ERROR: error removing device '/dev/sdc': Invalid argument
>
> I have been unlucky and already had a failure last friday, where a
> RAID5 array failed after a disk failure.  I rebooted, and the data was
> unrecoverable. Fortunately this was only temp data so the failure
> wasn't a real issue.
>
> Can somebody give me some advice how to delete the failing disk? I
> plan on replacing the disk but unfortunately the system doesn't have
> hotplug, so i will need to shutdown to replace the disk without
> loosing any of the data stored on these devices.
>
> Regards
>
> Rene Castberg
>
> # uname -a
> Linux midgard 4.3.3-1.el7.elrepo.x86_64 #1 SMP Tue Dec 15 11:18:19 EST
> 2015 x86_64 x86_64 x86_64 GNU/Linux
> [root@midgard ~]# btrfs --version
> btrfs-progs v4.3.1
> [root@midgard ~]# btrfs fi df  /mnt2/RenesData/
> Data, RAID6: total=5.52TiB, used=5.37TiB
> System, RAID6: total=96.00MiB, used=480.00KiB
> Metadata, RAID6: total=17.53GiB, used=11.86GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
>
> # btrfs device stats /mnt2/RenesData/
> [/dev/sdb].write_io_errs   0
> [/dev/sdb].read_io_errs    0
> [/dev/sdb].flush_io_errs   0
> [/dev/sdb].corruption_errs 0
> [/dev/sdb].generation_errs 0
> [/dev/sde].write_io_errs   0
> [/dev/sde].read_io_errs    0
> [/dev/sde].flush_io_errs   0
> [/dev/sde].corruption_errs 0
> [/dev/sde].generation_errs 0
> [/dev/sdf].write_io_errs   0
> [/dev/sdf].read_io_errs    0
> [/dev/sdf].flush_io_errs   0
> [/dev/sdf].corruption_errs 0
> [/dev/sdf].generation_errs 0
> [/dev/sdd].write_io_errs   0
> [/dev/sdd].read_io_errs    0
> [/dev/sdd].flush_io_errs   0
> [/dev/sdd].corruption_errs 0
> [/dev/sdd].generation_errs 0
> [/dev/sdc].write_io_errs   1583
> [/dev/sdc].read_io_errs    45652
> [/dev/sdc].flush_io_errs   503
> [/dev/sdc].corruption_errs 0
> [/dev/sdc].generation_errs 0
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RAID5 Unable to remove Failing HD
       [not found]   ` <CAKUFzr___Mc56XSu2nCuKbt11bAWdOdNo4y1LEZ47E5_TDxFGQ@mail.gmail.com>
@ 2016-02-10 16:58     ` Rene Castberg
  2016-02-11  4:52       ` Anand Jain
  0 siblings, 1 reply; 10+ messages in thread
From: Rene Castberg @ 2016-02-10 16:58 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

Arnand, thanks for the tip. What kernels are these meant for? I am not
able to apply these cleanly to the kernels i have tried. Or is there a
kernel with these incorporated?

I have tried rebooting without the disk attached and am unable to
mount the partition. Complaining about bad tree and
failed to read chunk. So at the moment the disk is still readable,
though not sure how long that will last.

I have posted a copy of my messages log, only the last couple of days.
https://www.dropbox.com/s/9f05e1q5w4zkp38/messages_trimmed2?dl=0

If you or anybody else has some tips i would appreciate it.

Regards

On 10 February 2016 at 17:58, Rene Castberg <rene@castberg.org> wrote:
> Arnand, thanks for the tip. What kernels are these meant for? I am not able
> to apply these cleanly to the kernels i have tried. Or is there a kernel
> with these incorporated?
>
> I have tried rebooting without the disk attached and am unable to mount the
> partition. Complaining about bad tree and
> failed to read chunk. So at the moment the disk is still readable, though
> not sure how long that will last.
>
> I have posted a copy of my messages log, only the last couple of days.
> https://www.dropbox.com/s/9f05e1q5w4zkp38/messages_trimmed2?dl=0
>
> If you or anybody else has some tips i would appreciate it.
>
> Regards
>
> Rene Castberg
>
> On 10 February 2016 at 10:00, Anand Jain <anand.jain@oracle.com> wrote:
>>
>>
>>
>> Rene,
>>
>> Thanks for the report. Fixes are in the following patch sets
>>
>>  concern1:
>>  Btrfs to fail/offline a device for write/flush error:
>>    [PATCH 00/15] btrfs: Hot spare and Auto replace
>>
>>  concern2:
>>  User should be able to delete a device when device has failed:
>>    [PATCH 0/7] Introduce device delete by devid
>>
>>  If you were able to tryout these patches, pls lets know.
>>
>> Thanks, Anand
>>
>>
>>
>> On 02/10/2016 03:17 PM, Rene Castberg wrote:
>>>
>>> Hi,
>>>
>>> This morning i woke up to a failing disk:
>>>
>>> [230743.953079] BTRFS: bdev /dev/sdc errs: wr 1573, rd 45648, flush
>>> 503, corrupt 0, gen 0
>>> [230743.953970] BTRFS: bdev /dev/sdc errs: wr 1573, rd 45649, flush
>>> 503, corrupt 0, gen 0
>>> [230744.106443] BTRFS: lost page write due to I/O error on /dev/sdc
>>> [230744.180412] BTRFS: lost page write due to I/O error on /dev/sdc
>>> [230760.116173] btrfs_dev_stat_print_on_error: 5 callbacks suppressed
>>> [230760.116176] BTRFS: bdev /dev/sdc errs: wr 1577, rd 45651, flush
>>> 503, corrupt 0, gen 0
>>> [230760.726244] BTRFS: bdev /dev/sdc errs: wr 1577, rd 45652, flush
>>> 503, corrupt 0, gen 0
>>> [230761.392939] btrfs_end_buffer_write_sync: 2 callbacks suppressed
>>> [230761.392947] BTRFS: lost page write due to I/O error on /dev/sdc
>>> [230761.392953] BTRFS: bdev /dev/sdc errs: wr 1578, rd 45652, flush
>>> 503, corrupt 0, gen 0
>>> [230761.393813] BTRFS: lost page write due to I/O error on /dev/sdc
>>> [230761.393818] BTRFS: bdev /dev/sdc errs: wr 1579, rd 45652, flush
>>> 503, corrupt 0, gen 0
>>> [230761.394843] BTRFS: lost page write due to I/O error on /dev/sdc
>>> [230761.394849] BTRFS: bdev /dev/sdc errs: wr 1580, rd 45652, flush
>>> 503, corrupt 0, gen 0
>>> [230802.000425] nfsd: last server has exited, flushing export cache
>>> [230898.791862] BTRFS: lost page write due to I/O error on /dev/sdc
>>> [230898.791873] BTRFS: bdev /dev/sdc errs: wr 1581, rd 45652, flush
>>> 503, corrupt 0, gen 0
>>> [230898.792746] BTRFS: lost page write due to I/O error on /dev/sdc
>>> [230898.792752] BTRFS: bdev /dev/sdc errs: wr 1582, rd 45652, flush
>>> 503, corrupt 0, gen 0
>>> [230898.793723] BTRFS: lost page write due to I/O error on /dev/sdc
>>> [230898.793728] BTRFS: bdev /dev/sdc errs: wr 1583, rd 45652, flush
>>> 503, corrupt 0, gen 0
>>> [230898.830893] BTRFS info (device sdd): allowing degraded mounts
>>> [230898.830902] BTRFS info (device sdd): disk space caching is enabled
>>>
>>> Eventually i remounted it as degraded, hopefully to prevent any loss of
>>> data.
>>>
>>> It seems taht the btrfs filesystem still hasn't noticed that the disk
>>> has failed:
>>> $btrfs fi show
>>> Label: 'RenesData'  uuid: ee80dae2-7c86-43ea-a253-c8f04589b496
>>>          Total devices 5 FS bytes used 5.38TiB
>>>          devid    1 size 2.73TiB used 1.84TiB path /dev/sdb
>>>          devid    2 size 2.73TiB used 1.84TiB path /dev/sde
>>>          devid    3 size 3.64TiB used 1.84TiB path /dev/sdf
>>>          devid    4 size 2.73TiB used 1.84TiB path /dev/sdd
>>>          devid    5 size 3.64TiB used 1.84TiB path /dev/sdc
>>>
>>> I tried deleting the device:
>>> # btrfs device delete /dev/sdc /mnt2/RenesData/
>>> ERROR: error removing device '/dev/sdc': Invalid argument
>>>
>>> I have been unlucky and already had a failure last friday, where a
>>> RAID5 array failed after a disk failure.  I rebooted, and the data was
>>> unrecoverable. Fortunately this was only temp data so the failure
>>> wasn't a real issue.
>>>
>>> Can somebody give me some advice how to delete the failing disk? I
>>> plan on replacing the disk but unfortunately the system doesn't have
>>> hotplug, so i will need to shutdown to replace the disk without
>>> loosing any of the data stored on these devices.
>>>
>>> Regards
>>>
>>> Rene Castberg
>>>
>>> # uname -a
>>> Linux midgard 4.3.3-1.el7.elrepo.x86_64 #1 SMP Tue Dec 15 11:18:19 EST
>>> 2015 x86_64 x86_64 x86_64 GNU/Linux
>>> [root@midgard ~]# btrfs --version
>>> btrfs-progs v4.3.1
>>> [root@midgard ~]# btrfs fi df  /mnt2/RenesData/
>>> Data, RAID6: total=5.52TiB, used=5.37TiB
>>> System, RAID6: total=96.00MiB, used=480.00KiB
>>> Metadata, RAID6: total=17.53GiB, used=11.86GiB
>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>
>>>
>>> # btrfs device stats /mnt2/RenesData/
>>> [/dev/sdb].write_io_errs   0
>>> [/dev/sdb].read_io_errs    0
>>> [/dev/sdb].flush_io_errs   0
>>> [/dev/sdb].corruption_errs 0
>>> [/dev/sdb].generation_errs 0
>>> [/dev/sde].write_io_errs   0
>>> [/dev/sde].read_io_errs    0
>>> [/dev/sde].flush_io_errs   0
>>> [/dev/sde].corruption_errs 0
>>> [/dev/sde].generation_errs 0
>>> [/dev/sdf].write_io_errs   0
>>> [/dev/sdf].read_io_errs    0
>>> [/dev/sdf].flush_io_errs   0
>>> [/dev/sdf].corruption_errs 0
>>> [/dev/sdf].generation_errs 0
>>> [/dev/sdd].write_io_errs   0
>>> [/dev/sdd].read_io_errs    0
>>> [/dev/sdd].flush_io_errs   0
>>> [/dev/sdd].corruption_errs 0
>>> [/dev/sdd].generation_errs 0
>>> [/dev/sdc].write_io_errs   1583
>>> [/dev/sdc].read_io_errs    45652
>>> [/dev/sdc].flush_io_errs   503
>>> [/dev/sdc].corruption_errs 0
>>> [/dev/sdc].generation_errs 0
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RAID5 Unable to remove Failing HD
  2016-02-10 16:58     ` Rene Castberg
@ 2016-02-11  4:52       ` Anand Jain
  0 siblings, 0 replies; 10+ messages in thread
From: Anand Jain @ 2016-02-11  4:52 UTC (permalink / raw)
  To: Rene Castberg; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1030 bytes --]



On 02/11/2016 12:58 AM, Rene Castberg wrote:
> Arnand, thanks for the tip. What kernels are these meant for? I am not
> able to apply these cleanly to the kernels i have tried. Or is there a
> kernel with these incorporated?

  As I am trying again, they apply nice on v4.4-rc8
  (last commit b82dde0230439215b55e545880e90337ee16f51a)

  Probably you may be missing some not related independent patches.
  To make things easier, I have attached here a tar of patches
  on 4.4-rc8, these patches are already in the ML as individual
  and set where there are dependencies. Pls apply them in the
  same order as the dir names.

> I have tried rebooting without the disk attached and am unable to
> mount the partition. Complaining about bad tree and
> failed to read chunk. So at the moment the disk is still readable,
> though not sure how long that will last.

   Pls physically remove the disk (/dev/sdc), And as you are already
   using -o degrade pls continue to use it.

   So now you can delete the missing.

Thanks, Anand


[-- Attachment #2: 2to5.tar.gz --]
[-- Type: application/gzip, Size: 22904 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RAID5 Unable to remove Failing HD
  2016-02-10  9:00 ` Anand Jain
       [not found]   ` <CAKUFzr___Mc56XSu2nCuKbt11bAWdOdNo4y1LEZ47E5_TDxFGQ@mail.gmail.com>
@ 2016-04-18  8:59   ` Lionel Bouton
  2016-04-18 14:11     ` Lionel Bouton
  2016-04-19  7:35     ` Duncan
  1 sibling, 2 replies; 10+ messages in thread
From: Lionel Bouton @ 2016-04-18  8:59 UTC (permalink / raw)
  To: Anand Jain, Rene Castberg, linux-btrfs

Hi,

Le 10/02/2016 10:00, Anand Jain a écrit :
>
>
> Rene,
>
> Thanks for the report. Fixes are in the following patch sets
>
>  concern1:
>  Btrfs to fail/offline a device for write/flush error:
>    [PATCH 00/15] btrfs: Hot spare and Auto replace
>
>  concern2:
>  User should be able to delete a device when device has failed:
>    [PATCH 0/7] Introduce device delete by devid
>
>  If you were able to tryout these patches, pls lets know.

Just found out this thread after digging for a problem similar to mine.

I just got the same error when trying to delete a failed hard drive on a
RAID1 filesystem with a total of 4 devices.

# btrfs device delete 3 /mnt/store/
ERROR: device delete by id failed: Inappropriate ioctl for device

Were the patch sets above for btrfs-progs or for the kernel ?
Currently the kernel is 4.1.15-r1 from Gentoo. I used btrfs-progs-4.3.1
(the Gentoo stable version) but it didn't support delete by devid so I
upgraded to btrfs-progs-4.5.1 which supports it but got the same
"inappropriate ioctl for device" error when I used the devid.

I don't have any drive available right now for replacing this one (so no
btrfs dev replace possible right now). The filesystem's data could fit
on only 2 of the 4 drives (in fact I just added 2 old drives that were
previously used with md and rebalanced, which is most probably what
triggered one of the new drives failure). So I can't use replace and
would prefer not to lose redundancy while waiting for new drives to get
there.

So the obvious thing to do in this circumstance is to delete the drive,
forcing the filesystem to create the missing replicas in the process and
only reboot if needed (no hotplug). Unfortunately I'm not sure of the
conditions where this is possible (which kernel version supports this if
any ?). If there is a minimum kernel version where device delete works,
can https://btrfs.wiki.kernel.org/index.php/Gotchas be updated ? I don't
have a wiki account yet but I'm willing to do it myself if I can get
reliable information.

I can reboot this system and I expect the current drive to appear
missing (it doesn't even respond to smartctl) and I suppose "device
delete missing" will work then. But should I/must I upgrade the kernel
to avoid this problem in the future and if yes which version(s)
support(s) failed device delete?

Best regards,

Lionel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RAID5 Unable to remove Failing HD
  2016-04-18  8:59   ` Lionel Bouton
@ 2016-04-18 14:11     ` Lionel Bouton
  2016-04-19  7:35     ` Duncan
  1 sibling, 0 replies; 10+ messages in thread
From: Lionel Bouton @ 2016-04-18 14:11 UTC (permalink / raw)
  To: Anand Jain, Rene Castberg, linux-btrfs

Le 18/04/2016 10:59, Lionel Bouton a écrit :
> [...]
> So the obvious thing to do in this circumstance is to delete the drive,
> forcing the filesystem to create the missing replicas in the process and
> only reboot if needed (no hotplug). Unfortunately I'm not sure of the
> conditions where this is possible (which kernel version supports this if
> any ?). If there is a minimum kernel version where device delete works,
> can https://btrfs.wiki.kernel.org/index.php/Gotchas be updated ? I don't
> have a wiki account yet but I'm willing to do it myself if I can get
> reliable information.

Note that whatever the best course of action is I think the wiki should
probably be updated with clear instructions on that. I'm willing to
document this myself and probably other Gotchas (like how to fix a
4-device RAID10 filesystem when one of them fails based on the recent
discussion I've seen here) but I'm not sure I know all the details and
wouldn't want to put incomplete information in the wiki so I'll wait for
answers before starting to work on this.

The data on this filesystem isn't critical and I have backups for the
most important files so I can live with a "degraded" state for a while
until I'm sure of the best way to proceed.

Best regards,

Lionel Bouton

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RAID5 Unable to remove Failing HD
  2016-04-18  8:59   ` Lionel Bouton
  2016-04-18 14:11     ` Lionel Bouton
@ 2016-04-19  7:35     ` Duncan
  2016-04-19  9:13       ` Anand Jain
  1 sibling, 1 reply; 10+ messages in thread
From: Duncan @ 2016-04-19  7:35 UTC (permalink / raw)
  To: linux-btrfs

Lionel Bouton posted on Mon, 18 Apr 2016 10:59:35 +0200 as excerpted:

> Hi,
> 
> Le 10/02/2016 10:00, Anand Jain a écrit :
>>
>> Thanks for the report. Fixes are in the following patch sets
>>
>>  concern1:
>>  Btrfs to fail/offline a device for write/flush error:
>>    [PATCH 00/15] btrfs: Hot spare and Auto replace
>>
>>  concern2:
>>  User should be able to delete a device when device has failed:
>>    [PATCH 0/7] Introduce device delete by devid
>>
>>  If you were able to tryout these patches, pls lets know.
> 
> Just found out this thread after digging for a problem similar to mine.
> 
> I just got the same error when trying to delete a failed hard drive on a
> RAID1 filesystem with a total of 4 devices.
> 
> # btrfs device delete 3 /mnt/store/
> ERROR: device delete by id failed: Inappropriate ioctl for device
> 
> Were the patch sets above for btrfs-progs or for the kernel ?

Looks like you're primarily interested in the concern2 patches, device 
delete by devid.

A quick search of the list back-history reveals that an updated patch 
set, 00/15 now (look for [PATCH 00/15] Device delete by id ), was posted 
by dserba on the 15th of February.  It was the kernel patches and was 
slated for the kernel 4.6 dev cycle.  However, the patch set was pulled 
at that time due to test failures, tho they were suspected to actually be 
from something else.

I haven't updated to kernel 4.6 git yet (tho I'm on 4.5 and generally do 
run git post rc4 or so, which was just released), so I'll probably update 
shortly) so can't check whether it ultimately made it in or not, but if 
it's not in 4.6 it certainly won't be in anything earlier as stable 
patches must be in devel mainline first.

So I'd say check 4.6 devel, and if it's not there as appears to be 
likely, you'll have to grab the patches off the list and apply them 
yourself.

> Currently the kernel is 4.1.15-r1 from Gentoo. I used btrfs-progs-4.3.1
> (the Gentoo stable version) but it didn't support delete by devid so I
> upgraded to btrfs-progs-4.5.1 which supports it but got the same
> "inappropriate ioctl for device" error when I used the devid.

FWIW, I'm a gentooer also, but I'm on ~amd64 not stable, and as I said I 
run current stable and later devel kernels.  I also often update the 
(often unfortunately lagging, even on ~arch) btrfs-progs ebuild to the 
latest as announced here and normally run that.

And FWIW I run btrfs raid1 mode also, but on only two ssds, which 
decomplicates things since btrfs raid1 is only 2-way-mirroring anyway.  I 
also partition up the ssds and run multiple independent btrfs, the 
largest only 24 GiB usable size (24 GiB partitions on two devices, 
raid1), so my data eggs aren't all in one btrfs basket and it's easier to 
recover from just one btrfs failing.  As an example, my / is only 8 GiB 
and contains everything installed by portage but a few bits of /var which 
need to be writable at runtime, because I keep my / mounted read-only by 
default, only mounting it writable to update.  An 8 GiB root is easy to 
duplicate elsewhere for backup, and indeed, my first backup is another 
set of 8 GiB partitions in btrfs raid1, on the same ssds, with the second 
backup being an 8 GiB reiserfs partition on spinning rust, with all three 
bootable from grub (separately installed to each of the three physical 
devices, each of which has its own /boot, with the one that's booted 
selected from grub), should it be needed.

> I don't have any drive available right now for replacing this one (so no
> btrfs dev replace possible right now). The filesystem's data could fit
> on only 2 of the 4 drives (in fact I just added 2 old drives that were
> previously used with md and rebalanced, which is most probably what
> triggered one of the new drives failure). So I can't use replace and
> would prefer not to lose redundancy while waiting for new drives to get
> there.

I did have to use btrfs replace for one of the ssds, but as it happens I 
did have a spare, as the old netbook I intended to put it in died before 
I got it installed.  And the failing ssd wasn't entirely failed, just 
needing more and more frequent scrubs as sectors failed, and the replace 
(replaces as I have multiple btrfs on the pair of ssds) went quite 
well... and fast on the ssds. =:^) 

> So the obvious thing to do in this circumstance is to delete the drive,
> forcing the filesystem to create the missing replicas in the process and
> only reboot if needed (no hotplug). Unfortunately I'm not sure of the
> conditions where this is possible (which kernel version supports this if
> any ?). If there is a minimum kernel version where device delete works,
> can https://btrfs.wiki.kernel.org/index.php/Gotchas be updated ? I don't
> have a wiki account yet but I'm willing to do it myself if I can get
> reliable information.

As I said, it'd be 4.6 if it's even there.  Otherwise you'll have to 
apply the patches yourself.

> I can reboot this system and I expect the current drive to appear
> missing (it doesn't even respond to smartctl) and I suppose "device
> delete missing" will work then. But should I/must I upgrade the kernel
> to avoid this problem in the future and if yes which version(s)
> support(s) failed device delete?

It's good to see (I think it was in your followup) that you have the 
critical stuff backed up already, and that what's not backed up you're 
not too worried about losing.  Despite btrfs not being entirely stable as 
yet, it's surprising the number of cases we see where that's not the case.

So kudos for being a wise sysadmin, and appreciating that data that's not 
backed up is data that by your actions you're defining as not worth the 
trouble of the backup.  Far too many appreciate that only after reality 
takes them up on that definition and they actually (at least potentially, 
btrfs restore can sometimes get them out of the hole they dug themselves 
into) lose what wasn't backed up. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RAID5 Unable to remove Failing HD
  2016-04-19  7:35     ` Duncan
@ 2016-04-19  9:13       ` Anand Jain
  2016-04-19  9:45         ` Duncan
  2016-04-19 10:49         ` Lionel Bouton
  0 siblings, 2 replies; 10+ messages in thread
From: Anand Jain @ 2016-04-19  9:13 UTC (permalink / raw)
  To: Duncan, linux-btrfs


>> # btrfs device delete 3 /mnt/store/
>> ERROR: device delete by id failed: Inappropriate ioctl for device
>>
>> Were the patch sets above for btrfs-progs or for the kernel ?
>
> Looks like you're primarily interested in the concern2 patches, device
> delete by devid.
>
> A quick search of the list back-history reveals that an updated patch
> set, 00/15 now (look for [PATCH 00/15] Device delete by id ), was posted
> by dserba on the 15th of February.  It was the kernel patches and was
> slated for the kernel 4.6 dev cycle.  However, the patch set was pulled
> at that time due to test failures, tho they were suspected to actually be
> from something else.

  Thanks Duncan. Yep the reported issue did not point to any
  of the patch in that set. But I am keeping a tab open for
  new users /test cases, anything that is found will help.

  By the way, For Lionel issue, delete missing should work ?
  which does not need any additional patch.

Thanks, Anand


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RAID5 Unable to remove Failing HD
  2016-04-19  9:13       ` Anand Jain
@ 2016-04-19  9:45         ` Duncan
  2016-04-19 10:49         ` Lionel Bouton
  1 sibling, 0 replies; 10+ messages in thread
From: Duncan @ 2016-04-19  9:45 UTC (permalink / raw)
  To: linux-btrfs

Anand Jain posted on Tue, 19 Apr 2016 17:13:04 +0800 as excerpted:

>>> # btrfs device delete 3 /mnt/store/
>>> ERROR: device delete by id failed: Inappropriate ioctl for device
>>>
>>> Were the patch sets above for btrfs-progs or for the kernel ?
>>
>> Looks like you're primarily interested in the concern2 patches, device
>> delete by devid.
>>
>> A quick search of the list back-history reveals that an updated patch
>> set, 00/15 now (look for [PATCH 00/15] Device delete by id ), was
>> posted by dserba on the 15th of February.  It was the kernel patches
>> and was slated for the kernel 4.6 dev cycle.  However, the patch set
>> was pulled at that time due to test failures, tho they were suspected
>> to actually be from something else.
> 
>   Thanks Duncan. Yep the reported issue did not point to any of the
>   patch in that set. But I am keeping a tab open for new users /test
>   cases, anything that is found will help.
> 
>   By the way, For Lionel issue, delete missing should work ?
>   which does not need any additional patch.

Were the issues with btrfs delete missing fixed?  There were some issues 
with it actually trying to delete a device literally named "missing" or 
something the like, and of course not finding it to delete, a kernel 
cycle or two ago, and AFAIK delete by ID was originally proposed as a fix 
for that.  If I was reading the comments correctly, however, I think the 
problem there was introduced with the switch to libblockdev or some such, 
however, so he might be able to get around that by using older releases, 
as long as the filesystem isn't using newer features that would block 
that.

So delete missing may or may not work now, I've lost track.  But he was 
reluctant to unmount and reboot, and of course with btrfs not yet 
offlining failed devices, it doesn't know it's actually missing, yet.  So 
even if delete missing does work for him, it's not going to work until a 
reboot and remount degraded, and he was hoping to avoid that with the 
delete by ID.




-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RAID5 Unable to remove Failing HD
  2016-04-19  9:13       ` Anand Jain
  2016-04-19  9:45         ` Duncan
@ 2016-04-19 10:49         ` Lionel Bouton
  1 sibling, 0 replies; 10+ messages in thread
From: Lionel Bouton @ 2016-04-19 10:49 UTC (permalink / raw)
  To: Anand Jain, Duncan, linux-btrfs

Hi,

Le 19/04/2016 11:13, Anand Jain a écrit :
>
>>> # btrfs device delete 3 /mnt/store/
>>> ERROR: device delete by id failed: Inappropriate ioctl for device
>>>
>>> Were the patch sets above for btrfs-progs or for the kernel ?
>> [...]
>
>  By the way, For Lionel issue, delete missing should work ?
>  which does not need any additional patch.

Delete missing works with 4.1.15 and btrfs-progs 4.5.1 (see later), but
the device can't be marked missing online so there's no way to maintain
redundancy without downtime. I was a little surprised: I half-expected
something like this because reading this list, RAID recovery seems to
still be a pain point but this isn't documented anywhere and after
looking around the relevant information seems to only be in this thread
(and many come from md and don't read this list, so won't expect this
behavior at all).
While I was waiting for directions the system crashed with a kernel
panic (clearly linked to IO errors according to the kernel panic but I
couldn't get all the stacktrace) and the system wasn't able to boot
properly (kernel panic shortly after the system mounted the filesystem
on each boot) until I removed the faulty drive (apparently it was
somehow readable enough to be recognized, but not enough to be usable).
After removing the faulty drive delete missing worked and a balance is
currently running (by the way it seems the drive bay was faulty: the
drive was not firmly fixed and it's cage could move a bit around in the
chassis and it was the only one, I didn't expect this and from
experience it's probably a factor in the hardware failure).

There may have been fixes since 4.1.15 to prevent the kernel panic
(there was only one device with IO errors, so ideally it shouldn't be
able to bring down the kernel) so it may not be worth further analysis.
That said I'll have 2 new drives next week (one replacement, one spare)
and I have a chassis lying around where I could try to replicate
failures with various kernels on a RAID1 filesystem built with a brand
new drive and the faulty drive (until the faulty drive completely dies
which they usually do in my experience) so if someone wants some tests
done with 4.6-rcX or even 4.6-rcX + patches I can spend some time on it
next week.

Lionel

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-04-19 10:49 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-10  7:17 RAID5 Unable to remove Failing HD Rene Castberg
2016-02-10  9:00 ` Anand Jain
     [not found]   ` <CAKUFzr___Mc56XSu2nCuKbt11bAWdOdNo4y1LEZ47E5_TDxFGQ@mail.gmail.com>
2016-02-10 16:58     ` Rene Castberg
2016-02-11  4:52       ` Anand Jain
2016-04-18  8:59   ` Lionel Bouton
2016-04-18 14:11     ` Lionel Bouton
2016-04-19  7:35     ` Duncan
2016-04-19  9:13       ` Anand Jain
2016-04-19  9:45         ` Duncan
2016-04-19 10:49         ` Lionel Bouton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.