All of lore.kernel.org
 help / color / mirror / Atom feed
* Issues while doing btrfs delete missing in raid6
@ 2017-11-20  6:43 Jérôme Carretero
  2017-11-20  6:54 ` Jérôme Carretero
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Jérôme Carretero @ 2017-11-20  6:43 UTC (permalink / raw)
  To: linux-btrfs

Hi,


While doing a test (to evaluate drives), where I'm filling a bunch of
drives in RAID6, one of the disks failed in the process.
(System with v4.14 / ECC).
I remounted the array in degraded, launched a "btrfs delete missing"
as I have no replacement device.

The command (takes ages and) fails with:
 ERROR: error removing device 'missing': Input/output error

and klog says:

 [631517.263313] BTRFS info (device dm-18): relocating block group 1411883335680 flags data|raid6
 [631547.556527] btrfs_print_data_csum_error: 151 callbacks suppressed
 [631547.556530] BTRFS warning (device dm-18): csum failed root -9 ino 1177 off 3559653376 csum 0x2e827bb4 expected csum 0xda9c34d6 mirror 2
 [631547.562727] BTRFS warning (device dm-18): csum failed root -9 ino 1177 off 3559657472 csum 0x6722cd32 expected csum 0x3ca2ce6f mirror 2
 [631547.562730] BTRFS warning (device dm-18): csum failed root -9 ino 1177 off 3559661568 csum 0x90368636 expected csum 0xf55a0410 mirror 2
 [631547.562732] BTRFS warning (device dm-18): csum failed root -9 ino 1177 off 3559665664 csum 0x3e38aeb2 expected csum 0x6c80a970 mirror 2
 [631547.562746] BTRFS warning (device dm-18): csum failed root -9 ino 1177 off 3559669760 csum 0x77d73f2d expected csum 0xe62cfbe8 mirror 2
 [631547.562747] BTRFS warning (device dm-18): csum failed root -9 ino 1177 off 3559673856 csum 0xb03d1632 expected csum 0xe9a3f0e6 mirror 2
 [631547.562756] BTRFS warning (device dm-18): csum failed root -9 ino 1177 off 3559677952 csum 0xeea04377 expected csum 0x8819aaf7 mirror 2
 [631547.562758] BTRFS warning (device dm-18): csum failed root -9 ino 1177 off 3559682048 csum 0xe46ab546 expected csum 0xacc16686 mirror 2
 [631547.562775] BTRFS warning (device dm-18): csum failed root -9 ino 1177 off 3559690240 csum 0x956a74d7 expected csum 0x99e29858 mirror 2
 [631547.562788] BTRFS warning (device dm-18): csum failed root -9 ino 1177 off 3559686144 csum 0xb09a35ae expected csum 0x5f61fa99 mirror 2

Since this is RAID6, I wasn't expecting to not be able to recover
from a checksum issue, also it's not very practical to bail out on the first
error of this kind during a delete... the offending blocks could be
left as is.

I then try to work around the issue by removing the offending file
(yes it's a test, but filling the drives takes a lot of time),
finding it with "btrfs inspect-internal inode-resolve 1177", and somehow:
 ERROR: ino paths ioctl: No such file or directory


Regards,

-- 
Jérôme

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Issues while doing btrfs delete missing in raid6
  2017-11-20  6:43 Issues while doing btrfs delete missing in raid6 Jérôme Carretero
@ 2017-11-20  6:54 ` Jérôme Carretero
  2017-11-20  7:13 ` Qu Wenruo
  2017-11-20 21:57 ` Duncan
  2 siblings, 0 replies; 5+ messages in thread
From: Jérôme Carretero @ 2017-11-20  6:54 UTC (permalink / raw)
  Cc: linux-btrfs

On Mon, 20 Nov 2017 01:43:44 -0500
Jérôme Carretero <cJ-ko@zougloub.eu> wrote:

> Hi,
> 
> 
> While doing a test (to evaluate drives), where I'm filling a bunch of
> drives in RAID6, one of the disks failed in the process.
> (System with v4.14 / ECC).
> I remounted the array in degraded, launched a "btrfs delete missing"
> as I have no replacement device.
> 
> The command (takes ages and) fails with:
>  ERROR: error removing device 'missing': Input/output error

> Since this is RAID6, I wasn't expecting to not be able to recover
> from a checksum issue, also it's not very practical to bail out on
> the first error of this kind during a delete... the offending blocks
> could be left as is.

While doing a "tar c /mnt/test | pv >/dev/null" I see the csum errors,
but they are corrected then.
I guess I'll try to scrub and see. But there's probably a bug, if
delete/replace/balance can't do that.


Regards,

-- 
Jérôme


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Issues while doing btrfs delete missing in raid6
  2017-11-20  6:43 Issues while doing btrfs delete missing in raid6 Jérôme Carretero
  2017-11-20  6:54 ` Jérôme Carretero
@ 2017-11-20  7:13 ` Qu Wenruo
  2017-11-20 21:57 ` Duncan
  2 siblings, 0 replies; 5+ messages in thread
From: Qu Wenruo @ 2017-11-20  7:13 UTC (permalink / raw)
  To: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 3108 bytes --]



On 2017年11月20日 14:43, Jérôme Carretero wrote:
> Hi,
> 
> 
> While doing a test (to evaluate drives), where I'm filling a bunch of
> drives in RAID6, one of the disks failed in the process.
> (System with v4.14 / ECC).
> I remounted the array in degraded, launched a "btrfs delete missing"
> as I have no replacement device.
> 
> The command (takes ages and) fails with:
>  ERROR: error removing device 'missing': Input/output error
> 
> and klog says:
> 
>  [631517.263313] BTRFS info (device dm-18): relocating block group 1411883335680 flags data|raid6
>  [631547.556527] btrfs_print_data_csum_error: 151 callbacks suppressed
>  [631547.556530] BTRFS warning (device dm-18): csum failed root -9 ino 1177 off 3559653376 csum 0x2e827bb4 expected csum 0xda9c34d6 mirror 2

Root -9 means it's a data reloc tree. So its ino number is not real
inode number.

To delete it, you need to  calculate the offset into bytenr, then find
the owner.

>  [631547.562727] BTRFS warning (device dm-18): csum failed root -9 ino 1177 off 3559657472 csum 0x6722cd32 expected csum 0x3ca2ce6f mirror 2
>  [631547.562730] BTRFS warning (device dm-18): csum failed root -9 ino 1177 off 3559661568 csum 0x90368636 expected csum 0xf55a0410 mirror 2
>  [631547.562732] BTRFS warning (device dm-18): csum failed root -9 ino 1177 off 3559665664 csum 0x3e38aeb2 expected csum 0x6c80a970 mirror 2
>  [631547.562746] BTRFS warning (device dm-18): csum failed root -9 ino 1177 off 3559669760 csum 0x77d73f2d expected csum 0xe62cfbe8 mirror 2
>  [631547.562747] BTRFS warning (device dm-18): csum failed root -9 ino 1177 off 3559673856 csum 0xb03d1632 expected csum 0xe9a3f0e6 mirror 2
>  [631547.562756] BTRFS warning (device dm-18): csum failed root -9 ino 1177 off 3559677952 csum 0xeea04377 expected csum 0x8819aaf7 mirror 2
>  [631547.562758] BTRFS warning (device dm-18): csum failed root -9 ino 1177 off 3559682048 csum 0xe46ab546 expected csum 0xacc16686 mirror 2
>  [631547.562775] BTRFS warning (device dm-18): csum failed root -9 ino 1177 off 3559690240 csum 0x956a74d7 expected csum 0x99e29858 mirror 2
>  [631547.562788] BTRFS warning (device dm-18): csum failed root -9 ino 1177 off 3559686144 csum 0xb09a35ae expected csum 0x5f61fa99 mirror 2
> 
> Since this is RAID6, I wasn't expecting to not be able to recover
> from a checksum issue,

Currently btrfs RAID6 can't ensure recovered data to match its csum.

That's to say, if some other error, like real data corruption in another
disk, in theory RAID6 could still recover it, but the truth is, it may
use the corrupted disk to recover, resulting back checksum.

Thanks,
Qu

> also it's not very practical to bail out on the first
> error of this kind during a delete... the offending blocks could be
> left as is.
> 
> I then try to work around the issue by removing the offending file
> (yes it's a test, but filling the drives takes a lot of time),
> finding it with "btrfs inspect-internal inode-resolve 1177", and somehow:
>  ERROR: ino paths ioctl: No such file or directory
> 
> 
> Regards,
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Issues while doing btrfs delete missing in raid6
  2017-11-20  6:43 Issues while doing btrfs delete missing in raid6 Jérôme Carretero
  2017-11-20  6:54 ` Jérôme Carretero
  2017-11-20  7:13 ` Qu Wenruo
@ 2017-11-20 21:57 ` Duncan
  2017-11-21  1:06   ` Jérôme Carretero
  2 siblings, 1 reply; 5+ messages in thread
From: Duncan @ 2017-11-20 21:57 UTC (permalink / raw)
  To: linux-btrfs

Jérôme Carretero posted on Mon, 20 Nov 2017 01:43:44 -0500 as excerpted:

> While doing a test (to evaluate drives), where I'm filling a bunch of
> drives in RAID6, one of the disks failed in the process.
> (System with v4.14 / ECC).

FWIW, see raid56 status in the status page (table and below raid56 note).

https://btrfs.wiki.kernel.org/index.php/Status

Basically, after the fixes in 4.12, it mostly works as long as things 
don't go too badly wrong, but due to the write hole and corner cases such 
as the checksum repair failure you ran into, it's not something people on 
this list can in good conscience recommend for general use, because it 
simply lacks the reliability people tend to want raid56 for, at least in 
combination with the file-integrity/checksumming features btrfs may be 
chosen for.  The two together simply aren't as reliable as the separate 
features might imply they should be, and there are known to be better 
alternatives.

Unfortunately that's likely to remain the case for awhile due to the 
complexity of a real fix, despite the 4.12 fixes to the worst of the 
problems.

One reasonably performant and reliable alternative, tho it's more 
directly an alternative to btrfs raid10, where it's better performing due 
to btrfs raid10 not yet being performance optimized, is btrfs raid1 on 
top of two raid0s (mdraid0, for instance).

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Issues while doing btrfs delete missing in raid6
  2017-11-20 21:57 ` Duncan
@ 2017-11-21  1:06   ` Jérôme Carretero
  0 siblings, 0 replies; 5+ messages in thread
From: Jérôme Carretero @ 2017-11-21  1:06 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

Hi Duncan,

On Mon, 20 Nov 2017 21:57:47 +0000 (UTC)
Duncan <1i5t5.duncan@cox.net> wrote:

> Jérôme Carretero posted on Mon, 20 Nov 2017 01:43:44 -0500 as
> excerpted:
> 
> > While doing a test (to evaluate drives), where I'm filling a bunch
> > of drives in RAID6, one of the disks failed in the process.
> > (System with v4.14 / ECC).  
> 
> FWIW, see raid56 status in the status page (table and below raid56
> note).
> 
> https://btrfs.wiki.kernel.org/index.php/Status
> 
> Basically, after the fixes in 4.12, it mostly works as long as things 
> don't go too badly wrong, but due to the write hole and corner cases
> such as the checksum repair failure you ran into, it's not something
> people on this list can in good conscience recommend for general use,
> because it simply lacks the reliability people tend to want raid56
> for, at least in combination with the file-integrity/checksumming
> features btrfs may be chosen for.  The two together simply aren't as
> reliable as the separate features might imply they should be, and
> there are known to be better alternatives.
> 
> Unfortunately that's likely to remain the case for awhile due to the 
> complexity of a real fix, despite the 4.12 fixes to the worst of the 
> problems.
> 
> One reasonably performant and reliable alternative, tho it's more 
> directly an alternative to btrfs raid10, where it's better performing
> due to btrfs raid10 not yet being performance optimized, is btrfs
> raid1 on top of two raid0s (mdraid0, for instance).

I normally use btrfs RAID1, but wanted to see what's new with RAID6
while "priming" some new disks. There was some click-bait on Phoronix,
the wiki page (status or https://btrfs.wiki.kernel.org/index.php/RAID56)
didn't look very up-to-date and quite vague, and not so many horror
stories on LKML...

Anyway, this was a 200-hour experiment, and apart from the failure, the
speed was low, really far from RAID1 (and there was plenty of CPU left
to compute parity), and "delete missing" was unexpectedly slow, running
at perhaps 10 MB/s average.

TL;DR: As of v4.14 RAID6 is as reliable as RAID0, but slower =)


-- 
Jérôme

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-11-21  1:05 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-20  6:43 Issues while doing btrfs delete missing in raid6 Jérôme Carretero
2017-11-20  6:54 ` Jérôme Carretero
2017-11-20  7:13 ` Qu Wenruo
2017-11-20 21:57 ` Duncan
2017-11-21  1:06   ` Jérôme Carretero

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.