All of lore.kernel.org
 help / color / mirror / Atom feed
* Failure propagation of concatenated raids ?
@ 2016-06-14 21:43 Nicolas Noble
  2016-06-14 22:41 ` Andreas Klauer
  2016-06-15  1:37 ` John Stoffel
  0 siblings, 2 replies; 12+ messages in thread
From: Nicolas Noble @ 2016-06-14 21:43 UTC (permalink / raw)
  To: linux-raid

Hello,

  I have a somewhat convoluted question, which may take me some lines
to explain, but the TL;DR version of it is somewhat along the lines of
"How can I safely concatenate two raids into a single filesystem, and
avoid drastic corruption when one of the two underlying raids fails
and goes read only ?" I have done extensive research about that, but I
haven't been able to get any answer to it.

  My basic expectation when using raids is that if something goes
wrong, the whole thing goes read-only in order to prevent any further
damage by writing inconsistent or incomplete metadata. At that point,
human intervention can try and recover the raid. If the damage was
caused by partial power failure, or simply a raid controller which
died, recovery is usually fairly straightforward, with little to no
errors, and filesystem checks, for the last few writes that failed to
get committed properly. This works well when having a simple 1:1 path
between the raid and the filesystem (or whichever process that is
using the md device), but if I'm trying anything outside that simple
path, not everything turns read only and the kernel will happily
continue writing to half of its filesystem, and heavy filesystem
corruption may occur over the course of time between when the failure
starts and human intervention begins shutting down everything.

  Here's a reproducible scenario that explains what I'm talking about,
using approximately 100MB of disk space.

0. Setting up 8x10MB loopback devices:

# dd if=/dev/zero of=mdadm-tests bs=10240 count=$((10*1024))
10240+0 records in
10240+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.121364 s, 864 MB/s
# for p in `seq 1 8` ; do sgdisk -n $p:+0:+10M mdadm-tests ; done > /dev/null
# kpartx -a -v -s mdadm-tests
add map loop2p1 (254:7): 0 20480 linear 7:2 2048
add map loop2p2 (254:8): 0 20480 linear 7:2 22528
add map loop2p3 (254:9): 0 20480 linear 7:2 43008
add map loop2p4 (254:10): 0 20480 linear 7:2 63488
add map loop2p5 (254:11): 0 20480 linear 7:2 83968
add map loop2p6 (254:12): 0 20480 linear 7:2 104448
add map loop2p7 (254:13): 0 20480 linear 7:2 124928
add map loop2p8 (254:14): 0 20480 linear 7:2 145408



1. The typical, properly working, situation, 1 raid device, 1 process:

First, creating the raid, and "formatting it" (writing zeroes on it):
# mdadm --create test-single --raid-devices=8 --level=5
/dev/mapper/loop2p[12345678]
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md/test-single started.
# mdadm --detail /dev/md/test-single | grep State.:
         State : clean
# shred -v -n0 -z /dev/md/test-single
shred: /dev/md/test-single: pass 1/1 (000000)...

We can then "read" the device properly, and grab a status:
# md5sum /dev/md/test-single
764ae0318bbdb835b4fa939b70babd4c  /dev/md/test-single

Now we fail the raid device by pulling two drives out of it:
# mdadm /dev/md/test-single --fail /dev/mapper/loop2p[78]

We can see that the raid has successfully be put in failed mode:
# mdadm --detail /dev/md/test-single | grep State.:
         State : clean, FAILED

Now we can try writing to it with random data, but it'll produce a lot
of write errors:
# shred -n1 /dev/md/test-single 2> /dev/null

We stop the raid, examine it, repair it, and re-assemble it:
# mdadm --stop /dev/md/test-single
mdadm: stopped /dev/md/test-single
# mdadm --assemble test-single -f /dev/mapper/loop2p[1234567]
mdadm: forcing event count in /dev/mapper/loop2p7(6) from 18 upto 38
mdadm: clearing FAULTY flag for device 6 in /dev/md/test-single for
/dev/mapper/loop2p7
mdadm: Marking array /dev/md/test-single as 'clean'
mdadm: /dev/md/test-single has been started with 7 drives (out of 8).

And we can start recovering data - nothing changed, as basically
expected in that scenario:
# md5sum /dev/md/test-single
764ae0318bbdb835b4fa939b70babd4c  /dev/md/test-single


Preparing for the next round of commands:

# mdadm --stop /dev/md/test-single
mdadm: stopped /dev/md/test-single
# for p in `seq 1 8` ; do shred -n0 -z -v /dev/mapper/loop2p$p ; done
shred: /dev/mapper/loop2p1: pass 1/1 (000000)...
shred: /dev/mapper/loop2p2: pass 1/1 (000000)...
shred: /dev/mapper/loop2p3: pass 1/1 (000000)...
shred: /dev/mapper/loop2p4: pass 1/1 (000000)...
shred: /dev/mapper/loop2p5: pass 1/1 (000000)...
shred: /dev/mapper/loop2p6: pass 1/1 (000000)...
shred: /dev/mapper/loop2p7: pass 1/1 (000000)...
shred: /dev/mapper/loop2p8: pass 1/1 (000000)...



2. Concatenating two raids, or when things fail hard:

First, let's create two raids, of different sizes:
# mdadm --create test-part1 --raid-devices=3 --level=5 /dev/mapper/loop2p[123]
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md/test-part1 started.
# mdadm --create test-part2 --raid-devices=5 --level=5 /dev/mapper/loop2p[45678]
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md/test-part2 started.

Then, let's create a super-raid made of these two. Note that this can
also be done using lvm2, with similar results.
# mdadm --create supertest --level=0 --raid-devices=2 /dev/md/test-part[12]
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md/supertest started.

As before, we "format" it (write zeroes to it), and we read it:
# shred -n0 -z -v /dev/md/supertest
shred: /dev/md/supertest: pass 1/1 (000000)...
# md5sum /dev/md/supertest
57f366e889970e90c22594d859f7847b  /dev/md/supertest

Now, we're going to fail only the second raid, again by pulling two
drives out of it:
# mdadm /dev/md/test-part2 --fail /dev/mapper/loop2p[78]

And here's really the issue I have: the failure doesn't cascade to the
superset of the two above:
# mdadm --detail /dev/md/test-part1 | grep State.:
         State : clean
# mdadm --detail /dev/md/test-part2 | grep State.:
         State : clean, FAILED
# mdadm --detail /dev/md/supertest | grep State.:
         State : clean

Not that it seems that it could, as failing a raid0 drive isn't
supposed to work, which kind of bothers me:
# mdadm /dev/md/supertest --fail /dev/md/test-part2
mdadm: set device faulty failed for /dev/md/test-part2:  Device or resource busy

So, when we try writing random data to the raid, a good portion of the
writes are being refused with write errors, but the ones on the first
raid are making it through:
# shred -n1 /dev/md/supertest 2> /dev/null

Now, if we try recovering our cascading raids as before...
# mdadm --stop /dev/md/supertest
mdadm: stopped /dev/md/supertest
# mdadm --stop /dev/md/test-part2
mdadm: stopped /dev/md/test-part2
# mdadm --assemble test-part2 -f /dev/mapper/loop2p[4567]
# mdadm --assemble supertest -f /dev/md/test-part[12]
mdadm: /dev/md/supertest has been started with 2 drives.

... then its content has changed:

# md5sum /dev/md/supertest
78a213cbc76b9c1f78e7f35bc7ae3b73  /dev/md/supertest

And upon inspecting it further (using a simple hexdump -C on the
device), one can see that the whole of the first raid has been filled
with data, while the second one is still fully empty - it's not just a
few lingering writes that were pending before the failures.


  As said during this quite long log, concatenating the two raid
devices by putting them into the same volume group using lvm2 instead
of trying to create a raid0 will yield the same kind of results,
albeit a bit different: with a raid0 on top of the two raid5, one can
see the damage as stripes of interlaced data, which makes sense given
the way data is being organized in a raid0. With the lvm2
concatenation, you would instead get two big chunks: one with the
altered content of raid1, and one with the original content of raid2,
which also makes sense given the way lvm2 organizes its data.
Concatenating raids using lvm2 seems more natural, but I would expect
both mechanisms to behave the same way.

  Now, the above log is shrunk down drastically, but is inspired by
real events, where a portion of a ~40TB filesystem turned read only
because of a controller failure, and went unnoticed for several hours
before the kernel turned the filesystem readonly after detecting an
inconsistency failure in the filesystem metadata. After rebooting, the
filesystem was so much corrupted that mounting it in emergency mode
was barely possible, and it took several days of quite painful
recovery process, involving tape backups. I strongly believe that said
recovery process would've been much faster / easier if the whole of
the filesystem turned readonly when one of its two portion failed.

  So, after this quite long demonstration, I'll reiterate my question
at the bottom of this e-mail: is there a way to safely concatenate two
software raids into a single filesystem under Linux, so that my basic
expectation of "everything goes suddenly read only in case of failure"
is being met ?

  Thanks

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Failure propagation of concatenated raids ?
  2016-06-14 21:43 Failure propagation of concatenated raids ? Nicolas Noble
@ 2016-06-14 22:41 ` Andreas Klauer
  2016-06-14 23:35   ` Nicolas Noble
  2016-06-15  1:37 ` John Stoffel
  1 sibling, 1 reply; 12+ messages in thread
From: Andreas Klauer @ 2016-06-14 22:41 UTC (permalink / raw)
  To: Nicolas Noble; +Cc: linux-raid

On Tue, Jun 14, 2016 at 02:43:27PM -0700, Nicolas Noble wrote:
> "How can I safely concatenate two raids into a single filesystem, and
> avoid drastic corruption when one of the two underlying raids fails
> and goes read only ?"

I think there may be misunderstanding at this point, the RAID does not 
go read only when a disk fails. It still happily writes to the 
remaining disks, giving you time to add a new disk with no harm done 
to the filesystem.

If too many disks fail it does not go read only either. Once there 
are not enough disks left to run the array, it's gone completely. 
Once there are not enough disks to make the RAID work at all, 
you can neither read nor write.

Going read only is something filesystems might do when they encouter 
I/O errors, which might happen on RAID if you have bad block list 
enabled, or if your filesystem spans several RAIDs and one of them 
goes away completely.

In this case your situation is no different from using a filesystem 
on a single disk that develops a failure zone... you can only hope 
for the best at this point.

Filesystems go read only as soon as they notice an error (as soon 
as it matters); this should already be nearly optimal, if you want 
to improve on that (hit the brakes as soon as the md layer goes south 
before the filesystem notices) you might be able to do something with 
an udev rule or by defining a custom PROGRAM in mdadm.conf that does 
some shenanigans on certain failure events...

> Now we fail the raid device by pulling two drives out of it:
> # mdadm /dev/md/test-single --fail /dev/mapper/loop2p[78]

It should be gone completely at this point, not just "read-only".

> Now we can try writing to it with random data, but it'll produce a lot
> of write errors:
> # shred -n1 /dev/md/test-single 2> /dev/null

With it gone completely, none of the writes succeed...
 
> And we can start recovering data - nothing changed

> First, let's create two raids, of different sizes:
> Then, let's create a super-raid made of these two.
> # mdadm --create supertest --level=0 --raid-devices=2 /dev/md/test-part[12]

Doing this with RAID0 is of course, super horrible. You lose everything 
even though only one side of your RAID died. Also RAID on RAID can cause 
assembly problems, things have to be done in the right order. Consider LVM.

> Now, we're going to fail only the second raid, again by pulling two
> drives out of it:
> # mdadm /dev/md/test-part2 --fail /dev/mapper/loop2p[78]
> 
> And here's really the issue I have: the failure doesn't cascade to the
> superset of the two above:

It will cascade... as soon as the upper layer tries to write on the 
lower layer which is no longer there. Maybe md could be smarter at 
this point but who will consider obscure md on md cases?

There doesn't seem to be a generic mechanisms that informs higher layers of 
failures, each layer has to find out by itself by encountering I/O errors.

> # mdadm /dev/md/supertest --fail /dev/md/test-part2
> mdadm: set device faulty failed for /dev/md/test-part2:  Device or resource busy

Not sure about this one.

> So, when we try writing random data to the raid, a good portion of the
> writes are being refused with write errors, but the ones on the first
> raid are making it through:

If that really writes random data to every other RAID chunk without 
ever failing the missing RAID0 disk... it might be a bug that needs 
looking at.

Until then I'd file it under oddities that are bound to happen when 
using obscure md on md setups. ;) No one does this, so who tests for 
such error cases...?

> ... then its content has changed:
> 
> # md5sum /dev/md/supertest
> 78a213cbc76b9c1f78e7f35bc7ae3b73  /dev/md/supertest

A change is expected if the first write (first chunk) succeeds 
(if you failed the wrong half of the RAID0). If shred managed 
to write more than one chunk then the RAID didn't fail itself, 
that would be unexpected.

If it wasn't shred which just aggressively keeps writing stuff, 
but a filesystem, you might still be fine since the filesystem 
is still nice enough to go read-only in this scenario, as long 
as the RAID0 reports those I/O errors upwards...

> With the lvm2 concatenation, you would instead get two big chunks:
> one with the altered content of raid1, and one with the original 
> content of raid2, which also makes sense given the way lvm2 organizes 
> its data.

Same here, if it was a filesystem instead of thread, there should 
be less damage; shred writes aggressively, filesystems try to keep 
things intact on their own.

>   Now, the above log is shrunk down drastically, but is inspired by
> real events, where a portion of a ~40TB filesystem turned read only
> because of a controller failure, and went unnoticed for several hours
> before the kernel turned the filesystem readonly after detecting an
> inconsistency failure in the filesystem metadata.

If the filesystem didn't notice the problem for a long time, 
the problem shouldn't have mattered for a long time. Each filesystem 
has huge amounts of data that aren't ever checked as long as you 
don't visit the files stored there. If your hard disk has a cavity 
right in the middle of your aunt's 65th birthday party you won't 
notice until you watch the video which you never do...

That's why regular self-checks are so important, if you don't 
run checks you won't notice errors, and won't replace your broken 
disks until it's too late.

Filesystem turning read only as soon as it notices a problem, 
should still be considered very good. But a read only filesystem 
will always cause a lot of damage. Any outstanding writes are lost, 
anything not saved until this point is lost, any database in the 
middle of a transaction may not be able to cope properly, ...

Going read only does not fix problems, it causes them too.

Even if you write your own event script that hits the brakes on 
a failure even before the filesystem notices the problem, it's 
probably not possible to avoid such damages. It depends a lot on 
what's actually happening in this filesystem.

If you write a PROGRAM to handle such error conditions basically 
what you need to think about is not just `mount remount,ro` but 
more like what `shutdown` does, how to get things to end gracefully 
under the circumstances.

> So, after this quite long demonstration, I'll reiterate my question
> at the bottom of this e-mail: is there a way to safely concatenate two
> software raids into a single filesystem under Linux, so that my basic
> expectation of "everything goes suddenly read only in case of failure"
> is being met ?

I would make every effort to prevent such a situation from happening 
in the first place. RAID failure is not a nice situation to be in, 
there is no magical remedy.

My own filesystems also span several RAID arrays; I do not have any 
special measures in place to react on individual RAID failures. 
I do have regular RAID checks, selective SMART self-tests, and I'm 
prepared to replace disks as soon as they step one toe out of line. 

I'm not taking any chances with reallocated/pending/uncorrectable sectors, 
if you keep those disks around IMHO you're gambling.

Since you mentioned RAID controllers, if you have several of them, you 
could use one disk (with RAID-6 maybe 2 disks) per controller for your 
arrays, so a controller failure would not actually kill a RAID. I think 
backblaze did something like this in one of their older storage pods...

Regards
Andreas Klauer

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Failure propagation of concatenated raids ?
  2016-06-14 22:41 ` Andreas Klauer
@ 2016-06-14 23:35   ` Nicolas Noble
  2016-06-15  0:48     ` Andreas Klauer
  0 siblings, 1 reply; 12+ messages in thread
From: Nicolas Noble @ 2016-06-14 23:35 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: linux-raid

> If too many disks fail it does not go read only either. Once there
> are not enough disks left to run the array, it's gone completely.
> Once there are not enough disks to make the RAID work at all,
> you can neither read nor write.

In my experience, that's not the case:

Create a raid:
# mdadm --create test-single --raid-devices=8 --level=5
/dev/mapper/loop2p[12345678]

Fill it with random data:
# shred -v -n1 /dev/md/test-single

Fail it to the point it should be gone:
# mdadm /dev/md/test-single --fail /dev/mapper/loop2p[78]

But I can still read the online stripes, with read errors occuring
when encountering offline stripes:
# hexdump -C /dev/md/test-single |& less
[ works, until it encounters an offline stripe, failing with 'hexdump:
/dev/md/test-single: Input/output error' ]

Any write on any stripe would get refused. Which is what I'm talking
about "read only mode", even with some portions being unreadable. That
behavior actually has been a boon for me in the past, to recover
partial data.

>> Now we fail the raid device by pulling two drives out of it:
>> # mdadm /dev/md/test-single --fail /dev/mapper/loop2p[78]
>
> It should be gone completely at this point, not just "read-only".

No, see above.

> With it gone completely, none of the writes succeed...

Correct, but reads still work for some portions.

> Doing this with RAID0 is of course, super horrible. You lose everything
> even though only one side of your RAID died. Also RAID on RAID can cause
> assembly problems, things have to be done in the right order. Consider LVM.

See below - I was trying to show the behavior using a single tool, but
the same occurs with lvm - albeit with more complicated chains of
command lines.

> It will cascade... as soon as the upper layer tries to write on the
> lower layer which is no longer there. Maybe md could be smarter at
> this point but who will consider obscure md on md cases?
>
> There doesn't seem to be a generic mechanisms that informs higher layers of
> failures, each layer has to find out by itself by encountering I/O errors.

No, it really doesn't cascade :-) The writes on the lower layers will
occasionally fail, but the upper layer will happily ignore them and
stay online all day long if necessary. And that really, REALLY is the
whole point of this e-mail.

> If that really writes random data to every other RAID chunk without
> ever failing the missing RAID0 disk... it might be a bug that needs
> looking at.

Yes, it really writes random data to every other raid chunk without
ever failing the missing RAID0 disk. And that would also happen with
LVM. LVM will continue being happy to ignore the chunk that failed,
sendings writes to it that are getting completely lost into the void,
without ever failing the logical volume.

> A change is expected if the first write (first chunk) succeeds
> (if you failed the wrong half of the RAID0). If shred managed
> to write more than one chunk then the RAID didn't fail itself,
> that would be unexpected.

Shred managed to write to every single chunk in the first raid that
was still online.

> If it wasn't shred which just aggressively keeps writing stuff,
> but a filesystem, you might still be fine since the filesystem
> is still nice enough to go read-only in this scenario, as long
> as the RAID0 reports those I/O errors upwards...

It does report the I/O errors upwards according to dmesg logs, but
that doesn't really prevent anything. The kernel continues writing to
the filesystem as if nothing really special happened.

> If the filesystem didn't notice the problem for a long time,
> the problem shouldn't have mattered for a long time. Each filesystem
> has huge amounts of data that aren't ever checked as long as you
> don't visit the files stored there. If your hard disk has a cavity
> right in the middle of your aunt's 65th birthday party you won't
> notice until you watch the video which you never do...
>
> That's why regular self-checks are so important, if you don't
> run checks you won't notice errors, and won't replace your broken
> disks until it's too late.
>
> Filesystem turning read only as soon as it notices a problem,
> should still be considered very good. But a read only filesystem
> will always cause a lot of damage. Any outstanding writes are lost,
> anything not saved until this point is lost, any database in the
> middle of a transaction may not be able to cope properly, ...
>
> Going read only does not fix problems, it causes them too.
>
> Even if you write your own event script that hits the brakes on
> a failure even before the filesystem notices the problem, it's
> probably not possible to avoid such damages. It depends a lot on
> what's actually happening in this filesystem.
>
> If you write a PROGRAM to handle such error conditions basically
> what you need to think about is not just `mount remount,ro` but
> more like what `shutdown` does, how to get things to end gracefully
> under the circumstances.

So, in the case I'm talking about, I'm talking about a volume that is
storing streaming videos, at a speed of ~2MB/s, generating hundreds of
thumbnails on the way, within deep subfolder trees. The filesystem WAS
really, really busy, and lots and lots of damage was caused. Various
directories collided with each other during that time, and several
gigabytes ended up in lost+found after a few days of intense fsck.
Tape backup only recovered up until a certain point in time, but
trying to recover what got created after the last backup was almost a
lost cause. That was a filesystem concatenated using lvm by the way.

>
>> So, after this quite long demonstration, I'll reiterate my question
>> at the bottom of this e-mail: is there a way to safely concatenate two
>> software raids into a single filesystem under Linux, so that my basic
>> expectation of "everything goes suddenly read only in case of failure"
>> is being met ?
>
> I would make every effort to prevent such a situation from happening
> in the first place. RAID failure is not a nice situation to be in,
> there is no magical remedy.
>
> My own filesystems also span several RAID arrays; I do not have any
> special measures in place to react on individual RAID failures.
> I do have regular RAID checks, selective SMART self-tests, and I'm
> prepared to replace disks as soon as they step one toe out of line.
>
> I'm not taking any chances with reallocated/pending/uncorrectable sectors,
> if you keep those disks around IMHO you're gambling.
>
> Since you mentioned RAID controllers, if you have several of them, you
> could use one disk (with RAID-6 maybe 2 disks) per controller for your
> arrays, so a controller failure would not actually kill a RAID. I think
> backblaze did something like this in one of their older storage pods...

The failed controller was a normal, non-RAID sata controller. The
disks are being used directly by the software raid under Linux. The
dmesg log indicated that the 4 disks that were plugged to that SATA
controller went offline suddenly, and one of the two RAID went into
failure, being "read only" as I described above (aka, read errors on
offline stripes, reads working on online stripes, write failures on
everything), but the above lvm layer still continued being online for
quite some time - about 5 hours with around 10000 files created, and
about 30GB of fresh data being created, a good half of it that
eventually ended up in lost+found. Right after the initial controller
failure, the volume reported lots and lots of write failures, but the
kernel continue happily nonetheless, until it realized there was a big
inconsistency in the filesystem, and decided to shut the filesystem
down. Remounting after bringing the failed controllers and disks back
online got next to impossible. In fact, I had to upgrade to
experimental e2fs tools in order to be able to do anything with it.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Failure propagation of concatenated raids ?
  2016-06-14 23:35   ` Nicolas Noble
@ 2016-06-15  0:48     ` Andreas Klauer
  2016-06-15  9:11       ` Nicolas Noble
  0 siblings, 1 reply; 12+ messages in thread
From: Andreas Klauer @ 2016-06-15  0:48 UTC (permalink / raw)
  To: Nicolas Noble; +Cc: linux-raid

On Tue, Jun 14, 2016 at 04:35:13PM -0700, Nicolas Noble wrote:
> But I can still read the online stripes, with read errors occuring
> when encountering offline stripes:
> # hexdump -C /dev/md/test-single |& less
> [ works, until it encounters an offline stripe, failing with 'hexdump:
> /dev/md/test-single: Input/output error' ]

Wow.

> No, it really doesn't cascade :-)

I stand corrected on both counts...

> the above lvm layer still continued being online for
> quite some time - about 5 hours with around 10000 files created, and
> about 30GB of fresh data being created

Why didn't the filesystem go into read-only, that's what baffles me the most.

I just tried it with this setup:

| Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
| md42 : active raid5 loop0p1[0] loop0p9[10] loop0p8[8] loop0p7[7] loop0p6[6] loop0p5[5] loop0p4[4] loop0p3[3] loop0p2[2] loop0p10[1]
|       82944 blocks super 1.2 level 5, 512k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
|       
| md43 : active raid5 loop1p1[0] loop1p9[10](F) loop1p8[8](F) loop1p7[7] loop1p6[6] loop1p5[5] loop1p4[4] loop1p3[3] loop1p2[2] loop1p10[1]
|       82944 blocks super 1.2 level 5, 512k chunk, algorithm 2 [10/8] [UUUUUUUU__]
|       
| md44 : active raid0 md42[0] md43[1]
|       163840 blocks super 1.2 512k chunks

mkfs works fine:
      
# mkfs.ext4 /dev/md44
Creating filesystem with 163840 1k blocks and 40960 inodes
Filesystem UUID: 9cc7f9db-f8d8-4155-b728-61d4998e03ec
Superblock backups stored on blocks: 
	8193, 24577, 40961, 57345, 73729

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done 

It even mounts fine:

# mount /dev/md44 loop/
# mount
/dev/md44 on /dev/shm/loop type ext4 (rw,relatime,stripe=512,data=ordered)

Creating files?

# yes | split --bytes=1M 
split: xfq: No space left on device

This thing really doesn't go read-only. Wow.

It believes it has written all these, but once you drop caches, 
you get I/O errors when trying to read those files back.

[59381.302517] EXT4-fs (md44): mounted filesystem with ordered data mode. Opts: (null)
[59559.959199] EXT4-fs warning (device md44): ext4_end_bio:315: I/O error -5 writing to inode 12 (offset 0 size 1048576 starting block 6144)

Not what I expected... hope you can find a solution.

Regards
Andreas Klauer

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Failure propagation of concatenated raids ?
  2016-06-14 21:43 Failure propagation of concatenated raids ? Nicolas Noble
  2016-06-14 22:41 ` Andreas Klauer
@ 2016-06-15  1:37 ` John Stoffel
  2016-06-15  9:18   ` Nicolas Noble
  1 sibling, 1 reply; 12+ messages in thread
From: John Stoffel @ 2016-06-15  1:37 UTC (permalink / raw)
  To: Nicolas Noble; +Cc: linux-raid


Nicolas>   I have a somewhat convoluted question, which may take me
Nicolas> some lines to explain, but the TL;DR version of it is
Nicolas> somewhat along the lines of "How can I safely concatenate two
Nicolas> raids into a single filesystem, and avoid drastic corruption
Nicolas> when one of the two underlying raids fails and goes read only
Nicolas> ?" I have done extensive research about that, but I haven't
Nicolas> been able to get any answer to it.

You can't basically.  When half your filesystem goes away, or
randomlly readonly, you're screwed.  

If you're trying to take smaller chunks and make it so that data can
span and balance across them without your having to do it, then it
*might* make sense to look at ceph or some other distributed
filesystem.  Having the data spread across multiple backends, with
redundancy, is possible, but once too much of the underlying device(s)
goes away, no filesystem I know handles that without either going
readonly, or totally locking up.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Failure propagation of concatenated raids ?
  2016-06-15  0:48     ` Andreas Klauer
@ 2016-06-15  9:11       ` Nicolas Noble
  0 siblings, 0 replies; 12+ messages in thread
From: Nicolas Noble @ 2016-06-15  9:11 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: linux-raid

> Why didn't the filesystem go into read-only, that's what baffles me the most.
>
> This thing really doesn't go read-only. Wow.

Yes. And if you push further, by creating lots of nested
subdirectories prior to the failure, then continue to do so after the
failure, just like the previous experience I am talking about, chances
are that you might also corrupt the filesystem to the same level - so
that it's unmountable without emergency mode repairs. The "I don't
care, I continue to write and ignore failures" behavior is really
worrisome, and I don't really know to which level this needs fixing.

>
> It believes it has written all these, but once you drop caches,
> you get I/O errors when trying to read those files back.
>
> [59381.302517] EXT4-fs (md44): mounted filesystem with ordered data mode. Opts: (null)
> [59559.959199] EXT4-fs warning (device md44): ext4_end_bio:315: I/O error -5 writing to inode 12 (offset 0 size 1048576 starting block 6144)

Yes, this is similar behavior, and similar dmesg messages seen during
the outage. The last message before the filesystem decided to lock
itself up after 5 hours was various inconsistency checks that failed.

> Not what I expected... hope you can find a solution.

Precisely the reason I decided to write to the linux-raid kernel
mailing list :-) Either these should be considered bugs, or maybe
there's a hidden solution somewhere down the line that I haven't been
able to discover. To be fair, something I've left out during the
various experiments is that btrfs seems to be doing the right thing.
It'll attempt all of the writes for the new files, directories, or
file changes, but when it realizes it can't actually write to the
disk's metadata structures, it'll roll the whole transaction back -
which is somewhat good. With an intense subdirectory-creating script,
I have been able to somewhat reproduce the same kind of corruption on
ext4fs and xfs, but btrfs seems to somewhat hold itself up. In only
one instance I managed to get a btrfs to fully become unmountable and
unrecoverable, but I was really pushing the envelope in terms of
corruptions. I can try to clean up and share these testing scripts if
it's helpful to the community, but I want to do more testing on btrfs
myself in order to see if that'd be a better filesystem in that
situation. Caching really is an issue, and I need to do more testing
on that. Basically, it seems possible to create folder A before the
failure, then folder B inside folder A after the failure, in a way
that the transaction needs to be rolled back on disk - but the kernel
cache still sees B inside of A. Then, still during the time B is still
in the cache, create C inside - which ends up being in a writable
online stripe. The net result is that you have a "healthy" folder, but
completely dangling - hence the tens of thousands of lost+found items
after the recovery.

Either way, I feel this is a worrisome situation. In the general case
(one raid, one filesystem), an actual RAID failure amounts to little
damage, and recovery usually is very possible, granted that some of
the disks can be brought back online somehow. But in that kind of more
specific case (multiple raids grouped into an lvm2 volume), filesystem
metadata corruption seems to be much, much more likely, and prone to
complete data losses. And the responses so far on that thread aren't
making me feel warm and fuzzy inside - it's really not good if you
can't write anything in the "how to prevent that from happening again
in the future ?" section of your post mortem. What worries me even
more is that the model I was describing (multiple raids, one giant
lvm2 volume) seems to me that it's being adopted more and more. For
instance, as far as I understand, this is exactly what Synology does
in their hybrid raid solution, meaning it may also be prone to the
same kind of failures and corruption, potentially:
https://www.synology.com/en-us/knowledgebase/DSM/tutorial/Storage/What_is_Synology_Hybrid_RAID_SHR
- I am not going to purchase one of their appliances to test my
theory, but this is a new level of corruption that I haven't found
documented anywhere before, which puzzles me to great lengths.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Failure propagation of concatenated raids ?
  2016-06-15  1:37 ` John Stoffel
@ 2016-06-15  9:18   ` Nicolas Noble
  2016-06-15  9:29     ` Benjamin ESTRABAUD
  2016-06-15 14:56     ` John Stoffel
  0 siblings, 2 replies; 12+ messages in thread
From: Nicolas Noble @ 2016-06-15  9:18 UTC (permalink / raw)
  To: John Stoffel; +Cc: linux-raid

> it
> *might* make sense to look at ceph or some other distributed
> filesystem.

I was trying to avoid that, mainly because that doesn't seem to be as
supported as a more straightforward raids+lvm2 scenario. But I might
be willing to reconsider my position in light of such data losses.

> no filesystem I know handles that without either going
> readonly, or totally locking up.

Which, to be fair, is exactly what I'm looking for. I'd rather see the
filesystem lock itself up, until a human tries to restore the failed
raid back online. But my recent experience and experiments show me
that the filesystems actually don't lock themselves up, and don't go
read only for quite some time, and heavy heavy data corruption will
then happen. I'd be much more happy if the behavior was that the
filesystem locks itself up instead of self destroying over time.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Failure propagation of concatenated raids ?
  2016-06-15  9:18   ` Nicolas Noble
@ 2016-06-15  9:29     ` Benjamin ESTRABAUD
  2016-06-15  9:49       ` Nicolas Noble
  2016-06-15 14:56     ` John Stoffel
  1 sibling, 1 reply; 12+ messages in thread
From: Benjamin ESTRABAUD @ 2016-06-15  9:29 UTC (permalink / raw)
  To: Nicolas Noble, John Stoffel; +Cc: linux-raid

On 15/06/16 10:18, Nicolas Noble wrote:
>> it
>> *might* make sense to look at ceph or some other distributed
>> filesystem.
>
> I was trying to avoid that, mainly because that doesn't seem to be as
> supported as a more straightforward raids+lvm2 scenario. But I might
> be willing to reconsider my position in light of such data losses.
>
>> no filesystem I know handles that without either going
>> readonly, or totally locking up.
>
> Which, to be fair, is exactly what I'm looking for. I'd rather see the
> filesystem lock itself up, until a human tries to restore the failed
> raid back online. But my recent experience and experiments show me
> that the filesystems actually don't lock themselves up, and don't go
> read only for quite some time, and heavy heavy data corruption will
> then happen. I'd be much more happy if the behavior was that the
> filesystem locks itself up instead of self destroying over time.
Hi Nicolas,

I have limited experience in that domain but I've usually observed that 
if the filesystem (say xfs) is unable to read or write its superblock it 
immediately goes into read only mode. MD will remain online and provide 
"best service" whenever possible, but as you pointed out this can be 
risky if you still think your RAID offers parity protection while 
degraded. I think in your case you're better off stopping an array that 
has less than parity drives than it should, either using a udev rule or 
using mdadm --monitor.

Regards,
Ben.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Failure propagation of concatenated raids ?
  2016-06-15  9:29     ` Benjamin ESTRABAUD
@ 2016-06-15  9:49       ` Nicolas Noble
  2016-06-15 14:45         ` Benjamin ESTRABAUD
  2016-06-15 14:59         ` John Stoffel
  0 siblings, 2 replies; 12+ messages in thread
From: Nicolas Noble @ 2016-06-15  9:49 UTC (permalink / raw)
  To: Benjamin ESTRABAUD; +Cc: John Stoffel, linux-raid

> I
> think in your case you're better off stopping an array that has less than
> parity drives than it should, either using a udev rule or using mdadm
> --monitor.

I actually have been unsuccessful in these attempts so far. What
happens is that you very quickly get processes that get indefinitely
stuck (indefinitely as in 'waiting on a very very long kernel
timeout') trying to write something, so that the ext4fs layer becomes
unresponsive on these threads, or take a very long time. Killing the
processes takes a very long time because they are stuck in a kernel
operation. And if potentially more processes can spawn back up, the
automated script starts an interesting game of whack-a-mole in order
to unmount the filesystem.

And you can't stop the underlying arrays without first stopping the
whole chain (umount, stop the lvm volume, etc...), otherwise you
simply get "device is busy" errors, hence the whack-a-mole process
killing. The only working method I've managed to successfully
implement is to programatically loop over the list of all the drives
involved in the filesystem, on all the raids involved, and flag all of
them as failed drives. This way, you get to really put "emergency
brakes" on. I find that to be a very, very scary method however.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Failure propagation of concatenated raids ?
  2016-06-15  9:49       ` Nicolas Noble
@ 2016-06-15 14:45         ` Benjamin ESTRABAUD
  2016-06-15 14:59         ` John Stoffel
  1 sibling, 0 replies; 12+ messages in thread
From: Benjamin ESTRABAUD @ 2016-06-15 14:45 UTC (permalink / raw)
  To: Nicolas Noble; +Cc: John Stoffel, linux-raid

On 15/06/16 10:49, Nicolas Noble wrote:
>> I
>> think in your case you're better off stopping an array that has less than
>> parity drives than it should, either using a udev rule or using mdadm
>> --monitor.
>
> I actually have been unsuccessful in these attempts so far. What
> happens is that you very quickly get processes that get indefinitely
> stuck (indefinitely as in 'waiting on a very very long kernel
> timeout') trying to write something, so that the ext4fs layer becomes
> unresponsive on these threads, or take a very long time. Killing the
> processes takes a very long time because they are stuck in a kernel
> operation. And if potentially more processes can spawn back up, the
> automated script starts an interesting game of whack-a-mole in order
> to unmount the filesystem.
>
> And you can't stop the underlying arrays without first stopping the
> whole chain (umount, stop the lvm volume, etc...), otherwise you
> simply get "device is busy" errors, hence the whack-a-mole process
> killing. The only working method I've managed to successfully
> implement is to programatically loop over the list of all the drives
> involved in the filesystem, on all the raids involved, and flag all of
> them as failed drives. This way, you get to really put "emergency
> brakes" on. I find that to be a very, very scary method however.
>
I understand your concern, but I remember a thread where it was 
explained that a RAID0 or linear one basically behaves like a hard drive 
would: since there is no parity and the data is distributed, if say half 
of the devices of the RAID0 are unavailable the LBAs on the other half 
of that RAID will work fine, like if you had a SSD with half of its 
cells broken. So your issue seems to be more related with dealing with 
IO errors than anything. I would imagine that if the filesystem's 
superblock was to become unread/writeable (if it was on the missing 
RAID) then the filesystem would "fail" (be remounted as readonly). Other 
than that there's not much to be done apart from instructing your 
program to stop the IOs and/or fiddling with timeouts to speed up the 
failure of the process.

The "emergency brake" as you put it would work similarly to a RAID5 
losing more than it can: the array will error every write sent to it. 
Alternatively you could disconnect the drives from Linux using the 
"delete" sysfs property. If you use a journalled filesystem you 
shouldn't lose any data over any of this anyways, so that seems safe.

HTH,

Regards,
Ben.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Failure propagation of concatenated raids ?
  2016-06-15  9:18   ` Nicolas Noble
  2016-06-15  9:29     ` Benjamin ESTRABAUD
@ 2016-06-15 14:56     ` John Stoffel
  1 sibling, 0 replies; 12+ messages in thread
From: John Stoffel @ 2016-06-15 14:56 UTC (permalink / raw)
  To: Nicolas Noble; +Cc: John Stoffel, linux-raid

>>>>> "Nicolas" == Nicolas Noble <nicolas@nobis-crew.org> writes:

>> it
>> *might* make sense to look at ceph or some other distributed
>> filesystem.

Nicolas> I was trying to avoid that, mainly because that doesn't seem
Nicolas> to be as supported as a more straightforward raids+lvm2
Nicolas> scenario. But I might be willing to reconsider my position in
Nicolas> light of such data losses.

If you are building multiple RAID sets, and then stripping across them
using LVM and then putting filesystems on top of them, you should be
ok if your underlying RAID is robust.

By that I mean splitting members across controllers, so as to avoid
single points of failure.  You would also use RAID6 with hot spares as
well.  Once you have a robust foundation, then the filesystem layered
on top doesn't have to worry as much about part of the storage going
away.

But if you're not willing, or can't afford the cost of true no single
point of failure, then you have to take your chances.  This is why I
tend to mirror my system at home and even do triple mirrors at points
for data I really care about.

>> no filesystem I know handles that without either going
>> readonly, or totally locking up.

Nicolas> Which, to be fair, is exactly what I'm looking for. I'd
Nicolas> rather see the filesystem lock itself up, until a human tries
Nicolas> to restore the failed raid back online. But my recent
Nicolas> experience and experiments show me that the filesystems
Nicolas> actually don't lock themselves up, and don't go read only for
Nicolas> quite some time, and heavy heavy data corruption will then
Nicolas> happen. I'd be much more happy if the behavior was that the
Nicolas> filesystem locks itself up instead of self destroying over
Nicolas> time.

Part of the problem is that if the filesystem isn't writing to that
section of the device, it might not know about the failure in time,
esp if they're seperate devices.  Now I would think that LVM would
notice that a PV in a VG has gone away, but it then needs to
percolate up to check the LV(s) on that PV which then needs to notify
the filesystem.

I agree it should work, and should be more robust, and it might
actually be possible to tweak the system to be more hair trigger about
going into lock down mode.


Of course the other option is for you to shard your data across
multiple filesystems, and pot the resiliency into your application, so
that if some of the data can't be found, it just keeps going.  But
that's a different sort of complexity as well.

John

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Failure propagation of concatenated raids ?
  2016-06-15  9:49       ` Nicolas Noble
  2016-06-15 14:45         ` Benjamin ESTRABAUD
@ 2016-06-15 14:59         ` John Stoffel
  1 sibling, 0 replies; 12+ messages in thread
From: John Stoffel @ 2016-06-15 14:59 UTC (permalink / raw)
  To: Nicolas Noble; +Cc: Benjamin ESTRABAUD, John Stoffel, linux-raid

>>>>> "Nicolas" == Nicolas Noble <nicolas@nobis-crew.org> writes:

>> I
>> think in your case you're better off stopping an array that has less than
>> parity drives than it should, either using a udev rule or using mdadm
>> --monitor.

Nicolas> I actually have been unsuccessful in these attempts so far. What
Nicolas> happens is that you very quickly get processes that get indefinitely
Nicolas> stuck (indefinitely as in 'waiting on a very very long kernel
Nicolas> timeout') trying to write something, so that the ext4fs layer becomes
Nicolas> unresponsive on these threads, or take a very long time. Killing the
Nicolas> processes takes a very long time because they are stuck in a kernel
Nicolas> operation. And if potentially more processes can spawn back up, the
Nicolas> automated script starts an interesting game of whack-a-mole in order
Nicolas> to unmount the filesystem.

Nicolas> And you can't stop the underlying arrays without first
Nicolas> stopping the whole chain (umount, stop the lvm volume,
Nicolas> etc...), otherwise you simply get "device is busy" errors,
Nicolas> hence the whack-a-mole process killing. The only working
Nicolas> method I've managed to successfully implement is to
Nicolas> programatically loop over the list of all the drives involved
Nicolas> in the filesystem, on all the raids involved, and flag all of
Nicolas> them as failed drives. This way, you get to really put
Nicolas> "emergency brakes" on. I find that to be a very, very scary
Nicolas> method however.

I think this is the wrong idea.  You do want MD to re-try errors on
underlying devices, because some drives will return an error, and if
MD has long enough timeouts, it can recover and try to re-write the
bad sector(s) on the drive, which early on will let the bad block be
mapped out and new block put in place.

But you're looking for a solution when one device in a stripped RAID0
goes away, what happens to the filesystem then.  And in that case your
shit out of luck.  No filesystem is designed to cope with that type of
failure.

So there might be ext4 or xfs or jfs options which will help you in
this case, but it's not a simple thing to program around.  Esp once
the size of the volume gets really big.

John

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-06-15 14:59 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-14 21:43 Failure propagation of concatenated raids ? Nicolas Noble
2016-06-14 22:41 ` Andreas Klauer
2016-06-14 23:35   ` Nicolas Noble
2016-06-15  0:48     ` Andreas Klauer
2016-06-15  9:11       ` Nicolas Noble
2016-06-15  1:37 ` John Stoffel
2016-06-15  9:18   ` Nicolas Noble
2016-06-15  9:29     ` Benjamin ESTRABAUD
2016-06-15  9:49       ` Nicolas Noble
2016-06-15 14:45         ` Benjamin ESTRABAUD
2016-06-15 14:59         ` John Stoffel
2016-06-15 14:56     ` John Stoffel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.