Re: [RFC] [PATCH 0/1] Introduce emergency raid0 stop for mounted arrays

From: NeilBrown <neilb@suse.com>
To: "Guilherme G. Piccoli" <gpiccoli@canonical.com>,
	linux-raid@vger.kernel.org, shli@kernel.org
Cc: kernel@gpiccoli.net, jay.vosburgh@canonical.com,
	dm-devel@redhat.com, linux-block@vger.kernel.org,
	linux-fsdevel@vger.kernel.org
Subject: Re: [RFC] [PATCH 0/1] Introduce emergency raid0 stop for mounted arrays
Date: Fri, 03 Aug 2018 07:37:57 +1000	[thread overview]
Message-ID: <87h8kcphze.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <7cfa220e-c04d-b69e-fe39-5b9277f0538d@canonical.com>

[-- Attachment #1: Type: text/plain, Size: 3929 bytes --]

On Thu, Aug 02 2018, Guilherme G. Piccoli wrote:

> On 01/08/2018 22:51, NeilBrown wrote:
>>> [...] 
>> If you have hard drive and some sectors or track stop working, I think
>> you would still expect IO to the other sectors or tracks to keep
>> working.
>> 
>> For this reason, the behaviour of md/raid0 is to continue to serve IO to
>> working devices, and only fail IO to failed/missing devices.
>> 
>
> Hi Neil, thanks for your quick response. I agree with you about the
> potential sector failure, it shouldn't automatically fail the entire
> array for a single failed write.
>
> The check I'm using in the patch is against device request queue - if a
> raid0 member queue is dying/dead, then we consider the device as dead,
> and as a consequence, the array is marked dead.
>
> In my understanding of raid0/stripping, data is split among N devices,
> called raid members. If one member is failed, for sure the data written
> to the array will be corrupted, since that "portion" of data going to
> the failed device won't be stored.
>
> Regarding the current behavior, one test I made was to remove 1 device
> of a 2-disk raid0 array and after that, write a file. Write completed
> normally (no errors from the userspace perspective), and I hashed the
> file using md5. I then rebooted the machine, raid0 was back with the 2
> devices, and guess what?
> The written file was there, but corrupted (with a different hash). I
> don't think this is something fine, user could have written important
> data and don't realize it was getting corrupted while writing.

In your test, did you "fsync" the file after writing to it?  That is
essential for data security.
If fsync succeeded even though the data wasn't written, that is
certainly a bug.  If it doesn't succeed, then you know there is a
problem with your data.

>
>
>> It seems reasonable that you might want a different behaviour, but I
>> think that should be optional.  i.e. you would need to explicitly set a
>> "one-out-all-out" flag on the array.  I'm not sure if this should cause
>> reads to fail, but it seems quite reasonable that it would cause all
>> writes to fail.
>> 
>> I would only change the kernel to recognise the flag and refuse any
>> writes after any error has been seen.
>> I would use udev/mdadm to detect a device removal and to mark the
>> relevant component device as missing.
>>
>
> Using the udev/mdadm to notice a member has failed and the array must be
> stopped might work, it was my first approach. The main issue here is
> timing: it takes "some time" until userspace is aware of the failure, so
> we have a window in which writes were sent between
>
> (A) the array member failed/got removed and
> (B) mdadm notices and instruct driver to refuse new writes;

I don't think the delay is relevant.
If writes are happening, then the kernel will get write error from the
failed devices and can flag the array as faulty.
If writes aren't happening, then it no important cost in the "device is
removed" message going up to user-space and back.

NeilBrown

>
> between (A) and (B), those writes are seen as completed, since they are
> indeed complete (at least, they are fine from the page cache point of
> view). Then, writeback will try to write those, which will cause
> problems or they will complete in a corrupted form (the file will
> be present in the array's filesystem after array is restored, but
> corrupted).
>
> So, the in-kernel mechanism avoided most part of window (A)-(B),
> although it seems we still have some problems when nesting arrays,
> due to this same window, even with the in-kernel mechanism (given the
> fact it takes some time to remove the top array when a pretty "far"
> bottom-member is failed).
>
> More suggestions on how to deal with this in a definitive manner are
> highly appreciated.
> Thanks,
>
>
> Guilherme

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]