On Wed, Aug 01 2018, Guilherme G. Piccoli wrote: > Currently the md driver completely relies in the userspace to stop an > array in case of some failure. There's an interesting case for raid0: if > we remove a raid0 member, like PCI hot(un)plugging an NVMe device, and > the raid0 array is _mounted_, mdadm cannot stop the array, since the tool > tries to open the block device (to perform the ioctl) with O_EXCL flag. > > So, in this case the array is still alive - users may write to this > "broken-yet-alive" array and unless they check the kernel log or some > other monitor tool, everything will seem fine and the writes are completed > with no errors. Being more precise, direct writes will not work, but since > usually writes are done in a regular form, i.e., backed by the page > cache, the most common scenario is an user being able to regularly write > to a broken raid0, and get all their data corrupted. > > PROPOSAL: > The idea proposed here to fix this behavior is mimic other block devices: > if one have a filesystem mounted in a block device on top of an NVMe or > SCSI disk and the disk gets removed, writes are prevented, errors are > observed and it's obvious something is wrong. Same goes for USB sticks, > which are sometimes even removed physically from the machine without > getting their filesystem unmounted before. > > We believe right now the md driver is not behaving properly for raid0 > arrays (it is handling these errors for other levels though). The approach > took for raid-0 is basically an emergency removal procedure, in which I/O > is blocked from the device, the regular clean-up happens and the associate > disk is deleted. It went to extensive testing, as detailed below. > > Not all are roses, we have some caveats that need to be resolved. > Feedback is _much appreciated_. If you have hard drive and some sectors or track stop working, I think you would still expect IO to the other sectors or tracks to keep working. For this reason, the behaviour of md/raid0 is to continue to serve IO to working devices, and only fail IO to failed/missing devices. It seems reasonable that you might want a different behaviour, but I think that should be optional. i.e. you would need to explicitly set a "one-out-all-out" flag on the array. I'm not sure if this should cause reads to fail, but it seems quite reasonable that it would cause all writes to fail. I would only change the kernel to recognise the flag and refuse any writes after any error has been seen. I would use udev/mdadm to detect a device removal and to mark the relevant component device as missing. NeilBrown