On Wed, 1 Oct 2014 20:00:45 +0400 Andrey Kuzmin wrote: > On Wed, Oct 1, 2014 at 6:56 AM, NeilBrown wrote: > > On Wed, 24 Sep 2014 13:02:28 +0200 Heinz Mauelshagen > > wrote: > > > >> > >> Martin, > >> > >> thanks for the good explanation of the state of the discard union. > >> Do you have an ETA for the 'zeroout, deallocate' ... support you mentioned? > >> > >> I was planning to have a followup patch for dm-raid supporting a dm-raid > >> table > >> line argument to prohibit discard passdown. > >> > >> In lieu of the fuzzy field situation wrt SSD fw and discard_zeroes_data > >> support > >> related to RAID4/5/6, we need that in upstream together with the initial > >> patch. > >> > >> That 'no_discard_passdown' table line can be added to dm-raid RAID4/5/6 > >> table > >> lines to avoid possible data corruption but can be avoided on RAID1/10 > >> table lines, > >> because the latter are not suffering from any discard_zeroes_data flaw. > >> > >> > >> Neil, > >> > >> are you going to disable discards in RAID4/5/6 shortly > >> or rather go with your bitmap solution? > > > > Can I just close my eyes and hope it goes away? > > > > The idea of a bitmap of uninitialised areas is not a short-term solution. > > But I'm not really keen on simply disabling discard for RAID4/5/6 either. It > > would mean that people with good sensible hardware wouldn't be able to use > > it properly. > > > > I would really rather that discard_zeroes_data were only set on devices where > > it was actually true. Then it wouldn't be my problem any more. > > > > Maybe I could do a loud warning > > "Not enabling DISCARD on RAID5 because we cannot trust committees. > > Set "md_mod.willing_to_risk_discard=Y" if your devices reads discarded > > sectors as zeros" > > > > and add an appropriate module parameter...... > > > > While we are on the topic, maybe I should write down my thoughts about the > > bitmap thing in case someone wants to contribute. > > > > There are 3 states that a 'region' can be in: > > 1- known to be in-sync > > 2- possibly not in sync, but it should be > > 3- probably not in sync, contains no valuable data. > > > > A read from '3' should return zeroes. > > A write to '3' should change the region to be '2'. It could either > > write zeros before allowing the write to start, or it could just start > > a normal resync. > > > > Here is a question: if a region has been discarded, are we guaranteed that > > reads are at least stable. i.e. if I read twice will I definitely get the > > same value? > > Not sure with other specs, but an NVMe-compliant SSD that supports > discard (Dataset Management command with Deallocate attribute, in NVMe > parlance) is, per spec, required to be deterministic when deallocated > range is subsequently read. That's what the spec (1.1) says: > > The value read from a deallocated LBA shall be deterministic; > specifically, the value returned by subsequent reads of that LBA shall > be the same until a write occurs to that LBA. The values read from a > deallocated LBA and its metadata (excluding protection information) > shall be all zeros, all ones, or the last data written to the > associated LBA and its metadata. The values read from an unwritten or > deallocated LBA’s protection information field shall be all ones > (indicating the protection information shall not be checked). > That's good to know - thanks. NeilBrown