All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] dm-raid: add RAID discard support
@ 2014-09-23 16:51 heinzm
  2014-09-23 21:52 ` Brassow Jonathan
  2014-09-23 23:07 ` Martin K. Petersen
  0 siblings, 2 replies; 28+ messages in thread
From: heinzm @ 2014-09-23 16:51 UTC (permalink / raw)
  To: dm-devel; +Cc: Heinz Mauelshagen

From: Heinz Mauelshagen <heinzm@redhat.com>

Support discard in all RAID levels 1,10,4,5 and 6.

In case of RAID levels 4,5 and 6 we have to check for the capability
to zero data on discards to avoid stripe data corruption.

If that capabilizy is missing on any disk in a RAID set,
we have to disable discards.


Signed-off-by: Heinz Mauelshagen <heinzm|redhat.com>

---
 drivers/md/dm-raid.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 4880b69..4edb2c7 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -1,6 +1,6 @@
 /*
  * Copyright (C) 2010-2011 Neil Brown
- * Copyright (C) 2010-2011 Red Hat, Inc. All rights reserved.
+ * Copyright (C) 2010-2014 Red Hat, Inc. All rights reserved.
  *
  * This file is released under the GPL.
  */
@@ -1150,6 +1150,43 @@ static int analyse_superblocks(struct dm_target *ti, struct raid_set *rs)
 }
 
 /*
+ * Enable/disable discard support on RAID set depending
+ * on RAID level and discard properties of underlying devices
+ * (i.e. the legs of the set).
+ */
+static void raid_check_discard(struct dm_target *ti, struct raid_set *rs)
+{
+	int i;
+	bool discard_supported = true;
+	/* RAID level 4,5,6 request discard_zeroes_data for data integrity! */
+	bool raid1_or_10 = rs->md.level == 1 || rs->md.level == 10;
+	bool zeroes_data_mandatory = !raid1_or_10;
+
+	for (i = 0; i < rs->md.raid_disks; i++) {
+		struct request_queue *q = bdev_get_queue(rs->dev[i].rdev.bdev);
+
+		if (!q ||
+		    !blk_queue_discard(q) ||
+		    (zeroes_data_mandatory && !q->limits.discard_zeroes_data)) {
+			discard_supported = false;
+			break;
+		}
+	}
+
+	if (discard_supported) {
+		ti->discards_supported = true;
+		/*
+		 * raid1 and raid10 personalities require bio splitting,
+		 * raid4/5/6 don't and process large discard bios properly.
+		 */
+		ti->split_discard_bios = raid1_or_10;
+		ti->num_discard_bios = 1;
+
+	} else
+		ti->discards_supported = false;
+}
+
+/*
  * Construct a RAID4/5/6 mapping:
  * Args:
  *	<raid_type> <#raid_params> <raid_params>		\
@@ -1231,6 +1268,11 @@ static int raid_ctr(struct dm_target *ti, unsigned argc, char **argv)
 	ti->private = rs;
 	ti->num_flush_bios = 1;
 
+	/*
+	 * Disable/enable discard support on RAID set.
+	 */
+	raid_check_discard(ti, rs);
+
 	mutex_lock(&rs->md.reconfig_mutex);
 	ret = md_run(&rs->md);
 	rs->md.in_sync = 0; /* Assume already marked dirty */
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] dm-raid: add RAID discard support
  2014-09-23 16:51 [PATCH] dm-raid: add RAID discard support heinzm
@ 2014-09-23 21:52 ` Brassow Jonathan
  2014-09-23 23:07 ` Martin K. Petersen
  1 sibling, 0 replies; 28+ messages in thread
From: Brassow Jonathan @ 2014-09-23 21:52 UTC (permalink / raw)
  To: device-mapper development; +Cc: Heinz Mauelshagen


On Sep 23, 2014, at 11:51 AM, heinzm@redhat.com wrote:

> From: Heinz Mauelshagen <heinzm@redhat.com>
> 
> Support discard in all RAID levels 1,10,4,5 and 6.
> 
> In case of RAID levels 4,5 and 6 we have to check for the capability
> to zero data on discards to avoid stripe data corruption.
> 
> If that capabilizy is missing on any disk in a RAID set,
> we have to disable discards.
> 
> 
> Signed-off-by: Heinz Mauelshagen <heinzm|redhat.com>

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>

 brassow

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] dm-raid: add RAID discard support
  2014-09-23 16:51 [PATCH] dm-raid: add RAID discard support heinzm
  2014-09-23 21:52 ` Brassow Jonathan
@ 2014-09-23 23:07 ` Martin K. Petersen
  2014-09-23 23:33   ` NeilBrown
  2014-09-24 14:21   ` Christoph Hellwig
  1 sibling, 2 replies; 28+ messages in thread
From: Martin K. Petersen @ 2014-09-23 23:07 UTC (permalink / raw)
  To: device-mapper development; +Cc: Heinz Mauelshagen

>>>>> "Heinz" == heinzm  <heinzm@redhat.com> writes:

Heinz> In case of RAID levels 4,5 and 6 we have to check for the
Heinz> capability to zero data on discards to avoid stripe data
Heinz> corruption.

I'm quite concerned about relying on discard_zeroes_data since it's a
hint and not a hard guarantee.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] dm-raid: add RAID discard support
  2014-09-23 23:07 ` Martin K. Petersen
@ 2014-09-23 23:33   ` NeilBrown
  2014-09-24  2:20     ` Martin K. Petersen
  2014-09-24 14:21   ` Christoph Hellwig
  1 sibling, 1 reply; 28+ messages in thread
From: NeilBrown @ 2014-09-23 23:33 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Heinz Mauelshagen, device-mapper development


[-- Attachment #1.1: Type: text/plain, Size: 796 bytes --]

On Tue, 23 Sep 2014 19:07:35 -0400 "Martin K. Petersen"
<martin.petersen@oracle.com> wrote:

> >>>>> "Heinz" == heinzm  <heinzm@redhat.com> writes:
> 
> Heinz> In case of RAID levels 4,5 and 6 we have to check for the
> Heinz> capability to zero data on discards to avoid stripe data
> Heinz> corruption.
> 
> I'm quite concerned about relying on discard_zeroes_data since it's a
> hint and not a hard guarantee.
> 

If it is a "hint", then why isn't it "discard_sometimes_zeroes_data"?

md/raid5 already depends on this.  If a read from a discarded region doesn't
reliably return zeros, then raid5 cannot support discard.
Should I turn of discard support in raid5?

Also, could you help my understand when such a hint-but-no-guarantee would be
useful?

Thanks,
NeilBrown

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] dm-raid: add RAID discard support
  2014-09-23 23:33   ` NeilBrown
@ 2014-09-24  2:20     ` Martin K. Petersen
  2014-09-24  4:05       ` Brassow Jonathan
  0 siblings, 1 reply; 28+ messages in thread
From: Martin K. Petersen @ 2014-09-24  2:20 UTC (permalink / raw)
  To: NeilBrown
  Cc: Heinz Mauelshagen, device-mapper development, Shaohua Li,
	Martin K. Petersen

>>>>> "Neil" == NeilBrown  <neilb@suse.de> writes:

Neil,

Several people have pointed out the issues with the reliability of
discard_zeroes_data in the past. My objection with Heinz' patch is
purely that we are perpetuating an interface that we have come to learn
that we can't entirely trust.

Neil> If it is a "hint", then why isn't it
Neil> "discard_sometimes_zeroes_data"?

Because at the time we put it in it was expected to be a solid
guarantee. However, as it turns out the spec is vaguely worded and only
says that the device must return zeroes if a block has been successfully
discarded. But since discard itself is only a hint the device is free to
silently ignore all or parts of the request. *sigh*

There were several approaches being entertained in both T10 and T13
about how to tighten this up, including a FORCE bit that would cause the
device to return an error if parts of the request were being ignored. I
was hoping that we would be able to use that. However, the consensus at
the T10 meeting this summer was that WRITE SAME is the only reliable way
to ensure that zeroed blocks are returned. And the spec was amended to
clarify that.

Neil> If a read from a discarded region doesn't reliably return zeros,
Neil> then raid5 cannot support discard.  Should I turn of discard
Neil> support in raid5?

I've been on the fence about this for a while.

The good news is that while the spec has been vague and many battles
fought in the standards bodies, most vendors are reasonably sane when it
comes to implementing this. We haven't had any corruption horror stories
from the field that I'm aware of. The first couple of generations of
SSDs were a travesty in this department but current devices behave
reasonably well. Hardware RAID vendors generally solve the RAID5 SSD
problem by explicitly whitelisting drives that are known to do the right
thing. Sadly we don't have a good device list. And for most drives the
discards only get dropped in rare corner cases so it's not trivial to
write a test tool to generate such a list.

I was really hoping we'd get better guarantees from the standards folks
this summer. But as the spec is currently written switching to a WRITE
SAME bias in our discard heuristics may have the opposite effect of what
we desire. That's because the spec mandates that if a device can not
discard the blocks in the request it must write zeroes to them. IOW, we
get the determinism at the expense of potentially ending up writing the
blocks in question, thus increasing write wear. Probably not what we
want for the discard use case.

I am in the process of tweaking the discard code to adhere to the August
27th T10 SBC-4 draft. The net takeaway is that if you depend on reading
zeroes back you should be using blkdev_issue_zeroout() and not
blkdev_issue_discard(). raid5.c effectively needs to translate discard
requests from above to zeroouts below.

I am torn between making blkdev_issue_zeroout() unprovision if possible
or having an explicit allocate flag. Some users only care about getting
zeroes back but others use blkdev_issue_zeroout() to preallocate blocks
(which is the current behavior).

Right now we have:

  discard		Unprovision, may or may not return zeroes
  zeroout		Provision, return zeroes

We'll have to move to a model where we have a third variant:

  discard		Unprovision, may or may not return zeroes
  zeroout, deallocate	Unprovision if possible, return zeroes
  zeroout, allocate     Provision, return zeroes

The zeroout, deallocate variant would be enabled if the device supports
WRITE SAME w/UNMAP and has the LBPRZ flag set (and is not our libata
SATL).

Discard is the bane of my existence :(

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] dm-raid: add RAID discard support
  2014-09-24  2:20     ` Martin K. Petersen
@ 2014-09-24  4:05       ` Brassow Jonathan
  2014-09-24  4:21         ` NeilBrown
  0 siblings, 1 reply; 28+ messages in thread
From: Brassow Jonathan @ 2014-09-24  4:05 UTC (permalink / raw)
  To: device-mapper development
  Cc: Heinz Mauelshagen, Shaohua Li, Martin K. Petersen


On Sep 23, 2014, at 9:20 PM, Martin K. Petersen wrote:

>>>>>> "Neil" == NeilBrown  <neilb@suse.de> writes:
> 
> Neil,
> 
> Several people have pointed out the issues with the reliability of
> discard_zeroes_data in the past. My objection with Heinz' patch is
> purely that we are perpetuating an interface that we have come to learn
> that we can't entirely trust.
> 
> Neil> If it is a "hint", then why isn't it
> Neil> "discard_sometimes_zeroes_data"?
> 
> Because at the time we put it in it was expected to be a solid
> guarantee. However, as it turns out the spec is vaguely worded and only
> says that the device must return zeroes if a block has been successfully
> discarded. But since discard itself is only a hint the device is free to
> silently ignore all or parts of the request. *sigh*
> 
> There were several approaches being entertained in both T10 and T13
> about how to tighten this up, including a FORCE bit that would cause the
> device to return an error if parts of the request were being ignored. I
> was hoping that we would be able to use that. However, the consensus at
> the T10 meeting this summer was that WRITE SAME is the only reliable way
> to ensure that zeroed blocks are returned. And the spec was amended to
> clarify that.
> 
> Neil> If a read from a discarded region doesn't reliably return zeros,
> Neil> then raid5 cannot support discard.  Should I turn of discard
> Neil> support in raid5?
> 
> I've been on the fence about this for a while.
> 
> The good news is that while the spec has been vague and many battles
> fought in the standards bodies, most vendors are reasonably sane when it
> comes to implementing this. We haven't had any corruption horror stories
> from the field that I'm aware of. The first couple of generations of
> SSDs were a travesty in this department but current devices behave
> reasonably well. Hardware RAID vendors generally solve the RAID5 SSD
> problem by explicitly whitelisting drives that are known to do the right
> thing. Sadly we don't have a good device list. And for most drives the
> discards only get dropped in rare corner cases so it's not trivial to
> write a test tool to generate such a list.
> 
> I was really hoping we'd get better guarantees from the standards folks
> this summer. But as the spec is currently written switching to a WRITE
> SAME bias in our discard heuristics may have the opposite effect of what
> we desire. That's because the spec mandates that if a device can not
> discard the blocks in the request it must write zeroes to them. IOW, we
> get the determinism at the expense of potentially ending up writing the
> blocks in question, thus increasing write wear. Probably not what we
> want for the discard use case.
> 
> I am in the process of tweaking the discard code to adhere to the August
> 27th T10 SBC-4 draft. The net takeaway is that if you depend on reading
> zeroes back you should be using blkdev_issue_zeroout() and not
> blkdev_issue_discard(). raid5.c effectively needs to translate discard
> requests from above to zeroouts below.
> 
> I am torn between making blkdev_issue_zeroout() unprovision if possible
> or having an explicit allocate flag. Some users only care about getting
> zeroes back but others use blkdev_issue_zeroout() to preallocate blocks
> (which is the current behavior).
> 
> Right now we have:
> 
>  discard		Unprovision, may or may not return zeroes
>  zeroout		Provision, return zeroes
> 
> We'll have to move to a model where we have a third variant:
> 
>  discard		Unprovision, may or may not return zeroes
>  zeroout, deallocate	Unprovision if possible, return zeroes
>  zeroout, allocate     Provision, return zeroes
> 
> The zeroout, deallocate variant would be enabled if the device supports
> WRITE SAME w/UNMAP and has the LBPRZ flag set (and is not our libata
> SATL).
> 
> Discard is the bane of my existence :(

It's not really so bad if it doesn't return zero's, is it?  As long as it returns the previous contents.  You mentioned that the discard was a hint - the drive may or may not discard the entire portion.  If that means the portions that are discarded will always return zero and the portions that aren't will always return their previous contents, is there any problem?  The real issue is randomness, no?  Could this be why "we haven't had any corruption horror stories", or is it because we got lucky with sane vendors?

 brassow

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] dm-raid: add RAID discard support
  2014-09-24  4:05       ` Brassow Jonathan
@ 2014-09-24  4:21         ` NeilBrown
  2014-09-24  4:35           ` Brassow Jonathan
  2014-09-24 11:02           ` Heinz Mauelshagen
  0 siblings, 2 replies; 28+ messages in thread
From: NeilBrown @ 2014-09-24  4:21 UTC (permalink / raw)
  To: Brassow Jonathan
  Cc: Heinz Mauelshagen, device-mapper development, Shaohua Li,
	Martin K. Petersen


[-- Attachment #1.1: Type: text/plain, Size: 5576 bytes --]

On Tue, 23 Sep 2014 23:05:51 -0500 Brassow Jonathan <jbrassow@redhat.com>
wrote:

> 
> On Sep 23, 2014, at 9:20 PM, Martin K. Petersen wrote:
> 
> >>>>>> "Neil" == NeilBrown  <neilb@suse.de> writes:
> > 
> > Neil,
> > 
> > Several people have pointed out the issues with the reliability of
> > discard_zeroes_data in the past. My objection with Heinz' patch is
> > purely that we are perpetuating an interface that we have come to learn
> > that we can't entirely trust.
> > 
> > Neil> If it is a "hint", then why isn't it
> > Neil> "discard_sometimes_zeroes_data"?
> > 
> > Because at the time we put it in it was expected to be a solid
> > guarantee. However, as it turns out the spec is vaguely worded and only
> > says that the device must return zeroes if a block has been successfully
> > discarded. But since discard itself is only a hint the device is free to
> > silently ignore all or parts of the request. *sigh*
> > 
> > There were several approaches being entertained in both T10 and T13
> > about how to tighten this up, including a FORCE bit that would cause the
> > device to return an error if parts of the request were being ignored. I
> > was hoping that we would be able to use that. However, the consensus at
> > the T10 meeting this summer was that WRITE SAME is the only reliable way
> > to ensure that zeroed blocks are returned. And the spec was amended to
> > clarify that.
> > 
> > Neil> If a read from a discarded region doesn't reliably return zeros,
> > Neil> then raid5 cannot support discard.  Should I turn of discard
> > Neil> support in raid5?
> > 
> > I've been on the fence about this for a while.
> > 
> > The good news is that while the spec has been vague and many battles
> > fought in the standards bodies, most vendors are reasonably sane when it
> > comes to implementing this. We haven't had any corruption horror stories
> > from the field that I'm aware of. The first couple of generations of
> > SSDs were a travesty in this department but current devices behave
> > reasonably well. Hardware RAID vendors generally solve the RAID5 SSD
> > problem by explicitly whitelisting drives that are known to do the right
> > thing. Sadly we don't have a good device list. And for most drives the
> > discards only get dropped in rare corner cases so it's not trivial to
> > write a test tool to generate such a list.
> > 
> > I was really hoping we'd get better guarantees from the standards folks
> > this summer. But as the spec is currently written switching to a WRITE
> > SAME bias in our discard heuristics may have the opposite effect of what
> > we desire. That's because the spec mandates that if a device can not
> > discard the blocks in the request it must write zeroes to them. IOW, we
> > get the determinism at the expense of potentially ending up writing the
> > blocks in question, thus increasing write wear. Probably not what we
> > want for the discard use case.
> > 
> > I am in the process of tweaking the discard code to adhere to the August
> > 27th T10 SBC-4 draft. The net takeaway is that if you depend on reading
> > zeroes back you should be using blkdev_issue_zeroout() and not
> > blkdev_issue_discard(). raid5.c effectively needs to translate discard
> > requests from above to zeroouts below.
> > 
> > I am torn between making blkdev_issue_zeroout() unprovision if possible
> > or having an explicit allocate flag. Some users only care about getting
> > zeroes back but others use blkdev_issue_zeroout() to preallocate blocks
> > (which is the current behavior).
> > 
> > Right now we have:
> > 
> >  discard		Unprovision, may or may not return zeroes
> >  zeroout		Provision, return zeroes
> > 
> > We'll have to move to a model where we have a third variant:
> > 
> >  discard		Unprovision, may or may not return zeroes
> >  zeroout, deallocate	Unprovision if possible, return zeroes
> >  zeroout, allocate     Provision, return zeroes
> > 
> > The zeroout, deallocate variant would be enabled if the device supports
> > WRITE SAME w/UNMAP and has the LBPRZ flag set (and is not our libata
> > SATL).
> > 
> > Discard is the bane of my existence :(

Thanks for the detailed explanation Martin!!
It seems that "discard" still comes with a strong "Caveat Emptor"
Sad.

> 
> It's not really so bad if it doesn't return zero's, is it?  As long as it returns the previous contents.  You mentioned that the discard was a hint - the drive may or may not discard the entire portion.  If that means the portions that are discarded will always return zero and the portions that aren't will always return their previous contents, is there any problem?  The real issue is randomness, no?  Could this be why "we haven't had any corruption horror stories", or is it because we got lucky with sane vendors?

For RAID5 it is not acceptable to return "zeros or the original content".
This is because of the possibility of RMW cycles to update the parity block.
As "discard" the doesn't treat all blocks in a stripe in an identical way
will cause the parity block to become incorrect, and it will stay incorrect
through multiple RMW cycles until an RCW corrects it.
A device failure while the parity block is incorrect will lead to data
corruption.

The only real solution I can think of is to maintain an 'allocated' bitmap
and clear bits when DISCARDing.  When a write happens to a location that is
not 'allocated' the whole  bit worth gets resynced (or maybe zeroed-out).

Grumble.

Thanks again,
NeilBrown

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] dm-raid: add RAID discard support
  2014-09-24  4:21         ` NeilBrown
@ 2014-09-24  4:35           ` Brassow Jonathan
  2014-09-24 11:02           ` Heinz Mauelshagen
  1 sibling, 0 replies; 28+ messages in thread
From: Brassow Jonathan @ 2014-09-24  4:35 UTC (permalink / raw)
  To: NeilBrown
  Cc: Heinz Mauelshagen, device-mapper development, Shaohua Li,
	Martin K. Petersen


On Sep 23, 2014, at 11:21 PM, NeilBrown wrote:

> 
> For RAID5 it is not acceptable to return "zeros or the original content".
> This is because of the possibility of RMW cycles to update the parity block.
> As "discard" the doesn't treat all blocks in a stripe in an identical way
> will cause the parity block to become incorrect, and it will stay incorrect
> through multiple RMW cycles until an RCW corrects it.
> A device failure while the parity block is incorrect will lead to data
> corruption.
> 
> The only real solution I can think of is to maintain an 'allocated' bitmap
> and clear bits when DISCARDing.  When a write happens to a location that is
> not 'allocated' the whole  bit worth gets resynced (or maybe zeroed-out).
> 
> Grumble.

I see, sure.

Are there problems with this 'allocated' bitmap idea when bitrot is considered?  Scrubbing (check/repair) assumes that the bitmap cannot be trusted and does a complete scan of the array - dealing with mismatches.  If bitrot happens in the 'allocated' bitmap...

 brassow

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] dm-raid: add RAID discard support
  2014-09-24  4:21         ` NeilBrown
  2014-09-24  4:35           ` Brassow Jonathan
@ 2014-09-24 11:02           ` Heinz Mauelshagen
  2014-10-01  2:56             ` NeilBrown
  1 sibling, 1 reply; 28+ messages in thread
From: Heinz Mauelshagen @ 2014-09-24 11:02 UTC (permalink / raw)
  To: device-mapper development, Brassow Jonathan
  Cc: Shaohua Li, Martin K. Petersen


[-- Attachment #1.1: Type: text/plain, Size: 6802 bytes --]


Martin,

thanks for the good explanation of the state of the discard union.
Do you have an ETA for the 'zeroout, deallocate' ... support you mentioned?

I was planning to have a followup patch for dm-raid supporting a dm-raid 
table
line argument to prohibit discard passdown.

In lieu of the fuzzy field situation wrt SSD fw and discard_zeroes_data 
support
related to RAID4/5/6, we need that in upstream together with the initial 
patch.

That 'no_discard_passdown' table line can be added to dm-raid RAID4/5/6 
table
lines to avoid possible data corruption but can be avoided on RAID1/10 
table lines,
because the latter are not suffering from any  discard_zeroes_data flaw.


Neil,

are you going to disable discards in RAID4/5/6 shortly
or rather go with your bitmap solution?

Heinz


On 09/24/2014 06:21 AM, NeilBrown wrote:
> On Tue, 23 Sep 2014 23:05:51 -0500 Brassow Jonathan <jbrassow@redhat.com>
> wrote:
>
>> On Sep 23, 2014, at 9:20 PM, Martin K. Petersen wrote:
>>
>>>>>>>> "Neil" == NeilBrown  <neilb@suse.de> writes:
>>> Neil,
>>>
>>> Several people have pointed out the issues with the reliability of
>>> discard_zeroes_data in the past. My objection with Heinz' patch is
>>> purely that we are perpetuating an interface that we have come to learn
>>> that we can't entirely trust.
>>>
>>> Neil> If it is a "hint", then why isn't it
>>> Neil> "discard_sometimes_zeroes_data"?
>>>
>>> Because at the time we put it in it was expected to be a solid
>>> guarantee. However, as it turns out the spec is vaguely worded and only
>>> says that the device must return zeroes if a block has been successfully
>>> discarded. But since discard itself is only a hint the device is free to
>>> silently ignore all or parts of the request. *sigh*
>>>
>>> There were several approaches being entertained in both T10 and T13
>>> about how to tighten this up, including a FORCE bit that would cause the
>>> device to return an error if parts of the request were being ignored. I
>>> was hoping that we would be able to use that. However, the consensus at
>>> the T10 meeting this summer was that WRITE SAME is the only reliable way
>>> to ensure that zeroed blocks are returned. And the spec was amended to
>>> clarify that.
>>>
>>> Neil> If a read from a discarded region doesn't reliably return zeros,
>>> Neil> then raid5 cannot support discard.  Should I turn of discard
>>> Neil> support in raid5?
>>>
>>> I've been on the fence about this for a while.
>>>
>>> The good news is that while the spec has been vague and many battles
>>> fought in the standards bodies, most vendors are reasonably sane when it
>>> comes to implementing this. We haven't had any corruption horror stories
>>> from the field that I'm aware of. The first couple of generations of
>>> SSDs were a travesty in this department but current devices behave
>>> reasonably well. Hardware RAID vendors generally solve the RAID5 SSD
>>> problem by explicitly whitelisting drives that are known to do the right
>>> thing. Sadly we don't have a good device list. And for most drives the
>>> discards only get dropped in rare corner cases so it's not trivial to
>>> write a test tool to generate such a list.
>>>
>>> I was really hoping we'd get better guarantees from the standards folks
>>> this summer. But as the spec is currently written switching to a WRITE
>>> SAME bias in our discard heuristics may have the opposite effect of what
>>> we desire. That's because the spec mandates that if a device can not
>>> discard the blocks in the request it must write zeroes to them. IOW, we
>>> get the determinism at the expense of potentially ending up writing the
>>> blocks in question, thus increasing write wear. Probably not what we
>>> want for the discard use case.
>>>
>>> I am in the process of tweaking the discard code to adhere to the August
>>> 27th T10 SBC-4 draft. The net takeaway is that if you depend on reading
>>> zeroes back you should be using blkdev_issue_zeroout() and not
>>> blkdev_issue_discard(). raid5.c effectively needs to translate discard
>>> requests from above to zeroouts below.
>>>
>>> I am torn between making blkdev_issue_zeroout() unprovision if possible
>>> or having an explicit allocate flag. Some users only care about getting
>>> zeroes back but others use blkdev_issue_zeroout() to preallocate blocks
>>> (which is the current behavior).
>>>
>>> Right now we have:
>>>
>>>   discard		Unprovision, may or may not return zeroes
>>>   zeroout		Provision, return zeroes
>>>
>>> We'll have to move to a model where we have a third variant:
>>>
>>>   discard		Unprovision, may or may not return zeroes
>>>   zeroout, deallocate	Unprovision if possible, return zeroes
>>>   zeroout, allocate     Provision, return zeroes
>>>
>>> The zeroout, deallocate variant would be enabled if the device supports
>>> WRITE SAME w/UNMAP and has the LBPRZ flag set (and is not our libata
>>> SATL).
>>>
>>> Discard is the bane of my existence :(
> Thanks for the detailed explanation Martin!!
> It seems that "discard" still comes with a strong "Caveat Emptor"
> Sad.
>
>> It's not really so bad if it doesn't return zero's, is it?  As long as it returns the previous contents.  You mentioned that the discard was a hint - the drive may or may not discard the entire portion.  If that means the portions that are discarded will always return zero and the portions that aren't will always return their previous contents, is there any problem?  The real issue is randomness, no?  Could this be why "we haven't had any corruption horror stories", or is it because we got lucky with sane vendors?
> For RAID5 it is not acceptable to return "zeros or the original content".
> This is because of the possibility of RMW cycles to update the parity block.
> As "discard" the doesn't treat all blocks in a stripe in an identical way
> will cause the parity block to become incorrect, and it will stay incorrect
> through multiple RMW cycles until an RCW corrects it.
> A device failure while the parity block is incorrect will lead to data
> corruption.
>
> The only real solution I can think of is to maintain an 'allocated' bitmap
> and clear bits when DISCARDing.  When a write happens to a location that is
> not 'allocated' the whole  bit worth gets resynced (or maybe zeroed-out).
>
> Grumble.
>
> Thanks again,
> NeilBrown
>
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

-- 
===============================================================
Heinz Mauelshagen                               +49 2626 141200
Consulting Development Engineer             FAX +49 2626 924446
Red Hat GmbH
Am Sonnenhang 11
56242 Marienrachdorf
Germany                                       heinzm@redhat.com
===============================================================


[-- Attachment #1.2: Type: text/html, Size: 8442 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] dm-raid: add RAID discard support
  2014-09-23 23:07 ` Martin K. Petersen
  2014-09-23 23:33   ` NeilBrown
@ 2014-09-24 14:21   ` Christoph Hellwig
  2014-09-24 14:38     ` Heinz Mauelshagen
  2014-09-24 15:11     ` Martin K. Petersen
  1 sibling, 2 replies; 28+ messages in thread
From: Christoph Hellwig @ 2014-09-24 14:21 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Heinz Mauelshagen, device-mapper development

On Tue, Sep 23, 2014 at 07:07:35PM -0400, Martin K. Petersen wrote:
> >>>>> "Heinz" == heinzm  <heinzm@redhat.com> writes:
> 
> Heinz> In case of RAID levels 4,5 and 6 we have to check for the
> Heinz> capability to zero data on discards to avoid stripe data
> Heinz> corruption.
> 
> I'm quite concerned about relying on discard_zeroes_data since it's a
> hint and not a hard guarantee.

If we want to provide useful discards through raid we need to make
it a hard guarantee.  That is:

 - only set it when using a WRITE SAME variant on scsi
 - start adding a whitelist for ATA

Relying on zeroout which also works when it doesn't unmap below is not
the answer, especially if it may end up doing a physical write of zeroes
on disk devices.  Alternatively we could get rid of discard_zeroes_data
and stop the pretence of any sane behavior from discard, and instead
expose an unmap bit through the blkdev_issue_zeroout stack for those
callers who want to do a discard, but rely on zeroes.  The end effect
will be the same, just a different presentation.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] dm-raid: add RAID discard support
  2014-09-24 14:21   ` Christoph Hellwig
@ 2014-09-24 14:38     ` Heinz Mauelshagen
  2014-09-24 15:11     ` Martin K. Petersen
  1 sibling, 0 replies; 28+ messages in thread
From: Heinz Mauelshagen @ 2014-09-24 14:38 UTC (permalink / raw)
  To: Christoph Hellwig, Martin K. Petersen; +Cc: device-mapper development


On 09/24/2014 04:21 PM, Christoph Hellwig wrote:
> On Tue, Sep 23, 2014 at 07:07:35PM -0400, Martin K. Petersen wrote:
>>>>>>> "Heinz" == heinzm  <heinzm@redhat.com> writes:
>> Heinz> In case of RAID levels 4,5 and 6 we have to check for the
>> Heinz> capability to zero data on discards to avoid stripe data
>> Heinz> corruption.
>>
>> I'm quite concerned about relying on discard_zeroes_data since it's a
>> hint and not a hard guarantee.
> If we want to provide useful discards through raid we need to make
> it a hard guarantee.  That is:
>
>   - only set it when using a WRITE SAME variant on scsi
>   - start adding a whitelist for ATA
>
> Relying on zeroout which also works when it doesn't unmap below is not
> the answer, especially if it may end up doing a physical write of zeroes
> on disk devices.  Alternatively we could get rid of discard_zeroes_data
> and stop the pretence of any sane behavior from discard, and instead
> expose an unmap bit through the blkdev_issue_zeroout stack for those
> callers who want to do a discard, but rely on zeroes.  The end effect
> will be the same, just a different presentation.

Neil already came up with an unmap  bitmap  to add to the RAID4/5/6
personalities in order to harden MD in this regard.
In order to avoid any related data corruption with any future disk types
it seems to be the approprite solution.

Waiting for him to comment more on this.

Heinz

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] dm-raid: add RAID discard support
  2014-09-24 14:21   ` Christoph Hellwig
  2014-09-24 14:38     ` Heinz Mauelshagen
@ 2014-09-24 15:11     ` Martin K. Petersen
  1 sibling, 0 replies; 28+ messages in thread
From: Martin K. Petersen @ 2014-09-24 15:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Heinz Mauelshagen, device-mapper development, Martin K. Petersen

>>>>> "Christoph" == Christoph Hellwig <hch@infradead.org> writes:

Christoph>  - start adding a whitelist for ATA

As far as I know the sanity bits have been discussed in T13 but haven't
been agreed upon yet.

The hard part about the whitelist is that bad things only happen in rare
circumstances. However, we may be able to take some hints from here:

http://www.lsi.com/downloads/Public/RAID%20Controllers/RAID%20Controllers%20Common%20Files/MegaRAID_SAS3_936x-938x_Compatibility_List.pdf

Christoph> Alternatively we could get rid of discard_zeroes_data and
Christoph> stop the pretence of any sane behavior from discard, and
Christoph> instead expose an unmap bit through the blkdev_issue_zeroout
Christoph> stack for those callers who want to do a discard, but rely on
Christoph> zeroes. The end effect will be the same, just a different
Christoph> presentation.

Yep.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] dm-raid: add RAID discard support
  2014-09-24 11:02           ` Heinz Mauelshagen
@ 2014-10-01  2:56             ` NeilBrown
  2014-10-01 11:13               ` Heinz Mauelshagen
                                 ` (4 more replies)
  0 siblings, 5 replies; 28+ messages in thread
From: NeilBrown @ 2014-10-01  2:56 UTC (permalink / raw)
  To: Heinz Mauelshagen
  Cc: device-mapper development, Shaohua Li, Martin K. Petersen


[-- Attachment #1.1: Type: text/plain, Size: 3824 bytes --]

On Wed, 24 Sep 2014 13:02:28 +0200 Heinz Mauelshagen <heinzm@redhat.com>
wrote:

> 
> Martin,
> 
> thanks for the good explanation of the state of the discard union.
> Do you have an ETA for the 'zeroout, deallocate' ... support you mentioned?
> 
> I was planning to have a followup patch for dm-raid supporting a dm-raid 
> table
> line argument to prohibit discard passdown.
> 
> In lieu of the fuzzy field situation wrt SSD fw and discard_zeroes_data 
> support
> related to RAID4/5/6, we need that in upstream together with the initial 
> patch.
> 
> That 'no_discard_passdown' table line can be added to dm-raid RAID4/5/6 
> table
> lines to avoid possible data corruption but can be avoided on RAID1/10 
> table lines,
> because the latter are not suffering from any  discard_zeroes_data flaw.
> 
> 
> Neil,
> 
> are you going to disable discards in RAID4/5/6 shortly
> or rather go with your bitmap solution?

Can I just close my eyes and hope it goes away?

The idea of a bitmap of uninitialised areas is not a short-term solution.
But I'm not really keen on simply disabling discard for RAID4/5/6 either. It
would  mean that people with good sensible hardware wouldn't be able to use
it properly.

I would really rather that discard_zeroes_data were only set on devices where
it was actually true.  Then it wouldn't be my problem any more.

Maybe I could do a loud warning
  "Not enabling DISCARD on RAID5 because we cannot trust committees.
   Set "md_mod.willing_to_risk_discard=Y" if your devices reads discarded
   sectors as zeros"

and add an appropriate module parameter......

While we are on the topic, maybe I should write down my thoughts about the
bitmap thing in case someone wants to contribute.

  There are 3 states that a 'region' can be in:
    1- known to be in-sync
    2- possibly not in sync, but it should  be
    3- probably not in sync, contains no valuable data.

 A read from '3' should return zeroes.
 A write to '3' should change the region to be '2'.  It could either
    write zeros before allowing the write to start, or it could just start
    a normal resync.

 Here is a question:  if a region has been discarded, are we guaranteed that
 reads are at least stable.  i.e. if I read twice will I definitely get the
 same value?

 It would be simplest to store all the states in the one bitmap. i.e. have a
 new version of the current bitmap (which distinguishes between '1' and '2')
 which allows the 3rd state.  Probably just 2 bits per region, though we could
 squeeze state for 20 regions into 32 bits if we really wanted :-)

 Then we set the bitmap to all '3's when it is created and set regions to
 '3' when discarding.

 One problem with this approach is that it forces the same region size.
 Regions for the current bitmap can conveniently be 10s of Megabytes.  For
 state '3', we ideally want one region per stripe.
 For 4TB drives with a 512K chunk size (and smaller is not uncommon) that is
 8 million regions which is a megabyte of bits.

 I guess that isn't all that much.  One problem with small regions for the
 1/2 distinction is that we end up updating the bitmap more often, but
 that could be avoided by setting multiple bits at one.
 i.e. keep the internal counters with a larger granularity, and when a
 counter (of outstanding writes) becomes non-zero, set quite a few bits in
 the bitmap.

 So that is probably what I would do:
  - new version for bitmap which has 2 bits per region and encodes 3 states
  - bitmap granularity matches chunk size (by default)
  - decouple region size of bitmap from region size for internal 'dirty'
    accounting
  - write to a 'state 3' region sets it to 'state 2' and kicks off resync
  - 'discard' sets state to '3'.

NeilBrown

[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] dm-raid: add RAID discard support
  2014-10-01  2:56             ` NeilBrown
@ 2014-10-01 11:13               ` Heinz Mauelshagen
  2014-10-03  1:12                 ` Martin K. Petersen
  2014-10-01 13:32               ` Mike Snitzer
                                 ` (3 subsequent siblings)
  4 siblings, 1 reply; 28+ messages in thread
From: Heinz Mauelshagen @ 2014-10-01 11:13 UTC (permalink / raw)
  To: device-mapper development; +Cc: Shaohua Li, Martin K. Petersen


[-- Attachment #1.1: Type: text/plain, Size: 4949 bytes --]

On 10/01/2014 04:56 AM, NeilBrown wrote:
> On Wed, 24 Sep 2014 13:02:28 +0200 Heinz Mauelshagen <heinzm@redhat.com>
> wrote:
>
>> Martin,
>>
>> thanks for the good explanation of the state of the discard union.
>> Do you have an ETA for the 'zeroout, deallocate' ... support you mentioned?
>>
>> I was planning to have a followup patch for dm-raid supporting a dm-raid
>> table
>> line argument to prohibit discard passdown.
>>
>> In lieu of the fuzzy field situation wrt SSD fw and discard_zeroes_data
>> support
>> related to RAID4/5/6, we need that in upstream together with the initial
>> patch.
>>
>> That 'no_discard_passdown' table line can be added to dm-raid RAID4/5/6
>> table
>> lines to avoid possible data corruption but can be avoided on RAID1/10
>> table lines,
>> because the latter are not suffering from any  discard_zeroes_data flaw.
>>
>>
>> Neil,
>>
>> are you going to disable discards in RAID4/5/6 shortly
>> or rather go with your bitmap solution?
> Can I just close my eyes and hope it goes away?

I'd think that it actually would.

Sane deployments are unlikely to base MD RAID4/5/6 mappings on 
ancient/unreliable SSDs.
Your "md_mod.willing_to_risk_discard=Y" mentioned below is the proper
way against all insane odds.

BTW:
we're referring to SSD discard_zeroes_data issues only so far it seems?
Can we be sure that SCSI unmap with the TPRZ bit set will always return
zeroes on reads for arbitrary storage devices?

>
> The idea of a bitmap of uninitialised areas is not a short-term solution.
> But I'm not really keen on simply disabling discard for RAID4/5/6 either. It
> would  mean that people with good sensible hardware wouldn't be able to use
> it properly.
>
> I would really rather that discard_zeroes_data were only set on devices where
> it was actually true.  Then it wouldn't be my problem any more.
>
> Maybe I could do a loud warning
>    "Not enabling DISCARD on RAID5 because we cannot trust committees.
>     Set "md_mod.willing_to_risk_discard=Y" if your devices reads discarded
>     sectors as zeros"

Yes, we are thinking the same for dm-raid (like a "force_discard" 
parameter).

>
> and add an appropriate module parameter......
>
> While we are on the topic, maybe I should write down my thoughts about the
> bitmap thing in case someone wants to contribute.
>
>    There are 3 states that a 'region' can be in:
>      1- known to be in-sync
>      2- possibly not in sync, but it should  be
>      3- probably not in sync, contains no valuable data.
>
>   A read from '3' should return zeroes.
>   A write to '3' should change the region to be '2'.  It could either
>      write zeros before allowing the write to start, or it could just start
>      a normal resync.
>
>   Here is a question:  if a region has been discarded, are we guaranteed that
>   reads are at least stable.  i.e. if I read twice will I definitely get the
>   same value?

We can never be sure taking the standards and implementation hassle
into account.
Thus RAID system OEMs have whitelists for supported drives.

>
>   It would be simplest to store all the states in the one bitmap. i.e. have a
>   new version of the current bitmap (which distinguishes between '1' and '2')
>   which allows the 3rd state.  Probably just 2 bits per region, though we could
>   squeeze state for 20 regions into 32 bits if we really wanted :-)
>
>   Then we set the bitmap to all '3's when it is created and set regions to
>   '3' when discarding.
>
>   One problem with this approach is that it forces the same region size.
>   Regions for the current bitmap can conveniently be 10s of Megabytes.  For
>   state '3', we ideally want one region per stripe.
>   For 4TB drives with a 512K chunk size (and smaller is not uncommon) that is
>   8 million regions which is a megabyte of bits.
>
>   I guess that isn't all that much.  One problem with small regions for the
>   1/2 distinction is that we end up updating the bitmap more often, but
>   that could be avoided by setting multiple bits at one.
>   i.e. keep the internal counters with a larger granularity, and when a
>   counter (of outstanding writes) becomes non-zero, set quite a few bits in
>   the bitmap.

How close can you get with update patterns on bitmaps for that new approach
to what we have with the current one?
Will it be negligible or performance impacting?


>
>   So that is probably what I would do:
>    - new version for bitmap which has 2 bits per region and encodes 3 states
>    - bitmap granularity matches chunk size (by default)
>    - decouple region size of bitmap from region size for internal 'dirty'
>      accounting
>    - write to a 'state 3' region sets it to 'state 2' and kicks off resync
>    - 'discard' sets state to '3'.

Makes sense but we'd rather avoid it based on the aforementioned 
"willing" option.

Heinz

>
> NeilBrown
>
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel


[-- Attachment #1.2: Type: text/html, Size: 6404 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: dm-raid: add RAID discard support
  2014-10-01  2:56             ` NeilBrown
  2014-10-01 11:13               ` Heinz Mauelshagen
@ 2014-10-01 13:32               ` Mike Snitzer
  2014-10-01 23:34                 ` NeilBrown
  2014-10-01 16:00               ` [PATCH] " Andrey Kuzmin
                                 ` (2 subsequent siblings)
  4 siblings, 1 reply; 28+ messages in thread
From: Mike Snitzer @ 2014-10-01 13:32 UTC (permalink / raw)
  To: NeilBrown
  Cc: Heinz Mauelshagen, device-mapper development, Shaohua Li,
	Martin K. Petersen

On Tue, Sep 30 2014 at 10:56pm -0400,
NeilBrown <neilb@suse.de> wrote:

> On Wed, 24 Sep 2014 13:02:28 +0200 Heinz Mauelshagen <heinzm@redhat.com>
> wrote:
> 
> > 
> > Martin,
> > 
> > thanks for the good explanation of the state of the discard union.
> > Do you have an ETA for the 'zeroout, deallocate' ... support you mentioned?
> > 
> > I was planning to have a followup patch for dm-raid supporting a dm-raid 
> > table
> > line argument to prohibit discard passdown.
> > 
> > In lieu of the fuzzy field situation wrt SSD fw and discard_zeroes_data 
> > support
> > related to RAID4/5/6, we need that in upstream together with the initial 
> > patch.
> > 
> > That 'no_discard_passdown' table line can be added to dm-raid RAID4/5/6 
> > table
> > lines to avoid possible data corruption but can be avoided on RAID1/10 
> > table lines,
> > because the latter are not suffering from any  discard_zeroes_data flaw.
> > 
> > 
> > Neil,
> > 
> > are you going to disable discards in RAID4/5/6 shortly
> > or rather go with your bitmap solution?
> 
> Can I just close my eyes and hope it goes away?
> 
> The idea of a bitmap of uninitialised areas is not a short-term solution.
> But I'm not really keen on simply disabling discard for RAID4/5/6 either. It
> would  mean that people with good sensible hardware wouldn't be able to use
> it properly.
> 
> I would really rather that discard_zeroes_data were only set on devices where
> it was actually true.  Then it wouldn't be my problem any more.
> 
> Maybe I could do a loud warning
>   "Not enabling DISCARD on RAID5 because we cannot trust committees.
>    Set "md_mod.willing_to_risk_discard=Y" if your devices reads discarded
>    sectors as zeros"
> 
> and add an appropriate module parameter......

I had the same thought and would be happy with this too.  I was going to
update Heinz's patch to have the same default off but allow user to
enable:
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=8e0cff64f35971135a6de7907bbc8c3a010aff8f

But I'd love to just follow what you arrive at with MD (using the same
name for the module param in dm-raid).

I'm open to getting this done now and included in 3.18 if you are.

Mike

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] dm-raid: add RAID discard support
  2014-10-01  2:56             ` NeilBrown
  2014-10-01 11:13               ` Heinz Mauelshagen
  2014-10-01 13:32               ` Mike Snitzer
@ 2014-10-01 16:00               ` Andrey Kuzmin
  2014-10-01 23:15                 ` NeilBrown
  2014-10-01 18:57               ` Brassow Jonathan
  2014-10-03  1:09               ` Martin K. Petersen
  4 siblings, 1 reply; 28+ messages in thread
From: Andrey Kuzmin @ 2014-10-01 16:00 UTC (permalink / raw)
  To: device-mapper development
  Cc: Heinz Mauelshagen, Shaohua Li, Martin K. Petersen

On Wed, Oct 1, 2014 at 6:56 AM, NeilBrown <neilb@suse.de> wrote:
> On Wed, 24 Sep 2014 13:02:28 +0200 Heinz Mauelshagen <heinzm@redhat.com>
> wrote:
>
>>
>> Martin,
>>
>> thanks for the good explanation of the state of the discard union.
>> Do you have an ETA for the 'zeroout, deallocate' ... support you mentioned?
>>
>> I was planning to have a followup patch for dm-raid supporting a dm-raid
>> table
>> line argument to prohibit discard passdown.
>>
>> In lieu of the fuzzy field situation wrt SSD fw and discard_zeroes_data
>> support
>> related to RAID4/5/6, we need that in upstream together with the initial
>> patch.
>>
>> That 'no_discard_passdown' table line can be added to dm-raid RAID4/5/6
>> table
>> lines to avoid possible data corruption but can be avoided on RAID1/10
>> table lines,
>> because the latter are not suffering from any  discard_zeroes_data flaw.
>>
>>
>> Neil,
>>
>> are you going to disable discards in RAID4/5/6 shortly
>> or rather go with your bitmap solution?
>
> Can I just close my eyes and hope it goes away?
>
> The idea of a bitmap of uninitialised areas is not a short-term solution.
> But I'm not really keen on simply disabling discard for RAID4/5/6 either. It
> would  mean that people with good sensible hardware wouldn't be able to use
> it properly.
>
> I would really rather that discard_zeroes_data were only set on devices where
> it was actually true.  Then it wouldn't be my problem any more.
>
> Maybe I could do a loud warning
>   "Not enabling DISCARD on RAID5 because we cannot trust committees.
>    Set "md_mod.willing_to_risk_discard=Y" if your devices reads discarded
>    sectors as zeros"
>
> and add an appropriate module parameter......
>
> While we are on the topic, maybe I should write down my thoughts about the
> bitmap thing in case someone wants to contribute.
>
>   There are 3 states that a 'region' can be in:
>     1- known to be in-sync
>     2- possibly not in sync, but it should  be
>     3- probably not in sync, contains no valuable data.
>
>  A read from '3' should return zeroes.
>  A write to '3' should change the region to be '2'.  It could either
>     write zeros before allowing the write to start, or it could just start
>     a normal resync.
>
>  Here is a question:  if a region has been discarded, are we guaranteed that
>  reads are at least stable.  i.e. if I read twice will I definitely get the
>  same value?

Not sure with other specs, but an NVMe-compliant SSD that supports
discard (Dataset Management command with Deallocate attribute, in NVMe
parlance) is, per spec, required to be deterministic when deallocated
range is subsequently read. That's what the spec (1.1) says:

The value read from a deallocated LBA shall be deterministic;
specifically, the value returned by subsequent reads of that LBA shall
be the same until a write occurs to that LBA. The values read from a
deallocated LBA and its metadata (excluding protection information)
shall be all zeros, all ones, or the last data written to the
associated LBA and its metadata. The values read from an unwritten or
deallocated LBA’s protection information field shall be all ones
(indicating the protection information shall not be checked).

Regards,
Andrey

>
>  It would be simplest to store all the states in the one bitmap. i.e. have a
>  new version of the current bitmap (which distinguishes between '1' and '2')
>  which allows the 3rd state.  Probably just 2 bits per region, though we could
>  squeeze state for 20 regions into 32 bits if we really wanted :-)
>
>  Then we set the bitmap to all '3's when it is created and set regions to
>  '3' when discarding.
>
>  One problem with this approach is that it forces the same region size.
>  Regions for the current bitmap can conveniently be 10s of Megabytes.  For
>  state '3', we ideally want one region per stripe.
>  For 4TB drives with a 512K chunk size (and smaller is not uncommon) that is
>  8 million regions which is a megabyte of bits.
>
>  I guess that isn't all that much.  One problem with small regions for the
>  1/2 distinction is that we end up updating the bitmap more often, but
>  that could be avoided by setting multiple bits at one.
>  i.e. keep the internal counters with a larger granularity, and when a
>  counter (of outstanding writes) becomes non-zero, set quite a few bits in
>  the bitmap.
>
>  So that is probably what I would do:
>   - new version for bitmap which has 2 bits per region and encodes 3 states
>   - bitmap granularity matches chunk size (by default)
>   - decouple region size of bitmap from region size for internal 'dirty'
>     accounting
>   - write to a 'state 3' region sets it to 'state 2' and kicks off resync
>   - 'discard' sets state to '3'.
>
> NeilBrown
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] dm-raid: add RAID discard support
  2014-10-01  2:56             ` NeilBrown
                                 ` (2 preceding siblings ...)
  2014-10-01 16:00               ` [PATCH] " Andrey Kuzmin
@ 2014-10-01 18:57               ` Brassow Jonathan
  2014-10-01 23:18                 ` NeilBrown
  2014-10-03  1:09               ` Martin K. Petersen
  4 siblings, 1 reply; 28+ messages in thread
From: Brassow Jonathan @ 2014-10-01 18:57 UTC (permalink / raw)
  To: NeilBrown
  Cc: Heinz Mauelshagen, device-mapper development, Shaohua Li,
	Martin K. Petersen


On Sep 30, 2014, at 9:56 PM, NeilBrown wrote:

> So that is probably what I would do:
>  - new version for bitmap which has 2 bits per region and encodes 3 states
>  - bitmap granularity matches chunk size (by default)
>  - decouple region size of bitmap from region size for internal 'dirty'
>    accounting
>  - write to a 'state 3' region sets it to 'state 2' and kicks off resync
>  - 'discard' sets state to '3'.

There are scenarios where we do not trust the bitmap and perform exhaustive searches (scrubbing).  When we do this, we assume that the bitmap has been correctly handled, but the data has been corrupted somehow.  When we scrub, we now could presumably use the bitmap to avoid those regions that have been discarded.  However, are there problems to be considered when the corruption is the other way around - i.e. when the bitmap/superblock are corrupted instead of the data area?  Do we consider problems like this unlikely because of the frequency of superblock writes?  (The bitrot that I am familiar with usually happens in areas that are not touched very often - especially if they exist near areas that /are/ touched very often.)

 brassow

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] dm-raid: add RAID discard support
  2014-10-01 16:00               ` [PATCH] " Andrey Kuzmin
@ 2014-10-01 23:15                 ` NeilBrown
  0 siblings, 0 replies; 28+ messages in thread
From: NeilBrown @ 2014-10-01 23:15 UTC (permalink / raw)
  To: Andrey Kuzmin
  Cc: Heinz Mauelshagen, device-mapper development, Shaohua Li,
	Martin K. Petersen


[-- Attachment #1.1: Type: text/plain, Size: 3587 bytes --]

On Wed, 1 Oct 2014 20:00:45 +0400 Andrey Kuzmin <andrey.v.kuzmin@gmail.com>
wrote:

> On Wed, Oct 1, 2014 at 6:56 AM, NeilBrown <neilb@suse.de> wrote:
> > On Wed, 24 Sep 2014 13:02:28 +0200 Heinz Mauelshagen <heinzm@redhat.com>
> > wrote:
> >
> >>
> >> Martin,
> >>
> >> thanks for the good explanation of the state of the discard union.
> >> Do you have an ETA for the 'zeroout, deallocate' ... support you mentioned?
> >>
> >> I was planning to have a followup patch for dm-raid supporting a dm-raid
> >> table
> >> line argument to prohibit discard passdown.
> >>
> >> In lieu of the fuzzy field situation wrt SSD fw and discard_zeroes_data
> >> support
> >> related to RAID4/5/6, we need that in upstream together with the initial
> >> patch.
> >>
> >> That 'no_discard_passdown' table line can be added to dm-raid RAID4/5/6
> >> table
> >> lines to avoid possible data corruption but can be avoided on RAID1/10
> >> table lines,
> >> because the latter are not suffering from any  discard_zeroes_data flaw.
> >>
> >>
> >> Neil,
> >>
> >> are you going to disable discards in RAID4/5/6 shortly
> >> or rather go with your bitmap solution?
> >
> > Can I just close my eyes and hope it goes away?
> >
> > The idea of a bitmap of uninitialised areas is not a short-term solution.
> > But I'm not really keen on simply disabling discard for RAID4/5/6 either. It
> > would  mean that people with good sensible hardware wouldn't be able to use
> > it properly.
> >
> > I would really rather that discard_zeroes_data were only set on devices where
> > it was actually true.  Then it wouldn't be my problem any more.
> >
> > Maybe I could do a loud warning
> >   "Not enabling DISCARD on RAID5 because we cannot trust committees.
> >    Set "md_mod.willing_to_risk_discard=Y" if your devices reads discarded
> >    sectors as zeros"
> >
> > and add an appropriate module parameter......
> >
> > While we are on the topic, maybe I should write down my thoughts about the
> > bitmap thing in case someone wants to contribute.
> >
> >   There are 3 states that a 'region' can be in:
> >     1- known to be in-sync
> >     2- possibly not in sync, but it should  be
> >     3- probably not in sync, contains no valuable data.
> >
> >  A read from '3' should return zeroes.
> >  A write to '3' should change the region to be '2'.  It could either
> >     write zeros before allowing the write to start, or it could just start
> >     a normal resync.
> >
> >  Here is a question:  if a region has been discarded, are we guaranteed that
> >  reads are at least stable.  i.e. if I read twice will I definitely get the
> >  same value?
> 
> Not sure with other specs, but an NVMe-compliant SSD that supports
> discard (Dataset Management command with Deallocate attribute, in NVMe
> parlance) is, per spec, required to be deterministic when deallocated
> range is subsequently read. That's what the spec (1.1) says:
> 
> The value read from a deallocated LBA shall be deterministic;
> specifically, the value returned by subsequent reads of that LBA shall
> be the same until a write occurs to that LBA. The values read from a
> deallocated LBA and its metadata (excluding protection information)
> shall be all zeros, all ones, or the last data written to the
> associated LBA and its metadata. The values read from an unwritten or
> deallocated LBA’s protection information field shall be all ones
> (indicating the protection information shall not be checked).
> 

That's good to know - thanks.

NeilBrown


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] dm-raid: add RAID discard support
  2014-10-01 18:57               ` Brassow Jonathan
@ 2014-10-01 23:18                 ` NeilBrown
  0 siblings, 0 replies; 28+ messages in thread
From: NeilBrown @ 2014-10-01 23:18 UTC (permalink / raw)
  To: Brassow Jonathan
  Cc: Heinz Mauelshagen, device-mapper development, Shaohua Li,
	Martin K. Petersen


[-- Attachment #1.1: Type: text/plain, Size: 1596 bytes --]

On Wed, 1 Oct 2014 13:57:24 -0500 Brassow Jonathan <jbrassow@redhat.com>
wrote:

> 
> On Sep 30, 2014, at 9:56 PM, NeilBrown wrote:
> 
> > So that is probably what I would do:
> >  - new version for bitmap which has 2 bits per region and encodes 3 states
> >  - bitmap granularity matches chunk size (by default)
> >  - decouple region size of bitmap from region size for internal 'dirty'
> >    accounting
> >  - write to a 'state 3' region sets it to 'state 2' and kicks off resync
> >  - 'discard' sets state to '3'.
> 
> There are scenarios where we do not trust the bitmap and perform exhaustive searches (scrubbing).  When we do this, we assume that the bitmap has been correctly handled, but the data has been corrupted somehow.  When we scrub, we now could presumably use the bitmap to avoid those regions that have been discarded.  However, are there problems to be considered when the corruption is the other way around - i.e. when the bitmap/superblock are corrupted instead of the data area?  Do we consider problems like this unlikely because of the frequency of superblock writes?  (The bitrot that I am familiar with usually happens in areas that are not touched very often - especially if they exist near areas that /are/ touched very often.)
> 
>  brassow

Good question....

I suspect the frequent writes that you mention would reduce the chance of
bitrot - and would make it quickly disappear if it did happen.

Maybe we could compare bitmaps across all devices as the array is assembled
and take some action if there are differences(??).

NeilBrown

[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: dm-raid: add RAID discard support
  2014-10-01 13:32               ` Mike Snitzer
@ 2014-10-01 23:34                 ` NeilBrown
  2014-10-02  1:31                   ` Mike Snitzer
  0 siblings, 1 reply; 28+ messages in thread
From: NeilBrown @ 2014-10-01 23:34 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Heinz Mauelshagen, device-mapper development, Shaohua Li,
	Martin K. Petersen


[-- Attachment #1.1: Type: text/plain, Size: 4321 bytes --]

On Wed, 1 Oct 2014 09:32:37 -0400 Mike Snitzer <snitzer@redhat.com> wrote:

> On Tue, Sep 30 2014 at 10:56pm -0400,
> NeilBrown <neilb@suse.de> wrote:
> 
> > On Wed, 24 Sep 2014 13:02:28 +0200 Heinz Mauelshagen <heinzm@redhat.com>
> > wrote:
> > 
> > > 
> > > Martin,
> > > 
> > > thanks for the good explanation of the state of the discard union.
> > > Do you have an ETA for the 'zeroout, deallocate' ... support you mentioned?
> > > 
> > > I was planning to have a followup patch for dm-raid supporting a dm-raid 
> > > table
> > > line argument to prohibit discard passdown.
> > > 
> > > In lieu of the fuzzy field situation wrt SSD fw and discard_zeroes_data 
> > > support
> > > related to RAID4/5/6, we need that in upstream together with the initial 
> > > patch.
> > > 
> > > That 'no_discard_passdown' table line can be added to dm-raid RAID4/5/6 
> > > table
> > > lines to avoid possible data corruption but can be avoided on RAID1/10 
> > > table lines,
> > > because the latter are not suffering from any  discard_zeroes_data flaw.
> > > 
> > > 
> > > Neil,
> > > 
> > > are you going to disable discards in RAID4/5/6 shortly
> > > or rather go with your bitmap solution?
> > 
> > Can I just close my eyes and hope it goes away?
> > 
> > The idea of a bitmap of uninitialised areas is not a short-term solution.
> > But I'm not really keen on simply disabling discard for RAID4/5/6 either. It
> > would  mean that people with good sensible hardware wouldn't be able to use
> > it properly.
> > 
> > I would really rather that discard_zeroes_data were only set on devices where
> > it was actually true.  Then it wouldn't be my problem any more.
> > 
> > Maybe I could do a loud warning
> >   "Not enabling DISCARD on RAID5 because we cannot trust committees.
> >    Set "md_mod.willing_to_risk_discard=Y" if your devices reads discarded
> >    sectors as zeros"
> > 
> > and add an appropriate module parameter......
> 
> I had the same thought and would be happy with this too.  I was going to
> update Heinz's patch to have the same default off but allow user to
> enable:
> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=8e0cff64f35971135a6de7907bbc8c3a010aff8f
> 
> But I'd love to just follow what you arrive at with MD (using the same
> name for the module param in dm-raid).
> 
> I'm open to getting this done now and included in 3.18 if you are.
> 
> Mike

How about something like this?
I want to keep it well away from regular API stuff as I hope it is just a
temporary hack until a more general solution can be found and implemented.

Thanks,
NeilBrown

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 183588b11fc1..3ed668c5378c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -64,6 +64,10 @@
 #define cpu_to_group(cpu) cpu_to_node(cpu)
 #define ANY_GROUP NUMA_NO_NODE
 
+static bool devices_handle_discard_safely = false;
+module_param(devices_handle_discard_safely, bool, false);
+MODULE_PARM_DESC(devices_handle_discard_safely,
+		 "Set to Y if all devices in array reliably return zeroes on reads from discarded regions");
 static struct workqueue_struct *raid5_wq;
 /*
  * Stripe cache
@@ -6208,7 +6212,7 @@ static int run(struct mddev *mddev)
 		mddev->queue->limits.discard_granularity = stripe;
 		/*
 		 * unaligned part of discard request will be ignored, so can't
-		 * guarantee discard_zerors_data
+		 * guarantee discard_zeroes_data
 		 */
 		mddev->queue->limits.discard_zeroes_data = 0;
 
@@ -6233,6 +6237,18 @@ static int run(struct mddev *mddev)
 			    !bdev_get_queue(rdev->bdev)->
 						limits.discard_zeroes_data)
 				discard_supported = false;
+			/* Unfortunately, discard_zeroes_data is not currently
+			 * a guarantee - just a hint.  So we only allow DISCARD
+			 * if the sysadmin has confirmed that only safe devices
+			 * are in use but setting a module parameter.
+			 */
+			if (!devices_handle_discard_safely) {
+				if (discard_supported) {
+					pr_info("md/raid456: discard support disabled due to uncertainty.\n");
+					pr_info("Set raid456.devices_handle_discard_safely=Y to override.\n");
+				}
+				discard_supported = false;
+			}
 		}
 
 		if (discard_supported &&

[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: dm-raid: add RAID discard support
  2014-10-01 23:34                 ` NeilBrown
@ 2014-10-02  1:31                   ` Mike Snitzer
  2014-10-02  2:00                     ` NeilBrown
  0 siblings, 1 reply; 28+ messages in thread
From: Mike Snitzer @ 2014-10-02  1:31 UTC (permalink / raw)
  To: NeilBrown
  Cc: Heinz Mauelshagen, device-mapper development, Shaohua Li,
	Martin K. Petersen

Hi,

On Wed, Oct 01 2014 at  7:34pm -0400,
NeilBrown <neilb@suse.de> wrote:

> On Wed, 1 Oct 2014 09:32:37 -0400 Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > 
> > I had the same thought and would be happy with this too.  I was going to
> > update Heinz's patch to have the same default off but allow user to
> > enable:
> > https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=8e0cff64f35971135a6de7907bbc8c3a010aff8f
> > 
> > But I'd love to just follow what you arrive at with MD (using the same
> > name for the module param in dm-raid).
> > 
> > I'm open to getting this done now and included in 3.18 if you are.
> > 
> > Mike
> 
> How about something like this?
> I want to keep it well away from regular API stuff as I hope it is just a
> temporary hack until a more general solution can be found and implemented.
> 
> Thanks,
> NeilBrown
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 183588b11fc1..3ed668c5378c 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -64,6 +64,10 @@
>  #define cpu_to_group(cpu) cpu_to_node(cpu)
>  #define ANY_GROUP NUMA_NO_NODE
>  
> +static bool devices_handle_discard_safely = false;
> +module_param(devices_handle_discard_safely, bool, false);
> +MODULE_PARM_DESC(devices_handle_discard_safely,
> +		 "Set to Y if all devices in array reliably return zeroes on reads from discarded regions");
>  static struct workqueue_struct *raid5_wq;
>  /*
>   * Stripe cache
> @@ -6208,7 +6212,7 @@ static int run(struct mddev *mddev)
>  		mddev->queue->limits.discard_granularity = stripe;
>  		/*
>  		 * unaligned part of discard request will be ignored, so can't
> -		 * guarantee discard_zerors_data
> +		 * guarantee discard_zeroes_data
>  		 */
>  		mddev->queue->limits.discard_zeroes_data = 0;
>  
> @@ -6233,6 +6237,18 @@ static int run(struct mddev *mddev)
>  			    !bdev_get_queue(rdev->bdev)->
>  						limits.discard_zeroes_data)
>  				discard_supported = false;
> +			/* Unfortunately, discard_zeroes_data is not currently
> +			 * a guarantee - just a hint.  So we only allow DISCARD
> +			 * if the sysadmin has confirmed that only safe devices
> +			 * are in use but setting a module parameter.
> +			 */
> +			if (!devices_handle_discard_safely) {
> +				if (discard_supported) {
> +					pr_info("md/raid456: discard support disabled due to uncertainty.\n");
> +					pr_info("Set raid456.devices_handle_discard_safely=Y to override.\n");
> +				}
> +				discard_supported = false;
> +			}
>  		}
>  
>  		if (discard_supported &&


There is a typo in the new block comment above: "are in use but setting
a module parameter".  s/but/by/

But thinking further: should this be a per array override?  E.g. for DM
this could easily be a dm-raid table parameter.  But I know the MD
implementation would likely be more cumbersome (superblock update?).

Though given the (hopefully) temporary nature of this, maybe it isn't
worth it for MD.  Maybe be a bit more precise in the MODULE_PARM_DESC
with: 
"Set to Y if all devices in each array reliably returns zeroes on reads
from discarded regions" ?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: dm-raid: add RAID discard support
  2014-10-02  1:31                   ` Mike Snitzer
@ 2014-10-02  2:00                     ` NeilBrown
  2014-10-02  4:04                       ` [dm-devel] " NeilBrown
  0 siblings, 1 reply; 28+ messages in thread
From: NeilBrown @ 2014-10-02  2:00 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Heinz Mauelshagen, device-mapper development, Shaohua Li,
	Martin K. Petersen


[-- Attachment #1.1: Type: text/plain, Size: 3864 bytes --]

On Wed, 1 Oct 2014 21:31:36 -0400 Mike Snitzer <snitzer@redhat.com> wrote:

> Hi,
> 
> On Wed, Oct 01 2014 at  7:34pm -0400,
> NeilBrown <neilb@suse.de> wrote:
> 
> > On Wed, 1 Oct 2014 09:32:37 -0400 Mike Snitzer <snitzer@redhat.com> wrote:
> > 
> > > 
> > > I had the same thought and would be happy with this too.  I was going to
> > > update Heinz's patch to have the same default off but allow user to
> > > enable:
> > > https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=8e0cff64f35971135a6de7907bbc8c3a010aff8f
> > > 
> > > But I'd love to just follow what you arrive at with MD (using the same
> > > name for the module param in dm-raid).
> > > 
> > > I'm open to getting this done now and included in 3.18 if you are.
> > > 
> > > Mike
> > 
> > How about something like this?
> > I want to keep it well away from regular API stuff as I hope it is just a
> > temporary hack until a more general solution can be found and implemented.
> > 
> > Thanks,
> > NeilBrown
> > 
> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > index 183588b11fc1..3ed668c5378c 100644
> > --- a/drivers/md/raid5.c
> > +++ b/drivers/md/raid5.c
> > @@ -64,6 +64,10 @@
> >  #define cpu_to_group(cpu) cpu_to_node(cpu)
> >  #define ANY_GROUP NUMA_NO_NODE
> >  
> > +static bool devices_handle_discard_safely = false;
> > +module_param(devices_handle_discard_safely, bool, false);
> > +MODULE_PARM_DESC(devices_handle_discard_safely,
> > +		 "Set to Y if all devices in array reliably return zeroes on reads from discarded regions");
> >  static struct workqueue_struct *raid5_wq;
> >  /*
> >   * Stripe cache
> > @@ -6208,7 +6212,7 @@ static int run(struct mddev *mddev)
> >  		mddev->queue->limits.discard_granularity = stripe;
> >  		/*
> >  		 * unaligned part of discard request will be ignored, so can't
> > -		 * guarantee discard_zerors_data
> > +		 * guarantee discard_zeroes_data
> >  		 */
> >  		mddev->queue->limits.discard_zeroes_data = 0;
> >  
> > @@ -6233,6 +6237,18 @@ static int run(struct mddev *mddev)
> >  			    !bdev_get_queue(rdev->bdev)->
> >  						limits.discard_zeroes_data)
> >  				discard_supported = false;
> > +			/* Unfortunately, discard_zeroes_data is not currently
> > +			 * a guarantee - just a hint.  So we only allow DISCARD
> > +			 * if the sysadmin has confirmed that only safe devices
> > +			 * are in use but setting a module parameter.
> > +			 */
> > +			if (!devices_handle_discard_safely) {
> > +				if (discard_supported) {
> > +					pr_info("md/raid456: discard support disabled due to uncertainty.\n");
> > +					pr_info("Set raid456.devices_handle_discard_safely=Y to override.\n");
> > +				}
> > +				discard_supported = false;
> > +			}
> >  		}
> >  
> >  		if (discard_supported &&
> 
> 
> There is a typo in the new block comment above: "are in use but setting
> a module parameter".  s/but/by/

Fixed, thanks.


> 
> But thinking further: should this be a per array override?  E.g. for DM
> this could easily be a dm-raid table parameter.  But I know the MD
> implementation would likely be more cumbersome (superblock update?).

If you want to use discard selectively on some arrays, you can mount
filesystems with appropriate options, or be careful in your use of 'fstrim'.

I see the module option as something to force people to think or ask before
"blindly" using DISCARD.  I don't see it is a configuration tool.

> 
> Though given the (hopefully) temporary nature of this, maybe it isn't
> worth it for MD.  Maybe be a bit more precise in the MODULE_PARM_DESC
> with: 
> "Set to Y if all devices in each array reliably returns zeroes on reads
> from discarded regions" ?

I added the 'each'.  I don't agree with the 's' on 'returns' :-(

Thanks,
NeilBrown



[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dm-devel] dm-raid: add RAID discard support
  2014-10-02  2:00                     ` NeilBrown
@ 2014-10-02  4:04                       ` NeilBrown
  2014-10-02 13:52                         ` Mike Snitzer
  2014-10-03  1:14                         ` [dm-devel] " Martin K. Petersen
  0 siblings, 2 replies; 28+ messages in thread
From: NeilBrown @ 2014-10-02  4:04 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Heinz Mauelshagen, device-mapper development, Shaohua Li,
	Martin K. Petersen, linux RAID, lkml

[-- Attachment #1: Type: text/plain, Size: 3570 bytes --]


I plan to submit this to Linus tomorrow, hopefully for 3.7, unless there are
complaints. It is in my for-next branch now.

Thanks,
NeilBrown


From aec6f821ed92fac5ae4f1db50279a3999de5872a Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Thu, 2 Oct 2014 13:45:00 +1000
Subject: [PATCH] md/raid5: disable 'DISCARD' by default due to safety
 concerns.

It has come to my attention (thanks Martin) that 'discard_zeroes_data'
is only a hint.  Some devices in some cases don't do what it
says on the label.

The use of DISCARD in RAID5 depends on reads from discarded regions
being predictably zero.  If a write to a previously discarded region
performs a read-modify-write cycle it assumes that the parity block
was consistent with the data blocks.  If all were zero, this would
be the case.  If some are and some aren't this would not be the case.
This could lead to data corruption after a device failure when
data needs to be reconstructed from the parity.

As we cannot trust 'discard_zeroes_data', ignore it by default
and so disallow DISCARD on all raid4/5/6 arrays.

As many devices are trustworthy, and as there are benefits to using
DISCARD, add a module parameter to over-ride this caution and cause
DISCARD to work if discard_zeroes_data is set.

If a site want to enable DISCARD on some arrays but not on others they
should select DISCARD support at the filesystem level, and set the
raid456 module parameter.
    raid456.devices_handle_discard_safely=Y

As this is a data-safety issue, I believe this patch is suitable for
-stable.
DISCARD support for RAID456 was added in 3.7

Cc: Shaohua Li <shli@kernel.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Heinz Mauelshagen <heinzm@redhat.com>
Cc: stable@vger.kernel.org (3.7+)
Fixes: 620125f2bf8ff0c4969b79653b54d7bcc9d40637
Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 183588b11fc1..9f0fbecd1eb5 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -64,6 +64,10 @@
 #define cpu_to_group(cpu) cpu_to_node(cpu)
 #define ANY_GROUP NUMA_NO_NODE
 
+static bool devices_handle_discard_safely = false;
+module_param(devices_handle_discard_safely, bool, 0644);
+MODULE_PARM_DESC(devices_handle_discard_safely,
+		 "Set to Y if all devices in each array reliably return zeroes on reads from discarded regions");
 static struct workqueue_struct *raid5_wq;
 /*
  * Stripe cache
@@ -6208,7 +6212,7 @@ static int run(struct mddev *mddev)
 		mddev->queue->limits.discard_granularity = stripe;
 		/*
 		 * unaligned part of discard request will be ignored, so can't
-		 * guarantee discard_zerors_data
+		 * guarantee discard_zeroes_data
 		 */
 		mddev->queue->limits.discard_zeroes_data = 0;
 
@@ -6233,6 +6237,18 @@ static int run(struct mddev *mddev)
 			    !bdev_get_queue(rdev->bdev)->
 						limits.discard_zeroes_data)
 				discard_supported = false;
+			/* Unfortunately, discard_zeroes_data is not currently
+			 * a guarantee - just a hint.  So we only allow DISCARD
+			 * if the sysadmin has confirmed that only safe devices
+			 * are in use by setting a module parameter.
+			 */
+			if (!devices_handle_discard_safely) {
+				if (discard_supported) {
+					pr_info("md/raid456: discard support disabled due to uncertainty.\n");
+					pr_info("Set raid456.devices_handle_discard_safely=Y to override.\n");
+				}
+				discard_supported = false;
+			}
 		}
 
 		if (discard_supported &&

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: dm-raid: add RAID discard support
  2014-10-02  4:04                       ` [dm-devel] " NeilBrown
@ 2014-10-02 13:52                         ` Mike Snitzer
  2014-10-02 18:00                           ` Mike Snitzer
  2014-10-03  1:14                         ` [dm-devel] " Martin K. Petersen
  1 sibling, 1 reply; 28+ messages in thread
From: Mike Snitzer @ 2014-10-02 13:52 UTC (permalink / raw)
  To: NeilBrown
  Cc: Heinz Mauelshagen, device-mapper development, Shaohua Li,
	Martin K. Petersen, linux RAID, lkml

On Thu, Oct 02 2014 at 12:04am -0400,
NeilBrown <neilb@suse.de> wrote:

> 
> I plan to submit this to Linus tomorrow, hopefully for 3.7, unless there are
> complaints. It is in my for-next branch now.
> 
> Thanks,
> NeilBrown
> 
> 
> From aec6f821ed92fac5ae4f1db50279a3999de5872a Mon Sep 17 00:00:00 2001
> From: NeilBrown <neilb@suse.de>
> Date: Thu, 2 Oct 2014 13:45:00 +1000
> Subject: [PATCH] md/raid5: disable 'DISCARD' by default due to safety
>  concerns.
> 
> It has come to my attention (thanks Martin) that 'discard_zeroes_data'
> is only a hint.  Some devices in some cases don't do what it
> says on the label.
> 
> The use of DISCARD in RAID5 depends on reads from discarded regions
> being predictably zero.  If a write to a previously discarded region
> performs a read-modify-write cycle it assumes that the parity block
> was consistent with the data blocks.  If all were zero, this would
> be the case.  If some are and some aren't this would not be the case.
> This could lead to data corruption after a device failure when
> data needs to be reconstructed from the parity.
> 
> As we cannot trust 'discard_zeroes_data', ignore it by default
> and so disallow DISCARD on all raid4/5/6 arrays.
> 
> As many devices are trustworthy, and as there are benefits to using
> DISCARD, add a module parameter to over-ride this caution and cause
> DISCARD to work if discard_zeroes_data is set.
> 
> If a site want to enable DISCARD on some arrays but not on others they
> should select DISCARD support at the filesystem level, and set the
> raid456 module parameter.
>     raid456.devices_handle_discard_safely=Y
> 
> As this is a data-safety issue, I believe this patch is suitable for
> -stable.
> DISCARD support for RAID456 was added in 3.7
> 
> Cc: Shaohua Li <shli@kernel.org>
> Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
> Cc: Mike Snitzer <snitzer@redhat.com>
> Cc: Heinz Mauelshagen <heinzm@redhat.com>
> Cc: stable@vger.kernel.org (3.7+)
> Fixes: 620125f2bf8ff0c4969b79653b54d7bcc9d40637
> Signed-off-by: NeilBrown <neilb@suse.de>

Acked-by: Mike Snitzer <snitzer@redhat.com>

Thanks Neil, I'll get something comparable staged for dm-raid for 3.18
(less urgency given that dm-raid doesn't yet support discards).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: dm-raid: add RAID discard support
  2014-10-02 13:52                         ` Mike Snitzer
@ 2014-10-02 18:00                           ` Mike Snitzer
  0 siblings, 0 replies; 28+ messages in thread
From: Mike Snitzer @ 2014-10-02 18:00 UTC (permalink / raw)
  To: NeilBrown
  Cc: Heinz Mauelshagen, device-mapper development, Shaohua Li,
	Martin K. Petersen, linux RAID, lkml

On Thu, Oct 02 2014 at  9:52am -0400,
Mike Snitzer <snitzer@redhat.com> wrote:
 
> Thanks Neil, I'll get something comparable staged for dm-raid for 3.18
> (less urgency given that dm-raid doesn't yet support discards).

FYI, I've staged the following dm-raid change in linux-next for 3.18:
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=bfc3f335f9886b68d281240aab751b96259d8860

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] dm-raid: add RAID discard support
  2014-10-01  2:56             ` NeilBrown
                                 ` (3 preceding siblings ...)
  2014-10-01 18:57               ` Brassow Jonathan
@ 2014-10-03  1:09               ` Martin K. Petersen
  4 siblings, 0 replies; 28+ messages in thread
From: Martin K. Petersen @ 2014-10-03  1:09 UTC (permalink / raw)
  To: NeilBrown
  Cc: Heinz Mauelshagen, device-mapper development, Shaohua Li,
	Martin K. Petersen

>>>>> "Neil" == NeilBrown  <neilb@suse.de> writes:

Neil> I would really rather that discard_zeroes_data were only set on
Neil> devices where it was actually true.  Then it wouldn't be my
Neil> problem any more.

Yeah. It's too late for 3.18 but I'll try to get a whitelist going for
3.19.

Neil>  Here is a question: if a region has been discarded, are we
Neil>  guaranteed that reads are at least stable.  i.e. if I read twice
Neil>  will I definitely get the same value?

If discard_zeroes_data=1, then yes.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] dm-raid: add RAID discard support
  2014-10-01 11:13               ` Heinz Mauelshagen
@ 2014-10-03  1:12                 ` Martin K. Petersen
  0 siblings, 0 replies; 28+ messages in thread
From: Martin K. Petersen @ 2014-10-03  1:12 UTC (permalink / raw)
  To: Heinz Mauelshagen
  Cc: device-mapper development, Shaohua Li, Martin K. Petersen

>>>>> "Heinz" == Heinz Mauelshagen <heinzm@redhat.com> writes:

Heinz> BTW: we're referring to SSD discard_zeroes_data issues only so
Heinz> far it seems?  Can we be sure that SCSI unmap with the TPRZ bit
Heinz> set will always return zeroes on reads for arbitrary storage
Heinz> devices?

Only if we used WRITE SAME to discard.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dm-devel] dm-raid: add RAID discard support
  2014-10-02  4:04                       ` [dm-devel] " NeilBrown
  2014-10-02 13:52                         ` Mike Snitzer
@ 2014-10-03  1:14                         ` Martin K. Petersen
  1 sibling, 0 replies; 28+ messages in thread
From: Martin K. Petersen @ 2014-10-03  1:14 UTC (permalink / raw)
  To: NeilBrown
  Cc: Mike Snitzer, Heinz Mauelshagen, device-mapper development,
	Shaohua Li, Martin K. Petersen, linux RAID, lkml

>>>>> "Neil" == NeilBrown  <neilb@suse.de> writes:

Neil> I plan to submit this to Linus tomorrow, hopefully for 3.7, unless
Neil> there are complaints. It is in my for-next branch now.

Acked-by: Martin K. Petersen <martin.petersen@oracle.com>

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2014-10-03  1:14 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-23 16:51 [PATCH] dm-raid: add RAID discard support heinzm
2014-09-23 21:52 ` Brassow Jonathan
2014-09-23 23:07 ` Martin K. Petersen
2014-09-23 23:33   ` NeilBrown
2014-09-24  2:20     ` Martin K. Petersen
2014-09-24  4:05       ` Brassow Jonathan
2014-09-24  4:21         ` NeilBrown
2014-09-24  4:35           ` Brassow Jonathan
2014-09-24 11:02           ` Heinz Mauelshagen
2014-10-01  2:56             ` NeilBrown
2014-10-01 11:13               ` Heinz Mauelshagen
2014-10-03  1:12                 ` Martin K. Petersen
2014-10-01 13:32               ` Mike Snitzer
2014-10-01 23:34                 ` NeilBrown
2014-10-02  1:31                   ` Mike Snitzer
2014-10-02  2:00                     ` NeilBrown
2014-10-02  4:04                       ` [dm-devel] " NeilBrown
2014-10-02 13:52                         ` Mike Snitzer
2014-10-02 18:00                           ` Mike Snitzer
2014-10-03  1:14                         ` [dm-devel] " Martin K. Petersen
2014-10-01 16:00               ` [PATCH] " Andrey Kuzmin
2014-10-01 23:15                 ` NeilBrown
2014-10-01 18:57               ` Brassow Jonathan
2014-10-01 23:18                 ` NeilBrown
2014-10-03  1:09               ` Martin K. Petersen
2014-09-24 14:21   ` Christoph Hellwig
2014-09-24 14:38     ` Heinz Mauelshagen
2014-09-24 15:11     ` Martin K. Petersen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.