All of lore.kernel.org
 help / color / mirror / Atom feed
* RAID creation resync behaviors
@ 2017-05-03 20:27 Shaohua Li
  2017-05-03 21:06 ` David Brown
                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Shaohua Li @ 2017-05-03 20:27 UTC (permalink / raw)
  To: linux-raid; +Cc: jes.sorensen, neilb

Hi,

Currently we have different resync behaviors in array creation.

- raid1: copy data from disk 0 to disk 1 (overwrite)
- raid10: read both disks, compare and write if there is difference (compare-write)
- raid4/5: read first n-1 disks, calculate parity and then write parity to the last disk (overwrite)
- raid6: read all disks, calculate parity and compare, and write if there is difference (compare-write)

Write whole disk is very unfriendly for SSD, because it reduces lifetime. And
if user already does a trim before creation, the unncessary write could make
SSD slower in the future. Could we prefer compare-write to overwrite if mdadm
detects the disks are SSD? Surely sometimes compare-write is slower than
overwrite, so maybe add new option in mdadm. An option to let mdadm trim SSD
before creation sounds reasonable too.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RAID creation resync behaviors
  2017-05-03 20:27 RAID creation resync behaviors Shaohua Li
@ 2017-05-03 21:06 ` David Brown
  2017-05-04  1:54   ` Shaohua Li
  2017-05-03 23:58 ` Andreas Klauer
  2017-05-04  1:07 ` NeilBrown
  2 siblings, 1 reply; 27+ messages in thread
From: David Brown @ 2017-05-03 21:06 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: jes.sorensen, neilb

On 03/05/17 22:27, Shaohua Li wrote:
> Hi,
>
> Currently we have different resync behaviors in array creation.
>
> - raid1: copy data from disk 0 to disk 1 (overwrite)
> - raid10: read both disks, compare and write if there is difference (compare-write)
> - raid4/5: read first n-1 disks, calculate parity and then write parity to the last disk (overwrite)
> - raid6: read all disks, calculate parity and compare, and write if there is difference (compare-write)
>
> Write whole disk is very unfriendly for SSD, because it reduces lifetime. And
> if user already does a trim before creation, the unncessary write could make
> SSD slower in the future. Could we prefer compare-write to overwrite if mdadm
> detects the disks are SSD? Surely sometimes compare-write is slower than
> overwrite, so maybe add new option in mdadm. An option to let mdadm trim SSD
> before creation sounds reasonable too.
>

When doing the first sync, md tracks how far its sync has got, keeping a 
record in the metadata in case it has to be restarted (such as due to a 
reboot while syncing).  Why not simply /not/ sync stripes until you 
first write to them?  It may be that a counter of synced stripes is not 
enough, and you need a bitmap (like the write intent bitmap), but it 
would reduce the creation sync time to 0 and avoid any writes at all.

David



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RAID creation resync behaviors
  2017-05-03 20:27 RAID creation resync behaviors Shaohua Li
  2017-05-03 21:06 ` David Brown
@ 2017-05-03 23:58 ` Andreas Klauer
  2017-05-04  2:22   ` Shaohua Li
  2017-05-04  1:07 ` NeilBrown
  2 siblings, 1 reply; 27+ messages in thread
From: Andreas Klauer @ 2017-05-03 23:58 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, jes.sorensen, neilb

On Wed, May 03, 2017 at 01:27:48PM -0700, Shaohua Li wrote:
> Write whole disk is very unfriendly for SSD, because it reduces lifetime.
> And if user already does a trim before creation, the unncessary write 
> could make SSD slower in the future.

I'm not a kernel developer so maybe I shouldn't reply. Feel free to ignore.

I don't see this as a big issue, whoever uses SSD will likely also fstrim, 
so all SSD will know about free blocks regardless how the drive was added 
to the RAID.

You don't resync everyday and once populated with data you just can't help 
but have many writes when adding / replacing drives. No way around it.

> An option to let mdadm trim SSD before creation sounds reasonable too.

This is my personal opinion but - there is way too much trim in Linux. 

On HDD if you did a botched mkfs on the wrong device you still had a chance 
to recover data, with SSD it's all gone in an eyeblink, because mkfs.ext4 
and other programs unfortunately do trim without asking. Lots of people 
come to this list only after already playing with mdadm --create and if 
mdadm simply started trimming SSDs too, then all would be lost.
LVM has these nice metadata backups but they're rendered useless 
if lvm.conf has issue_discards set to 1. Etc...

And it's entirely superfluous, there was a big hullabaloo when SSD were 
new, everyone was concerned about how quickly they'd die when written to, 
but tests show their endurance is considerably greater than advertized. 
A single RAID resync won't put a dent in even a consumer's SSD lifetime.

At the same time you have two utilities blkdiscard and fstrim so anyone  
who desires to trim can already easily do so with little effort. For SSD 
that return zero after TRIM you can already create like this:

blkdiscard device1
blkdiscard device2
blkdiscard device3
echo 3 > /proc/sys/vm/drop_caches # optional: Linux caches trimmed data
mdadm --create --assume-clean /dev/md ... device1 device2 device3

If you wanted mdadm to do that directly, how about a mdadm --create --trim 
which implies assume-clean? But in my opinion it should not happen unasked.
If it was up to me I'd even add a prompt asking to confirm dataloss...

As for overwrite vs. compare-write, I don't know if it's possible or 
how painful it would be to implement but could you start out comparing, 
continue while the data actually matches, but switch to presumably much 
faster overwrite mode once there are sufficient mismatches? Perhaps with a 
fallback option so it can go back to compare later if data starts to match.

So kind of a smart-compare-overwrite mode which would go something like:

Compare. Match.
Compare. Match.
Compare. Mismatch. Overwrite.
Compare. Mismatch. Overwrite x2.
Compare. Mismatch. Overwrite x4.
Compare. Match.
Compare. Mismatch. Overwrite x8.
Compare. Mismatch. Overwrite x16.

Perhaps cap the overwrite multiplier at a certain point...

Maybe a silly idea, I don't know.

Regards
Andreas Klauer

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RAID creation resync behaviors
  2017-05-03 20:27 RAID creation resync behaviors Shaohua Li
  2017-05-03 21:06 ` David Brown
  2017-05-03 23:58 ` Andreas Klauer
@ 2017-05-04  1:07 ` NeilBrown
  2017-05-04  2:04   ` Shaohua Li
  2 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2017-05-04  1:07 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: jes.sorensen, neilb

[-- Attachment #1: Type: text/plain, Size: 2120 bytes --]

On Wed, May 03 2017, Shaohua Li wrote:

> Hi,
>
> Currently we have different resync behaviors in array creation.
>
> - raid1: copy data from disk 0 to disk 1 (overwrite)
> - raid10: read both disks, compare and write if there is difference (compare-write)
> - raid4/5: read first n-1 disks, calculate parity and then write parity to the last disk (overwrite)
> - raid6: read all disks, calculate parity and compare, and write if there is difference (compare-write)

The approach taken for raid1 and raid4/5 provides the fastest sync for
an array built on uninitialised spinning devices.
RAID6 could use the same approach but would involve more CPU and so
the original author of the RAID6 code (hpa) chose to go for the low-CPU
cost option.  I don't know if tests were done, or if they would still be
valid on new hardware.
The raid10 approach comes from "it is too hard to optimize in general
because different RAID10 layouts have different trade-offs, so just
take the easy way out."

>
> Write whole disk is very unfriendly for SSD, because it reduces lifetime. And
> if user already does a trim before creation, the unncessary write could make
> SSD slower in the future. Could we prefer compare-write to overwrite if mdadm
> detects the disks are SSD? Surely sometimes compare-write is slower than
> overwrite, so maybe add new option in mdadm. An option to let mdadm trim SSD
> before creation sounds reasonable too.

An option to ask mdadm to trim the data space and then --assume-clean
certainly sounds reasonable.

One possible approach would be to use compare-write until some
threshold of writes were crossed, then switch to over-write.  That could
work well for RAID1, but could be awkward to manage for RAID5.
Possibly mdadm could read the first few megas of each device in RAID5
and try to guess if many writes will be needed.  If they will, the
current approach is best.  If not, assemble the array so that
compare-write is used.

I'm in favour of providing options and making the defaults "not
terrible".  I think they currently are "not terrible", but maybe they
can be better in some cases.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RAID creation resync behaviors
  2017-05-03 21:06 ` David Brown
@ 2017-05-04  1:54   ` Shaohua Li
  2017-05-04  7:37     ` David Brown
  2017-05-04 15:50     ` Wols Lists
  0 siblings, 2 replies; 27+ messages in thread
From: Shaohua Li @ 2017-05-04  1:54 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid, jes.sorensen, neilb

On Wed, May 03, 2017 at 11:06:01PM +0200, David Brown wrote:
> On 03/05/17 22:27, Shaohua Li wrote:
> > Hi,
> > 
> > Currently we have different resync behaviors in array creation.
> > 
> > - raid1: copy data from disk 0 to disk 1 (overwrite)
> > - raid10: read both disks, compare and write if there is difference (compare-write)
> > - raid4/5: read first n-1 disks, calculate parity and then write parity to the last disk (overwrite)
> > - raid6: read all disks, calculate parity and compare, and write if there is difference (compare-write)
> > 
> > Write whole disk is very unfriendly for SSD, because it reduces lifetime. And
> > if user already does a trim before creation, the unncessary write could make
> > SSD slower in the future. Could we prefer compare-write to overwrite if mdadm
> > detects the disks are SSD? Surely sometimes compare-write is slower than
> > overwrite, so maybe add new option in mdadm. An option to let mdadm trim SSD
> > before creation sounds reasonable too.
> > 
> 
> When doing the first sync, md tracks how far its sync has got, keeping a
> record in the metadata in case it has to be restarted (such as due to a
> reboot while syncing).  Why not simply /not/ sync stripes until you first
> write to them?  It may be that a counter of synced stripes is not enough,
> and you need a bitmap (like the write intent bitmap), but it would reduce
> the creation sync time to 0 and avoid any writes at all.

For raid 4/5/6, this means we always must do a full stripe write for any normal
write if it hits a range not synced. This would harm the performance of the
norma write. For raid1/10, this sounds more appealing. But since each bit in
the bitmap will stand for a range. If only part of the range is written by
normal IO, we have two choices. sync the range immediately and clear the bit,
this sync will impact normal IO. Don't do the sync immediately, but since the
bit is set (which means the range isn't synced), read IO can only access the
first disk, which is harmful too.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RAID creation resync behaviors
  2017-05-04  1:07 ` NeilBrown
@ 2017-05-04  2:04   ` Shaohua Li
  2017-05-09 18:39     ` Jes Sorensen
  0 siblings, 1 reply; 27+ messages in thread
From: Shaohua Li @ 2017-05-04  2:04 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, jes.sorensen, neilb

On Thu, May 04, 2017 at 11:07:01AM +1000, Neil Brown wrote:
> On Wed, May 03 2017, Shaohua Li wrote:
> 
> > Hi,
> >
> > Currently we have different resync behaviors in array creation.
> >
> > - raid1: copy data from disk 0 to disk 1 (overwrite)
> > - raid10: read both disks, compare and write if there is difference (compare-write)
> > - raid4/5: read first n-1 disks, calculate parity and then write parity to the last disk (overwrite)
> > - raid6: read all disks, calculate parity and compare, and write if there is difference (compare-write)
> 
> The approach taken for raid1 and raid4/5 provides the fastest sync for
> an array built on uninitialised spinning devices.
> RAID6 could use the same approach but would involve more CPU and so
> the original author of the RAID6 code (hpa) chose to go for the low-CPU
> cost option.  I don't know if tests were done, or if they would still be
> valid on new hardware.
> The raid10 approach comes from "it is too hard to optimize in general
> because different RAID10 layouts have different trade-offs, so just
> take the easy way out."

ok, thanks for the explanation! 
> >
> > Write whole disk is very unfriendly for SSD, because it reduces lifetime. And
> > if user already does a trim before creation, the unncessary write could make
> > SSD slower in the future. Could we prefer compare-write to overwrite if mdadm
> > detects the disks are SSD? Surely sometimes compare-write is slower than
> > overwrite, so maybe add new option in mdadm. An option to let mdadm trim SSD
> > before creation sounds reasonable too.
> 
> An option to ask mdadm to trim the data space and then --assume-clean
> certainly sounds reasonable.

This doesn't work well. read returns 0 for trimmed data space in some SSDs, but
not all. If not, we will have trouble.

> One possible approach would be to use compare-write until some
> threshold of writes were crossed, then switch to over-write.  That could
> work well for RAID1, but could be awkward to manage for RAID5.
> Possibly mdadm could read the first few megas of each device in RAID5
> and try to guess if many writes will be needed.  If they will, the
> current approach is best.  If not, assemble the array so that
> compare-write is used.

I think this makes sense if we do trim first, assume in most SSDs read return 0
for trimmed space. Maybe trim first, and check if read returns 0. If returns 0,
do compare-write (even assume-clean), otherwise overwrite.

> I'm in favour of providing options and making the defaults "not
> terrible".  I think they currently are "not terrible", but maybe they
> can be better in some cases.

Agree, more options are required.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RAID creation resync behaviors
  2017-05-03 23:58 ` Andreas Klauer
@ 2017-05-04  2:22   ` Shaohua Li
  2017-05-04  7:55     ` Andreas Klauer
  0 siblings, 1 reply; 27+ messages in thread
From: Shaohua Li @ 2017-05-04  2:22 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: linux-raid, jes.sorensen, neilb

On Thu, May 04, 2017 at 01:58:56AM +0200, Andreas Klauer wrote:
> On Wed, May 03, 2017 at 01:27:48PM -0700, Shaohua Li wrote:
> > Write whole disk is very unfriendly for SSD, because it reduces lifetime.
> > And if user already does a trim before creation, the unncessary write 
> > could make SSD slower in the future.
> 
> I'm not a kernel developer so maybe I shouldn't reply. Feel free to ignore.
> 
> I don't see this as a big issue, whoever uses SSD will likely also fstrim, 
> so all SSD will know about free blocks regardless how the drive was added 
> to the RAID.
> 
> You don't resync everyday and once populated with data you just can't help 
> but have many writes when adding / replacing drives. No way around it.

I agree fstrim doesn't make issue smaller. But there are still extra writes,
which if we can avoid in an easy way, we should do it.

> > An option to let mdadm trim SSD before creation sounds reasonable too.
> 
> This is my personal opinion but - there is way too much trim in Linux. 
> 
> On HDD if you did a botched mkfs on the wrong device you still had a chance 
> to recover data, with SSD it's all gone in an eyeblink, because mkfs.ext4 
> and other programs unfortunately do trim without asking. Lots of people 
> come to this list only after already playing with mdadm --create and if 
> mdadm simply started trimming SSDs too, then all would be lost.
> LVM has these nice metadata backups but they're rendered useless 
> if lvm.conf has issue_discards set to 1. Etc...

Totally understand the concerns. I think a new option is required for this and
it should not be default.

> And it's entirely superfluous, there was a big hullabaloo when SSD were 
> new, everyone was concerned about how quickly they'd die when written to, 
> but tests show their endurance is considerably greater than advertized. 
> A single RAID resync won't put a dent in even a consumer's SSD lifetime.

From my experience, if my filesystem under a SSD is nearly full, I found the
system is unstable in at least one type of SSD. Fully write a SSD not just
reduces the lifetime, it also makes firmware of SSD has higher chance to fail.

> At the same time you have two utilities blkdiscard and fstrim so anyone  
> who desires to trim can already easily do so with little effort. For SSD 
> that return zero after TRIM you can already create like this:
> 
> blkdiscard device1
> blkdiscard device2
> blkdiscard device3
> echo 3 > /proc/sys/vm/drop_caches # optional: Linux caches trimmed data
> mdadm --create --assume-clean /dev/md ... device1 device2 device3

Unfortunately not all SSD return zero after trim.
 
> If you wanted mdadm to do that directly, how about a mdadm --create --trim 
> which implies assume-clean? But in my opinion it should not happen unasked.
> If it was up to me I'd even add a prompt asking to confirm dataloss...
> 
> As for overwrite vs. compare-write, I don't know if it's possible or 
> how painful it would be to implement but could you start out comparing, 
> continue while the data actually matches, but switch to presumably much 
> faster overwrite mode once there are sufficient mismatches? Perhaps with a 
> fallback option so it can go back to compare later if data starts to match.
> 
> So kind of a smart-compare-overwrite mode which would go something like:
> 
> Compare. Match.
> Compare. Match.
> Compare. Mismatch. Overwrite.
> Compare. Mismatch. Overwrite x2.
> Compare. Mismatch. Overwrite x4.
> Compare. Match.
> Compare. Mismatch. Overwrite x8.
> Compare. Mismatch. Overwrite x16.
> 
> Perhaps cap the overwrite multiplier at a certain point...
> 
> Maybe a silly idea, I don't know.

This certainly is an interesting idea. Not sure if we should put complex
heuristics into kernel side though. If there is easy approach in mdadm side, it
definitely will be preferred.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RAID creation resync behaviors
  2017-05-04  1:54   ` Shaohua Li
@ 2017-05-04  7:37     ` David Brown
  2017-05-04 16:02       ` Wols Lists
  2017-05-04 21:57       ` NeilBrown
  2017-05-04 15:50     ` Wols Lists
  1 sibling, 2 replies; 27+ messages in thread
From: David Brown @ 2017-05-04  7:37 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, jes.sorensen, neilb

On 04/05/17 03:54, Shaohua Li wrote:
> On Wed, May 03, 2017 at 11:06:01PM +0200, David Brown wrote:
>> On 03/05/17 22:27, Shaohua Li wrote:
>>> Hi,
>>>
>>> Currently we have different resync behaviors in array creation.
>>>
>>> - raid1: copy data from disk 0 to disk 1 (overwrite)
>>> - raid10: read both disks, compare and write if there is difference (compare-write)
>>> - raid4/5: read first n-1 disks, calculate parity and then write parity to the last disk (overwrite)
>>> - raid6: read all disks, calculate parity and compare, and write if there is difference (compare-write)
>>>
>>> Write whole disk is very unfriendly for SSD, because it reduces lifetime. And
>>> if user already does a trim before creation, the unncessary write could make
>>> SSD slower in the future. Could we prefer compare-write to overwrite if mdadm
>>> detects the disks are SSD? Surely sometimes compare-write is slower than
>>> overwrite, so maybe add new option in mdadm. An option to let mdadm trim SSD
>>> before creation sounds reasonable too.
>>>
>>
>> When doing the first sync, md tracks how far its sync has got, keeping a
>> record in the metadata in case it has to be restarted (such as due to a
>> reboot while syncing).  Why not simply /not/ sync stripes until you first
>> write to them?  It may be that a counter of synced stripes is not enough,
>> and you need a bitmap (like the write intent bitmap), but it would reduce
>> the creation sync time to 0 and avoid any writes at all.
> 
> For raid 4/5/6, this means we always must do a full stripe write for any normal
> write if it hits a range not synced. This would harm the performance of the
> norma write.

Agreed.  The unused sectors could be set to 0, rather than read from the
disks - that would reduce the latency and be friendly to high-end SSDs
with compression (zero blocks compress quite well!).

> For raid1/10, this sounds more appealing. But since each bit in
> the bitmap will stand for a range. If only part of the range is written by
> normal IO, we have two choices. sync the range immediately and clear the bit,
> this sync will impact normal IO. Don't do the sync immediately, but since the
> bit is set (which means the range isn't synced), read IO can only access the
> first disk, which is harmful too.
> 

This could be done in a more sophisticated manner.  (Yes, I appreciate
that "sophisticated" or "complex" are a serious disadvantage - I'm just
throwing up ideas that could be considered.)

Divide the array into "sync blocks", each covering a range of stripes,
with a bitmap of three states - unused, partially synced, fully synced.
 All blocks start off unused.  If a write is made to a previously unused
block, that block becomes partially synced, and the write has to be done
as a full stripe write.  For a partially synced block, keep a list of
ranges of synced stripes (a list will normally be smaller than a bitmap
here).  Whenever there are partially synced blocks in the array, have a
low priority process (like the normal array creation sync process, or
rebuild processes) sync the stripes until the block is finished as a
fully synced block.

That should let you delay the time-consuming and write intensive
creation sync until you actually need to sync the blocks, without /too/
much overhead in metadata or in delays when using the disk.


I have another couple of questions that might be relevant, but I am
really not sure about the correct answers.

First, if you have a stripe that you know is unused - it has not been
written to since the array was created - could the raid layer safely
return all zeros if an attempt was made to read the stripe?

Second, when syncing an unused stripe (such as during creation), rather
than reading the old data and copying it or generating parities, could
we simply write all zeros to all the blocks in the stripes?  For many
SSDs, this is very efficient.

Best regards,

David


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RAID creation resync behaviors
  2017-05-04  2:22   ` Shaohua Li
@ 2017-05-04  7:55     ` Andreas Klauer
  2017-05-04  8:06       ` Roman Mamedov
  2017-05-04 15:20       ` Brad Campbell
  0 siblings, 2 replies; 27+ messages in thread
From: Andreas Klauer @ 2017-05-04  7:55 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, jes.sorensen, neilb

On Wed, May 03, 2017 at 07:22:58PM -0700, Shaohua Li wrote:
> 
> Unfortunately not all SSD return zero after trim.
>  

For example? I was under the impression that pretty much all of them do. 
Even the ones that don't advertize it returned zero after trim for me. 
[Sometimes you get original data but that's Linux; gone after drop_caches]
(Of course, I don't have access to that many different models of SSD...)

But what do they actually return then? Original data? This might seem 
fine at first but it can't be the case, right? I mean, even if the SSD 
does not erase a trimmed block right away, sooner or later it would, 
in order to re-use it for new data and wear leveling. If it's never 
re-used what is the point of trimming it in the first place?

So it seems to me even if an SSD attempted to keep returning original data 
for a while, sooner or later it would be gone and if that happens randomly, 
as far as your RAID is concerned you will be in mismatch hell anyways. 
So after a full trim, you can skip the initial sync either way.

Mismatches might also be an issue for SSD that do return zero after trim, 
but have different erase block sizes / trim boundaries / partition offsets.
Or does read zero after trim actually mean it works at 4K page resolution? 
Never mind, then. :-)

Regards
Andreas Klauer

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RAID creation resync behaviors
  2017-05-04  7:55     ` Andreas Klauer
@ 2017-05-04  8:06       ` Roman Mamedov
  2017-05-04 15:20       ` Brad Campbell
  1 sibling, 0 replies; 27+ messages in thread
From: Roman Mamedov @ 2017-05-04  8:06 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: Shaohua Li, linux-raid, jes.sorensen, neilb

On Thu, 4 May 2017 09:55:51 +0200
Andreas Klauer <Andreas.Klauer@metamorpher.de> wrote:

> For example? I was under the impression that pretty much all of them do. 
> Even the ones that don't advertize it returned zero after trim for me. 
> [Sometimes you get original data but that's Linux; gone after drop_caches]
> (Of course, I don't have access to that many different models of SSD...)
> 
> But what do they actually return then? Original data?

Consult Wikipedia at least, if not the original ATA standards documents:
https://en.wikipedia.org/wiki/Trim_(computing)#ATA
Can return "undefined", "something that is the same every time" or "zeroes".
Apparently the OS can query which is to be expected with a particular device,
but firstly, you can't necessarily trust that 100%, and secondly, that's
operating on a level lower (ATA) than where md is.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RAID creation resync behaviors
  2017-05-04  7:55     ` Andreas Klauer
  2017-05-04  8:06       ` Roman Mamedov
@ 2017-05-04 15:20       ` Brad Campbell
  1 sibling, 0 replies; 27+ messages in thread
From: Brad Campbell @ 2017-05-04 15:20 UTC (permalink / raw)
  To: Andreas Klauer, Shaohua Li; +Cc: linux-raid, jes.sorensen, neilb

On 04/05/17 15:55, Andreas Klauer wrote:
> On Wed, May 03, 2017 at 07:22:58PM -0700, Shaohua Li wrote:
>>
>> Unfortunately not all SSD return zero after trim.
>>
>
> For example? I was under the impression that pretty much all of them do.

Pretty much every Samsung except the 840 Pro it seems. Certainly my 830s 
and 850s don't.

> Even the ones that don't advertize it returned zero after trim for me.

You got lucky.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RAID creation resync behaviors
  2017-05-04  1:54   ` Shaohua Li
  2017-05-04  7:37     ` David Brown
@ 2017-05-04 15:50     ` Wols Lists
  2017-05-04 22:00       ` NeilBrown
  1 sibling, 1 reply; 27+ messages in thread
From: Wols Lists @ 2017-05-04 15:50 UTC (permalink / raw)
  To: Shaohua Li, David Brown; +Cc: linux-raid, jes.sorensen, neilb

On 04/05/17 02:54, Shaohua Li wrote:
> On Wed, May 03, 2017 at 11:06:01PM +0200, David Brown wrote:
>> On 03/05/17 22:27, Shaohua Li wrote:
>>> Hi,
>>>
>>> Currently we have different resync behaviors in array creation.
>>>
>>> - raid1: copy data from disk 0 to disk 1 (overwrite)
>>> - raid10: read both disks, compare and write if there is difference (compare-write)
>>> - raid4/5: read first n-1 disks, calculate parity and then write parity to the last disk (overwrite)
>>> - raid6: read all disks, calculate parity and compare, and write if there is difference (compare-write)
>>>
>>> Write whole disk is very unfriendly for SSD, because it reduces lifetime. And
>>> if user already does a trim before creation, the unncessary write could make
>>> SSD slower in the future. Could we prefer compare-write to overwrite if mdadm
>>> detects the disks are SSD? Surely sometimes compare-write is slower than
>>> overwrite, so maybe add new option in mdadm. An option to let mdadm trim SSD
>>> before creation sounds reasonable too.
>>>
>>
>> When doing the first sync, md tracks how far its sync has got, keeping a
>> record in the metadata in case it has to be restarted (such as due to a
>> reboot while syncing).  Why not simply /not/ sync stripes until you first
>> write to them?  It may be that a counter of synced stripes is not enough,
>> and you need a bitmap (like the write intent bitmap), but it would reduce
>> the creation sync time to 0 and avoid any writes at all.
> 
> For raid 4/5/6, this means we always must do a full stripe write for any normal
> write if it hits a range not synced. This would harm the performance of the
> norma write. For raid1/10, this sounds more appealing. But since each bit in
> the bitmap will stand for a range. If only part of the range is written by
> normal IO, we have two choices. sync the range immediately and clear the bit,
> this sync will impact normal IO. Don't do the sync immediately, but since the
> bit is set (which means the range isn't synced), read IO can only access the
> first disk, which is harmful too.
> 
We're creating the array, right? So the user is sitting in front of
mdadm looking at its output, right?

So we just print a message saying "the disks aren't sync'd. If you don't
want a performance hit in normal use, fire up a sync now and take the
hit up front".

The question isn't "how do we avoid a performance hit?", it's "we're
going to take a hit, do we take it up-front on creation or defer it
until we're using the array?".

Cheers,
Wol


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RAID creation resync behaviors
  2017-05-04  7:37     ` David Brown
@ 2017-05-04 16:02       ` Wols Lists
  2017-05-04 21:57       ` NeilBrown
  1 sibling, 0 replies; 27+ messages in thread
From: Wols Lists @ 2017-05-04 16:02 UTC (permalink / raw)
  To: David Brown, Shaohua Li; +Cc: linux-raid, jes.sorensen, neilb

On 04/05/17 08:37, David Brown wrote:
> On 04/05/17 03:54, Shaohua Li wrote:
>> > On Wed, May 03, 2017 at 11:06:01PM +0200, David Brown wrote:
>>> >> On 03/05/17 22:27, Shaohua Li wrote:
>>>> >>> Hi,
>>>> >>>
>>>> >>> Currently we have different resync behaviors in array creation.
>>>> >>>
>>>> >>> - raid1: copy data from disk 0 to disk 1 (overwrite)
>>>> >>> - raid10: read both disks, compare and write if there is difference (compare-write)
>>>> >>> - raid4/5: read first n-1 disks, calculate parity and then write parity to the last disk (overwrite)
>>>> >>> - raid6: read all disks, calculate parity and compare, and write if there is difference (compare-write)
>>>> >>>
>>>> >>> Write whole disk is very unfriendly for SSD, because it reduces lifetime. And
>>>> >>> if user already does a trim before creation, the unncessary write could make
>>>> >>> SSD slower in the future. Could we prefer compare-write to overwrite if mdadm
>>>> >>> detects the disks are SSD? Surely sometimes compare-write is slower than
>>>> >>> overwrite, so maybe add new option in mdadm. An option to let mdadm trim SSD
>>>> >>> before creation sounds reasonable too.
>>>> >>>
>>> >>
>>> >> When doing the first sync, md tracks how far its sync has got, keeping a
>>> >> record in the metadata in case it has to be restarted (such as due to a
>>> >> reboot while syncing).  Why not simply /not/ sync stripes until you first
>>> >> write to them?  It may be that a counter of synced stripes is not enough,
>>> >> and you need a bitmap (like the write intent bitmap), but it would reduce
>>> >> the creation sync time to 0 and avoid any writes at all.
>> > 
>> > For raid 4/5/6, this means we always must do a full stripe write for any normal
>> > write if it hits a range not synced. This would harm the performance of the
>> > norma write.
> Agreed.  The unused sectors could be set to 0, rather than read from the
> disks - that would reduce the latency and be friendly to high-end SSDs
> with compression (zero blocks compress quite well!).
> 
>> > For raid1/10, this sounds more appealing. But since each bit in
>> > the bitmap will stand for a range. If only part of the range is written by
>> > normal IO, we have two choices. sync the range immediately and clear the bit,
>> > this sync will impact normal IO. Don't do the sync immediately, but since the
>> > bit is set (which means the range isn't synced), read IO can only access the
>> > first disk, which is harmful too.
>> > 
> This could be done in a more sophisticated manner.  (Yes, I appreciate
> that "sophisticated" or "complex" are a serious disadvantage - I'm just
> throwing up ideas that could be considered.)
> 
> Divide the array into "sync blocks", each covering a range of stripes,
> with a bitmap of three states - unused, partially synced, fully synced.
>  All blocks start off unused.  If a write is made to a previously unused
> block, that block becomes partially synced, and the write has to be done
> as a full stripe write.  For a partially synced block, keep a list of
> ranges of synced stripes (a list will normally be smaller than a bitmap
> here).  Whenever there are partially synced blocks in the array, have a
> low priority process (like the normal array creation sync process, or
> rebuild processes) sync the stripes until the block is finished as a
> fully synced block.
> 
> That should let you delay the time-consuming and write intensive
> creation sync until you actually need to sync the blocks, without /too/
> much overhead in metadata or in delays when using the disk.

I was thinking along those lines. You mentioned earlier what I would
think of as a "high water mark" - or "how far have we used the array".
The only snag I can think of there is if you start writing in the middle
of the array so your idea of blocks sounds a lot better.

The other thing - this would probably be a synonym of "--assume-clean"
but create a flag "--new-array". This would have to be an opt-in - it
tells mdadm that whatever is on the disk is garbage, and when it does
sync it can safely just stream zeroes to the disk - no reads or parity
checks required ... :-) (This idea might need a few tweaks :-)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RAID creation resync behaviors
  2017-05-04  7:37     ` David Brown
  2017-05-04 16:02       ` Wols Lists
@ 2017-05-04 21:57       ` NeilBrown
  2017-05-05  6:46         ` David Brown
  1 sibling, 1 reply; 27+ messages in thread
From: NeilBrown @ 2017-05-04 21:57 UTC (permalink / raw)
  To: David Brown, Shaohua Li; +Cc: linux-raid, jes.sorensen, neilb

[-- Attachment #1: Type: text/plain, Size: 1668 bytes --]

On Thu, May 04 2017, David Brown wrote:

>
> I have another couple of questions that might be relevant, but I am
> really not sure about the correct answers.
>
> First, if you have a stripe that you know is unused - it has not been
> written to since the array was created - could the raid layer safely
> return all zeros if an attempt was made to read the stripe?

"know is unused" and "it has not been written to since the array was
created" are not necessarily the same thing.

If I have some devices which used to have a RAID5 array but for which
the metadata got destroyed, I might carefully "create" a RAID5 over the
devices and then have access to my data.  This has been done more than
once - it is not just theoretical.

But if you really "know" it is unused, then returning zeros should be fine.

>
> Second, when syncing an unused stripe (such as during creation), rather
> than reading the old data and copying it or generating parities, could
> we simply write all zeros to all the blocks in the stripes?  For many
> SSDs, this is very efficient.

If you were happy to destroy whatever was there before (see above
recovery example for when you wouldn't), then it might be possible to
make this work.
You would need to be careful not to write zeros over a region that the
filesystem has already used.
That means you either disable all writes until the initialization
completes (waste of time), or you add complexity to track which strips
have been written and which haven't, and only initialise strips that have
not been written.  This complexity would only be used once in the entire
life of the RAID.  That might not be best use of resources.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RAID creation resync behaviors
  2017-05-04 15:50     ` Wols Lists
@ 2017-05-04 22:00       ` NeilBrown
  0 siblings, 0 replies; 27+ messages in thread
From: NeilBrown @ 2017-05-04 22:00 UTC (permalink / raw)
  To: Wols Lists, Shaohua Li, David Brown; +Cc: linux-raid, jes.sorensen, neilb

[-- Attachment #1: Type: text/plain, Size: 2853 bytes --]

On Thu, May 04 2017, Wols Lists wrote:

> On 04/05/17 02:54, Shaohua Li wrote:
>> On Wed, May 03, 2017 at 11:06:01PM +0200, David Brown wrote:
>>> On 03/05/17 22:27, Shaohua Li wrote:
>>>> Hi,
>>>>
>>>> Currently we have different resync behaviors in array creation.
>>>>
>>>> - raid1: copy data from disk 0 to disk 1 (overwrite)
>>>> - raid10: read both disks, compare and write if there is difference (compare-write)
>>>> - raid4/5: read first n-1 disks, calculate parity and then write parity to the last disk (overwrite)
>>>> - raid6: read all disks, calculate parity and compare, and write if there is difference (compare-write)
>>>>
>>>> Write whole disk is very unfriendly for SSD, because it reduces lifetime. And
>>>> if user already does a trim before creation, the unncessary write could make
>>>> SSD slower in the future. Could we prefer compare-write to overwrite if mdadm
>>>> detects the disks are SSD? Surely sometimes compare-write is slower than
>>>> overwrite, so maybe add new option in mdadm. An option to let mdadm trim SSD
>>>> before creation sounds reasonable too.
>>>>
>>>
>>> When doing the first sync, md tracks how far its sync has got, keeping a
>>> record in the metadata in case it has to be restarted (such as due to a
>>> reboot while syncing).  Why not simply /not/ sync stripes until you first
>>> write to them?  It may be that a counter of synced stripes is not enough,
>>> and you need a bitmap (like the write intent bitmap), but it would reduce
>>> the creation sync time to 0 and avoid any writes at all.
>> 
>> For raid 4/5/6, this means we always must do a full stripe write for any normal
>> write if it hits a range not synced. This would harm the performance of the
>> norma write. For raid1/10, this sounds more appealing. But since each bit in
>> the bitmap will stand for a range. If only part of the range is written by
>> normal IO, we have two choices. sync the range immediately and clear the bit,
>> this sync will impact normal IO. Don't do the sync immediately, but since the
>> bit is set (which means the range isn't synced), read IO can only access the
>> first disk, which is harmful too.
>> 
> We're creating the array, right? So the user is sitting in front of
> mdadm looking at its output, right?

No, it might be anaconda or yast or some other sysadmin tool that is
running mdadm under the hood.

Presumably those tools could ask the question themselves.

NeilBrown

>
> So we just print a message saying "the disks aren't sync'd. If you don't
> want a performance hit in normal use, fire up a sync now and take the
> hit up front".
>
> The question isn't "how do we avoid a performance hit?", it's "we're
> going to take a hit, do we take it up-front on creation or defer it
> until we're using the array?".
>
> Cheers,
> Wol

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RAID creation resync behaviors
  2017-05-04 21:57       ` NeilBrown
@ 2017-05-05  6:46         ` David Brown
  0 siblings, 0 replies; 27+ messages in thread
From: David Brown @ 2017-05-05  6:46 UTC (permalink / raw)
  To: NeilBrown, Shaohua Li; +Cc: linux-raid, jes.sorensen, neilb

On 04/05/17 23:57, NeilBrown wrote:
> On Thu, May 04 2017, David Brown wrote:
> 
>>
>> I have another couple of questions that might be relevant, but I am
>> really not sure about the correct answers.
>>
>> First, if you have a stripe that you know is unused - it has not been
>> written to since the array was created - could the raid layer safely
>> return all zeros if an attempt was made to read the stripe?
> 
> "know is unused" and "it has not been written to since the array was
> created" are not necessarily the same thing.
> 
> If I have some devices which used to have a RAID5 array but for which
> the metadata got destroyed, I might carefully "create" a RAID5 over the
> devices and then have access to my data.  This has been done more than
> once - it is not just theoretical.

That is true, of course - anything like this would have to be optional
(command line switches in mdadm, for example).

There is also the opposite situation - when you /have/ had something
written to the array, but now you know it is unused (due to a trim).
Knowing the stripe is unused might make a later partial write a little
faster, and it would certainly speed up a scrub or other consistency
check since unused stripes can be skipped.

> 
> But if you really "know" it is unused, then returning zeros should be fine.
> 
>>
>> Second, when syncing an unused stripe (such as during creation), rather
>> than reading the old data and copying it or generating parities, could
>> we simply write all zeros to all the blocks in the stripes?  For many
>> SSDs, this is very efficient.
> 
> If you were happy to destroy whatever was there before (see above
> recovery example for when you wouldn't), then it might be possible to
> make this work.

As above, this would have to be option-controlled.  (I have had occasion
to pull disks from one dead server to recover them on another machine -
it's nerve-racking enough at the best of times, without fearing that you
will zero out your remaining good disks!)

> You would need to be careful not to write zeros over a region that the
> filesystem has already used.

Yes, but that should not be a difficult problem - the array is created
before the filesystem.

> That means you either disable all writes until the initialization
> completes (waste of time), or you add complexity to track which strips
> have been written and which haven't, and only initialise strips that have
> not been written.  This complexity would only be used once in the entire
> life of the RAID.  That might not be best use of resources.
> 

I am not sure I see how this would be a problem.  But it is something
that would need to be considered carefully when looking at details of
implementing these ideas (if anyone thinks they would be worth
implementing).

mvh.,

David


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RAID creation resync behaviors
  2017-05-04  2:04   ` Shaohua Li
@ 2017-05-09 18:39     ` Jes Sorensen
  2017-05-09 20:30       ` NeilBrown
  0 siblings, 1 reply; 27+ messages in thread
From: Jes Sorensen @ 2017-05-09 18:39 UTC (permalink / raw)
  To: Shaohua Li, NeilBrown; +Cc: linux-raid, neilb

On 05/03/2017 10:04 PM, Shaohua Li wrote:
> On Thu, May 04, 2017 at 11:07:01AM +1000, Neil Brown wrote:
>> On Wed, May 03 2017, Shaohua Li wrote:
>>
>>> Hi,
>>>
>>> Currently we have different resync behaviors in array creation.
>>>
>>> - raid1: copy data from disk 0 to disk 1 (overwrite)
>>> - raid10: read both disks, compare and write if there is difference (compare-write)
>>> - raid4/5: read first n-1 disks, calculate parity and then write parity to the last disk (overwrite)
>>> - raid6: read all disks, calculate parity and compare, and write if there is difference (compare-write)
>>
>> The approach taken for raid1 and raid4/5 provides the fastest sync for
>> an array built on uninitialised spinning devices.
>> RAID6 could use the same approach but would involve more CPU and so
>> the original author of the RAID6 code (hpa) chose to go for the low-CPU
>> cost option.  I don't know if tests were done, or if they would still be
>> valid on new hardware.
>> The raid10 approach comes from "it is too hard to optimize in general
>> because different RAID10 layouts have different trade-offs, so just
>> take the easy way out."
> 
> ok, thanks for the explanation!
>>>
>>> Write whole disk is very unfriendly for SSD, because it reduces lifetime. And
>>> if user already does a trim before creation, the unncessary write could make
>>> SSD slower in the future. Could we prefer compare-write to overwrite if mdadm
>>> detects the disks are SSD? Surely sometimes compare-write is slower than
>>> overwrite, so maybe add new option in mdadm. An option to let mdadm trim SSD
>>> before creation sounds reasonable too.
>>
>> An option to ask mdadm to trim the data space and then --assume-clean
>> certainly sounds reasonable.
> 
> This doesn't work well. read returns 0 for trimmed data space in some SSDs, but
> not all. If not, we will have trouble.

/sys/block/<device>/queue/discard_zeroes_data

We could use this as an indicator for what to do.

Jes

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RAID creation resync behaviors
  2017-05-09 18:39     ` Jes Sorensen
@ 2017-05-09 20:30       ` NeilBrown
  2017-05-09 20:49         ` Jes Sorensen
  0 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2017-05-09 20:30 UTC (permalink / raw)
  To: Jes Sorensen, Shaohua Li; +Cc: linux-raid, neilb

[-- Attachment #1: Type: text/plain, Size: 2390 bytes --]

On Tue, May 09 2017, Jes Sorensen wrote:

> On 05/03/2017 10:04 PM, Shaohua Li wrote:
>> On Thu, May 04, 2017 at 11:07:01AM +1000, Neil Brown wrote:
>>> On Wed, May 03 2017, Shaohua Li wrote:
>>>
>>>> Hi,
>>>>
>>>> Currently we have different resync behaviors in array creation.
>>>>
>>>> - raid1: copy data from disk 0 to disk 1 (overwrite)
>>>> - raid10: read both disks, compare and write if there is difference (compare-write)
>>>> - raid4/5: read first n-1 disks, calculate parity and then write parity to the last disk (overwrite)
>>>> - raid6: read all disks, calculate parity and compare, and write if there is difference (compare-write)
>>>
>>> The approach taken for raid1 and raid4/5 provides the fastest sync for
>>> an array built on uninitialised spinning devices.
>>> RAID6 could use the same approach but would involve more CPU and so
>>> the original author of the RAID6 code (hpa) chose to go for the low-CPU
>>> cost option.  I don't know if tests were done, or if they would still be
>>> valid on new hardware.
>>> The raid10 approach comes from "it is too hard to optimize in general
>>> because different RAID10 layouts have different trade-offs, so just
>>> take the easy way out."
>> 
>> ok, thanks for the explanation!
>>>>
>>>> Write whole disk is very unfriendly for SSD, because it reduces lifetime. And
>>>> if user already does a trim before creation, the unncessary write could make
>>>> SSD slower in the future. Could we prefer compare-write to overwrite if mdadm
>>>> detects the disks are SSD? Surely sometimes compare-write is slower than
>>>> overwrite, so maybe add new option in mdadm. An option to let mdadm trim SSD
>>>> before creation sounds reasonable too.
>>>
>>> An option to ask mdadm to trim the data space and then --assume-clean
>>> certainly sounds reasonable.
>> 
>> This doesn't work well. read returns 0 for trimmed data space in some SSDs, but
>> not all. If not, we will have trouble.
>
> /sys/block/<device>/queue/discard_zeroes_data
>
> We could use this as an indicator for what to do.
>
According to

Documentation/ABI/testing/sysfs-block

Description:
                Will always return 0.  Don't rely on any specific behavior
                for discards, and don't read this file.

See also
 Commit: 48920ff2a5a9 ("block: remove the discard_zeroes_data flag")

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RAID creation resync behaviors
  2017-05-09 20:30       ` NeilBrown
@ 2017-05-09 20:49         ` Jes Sorensen
  2017-05-09 21:03           ` Martin K. Petersen
  0 siblings, 1 reply; 27+ messages in thread
From: Jes Sorensen @ 2017-05-09 20:49 UTC (permalink / raw)
  To: NeilBrown, Shaohua Li; +Cc: linux-raid, neilb

On 05/09/2017 04:30 PM, NeilBrown wrote:
> On Tue, May 09 2017, Jes Sorensen wrote:
> 
>> On 05/03/2017 10:04 PM, Shaohua Li wrote:
>>> On Thu, May 04, 2017 at 11:07:01AM +1000, Neil Brown wrote:
>>>> On Wed, May 03 2017, Shaohua Li wrote:
>>> This doesn't work well. read returns 0 for trimmed data space in some SSDs, but
>>> not all. If not, we will have trouble.
>>
>> /sys/block/<device>/queue/discard_zeroes_data
>>
>> We could use this as an indicator for what to do.
>>
> According to
> 
> Documentation/ABI/testing/sysfs-block
> 
> Description:
>                  Will always return 0.  Don't rely on any specific behavior
>                  for discards, and don't read this file.
> 
> See also
>   Commit: 48920ff2a5a9 ("block: remove the discard_zeroes_data flag")

Crap!

Back to the drawing board :(

Jes


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RAID creation resync behaviors
  2017-05-09 20:49         ` Jes Sorensen
@ 2017-05-09 21:03           ` Martin K. Petersen
  2017-05-09 21:11             ` Jes Sorensen
  0 siblings, 1 reply; 27+ messages in thread
From: Martin K. Petersen @ 2017-05-09 21:03 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: NeilBrown, Shaohua Li, linux-raid, neilb


Jes,

>> According to
>>
>> Documentation/ABI/testing/sysfs-block
>>
>> Description:
>>                  Will always return 0.  Don't rely on any specific behavior
>>                  for discards, and don't read this file.
>>
>> See also
>>   Commit: 48920ff2a5a9 ("block: remove the discard_zeroes_data flag")
>
> Crap!
>
> Back to the drawing board :(

Discard is now a deallocate hint like it was originally intended.
Behavior is non-deterministic and no guarantees are made wrt. block
contents on subsequent reads.

To zero a block range you should be issuing blkdev_issue_zerooout().
This will use the best zeroing approach given the device characteristics
(TRIM/UNMAP if the device provides hard guarantees, or regular WRITE
SAME which also does the right thing on some SSDs). If none of the fancy
zeroing commands work, you'll fall back to writing zeroes manually.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RAID creation resync behaviors
  2017-05-09 21:03           ` Martin K. Petersen
@ 2017-05-09 21:11             ` Jes Sorensen
  2017-05-09 21:16               ` Martin K. Petersen
  0 siblings, 1 reply; 27+ messages in thread
From: Jes Sorensen @ 2017-05-09 21:11 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: NeilBrown, Shaohua Li, linux-raid, neilb

On 05/09/2017 05:03 PM, Martin K. Petersen wrote:
> 
> Jes,
> 
>>> According to
>>>
>>> Documentation/ABI/testing/sysfs-block
>>>
>>> Description:
>>>                   Will always return 0.  Don't rely on any specific behavior
>>>                   for discards, and don't read this file.
>>>
>>> See also
>>>    Commit: 48920ff2a5a9 ("block: remove the discard_zeroes_data flag")
>>
>> Crap!
>>
>> Back to the drawing board :(
> 
> Discard is now a deallocate hint like it was originally intended.
> Behavior is non-deterministic and no guarantees are made wrt. block
> contents on subsequent reads.
> 
> To zero a block range you should be issuing blkdev_issue_zerooout().
> This will use the best zeroing approach given the device characteristics
> (TRIM/UNMAP if the device provides hard guarantees, or regular WRITE
> SAME which also does the right thing on some SSDs). If none of the fancy
> zeroing commands work, you'll fall back to writing zeroes manually.

Martin,

This is fine within the kernel, however it is not overly useful for 
mdadm to determine which strategy to apply when syncing devices.

Jes


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RAID creation resync behaviors
  2017-05-09 21:11             ` Jes Sorensen
@ 2017-05-09 21:16               ` Martin K. Petersen
  2017-05-09 21:22                 ` Jes Sorensen
  0 siblings, 1 reply; 27+ messages in thread
From: Martin K. Petersen @ 2017-05-09 21:16 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: Martin K. Petersen, NeilBrown, Shaohua Li, linux-raid, neilb


Jes,

> This is fine within the kernel, however it is not overly useful for
> mdadm to determine which strategy to apply when syncing devices.

BLKZEROOUT

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RAID creation resync behaviors
  2017-05-09 21:16               ` Martin K. Petersen
@ 2017-05-09 21:22                 ` Jes Sorensen
  2017-05-09 23:56                   ` Martin K. Petersen
                                     ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Jes Sorensen @ 2017-05-09 21:22 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: NeilBrown, Shaohua Li, linux-raid, neilb

On 05/09/2017 05:16 PM, Martin K. Petersen wrote:
> 
> Jes,
> 
>> This is fine within the kernel, however it is not overly useful for
>> mdadm to determine which strategy to apply when syncing devices.
> 
> BLKZEROOUT
> 

Trying to read the code, as this ioctl doesn't seem to be documented 
anywhere I can find.... it looks like this ioctl zeroes out a device.

It doesn't help me obtain the information I need to make a decision in 
mdadm as whether to overwrite all or compare+write when resyncing a RAID 
array.

Jes

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RAID creation resync behaviors
  2017-05-09 21:22                 ` Jes Sorensen
@ 2017-05-09 23:56                   ` Martin K. Petersen
  2017-05-10  5:58                   ` Hannes Reinecke
  2017-05-10 17:30                   ` Shaohua Li
  2 siblings, 0 replies; 27+ messages in thread
From: Martin K. Petersen @ 2017-05-09 23:56 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: Martin K. Petersen, NeilBrown, Shaohua Li, linux-raid, neilb


Jes,

>> BLKZEROOUT
>>
>
> Trying to read the code, as this ioctl doesn't seem to be documented
> anywhere I can find.... it looks like this ioctl zeroes out a device.
>
> It doesn't help me obtain the information I need to make a decision in
> mdadm as whether to overwrite all or compare+write when resyncing a RAID
> array.

I wasn't trying to solve your policy decision problem. I was merely
responding to Shaohua's concerns about discard vs. zeroes and wearing
out the media.

If you want to act based on the media type, the best heuristic we have
right now is the rotational sysfs attribute / BLKROTATIONAL ioctl. It'll
be one for spinning rust and zero for pretty much everything else.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RAID creation resync behaviors
  2017-05-09 21:22                 ` Jes Sorensen
  2017-05-09 23:56                   ` Martin K. Petersen
@ 2017-05-10  5:58                   ` Hannes Reinecke
  2017-05-10 22:20                     ` Martin K. Petersen
  2017-05-10 17:30                   ` Shaohua Li
  2 siblings, 1 reply; 27+ messages in thread
From: Hannes Reinecke @ 2017-05-10  5:58 UTC (permalink / raw)
  To: Jes Sorensen, Martin K. Petersen; +Cc: NeilBrown, Shaohua Li, linux-raid, neilb

On 05/09/2017 11:22 PM, Jes Sorensen wrote:
> On 05/09/2017 05:16 PM, Martin K. Petersen wrote:
>>
>> Jes,
>>
>>> This is fine within the kernel, however it is not overly useful for
>>> mdadm to determine which strategy to apply when syncing devices.
>>
>> BLKZEROOUT
>>
> 
> Trying to read the code, as this ioctl doesn't seem to be documented
> anywhere I can find.... it looks like this ioctl zeroes out a device.
> 
> It doesn't help me obtain the information I need to make a decision in
> mdadm as whether to overwrite all or compare+write when resyncing a RAID
> array.
> 
What you actually want is the COMPARE AND WRITE SCSI command :-)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RAID creation resync behaviors
  2017-05-09 21:22                 ` Jes Sorensen
  2017-05-09 23:56                   ` Martin K. Petersen
  2017-05-10  5:58                   ` Hannes Reinecke
@ 2017-05-10 17:30                   ` Shaohua Li
  2 siblings, 0 replies; 27+ messages in thread
From: Shaohua Li @ 2017-05-10 17:30 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: Martin K. Petersen, NeilBrown, linux-raid, neilb

On Tue, May 09, 2017 at 05:22:57PM -0400, Jes Sorensen wrote:
> On 05/09/2017 05:16 PM, Martin K. Petersen wrote:
> > 
> > Jes,
> > 
> > > This is fine within the kernel, however it is not overly useful for
> > > mdadm to determine which strategy to apply when syncing devices.
> > 
> > BLKZEROOUT
> > 
> 
> Trying to read the code, as this ioctl doesn't seem to be documented
> anywhere I can find.... it looks like this ioctl zeroes out a device.
> 
> It doesn't help me obtain the information I need to make a decision in mdadm
> as whether to overwrite all or compare+write when resyncing a RAID array.

The best way I can think is if we find the disks are SSD, trim the disks, and
read some random places. If all return 0, the guess is the SSD returns 0 after
trim, so do compare-write resync otherwise overwrite. Surely this needs extra
mdadm option to enable. We can only reliably detect a SSD, others lie.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: RAID creation resync behaviors
  2017-05-10  5:58                   ` Hannes Reinecke
@ 2017-05-10 22:20                     ` Martin K. Petersen
  0 siblings, 0 replies; 27+ messages in thread
From: Martin K. Petersen @ 2017-05-10 22:20 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jes Sorensen, Martin K. Petersen, NeilBrown, Shaohua Li,
	linux-raid, neilb


Hannes,

>> It doesn't help me obtain the information I need to make a decision in
>> mdadm as whether to overwrite all or compare+write when resyncing a RAID
>> array.
>> 
> What you actually want is the COMPARE AND WRITE SCSI command

Actually, I think what Jes needs is a MISCOMPARE AND WRITE command :)

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2017-05-10 22:20 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-03 20:27 RAID creation resync behaviors Shaohua Li
2017-05-03 21:06 ` David Brown
2017-05-04  1:54   ` Shaohua Li
2017-05-04  7:37     ` David Brown
2017-05-04 16:02       ` Wols Lists
2017-05-04 21:57       ` NeilBrown
2017-05-05  6:46         ` David Brown
2017-05-04 15:50     ` Wols Lists
2017-05-04 22:00       ` NeilBrown
2017-05-03 23:58 ` Andreas Klauer
2017-05-04  2:22   ` Shaohua Li
2017-05-04  7:55     ` Andreas Klauer
2017-05-04  8:06       ` Roman Mamedov
2017-05-04 15:20       ` Brad Campbell
2017-05-04  1:07 ` NeilBrown
2017-05-04  2:04   ` Shaohua Li
2017-05-09 18:39     ` Jes Sorensen
2017-05-09 20:30       ` NeilBrown
2017-05-09 20:49         ` Jes Sorensen
2017-05-09 21:03           ` Martin K. Petersen
2017-05-09 21:11             ` Jes Sorensen
2017-05-09 21:16               ` Martin K. Petersen
2017-05-09 21:22                 ` Jes Sorensen
2017-05-09 23:56                   ` Martin K. Petersen
2017-05-10  5:58                   ` Hannes Reinecke
2017-05-10 22:20                     ` Martin K. Petersen
2017-05-10 17:30                   ` Shaohua Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.