All of lore.kernel.org
 help / color / mirror / Atom feed
* Software RAID and TRIM
@ 2011-06-28 15:31 Tom De Mulder
  2011-06-28 16:11 ` Mathias Burén
                   ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Tom De Mulder @ 2011-06-28 15:31 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: TEXT/PLAIN, Size: 847 bytes --]

Hi,


I'm investigating SSD performance on Linux, in particular for RAID 
devices.

As I understand it—and please correct me if I'm wrong—currently software 
RAID does not pass through TRIM to the underlying devices. TRIM is 
essential for the continued high performance of SSDs, which otherwise 
degrade over time.

I don't think there would be any harm in this command being passed through 
to underlying devices if they don't support it (they would just ignore 
it), and if they do it would make high-performance software RAID of SSDs a 
possibility.


Is this something that's in the works?



Many thanks,

--
Tom De Mulder <tdm27@cam.ac.uk> - Cambridge University Computing Service
+44 1223 3 31843 - New Museums Site, Pembroke Street, Cambridge CB2 3QH
-> 28/06/2011 : The Moon is Waning Crescent (22% of Full)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-06-28 15:31 Software RAID and TRIM Tom De Mulder
@ 2011-06-28 16:11 ` Mathias Burén
  2011-06-29 10:32   ` Tom De Mulder
  2011-06-29 10:33   ` Tom De Mulder
  2011-06-28 16:17 ` Johannes Truschnigg
  2011-06-28 16:40 ` David Brown
  2 siblings, 2 replies; 45+ messages in thread
From: Mathias Burén @ 2011-06-28 16:11 UTC (permalink / raw)
  To: Tom De Mulder; +Cc: linux-raid

On 28 June 2011 16:31, Tom De Mulder <tdm27@cam.ac.uk> wrote:
> Hi,
>
>
> I'm investigating SSD performance on Linux, in particular for RAID devices.
>
> As I understand it—and please correct me if I'm wrong—currently software
> RAID does not pass through TRIM to the underlying devices. TRIM is essential
> for the continued high performance of SSDs, which otherwise degrade over
> time.
>
> I don't think there would be any harm in this command being passed through
> to underlying devices if they don't support it (they would just ignore it),
> and if they do it would make high-performance software RAID of SSDs a
> possibility.
>
>
> Is this something that's in the works?
>
>
>
> Many thanks,
>
> --
> Tom De Mulder <tdm27@cam.ac.uk> - Cambridge University Computing Service
> +44 1223 3 31843 - New Museums Site, Pembroke Street, Cambridge CB2 3QH
> -> 28/06/2011 : The Moon is Waning Crescent (22% of Full)


IIRC md can already pass TRIM down, but I think the filesystem needs
to know about the underlying architecture, or something, for TRIM to
work in RAID. There's numerous discussions on this in the archives of
this mailing list.

/M
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-06-28 15:31 Software RAID and TRIM Tom De Mulder
  2011-06-28 16:11 ` Mathias Burén
@ 2011-06-28 16:17 ` Johannes Truschnigg
  2011-06-28 16:40 ` David Brown
  2 siblings, 0 replies; 45+ messages in thread
From: Johannes Truschnigg @ 2011-06-28 16:17 UTC (permalink / raw)
  To: Tom De Mulder; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 717 bytes --]

Hi Tom,
On Tue, Jun 28, 2011 at 04:31:35PM +0100, Tom De Mulder wrote:
> Hi,
> [...]
> Is this something that's in the works?

Iirc, dm-raid supports passthru of DSM/TRIM commands for its provided RAID0
and RAID1 levels. Maybe that's already enough for your purposes?

I don't know if there's any development going on on the md side of things in
in that regard. Others on this list will surely be able to answer that
question, however.

Have a nice day!
-- 
with best regards: 
- Johannes Truschnigg ( johannes@truschnigg.info )

www:  http://johannes.truschnigg.info/ 
phone: +43 650 2 133337 
xmpp: johannes@truschnigg.info

Please do not bother me with HTML-eMail or attachments. Thank you.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-06-28 15:31 Software RAID and TRIM Tom De Mulder
  2011-06-28 16:11 ` Mathias Burén
  2011-06-28 16:17 ` Johannes Truschnigg
@ 2011-06-28 16:40 ` David Brown
  2011-07-17 21:52   ` Lutz Vieweg
  2 siblings, 1 reply; 45+ messages in thread
From: David Brown @ 2011-06-28 16:40 UTC (permalink / raw)
  To: linux-raid

On 28/06/11 17:31, Tom De Mulder wrote:
> Hi,
>
>
> I'm investigating SSD performance on Linux, in particular for RAID devices.
>
> As I understand it—and please correct me if I'm wrong—currently software
> RAID does not pass through TRIM to the underlying devices. TRIM is
> essential for the continued high performance of SSDs, which otherwise
> degrade over time.
>
> I don't think there would be any harm in this command being passed
> through to underlying devices if they don't support it (they would just
> ignore it), and if they do it would make high-performance software RAID
> of SSDs a possibility.
>
>
> Is this something that's in the works?
>
>

I don't think you are wrong about software raid not passing TRIM down to 
the device (IIRC, it /can/ be passed down through LVM raid setups, but 
they are slower and less flexible than md raid).

However, AFAIUI, you are wrong about TRIM being essential for the 
continued high performance of SSDs.  As long as your SSDs have some 
over-provisioning (or you only partition something like 90% of the 
drive), and it's got good garbage collection, then TRIM will have 
minimal effect.

TRIM only makes a big difference in benchmarks which fill up most of a 
disk, then erase the files, then start writing them again, and even then 
it is mainly with older flash controllers.

I think other SSD-optimisations, such as those in BTRFS, are much more 
important.  These include bypassing or disabling code that is aimed at 
optimising disk access and minimising head movement - such code is of 
great benefit with hard disks, but helps little and adds latency on SSD 
systems.

(I haven't done any benchmarks to justify this opinion, nor have I 
direct links - it's based on my understanding of TRIM and how SSDs work, 
and how SSD controllers have changed between early devices and current 
ones.)

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-06-28 16:11 ` Mathias Burén
@ 2011-06-29 10:32   ` Tom De Mulder
  2011-06-29 10:45     ` NeilBrown
  2011-07-17 21:57     ` Lutz Vieweg
  2011-06-29 10:33   ` Tom De Mulder
  1 sibling, 2 replies; 45+ messages in thread
From: Tom De Mulder @ 2011-06-29 10:32 UTC (permalink / raw)
  To: Mathias Burén; +Cc: linux-raid

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1186 bytes --]

On Tue, 28 Jun 2011, Mathias Burén wrote:

> IIRC md can already pass TRIM down, but I think the filesystem needs
> to know about the underlying architecture, or something, for TRIM to
> work in RAID.

Yes, it's (usually/ideally) the filesystem's job to invoke the TRIM 
command, and that's what ext4 can do. I have it working just fine on 
single drives, but for reasons of service reliability would need to get 
RAID to work.

I tried (on an admittedly vanilla Ubuntu 2.6.38 kernel) the same on a two 
drive RAID1 md and it definitely didn't work (the blocks didn't get marked 
as unused and zeroed).

> There's numerous discussions on this in the archives of
> this mailing list.

Given how fast things move in the world of SSDs at the moment, I wanted to 
check if any progress was made since. :-) I don't seem to be able to find 
any reference to this in recent kernel source commits (but I'm a complete 
amateur when it comes to git).


Thanks,

--
Tom De Mulder <tdm27@cam.ac.uk> - Cambridge University Computing Service
+44 1223 3 31843 - New Museums Site, Pembroke Street, Cambridge CB2 3QH
-> 29/06/2011 : The Moon is Waning Crescent (18% of Full)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-06-28 16:11 ` Mathias Burén
  2011-06-29 10:32   ` Tom De Mulder
@ 2011-06-29 10:33   ` Tom De Mulder
  2011-06-29 12:42     ` David Brown
  2011-07-17 22:00     ` Lutz Vieweg
  1 sibling, 2 replies; 45+ messages in thread
From: Tom De Mulder @ 2011-06-29 10:33 UTC (permalink / raw)
  To: linux-raid

On 28/06/11, David Brown wrote:

> However, AFAIUI, you are wrong about TRIM being essential for the
> continued high performance of SSDs.  As long as your SSDs have some
> over-provisioning (or you only partition something like 90% of the
> drive), and it's got good garbage collection, then TRIM will have
> minimal effect.

While you are mostly correct, over time even consumer SSDs will end up in 
this state.

Maybe I should have specified--my particular aim is to try and use (fairly 
high-end) consumer SSDs for "enterprise" server applications, hence the 
research into RAID. Most hardware RAID controllers that I know of don't 
pass on the TRIM command (for various reasons), so I was hoping to have 
more luck with software RAID.


Best,

--
Tom De Mulder <tdm27@cam.ac.uk> - Cambridge University Computing Service
+44 1223 3 31843 - New Museums Site, Pembroke Street, Cambridge CB2 3QH
-> 29/06/2011 : The Moon is Waning Crescent (18% of Full)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-06-29 10:32   ` Tom De Mulder
@ 2011-06-29 10:45     ` NeilBrown
  2011-06-29 11:10       ` Tom De Mulder
                         ` (3 more replies)
  2011-07-17 21:57     ` Lutz Vieweg
  1 sibling, 4 replies; 45+ messages in thread
From: NeilBrown @ 2011-06-29 10:45 UTC (permalink / raw)
  To: Tom De Mulder; +Cc: Mathias Burén, linux-raid

On Wed, 29 Jun 2011 11:32:55 +0100 (BST) Tom De Mulder <tdm27@cam.ac.uk>
wrote:

> On Tue, 28 Jun 2011, Mathias Burén wrote:
> 
> > IIRC md can already pass TRIM down, but I think the filesystem needs
> > to know about the underlying architecture, or something, for TRIM to
> > work in RAID.
> 
> Yes, it's (usually/ideally) the filesystem's job to invoke the TRIM 
> command, and that's what ext4 can do. I have it working just fine on 
> single drives, but for reasons of service reliability would need to get 
> RAID to work.
> 
> I tried (on an admittedly vanilla Ubuntu 2.6.38 kernel) the same on a two 
> drive RAID1 md and it definitely didn't work (the blocks didn't get marked 
> as unused and zeroed).
> 
> > There's numerous discussions on this in the archives of
> > this mailing list.
> 
> Given how fast things move in the world of SSDs at the moment, I wanted to 
> check if any progress was made since. :-) I don't seem to be able to find 
> any reference to this in recent kernel source commits (but I'm a complete 
> amateur when it comes to git).


Trim support for md is a long way down my list of interesting projects (and
no-one else has volunteered).

It is not at all straight forward to implement.

For stripe/parity RAID, (RAID4/5/6) it is only safe to discard full stripes at
a time, and the md layer would need to keep a record of which stripes had been
discarded so that it didn't risk trusting data (and parity) read from those
stripes.  So you would need some sort of bitmap of invalid stripes, and you
would need the fs to discard in very large chunks for it to be useful at all.

For copying RAID (RAID1, RAID10) you really need the same bitmap.  There
isn't the same risk of reading and trusting discarded parity, but a resync
which didn't know about discarded ranges would undo the discard for you.

So is basically requires another bitmap to be stored with the metadata, and a
fairly fine-grained bitmap it would need to be.  Then every read and resync
checks the bitmap and ignores or returns 0 for discarded ranges, and every
write needs to check and if the range was discard, clear the bit and write to
the whole range.

So: do-able, but definitely non-trivial.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-06-29 10:45     ` NeilBrown
@ 2011-06-29 11:10       ` Tom De Mulder
  2011-06-29 11:48         ` Scott E. Armitage
  2011-06-29 12:46       ` David Brown
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 45+ messages in thread
From: Tom De Mulder @ 2011-06-29 11:10 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Wed, 29 Jun 2011, NeilBrown wrote:

> It is not at all straight forward to implement.
>
> For stripe/parity RAID, (RAID4/5/6) it is only safe to discard full stripes at
> a time, and the md layer would need to keep a record of which stripes had been
> discarded so that it didn't risk trusting data (and parity) read from those
> stripes.  So you would need some sort of bitmap of invalid stripes, and you
> would need the fs to discard in very large chunks for it to be useful at all.
>
> For copying RAID (RAID1, RAID10) you really need the same bitmap.  There
> isn't the same risk of reading and trusting discarded parity, but a resync
> which didn't know about discarded ranges would undo the discard for you.

However, that might not necessarily be a problem; tools exist that can be 
run manually (slightly fsck-like) and tell the drive which blocks can be 
erased.

> So: do-able, but definitely non-trivial.

Thanks very much for your response, you make some very good points.

I shall, for the time being, chop my SSDs in half and let them treat the 
empty half as spare area, which should make performance degradation a 
non-issue. I hope.


Cheers,

--
Tom De Mulder <tdm27@cam.ac.uk> - Cambridge University Computing Service
+44 1223 3 31843 - New Museums Site, Pembroke Street, Cambridge CB2 3QH
-> 29/06/2011 : The Moon is Waning Crescent (18% of Full)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-06-29 11:10       ` Tom De Mulder
@ 2011-06-29 11:48         ` Scott E. Armitage
  2011-06-29 12:46           ` Roberto Spadim
  0 siblings, 1 reply; 45+ messages in thread
From: Scott E. Armitage @ 2011-06-29 11:48 UTC (permalink / raw)
  To: linux-raid

On Wed, Jun 29, 2011 at 7:10 AM, Tom De Mulder <tdm27@cam.ac.uk> wrote:
> However, that might not necessarily be a problem; tools exist that can be run manually (slightly fsck-like) and tell the drive which blocks can be erased.

For RAID5/6 at least, md will still require knowledge of what stripes
are and are not in use by the filesystem. In the current
implementation, the entire array must be consistent, regardless of
whether or not a particular block is in use. As far as my
understanding goes, any level of TRIM support for parity arrays would
be a fundamental shift in the way md treats the array.

The simplest solution I see is to do as Niel suggested, and mimic TRIM
support at the RAID level, and pass commands down as necessary. An
alternative solution would be to add a second TRIM layer, where md
maintains a list of what is or is not in use, and once an entire
stripe has been discarded by the filesystem, it can send a single TRIM
command to each member drive to drop the entire stripe contents. This
adds abstraction for the filesystem layer, allowing it to treat the
RAID array like a regular SSD, but adds significant complexity to md
itself.

-Scott

p.s. Sorry if you receive this twice; Majordomo rejected the first one
on HTML subpart basis.

--
Scott Armitage, B.A.Sc., M.A.Sc. candidate
Space Flight Laboratory
University of Toronto Institute for Aerospace Studies
4925 Dufferin Street, Toronto, Ontario, Canada, M3H 5T6

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-06-29 10:33   ` Tom De Mulder
@ 2011-06-29 12:42     ` David Brown
  2011-06-29 12:55       ` Tom De Mulder
  2011-07-17 22:00     ` Lutz Vieweg
  1 sibling, 1 reply; 45+ messages in thread
From: David Brown @ 2011-06-29 12:42 UTC (permalink / raw)
  To: linux-raid

On 29/06/2011 12:33, Tom De Mulder wrote:
> On 28/06/11, David Brown wrote:
>
>> However, AFAIUI, you are wrong about TRIM being essential for the
>> continued high performance of SSDs. As long as your SSDs have some
>> over-provisioning (or you only partition something like 90% of the
>> drive), and it's got good garbage collection, then TRIM will have
>> minimal effect.
>
> While you are mostly correct, over time even consumer SSDs will end up
> in this state.
>

I don't quite follow you here - what state will consumer SSDs end up in?

> Maybe I should have specified--my particular aim is to try and use
> (fairly high-end) consumer SSDs for "enterprise" server applications,
> hence the research into RAID. Most hardware RAID controllers that I know
> of don't pass on the TRIM command (for various reasons), so I was hoping
> to have more luck with software RAID.
>
>

Now you know /why/ hardware RAID controllers don't implement TRIM!


Have you tried any real-world benchmarking with realistic loads with a 
single SSD, ext4, and TRIM on and off?  Almost every article I've seen 
on the subject is using very synthetic benchmarks, almost always on 
windows, few are done with current garbage-collecting SSDs.  It seems to 
be accepted wisdom from the early days of SSDs that TRIM makes a big 
difference - and few people challenge that with real numbers or real 
thought, even though the internal structure of the flash has changed 
dramatically (transparent compression, for example, gives a completely 
different effect).

Of course, if you /do/ try it yourself and can show clear figures, then 
I'm willing to change my mind :-)  If I had a spare SSD, I'd do the 
testing myself.




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-06-29 10:45     ` NeilBrown
  2011-06-29 11:10       ` Tom De Mulder
@ 2011-06-29 12:46       ` David Brown
  2011-06-30  0:28         ` NeilBrown
  2011-06-29 13:39       ` Namhyung Kim
  2011-07-17 22:11       ` Lutz Vieweg
  3 siblings, 1 reply; 45+ messages in thread
From: David Brown @ 2011-06-29 12:46 UTC (permalink / raw)
  To: linux-raid

On 29/06/2011 12:45, NeilBrown wrote:
> On Wed, 29 Jun 2011 11:32:55 +0100 (BST) Tom De Mulder<tdm27@cam.ac.uk>
> wrote:
>
>> On Tue, 28 Jun 2011, Mathias Burén wrote:
>>
>>> IIRC md can already pass TRIM down, but I think the filesystem needs
>>> to know about the underlying architecture, or something, for TRIM to
>>> work in RAID.
>>
>> Yes, it's (usually/ideally) the filesystem's job to invoke the TRIM
>> command, and that's what ext4 can do. I have it working just fine on
>> single drives, but for reasons of service reliability would need to get
>> RAID to work.
>>
>> I tried (on an admittedly vanilla Ubuntu 2.6.38 kernel) the same on a two
>> drive RAID1 md and it definitely didn't work (the blocks didn't get marked
>> as unused and zeroed).
>>
>>> There's numerous discussions on this in the archives of
>>> this mailing list.
>>
>> Given how fast things move in the world of SSDs at the moment, I wanted to
>> check if any progress was made since. :-) I don't seem to be able to find
>> any reference to this in recent kernel source commits (but I'm a complete
>> amateur when it comes to git).
>
>
> Trim support for md is a long way down my list of interesting projects (and
> no-one else has volunteered).
>
> It is not at all straight forward to implement.
>
> For stripe/parity RAID, (RAID4/5/6) it is only safe to discard full stripes at
> a time, and the md layer would need to keep a record of which stripes had been
> discarded so that it didn't risk trusting data (and parity) read from those
> stripes.  So you would need some sort of bitmap of invalid stripes, and you
> would need the fs to discard in very large chunks for it to be useful at all.
>
> For copying RAID (RAID1, RAID10) you really need the same bitmap.  There
> isn't the same risk of reading and trusting discarded parity, but a resync
> which didn't know about discarded ranges would undo the discard for you.
>
> So is basically requires another bitmap to be stored with the metadata, and a
> fairly fine-grained bitmap it would need to be.  Then every read and resync
> checks the bitmap and ignores or returns 0 for discarded ranges, and every
> write needs to check and if the range was discard, clear the bit and write to
> the whole range.
>
> So: do-able, but definitely non-trivial.
>

Wouldn't the sync/no-sync tracking you already have planned be usable 
for tracking discarded areas?  Or will that not be find-grained enough 
for the purpose?

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-06-29 11:48         ` Scott E. Armitage
@ 2011-06-29 12:46           ` Roberto Spadim
  0 siblings, 0 replies; 45+ messages in thread
From: Roberto Spadim @ 2011-06-29 12:46 UTC (permalink / raw)
  To: Scott E. Armitage; +Cc: linux-raid

some ideas....

maybe for a test only...
we could send trim commands on raid1 arrays only or 'raid0 linear'
since they don´t stripe, this could be 'easy' to develop
when filesystem send trim, we send it to down device (/dev/sdX99)
there´s a problem of offset (for raid1) maybe some devices just work
with 4096bytes blocks on trim command, maybe not
we could implement and put in a beta/alpha realease to test like ext4
guys are doing with discard command (it´s a user option today)


2011/6/29 Scott E. Armitage <launchpad@scott.armitage.name>:
> On Wed, Jun 29, 2011 at 7:10 AM, Tom De Mulder <tdm27@cam.ac.uk> wrote:
>> However, that might not necessarily be a problem; tools exist that can be run manually (slightly fsck-like) and tell the drive which blocks can be erased.
>
> For RAID5/6 at least, md will still require knowledge of what stripes
> are and are not in use by the filesystem. In the current
> implementation, the entire array must be consistent, regardless of
> whether or not a particular block is in use. As far as my
> understanding goes, any level of TRIM support for parity arrays would
> be a fundamental shift in the way md treats the array.
>
> The simplest solution I see is to do as Niel suggested, and mimic TRIM
> support at the RAID level, and pass commands down as necessary. An
> alternative solution would be to add a second TRIM layer, where md
> maintains a list of what is or is not in use, and once an entire
> stripe has been discarded by the filesystem, it can send a single TRIM
> command to each member drive to drop the entire stripe contents. This
> adds abstraction for the filesystem layer, allowing it to treat the
> RAID array like a regular SSD, but adds significant complexity to md
> itself.
>
> -Scott
>
> p.s. Sorry if you receive this twice; Majordomo rejected the first one
> on HTML subpart basis.
>
> --
> Scott Armitage, B.A.Sc., M.A.Sc. candidate
> Space Flight Laboratory
> University of Toronto Institute for Aerospace Studies
> 4925 Dufferin Street, Toronto, Ontario, Canada, M3H 5T6
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-06-29 12:42     ` David Brown
@ 2011-06-29 12:55       ` Tom De Mulder
  2011-06-29 13:02         ` Roberto Spadim
                           ` (3 more replies)
  0 siblings, 4 replies; 45+ messages in thread
From: Tom De Mulder @ 2011-06-29 12:55 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2349 bytes --]

On Wed, 29 Jun 2011, David Brown wrote:

>> While you are mostly correct, over time even consumer SSDs will end up
>> in this state.
> I don't quite follow you here - what state will consumer SSDs end up in?

Sorry, I meant to say "SSDs in typical consumer desktop machines". The 
state where writes are very slow.

> Have you tried any real-world benchmarking with realistic loads with a single 
> SSD, ext4, and TRIM on and off?  Almost every article I've seen on the subject 
> is using very synthetic benchmarks, almost always on windows, few are done 
> with current garbage-collecting SSDs.  It seems to be accepted wisdom from the 
> early days of SSDs that TRIM makes a big difference - and few people challenge 
> that with real numbers or real thought, even though the internal structure of 
> the flash has changed dramatically (transparent compression, for example, 
> gives a completely different effect).
>
> Of course, if you /do/ try it yourself and can show clear figures, then I'm 
> willing to change my mind :-)  If I had a spare SSD, I'd do the testing 
> myself.

I have a set of 4 Intel 510 SSDs purely for testing, and I have used these 
to simulate the kinds of workload I would expect them to experience in a 
server environment (focused mainly on database access). So far, those 
tests have focused on using single drives (ie. without RAID) on a variety 
of controllers.

Once the drives get fuller (something which does happen on servers) I do 
indeed see write latencies that are in the order of several seconds (I saw 
from 1500µs to 6000µs), as the drive suddenly struggles to free entire 
blocks, where initially latency was in the single digits.

I am hoping to get my hands on some Sandforce controller-based SSDs as 
well, to compare, but even they show degradation as they get fuller in 
AnandTech's tests (and those tests seem, IME, trustworthy).

My current plan is to sacrifice half the capacity by partitioning, stick 2 
of them in md RAID1 (so, without TRIM) and over the next few days to run 
benchmarks over them, to see what the end result is.


Best,

--
Tom De Mulder <tdm27@cam.ac.uk> - Cambridge University Computing Service
+44 1223 3 31843 - New Museums Site, Pembroke Street, Cambridge CB2 3QH
-> 29/06/2011 : The Moon is Waning Crescent (18% of Full)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-06-29 12:55       ` Tom De Mulder
@ 2011-06-29 13:02         ` Roberto Spadim
  2011-06-29 13:10         ` David Brown
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 45+ messages in thread
From: Roberto Spadim @ 2011-06-29 13:02 UTC (permalink / raw)
  To: Tom De Mulder; +Cc: David Brown, linux-raid

nice,
anyone know if freebsd or netbsd or other o.s. have this (raid trim)
to do some benchmarks without losing our time developing?

2011/6/29 Tom De Mulder <tdm27@cam.ac.uk>:
> On Wed, 29 Jun 2011, David Brown wrote:
>
>>> While you are mostly correct, over time even consumer SSDs will end up
>>> in this state.
>>
>> I don't quite follow you here - what state will consumer SSDs end up in?
>
> Sorry, I meant to say "SSDs in typical consumer desktop machines". The state
> where writes are very slow.
>
>> Have you tried any real-world benchmarking with realistic loads with a
>> single SSD, ext4, and TRIM on and off?  Almost every article I've seen on
>> the subject is using very synthetic benchmarks, almost always on windows,
>> few are done with current garbage-collecting SSDs.  It seems to be accepted
>> wisdom from the early days of SSDs that TRIM makes a big difference - and
>> few people challenge that with real numbers or real thought, even though the
>> internal structure of the flash has changed dramatically (transparent
>> compression, for example, gives a completely different effect).
>>
>> Of course, if you /do/ try it yourself and can show clear figures, then
>> I'm willing to change my mind :-)  If I had a spare SSD, I'd do the testing
>> myself.
>
> I have a set of 4 Intel 510 SSDs purely for testing, and I have used these
> to simulate the kinds of workload I would expect them to experience in a
> server environment (focused mainly on database access). So far, those tests
> have focused on using single drives (ie. without RAID) on a variety of
> controllers.
>
> Once the drives get fuller (something which does happen on servers) I do
> indeed see write latencies that are in the order of several seconds (I saw
> from 1500µs to 6000µs), as the drive suddenly struggles to free entire
> blocks, where initially latency was in the single digits.
>
> I am hoping to get my hands on some Sandforce controller-based SSDs as well,
> to compare, but even they show degradation as they get fuller in AnandTech's
> tests (and those tests seem, IME, trustworthy).
>
> My current plan is to sacrifice half the capacity by partitioning, stick 2
> of them in md RAID1 (so, without TRIM) and over the next few days to run
> benchmarks over them, to see what the end result is.
>
>
> Best,
>
> --
> Tom De Mulder <tdm27@cam.ac.uk> - Cambridge University Computing Service
> +44 1223 3 31843 - New Museums Site, Pembroke Street, Cambridge CB2 3QH
> -> 29/06/2011 : The Moon is Waning Crescent (18% of Full)



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-06-29 12:55       ` Tom De Mulder
  2011-06-29 13:02         ` Roberto Spadim
@ 2011-06-29 13:10         ` David Brown
  2011-06-30  5:51         ` Mikael Abrahamsson
  2011-07-17 22:16         ` Lutz Vieweg
  3 siblings, 0 replies; 45+ messages in thread
From: David Brown @ 2011-06-29 13:10 UTC (permalink / raw)
  To: linux-raid

On 29/06/2011 14:55, Tom De Mulder wrote:
> On Wed, 29 Jun 2011, David Brown wrote:
>
>>> While you are mostly correct, over time even consumer SSDs will end up
>>> in this state.
>> I don't quite follow you here - what state will consumer SSDs end up in?
>
> Sorry, I meant to say "SSDs in typical consumer desktop machines". The
> state where writes are very slow.
>

Well, many consumer level systems use older or cheaper SSDs which don't 
have the benefit of newer garbage collection, and don't have much 
over-provisioning (you can always do that yourself by leaving some space 
unpartitioned - but "consumer" users would typically not do that).  And 
remember that for users in this class, who will probably have small SSDs 
to keep costs down, will have fairly full drives - making TRIM almost 
useless.

>> Have you tried any real-world benchmarking with realistic loads with a
>> single SSD, ext4, and TRIM on and off? Almost every article I've seen
>> on the subject is using very synthetic benchmarks, almost always on
>> windows, few are done with current garbage-collecting SSDs. It seems
>> to be accepted wisdom from the early days of SSDs that TRIM makes a
>> big difference - and few people challenge that with real numbers or
>> real thought, even though the internal structure of the flash has
>> changed dramatically (transparent compression, for example, gives a
>> completely different effect).
>>
>> Of course, if you /do/ try it yourself and can show clear figures,
>> then I'm willing to change my mind :-) If I had a spare SSD, I'd do
>> the testing myself.
>
> I have a set of 4 Intel 510 SSDs purely for testing, and I have used
> these to simulate the kinds of workload I would expect them to
> experience in a server environment (focused mainly on database access).
> So far, those tests have focused on using single drives (ie. without
> RAID) on a variety of controllers.
>
> Once the drives get fuller (something which does happen on servers) I do
> indeed see write latencies that are in the order of several seconds (I
> saw from 1500µs to 6000µs), as the drive suddenly struggles to free
> entire blocks, where initially latency was in the single digits.
>
> I am hoping to get my hands on some Sandforce controller-based SSDs as
> well, to compare, but even they show degradation as they get fuller in
> AnandTech's tests (and those tests seem, IME, trustworthy).
>
> My current plan is to sacrifice half the capacity by partitioning, stick
> 2 of them in md RAID1 (so, without TRIM) and over the next few days to
> run benchmarks over them, to see what the end result is.
>

Well, try it and see - and let us know the results.  50% manual 
over-provisioning seems excessive, but I guess that's what you'll find 
out with the tests.

>
> Best,
>
> --
> Tom De Mulder <tdm27@cam.ac.uk> - Cambridge University Computing Service
> +44 1223 3 31843 - New Museums Site, Pembroke Street, Cambridge CB2 3QH
> -> 29/06/2011 : The Moon is Waning Crescent (18% of Full)


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-06-29 10:45     ` NeilBrown
  2011-06-29 11:10       ` Tom De Mulder
  2011-06-29 12:46       ` David Brown
@ 2011-06-29 13:39       ` Namhyung Kim
  2011-06-30  0:27         ` NeilBrown
  2011-07-17 22:11       ` Lutz Vieweg
  3 siblings, 1 reply; 45+ messages in thread
From: Namhyung Kim @ 2011-06-29 13:39 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

NeilBrown <neilb@suse.de> writes:

> On Wed, 29 Jun 2011 11:32:55 +0100 (BST) Tom De Mulder <tdm27@cam.ac.uk>
> wrote:
>
>> On Tue, 28 Jun 2011, Mathias Burén wrote:
>> 
>> > IIRC md can already pass TRIM down, but I think the filesystem needs
>> > to know about the underlying architecture, or something, for TRIM to
>> > work in RAID.
>> 
>> Yes, it's (usually/ideally) the filesystem's job to invoke the TRIM 
>> command, and that's what ext4 can do. I have it working just fine on 
>> single drives, but for reasons of service reliability would need to get 
>> RAID to work.
>> 
>> I tried (on an admittedly vanilla Ubuntu 2.6.38 kernel) the same on a two 
>> drive RAID1 md and it definitely didn't work (the blocks didn't get marked 
>> as unused and zeroed).
>> 
>> > There's numerous discussions on this in the archives of
>> > this mailing list.
>> 
>> Given how fast things move in the world of SSDs at the moment, I wanted to 
>> check if any progress was made since. :-) I don't seem to be able to find 
>> any reference to this in recent kernel source commits (but I'm a complete 
>> amateur when it comes to git).
>
>
> Trim support for md is a long way down my list of interesting projects (and
> no-one else has volunteered).
>

Just out of curiosity, what are there in your list? :)


-- 
Regards,
Namhyung Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-06-29 13:39       ` Namhyung Kim
@ 2011-06-30  0:27         ` NeilBrown
  0 siblings, 0 replies; 45+ messages in thread
From: NeilBrown @ 2011-06-30  0:27 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: linux-raid

On Wed, 29 Jun 2011 22:39:24 +0900 Namhyung Kim <namhyung@gmail.com> wrote:

> NeilBrown <neilb@suse.de> writes:
> 
> > On Wed, 29 Jun 2011 11:32:55 +0100 (BST) Tom De Mulder <tdm27@cam.ac.uk>
> > wrote:
> >
> >> On Tue, 28 Jun 2011, Mathias Burén wrote:
> >> 
> >> > IIRC md can already pass TRIM down, but I think the filesystem needs
> >> > to know about the underlying architecture, or something, for TRIM to
> >> > work in RAID.
> >> 
> >> Yes, it's (usually/ideally) the filesystem's job to invoke the TRIM 
> >> command, and that's what ext4 can do. I have it working just fine on 
> >> single drives, but for reasons of service reliability would need to get 
> >> RAID to work.
> >> 
> >> I tried (on an admittedly vanilla Ubuntu 2.6.38 kernel) the same on a two 
> >> drive RAID1 md and it definitely didn't work (the blocks didn't get marked 
> >> as unused and zeroed).
> >> 
> >> > There's numerous discussions on this in the archives of
> >> > this mailing list.
> >> 
> >> Given how fast things move in the world of SSDs at the moment, I wanted to 
> >> check if any progress was made since. :-) I don't seem to be able to find 
> >> any reference to this in recent kernel source commits (but I'm a complete 
> >> amateur when it comes to git).
> >
> >
> > Trim support for md is a long way down my list of interesting projects (and
> > no-one else has volunteered).
> >
> 
> Just out of curiosity, what are there in your list? :)
> 
> 

   http://neil.brown.name/blog/20110216044002

I have code for the first - the bad block log - and it seems to work.  But I
really need to design and then perform some more testing.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-06-29 12:46       ` David Brown
@ 2011-06-30  0:28         ` NeilBrown
  2011-06-30  7:50           ` David Brown
  0 siblings, 1 reply; 45+ messages in thread
From: NeilBrown @ 2011-06-30  0:28 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid

On Wed, 29 Jun 2011 14:46:08 +0200 David Brown <david@westcontrol.com> wrote:

> On 29/06/2011 12:45, NeilBrown wrote:
> > On Wed, 29 Jun 2011 11:32:55 +0100 (BST) Tom De Mulder<tdm27@cam.ac.uk>
> > wrote:
> >
> >> On Tue, 28 Jun 2011, Mathias Burén wrote:
> >>
> >>> IIRC md can already pass TRIM down, but I think the filesystem needs
> >>> to know about the underlying architecture, or something, for TRIM to
> >>> work in RAID.
> >>
> >> Yes, it's (usually/ideally) the filesystem's job to invoke the TRIM
> >> command, and that's what ext4 can do. I have it working just fine on
> >> single drives, but for reasons of service reliability would need to get
> >> RAID to work.
> >>
> >> I tried (on an admittedly vanilla Ubuntu 2.6.38 kernel) the same on a two
> >> drive RAID1 md and it definitely didn't work (the blocks didn't get marked
> >> as unused and zeroed).
> >>
> >>> There's numerous discussions on this in the archives of
> >>> this mailing list.
> >>
> >> Given how fast things move in the world of SSDs at the moment, I wanted to
> >> check if any progress was made since. :-) I don't seem to be able to find
> >> any reference to this in recent kernel source commits (but I'm a complete
> >> amateur when it comes to git).
> >
> >
> > Trim support for md is a long way down my list of interesting projects (and
> > no-one else has volunteered).
> >
> > It is not at all straight forward to implement.
> >
> > For stripe/parity RAID, (RAID4/5/6) it is only safe to discard full stripes at
> > a time, and the md layer would need to keep a record of which stripes had been
> > discarded so that it didn't risk trusting data (and parity) read from those
> > stripes.  So you would need some sort of bitmap of invalid stripes, and you
> > would need the fs to discard in very large chunks for it to be useful at all.
> >
> > For copying RAID (RAID1, RAID10) you really need the same bitmap.  There
> > isn't the same risk of reading and trusting discarded parity, but a resync
> > which didn't know about discarded ranges would undo the discard for you.
> >
> > So is basically requires another bitmap to be stored with the metadata, and a
> > fairly fine-grained bitmap it would need to be.  Then every read and resync
> > checks the bitmap and ignores or returns 0 for discarded ranges, and every
> > write needs to check and if the range was discard, clear the bit and write to
> > the whole range.
> >
> > So: do-able, but definitely non-trivial.
> >
> 
> Wouldn't the sync/no-sync tracking you already have planned be usable 
> for tracking discarded areas?  Or will that not be find-grained enough 
> for the purpose?

That would be a necessary precursor to DISCARD support: yes.
DISCARD would probably require a much finer grain than I would otherwise
suggest but I would design the feature to allow a range of granularities.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-06-29 12:55       ` Tom De Mulder
  2011-06-29 13:02         ` Roberto Spadim
  2011-06-29 13:10         ` David Brown
@ 2011-06-30  5:51         ` Mikael Abrahamsson
  2011-07-04  9:13           ` Tom De Mulder
  2011-07-17 22:16         ` Lutz Vieweg
  3 siblings, 1 reply; 45+ messages in thread
From: Mikael Abrahamsson @ 2011-06-30  5:51 UTC (permalink / raw)
  To: Tom De Mulder; +Cc: David Brown, linux-raid

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1552 bytes --]

On Wed, 29 Jun 2011, Tom De Mulder wrote:

> I have a set of 4 Intel 510 SSDs purely for testing, and I have used 
> these to simulate the kinds of workload I would expect them to 
> experience in a server environment (focused mainly on database access). 
> So far, those tests have focused on using single drives (ie. without 
> RAID) on a variety of controllers.

From the tests I have read, the Intel 510 are actually worse than the 
Intel X-25 G1/G2/320 models, with exactly the symptoms you're describing. 
It's fast for linear reads and writes, but not so good for random writes, 
especially not when it's getting full.

> Once the drives get fuller (something which does happen on servers) I do 
> indeed see write latencies that are in the order of several seconds (I 
> saw from 1500µs to 6000µs), as the drive suddenly struggles to free 
> entire blocks, where initially latency was in the single digits.

Yeah, this is a common problem especially for older drives. A lot has 
happened with garbage collect but the fact is still that a lot of SSD 
vendors have too little spare area, so the recommendation you make 
regarding leaving a large area unused is something I do as well, and it 
works.

> I am hoping to get my hands on some Sandforce controller-based SSDs as well, 
> to compare, but even they show degradation as they get fuller in AnandTech's 
> tests (and those tests seem, IME, trustworthy).

Include the Intel 320 as well, I think it should be viable for your usage 
pattern.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-06-30  0:28         ` NeilBrown
@ 2011-06-30  7:50           ` David Brown
  0 siblings, 0 replies; 45+ messages in thread
From: David Brown @ 2011-06-30  7:50 UTC (permalink / raw)
  To: linux-raid

On 30/06/2011 02:28, NeilBrown wrote:
> On Wed, 29 Jun 2011 14:46:08 +0200 David Brown<david@westcontrol.com>  wrote:
>
>> On 29/06/2011 12:45, NeilBrown wrote:
>>> On Wed, 29 Jun 2011 11:32:55 +0100 (BST) Tom De Mulder<tdm27@cam.ac.uk>
>>> wrote:
>>>
>>>> On Tue, 28 Jun 2011, Mathias Burén wrote:
>>>>
>>>>> IIRC md can already pass TRIM down, but I think the filesystem needs
>>>>> to know about the underlying architecture, or something, for TRIM to
>>>>> work in RAID.
>>>>
>>>> Yes, it's (usually/ideally) the filesystem's job to invoke the TRIM
>>>> command, and that's what ext4 can do. I have it working just fine on
>>>> single drives, but for reasons of service reliability would need to get
>>>> RAID to work.
>>>>
>>>> I tried (on an admittedly vanilla Ubuntu 2.6.38 kernel) the same on a two
>>>> drive RAID1 md and it definitely didn't work (the blocks didn't get marked
>>>> as unused and zeroed).
>>>>
>>>>> There's numerous discussions on this in the archives of
>>>>> this mailing list.
>>>>
>>>> Given how fast things move in the world of SSDs at the moment, I wanted to
>>>> check if any progress was made since. :-) I don't seem to be able to find
>>>> any reference to this in recent kernel source commits (but I'm a complete
>>>> amateur when it comes to git).
>>>
>>>
>>> Trim support for md is a long way down my list of interesting projects (and
>>> no-one else has volunteered).
>>>
>>> It is not at all straight forward to implement.
>>>
>>> For stripe/parity RAID, (RAID4/5/6) it is only safe to discard full stripes at
>>> a time, and the md layer would need to keep a record of which stripes had been
>>> discarded so that it didn't risk trusting data (and parity) read from those
>>> stripes.  So you would need some sort of bitmap of invalid stripes, and you
>>> would need the fs to discard in very large chunks for it to be useful at all.
>>>
>>> For copying RAID (RAID1, RAID10) you really need the same bitmap.  There
>>> isn't the same risk of reading and trusting discarded parity, but a resync
>>> which didn't know about discarded ranges would undo the discard for you.
>>>
>>> So is basically requires another bitmap to be stored with the metadata, and a
>>> fairly fine-grained bitmap it would need to be.  Then every read and resync
>>> checks the bitmap and ignores or returns 0 for discarded ranges, and every
>>> write needs to check and if the range was discard, clear the bit and write to
>>> the whole range.
>>>
>>> So: do-able, but definitely non-trivial.
>>>
>>
>> Wouldn't the sync/no-sync tracking you already have planned be usable
>> for tracking discarded areas?  Or will that not be find-grained enough
>> for the purpose?
>
> That would be a necessary precursor to DISCARD support: yes.
> DISCARD would probably require a much finer grain than I would otherwise
> suggest but I would design the feature to allow a range of granularities.
>

I suppose the big win for the sync/no-sync tracking is when initialising 
an array - arrays that haven't been written don't need to be in sync. 
But you will probably be best with a list of sync (or no-sync) areas for 
that job, rather than a bitmap, as there won't be very many such blocks 
(a few dozen, perhaps, for multiple partitions and filesystems like XFS 
that write in different areas) and as the disk gets used, the "no-sync" 
areas will decrease in size and number.  For DISCARD, however, you'd get 
no-sync areas scattered around the disk.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-06-30  5:51         ` Mikael Abrahamsson
@ 2011-07-04  9:13           ` Tom De Mulder
  2011-07-04 16:26             ` Werner Fischer
  0 siblings, 1 reply; 45+ messages in thread
From: Tom De Mulder @ 2011-07-04  9:13 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: David Brown, linux-raid

On Thu, 30 Jun 2011, Mikael Abrahamsson wrote:

> From the tests I have read, the Intel 510 are actually worse than the Intel 
> X-25 G1/G2/320 models, with exactly the symptoms you're describing. It's fast 
> for linear reads and writes, but not so good for random writes, especially not 
> when it's getting full.

Yes; that's why I'm looking forward to also getting some SandForce 22xx 
based drives (probably OCZ Vertex 3) to test.

> Include the Intel 320 as well, I think it should be viable for your usage 
> pattern.

I wasn't too impressed by the Anandtech review of the 320, and (as 
everywhere) my funds are limited. :-)


--
Tom De Mulder <tdm27@cam.ac.uk> - Cambridge University Computing Service
+44 1223 3 31843 - New Museums Site, Pembroke Street, Cambridge CB2 3QH
-> 04/07/2011 : The Moon is Waxing Crescent (17% of Full)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-07-04  9:13           ` Tom De Mulder
@ 2011-07-04 16:26             ` Werner Fischer
  2011-07-17 22:31               ` Lutz Vieweg
  0 siblings, 1 reply; 45+ messages in thread
From: Werner Fischer @ 2011-07-04 16:26 UTC (permalink / raw)
  To: Tom De Mulder; +Cc: linux-raid

Hi Tom,

1) regarding Software RAID and TRIM:
there is a script raid1ext4trim.sh-1.4 from Chris Caputo that does a
TRIM for Ext4 file systems on a software RAID 1. According to the
comments in the script it only supports RAID volumes which reside on
complete disks (e.g. /dev/sdb and /dev/sdc), not on RAID partitions
(e.g. /dev/sdb1 and /dev/sdc1)
The script is shipped with hdparm, get hdparm 9.37 at
http://sourceforge.net/projects/hdparm/ and you'll find the script in
the subfolder hdparm-9.37/wiper/contrib/
I have not tested the script yet, maybe I could do some tests tomorrow

2) regarding choosing the right SSD:
I would strongly recommend a SSD with integrated power-outage
protection, Intel's 320 series has this inside:
http://newsroom.intel.com/servlet/JiveServlet/download/38-4324/Intel_SSD_320_Series_Enhance_Power_Loss_Technology_Brief.pdf
I have done some power-outage tests today, including a Vertex-3 and an
Intel 320 series. I used diskchecker.pl from
http://brad.livejournal.com/2116715.html

result:
-> for the Vertex 3 diskchecker.pl reported lost data:
        [root@f15-ocz-vertex3 ~]# ./diskchecker.pl -s 10.10.30.199 verify testfile2
         verifying: 0.00%
         verifying: 1.42%
          Error at page 52141, 0 seconds before end.
         verifying: 6.31%
          Error at page 83344, 0 seconds before end.
         verifying: 11.12%
          Error at page 163555, 0 seconds before end.
        [...]
        Total errors: 12
        Histogram of seconds before end:
             0   12
        [root@f15-ocz-vertex3 ~]# 
-> for the Intel 320 Series diskchecker.pl did not report data loss:
        [root@f15-intel-320 ~]# ./diskchecker.pl -s 10.10.30.199 verify
        testfile2 
         verifying: 0.00%
         verifying: 0.12%
         [...]
         verifying: 99.82%
         verifying: 100.00%
        Total errors: 0
        [root@f15-intel-320 ~]# 
I did the tests multiple times, I had also some runs on the Vertex 3
without errors, but with the Intel 320 Series no single test reported an
error.

I did the tests with fedora 15 on the SSDs, here are the details of
hdparm -I

OCZ Vertex 3:
	Model Number:       OCZ-VERTEX3                             
	Serial Number:      OCZ-OQZF2I45DYZ47T3C
	Firmware Revision:  2.06
	Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
	[...]
	device size with M = 1000*1000:      120034 MBytes (120 GB)

Intel 320 Series:
	Model Number:       INTEL SSDSA2CW160G3                     
	Serial Number:      CVPR112601AL160DGN  
	Firmware Revision:  4PC10302
	Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6
	[...]
	device size with M = 1000*1000:      160041 MBytes (160 GB)

Regards,
Werner

On Mon, 2011-07-04 at 10:13 +0100, Tom De Mulder wrote:
> On Thu, 30 Jun 2011, Mikael Abrahamsson wrote:
> 
> > From the tests I have read, the Intel 510 are actually worse than the Intel 
> > X-25 G1/G2/320 models, with exactly the symptoms you're describing. It's fast 
> > for linear reads and writes, but not so good for random writes, especially not 
> > when it's getting full.
> 
> Yes; that's why I'm looking forward to also getting some SandForce 22xx 
> based drives (probably OCZ Vertex 3) to test.
> 
> > Include the Intel 320 as well, I think it should be viable for your usage 
> > pattern.
> 
> I wasn't too impressed by the Anandtech review of the 320, and (as 
> everywhere) my funds are limited. :-)
> 
> 
> --
> Tom De Mulder <tdm27@cam.ac.uk> - Cambridge University Computing Service
> +44 1223 3 31843 - New Museums Site, Pembroke Street, Cambridge CB2 3QH
> -> 04/07/2011 : The Moon is Waxing Crescent (17% of Full)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-06-28 16:40 ` David Brown
@ 2011-07-17 21:52   ` Lutz Vieweg
  2011-07-18  5:14     ` Mikael Abrahamsson
                       ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Lutz Vieweg @ 2011-07-17 21:52 UTC (permalink / raw)
  To: linux-raid

David Brown wrote:
> However, AFAIUI, you are wrong about TRIM being essential for the 
> continued high performance of SSDs.  As long as your SSDs have some 
> over-provisioning (or you only partition something like 90% of the 
> drive), and it's got good garbage collection, then TRIM will have 
> minimal effect.

I beg to differ.

We are using SSDs in very much the way that Tom de Mulder intends,
and from our extensive performance measurements over many months
now I can say that (at least if you do have significant amounts
of write operations) it _does_ make a lot of difference whether you
periodically discard the unused sectors or not.
(For us, the write performance measured to be about half as good
when there are no free erase blocks available anymore.)

Of course, you can only benefit from discards if your filesystem
is not full (because then there is nothing to discard). But any
kind of "garbage collection" by the SSD itself will not have the
same effect, since it cannot know which blocks are in use by the
filesystem.

> I think other SSD-optimisations, such as those in BTRFS, are much more 
> important.

Actually, (apart from btrfs still being in development, not really
ready for production use, yet), XFS (-o delaylog,barrier) performs
better on our SSDs than btrfs - without any SSD-specific options.

What is really an important factor for SSD performance: The controller.
The same SSDs perform with significantly lower latency for us when
connected to SATA controller channels than when connected to SAS
controllers (and they perform abysmal when used as hardware-RAID
constituents, in comparison).

Regards,

Lutz Vieweg


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-06-29 10:32   ` Tom De Mulder
  2011-06-29 10:45     ` NeilBrown
@ 2011-07-17 21:57     ` Lutz Vieweg
  1 sibling, 0 replies; 45+ messages in thread
From: Lutz Vieweg @ 2011-07-17 21:57 UTC (permalink / raw)
  To: linux-raid

Tom De Mulder wrote:
> Yes, it's (usually/ideally) the filesystem's job to invoke the TRIM 
> command

Well, for us voluntarily (cron-triggered) batch-discards have shown
to be the better option. If you leave it to the filesystem to
trigger the discards, then you might lose write performance
when you need it most.

In comparison, a voluntarily triggered discard in some low-usage time
is painless.

Regards,

Lutz Vieweg


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-06-29 10:33   ` Tom De Mulder
  2011-06-29 12:42     ` David Brown
@ 2011-07-17 22:00     ` Lutz Vieweg
  1 sibling, 0 replies; 45+ messages in thread
From: Lutz Vieweg @ 2011-07-17 22:00 UTC (permalink / raw)
  To: linux-raid

Tom De Mulder wrote:
> Maybe I should have specified--my particular aim is to try and use 
> (fairly high-end) consumer SSDs for "enterprise" server applications

That's exactly what we do.
After all, "RAID" is still the acronym for "Redundant Array of 
_Inexpensive_ Disks", no matter how many times big-$$$ will try to tell 
you otherwise.

And a software RAID built from some cheap consumer SSDs easily 
outperforms those overpriced "enterprise class" SSD devices they try to 
sell you.


> Most hardware RAID controllers that I know 
> of don't pass on the TRIM command 

Not only that, they also suck regarding adding lots of latency to the 
SSD communication.
There simply is no reason anymore to use hardware RAID at all.

Regards,

Lutz Vieweg


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-06-29 10:45     ` NeilBrown
                         ` (2 preceding siblings ...)
  2011-06-29 13:39       ` Namhyung Kim
@ 2011-07-17 22:11       ` Lutz Vieweg
  3 siblings, 0 replies; 45+ messages in thread
From: Lutz Vieweg @ 2011-07-17 22:11 UTC (permalink / raw)
  To: linux-raid

NeilBrown wrote:
> Trim support for md is a long way down my list of interesting projects (and
> no-one else has volunteered).

That's a pity.

Actually, we were desperate enough about being able to discard unused 
sectors from our SSDs "behind" MD that we implemented a user-space 
work-around (using fallocate and BLKDISCARD ioctls after finding out 
which physical devices are hidden behind the RAID), but that is awkward 
in comparison to just using "fstrim" or alike, as this means that during 
the discards, the filesystem appears "almost full", and the work-around 
supports only RAID-1.

> It is not at all straight forward to implement.

For RAID5/6, I understand that. But supporting RAID 0/1, and maybe even 
RAID 10, should not be that difficult. (dm-raid does support this, 
though we don't like dm-raid too much for several other reasons.)

If today somebody is investing into SSDs, it is for speed. So if you are 
setting up an SSD based RAID, it's unlikely that you'll aim for RAID5/6, 
anyway.

> For copying RAID (RAID1, RAID10) you really need the same bitmap.  There
> isn't the same risk of reading and trusting discarded parity, but a resync
> which didn't know about discarded ranges would undo the discard for you.

That is true, but not really a problem. Yes, the write-performance will 
suffer until the next "fstrim" is done, but the performance suffers from 
the resync anyway, so that's not something extra, and SSD users will 
certainly issue "fstrim" periodically, anyway.

I guess you would make many people happy if MD-raid supported passing 
through discards, even if it was only for RAID 0/1, and even if a resync 
meant you'd have to issue an additional "fstrim".

Regards,

Lutz Vieweg




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-06-29 12:55       ` Tom De Mulder
                           ` (2 preceding siblings ...)
  2011-06-30  5:51         ` Mikael Abrahamsson
@ 2011-07-17 22:16         ` Lutz Vieweg
  3 siblings, 0 replies; 45+ messages in thread
From: Lutz Vieweg @ 2011-07-17 22:16 UTC (permalink / raw)
  To: linux-raid

Tom De Mulder wrote:
> I have a set of 4 Intel 510 SSDs purely for testing, and I have used 
> these to simulate the kinds of workload I would expect them to 
> experience in a server environment

Beware: The Intel SSDs are documented to voluntarily throttle write 
speed if they detect a lot of writing going on to meet their lifetime 
advertisement.

(I have not read such in the documentation of 
marvell/micron/indilinx/sandforce controllers, and indeed, when wiped 
once per week, our SSDs keep up their initial performance. And yes, I 
find it acceptable that they might wear out after >= 3 years :-)

Regards,

Lutz Vieweg



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-07-04 16:26             ` Werner Fischer
@ 2011-07-17 22:31               ` Lutz Vieweg
  0 siblings, 0 replies; 45+ messages in thread
From: Lutz Vieweg @ 2011-07-17 22:31 UTC (permalink / raw)
  To: linux-raid

Werner Fischer wrote:
> 1) regarding Software RAID and TRIM:
> there is a script raid1ext4trim.sh-1.4 from Chris Caputo that does a
> TRIM for Ext4 file systems on a software RAID 1. According to the
> comments in the script it only supports RAID volumes which reside on
> complete disks (e.g. /dev/sdb and /dev/sdc), not on RAID partitions
> (e.g. /dev/sdb1 and /dev/sdc1)
> The script is shipped with hdparm

I wonder why people would use the "hdparm" tool to issue TRIM commands 
on a lower level that you can do much more portable by using ioctl 
BLKDISCARD...

> I would strongly recommend a SSD with integrated power-outage
> protection

Your results seem to indicate differences, but how is that an evidence
for SSDs corrupting filesystems? As long as the SSD actually tells the 
truth about draining its caches when asked to, the journaling of the 
filesystem will keep the meta-data intact - but not necessarily the data 
inside the files, - for very plausible performance reasons, most 
filesystems will _not_ try to sync non-meta data by default!

Nevertheless, sensitivity against power-outage situations has been a 
subject of many SSD updates for different controllers, so there may have 
been real issues, too.

Regards,

Lutz Vieweg


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-07-17 21:52   ` Lutz Vieweg
@ 2011-07-18  5:14     ` Mikael Abrahamsson
  2011-07-18 10:35     ` David Brown
  2011-07-18 10:53     ` Tom De Mulder
  2 siblings, 0 replies; 45+ messages in thread
From: Mikael Abrahamsson @ 2011-07-18  5:14 UTC (permalink / raw)
  To: Lutz Vieweg; +Cc: linux-raid

On Sun, 17 Jul 2011, Lutz Vieweg wrote:

> David Brown wrote:
>> However, AFAIUI, you are wrong about TRIM being essential for the continued 
>> high performance of SSDs.  As long as your SSDs have some over-provisioning 
>> (or you only partition something like 90% of the drive), and it's got good 
>> garbage collection, then TRIM will have minimal effect.
>
> I beg to differ.
>
> Of course, you can only benefit from discards if your filesystem
> is not full (because then there is nothing to discard). But any
> kind of "garbage collection" by the SSD itself will not have the
> same effect, since it cannot know which blocks are in use by the
> filesystem.

Well, that's what you gain from only using 90% of the drive space for data 
(be it via partition or some other means), you increase the 
overprovisioning and thus the drive has more empty space to play with, 
even if you fill up the FS to 100%.

So yes, TRIM is nice but if you want consistant performance then you need 
to assume that your FS is going to be 100% full anyway, so then you have 
to limit the FS block use to 80-90% of the total drive space.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-07-17 21:52   ` Lutz Vieweg
  2011-07-18  5:14     ` Mikael Abrahamsson
@ 2011-07-18 10:35     ` David Brown
  2011-07-18 10:48       ` Tom De Mulder
  2011-07-18 18:09       ` Lutz Vieweg
  2011-07-18 10:53     ` Tom De Mulder
  2 siblings, 2 replies; 45+ messages in thread
From: David Brown @ 2011-07-18 10:35 UTC (permalink / raw)
  To: linux-raid

On 17/07/2011 23:52, Lutz Vieweg wrote:
> David Brown wrote:
>> However, AFAIUI, you are wrong about TRIM being essential for the
>> continued high performance of SSDs. As long as your SSDs have some
>> over-provisioning (or you only partition something like 90% of the
>> drive), and it's got good garbage collection, then TRIM will have
>> minimal effect.
>
> I beg to differ.
>

Well, I don't have your experience here (I have a couple of 60G SSD's in 
RAID0, without TRIM, but that's hardly in the same class).  So I don't 
expect you to put much weight on my opinions.  But maybe it will give 
you reason for more testing.

> We are using SSDs in very much the way that Tom de Mulder intends,
> and from our extensive performance measurements over many months
> now I can say that (at least if you do have significant amounts
> of write operations) it _does_ make a lot of difference whether you
> periodically discard the unused sectors or not.
> (For us, the write performance measured to be about half as good
> when there are no free erase blocks available anymore.)
>

If there are no free erase blocks, then your SSD's don't have enough 
over-provisioning.  This is, after all, the whole point of having more 
physical flash than the logical disk size would suggest.  Depending on 
the quality of the SSD (more expensive ones have more 
over-provisioning), and the usage patterns (if you have lots of small 
random writes, you'll need more extra space), then you might have to 
"manually" over-provision the disk by only partitioning about 90% of the 
disk.  Of course, you must make sure that the remaining 10% is 
"discarded", or left untouched from new, and that you use the partition 
for your RAID and not the whole disk.

So now you have plenty of erase blocks at any time, and your write 
performance will be good.


TRIM, on the other hand, does not give you any extra free erase blocks. 
  If you think it does, you've misunderstood it.

TRIM exists to make garbage collection a little more efficient - when 
garbage collecting an erase block that contains TRIM'ed blocks, the 
TRIM'ed blocks don't need to be copied.  This saves a small amount of 
time in the copying, and allows slightly denser packing.  It may 
sometimes lead to saving whole erase blocks, but that's seldom the case 
in practice except when erasing large files.

If your disks are reasonably full, then TRIM will not help much because 
the garbage collection will be desperately trying to piece together 
small bits into complete erase blocks, and your performance will drop 
through the floor.  If you have plenty of overprovisioning, then the SSD 
still has lots of completely free erase blocks whenever it needs them.

If your filesystem re-uses (logical) blocks, then TRIM will not help. 
It is /always/ more efficient for the FS to simply write new data to the 
same block, rather than TRIM'ing it first.

TRIM is a very expensive command - it acts a bit like a write, but it is 
not a queued command.  Thus the block layer must wait for /all/ IO 
commands to have completed, then issue the TRIM, then wait for it to 
complete, and then carry on with new commands.  On some SSD's, it will 
(according to something I read) trigger garbage collection, which may 
slow down the SSD.  Even without that, the performance of most meta-data 
operations (such as delete) will drop considerably when they also need 
to do TRIM.

<http://people.redhat.com/jmoyer/discard/ext4_batched_discard/ext4_discard.html>

<http://lwn.net/Articles/347511/>

<http://www.realworldtech.com/beta/forums/index.cfm?action=detail&id=116034&threadid=115697&roomid=2>


On the other hand, your off-line batch TRIM during low use periods could 
well be a win.  The cost of these discards is not going to be an issue, 
and large batched discards are going to be far more useful to the SSD 
than small scattered ones.  I believe that there has been work on a 
similar system in XFS - I don't know what happened to that, or if there 
is any way to make it work in concert with md raid.


What will make a big difference to using SSD's in md raid is the 
sync/no-sync tracking.  This will avoid a lot of unnecessary writes, 
especially with a new array, and leave the SSD with more free blocks (at 
least until the disk is getting full of data).  It is also much higher 
up the things-to-do list, because it will be useful for all uses of md 
raid, and is a perquisite to general discard support.  (Strictly 
speaking it is not needed for SSD's that guarantee a zero return on 
TRIM'ed blocks - but only some SSD's give that guarantee.)


> Of course, you can only benefit from discards if your filesystem
> is not full (because then there is nothing to discard). But any
> kind of "garbage collection" by the SSD itself will not have the
> same effect, since it cannot know which blocks are in use by the
> filesystem.
>

Garbage collection will recycle blocks that have been overwritten.  The 
filesystem knows which logical blocks are in use, and which are free. 
Filesystems already heavily re-use blocks, in the aim of preferring 
faster outer tracks on HD's, and minimizing head movement.  So when a 
file is erased, there's a good chance that those same logical blocks 
will be re-used soon - TRIM is of no benefit in that case.

>> I think other SSD-optimisations, such as those in BTRFS, are much more
>> important.
>
> Actually, (apart from btrfs still being in development, not really
> ready for production use, yet), XFS (-o delaylog,barrier) performs
> better on our SSDs than btrfs - without any SSD-specific options.
>

btrfs is ready for some uses, but is not mature and real-world tested 
enough for serious systems (and its tools are still lacking somewhat). 
But more generally, different filesystems are faster and slower for 
different usage patterns.

One SSD optimisation that many filesystems could implement is to be less 
concerned about fragmentation.  Most modern filesystems go out of their 
way to try to reduce fragmentation, which is great for HD use.  But on 
SSD's, you should be happy to fragment files if it promotes re-use of 
erased blocks, as long as fragments aim to fill complete erase blocks 
(in size and alignment).


> What is really an important factor for SSD performance: The controller.
> The same SSDs perform with significantly lower latency for us when
> connected to SATA controller channels than when connected to SAS
> controllers (and they perform abysmal when used as hardware-RAID
> constituents, in comparison).

That is /very/ interesting to know, and is a data point I haven't read 
elsewhere (though I knew about poor performance of hardware RAID with 
SSD).  Thanks for sharing that.


>
> Regards,
>
> Lutz Vieweg
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-07-18 10:35     ` David Brown
@ 2011-07-18 10:48       ` Tom De Mulder
  2011-07-18 18:09       ` Lutz Vieweg
  1 sibling, 0 replies; 45+ messages in thread
From: Tom De Mulder @ 2011-07-18 10:48 UTC (permalink / raw)
  To: linux-raid

On Mon, 18 Jul 2011, David Brown wrote:

First, I'd like to say that I've done more testing, and found that even 
after very prolonged, sustained heavy use, the (Intel 510) SSDs I 
partitioned 50/50 with half left unused didn't show any degradation in 
performance. That's after about a week of constant writing/erasing.

> If your disks are reasonably full, then TRIM will not help much because the 
> garbage collection will be desperately trying to piece together small bits 
> into complete erase blocks, and your performance will drop through the floor.

However, it won't drop as low as it would without TRIM in the same 
situation. But with a continuous heavy workload, even TRIM won't help, and 
over-provisioning is the way to go.

Best,

--
Tom De Mulder <tdm27@cam.ac.uk> - Cambridge University Computing Service
+44 1223 3 31843 - New Museums Site, Pembroke Street, Cambridge CB2 3QH
-> 18/07/2011 : The Moon is Waning Gibbous (83% of Full)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-07-17 21:52   ` Lutz Vieweg
  2011-07-18  5:14     ` Mikael Abrahamsson
  2011-07-18 10:35     ` David Brown
@ 2011-07-18 10:53     ` Tom De Mulder
  2011-07-18 12:13       ` Werner Fischer
  2 siblings, 1 reply; 45+ messages in thread
From: Tom De Mulder @ 2011-07-18 10:53 UTC (permalink / raw)
  To: linux-raid

On Sun, 17 Jul 2011, Lutz Vieweg wrote:

> What is really an important factor for SSD performance: The controller.
> The same SSDs perform with significantly lower latency for us when
> connected to SATA controller channels than when connected to SAS
> controllers (and they perform abysmal when used as hardware-RAID
> constituents, in comparison).

Interesting.

I think it depends a lot on the controller. On a Dell server with PERC5/i 
RAID controller (actually made by LSI) I saw some performance degradation 
but not enough that I'd consider it a deal-breaker for situations where I 
really cared about the RAID functionality, more than about the loss of 
performance. After all, the latency is still massively lower than it is 
with spinning disk.

I have a really great Areca RAID controller in a different server, but 
unfortunately it's in use and it'll be a while before I get another one I 
can use for testing. Given how well it does in other respects, I have high 
hopes for it.


Best,

--
Tom De Mulder <tdm27@cam.ac.uk> - Cambridge University Computing Service
+44 1223 3 31843 - New Museums Site, Pembroke Street, Cambridge CB2 3QH
-> 18/07/2011 : The Moon is Waning Gibbous (83% of Full)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-07-18 10:53     ` Tom De Mulder
@ 2011-07-18 12:13       ` Werner Fischer
  0 siblings, 0 replies; 45+ messages in thread
From: Werner Fischer @ 2011-07-18 12:13 UTC (permalink / raw)
  To: linux-raid

On Mon, 2011-07-18 at 11:53 +0100, Tom De Mulder wrote:
> On Sun, 17 Jul 2011, Lutz Vieweg wrote:
> 
> > What is really an important factor for SSD performance: The controller.
> > The same SSDs perform with significantly lower latency for us when
> > connected to SATA controller channels than when connected to SAS
> > controllers (and they perform abysmal when used as hardware-RAID
> > constituents, in comparison).
> 
> Interesting.
> 
> I think it depends a lot on the controller. On a Dell server with PERC5/i 
> RAID controller (actually made by LSI) I saw some performance degradation 
> but not enough that I'd consider it a deal-breaker for situations where I 
> really cared about the RAID functionality, more than about the loss of 
> performance. After all, the latency is still massively lower than it is 
> with spinning disk.
> 
> I have a really great Areca RAID controller in a different server, but 
> unfortunately it's in use and it'll be a while before I get another one I 
> can use for testing. Given how well it does in other respects, I have high 
> hopes for it.

I agree that the controller can influence performance:
1. SATA controller: direct communication
2. SAS controller: Serial ATA Tunneling Protocol (STP) is used,
   this can have an impact on performance
3. Hardware RAID controller: depending on the controller, performance 
   impact can be from low to very high

Regards,
Werner

> 
> 
> Best,
> 
> --
> Tom De Mulder <tdm27@cam.ac.uk> - Cambridge University Computing Service
> +44 1223 3 31843 - New Museums Site, Pembroke Street, Cambridge CB2 3QH
> -> 18/07/2011 : The Moon is Waning Gibbous (83% of Full)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
: Werner Fischer
: Technology Specialist
: Thomas-Krenn.AG | The server-experts
: http://www.thomas-krenn.com | http://www.thomas-krenn.com/wiki


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-07-18 10:35     ` David Brown
  2011-07-18 10:48       ` Tom De Mulder
@ 2011-07-18 18:09       ` Lutz Vieweg
  2011-07-18 20:18         ` David Brown
  1 sibling, 1 reply; 45+ messages in thread
From: Lutz Vieweg @ 2011-07-18 18:09 UTC (permalink / raw)
  To: linux-raid

On 07/18/2011 12:35 PM, David Brown wrote:
> If there are no free erase blocks, then your SSD's don't have enough over-provisioning.

When you think about "How many free erase blocks are enough?" you'll come to the conclusion that 
this simply depends on the usage pattern.

Ideally, you'll want every write to a SSD to go to a completely free erase block, because if it 
doesn't, it's both slower and will probably also lead to a higher average number of write cycles 
(because more than one read-modify-write cycle per erase block may be required to fill it with new 
data, if that new data cannot be buffered in the SSDs RAM.)

If the goal is to have every write go to a free erase block, then you need to free up at least as 
many erase blocks per time period as data will be written during that time period (assuming the 
worst case that all writes will _not_ go to blocks that have been written to before).

Of course you can accomplish this by over-providing so much flash space that the SSD will always be 
capable of re-arranging the used data blocks such that they are tightly packed into fully used erase 
blocks, while the rest of the erase blocks are completely empty.
But that is a pretty expensive approach, essentially this requires 100% over-provisioing (or: 50 
usable capacity, or twice the price for the storage).
And, you still have to trust that the SSD will use that over-provisioned space the way you want 
(e.g. the SSD firmware could be inclined to only re-arrange erase blocks that have a certain ratio 
of unused sectors within them).

One good thing abort explicitely discarding sectors, while using most of the offered space is 
(besides the significant cost argument) that your SSD will likely invest effort to re-arrange 
sectors into fully allocated and fully free erase blocks exactly at the time when this makes most 
sense for you. It will have to copy only data that is actually still valid (reducing wear), and you 
may even choose a time at which you know that significant amounts of data have been deleted.

> Depending on the quality of the SSD (more expensive ones have more over-provisioning)

Alas, manufacturers tend to ask twice the price for much less than twice the over-provisioning,
so it's still advisable to buy the cheaper SSD and choose over-provisioning ratio by using
only part of it...


> TRIM, on the other hand, does not give you any extra free erase blocks. If you think it does, you've
> misunderstood it.

I have to disagree on this :-)

Imagine a SSD with 10 erase blocks capacity, each having place for 10 sectors.
Let's assume the SSD advertises only 90 sectors total capacity, over-providing one erase block.
Now I write 8 files each of 10 sectors size on the SSDs, then delete 2 of the 8 files.

If the SSD now performs some "garbage collection", it will not have more than 2 free erase blocks.

But if I discard/TRIM the unused sectors, and the SSD does the right thing about it, there will be 4 
free erase blocks.

So, yes, TRIM can gain you extra free erase blocks, but of course only if there is unused space in 
the filesystem.


> It may sometimes lead to saving
> whole erase blocks, but that's seldom the case in practice except when erasing large files.

Our different perception may result from our use-case involving frequent deletion of files, while 
yours doesn't.

But this is not only about "large files", only. Obviously, all modern SSDs are capable of 
re-arranging data into fully allocated and fully free erase-blocks, and this process can benefit 
from every single sector that has been discarded.


> If your filesystem re-uses (logical) blocks, then TRIM will not help.

If the only thing the filesystem does is overwriting blocks that held valid data right until they 
are overwritten with newer valid data, then TRIM will certainly not help.

But every discard that happens in between an invalidation of data and the overwriting of the same 
logical block can potentially benefit from a TRIM in between. Imagine a file of 1000 sectors, all 
valid data. Now your application decides to overwrite that file with 1000 sectors of newer data. 
Let's assume the FS is clever enough to use the same 1000 logical sectors for this. But let's also 
assume the RAM-cache of the SSD is only 20 logical sectors in size, and one erase-block is 10
sectors in size. Now the SSD needs to start writing from its RAM buffer to flash at least after 20 
sectors of data have been processed. If you are lucky, and everything was written in sequence, and 
well aligned, then the SSD may just need to erase and overwrite flash blocks that were formerly used 
for the same logical sectors. But if you are unlucky, the logical sectors to write are spread across 
different flash erase blocks. Thus the SSD can at best only mark them "unused" and has to write the 
data to a different (hopefully completely free) erase block. Again, if lucky (or heavily 
over-provisioned), you had >= 100 free erase blocks available when you started writing, and after 
they were written, 100 other erase blocks that held the older data can be freed after all 1000 
sectors have been written. But if you are unlucky, not that many free erase blocks were available 
when starting to write. Then, to write the new data, the SSD needs to read data from 
non-completely-free erase blocks, fill the unused sectors within them with the new data, and write 
back the erase-blocks - which means much lower performance, and more wear.
Now the same procedure with a "TRIM": After laying out the logical sectors to write to (but before 
writing to them), the filesystem can issue a "discard" on all those sectors. This will enable the 
SSD to mark all 100 erase blocks as completely free - even without additional "re-arranging". The 
following write operation to 1000 sectors may require erase-before write (if no pre-existing 
completely free erase-blocks can be used), but that is much better than having to do 
"read-modify-erase-write" cycles to the flash (and a larger number of that, since data has to be 
copied that the SSD cannot know to be obsolete).

So: While re-arranging of valid data into erase-blocks may be expensive enough to do it only 
"batched" from time to time, even the simple marking of sectors as discarded can help the 
performance and endurance of a SSD.

> It is /always/ more efficient
> for the FS to simply write new data to the same block, rather than TRIM'ing it first.

Depends on how expensive the marking of sectors as free is for the SSD, and how likely newly written 
data that fits into the SSDs cache will cause the freeing of complete erase blocks.


> TRIM is a very expensive command

That seems to depend a lot on the firmware of different drives.
But I agree that it might not be a good idea to rely on it being cheap.

 From the behaviour of the SSDs we like best it seems that TRIM is often only causing cheap "marking 
as free" operations, while sometimes, every few weeks, the SSD is actually doing a lot of 
re-arranging ("garbage collecting"?) stuff after the discards have been issued.
(Certainly also depends a lot on the usage pattern.)

> I believe that there has been work on a similar system
> in XFS

Yes, XFS supports that now, but alas, we cannot use it with MD, as MD will discard the discards :-)


> What will make a big difference to using SSD's in md raid is the sync/no-sync tracking. This will
> avoid a lot of unnecessary writes, especially with a new array, and leave the SSD with more free
> blocks (at least until the disk is getting full of data).

Hmmm... the sync/no-sync tracking will save you exactly one write to all sectors. That's certainly a 
good thing, but since a single "fstrim" after the sync will restore the "good performance" 
situation, I don't consider that an urgent feature.


> Filesystems already heavily re-use blocks, in the aim
> of preferring faster outer tracks on HD's, and minimizing head movement. So when a file is erased,
> there's a good chance that those same logical blocks will be re-used soon - TRIM is of no benefit in
> that case.

It is of benefit - to the performance of exactly those writes that go to the formerly used logical 
blocks.


> btrfs is ready for some uses, but is not mature and real-world tested enough for serious systems
> (and its tools are still lacking somewhat).

Let's not divert the discussion too much. I'll happily re-try btrfs when the developers say it's not 
experimental anymore, and when there's a "fsck"-like utility to check its integrity.

Regards,

Lutz Vieweg




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-07-18 18:09       ` Lutz Vieweg
@ 2011-07-18 20:18         ` David Brown
  2011-07-19  9:29           ` Lutz Vieweg
  0 siblings, 1 reply; 45+ messages in thread
From: David Brown @ 2011-07-18 20:18 UTC (permalink / raw)
  To: linux-raid

On 18/07/11 20:09, Lutz Vieweg wrote:
> On 07/18/2011 12:35 PM, David Brown wrote:
>> If there are no free erase blocks, then your SSD's don't have enough
>> over-provisioning.
>
> When you think about "How many free erase blocks are enough?" you'll
> come to the conclusion that this simply depends on the usage pattern.
>

Yes.

> Ideally, you'll want every write to a SSD to go to a completely free
> erase block, because if it doesn't, it's both slower and will probably
> also lead to a higher average number of write cycles (because more than
> one read-modify-write cycle per erase block may be required to fill it
> with new data, if that new data cannot be buffered in the SSDs RAM.)
>

No.

You don't need to fill an erase block for writing - writes are done as 
write blocks (I think 4K is the norm).  That's the odd thing about flash 
- erase is done in much larger blocks than writes.

> If the goal is to have every write go to a free erase block, then you
> need to free up at least as many erase blocks per time period as data
> will be written during that time period (assuming the worst case that
> all writes will _not_ go to blocks that have been written to before).
>

Again, no - since you don't have to write to whole erase blocks.

> Of course you can accomplish this by over-providing so much flash space
> that the SSD will always be capable of re-arranging the used data blocks
> such that they are tightly packed into fully used erase blocks, while
> the rest of the erase blocks are completely empty.
> But that is a pretty expensive approach, essentially this requires 100%
> over-provisioing (or: 50 usable capacity, or twice the price for the
> storage).

The level of over-provisioning that can be useful will depend on the 
usage patterns, such as how much and how scattered your deletes are. 
There will be diminishing returns for increased overprovisioning - the 
balance is up to the user, but I can't imagine 50% being sensible.

I wonder if you are mixing up the theoretical peak write speeds to a new 
SSD with real-world write speeds to a disk in use.  These are not the 
same, and no amount of TRIM'ing or over-provisioning will let you see 
those speeds in anything but a synthetic benchmark.  Your aim is /not/ 
to go mad trying to reach the marketing-claimed speeds in a real 
application, but to balance /good/ and /consistent/ speeds with a 
sensible cost.  Understand that SSD's are very fast, but not as fast as 
a marketer or an initial benchmark suggests, and you will be much 
happier with your disks.

> And, you still have to trust that the SSD will use that over-provisioned
> space the way you want (e.g. the SSD firmware could be inclined to only
> re-arrange erase blocks that have a certain ratio of unused sectors
> within them).
>

You want to pick an SSD with good garbage collection, if that's what you 
mean.


> One good thing abort explicitely discarding sectors, while using most of
> the offered space is (besides the significant cost argument) that your
> SSD will likely invest effort to re-arrange sectors into fully allocated
> and fully free erase blocks exactly at the time when this makes most
> sense for you. It will have to copy only data that is actually still
> valid (reducing wear), and you may even choose a time at which you know
> that significant amounts of data have been deleted.
>

The reality is that for most applications and usage patterns, logical 
blocks that are deleted and not re-used are in the minority.  It is true 
that when garbage-collecting a block, the SSD can hop over the discarded 
blocks.  But since they are in the minority, it's a small effect.  It 
could even be a detrimental effect - it could encourage the SSD to 
garbage-collect a block that would otherwise be left untouched, leading 
to extra effort and wear (but giving you a little more free space).  Any 
effort done by the SSD on TRIM'ed blocks is wasted if these (logical) 
blocks are overwritten by the filesystem later, except if the SSD was 
otherwise short on free blocks.

Again, the use of explicit batch discards gives a better effect than 
automatic TRIMs on deletes.

>> Depending on the quality of the SSD (more expensive ones have more
>> over-provisioning)
>
> Alas, manufacturers tend to ask twice the price for much less than twice
> the over-provisioning,
> so it's still advisable to buy the cheaper SSD and choose
> over-provisioning ratio by using
> only part of it...
>

Fair enough.

>
>> TRIM, on the other hand, does not give you any extra free erase
>> blocks. If you think it does, you've
>> misunderstood it.
>
> I have to disagree on this :-)
>
> Imagine a SSD with 10 erase blocks capacity, each having place for 10
> sectors.
> Let's assume the SSD advertises only 90 sectors total capacity,
> over-providing one erase block.
> Now I write 8 files each of 10 sectors size on the SSDs, then delete 2
> of the 8 files.
>
> If the SSD now performs some "garbage collection", it will not have more
> than 2 free erase blocks.
>
> But if I discard/TRIM the unused sectors, and the SSD does the right
> thing about it, there will be 4 free erase blocks.
>
> So, yes, TRIM can gain you extra free erase blocks, but of course only
> if there is unused space in the filesystem.
>

OK, let me rephrase - TRIM does not give you /significantly/ more free 
erase blocks /in real life/.  You can construct arrangements, like you 
described, where the SSD can get noticeably more erase blocks through 
the use of TRIM.  But under use, things are different as blocks are 
written and re-written.  Your example would break as soon as you take 
into account the writing of the directory to the disk, messing up your 
neat blocks.

And again, appropriately scheduled batch TRIM will give better results 
than automatic TRIM, and /may/ be worth the effort.

>
>> It may sometimes lead to saving
>> whole erase blocks, but that's seldom the case in practice except when
>> erasing large files.
>
> Our different perception may result from our use-case involving frequent
> deletion of files, while yours doesn't.
>

Perhaps.  The nature of most filesystems is to grow - more data gets 
written than erased.  But many of the effects here are usage pattern 
dependent.

> But this is not only about "large files", only. Obviously, all modern
> SSDs are capable of re-arranging data into fully allocated and fully
> free erase-blocks, and this process can benefit from every single sector
> that has been discarded.
>
>
>> If your filesystem re-uses (logical) blocks, then TRIM will not help.
>
> If the only thing the filesystem does is overwriting blocks that held
> valid data right until they are overwritten with newer valid data, then
> TRIM will certainly not help.
>
> But every discard that happens in between an invalidation of data and
> the overwriting of the same logical block can potentially benefit from a
> TRIM in between. Imagine a file of 1000 sectors, all valid data. Now
> your application decides to overwrite that file with 1000 sectors of
> newer data. Let's assume the FS is clever enough to use the same 1000
> logical sectors for this. But let's also assume the RAM-cache of the SSD
> is only 20 logical sectors in size, and one erase-block is 10
> sectors in size. Now the SSD needs to start writing from its RAM buffer
> to flash at least after 20 sectors of data have been processed. If you
> are lucky, and everything was written in sequence, and well aligned,
> then the SSD may just need to erase and overwrite flash blocks that were
> formerly used for the same logical sectors. But if you are unlucky, the
> logical sectors to write are spread across different flash erase blocks.
> Thus the SSD can at best only mark them "unused" and has to write the
> data to a different (hopefully completely free) erase block. Again, if
> lucky (or heavily over-provisioned), you had >= 100 free erase blocks
> available when you started writing, and after they were written, 100
> other erase blocks that held the older data can be freed after all 1000
> sectors have been written. But if you are unlucky, not that many free
> erase blocks were available when starting to write. Then, to write the
> new data, the SSD needs to read data from non-completely-free erase
> blocks, fill the unused sectors within them with the new data, and write
> back the erase-blocks - which means much lower performance, and more wear.
> Now the same procedure with a "TRIM": After laying out the logical
> sectors to write to (but before writing to them), the filesystem can
> issue a "discard" on all those sectors. This will enable the SSD to mark
> all 100 erase blocks as completely free - even without additional
> "re-arranging". The following write operation to 1000 sectors may
> require erase-before write (if no pre-existing completely free
> erase-blocks can be used), but that is much better than having to do
> "read-modify-erase-write" cycles to the flash (and a larger number of
> that, since data has to be copied that the SSD cannot know to be obsolete).
>
> So: While re-arranging of valid data into erase-blocks may be expensive
> enough to do it only "batched" from time to time, even the simple
> marking of sectors as discarded can help the performance and endurance
> of a SSD.
>

Again, I think your arguments only work on very artificial data.  But 
perhaps this is close to your real-world usage patterns.

>> It is /always/ more efficient
>> for the FS to simply write new data to the same block, rather than
>> TRIM'ing it first.
>
> Depends on how expensive the marking of sectors as free is for the SSD,
> and how likely newly written data that fits into the SSDs cache will
> cause the freeing of complete erase blocks.
>
>
>> TRIM is a very expensive command
>
> That seems to depend a lot on the firmware of different drives.
> But I agree that it might not be a good idea to rely on it being cheap.
>
>  From the behaviour of the SSDs we like best it seems that TRIM is often
> only causing cheap "marking as free" operations, while sometimes, every
> few weeks, the SSD is actually doing a lot of re-arranging ("garbage
> collecting"?) stuff after the discards have been issued.
> (Certainly also depends a lot on the usage pattern.)
>

My main point about TRIM being expensive is the effect it has on the 
block IO queue, regardless of the implementation in the SSD.  Again, 
this is less relevant to batched TRIMs during low-use times.

>> I believe that there has been work on a similar system
>> in XFS
>
> Yes, XFS supports that now, but alas, we cannot use it with MD, as MD
> will discard the discards :-)
>
>
>> What will make a big difference to using SSD's in md raid is the
>> sync/no-sync tracking. This will
>> avoid a lot of unnecessary writes, especially with a new array, and
>> leave the SSD with more free
>> blocks (at least until the disk is getting full of data).
>
> Hmmm... the sync/no-sync tracking will save you exactly one write to all
> sectors. That's certainly a good thing, but since a single "fstrim"
> after the sync will restore the "good performance" situation, I don't
> consider that an urgent feature.
>

I really hope your SSD's return zeros for TRIM'ed blocks, and that you 
are sure all your TRIMs are in full raid stripes - otherwise you will 
/seriously/ mess up your raid arrays.

One definite problem with RAID on SSD's is that this first write will 
mean that the SSD has no more free erase blocks than if the filesystem 
were full, as the SSD doesn't know the blocks can be recycled.  Of 
course, it will see that pretty quickly as soon as the filesystem writes 
real data, but it will still have extra waste.  For mirrored drives, 
this may mean a difference in speed in the two drives as one has more 
freedom for garbage collection than the other (for RAID5, this effect is 
spread evenly over the disks).

>
>> Filesystems already heavily re-use blocks, in the aim
>> of preferring faster outer tracks on HD's, and minimizing head
>> movement. So when a file is erased,
>> there's a good chance that those same logical blocks will be re-used
>> soon - TRIM is of no benefit in
>> that case.
>
> It is of benefit - to the performance of exactly those writes that go to
> the formerly used logical blocks.
>
>
>> btrfs is ready for some uses, but is not mature and real-world tested
>> enough for serious systems
>> (and its tools are still lacking somewhat).
>
> Let's not divert the discussion too much. I'll happily re-try btrfs when
> the developers say it's not experimental anymore, and when there's a
> "fsck"-like utility to check its integrity.
>
> Regards,
>
> Lutz Vieweg
>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-07-18 20:18         ` David Brown
@ 2011-07-19  9:29           ` Lutz Vieweg
  2011-07-19 10:22             ` David Brown
  0 siblings, 1 reply; 45+ messages in thread
From: Lutz Vieweg @ 2011-07-19  9:29 UTC (permalink / raw)
  To: linux-raid

On 07/18/2011 10:18 PM, David Brown wrote:
> You don't need to fill an erase block for writing - writes are done as write blocks (I think 4K is
> the norm).

You are right on that. Those sectors in a partially used
erase block that have not been written to since the last erase of the
whole erase block can be written to as good as sectors in completely
empty erase blocks.


> My main point about TRIM being expensive is the effect it has on the block IO queue, regardless of
> the implementation in the SSD.

Because of those effects on the block-IO-queue, the user-space work-around
we implemented to discard the SSDs our RAID-1s consist of will not discard
"one area on all SSDs at a time", but rather iterate first through all
unused areas on one SSD, then iterate through the same list of areas on the
second SSD.

The effect of this is very much to our liking: While we can see
near-100%-utilization on one SSD at a time during the discards,
the other SSD will happily service the readers, and even the writes that
go to the /dev/md* device are buffered in main memory long enough that we
do not really see a significantly bad impact on the service.
(This might be different, though, if the discards were done
during peak-write-load times of the day.)


> I really hope your SSD's return zeros for TRIM'ed blocks

For RAID-1, the only consequence of not doing so is just that "data-check" runs may result
in a > 0 mismatch_cnt. It does not destroy any of your data, and as long as I have
two SSDs in a RAID, both of which give a non-error result when reading a sector, I would
have no indication of "which of the returned sector contents to prefer", anyway.

(I admit that for health monitoring it is useful to have a meaningful mismatch_cnt.)

> and that you are sure all your TRIMs are
> in full raid stripes - otherwise you will /seriously/ mess up your raid arrays.

Again, for RAID0/1 (even 10) I don't see why this would harm any data.

Regards,

Lutz Vieweg



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-07-19  9:29           ` Lutz Vieweg
@ 2011-07-19 10:22             ` David Brown
  2011-07-19 13:41               ` Lutz Vieweg
  2011-07-19 14:19               ` Tom De Mulder
  0 siblings, 2 replies; 45+ messages in thread
From: David Brown @ 2011-07-19 10:22 UTC (permalink / raw)
  To: linux-raid

On 19/07/2011 11:29, Lutz Vieweg wrote:
> On 07/18/2011 10:18 PM, David Brown wrote:
>> You don't need to fill an erase block for writing - writes are done
>> as write blocks (I think 4K is the norm).
>
> You are right on that. Those sectors in a partially used erase block
> that have not been written to since the last erase of the whole erase
> block can be written to as good as sectors in completely empty erase
> blocks.
>
>
>> My main point about TRIM being expensive is the effect it has on
>> the block IO queue, regardless of the implementation in the SSD.
>
> Because of those effects on the block-IO-queue, the user-space
> work-around we implemented to discard the SSDs our RAID-1s consist of
> will not discard "one area on all SSDs at a time", but rather iterate
> first through all unused areas on one SSD, then iterate through the
> same list of areas on the second SSD.
>

Do you take the arrays off-line during this process, or at least make
them read-only?  If not, how do you ensure that the lists are valid?

> The effect of this is very much to our liking: While we can see
> near-100%-utilization on one SSD at a time during the discards, the
> other SSD will happily service the readers, and even the writes that
> go to the /dev/md* device are buffered in main memory long enough
> that we do not really see a significantly bad impact on the service.
> (This might be different, though, if the discards were done during
> peak-write-load times of the day.)
>
>
>> I really hope your SSD's return zeros for TRIM'ed blocks
>
> For RAID-1, the only consequence of not doing so is just that
> "data-check" runs may result in a > 0 mismatch_cnt. It does not
> destroy any of your data, and as long as I have two SSDs in a RAID,
> both of which give a non-error result when reading a sector, I would
> have no indication of "which of the returned sector contents to
> prefer", anyway.
>
> (I admit that for health monitoring it is useful to have a meaningful
>  mismatch_cnt.)
>
>> and that you are sure all your TRIMs are in full raid stripes -
>> otherwise you will /seriously/ mess up your raid arrays.
>
> Again, for RAID0/1 (even 10) I don't see why this would harm any
> data.
>

Fair enough for RAID1.  Just don't try it with RAID5!



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-07-19 10:22             ` David Brown
@ 2011-07-19 13:41               ` Lutz Vieweg
  2011-07-19 15:06                 ` David Brown
  2011-07-19 14:19               ` Tom De Mulder
  1 sibling, 1 reply; 45+ messages in thread
From: Lutz Vieweg @ 2011-07-19 13:41 UTC (permalink / raw)
  To: linux-raid

On 07/19/2011 12:22 PM, David Brown wrote:
>> Because of those effects on the block-IO-queue, the user-space
>> work-around we implemented to discard the SSDs our RAID-1s consist of
>> will not discard "one area on all SSDs at a time", but rather iterate
>> first through all unused areas on one SSD, then iterate through the
>> same list of areas on the second SSD.
>
> Do you take the arrays off-line during this process, or at least make
> them read-only?

No, we keep them online and writeable.

> If not, how do you ensure that the lists are valid?

The discard procedure works by..:

- use SYS_fallocate to allocate the free space on the device (minus
   some safety margin for the writes that will happen during the procedure)
   for a temporary file (notice that with fallocate on XFS, you can
   allocate space for a file without actually ever writing to it)

- use ioctl FIEMAP to get a list of the logical blocks that were
   allocated

- use ioctl BLKDISCARD to discard these blocks

- remove the temporary file

Since the blocks to discard are allocated for the temporary
file during the procedure, they will not be used otherwise.

Obviously, we would still prefer using "fstrim", because then
there would be no need for that temporary file, the "safety margin"
and a temporary high fill level of the filesystem.

Regards,

Lutz Vieweg



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-07-19 10:22             ` David Brown
  2011-07-19 13:41               ` Lutz Vieweg
@ 2011-07-19 14:19               ` Tom De Mulder
  2011-07-20  7:42                 ` David Brown
  2011-07-20 12:13                 ` Werner Fischer
  1 sibling, 2 replies; 45+ messages in thread
From: Tom De Mulder @ 2011-07-19 14:19 UTC (permalink / raw)
  To: linux-raid


In case people are interested, I ran more benchmarks. The impact of TRIM 
on an over-provisioned drive is remarkable: a 25% performance loss when 
using Postmark.

Because this isn't really on-topic for the MD mailing list, I've put it 
somewhere else:

http://tdm27.wordpress.com/2011/07/19/some-solid-state-drive-benchmarks/

My next goal, when I have the time, is to compare different amounts of 
over-provisioning.

--
Tom De Mulder <tdm27@cam.ac.uk> - Cambridge University Computing Service
+44 1223 3 31843 - New Museums Site, Pembroke Street, Cambridge CB2 3QH
-> 19/07/2011 : The Moon is Waning Gibbous (75% of Full)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-07-19 13:41               ` Lutz Vieweg
@ 2011-07-19 15:06                 ` David Brown
  2011-07-20 10:39                   ` Lutz Vieweg
  0 siblings, 1 reply; 45+ messages in thread
From: David Brown @ 2011-07-19 15:06 UTC (permalink / raw)
  To: linux-raid

On 19/07/2011 15:41, Lutz Vieweg wrote:
> On 07/19/2011 12:22 PM, David Brown wrote:
>>> Because of those effects on the block-IO-queue, the user-space
>>> work-around we implemented to discard the SSDs our RAID-1s consist of
>>> will not discard "one area on all SSDs at a time", but rather iterate
>>> first through all unused areas on one SSD, then iterate through the
>>> same list of areas on the second SSD.
>>
>> Do you take the arrays off-line during this process, or at least make
>> them read-only?
>
> No, we keep them online and writeable.
>
>> If not, how do you ensure that the lists are valid?
>
> The discard procedure works by..:
>
> - use SYS_fallocate to allocate the free space on the device (minus
> some safety margin for the writes that will happen during the procedure)
> for a temporary file (notice that with fallocate on XFS, you can
> allocate space for a file without actually ever writing to it)
>
> - use ioctl FIEMAP to get a list of the logical blocks that were
> allocated
>
> - use ioctl BLKDISCARD to discard these blocks
>
> - remove the temporary file
>
> Since the blocks to discard are allocated for the temporary
> file during the procedure, they will not be used otherwise.
>
> Obviously, we would still prefer using "fstrim", because then
> there would be no need for that temporary file, the "safety margin"
> and a temporary high fill level of the filesystem.
>
> Regards,
>
> Lutz Vieweg
>

It certainly sounds like a safe procedure, but I can see why you feel 
it's not quite as elegant as it could be.  You will also be "discarding" 
blocks that have never been written (at least, not since the last 
discard...) - is there much overhead in that?



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-07-19 14:19               ` Tom De Mulder
@ 2011-07-20  7:42                 ` David Brown
  2011-07-20 12:20                   ` Lutz Vieweg
  2011-07-20 12:13                 ` Werner Fischer
  1 sibling, 1 reply; 45+ messages in thread
From: David Brown @ 2011-07-20  7:42 UTC (permalink / raw)
  To: linux-raid

On 19/07/2011 16:19, Tom De Mulder wrote:
>
> In case people are interested, I ran more benchmarks. The impact of TRIM
> on an over-provisioned drive is remarkable: a 25% performance loss when
> using Postmark.
>
> Because this isn't really on-topic for the MD mailing list, I've put it
> somewhere else:
>

It is a little off-topic, perhaps, but still of interest to many RAID 
users precisely because of the myths and inaccurate data surrounding 
TRIM.  There are too many people that think TRIM is essential to SSD's, 
RAID doesn't support TRIM, therefore you shouldn't use RAID and SSD's 
together.

> http://tdm27.wordpress.com/2011/07/19/some-solid-state-drive-benchmarks/
>

To try to explain your results - first it's easy to see why md raid1 
with discard is a little slower than md raid1 without discard - the raid 
layer ignores the discards, so they can't help or hinder much, and the 
filesystem is doing a bit of extra work (sending the discards) to no 
purpose.

It is also easy to see why a single SSD with no discards is about the 
same speed.  You are using RAID1 - reads and writes are not striped in 
any way, so the speed is the same as for a single disk.  If the test 
accessed multiple files in parallel (especially reads), you'd see faster 
reads.

The telling figure here, though, is that TRIM made the single drive 
significantly slower.

> My next goal, when I have the time, is to compare different amounts of
> over-provisioning.
>

Also try using RAID10,far for your arrays.  That will work the SSD's 
harder, and perhaps give a better comparison.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-07-19 15:06                 ` David Brown
@ 2011-07-20 10:39                   ` Lutz Vieweg
  0 siblings, 0 replies; 45+ messages in thread
From: Lutz Vieweg @ 2011-07-20 10:39 UTC (permalink / raw)
  To: linux-raid

On 07/19/2011 05:06 PM, David Brown wrote:
> It certainly sounds like a safe procedure, but I can see why you feel it's not quite as elegant as
> it could be. You will also be "discarding" blocks that have never been written (at least, not since
> the last discard...) - is there much overhead in that?

Luckily the SSDs we use do not require significant time to process a discard
on areas that were already free - e.g. discarding ~ 250G of SSD space that is
already empty this way takes only ~ 10 seconds.

Regards,

Lutz Vieweg



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-07-19 14:19               ` Tom De Mulder
  2011-07-20  7:42                 ` David Brown
@ 2011-07-20 12:13                 ` Werner Fischer
  2011-07-20 12:25                   ` Lutz Vieweg
  1 sibling, 1 reply; 45+ messages in thread
From: Werner Fischer @ 2011-07-20 12:13 UTC (permalink / raw)
  To: linux-raid

On Tue, 2011-07-19 at 15:19 +0100, Tom De Mulder wrote:
> In case people are interested, I ran more benchmarks. The impact of TRIM 
> on an over-provisioned drive is remarkable: a 25% performance loss when 
> using Postmark.
> 
> Because this isn't really on-topic for the MD mailing list, I've put it 
> somewhere else:
> 
> http://tdm27.wordpress.com/2011/07/19/some-solid-state-drive-benchmarks/
> 
> My next goal, when I have the time, is to compare different amounts of 
> over-provisioning.

There is a paper from Intel "Over-provisioning an Intel® SSD" (analyzing
X25-M 160 GB Gen.2 SSDs):
http://cache-www.intel.com/cd/00/00/45/95/459555_459555.pdf

On page 10 of this Intel presentation they mention that a spare area
>27% of native capacity has diminishing returns for such an SSD:
http://maltiel-consulting.com/Enterprise_Data_Integrity_Increasing_Endurance.pdf

Regards,
Werner

-- 
: Werner Fischer
: Technology Specialist
: Thomas-Krenn.AG | The server-experts
: http://www.thomas-krenn.com | http://www.thomas-krenn.com/wiki

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-07-20  7:42                 ` David Brown
@ 2011-07-20 12:20                   ` Lutz Vieweg
  0 siblings, 0 replies; 45+ messages in thread
From: Lutz Vieweg @ 2011-07-20 12:20 UTC (permalink / raw)
  To: linux-raid

On 07/20/2011 09:42 AM, David Brown wrote:
>> http://tdm27.wordpress.com/2011/07/19/some-solid-state-drive-benchmarks/
>
> The telling figure here, though, is that TRIM made the single drive significantly slower.

More precisely, online-TRIM of ext4 on Intel SSDs seems to be a bad combination.

I think it's clear you cannot gain much from TRIM if you're willing to spend
the money for 2 times overprovisioning, anyway. You can lose significantly from
online-trim when the filesystem issues a lot of TRIM commands all the time and
when the SSD is slow to process them.

TRIM gains you an advantage with less over-provisioning, and is better
done in batches after significant amounts of data have been written/deleted.

When you try with different levels of over-provisioning, also try with
batched discards (fstrim) between runs of your benchmark.

Regards,

Lutz Vieweg



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Software RAID and TRIM
  2011-07-20 12:13                 ` Werner Fischer
@ 2011-07-20 12:25                   ` Lutz Vieweg
  0 siblings, 0 replies; 45+ messages in thread
From: Lutz Vieweg @ 2011-07-20 12:25 UTC (permalink / raw)
  To: linux-raid

On 07/20/2011 02:13 PM, Werner Fischer wrote:
> There is a paper from Intel "Over-provisioning an Intel® SSD" (analyzing
> X25-M 160 GB Gen.2 SSDs):
> http://cache-www.intel.com/cd/00/00/45/95/459555_459555.pdf
>
> On page 10 of this Intel presentation they mention that a spare area
>> 27% of native capacity has diminishing returns for such an SSD:
> http://maltiel-consulting.com/Enterprise_Data_Integrity_Increasing_Endurance.pdf

(This latter document is password protected.)

The first document, though, claims almost linear benefit (regarding IOs/sec)
from much higher amounts of over-provisioning. Alas, their chart does not extend
into the region where saturation of the effect must occur for sure.

Regards,

Lutz Vieweg



--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2011-07-20 12:25 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-28 15:31 Software RAID and TRIM Tom De Mulder
2011-06-28 16:11 ` Mathias Burén
2011-06-29 10:32   ` Tom De Mulder
2011-06-29 10:45     ` NeilBrown
2011-06-29 11:10       ` Tom De Mulder
2011-06-29 11:48         ` Scott E. Armitage
2011-06-29 12:46           ` Roberto Spadim
2011-06-29 12:46       ` David Brown
2011-06-30  0:28         ` NeilBrown
2011-06-30  7:50           ` David Brown
2011-06-29 13:39       ` Namhyung Kim
2011-06-30  0:27         ` NeilBrown
2011-07-17 22:11       ` Lutz Vieweg
2011-07-17 21:57     ` Lutz Vieweg
2011-06-29 10:33   ` Tom De Mulder
2011-06-29 12:42     ` David Brown
2011-06-29 12:55       ` Tom De Mulder
2011-06-29 13:02         ` Roberto Spadim
2011-06-29 13:10         ` David Brown
2011-06-30  5:51         ` Mikael Abrahamsson
2011-07-04  9:13           ` Tom De Mulder
2011-07-04 16:26             ` Werner Fischer
2011-07-17 22:31               ` Lutz Vieweg
2011-07-17 22:16         ` Lutz Vieweg
2011-07-17 22:00     ` Lutz Vieweg
2011-06-28 16:17 ` Johannes Truschnigg
2011-06-28 16:40 ` David Brown
2011-07-17 21:52   ` Lutz Vieweg
2011-07-18  5:14     ` Mikael Abrahamsson
2011-07-18 10:35     ` David Brown
2011-07-18 10:48       ` Tom De Mulder
2011-07-18 18:09       ` Lutz Vieweg
2011-07-18 20:18         ` David Brown
2011-07-19  9:29           ` Lutz Vieweg
2011-07-19 10:22             ` David Brown
2011-07-19 13:41               ` Lutz Vieweg
2011-07-19 15:06                 ` David Brown
2011-07-20 10:39                   ` Lutz Vieweg
2011-07-19 14:19               ` Tom De Mulder
2011-07-20  7:42                 ` David Brown
2011-07-20 12:20                   ` Lutz Vieweg
2011-07-20 12:13                 ` Werner Fischer
2011-07-20 12:25                   ` Lutz Vieweg
2011-07-18 10:53     ` Tom De Mulder
2011-07-18 12:13       ` Werner Fischer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.