All of lore.kernel.org
 help / color / mirror / Atom feed
From: Doug Dumitru <doug@easyco.com>
To: Chris Worley <worleys@gmail.com>
Cc: "Scott E. Armitage" <launchpad@scott.armitage.name>,
	Roberto Spadim <roberto@spadim.com.br>,
	David Brown <david@westcontrol.com>,
	linux-raid@vger.kernel.org
Subject: Re: SSD - TRIM command
Date: Wed, 9 Feb 2011 11:15:48 -0800	[thread overview]
Message-ID: <AANLkTi=+w-q8s-y-K-24ZwSLsLrJ44Oh9LiF3ReeyYdP@mail.gmail.com> (raw)
In-Reply-To: <AANLkTim-S0nTC8-r1U2xQ9Gw3EVAZHDjSxT=tnNnHmCT@mail.gmail.com>

I work with SSDs arrays all the time, so I have a couple of thoughts
about trim and md.

'trim' is still necessary.  SandForce controllers are "better" at
this, but still need free space to do their work.  I had a set of SF
drives drop to 22 MB/sec writes because they were full and scrambled.
It takes a lot of effort to get them that messed up, but it can still
happen.  Trim brings them back.

The bottom line is that SSDs do block re-organization on the fly and
free space makes the re-org more efficient.  More efficient means
faster, and as importantly less wear amplification.

Most SSDs (and I think the latest trim spec) are deterministic on
trim'd sectors.  If you trim a sector, they read that sector as zeros.
 This makes raid much "safer".

raid/0,1,10 should be fine to echo discard commands down to the
downstream drives in the bio request.  It is then up to the physical
device driver to turn the discard bio request into an ATA (or SCSI)
trim.  Most block devices don't seem to understand discard requests
yet, but this will get better over time.

raid/4,5,6 is a lot more complicated.  With raid/4,5 with an even
number of drives, you can trim whole stripes safely.  Pieces of
stripes get interesting because you have to treat a trim as a write of
zeros and re-calc parity.  raid/6 will always have parity issues
regardless of how many drives there are.  Even worse is that
raid/4,5,6 parity read/modify/write operations tend to chatter the FTL
(Flash Translation Layer) logic and make matters worse (often much
worse).  If you are not streaming long linear writes, raid/4,5,6 in a
heavy write environment is a probably a very bad idea for most SSDs.

Another issue with trim is how "async" it behaves.  You can trim a lot
of data to a drive, but it is hard to tell when the drive actually is
ready afterwards.  Some drives also choke on trim requests that come
at them too fast or requests that are too long.  The behavior can be
quite random.  So then comes the issue of how many "user knobs" to
supply to tune what trims where.  Again, raid/0,1,10 are pretty easy.
Raid/4,5,6 really requires that you know the precise geometry and
control the IO.  Way beyond what ext4 understands at this point.

Trim can also be "faked" with some drives.  Again, looking at the
SandForce based drives, these drive internally de-dupe so you can fake
write data and help the drives get free space.  Do this by filling the
drive with zeros (ie, dd if=/dev/zero of=big.file bs=1M), do a sync,
and then delete the big.file.  This works through md, across SANs,
from XEN virtuals, or wherever.  With SandForce drives, this is not as
effective as a trim, but better than nothing.  Unfortunately, only
SandForce drives and Flash Supercharger understand zero's this way.  A
filesystem option that "zeros discarded sectors" would actually make
as much sense in some deployment settings as the discard option (not
sure, but ext# might already have this).  NTFS has actually supported
this since XP as a security enhancement.

Doug Dumitru
EasyCo LLC

ps:  My background with this has been the development of Flash
SuperCharger.  I am not trying to run an advert here, but the care and
feeding of SSDs can be interesting.  Flash SuperCharger breaks most of
these rules, but it does know the exact geometry of what it is driving
and plays excessive games to drives SSDs at their exact "sweet spot".
One of our licensees just sent me some benchmarks at > 500,000 4K
random writes/sec for a moderate sized array running raid/5.

pps:  Failures of SSDs are different than HDDs.  SSDs can and do fail
and need raid for many applications.  If you need high write IOPS, it
pretty much has to be raid/1,10 (unless you run our Flash SuperCharger
layer).

ppps:  I have seen SSDs silently return corrupted data.  Disks do this
as well.  A paper from 2 years ago quoted disk silent error rates as
high as 1 bad block every 73TB read.  Very scary stuff, but probably
beyond the scope of what md can address.

  reply	other threads:[~2011-02-09 19:15 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-07 20:07 SSD - TRIM command Roberto Spadim
2011-02-08 17:37 ` maurice
2011-02-08 18:31   ` Roberto Spadim
     [not found]     ` <AANLkTik5SumqyTN5LZVntna8nunvPe7v38TSFf9eCfcU@mail.gmail.com>
2011-02-08 20:50       ` Roberto Spadim
2011-02-08 21:18         ` maurice
2011-02-08 21:33           ` Roberto Spadim
2011-02-09  7:44   ` Stan Hoeppner
2011-02-09  9:05     ` Eric D. Mudama
2011-02-09 15:45       ` Chris Worley
2011-02-09 13:29     ` David Brown
2011-02-09 14:39       ` Roberto Spadim
2011-02-09 15:00         ` Scott E. Armitage
2011-02-09 15:52           ` Chris Worley
2011-02-09 19:15             ` Doug Dumitru [this message]
2011-02-09 19:22               ` Roberto Spadim
2011-02-09 16:19           ` Eric D. Mudama
2011-02-09 16:28             ` Scott E. Armitage
2011-02-09 17:17               ` Eric D. Mudama
2011-02-09 18:18                 ` Roberto Spadim
2011-02-09 18:24                   ` Piergiorgio Sartor
2011-02-09 18:30                     ` Roberto Spadim
2011-02-09 18:38                       ` Piergiorgio Sartor
2011-02-09 18:46                         ` Roberto Spadim
2011-02-09 18:52                           ` Roberto Spadim
2011-02-09 19:13                           ` Piergiorgio Sartor
2011-02-09 19:16                             ` Roberto Spadim
2011-02-09 19:21                               ` Piergiorgio Sartor
2011-02-09 19:27                                 ` Roberto Spadim
2011-02-21 18:24             ` Phillip Susi
2011-02-21 18:30               ` Roberto Spadim
2011-02-09 15:49         ` David Brown
2011-02-21 18:20           ` Phillip Susi
2011-02-21 18:25             ` Roberto Spadim
2011-02-21 18:34               ` Phillip Susi
2011-02-21 18:48                 ` Roberto Spadim
2011-02-21 18:51               ` Mathias Burén
2011-02-21 19:32                 ` Roberto Spadim
2011-02-21 19:38                   ` Mathias Burén
2011-02-21 19:39                     ` Mathias Burén
2011-02-21 19:43                       ` Roberto Spadim
2011-02-21 20:45                       ` Phillip Susi
2011-02-21 19:39                   ` Roberto Spadim
2011-02-21 19:51                     ` Doug Dumitru
2011-02-21 19:57                       ` Roberto Spadim
2011-02-21 20:47                     ` Phillip Susi
2011-02-21 21:02                       ` Mathias Burén
2011-02-21 22:52                         ` Roberto Spadim
2011-02-21 23:41                           ` Mathias Burén
2011-02-21 23:42                             ` Mathias Burén
2011-02-21 23:52                               ` Roberto Spadim
2011-02-22  0:25                                 ` Mathias Burén
2011-02-22  0:30                                 ` Brendan Conoboy
2011-02-22  0:36                                 ` Eric D. Mudama
2011-02-22  1:46                                   ` Roberto Spadim
2011-02-22  1:52                                     ` Mathias Burén
2011-02-22  1:55                                       ` Roberto Spadim
2011-02-22  2:01                                         ` Eric D. Mudama
2011-02-22  2:02                                         ` Mikael Abrahamsson
2011-02-22  2:22                                           ` Guy Watkins
2011-02-22  2:27                                             ` Roberto Spadim
2011-02-22  3:45                                               ` NeilBrown
2011-02-22  4:37                                                 ` Roberto Spadim
2011-02-22  2:38                                         ` Phillip Susi
2011-02-22  3:29                                           ` Roberto Spadim
2011-02-22  3:42                                             ` Roberto Spadim
2011-02-22  4:04                                             ` Phillip Susi
2011-02-22  4:30                                               ` Roberto Spadim
2011-02-22 14:45                                                 ` Phillip Susi
2011-02-22 17:15                                                   ` Roberto Spadim
2011-02-22  0:32                           ` Eric D. Mudama

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='AANLkTi=+w-q8s-y-K-24ZwSLsLrJ44Oh9LiF3ReeyYdP@mail.gmail.com' \
    --to=doug@easyco.com \
    --cc=david@westcontrol.com \
    --cc=launchpad@scott.armitage.name \
    --cc=linux-raid@vger.kernel.org \
    --cc=roberto@spadim.com.br \
    --cc=worleys@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.